Christian Reiss
2022-Sep-14 07:47 UTC
[Gluster-users] Weird Issue / Gluster not really synced
Hey folks, I am having a weird issue here. I am running a 3-node gluster setup with these versions: glusterfs-selinux-2.0.1-1.el8s.noarch glusterfs-9.6-1.el8s.x86_64 centos-release-gluster9-1.0-1.el8.noarch libglusterfs0-9.6-1.el8s.x86_64 libglusterd0-9.6-1.el8s.x86_64 glusterfs-cli-9.6-1.el8s.x86_64 glusterfs-server-9.6-1.el8s.x86_64 glusterfs-client-xlators-9.6-1.el8s.x86_64 glusterfs-fuse-9.6-1.el8s.x86_64 My volume info: Volume Name: web-dir Type: Replicate Volume ID: 4ff57154-6ccb-45b0-97da-c12b8b5afa2b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: wc-srv01.eulie.de:/var/lib/gluster/brick01 Brick2: wc-srv02.eulie.de:/var/lib/gluster/brick01 Brick3: wc-srv03.eulie.de:/var/lib/gluster/brick01 Options Reconfigured: cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 200000 performance.readdir-ahead: on performance.parallel-readdir: on performance.nl-cache: on performance.nl-cache-timeout: 600 performance.nl-cache-positive-entry: on performance.qr-cache-timeout: 600 performance.cache-size: 4096MB performance.cache-max-file-size: 512KB diagnostics.latency-measurement: on diagnostics.count-fop-hits: on performance.io-cache: on performance.io-thread-count: 16 server.allow-insecure: on cluster.lookup-optimize: on client.event-threads: 8 server.event-threads: 4 cluster.readdir-optimize: on performance.write-behind-window-size: 32MB and all bricks are online: Status of volume: web-dir Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick wc-srv01.eulie.de:/var/lib/ gluster/brick01 49152 0 Y 2671 Brick wc-srv02.eulie.de:/var/lib/ gluster/brick01 49152 0 Y 2614 Brick wc-srv03.eulie.de:/var/lib/ gluster/brick01 49152 0 Y 3223 Self-heal Daemon on localhost N/A N/A Y 2679 Self-heal Daemon on wc-srv02.dc-dus.dalason .net N/A N/A Y 41537 Self-heal Daemon on wc-srv03.dc-dus.dalason .net N/A N/A Y 78473 Task Status of Volume web-dir ------------------------------------------------------------------------------ There are no active volume tasks Selinux is set to permissive. System is running AlmaLinux 8 with current patches (as of today). The three servers wc-srv01, wc-srv02 and wc-srv03 are connected via 10Gbit, can see each other and no connections issue arive. Network speed is nearly 10Gbit, tested. I mounted the volume on each server via itself: wc-srv01 fstab: wc-srv01.eulie.de:/web-dir /var/www glusterfs defaults,_netdev 0 0 wc-srv02 fstab: wc-srv02.eulie.de:/web-dir /var/www glusterfs defaults,_netdev 0 0 wc-srv03 fstab: wc-srv03.eulie.de:/web-dir /var/www glusterfs defaults,_netdev 0 0 Mounting works, and its size is correct across all servers: # df -h /var/www/ Filesystem Size Used Avail Use% Mounted on wc-srv01.eulie.de:/web-dir 100G 31G 70G 31% /var/www Here is the weird issue: wc01: while sleep 1; do date > testfile ; done wc02: while sleep 1 ; do date ; cat testfile ; done Wed 14 Sep 09:43:47 CEST 2022 Wed 14 Sep 09:43:45 CEST 2022 Wed 14 Sep 09:43:48 CEST 2022 Wed 14 Sep 09:43:45 CEST 2022 Wed 14 Sep 09:43:49 CEST 2022 Wed 14 Sep 09:43:45 CEST 2022 Wed 14 Sep 09:43:50 CEST 2022 Wed 14 Sep 09:43:45 CEST 2022 Wed 14 Sep 09:43:51 CEST 2022 Wed 14 Sep 09:43:45 CEST 2022 Wed 14 Sep 09:43:52 CEST 2022 Wed 14 Sep 09:43:45 CEST 2022 wc03: while sleep 1 ; do date ; cat testfile ; done Wed 14 Sep 09:43:43 CEST 2022 Wed 14 Sep 09:41:12 CEST 2022 Wed 14 Sep 09:43:45 CEST 2022 Wed 14 Sep 09:41:12 CEST 2022 Wed 14 Sep 09:43:46 CEST 2022 Wed 14 Sep 09:41:12 CEST 2022 Wed 14 Sep 09:43:47 CEST 2022 Wed 14 Sep 09:41:12 CEST 2022 Wed 14 Sep 09:43:48 CEST 2022 Wed 14 Sep 09:41:12 CEST 2022 Wed 14 Sep 09:43:49 CEST 2022 Wed 14 Sep 09:41:12 CEST 2022 So the file exists, and on initial write the stamps are correct. From second 2 onward, I have three different files on all servers. Deleting the file is instant on all nodes, and editing a file in vim (doing :w) also instantly updates all files. # gluster volume heal web-dir info Brick wc-srv01.eulie.de:/var/lib/gluster/brick01 Status: Connected Number of entries: 0 Brick wc-srv02.eulie.de:/var/lib/gluster/brick01 Status: Connected Number of entries: 0 Brick wc-srv03.eulie.de:/var/lib/gluster/brick01 Status: Connected Number of entries: 0 What... Why... How? :-) I need a synced three-way active-active-active cluster with consistent data across all nodes. Any pointers from you gurus? -- with kind regards, mit freundlichen Gruessen, Christian Reiss -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 840 bytes Desc: OpenPGP digital signature URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20220914/aab2a858/attachment.sig>