Jon Sime
2015-Aug-26 18:46 UTC
[Gluster-users] Missing/Duplicate files on Gluster 3.6.5 distributed-replicate volume
We have a v3.6.5 two node cluster with a distributed-replicate volume (2x2 bricks, everything formatted with ext4 on CentOS 6.6) which regularly omits some files from directory listings on the client side, and also regularly duplicates the listing of some other files. Summary of the issue and steps we've tried so far: - There is only one client system connected to this volume. - That client populates files in this volume by copying them from a local filesystem into the gluster mount point, via 'cp' within a single process (it is a single-threaded Python script that invokes call() to run cp via a subprocess shell), so we believe we have ruled out any possibility of concurrency or race-condition problems as there is only one source of writes and the files are copied sequentially. - The two Gluster servers provide 7 volumes in total, but only one of the volumes has been observed with this behavior. - There are no errors or warnings in the Gluster logs, on client or server. - We have tried clearing all the extended attributes on all the bricks, but that did not resolve the problem. - We have deleted everything on the brick filesystems (including .glusterfs/), but copying the files over again (via the gluster mount point on the client) results in the same missing & duplicate issue. - We ran a rebalance/fix-layout on the volume, but that did not resolve the problem. - Interestingly, the set of files which are missing from the directory listings is the same each time we delete everything and try again with an empty directory; and the set of files which are duplicated in the listing output is also the same each time. - When all of the files have been copied over to the gluster volume, running an 'ls' from the client will show most, but not all of the files. Examining the bricks directly shows that all of the files are present (and properly distributed and replicated). If an 'rm *' is then done from the client, all of the files which were visible are deleted, but the files which had not been visible on the client now are shown by 'ls' and some of them are shown twice in the output. Examining the bricks directly again shows that all of the files in the client's ls output are present, but there are no improper duplicates (only the correctly-replicated copies that should be present). Running another 'rm *' correctly deletes all of the files both from the client's view, as well as removing all copies on the underlying bricks. As requested in IRC, the following is output from getfattr for a file which was missing in the initial directory listing output on the client, as well as the getfattr output for its parent directory (I've included the same directory from all four bricks, though in this distributed+replicate layout, the file was only (properly) located on the bricks in each gluster hosts' /export/zones1). As for an example of a file which appeared fine from the beginning, I'll need to follow up with that in a bit once I can get the client I'm doing this for to repeat the test, but pausing after the initial copy and before deleting the set of visible files. FWIW, these files were copied to an empty volume after a rebalance operation had been run. (Host gluster-001) -bash-4.1# getfattr -d -e hex -m . /export/zones1/brick/landing/arrivals/xx/xx_user1/G03_Interim\ ELA\ PT\ Beetles\ \(IAB\)_2015-08-11.tar.gz.gpg getfattr: Removing leading '/' from absolute path names # file: export/zones1/brick/landing/arrivals/xx/xx_user1/G03_Interim ELA PT Beetles (IAB)_2015-08-11.tar.gz.gpg trusted.afr.dirty=0x000000000000000000000000 trusted.afr.zones-client-0=0x000000000000000000000000 trusted.afr.zones-client-1=0x000000000000000000000000 trusted.gfid=0x8823094f0ea14f049bbc4f98895f7192 -bash-4.1# getfattr -d -e hex -m . /export/zones1/brick/landing/arrivals/xx/xx_user1 getfattr: Removing leading '/' from absolute path names # file: export/zones1/brick/landing/arrivals/xx/xx_user1 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.zones-client-0=0x000000000000000000000000 trusted.afr.zones-client-1=0x000000000000000000000000 trusted.gfid=0xdc7b9acea4084541a830935e48f4a2a1 trusted.glusterfs.dht=0x0000000100000000000000007fffd0ea -bash-4.1# getfattr -d -e hex -m . /export/zones2/brick/landing/arrivals/xx/xx_user1 getfattr: Removing leading '/' from absolute path names # file: export/zones2/brick/landing/arrivals/xx/xx_user1 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.zones-client-2=0x000000000000000000000000 trusted.afr.zones-client-3=0x000000000000000000000000 trusted.gfid=0xdc7b9acea4084541a830935e48f4a2a1 trusted.glusterfs.dht=0x00000001000000007fffd0ebffffffff (Host gluster-002) -bash-4.1# getfattr -d -e hex -m . /export/zones1/brick/landing/arrivals/xx/xx_user1/G03_Interim\ ELA\ PT\ Beetles\ \(IAB\)_2015-08-11.tar.gz.gpg getfattr: Removing leading '/' from absolute path names # file: export/zones1/brick/landing/arrivals/xx/xx_user1/G03_Interim ELA PT Beetles (IAB)_2015-08-11.tar.gz.gpg trusted.afr.dirty=0x000000000000000000000000 trusted.afr.zones-client-0=0x000000000000000000000000 trusted.afr.zones-client-1=0x000000000000000000000000 trusted.gfid=0x8823094f0ea14f049bbc4f98895f7192 -bash-4.1# getfattr -d -e hex -m . /export/zones1/brick/landing/arrivals/xx/xx_user1 getfattr: Removing leading '/' from absolute path names # file: export/zones1/brick/landing/arrivals/xx/xx_user1 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.zones-client-0=0x000000000000000000000000 trusted.afr.zones-client-1=0x000000000000000000000000 trusted.gfid=0xdc7b9acea4084541a830935e48f4a2a1 trusted.glusterfs.dht=0x0000000100000000000000007fffd0ea -bash-4.1# getfattr -d -e hex -m . /export/zones2/brick/landing/arrivals/xx/xx_user1 getfattr: Removing leading '/' from absolute path names # file: export/zones2/brick/landing/arrivals/xx/xx_user1 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.zones-client-2=0x000000000000000000000000 trusted.afr.zones-client-3=0x000000000000000000000000 trusted.gfid=0xdc7b9acea4084541a830935e48f4a2a1 trusted.glusterfs.dht=0x00000001000000007fffd0ebffffffff Volume configuration server-side: -bash-4.1# mount | grep zones /dev/mapper/vg.zones1-lv.zones1 on /export/zones1 type ext4 (rw,noatime) /dev/mapper/vg.zones2-lv.zones2 on /export/zones2 type ext4 (rw,noatime) -bash-4.1# gluster volume info zones Volume Name: zones Type: Distributed-Replicate Volume ID: 53ff45b1-8dc7-47ef-8a26-3245414e4990 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.1.1.122:/export/zones1/brick Brick2: 10.1.1.121:/export/zones1/brick Brick3: 10.1.1.122:/export/zones2/brick Brick4: 10.1.1.121:/export/zones2/brick Options Reconfigured: client.ssl: off server.ssl: off performance.cache-size: 256MB auth.ssl-allow: * -bash-4.1# gluster volume status zones Status of volume: zones Gluster process Port Online Pid --------------------------------------------------------------------------- Brick 10.1.1.122:/export/zones1/brick 49165 Y 25189 Brick 10.1.1.121:/export/zones1/brick 49164 Y 697 Brick 10.1.1.122:/export/zones2/brick 49166 Y 25194 Brick 10.1.1.121:/export/zones2/brick 49161 Y 703 NFS Server on localhost 2049 Y 25213 Self-heal Daemon on localhost N/A Y 25222 NFS Server on 10.1.1.121 2049 Y 719 Self-heal Daemon on 10.1.1.121 N/A Y 736 Task Status of Volume zones --------------------------------------------------------------------------- Task : Rebalance ID : 75f0b7ae-ed26-417b-a285-9ad81e40073c Status : completed Mountpoint on client side: -bash-4.1# mount | grep zones 10.1.1.122:/zones on /opt/edware/zones type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150826/91f49f28/attachment.html>