Saravanakumar Arumugam
2015-Dec-21 07:08 UTC
[Gluster-users] geo-replication 3.6.7 - no trusted.gfid on some slave nodes - stale file handle
Hi, Replies inline.. Thanks, Saravana On 12/18/2015 10:02 PM, Dietmar Putz wrote:> Hello again... > > after having some big trouble with an xfs issue in kernel 3.13.0-x and > 3.19.0-39 which has been 'solved' by downgrading to 3.8.4 > (http://comments.gmane.org/gmane.comp.file-systems.xfs.general/71629) > we decided to start a new geo-replication attempt from scratch... > we have deleted the former geo-replication session and started a new > one as described in : > http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.6 > > master and slave is a distributed replicated volume running on gluster > 3.6.7 / ubuntu 14.04. > setup worked as described but unfortunately geo-replication isn't > syncing files and remains in the below shown status. > > in the ~geo-replication-slaves/...gluster.log i can found on all slave > nodes messages like : > > [2015-12-16 15:06:46.837748] W [dht-layout.c:180:dht_layout_search] > 0-aut-wien-01-dht: no subvolume for hash (value) = 1448787070 > [2015-12-16 15:06:46.837789] W [fuse-bridge.c:1261:fuse_err_cbk] > 0-glusterfs-fuse: 74203: SETXATTR() > /.gfid/d4815ee4-3348-4105-9136-d0219d956ed8 => -1 (No such file or > directory) > [2015-12-16 15:06:47.090212] I [dht-layout.c:663:dht_layout_normalize] > 0-aut-wien-01-dht: Found anomalies in (null) (gfid = > d4815ee4-3348-4105-9136-d0219d956ed8). Holes=1 overlaps=0 > > [2015-12-16 20:25:55.327874] W [fuse-bridge.c:1967:fuse_create_cbk] > 0-glusterfs-fuse: 199968: /.gfid/603de79d-8d41-44bd-845e-3727cf64a617 > => -1 (Operation not permitted) > [2015-12-16 20:25:55.617016] W [fuse-bridge.c:1967:fuse_create_cbk] > 0-glusterfs-fuse: 199971: /.gfid/8622fb7d-8909-42de-adb5-c67ed6f006c0 > => -1 (Operation not permitted)Please check whether selinux is enabled in both Master/Slave..I remember seeing such errors if selinux enabled.> > this is found only on gluster-wien-03-int which is in 'Hybrid Crawl' : > [2015-12-16 17:17:07.219939] W [fuse-bridge.c:1261:fuse_err_cbk] > 0-glusterfs-fuse: 123841: SETXATTR() > /.gfid/00000000-0000-0000-0000-000000000001 => -1 (File exists) > [2015-12-16 17:17:07.220658] W > [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-aut-wien-01-client-3: > remote operation failed: File exists. Path: /2301 > [2015-12-16 17:17:07.220702] W > [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-aut-wien-01-client-2: > remote operation failed: File exists. Path: /2301 >Some errors like "file exists" can be ignored.> > But first of all i would like to have a look at this message, found > about 6000 times on gluster-wien-05-int and ~07-int which are in > 'History Crawl': > [2015-12-16 13:03:25.658359] W [fuse-bridge.c:483:fuse_entry_cbk] > 0-glusterfs-fuse: 119569: LOOKUP() > /.gfid/d4815ee4-3348-4105-9136-d0219d956ed8/.dstXXXfDyaP9 => -1 (Stale > file handle) > > The gfid d4815ee4-3348-4105-9136-d0219d956ed8 > 1050="d4815ee4-3348-4105-9136-d0219d956ed8" belongs as shown to the > folder 1050 in the brick-directory. > > any brick in the master volume looks like this one ...: > Host : gluster-ger-ber-12-int > # file: gluster-export/1050 > trusted.afr.dirty=0x000000000000000000000000 > trusted.afr.ger-ber-01-client-0=0x000000000000000000000000 > trusted.afr.ger-ber-01-client-1=0x000000000000000000000000 > trusted.afr.ger-ber-01-client-2=0x000000000000000000000000 > trusted.afr.ger-ber-01-client-3=0x000000000000000000000000 > trusted.gfid=0xd4815ee4334841059136d0219d956ed8 > trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.1c31dc4d-7ee3-423b-8577-c7b0ce2e356a.stime=0x56606290000c7e4e > > trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.xtime=0x567428e000042116 > > trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9 > > on the slave volume just the brick of wien-02 and wien-03 have the > same trusted.gfid > Host : gluster-wien-03 > # file: gluster-export/1050 > trusted.afr.aut-wien-01-client-0=0x000000000000000000000000 > trusted.afr.aut-wien-01-client-1=0x000000000000000000000000 > trusted.gfid=0xd4815ee4334841059136d0219d956ed8 > trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.xtime=0x5638bfb5000379c0 > > trusted.glusterfs.dht=0x00000001000000000000000055555554 > > all nodes in 'History Crawl' haven't this trusted.gfid assigned. > Host : gluster-wien-05 > # file: gluster-export/1050 > trusted.afr.aut-wien-01-client-2=0x000000000000000000000000 > trusted.afr.aut-wien-01-client-3=0x000000000000000000000000 > trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.xtime=0x5638bfb5000379c0 > > trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff > > I'm not sure if it is normal or if that trusted.gfid should have been > assigned on all slave nodes by the slave-upgrade.sh script.As per the doc, it applies gfid on all slave nodes.> bash slave-upgrade.sh localhost:<aut-wien-01> > /tmp/master_gfid_file.txt $PWD/gsync-sync-gfid was running on wien-02 > which has password less login for any other slave node. > as i could see in the process list slave-upgrade.sh was running on > each slave node and starts as far as i can remember with a 'rm -rf > ~/.glusterfs/...' > so the mentioned gfid should disappeared by the slave-upgrade.sh but > should the trusted.gfid also be re-assigned by the script ? > ...I'm confused, > is the 'Stale file handle' message based on the missing trusted.gfid > for /gluster-export/1050/ on the nodes where the message appears ? > does it make sense to geo-rep and to start the slave-upgrade.sh script > on the affected nodes without having access to the other nodes to fix > this ? > > currently I'm not sure if the 'stale file handle' messages prevent us > from getting a running geo-replication but i guess best way is trying > to get it running step by step... > any help is appreciated. > > best regards > dietmar > > > > [ 14:45:42 ] - root at gluster-ger-ber-07 > /var/log/glusterfs/geo-replication/ger-ber-01 $gluster volume > geo-replication ger-ber-01 gluster-wien-02::aut-wien-01 status detail > > MASTER NODE MASTER VOL MASTER BRICK > SLAVE STATUS CHECKPOINT STATUS > CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING > DELETES PENDING FILES SKIPPED > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > gluster-ger-ber-07 ger-ber-01 /gluster-export > gluster-wien-07-int::aut-wien-01 Active N/A History Crawl -6500 > 0 0 5 6500 > gluster-ger-ber-12 ger-ber-01 /gluster-export > gluster-wien-06-int::aut-wien-01 Passive N/A N/A 0 > 0 0 0 0 > gluster-ger-ber-11 ger-ber-01 /gluster-export > gluster-wien-03-int::aut-wien-01 Active N/A Hybrid Crawl 0 > 8191 0 0 0 > gluster-ger-ber-09 ger-ber-01 /gluster-export > gluster-wien-05-int::aut-wien-01 Active N/A History Crawl -5792 > 0 0 0 5793 > gluster-ger-ber-10 ger-ber-01 /gluster-export > gluster-wien-02-int::aut-wien-01 Passive N/A N/A 0 > 0 0 0 0 > gluster-ger-ber-08 ger-ber-01 /gluster-export > gluster-wien-04-int::aut-wien-01 Passive N/A N/A 0 > 0 0 0 0 > [ 14:45:46 ] - root at gluster-ger-ber-07 > /var/log/glusterfs/geo-replication/ger-ber-01 $> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users
Dietmar Putz
2015-Dec-22 10:47 UTC
[Gluster-users] geo-replication 3.6.7 - no trusted.gfid on some slave nodes - stale file handle
Hi Saravana, thanks for your reply... all gluster-nodes running ubuntu 14.04 using apparmor. Even when it is running without any configuration i have unloaded the module to prevent any influence. i have stopped and deleted geo-replication one more time and started the slave-upgrade.sh again but this time on gluster-wien-07, geo-replication is currently not started again. the result is the same as before and more comprehensive than first identified by myself... i have checked all directories in the root of each brick for a trusted.gfid (567 dir's). only on subvolume aut-wien-01-replicate-0 each directory has an trusted.gfid assigned. on subvolume ~replicate-1 and ~replicate-2 186 resp. 206 of 567 directories have an trusted.gfid assigned. for example the directory /gluster-export/1050 which have been seen in the geo-replication logs before... the screenlog of the slave-upgrade.sh shows a 'failed' for setxattr on 1050 but this folder exist and contains data / folders on each subvolume. [ 09:50:43 ] - root at gluster-wien-07 /usr/share/glusterfs/scripts $grep 1050 screenlog.0 | head -3 setxattr on ./1050="d4815ee4-3348-4105-9136-d0219d956ed8" failed (No such file or directory) setxattr on 1050/recordings="6056c887-99bc-4fcc-bf39-8ea2478bb780" failed (No such file or directory) setxattr on 1050/recordings/REC_22_3619210_63112.mp4="63d127a3-a387-4cb6-bb4b-792dc422ebbf" failed (No such file or directory) [ 09:50:53 ] - root at gluster-wien-07 /usr/share/glusterfs/scripts $ [ 10:11:01 ] - root at gluster-wien-07 /gluster-export $getfattr -m . -d -e hex 1050 # file: 1050 trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.xtime=0x5638bfb5000379c0 trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9 [ 10:11:10 ] - root at gluster-wien-07 /gluster-export $ls -li | grep 1050 17179869881 drwxr-xr-x 72 1009 admin 4096 Dec 2 21:34 1050 [ 10:11:21 ] - root at gluster-wien-07 /gluster-export $du -hs 1050 877G 1050 [ 10:11:29 ] - root at gluster-wien-07 /gluster-export $ as far as i understood folder 1050 and many other folders should have an unique trusted.gfid assigned like on all master nodes resp. on subvolume aut-wien-01-replicate-0. does it make sense to start the geo-replication again or does this issue need to be fixed before starting another attempt...? ...and if yes, does anybody know how to fix the missing trusted.gfid ? just restarting slave-upgrade did not help. any help is appreciated. best regards dietmar volume aut-wien-01-client-0 remote-host gluster-wien-02-int volume aut-wien-01-client-1 remote-host gluster-wien-03-int volume aut-wien-01-client-2 remote-host gluster-wien-04-int volume aut-wien-01-client-3 remote-host gluster-wien-05-int volume aut-wien-01-client-4 remote-host gluster-wien-06-int volume aut-wien-01-client-5 remote-host gluster-wien-07-int volume aut-wien-01-replicate-0 subvolumes aut-wien-01-client-0 aut-wien-01-client-1 volume aut-wien-01-replicate-1 subvolumes aut-wien-01-client-2 aut-wien-01-client-3 volume aut-wien-01-replicate-2 subvolumes aut-wien-01-client-4 aut-wien-01-client-5 volume glustershd type debug/io-stats subvolumes aut-wien-01-replicate-0 aut-wien-01-replicate-1 aut-wien-01-replicate-2 end-volume Am 21.12.2015 um 08:08 schrieb Saravanakumar Arumugam:> Hi, > Replies inline.. > > Thanks, > Saravana > > On 12/18/2015 10:02 PM, Dietmar Putz wrote: >> Hello again... >> >> after having some big trouble with an xfs issue in kernel 3.13.0-x >> and 3.19.0-39 which has been 'solved' by downgrading to 3.8.4 >> (http://comments.gmane.org/gmane.comp.file-systems.xfs.general/71629) >> we decided to start a new geo-replication attempt from scratch... >> we have deleted the former geo-replication session and started a new >> one as described in : >> http://www.gluster.org/community/documentation/index.php/Upgrade_to_3.6 >> >> master and slave is a distributed replicated volume running on >> gluster 3.6.7 / ubuntu 14.04. >> setup worked as described but unfortunately geo-replication isn't >> syncing files and remains in the below shown status. >> >> in the ~geo-replication-slaves/...gluster.log i can found on all >> slave nodes messages like : >> >> [2015-12-16 15:06:46.837748] W [dht-layout.c:180:dht_layout_search] >> 0-aut-wien-01-dht: no subvolume for hash (value) = 1448787070 >> [2015-12-16 15:06:46.837789] W [fuse-bridge.c:1261:fuse_err_cbk] >> 0-glusterfs-fuse: 74203: SETXATTR() >> /.gfid/d4815ee4-3348-4105-9136-d0219d956ed8 => -1 (No such file or >> directory) >> [2015-12-16 15:06:47.090212] I >> [dht-layout.c:663:dht_layout_normalize] 0-aut-wien-01-dht: Found >> anomalies in (null) (gfid = d4815ee4-3348-4105-9136-d0219d956ed8). >> Holes=1 overlaps=0 >> >> [2015-12-16 20:25:55.327874] W [fuse-bridge.c:1967:fuse_create_cbk] >> 0-glusterfs-fuse: 199968: /.gfid/603de79d-8d41-44bd-845e-3727cf64a617 >> => -1 (Operation not permitted) >> [2015-12-16 20:25:55.617016] W [fuse-bridge.c:1967:fuse_create_cbk] >> 0-glusterfs-fuse: 199971: /.gfid/8622fb7d-8909-42de-adb5-c67ed6f006c0 >> => -1 (Operation not permitted) > Please check whether selinux is enabled in both Master/Slave..I > remember seeing such errors if selinux enabled. > >> >> this is found only on gluster-wien-03-int which is in 'Hybrid Crawl' : >> [2015-12-16 17:17:07.219939] W [fuse-bridge.c:1261:fuse_err_cbk] >> 0-glusterfs-fuse: 123841: SETXATTR() >> /.gfid/00000000-0000-0000-0000-000000000001 => -1 (File exists) >> [2015-12-16 17:17:07.220658] W >> [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-aut-wien-01-client-3: >> remote operation failed: File exists. Path: /2301 >> [2015-12-16 17:17:07.220702] W >> [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-aut-wien-01-client-2: >> remote operation failed: File exists. Path: /2301 >> > Some errors like "file exists" can be ignored. >> >> But first of all i would like to have a look at this message, found >> about 6000 times on gluster-wien-05-int and ~07-int which are in >> 'History Crawl': >> [2015-12-16 13:03:25.658359] W [fuse-bridge.c:483:fuse_entry_cbk] >> 0-glusterfs-fuse: 119569: LOOKUP() >> /.gfid/d4815ee4-3348-4105-9136-d0219d956ed8/.dstXXXfDyaP9 => -1 >> (Stale file handle) >> >> The gfid d4815ee4-3348-4105-9136-d0219d956ed8 >> 1050="d4815ee4-3348-4105-9136-d0219d956ed8" belongs as shown to the >> folder 1050 in the brick-directory. >> >> any brick in the master volume looks like this one ...: >> Host : gluster-ger-ber-12-int >> # file: gluster-export/1050 >> trusted.afr.dirty=0x000000000000000000000000 >> trusted.afr.ger-ber-01-client-0=0x000000000000000000000000 >> trusted.afr.ger-ber-01-client-1=0x000000000000000000000000 >> trusted.afr.ger-ber-01-client-2=0x000000000000000000000000 >> trusted.afr.ger-ber-01-client-3=0x000000000000000000000000 >> trusted.gfid=0xd4815ee4334841059136d0219d956ed8 >> trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.1c31dc4d-7ee3-423b-8577-c7b0ce2e356a.stime=0x56606290000c7e4e >> >> trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.xtime=0x567428e000042116 >> >> trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9 >> >> on the slave volume just the brick of wien-02 and wien-03 have the >> same trusted.gfid >> Host : gluster-wien-03 >> # file: gluster-export/1050 >> trusted.afr.aut-wien-01-client-0=0x000000000000000000000000 >> trusted.afr.aut-wien-01-client-1=0x000000000000000000000000 >> trusted.gfid=0xd4815ee4334841059136d0219d956ed8 >> trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.xtime=0x5638bfb5000379c0 >> >> trusted.glusterfs.dht=0x00000001000000000000000055555554 >> >> all nodes in 'History Crawl' haven't this trusted.gfid assigned. >> Host : gluster-wien-05 >> # file: gluster-export/1050 >> trusted.afr.aut-wien-01-client-2=0x000000000000000000000000 >> trusted.afr.aut-wien-01-client-3=0x000000000000000000000000 >> trusted.glusterfs.6a071cfa-b150-4f0b-b1ed-96ab5d4bd671.xtime=0x5638bfb5000379c0 >> >> trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff >> >> I'm not sure if it is normal or if that trusted.gfid should have been >> assigned on all slave nodes by the slave-upgrade.sh script. > > As per the doc, it applies gfid on all slave nodes. > >> bash slave-upgrade.sh localhost:<aut-wien-01> >> /tmp/master_gfid_file.txt $PWD/gsync-sync-gfid was running on wien-02 >> which has password less login for any other slave node. >> as i could see in the process list slave-upgrade.sh was running on >> each slave node and starts as far as i can remember with a 'rm -rf >> ~/.glusterfs/...' >> so the mentioned gfid should disappeared by the slave-upgrade.sh but >> should the trusted.gfid also be re-assigned by the script ? >> ...I'm confused, >> is the 'Stale file handle' message based on the missing trusted.gfid >> for /gluster-export/1050/ on the nodes where the message appears ? >> does it make sense to geo-rep and to start the slave-upgrade.sh >> script on the affected nodes without having access to the other nodes >> to fix this ? >> >> currently I'm not sure if the 'stale file handle' messages prevent us >> from getting a running geo-replication but i guess best way is trying >> to get it running step by step... >> any help is appreciated. >> >> best regards >> dietmar >> >> >> >> [ 14:45:42 ] - root at gluster-ger-ber-07 >> /var/log/glusterfs/geo-replication/ger-ber-01 $gluster volume >> geo-replication ger-ber-01 gluster-wien-02::aut-wien-01 status detail >> >> MASTER NODE MASTER VOL MASTER BRICK >> SLAVE STATUS CHECKPOINT STATUS >> CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING >> DELETES PENDING FILES SKIPPED >> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> gluster-ger-ber-07 ger-ber-01 /gluster-export >> gluster-wien-07-int::aut-wien-01 Active N/A History Crawl -6500 >> 0 0 5 6500 >> gluster-ger-ber-12 ger-ber-01 /gluster-export >> gluster-wien-06-int::aut-wien-01 Passive N/A N/A 0 >> 0 0 0 0 >> gluster-ger-ber-11 ger-ber-01 /gluster-export >> gluster-wien-03-int::aut-wien-01 Active N/A Hybrid Crawl 0 >> 8191 0 0 0 >> gluster-ger-ber-09 ger-ber-01 /gluster-export >> gluster-wien-05-int::aut-wien-01 Active N/A History Crawl -5792 >> 0 0 0 5793 >> gluster-ger-ber-10 ger-ber-01 /gluster-export >> gluster-wien-02-int::aut-wien-01 Passive N/A N/A 0 >> 0 0 0 0 >> gluster-ger-ber-08 ger-ber-01 /gluster-export >> gluster-wien-04-int::aut-wien-01 Passive N/A N/A 0 >> 0 0 0 0 >> [ 14:45:46 ] - root at gluster-ger-ber-07 >> /var/log/glusterfs/geo-replication/ger-ber-01 $ > >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://www.gluster.org/mailman/listinfo/gluster-users >