Martin Schenker
2011-Apr-29 15:09 UTC
[Gluster-users] Server outage, file sync/self-heal doesn't sync ALL files?!
Hi all! We have another incident over here. One of the servers (pserver12) in a pair (12 & 13) has been rebooted. pserver13 showed 63 files not in sync after the outage for 2h. Both server are clients as well. Starting pserver12 brought up the self-heal mechanism, but only 39 files were triggered within the first 10 min. Now the system seems dormant and 24 files are left hanging. On the other three servers no inconsistencies are seen. tail of client log file: 2011-04-29 14:48:23.820022] I [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 0-storage0-replicate-2: diff self-heal on /pserver13-17: 1960 blocks of 22736 were different (8.62%) [2011-04-29 14:48:23.887651] E [afr-common.c:110:afr_set_split_brain] 0-storage0-replicate-2: invalid argument: inode [2011-04-29 14:48:23.887740] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-storage0-replicate-2: background data self-heal completed on /pserver13-17 [2011-04-29 14:48:24.272220] I [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 0-storage0-replicate-2: diff self-heal on /pserver13-19: 1960 blocks of 22744 were different (8.62%) [2011-04-29 14:48:24.341868] E [afr-common.c:110:afr_set_split_brain] 0-storage0-replicate-2: invalid argument: inode [2011-04-29 14:48:24.341959] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-storage0-replicate-2: background data self-heal completed on /pserver13-19 [2011-04-29 14:48:24.758131] I [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 0-storage0-replicate-2: diff self-heal on /pserver13-23: 1952 blocks of 22752 were different (8.58%) [2011-04-29 14:48:24.766054] E [afr-common.c:110:afr_set_split_brain] 0-storage0-replicate-2: invalid argument: inode [2011-04-29 14:48:24.766137] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-storage0-replicate-2: background data self-heal completed on /pserver13-23 [2011-04-29 14:48:24.884613] I [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 0-storage0-replicate-2: diff self-heal on /pserver13-10: 1952 blocks of 22760 were different (8.58%) [2011-04-29 14:48:24.895631] E [afr-common.c:110:afr_set_split_brain] 0-storage0-replicate-2: invalid argument: inode [2011-04-29 14:48:24.895721] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-storage0-replicate-2: background data self-heal completed on /pserver13-10 0 root at pserver13:/var/log/glusterfs # date Fri Apr 29 15:08:18 UTC 2011 Search for mismatch: 0 root at pserver13:~ # getfattr -R -d -e hex -m "trusted.afr." /mnt/gluster/brick?/storage | grep -v 0x000000000000000000000000 | grep -B1 -A1 trusted | grep -c file getfattr: Removing leading '/' from absolute path names *24* 0 root at pserver13:~ # getfattr -R -d -e hex -m "trusted.afr." /mnt/gluster/brick?/storage | grep -v 0x000000000000000000000000 | grep -B1 trusted getfattr: Removing leading '/' from absolute path names # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-33 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-26 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/images/1959/cd55c5f3-9aa1-bfd9-99a0-01c13a7d8559/hdd-images trusted.afr.storage0-client-4=0x000000000000001600000001 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-24 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-8 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-21 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-22 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-30 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-20 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-9 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-38 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-18 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-2 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-23 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-4 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-3 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-34 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-37 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-12 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-27 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/images/1831/9a039a81-60fe-5fa3-f562-8f6d3828382b/hdd-images/13169 trusted.afr.storage0-client-6=0x100000020000000000000000 -- # file: mnt/gluster/brick1/storage/images/1959/cd55c5f3-9aa1-bfd9-99a0-01c13a7d8559/hdd-images trusted.afr.storage0-client-6=0x000000000000001600000002 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-25 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-7 trusted.afr.storage0-client-6=0x270000010000000000000000 I could trigger manually but why isn't the sync/self-heal not working on all files shown as inconsistent? Or am I assuming something wrongly here?!? Best, Martin
Pranith Kumar. Karampuri
2011-Apr-29 17:30 UTC
[Gluster-users] Server outage, file sync/self-heal doesn't sync ALL files?!
hi Martin, Could you please send the output of -m "trusted*" instead of "trusted.afr" for the remaining 24 files from both the servers. I would like to see the gfids of these files on both the machines. Pranith. ----- Original Message ----- From: "Martin Schenker" <martin.schenker at profitbricks.com> To: gluster-users at gluster.org Sent: Friday, April 29, 2011 8:39:46 PM Subject: [Gluster-users] Server outage, file sync/self-heal doesn't sync ALL files?! Hi all! We have another incident over here. One of the servers (pserver12) in a pair (12 & 13) has been rebooted. pserver13 showed 63 files not in sync after the outage for 2h. Both server are clients as well. Starting pserver12 brought up the self-heal mechanism, but only 39 files were triggered within the first 10 min. Now the system seems dormant and 24 files are left hanging. On the other three servers no inconsistencies are seen. tail of client log file: 2011-04-29 14:48:23.820022] I [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 0-storage0-replicate-2: diff self-heal on /pserver13-17: 1960 blocks of 22736 were different (8.62%) [2011-04-29 14:48:23.887651] E [afr-common.c:110:afr_set_split_brain] 0-storage0-replicate-2: invalid argument: inode [2011-04-29 14:48:23.887740] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-storage0-replicate-2: background data self-heal completed on /pserver13-17 [2011-04-29 14:48:24.272220] I [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 0-storage0-replicate-2: diff self-heal on /pserver13-19: 1960 blocks of 22744 were different (8.62%) [2011-04-29 14:48:24.341868] E [afr-common.c:110:afr_set_split_brain] 0-storage0-replicate-2: invalid argument: inode [2011-04-29 14:48:24.341959] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-storage0-replicate-2: background data self-heal completed on /pserver13-19 [2011-04-29 14:48:24.758131] I [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 0-storage0-replicate-2: diff self-heal on /pserver13-23: 1952 blocks of 22752 were different (8.58%) [2011-04-29 14:48:24.766054] E [afr-common.c:110:afr_set_split_brain] 0-storage0-replicate-2: invalid argument: inode [2011-04-29 14:48:24.766137] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-storage0-replicate-2: background data self-heal completed on /pserver13-23 [2011-04-29 14:48:24.884613] I [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done] 0-storage0-replicate-2: diff self-heal on /pserver13-10: 1952 blocks of 22760 were different (8.58%) [2011-04-29 14:48:24.895631] E [afr-common.c:110:afr_set_split_brain] 0-storage0-replicate-2: invalid argument: inode [2011-04-29 14:48:24.895721] I [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk] 0-storage0-replicate-2: background data self-heal completed on /pserver13-10 0 root at pserver13:/var/log/glusterfs # date Fri Apr 29 15:08:18 UTC 2011 Search for mismatch: 0 root at pserver13:~ # getfattr -R -d -e hex -m "trusted.afr." /mnt/gluster/brick?/storage | grep -v 0x000000000000000000000000 | grep -B1 -A1 trusted | grep -c file getfattr: Removing leading '/' from absolute path names *24* 0 root at pserver13:~ # getfattr -R -d -e hex -m "trusted.afr." /mnt/gluster/brick?/storage | grep -v 0x000000000000000000000000 | grep -B1 trusted getfattr: Removing leading '/' from absolute path names # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-33 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-26 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/images/1959/cd55c5f3-9aa1-bfd9-99a0-01c13a7d8559/hdd-images trusted.afr.storage0-client-4=0x000000000000001600000001 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-24 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-8 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-21 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-22 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-30 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-20 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-9 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-38 trusted.afr.storage0-client-4=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-18 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-2 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-23 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-4 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-3 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-34 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-37 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-12 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-27 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/images/1831/9a039a81-60fe-5fa3-f562-8f6d3828382b/hdd-images/13169 trusted.afr.storage0-client-6=0x100000020000000000000000 -- # file: mnt/gluster/brick1/storage/images/1959/cd55c5f3-9aa1-bfd9-99a0-01c13a7d8559/hdd-images trusted.afr.storage0-client-6=0x000000000000001600000002 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-25 trusted.afr.storage0-client-6=0x270000010000000000000000 -- # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-7 trusted.afr.storage0-client-6=0x270000010000000000000000 I could trigger manually but why isn't the sync/self-heal not working on all files shown as inconsistent? Or am I assuming something wrongly here?!? Best, Martin _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://gluster.org/cgi-bin/mailman/listinfo/gluster-users