Ramesh Natarajan
2014-Sep-05 18:22 UTC
[Gluster-users] split-brain on glusterfs running with quorum on server and client
I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have client and server quorum turned on. I rebooted one of the 3 bricks. When it came back up, the client started throwing error messages that one of the files went into split brain. When i check the file sizes and sha1sum on the bricks, 2 of the 3 bricks have the same value. So by quorum logic the first brick should have healed with this information. But i don't see that happening. Can someone please tell me if this is expected behavior? Can someone please tell me if i have things misconfigured... thanks Ramesh My config is as below. [root at ip-172-31-12-218 ~]# gluster volume info Volume Name: PL1 Type: Replicate Volume ID: a7aabae0-c6bc-40a9-8b26-0498d488ee39 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 172.31.38.189:/data/vol1/gluster-data Brick2: 172.31.16.220:/data/vol1/gluster-data Brick3: 172.31.12.218:/data/vol1/gluster-data Options Reconfigured: performance.cache-size: 2147483648 nfs.addr-namelookup: off network.ping-timeout: 12 cluster.server-quorum-type: server nfs.enable-ino32: on cluster.quorum-type: auto cluster.server-quorum-ratio: 51% Volume Name: PL2 Type: Replicate Volume ID: fadb3671-7a92-40b7-bccd-fbacf672f6dc Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 172.31.38.189:/data/vol2/gluster-data Brick2: 172.31.16.220:/data/vol2/gluster-data Brick3: 172.31.12.218:/data/vol2/gluster-data Options Reconfigured: performance.cache-size: 2147483648 nfs.addr-namelookup: off network.ping-timeout: 12 cluster.server-quorum-type: server nfs.enable-ino32: on cluster.quorum-type: auto cluster.server-quorum-ratio: 51% [root at ip-172-31-12-218 ~]# I have 2 clients each mounting one of the volumes. At no time the same volume is mounted by more than 1 client. mount -t glusterfs -o defaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256 172.31.16.220:/PL2 /mnt/vm I restarted the Brick 1 172.31.38.189 and when it came up, one of the file on PL2 volume went into split mode.. [2014-09-05 17:59:42.997308] W [afr-open.c:209:afr_open] 0-PL2-replicate-0: failed to open as split brain seen, returning EIO [2014-09-05 17:59:42.997350] W [fuse-bridge.c:2209:fuse_writev_cbk] 0-glusterfs-fuse: 3359683: WRITE => -1 (Input/output error) [2014-09-05 17:59:42.997476] W [fuse-bridge.c:690:fuse_truncate_cbk] 0-glusterfs-fuse: 3359684: FTRUNCATE() ERR => -1 (Input/ output error)[2014-09-05 17:59:42.997647] W [fuse-bridge.c:2209:fuse_writev_cbk] 0-glusterfs-fuse: 3359686: WRITE => -1 (Input/output erro r)[2014-09-05 17:59:42.997783] W [fuse-bridge.c:1214:fuse_err_cbk] 0-glusterfs-fuse: 3359687: FLUSH() ERR => -1 (Input/output e rror)[2014-09-05 17:59:44.009187] E [afr-self-heal-common.c:233:afr_sh_print_split_brain_log] 0-PL2-replicate-0: Unable to self-he al contents of '/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00' (possible split-brain). Please delete the file from all but the preferred subvolume.- Pending matrix: [ [ 0 1 1 ] [ 3398 0 0 ] [ 3398 0 0 ] ] [2014-09-05 17:59:44.011116] E [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-PL2-replicate-0: backgroung data self heal failed, on /apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 [2014-09-05 17:59:44.011480] W [afr-open.c:209:afr_open] 0-PL2-replicate-0: failed to open as split brain seen, returning EIO Starting time of crawl: Fri Sep 5 17:55:32 2014 Ending time of crawl: Fri Sep 5 17:55:33 2014 Type of crawl: INDEX No. of entries healed: 4 No. of entries in split-brain: 1 No. of heal failed entries: 0 [root at ip-172-31-16-220 ~]# gluster volume heal PL2 info Brick ip-172-31-38-189:/data/vol2/gluster-data/ /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00 Number of entries: 1 Brick ip-172-31-16-220:/data/vol2/gluster-data/ /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00 Number of entries: 1 Brick ip-172-31-12-218:/data/vol2/gluster-data/ /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00 Number of entries: 1 BRICK1 ======= [root at ip-172-31-38-189 ~]# sha1sum access_log.2014-09-05-17_00_00 aa72d0f3949700f67b61d3c58fdbc75b772d607b access_log.2014-09-05-17_00_00 [root at ip-172-31-38-189 ~]# ls -al total 12760 dr-xr-x--- 3 root root 4096 Sep 5 17:42 . dr-xr-xr-x 24 root root 4096 Sep 5 17:34 .. -rw-r----- 1 root root 13019808 Sep 5 17:42 access_log.2014-09-05-17_00_00 [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 getfattr: Removing leading '/' from absolute path names # file: data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 trusted.afr.PL2-client-0=0x000000000000000000000000 trusted.afr.PL2-client-1=0x000000010000000000000000 trusted.afr.PL2-client-2=0x000000010000000000000000 trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 BRICK 2 ====== [root at ip-172-31-16-220 ~]# sha1sum access_log.2014-09-05-17_00_00 0f7b72f77a792b5c2b68456c906cf7b93287f0d6 access_log.2014-09-05-17_00_00 [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 getfattr: Removing leading '/' from absolute path names # file: data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 trusted.afr.PL2-client-0=0x00000d460000000000000000 trusted.afr.PL2-client-1=0x000000000000000000000000 trusted.afr.PL2-client-2=0x000000000000000000000000 trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 BRICK 3 ======== [root at ip-172-31-12-218 ~]# sha1sum access_log.2014-09-05-17_00_00 0f7b72f77a792b5c2b68456c906cf7b93287f0d6 access_log.2014-09-05-17_00_00 [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 getfattr: Removing leading '/' from absolute path names # file: data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 trusted.afr.PL2-client-0=0x00000d460000000000000000 trusted.afr.PL2-client-1=0x000000000000000000000000 trusted.afr.PL2-client-2=0x000000000000000000000000 trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140905/c1934673/attachment.html>
Jeff Darcy
2014-Sep-05 23:23 UTC
[Gluster-users] split-brain on glusterfs running with quorum on server and client
> I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have > client and server quorum turned on. I rebooted one of the 3 bricks. When it > came back up, the client started throwing error messages that one of the > files went into split brain.This is a good example of how split brain can happen even with all kinds of quorum enabled. Let's look at those xattrs. BTW, thank you for a very nicely detailed bug report which includes those.> BRICK1 > =======> [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex > /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > getfattr: Removing leading '/' from absolute path names > # file: > data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > trusted.afr.PL2-client-0=0x000000000000000000000000 > trusted.afr.PL2-client-1=0x000000010000000000000000 > trusted.afr.PL2-client-2=0x000000010000000000000000 > trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 > > BRICK 2 > ======> [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex > /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > getfattr: Removing leading '/' from absolute path names > # file: > data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > trusted.afr.PL2-client-0=0x00000d460000000000000000 > trusted.afr.PL2-client-1=0x000000000000000000000000 > trusted.afr.PL2-client-2=0x000000000000000000000000 > trusted.gfid=0xea950263977e46bf89a0ef631ca139c2> BRICK 3 > ========> [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex > /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > getfattr: Removing leading '/' from absolute path names > # file: > data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > trusted.afr.PL2-client-0=0x00000d460000000000000000 > trusted.afr.PL2-client-1=0x000000000000000000000000 > trusted.afr.PL2-client-2=0x000000000000000000000000 > trusted.gfid=0xea950263977e46bf89a0ef631ca139c2Here, we see that brick 1 shows a single pending operation for the other two, while they show 0xd46 (3398) pending operations for brick 1. Here's how this can happen. (1) There is exactly one pending operation. (2) Brick1 completes the write first, and says so. (3) Client sends messages to all three, saying to decrement brick1's count. (4) All three bricks receive and process that message. (5) Brick1 fails. (6) Brick2 and brick3 complete the write, and say so. (7) Client tells all bricks to decrement remaining counts. (8) Brick2 and brick3 receive and process that message. (9) Brick1 is dead, so its counts for brick2/3 stay at one. (10) Brick2 and brick3 have quorum, with all-zero pending counters. (11) Client sends 0xd46 more writes to brick2 and brick3. Note that at no point did we lose quorum. Note also the tight timing required. If brick1 had failed an instant earlier, it would not have decremented its own counter. If it had failed an instant later, it would have decremented brick2's and brick3's as well. If brick1 had not finished first, we'd be in yet another scenario. If delayed changelog had been operative, the messages at (3) and (7) would have been combined to leave us in yet another scenario. As far as I can tell, we would have been able to resolve the conflict in all those cases. *** Key point: quorum enforcement does not totally eliminate split brain. It only makes the frequency a few orders of magnitude lower. *** So, is there any way to prevent this completely? Some AFR enhancements, such as the oft-promised "outcast" feature[1], might have helped. NSR[2] is immune to this particular problem. "Policy based split brain resolution"[3] might have resolved it automatically instead of merely flagging it. Unfortunately, those are all in the future. For now, I'd say the best approach is to resolve the conflict manually and try to move on. Unless there's more going on than meets the eye, recurrence should be very unlikely. [1] http://www.gluster.org/community/documentation/index.php/Features/outcast [2] http://www.gluster.org/community/documentation/index.php/Features/new-style-replication [3] http://www.gluster.org/community/documentation/index.php/Features/pbspbr
Pranith Kumar Karampuri
2014-Sep-06 12:29 UTC
[Gluster-users] split-brain on glusterfs running with quorum on server and client
What is the glusterfs version where you ran into this issue? Pranith On 09/05/2014 11:52 PM, Ramesh Natarajan wrote:> I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I > have client and server quorum turned on. I rebooted one of the 3 > bricks. When it came back up, the client started throwing error > messages that one of the files went into split brain. > > When i check the file sizes and sha1sum on the bricks, 2 of the 3 > bricks have the same value. So by quorum logic the first brick should > have healed with this information. But i don't see that happening. Can > someone please tell me if this is expected behavior? > > > Can someone please tell me if i have things misconfigured... > > thanks > Ramesh > > My config is as below. > > [root at ip-172-31-12-218 ~]# gluster volume info > Volume Name: PL1 > Type: Replicate > Volume ID: a7aabae0-c6bc-40a9-8b26-0498d488ee39 > Status: Started > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 172.31.38.189:/data/vol1/gluster-data > Brick2: 172.31.16.220:/data/vol1/gluster-data > Brick3: 172.31.12.218:/data/vol1/gluster-data > Options Reconfigured: > performance.cache-size: 2147483648 > nfs.addr-namelookup: off > network.ping-timeout: 12 > cluster.server-quorum-type: server > nfs.enable-ino32: on > cluster.quorum-type: auto > cluster.server-quorum-ratio: 51% > Volume Name: PL2 > Type: Replicate > Volume ID: fadb3671-7a92-40b7-bccd-fbacf672f6dc > Status: Started > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: 172.31.38.189:/data/vol2/gluster-data > Brick2: 172.31.16.220:/data/vol2/gluster-data > Brick3: 172.31.12.218:/data/vol2/gluster-data > Options Reconfigured: > performance.cache-size: 2147483648 > nfs.addr-namelookup: off > network.ping-timeout: 12 > cluster.server-quorum-type: server > nfs.enable-ino32: on > cluster.quorum-type: auto > cluster.server-quorum-ratio: 51% > [root at ip-172-31-12-218 ~]# > > > I have 2 clients each mounting one of the volumes. At no time the same > volume is mounted by more than 1 client. > > mount -t glusterfs -o > defaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256 > 172.31.16.220:/PL2 /mnt/vm > > > I restarted the Brick 1 172.31.38.189 and when it came up, one of the > file on PL2 volume went into split mode.. > > > [2014-09-05 17:59:42.997308] W [afr-open.c:209:afr_open] > 0-PL2-replicate-0: failed to open as split brain seen, returning EIO > [2014-09-05 17:59:42.997350] W [fuse-bridge.c:2209:fuse_writev_cbk] > 0-glusterfs-fuse: 3359683: WRITE => -1 (Input/output error) > [2014-09-05 17:59:42.997476] W [fuse-bridge.c:690:fuse_truncate_cbk] > 0-glusterfs-fuse: 3359684: FTRUNCATE() ERR => -1 (Input/ > output error)[2014-09-05 17:59:42.997647] W > [fuse-bridge.c:2209:fuse_writev_cbk] 0-glusterfs-fuse: 3359686: WRITE > => -1 (Input/output erro > r)[2014-09-05 17:59:42.997783] W [fuse-bridge.c:1214:fuse_err_cbk] > 0-glusterfs-fuse: 3359687: FLUSH() ERR => -1 (Input/output e > rror)[2014-09-05 17:59:44.009187] E > [afr-self-heal-common.c:233:afr_sh_print_split_brain_log] > 0-PL2-replicate-0: Unable to self-he > al contents of '/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00' > (possible split-brain). Please delete the file from all but the > preferred subvolume.- Pending matrix: [ [ 0 1 1 ] [ 3398 0 0 ] [ 3398 > 0 0 ] ] > [2014-09-05 17:59:44.011116] E > [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] > 0-PL2-replicate-0: backgroung data self heal failed, on > /apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > [2014-09-05 17:59:44.011480] W [afr-open.c:209:afr_open] > 0-PL2-replicate-0: failed to open as split brain seen, returning EIO > > Starting time of crawl: Fri Sep 5 17:55:32 2014 > > Ending time of crawl: Fri Sep 5 17:55:33 2014 > > Type of crawl: INDEX > No. of entries healed: 4 > No. of entries in split-brain: 1 > No. of heal failed entries: 0 > [root at ip-172-31-16-220 ~]# gluster volume heal PL2 info > Brick ip-172-31-38-189:/data/vol2/gluster-data/ > /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00 > Number of entries: 1 > > Brick ip-172-31-16-220:/data/vol2/gluster-data/ > /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00 > Number of entries: 1 > > Brick ip-172-31-12-218:/data/vol2/gluster-data/ > /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00 > Number of entries: 1 > > > BRICK1 > =======> > [root at ip-172-31-38-189 ~]# sha1sum access_log.2014-09-05-17_00_00 > aa72d0f3949700f67b61d3c58fdbc75b772d607b access_log.2014-09-05-17_00_00 > > [root at ip-172-31-38-189 ~]# ls -al > total 12760 > dr-xr-x--- 3 root root 4096 Sep 5 17:42 . > dr-xr-xr-x 24 root root 4096 Sep 5 17:34 .. > -rw-r----- 1 root root 13019808 Sep 5 17:42 > access_log.2014-09-05-17_00_00 > > [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex > /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > > getfattr: Removing leading '/' from absolute path names > # file: > data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > trusted.afr.PL2-client-0=0x000000000000000000000000 > trusted.afr.PL2-client-1=0x000000010000000000000000 > trusted.afr.PL2-client-2=0x000000010000000000000000 > trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 > > > BRICK 2 > ======> > [root at ip-172-31-16-220 ~]# sha1sum access_log.2014-09-05-17_00_00 > 0f7b72f77a792b5c2b68456c906cf7b93287f0d6 access_log.2014-09-05-17_00_00 > > [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex > /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > > getfattr: Removing leading '/' from absolute path names > # file: > data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > trusted.afr.PL2-client-0=0x00000d460000000000000000 > trusted.afr.PL2-client-1=0x000000000000000000000000 > trusted.afr.PL2-client-2=0x000000000000000000000000 > trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 > > BRICK 3 > ========> > [root at ip-172-31-12-218 ~]# sha1sum access_log.2014-09-05-17_00_00 > 0f7b72f77a792b5c2b68456c906cf7b93287f0d6 access_log.2014-09-05-17_00_00 > > [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex > /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > > getfattr: Removing leading '/' from absolute path names > # file: > data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 > trusted.afr.PL2-client-0=0x00000d460000000000000000 > trusted.afr.PL2-client-1=0x000000000000000000000000 > trusted.afr.PL2-client-2=0x000000000000000000000000 > trusted.gfid=0xea950263977e46bf89a0ef631ca139c2 > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140906/31991aed/attachment.html>