thr3ads.net - Gluster users - [Gluster-users] split-brain on glusterfs running with quorum on server and client [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Ramesh Natarajan

2014-Sep-05 18:22 UTC

[Gluster-users] split-brain on glusterfs running with quorum on server and client

I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have
client and server quorum turned on.  I rebooted one of the 3 bricks. When
it came back up, the client started throwing error messages that one of the
files went into split brain.

When i check the file sizes and sha1sum on the bricks, 2 of the 3 bricks
have the same value. So by quorum logic the first brick should have healed
with this information. But i don't see that happening. Can someone please
tell me if this is expected behavior?


Can someone please tell me if i have things misconfigured...

thanks
Ramesh

My config is as below.

[root at ip-172-31-12-218 ~]# gluster volume info

Volume Name: PL1
Type: Replicate
Volume ID: a7aabae0-c6bc-40a9-8b26-0498d488ee39
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.31.38.189:/data/vol1/gluster-data
Brick2: 172.31.16.220:/data/vol1/gluster-data
Brick3: 172.31.12.218:/data/vol1/gluster-data
Options Reconfigured:
performance.cache-size: 2147483648
nfs.addr-namelookup: off
network.ping-timeout: 12
cluster.server-quorum-type: server
nfs.enable-ino32: on
cluster.quorum-type: auto
cluster.server-quorum-ratio: 51%

Volume Name: PL2
Type: Replicate
Volume ID: fadb3671-7a92-40b7-bccd-fbacf672f6dc
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.31.38.189:/data/vol2/gluster-data
Brick2: 172.31.16.220:/data/vol2/gluster-data
Brick3: 172.31.12.218:/data/vol2/gluster-data
Options Reconfigured:
performance.cache-size: 2147483648
nfs.addr-namelookup: off
network.ping-timeout: 12
cluster.server-quorum-type: server
nfs.enable-ino32: on
cluster.quorum-type: auto
cluster.server-quorum-ratio: 51%
[root at ip-172-31-12-218 ~]#


I have 2 clients each mounting one of the volumes. At no time the same
volume is mounted by more than 1 client.

mount -t glusterfs -o
defaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256
172.31.16.220:/PL2  /mnt/vm


I restarted the Brick 1 172.31.38.189 and when it came up, one of the file
on PL2 volume went into split mode..


[2014-09-05 17:59:42.997308] W [afr-open.c:209:afr_open] 0-PL2-replicate-0:
failed to open as split brain seen, returning EIO
[2014-09-05 17:59:42.997350] W [fuse-bridge.c:2209:fuse_writev_cbk]
0-glusterfs-fuse: 3359683: WRITE => -1 (Input/output error)
[2014-09-05 17:59:42.997476] W [fuse-bridge.c:690:fuse_truncate_cbk]
0-glusterfs-fuse: 3359684: FTRUNCATE() ERR => -1 (Input/
output error)[2014-09-05 17:59:42.997647] W
[fuse-bridge.c:2209:fuse_writev_cbk] 0-glusterfs-fuse: 3359686: WRITE => -1
(Input/output erro
r)[2014-09-05 17:59:42.997783] W [fuse-bridge.c:1214:fuse_err_cbk]
0-glusterfs-fuse: 3359687: FLUSH() ERR => -1 (Input/output e
rror)[2014-09-05 17:59:44.009187] E
[afr-self-heal-common.c:233:afr_sh_print_split_brain_log]
0-PL2-replicate-0: Unable to self-he
al contents of '/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00'
(possible split-brain). Please delete the file from all but the preferred
subvolume.- Pending matrix:  [ [ 0 1 1 ] [ 3398 0 0 ] [ 3398 0 0 ] ]
[2014-09-05 17:59:44.011116] E
[afr-self-heal-common.c:2868:afr_log_self_heal_completion_status]
0-PL2-replicate-0:  backgroung data self heal  failed,   on
/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
[2014-09-05 17:59:44.011480] W [afr-open.c:209:afr_open] 0-PL2-replicate-0:
failed to open as split brain seen, returning EIO

Starting time of crawl: Fri Sep  5 17:55:32 2014

Ending time of crawl: Fri Sep  5 17:55:33 2014

Type of crawl: INDEX
No. of entries healed: 4
No. of entries in split-brain: 1
No. of heal failed entries: 0
[root at ip-172-31-16-220 ~]# gluster volume heal PL2 info
Brick ip-172-31-38-189:/data/vol2/gluster-data/
/apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00
Number of entries: 1

Brick ip-172-31-16-220:/data/vol2/gluster-data/
/apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00
Number of entries: 1

Brick ip-172-31-12-218:/data/vol2/gluster-data/
/apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00
Number of entries: 1


BRICK1
=======
[root at ip-172-31-38-189 ~]# sha1sum access_log.2014-09-05-17_00_00
aa72d0f3949700f67b61d3c58fdbc75b772d607b  access_log.2014-09-05-17_00_00

[root at ip-172-31-38-189 ~]# ls -al
total 12760
dr-xr-x---  3 root     root         4096 Sep  5 17:42 .
dr-xr-xr-x 24 root     root         4096 Sep  5 17:34 ..
-rw-r-----  1 root     root     13019808 Sep  5 17:42
access_log.2014-09-05-17_00_00

[root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex
 /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
getfattr: Removing leading '/' from absolute path names
# file:
data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
trusted.afr.PL2-client-0=0x000000000000000000000000
trusted.afr.PL2-client-1=0x000000010000000000000000
trusted.afr.PL2-client-2=0x000000010000000000000000
trusted.gfid=0xea950263977e46bf89a0ef631ca139c2


BRICK 2
======
[root at ip-172-31-16-220 ~]# sha1sum access_log.2014-09-05-17_00_00
0f7b72f77a792b5c2b68456c906cf7b93287f0d6  access_log.2014-09-05-17_00_00

[root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex
 /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
getfattr: Removing leading '/' from absolute path names
# file:
data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
trusted.afr.PL2-client-0=0x00000d460000000000000000
trusted.afr.PL2-client-1=0x000000000000000000000000
trusted.afr.PL2-client-2=0x000000000000000000000000
trusted.gfid=0xea950263977e46bf89a0ef631ca139c2

BRICK 3
========
[root at ip-172-31-12-218 ~]# sha1sum access_log.2014-09-05-17_00_00
0f7b72f77a792b5c2b68456c906cf7b93287f0d6  access_log.2014-09-05-17_00_00

[root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex
 /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
getfattr: Removing leading '/' from absolute path names
# file:
data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
trusted.afr.PL2-client-0=0x00000d460000000000000000
trusted.afr.PL2-client-1=0x000000000000000000000000
trusted.afr.PL2-client-2=0x000000000000000000000000
trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140905/c1934673/attachment.html>

Jeff Darcy

2014-Sep-05 23:23 UTC

head link

[Gluster-users] split-brain on glusterfs running with quorum on server and client

> I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I have
> client and server quorum turned on. I rebooted one of the 3 bricks. When it
> came back up, the client started throwing error messages that one of the
> files went into split brain.
This is a good example of how split brain can happen even with all kinds of
quorum enabled.  Let's look at those xattrs.  BTW, thank you for a very
nicely detailed bug report which includes those.
> BRICK1
> =======> [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex
> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> getfattr: Removing leading '/' from absolute path names
> # file:
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> trusted.afr.PL2-client-0=0x000000000000000000000000
> trusted.afr.PL2-client-1=0x000000010000000000000000
> trusted.afr.PL2-client-2=0x000000010000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>
> BRICK 2
> ======> [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex
> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> getfattr: Removing leading '/' from absolute path names
> # file:
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> trusted.afr.PL2-client-0=0x00000d460000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
> BRICK 3
> ========> [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex
> /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> getfattr: Removing leading '/' from absolute path names
> # file:
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> trusted.afr.PL2-client-0=0x00000d460000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
Here, we see that brick 1 shows a single pending operation for the other
two, while they show 0xd46 (3398) pending operations for brick 1.
Here's how this can happen.

(1) There is exactly one pending operation.

(2) Brick1 completes the write first, and says so.

(3) Client sends messages to all three, saying to decrement brick1's
count.

(4) All three bricks receive and process that message.

(5) Brick1 fails.

(6) Brick2 and brick3 complete the write, and say so.

(7) Client tells all bricks to decrement remaining counts.

(8) Brick2 and brick3 receive and process that message.

(9) Brick1 is dead, so its counts for brick2/3 stay at one.

(10) Brick2 and brick3 have quorum, with all-zero pending counters.

(11) Client sends 0xd46 more writes to brick2 and brick3.

Note that at no point did we lose quorum. Note also the tight timing
required.  If brick1 had failed an instant earlier, it would not have
decremented its own counter.  If it had failed an instant later, it
would have decremented brick2's and brick3's as well.  If brick1 had not
finished first, we'd be in yet another scenario.  If delayed changelog
had been operative, the messages at (3) and (7) would have been combined
to leave us in yet another scenario.  As far as I can tell, we would
have been able to resolve the conflict in all those cases.

*** Key point: quorum enforcement does not totally eliminate split
brain.  It only makes the frequency a few orders of magnitude lower. ***

So, is there any way to prevent this completely?  Some AFR enhancements,
such as the oft-promised "outcast" feature[1], might have helped.
NSR[2] is immune to this particular problem.  "Policy based split brain
resolution"[3] might have resolved it automatically instead of merely
flagging it.  Unfortunately, those are all in the future.  For now, I'd
say the best approach is to resolve the conflict manually and try to
move on.  Unless there's more going on than meets the eye, recurrence
should be very unlikely.

[1] http://www.gluster.org/community/documentation/index.php/Features/outcast

[2]
http://www.gluster.org/community/documentation/index.php/Features/new-style-replication

[3] http://www.gluster.org/community/documentation/index.php/Features/pbspbr

Pranith Kumar Karampuri

2014-Sep-06 12:29 UTC

head link

[Gluster-users] split-brain on glusterfs running with quorum on server and client

What is the glusterfs version where you ran into this issue?

Pranith
On 09/05/2014 11:52 PM, Ramesh Natarajan wrote:> I have a replicate glusterfs setup on 3 Bricks ( replicate = 3 ). I 
> have client and server quorum turned on.  I rebooted one of the 3 
> bricks. When it came back up, the client started throwing error 
> messages that one of the files went into split brain.
>
> When i check the file sizes and sha1sum on the bricks, 2 of the 3 
> bricks have the same value. So by quorum logic the first brick should 
> have healed with this information. But i don't see that happening. Can 
> someone please tell me if this is expected behavior?
>
>
> Can someone please tell me if i have things misconfigured...
>
> thanks
> Ramesh
>
> My config is as below.
>
> [root at ip-172-31-12-218 ~]# gluster volume info
> Volume Name: PL1
> Type: Replicate
> Volume ID: a7aabae0-c6bc-40a9-8b26-0498d488ee39
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.31.38.189:/data/vol1/gluster-data
> Brick2: 172.31.16.220:/data/vol1/gluster-data
> Brick3: 172.31.12.218:/data/vol1/gluster-data
> Options Reconfigured:
> performance.cache-size: 2147483648
> nfs.addr-namelookup: off
> network.ping-timeout: 12
> cluster.server-quorum-type: server
> nfs.enable-ino32: on
> cluster.quorum-type: auto
> cluster.server-quorum-ratio: 51%
> Volume Name: PL2
> Type: Replicate
> Volume ID: fadb3671-7a92-40b7-bccd-fbacf672f6dc
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.31.38.189:/data/vol2/gluster-data
> Brick2: 172.31.16.220:/data/vol2/gluster-data
> Brick3: 172.31.12.218:/data/vol2/gluster-data
> Options Reconfigured:
> performance.cache-size: 2147483648
> nfs.addr-namelookup: off
> network.ping-timeout: 12
> cluster.server-quorum-type: server
> nfs.enable-ino32: on
> cluster.quorum-type: auto
> cluster.server-quorum-ratio: 51%
> [root at ip-172-31-12-218 ~]#
>
>
> I have 2 clients each mounting one of the volumes. At no time the same 
> volume is mounted by more than 1 client.
>
> mount -t glusterfs -o 
>
defaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256
> 172.31.16.220:/PL2  /mnt/vm
>
>
> I restarted the Brick 1 172.31.38.189 and when it came up, one of the 
> file on PL2 volume went into split mode..
>
>
> [2014-09-05 17:59:42.997308] W [afr-open.c:209:afr_open] 
> 0-PL2-replicate-0: failed to open as split brain seen, returning EIO
> [2014-09-05 17:59:42.997350] W [fuse-bridge.c:2209:fuse_writev_cbk] 
> 0-glusterfs-fuse: 3359683: WRITE => -1 (Input/output error)
> [2014-09-05 17:59:42.997476] W [fuse-bridge.c:690:fuse_truncate_cbk] 
> 0-glusterfs-fuse: 3359684: FTRUNCATE() ERR => -1 (Input/
> output error)[2014-09-05 17:59:42.997647] W 
> [fuse-bridge.c:2209:fuse_writev_cbk] 0-glusterfs-fuse: 3359686: WRITE 
> => -1 (Input/output erro
> r)[2014-09-05 17:59:42.997783] W [fuse-bridge.c:1214:fuse_err_cbk] 
> 0-glusterfs-fuse: 3359687: FLUSH() ERR => -1 (Input/output e
> rror)[2014-09-05 17:59:44.009187] E 
> [afr-self-heal-common.c:233:afr_sh_print_split_brain_log] 
> 0-PL2-replicate-0: Unable to self-he
> al contents of '/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00'
> (possible split-brain). Please delete the file from all but the 
> preferred subvolume.- Pending matrix:  [ [ 0 1 1 ] [ 3398 0 0 ] [ 3398 
> 0 0 ] ]
> [2014-09-05 17:59:44.011116] E 
> [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 
> 0-PL2-replicate-0:  backgroung data self heal  failed, on 
> /apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> [2014-09-05 17:59:44.011480] W [afr-open.c:209:afr_open] 
> 0-PL2-replicate-0: failed to open as split brain seen, returning EIO
>
> Starting time of crawl: Fri Sep  5 17:55:32 2014
>
> Ending time of crawl: Fri Sep  5 17:55:33 2014
>
> Type of crawl: INDEX
> No. of entries healed: 4
> No. of entries in split-brain: 1
> No. of heal failed entries: 0
> [root at ip-172-31-16-220 ~]# gluster volume heal PL2 info
> Brick ip-172-31-38-189:/data/vol2/gluster-data/
> /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00
> Number of entries: 1
>
> Brick ip-172-31-16-220:/data/vol2/gluster-data/
> /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00
> Number of entries: 1
>
> Brick ip-172-31-12-218:/data/vol2/gluster-data/
> /apache_cp_mm1/logs/mm1.access_log.2014-09-05-17_00_00
> Number of entries: 1
>
>
> BRICK1
> =======>
> [root at ip-172-31-38-189 ~]# sha1sum access_log.2014-09-05-17_00_00
> aa72d0f3949700f67b61d3c58fdbc75b772d607b  access_log.2014-09-05-17_00_00
>
> [root at ip-172-31-38-189 ~]# ls -al
> total 12760
> dr-xr-x---  3 root     root         4096 Sep  5 17:42 .
> dr-xr-xr-x 24 root     root         4096 Sep  5 17:34 ..
> -rw-r-----  1 root     root     13019808 Sep  5 17:42 
> access_log.2014-09-05-17_00_00
>
> [root at ip-172-31-38-189 ~]# getfattr -d -m . -e hex 
>  /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 
>
> getfattr: Removing leading '/' from absolute path names
> # file: 
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> trusted.afr.PL2-client-0=0x000000000000000000000000
> trusted.afr.PL2-client-1=0x000000010000000000000000
> trusted.afr.PL2-client-2=0x000000010000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>
>
> BRICK 2
> ======>
> [root at ip-172-31-16-220 ~]# sha1sum access_log.2014-09-05-17_00_00
> 0f7b72f77a792b5c2b68456c906cf7b93287f0d6  access_log.2014-09-05-17_00_00
>
> [root at ip-172-31-16-220 ~]# getfattr -d -m . -e hex 
>  /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 
>
> getfattr: Removing leading '/' from absolute path names
> # file: 
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> trusted.afr.PL2-client-0=0x00000d460000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>
> BRICK 3
> ========>
> [root at ip-172-31-12-218 ~]# sha1sum access_log.2014-09-05-17_00_00
> 0f7b72f77a792b5c2b68456c906cf7b93287f0d6  access_log.2014-09-05-17_00_00
>
> [root at ip-172-31-12-218 ~]# getfattr -d -m . -e hex 
>  /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00 
>
> getfattr: Removing leading '/' from absolute path names
> # file: 
> data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17_00_00
> trusted.afr.PL2-client-0=0x00000d460000000000000000
> trusted.afr.PL2-client-1=0x000000000000000000000000
> trusted.afr.PL2-client-2=0x000000000000000000000000
> trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140906/31991aed/attachment.html>

Gluster users - Sep 2014 - split-brain on glusterfs running with quorum on server and client

[Gluster-users] split-brain on glusterfs running with quorum on server and client

[Gluster-users] split-brain on glusterfs running with quorum on server and client

[Gluster-users] split-brain on glusterfs running with quorum on server and client