Hi folks, I'm running a simple gluster setup with a single volume replicated at two servers, as follows: Volume Name: gv0 Type: Replicate Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: sst0:/var/glusterfs Brick2: sst2:/var/glusterfs Options Reconfigured: cluster.self-heal-daemon: enable performance.readdir-ahead: on nfs.disable: on transport.address-family: inet This volume is used to store data in highload production, and recently I faced two major problems that made the whole idea of using gluster quite questionnable, so I would like to ask gluster developers and/or call for community wisdom in hope that I might be missing something. The problem is, when it happened that one of replica servers hung, it caused the whole glusterfs to hang. Could you please drop me a hint, is it expected behaviour, or are there any tweaks and server or volume settings that might be altered to change this? Any help would be appreciated much. -- Best Regards, Seva Gluschenko CTO @ http://webkontrol.ru (http://webkontrol.ru/) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170731/5142ca58/attachment.html>
Hi With only two nodes it's recommended to set cluster.server-quorum-type=server and cluster.server-quorum-ratio=51% (i.e. more than 50%). On Mon, Jul 31, 2017 at 4:12 AM, Seva Gluschenko <gvs at webkontrol.ru> wrote:> Hi folks, > > > I'm running a simple gluster setup with a single volume replicated at two > servers, as follows: > > Volume Name: gv0 > Type: Replicate > Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: sst0:/var/glusterfs > Brick2: sst2:/var/glusterfs > Options Reconfigured: > cluster.self-heal-daemon: enable > performance.readdir-ahead: on > nfs.disable: on > transport.address-family: inet > > This volume is used to store data in highload production, and recently I > faced two major problems that made the whole idea of using gluster quite > questionnable, so I would like to ask gluster developers and/or call for > community wisdom in hope that I might be missing something. The problem is, > when it happened that one of replica servers hung, it caused the whole > glusterfs to hang. Could you please drop me a hint, is it expected > behaviour, or are there any tweaks and server or volume settings that might > be altered to change this? Any help would be appreciated much. > > > -- > Best Regards, > > Seva Gluschenko > CTO @ http://webkontrol.ru > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170731/1da3a3e4/attachment.html>
On 7/31/2017 1:12 AM, Seva Gluschenko wrote:> Hi folks, > > > I'm running a simple gluster setup with a single volume replicated at > two servers, as follows: > > Volume Name: gv0 > Type: Replicate > Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp> The problem is, when it happened that one of replica servers hung, it > caused the whole glusterfs to hang.Yes, you lost quorum and the system doesn't want you to get a split-brain.> Could you please drop me a hint, is it expected behaviour, or are > there any tweaks and server or volume settings that might be altered > to change this? Any help would be appreciated much. >Add a third replica node (or just an arbiter node if you aren't that ambitious or want to save on the kit) That way when you lose a node, the cluster it will pause for 40 seconds or so while it figures things out and then continue on. When the missing node returns, the self-heal will kick in and you will be back to 100%. Your other alternative is to turn off quorum. But that risks split-brain. Depending upon your data, that may or may not be a serious issue. -wk -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170731/197afbb8/attachment.html>
Thank you very much indeed, I'll try and add an arbiter node. -- Best Regards, Seva Gluschenko CTO @ http://webkontrol.ru (http://webkontrol.ru/) +7 916 172 6 170 August 1, 2017 12:29 AM, "WK" wrote: On 7/31/2017 1:12 AM, Seva Gluschenko wrote: Hi folks, I'm running a simple gluster setup with a single volume replicated at two servers, as follows: Volume Name: gv0 Type: Replicate Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp The problem is, when it happened that one of replica servers hung, it caused the whole glusterfs to hang. Yes, you lost quorum and the system doesn't want you to get a split-brain. Could you please drop me a hint, is it expected behaviour, or are there any tweaks and server or volume settings that might be altered to change this? Any help would be appreciated much. Add a third replica node (or just an arbiter node if you aren't that ambitious or want to save on the kit) That way when you lose a node, the cluster it will pause for 40 seconds or so while it figures things out and then continue on. When the missing node returns, the self-heal will kick in and you will be back to 100%. Your other alternative is to turn off quorum. But that risks split-brain. Depending upon your data, that may or may not be a serious issue. -wk -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170802/1c141080/attachment.html>