Dave Sherohman
2018-Feb-26 12:44 UTC
[Gluster-users] Quorum in distributed-replicate volume
On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya wrote:> > "In a replica 2 volume... If we set the client-quorum option to > > auto, then the first brick must always be up, irrespective of the > > status of the second brick. If only the second brick is up, the > > subvolume becomes read-only." > > > By default client-quorum is "none" in replica 2 volume.I'm not sure where I saw the directions saying to set it, but I do have "cluster.quorum-type: auto" in my volume configuration. (And I think that's client quorum, but feel free to correct me if I've misunderstood the docs.)> It applies to all the replica 2 volumes even if it has just 2 brick or more. > Total brick count in the volume doesn't matter for the quorum, what matters > is the number of bricks which are up in the particular replica subvol.Thanks for confirming that.> If I understood your configuration correctly it should look something like > this: > (Please correct me if I am wrong) > replica-1: bricks 1 & 2 > replica-2: bricks 3 & 4 > replica-3: bricks 5 & 6Yes, that's correct.> Since quorum is per replica, if it is set to auto then it needs the first > brick of the particular replica subvol to be up to perform the fop. > > In replica 2 volumes you can end up in split-brains.How would that happen if bricks which are not in (cluster-wide) quorum refuse to accept writes? I'm not seeing the reason for using individual subvolume quorums instead of full-volume quorum.> It would be great if you can consider configuring an arbiter or > replica 3 volume.I can. My bricks are 2x850G and 4x11T, so I can repurpose the small bricks as arbiters with minimal effect on capacity. What would be the sequence of commands needed to: 1) Move all data off of bricks 1 & 2 2) Remove that replica from the cluster 3) Re-add those two bricks as arbiters (And did I miss any additional steps?) Unfortunately, I've been running a few months already with the current configuration and there are several virtual machines running off the existing volume, so I'll need to reconfigure it online if possible. -- Dave Sherohman
Karthik Subrahmanya
2018-Feb-27 06:30 UTC
[Gluster-users] Quorum in distributed-replicate volume
On Mon, Feb 26, 2018 at 6:14 PM, Dave Sherohman <dave at sherohman.org> wrote:> On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya wrote: > > > "In a replica 2 volume... If we set the client-quorum option to > > > auto, then the first brick must always be up, irrespective of the > > > status of the second brick. If only the second brick is up, the > > > subvolume becomes read-only." > > > > > By default client-quorum is "none" in replica 2 volume. > > I'm not sure where I saw the directions saying to set it, but I do have > "cluster.quorum-type: auto" in my volume configuration. (And I think > that's client quorum, but feel free to correct me if I've misunderstood > the docs.) >If it is "auto" then I think it is reconfigured. In replica 2 it will be "none".> > > It applies to all the replica 2 volumes even if it has just 2 brick or > more. > > Total brick count in the volume doesn't matter for the quorum, what > matters > > is the number of bricks which are up in the particular replica subvol. > > Thanks for confirming that. > > > If I understood your configuration correctly it should look something > like > > this: > > (Please correct me if I am wrong) > > replica-1: bricks 1 & 2 > > replica-2: bricks 3 & 4 > > replica-3: bricks 5 & 6 > > Yes, that's correct. > > > Since quorum is per replica, if it is set to auto then it needs the first > > brick of the particular replica subvol to be up to perform the fop. > > > > In replica 2 volumes you can end up in split-brains. > > How would that happen if bricks which are not in (cluster-wide) quorum > refuse to accept writes? I'm not seeing the reason for using individual > subvolume quorums instead of full-volume quorum. >Split brains happen within the replica pair. I will try to explain how you can end up in split-brain even with cluster wide quorum: Lets say you have 6 bricks (replica 2) volume and you always have at least quorum number of bricks up & running. Bricks 1 & 2 are part of replica subvol-1 Bricks 3 & 4 are part of replica subvol-2 Bricks 5 & 6 are part of replica subvol-3 - Brick 1 goes down and a write comes on a file which is part of that replica subvol-1 - Quorum is met since we have 5 out of 6 bricks are running - Brick 2 says brick 1 is bad - Brick 2 goes down and brick 1 comes up. Heal did not happened - Write comes on the same file, quorum is met, and now brick 1 says brick 2 is bad - When both the bricks 1 & 2 are up, both of them blame the other brick - *split-brain*> > > It would be great if you can consider configuring an arbiter or > > replica 3 volume. > > I can. My bricks are 2x850G and 4x11T, so I can repurpose the small > bricks as arbiters with minimal effect on capacity. What would be the > sequence of commands needed to: > > 1) Move all data off of bricks 1 & 2 > 2) Remove that replica from the cluster > 3) Re-add those two bricks as arbiters > >(And did I miss any additional steps?)> > Unfortunately, I've been running a few months already with the current > configuration and there are several virtual machines running off the > existing volume, so I'll need to reconfigure it online if possible. >Without knowing the volume configuration it is difficult to suggest the configuration change, and since it is a live system you may end up in data unavailability or data loss. Can you give the output of "gluster volume info <volname>" and which brick is of what size. Note: The arbiter bricks need not be of bigger size. [1] gives information about how you can provision the arbiter brick. [1] http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#arbiter-bricks-sizing Regards, Karthik> > -- > Dave Sherohman >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180227/7605cde5/attachment.html>
Dave Sherohman
2018-Feb-27 08:10 UTC
[Gluster-users] Quorum in distributed-replicate volume
On Tue, Feb 27, 2018 at 12:00:29PM +0530, Karthik Subrahmanya wrote:> I will try to explain how you can end up in split-brain even with cluster > wide quorum:Yep, the explanation made sense. I hadn't considered the possibility of alternating outages. Thanks!> > > It would be great if you can consider configuring an arbiter or > > > replica 3 volume. > > > > I can. My bricks are 2x850G and 4x11T, so I can repurpose the small > > bricks as arbiters with minimal effect on capacity. What would be the > > sequence of commands needed to: > > > > 1) Move all data off of bricks 1 & 2 > > 2) Remove that replica from the cluster > > 3) Re-add those two bricks as arbiters > > > > (And did I miss any additional steps?) > > > > Unfortunately, I've been running a few months already with the current > > configuration and there are several virtual machines running off the > > existing volume, so I'll need to reconfigure it online if possible. > > > Without knowing the volume configuration it is difficult to suggest the > configuration change, > and since it is a live system you may end up in data unavailability or data > loss. > Can you give the output of "gluster volume info <volname>" > and which brick is of what size.Volume Name: palantir Type: Distributed-Replicate Volume ID: 48379a50-3210-41b4-9a77-ae143c8bcac0 Status: Started Snapshot Count: 0 Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: saruman:/var/local/brick0/data Brick2: gandalf:/var/local/brick0/data Brick3: azathoth:/var/local/brick0/data Brick4: yog-sothoth:/var/local/brick0/data Brick5: cthulhu:/var/local/brick0/data Brick6: mordiggian:/var/local/brick0/data Options Reconfigured: features.scrub: Inactive features.bitrot: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on network.ping-timeout: 1013 performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server features.shard: on cluster.data-self-heal-algorithm: full storage.owner-uid: 64055 storage.owner-gid: 64055 For brick sizes, saruman/gandalf have $ df -h /var/local/brick0 Filesystem Size Used Avail Use% Mounted on /dev/mapper/gandalf-gluster 885G 55G 786G 7% /var/local/brick0 and the other four have $ df -h /var/local/brick0 Filesystem Size Used Avail Use% Mounted on /dev/sdb1 11T 254G 11T 3% /var/local/brick0 -- Dave Sherohman