thr3ads.net - Gluster users - [Gluster-users] Quorum in distributed-replicate volume [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Dave Sherohman

2018-Feb-26 12:44 UTC

[Gluster-users] Quorum in distributed-replicate volume

On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya
wrote:> > "In a replica 2 volume... If we set the client-quorum option to
> > auto, then the first brick must always be up, irrespective of the
> > status of the second brick. If only the second brick is up, the
> > subvolume becomes read-only."
> >
> By default client-quorum is "none" in replica 2 volume.
I'm not sure where I saw the directions saying to set it, but I do have
"cluster.quorum-type: auto" in my volume configuration.  (And I think
that's client quorum, but feel free to correct me if I've misunderstood
the docs.)
> It applies to all the replica 2 volumes even if it has just 2 brick or
more.
> Total brick count in the volume doesn't matter for the quorum, what
matters
> is the number of bricks which are up in the particular replica subvol.
Thanks for confirming that.
> If I understood your configuration correctly it should look something like
> this:
> (Please correct me if I am wrong)
> replica-1:  bricks 1 & 2
> replica-2: bricks 3 & 4
> replica-3: bricks 5 & 6
Yes, that's correct.
> Since quorum is per replica, if it is set to auto then it needs the first
> brick of the particular replica subvol to be up to perform the fop.
> 
> In replica 2 volumes you can end up in split-brains.
How would that happen if bricks which are not in (cluster-wide) quorum
refuse to accept writes?  I'm not seeing the reason for using individual
subvolume quorums instead of full-volume quorum.
> It would be great if you can consider configuring an arbiter or
> replica 3 volume.
I can.  My bricks are 2x850G and 4x11T, so I can repurpose the small
bricks as arbiters with minimal effect on capacity.  What would be the
sequence of commands needed to:

1) Move all data off of bricks 1 & 2
2) Remove that replica from the cluster
3) Re-add those two bricks as arbiters

(And did I miss any additional steps?)

Unfortunately, I've been running a few months already with the current
configuration and there are several virtual machines running off the
existing volume, so I'll need to reconfigure it online if possible.

-- 
Dave Sherohman

Karthik Subrahmanya

2018-Feb-27 06:30 UTC

head link

[Gluster-users] Quorum in distributed-replicate volume

On Mon, Feb 26, 2018 at 6:14 PM, Dave Sherohman <dave at sherohman.org>
wrote:
> On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya wrote:
> > > "In a replica 2 volume... If we set the client-quorum option
to
> > > auto, then the first brick must always be up, irrespective of the
> > > status of the second brick. If only the second brick is up, the
> > > subvolume becomes read-only."
> > >
> > By default client-quorum is "none" in replica 2 volume.
>
> I'm not sure where I saw the directions saying to set it, but I do have
> "cluster.quorum-type: auto" in my volume configuration.  (And I
think
> that's client quorum, but feel free to correct me if I've
misunderstood
> the docs.)
>If it is "auto" then I think it is reconfigured. In replica 2 it will
be
"none".
>
> > It applies to all the replica 2 volumes even if it has just 2 brick or
> more.
> > Total brick count in the volume doesn't matter for the quorum,
what
> matters
> > is the number of bricks which are up in the particular replica subvol.
>
> Thanks for confirming that.
>
> > If I understood your configuration correctly it should look something
> like
> > this:
> > (Please correct me if I am wrong)
> > replica-1:  bricks 1 & 2
> > replica-2: bricks 3 & 4
> > replica-3: bricks 5 & 6
>
> Yes, that's correct.
>
> > Since quorum is per replica, if it is set to auto then it needs the
first
> > brick of the particular replica subvol to be up to perform the fop.
> >
> > In replica 2 volumes you can end up in split-brains.
>
> How would that happen if bricks which are not in (cluster-wide) quorum
> refuse to accept writes?  I'm not seeing the reason for using
individual
> subvolume quorums instead of full-volume quorum.
>Split brains happen within the replica pair.
I will try to explain how you can end up in split-brain even with cluster
wide quorum:
Lets say you have 6 bricks (replica 2) volume and you always have at least
quorum number of bricks up & running.
Bricks 1 & 2 are part of replica subvol-1
Bricks 3 & 4 are part of replica subvol-2
Bricks 5 & 6 are part of replica subvol-3

- Brick 1 goes down and a write comes on a file which is part of that
replica subvol-1
- Quorum is met since we have 5 out of 6 bricks are running
- Brick 2 says brick 1 is bad
- Brick 2 goes down and brick 1 comes up. Heal did not happened
- Write comes on the same file, quorum is met, and now brick 1 says brick 2
is bad
- When both the bricks 1 & 2 are up, both of them blame the other brick -
*split-brain*
>
> > It would be great if you can consider configuring an arbiter or
> > replica 3 volume.
>
> I can.  My bricks are 2x850G and 4x11T, so I can repurpose the small
> bricks as arbiters with minimal effect on capacity.  What would be the
> sequence of commands needed to:
>
> 1) Move all data off of bricks 1 & 2
> 2) Remove that replica from the cluster
> 3) Re-add those two bricks as arbiters
>
>
(And did I miss any additional steps?)>
> Unfortunately, I've been running a few months already with the current
> configuration and there are several virtual machines running off the
> existing volume, so I'll need to reconfigure it online if possible.
>Without knowing the volume configuration it is difficult to suggest the
configuration change,
and since it is a live system you may end up in data unavailability or data
loss.
Can you give the output of "gluster volume info <volname>"
and which brick is of what size.
Note: The arbiter bricks need not be of bigger size.
[1] gives information about how you can provision the arbiter brick.

[1]
http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#arbiter-bricks-sizing

Regards,
Karthik
>
> --
> Dave Sherohman
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180227/7605cde5/attachment.html>

Dave Sherohman

2018-Feb-27 08:10 UTC

head link

[Gluster-users] Quorum in distributed-replicate volume

On Tue, Feb 27, 2018 at 12:00:29PM +0530, Karthik Subrahmanya
wrote:> I will try to explain how you can end up in split-brain even with cluster
> wide quorum:
Yep, the explanation made sense.  I hadn't considered the possibility of
alternating outages.  Thanks!
> > > It would be great if you can consider configuring an arbiter or
> > > replica 3 volume.
> >
> > I can.  My bricks are 2x850G and 4x11T, so I can repurpose the small
> > bricks as arbiters with minimal effect on capacity.  What would be the
> > sequence of commands needed to:
> >
> > 1) Move all data off of bricks 1 & 2
> > 2) Remove that replica from the cluster
> > 3) Re-add those two bricks as arbiters
> >
> > (And did I miss any additional steps?)
> >
> > Unfortunately, I've been running a few months already with the
current
> > configuration and there are several virtual machines running off the
> > existing volume, so I'll need to reconfigure it online if
possible.
> >
> Without knowing the volume configuration it is difficult to suggest the
> configuration change,
> and since it is a live system you may end up in data unavailability or data
> loss.
> Can you give the output of "gluster volume info <volname>"
> and which brick is of what size.
Volume Name: palantir
Type: Distributed-Replicate
Volume ID: 48379a50-3210-41b4-9a77-ae143c8bcac0
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: saruman:/var/local/brick0/data
Brick2: gandalf:/var/local/brick0/data
Brick3: azathoth:/var/local/brick0/data
Brick4: yog-sothoth:/var/local/brick0/data
Brick5: cthulhu:/var/local/brick0/data
Brick6: mordiggian:/var/local/brick0/data
Options Reconfigured:
features.scrub: Inactive
features.bitrot: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
network.ping-timeout: 1013
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
features.shard: on
cluster.data-self-heal-algorithm: full
storage.owner-uid: 64055
storage.owner-gid: 64055


For brick sizes, saruman/gandalf have

$ df -h /var/local/brick0
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/gandalf-gluster  885G   55G  786G   7% /var/local/brick0

and the other four have

$ df -h /var/local/brick0
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        11T  254G   11T   3% /var/local/brick0


-- 
Dave Sherohman

Possibly Parallel Threads

Search for more possibly parallel threads

Gluster users - Feb 2018 - Quorum in distributed-replicate volume

[Gluster-users] Quorum in distributed-replicate volume

[Gluster-users] Quorum in distributed-replicate volume

[Gluster-users] Quorum in distributed-replicate volume

Possibly Parallel Threads