thr3ads.net - Gluster users - [Gluster-users] Replication logic [Dec 2020]

If this information is useful, please help other people find it:
Share via:

Strahil Nikolov

2020-Dec-27 11:31 UTC

[Gluster-users] Replication logic

>But if I do that, the metadata that are already on the brick will be
>lost. What I was asking, is whether there is a way to "upgrade"
the
>arbiter to a full replica without losing the metadata in the meanwhile.
You have a 'replica 3 arbiter 1' volume. When you want to replace the
arbiter you will need to do it in several steps:
1) use remove-brick to get rid of the arbiter like this:
gluster volume remove-brick VOLUME replica 2 arbiter:/path/to/brick

The command will reduce from 'replica 3 arbiter 1' to 'replica
2' type of volume. You still have the 2 data bricks left and running.

2) Reuseing the brick is easiest if you just umount, wipe the fs and recreate
it. It's far simpler
umount /dev/VG/arbiter-brick
mkfs.xfs -f -i size=512?/dev/VG/arbiter-brick
mount?/dev/VG/arbiter-brick
mkdir </path/to/lv/mountpoint>/brick

3) Add the recreated brick
gluster volume add-brick VOLUME replica 3 arbiter:/path/to/lv/mountpoint/brick

4) force a heal
gluster volume heal VOLUME full


>You might ask, why does it matter? If the data needs to be replicated
>to the ex-arbiter brick anyway, also rebuilding the metadata is only
>a very slight overhead. Yes, but if the metadata on the ex-arbiter
>remains intact, any one other brick can go down while the ex-arbiter
>is building up its datastore and the volume will still have quorum.
Arbiter holds only metadata , but it's usefull to have it running. Yet, in
both cases (remove-brick + add-brick or replace-brick) you have a moment where
some files/dirs won't have metadata on the arbiter. You have to take the
risk. And you always got the option to reduce the quorum statically to
"1" , so even in replica 2 the survived node will be serving requests
from the clients.
>Aha, it's the client writing to the bricks and not the server??
That'sthe part that I had not understood.
What you described is the NFS xlator (old legacy gNFS which is disabled by
default, but you can recompile) , yet the NFS xlator will try to replicate to
all nodes in the cluster simultaneously.

> What are you trying to achieve ? What is your setup (clients,servers,etc) ?

>Now you might think georeplication, but that won't work for a mailstore
>(a) because georeplication is asynchronous, so if mailserver1 suddenly
>goes down and mailserver2 takes over, there will be mail on mailserver1
>that is still missing on mailserver2 and will remain missing until
>mailserver1 comes back up again, and (b) because georeplication (if
>I have understood the docs correctly) only works in one direction,
>so that any mail that arrives on a downstream replica will never be
>propagated to its upstream replicas.
Geo replication is not so slow . Based on my experience it happens quite often
by default. I understand that it will be an issue if a mail is missing if Node1
died and the replication hasn't had the time to distribute it. Keep in mind
that secondary volumes (a.k.a. slave volume) are in read-only mode by default
... just mentioning it.



>I want to get rid of the arbiter and have three full replicas.You got 2 options -> remove-brick + add-brick or the old school
"replace-brick". In both cases you have a moment where the new brick
has some data still replicating and if an old "data" brick fails, you
have to change the quorum to "1" untill you fix the issue.

>There are three machines running gluster 8.3 and only using gluster
>as the client (mount -t glusterfs) without nfs or anything else.?If the node is both Gluster and App , we call it HyperConverged setup. Quite
typical usage.
>One machine is in Stockholm, one is in Athens and one in Frankfurt a/M,
>though the latter will eventually migrate to Buenos Aires. That's
>a lot of latency and then the Athens connection is also very slow.That's a lot of lattency and bandwidth restriction. With regular replica the
performance will be quite limited. Reads happen locally (if you use the default
value for "cluster.choose-local" option), but writes will go to all
bricks and will be confirmed only when all bricks confirm the FOP (file
operation) or time out. I'm not sure if there is an option that allows to
limit the FUSE operation timeout without touching
"network.ping-timeout".
>Now, I've read the docs and I know very well that I am doing things
>way out of "the normal way", but I am willing to trade performance
>for resiliency on the mail server, so if I can get that distributed
>mailstore to work somewhat properly, I don't care at all if new mail
>takes 15 minutes to propagate to the slow Athens node. What's
>important is that all three nodes are perfectly synchronised and that
>mail continues to work seamlessly if any one of them goes down[1].
Erm... in 'replica 3' volume and you got slow bandwidth to Athens, then
you might have to check your mail server's timeouts (and bump them) as it
might get stuck in "D" state (waiting for I/O) while writing the
e-mail.
Most probably every write will be slower than the usual , but the reads should
not be affected.

You are definately our of the "normal" , but if the performance is not
the highest priority - it should work.

Best Regards,
Strahil Nikolov

Zenon Panoussis

2020-Dec-28 21:14 UTC

head link

[Gluster-users] Replication logic

>  And you always got the option to reduce the quorum statically to
"1"
This is a very interesting tidbit of information. I was
wondering if there was some way to preload data on a brick,
and I think you might have just given me one.

I have a volume of three peers, one brick each. Two peers
have a fast connection, the third one has a very slow
connection. In normal operation this doesn't matter,
because there will only be fairly small changes to the
filesystem over time. However, when loading the initial
data on the volume before it becomes operative, the one
slow connection becomes a bottleneck for two fast ones.
So I'm thinking now whether I could

1. join the three peers and build the empty volume,
2. take the slow peer off-line,
3. load the data on the crippled volume, so that it is
   written to the two fast peers that are still online,
4. take the two fast peers offline and put the slow peer
   online,
5. reduce quorum to 1,
6. load the exact same data locally to the slow peer, and
7. put the two fast peers back online and increase quorum
   to 2.

This would lead to all three bricks having the exact same
data without the delay of the slow transfer, but it will
only work if the exact same metadata are created for the
same files during the two separate loads. That is, if a
given file foo always produces the exact same metadata,
after loading foo to different bricks on different
occasions, the metadata of all bricks will be identical
and no healing would be needed.

Is that so, or am imagining impossible acrobatics?

Z

Diego Zuccato

2021-Jan-13 07:27 UTC

head link

[Gluster-users] Replication logic

Il 28/12/20 22:14, Zenon Panoussis ha scritto:
> Is that so, or am imagining impossible acrobatics?Given the slow link, probably snail mail is faster.

Configure a new node near the fast ones, add it to the pool, replace
thin arbiters with full replicas on the new node, let it rebuild (fast,
since it's "local"), then put it offline and send it to the final
location. Once you turn it on again it will have to sync only the latest
changes.

Sould take less than 3 weeks :)

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Universit? di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Gluster users - Dec 2020 - Replication logic

[Gluster-users] Replication logic

[Gluster-users] Replication logic

[Gluster-users] Replication logic