thr3ads.net - Gluster users - [Gluster-users] Replication logic [Dec 2020]

If this information is useful, please help other people find it:
Share via:

Zenon Panoussis

2020-Dec-27 02:00 UTC

[Gluster-users] Replication logic

> Merry Christmas!
To you too :)
>> I have set up a replica 3 arbiter 1 volume. Is there a way to turn
>> the arbiter into a full replica without breaking the volume and
>> losing the metadata that is already on the arbiter?
> Yes, you have to use "remove-brick" with the option
"replica" to reduce
> the replica count and then reformat the arbiter brick and add it back.
But if I do that, the metadata that are already on the brick will be
lost. What I was asking, is whether there is a way to "upgrade" the
arbiter to a full replica without losing the metadata in the meanwhile.

You might ask, why does it matter? If the data needs to be replicated
to the ex-arbiter brick anyway, also rebuilding the metadata is only
a very slight overhead. Yes, but if the metadata on the ex-arbiter
remains intact, any one other brick can go down while the ex-arbiter
is building up its datastore and the volume will still have quorum.
>> where brick2<->brick3 is a high-speed connection, but
brick1<->brick2
>> and brick1<->brick3 are low speed, and data is fed to brick1, is
there
>> a way to tell the volume that brick1 should only feed brick2 and let
>> brick2 feed brick3 if (and only if) all three are online, rather than
>> brick1 feeding both brick2 and brick3?
> Erm... this is not how it works. The FUSE client (mount -t glusterfs) 
> is writing to all bricks in the replica volume, not the brick to brick. 
Aha, it's the client writing to the bricks and not the server?  That's
the part that I had not understood.
> What are you trying to achieve ? What is your setup (clients,servers,etc) ?
The goal: a resilient and geographically distributed mailstore. A
mail server is a very dynamic thing, with files being written, moved
and deleted all the time. You can put the mailstore on a SAN and
access it from multiple SMTP and IMAP servers, but if the SAN goes
down, everything is down. What I am trying to do is to distribute
the mailstore over several locations and internet connections that
function completely independently of each-other.

Now you might think georeplication, but that won't work for a mailstore
(a) because georeplication is asynchronous, so if mailserver1 suddenly
goes down and mailserver2 takes over, there will be mail on mailserver1
that is still missing on mailserver2 and will remain missing until
mailserver1 comes back up again, and (b) because georeplication (if
I have understood the docs correctly) only works in one direction,
so that any mail that arrives on a downstream replica will never be
propagated to its upstream replicas.

That's why I'm using a normal synchronous replica, currenly
experimenting
and testing with replica 3 arbiter 1. If and when this goes into production,
I want to get rid of the arbiter and have three full replicas.

There are three machines running gluster 8.3 and only using gluster
as the client (mount -t glusterfs) without nfs or anything else. One
machine is in Stockholm, one is in Athens and one in Frankfurt a/M,
though the latter will eventually migrate to Buenos Aires. That's
a lot of latency and then the Athens connection is also very slow.
That's why I asked whether I could configure brick1 (where the data
is now coming in) to only write to brick2 and let brick2 write to
brick3.

Now, I've read the docs and I know very well that I am doing things
way out of "the normal way", but I am willing to trade performance
for resiliency on the mail server, so if I can get that distributed
mailstore to work somewhat properly, I don't care at all if new mail
takes 15 minutes to propagate to the slow Athens node. What's
important is that all three nodes are perfectly synchronised and that
mail continues to work seamlessly if any one of them goes down[1].

Z



[1] Beyond the scope of gluster: with synchronous replication, if mail
is being delivered to one node, it won't be finally accepted by the
mail server until it has also been written to the other online nodes.
This means that if the receiving node goes down or the volume gets out
of quorum before the incoming mail is everywhere on the volume, the
sending mail server will never get an acknowledgement of receipt and
will therefore try to resend the mail later. Thus, if all nodes are
advertised as MX in DNS, the mail will be resent to another node five
minutes later.

Strahil Nikolov

2020-Dec-27 11:31 UTC

head link

[Gluster-users] Replication logic

>But if I do that, the metadata that are already on the brick will be
>lost. What I was asking, is whether there is a way to "upgrade"
the
>arbiter to a full replica without losing the metadata in the meanwhile.
You have a 'replica 3 arbiter 1' volume. When you want to replace the
arbiter you will need to do it in several steps:
1) use remove-brick to get rid of the arbiter like this:
gluster volume remove-brick VOLUME replica 2 arbiter:/path/to/brick

The command will reduce from 'replica 3 arbiter 1' to 'replica
2' type of volume. You still have the 2 data bricks left and running.

2) Reuseing the brick is easiest if you just umount, wipe the fs and recreate
it. It's far simpler
umount /dev/VG/arbiter-brick
mkfs.xfs -f -i size=512?/dev/VG/arbiter-brick
mount?/dev/VG/arbiter-brick
mkdir </path/to/lv/mountpoint>/brick

3) Add the recreated brick
gluster volume add-brick VOLUME replica 3 arbiter:/path/to/lv/mountpoint/brick

4) force a heal
gluster volume heal VOLUME full


>You might ask, why does it matter? If the data needs to be replicated
>to the ex-arbiter brick anyway, also rebuilding the metadata is only
>a very slight overhead. Yes, but if the metadata on the ex-arbiter
>remains intact, any one other brick can go down while the ex-arbiter
>is building up its datastore and the volume will still have quorum.
Arbiter holds only metadata , but it's usefull to have it running. Yet, in
both cases (remove-brick + add-brick or replace-brick) you have a moment where
some files/dirs won't have metadata on the arbiter. You have to take the
risk. And you always got the option to reduce the quorum statically to
"1" , so even in replica 2 the survived node will be serving requests
from the clients.
>Aha, it's the client writing to the bricks and not the server??
That'sthe part that I had not understood.
What you described is the NFS xlator (old legacy gNFS which is disabled by
default, but you can recompile) , yet the NFS xlator will try to replicate to
all nodes in the cluster simultaneously.

> What are you trying to achieve ? What is your setup (clients,servers,etc) ?

>Now you might think georeplication, but that won't work for a mailstore
>(a) because georeplication is asynchronous, so if mailserver1 suddenly
>goes down and mailserver2 takes over, there will be mail on mailserver1
>that is still missing on mailserver2 and will remain missing until
>mailserver1 comes back up again, and (b) because georeplication (if
>I have understood the docs correctly) only works in one direction,
>so that any mail that arrives on a downstream replica will never be
>propagated to its upstream replicas.
Geo replication is not so slow . Based on my experience it happens quite often
by default. I understand that it will be an issue if a mail is missing if Node1
died and the replication hasn't had the time to distribute it. Keep in mind
that secondary volumes (a.k.a. slave volume) are in read-only mode by default
... just mentioning it.



>I want to get rid of the arbiter and have three full replicas.You got 2 options -> remove-brick + add-brick or the old school
"replace-brick". In both cases you have a moment where the new brick
has some data still replicating and if an old "data" brick fails, you
have to change the quorum to "1" untill you fix the issue.

>There are three machines running gluster 8.3 and only using gluster
>as the client (mount -t glusterfs) without nfs or anything else.?If the node is both Gluster and App , we call it HyperConverged setup. Quite
typical usage.
>One machine is in Stockholm, one is in Athens and one in Frankfurt a/M,
>though the latter will eventually migrate to Buenos Aires. That's
>a lot of latency and then the Athens connection is also very slow.That's a lot of lattency and bandwidth restriction. With regular replica the
performance will be quite limited. Reads happen locally (if you use the default
value for "cluster.choose-local" option), but writes will go to all
bricks and will be confirmed only when all bricks confirm the FOP (file
operation) or time out. I'm not sure if there is an option that allows to
limit the FUSE operation timeout without touching
"network.ping-timeout".
>Now, I've read the docs and I know very well that I am doing things
>way out of "the normal way", but I am willing to trade performance
>for resiliency on the mail server, so if I can get that distributed
>mailstore to work somewhat properly, I don't care at all if new mail
>takes 15 minutes to propagate to the slow Athens node. What's
>important is that all three nodes are perfectly synchronised and that
>mail continues to work seamlessly if any one of them goes down[1].
Erm... in 'replica 3' volume and you got slow bandwidth to Athens, then
you might have to check your mail server's timeouts (and bump them) as it
might get stuck in "D" state (waiting for I/O) while writing the
e-mail.
Most probably every write will be slower than the usual , but the reads should
not be affected.

You are definately our of the "normal" , but if the performance is not
the highest priority - it should work.

Best Regards,
Strahil Nikolov

Gionatan Danti

2020-Dec-27 13:05 UTC

head link

[Gluster-users] Replication logic

Il 2020-12-27 03:00 Zenon Panoussis ha scritto:> The goal: a resilient and geographically distributed mailstore. A
> mail server is a very dynamic thing, with files being written, moved
> and deleted all the time. You can put the mailstore on a SAN and
> access it from multiple SMTP and IMAP servers, but if the SAN goes
> down, everything is down. What I am trying to do is to distribute
> the mailstore over several locations and internet connections that
> function completely independently of each-other.
Hi, I don't think Gluster is the correct tool for the job: sync 
replication and file lookups will suffer tremendously due to the high 
latency WAN links, grinding all to an halt. Geo-replication should be 
out of question, because it is a read-only copy from source to dest 
(unless things changed recently).

For such a project, I would simply configure the SMTP server do to 
protocol-specific replication and use a low-TTL DNS name to publish the 
IMAP/Web frontends.

Regards.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti at assyoma.it - info at assyoma.it
GPG public key ID: FF5F32A8

Gluster users - Dec 2020 - Replication logic

[Gluster-users] Replication logic

[Gluster-users] Replication logic

[Gluster-users] Replication logic