> Merry Christmas!To you too :)>> I have set up a replica 3 arbiter 1 volume. Is there a way to turn >> the arbiter into a full replica without breaking the volume and >> losing the metadata that is already on the arbiter?> Yes, you have to use "remove-brick" with the option "replica" to reduce > the replica count and then reformat the arbiter brick and add it back.But if I do that, the metadata that are already on the brick will be lost. What I was asking, is whether there is a way to "upgrade" the arbiter to a full replica without losing the metadata in the meanwhile. You might ask, why does it matter? If the data needs to be replicated to the ex-arbiter brick anyway, also rebuilding the metadata is only a very slight overhead. Yes, but if the metadata on the ex-arbiter remains intact, any one other brick can go down while the ex-arbiter is building up its datastore and the volume will still have quorum.>> where brick2<->brick3 is a high-speed connection, but brick1<->brick2 >> and brick1<->brick3 are low speed, and data is fed to brick1, is there >> a way to tell the volume that brick1 should only feed brick2 and let >> brick2 feed brick3 if (and only if) all three are online, rather than >> brick1 feeding both brick2 and brick3?> Erm... this is not how it works. The FUSE client (mount -t glusterfs) > is writing to all bricks in the replica volume, not the brick to brick.Aha, it's the client writing to the bricks and not the server? That's the part that I had not understood.> What are you trying to achieve ? What is your setup (clients,servers,etc) ?The goal: a resilient and geographically distributed mailstore. A mail server is a very dynamic thing, with files being written, moved and deleted all the time. You can put the mailstore on a SAN and access it from multiple SMTP and IMAP servers, but if the SAN goes down, everything is down. What I am trying to do is to distribute the mailstore over several locations and internet connections that function completely independently of each-other. Now you might think georeplication, but that won't work for a mailstore (a) because georeplication is asynchronous, so if mailserver1 suddenly goes down and mailserver2 takes over, there will be mail on mailserver1 that is still missing on mailserver2 and will remain missing until mailserver1 comes back up again, and (b) because georeplication (if I have understood the docs correctly) only works in one direction, so that any mail that arrives on a downstream replica will never be propagated to its upstream replicas. That's why I'm using a normal synchronous replica, currenly experimenting and testing with replica 3 arbiter 1. If and when this goes into production, I want to get rid of the arbiter and have three full replicas. There are three machines running gluster 8.3 and only using gluster as the client (mount -t glusterfs) without nfs or anything else. One machine is in Stockholm, one is in Athens and one in Frankfurt a/M, though the latter will eventually migrate to Buenos Aires. That's a lot of latency and then the Athens connection is also very slow. That's why I asked whether I could configure brick1 (where the data is now coming in) to only write to brick2 and let brick2 write to brick3. Now, I've read the docs and I know very well that I am doing things way out of "the normal way", but I am willing to trade performance for resiliency on the mail server, so if I can get that distributed mailstore to work somewhat properly, I don't care at all if new mail takes 15 minutes to propagate to the slow Athens node. What's important is that all three nodes are perfectly synchronised and that mail continues to work seamlessly if any one of them goes down[1]. Z [1] Beyond the scope of gluster: with synchronous replication, if mail is being delivered to one node, it won't be finally accepted by the mail server until it has also been written to the other online nodes. This means that if the receiving node goes down or the volume gets out of quorum before the incoming mail is everywhere on the volume, the sending mail server will never get an acknowledgement of receipt and will therefore try to resend the mail later. Thus, if all nodes are advertised as MX in DNS, the mail will be resent to another node five minutes later.
>But if I do that, the metadata that are already on the brick will be >lost. What I was asking, is whether there is a way to "upgrade" the >arbiter to a full replica without losing the metadata in the meanwhile.You have a 'replica 3 arbiter 1' volume. When you want to replace the arbiter you will need to do it in several steps: 1) use remove-brick to get rid of the arbiter like this: gluster volume remove-brick VOLUME replica 2 arbiter:/path/to/brick The command will reduce from 'replica 3 arbiter 1' to 'replica 2' type of volume. You still have the 2 data bricks left and running. 2) Reuseing the brick is easiest if you just umount, wipe the fs and recreate it. It's far simpler umount /dev/VG/arbiter-brick mkfs.xfs -f -i size=512?/dev/VG/arbiter-brick mount?/dev/VG/arbiter-brick mkdir </path/to/lv/mountpoint>/brick 3) Add the recreated brick gluster volume add-brick VOLUME replica 3 arbiter:/path/to/lv/mountpoint/brick 4) force a heal gluster volume heal VOLUME full>You might ask, why does it matter? If the data needs to be replicated >to the ex-arbiter brick anyway, also rebuilding the metadata is only >a very slight overhead. Yes, but if the metadata on the ex-arbiter >remains intact, any one other brick can go down while the ex-arbiter >is building up its datastore and the volume will still have quorum.Arbiter holds only metadata , but it's usefull to have it running. Yet, in both cases (remove-brick + add-brick or replace-brick) you have a moment where some files/dirs won't have metadata on the arbiter. You have to take the risk. And you always got the option to reduce the quorum statically to "1" , so even in replica 2 the survived node will be serving requests from the clients.>Aha, it's the client writing to the bricks and not the server?? That'sthe part that I had not understood. What you described is the NFS xlator (old legacy gNFS which is disabled by default, but you can recompile) , yet the NFS xlator will try to replicate to all nodes in the cluster simultaneously.> What are you trying to achieve ? What is your setup (clients,servers,etc) ?>Now you might think georeplication, but that won't work for a mailstore >(a) because georeplication is asynchronous, so if mailserver1 suddenly >goes down and mailserver2 takes over, there will be mail on mailserver1 >that is still missing on mailserver2 and will remain missing until >mailserver1 comes back up again, and (b) because georeplication (if >I have understood the docs correctly) only works in one direction, >so that any mail that arrives on a downstream replica will never be >propagated to its upstream replicas.Geo replication is not so slow . Based on my experience it happens quite often by default. I understand that it will be an issue if a mail is missing if Node1 died and the replication hasn't had the time to distribute it. Keep in mind that secondary volumes (a.k.a. slave volume) are in read-only mode by default ... just mentioning it.>I want to get rid of the arbiter and have three full replicas.You got 2 options -> remove-brick + add-brick or the old school "replace-brick". In both cases you have a moment where the new brick has some data still replicating and if an old "data" brick fails, you have to change the quorum to "1" untill you fix the issue.>There are three machines running gluster 8.3 and only using gluster >as the client (mount -t glusterfs) without nfs or anything else.?If the node is both Gluster and App , we call it HyperConverged setup. Quite typical usage.>One machine is in Stockholm, one is in Athens and one in Frankfurt a/M, >though the latter will eventually migrate to Buenos Aires. That's >a lot of latency and then the Athens connection is also very slow.That's a lot of lattency and bandwidth restriction. With regular replica the performance will be quite limited. Reads happen locally (if you use the default value for "cluster.choose-local" option), but writes will go to all bricks and will be confirmed only when all bricks confirm the FOP (file operation) or time out. I'm not sure if there is an option that allows to limit the FUSE operation timeout without touching "network.ping-timeout".>Now, I've read the docs and I know very well that I am doing things >way out of "the normal way", but I am willing to trade performance >for resiliency on the mail server, so if I can get that distributed >mailstore to work somewhat properly, I don't care at all if new mail >takes 15 minutes to propagate to the slow Athens node. What's >important is that all three nodes are perfectly synchronised and that >mail continues to work seamlessly if any one of them goes down[1].Erm... in 'replica 3' volume and you got slow bandwidth to Athens, then you might have to check your mail server's timeouts (and bump them) as it might get stuck in "D" state (waiting for I/O) while writing the e-mail. Most probably every write will be slower than the usual , but the reads should not be affected. You are definately our of the "normal" , but if the performance is not the highest priority - it should work. Best Regards, Strahil Nikolov
Il 2020-12-27 03:00 Zenon Panoussis ha scritto:> The goal: a resilient and geographically distributed mailstore. A > mail server is a very dynamic thing, with files being written, moved > and deleted all the time. You can put the mailstore on a SAN and > access it from multiple SMTP and IMAP servers, but if the SAN goes > down, everything is down. What I am trying to do is to distribute > the mailstore over several locations and internet connections that > function completely independently of each-other.Hi, I don't think Gluster is the correct tool for the job: sync replication and file lookups will suffer tremendously due to the high latency WAN links, grinding all to an halt. Geo-replication should be out of question, because it is a read-only copy from source to dest (unless things changed recently). For such a project, I would simply configure the SMTP server do to protocol-specific replication and use a low-TTL DNS name to publish the IMAP/Web frontends. Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8