thr3ads.net - Gluster users - [Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery [Dec 2020]

If this information is useful, please help other people find it:
Share via:

Strahil Nikolov

2020-Dec-01 13:15 UTC

[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery

Replica 3 with quorum 1 ?
This is not good. I doubt anyone will help you with this. The idea of replica 3
volumes is to tolerate 1 node ,as when a second one is dead - only 1 will accept
writes.

You can imagine the situation when 2 bricks are down and data is writen to brick
3. What happens when the brick 1 and 2 is up and running -> how is gluster
going to decide where to heal from ?
2 is more than 1 , so the third node should delete the file instead of the
opposite.

What are you trying to achive with the quorum 1 ?


Best Regards,
Strahil Nikolov






? ???????, 1 ???????? 2020 ?., 14:09:32 ???????+2, Dmitry Antipov <dmantipov
at yandex.ru> ??????:





It seems that consistency of replica 3 volume with quorum forced to 1 becomes
broken after a few forced volume restarts initiated after 2 brick failures.
At least it breaks GFAPI clients, and even volume restart doesn't help.

Volume setup is:

Volume Name: test0
Type: Replicate
Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.112:/glusterfs/test0-000
Brick2: 192.168.1.112:/glusterfs/test0-001
Brick3: 192.168.1.112:/glusterfs/test0-002
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Client is fio with the following options:

[global]
name=write
filename=testfile
ioengine=gfapi_async
volume=test0
brick=localhost
create_on_open=1
rw=randwrite
direct=1
numjobs=1
time_based=1
runtime=600

[test-4-kbytes]
bs=4k
size=1G
iodepth=128

How to reproduce:

0) start the volume;
1) run fio;
2) run 'gluster volume status', select 2 arbitrary brick processes
? ? and kill them;
3) make sure fio is OK;
4) wait a few seconds, then issue 'gluster volume start [VOL] force'
? ? to restart bricks, and finally issue 'gluster volume status' again
? ? to check whether all bricks are running;
5) restart from 2).

This is likely to work for a few times but, sooner or later, it breaks
at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting
from this point, killing and restarting fio yields in error in glfs_creat(),
and even the manual volume restart doesn't help.

NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async
engine is still incomplete and _silently ignores I/O errors_. Currently
I'm using the following tweak to detect and report them (YMMV, consider
experimental):

diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c
index 0392ad6e..27ebb6f1 100644
--- a/engines/glusterfs_async.c
+++ b/engines/glusterfs_async.c
@@ -7,6 +7,7 @@
? #include "gfapi.h"
? #define NOT_YET 1
? struct fio_gf_iou {
+??? struct thread_data *td;
? ??? struct io_u *io_u;
? ??? int io_complete;
? };
@@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct
io_u *io_u)
? ? ? }
? ? ? io->io_complete = 0;
? ? ? io->io_u = io_u;
+? ? io->td = td;
? ? ? io_u->engine_data = io;
? ??? return 0;
? }
@@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void
*data)
? ??? struct fio_gf_iou *iou = io_u->engine_data;

? ??? dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret);
-??? iou->io_complete = 1;
+??? if (ret != io_u->xfer_buflen) {
+??? ??? if (ret >= 0) {
+??? ??? ??? io_u->resid = io_u->xfer_buflen - ret;
+??? ??? ??? io_u->error = 0;
+??? ??? ??? iou->io_complete = 1;
+??? ??? } else
+??? ??? ??? io_u->error = errno;
+??? }
+
+??? if (io_u->error) {
+??? ??? log_err("IO failed (%s).\n", strerror(io_u->error));
+??? ??? td_verror(iou->td, io_u->error, "xfer");
+??? } else
+??? ??? iou->io_complete = 1;
? }

? static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused *
td,

--

Dmitry
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Dmitry Antipov

2020-Dec-01 14:23 UTC

head link

[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery

On 12/1/20 4:15 PM, Strahil Nikolov wrote:
> You can imagine the situation when 2 bricks are down and data is writen to
brick 3. What happens when the brick 1 and 2 is up and running -> how is
gluster going to decide where to heal from ?
At least I can imagine the volume option to specify "let's assume that
the only live brick contains the
most recent (and so hopefully valid) data, so newly (re)started ones are pleased
to heal from it" behavior.

Dmitry

Gluster users - Dec 2020 - Replica 3 volume with forced quorum 1 fault tolerance and recovery

[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery

[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery