Dmitry Antipov
2020-Dec-01  12:09 UTC
[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery
It seems that consistency of replica 3 volume with quorum forced to 1 becomes
broken after a few forced volume restarts initiated after 2 brick failures.
At least it breaks GFAPI clients, and even volume restart doesn't help.
Volume setup is:
Volume Name: test0
Type: Replicate
Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.112:/glusterfs/test0-000
Brick2: 192.168.1.112:/glusterfs/test0-001
Brick3: 192.168.1.112:/glusterfs/test0-002
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
Client is fio with the following options:
[global]
name=write
filename=testfile
ioengine=gfapi_async
volume=test0
brick=localhost
create_on_open=1
rw=randwrite
direct=1
numjobs=1
time_based=1
runtime=600
[test-4-kbytes]
bs=4k
size=1G
iodepth=128
How to reproduce:
0) start the volume;
1) run fio;
2) run 'gluster volume status', select 2 arbitrary brick processes
    and kill them;
3) make sure fio is OK;
4) wait a few seconds, then issue 'gluster volume start [VOL] force'
    to restart bricks, and finally issue 'gluster volume status' again
    to check whether all bricks are running;
5) restart from 2).
This is likely to work for a few times but, sooner or later, it breaks
at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting
from this point, killing and restarting fio yields in error in glfs_creat(),
and even the manual volume restart doesn't help.
NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async
engine is still incomplete and _silently ignores I/O errors_. Currently
I'm using the following tweak to detect and report them (YMMV, consider
experimental):
diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c
index 0392ad6e..27ebb6f1 100644
--- a/engines/glusterfs_async.c
+++ b/engines/glusterfs_async.c
@@ -7,6 +7,7 @@
  #include "gfapi.h"
  #define NOT_YET 1
  struct fio_gf_iou {
+	struct thread_data *td;
  	struct io_u *io_u;
  	int io_complete;
  };
@@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct
io_u *io_u)
      }
      io->io_complete = 0;
      io->io_u = io_u;
+    io->td = td;
      io_u->engine_data = io;
  	return 0;
  }
@@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void
*data)
  	struct fio_gf_iou *iou = io_u->engine_data;
  	dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret);
-	iou->io_complete = 1;
+	if (ret != io_u->xfer_buflen) {
+		if (ret >= 0) {
+			io_u->resid = io_u->xfer_buflen - ret;
+			io_u->error = 0;
+			iou->io_complete = 1;
+		} else
+			io_u->error = errno;
+	}
+
+	if (io_u->error) {
+		log_err("IO failed (%s).\n", strerror(io_u->error));
+		td_verror(iou->td, io_u->error, "xfer");
+	} else
+		iou->io_complete = 1;
  }
  static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused *
td,
--
Dmitry
Strahil Nikolov
2020-Dec-01  13:15 UTC
[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery
Replica 3 with quorum 1 ?
This is not good. I doubt anyone will help you with this. The idea of replica 3
volumes is to tolerate 1 node ,as when a second one is dead - only 1 will accept
writes.
You can imagine the situation when 2 bricks are down and data is writen to brick
3. What happens when the brick 1 and 2 is up and running -> how is gluster
going to decide where to heal from ?
2 is more than 1 , so the third node should delete the file instead of the
opposite.
What are you trying to achive with the quorum 1 ?
Best Regards,
Strahil Nikolov
? ???????, 1 ???????? 2020 ?., 14:09:32 ???????+2, Dmitry Antipov <dmantipov
at yandex.ru> ??????:
It seems that consistency of replica 3 volume with quorum forced to 1 becomes
broken after a few forced volume restarts initiated after 2 brick failures.
At least it breaks GFAPI clients, and even volume restart doesn't help.
Volume setup is:
Volume Name: test0
Type: Replicate
Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.112:/glusterfs/test0-000
Brick2: 192.168.1.112:/glusterfs/test0-001
Brick3: 192.168.1.112:/glusterfs/test0-002
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
Client is fio with the following options:
[global]
name=write
filename=testfile
ioengine=gfapi_async
volume=test0
brick=localhost
create_on_open=1
rw=randwrite
direct=1
numjobs=1
time_based=1
runtime=600
[test-4-kbytes]
bs=4k
size=1G
iodepth=128
How to reproduce:
0) start the volume;
1) run fio;
2) run 'gluster volume status', select 2 arbitrary brick processes
? ? and kill them;
3) make sure fio is OK;
4) wait a few seconds, then issue 'gluster volume start [VOL] force'
? ? to restart bricks, and finally issue 'gluster volume status' again
? ? to check whether all bricks are running;
5) restart from 2).
This is likely to work for a few times but, sooner or later, it breaks
at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting
from this point, killing and restarting fio yields in error in glfs_creat(),
and even the manual volume restart doesn't help.
NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async
engine is still incomplete and _silently ignores I/O errors_. Currently
I'm using the following tweak to detect and report them (YMMV, consider
experimental):
diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c
index 0392ad6e..27ebb6f1 100644
--- a/engines/glusterfs_async.c
+++ b/engines/glusterfs_async.c
@@ -7,6 +7,7 @@
? #include "gfapi.h"
? #define NOT_YET 1
? struct fio_gf_iou {
+??? struct thread_data *td;
? ??? struct io_u *io_u;
? ??? int io_complete;
? };
@@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct
io_u *io_u)
? ? ? }
? ? ? io->io_complete = 0;
? ? ? io->io_u = io_u;
+? ? io->td = td;
? ? ? io_u->engine_data = io;
? ??? return 0;
? }
@@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void
*data)
? ??? struct fio_gf_iou *iou = io_u->engine_data;
? ??? dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret);
-??? iou->io_complete = 1;
+??? if (ret != io_u->xfer_buflen) {
+??? ??? if (ret >= 0) {
+??? ??? ??? io_u->resid = io_u->xfer_buflen - ret;
+??? ??? ??? io_u->error = 0;
+??? ??? ??? iou->io_complete = 1;
+??? ??? } else
+??? ??? ??? io_u->error = errno;
+??? }
+
+??? if (io_u->error) {
+??? ??? log_err("IO failed (%s).\n", strerror(io_u->error));
+??? ??? td_verror(iou->td, io_u->error, "xfer");
+??? } else
+??? ??? iou->io_complete = 1;
? }
? static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused *
td,
--
Dmitry
________
Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users