Dmitry Antipov
2020-Dec-01 12:09 UTC
[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery
It seems that consistency of replica 3 volume with quorum forced to 1 becomes broken after a few forced volume restarts initiated after 2 brick failures. At least it breaks GFAPI clients, and even volume restart doesn't help. Volume setup is: Volume Name: test0 Type: Replicate Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.1.112:/glusterfs/test0-000 Brick2: 192.168.1.112:/glusterfs/test0-001 Brick3: 192.168.1.112:/glusterfs/test0-002 Options Reconfigured: cluster.quorum-count: 1 cluster.quorum-type: fixed cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off Client is fio with the following options: [global] name=write filename=testfile ioengine=gfapi_async volume=test0 brick=localhost create_on_open=1 rw=randwrite direct=1 numjobs=1 time_based=1 runtime=600 [test-4-kbytes] bs=4k size=1G iodepth=128 How to reproduce: 0) start the volume; 1) run fio; 2) run 'gluster volume status', select 2 arbitrary brick processes and kill them; 3) make sure fio is OK; 4) wait a few seconds, then issue 'gluster volume start [VOL] force' to restart bricks, and finally issue 'gluster volume status' again to check whether all bricks are running; 5) restart from 2). This is likely to work for a few times but, sooner or later, it breaks at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting from this point, killing and restarting fio yields in error in glfs_creat(), and even the manual volume restart doesn't help. NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async engine is still incomplete and _silently ignores I/O errors_. Currently I'm using the following tweak to detect and report them (YMMV, consider experimental): diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c index 0392ad6e..27ebb6f1 100644 --- a/engines/glusterfs_async.c +++ b/engines/glusterfs_async.c @@ -7,6 +7,7 @@ #include "gfapi.h" #define NOT_YET 1 struct fio_gf_iou { + struct thread_data *td; struct io_u *io_u; int io_complete; }; @@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct io_u *io_u) } io->io_complete = 0; io->io_u = io_u; + io->td = td; io_u->engine_data = io; return 0; } @@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void *data) struct fio_gf_iou *iou = io_u->engine_data; dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret); - iou->io_complete = 1; + if (ret != io_u->xfer_buflen) { + if (ret >= 0) { + io_u->resid = io_u->xfer_buflen - ret; + io_u->error = 0; + iou->io_complete = 1; + } else + io_u->error = errno; + } + + if (io_u->error) { + log_err("IO failed (%s).\n", strerror(io_u->error)); + td_verror(iou->td, io_u->error, "xfer"); + } else + iou->io_complete = 1; } static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused * td, -- Dmitry
Strahil Nikolov
2020-Dec-01 13:15 UTC
[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery
Replica 3 with quorum 1 ? This is not good. I doubt anyone will help you with this. The idea of replica 3 volumes is to tolerate 1 node ,as when a second one is dead - only 1 will accept writes. You can imagine the situation when 2 bricks are down and data is writen to brick 3. What happens when the brick 1 and 2 is up and running -> how is gluster going to decide where to heal from ? 2 is more than 1 , so the third node should delete the file instead of the opposite. What are you trying to achive with the quorum 1 ? Best Regards, Strahil Nikolov ? ???????, 1 ???????? 2020 ?., 14:09:32 ???????+2, Dmitry Antipov <dmantipov at yandex.ru> ??????: It seems that consistency of replica 3 volume with quorum forced to 1 becomes broken after a few forced volume restarts initiated after 2 brick failures. At least it breaks GFAPI clients, and even volume restart doesn't help. Volume setup is: Volume Name: test0 Type: Replicate Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.1.112:/glusterfs/test0-000 Brick2: 192.168.1.112:/glusterfs/test0-001 Brick3: 192.168.1.112:/glusterfs/test0-002 Options Reconfigured: cluster.quorum-count: 1 cluster.quorum-type: fixed cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off Client is fio with the following options: [global] name=write filename=testfile ioengine=gfapi_async volume=test0 brick=localhost create_on_open=1 rw=randwrite direct=1 numjobs=1 time_based=1 runtime=600 [test-4-kbytes] bs=4k size=1G iodepth=128 How to reproduce: 0) start the volume; 1) run fio; 2) run 'gluster volume status', select 2 arbitrary brick processes ? ? and kill them; 3) make sure fio is OK; 4) wait a few seconds, then issue 'gluster volume start [VOL] force' ? ? to restart bricks, and finally issue 'gluster volume status' again ? ? to check whether all bricks are running; 5) restart from 2). This is likely to work for a few times but, sooner or later, it breaks at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting from this point, killing and restarting fio yields in error in glfs_creat(), and even the manual volume restart doesn't help. NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async engine is still incomplete and _silently ignores I/O errors_. Currently I'm using the following tweak to detect and report them (YMMV, consider experimental): diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c index 0392ad6e..27ebb6f1 100644 --- a/engines/glusterfs_async.c +++ b/engines/glusterfs_async.c @@ -7,6 +7,7 @@ ? #include "gfapi.h" ? #define NOT_YET 1 ? struct fio_gf_iou { +??? struct thread_data *td; ? ??? struct io_u *io_u; ? ??? int io_complete; ? }; @@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct io_u *io_u) ? ? ? } ? ? ? io->io_complete = 0; ? ? ? io->io_u = io_u; +? ? io->td = td; ? ? ? io_u->engine_data = io; ? ??? return 0; ? } @@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void *data) ? ??? struct fio_gf_iou *iou = io_u->engine_data; ? ??? dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret); -??? iou->io_complete = 1; +??? if (ret != io_u->xfer_buflen) { +??? ??? if (ret >= 0) { +??? ??? ??? io_u->resid = io_u->xfer_buflen - ret; +??? ??? ??? io_u->error = 0; +??? ??? ??? iou->io_complete = 1; +??? ??? } else +??? ??? ??? io_u->error = errno; +??? } + +??? if (io_u->error) { +??? ??? log_err("IO failed (%s).\n", strerror(io_u->error)); +??? ??? td_verror(iou->td, io_u->error, "xfer"); +??? } else +??? ??? iou->io_complete = 1; ? } ? static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused * td, -- Dmitry ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users