Strahil Nikolov
2020-Dec-01 13:15 UTC
[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery
Replica 3 with quorum 1 ? This is not good. I doubt anyone will help you with this. The idea of replica 3 volumes is to tolerate 1 node ,as when a second one is dead - only 1 will accept writes. You can imagine the situation when 2 bricks are down and data is writen to brick 3. What happens when the brick 1 and 2 is up and running -> how is gluster going to decide where to heal from ? 2 is more than 1 , so the third node should delete the file instead of the opposite. What are you trying to achive with the quorum 1 ? Best Regards, Strahil Nikolov ? ???????, 1 ???????? 2020 ?., 14:09:32 ???????+2, Dmitry Antipov <dmantipov at yandex.ru> ??????: It seems that consistency of replica 3 volume with quorum forced to 1 becomes broken after a few forced volume restarts initiated after 2 brick failures. At least it breaks GFAPI clients, and even volume restart doesn't help. Volume setup is: Volume Name: test0 Type: Replicate Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 192.168.1.112:/glusterfs/test0-000 Brick2: 192.168.1.112:/glusterfs/test0-001 Brick3: 192.168.1.112:/glusterfs/test0-002 Options Reconfigured: cluster.quorum-count: 1 cluster.quorum-type: fixed cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off Client is fio with the following options: [global] name=write filename=testfile ioengine=gfapi_async volume=test0 brick=localhost create_on_open=1 rw=randwrite direct=1 numjobs=1 time_based=1 runtime=600 [test-4-kbytes] bs=4k size=1G iodepth=128 How to reproduce: 0) start the volume; 1) run fio; 2) run 'gluster volume status', select 2 arbitrary brick processes ? ? and kill them; 3) make sure fio is OK; 4) wait a few seconds, then issue 'gluster volume start [VOL] force' ? ? to restart bricks, and finally issue 'gluster volume status' again ? ? to check whether all bricks are running; 5) restart from 2). This is likely to work for a few times but, sooner or later, it breaks at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting from this point, killing and restarting fio yields in error in glfs_creat(), and even the manual volume restart doesn't help. NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async engine is still incomplete and _silently ignores I/O errors_. Currently I'm using the following tweak to detect and report them (YMMV, consider experimental): diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c index 0392ad6e..27ebb6f1 100644 --- a/engines/glusterfs_async.c +++ b/engines/glusterfs_async.c @@ -7,6 +7,7 @@ ? #include "gfapi.h" ? #define NOT_YET 1 ? struct fio_gf_iou { +??? struct thread_data *td; ? ??? struct io_u *io_u; ? ??? int io_complete; ? }; @@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct io_u *io_u) ? ? ? } ? ? ? io->io_complete = 0; ? ? ? io->io_u = io_u; +? ? io->td = td; ? ? ? io_u->engine_data = io; ? ??? return 0; ? } @@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void *data) ? ??? struct fio_gf_iou *iou = io_u->engine_data; ? ??? dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret); -??? iou->io_complete = 1; +??? if (ret != io_u->xfer_buflen) { +??? ??? if (ret >= 0) { +??? ??? ??? io_u->resid = io_u->xfer_buflen - ret; +??? ??? ??? io_u->error = 0; +??? ??? ??? iou->io_complete = 1; +??? ??? } else +??? ??? ??? io_u->error = errno; +??? } + +??? if (io_u->error) { +??? ??? log_err("IO failed (%s).\n", strerror(io_u->error)); +??? ??? td_verror(iou->td, io_u->error, "xfer"); +??? } else +??? ??? iou->io_complete = 1; ? } ? static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused * td, -- Dmitry ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Dmitry Antipov
2020-Dec-01 14:23 UTC
[Gluster-users] Replica 3 volume with forced quorum 1 fault tolerance and recovery
On 12/1/20 4:15 PM, Strahil Nikolov wrote:> You can imagine the situation when 2 bricks are down and data is writen to brick 3. What happens when the brick 1 and 2 is up and running -> how is gluster going to decide where to heal from ?At least I can imagine the volume option to specify "let's assume that the only live brick contains the most recent (and so hopefully valid) data, so newly (re)started ones are pleased to heal from it" behavior. Dmitry