Ravishankar N
2020-Apr-16 07:03 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
The patch by itself is only making changes specific to AFR, so it should not affect other translators. But I wonder how readdir-ahead is enabled in your gnfs stack. All performance xlators are turned off in gnfs except write-behind and AFAIK, there is no way to enable them via the CLI. Did you custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash go-away if you remove the xlator from the nfs volfile? Regards, Ravi On 16/04/20 8:47 am, Erik Jacobson wrote:> It is important to note that our testing has shown zero split-brain > errors since the patch... And that it is significantly harder to > hit the seg fault than it was to hit split-brain before. It's still > sufficiently frequent that we can't let it out the door. In my intensive > test case (found elsewhere in the thread), it would 100% hit the problem > with 57 nodes every time at least once. With the patch, zero split > brain, but maybe 1 in 4 runs would seg fault. We didn't have a seg > fault problem previously. This is all within the context of 1 of the 3 > servers in the subvolume being down. I hit the seg fault once with just > 57 nodes booting (using NFS for their root FS) and no other load. > > > Scott was able to take an analysis pass. Any suggestions? his words > follow: > > > The segfault appears to occur in read-ahead functionality.? We will keep > the core in case it needs to be looked at again, being sure to copy off > all necessary metadata to maintain adequate symbol lookup within gdb. > It may also be possible to breakpoint immediately prior to the segfault, > but setting the right conditions may prove to be difficult. > > A bit of analysis: > > Prior to the segfault, the op_errno field in a struct rda_fd_ctx packet > shows an ENOENT error.? The packet is from the call_frame_t parameter of > rda_fill_fd_cbk() (Backtrace #2)? The following shows the progression > from the call_frame_t parameter to the op_errno field of the rda_fd_ctx > structure. > > (gdb) print {call_frame_t}0x7fe5acf18eb8 > $26 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next > 0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78, > ? this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock > 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > ??????? __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list > {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, > ????? __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, > begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234, > ??? tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from > 0x0, unwind_to = 0x0} > > (gdb) print {struct rda_local}0x7fe5ac1dbc78 > $27 = {ctx = 0x7fe5ace46590, fd = 0x7fe60433d8b8, xattrs = 0x0, inode > 0x0, offset = 0, generation = 0, skip_dir = 0} > > (gdb) print {struct rda_fd_ctx}0x7fe5ace46590 > $28 = {cur_offset = 0, cur_size = 638, next_offset = 1538, state = 36, > lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, > ??????? __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision > 0, __list = {__prev = 0x0, __next = 0x0}}, > ????? __size = '\000' <repeats 39 times>, __align = 0}}, entries > {{list = {next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}, { > ??????? next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}}, d_ino = 0, > d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_flags = 0, ia_ino = 0, > ????? ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, > ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, > ????? ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0, > ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0, > ????? ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = '\000' > <repeats 15 times>, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', > ??????? sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', > write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', > ????????? write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', > write = 0 '\000', exec = 0 '\000'}}}, dict = 0x0, inode = 0x0, > ??? d_name = 0x7fe5ace466a8 ""}, fill_frame = 0x0, stub = 0x0, op_errno > = 2, xattrs = 0x0, writes_during_prefetch = 0x0, prefetching = { > ??? lk = 0x7fe5ace466d0 "", value = 0}} > > The segfault occurs at the bottom of rda_fill_fd_cbk() where the rpc > call stack frames are being destroyed.? The following are what I believe > to be the three frames that are intended to be destroyed, but it is > unclear which packet is causing the problem.? If I were to dig more into > this, I will use ddd (graphical debugger).? It's been a while since I've > done low level debugging like this, so I'm a bit rusty. > > (gdb) print {call_frame_t}0x7fe5acf18eb8 > $34 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next > 0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78, > ? this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock > 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > ??????? __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list > {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, > ????? __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, > begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234, > ??? tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from > 0x0, unwind_to = 0x0} > (gdb) print {call_frame_t}0x7fe5ac6d6ce0 > $35 = {root = 0x0, parent = 0x563f5a955920, frames = {next > 0x7fe5ac096298, prev = 0x7fe5acf18ec8}, local = 0x0, this = 0x108a, > ? ret = 0x25f90b3c, ref_count = 0, lock = {spinlock = 0, mutex > {__data = {__lock = 0, __count = 0, __owner = 1586972324, __nusers = 0, > ??????? __kind = 210092664, __spins = 0, __elision = 0, __list > {__prev = 0x0, __next = 0x0}}, > ????? __size > "\000\000\000\000\000\000\000\000\244F\227^\000\000\000\000x\302\205\f", > '\000' <repeats 19 times>, __align = 0}}, > ? cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec > 0, tv_nsec = 0}, end = {tv_sec = 0, tv_nsec = 0}, wind_from = 0x0, > ? wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0} > (gdb) print {call_frame_t}0x7fe5ac096288 > $36 = {root = 0x7fe5ac378860, parent = 0x7fe5acf18eb8, frames = {next > 0x7fe5acf18ec8, prev = 0x7fe5ac6d6cf0}, local = 0x0, > ? this = 0x7fe63c014000, ret = 0x7fe63bb5d350 <rda_fill_fd_cbk>, > ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, > ??????? __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins > 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, > ????? __size = '\000' <repeats 39 times>, __align = 0}}, cookie > 0x7fe5ac096288, complete = true, op = GF_FOP_READDIRP, begin = { > ??? tv_sec = 4234, tv_nsec = 637078816}, end = {tv_sec = 4234, tv_nsec > = 803882755}, > ? wind_from = 0x7fe63bb5e8c0 <__FUNCTION__.22226> "rda_fill_fd", > wind_to = 0x7fe63bb5e3f0 "(this->children->xlator)->fops->readdirp", > ? unwind_from = 0x7fe63bdd8a80 <__FUNCTION__.20442> "afr_readdir_cbk", > unwind_to = 0x7fe63bb5dfbb "rda_fill_fd_cbk"} > > > > On 4/15/20 8:14 AM, Erik Jacobson wrote: >> Scott - I was going to start with gluster74 since that is what he >> started at but it applies well to glsuter72 so I'll start tthere. >> >> Getting ready to go. The patch detail is interesting. This is probably >> why it took hiim a bit longer. It wasn't a trivial patch. > > > On Wed, Apr 15, 2020 at 12:45:57PM -0500, Erik Jacobson wrote: >>> The new split-brain issue is much harder to reproduce, but after several >> (correcting to say new seg fault issue, the split brain is gone!!) >> >>> intense runs, it usually hits once. >>> >>> We switched to pure gluster74 plus your patch so we're apples to apples >>> now. >>> >>> I'm going to see if Scott can help debug it. >>> >>> Here is the back trace info from the core dump: >>> >>> -rw-r----- 1 root root 1.9G Apr 15 12:40 core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000 >>> -rw-r----- 1 root root 221M Apr 15 12:40 core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000.lz4 >>> drwxrwxrwt 9 root root 20K Apr 15 12:40 . >>> [root at leader3 tmp]# >>> [root at leader3 tmp]# >>> [root at leader3 tmp]# gdb core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000 >>> GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8 >>> Copyright (C) 2018 Free Software Foundation, Inc. >>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> >>> This is free software: you are free to change and redistribute it. >>> There is NO WARRANTY, to the extent permitted by law. >>> Type "show copying" and "show warranty" for details. >>> This GDB was configured as "x86_64-redhat-linux-gnu". >>> Type "show configuration" for configuration details. >>> For bug reporting instructions, please see: >>> <http://www.gnu.org/software/gdb/bugs/>. >>> Find the GDB manual and other documentation resources online at: >>> <http://www.gnu.org/software/gdb/documentation/>. >>> >>> For help, type "help". >>> Type "apropos word" to search for commands related to "word"... >>> [New LWP 61102] >>> [New LWP 61085] >>> [New LWP 61087] >>> [New LWP 61117] >>> [New LWP 61086] >>> [New LWP 61108] >>> [New LWP 61089] >>> [New LWP 61090] >>> [New LWP 61121] >>> [New LWP 61088] >>> [New LWP 61091] >>> [New LWP 61093] >>> [New LWP 61095] >>> [New LWP 61092] >>> [New LWP 61094] >>> [New LWP 61098] >>> [New LWP 61096] >>> [New LWP 61097] >>> [New LWP 61084] >>> [New LWP 61100] >>> [New LWP 61103] >>> [New LWP 61104] >>> [New LWP 61099] >>> [New LWP 61105] >>> [New LWP 61101] >>> [New LWP 61106] >>> [New LWP 61109] >>> [New LWP 61107] >>> [New LWP 61112] >>> [New LWP 61119] >>> [New LWP 61110] >>> [New LWP 61111] >>> [New LWP 61118] >>> [New LWP 61123] >>> [New LWP 61122] >>> [New LWP 61113] >>> [New LWP 61114] >>> [New LWP 61120] >>> [New LWP 61116] >>> [New LWP 61115] >>> >>> warning: core file may not match specified executable file. >>> Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done. >>> done. >>> >>> warning: Ignoring non-absolute filename: <linux-vdso.so.1> >>> Missing separate debuginfo for linux-vdso.so.1 >>> Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266 >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> [Thread debugging using libthread_db enabled] >>> Using host libthread_db library "/lib64/libthread_db.so.1". >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/run/gluster/n'. >>> Program terminated with signal SIGSEGV, Segmentation fault. >>> #0 0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288) >>> at ../../../../libglusterfs/src/glusterfs/stack.h:193 >>> 193 FRAME_DESTROY(frame); >>> [Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))] >>> Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 zlib-1.2.11-10.el8.x86_64 >>> (gdb) bt >>> #0 0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288) >>> at ../../../../libglusterfs/src/glusterfs/stack.h:193 >>> #1 STACK_DESTROY (stack=0x7fe5ac6d65f8) >>> at ../../../../libglusterfs/src/glusterfs/stack.h:193 >>> #2 rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=<optimized out>, >>> this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=<optimized out>, >>> xdata=0x0) at readdir-ahead.c:623 >>> #3 0x00007fe63bd6c3aa in afr_readdir_cbk (frame=<optimized out>, >>> cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, >>> op_errno=<optimized out>, subvol_entries=<optimized out>, xdata=0x0) >>> at afr-dir-read.c:234 >>> #4 0x00007fe6400a1e07 in client4_0_readdirp_cbk (req=<optimized out>, >>> iov=<optimized out>, count=<optimized out>, myframe=0x7fe5ace0eda8) >>> at client-rpc-fops_v2.c:2338 >>> #5 0x00007fe6479ca115 in rpc_clnt_handle_reply ( >>> clnt=clnt at entry=0x7fe63c0663f0, pollin=pollin at entry=0x7fe60c1737a0) >>> at rpc-clnt.c:764 >>> #6 0x00007fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780, >>> mydata=0x7fe63c066420, event=<optimized out>, data=0x7fe60c1737a0) >>> at rpc-clnt.c:931 >>> #7 0x00007fe6479c707b in rpc_transport_notify ( >>> this=this at entry=0x7fe63c066780, >>> event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, >>> data=data at entry=0x7fe60c1737a0) at rpc-transport.c:545 >>> #8 0x00007fe640da893c in socket_event_poll_in_async (xl=<optimized out>, >>> async=0x7fe60c1738c8) at socket.c:2601 >>> #9 0x00007fe640db03dc in gf_async ( >>> cbk=0x7fe640da8910 <socket_event_poll_in_async>, xl=<optimized out>, >>> async=0x7fe60c1738c8) at ../../../../libglusterfs/src/glusterfs/async.h:189 >>> #10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780) >>> at socket.c:2642 >>> #11 socket_event_handler (fd=fd at entry=19, idx=idx at entry=10, gen=gen at entry=1, >>> data=data at entry=0x7fe63c066780, poll_in=<optimized out>, >>> poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') >>> at socket.c:3040 >>> #12 0x00007fe647c84a5b in event_dispatch_epoll_handler (event=0x7fe617ffe014, >>> event_pool=0x563f5a98c750) at event-epoll.c:650 >>> #13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at event-epoll.c:763 >>> #14 0x00007fe6467a72de in start_thread () from /lib64/libpthread.so.0 >>> #15 0x00007fe645fffa63 in clone () from /lib64/libc.so.6 >>> >>> >>> >>> On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote: >>>> After several successful runs of the test case, we thought we were >>>> solved. Indeed, split-brain is gone. >>>> >>>> But we're triggering a seg fault now, even in a less loaded case. >>>> >>>> We're going to switch to gluster74, which was your intention, and report >>>> back. >>>> >>>> On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote: >>>>>> Attached the wrong patch by mistake in my previous mail. Sending the correct >>>>>> one now. >>>>> Early results loook GREAT !! >>>>> >>>>> We'll keep beating on it. We applied it to glsuter72 as that is what we >>>>> have to ship with. It applied fine with some line moves. >>>>> >>>>> If you would like us to also run a test with gluster74 so that you can >>>>> say that's tested, we can run that test. I can do a special build. >>>>> >>>>> THANK YOU!! >>>>> >>>>>> >>>>>> -Ravi >>>>>> >>>>>> >>>>>> On 15/04/20 2:05 pm, Ravishankar N wrote: >>>>>> >>>>>> >>>>>> On 10/04/20 2:06 am, Erik Jacobson wrote: >>>>>> >>>>>> Once again thanks for sticking with us. Here is a reply from Scott >>>>>> Titus. If you have something for us to try, we'd love it. The code had >>>>>> your patch applied when gdb was run: >>>>>> >>>>>> >>>>>> Here is the addr2line output for those addresses. Very interesting >>>>>> command, of >>>>>> which I was not aware. >>>>>> >>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ >>>>>> cluster/ >>>>>> afr.so 0x6f735 >>>>>> afr_lookup_metadata_heal_check >>>>>> afr-common.c:2803 >>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ >>>>>> cluster/ >>>>>> afr.so 0x6f0b9 >>>>>> afr_lookup_done >>>>>> afr-common.c:2455 >>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ >>>>>> cluster/ >>>>>> afr.so 0x5c701 >>>>>> afr_inode_event_gen_reset >>>>>> afr-common.c:755 >>>>>> >>>>>> >>>>>> Right, so afr_lookup_done() is resetting the event gen to zero. This looks >>>>>> like a race between lookup and inode refresh code paths. We made some >>>>>> changes to the event generation logic in AFR. Can you apply the attached >>>>>> patch and see if it fixes the split-brain issue? It should apply cleanly on >>>>>> glusterfs-7.4. >>>>>> >>>>>> Thanks, >>>>>> Ravi >>>>>> >>>>>> >>>>>> ________ >>>>>> >>>>>> >>>>>> >>>>>> Community Meeting Calendar: >>>>>> >>>>>> Schedule - >>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >>>>>> Bridge: https://bluejeans.com/441850968 >>>>>> >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001 >>>>>> From: Ravishankar N <ravishankar at redhat.com> >>>>>> Date: Wed, 15 Apr 2020 13:53:26 +0530 >>>>>> Subject: [PATCH] afr: event gen changes >>>>>> >>>>>> The general idea of the changes is to prevent resetting event generation >>>>>> to zero in the inode ctx, since event gen is something that should >>>>>> follow 'causal order'. >>>>>> >>>>>> Change #1: >>>>>> For a read txn, in inode refresh cbk, if event_generation is >>>>>> found zero, we are failing the read fop. This is not needed >>>>>> because change in event gen is only a marker for the next inode refresh to >>>>>> happen and should not be taken into account by the current read txn. >>>>>> >>>>>> Change #2: >>>>>> The event gen being zero above can happen if there is a racing lookup, >>>>>> which resets even get (in afr_lookup_done) if there are non zero afr >>>>>> xattrs. The resetting is done only to trigger an inode refresh and a >>>>>> possible client side heal on the next lookup. That can be acheived by >>>>>> setting the need_refresh flag in the inode ctx. So replaced all >>>>>> occurences of resetting even gen to zero with a call to >>>>>> afr_inode_need_refresh_set(). >>>>>> >>>>>> Change #3: >>>>>> In both lookup and discover path, we are doing an inode refresh which is >>>>>> not required since all 3 essentially do the same thing- update the inode >>>>>> ctx with the good/bad copies from the brick replies. Inode refresh also >>>>>> triggers background heals, but I think it is okay to do it when we call >>>>>> refresh during the read and write txns and not in the lookup path. >>>>>> >>>>>> Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1 >>>>>> Signed-off-by: Ravishankar N <ravishankar at redhat.com> >>>>>> --- >>>>>> ...ismatch-resolution-with-fav-child-policy.t | 8 +- >>>>>> xlators/cluster/afr/src/afr-common.c | 92 ++++--------------- >>>>>> xlators/cluster/afr/src/afr-dir-write.c | 6 +- >>>>>> xlators/cluster/afr/src/afr.h | 5 +- >>>>>> 4 files changed, 29 insertions(+), 82 deletions(-) >>>>>> >>>>>> diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t >>>>>> index f4aa351e4..12af0c854 100644 >>>>>> --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t >>>>>> +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t >>>>>> @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ] >>>>>> #We know that second brick has the bigger size file >>>>>> BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\ -f1) >>>>>> >>>>>> -TEST ls $M0/f3 >>>>>> -TEST cat $M0/f3 >>>>>> +TEST ls $M0 #Trigger entry heal via readdir inode refresh >>>>>> +TEST cat $M0/f3 #Trigger data heal via readv inode refresh >>>>>> EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 >>>>>> >>>>>> #gfid split-brain should be resolved >>>>>> @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force >>>>>> EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2 >>>>>> EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2 >>>>>> >>>>>> -TEST ls $M0/f4 >>>>>> -TEST cat $M0/f4 >>>>>> +TEST ls $M0 #Trigger entry heal via readdir inode refresh >>>>>> +TEST cat $M0/f4 #Trigger data heal via readv inode refresh >>>>>> EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 >>>>>> >>>>>> #gfid split-brain should be resolved >>>>>> diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c >>>>>> index 61f21795e..319665a14 100644 >>>>>> --- a/xlators/cluster/afr/src/afr-common.c >>>>>> +++ b/xlators/cluster/afr/src/afr-common.c >>>>>> @@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local, >>>>>> metadatamap |= (1 << index); >>>>>> } >>>>>> if (metadatamap_old != metadatamap) { >>>>>> - event = 0; >>>>>> + __afr_inode_need_refresh_set(inode, this); >>>>>> } >>>>>> break; >>>>>> >>>>>> @@ -295,7 +295,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local, >>>>>> datamap |= (1 << index); >>>>>> } >>>>>> if (datamap_old != datamap) >>>>>> - event = 0; >>>>>> + __afr_inode_need_refresh_set(inode, this); >>>>>> break; >>>>>> >>>>>> default: >>>>>> @@ -458,34 +458,6 @@ out: >>>>>> return ret; >>>>>> } >>>>>> >>>>>> -int >>>>>> -__afr_inode_event_gen_reset_small(inode_t *inode, xlator_t *this) >>>>>> -{ >>>>>> - int ret = -1; >>>>>> - uint16_t datamap = 0; >>>>>> - uint16_t metadatamap = 0; >>>>>> - uint32_t event = 0; >>>>>> - uint64_t val = 0; >>>>>> - afr_inode_ctx_t *ctx = NULL; >>>>>> - >>>>>> - ret = __afr_inode_ctx_get(this, inode, &ctx); >>>>>> - if (ret) >>>>>> - return ret; >>>>>> - >>>>>> - val = ctx->read_subvol; >>>>>> - >>>>>> - metadatamap = (val & 0x000000000000ffff) >> 0; >>>>>> - datamap = (val & 0x00000000ffff0000) >> 16; >>>>>> - event = 0; >>>>>> - >>>>>> - val = ((uint64_t)metadatamap) | (((uint64_t)datamap) << 16) | >>>>>> - (((uint64_t)event) << 32); >>>>>> - >>>>>> - ctx->read_subvol = val; >>>>>> - >>>>>> - return ret; >>>>>> -} >>>>>> - >>>>>> int >>>>>> __afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data, >>>>>> unsigned char *metadata, int *event_p) >>>>>> @@ -556,22 +528,6 @@ out: >>>>>> return ret; >>>>>> } >>>>>> >>>>>> -int >>>>>> -__afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) >>>>>> -{ >>>>>> - afr_private_t *priv = NULL; >>>>>> - int ret = -1; >>>>>> - >>>>>> - priv = this->private; >>>>>> - >>>>>> - if (priv->child_count <= 16) >>>>>> - ret = __afr_inode_event_gen_reset_small(inode, this); >>>>>> - else >>>>>> - ret = -1; >>>>>> - >>>>>> - return ret; >>>>>> -} >>>>>> - >>>>>> int >>>>>> afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data, >>>>>> unsigned char *metadata, int *event_p) >>>>>> @@ -721,30 +677,22 @@ out: >>>>>> return need_refresh; >>>>>> } >>>>>> >>>>>> -static int >>>>>> -afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) >>>>>> +int >>>>>> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) >>>>>> { >>>>>> int ret = -1; >>>>>> afr_inode_ctx_t *ctx = NULL; >>>>>> >>>>>> - GF_VALIDATE_OR_GOTO(this->name, inode, out); >>>>>> - >>>>>> - LOCK(&inode->lock); >>>>>> - { >>>>>> - ret = __afr_inode_ctx_get(this, inode, &ctx); >>>>>> - if (ret) >>>>>> - goto unlock; >>>>>> - >>>>>> + ret = __afr_inode_ctx_get(this, inode, &ctx); >>>>>> + if (ret == 0) { >>>>>> ctx->need_refresh = _gf_true; >>>>>> } >>>>>> -unlock: >>>>>> - UNLOCK(&inode->lock); >>>>>> -out: >>>>>> + >>>>>> return ret; >>>>>> } >>>>>> >>>>>> int >>>>>> -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) >>>>>> +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) >>>>>> { >>>>>> int ret = -1; >>>>>> >>>>>> @@ -754,7 +702,7 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) >>>>>> "Resetting event gen for %s", uuid_utoa(inode->gfid)); >>>>>> LOCK(&inode->lock); >>>>>> { >>>>>> - ret = __afr_inode_event_gen_reset(inode, this); >>>>>> + ret = __afr_inode_need_refresh_set(inode, this); >>>>>> } >>>>>> UNLOCK(&inode->lock); >>>>>> out: >>>>>> @@ -1187,7 +1135,7 @@ afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err) >>>>>> ret = afr_inode_get_readable(frame, inode, this, local->readable, >>>>>> &event_generation, local->transaction.type); >>>>>> >>>>>> - if (ret == -EIO || (local->is_read_txn && !event_generation)) { >>>>>> + if (ret == -EIO) { >>>>>> /* No readable subvolume even after refresh ==> splitbrain.*/ >>>>>> if (!priv->fav_child_policy) { >>>>>> err = EIO; >>>>>> @@ -2451,7 +2399,7 @@ afr_lookup_done(call_frame_t *frame, xlator_t *this) >>>>>> if (read_subvol == -1) >>>>>> goto cant_interpret; >>>>>> if (ret) { >>>>>> - afr_inode_event_gen_reset(local->inode, this); >>>>>> + afr_inode_need_refresh_set(local->inode, this); >>>>>> dict_del_sizen(local->replies[read_subvol].xdata, GF_CONTENT_KEY); >>>>>> } >>>>>> } else { >>>>>> @@ -3007,6 +2955,7 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this) >>>>>> afr_private_t *priv = NULL; >>>>>> afr_local_t *local = NULL; >>>>>> int read_subvol = -1; >>>>>> + int ret = 0; >>>>>> unsigned char *data_readable = NULL; >>>>>> unsigned char *success_replies = NULL; >>>>>> >>>>>> @@ -3028,7 +2977,10 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this) >>>>>> if (!afr_has_quorum(success_replies, this, frame)) >>>>>> goto unwind; >>>>>> >>>>>> - afr_replies_interpret(frame, this, local->inode, NULL); >>>>>> + ret = afr_replies_interpret(frame, this, local->inode, NULL); >>>>>> + if (ret) { >>>>>> + afr_inode_need_refresh_set(local->inode, this); >>>>>> + } >>>>>> >>>>>> read_subvol = afr_read_subvol_decide(local->inode, this, NULL, >>>>>> data_readable); >>>>>> @@ -3284,11 +3236,7 @@ afr_discover(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req) >>>>>> afr_read_subvol_get(loc->inode, this, NULL, NULL, &event, >>>>>> AFR_DATA_TRANSACTION, NULL); >>>>>> >>>>>> - if (afr_is_inode_refresh_reqd(loc->inode, this, event, >>>>>> - local->event_generation)) >>>>>> - afr_inode_refresh(frame, this, loc->inode, NULL, afr_discover_do); >>>>>> - else >>>>>> - afr_discover_do(frame, this, 0); >>>>>> + afr_discover_do(frame, this, 0); >>>>>> >>>>>> return 0; >>>>>> out: >>>>>> @@ -3429,11 +3377,7 @@ afr_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req) >>>>>> afr_read_subvol_get(loc->parent, this, NULL, NULL, &event, >>>>>> AFR_DATA_TRANSACTION, NULL); >>>>>> >>>>>> - if (afr_is_inode_refresh_reqd(loc->inode, this, event, >>>>>> - local->event_generation)) >>>>>> - afr_inode_refresh(frame, this, loc->parent, NULL, afr_lookup_do); >>>>>> - else >>>>>> - afr_lookup_do(frame, this, 0); >>>>>> + afr_lookup_do(frame, this, 0); >>>>>> >>>>>> return 0; >>>>>> out: >>>>>> diff --git a/xlators/cluster/afr/src/afr-dir-write.c b/xlators/cluster/afr/src/afr-dir-write.c >>>>>> index 82a72fddd..333085b14 100644 >>>>>> --- a/xlators/cluster/afr/src/afr-dir-write.c >>>>>> +++ b/xlators/cluster/afr/src/afr-dir-write.c >>>>>> @@ -119,11 +119,11 @@ __afr_dir_write_finalize(call_frame_t *frame, xlator_t *this) >>>>>> continue; >>>>>> if (local->replies[i].op_ret < 0) { >>>>>> if (local->inode) >>>>>> - afr_inode_event_gen_reset(local->inode, this); >>>>>> + afr_inode_need_refresh_set(local->inode, this); >>>>>> if (local->parent) >>>>>> - afr_inode_event_gen_reset(local->parent, this); >>>>>> + afr_inode_need_refresh_set(local->parent, this); >>>>>> if (local->parent2) >>>>>> - afr_inode_event_gen_reset(local->parent2, this); >>>>>> + afr_inode_need_refresh_set(local->parent2, this); >>>>>> continue; >>>>>> } >>>>>> >>>>>> diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h >>>>>> index a3f2942b3..ed6d777c1 100644 >>>>>> --- a/xlators/cluster/afr/src/afr.h >>>>>> +++ b/xlators/cluster/afr/src/afr.h >>>>>> @@ -958,7 +958,10 @@ afr_inode_read_subvol_set(inode_t *inode, xlator_t *this, >>>>>> int event_generation); >>>>>> >>>>>> int >>>>>> -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this); >>>>>> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this); >>>>>> + >>>>>> +int >>>>>> +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this); >>>>>> >>>>>> int >>>>>> afr_read_subvol_select_by_policy(inode_t *inode, xlator_t *this, >>>>>> -- >>>>>> 2.25.2 >>>>>> >>>>>
Erik Jacobson
2020-Apr-16 13:24 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
> The patch by itself is only making changes specific to AFR, so it should not > affect other translators. But I wonder how readdir-ahead is enabled in your > gnfs stack. All performance xlators are turned off in gnfs except > write-behind and AFAIK, there is no way to enable them via the CLI. Did you > custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash > go-away if you remove the xlator from the nfs volfile?thank you. A quick reply. I will then go research how to do this, I've never hand edited a volume before. I've never even really looked at the gnfs volfile before. There are no custom code changes or hand edits. More soon.