Erik Jacobson
2020-Apr-16 03:17 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
It is important to note that our testing has shown zero split-brain errors since the patch... And that it is significantly harder to hit the seg fault than it was to hit split-brain before. It's still sufficiently frequent that we can't let it out the door. In my intensive test case (found elsewhere in the thread), it would 100% hit the problem with 57 nodes every time at least once. With the patch, zero split brain, but maybe 1 in 4 runs would seg fault. We didn't have a seg fault problem previously. This is all within the context of 1 of the 3 servers in the subvolume being down. I hit the seg fault once with just 57 nodes booting (using NFS for their root FS) and no other load. Scott was able to take an analysis pass. Any suggestions? his words follow: The segfault appears to occur in read-ahead functionality.? We will keep the core in case it needs to be looked at again, being sure to copy off all necessary metadata to maintain adequate symbol lookup within gdb.? It may also be possible to breakpoint immediately prior to the segfault, but setting the right conditions may prove to be difficult. A bit of analysis: Prior to the segfault, the op_errno field in a struct rda_fd_ctx packet shows an ENOENT error.? The packet is from the call_frame_t parameter of rda_fill_fd_cbk() (Backtrace #2)? The following shows the progression from the call_frame_t parameter to the op_errno field of the rda_fd_ctx structure. (gdb) print {call_frame_t}0x7fe5acf18eb8 $26 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next = 0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78, ? this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, ??????? __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, ????? __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234, ??? tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0} (gdb) print {struct rda_local}0x7fe5ac1dbc78 $27 = {ctx = 0x7fe5ace46590, fd = 0x7fe60433d8b8, xattrs = 0x0, inode = 0x0, offset = 0, generation = 0, skip_dir = 0} (gdb) print {struct rda_fd_ctx}0x7fe5ace46590 $28 = {cur_offset = 0, cur_size = 638, next_offset = 1538, state = 36, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, ??????? __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, ????? __size = '\000' <repeats 39 times>, __align = 0}}, entries = {{list = {next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}, { ??????? next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}}, d_ino = 0, d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_flags = 0, ia_ino = 0, ????? ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ????? ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0, ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0, ????? ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = '\000' <repeats 15 times>, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', ??????? sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', ????????? write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}}, dict = 0x0, inode = 0x0, ??? d_name = 0x7fe5ace466a8 ""}, fill_frame = 0x0, stub = 0x0, op_errno = 2, xattrs = 0x0, writes_during_prefetch = 0x0, prefetching = { ??? lk = 0x7fe5ace466d0 "", value = 0}} The segfault occurs at the bottom of rda_fill_fd_cbk() where the rpc call stack frames are being destroyed.? The following are what I believe to be the three frames that are intended to be destroyed, but it is unclear which packet is causing the problem.? If I were to dig more into this, I will use ddd (graphical debugger).? It's been a while since I've done low level debugging like this, so I'm a bit rusty. (gdb) print {call_frame_t}0x7fe5acf18eb8 $34 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next = 0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78, ? this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, ??????? __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, ????? __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234, ??? tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0} (gdb) print {call_frame_t}0x7fe5ac6d6ce0 $35 = {root = 0x0, parent = 0x563f5a955920, frames = {next = 0x7fe5ac096298, prev = 0x7fe5acf18ec8}, local = 0x0, this = 0x108a, ? ret = 0x25f90b3c, ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 1586972324, __nusers = 0, ??????? __kind = 210092664, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, ????? __size = "\000\000\000\000\000\000\000\000\244F\227^\000\000\000\000x\302\205\f", '\000' <repeats 19 times>, __align = 0}}, ? cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_nsec = 0}, end = {tv_sec = 0, tv_nsec = 0}, wind_from = 0x0, ? wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0} (gdb) print {call_frame_t}0x7fe5ac096288 $36 = {root = 0x7fe5ac378860, parent = 0x7fe5acf18eb8, frames = {next = 0x7fe5acf18ec8, prev = 0x7fe5ac6d6cf0}, local = 0x0, ? this = 0x7fe63c014000, ret = 0x7fe63bb5d350 <rda_fill_fd_cbk>, ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, ??????? __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, ????? __size = '\000' <repeats 39 times>, __align = 0}}, cookie = 0x7fe5ac096288, complete = true, op = GF_FOP_READDIRP, begin = { ??? tv_sec = 4234, tv_nsec = 637078816}, end = {tv_sec = 4234, tv_nsec = 803882755}, ? wind_from = 0x7fe63bb5e8c0 <__FUNCTION__.22226> "rda_fill_fd", wind_to = 0x7fe63bb5e3f0 "(this->children->xlator)->fops->readdirp", ? unwind_from = 0x7fe63bdd8a80 <__FUNCTION__.20442> "afr_readdir_cbk", unwind_to = 0x7fe63bb5dfbb "rda_fill_fd_cbk"} On 4/15/20 8:14 AM, Erik Jacobson wrote:> Scott - I was going to start with gluster74 since that is what he > started at but it applies well to glsuter72 so I'll start tthere. > > Getting ready to go. The patch detail is interesting. This is probably > why it took hiim a bit longer. It wasn't a trivial patch.On Wed, Apr 15, 2020 at 12:45:57PM -0500, Erik Jacobson wrote:> > The new split-brain issue is much harder to reproduce, but after several > > (correcting to say new seg fault issue, the split brain is gone!!) > > > intense runs, it usually hits once. > > > > We switched to pure gluster74 plus your patch so we're apples to apples > > now. > > > > I'm going to see if Scott can help debug it. > > > > Here is the back trace info from the core dump: > > > > -rw-r----- 1 root root 1.9G Apr 15 12:40 core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000 > > -rw-r----- 1 root root 221M Apr 15 12:40 core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000.lz4 > > drwxrwxrwt 9 root root 20K Apr 15 12:40 . > > [root at leader3 tmp]# > > [root at leader3 tmp]# > > [root at leader3 tmp]# gdb core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000 > > GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8 > > Copyright (C) 2018 Free Software Foundation, Inc. > > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > > This is free software: you are free to change and redistribute it. > > There is NO WARRANTY, to the extent permitted by law. > > Type "show copying" and "show warranty" for details. > > This GDB was configured as "x86_64-redhat-linux-gnu". > > Type "show configuration" for configuration details. > > For bug reporting instructions, please see: > > <http://www.gnu.org/software/gdb/bugs/>. > > Find the GDB manual and other documentation resources online at: > > <http://www.gnu.org/software/gdb/documentation/>. > > > > For help, type "help". > > Type "apropos word" to search for commands related to "word"... > > [New LWP 61102] > > [New LWP 61085] > > [New LWP 61087] > > [New LWP 61117] > > [New LWP 61086] > > [New LWP 61108] > > [New LWP 61089] > > [New LWP 61090] > > [New LWP 61121] > > [New LWP 61088] > > [New LWP 61091] > > [New LWP 61093] > > [New LWP 61095] > > [New LWP 61092] > > [New LWP 61094] > > [New LWP 61098] > > [New LWP 61096] > > [New LWP 61097] > > [New LWP 61084] > > [New LWP 61100] > > [New LWP 61103] > > [New LWP 61104] > > [New LWP 61099] > > [New LWP 61105] > > [New LWP 61101] > > [New LWP 61106] > > [New LWP 61109] > > [New LWP 61107] > > [New LWP 61112] > > [New LWP 61119] > > [New LWP 61110] > > [New LWP 61111] > > [New LWP 61118] > > [New LWP 61123] > > [New LWP 61122] > > [New LWP 61113] > > [New LWP 61114] > > [New LWP 61120] > > [New LWP 61116] > > [New LWP 61115] > > > > warning: core file may not match specified executable file. > > Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done. > > done. > > > > warning: Ignoring non-absolute filename: <linux-vdso.so.1> > > Missing separate debuginfo for linux-vdso.so.1 > > Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266 > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > [Thread debugging using libthread_db enabled] > > Using host libthread_db library "/lib64/libthread_db.so.1". > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > > > warning: Loadable section ".note.gnu.property" outside of ELF segments > > Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/run/gluster/n'. > > Program terminated with signal SIGSEGV, Segmentation fault. > > #0 0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288) > > at ../../../../libglusterfs/src/glusterfs/stack.h:193 > > 193 FRAME_DESTROY(frame); > > [Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))] > > Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 zlib-1.2.11-10.el8.x86_64 > > (gdb) bt > > #0 0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288) > > at ../../../../libglusterfs/src/glusterfs/stack.h:193 > > #1 STACK_DESTROY (stack=0x7fe5ac6d65f8) > > at ../../../../libglusterfs/src/glusterfs/stack.h:193 > > #2 rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=<optimized out>, > > this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=<optimized out>, > > xdata=0x0) at readdir-ahead.c:623 > > #3 0x00007fe63bd6c3aa in afr_readdir_cbk (frame=<optimized out>, > > cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, > > op_errno=<optimized out>, subvol_entries=<optimized out>, xdata=0x0) > > at afr-dir-read.c:234 > > #4 0x00007fe6400a1e07 in client4_0_readdirp_cbk (req=<optimized out>, > > iov=<optimized out>, count=<optimized out>, myframe=0x7fe5ace0eda8) > > at client-rpc-fops_v2.c:2338 > > #5 0x00007fe6479ca115 in rpc_clnt_handle_reply ( > > clnt=clnt at entry=0x7fe63c0663f0, pollin=pollin at entry=0x7fe60c1737a0) > > at rpc-clnt.c:764 > > #6 0x00007fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780, > > mydata=0x7fe63c066420, event=<optimized out>, data=0x7fe60c1737a0) > > at rpc-clnt.c:931 > > #7 0x00007fe6479c707b in rpc_transport_notify ( > > this=this at entry=0x7fe63c066780, > > event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, > > data=data at entry=0x7fe60c1737a0) at rpc-transport.c:545 > > #8 0x00007fe640da893c in socket_event_poll_in_async (xl=<optimized out>, > > async=0x7fe60c1738c8) at socket.c:2601 > > #9 0x00007fe640db03dc in gf_async ( > > cbk=0x7fe640da8910 <socket_event_poll_in_async>, xl=<optimized out>, > > async=0x7fe60c1738c8) at ../../../../libglusterfs/src/glusterfs/async.h:189 > > #10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780) > > at socket.c:2642 > > #11 socket_event_handler (fd=fd at entry=19, idx=idx at entry=10, gen=gen at entry=1, > > data=data at entry=0x7fe63c066780, poll_in=<optimized out>, > > poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') > > at socket.c:3040 > > #12 0x00007fe647c84a5b in event_dispatch_epoll_handler (event=0x7fe617ffe014, > > event_pool=0x563f5a98c750) at event-epoll.c:650 > > #13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at event-epoll.c:763 > > #14 0x00007fe6467a72de in start_thread () from /lib64/libpthread.so.0 > > #15 0x00007fe645fffa63 in clone () from /lib64/libc.so.6 > > > > > > > > On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote: > > > After several successful runs of the test case, we thought we were > > > solved. Indeed, split-brain is gone. > > > > > > But we're triggering a seg fault now, even in a less loaded case. > > > > > > We're going to switch to gluster74, which was your intention, and report > > > back. > > > > > > On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote: > > > > > Attached the wrong patch by mistake in my previous mail. Sending the correct > > > > > one now. > > > > > > > > Early results loook GREAT !! > > > > > > > > We'll keep beating on it. We applied it to glsuter72 as that is what we > > > > have to ship with. It applied fine with some line moves. > > > > > > > > If you would like us to also run a test with gluster74 so that you can > > > > say that's tested, we can run that test. I can do a special build. > > > > > > > > THANK YOU!! > > > > > > > > > > > > > > > > > > > -Ravi > > > > > > > > > > > > > > > On 15/04/20 2:05 pm, Ravishankar N wrote: > > > > > > > > > > > > > > > On 10/04/20 2:06 am, Erik Jacobson wrote: > > > > > > > > > > Once again thanks for sticking with us. Here is a reply from Scott > > > > > Titus. If you have something for us to try, we'd love it. The code had > > > > > your patch applied when gdb was run: > > > > > > > > > > > > > > > Here is the addr2line output for those addresses. Very interesting > > > > > command, of > > > > > which I was not aware. > > > > > > > > > > [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > > > > cluster/ > > > > > afr.so 0x6f735 > > > > > afr_lookup_metadata_heal_check > > > > > afr-common.c:2803 > > > > > [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > > > > cluster/ > > > > > afr.so 0x6f0b9 > > > > > afr_lookup_done > > > > > afr-common.c:2455 > > > > > [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ > > > > > cluster/ > > > > > afr.so 0x5c701 > > > > > afr_inode_event_gen_reset > > > > > afr-common.c:755 > > > > > > > > > > > > > > > Right, so afr_lookup_done() is resetting the event gen to zero. This looks > > > > > like a race between lookup and inode refresh code paths. We made some > > > > > changes to the event generation logic in AFR. Can you apply the attached > > > > > patch and see if it fixes the split-brain issue? It should apply cleanly on > > > > > glusterfs-7.4. > > > > > > > > > > Thanks, > > > > > Ravi > > > > > > > > > > > > > > > ________ > > > > > > > > > > > > > > > > > > > > Community Meeting Calendar: > > > > > > > > > > Schedule - > > > > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > > > Bridge: https://bluejeans.com/441850968 > > > > > > > > > > Gluster-users mailing list > > > > > Gluster-users at gluster.org > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > > > > > > > > > > >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001 > > > > > From: Ravishankar N <ravishankar at redhat.com> > > > > > Date: Wed, 15 Apr 2020 13:53:26 +0530 > > > > > Subject: [PATCH] afr: event gen changes > > > > > > > > > > The general idea of the changes is to prevent resetting event generation > > > > > to zero in the inode ctx, since event gen is something that should > > > > > follow 'causal order'. > > > > > > > > > > Change #1: > > > > > For a read txn, in inode refresh cbk, if event_generation is > > > > > found zero, we are failing the read fop. This is not needed > > > > > because change in event gen is only a marker for the next inode refresh to > > > > > happen and should not be taken into account by the current read txn. > > > > > > > > > > Change #2: > > > > > The event gen being zero above can happen if there is a racing lookup, > > > > > which resets even get (in afr_lookup_done) if there are non zero afr > > > > > xattrs. The resetting is done only to trigger an inode refresh and a > > > > > possible client side heal on the next lookup. That can be acheived by > > > > > setting the need_refresh flag in the inode ctx. So replaced all > > > > > occurences of resetting even gen to zero with a call to > > > > > afr_inode_need_refresh_set(). > > > > > > > > > > Change #3: > > > > > In both lookup and discover path, we are doing an inode refresh which is > > > > > not required since all 3 essentially do the same thing- update the inode > > > > > ctx with the good/bad copies from the brick replies. Inode refresh also > > > > > triggers background heals, but I think it is okay to do it when we call > > > > > refresh during the read and write txns and not in the lookup path. > > > > > > > > > > Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1 > > > > > Signed-off-by: Ravishankar N <ravishankar at redhat.com> > > > > > --- > > > > > ...ismatch-resolution-with-fav-child-policy.t | 8 +- > > > > > xlators/cluster/afr/src/afr-common.c | 92 ++++--------------- > > > > > xlators/cluster/afr/src/afr-dir-write.c | 6 +- > > > > > xlators/cluster/afr/src/afr.h | 5 +- > > > > > 4 files changed, 29 insertions(+), 82 deletions(-) > > > > > > > > > > diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t > > > > > index f4aa351e4..12af0c854 100644 > > > > > --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t > > > > > +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t > > > > > @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ] > > > > > #We know that second brick has the bigger size file > > > > > BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\ -f1) > > > > > > > > > > -TEST ls $M0/f3 > > > > > -TEST cat $M0/f3 > > > > > +TEST ls $M0 #Trigger entry heal via readdir inode refresh > > > > > +TEST cat $M0/f3 #Trigger data heal via readv inode refresh > > > > > EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 > > > > > > > > > > #gfid split-brain should be resolved > > > > > @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force > > > > > EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2 > > > > > EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2 > > > > > > > > > > -TEST ls $M0/f4 > > > > > -TEST cat $M0/f4 > > > > > +TEST ls $M0 #Trigger entry heal via readdir inode refresh > > > > > +TEST cat $M0/f4 #Trigger data heal via readv inode refresh > > > > > EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 > > > > > > > > > > #gfid split-brain should be resolved > > > > > diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c > > > > > index 61f21795e..319665a14 100644 > > > > > --- a/xlators/cluster/afr/src/afr-common.c > > > > > +++ b/xlators/cluster/afr/src/afr-common.c > > > > > @@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local, > > > > > metadatamap |= (1 << index); > > > > > } > > > > > if (metadatamap_old != metadatamap) { > > > > > - event = 0; > > > > > + __afr_inode_need_refresh_set(inode, this); > > > > > } > > > > > break; > > > > > > > > > > @@ -295,7 +295,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local, > > > > > datamap |= (1 << index); > > > > > } > > > > > if (datamap_old != datamap) > > > > > - event = 0; > > > > > + __afr_inode_need_refresh_set(inode, this); > > > > > break; > > > > > > > > > > default: > > > > > @@ -458,34 +458,6 @@ out: > > > > > return ret; > > > > > } > > > > > > > > > > -int > > > > > -__afr_inode_event_gen_reset_small(inode_t *inode, xlator_t *this) > > > > > -{ > > > > > - int ret = -1; > > > > > - uint16_t datamap = 0; > > > > > - uint16_t metadatamap = 0; > > > > > - uint32_t event = 0; > > > > > - uint64_t val = 0; > > > > > - afr_inode_ctx_t *ctx = NULL; > > > > > - > > > > > - ret = __afr_inode_ctx_get(this, inode, &ctx); > > > > > - if (ret) > > > > > - return ret; > > > > > - > > > > > - val = ctx->read_subvol; > > > > > - > > > > > - metadatamap = (val & 0x000000000000ffff) >> 0; > > > > > - datamap = (val & 0x00000000ffff0000) >> 16; > > > > > - event = 0; > > > > > - > > > > > - val = ((uint64_t)metadatamap) | (((uint64_t)datamap) << 16) | > > > > > - (((uint64_t)event) << 32); > > > > > - > > > > > - ctx->read_subvol = val; > > > > > - > > > > > - return ret; > > > > > -} > > > > > - > > > > > int > > > > > __afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data, > > > > > unsigned char *metadata, int *event_p) > > > > > @@ -556,22 +528,6 @@ out: > > > > > return ret; > > > > > } > > > > > > > > > > -int > > > > > -__afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) > > > > > -{ > > > > > - afr_private_t *priv = NULL; > > > > > - int ret = -1; > > > > > - > > > > > - priv = this->private; > > > > > - > > > > > - if (priv->child_count <= 16) > > > > > - ret = __afr_inode_event_gen_reset_small(inode, this); > > > > > - else > > > > > - ret = -1; > > > > > - > > > > > - return ret; > > > > > -} > > > > > - > > > > > int > > > > > afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data, > > > > > unsigned char *metadata, int *event_p) > > > > > @@ -721,30 +677,22 @@ out: > > > > > return need_refresh; > > > > > } > > > > > > > > > > -static int > > > > > -afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) > > > > > +int > > > > > +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) > > > > > { > > > > > int ret = -1; > > > > > afr_inode_ctx_t *ctx = NULL; > > > > > > > > > > - GF_VALIDATE_OR_GOTO(this->name, inode, out); > > > > > - > > > > > - LOCK(&inode->lock); > > > > > - { > > > > > - ret = __afr_inode_ctx_get(this, inode, &ctx); > > > > > - if (ret) > > > > > - goto unlock; > > > > > - > > > > > + ret = __afr_inode_ctx_get(this, inode, &ctx); > > > > > + if (ret == 0) { > > > > > ctx->need_refresh = _gf_true; > > > > > } > > > > > -unlock: > > > > > - UNLOCK(&inode->lock); > > > > > -out: > > > > > + > > > > > return ret; > > > > > } > > > > > > > > > > int > > > > > -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) > > > > > +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) > > > > > { > > > > > int ret = -1; > > > > > > > > > > @@ -754,7 +702,7 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) > > > > > "Resetting event gen for %s", uuid_utoa(inode->gfid)); > > > > > LOCK(&inode->lock); > > > > > { > > > > > - ret = __afr_inode_event_gen_reset(inode, this); > > > > > + ret = __afr_inode_need_refresh_set(inode, this); > > > > > } > > > > > UNLOCK(&inode->lock); > > > > > out: > > > > > @@ -1187,7 +1135,7 @@ afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err) > > > > > ret = afr_inode_get_readable(frame, inode, this, local->readable, > > > > > &event_generation, local->transaction.type); > > > > > > > > > > - if (ret == -EIO || (local->is_read_txn && !event_generation)) { > > > > > + if (ret == -EIO) { > > > > > /* No readable subvolume even after refresh ==> splitbrain.*/ > > > > > if (!priv->fav_child_policy) { > > > > > err = EIO; > > > > > @@ -2451,7 +2399,7 @@ afr_lookup_done(call_frame_t *frame, xlator_t *this) > > > > > if (read_subvol == -1) > > > > > goto cant_interpret; > > > > > if (ret) { > > > > > - afr_inode_event_gen_reset(local->inode, this); > > > > > + afr_inode_need_refresh_set(local->inode, this); > > > > > dict_del_sizen(local->replies[read_subvol].xdata, GF_CONTENT_KEY); > > > > > } > > > > > } else { > > > > > @@ -3007,6 +2955,7 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this) > > > > > afr_private_t *priv = NULL; > > > > > afr_local_t *local = NULL; > > > > > int read_subvol = -1; > > > > > + int ret = 0; > > > > > unsigned char *data_readable = NULL; > > > > > unsigned char *success_replies = NULL; > > > > > > > > > > @@ -3028,7 +2977,10 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this) > > > > > if (!afr_has_quorum(success_replies, this, frame)) > > > > > goto unwind; > > > > > > > > > > - afr_replies_interpret(frame, this, local->inode, NULL); > > > > > + ret = afr_replies_interpret(frame, this, local->inode, NULL); > > > > > + if (ret) { > > > > > + afr_inode_need_refresh_set(local->inode, this); > > > > > + } > > > > > > > > > > read_subvol = afr_read_subvol_decide(local->inode, this, NULL, > > > > > data_readable); > > > > > @@ -3284,11 +3236,7 @@ afr_discover(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req) > > > > > afr_read_subvol_get(loc->inode, this, NULL, NULL, &event, > > > > > AFR_DATA_TRANSACTION, NULL); > > > > > > > > > > - if (afr_is_inode_refresh_reqd(loc->inode, this, event, > > > > > - local->event_generation)) > > > > > - afr_inode_refresh(frame, this, loc->inode, NULL, afr_discover_do); > > > > > - else > > > > > - afr_discover_do(frame, this, 0); > > > > > + afr_discover_do(frame, this, 0); > > > > > > > > > > return 0; > > > > > out: > > > > > @@ -3429,11 +3377,7 @@ afr_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req) > > > > > afr_read_subvol_get(loc->parent, this, NULL, NULL, &event, > > > > > AFR_DATA_TRANSACTION, NULL); > > > > > > > > > > - if (afr_is_inode_refresh_reqd(loc->inode, this, event, > > > > > - local->event_generation)) > > > > > - afr_inode_refresh(frame, this, loc->parent, NULL, afr_lookup_do); > > > > > - else > > > > > - afr_lookup_do(frame, this, 0); > > > > > + afr_lookup_do(frame, this, 0); > > > > > > > > > > return 0; > > > > > out: > > > > > diff --git a/xlators/cluster/afr/src/afr-dir-write.c b/xlators/cluster/afr/src/afr-dir-write.c > > > > > index 82a72fddd..333085b14 100644 > > > > > --- a/xlators/cluster/afr/src/afr-dir-write.c > > > > > +++ b/xlators/cluster/afr/src/afr-dir-write.c > > > > > @@ -119,11 +119,11 @@ __afr_dir_write_finalize(call_frame_t *frame, xlator_t *this) > > > > > continue; > > > > > if (local->replies[i].op_ret < 0) { > > > > > if (local->inode) > > > > > - afr_inode_event_gen_reset(local->inode, this); > > > > > + afr_inode_need_refresh_set(local->inode, this); > > > > > if (local->parent) > > > > > - afr_inode_event_gen_reset(local->parent, this); > > > > > + afr_inode_need_refresh_set(local->parent, this); > > > > > if (local->parent2) > > > > > - afr_inode_event_gen_reset(local->parent2, this); > > > > > + afr_inode_need_refresh_set(local->parent2, this); > > > > > continue; > > > > > } > > > > > > > > > > diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h > > > > > index a3f2942b3..ed6d777c1 100644 > > > > > --- a/xlators/cluster/afr/src/afr.h > > > > > +++ b/xlators/cluster/afr/src/afr.h > > > > > @@ -958,7 +958,10 @@ afr_inode_read_subvol_set(inode_t *inode, xlator_t *this, > > > > > int event_generation); > > > > > > > > > > int > > > > > -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this); > > > > > +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this); > > > > > + > > > > > +int > > > > > +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this); > > > > > > > > > > int > > > > > afr_read_subvol_select_by_policy(inode_t *inode, xlator_t *this, > > > > > -- > > > > > 2.25.2 > > > > > > > > > > > > >
Ravishankar N
2020-Apr-16 07:03 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
The patch by itself is only making changes specific to AFR, so it should not affect other translators. But I wonder how readdir-ahead is enabled in your gnfs stack. All performance xlators are turned off in gnfs except write-behind and AFAIK, there is no way to enable them via the CLI. Did you custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash go-away if you remove the xlator from the nfs volfile? Regards, Ravi On 16/04/20 8:47 am, Erik Jacobson wrote:> It is important to note that our testing has shown zero split-brain > errors since the patch... And that it is significantly harder to > hit the seg fault than it was to hit split-brain before. It's still > sufficiently frequent that we can't let it out the door. In my intensive > test case (found elsewhere in the thread), it would 100% hit the problem > with 57 nodes every time at least once. With the patch, zero split > brain, but maybe 1 in 4 runs would seg fault. We didn't have a seg > fault problem previously. This is all within the context of 1 of the 3 > servers in the subvolume being down. I hit the seg fault once with just > 57 nodes booting (using NFS for their root FS) and no other load. > > > Scott was able to take an analysis pass. Any suggestions? his words > follow: > > > The segfault appears to occur in read-ahead functionality.? We will keep > the core in case it needs to be looked at again, being sure to copy off > all necessary metadata to maintain adequate symbol lookup within gdb. > It may also be possible to breakpoint immediately prior to the segfault, > but setting the right conditions may prove to be difficult. > > A bit of analysis: > > Prior to the segfault, the op_errno field in a struct rda_fd_ctx packet > shows an ENOENT error.? The packet is from the call_frame_t parameter of > rda_fill_fd_cbk() (Backtrace #2)? The following shows the progression > from the call_frame_t parameter to the op_errno field of the rda_fd_ctx > structure. > > (gdb) print {call_frame_t}0x7fe5acf18eb8 > $26 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next > 0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78, > ? this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock > 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > ??????? __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list > {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, > ????? __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, > begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234, > ??? tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from > 0x0, unwind_to = 0x0} > > (gdb) print {struct rda_local}0x7fe5ac1dbc78 > $27 = {ctx = 0x7fe5ace46590, fd = 0x7fe60433d8b8, xattrs = 0x0, inode > 0x0, offset = 0, generation = 0, skip_dir = 0} > > (gdb) print {struct rda_fd_ctx}0x7fe5ace46590 > $28 = {cur_offset = 0, cur_size = 638, next_offset = 1538, state = 36, > lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, > ??????? __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision > 0, __list = {__prev = 0x0, __next = 0x0}}, > ????? __size = '\000' <repeats 39 times>, __align = 0}}, entries > {{list = {next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}, { > ??????? next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}}, d_ino = 0, > d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_flags = 0, ia_ino = 0, > ????? ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, > ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, > ????? ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0, > ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0, > ????? ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = '\000' > <repeats 15 times>, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', > ??????? sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', > write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', > ????????? write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', > write = 0 '\000', exec = 0 '\000'}}}, dict = 0x0, inode = 0x0, > ??? d_name = 0x7fe5ace466a8 ""}, fill_frame = 0x0, stub = 0x0, op_errno > = 2, xattrs = 0x0, writes_during_prefetch = 0x0, prefetching = { > ??? lk = 0x7fe5ace466d0 "", value = 0}} > > The segfault occurs at the bottom of rda_fill_fd_cbk() where the rpc > call stack frames are being destroyed.? The following are what I believe > to be the three frames that are intended to be destroyed, but it is > unclear which packet is causing the problem.? If I were to dig more into > this, I will use ddd (graphical debugger).? It's been a while since I've > done low level debugging like this, so I'm a bit rusty. > > (gdb) print {call_frame_t}0x7fe5acf18eb8 > $34 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next > 0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78, > ? this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock > 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > ??????? __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list > {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, > ????? __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL, > begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234, > ??? tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from > 0x0, unwind_to = 0x0} > (gdb) print {call_frame_t}0x7fe5ac6d6ce0 > $35 = {root = 0x0, parent = 0x563f5a955920, frames = {next > 0x7fe5ac096298, prev = 0x7fe5acf18ec8}, local = 0x0, this = 0x108a, > ? ret = 0x25f90b3c, ref_count = 0, lock = {spinlock = 0, mutex > {__data = {__lock = 0, __count = 0, __owner = 1586972324, __nusers = 0, > ??????? __kind = 210092664, __spins = 0, __elision = 0, __list > {__prev = 0x0, __next = 0x0}}, > ????? __size > "\000\000\000\000\000\000\000\000\244F\227^\000\000\000\000x\302\205\f", > '\000' <repeats 19 times>, __align = 0}}, > ? cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec > 0, tv_nsec = 0}, end = {tv_sec = 0, tv_nsec = 0}, wind_from = 0x0, > ? wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0} > (gdb) print {call_frame_t}0x7fe5ac096288 > $36 = {root = 0x7fe5ac378860, parent = 0x7fe5acf18eb8, frames = {next > 0x7fe5acf18ec8, prev = 0x7fe5ac6d6cf0}, local = 0x0, > ? this = 0x7fe63c014000, ret = 0x7fe63bb5d350 <rda_fill_fd_cbk>, > ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, > ??????? __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins > 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, > ????? __size = '\000' <repeats 39 times>, __align = 0}}, cookie > 0x7fe5ac096288, complete = true, op = GF_FOP_READDIRP, begin = { > ??? tv_sec = 4234, tv_nsec = 637078816}, end = {tv_sec = 4234, tv_nsec > = 803882755}, > ? wind_from = 0x7fe63bb5e8c0 <__FUNCTION__.22226> "rda_fill_fd", > wind_to = 0x7fe63bb5e3f0 "(this->children->xlator)->fops->readdirp", > ? unwind_from = 0x7fe63bdd8a80 <__FUNCTION__.20442> "afr_readdir_cbk", > unwind_to = 0x7fe63bb5dfbb "rda_fill_fd_cbk"} > > > > On 4/15/20 8:14 AM, Erik Jacobson wrote: >> Scott - I was going to start with gluster74 since that is what he >> started at but it applies well to glsuter72 so I'll start tthere. >> >> Getting ready to go. The patch detail is interesting. This is probably >> why it took hiim a bit longer. It wasn't a trivial patch. > > > On Wed, Apr 15, 2020 at 12:45:57PM -0500, Erik Jacobson wrote: >>> The new split-brain issue is much harder to reproduce, but after several >> (correcting to say new seg fault issue, the split brain is gone!!) >> >>> intense runs, it usually hits once. >>> >>> We switched to pure gluster74 plus your patch so we're apples to apples >>> now. >>> >>> I'm going to see if Scott can help debug it. >>> >>> Here is the back trace info from the core dump: >>> >>> -rw-r----- 1 root root 1.9G Apr 15 12:40 core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000 >>> -rw-r----- 1 root root 221M Apr 15 12:40 core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000.lz4 >>> drwxrwxrwt 9 root root 20K Apr 15 12:40 . >>> [root at leader3 tmp]# >>> [root at leader3 tmp]# >>> [root at leader3 tmp]# gdb core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000 >>> GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8 >>> Copyright (C) 2018 Free Software Foundation, Inc. >>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> >>> This is free software: you are free to change and redistribute it. >>> There is NO WARRANTY, to the extent permitted by law. >>> Type "show copying" and "show warranty" for details. >>> This GDB was configured as "x86_64-redhat-linux-gnu". >>> Type "show configuration" for configuration details. >>> For bug reporting instructions, please see: >>> <http://www.gnu.org/software/gdb/bugs/>. >>> Find the GDB manual and other documentation resources online at: >>> <http://www.gnu.org/software/gdb/documentation/>. >>> >>> For help, type "help". >>> Type "apropos word" to search for commands related to "word"... >>> [New LWP 61102] >>> [New LWP 61085] >>> [New LWP 61087] >>> [New LWP 61117] >>> [New LWP 61086] >>> [New LWP 61108] >>> [New LWP 61089] >>> [New LWP 61090] >>> [New LWP 61121] >>> [New LWP 61088] >>> [New LWP 61091] >>> [New LWP 61093] >>> [New LWP 61095] >>> [New LWP 61092] >>> [New LWP 61094] >>> [New LWP 61098] >>> [New LWP 61096] >>> [New LWP 61097] >>> [New LWP 61084] >>> [New LWP 61100] >>> [New LWP 61103] >>> [New LWP 61104] >>> [New LWP 61099] >>> [New LWP 61105] >>> [New LWP 61101] >>> [New LWP 61106] >>> [New LWP 61109] >>> [New LWP 61107] >>> [New LWP 61112] >>> [New LWP 61119] >>> [New LWP 61110] >>> [New LWP 61111] >>> [New LWP 61118] >>> [New LWP 61123] >>> [New LWP 61122] >>> [New LWP 61113] >>> [New LWP 61114] >>> [New LWP 61120] >>> [New LWP 61116] >>> [New LWP 61115] >>> >>> warning: core file may not match specified executable file. >>> Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done. >>> done. >>> >>> warning: Ignoring non-absolute filename: <linux-vdso.so.1> >>> Missing separate debuginfo for linux-vdso.so.1 >>> Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266 >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> [Thread debugging using libthread_db enabled] >>> Using host libthread_db library "/lib64/libthread_db.so.1". >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> >>> warning: Loadable section ".note.gnu.property" outside of ELF segments >>> Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/run/gluster/n'. >>> Program terminated with signal SIGSEGV, Segmentation fault. >>> #0 0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288) >>> at ../../../../libglusterfs/src/glusterfs/stack.h:193 >>> 193 FRAME_DESTROY(frame); >>> [Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))] >>> Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64 krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64 libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64 libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64 libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64 openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64 zlib-1.2.11-10.el8.x86_64 >>> (gdb) bt >>> #0 0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288) >>> at ../../../../libglusterfs/src/glusterfs/stack.h:193 >>> #1 STACK_DESTROY (stack=0x7fe5ac6d65f8) >>> at ../../../../libglusterfs/src/glusterfs/stack.h:193 >>> #2 rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=<optimized out>, >>> this=0x7fe63c0162b0, op_ret=3, op_errno=0, entries=<optimized out>, >>> xdata=0x0) at readdir-ahead.c:623 >>> #3 0x00007fe63bd6c3aa in afr_readdir_cbk (frame=<optimized out>, >>> cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, >>> op_errno=<optimized out>, subvol_entries=<optimized out>, xdata=0x0) >>> at afr-dir-read.c:234 >>> #4 0x00007fe6400a1e07 in client4_0_readdirp_cbk (req=<optimized out>, >>> iov=<optimized out>, count=<optimized out>, myframe=0x7fe5ace0eda8) >>> at client-rpc-fops_v2.c:2338 >>> #5 0x00007fe6479ca115 in rpc_clnt_handle_reply ( >>> clnt=clnt at entry=0x7fe63c0663f0, pollin=pollin at entry=0x7fe60c1737a0) >>> at rpc-clnt.c:764 >>> #6 0x00007fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780, >>> mydata=0x7fe63c066420, event=<optimized out>, data=0x7fe60c1737a0) >>> at rpc-clnt.c:931 >>> #7 0x00007fe6479c707b in rpc_transport_notify ( >>> this=this at entry=0x7fe63c066780, >>> event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, >>> data=data at entry=0x7fe60c1737a0) at rpc-transport.c:545 >>> #8 0x00007fe640da893c in socket_event_poll_in_async (xl=<optimized out>, >>> async=0x7fe60c1738c8) at socket.c:2601 >>> #9 0x00007fe640db03dc in gf_async ( >>> cbk=0x7fe640da8910 <socket_event_poll_in_async>, xl=<optimized out>, >>> async=0x7fe60c1738c8) at ../../../../libglusterfs/src/glusterfs/async.h:189 >>> #10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780) >>> at socket.c:2642 >>> #11 socket_event_handler (fd=fd at entry=19, idx=idx at entry=10, gen=gen at entry=1, >>> data=data at entry=0x7fe63c066780, poll_in=<optimized out>, >>> poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') >>> at socket.c:3040 >>> #12 0x00007fe647c84a5b in event_dispatch_epoll_handler (event=0x7fe617ffe014, >>> event_pool=0x563f5a98c750) at event-epoll.c:650 >>> #13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at event-epoll.c:763 >>> #14 0x00007fe6467a72de in start_thread () from /lib64/libpthread.so.0 >>> #15 0x00007fe645fffa63 in clone () from /lib64/libc.so.6 >>> >>> >>> >>> On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote: >>>> After several successful runs of the test case, we thought we were >>>> solved. Indeed, split-brain is gone. >>>> >>>> But we're triggering a seg fault now, even in a less loaded case. >>>> >>>> We're going to switch to gluster74, which was your intention, and report >>>> back. >>>> >>>> On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote: >>>>>> Attached the wrong patch by mistake in my previous mail. Sending the correct >>>>>> one now. >>>>> Early results loook GREAT !! >>>>> >>>>> We'll keep beating on it. We applied it to glsuter72 as that is what we >>>>> have to ship with. It applied fine with some line moves. >>>>> >>>>> If you would like us to also run a test with gluster74 so that you can >>>>> say that's tested, we can run that test. I can do a special build. >>>>> >>>>> THANK YOU!! >>>>> >>>>>> >>>>>> -Ravi >>>>>> >>>>>> >>>>>> On 15/04/20 2:05 pm, Ravishankar N wrote: >>>>>> >>>>>> >>>>>> On 10/04/20 2:06 am, Erik Jacobson wrote: >>>>>> >>>>>> Once again thanks for sticking with us. Here is a reply from Scott >>>>>> Titus. If you have something for us to try, we'd love it. The code had >>>>>> your patch applied when gdb was run: >>>>>> >>>>>> >>>>>> Here is the addr2line output for those addresses. Very interesting >>>>>> command, of >>>>>> which I was not aware. >>>>>> >>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ >>>>>> cluster/ >>>>>> afr.so 0x6f735 >>>>>> afr_lookup_metadata_heal_check >>>>>> afr-common.c:2803 >>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ >>>>>> cluster/ >>>>>> afr.so 0x6f0b9 >>>>>> afr_lookup_done >>>>>> afr-common.c:2455 >>>>>> [root at leader3 ~]# addr2line -f -e/usr/lib64/glusterfs/7.2/xlator/ >>>>>> cluster/ >>>>>> afr.so 0x5c701 >>>>>> afr_inode_event_gen_reset >>>>>> afr-common.c:755 >>>>>> >>>>>> >>>>>> Right, so afr_lookup_done() is resetting the event gen to zero. This looks >>>>>> like a race between lookup and inode refresh code paths. We made some >>>>>> changes to the event generation logic in AFR. Can you apply the attached >>>>>> patch and see if it fixes the split-brain issue? It should apply cleanly on >>>>>> glusterfs-7.4. >>>>>> >>>>>> Thanks, >>>>>> Ravi >>>>>> >>>>>> >>>>>> ________ >>>>>> >>>>>> >>>>>> >>>>>> Community Meeting Calendar: >>>>>> >>>>>> Schedule - >>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >>>>>> Bridge: https://bluejeans.com/441850968 >>>>>> >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon Sep 17 00:00:00 2001 >>>>>> From: Ravishankar N <ravishankar at redhat.com> >>>>>> Date: Wed, 15 Apr 2020 13:53:26 +0530 >>>>>> Subject: [PATCH] afr: event gen changes >>>>>> >>>>>> The general idea of the changes is to prevent resetting event generation >>>>>> to zero in the inode ctx, since event gen is something that should >>>>>> follow 'causal order'. >>>>>> >>>>>> Change #1: >>>>>> For a read txn, in inode refresh cbk, if event_generation is >>>>>> found zero, we are failing the read fop. This is not needed >>>>>> because change in event gen is only a marker for the next inode refresh to >>>>>> happen and should not be taken into account by the current read txn. >>>>>> >>>>>> Change #2: >>>>>> The event gen being zero above can happen if there is a racing lookup, >>>>>> which resets even get (in afr_lookup_done) if there are non zero afr >>>>>> xattrs. The resetting is done only to trigger an inode refresh and a >>>>>> possible client side heal on the next lookup. That can be acheived by >>>>>> setting the need_refresh flag in the inode ctx. So replaced all >>>>>> occurences of resetting even gen to zero with a call to >>>>>> afr_inode_need_refresh_set(). >>>>>> >>>>>> Change #3: >>>>>> In both lookup and discover path, we are doing an inode refresh which is >>>>>> not required since all 3 essentially do the same thing- update the inode >>>>>> ctx with the good/bad copies from the brick replies. Inode refresh also >>>>>> triggers background heals, but I think it is okay to do it when we call >>>>>> refresh during the read and write txns and not in the lookup path. >>>>>> >>>>>> Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1 >>>>>> Signed-off-by: Ravishankar N <ravishankar at redhat.com> >>>>>> --- >>>>>> ...ismatch-resolution-with-fav-child-policy.t | 8 +- >>>>>> xlators/cluster/afr/src/afr-common.c | 92 ++++--------------- >>>>>> xlators/cluster/afr/src/afr-dir-write.c | 6 +- >>>>>> xlators/cluster/afr/src/afr.h | 5 +- >>>>>> 4 files changed, 29 insertions(+), 82 deletions(-) >>>>>> >>>>>> diff --git a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t >>>>>> index f4aa351e4..12af0c854 100644 >>>>>> --- a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t >>>>>> +++ b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t >>>>>> @@ -168,8 +168,8 @@ TEST [ "$gfid_1" != "$gfid_2" ] >>>>>> #We know that second brick has the bigger size file >>>>>> BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\ -f1) >>>>>> >>>>>> -TEST ls $M0/f3 >>>>>> -TEST cat $M0/f3 >>>>>> +TEST ls $M0 #Trigger entry heal via readdir inode refresh >>>>>> +TEST cat $M0/f3 #Trigger data heal via readv inode refresh >>>>>> EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 >>>>>> >>>>>> #gfid split-brain should be resolved >>>>>> @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force >>>>>> EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1" brick_up_status $V0 $H0 $B0/${V0}2 >>>>>> EXPECT_WITHIN $CHILD_UP_TIMEOUT "1" afr_child_up_status $V0 2 >>>>>> >>>>>> -TEST ls $M0/f4 >>>>>> -TEST cat $M0/f4 >>>>>> +TEST ls $M0 #Trigger entry heal via readdir inode refresh >>>>>> +TEST cat $M0/f4 #Trigger data heal via readv inode refresh >>>>>> EXPECT_WITHIN $HEAL_TIMEOUT "^0$" get_pending_heal_count $V0 >>>>>> >>>>>> #gfid split-brain should be resolved >>>>>> diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c >>>>>> index 61f21795e..319665a14 100644 >>>>>> --- a/xlators/cluster/afr/src/afr-common.c >>>>>> +++ b/xlators/cluster/afr/src/afr-common.c >>>>>> @@ -282,7 +282,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local, >>>>>> metadatamap |= (1 << index); >>>>>> } >>>>>> if (metadatamap_old != metadatamap) { >>>>>> - event = 0; >>>>>> + __afr_inode_need_refresh_set(inode, this); >>>>>> } >>>>>> break; >>>>>> >>>>>> @@ -295,7 +295,7 @@ __afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local, >>>>>> datamap |= (1 << index); >>>>>> } >>>>>> if (datamap_old != datamap) >>>>>> - event = 0; >>>>>> + __afr_inode_need_refresh_set(inode, this); >>>>>> break; >>>>>> >>>>>> default: >>>>>> @@ -458,34 +458,6 @@ out: >>>>>> return ret; >>>>>> } >>>>>> >>>>>> -int >>>>>> -__afr_inode_event_gen_reset_small(inode_t *inode, xlator_t *this) >>>>>> -{ >>>>>> - int ret = -1; >>>>>> - uint16_t datamap = 0; >>>>>> - uint16_t metadatamap = 0; >>>>>> - uint32_t event = 0; >>>>>> - uint64_t val = 0; >>>>>> - afr_inode_ctx_t *ctx = NULL; >>>>>> - >>>>>> - ret = __afr_inode_ctx_get(this, inode, &ctx); >>>>>> - if (ret) >>>>>> - return ret; >>>>>> - >>>>>> - val = ctx->read_subvol; >>>>>> - >>>>>> - metadatamap = (val & 0x000000000000ffff) >> 0; >>>>>> - datamap = (val & 0x00000000ffff0000) >> 16; >>>>>> - event = 0; >>>>>> - >>>>>> - val = ((uint64_t)metadatamap) | (((uint64_t)datamap) << 16) | >>>>>> - (((uint64_t)event) << 32); >>>>>> - >>>>>> - ctx->read_subvol = val; >>>>>> - >>>>>> - return ret; >>>>>> -} >>>>>> - >>>>>> int >>>>>> __afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data, >>>>>> unsigned char *metadata, int *event_p) >>>>>> @@ -556,22 +528,6 @@ out: >>>>>> return ret; >>>>>> } >>>>>> >>>>>> -int >>>>>> -__afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) >>>>>> -{ >>>>>> - afr_private_t *priv = NULL; >>>>>> - int ret = -1; >>>>>> - >>>>>> - priv = this->private; >>>>>> - >>>>>> - if (priv->child_count <= 16) >>>>>> - ret = __afr_inode_event_gen_reset_small(inode, this); >>>>>> - else >>>>>> - ret = -1; >>>>>> - >>>>>> - return ret; >>>>>> -} >>>>>> - >>>>>> int >>>>>> afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data, >>>>>> unsigned char *metadata, int *event_p) >>>>>> @@ -721,30 +677,22 @@ out: >>>>>> return need_refresh; >>>>>> } >>>>>> >>>>>> -static int >>>>>> -afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) >>>>>> +int >>>>>> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) >>>>>> { >>>>>> int ret = -1; >>>>>> afr_inode_ctx_t *ctx = NULL; >>>>>> >>>>>> - GF_VALIDATE_OR_GOTO(this->name, inode, out); >>>>>> - >>>>>> - LOCK(&inode->lock); >>>>>> - { >>>>>> - ret = __afr_inode_ctx_get(this, inode, &ctx); >>>>>> - if (ret) >>>>>> - goto unlock; >>>>>> - >>>>>> + ret = __afr_inode_ctx_get(this, inode, &ctx); >>>>>> + if (ret == 0) { >>>>>> ctx->need_refresh = _gf_true; >>>>>> } >>>>>> -unlock: >>>>>> - UNLOCK(&inode->lock); >>>>>> -out: >>>>>> + >>>>>> return ret; >>>>>> } >>>>>> >>>>>> int >>>>>> -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) >>>>>> +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this) >>>>>> { >>>>>> int ret = -1; >>>>>> >>>>>> @@ -754,7 +702,7 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t *this) >>>>>> "Resetting event gen for %s", uuid_utoa(inode->gfid)); >>>>>> LOCK(&inode->lock); >>>>>> { >>>>>> - ret = __afr_inode_event_gen_reset(inode, this); >>>>>> + ret = __afr_inode_need_refresh_set(inode, this); >>>>>> } >>>>>> UNLOCK(&inode->lock); >>>>>> out: >>>>>> @@ -1187,7 +1135,7 @@ afr_txn_refresh_done(call_frame_t *frame, xlator_t *this, int err) >>>>>> ret = afr_inode_get_readable(frame, inode, this, local->readable, >>>>>> &event_generation, local->transaction.type); >>>>>> >>>>>> - if (ret == -EIO || (local->is_read_txn && !event_generation)) { >>>>>> + if (ret == -EIO) { >>>>>> /* No readable subvolume even after refresh ==> splitbrain.*/ >>>>>> if (!priv->fav_child_policy) { >>>>>> err = EIO; >>>>>> @@ -2451,7 +2399,7 @@ afr_lookup_done(call_frame_t *frame, xlator_t *this) >>>>>> if (read_subvol == -1) >>>>>> goto cant_interpret; >>>>>> if (ret) { >>>>>> - afr_inode_event_gen_reset(local->inode, this); >>>>>> + afr_inode_need_refresh_set(local->inode, this); >>>>>> dict_del_sizen(local->replies[read_subvol].xdata, GF_CONTENT_KEY); >>>>>> } >>>>>> } else { >>>>>> @@ -3007,6 +2955,7 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this) >>>>>> afr_private_t *priv = NULL; >>>>>> afr_local_t *local = NULL; >>>>>> int read_subvol = -1; >>>>>> + int ret = 0; >>>>>> unsigned char *data_readable = NULL; >>>>>> unsigned char *success_replies = NULL; >>>>>> >>>>>> @@ -3028,7 +2977,10 @@ afr_discover_unwind(call_frame_t *frame, xlator_t *this) >>>>>> if (!afr_has_quorum(success_replies, this, frame)) >>>>>> goto unwind; >>>>>> >>>>>> - afr_replies_interpret(frame, this, local->inode, NULL); >>>>>> + ret = afr_replies_interpret(frame, this, local->inode, NULL); >>>>>> + if (ret) { >>>>>> + afr_inode_need_refresh_set(local->inode, this); >>>>>> + } >>>>>> >>>>>> read_subvol = afr_read_subvol_decide(local->inode, this, NULL, >>>>>> data_readable); >>>>>> @@ -3284,11 +3236,7 @@ afr_discover(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req) >>>>>> afr_read_subvol_get(loc->inode, this, NULL, NULL, &event, >>>>>> AFR_DATA_TRANSACTION, NULL); >>>>>> >>>>>> - if (afr_is_inode_refresh_reqd(loc->inode, this, event, >>>>>> - local->event_generation)) >>>>>> - afr_inode_refresh(frame, this, loc->inode, NULL, afr_discover_do); >>>>>> - else >>>>>> - afr_discover_do(frame, this, 0); >>>>>> + afr_discover_do(frame, this, 0); >>>>>> >>>>>> return 0; >>>>>> out: >>>>>> @@ -3429,11 +3377,7 @@ afr_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xattr_req) >>>>>> afr_read_subvol_get(loc->parent, this, NULL, NULL, &event, >>>>>> AFR_DATA_TRANSACTION, NULL); >>>>>> >>>>>> - if (afr_is_inode_refresh_reqd(loc->inode, this, event, >>>>>> - local->event_generation)) >>>>>> - afr_inode_refresh(frame, this, loc->parent, NULL, afr_lookup_do); >>>>>> - else >>>>>> - afr_lookup_do(frame, this, 0); >>>>>> + afr_lookup_do(frame, this, 0); >>>>>> >>>>>> return 0; >>>>>> out: >>>>>> diff --git a/xlators/cluster/afr/src/afr-dir-write.c b/xlators/cluster/afr/src/afr-dir-write.c >>>>>> index 82a72fddd..333085b14 100644 >>>>>> --- a/xlators/cluster/afr/src/afr-dir-write.c >>>>>> +++ b/xlators/cluster/afr/src/afr-dir-write.c >>>>>> @@ -119,11 +119,11 @@ __afr_dir_write_finalize(call_frame_t *frame, xlator_t *this) >>>>>> continue; >>>>>> if (local->replies[i].op_ret < 0) { >>>>>> if (local->inode) >>>>>> - afr_inode_event_gen_reset(local->inode, this); >>>>>> + afr_inode_need_refresh_set(local->inode, this); >>>>>> if (local->parent) >>>>>> - afr_inode_event_gen_reset(local->parent, this); >>>>>> + afr_inode_need_refresh_set(local->parent, this); >>>>>> if (local->parent2) >>>>>> - afr_inode_event_gen_reset(local->parent2, this); >>>>>> + afr_inode_need_refresh_set(local->parent2, this); >>>>>> continue; >>>>>> } >>>>>> >>>>>> diff --git a/xlators/cluster/afr/src/afr.h b/xlators/cluster/afr/src/afr.h >>>>>> index a3f2942b3..ed6d777c1 100644 >>>>>> --- a/xlators/cluster/afr/src/afr.h >>>>>> +++ b/xlators/cluster/afr/src/afr.h >>>>>> @@ -958,7 +958,10 @@ afr_inode_read_subvol_set(inode_t *inode, xlator_t *this, >>>>>> int event_generation); >>>>>> >>>>>> int >>>>>> -afr_inode_event_gen_reset(inode_t *inode, xlator_t *this); >>>>>> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t *this); >>>>>> + >>>>>> +int >>>>>> +afr_inode_need_refresh_set(inode_t *inode, xlator_t *this); >>>>>> >>>>>> int >>>>>> afr_read_subvol_select_by_policy(inode_t *inode, xlator_t *this, >>>>>> -- >>>>>> 2.25.2 >>>>>> >>>>>