thr3ads.net - Gluster users - [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load)

If this information is useful, please help other people find it:
Share via:

Ravishankar N

2020-Apr-16 07:03 UTC

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

The patch by itself is only making changes specific to AFR, so it should 
not affect other translators. But I wonder how readdir-ahead is enabled 
in your gnfs stack. All performance xlators are turned off in gnfs 
except write-behind and AFAIK, there is no way to enable them via the 
CLI. Did you custom edit your gnfs volfile to add readdir-ahead? If yes, 
does the crash go-away if you remove the xlator from the nfs volfile?

Regards,
Ravi
On 16/04/20 8:47 am, Erik Jacobson wrote:> It is important to note that our testing has shown zero split-brain
> errors since the patch... And that it is significantly harder to
> hit the seg fault than it was to hit split-brain before. It's still
> sufficiently frequent that we can't let it out the door.  In my
intensive
> test case (found elsewhere in the thread), it would 100% hit the problem
> with 57 nodes every time at least once. With the patch, zero split
> brain, but maybe 1 in 4 runs would seg fault. We didn't have a seg
> fault problem previously. This is all within the context of 1 of the 3
> servers in the subvolume being down. I hit the seg fault once with just
> 57 nodes booting (using NFS for their root FS) and no other load.
>
>
> Scott was able to take an analysis pass. Any suggestions? his words
> follow:
>
>
> The segfault appears to occur in read-ahead functionality.? We will keep
> the core in case it needs to be looked at again, being sure to copy off
> all necessary metadata to maintain adequate symbol lookup within gdb.
> It may also be possible to breakpoint immediately prior to the segfault,
> but setting the right conditions may prove to be difficult.
>
> A bit of analysis:
>
> Prior to the segfault, the op_errno field in a struct rda_fd_ctx packet
> shows an ENOENT error.? The packet is from the call_frame_t parameter of
> rda_fill_fd_cbk() (Backtrace #2)? The following shows the progression
> from the call_frame_t parameter to the op_errno field of the rda_fd_ctx
> structure.
>
> (gdb) print {call_frame_t}0x7fe5acf18eb8
> $26 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next >
0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
>   ? this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock >
0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
>   ??????? __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list >
{__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>,
>   ????? __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL,
> begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
>   ??? tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from
> 0x0, unwind_to = 0x0}
>
> (gdb) print {struct rda_local}0x7fe5ac1dbc78
> $27 = {ctx = 0x7fe5ace46590, fd = 0x7fe60433d8b8, xattrs = 0x0, inode >
0x0, offset = 0, generation = 0, skip_dir = 0}
>
> (gdb) print {struct rda_fd_ctx}0x7fe5ace46590
> $28 = {cur_offset = 0, cur_size = 638, next_offset = 1538, state = 36,
> lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0,
>   ??????? __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision
> 0, __list = {__prev = 0x0, __next = 0x0}},
>   ????? __size = '\000' <repeats 39 times>, __align = 0}},
entries > {{list = {next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}, {
>   ??????? next = 0x7fe60cda5f90, prev = 0x7fe60ca08190}}, d_ino = 0,
> d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_flags = 0, ia_ino = 0,
>   ????? ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0,
> ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0,
>   ????? ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0,
> ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0,
>   ????? ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = '\000'
> <repeats 15 times>, ia_type = IA_INVAL, ia_prot = {suid = 0
'\000',
>   ??????? sgid = 0 '\000', sticky = 0 '\000', owner = {read
= 0 '\000',
> write = 0 '\000', exec = 0 '\000'}, group = {read = 0
'\000',
>   ????????? write = 0 '\000', exec = 0 '\000'}, other =
{read = 0 '\000',
> write = 0 '\000', exec = 0 '\000'}}}, dict = 0x0, inode =
0x0,
>   ??? d_name = 0x7fe5ace466a8 ""}, fill_frame = 0x0, stub = 0x0,
op_errno
> = 2, xattrs = 0x0, writes_during_prefetch = 0x0, prefetching = {
>   ??? lk = 0x7fe5ace466d0 "", value = 0}}
>
> The segfault occurs at the bottom of rda_fill_fd_cbk() where the rpc
> call stack frames are being destroyed.? The following are what I believe
> to be the three frames that are intended to be destroyed, but it is
> unclear which packet is causing the problem.? If I were to dig more into
> this, I will use ddd (graphical debugger).? It's been a while since
I've
> done low level debugging like this, so I'm a bit rusty.
>
> (gdb) print {call_frame_t}0x7fe5acf18eb8
> $34 = {root = 0x7fe5ac6d65f8, parent = 0x0, frames = {next >
0x7fe5ac6d6cf0, prev = 0x7fe5ac096298}, local = 0x7fe5ac1dbc78,
>   ? this = 0x7fe63c0162b0, ret = 0x0, ref_count = 0, lock = {spinlock >
0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
>   ??????? __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list >
{__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>,
>   ????? __align = 0}}, cookie = 0x0, complete = false, op = GF_FOP_NULL,
> begin = {tv_sec = 4234, tv_nsec = 637078332}, end = {tv_sec = 4234,
>   ??? tv_nsec = 803882781}, wind_from = 0x0, wind_to = 0x0, unwind_from
> 0x0, unwind_to = 0x0}
> (gdb) print {call_frame_t}0x7fe5ac6d6ce0
> $35 = {root = 0x0, parent = 0x563f5a955920, frames = {next >
0x7fe5ac096298, prev = 0x7fe5acf18ec8}, local = 0x0, this = 0x108a,
>   ? ret = 0x25f90b3c, ref_count = 0, lock = {spinlock = 0, mutex >
{__data = {__lock = 0, __count = 0, __owner = 1586972324, __nusers = 0,
>   ??????? __kind = 210092664, __spins = 0, __elision = 0, __list >
{__prev = 0x0, __next = 0x0}},
>   ????? __size >
"\000\000\000\000\000\000\000\000\244F\227^\000\000\000\000x\302\205\f",
> '\000' <repeats 19 times>, __align = 0}},
>   ? cookie = 0x0, complete = false, op = GF_FOP_NULL, begin = {tv_sec >
0, tv_nsec = 0}, end = {tv_sec = 0, tv_nsec = 0}, wind_from = 0x0,
>   ? wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0}
> (gdb) print {call_frame_t}0x7fe5ac096288
> $36 = {root = 0x7fe5ac378860, parent = 0x7fe5acf18eb8, frames = {next >
0x7fe5acf18ec8, prev = 0x7fe5ac6d6cf0}, local = 0x0,
>   ? this = 0x7fe63c014000, ret = 0x7fe63bb5d350 <rda_fill_fd_cbk>,
> ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0,
>   ??????? __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins >
0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
>   ????? __size = '\000' <repeats 39 times>, __align = 0}},
cookie > 0x7fe5ac096288, complete = true, op = GF_FOP_READDIRP, begin = {
>   ??? tv_sec = 4234, tv_nsec = 637078816}, end = {tv_sec = 4234, tv_nsec
> = 803882755},
>   ? wind_from = 0x7fe63bb5e8c0 <__FUNCTION__.22226>
"rda_fill_fd",
> wind_to = 0x7fe63bb5e3f0
"(this->children->xlator)->fops->readdirp",
>   ? unwind_from = 0x7fe63bdd8a80 <__FUNCTION__.20442>
"afr_readdir_cbk",
> unwind_to = 0x7fe63bb5dfbb "rda_fill_fd_cbk"}
>
>
>
> On 4/15/20 8:14 AM, Erik Jacobson wrote:
>> Scott - I was going to start with gluster74 since that is what he
>> started at but it applies well to glsuter72 so I'll start tthere.
>>
>> Getting ready to go. The patch detail is interesting. This is probably
>> why it took hiim a bit longer. It wasn't a trivial patch.
>
>
> On Wed, Apr 15, 2020 at 12:45:57PM -0500, Erik Jacobson wrote:
>>> The new split-brain issue is much harder to reproduce, but after
several
>> (correcting to say new seg fault issue, the split brain is gone!!)
>>
>>> intense runs, it usually hits once.
>>>
>>> We switched to pure gluster74 plus your patch so we're apples
to apples
>>> now.
>>>
>>> I'm going to see if Scott can help debug it.
>>>
>>> Here is the back trace info from the core dump:
>>>
>>> -rw-r-----  1 root root 1.9G Apr 15 12:40
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000
>>> -rw-r-----  1 root root 221M Apr 15 12:40
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000.lz4
>>> drwxrwxrwt  9 root root  20K Apr 15 12:40 .
>>> [root at leader3 tmp]#
>>> [root at leader3 tmp]#
>>> [root at leader3 tmp]# gdb
core.glusterfs.0.52467a7e67964553aa9971eb2bb0148c.61084.1586972324000000
>>> GNU gdb (GDB) Red Hat Enterprise Linux 8.2-5.el8
>>> Copyright (C) 2018 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law.
>>> Type "show copying" and "show warranty" for
details.
>>> This GDB was configured as "x86_64-redhat-linux-gnu".
>>> Type "show configuration" for configuration details.
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>.
>>> Find the GDB manual and other documentation resources online at:
>>>      <http://www.gnu.org/software/gdb/documentation/>.
>>>
>>> For help, type "help".
>>> Type "apropos word" to search for commands related to
"word"...
>>> [New LWP 61102]
>>> [New LWP 61085]
>>> [New LWP 61087]
>>> [New LWP 61117]
>>> [New LWP 61086]
>>> [New LWP 61108]
>>> [New LWP 61089]
>>> [New LWP 61090]
>>> [New LWP 61121]
>>> [New LWP 61088]
>>> [New LWP 61091]
>>> [New LWP 61093]
>>> [New LWP 61095]
>>> [New LWP 61092]
>>> [New LWP 61094]
>>> [New LWP 61098]
>>> [New LWP 61096]
>>> [New LWP 61097]
>>> [New LWP 61084]
>>> [New LWP 61100]
>>> [New LWP 61103]
>>> [New LWP 61104]
>>> [New LWP 61099]
>>> [New LWP 61105]
>>> [New LWP 61101]
>>> [New LWP 61106]
>>> [New LWP 61109]
>>> [New LWP 61107]
>>> [New LWP 61112]
>>> [New LWP 61119]
>>> [New LWP 61110]
>>> [New LWP 61111]
>>> [New LWP 61118]
>>> [New LWP 61123]
>>> [New LWP 61122]
>>> [New LWP 61113]
>>> [New LWP 61114]
>>> [New LWP 61120]
>>> [New LWP 61116]
>>> [New LWP 61115]
>>>
>>> warning: core file may not match specified executable file.
>>> Reading symbols from /usr/sbin/glusterfsd...Reading symbols from
/usr/lib/debug/usr/sbin/glusterfsd-7.4-1.el8722.0800.200415T1052.a.rhel8hpeerikj.x86_64.debug...done.
>>> done.
>>>
>>> warning: Ignoring non-absolute filename: <linux-vdso.so.1>
>>> Missing separate debuginfo for linux-vdso.so.1
>>> Try: dnf --enablerepo='*debug*' install
/usr/lib/debug/.build-id/06/44254f9cbaa826db070a796046026adba58266
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library
"/lib64/libthread_db.so.1".
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>>
>>> warning: Loadable section ".note.gnu.property" outside of
ELF segments
>>> Core was generated by `/usr/sbin/glusterfs -s localhost
--volfile-id gluster/nfs -p /var/run/gluster/n'.
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0  0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
>>>      at ../../../../libglusterfs/src/glusterfs/stack.h:193
>>> 193	        FRAME_DESTROY(frame);
>>> [Current thread is 1 (Thread 0x7fe617fff700 (LWP 61102))]
>>> Missing separate debuginfos, use: dnf debuginfo-install
glibc-2.28-42.el8.x86_64 keyutils-libs-1.5.10-6.el8.x86_64
krb5-libs-1.16.1-22.el8.x86_64 libacl-2.2.53-1.el8.x86_64
libattr-2.4.48-3.el8.x86_64 libcom_err-1.44.3-2.el8.x86_64
libgcc-8.2.1-3.5.el8.x86_64 libselinux-2.8-6.el8.x86_64
libtirpc-1.1.4-3.el8.x86_64 libuuid-2.32.1-8.el8.x86_64
openssl-libs-1.1.1-8.el8.x86_64 pcre2-10.32-1.el8.x86_64
zlib-1.2.11-10.el8.x86_64
>>> (gdb) bt
>>> #0  0x00007fe63bb5d7bb in FRAME_DESTROY (frame=0x7fe5ac096288)
>>>      at ../../../../libglusterfs/src/glusterfs/stack.h:193
>>> #1  STACK_DESTROY (stack=0x7fe5ac6d65f8)
>>>      at ../../../../libglusterfs/src/glusterfs/stack.h:193
>>> #2  rda_fill_fd_cbk (frame=0x7fe5acf18eb8, cookie=<optimized
out>,
>>>      this=0x7fe63c0162b0, op_ret=3, op_errno=0,
entries=<optimized out>,
>>>      xdata=0x0) at readdir-ahead.c:623
>>> #3  0x00007fe63bd6c3aa in afr_readdir_cbk (frame=<optimized
out>,
>>>      cookie=<optimized out>, this=<optimized out>,
op_ret=<optimized out>,
>>>      op_errno=<optimized out>, subvol_entries=<optimized
out>, xdata=0x0)
>>>      at afr-dir-read.c:234
>>> #4  0x00007fe6400a1e07 in client4_0_readdirp_cbk (req=<optimized
out>,
>>>      iov=<optimized out>, count=<optimized out>,
myframe=0x7fe5ace0eda8)
>>>      at client-rpc-fops_v2.c:2338
>>> #5  0x00007fe6479ca115 in rpc_clnt_handle_reply (
>>>      clnt=clnt at entry=0x7fe63c0663f0, pollin=pollin at
entry=0x7fe60c1737a0)
>>>      at rpc-clnt.c:764
>>> #6  0x00007fe6479ca4b3 in rpc_clnt_notify (trans=0x7fe63c066780,
>>>      mydata=0x7fe63c066420, event=<optimized out>,
data=0x7fe60c1737a0)
>>>      at rpc-clnt.c:931
>>> #7  0x00007fe6479c707b in rpc_transport_notify (
>>>      this=this at entry=0x7fe63c066780,
>>>      event=event at entry=RPC_TRANSPORT_MSG_RECEIVED,
>>>      data=data at entry=0x7fe60c1737a0) at rpc-transport.c:545
>>> #8  0x00007fe640da893c in socket_event_poll_in_async
(xl=<optimized out>,
>>>      async=0x7fe60c1738c8) at socket.c:2601
>>> #9  0x00007fe640db03dc in gf_async (
>>>      cbk=0x7fe640da8910 <socket_event_poll_in_async>,
xl=<optimized out>,
>>>      async=0x7fe60c1738c8) at
../../../../libglusterfs/src/glusterfs/async.h:189
>>> #10 socket_event_poll_in (notify_handled=true, this=0x7fe63c066780)
>>>      at socket.c:2642
>>> #11 socket_event_handler (fd=fd at entry=19, idx=idx at entry=10,
gen=gen at entry=1,
>>>      data=data at entry=0x7fe63c066780, poll_in=<optimized
out>,
>>>      poll_out=<optimized out>, poll_err=0,
event_thread_died=0 '\000')
>>>      at socket.c:3040
>>> #12 0x00007fe647c84a5b in event_dispatch_epoll_handler
(event=0x7fe617ffe014,
>>>      event_pool=0x563f5a98c750) at event-epoll.c:650
>>> #13 event_dispatch_epoll_worker (data=0x7fe63c063b60) at
event-epoll.c:763
>>> #14 0x00007fe6467a72de in start_thread () from
/lib64/libpthread.so.0
>>> #15 0x00007fe645fffa63 in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>> On Wed, Apr 15, 2020 at 10:39:34AM -0500, Erik Jacobson wrote:
>>>> After several successful runs of the test case, we thought we
were
>>>> solved. Indeed, split-brain is gone.
>>>>
>>>> But we're triggering a seg fault now, even in a less loaded
case.
>>>>
>>>> We're going to switch to gluster74, which was your
intention, and report
>>>> back.
>>>>
>>>> On Wed, Apr 15, 2020 at 10:33:01AM -0500, Erik Jacobson wrote:
>>>>>> Attached the wrong patch by mistake in my previous
mail. Sending the correct
>>>>>> one now.
>>>>> Early results loook GREAT !!
>>>>>
>>>>> We'll keep beating on it. We applied it to glsuter72 as
that is what we
>>>>> have to ship with. It applied fine with some line moves.
>>>>>
>>>>> If you would like us to also run a test with gluster74 so
that you can
>>>>> say that's tested, we can run that test. I can do a
special build.
>>>>>
>>>>> THANK YOU!!
>>>>>
>>>>>>
>>>>>> -Ravi
>>>>>>
>>>>>>
>>>>>> On 15/04/20 2:05 pm, Ravishankar N wrote:
>>>>>>
>>>>>>
>>>>>>      On 10/04/20 2:06 am, Erik Jacobson wrote:
>>>>>>
>>>>>>          Once again thanks for sticking with us. Here
is a reply from Scott
>>>>>>          Titus. If you have something for us to try,
we'd love it. The code had
>>>>>>          your patch applied when gdb was run:
>>>>>>
>>>>>>
>>>>>>          Here is the addr2line output for those
addresses.  Very interesting
>>>>>>          command, of
>>>>>>          which I was not aware.
>>>>>>
>>>>>>          [root at leader3 ~]# addr2line -f
-e/usr/lib64/glusterfs/7.2/xlator/
>>>>>>          cluster/
>>>>>>          afr.so 0x6f735
>>>>>>          afr_lookup_metadata_heal_check
>>>>>>          afr-common.c:2803
>>>>>>          [root at leader3 ~]# addr2line -f
-e/usr/lib64/glusterfs/7.2/xlator/
>>>>>>          cluster/
>>>>>>          afr.so 0x6f0b9
>>>>>>          afr_lookup_done
>>>>>>          afr-common.c:2455
>>>>>>          [root at leader3 ~]# addr2line -f
-e/usr/lib64/glusterfs/7.2/xlator/
>>>>>>          cluster/
>>>>>>          afr.so 0x5c701
>>>>>>          afr_inode_event_gen_reset
>>>>>>          afr-common.c:755
>>>>>>
>>>>>>
>>>>>>      Right, so afr_lookup_done() is resetting the event
gen to zero. This looks
>>>>>>      like a race between lookup and inode refresh code
paths. We made some
>>>>>>      changes to the event generation logic in AFR. Can
you apply the attached
>>>>>>      patch and see if it fixes the split-brain issue?
It should apply cleanly on
>>>>>>      glusterfs-7.4.
>>>>>>
>>>>>>      Thanks,
>>>>>>      Ravi
>>>>>>
>>>>>>     
>>>>>>      ________
>>>>>>
>>>>>>
>>>>>>
>>>>>>      Community Meeting Calendar:
>>>>>>
>>>>>>      Schedule -
>>>>>>      Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>>>>      Bridge: https://bluejeans.com/441850968
>>>>>>
>>>>>>      Gluster-users mailing list
>>>>>>      Gluster-users at gluster.org
>>>>>>     
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>>> >From 11601e709a97ce7c40078866bf5d24b486f39454 Mon
Sep 17 00:00:00 2001
>>>>>> From: Ravishankar N <ravishankar at redhat.com>
>>>>>> Date: Wed, 15 Apr 2020 13:53:26 +0530
>>>>>> Subject: [PATCH] afr: event gen changes
>>>>>>
>>>>>> The general idea of the changes is to prevent resetting
event generation
>>>>>> to zero in the inode ctx, since event gen is something
that should
>>>>>> follow 'causal order'.
>>>>>>
>>>>>> Change #1:
>>>>>> For a read txn, in inode refresh cbk, if
event_generation is
>>>>>> found zero, we are failing the read fop. This is not
needed
>>>>>> because change in event gen is only a marker for the
next inode refresh to
>>>>>> happen and should not be taken into account by the
current read txn.
>>>>>>
>>>>>> Change #2:
>>>>>> The event gen being zero above can happen if there is a
racing lookup,
>>>>>> which resets even get (in afr_lookup_done) if there are
non zero afr
>>>>>> xattrs. The resetting is done only to trigger an inode
refresh and a
>>>>>> possible client side heal on the next lookup. That can
be acheived by
>>>>>> setting the need_refresh flag in the inode ctx. So
replaced all
>>>>>> occurences of resetting even gen to zero with a call to
>>>>>> afr_inode_need_refresh_set().
>>>>>>
>>>>>> Change #3:
>>>>>> In both lookup and discover path, we are doing an inode
refresh which is
>>>>>> not required since all 3 essentially do the same thing-
update the inode
>>>>>> ctx with the good/bad copies from the brick replies.
Inode refresh also
>>>>>> triggers background heals, but I think it is okay to do
it when we call
>>>>>> refresh during the read and write txns and not in the
lookup path.
>>>>>>
>>>>>> Change-Id: Id0600dd34b144b4ae7a3bf3c397551adf7e402f1
>>>>>> Signed-off-by: Ravishankar N <ravishankar at
redhat.com>
>>>>>> ---
>>>>>>   ...ismatch-resolution-with-fav-child-policy.t |  8 +-
>>>>>>   xlators/cluster/afr/src/afr-common.c          | 92
++++---------------
>>>>>>   xlators/cluster/afr/src/afr-dir-write.c       |  6 +-
>>>>>>   xlators/cluster/afr/src/afr.h                 |  5 +-
>>>>>>   4 files changed, 29 insertions(+), 82 deletions(-)
>>>>>>
>>>>>> diff --git
a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
>>>>>> index f4aa351e4..12af0c854 100644
>>>>>> ---
a/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
>>>>>> +++
b/tests/basic/afr/gfid-mismatch-resolution-with-fav-child-policy.t
>>>>>> @@ -168,8 +168,8 @@ TEST [ "$gfid_1" !=
"$gfid_2" ]
>>>>>>   #We know that second brick has the bigger size file
>>>>>>   BIGGER_FILE_MD5=$(md5sum $B0/${V0}1/f3 | cut -d\ 
-f1)
>>>>>>   
>>>>>> -TEST ls $M0/f3
>>>>>> -TEST cat $M0/f3
>>>>>> +TEST ls $M0 #Trigger entry heal via readdir inode
refresh
>>>>>> +TEST cat $M0/f3 #Trigger data heal via readv inode
refresh
>>>>>>   EXPECT_WITHIN $HEAL_TIMEOUT "^0$"
get_pending_heal_count $V0
>>>>>>   
>>>>>>   #gfid split-brain should be resolved
>>>>>> @@ -215,8 +215,8 @@ TEST $CLI volume start $V0 force
>>>>>>   EXPECT_WITHIN $PROCESS_UP_TIMEOUT "1"
brick_up_status $V0 $H0 $B0/${V0}2
>>>>>>   EXPECT_WITHIN $CHILD_UP_TIMEOUT "1"
afr_child_up_status $V0 2
>>>>>>   
>>>>>> -TEST ls $M0/f4
>>>>>> -TEST cat $M0/f4
>>>>>> +TEST ls $M0 #Trigger entry heal via readdir inode
refresh
>>>>>> +TEST cat $M0/f4  #Trigger data heal via readv inode
refresh
>>>>>>   EXPECT_WITHIN $HEAL_TIMEOUT "^0$"
get_pending_heal_count $V0
>>>>>>   
>>>>>>   #gfid split-brain should be resolved
>>>>>> diff --git a/xlators/cluster/afr/src/afr-common.c
b/xlators/cluster/afr/src/afr-common.c
>>>>>> index 61f21795e..319665a14 100644
>>>>>> --- a/xlators/cluster/afr/src/afr-common.c
>>>>>> +++ b/xlators/cluster/afr/src/afr-common.c
>>>>>> @@ -282,7 +282,7 @@
__afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
>>>>>>                   metadatamap |= (1 << index);
>>>>>>               }
>>>>>>               if (metadatamap_old != metadatamap) {
>>>>>> -                event = 0;
>>>>>> +                __afr_inode_need_refresh_set(inode,
this);
>>>>>>               }
>>>>>>               break;
>>>>>>   
>>>>>> @@ -295,7 +295,7 @@
__afr_set_in_flight_sb_status(xlator_t *this, afr_local_t *local,
>>>>>>                   datamap |= (1 << index);
>>>>>>               }
>>>>>>               if (datamap_old != datamap)
>>>>>> -                event = 0;
>>>>>> +                __afr_inode_need_refresh_set(inode,
this);
>>>>>>               break;
>>>>>>   
>>>>>>           default:
>>>>>> @@ -458,34 +458,6 @@ out:
>>>>>>       return ret;
>>>>>>   }
>>>>>>   
>>>>>> -int
>>>>>> -__afr_inode_event_gen_reset_small(inode_t *inode,
xlator_t *this)
>>>>>> -{
>>>>>> -    int ret = -1;
>>>>>> -    uint16_t datamap = 0;
>>>>>> -    uint16_t metadatamap = 0;
>>>>>> -    uint32_t event = 0;
>>>>>> -    uint64_t val = 0;
>>>>>> -    afr_inode_ctx_t *ctx = NULL;
>>>>>> -
>>>>>> -    ret = __afr_inode_ctx_get(this, inode, &ctx);
>>>>>> -    if (ret)
>>>>>> -        return ret;
>>>>>> -
>>>>>> -    val = ctx->read_subvol;
>>>>>> -
>>>>>> -    metadatamap = (val & 0x000000000000ffff)
>> 0;
>>>>>> -    datamap = (val & 0x00000000ffff0000) >>
16;
>>>>>> -    event = 0;
>>>>>> -
>>>>>> -    val = ((uint64_t)metadatamap) |
(((uint64_t)datamap) << 16) |
>>>>>> -          (((uint64_t)event) << 32);
>>>>>> -
>>>>>> -    ctx->read_subvol = val;
>>>>>> -
>>>>>> -    return ret;
>>>>>> -}
>>>>>> -
>>>>>>   int
>>>>>>   __afr_inode_read_subvol_get(inode_t *inode, xlator_t
*this, unsigned char *data,
>>>>>>                               unsigned char *metadata,
int *event_p)
>>>>>> @@ -556,22 +528,6 @@ out:
>>>>>>       return ret;
>>>>>>   }
>>>>>>   
>>>>>> -int
>>>>>> -__afr_inode_event_gen_reset(inode_t *inode, xlator_t
*this)
>>>>>> -{
>>>>>> -    afr_private_t *priv = NULL;
>>>>>> -    int ret = -1;
>>>>>> -
>>>>>> -    priv = this->private;
>>>>>> -
>>>>>> -    if (priv->child_count <= 16)
>>>>>> -        ret = __afr_inode_event_gen_reset_small(inode,
this);
>>>>>> -    else
>>>>>> -        ret = -1;
>>>>>> -
>>>>>> -    return ret;
>>>>>> -}
>>>>>> -
>>>>>>   int
>>>>>>   afr_inode_read_subvol_get(inode_t *inode, xlator_t
*this, unsigned char *data,
>>>>>>                             unsigned char *metadata,
int *event_p)
>>>>>> @@ -721,30 +677,22 @@ out:
>>>>>>       return need_refresh;
>>>>>>   }
>>>>>>   
>>>>>> -static int
>>>>>> -afr_inode_need_refresh_set(inode_t *inode, xlator_t
*this)
>>>>>> +int
>>>>>> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t
*this)
>>>>>>   {
>>>>>>       int ret = -1;
>>>>>>       afr_inode_ctx_t *ctx = NULL;
>>>>>>   
>>>>>> -    GF_VALIDATE_OR_GOTO(this->name, inode, out);
>>>>>> -
>>>>>> -    LOCK(&inode->lock);
>>>>>> -    {
>>>>>> -        ret = __afr_inode_ctx_get(this, inode,
&ctx);
>>>>>> -        if (ret)
>>>>>> -            goto unlock;
>>>>>> -
>>>>>> +    ret = __afr_inode_ctx_get(this, inode, &ctx);
>>>>>> +    if (ret == 0) {
>>>>>>           ctx->need_refresh = _gf_true;
>>>>>>       }
>>>>>> -unlock:
>>>>>> -    UNLOCK(&inode->lock);
>>>>>> -out:
>>>>>> +
>>>>>>       return ret;
>>>>>>   }
>>>>>>   
>>>>>>   int
>>>>>> -afr_inode_event_gen_reset(inode_t *inode, xlator_t
*this)
>>>>>> +afr_inode_need_refresh_set(inode_t *inode, xlator_t
*this)
>>>>>>   {
>>>>>>       int ret = -1;
>>>>>>   
>>>>>> @@ -754,7 +702,7 @@ afr_inode_event_gen_reset(inode_t
*inode, xlator_t *this)
>>>>>>                        "Resetting event gen for
%s", uuid_utoa(inode->gfid));
>>>>>>       LOCK(&inode->lock);
>>>>>>       {
>>>>>> -        ret = __afr_inode_event_gen_reset(inode,
this);
>>>>>> +        ret = __afr_inode_need_refresh_set(inode,
this);
>>>>>>       }
>>>>>>       UNLOCK(&inode->lock);
>>>>>>   out:
>>>>>> @@ -1187,7 +1135,7 @@ afr_txn_refresh_done(call_frame_t
*frame, xlator_t *this, int err)
>>>>>>       ret = afr_inode_get_readable(frame, inode, this,
local->readable,
>>>>>>                                   
&event_generation, local->transaction.type);
>>>>>>   
>>>>>> -    if (ret == -EIO || (local->is_read_txn
&& !event_generation)) {
>>>>>> +    if (ret == -EIO) {
>>>>>>           /* No readable subvolume even after refresh
==> splitbrain.*/
>>>>>>           if (!priv->fav_child_policy) {
>>>>>>               err = EIO;
>>>>>> @@ -2451,7 +2399,7 @@ afr_lookup_done(call_frame_t
*frame, xlator_t *this)
>>>>>>           if (read_subvol == -1)
>>>>>>               goto cant_interpret;
>>>>>>           if (ret) {
>>>>>> -            afr_inode_event_gen_reset(local->inode,
this);
>>>>>> +           
afr_inode_need_refresh_set(local->inode, this);
>>>>>>              
dict_del_sizen(local->replies[read_subvol].xdata, GF_CONTENT_KEY);
>>>>>>           }
>>>>>>       } else {
>>>>>> @@ -3007,6 +2955,7 @@ afr_discover_unwind(call_frame_t
*frame, xlator_t *this)
>>>>>>       afr_private_t *priv = NULL;
>>>>>>       afr_local_t *local = NULL;
>>>>>>       int read_subvol = -1;
>>>>>> +    int ret = 0;
>>>>>>       unsigned char *data_readable = NULL;
>>>>>>       unsigned char *success_replies = NULL;
>>>>>>   
>>>>>> @@ -3028,7 +2977,10 @@ afr_discover_unwind(call_frame_t
*frame, xlator_t *this)
>>>>>>       if (!afr_has_quorum(success_replies, this,
frame))
>>>>>>           goto unwind;
>>>>>>   
>>>>>> -    afr_replies_interpret(frame, this,
local->inode, NULL);
>>>>>> +    ret = afr_replies_interpret(frame, this,
local->inode, NULL);
>>>>>> +    if (ret) {
>>>>>> +        afr_inode_need_refresh_set(local->inode,
this);
>>>>>> +    }
>>>>>>   
>>>>>>       read_subvol =
afr_read_subvol_decide(local->inode, this, NULL,
>>>>>>                                           
data_readable);
>>>>>> @@ -3284,11 +3236,7 @@ afr_discover(call_frame_t
*frame, xlator_t *this, loc_t *loc, dict_t *xattr_req)
>>>>>>       afr_read_subvol_get(loc->inode, this, NULL,
NULL, &event,
>>>>>>                           AFR_DATA_TRANSACTION, NULL);
>>>>>>   
>>>>>> -    if (afr_is_inode_refresh_reqd(loc->inode, this,
event,
>>>>>> -                                 
local->event_generation))
>>>>>> -        afr_inode_refresh(frame, this, loc->inode,
NULL, afr_discover_do);
>>>>>> -    else
>>>>>> -        afr_discover_do(frame, this, 0);
>>>>>> +    afr_discover_do(frame, this, 0);
>>>>>>   
>>>>>>       return 0;
>>>>>>   out:
>>>>>> @@ -3429,11 +3377,7 @@ afr_lookup(call_frame_t *frame,
xlator_t *this, loc_t *loc, dict_t *xattr_req)
>>>>>>       afr_read_subvol_get(loc->parent, this, NULL,
NULL, &event,
>>>>>>                           AFR_DATA_TRANSACTION, NULL);
>>>>>>   
>>>>>> -    if (afr_is_inode_refresh_reqd(loc->inode, this,
event,
>>>>>> -                                 
local->event_generation))
>>>>>> -        afr_inode_refresh(frame, this, loc->parent,
NULL, afr_lookup_do);
>>>>>> -    else
>>>>>> -        afr_lookup_do(frame, this, 0);
>>>>>> +    afr_lookup_do(frame, this, 0);
>>>>>>   
>>>>>>       return 0;
>>>>>>   out:
>>>>>> diff --git a/xlators/cluster/afr/src/afr-dir-write.c
b/xlators/cluster/afr/src/afr-dir-write.c
>>>>>> index 82a72fddd..333085b14 100644
>>>>>> --- a/xlators/cluster/afr/src/afr-dir-write.c
>>>>>> +++ b/xlators/cluster/afr/src/afr-dir-write.c
>>>>>> @@ -119,11 +119,11 @@
__afr_dir_write_finalize(call_frame_t *frame, xlator_t *this)
>>>>>>               continue;
>>>>>>           if (local->replies[i].op_ret < 0) {
>>>>>>               if (local->inode)
>>>>>> -               
afr_inode_event_gen_reset(local->inode, this);
>>>>>> +               
afr_inode_need_refresh_set(local->inode, this);
>>>>>>               if (local->parent)
>>>>>> -               
afr_inode_event_gen_reset(local->parent, this);
>>>>>> +               
afr_inode_need_refresh_set(local->parent, this);
>>>>>>               if (local->parent2)
>>>>>> -               
afr_inode_event_gen_reset(local->parent2, this);
>>>>>> +               
afr_inode_need_refresh_set(local->parent2, this);
>>>>>>               continue;
>>>>>>           }
>>>>>>   
>>>>>> diff --git a/xlators/cluster/afr/src/afr.h
b/xlators/cluster/afr/src/afr.h
>>>>>> index a3f2942b3..ed6d777c1 100644
>>>>>> --- a/xlators/cluster/afr/src/afr.h
>>>>>> +++ b/xlators/cluster/afr/src/afr.h
>>>>>> @@ -958,7 +958,10 @@ afr_inode_read_subvol_set(inode_t
*inode, xlator_t *this,
>>>>>>                             int event_generation);
>>>>>>   
>>>>>>   int
>>>>>> -afr_inode_event_gen_reset(inode_t *inode, xlator_t
*this);
>>>>>> +__afr_inode_need_refresh_set(inode_t *inode, xlator_t
*this);
>>>>>> +
>>>>>> +int
>>>>>> +afr_inode_need_refresh_set(inode_t *inode, xlator_t
*this);
>>>>>>   
>>>>>>   int
>>>>>>   afr_read_subvol_select_by_policy(inode_t *inode,
xlator_t *this,
>>>>>> -- 
>>>>>> 2.25.2
>>>>>>
>>>>>

Erik Jacobson

2020-Apr-16 13:24 UTC

head link

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

> The patch by itself is only making changes specific to AFR, so it should
not
> affect other translators. But I wonder how readdir-ahead is enabled in your
> gnfs stack. All performance xlators are turned off in gnfs except
> write-behind and AFAIK, there is no way to enable them via the CLI. Did you
> custom edit your gnfs volfile to add readdir-ahead? If yes, does the crash
> go-away if you remove the xlator from the nfs volfile?
thank you. A quick reply. I will then go research how to do this,
I've never hand edited a volume before. I've never even really looked at
the gnfs volfile before.

There are no custom code changes or hand edits.

More soon.

Gluster users - Apr 2020 - gnfs split brain when 1 server in 3x1 down (high load) - help request

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request