thr3ads.net - Gluster users - [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load)

If this information is useful, please help other people find it:
Share via:

Erik Jacobson

2020-Apr-08 16:25 UTC

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Thank you again for the help so far. Here is what Scott Titus came up
with. Let us know if you have suggestions for next steps.



We never hit the "Event gen is zero" message, so it appears that 
afr_access() never has a zero event_gen to begin with.

However, the "Resetting event gen" message was just a bit chatty, 
growing our nfs.log to >2.4GB.? Many were against a gfid of populated 
with zeros.

Around each split brain log, we did find the "Resetting event gen" 
messages containing a matching gfid:

9439138:[2020-04-08 15:48:44.737590] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439139:[2020-04-08 15:48:44.737636] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439140:[2020-04-08 15:48:44.737663] E [MSGID: 108008] 
[afr-read-txn.c:312:afr_read_txn_refresh_done] 0-cm_shared-replicate-0: 
Failing ACCESS on gfid f2d7abf0-5444-48d6-863d-4b128502daf9: split-brain 
observed. [Input/output error]
9439143:[2020-04-08 15:48:44.737801] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439145:[2020-04-08 15:48:44.737861] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439148:[2020-04-08 15:48:44.738125] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9
9439225:[2020-04-08 15:48:44.749920] E 
[afr-common.c:754:afr_inode_event_gen_reset] 
(-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
[0x7fa4fb1cb735] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
[0x7fa4fb1cb0b9] 
-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
[0x7fa4fb1b8701] ) 0-cm_shared-replicate-0: Resetting event gen for 
f2d7abf0-5444-48d6-863d-4b128502daf9

Thanks,
-Scott

On 4/8/20 8:31 AM, Erik Jacobson wrote:> Hi team -
>
> We got an update to try more stuff from the community.
>
> I feel like I've been "given an inch but am taking a mile"
but if we
> do happen to have time on orbit41 again, we'll do the next round of
> debugging.
>
> Erik

On Wed, Apr 08, 2020 at 01:53:00PM +0530, Ravishankar N
wrote:> On 08/04/20 4:59 am, Erik Jacobson wrote:
> > Apologies for misinterpreting the backtrace.
> > 
> > #0? afr_read_txn_refresh_done (frame=0x7ffcf4146478,
> > this=0x7fff64013720, err=5) at afr-read-txn.c:312
> > #1? 0x00007fff68938d2b in afr_txn_refresh_done
> > (frame=frame at entry=0x7ffcf4146478, this=this at
entry=0x7fff64013720,
> > err=5, err at entry=0)
> >   ??? at afr-common.c:1222
> Sorry, I missed this too.
> > (gdb) print event_generation
> > $3 = 0
> > 
> > (gdb) print priv->fav_child_policy
> > $4 = AFR_FAV_CHILD_NONE
> > 
> > I am not sure what this signifies though.? It appears to be a read
> > transaction with no event generation and no favorite child policy.
> > 
> > Feel free to ask for clarification in case my thought process went
awry
> > somewhere.
> 
> Favorite child policy is only for automatically resolving split-brains and
> is 0 unless that volume option is set. The problem is indeed that
> event_generation is zero. Could you try to apply this logging patch and see
> if afr_inode_event_gen_reset() for that gfid is hit or if afr_access() has
a
> zero event_gen to begin with?
> 
> Thanks,
> 
> Ravi
> 
> 
> 
> 
> 
> diff --git a/xlators/cluster/afr/src/afr-common.c
b/xlators/cluster/afr/src/afr-common.c
> index 4bfaef9e8..61f21795e 100644
> --- a/xlators/cluster/afr/src/afr-common.c
> +++ b/xlators/cluster/afr/src/afr-common.c
> @@ -750,6 +750,8 @@ afr_inode_event_gen_reset(inode_t *inode, xlator_t
*this)
>  
>      GF_VALIDATE_OR_GOTO(this->name, inode, out);
>  
> +    gf_msg_callingfn(this->name, GF_LOG_ERROR, 0, AFR_MSG_SPLIT_BRAIN,
> +                     "Resetting event gen for %s",
uuid_utoa(inode->gfid));
>      LOCK(&inode->lock);
>      {
>          ret = __afr_inode_event_gen_reset(inode, this);
> diff --git a/xlators/cluster/afr/src/afr-inode-read.c
b/xlators/cluster/afr/src/afr-inode-read.c
> index 9204add5b..5ac83d6c8 100644
> --- a/xlators/cluster/afr/src/afr-inode-read.c
> +++ b/xlators/cluster/afr/src/afr-inode-read.c
> @@ -172,6 +172,12 @@ afr_access(call_frame_t *frame, xlator_t *this, loc_t
*loc, int mask,
>      if (xdata)
>          local->xdata_req = dict_ref(xdata);
>  
> +    if (local->event_generation == 0)
> +        gf_msg(this->name, GF_LOG_ERROR, 0, AFR_MSG_SPLIT_BRAIN,
> +               "Event gen is zero for %s(%s)",
local->loc.name,
> +               local->loc.inode->gfid ?
uuid_utoa(local->loc.inode->gfid)
> +                                      : "NULL");
> +
>      afr_read_txn(frame, this, loc->inode, afr_access_wind,
>                   AFR_METADATA_TRANSACTION);
>

Ravishankar N

2020-Apr-09 06:08 UTC

head link

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

On 08/04/20 9:55 pm, Erik Jacobson wrote:> 9439138:[2020-04-08 15:48:44.737590] E
[afr-common.c:754:afr_inode_event_gen_reset]
> (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735)
[0x7fa4fb1cb735]
> -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9)
[0x7fa4fb1cb0b9]
> -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701)
[0x7fa4fb1b8701] )
> 0-cm_shared-replicate-0: Resetting event gen for
f2d7abf0-5444-48d6-863d-4b128502daf9
>Could you print the function/line no. of each of these 3 functions in 
the backtrace and see who calls afr_inode_event_gen_reset? `addr2line` 
should give you that info:
 ?addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 
0x6f735
 ?addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 
0x6f0b9
 ?addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so 
0x5c701


I think it is likely called from afr_lookup_done, which I don't think is 
necessary. I will send a patch for review. Once reviews are over, I will 
share it with you and if it fixes the issue in your testing, we can 
merge it with confidence.

Thanks,
Ravi

Gluster users - Apr 2020 - gnfs split brain when 1 server in 3x1 down (high load) - help request

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request