thr3ads.net - Gluster users - [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800 [Oct 2018]

If this information is useful, please help other people find it:
Share via:

Hoggins!

2018-Oct-24 09:08 UTC

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

Thanks, that's helping a lot, I will do that.

One more question: should the glustershd restart be performed on the
arbiter only, or on each node of the cluster?

Thanks!

??? Hoggins!

Le 24/10/2018 ? 02:55, Ravishankar N a ?crit?:>
> On 10/23/2018 10:01 PM, Hoggins! wrote:
>> Hello there,
>>
>> I'm stumbling upon the *exact same issue*, and unfortunately
setting the
>> server.tcp-user-timeout to 42 does not help.
>> Any other suggestion?
>>
>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes
running
>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a
>> node) suggested by Sam works, but it often doesn't.
>>
>> You may ask how I got into this, well it's simple: I needed to
replace
>> my brick 1 and brick 2 with two brand new machines, so here's what
I did:
>> ??? - add brick 3 and brick 4 into the cluster (gluster peer probe,
>> gluster volume add-brick, etc., with the issue regarding the arbiter
>> node that has to be first removed from the cluster before being able to
>> add bricks 3 and 4)
>> ??? - wait for all the files on my volumes to heal. It took a few days.
>> ??? - remove bricks 1 and 2
>> ??? - after having "reset" the arbiter, re-add the arbiter
into the cluster
>>
>> And now it's intermittently hanging on writing *on existing files*.
>> There is *no problem for writing new files* on the volumes.
> Hi,
>
> There was a arbiter volume hang issue? that was fixed [1] recently.
> The fix has been back-ported to all release branches.
>
> One workaround to overcome hangs is to (1)turn off? 'testvol
> cluster.data-self-heal', remount the clients *and* (2) restart
> glustershd (via volume start force). The hang is observed due to an
> unreleased lock from self-heal. There are other ways to release the
> stale lock via gluster clear-locks command or tweaking
> features.locks-revocation-secs but restarting shd whenever you see the
> issue is the easiest and safest way.
>
> -Ravi
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802
>
>
>> I'm lost here, thanks for your inputs!
>>
>> ??? Hoggins!
>>
>> Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?:
>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at
smcleod.net
>>> <mailto:mailinglists at smcleod.net>> wrote:
>>>
>>>     I apologise for this being posted twice - I'm not sure if
that was
>>>     user error or a bug in the mailing list, but the list
wasn't
>>>     showing my post after quite some time so I sent a second email
>>>     which near immediately showed up - that's mailing lists I
guess...
>>>
>>>     Anyway, if anyone has any input, advice or abuse I'm
welcome any
>>>     input!
>>>
>>>
>>> We got little late to get back on this. But after running tests
>>> internally, we found possibly missing an volume option is the
reason
>>> for this:
>>>
>>> Try?
>>>
>>> gluster volume set <volname> server.tcp-user-timeout 42
>>> on your volume. Let us know if this helps.
>>> (Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/)
>>> ?
>>>
>>>     --
>>>     Sam McLeod
>>>     https://smcleod.net
>>>     https://twitter.com/s_mcleod
>>>
>>>>     On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists at
smcleod.net
>>>>     <mailto:mailinglists at smcleod.net>> wrote:
>>>>
>>>>     We've got an odd problem where clients are blocked from
writing
>>>>     to Gluster volumes until the first node of the Gluster
cluster is
>>>>     rebooted.
>>>>
>>>>     I suspect I've either configured something incorrectly
with the
>>>>     arbiter / replica configuration of the volumes, or there is
some
>>>>     sort of bug in the gluster client-server connection that
we're
>>>>     triggering.
>>>>
>>>>     I was wondering if anyone has seen this or could point me
in the
>>>>     right direction?
>>>>
>>>>
>>>>     *Environment:*
>>>>
>>>>       * Typology: 3 node cluster, replica 2, arbiter 1 (third
node is
>>>>         metadata only).
>>>>       * Version: Client and Servers both running 4.1.3, both on
>>>>         CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast
>>>>         networked SSD storage backing them, XFS.
>>>>       * Client: Native Gluster FUSE client mounting via the
>>>>         kubernetes provider
>>>>
>>>>
>>>>     *Problem:*
>>>>
>>>>       * Seemingly randomly some clients will be blocked / are
unable
>>>>         to write to what should be a highly available gluster
volume.
>>>>       * The client gluster logs show it failing to do new file
>>>>         operations across various volumes and all three nodes
of the
>>>>         gluster.
>>>>       * The server gluster (or OS) logs do not show any
warnings or
>>>>         errors.
>>>>       * The client recovers and is able to write to volumes
again
>>>>         after the first node of the gluster cluster is
rebooted.
>>>>       * Until the first node of the gluster cluster is
rebooted, the
>>>>         client fails to write to the volume that is (or should
be)
>>>>         available on the second node (a replica) and third node
(an
>>>>         arbiter only node).
>>>>
>>>>
>>>>     *What 'fixes' the issue:*
>>>>
>>>>       * Although the clients (kubernetes hosts) connect to all
3
>>>>         nodes of the Gluster cluster - restarting the first
gluster
>>>>         node always?unblocks the IO and allows the client to
continue
>>>>         writing.
>>>>       * Stopping and starting the glusterd service on the
gluster
>>>>         server is not enough to fix the issue, nor is
restarting its
>>>>         networking.
>>>>       * This suggests to me that the volume is unavailable for
>>>>         writing for some reason and restarting the first node
in the
>>>>         cluster either clears some sort of TCP sessions between
the
>>>>         client-server or between the server-server replication.
>>>>
>>>>
>>>>     *Expected behaviour:*
>>>>
>>>>       * If the first gluster node / server had failed or was
blocked
>>>>         from performing operations for some reason (which it
doesn't
>>>>         seem it is), I'd expect the clients to access data
from the
>>>>         second gluster node and write metadata to the third
gluster
>>>>         node as well as it's an arbiter / metadata only
node.
>>>>       * If for some reason the a gluster node was not able to
serve
>>>>         connections to clients, I'd expect to see errors in
the
>>>>         volume, glusterd or brick log files (there are none on
the
>>>>         first gluster node).
>>>>       * If the first gluster node was for some reason blocking
IO on
>>>>         a volume, I'd expect that node either to show as
unhealthy or
>>>>         unavailable in the gluster peer status or gluster
volume status.
>>>>
>>>>
>>>>
>>>>     *Client gluster errors:*
>>>>
>>>>       * staging_static in this example is a volume name.
>>>>       * You can see the client trying to connect to the second
and
>>>>         third nodes of the gluster cluster and failing (unsure
as to
>>>>         why?)
>>>>       * The server side logs on the first gluster node do not
show
>>>>         any errors or problems, but the second / third node
show
>>>>         errors in the glusterd.log when trying to
'unlock' the
>>>>         0-management volume on the first node.
>>>>
>>>>
>>>>
>>>>     *On a gluster client*?(a kubernetes host using the
kubernetes
>>>>     connector which uses the native fuse client) when its
blocked
>>>>     from writing but the gluster appears healthy (other than
the
>>>>     errors mentioned later):
>>>>
>>>>     [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02
>>>>     15:03:22.417773. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 15:33:22.750989] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02
>>>>     15:33:22.765751. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-02 16:03:23.097988] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02
>>>>     16:03:23.098133. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 16:33:23.439282] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02
>>>>     16:33:23.455171. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-02 17:03:23.786971] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02
>>>>     17:03:23.787120. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 17:33:24.160720] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02
>>>>     17:33:24.173153. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-02 18:03:24.505185] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02
>>>>     18:03:24.505328. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 18:33:24.841311] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02
>>>>     18:33:24.855372. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-02 19:03:25.204784] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02
>>>>     19:03:25.204977. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 19:33:25.533611] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02
>>>>     19:33:25.545921. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-02 20:03:25.877098] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02
>>>>     20:03:25.877264. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 20:33:26.217973] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02
>>>>     20:33:26.233010. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-02 21:03:26.588316] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02
>>>>     21:03:26.588456. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 21:33:26.912449] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02
>>>>     21:33:32.091009. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-02 22:03:37.259000] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02
>>>>     22:03:37.259147. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 22:33:37.615574] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02
>>>>     22:33:37.629655. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-02 23:03:37.941049] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02
>>>>     23:03:37.941199. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-02 23:33:38.271078] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02
>>>>     23:33:38.285899. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-03 00:03:38.607263] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03
>>>>     00:03:38.607410. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-03 00:33:38.934479] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-1: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03
>>>>     00:33:38.948570. timeout = 1800 for <ip of second
gluster node>:49154
>>>>     [2018-09-03 01:03:39.256972] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>     [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail]
>>>>     0-staging_static-client-2: bailing out frame type(GlusterFS
4.x
>>>>     v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03
>>>>     01:03:39.258166. timeout = 1800 for <ip of third gluster
node>:49154
>>>>     [2018-09-03 01:33:39.614483] E [MSGID: 114031]
>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>     endpoint is not connected]
>>>>
>>>>
>>>>     *On the second gluster server:*
>>>>
>>>>
>>>>     We are seeing the following error in the glusterd.log file
when
>>>>     the client is blocked from writing the volume, I think this
is
>>>>     probably the most important information about the error and
>>>>     suggests a problem with the first node but doesn't
explain the
>>>>     client behaviour:
>>>>
>>>>     [2018-09-02 08:31:03.902272] E [MSGID: 106115]
>>>>     [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors]
0-management:
>>>>     Unlocking failed on <FQDN of the first gluster node>.
Please
>>>>     check log file for details.
>>>>     [2018-09-02 08:31:03.902477] E [MSGID: 106151]
>>>>     [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management:
Failed
>>>>     to unlock on some peer(s)
>>>>
>>>>     Note in the above error:
>>>>
>>>>     1. I'm not sure which log to check (there doesn't
seem to be a
>>>>     management brick / brick log)?
>>>>     2. If there's a problem with the first node, why
isn't it
>>>>     rejected from the gluster / taken offline / the health of
the
>>>>     peers or volume list degraded?
>>>>     3. Why does the client fail to write to the volume rather
than
>>>>     (I'm assuming) trying the second (or third I guess)
node to write
>>>>     to the volume?
>>>>
>>>>
>>>>     We are also seeing the following errors repeated a lot in
the
>>>>     logs, both when the volumes are working and when
there's an issue
>>>>     in the brick log
>>>>    
(/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log):
>>>>
>>>>     [2018-09-03 01:58:35.128923] E
[server.c:137:server_submit_reply]
>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>     [0x7f8470319d14]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>     [0x7f846bdde24a]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>     [2018-09-03 01:58:35.128957] E
>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed
to
>>>>     submit message (XID: 0x3d60, Program: GlusterFS 4.x v1,
ProgVers:
>>>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>>>     [2018-09-03 01:58:35.128983] E
[server.c:137:server_submit_reply]
>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>     [0x7f8470319d14]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>     [0x7f846bdde24a]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>     [2018-09-03 01:58:35.129016] E
>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed
to
>>>>     submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1,
ProgVers:
>>>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>>>     [2018-09-03 01:58:35.129042] E
[server.c:137:server_submit_reply]
>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>     [0x7f8470319d14]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>     [0x7f846bdde24a]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>     [2018-09-03 01:58:35.129077] E
>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed
to
>>>>     submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1,
ProgVers:
>>>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>>>     [2018-09-03 01:58:35.129149] E
[server.c:137:server_submit_reply]
>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>     [0x7f8470319d14]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>     [0x7f846bdde24a]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>     [2018-09-03 01:58:35.129191] E
>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed
to
>>>>     submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1,
ProgVers:
>>>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>>>     [2018-09-03 01:58:35.129219] E
[server.c:137:server_submit_reply]
>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>     [0x7f8470319d14]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>     [0x7f846bdde24a]
>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>
>>>>
>>>>
>>>>     *Gluster volume information:*
>>>>
>>>>
>>>>     # gluster volume info staging_static
>>>>
>>>>     Volume Name: staging_static
>>>>     Type: Replicate
>>>>     Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3
>>>>     Status: Started
>>>>     Snapshot Count: 0
>>>>     Number of Bricks: 1 x (2 + 1) = 3
>>>>     Transport-type: tcp
>>>>     Bricks:
>>>>     Brick1: <first gluster
node.fqdn>:/mnt/gluster-storage/staging_static
>>>>     Brick2: <second gluster
>>>>     node.fqdn>:/mnt/gluster-storage/staging_static
>>>>     Brick3: <third gluster
>>>>     node.fqdn>:/mnt/gluster-storage/staging_static (arbiter)
>>>>     Options Reconfigured:
>>>>     storage.fips-mode-rchecksum: true
>>>>     cluster.self-heal-window-size: 16
>>>>     cluster.shd-wait-qlength: 4096
>>>>     cluster.shd-max-threads: 8
>>>>     performance.cache-min-file-size: 2KB
>>>>     performance.rda-cache-limit: 1GB
>>>>     network.inode-lru-limit: 50000
>>>>     server.outstanding-rpc-limit: 256
>>>>     transport.listen-backlog: 2048
>>>>     performance.write-behind-window-size: 512MB
>>>>     performance.stat-prefetch: true
>>>>     performance.io <http://performance.io/>-thread-count:
16
>>>>     performance.client-io-threads: true
>>>>     performance.cache-size: 1GB
>>>>     performance.cache-refresh-timeout: 60
>>>>     performance.cache-invalidation: true
>>>>     cluster.use-compound-fops: true
>>>>     cluster.readdir-optimize: true
>>>>     cluster.lookup-optimize: true
>>>>     cluster.favorite-child-policy: size
>>>>     cluster.eager-lock: true
>>>>     client.event-threads: 4
>>>>     nfs.disable: on
>>>>     transport.address-family: inet
>>>>     diagnostics.brick-log-level: ERROR
>>>>     diagnostics.client-log-level: ERROR
>>>>     features.cache-invalidation-timeout: 300
>>>>     features.cache-invalidation: true
>>>>     network.ping-timeout: 15
>>>>     performance.cache-max-file-size: 3MB
>>>>     performance.md-cache-timeout: 300
>>>>     server.event-threads: 4
>>>>
>>>>     Thanks in advance,
>>>>
>>>>
>>>>     --
>>>>     Sam McLeod (protoporpoise on IRC)
>>>>     https://smcleod.net <https://smcleod.net/>
>>>>     https://twitter.com/s_mcleod
>>>>
>>>>     Words are my own opinions and do not?necessarily represent
those
>>>>     of my?employer or partners.
>>>>
>>>>     _______________________________________________
>>>>     Gluster-users mailing list
>>>>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>    
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>     _______________________________________________
>>>     Gluster-users mailing list
>>>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>>>    
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>
>>>
>>>
>>>
>>> -- 
>>> Amar Tumballi (amarts)
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20181024/c88fb80c/attachment.sig>

Ravishankar N

2018-Oct-24 09:53 UTC

head link

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

On 10/24/2018 02:38 PM, Hoggins! wrote:> Thanks, that's helping a lot, I will do that.
>
> One more question: should the glustershd restart be performed on the
> arbiter only, or on each node of the cluster?If you do a 'gluster volume start volname force' it will restart the shd
on all nodes.
-Ravi>
> Thanks!
>
>  ??? Hoggins!
>
> Le 24/10/2018 ? 02:55, Ravishankar N a ?crit?:
>> On 10/23/2018 10:01 PM, Hoggins! wrote:
>>> Hello there,
>>>
>>> I'm stumbling upon the *exact same issue*, and unfortunately
setting the
>>> server.tcp-user-timeout to 42 does not help.
>>> Any other suggestion?
>>>
>>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes
running
>>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround
(rebooting a
>>> node) suggested by Sam works, but it often doesn't.
>>>
>>> You may ask how I got into this, well it's simple: I needed to
replace
>>> my brick 1 and brick 2 with two brand new machines, so here's
what I did:
>>>  ??? - add brick 3 and brick 4 into the cluster (gluster peer
probe,
>>> gluster volume add-brick, etc., with the issue regarding the
arbiter
>>> node that has to be first removed from the cluster before being
able to
>>> add bricks 3 and 4)
>>>  ??? - wait for all the files on my volumes to heal. It took a few
days.
>>>  ??? - remove bricks 1 and 2
>>>  ??? - after having "reset" the arbiter, re-add the
arbiter into the cluster
>>>
>>> And now it's intermittently hanging on writing *on existing
files*.
>>> There is *no problem for writing new files* on the volumes.
>> Hi,
>>
>> There was a arbiter volume hang issue? that was fixed [1] recently.
>> The fix has been back-ported to all release branches.
>>
>> One workaround to overcome hangs is to (1)turn off? 'testvol
>> cluster.data-self-heal', remount the clients *and* (2) restart
>> glustershd (via volume start force). The hang is observed due to an
>> unreleased lock from self-heal. There are other ways to release the
>> stale lock via gluster clear-locks command or tweaking
>> features.locks-revocation-secs but restarting shd whenever you see the
>> issue is the easiest and safest way.
>>
>> -Ravi
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802
>>
>>
>>> I'm lost here, thanks for your inputs!
>>>
>>>  ??? Hoggins!
>>>
>>> Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?:
>>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at
smcleod.net
>>>> <mailto:mailinglists at smcleod.net>> wrote:
>>>>
>>>>      I apologise for this being posted twice - I'm not sure
if that was
>>>>      user error or a bug in the mailing list, but the list
wasn't
>>>>      showing my post after quite some time so I sent a second
email
>>>>      which near immediately showed up - that's mailing
lists I guess...
>>>>
>>>>      Anyway, if anyone has any input, advice or abuse I'm
welcome any
>>>>      input!
>>>>
>>>>
>>>> We got little late to get back on this. But after running tests
>>>> internally, we found possibly missing an volume option is the
reason
>>>> for this:
>>>>
>>>> Try
>>>>
>>>> gluster volume set <volname> server.tcp-user-timeout 42
>>>> on your volume. Let us know if this helps.
>>>> (Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/)
>>>>   
>>>>
>>>>      --
>>>>      Sam McLeod
>>>>      https://smcleod.net
>>>>      https://twitter.com/s_mcleod
>>>>
>>>>>      On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists
at smcleod.net
>>>>>      <mailto:mailinglists at smcleod.net>> wrote:
>>>>>
>>>>>      We've got an odd problem where clients are blocked
from writing
>>>>>      to Gluster volumes until the first node of the Gluster
cluster is
>>>>>      rebooted.
>>>>>
>>>>>      I suspect I've either configured something
incorrectly with the
>>>>>      arbiter / replica configuration of the volumes, or
there is some
>>>>>      sort of bug in the gluster client-server connection
that we're
>>>>>      triggering.
>>>>>
>>>>>      I was wondering if anyone has seen this or could point
me in the
>>>>>      right direction?
>>>>>
>>>>>
>>>>>      *Environment:*
>>>>>
>>>>>        * Typology: 3 node cluster, replica 2, arbiter 1
(third node is
>>>>>          metadata only).
>>>>>        * Version: Client and Servers both running 4.1.3,
both on
>>>>>          CentOS 7, kernel 4.18.x, (Xen) VMs with relatively
fast
>>>>>          networked SSD storage backing them, XFS.
>>>>>        * Client: Native Gluster FUSE client mounting via
the
>>>>>          kubernetes provider
>>>>>
>>>>>
>>>>>      *Problem:*
>>>>>
>>>>>        * Seemingly randomly some clients will be blocked /
are unable
>>>>>          to write to what should be a highly available
gluster volume.
>>>>>        * The client gluster logs show it failing to do new
file
>>>>>          operations across various volumes and all three
nodes of the
>>>>>          gluster.
>>>>>        * The server gluster (or OS) logs do not show any
warnings or
>>>>>          errors.
>>>>>        * The client recovers and is able to write to
volumes again
>>>>>          after the first node of the gluster cluster is
rebooted.
>>>>>        * Until the first node of the gluster cluster is
rebooted, the
>>>>>          client fails to write to the volume that is (or
should be)
>>>>>          available on the second node (a replica) and third
node (an
>>>>>          arbiter only node).
>>>>>
>>>>>
>>>>>      *What 'fixes' the issue:*
>>>>>
>>>>>        * Although the clients (kubernetes hosts) connect to
all 3
>>>>>          nodes of the Gluster cluster - restarting the
first gluster
>>>>>          node always?unblocks the IO and allows the client
to continue
>>>>>          writing.
>>>>>        * Stopping and starting the glusterd service on the
gluster
>>>>>          server is not enough to fix the issue, nor is
restarting its
>>>>>          networking.
>>>>>        * This suggests to me that the volume is unavailable
for
>>>>>          writing for some reason and restarting the first
node in the
>>>>>          cluster either clears some sort of TCP sessions
between the
>>>>>          client-server or between the server-server
replication.
>>>>>
>>>>>
>>>>>      *Expected behaviour:*
>>>>>
>>>>>        * If the first gluster node / server had failed or
was blocked
>>>>>          from performing operations for some reason (which
it doesn't
>>>>>          seem it is), I'd expect the clients to access
data from the
>>>>>          second gluster node and write metadata to the
third gluster
>>>>>          node as well as it's an arbiter / metadata
only node.
>>>>>        * If for some reason the a gluster node was not able
to serve
>>>>>          connections to clients, I'd expect to see
errors in the
>>>>>          volume, glusterd or brick log files (there are
none on the
>>>>>          first gluster node).
>>>>>        * If the first gluster node was for some reason
blocking IO on
>>>>>          a volume, I'd expect that node either to show
as unhealthy or
>>>>>          unavailable in the gluster peer status or gluster
volume status.
>>>>>
>>>>>
>>>>>
>>>>>      *Client gluster errors:*
>>>>>
>>>>>        * staging_static in this example is a volume name.
>>>>>        * You can see the client trying to connect to the
second and
>>>>>          third nodes of the gluster cluster and failing
(unsure as to
>>>>>          why?)
>>>>>        * The server side logs on the first gluster node do
not show
>>>>>          any errors or problems, but the second / third
node show
>>>>>          errors in the glusterd.log when trying to
'unlock' the
>>>>>          0-management volume on the first node.
>>>>>
>>>>>
>>>>>
>>>>>      *On a gluster client*?(a kubernetes host using the
kubernetes
>>>>>      connector which uses the native fuse client) when its
blocked
>>>>>      from writing but the gluster appears healthy (other
than the
>>>>>      errors mentioned later):
>>>>>
>>>>>      [2018-09-02 15:33:22.750874] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02
>>>>>      15:03:22.417773. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 15:33:22.750989] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 16:03:23.097905] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02
>>>>>      15:33:22.765751. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-02 16:03:23.097988] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 16:33:23.439172] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02
>>>>>      16:03:23.098133. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 16:33:23.439282] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 17:03:23.786858] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02
>>>>>      16:33:23.455171. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-02 17:03:23.786971] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 17:33:24.160607] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02
>>>>>      17:03:23.787120. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 17:33:24.160720] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 18:03:24.505092] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02
>>>>>      17:33:24.173153. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-02 18:03:24.505185] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 18:33:24.841248] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02
>>>>>      18:03:24.505328. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 18:33:24.841311] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 19:03:25.204711] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02
>>>>>      18:33:24.855372. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-02 19:03:25.204784] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 19:33:25.533545] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02
>>>>>      19:03:25.204977. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 19:33:25.533611] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 20:03:25.877020] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02
>>>>>      19:33:25.545921. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-02 20:03:25.877098] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 20:33:26.217858] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02
>>>>>      20:03:25.877264. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 20:33:26.217973] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 21:03:26.588237] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02
>>>>>      20:33:26.233010. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-02 21:03:26.588316] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 21:33:26.912334] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02
>>>>>      21:03:26.588456. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 21:33:26.912449] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 22:03:37.258915] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02
>>>>>      21:33:32.091009. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-02 22:03:37.259000] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 22:33:37.615497] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02
>>>>>      22:03:37.259147. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 22:33:37.615574] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 23:03:37.940969] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02
>>>>>      22:33:37.629655. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-02 23:03:37.941049] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-02 23:33:38.270998] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02
>>>>>      23:03:37.941199. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-02 23:33:38.271078] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-03 00:03:38.607186] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02
>>>>>      23:33:38.285899. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-03 00:03:38.607263] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-03 00:33:38.934385] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03
>>>>>      00:03:38.607410. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-03 00:33:38.934479] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-03 01:03:39.256842] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03
>>>>>      00:33:38.948570. timeout = 1800 for <ip of second
gluster node>:49154
>>>>>      [2018-09-03 01:03:39.256972] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-1: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>      [2018-09-03 01:33:39.614402] E
[rpc-clnt.c:184:call_bail]
>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>      v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03
>>>>>      01:03:39.258166. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>      [2018-09-03 01:33:39.614483] E [MSGID: 114031]
>>>>>      [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>      0-staging_static-client-2: remote operation failed
[Transport
>>>>>      endpoint is not connected]
>>>>>
>>>>>
>>>>>      *On the second gluster server:*
>>>>>
>>>>>
>>>>>      We are seeing the following error in the glusterd.log
file when
>>>>>      the client is blocked from writing the volume, I think
this is
>>>>>      probably the most important information about the
error and
>>>>>      suggests a problem with the first node but doesn't
explain the
>>>>>      client behaviour:
>>>>>
>>>>>      [2018-09-02 08:31:03.902272] E [MSGID: 106115]
>>>>>      [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors]
0-management:
>>>>>      Unlocking failed on <FQDN of the first gluster
node>. Please
>>>>>      check log file for details.
>>>>>      [2018-09-02 08:31:03.902477] E [MSGID: 106151]
>>>>>      [glusterd-syncop.c:1640:gd_unlock_op_phase]
0-management: Failed
>>>>>      to unlock on some peer(s)
>>>>>
>>>>>      Note in the above error:
>>>>>
>>>>>      1. I'm not sure which log to check (there
doesn't seem to be a
>>>>>      management brick / brick log)?
>>>>>      2. If there's a problem with the first node, why
isn't it
>>>>>      rejected from the gluster / taken offline / the health
of the
>>>>>      peers or volume list degraded?
>>>>>      3. Why does the client fail to write to the volume
rather than
>>>>>      (I'm assuming) trying the second (or third I
guess) node to write
>>>>>      to the volume?
>>>>>
>>>>>
>>>>>      We are also seeing the following errors repeated a lot
in the
>>>>>      logs, both when the volumes are working and when
there's an issue
>>>>>      in the brick log
>>>>>     
(/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log):
>>>>>
>>>>>      [2018-09-03 01:58:35.128923] E
[server.c:137:server_submit_reply]
>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>      [0x7f8470319d14]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>      [0x7f846bdde24a]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>      [2018-09-03 01:58:35.128957] E
>>>>>      [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service:
failed to
>>>>>      submit message (XID: 0x3d60, Program: GlusterFS 4.x
v1, ProgVers:
>>>>>      400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>      [2018-09-03 01:58:35.128983] E
[server.c:137:server_submit_reply]
>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>      [0x7f8470319d14]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>      [0x7f846bdde24a]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>      [2018-09-03 01:58:35.129016] E
>>>>>      [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service:
failed to
>>>>>      submit message (XID: 0x3e2a, Program: GlusterFS 4.x
v1, ProgVers:
>>>>>      400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>      [2018-09-03 01:58:35.129042] E
[server.c:137:server_submit_reply]
>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>      [0x7f8470319d14]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>      [0x7f846bdde24a]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>      [2018-09-03 01:58:35.129077] E
>>>>>      [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service:
failed to
>>>>>      submit message (XID: 0x3ef6, Program: GlusterFS 4.x
v1, ProgVers:
>>>>>      400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>      [2018-09-03 01:58:35.129149] E
[server.c:137:server_submit_reply]
>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>      [0x7f8470319d14]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>      [0x7f846bdde24a]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>      [2018-09-03 01:58:35.129191] E
>>>>>      [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service:
failed to
>>>>>      submit message (XID: 0x3fc6, Program: GlusterFS 4.x
v1, ProgVers:
>>>>>      400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>      [2018-09-03 01:58:35.129219] E
[server.c:137:server_submit_reply]
>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>      [0x7f8470319d14]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>      [0x7f846bdde24a]
>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>
>>>>>
>>>>>
>>>>>      *Gluster volume information:*
>>>>>
>>>>>
>>>>>      # gluster volume info staging_static
>>>>>
>>>>>      Volume Name: staging_static
>>>>>      Type: Replicate
>>>>>      Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3
>>>>>      Status: Started
>>>>>      Snapshot Count: 0
>>>>>      Number of Bricks: 1 x (2 + 1) = 3
>>>>>      Transport-type: tcp
>>>>>      Bricks:
>>>>>      Brick1: <first gluster
node.fqdn>:/mnt/gluster-storage/staging_static
>>>>>      Brick2: <second gluster
>>>>>      node.fqdn>:/mnt/gluster-storage/staging_static
>>>>>      Brick3: <third gluster
>>>>>      node.fqdn>:/mnt/gluster-storage/staging_static
(arbiter)
>>>>>      Options Reconfigured:
>>>>>      storage.fips-mode-rchecksum: true
>>>>>      cluster.self-heal-window-size: 16
>>>>>      cluster.shd-wait-qlength: 4096
>>>>>      cluster.shd-max-threads: 8
>>>>>      performance.cache-min-file-size: 2KB
>>>>>      performance.rda-cache-limit: 1GB
>>>>>      network.inode-lru-limit: 50000
>>>>>      server.outstanding-rpc-limit: 256
>>>>>      transport.listen-backlog: 2048
>>>>>      performance.write-behind-window-size: 512MB
>>>>>      performance.stat-prefetch: true
>>>>>      performance.io
<http://performance.io/>-thread-count: 16
>>>>>      performance.client-io-threads: true
>>>>>      performance.cache-size: 1GB
>>>>>      performance.cache-refresh-timeout: 60
>>>>>      performance.cache-invalidation: true
>>>>>      cluster.use-compound-fops: true
>>>>>      cluster.readdir-optimize: true
>>>>>      cluster.lookup-optimize: true
>>>>>      cluster.favorite-child-policy: size
>>>>>      cluster.eager-lock: true
>>>>>      client.event-threads: 4
>>>>>      nfs.disable: on
>>>>>      transport.address-family: inet
>>>>>      diagnostics.brick-log-level: ERROR
>>>>>      diagnostics.client-log-level: ERROR
>>>>>      features.cache-invalidation-timeout: 300
>>>>>      features.cache-invalidation: true
>>>>>      network.ping-timeout: 15
>>>>>      performance.cache-max-file-size: 3MB
>>>>>      performance.md-cache-timeout: 300
>>>>>      server.event-threads: 4
>>>>>
>>>>>      Thanks in advance,
>>>>>
>>>>>
>>>>>      --
>>>>>      Sam McLeod (protoporpoise on IRC)
>>>>>      https://smcleod.net <https://smcleod.net/>
>>>>>      https://twitter.com/s_mcleod
>>>>>
>>>>>      Words are my own opinions and do not?necessarily
represent those
>>>>>      of my?employer or partners.
>>>>>
>>>>>      _______________________________________________
>>>>>      Gluster-users mailing list
>>>>>      Gluster-users at gluster.org <mailto:Gluster-users
at gluster.org>
>>>>>     
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>     
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>>      _______________________________________________
>>>>      Gluster-users mailing list
>>>>      Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>>>      https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>     
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Amar Tumballi (amarts)
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20181024/7ebb3084/attachment.html>

Gluster users - Oct 2018 - Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent =

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800