thr3ads.net - Gluster users - [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800 [Oct 2018]

If this information is useful, please help other people find it:
Share via:

Hoggins!

2018-Oct-24 11:46 UTC

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

Thank you, it's working as expected.

I guess it's only safe to put cluster.data-self-heal back on when I get
an updated version of GlusterFS?

??? Hoggins!

Le 24/10/2018 ? 11:53, Ravishankar N a ?crit?:>
> On 10/24/2018 02:38 PM, Hoggins! wrote:
>> Thanks, that's helping a lot, I will do that.
>>
>> One more question: should the glustershd restart be performed on the
>> arbiter only, or on each node of the cluster?
> If you do a 'gluster volume start volname force' it will restart
the
> shd on all nodes.
> -Ravi
>> Thanks!
>>
>> ??? Hoggins!
>>
>> Le 24/10/2018 ? 02:55, Ravishankar N a ?crit?:
>>> On 10/23/2018 10:01 PM, Hoggins! wrote:
>>>> Hello there,
>>>>
>>>> I'm stumbling upon the *exact same issue*, and
unfortunately setting the
>>>> server.tcp-user-timeout to 42 does not help.
>>>> Any other suggestion?
>>>>
>>>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all
nodes running
>>>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround
(rebooting a
>>>> node) suggested by Sam works, but it often doesn't.
>>>>
>>>> You may ask how I got into this, well it's simple: I needed
to replace
>>>> my brick 1 and brick 2 with two brand new machines, so
here's what I did:
>>>> ??? - add brick 3 and brick 4 into the cluster (gluster peer
probe,
>>>> gluster volume add-brick, etc., with the issue regarding the
arbiter
>>>> node that has to be first removed from the cluster before being
able to
>>>> add bricks 3 and 4)
>>>> ??? - wait for all the files on my volumes to heal. It took a
few days.
>>>> ??? - remove bricks 1 and 2
>>>> ??? - after having "reset" the arbiter, re-add the
arbiter into the cluster
>>>>
>>>> And now it's intermittently hanging on writing *on existing
files*.
>>>> There is *no problem for writing new files* on the volumes.
>>> Hi,
>>>
>>> There was a arbiter volume hang issue? that was fixed [1] recently.
>>> The fix has been back-ported to all release branches.
>>>
>>> One workaround to overcome hangs is to (1)turn off? 'testvol
>>> cluster.data-self-heal', remount the clients *and* (2) restart
>>> glustershd (via volume start force). The hang is observed due to an
>>> unreleased lock from self-heal. There are other ways to release the
>>> stale lock via gluster clear-locks command or tweaking
>>> features.locks-revocation-secs but restarting shd whenever you see
the
>>> issue is the easiest and safest way.
>>>
>>> -Ravi
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802
>>>
>>>
>>>> I'm lost here, thanks for your inputs!
>>>>
>>>> ??? Hoggins!
>>>>
>>>> Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?:
>>>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists
at smcleod.net
>>>>> <mailto:mailinglists at smcleod.net>> wrote:
>>>>>
>>>>>     I apologise for this being posted twice - I'm not
sure if that was
>>>>>     user error or a bug in the mailing list, but the list
wasn't
>>>>>     showing my post after quite some time so I sent a
second email
>>>>>     which near immediately showed up - that's mailing
lists I guess...
>>>>>
>>>>>     Anyway, if anyone has any input, advice or abuse
I'm welcome any
>>>>>     input!
>>>>>
>>>>>
>>>>> We got little late to get back on this. But after running
tests
>>>>> internally, we found possibly missing an volume option is
the reason
>>>>> for this:
>>>>>
>>>>> Try?
>>>>>
>>>>> gluster volume set <volname> server.tcp-user-timeout
42
>>>>> on your volume. Let us know if this helps.
>>>>> (Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/)
>>>>> ?
>>>>>
>>>>>     --
>>>>>     Sam McLeod
>>>>>     https://smcleod.net
>>>>>     https://twitter.com/s_mcleod
>>>>>
>>>>>>     On 3 Sep 2018, at 1:20 pm, Sam McLeod
<mailinglists at smcleod.net
>>>>>>     <mailto:mailinglists at smcleod.net>>
wrote:
>>>>>>
>>>>>>     We've got an odd problem where clients are
blocked from writing
>>>>>>     to Gluster volumes until the first node of the
Gluster cluster is
>>>>>>     rebooted.
>>>>>>
>>>>>>     I suspect I've either configured something
incorrectly with the
>>>>>>     arbiter / replica configuration of the volumes, or
there is some
>>>>>>     sort of bug in the gluster client-server connection
that we're
>>>>>>     triggering.
>>>>>>
>>>>>>     I was wondering if anyone has seen this or could
point me in the
>>>>>>     right direction?
>>>>>>
>>>>>>
>>>>>>     *Environment:*
>>>>>>
>>>>>>       * Typology: 3 node cluster, replica 2, arbiter 1
(third node is
>>>>>>         metadata only).
>>>>>>       * Version: Client and Servers both running 4.1.3,
both on
>>>>>>         CentOS 7, kernel 4.18.x, (Xen) VMs with
relatively fast
>>>>>>         networked SSD storage backing them, XFS.
>>>>>>       * Client: Native Gluster FUSE client mounting via
the
>>>>>>         kubernetes provider
>>>>>>
>>>>>>
>>>>>>     *Problem:*
>>>>>>
>>>>>>       * Seemingly randomly some clients will be blocked
/ are unable
>>>>>>         to write to what should be a highly available
gluster volume.
>>>>>>       * The client gluster logs show it failing to do
new file
>>>>>>         operations across various volumes and all three
nodes of the
>>>>>>         gluster.
>>>>>>       * The server gluster (or OS) logs do not show any
warnings or
>>>>>>         errors.
>>>>>>       * The client recovers and is able to write to
volumes again
>>>>>>         after the first node of the gluster cluster is
rebooted.
>>>>>>       * Until the first node of the gluster cluster is
rebooted, the
>>>>>>         client fails to write to the volume that is (or
should be)
>>>>>>         available on the second node (a replica) and
third node (an
>>>>>>         arbiter only node).
>>>>>>
>>>>>>
>>>>>>     *What 'fixes' the issue:*
>>>>>>
>>>>>>       * Although the clients (kubernetes hosts) connect
to all 3
>>>>>>         nodes of the Gluster cluster - restarting the
first gluster
>>>>>>         node always?unblocks the IO and allows the
client to continue
>>>>>>         writing.
>>>>>>       * Stopping and starting the glusterd service on
the gluster
>>>>>>         server is not enough to fix the issue, nor is
restarting its
>>>>>>         networking.
>>>>>>       * This suggests to me that the volume is
unavailable for
>>>>>>         writing for some reason and restarting the
first node in the
>>>>>>         cluster either clears some sort of TCP sessions
between the
>>>>>>         client-server or between the server-server
replication.
>>>>>>
>>>>>>
>>>>>>     *Expected behaviour:*
>>>>>>
>>>>>>       * If the first gluster node / server had failed
or was blocked
>>>>>>         from performing operations for some reason
(which it doesn't
>>>>>>         seem it is), I'd expect the clients to
access data from the
>>>>>>         second gluster node and write metadata to the
third gluster
>>>>>>         node as well as it's an arbiter / metadata
only node.
>>>>>>       * If for some reason the a gluster node was not
able to serve
>>>>>>         connections to clients, I'd expect to see
errors in the
>>>>>>         volume, glusterd or brick log files (there are
none on the
>>>>>>         first gluster node).
>>>>>>       * If the first gluster node was for some reason
blocking IO on
>>>>>>         a volume, I'd expect that node either to
show as unhealthy or
>>>>>>         unavailable in the gluster peer status or
gluster volume status.
>>>>>>
>>>>>>
>>>>>>
>>>>>>     *Client gluster errors:*
>>>>>>
>>>>>>       * staging_static in this example is a volume
name.
>>>>>>       * You can see the client trying to connect to the
second and
>>>>>>         third nodes of the gluster cluster and failing
(unsure as to
>>>>>>         why?)
>>>>>>       * The server side logs on the first gluster node
do not show
>>>>>>         any errors or problems, but the second / third
node show
>>>>>>         errors in the glusterd.log when trying to
'unlock' the
>>>>>>         0-management volume on the first node.
>>>>>>
>>>>>>
>>>>>>
>>>>>>     *On a gluster client*?(a kubernetes host using the
kubernetes
>>>>>>     connector which uses the native fuse client) when
its blocked
>>>>>>     from writing but the gluster appears healthy (other
than the
>>>>>>     errors mentioned later):
>>>>>>
>>>>>>     [2018-09-02 15:33:22.750874] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02
>>>>>>     15:03:22.417773. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 15:33:22.750989] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 16:03:23.097905] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02
>>>>>>     15:33:22.765751. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-02 16:03:23.097988] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 16:33:23.439172] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02
>>>>>>     16:03:23.098133. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 16:33:23.439282] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 17:03:23.786858] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02
>>>>>>     16:33:23.455171. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-02 17:03:23.786971] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 17:33:24.160607] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02
>>>>>>     17:03:23.787120. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 17:33:24.160720] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 18:03:24.505092] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02
>>>>>>     17:33:24.173153. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-02 18:03:24.505185] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 18:33:24.841248] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02
>>>>>>     18:03:24.505328. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 18:33:24.841311] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 19:03:25.204711] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02
>>>>>>     18:33:24.855372. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-02 19:03:25.204784] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 19:33:25.533545] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02
>>>>>>     19:03:25.204977. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 19:33:25.533611] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 20:03:25.877020] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02
>>>>>>     19:33:25.545921. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-02 20:03:25.877098] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 20:33:26.217858] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02
>>>>>>     20:03:25.877264. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 20:33:26.217973] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 21:03:26.588237] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02
>>>>>>     20:33:26.233010. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-02 21:03:26.588316] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 21:33:26.912334] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02
>>>>>>     21:03:26.588456. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 21:33:26.912449] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 22:03:37.258915] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02
>>>>>>     21:33:32.091009. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-02 22:03:37.259000] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 22:33:37.615497] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02
>>>>>>     22:03:37.259147. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 22:33:37.615574] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 23:03:37.940969] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02
>>>>>>     22:33:37.629655. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-02 23:03:37.941049] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-02 23:33:38.270998] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02
>>>>>>     23:03:37.941199. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-02 23:33:38.271078] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-03 00:03:38.607186] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02
>>>>>>     23:33:38.285899. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-03 00:03:38.607263] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-03 00:33:38.934385] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03
>>>>>>     00:03:38.607410. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-03 00:33:38.934479] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-03 01:03:39.256842] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03
>>>>>>     00:33:38.948570. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>     [2018-09-03 01:03:39.256972] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-1: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>     [2018-09-03 01:33:39.614402] E
[rpc-clnt.c:184:call_bail]
>>>>>>     0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>     v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03
>>>>>>     01:03:39.258166. timeout = 1800 for <ip of third
gluster node>:49154
>>>>>>     [2018-09-03 01:33:39.614483] E [MSGID: 114031]
>>>>>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>     0-staging_static-client-2: remote operation failed
[Transport
>>>>>>     endpoint is not connected]
>>>>>>
>>>>>>
>>>>>>     *On the second gluster server:*
>>>>>>
>>>>>>
>>>>>>     We are seeing the following error in the
glusterd.log file when
>>>>>>     the client is blocked from writing the volume, I
think this is
>>>>>>     probably the most important information about the
error and
>>>>>>     suggests a problem with the first node but
doesn't explain the
>>>>>>     client behaviour:
>>>>>>
>>>>>>     [2018-09-02 08:31:03.902272] E [MSGID: 106115]
>>>>>>     [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors]
0-management:
>>>>>>     Unlocking failed on <FQDN of the first gluster
node>. Please
>>>>>>     check log file for details.
>>>>>>     [2018-09-02 08:31:03.902477] E [MSGID: 106151]
>>>>>>     [glusterd-syncop.c:1640:gd_unlock_op_phase]
0-management: Failed
>>>>>>     to unlock on some peer(s)
>>>>>>
>>>>>>     Note in the above error:
>>>>>>
>>>>>>     1. I'm not sure which log to check (there
doesn't seem to be a
>>>>>>     management brick / brick log)?
>>>>>>     2. If there's a problem with the first node,
why isn't it
>>>>>>     rejected from the gluster / taken offline / the
health of the
>>>>>>     peers or volume list degraded?
>>>>>>     3. Why does the client fail to write to the volume
rather than
>>>>>>     (I'm assuming) trying the second (or third I
guess) node to write
>>>>>>     to the volume?
>>>>>>
>>>>>>
>>>>>>     We are also seeing the following errors repeated a
lot in the
>>>>>>     logs, both when the volumes are working and when
there's an issue
>>>>>>     in the brick log
>>>>>>    
(/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log):
>>>>>>
>>>>>>     [2018-09-03 01:58:35.128923] E
[server.c:137:server_submit_reply]
>>>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>     [0x7f8470319d14]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>     [0x7f846bdde24a]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>     [2018-09-03 01:58:35.128957] E
>>>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic]
0-rpc-service: failed to
>>>>>>     submit message (XID: 0x3d60, Program: GlusterFS 4.x
v1, ProgVers:
>>>>>>     400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>>     [2018-09-03 01:58:35.128983] E
[server.c:137:server_submit_reply]
>>>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>     [0x7f8470319d14]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>     [0x7f846bdde24a]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>     [2018-09-03 01:58:35.129016] E
>>>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic]
0-rpc-service: failed to
>>>>>>     submit message (XID: 0x3e2a, Program: GlusterFS 4.x
v1, ProgVers:
>>>>>>     400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>>     [2018-09-03 01:58:35.129042] E
[server.c:137:server_submit_reply]
>>>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>     [0x7f8470319d14]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>     [0x7f846bdde24a]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>     [2018-09-03 01:58:35.129077] E
>>>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic]
0-rpc-service: failed to
>>>>>>     submit message (XID: 0x3ef6, Program: GlusterFS 4.x
v1, ProgVers:
>>>>>>     400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>>     [2018-09-03 01:58:35.129149] E
[server.c:137:server_submit_reply]
>>>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>     [0x7f8470319d14]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>     [0x7f846bdde24a]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>     [2018-09-03 01:58:35.129191] E
>>>>>>     [rpcsvc.c:1378:rpcsvc_submit_generic]
0-rpc-service: failed to
>>>>>>     submit message (XID: 0x3fc6, Program: GlusterFS 4.x
v1, ProgVers:
>>>>>>     400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>>     [2018-09-03 01:58:35.129219] E
[server.c:137:server_submit_reply]
>>>>>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>     [0x7f8470319d14]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>     [0x7f846bdde24a]
>>>>>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>
>>>>>>
>>>>>>
>>>>>>     *Gluster volume information:*
>>>>>>
>>>>>>
>>>>>>     # gluster volume info staging_static
>>>>>>
>>>>>>     Volume Name: staging_static
>>>>>>     Type: Replicate
>>>>>>     Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3
>>>>>>     Status: Started
>>>>>>     Snapshot Count: 0
>>>>>>     Number of Bricks: 1 x (2 + 1) = 3
>>>>>>     Transport-type: tcp
>>>>>>     Bricks:
>>>>>>     Brick1: <first gluster
node.fqdn>:/mnt/gluster-storage/staging_static
>>>>>>     Brick2: <second gluster
>>>>>>     node.fqdn>:/mnt/gluster-storage/staging_static
>>>>>>     Brick3: <third gluster
>>>>>>     node.fqdn>:/mnt/gluster-storage/staging_static
(arbiter)
>>>>>>     Options Reconfigured:
>>>>>>     storage.fips-mode-rchecksum: true
>>>>>>     cluster.self-heal-window-size: 16
>>>>>>     cluster.shd-wait-qlength: 4096
>>>>>>     cluster.shd-max-threads: 8
>>>>>>     performance.cache-min-file-size: 2KB
>>>>>>     performance.rda-cache-limit: 1GB
>>>>>>     network.inode-lru-limit: 50000
>>>>>>     server.outstanding-rpc-limit: 256
>>>>>>     transport.listen-backlog: 2048
>>>>>>     performance.write-behind-window-size: 512MB
>>>>>>     performance.stat-prefetch: true
>>>>>>     performance.io
<http://performance.io/>-thread-count: 16
>>>>>>     performance.client-io-threads: true
>>>>>>     performance.cache-size: 1GB
>>>>>>     performance.cache-refresh-timeout: 60
>>>>>>     performance.cache-invalidation: true
>>>>>>     cluster.use-compound-fops: true
>>>>>>     cluster.readdir-optimize: true
>>>>>>     cluster.lookup-optimize: true
>>>>>>     cluster.favorite-child-policy: size
>>>>>>     cluster.eager-lock: true
>>>>>>     client.event-threads: 4
>>>>>>     nfs.disable: on
>>>>>>     transport.address-family: inet
>>>>>>     diagnostics.brick-log-level: ERROR
>>>>>>     diagnostics.client-log-level: ERROR
>>>>>>     features.cache-invalidation-timeout: 300
>>>>>>     features.cache-invalidation: true
>>>>>>     network.ping-timeout: 15
>>>>>>     performance.cache-max-file-size: 3MB
>>>>>>     performance.md-cache-timeout: 300
>>>>>>     server.event-threads: 4
>>>>>>
>>>>>>     Thanks in advance,
>>>>>>
>>>>>>
>>>>>>     --
>>>>>>     Sam McLeod (protoporpoise on IRC)
>>>>>>     https://smcleod.net <https://smcleod.net/>
>>>>>>     https://twitter.com/s_mcleod
>>>>>>
>>>>>>     Words are my own opinions and do not?necessarily
represent those
>>>>>>     of my?employer or partners.
>>>>>>
>>>>>>     _______________________________________________
>>>>>>     Gluster-users mailing list
>>>>>>     Gluster-users at gluster.org
<mailto:Gluster-users at gluster.org>
>>>>>>    
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>    
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>>>     _______________________________________________
>>>>>     Gluster-users mailing list
>>>>>     Gluster-users at gluster.org <mailto:Gluster-users
at gluster.org>
>>>>>    
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>    
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Amar Tumballi (amarts)
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20181024/6b67cdbc/attachment.sig>

Ravishankar N

2018-Oct-24 11:57 UTC

head link

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

On 10/24/2018 05:16 PM, Hoggins! wrote:> Thank you, it's working as expected.
>
> I guess it's only safe to put cluster.data-self-heal back on when I get
> an updated version of GlusterFS?Yes correct. Also, you would still need to restart shd whenever you hit 
this issue until upgrade.
-Ravi>
>  ??? Hoggins!
>
> Le 24/10/2018 ? 11:53, Ravishankar N a ?crit?:
>> On 10/24/2018 02:38 PM, Hoggins! wrote:
>>> Thanks, that's helping a lot, I will do that.
>>>
>>> One more question: should the glustershd restart be performed on
the
>>> arbiter only, or on each node of the cluster?
>> If you do a 'gluster volume start volname force' it will
restart the
>> shd on all nodes.
>> -Ravi
>>> Thanks!
>>>
>>>  ??? Hoggins!
>>>
>>> Le 24/10/2018 ? 02:55, Ravishankar N a ?crit?:
>>>> On 10/23/2018 10:01 PM, Hoggins! wrote:
>>>>> Hello there,
>>>>>
>>>>> I'm stumbling upon the *exact same issue*, and
unfortunately setting the
>>>>> server.tcp-user-timeout to 42 does not help.
>>>>> Any other suggestion?
>>>>>
>>>>> I'm running a replica 3 arbiter 1 GlusterFS cluster,
all nodes running
>>>>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround
(rebooting a
>>>>> node) suggested by Sam works, but it often doesn't.
>>>>>
>>>>> You may ask how I got into this, well it's simple: I
needed to replace
>>>>> my brick 1 and brick 2 with two brand new machines, so
here's what I did:
>>>>>  ??? - add brick 3 and brick 4 into the cluster (gluster
peer probe,
>>>>> gluster volume add-brick, etc., with the issue regarding
the arbiter
>>>>> node that has to be first removed from the cluster before
being able to
>>>>> add bricks 3 and 4)
>>>>>  ??? - wait for all the files on my volumes to heal. It
took a few days.
>>>>>  ??? - remove bricks 1 and 2
>>>>>  ??? - after having "reset" the arbiter, re-add
the arbiter into the cluster
>>>>>
>>>>> And now it's intermittently hanging on writing *on
existing files*.
>>>>> There is *no problem for writing new files* on the volumes.
>>>> Hi,
>>>>
>>>> There was a arbiter volume hang issue? that was fixed [1]
recently.
>>>> The fix has been back-ported to all release branches.
>>>>
>>>> One workaround to overcome hangs is to (1)turn off?
'testvol
>>>> cluster.data-self-heal', remount the clients *and* (2)
restart
>>>> glustershd (via volume start force). The hang is observed due
to an
>>>> unreleased lock from self-heal. There are other ways to release
the
>>>> stale lock via gluster clear-locks command or tweaking
>>>> features.locks-revocation-secs but restarting shd whenever you
see the
>>>> issue is the easiest and safest way.
>>>>
>>>> -Ravi
>>>>
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802
>>>>
>>>>
>>>>> I'm lost here, thanks for your inputs!
>>>>>
>>>>>  ??? Hoggins!
>>>>>
>>>>> Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?:
>>>>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod
<mailinglists at smcleod.net
>>>>>> <mailto:mailinglists at smcleod.net>> wrote:
>>>>>>
>>>>>>      I apologise for this being posted twice - I'm
not sure if that was
>>>>>>      user error or a bug in the mailing list, but the
list wasn't
>>>>>>      showing my post after quite some time so I sent a
second email
>>>>>>      which near immediately showed up - that's
mailing lists I guess...
>>>>>>
>>>>>>      Anyway, if anyone has any input, advice or abuse
I'm welcome any
>>>>>>      input!
>>>>>>
>>>>>>
>>>>>> We got little late to get back on this. But after
running tests
>>>>>> internally, we found possibly missing an volume option
is the reason
>>>>>> for this:
>>>>>>
>>>>>> Try
>>>>>>
>>>>>> gluster volume set <volname>
server.tcp-user-timeout 42
>>>>>> on your volume. Let us know if this helps.
>>>>>>
(Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/)
>>>>>>   
>>>>>>
>>>>>>      --
>>>>>>      Sam McLeod
>>>>>>      https://smcleod.net
>>>>>>      https://twitter.com/s_mcleod
>>>>>>
>>>>>>>      On 3 Sep 2018, at 1:20 pm, Sam McLeod
<mailinglists at smcleod.net
>>>>>>>      <mailto:mailinglists at smcleod.net>>
wrote:
>>>>>>>
>>>>>>>      We've got an odd problem where clients are
blocked from writing
>>>>>>>      to Gluster volumes until the first node of the
Gluster cluster is
>>>>>>>      rebooted.
>>>>>>>
>>>>>>>      I suspect I've either configured something
incorrectly with the
>>>>>>>      arbiter / replica configuration of the
volumes, or there is some
>>>>>>>      sort of bug in the gluster client-server
connection that we're
>>>>>>>      triggering.
>>>>>>>
>>>>>>>      I was wondering if anyone has seen this or
could point me in the
>>>>>>>      right direction?
>>>>>>>
>>>>>>>
>>>>>>>      *Environment:*
>>>>>>>
>>>>>>>        * Typology: 3 node cluster, replica 2,
arbiter 1 (third node is
>>>>>>>          metadata only).
>>>>>>>        * Version: Client and Servers both running
4.1.3, both on
>>>>>>>          CentOS 7, kernel 4.18.x, (Xen) VMs with
relatively fast
>>>>>>>          networked SSD storage backing them, XFS.
>>>>>>>        * Client: Native Gluster FUSE client
mounting via the
>>>>>>>          kubernetes provider
>>>>>>>
>>>>>>>
>>>>>>>      *Problem:*
>>>>>>>
>>>>>>>        * Seemingly randomly some clients will be
blocked / are unable
>>>>>>>          to write to what should be a highly
available gluster volume.
>>>>>>>        * The client gluster logs show it failing to
do new file
>>>>>>>          operations across various volumes and all
three nodes of the
>>>>>>>          gluster.
>>>>>>>        * The server gluster (or OS) logs do not
show any warnings or
>>>>>>>          errors.
>>>>>>>        * The client recovers and is able to write
to volumes again
>>>>>>>          after the first node of the gluster
cluster is rebooted.
>>>>>>>        * Until the first node of the gluster
cluster is rebooted, the
>>>>>>>          client fails to write to the volume that
is (or should be)
>>>>>>>          available on the second node (a replica)
and third node (an
>>>>>>>          arbiter only node).
>>>>>>>
>>>>>>>
>>>>>>>      *What 'fixes' the issue:*
>>>>>>>
>>>>>>>        * Although the clients (kubernetes hosts)
connect to all 3
>>>>>>>          nodes of the Gluster cluster - restarting
the first gluster
>>>>>>>          node always?unblocks the IO and allows the
client to continue
>>>>>>>          writing.
>>>>>>>        * Stopping and starting the glusterd service
on the gluster
>>>>>>>          server is not enough to fix the issue, nor
is restarting its
>>>>>>>          networking.
>>>>>>>        * This suggests to me that the volume is
unavailable for
>>>>>>>          writing for some reason and restarting the
first node in the
>>>>>>>          cluster either clears some sort of TCP
sessions between the
>>>>>>>          client-server or between the server-server
replication.
>>>>>>>
>>>>>>>
>>>>>>>      *Expected behaviour:*
>>>>>>>
>>>>>>>        * If the first gluster node / server had
failed or was blocked
>>>>>>>          from performing operations for some reason
(which it doesn't
>>>>>>>          seem it is), I'd expect the clients to
access data from the
>>>>>>>          second gluster node and write metadata to
the third gluster
>>>>>>>          node as well as it's an arbiter /
metadata only node.
>>>>>>>        * If for some reason the a gluster node was
not able to serve
>>>>>>>          connections to clients, I'd expect to
see errors in the
>>>>>>>          volume, glusterd or brick log files (there
are none on the
>>>>>>>          first gluster node).
>>>>>>>        * If the first gluster node was for some
reason blocking IO on
>>>>>>>          a volume, I'd expect that node either
to show as unhealthy or
>>>>>>>          unavailable in the gluster peer status or
gluster volume status.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      *Client gluster errors:*
>>>>>>>
>>>>>>>        * staging_static in this example is a volume
name.
>>>>>>>        * You can see the client trying to connect
to the second and
>>>>>>>          third nodes of the gluster cluster and
failing (unsure as to
>>>>>>>          why?)
>>>>>>>        * The server side logs on the first gluster
node do not show
>>>>>>>          any errors or problems, but the second /
third node show
>>>>>>>          errors in the glusterd.log when trying to
'unlock' the
>>>>>>>          0-management volume on the first node.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      *On a gluster client*?(a kubernetes host using
the kubernetes
>>>>>>>      connector which uses the native fuse client)
when its blocked
>>>>>>>      from writing but the gluster appears healthy
(other than the
>>>>>>>      errors mentioned later):
>>>>>>>
>>>>>>>      [2018-09-02 15:33:22.750874] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x1cce sent =
2018-09-02
>>>>>>>      15:03:22.417773. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 15:33:22.750989] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 16:03:23.097905] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x2e21 sent =
2018-09-02
>>>>>>>      15:33:22.765751. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-02 16:03:23.097988] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 16:33:23.439172] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x1d4b sent =
2018-09-02
>>>>>>>      16:03:23.098133. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 16:33:23.439282] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 17:03:23.786858] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x2ee7 sent =
2018-09-02
>>>>>>>      16:33:23.455171. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-02 17:03:23.786971] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 17:33:24.160607] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x1dc8 sent =
2018-09-02
>>>>>>>      17:03:23.787120. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 17:33:24.160720] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 18:03:24.505092] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x2faf sent =
2018-09-02
>>>>>>>      17:33:24.173153. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-02 18:03:24.505185] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 18:33:24.841248] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x1e45 sent =
2018-09-02
>>>>>>>      18:03:24.505328. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 18:33:24.841311] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 19:03:25.204711] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x3074 sent =
2018-09-02
>>>>>>>      18:33:24.855372. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-02 19:03:25.204784] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 19:33:25.533545] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x1ec2 sent =
2018-09-02
>>>>>>>      19:03:25.204977. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 19:33:25.533611] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 20:03:25.877020] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x3138 sent =
2018-09-02
>>>>>>>      19:33:25.545921. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-02 20:03:25.877098] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 20:33:26.217858] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x1f3e sent =
2018-09-02
>>>>>>>      20:03:25.877264. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 20:33:26.217973] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 21:03:26.588237] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x31ff sent =
2018-09-02
>>>>>>>      20:33:26.233010. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-02 21:03:26.588316] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 21:33:26.912334] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x1fbb sent =
2018-09-02
>>>>>>>      21:03:26.588456. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 21:33:26.912449] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 22:03:37.258915] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x32c5 sent =
2018-09-02
>>>>>>>      21:33:32.091009. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-02 22:03:37.259000] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 22:33:37.615497] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x2039 sent =
2018-09-02
>>>>>>>      22:03:37.259147. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 22:33:37.615574] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 23:03:37.940969] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x3386 sent =
2018-09-02
>>>>>>>      22:33:37.629655. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-02 23:03:37.941049] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-02 23:33:38.270998] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x20b5 sent =
2018-09-02
>>>>>>>      23:03:37.941199. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-02 23:33:38.271078] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-03 00:03:38.607186] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x3447 sent =
2018-09-02
>>>>>>>      23:33:38.285899. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-03 00:03:38.607263] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-03 00:33:38.934385] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x2131 sent =
2018-09-03
>>>>>>>      00:03:38.607410. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-03 00:33:38.934479] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-03 01:03:39.256842] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-1: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x350c sent =
2018-09-03
>>>>>>>      00:33:38.948570. timeout = 1800 for <ip of
second gluster node>:49154
>>>>>>>      [2018-09-03 01:03:39.256972] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-1: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>      [2018-09-03 01:33:39.614402] E
[rpc-clnt.c:184:call_bail]
>>>>>>>      0-staging_static-client-2: bailing out frame
type(GlusterFS 4.x
>>>>>>>      v1) op(INODELK(29)) xid = 0x21ae sent =
2018-09-03
>>>>>>>      01:03:39.258166. timeout = 1800 for <ip of
third gluster node>:49154
>>>>>>>      [2018-09-03 01:33:39.614483] E [MSGID: 114031]
>>>>>>>     
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>>>>>>      0-staging_static-client-2: remote operation
failed [Transport
>>>>>>>      endpoint is not connected]
>>>>>>>
>>>>>>>
>>>>>>>      *On the second gluster server:*
>>>>>>>
>>>>>>>
>>>>>>>      We are seeing the following error in the
glusterd.log file when
>>>>>>>      the client is blocked from writing the volume,
I think this is
>>>>>>>      probably the most important information about
the error and
>>>>>>>      suggests a problem with the first node but
doesn't explain the
>>>>>>>      client behaviour:
>>>>>>>
>>>>>>>      [2018-09-02 08:31:03.902272] E [MSGID: 106115]
>>>>>>>     
[glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management:
>>>>>>>      Unlocking failed on <FQDN of the first
gluster node>. Please
>>>>>>>      check log file for details.
>>>>>>>      [2018-09-02 08:31:03.902477] E [MSGID: 106151]
>>>>>>>      [glusterd-syncop.c:1640:gd_unlock_op_phase]
0-management: Failed
>>>>>>>      to unlock on some peer(s)
>>>>>>>
>>>>>>>      Note in the above error:
>>>>>>>
>>>>>>>      1. I'm not sure which log to check (there
doesn't seem to be a
>>>>>>>      management brick / brick log)?
>>>>>>>      2. If there's a problem with the first
node, why isn't it
>>>>>>>      rejected from the gluster / taken offline /
the health of the
>>>>>>>      peers or volume list degraded?
>>>>>>>      3. Why does the client fail to write to the
volume rather than
>>>>>>>      (I'm assuming) trying the second (or third
I guess) node to write
>>>>>>>      to the volume?
>>>>>>>
>>>>>>>
>>>>>>>      We are also seeing the following errors
repeated a lot in the
>>>>>>>      logs, both when the volumes are working and
when there's an issue
>>>>>>>      in the brick log
>>>>>>>     
(/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log):
>>>>>>>
>>>>>>>      [2018-09-03 01:58:35.128923] E
[server.c:137:server_submit_reply]
>>>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>      [0x7f8470319d14]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>      [0x7f846bdde24a]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>      [2018-09-03 01:58:35.128957] E
>>>>>>>      [rpcsvc.c:1378:rpcsvc_submit_generic]
0-rpc-service: failed to
>>>>>>>      submit message (XID: 0x3d60, Program:
GlusterFS 4.x v1, ProgVers:
>>>>>>>      400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>>>      [2018-09-03 01:58:35.128983] E
[server.c:137:server_submit_reply]
>>>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>      [0x7f8470319d14]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>      [0x7f846bdde24a]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>      [2018-09-03 01:58:35.129016] E
>>>>>>>      [rpcsvc.c:1378:rpcsvc_submit_generic]
0-rpc-service: failed to
>>>>>>>      submit message (XID: 0x3e2a, Program:
GlusterFS 4.x v1, ProgVers:
>>>>>>>      400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>>>      [2018-09-03 01:58:35.129042] E
[server.c:137:server_submit_reply]
>>>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>      [0x7f8470319d14]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>      [0x7f846bdde24a]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>      [2018-09-03 01:58:35.129077] E
>>>>>>>      [rpcsvc.c:1378:rpcsvc_submit_generic]
0-rpc-service: failed to
>>>>>>>      submit message (XID: 0x3ef6, Program:
GlusterFS 4.x v1, ProgVers:
>>>>>>>      400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>>>      [2018-09-03 01:58:35.129149] E
[server.c:137:server_submit_reply]
>>>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>      [0x7f8470319d14]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>      [0x7f846bdde24a]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>      [2018-09-03 01:58:35.129191] E
>>>>>>>      [rpcsvc.c:1378:rpcsvc_submit_generic]
0-rpc-service: failed to
>>>>>>>      submit message (XID: 0x3fc6, Program:
GlusterFS 4.x v1, ProgVers:
>>>>>>>      400, Proc: 29) to rpc-transport
(tcp.staging_static-server)
>>>>>>>      [2018-09-03 01:58:35.129219] E
[server.c:137:server_submit_reply]
>>>>>>>     
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>>>>>>      [0x7f8470319d14]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>>>>>>      [0x7f846bdde24a]
>>>>>>>     
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>>>>>>      [0x7f846bd89fce] ) 0-: Reply submission failed
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      *Gluster volume information:*
>>>>>>>
>>>>>>>
>>>>>>>      # gluster volume info staging_static
>>>>>>>
>>>>>>>      Volume Name: staging_static
>>>>>>>      Type: Replicate
>>>>>>>      Volume ID:
7f3b8e91-afea-4fc6-be83-3399a089b6f3
>>>>>>>      Status: Started
>>>>>>>      Snapshot Count: 0
>>>>>>>      Number of Bricks: 1 x (2 + 1) = 3
>>>>>>>      Transport-type: tcp
>>>>>>>      Bricks:
>>>>>>>      Brick1: <first gluster
node.fqdn>:/mnt/gluster-storage/staging_static
>>>>>>>      Brick2: <second gluster
>>>>>>>     
node.fqdn>:/mnt/gluster-storage/staging_static
>>>>>>>      Brick3: <third gluster
>>>>>>>     
node.fqdn>:/mnt/gluster-storage/staging_static (arbiter)
>>>>>>>      Options Reconfigured:
>>>>>>>      storage.fips-mode-rchecksum: true
>>>>>>>      cluster.self-heal-window-size: 16
>>>>>>>      cluster.shd-wait-qlength: 4096
>>>>>>>      cluster.shd-max-threads: 8
>>>>>>>      performance.cache-min-file-size: 2KB
>>>>>>>      performance.rda-cache-limit: 1GB
>>>>>>>      network.inode-lru-limit: 50000
>>>>>>>      server.outstanding-rpc-limit: 256
>>>>>>>      transport.listen-backlog: 2048
>>>>>>>      performance.write-behind-window-size: 512MB
>>>>>>>      performance.stat-prefetch: true
>>>>>>>      performance.io
<http://performance.io/>-thread-count: 16
>>>>>>>      performance.client-io-threads: true
>>>>>>>      performance.cache-size: 1GB
>>>>>>>      performance.cache-refresh-timeout: 60
>>>>>>>      performance.cache-invalidation: true
>>>>>>>      cluster.use-compound-fops: true
>>>>>>>      cluster.readdir-optimize: true
>>>>>>>      cluster.lookup-optimize: true
>>>>>>>      cluster.favorite-child-policy: size
>>>>>>>      cluster.eager-lock: true
>>>>>>>      client.event-threads: 4
>>>>>>>      nfs.disable: on
>>>>>>>      transport.address-family: inet
>>>>>>>      diagnostics.brick-log-level: ERROR
>>>>>>>      diagnostics.client-log-level: ERROR
>>>>>>>      features.cache-invalidation-timeout: 300
>>>>>>>      features.cache-invalidation: true
>>>>>>>      network.ping-timeout: 15
>>>>>>>      performance.cache-max-file-size: 3MB
>>>>>>>      performance.md-cache-timeout: 300
>>>>>>>      server.event-threads: 4
>>>>>>>
>>>>>>>      Thanks in advance,
>>>>>>>
>>>>>>>
>>>>>>>      --
>>>>>>>      Sam McLeod (protoporpoise on IRC)
>>>>>>>      https://smcleod.net
<https://smcleod.net/>
>>>>>>>      https://twitter.com/s_mcleod
>>>>>>>
>>>>>>>      Words are my own opinions and do
not?necessarily represent those
>>>>>>>      of my?employer or partners.
>>>>>>>
>>>>>>>     
_______________________________________________
>>>>>>>      Gluster-users mailing list
>>>>>>>      Gluster-users at gluster.org
<mailto:Gluster-users at gluster.org>
>>>>>>>     
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>     
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>>>>      _______________________________________________
>>>>>>      Gluster-users mailing list
>>>>>>      Gluster-users at gluster.org
<mailto:Gluster-users at gluster.org>
>>>>>>     
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>     
<https://lists.gluster.org/mailman/listinfo/gluster-users>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Amar Tumballi (amarts)
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>>
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20181024/7c07142b/attachment.html>

Gluster users - Oct 2018 - Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent =

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800