thr3ads.net - Gluster users - [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800 [Oct 2018]

If this information is useful, please help other people find it:
Share via:

Amar Tumballi

2018-Sep-14 02:16 UTC

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at smcleod.net>
wrote:
> I apologise for this being posted twice - I'm not sure if that was user
> error or a bug in the mailing list, but the list wasn't showing my post
> after quite some time so I sent a second email which near immediately
> showed up - that's mailing lists I guess...
>
> Anyway, if anyone has any input, advice or abuse I'm welcome any input!
>
>We got little late to get back on this. But after running tests internally,
we found possibly missing an volume option is the reason for this:

Try

gluster volume set <volname> server.tcp-user-timeout 42

on your volume. Let us know if this helps. (Ref:
https://review.gluster.org/#/c/glusterfs/+/21170/)

> --
> Sam McLeod
> https://smcleod.net
> https://twitter.com/s_mcleod
>
> On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists at smcleod.net>
wrote:
>
> We've got an odd problem where clients are blocked from writing to
Gluster
> volumes until the first node of the Gluster cluster is rebooted.
>
> I suspect I've either configured something incorrectly with the arbiter
/
> replica configuration of the volumes, or there is some sort of bug in the
> gluster client-server connection that we're triggering.
>
> I was wondering if anyone has seen this or could point me in the right
> direction?
>
>
> *Environment:*
>
>    - Typology: 3 node cluster, replica 2, arbiter 1 (third node is
>    metadata only).
>    - Version: Client and Servers both running 4.1.3, both on CentOS 7,
>    kernel 4.18.x, (Xen) VMs with relatively fast networked SSD storage
backing
>    them, XFS.
>    - Client: Native Gluster FUSE client mounting via the kubernetes
>    provider
>
>
> *Problem:*
>
>    - Seemingly randomly some clients will be blocked / are unable to
>    write to what should be a highly available gluster volume.
>    - The client gluster logs show it failing to do new file operations
>    across various volumes and all three nodes of the gluster.
>    - The server gluster (or OS) logs do not show any warnings or errors.
>    - The client recovers and is able to write to volumes again after the
>    first node of the gluster cluster is rebooted.
>    - Until the first node of the gluster cluster is rebooted, the client
>    fails to write to the volume that is (or should be) available on the
second
>    node (a replica) and third node (an arbiter only node).
>
>
> *What 'fixes' the issue:*
>
>    - Although the clients (kubernetes hosts) connect to all 3 nodes of
>    the Gluster cluster - restarting the first gluster node always unblocks
the
>    IO and allows the client to continue writing.
>    - Stopping and starting the glusterd service on the gluster server is
>    not enough to fix the issue, nor is restarting its networking.
>    - This suggests to me that the volume is unavailable for writing for
>    some reason and restarting the first node in the cluster either clears
some
>    sort of TCP sessions between the client-server or between the
server-server
>    replication.
>
>
> *Expected behaviour:*
>
>
>    - If the first gluster node / server had failed or was blocked from
>    performing operations for some reason (which it doesn't seem it is),
I'd
>    expect the clients to access data from the second gluster node and write
>    metadata to the third gluster node as well as it's an arbiter /
metadata
>    only node.
>    - If for some reason the a gluster node was not able to serve
>    connections to clients, I'd expect to see errors in the volume,
glusterd or
>    brick log files (there are none on the first gluster node).
>    - If the first gluster node was for some reason blocking IO on a
>    volume, I'd expect that node either to show as unhealthy or
unavailable in
>    the gluster peer status or gluster volume status.
>
>
>
> *Client gluster errors:*
>
>
>    - staging_static in this example is a volume name.
>    - You can see the client trying to connect to the second and third
>    nodes of the gluster cluster and failing (unsure as to why?)
>    - The server side logs on the first gluster node do not show any
>    errors or problems, but the second / third node show errors in the
>    glusterd.log when trying to 'unlock' the 0-management volume on
the first
>    node.
>
>
>
> *On a gluster client* (a kubernetes host using the kubernetes connector
> which uses the native fuse client) when its blocked from writing but the
> gluster appears healthy (other than the errors mentioned later):
>
> [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 15:03:22.417773. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 15:33:22.750989] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 15:33:22.765751. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-02 16:03:23.097988] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 16:03:23.098133. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 16:33:23.439282] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 16:33:23.455171. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-02 17:03:23.786971] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02 17:03:23.787120. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 17:33:24.160720] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x2faf sent = 2018-09-02 17:33:24.173153. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-02 18:03:24.505185] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02 18:03:24.505328. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 18:33:24.841311] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x3074 sent = 2018-09-02 18:33:24.855372. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-02 19:03:25.204784] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02 19:03:25.204977. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 19:33:25.533611] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x3138 sent = 2018-09-02 19:33:25.545921. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-02 20:03:25.877098] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02 20:03:25.877264. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 20:33:26.217973] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x31ff sent = 2018-09-02 20:33:26.233010. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-02 21:03:26.588316] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02 21:03:26.588456. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 21:33:26.912449] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02 21:33:32.091009. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-02 22:03:37.259000] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x2039 sent = 2018-09-02 22:03:37.259147. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 22:33:37.615574] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x3386 sent = 2018-09-02 22:33:37.629655. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-02 23:03:37.941049] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02 23:03:37.941199. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-02 23:33:38.271078] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x3447 sent = 2018-09-02 23:33:38.285899. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-03 00:03:38.607263] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x2131 sent = 2018-09-03 00:03:38.607410. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-03 00:33:38.934479] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x350c sent = 2018-09-03 00:33:38.948570. timeout
> 1800 for <ip of second gluster node>:49154
> [2018-09-03 01:03:39.256972] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-1: remote operation failed [Transport endpoint is
> not connected]
> [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail]
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1)
> op(INODELK(29)) xid = 0x21ae sent = 2018-09-03 01:03:39.258166. timeout
> 1800 for <ip of third gluster node>:49154
> [2018-09-03 01:33:39.614483] E [MSGID: 114031]
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
> 0-staging_static-client-2: remote operation failed [Transport endpoint is
> not connected]
>
>
> *On the second gluster server:*
>
>
> We are seeing the following error in the glusterd.log file when the client
> is blocked from writing the volume, I think this is probably the most
> important information about the error and suggests a problem with the first
> node but doesn't explain the client behaviour:
>
> [2018-09-02 08:31:03.902272] E [MSGID: 106115]
> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: Unlocking
> failed on <FQDN of the first gluster node>. Please check log file for
> details.
> [2018-09-02 08:31:03.902477] E [MSGID: 106151]
[glusterd-syncop.c:1640:gd_unlock_op_phase]
> 0-management: Failed to unlock on some peer(s)
>
> Note in the above error:
>
> 1. I'm not sure which log to check (there doesn't seem to be a
management
> brick / brick log)?
> 2. If there's a problem with the first node, why isn't it rejected
from
> the gluster / taken offline / the health of the peers or volume list
> degraded?
> 3. Why does the client fail to write to the volume rather than (I'm
> assuming) trying the second (or third I guess) node to write to the volume?
>
>
> We are also seeing the following errors repeated a lot in the logs, both
> when the volumes are working and when there's an issue in the brick log
> (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log):
>
> [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply]
> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
> [0x7f8470319d14]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
> [0x7f846bdde24a]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
> [0x7f846bd89fce] ) 0-: Reply submission failed
> [2018-09-03 01:58:35.128957] E [rpcsvc.c:1378:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x3d60, Program: GlusterFS
> 4.x v1, ProgVers: 400, Proc: 29) to rpc-transport
> (tcp.staging_static-server)
> [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply]
> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
> [0x7f8470319d14]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
> [0x7f846bdde24a]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
> [0x7f846bd89fce] ) 0-: Reply submission failed
> [2018-09-03 01:58:35.129016] E [rpcsvc.c:1378:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x3e2a, Program: GlusterFS
> 4.x v1, ProgVers: 400, Proc: 29) to rpc-transport
> (tcp.staging_static-server)
> [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply]
> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
> [0x7f8470319d14]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
> [0x7f846bdde24a]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
> [0x7f846bd89fce] ) 0-: Reply submission failed
> [2018-09-03 01:58:35.129077] E [rpcsvc.c:1378:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x3ef6, Program: GlusterFS
> 4.x v1, ProgVers: 400, Proc: 29) to rpc-transport
> (tcp.staging_static-server)
> [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply]
> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
> [0x7f8470319d14]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
> [0x7f846bdde24a]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
> [0x7f846bd89fce] ) 0-: Reply submission failed
> [2018-09-03 01:58:35.129191] E [rpcsvc.c:1378:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x3fc6, Program: GlusterFS
> 4.x v1, ProgVers: 400, Proc: 29) to rpc-transport
> (tcp.staging_static-server)
> [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply]
> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
> [0x7f8470319d14]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
> [0x7f846bdde24a]
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
> [0x7f846bd89fce] ) 0-: Reply submission failed
>
>
>
> *Gluster volume information:*
>
>
> # gluster volume info staging_static
>
> Volume Name: staging_static
> Type: Replicate
> Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: <first gluster node.fqdn>:/mnt/gluster-storage/staging_static
> Brick2: <second gluster
node.fqdn>:/mnt/gluster-storage/staging_static
> Brick3: <third gluster node.fqdn>:/mnt/gluster-storage/staging_static
> (arbiter)
> Options Reconfigured:
> storage.fips-mode-rchecksum: true
> cluster.self-heal-window-size: 16
> cluster.shd-wait-qlength: 4096
> cluster.shd-max-threads: 8
> performance.cache-min-file-size: 2KB
> performance.rda-cache-limit: 1GB
> network.inode-lru-limit: 50000
> server.outstanding-rpc-limit: 256
> transport.listen-backlog: 2048
> performance.write-behind-window-size: 512MB
> performance.stat-prefetch: true
> performance.io-thread-count: 16
> performance.client-io-threads: true
> performance.cache-size: 1GB
> performance.cache-refresh-timeout: 60
> performance.cache-invalidation: true
> cluster.use-compound-fops: true
> cluster.readdir-optimize: true
> cluster.lookup-optimize: true
> cluster.favorite-child-policy: size
> cluster.eager-lock: true
> client.event-threads: 4
> nfs.disable: on
> transport.address-family: inet
> diagnostics.brick-log-level: ERROR
> diagnostics.client-log-level: ERROR
> features.cache-invalidation-timeout: 300
> features.cache-invalidation: true
> network.ping-timeout: 15
> performance.cache-max-file-size: 3MB
> performance.md-cache-timeout: 300
> server.event-threads: 4
>
> Thanks in advance,
>
>
> --
> Sam McLeod (protoporpoise on IRC)
> https://smcleod.net
> https://twitter.com/s_mcleod
>
> Words are my own opinions and do not necessarily represent those of
> my employer or partners.
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>


-- 
Amar Tumballi (amarts)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180914/f4886c34/attachment.html>

Hoggins!

2018-Oct-23 16:31 UTC

head link

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

Hello there,

I'm stumbling upon the *exact same issue*, and unfortunately setting the
server.tcp-user-timeout to 42 does not help.
Any other suggestion?

I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running
version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a
node) suggested by Sam works, but it often doesn't.

You may ask how I got into this, well it's simple: I needed to replace
my brick 1 and brick 2 with two brand new machines, so here's what I did:
??? - add brick 3 and brick 4 into the cluster (gluster peer probe,
gluster volume add-brick, etc., with the issue regarding the arbiter
node that has to be first removed from the cluster before being able to
add bricks 3 and 4)
??? - wait for all the files on my volumes to heal. It took a few days.
??? - remove bricks 1 and 2
??? - after having "reset" the arbiter, re-add the arbiter into the
cluster

And now it's intermittently hanging on writing *on existing files*.
There is *no problem for writing new files* on the volumes.

I'm lost here, thanks for your inputs!

??? Hoggins!

Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?:>
>
> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at smcleod.net
> <mailto:mailinglists at smcleod.net>> wrote:
>
>     I apologise for this being posted twice - I'm not sure if that was
>     user error or a bug in the mailing list, but the list wasn't
>     showing my post after quite some time so I sent a second email
>     which near immediately showed up - that's mailing lists I guess...
>
>     Anyway, if anyone has any input, advice or abuse I'm welcome any
>     input!
>
>
> We got little late to get back on this. But after running tests
> internally, we found possibly missing an volume option is the reason
> for this:
>
> Try?
>
> gluster volume set <volname> server.tcp-user-timeout 42
> on your volume. Let us know if this helps.
> (Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/)
> ?
>
>     --
>     Sam McLeod
>     https://smcleod.net
>     https://twitter.com/s_mcleod
>
>>     On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists at
smcleod.net
>>     <mailto:mailinglists at smcleod.net>> wrote:
>>
>>     We've got an odd problem where clients are blocked from writing
>>     to Gluster volumes until the first node of the Gluster cluster is
>>     rebooted.
>>
>>     I suspect I've either configured something incorrectly with the
>>     arbiter / replica configuration of the volumes, or there is some
>>     sort of bug in the gluster client-server connection that we're
>>     triggering.
>>
>>     I was wondering if anyone has seen this or could point me in the
>>     right direction?
>>
>>
>>     *Environment:*
>>
>>       * Typology: 3 node cluster, replica 2, arbiter 1 (third node is
>>         metadata only).
>>       * Version: Client and Servers both running 4.1.3, both on
>>         CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast
>>         networked SSD storage backing them, XFS.
>>       * Client: Native Gluster FUSE client mounting via the
>>         kubernetes provider
>>
>>
>>     *Problem:*
>>
>>       * Seemingly randomly some clients will be blocked / are unable
>>         to write to what should be a highly available gluster volume.
>>       * The client gluster logs show it failing to do new file
>>         operations across various volumes and all three nodes of the
>>         gluster.
>>       * The server gluster (or OS) logs do not show any warnings or
>>         errors.
>>       * The client recovers and is able to write to volumes again
>>         after the first node of the gluster cluster is rebooted.
>>       * Until the first node of the gluster cluster is rebooted, the
>>         client fails to write to the volume that is (or should be)
>>         available on the second node (a replica) and third node (an
>>         arbiter only node).
>>
>>
>>     *What 'fixes' the issue:*
>>
>>       * Although the clients (kubernetes hosts) connect to all 3
>>         nodes of the Gluster cluster - restarting the first gluster
>>         node always?unblocks the IO and allows the client to continue
>>         writing.
>>       * Stopping and starting the glusterd service on the gluster
>>         server is not enough to fix the issue, nor is restarting its
>>         networking.
>>       * This suggests to me that the volume is unavailable for
>>         writing for some reason and restarting the first node in the
>>         cluster either clears some sort of TCP sessions between the
>>         client-server or between the server-server replication.
>>
>>
>>     *Expected behaviour:*
>>
>>       * If the first gluster node / server had failed or was blocked
>>         from performing operations for some reason (which it
doesn't
>>         seem it is), I'd expect the clients to access data from the
>>         second gluster node and write metadata to the third gluster
>>         node as well as it's an arbiter / metadata only node.
>>       * If for some reason the a gluster node was not able to serve
>>         connections to clients, I'd expect to see errors in the
>>         volume, glusterd or brick log files (there are none on the
>>         first gluster node).
>>       * If the first gluster node was for some reason blocking IO on
>>         a volume, I'd expect that node either to show as unhealthy
or
>>         unavailable in the gluster peer status or gluster volume
status.
>>
>>
>>
>>     *Client gluster errors:*
>>
>>       * staging_static in this example is a volume name.
>>       * You can see the client trying to connect to the second and
>>         third nodes of the gluster cluster and failing (unsure as to
>>         why?)
>>       * The server side logs on the first gluster node do not show
>>         any errors or problems, but the second / third node show
>>         errors in the glusterd.log when trying to 'unlock' the
>>         0-management volume on the first node.
>>
>>
>>
>>     *On a gluster client*?(a kubernetes host using the kubernetes
>>     connector which uses the native fuse client) when its blocked
>>     from writing but the gluster appears healthy (other than the
>>     errors mentioned later):
>>
>>     [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02
>>     15:03:22.417773. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 15:33:22.750989] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02
>>     15:33:22.765751. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-02 16:03:23.097988] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02
>>     16:03:23.098133. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 16:33:23.439282] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02
>>     16:33:23.455171. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-02 17:03:23.786971] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02
>>     17:03:23.787120. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 17:33:24.160720] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02
>>     17:33:24.173153. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-02 18:03:24.505185] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02
>>     18:03:24.505328. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 18:33:24.841311] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02
>>     18:33:24.855372. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-02 19:03:25.204784] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02
>>     19:03:25.204977. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 19:33:25.533611] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02
>>     19:33:25.545921. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-02 20:03:25.877098] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02
>>     20:03:25.877264. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 20:33:26.217973] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02
>>     20:33:26.233010. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-02 21:03:26.588316] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02
>>     21:03:26.588456. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 21:33:26.912449] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02
>>     21:33:32.091009. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-02 22:03:37.259000] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02
>>     22:03:37.259147. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 22:33:37.615574] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02
>>     22:33:37.629655. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-02 23:03:37.941049] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02
>>     23:03:37.941199. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-02 23:33:38.271078] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02
>>     23:33:38.285899. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-03 00:03:38.607263] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03
>>     00:03:38.607410. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-03 00:33:38.934479] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-1: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03
>>     00:33:38.948570. timeout = 1800 for <ip of second gluster
node>:49154
>>     [2018-09-03 01:03:39.256972] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-1: remote operation failed [Transport
>>     endpoint is not connected]
>>     [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail]
>>     0-staging_static-client-2: bailing out frame type(GlusterFS 4.x
>>     v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03
>>     01:03:39.258166. timeout = 1800 for <ip of third gluster
node>:49154
>>     [2018-09-03 01:33:39.614483] E [MSGID: 114031]
>>     [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk]
>>     0-staging_static-client-2: remote operation failed [Transport
>>     endpoint is not connected]
>>
>>
>>     *On the second gluster server:*
>>
>>
>>     We are seeing the following error in the glusterd.log file when
>>     the client is blocked from writing the volume, I think this is
>>     probably the most important information about the error and
>>     suggests a problem with the first node but doesn't explain the
>>     client behaviour:
>>
>>     [2018-09-02 08:31:03.902272] E [MSGID: 106115]
>>     [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management:
>>     Unlocking failed on <FQDN of the first gluster node>. Please
>>     check log file for details.
>>     [2018-09-02 08:31:03.902477] E [MSGID: 106151]
>>     [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management: Failed
>>     to unlock on some peer(s)
>>
>>     Note in the above error:
>>
>>     1. I'm not sure which log to check (there doesn't seem to
be a
>>     management brick / brick log)?
>>     2. If there's a problem with the first node, why isn't it
>>     rejected from the gluster / taken offline / the health of the
>>     peers or volume list degraded?
>>     3. Why does the client fail to write to the volume rather than
>>     (I'm assuming) trying the second (or third I guess) node to
write
>>     to the volume?
>>
>>
>>     We are also seeing the following errors repeated a lot in the
>>     logs, both when the volumes are working and when there's an
issue
>>     in the brick log
>>     (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log):
>>
>>     [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply]
>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>     [0x7f8470319d14]
>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>     [0x7f846bdde24a]
>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>     [2018-09-03 01:58:35.128957] E
>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to
>>     submit message (XID: 0x3d60, Program: GlusterFS 4.x v1, ProgVers:
>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>     [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply]
>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>     [0x7f8470319d14]
>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>     [0x7f846bdde24a]
>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>     [2018-09-03 01:58:35.129016] E
>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to
>>     submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1, ProgVers:
>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>     [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply]
>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>     [0x7f8470319d14]
>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>     [0x7f846bdde24a]
>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>     [2018-09-03 01:58:35.129077] E
>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to
>>     submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1, ProgVers:
>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>     [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply]
>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>     [0x7f8470319d14]
>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>     [0x7f846bdde24a]
>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>     [2018-09-03 01:58:35.129191] E
>>     [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to
>>     submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1, ProgVers:
>>     400, Proc: 29) to rpc-transport (tcp.staging_static-server)
>>     [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply]
>>    
(-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14)
>>     [0x7f8470319d14]
>>    
-->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a)
>>     [0x7f846bdde24a]
>>     -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce)
>>     [0x7f846bd89fce] ) 0-: Reply submission failed
>>
>>
>>
>>     *Gluster volume information:*
>>
>>
>>     # gluster volume info staging_static
>>
>>     Volume Name: staging_static
>>     Type: Replicate
>>     Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3
>>     Status: Started
>>     Snapshot Count: 0
>>     Number of Bricks: 1 x (2 + 1) = 3
>>     Transport-type: tcp
>>     Bricks:
>>     Brick1: <first gluster
node.fqdn>:/mnt/gluster-storage/staging_static
>>     Brick2: <second gluster
>>     node.fqdn>:/mnt/gluster-storage/staging_static
>>     Brick3: <third gluster
>>     node.fqdn>:/mnt/gluster-storage/staging_static (arbiter)
>>     Options Reconfigured:
>>     storage.fips-mode-rchecksum: true
>>     cluster.self-heal-window-size: 16
>>     cluster.shd-wait-qlength: 4096
>>     cluster.shd-max-threads: 8
>>     performance.cache-min-file-size: 2KB
>>     performance.rda-cache-limit: 1GB
>>     network.inode-lru-limit: 50000
>>     server.outstanding-rpc-limit: 256
>>     transport.listen-backlog: 2048
>>     performance.write-behind-window-size: 512MB
>>     performance.stat-prefetch: true
>>     performance.io <http://performance.io/>-thread-count: 16
>>     performance.client-io-threads: true
>>     performance.cache-size: 1GB
>>     performance.cache-refresh-timeout: 60
>>     performance.cache-invalidation: true
>>     cluster.use-compound-fops: true
>>     cluster.readdir-optimize: true
>>     cluster.lookup-optimize: true
>>     cluster.favorite-child-policy: size
>>     cluster.eager-lock: true
>>     client.event-threads: 4
>>     nfs.disable: on
>>     transport.address-family: inet
>>     diagnostics.brick-log-level: ERROR
>>     diagnostics.client-log-level: ERROR
>>     features.cache-invalidation-timeout: 300
>>     features.cache-invalidation: true
>>     network.ping-timeout: 15
>>     performance.cache-max-file-size: 3MB
>>     performance.md-cache-timeout: 300
>>     server.event-threads: 4
>>
>>     Thanks in advance,
>>
>>
>>     --
>>     Sam McLeod (protoporpoise on IRC)
>>     https://smcleod.net <https://smcleod.net/>
>>     https://twitter.com/s_mcleod
>>
>>     Words are my own opinions and do not?necessarily represent those
>>     of my?employer or partners.
>>
>>     _______________________________________________
>>     Gluster-users mailing list
>>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>
>
>     _______________________________________________
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>
>
>
>
> -- 
> Amar Tumballi (amarts)
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20181023/7628e663/attachment.sig>

Gluster users - Oct 2018 - Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent =

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800