Hoggins!
2018-Oct-24 09:08 UTC
[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800
Thanks, that's helping a lot, I will do that. One more question: should the glustershd restart be performed on the arbiter only, or on each node of the cluster? Thanks! ??? Hoggins! Le 24/10/2018 ? 02:55, Ravishankar N a ?crit?:> > On 10/23/2018 10:01 PM, Hoggins! wrote: >> Hello there, >> >> I'm stumbling upon the *exact same issue*, and unfortunately setting the >> server.tcp-user-timeout to 42 does not help. >> Any other suggestion? >> >> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running >> version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a >> node) suggested by Sam works, but it often doesn't. >> >> You may ask how I got into this, well it's simple: I needed to replace >> my brick 1 and brick 2 with two brand new machines, so here's what I did: >> ??? - add brick 3 and brick 4 into the cluster (gluster peer probe, >> gluster volume add-brick, etc., with the issue regarding the arbiter >> node that has to be first removed from the cluster before being able to >> add bricks 3 and 4) >> ??? - wait for all the files on my volumes to heal. It took a few days. >> ??? - remove bricks 1 and 2 >> ??? - after having "reset" the arbiter, re-add the arbiter into the cluster >> >> And now it's intermittently hanging on writing *on existing files*. >> There is *no problem for writing new files* on the volumes. > Hi, > > There was a arbiter volume hang issue? that was fixed [1] recently. > The fix has been back-ported to all release branches. > > One workaround to overcome hangs is to (1)turn off? 'testvol > cluster.data-self-heal', remount the clients *and* (2) restart > glustershd (via volume start force). The hang is observed due to an > unreleased lock from self-heal. There are other ways to release the > stale lock via gluster clear-locks command or tweaking > features.locks-revocation-secs but restarting shd whenever you see the > issue is the easiest and safest way. > > -Ravi > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802 > > >> I'm lost here, thanks for your inputs! >> >> ??? Hoggins! >> >> Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?: >>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at smcleod.net >>> <mailto:mailinglists at smcleod.net>> wrote: >>> >>> I apologise for this being posted twice - I'm not sure if that was >>> user error or a bug in the mailing list, but the list wasn't >>> showing my post after quite some time so I sent a second email >>> which near immediately showed up - that's mailing lists I guess... >>> >>> Anyway, if anyone has any input, advice or abuse I'm welcome any >>> input! >>> >>> >>> We got little late to get back on this. But after running tests >>> internally, we found possibly missing an volume option is the reason >>> for this: >>> >>> Try? >>> >>> gluster volume set <volname> server.tcp-user-timeout 42 >>> on your volume. Let us know if this helps. >>> (Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/) >>> ? >>> >>> -- >>> Sam McLeod >>> https://smcleod.net >>> https://twitter.com/s_mcleod >>> >>>> On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists at smcleod.net >>>> <mailto:mailinglists at smcleod.net>> wrote: >>>> >>>> We've got an odd problem where clients are blocked from writing >>>> to Gluster volumes until the first node of the Gluster cluster is >>>> rebooted. >>>> >>>> I suspect I've either configured something incorrectly with the >>>> arbiter / replica configuration of the volumes, or there is some >>>> sort of bug in the gluster client-server connection that we're >>>> triggering. >>>> >>>> I was wondering if anyone has seen this or could point me in the >>>> right direction? >>>> >>>> >>>> *Environment:* >>>> >>>> * Typology: 3 node cluster, replica 2, arbiter 1 (third node is >>>> metadata only). >>>> * Version: Client and Servers both running 4.1.3, both on >>>> CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast >>>> networked SSD storage backing them, XFS. >>>> * Client: Native Gluster FUSE client mounting via the >>>> kubernetes provider >>>> >>>> >>>> *Problem:* >>>> >>>> * Seemingly randomly some clients will be blocked / are unable >>>> to write to what should be a highly available gluster volume. >>>> * The client gluster logs show it failing to do new file >>>> operations across various volumes and all three nodes of the >>>> gluster. >>>> * The server gluster (or OS) logs do not show any warnings or >>>> errors. >>>> * The client recovers and is able to write to volumes again >>>> after the first node of the gluster cluster is rebooted. >>>> * Until the first node of the gluster cluster is rebooted, the >>>> client fails to write to the volume that is (or should be) >>>> available on the second node (a replica) and third node (an >>>> arbiter only node). >>>> >>>> >>>> *What 'fixes' the issue:* >>>> >>>> * Although the clients (kubernetes hosts) connect to all 3 >>>> nodes of the Gluster cluster - restarting the first gluster >>>> node always?unblocks the IO and allows the client to continue >>>> writing. >>>> * Stopping and starting the glusterd service on the gluster >>>> server is not enough to fix the issue, nor is restarting its >>>> networking. >>>> * This suggests to me that the volume is unavailable for >>>> writing for some reason and restarting the first node in the >>>> cluster either clears some sort of TCP sessions between the >>>> client-server or between the server-server replication. >>>> >>>> >>>> *Expected behaviour:* >>>> >>>> * If the first gluster node / server had failed or was blocked >>>> from performing operations for some reason (which it doesn't >>>> seem it is), I'd expect the clients to access data from the >>>> second gluster node and write metadata to the third gluster >>>> node as well as it's an arbiter / metadata only node. >>>> * If for some reason the a gluster node was not able to serve >>>> connections to clients, I'd expect to see errors in the >>>> volume, glusterd or brick log files (there are none on the >>>> first gluster node). >>>> * If the first gluster node was for some reason blocking IO on >>>> a volume, I'd expect that node either to show as unhealthy or >>>> unavailable in the gluster peer status or gluster volume status. >>>> >>>> >>>> >>>> *Client gluster errors:* >>>> >>>> * staging_static in this example is a volume name. >>>> * You can see the client trying to connect to the second and >>>> third nodes of the gluster cluster and failing (unsure as to >>>> why?) >>>> * The server side logs on the first gluster node do not show >>>> any errors or problems, but the second / third node show >>>> errors in the glusterd.log when trying to 'unlock' the >>>> 0-management volume on the first node. >>>> >>>> >>>> >>>> *On a gluster client*?(a kubernetes host using the kubernetes >>>> connector which uses the native fuse client) when its blocked >>>> from writing but the gluster appears healthy (other than the >>>> errors mentioned later): >>>> >>>> [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 >>>> 15:03:22.417773. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 15:33:22.750989] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 >>>> 15:33:22.765751. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-02 16:03:23.097988] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 >>>> 16:03:23.098133. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 16:33:23.439282] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 >>>> 16:33:23.455171. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-02 17:03:23.786971] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02 >>>> 17:03:23.787120. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 17:33:24.160720] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02 >>>> 17:33:24.173153. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-02 18:03:24.505185] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02 >>>> 18:03:24.505328. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 18:33:24.841311] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02 >>>> 18:33:24.855372. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-02 19:03:25.204784] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02 >>>> 19:03:25.204977. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 19:33:25.533611] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02 >>>> 19:33:25.545921. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-02 20:03:25.877098] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02 >>>> 20:03:25.877264. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 20:33:26.217973] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02 >>>> 20:33:26.233010. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-02 21:03:26.588316] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02 >>>> 21:03:26.588456. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 21:33:26.912449] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02 >>>> 21:33:32.091009. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-02 22:03:37.259000] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02 >>>> 22:03:37.259147. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 22:33:37.615574] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02 >>>> 22:33:37.629655. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-02 23:03:37.941049] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02 >>>> 23:03:37.941199. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-02 23:33:38.271078] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02 >>>> 23:33:38.285899. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-03 00:03:38.607263] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03 >>>> 00:03:38.607410. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-03 00:33:38.934479] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03 >>>> 00:33:38.948570. timeout = 1800 for <ip of second gluster node>:49154 >>>> [2018-09-03 01:03:39.256972] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-1: remote operation failed [Transport >>>> endpoint is not connected] >>>> [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail] >>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>> v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03 >>>> 01:03:39.258166. timeout = 1800 for <ip of third gluster node>:49154 >>>> [2018-09-03 01:33:39.614483] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>> 0-staging_static-client-2: remote operation failed [Transport >>>> endpoint is not connected] >>>> >>>> >>>> *On the second gluster server:* >>>> >>>> >>>> We are seeing the following error in the glusterd.log file when >>>> the client is blocked from writing the volume, I think this is >>>> probably the most important information about the error and >>>> suggests a problem with the first node but doesn't explain the >>>> client behaviour: >>>> >>>> [2018-09-02 08:31:03.902272] E [MSGID: 106115] >>>> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: >>>> Unlocking failed on <FQDN of the first gluster node>. Please >>>> check log file for details. >>>> [2018-09-02 08:31:03.902477] E [MSGID: 106151] >>>> [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management: Failed >>>> to unlock on some peer(s) >>>> >>>> Note in the above error: >>>> >>>> 1. I'm not sure which log to check (there doesn't seem to be a >>>> management brick / brick log)? >>>> 2. If there's a problem with the first node, why isn't it >>>> rejected from the gluster / taken offline / the health of the >>>> peers or volume list degraded? >>>> 3. Why does the client fail to write to the volume rather than >>>> (I'm assuming) trying the second (or third I guess) node to write >>>> to the volume? >>>> >>>> >>>> We are also seeing the following errors repeated a lot in the >>>> logs, both when the volumes are working and when there's an issue >>>> in the brick log >>>> (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log): >>>> >>>> [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply] >>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>> [0x7f8470319d14] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>> [0x7f846bdde24a] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>> [2018-09-03 01:58:35.128957] E >>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>> submit message (XID: 0x3d60, Program: GlusterFS 4.x v1, ProgVers: >>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>> [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply] >>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>> [0x7f8470319d14] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>> [0x7f846bdde24a] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>> [2018-09-03 01:58:35.129016] E >>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>> submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1, ProgVers: >>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>> [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply] >>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>> [0x7f8470319d14] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>> [0x7f846bdde24a] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>> [2018-09-03 01:58:35.129077] E >>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>> submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1, ProgVers: >>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>> [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply] >>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>> [0x7f8470319d14] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>> [0x7f846bdde24a] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>> [2018-09-03 01:58:35.129191] E >>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>> submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1, ProgVers: >>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>> [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply] >>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>> [0x7f8470319d14] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>> [0x7f846bdde24a] >>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>> >>>> >>>> >>>> *Gluster volume information:* >>>> >>>> >>>> # gluster volume info staging_static >>>> >>>> Volume Name: staging_static >>>> Type: Replicate >>>> Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3 >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 1 x (2 + 1) = 3 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: <first gluster node.fqdn>:/mnt/gluster-storage/staging_static >>>> Brick2: <second gluster >>>> node.fqdn>:/mnt/gluster-storage/staging_static >>>> Brick3: <third gluster >>>> node.fqdn>:/mnt/gluster-storage/staging_static (arbiter) >>>> Options Reconfigured: >>>> storage.fips-mode-rchecksum: true >>>> cluster.self-heal-window-size: 16 >>>> cluster.shd-wait-qlength: 4096 >>>> cluster.shd-max-threads: 8 >>>> performance.cache-min-file-size: 2KB >>>> performance.rda-cache-limit: 1GB >>>> network.inode-lru-limit: 50000 >>>> server.outstanding-rpc-limit: 256 >>>> transport.listen-backlog: 2048 >>>> performance.write-behind-window-size: 512MB >>>> performance.stat-prefetch: true >>>> performance.io <http://performance.io/>-thread-count: 16 >>>> performance.client-io-threads: true >>>> performance.cache-size: 1GB >>>> performance.cache-refresh-timeout: 60 >>>> performance.cache-invalidation: true >>>> cluster.use-compound-fops: true >>>> cluster.readdir-optimize: true >>>> cluster.lookup-optimize: true >>>> cluster.favorite-child-policy: size >>>> cluster.eager-lock: true >>>> client.event-threads: 4 >>>> nfs.disable: on >>>> transport.address-family: inet >>>> diagnostics.brick-log-level: ERROR >>>> diagnostics.client-log-level: ERROR >>>> features.cache-invalidation-timeout: 300 >>>> features.cache-invalidation: true >>>> network.ping-timeout: 15 >>>> performance.cache-max-file-size: 3MB >>>> performance.md-cache-timeout: 300 >>>> server.event-threads: 4 >>>> >>>> Thanks in advance, >>>> >>>> >>>> -- >>>> Sam McLeod (protoporpoise on IRC) >>>> https://smcleod.net <https://smcleod.net/> >>>> https://twitter.com/s_mcleod >>>> >>>> Words are my own opinions and do not?necessarily represent those >>>> of my?employer or partners. >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>> >>> >>> >>> >>> -- >>> Amar Tumballi (amarts) >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: OpenPGP digital signature URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181024/c88fb80c/attachment.sig>
Ravishankar N
2018-Oct-24 09:53 UTC
[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800
On 10/24/2018 02:38 PM, Hoggins! wrote:> Thanks, that's helping a lot, I will do that. > > One more question: should the glustershd restart be performed on the > arbiter only, or on each node of the cluster?If you do a 'gluster volume start volname force' it will restart the shd on all nodes. -Ravi> > Thanks! > > ??? Hoggins! > > Le 24/10/2018 ? 02:55, Ravishankar N a ?crit?: >> On 10/23/2018 10:01 PM, Hoggins! wrote: >>> Hello there, >>> >>> I'm stumbling upon the *exact same issue*, and unfortunately setting the >>> server.tcp-user-timeout to 42 does not help. >>> Any other suggestion? >>> >>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running >>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a >>> node) suggested by Sam works, but it often doesn't. >>> >>> You may ask how I got into this, well it's simple: I needed to replace >>> my brick 1 and brick 2 with two brand new machines, so here's what I did: >>> ??? - add brick 3 and brick 4 into the cluster (gluster peer probe, >>> gluster volume add-brick, etc., with the issue regarding the arbiter >>> node that has to be first removed from the cluster before being able to >>> add bricks 3 and 4) >>> ??? - wait for all the files on my volumes to heal. It took a few days. >>> ??? - remove bricks 1 and 2 >>> ??? - after having "reset" the arbiter, re-add the arbiter into the cluster >>> >>> And now it's intermittently hanging on writing *on existing files*. >>> There is *no problem for writing new files* on the volumes. >> Hi, >> >> There was a arbiter volume hang issue? that was fixed [1] recently. >> The fix has been back-ported to all release branches. >> >> One workaround to overcome hangs is to (1)turn off? 'testvol >> cluster.data-self-heal', remount the clients *and* (2) restart >> glustershd (via volume start force). The hang is observed due to an >> unreleased lock from self-heal. There are other ways to release the >> stale lock via gluster clear-locks command or tweaking >> features.locks-revocation-secs but restarting shd whenever you see the >> issue is the easiest and safest way. >> >> -Ravi >> >> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802 >> >> >>> I'm lost here, thanks for your inputs! >>> >>> ??? Hoggins! >>> >>> Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?: >>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at smcleod.net >>>> <mailto:mailinglists at smcleod.net>> wrote: >>>> >>>> I apologise for this being posted twice - I'm not sure if that was >>>> user error or a bug in the mailing list, but the list wasn't >>>> showing my post after quite some time so I sent a second email >>>> which near immediately showed up - that's mailing lists I guess... >>>> >>>> Anyway, if anyone has any input, advice or abuse I'm welcome any >>>> input! >>>> >>>> >>>> We got little late to get back on this. But after running tests >>>> internally, we found possibly missing an volume option is the reason >>>> for this: >>>> >>>> Try >>>> >>>> gluster volume set <volname> server.tcp-user-timeout 42 >>>> on your volume. Let us know if this helps. >>>> (Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/) >>>> >>>> >>>> -- >>>> Sam McLeod >>>> https://smcleod.net >>>> https://twitter.com/s_mcleod >>>> >>>>> On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists at smcleod.net >>>>> <mailto:mailinglists at smcleod.net>> wrote: >>>>> >>>>> We've got an odd problem where clients are blocked from writing >>>>> to Gluster volumes until the first node of the Gluster cluster is >>>>> rebooted. >>>>> >>>>> I suspect I've either configured something incorrectly with the >>>>> arbiter / replica configuration of the volumes, or there is some >>>>> sort of bug in the gluster client-server connection that we're >>>>> triggering. >>>>> >>>>> I was wondering if anyone has seen this or could point me in the >>>>> right direction? >>>>> >>>>> >>>>> *Environment:* >>>>> >>>>> * Typology: 3 node cluster, replica 2, arbiter 1 (third node is >>>>> metadata only). >>>>> * Version: Client and Servers both running 4.1.3, both on >>>>> CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast >>>>> networked SSD storage backing them, XFS. >>>>> * Client: Native Gluster FUSE client mounting via the >>>>> kubernetes provider >>>>> >>>>> >>>>> *Problem:* >>>>> >>>>> * Seemingly randomly some clients will be blocked / are unable >>>>> to write to what should be a highly available gluster volume. >>>>> * The client gluster logs show it failing to do new file >>>>> operations across various volumes and all three nodes of the >>>>> gluster. >>>>> * The server gluster (or OS) logs do not show any warnings or >>>>> errors. >>>>> * The client recovers and is able to write to volumes again >>>>> after the first node of the gluster cluster is rebooted. >>>>> * Until the first node of the gluster cluster is rebooted, the >>>>> client fails to write to the volume that is (or should be) >>>>> available on the second node (a replica) and third node (an >>>>> arbiter only node). >>>>> >>>>> >>>>> *What 'fixes' the issue:* >>>>> >>>>> * Although the clients (kubernetes hosts) connect to all 3 >>>>> nodes of the Gluster cluster - restarting the first gluster >>>>> node always?unblocks the IO and allows the client to continue >>>>> writing. >>>>> * Stopping and starting the glusterd service on the gluster >>>>> server is not enough to fix the issue, nor is restarting its >>>>> networking. >>>>> * This suggests to me that the volume is unavailable for >>>>> writing for some reason and restarting the first node in the >>>>> cluster either clears some sort of TCP sessions between the >>>>> client-server or between the server-server replication. >>>>> >>>>> >>>>> *Expected behaviour:* >>>>> >>>>> * If the first gluster node / server had failed or was blocked >>>>> from performing operations for some reason (which it doesn't >>>>> seem it is), I'd expect the clients to access data from the >>>>> second gluster node and write metadata to the third gluster >>>>> node as well as it's an arbiter / metadata only node. >>>>> * If for some reason the a gluster node was not able to serve >>>>> connections to clients, I'd expect to see errors in the >>>>> volume, glusterd or brick log files (there are none on the >>>>> first gluster node). >>>>> * If the first gluster node was for some reason blocking IO on >>>>> a volume, I'd expect that node either to show as unhealthy or >>>>> unavailable in the gluster peer status or gluster volume status. >>>>> >>>>> >>>>> >>>>> *Client gluster errors:* >>>>> >>>>> * staging_static in this example is a volume name. >>>>> * You can see the client trying to connect to the second and >>>>> third nodes of the gluster cluster and failing (unsure as to >>>>> why?) >>>>> * The server side logs on the first gluster node do not show >>>>> any errors or problems, but the second / third node show >>>>> errors in the glusterd.log when trying to 'unlock' the >>>>> 0-management volume on the first node. >>>>> >>>>> >>>>> >>>>> *On a gluster client*?(a kubernetes host using the kubernetes >>>>> connector which uses the native fuse client) when its blocked >>>>> from writing but the gluster appears healthy (other than the >>>>> errors mentioned later): >>>>> >>>>> [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 >>>>> 15:03:22.417773. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 15:33:22.750989] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 >>>>> 15:33:22.765751. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-02 16:03:23.097988] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 >>>>> 16:03:23.098133. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 16:33:23.439282] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 >>>>> 16:33:23.455171. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-02 17:03:23.786971] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02 >>>>> 17:03:23.787120. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 17:33:24.160720] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02 >>>>> 17:33:24.173153. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-02 18:03:24.505185] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02 >>>>> 18:03:24.505328. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 18:33:24.841311] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02 >>>>> 18:33:24.855372. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-02 19:03:25.204784] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02 >>>>> 19:03:25.204977. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 19:33:25.533611] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02 >>>>> 19:33:25.545921. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-02 20:03:25.877098] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02 >>>>> 20:03:25.877264. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 20:33:26.217973] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02 >>>>> 20:33:26.233010. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-02 21:03:26.588316] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02 >>>>> 21:03:26.588456. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 21:33:26.912449] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02 >>>>> 21:33:32.091009. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-02 22:03:37.259000] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02 >>>>> 22:03:37.259147. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 22:33:37.615574] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02 >>>>> 22:33:37.629655. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-02 23:03:37.941049] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02 >>>>> 23:03:37.941199. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-02 23:33:38.271078] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02 >>>>> 23:33:38.285899. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-03 00:03:38.607263] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03 >>>>> 00:03:38.607410. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-03 00:33:38.934479] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03 >>>>> 00:33:38.948570. timeout = 1800 for <ip of second gluster node>:49154 >>>>> [2018-09-03 01:03:39.256972] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail] >>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>> v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03 >>>>> 01:03:39.258166. timeout = 1800 for <ip of third gluster node>:49154 >>>>> [2018-09-03 01:33:39.614483] E [MSGID: 114031] >>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>> endpoint is not connected] >>>>> >>>>> >>>>> *On the second gluster server:* >>>>> >>>>> >>>>> We are seeing the following error in the glusterd.log file when >>>>> the client is blocked from writing the volume, I think this is >>>>> probably the most important information about the error and >>>>> suggests a problem with the first node but doesn't explain the >>>>> client behaviour: >>>>> >>>>> [2018-09-02 08:31:03.902272] E [MSGID: 106115] >>>>> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: >>>>> Unlocking failed on <FQDN of the first gluster node>. Please >>>>> check log file for details. >>>>> [2018-09-02 08:31:03.902477] E [MSGID: 106151] >>>>> [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management: Failed >>>>> to unlock on some peer(s) >>>>> >>>>> Note in the above error: >>>>> >>>>> 1. I'm not sure which log to check (there doesn't seem to be a >>>>> management brick / brick log)? >>>>> 2. If there's a problem with the first node, why isn't it >>>>> rejected from the gluster / taken offline / the health of the >>>>> peers or volume list degraded? >>>>> 3. Why does the client fail to write to the volume rather than >>>>> (I'm assuming) trying the second (or third I guess) node to write >>>>> to the volume? >>>>> >>>>> >>>>> We are also seeing the following errors repeated a lot in the >>>>> logs, both when the volumes are working and when there's an issue >>>>> in the brick log >>>>> (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log): >>>>> >>>>> [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply] >>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>> [0x7f8470319d14] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>> [0x7f846bdde24a] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>> [2018-09-03 01:58:35.128957] E >>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>> submit message (XID: 0x3d60, Program: GlusterFS 4.x v1, ProgVers: >>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>> [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply] >>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>> [0x7f8470319d14] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>> [0x7f846bdde24a] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>> [2018-09-03 01:58:35.129016] E >>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>> submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1, ProgVers: >>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>> [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply] >>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>> [0x7f8470319d14] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>> [0x7f846bdde24a] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>> [2018-09-03 01:58:35.129077] E >>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>> submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1, ProgVers: >>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>> [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply] >>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>> [0x7f8470319d14] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>> [0x7f846bdde24a] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>> [2018-09-03 01:58:35.129191] E >>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>> submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1, ProgVers: >>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>> [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply] >>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>> [0x7f8470319d14] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>> [0x7f846bdde24a] >>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>> >>>>> >>>>> >>>>> *Gluster volume information:* >>>>> >>>>> >>>>> # gluster volume info staging_static >>>>> >>>>> Volume Name: staging_static >>>>> Type: Replicate >>>>> Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3 >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 1 x (2 + 1) = 3 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: <first gluster node.fqdn>:/mnt/gluster-storage/staging_static >>>>> Brick2: <second gluster >>>>> node.fqdn>:/mnt/gluster-storage/staging_static >>>>> Brick3: <third gluster >>>>> node.fqdn>:/mnt/gluster-storage/staging_static (arbiter) >>>>> Options Reconfigured: >>>>> storage.fips-mode-rchecksum: true >>>>> cluster.self-heal-window-size: 16 >>>>> cluster.shd-wait-qlength: 4096 >>>>> cluster.shd-max-threads: 8 >>>>> performance.cache-min-file-size: 2KB >>>>> performance.rda-cache-limit: 1GB >>>>> network.inode-lru-limit: 50000 >>>>> server.outstanding-rpc-limit: 256 >>>>> transport.listen-backlog: 2048 >>>>> performance.write-behind-window-size: 512MB >>>>> performance.stat-prefetch: true >>>>> performance.io <http://performance.io/>-thread-count: 16 >>>>> performance.client-io-threads: true >>>>> performance.cache-size: 1GB >>>>> performance.cache-refresh-timeout: 60 >>>>> performance.cache-invalidation: true >>>>> cluster.use-compound-fops: true >>>>> cluster.readdir-optimize: true >>>>> cluster.lookup-optimize: true >>>>> cluster.favorite-child-policy: size >>>>> cluster.eager-lock: true >>>>> client.event-threads: 4 >>>>> nfs.disable: on >>>>> transport.address-family: inet >>>>> diagnostics.brick-log-level: ERROR >>>>> diagnostics.client-log-level: ERROR >>>>> features.cache-invalidation-timeout: 300 >>>>> features.cache-invalidation: true >>>>> network.ping-timeout: 15 >>>>> performance.cache-max-file-size: 3MB >>>>> performance.md-cache-timeout: 300 >>>>> server.event-threads: 4 >>>>> >>>>> Thanks in advance, >>>>> >>>>> >>>>> -- >>>>> Sam McLeod (protoporpoise on IRC) >>>>> https://smcleod.net <https://smcleod.net/> >>>>> https://twitter.com/s_mcleod >>>>> >>>>> Words are my own opinions and do not?necessarily represent those >>>>> of my?employer or partners. >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>>> >>>> >>>> >>>> >>>> -- >>>> Amar Tumballi (amarts) >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181024/7ebb3084/attachment.html>