Hoggins!
2018-Oct-24 11:46 UTC
[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800
Thank you, it's working as expected. I guess it's only safe to put cluster.data-self-heal back on when I get an updated version of GlusterFS? ??? Hoggins! Le 24/10/2018 ? 11:53, Ravishankar N a ?crit?:> > On 10/24/2018 02:38 PM, Hoggins! wrote: >> Thanks, that's helping a lot, I will do that. >> >> One more question: should the glustershd restart be performed on the >> arbiter only, or on each node of the cluster? > If you do a 'gluster volume start volname force' it will restart the > shd on all nodes. > -Ravi >> Thanks! >> >> ??? Hoggins! >> >> Le 24/10/2018 ? 02:55, Ravishankar N a ?crit?: >>> On 10/23/2018 10:01 PM, Hoggins! wrote: >>>> Hello there, >>>> >>>> I'm stumbling upon the *exact same issue*, and unfortunately setting the >>>> server.tcp-user-timeout to 42 does not help. >>>> Any other suggestion? >>>> >>>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running >>>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a >>>> node) suggested by Sam works, but it often doesn't. >>>> >>>> You may ask how I got into this, well it's simple: I needed to replace >>>> my brick 1 and brick 2 with two brand new machines, so here's what I did: >>>> ??? - add brick 3 and brick 4 into the cluster (gluster peer probe, >>>> gluster volume add-brick, etc., with the issue regarding the arbiter >>>> node that has to be first removed from the cluster before being able to >>>> add bricks 3 and 4) >>>> ??? - wait for all the files on my volumes to heal. It took a few days. >>>> ??? - remove bricks 1 and 2 >>>> ??? - after having "reset" the arbiter, re-add the arbiter into the cluster >>>> >>>> And now it's intermittently hanging on writing *on existing files*. >>>> There is *no problem for writing new files* on the volumes. >>> Hi, >>> >>> There was a arbiter volume hang issue? that was fixed [1] recently. >>> The fix has been back-ported to all release branches. >>> >>> One workaround to overcome hangs is to (1)turn off? 'testvol >>> cluster.data-self-heal', remount the clients *and* (2) restart >>> glustershd (via volume start force). The hang is observed due to an >>> unreleased lock from self-heal. There are other ways to release the >>> stale lock via gluster clear-locks command or tweaking >>> features.locks-revocation-secs but restarting shd whenever you see the >>> issue is the easiest and safest way. >>> >>> -Ravi >>> >>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802 >>> >>> >>>> I'm lost here, thanks for your inputs! >>>> >>>> ??? Hoggins! >>>> >>>> Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?: >>>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at smcleod.net >>>>> <mailto:mailinglists at smcleod.net>> wrote: >>>>> >>>>> I apologise for this being posted twice - I'm not sure if that was >>>>> user error or a bug in the mailing list, but the list wasn't >>>>> showing my post after quite some time so I sent a second email >>>>> which near immediately showed up - that's mailing lists I guess... >>>>> >>>>> Anyway, if anyone has any input, advice or abuse I'm welcome any >>>>> input! >>>>> >>>>> >>>>> We got little late to get back on this. But after running tests >>>>> internally, we found possibly missing an volume option is the reason >>>>> for this: >>>>> >>>>> Try? >>>>> >>>>> gluster volume set <volname> server.tcp-user-timeout 42 >>>>> on your volume. Let us know if this helps. >>>>> (Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/) >>>>> ? >>>>> >>>>> -- >>>>> Sam McLeod >>>>> https://smcleod.net >>>>> https://twitter.com/s_mcleod >>>>> >>>>>> On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists at smcleod.net >>>>>> <mailto:mailinglists at smcleod.net>> wrote: >>>>>> >>>>>> We've got an odd problem where clients are blocked from writing >>>>>> to Gluster volumes until the first node of the Gluster cluster is >>>>>> rebooted. >>>>>> >>>>>> I suspect I've either configured something incorrectly with the >>>>>> arbiter / replica configuration of the volumes, or there is some >>>>>> sort of bug in the gluster client-server connection that we're >>>>>> triggering. >>>>>> >>>>>> I was wondering if anyone has seen this or could point me in the >>>>>> right direction? >>>>>> >>>>>> >>>>>> *Environment:* >>>>>> >>>>>> * Typology: 3 node cluster, replica 2, arbiter 1 (third node is >>>>>> metadata only). >>>>>> * Version: Client and Servers both running 4.1.3, both on >>>>>> CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast >>>>>> networked SSD storage backing them, XFS. >>>>>> * Client: Native Gluster FUSE client mounting via the >>>>>> kubernetes provider >>>>>> >>>>>> >>>>>> *Problem:* >>>>>> >>>>>> * Seemingly randomly some clients will be blocked / are unable >>>>>> to write to what should be a highly available gluster volume. >>>>>> * The client gluster logs show it failing to do new file >>>>>> operations across various volumes and all three nodes of the >>>>>> gluster. >>>>>> * The server gluster (or OS) logs do not show any warnings or >>>>>> errors. >>>>>> * The client recovers and is able to write to volumes again >>>>>> after the first node of the gluster cluster is rebooted. >>>>>> * Until the first node of the gluster cluster is rebooted, the >>>>>> client fails to write to the volume that is (or should be) >>>>>> available on the second node (a replica) and third node (an >>>>>> arbiter only node). >>>>>> >>>>>> >>>>>> *What 'fixes' the issue:* >>>>>> >>>>>> * Although the clients (kubernetes hosts) connect to all 3 >>>>>> nodes of the Gluster cluster - restarting the first gluster >>>>>> node always?unblocks the IO and allows the client to continue >>>>>> writing. >>>>>> * Stopping and starting the glusterd service on the gluster >>>>>> server is not enough to fix the issue, nor is restarting its >>>>>> networking. >>>>>> * This suggests to me that the volume is unavailable for >>>>>> writing for some reason and restarting the first node in the >>>>>> cluster either clears some sort of TCP sessions between the >>>>>> client-server or between the server-server replication. >>>>>> >>>>>> >>>>>> *Expected behaviour:* >>>>>> >>>>>> * If the first gluster node / server had failed or was blocked >>>>>> from performing operations for some reason (which it doesn't >>>>>> seem it is), I'd expect the clients to access data from the >>>>>> second gluster node and write metadata to the third gluster >>>>>> node as well as it's an arbiter / metadata only node. >>>>>> * If for some reason the a gluster node was not able to serve >>>>>> connections to clients, I'd expect to see errors in the >>>>>> volume, glusterd or brick log files (there are none on the >>>>>> first gluster node). >>>>>> * If the first gluster node was for some reason blocking IO on >>>>>> a volume, I'd expect that node either to show as unhealthy or >>>>>> unavailable in the gluster peer status or gluster volume status. >>>>>> >>>>>> >>>>>> >>>>>> *Client gluster errors:* >>>>>> >>>>>> * staging_static in this example is a volume name. >>>>>> * You can see the client trying to connect to the second and >>>>>> third nodes of the gluster cluster and failing (unsure as to >>>>>> why?) >>>>>> * The server side logs on the first gluster node do not show >>>>>> any errors or problems, but the second / third node show >>>>>> errors in the glusterd.log when trying to 'unlock' the >>>>>> 0-management volume on the first node. >>>>>> >>>>>> >>>>>> >>>>>> *On a gluster client*?(a kubernetes host using the kubernetes >>>>>> connector which uses the native fuse client) when its blocked >>>>>> from writing but the gluster appears healthy (other than the >>>>>> errors mentioned later): >>>>>> >>>>>> [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 >>>>>> 15:03:22.417773. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 15:33:22.750989] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 >>>>>> 15:33:22.765751. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-02 16:03:23.097988] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 >>>>>> 16:03:23.098133. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 16:33:23.439282] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 >>>>>> 16:33:23.455171. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-02 17:03:23.786971] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02 >>>>>> 17:03:23.787120. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 17:33:24.160720] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02 >>>>>> 17:33:24.173153. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-02 18:03:24.505185] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02 >>>>>> 18:03:24.505328. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 18:33:24.841311] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02 >>>>>> 18:33:24.855372. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-02 19:03:25.204784] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02 >>>>>> 19:03:25.204977. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 19:33:25.533611] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02 >>>>>> 19:33:25.545921. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-02 20:03:25.877098] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02 >>>>>> 20:03:25.877264. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 20:33:26.217973] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02 >>>>>> 20:33:26.233010. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-02 21:03:26.588316] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02 >>>>>> 21:03:26.588456. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 21:33:26.912449] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02 >>>>>> 21:33:32.091009. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-02 22:03:37.259000] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02 >>>>>> 22:03:37.259147. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 22:33:37.615574] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02 >>>>>> 22:33:37.629655. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-02 23:03:37.941049] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02 >>>>>> 23:03:37.941199. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-02 23:33:38.271078] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02 >>>>>> 23:33:38.285899. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-03 00:03:38.607263] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03 >>>>>> 00:03:38.607410. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-03 00:33:38.934479] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03 >>>>>> 00:33:38.948570. timeout = 1800 for <ip of second gluster node>:49154 >>>>>> [2018-09-03 01:03:39.256972] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail] >>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>> v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03 >>>>>> 01:03:39.258166. timeout = 1800 for <ip of third gluster node>:49154 >>>>>> [2018-09-03 01:33:39.614483] E [MSGID: 114031] >>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>> endpoint is not connected] >>>>>> >>>>>> >>>>>> *On the second gluster server:* >>>>>> >>>>>> >>>>>> We are seeing the following error in the glusterd.log file when >>>>>> the client is blocked from writing the volume, I think this is >>>>>> probably the most important information about the error and >>>>>> suggests a problem with the first node but doesn't explain the >>>>>> client behaviour: >>>>>> >>>>>> [2018-09-02 08:31:03.902272] E [MSGID: 106115] >>>>>> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: >>>>>> Unlocking failed on <FQDN of the first gluster node>. Please >>>>>> check log file for details. >>>>>> [2018-09-02 08:31:03.902477] E [MSGID: 106151] >>>>>> [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management: Failed >>>>>> to unlock on some peer(s) >>>>>> >>>>>> Note in the above error: >>>>>> >>>>>> 1. I'm not sure which log to check (there doesn't seem to be a >>>>>> management brick / brick log)? >>>>>> 2. If there's a problem with the first node, why isn't it >>>>>> rejected from the gluster / taken offline / the health of the >>>>>> peers or volume list degraded? >>>>>> 3. Why does the client fail to write to the volume rather than >>>>>> (I'm assuming) trying the second (or third I guess) node to write >>>>>> to the volume? >>>>>> >>>>>> >>>>>> We are also seeing the following errors repeated a lot in the >>>>>> logs, both when the volumes are working and when there's an issue >>>>>> in the brick log >>>>>> (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log): >>>>>> >>>>>> [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply] >>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>> [0x7f8470319d14] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>> [0x7f846bdde24a] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>> [2018-09-03 01:58:35.128957] E >>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>> submit message (XID: 0x3d60, Program: GlusterFS 4.x v1, ProgVers: >>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>> [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply] >>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>> [0x7f8470319d14] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>> [0x7f846bdde24a] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>> [2018-09-03 01:58:35.129016] E >>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>> submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1, ProgVers: >>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>> [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply] >>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>> [0x7f8470319d14] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>> [0x7f846bdde24a] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>> [2018-09-03 01:58:35.129077] E >>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>> submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1, ProgVers: >>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>> [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply] >>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>> [0x7f8470319d14] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>> [0x7f846bdde24a] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>> [2018-09-03 01:58:35.129191] E >>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>> submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1, ProgVers: >>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>> [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply] >>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>> [0x7f8470319d14] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>> [0x7f846bdde24a] >>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>> >>>>>> >>>>>> >>>>>> *Gluster volume information:* >>>>>> >>>>>> >>>>>> # gluster volume info staging_static >>>>>> >>>>>> Volume Name: staging_static >>>>>> Type: Replicate >>>>>> Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3 >>>>>> Status: Started >>>>>> Snapshot Count: 0 >>>>>> Number of Bricks: 1 x (2 + 1) = 3 >>>>>> Transport-type: tcp >>>>>> Bricks: >>>>>> Brick1: <first gluster node.fqdn>:/mnt/gluster-storage/staging_static >>>>>> Brick2: <second gluster >>>>>> node.fqdn>:/mnt/gluster-storage/staging_static >>>>>> Brick3: <third gluster >>>>>> node.fqdn>:/mnt/gluster-storage/staging_static (arbiter) >>>>>> Options Reconfigured: >>>>>> storage.fips-mode-rchecksum: true >>>>>> cluster.self-heal-window-size: 16 >>>>>> cluster.shd-wait-qlength: 4096 >>>>>> cluster.shd-max-threads: 8 >>>>>> performance.cache-min-file-size: 2KB >>>>>> performance.rda-cache-limit: 1GB >>>>>> network.inode-lru-limit: 50000 >>>>>> server.outstanding-rpc-limit: 256 >>>>>> transport.listen-backlog: 2048 >>>>>> performance.write-behind-window-size: 512MB >>>>>> performance.stat-prefetch: true >>>>>> performance.io <http://performance.io/>-thread-count: 16 >>>>>> performance.client-io-threads: true >>>>>> performance.cache-size: 1GB >>>>>> performance.cache-refresh-timeout: 60 >>>>>> performance.cache-invalidation: true >>>>>> cluster.use-compound-fops: true >>>>>> cluster.readdir-optimize: true >>>>>> cluster.lookup-optimize: true >>>>>> cluster.favorite-child-policy: size >>>>>> cluster.eager-lock: true >>>>>> client.event-threads: 4 >>>>>> nfs.disable: on >>>>>> transport.address-family: inet >>>>>> diagnostics.brick-log-level: ERROR >>>>>> diagnostics.client-log-level: ERROR >>>>>> features.cache-invalidation-timeout: 300 >>>>>> features.cache-invalidation: true >>>>>> network.ping-timeout: 15 >>>>>> performance.cache-max-file-size: 3MB >>>>>> performance.md-cache-timeout: 300 >>>>>> server.event-threads: 4 >>>>>> >>>>>> Thanks in advance, >>>>>> >>>>>> >>>>>> -- >>>>>> Sam McLeod (protoporpoise on IRC) >>>>>> https://smcleod.net <https://smcleod.net/> >>>>>> https://twitter.com/s_mcleod >>>>>> >>>>>> Words are my own opinions and do not?necessarily represent those >>>>>> of my?employer or partners. >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Amar Tumballi (amarts) >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: OpenPGP digital signature URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181024/6b67cdbc/attachment.sig>
Ravishankar N
2018-Oct-24 11:57 UTC
[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2131 sent = <datestamp>. timeout = 1800
On 10/24/2018 05:16 PM, Hoggins! wrote:> Thank you, it's working as expected. > > I guess it's only safe to put cluster.data-self-heal back on when I get > an updated version of GlusterFS?Yes correct. Also, you would still need to restart shd whenever you hit this issue until upgrade. -Ravi> > ??? Hoggins! > > Le 24/10/2018 ? 11:53, Ravishankar N a ?crit?: >> On 10/24/2018 02:38 PM, Hoggins! wrote: >>> Thanks, that's helping a lot, I will do that. >>> >>> One more question: should the glustershd restart be performed on the >>> arbiter only, or on each node of the cluster? >> If you do a 'gluster volume start volname force' it will restart the >> shd on all nodes. >> -Ravi >>> Thanks! >>> >>> ??? Hoggins! >>> >>> Le 24/10/2018 ? 02:55, Ravishankar N a ?crit?: >>>> On 10/23/2018 10:01 PM, Hoggins! wrote: >>>>> Hello there, >>>>> >>>>> I'm stumbling upon the *exact same issue*, and unfortunately setting the >>>>> server.tcp-user-timeout to 42 does not help. >>>>> Any other suggestion? >>>>> >>>>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running >>>>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a >>>>> node) suggested by Sam works, but it often doesn't. >>>>> >>>>> You may ask how I got into this, well it's simple: I needed to replace >>>>> my brick 1 and brick 2 with two brand new machines, so here's what I did: >>>>> ??? - add brick 3 and brick 4 into the cluster (gluster peer probe, >>>>> gluster volume add-brick, etc., with the issue regarding the arbiter >>>>> node that has to be first removed from the cluster before being able to >>>>> add bricks 3 and 4) >>>>> ??? - wait for all the files on my volumes to heal. It took a few days. >>>>> ??? - remove bricks 1 and 2 >>>>> ??? - after having "reset" the arbiter, re-add the arbiter into the cluster >>>>> >>>>> And now it's intermittently hanging on writing *on existing files*. >>>>> There is *no problem for writing new files* on the volumes. >>>> Hi, >>>> >>>> There was a arbiter volume hang issue? that was fixed [1] recently. >>>> The fix has been back-ported to all release branches. >>>> >>>> One workaround to overcome hangs is to (1)turn off? 'testvol >>>> cluster.data-self-heal', remount the clients *and* (2) restart >>>> glustershd (via volume start force). The hang is observed due to an >>>> unreleased lock from self-heal. There are other ways to release the >>>> stale lock via gluster clear-locks command or tweaking >>>> features.locks-revocation-secs but restarting shd whenever you see the >>>> issue is the easiest and safest way. >>>> >>>> -Ravi >>>> >>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802 >>>> >>>> >>>>> I'm lost here, thanks for your inputs! >>>>> >>>>> ??? Hoggins! >>>>> >>>>> Le 14/09/2018 ? 04:16, Amar Tumballi a ?crit?: >>>>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod <mailinglists at smcleod.net >>>>>> <mailto:mailinglists at smcleod.net>> wrote: >>>>>> >>>>>> I apologise for this being posted twice - I'm not sure if that was >>>>>> user error or a bug in the mailing list, but the list wasn't >>>>>> showing my post after quite some time so I sent a second email >>>>>> which near immediately showed up - that's mailing lists I guess... >>>>>> >>>>>> Anyway, if anyone has any input, advice or abuse I'm welcome any >>>>>> input! >>>>>> >>>>>> >>>>>> We got little late to get back on this. But after running tests >>>>>> internally, we found possibly missing an volume option is the reason >>>>>> for this: >>>>>> >>>>>> Try >>>>>> >>>>>> gluster volume set <volname> server.tcp-user-timeout 42 >>>>>> on your volume. Let us know if this helps. >>>>>> (Ref:?https://review.gluster.org/#/c/glusterfs/+/21170/) >>>>>> >>>>>> >>>>>> -- >>>>>> Sam McLeod >>>>>> https://smcleod.net >>>>>> https://twitter.com/s_mcleod >>>>>> >>>>>>> On 3 Sep 2018, at 1:20 pm, Sam McLeod <mailinglists at smcleod.net >>>>>>> <mailto:mailinglists at smcleod.net>> wrote: >>>>>>> >>>>>>> We've got an odd problem where clients are blocked from writing >>>>>>> to Gluster volumes until the first node of the Gluster cluster is >>>>>>> rebooted. >>>>>>> >>>>>>> I suspect I've either configured something incorrectly with the >>>>>>> arbiter / replica configuration of the volumes, or there is some >>>>>>> sort of bug in the gluster client-server connection that we're >>>>>>> triggering. >>>>>>> >>>>>>> I was wondering if anyone has seen this or could point me in the >>>>>>> right direction? >>>>>>> >>>>>>> >>>>>>> *Environment:* >>>>>>> >>>>>>> * Typology: 3 node cluster, replica 2, arbiter 1 (third node is >>>>>>> metadata only). >>>>>>> * Version: Client and Servers both running 4.1.3, both on >>>>>>> CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast >>>>>>> networked SSD storage backing them, XFS. >>>>>>> * Client: Native Gluster FUSE client mounting via the >>>>>>> kubernetes provider >>>>>>> >>>>>>> >>>>>>> *Problem:* >>>>>>> >>>>>>> * Seemingly randomly some clients will be blocked / are unable >>>>>>> to write to what should be a highly available gluster volume. >>>>>>> * The client gluster logs show it failing to do new file >>>>>>> operations across various volumes and all three nodes of the >>>>>>> gluster. >>>>>>> * The server gluster (or OS) logs do not show any warnings or >>>>>>> errors. >>>>>>> * The client recovers and is able to write to volumes again >>>>>>> after the first node of the gluster cluster is rebooted. >>>>>>> * Until the first node of the gluster cluster is rebooted, the >>>>>>> client fails to write to the volume that is (or should be) >>>>>>> available on the second node (a replica) and third node (an >>>>>>> arbiter only node). >>>>>>> >>>>>>> >>>>>>> *What 'fixes' the issue:* >>>>>>> >>>>>>> * Although the clients (kubernetes hosts) connect to all 3 >>>>>>> nodes of the Gluster cluster - restarting the first gluster >>>>>>> node always?unblocks the IO and allows the client to continue >>>>>>> writing. >>>>>>> * Stopping and starting the glusterd service on the gluster >>>>>>> server is not enough to fix the issue, nor is restarting its >>>>>>> networking. >>>>>>> * This suggests to me that the volume is unavailable for >>>>>>> writing for some reason and restarting the first node in the >>>>>>> cluster either clears some sort of TCP sessions between the >>>>>>> client-server or between the server-server replication. >>>>>>> >>>>>>> >>>>>>> *Expected behaviour:* >>>>>>> >>>>>>> * If the first gluster node / server had failed or was blocked >>>>>>> from performing operations for some reason (which it doesn't >>>>>>> seem it is), I'd expect the clients to access data from the >>>>>>> second gluster node and write metadata to the third gluster >>>>>>> node as well as it's an arbiter / metadata only node. >>>>>>> * If for some reason the a gluster node was not able to serve >>>>>>> connections to clients, I'd expect to see errors in the >>>>>>> volume, glusterd or brick log files (there are none on the >>>>>>> first gluster node). >>>>>>> * If the first gluster node was for some reason blocking IO on >>>>>>> a volume, I'd expect that node either to show as unhealthy or >>>>>>> unavailable in the gluster peer status or gluster volume status. >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Client gluster errors:* >>>>>>> >>>>>>> * staging_static in this example is a volume name. >>>>>>> * You can see the client trying to connect to the second and >>>>>>> third nodes of the gluster cluster and failing (unsure as to >>>>>>> why?) >>>>>>> * The server side logs on the first gluster node do not show >>>>>>> any errors or problems, but the second / third node show >>>>>>> errors in the glusterd.log when trying to 'unlock' the >>>>>>> 0-management volume on the first node. >>>>>>> >>>>>>> >>>>>>> >>>>>>> *On a gluster client*?(a kubernetes host using the kubernetes >>>>>>> connector which uses the native fuse client) when its blocked >>>>>>> from writing but the gluster appears healthy (other than the >>>>>>> errors mentioned later): >>>>>>> >>>>>>> [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 >>>>>>> 15:03:22.417773. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 15:33:22.750989] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 >>>>>>> 15:33:22.765751. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-02 16:03:23.097988] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 >>>>>>> 16:03:23.098133. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 16:33:23.439282] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 >>>>>>> 16:33:23.455171. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-02 17:03:23.786971] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 17:33:24.160607] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x1dc8 sent = 2018-09-02 >>>>>>> 17:03:23.787120. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 17:33:24.160720] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 18:03:24.505092] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x2faf sent = 2018-09-02 >>>>>>> 17:33:24.173153. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-02 18:03:24.505185] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 18:33:24.841248] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x1e45 sent = 2018-09-02 >>>>>>> 18:03:24.505328. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 18:33:24.841311] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 19:03:25.204711] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x3074 sent = 2018-09-02 >>>>>>> 18:33:24.855372. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-02 19:03:25.204784] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 19:33:25.533545] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x1ec2 sent = 2018-09-02 >>>>>>> 19:03:25.204977. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 19:33:25.533611] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 20:03:25.877020] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x3138 sent = 2018-09-02 >>>>>>> 19:33:25.545921. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-02 20:03:25.877098] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 20:33:26.217858] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x1f3e sent = 2018-09-02 >>>>>>> 20:03:25.877264. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 20:33:26.217973] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 21:03:26.588237] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x31ff sent = 2018-09-02 >>>>>>> 20:33:26.233010. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-02 21:03:26.588316] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 21:33:26.912334] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x1fbb sent = 2018-09-02 >>>>>>> 21:03:26.588456. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 21:33:26.912449] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 22:03:37.258915] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x32c5 sent = 2018-09-02 >>>>>>> 21:33:32.091009. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-02 22:03:37.259000] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 22:33:37.615497] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x2039 sent = 2018-09-02 >>>>>>> 22:03:37.259147. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 22:33:37.615574] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 23:03:37.940969] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x3386 sent = 2018-09-02 >>>>>>> 22:33:37.629655. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-02 23:03:37.941049] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-02 23:33:38.270998] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x20b5 sent = 2018-09-02 >>>>>>> 23:03:37.941199. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-02 23:33:38.271078] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-03 00:03:38.607186] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x3447 sent = 2018-09-02 >>>>>>> 23:33:38.285899. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-03 00:03:38.607263] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-03 00:33:38.934385] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x2131 sent = 2018-09-03 >>>>>>> 00:03:38.607410. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-03 00:33:38.934479] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-03 01:03:39.256842] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x350c sent = 2018-09-03 >>>>>>> 00:33:38.948570. timeout = 1800 for <ip of second gluster node>:49154 >>>>>>> [2018-09-03 01:03:39.256972] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-1: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> [2018-09-03 01:33:39.614402] E [rpc-clnt.c:184:call_bail] >>>>>>> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x >>>>>>> v1) op(INODELK(29)) xid = 0x21ae sent = 2018-09-03 >>>>>>> 01:03:39.258166. timeout = 1800 for <ip of third gluster node>:49154 >>>>>>> [2018-09-03 01:33:39.614483] E [MSGID: 114031] >>>>>>> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] >>>>>>> 0-staging_static-client-2: remote operation failed [Transport >>>>>>> endpoint is not connected] >>>>>>> >>>>>>> >>>>>>> *On the second gluster server:* >>>>>>> >>>>>>> >>>>>>> We are seeing the following error in the glusterd.log file when >>>>>>> the client is blocked from writing the volume, I think this is >>>>>>> probably the most important information about the error and >>>>>>> suggests a problem with the first node but doesn't explain the >>>>>>> client behaviour: >>>>>>> >>>>>>> [2018-09-02 08:31:03.902272] E [MSGID: 106115] >>>>>>> [glusterd-mgmt.c:124:gd_mgmt_v3_collate_errors] 0-management: >>>>>>> Unlocking failed on <FQDN of the first gluster node>. Please >>>>>>> check log file for details. >>>>>>> [2018-09-02 08:31:03.902477] E [MSGID: 106151] >>>>>>> [glusterd-syncop.c:1640:gd_unlock_op_phase] 0-management: Failed >>>>>>> to unlock on some peer(s) >>>>>>> >>>>>>> Note in the above error: >>>>>>> >>>>>>> 1. I'm not sure which log to check (there doesn't seem to be a >>>>>>> management brick / brick log)? >>>>>>> 2. If there's a problem with the first node, why isn't it >>>>>>> rejected from the gluster / taken offline / the health of the >>>>>>> peers or volume list degraded? >>>>>>> 3. Why does the client fail to write to the volume rather than >>>>>>> (I'm assuming) trying the second (or third I guess) node to write >>>>>>> to the volume? >>>>>>> >>>>>>> >>>>>>> We are also seeing the following errors repeated a lot in the >>>>>>> logs, both when the volumes are working and when there's an issue >>>>>>> in the brick log >>>>>>> (/var/log/glusterfs/bricks/mnt-gluster-storage-staging_static.log): >>>>>>> >>>>>>> [2018-09-03 01:58:35.128923] E [server.c:137:server_submit_reply] >>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>> [0x7f8470319d14] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>> [0x7f846bdde24a] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>> [2018-09-03 01:58:35.128957] E >>>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>>> submit message (XID: 0x3d60, Program: GlusterFS 4.x v1, ProgVers: >>>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>>> [2018-09-03 01:58:35.128983] E [server.c:137:server_submit_reply] >>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>> [0x7f8470319d14] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>> [0x7f846bdde24a] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>> [2018-09-03 01:58:35.129016] E >>>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>>> submit message (XID: 0x3e2a, Program: GlusterFS 4.x v1, ProgVers: >>>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>>> [2018-09-03 01:58:35.129042] E [server.c:137:server_submit_reply] >>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>> [0x7f8470319d14] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>> [0x7f846bdde24a] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>> [2018-09-03 01:58:35.129077] E >>>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>>> submit message (XID: 0x3ef6, Program: GlusterFS 4.x v1, ProgVers: >>>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>>> [2018-09-03 01:58:35.129149] E [server.c:137:server_submit_reply] >>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>> [0x7f8470319d14] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>> [0x7f846bdde24a] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>> [2018-09-03 01:58:35.129191] E >>>>>>> [rpcsvc.c:1378:rpcsvc_submit_generic] 0-rpc-service: failed to >>>>>>> submit message (XID: 0x3fc6, Program: GlusterFS 4.x v1, ProgVers: >>>>>>> 400, Proc: 29) to rpc-transport (tcp.staging_static-server) >>>>>>> [2018-09-03 01:58:35.129219] E [server.c:137:server_submit_reply] >>>>>>> (-->/usr/lib64/glusterfs/4.1.2/xlator/debug/io-stats.so(+0x1fd14) >>>>>>> [0x7f8470319d14] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0x5f24a) >>>>>>> [0x7f846bdde24a] >>>>>>> -->/usr/lib64/glusterfs/4.1.2/xlator/protocol/server.so(+0xafce) >>>>>>> [0x7f846bd89fce] ) 0-: Reply submission failed >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Gluster volume information:* >>>>>>> >>>>>>> >>>>>>> # gluster volume info staging_static >>>>>>> >>>>>>> Volume Name: staging_static >>>>>>> Type: Replicate >>>>>>> Volume ID: 7f3b8e91-afea-4fc6-be83-3399a089b6f3 >>>>>>> Status: Started >>>>>>> Snapshot Count: 0 >>>>>>> Number of Bricks: 1 x (2 + 1) = 3 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: <first gluster node.fqdn>:/mnt/gluster-storage/staging_static >>>>>>> Brick2: <second gluster >>>>>>> node.fqdn>:/mnt/gluster-storage/staging_static >>>>>>> Brick3: <third gluster >>>>>>> node.fqdn>:/mnt/gluster-storage/staging_static (arbiter) >>>>>>> Options Reconfigured: >>>>>>> storage.fips-mode-rchecksum: true >>>>>>> cluster.self-heal-window-size: 16 >>>>>>> cluster.shd-wait-qlength: 4096 >>>>>>> cluster.shd-max-threads: 8 >>>>>>> performance.cache-min-file-size: 2KB >>>>>>> performance.rda-cache-limit: 1GB >>>>>>> network.inode-lru-limit: 50000 >>>>>>> server.outstanding-rpc-limit: 256 >>>>>>> transport.listen-backlog: 2048 >>>>>>> performance.write-behind-window-size: 512MB >>>>>>> performance.stat-prefetch: true >>>>>>> performance.io <http://performance.io/>-thread-count: 16 >>>>>>> performance.client-io-threads: true >>>>>>> performance.cache-size: 1GB >>>>>>> performance.cache-refresh-timeout: 60 >>>>>>> performance.cache-invalidation: true >>>>>>> cluster.use-compound-fops: true >>>>>>> cluster.readdir-optimize: true >>>>>>> cluster.lookup-optimize: true >>>>>>> cluster.favorite-child-policy: size >>>>>>> cluster.eager-lock: true >>>>>>> client.event-threads: 4 >>>>>>> nfs.disable: on >>>>>>> transport.address-family: inet >>>>>>> diagnostics.brick-log-level: ERROR >>>>>>> diagnostics.client-log-level: ERROR >>>>>>> features.cache-invalidation-timeout: 300 >>>>>>> features.cache-invalidation: true >>>>>>> network.ping-timeout: 15 >>>>>>> performance.cache-max-file-size: 3MB >>>>>>> performance.md-cache-timeout: 300 >>>>>>> server.event-threads: 4 >>>>>>> >>>>>>> Thanks in advance, >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sam McLeod (protoporpoise on IRC) >>>>>>> https://smcleod.net <https://smcleod.net/> >>>>>>> https://twitter.com/s_mcleod >>>>>>> >>>>>>> Words are my own opinions and do not?necessarily represent those >>>>>>> of my?employer or partners. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Amar Tumballi (amarts) >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181024/7c07142b/attachment.html>