Ryan Lee
2018-Feb-28 21:30 UTC
[Gluster-users] Intermittent mount disconnect due to socket poller error
We've been on the Gluster 3.7 series for several years with things pretty stable. Given that it's reached EOL, yesterday I upgraded to 3.13.2. Every Gluster mount and server was disabled then brought back up after the upgrade, changing the op-version to 31302 and then trying it all out. It went poorly. Every sizable read and write (100's MB) lead to 'Transport endpoint not connected' errors on the command line and immediate unavailability of the mount. After unsuccessfully trying to search for similar problems with solutions, I ended up downgrading to 3.12.6 and changing the op-version to 31202. That brought us back to usability with the majority of those operations succeeding enough to consider it online, but there are still occasional mount disconnects that we never saw with 3.7 - about 6 in the past 18 hours. It seems these disconnects would never come back, either, unless manually re-mounted. Manually remounting reconnects immediately. They only disconnect the affected client, though some simultaneous disconnects have occurred due to simultaneous activity. The lower-level log info seems to indicate a socket problem, potentially broken on the client side based on timing (but the timing is narrow, and I won't claim the clocks are that well synchronized across all our servers). The client and one server claim a socket polling error with no data available, and the other server claims a writev error. This seems to lead the client to the 'all subvolumes are down' state, even though all other clients are still connected. Has anybody run into this? Did I miss anything moving so many versions ahead? I've included the output of volume info and some excerpts from the logs. We have two servers running glusterd and two replica volumes with a brick on each server. Both experience disconnects; there are about 10 clients for each, with one using both. We use SSL over internal IPv4. Names in all caps were replaced, as were IP addresses. Let me know if there's anything else I can provide. % gluster v info VOL Volume Name: VOL Type: Replicate Volume ID: 3207155f-02c6-447a-96c4-5897917345e0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: SERVER1:/glusterfs/VOL-brick1/data Brick2: SERVER2:/glusterfs/VOL-brick2/data Options Reconfigured: config.transport: tcp features.selinux: off transport.address-family: inet nfs.disable: on client.ssl: on performance.readdir-ahead: on auth.ssl-allow: [NAMES, including CLIENT] server.ssl: on ssl.certificate-depth: 3 Log excerpts (there was nothing related in glusterd.log): CLIENT:/var/log/glusterfs/mnt-VOL.log [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller] 0-VOL-client-1: socket_poller SERVER2:49153 failed (No data available) [2018-02-28 19:35:58.477154] E [MSGID: 108006] [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2018-02-28 19:35:58.486146] E [MSGID: 101046] [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67 times> <lots of saved_frames_unwind messages> [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller] 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available) [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller] 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available) <manual umount / mount> SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller] 0-tcp.VOL-server: poll error on socket [2018-02-28 19:35:58.380530] I [MSGID: 115036] [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0 [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply] 0-tcp.VOL-server: not connected (priv->connected = -1) [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa4e, Program: GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.uploads-server) [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) [0x7f97bd37ee6a] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f97bcf1f4c8] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) [0x7f97bcf0abd5] ) 0-: Reply submission failed [2018-02-28 19:35:58.381196] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0 [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update] 0-/glusterfs/uploads-brick2/data: allowed = "*", received addr = "CLIENT" [2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login: connecting user name: CLIENT SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log [2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv] 0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available) [2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller] 0-tcp.VOL-server: poll error on socket [2018-02-28 19:35:58.509957] I [MSGID: 115036] [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0 [2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply] 0-tcp.VOL-server: not connected (priv->connected = -1) [2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x4b3f, Program: GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.VOL-server) [2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) [0x7f85bb7a8e6a] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f85bb3494c8] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) [0x7f85bb334bd5] ) 0-: Reply submission failed [2018-02-28 19:35:58.510409] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0 [2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update] 0-/glusterfs/uploads-brick1/data: allowed = "*", received addr = "CLIENT" [2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login: connecting user name: CLIENT
Raghavendra Gowdappa
2018-Mar-01 02:38 UTC
[Gluster-users] Intermittent mount disconnect due to socket poller error
Is it possible to attach logfiles of problematic client and bricks? On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee <ryanlee at zepheira.com> wrote:> We've been on the Gluster 3.7 series for several years with things pretty > stable. Given that it's reached EOL, yesterday I upgraded to 3.13.2. > Every Gluster mount and server was disabled then brought back up after the > upgrade, changing the op-version to 31302 and then trying it all out. > > It went poorly. Every sizable read and write (100's MB) lead to > 'Transport endpoint not connected' errors on the command line and immediate > unavailability of the mount. After unsuccessfully trying to search for > similar problems with solutions, I ended up downgrading to 3.12.6 and > changing the op-version to 31202. That brought us back to usability with > the majority of those operations succeeding enough to consider it online, > but there are still occasional mount disconnects that we never saw with 3.7 > - about 6 in the past 18 hours. It seems these disconnects would never > come back, either, unless manually re-mounted. Manually remounting > reconnects immediately. They only disconnect the affected client, though > some simultaneous disconnects have occurred due to simultaneous activity. > The lower-level log info seems to indicate a socket problem, potentially > broken on the client side based on timing (but the timing is narrow, and I > won't claim the clocks are that well synchronized across all our servers). > The client and one server claim a socket polling error with no data > available, and the other server claims a writev error. This seems to lead > the client to the 'all subvolumes are down' state, even though all other > clients are still connected. Has anybody run into this? Did I miss > anything moving so many versions ahead? > > I've included the output of volume info and some excerpts from the logs. > We have two servers running glusterd and two replica volumes with a brick > on each server. Both experience disconnects; there are about 10 clients > for each, with one using both. We use SSL over internal IPv4. Names in all > caps were replaced, as were IP addresses. > > Let me know if there's anything else I can provide. > > % gluster v info VOL > Volume Name: VOL > Type: Replicate > Volume ID: 3207155f-02c6-447a-96c4-5897917345e0 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: SERVER1:/glusterfs/VOL-brick1/data > Brick2: SERVER2:/glusterfs/VOL-brick2/data > Options Reconfigured: > config.transport: tcp > features.selinux: off > transport.address-family: inet > nfs.disable: on > client.ssl: on > performance.readdir-ahead: on > auth.ssl-allow: [NAMES, including CLIENT] > server.ssl: on > ssl.certificate-depth: 3 > > Log excerpts (there was nothing related in glusterd.log): > > CLIENT:/var/log/glusterfs/mnt-VOL.log > [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller] > 0-VOL-client-1: socket_poller SERVER2:49153 failed (No data available) > [2018-02-28 19:35:58.477154] E [MSGID: 108006] > [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0: All > subvolumes are down. Going offline until atleast one of them comes back up. > [2018-02-28 19:35:58.486146] E [MSGID: 101046] > [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67 times> > <lots of saved_frames_unwind messages> > [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller] > 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available) > [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller] > 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available) > > <manual umount / mount> > > > SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log > [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller] > 0-tcp.VOL-server: poll error on socket > [2018-02-28 19:35:58.380530] I [MSGID: 115036] > [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection > from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0 > [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply] > 0-tcp.VOL-server: not connected (priv->connected = -1) > [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic] > 0-rpc-service: failed to submit message (XID: 0xa4e, Program: GlusterFS > 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.uploads-server) > [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply] > (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) > [0x7f97bd37ee6a] -->/usr/lib/x86_64-linux-gnu/g > lusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f97bcf1f4c8] > -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) > [0x7f97bcf0abd5] ) 0-: Reply submission failed > [2018-02-28 19:35:58.381196] I [MSGID: 101055] > [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection > CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0 > [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update] > 0-/glusterfs/uploads-brick2/data: allowed = "*", received addr = "CLIENT" > [2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login: > connecting user name: CLIENT > > SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log > [2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv] > 0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available) > [2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller] > 0-tcp.VOL-server: poll error on socket > [2018-02-28 19:35:58.509957] I [MSGID: 115036] > [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting connection > from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0 > [2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply] > 0-tcp.VOL-server: not connected (priv->connected = -1) > [2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic] > 0-rpc-service: failed to submit message (XID: 0x4b3f, Program: GlusterFS > 3.3, ProgVers: 330, Proc: 25) to rpc-transport (tcp.VOL-server) > [2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply] > (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) > [0x7f85bb7a8e6a] -->/usr/lib/x86_64-linux-gnu/g > lusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) [0x7f85bb3494c8] > -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) > [0x7f85bb334bd5] ) 0-: Reply submission failed > [2018-02-28 19:35:58.510409] I [MSGID: 101055] > [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down connection > CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0 > [2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update] > 0-/glusterfs/uploads-brick1/data: allowed = "*", received addr = "CLIENT" > [2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login: > connecting user name: CLIENT > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180301/3da08f5a/attachment.html>