Ryan Lee
2018-Mar-07 02:45 UTC
[Gluster-users] Intermittent mount disconnect due to socket poller error
I happened to review the status of volume clients and realized they were reporting a mix of different op-versions: 3.13 clients were still connecting to the downgraded 3.12 server (likely a timing issue between downgrading clients and mounting volumes). Remounting the reported clients has resulted in the correct op-version all around and about a week free of these errors. On 2018-03-01 12:38, Ryan Lee wrote:> Thanks for your response - is there more that would be useful in > addition to what I already attached?? We're logging at default level on > the brick side and at error on clients.? I could turn it up for a few > days to try to catch this problem in action (it's happened several more > times since I first wrote). > > On 2018-02-28 18:38, Raghavendra Gowdappa wrote: >> Is it possible to attach logfiles of problematic client and bricks? >> >> On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee <ryanlee at zepheira.com >> <mailto:ryanlee at zepheira.com>> wrote: >> >> ??? We've been on the Gluster 3.7 series for several years with things >> ??? pretty stable.? Given that it's reached EOL, yesterday I upgraded to >> ??? 3.13.2.? Every Gluster mount and server was disabled then brought >> ??? back up after the upgrade, changing the op-version to 31302 and then >> ??? trying it all out. >> >> ??? It went poorly.? Every sizable read and write (100's MB) lead to >> ??? 'Transport endpoint not connected' errors on the command line and >> ??? immediate unavailability of the mount.? After unsuccessfully trying >> ??? to search for similar problems with solutions, I ended up >> ??? downgrading to 3.12.6 and changing the op-version to 31202.? That >> ??? brought us back to usability with the majority of those operations >> ??? succeeding enough to consider it online, but there are still >> ??? occasional mount disconnects that we never saw with 3.7 - about 6 in >> ??? the past 18 hours.? It seems these disconnects would never come >> ??? back, either, unless manually re-mounted.? Manually remounting >> ??? reconnects immediately.? They only disconnect the affected client, >> ??? though some simultaneous disconnects have occurred due to >> ??? simultaneous activity.? The lower-level log info seems to indicate a >> ??? socket problem, potentially broken on the client side based on >> ??? timing (but the timing is narrow, and I won't claim the clocks are >> ??? that well synchronized across all our servers).? The client and one >> ??? server claim a socket polling error with no data available, and the >> ??? other server claims a writev error.? This seems to lead the client >> ??? to the 'all subvolumes are down' state, even though all other >> ??? clients are still connected.? Has anybody run into this?? Did I miss >> ??? anything moving so many versions ahead? >> >> ??? I've included the output of volume info and some excerpts from the >> ??? logs.? ?We have two servers running glusterd and two replica volumes >> ??? with a brick on each server.? Both experience disconnects; there are >> ??? about 10 clients for each, with one using both.? We use SSL over >> ??? internal IPv4. Names in all caps were replaced, as were IP addresses. >> >> ??? Let me know if there's anything else I can provide. >> >> ??? % gluster v info VOL >> ??? Volume Name: VOL >> ??? Type: Replicate >> ??? Volume ID: 3207155f-02c6-447a-96c4-5897917345e0 >> ??? Status: Started >> ??? Snapshot Count: 0 >> ??? Number of Bricks: 1 x 2 = 2 >> ??? Transport-type: tcp >> ??? Bricks: >> ??? Brick1: SERVER1:/glusterfs/VOL-brick1/data >> ??? Brick2: SERVER2:/glusterfs/VOL-brick2/data >> ??? Options Reconfigured: >> ??? config.transport: tcp >> ??? features.selinux: off >> ??? transport.address-family: inet >> ??? nfs.disable: on >> ??? client.ssl: on >> ??? performance.readdir-ahead: on >> ??? auth.ssl-allow: [NAMES, including CLIENT] >> ??? server.ssl: on >> ??? ssl.certificate-depth: 3 >> >> ??? Log excerpts (there was nothing related in glusterd.log): >> >> ??? CLIENT:/var/log/glusterfs/mnt-VOL.log >> ??? [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller] >> ??? 0-VOL-client-1: socket_poller SERVER2:49153 failed (No data >> available) >> ??? [2018-02-28 19:35:58.477154] E [MSGID: 108006] >> ??? [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0: >> ??? All subvolumes are down. Going offline until atleast one of them >> ??? comes back up. >> ??? [2018-02-28 19:35:58.486146] E [MSGID: 101046] >> ??? [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67 >> ??? times> >> ??? <lots of saved_frames_unwind messages> >> ??? [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller] >> ??? 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data >> available) >> ??? [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller] >> ??? 0-VOL-client-1: socket_poller SERVER2:24007 failed (No data >> available) >> >> ??? <manual umount / mount> >> >> >> ??? SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log >> ??? [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller] >> ??? 0-tcp.VOL-server: poll error on socket >> ??? [2018-02-28 19:35:58.380530] I [MSGID: 115036] >> ??? [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting >> ??? connection from >> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0 >> ??? [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply] >> ??? 0-tcp.VOL-server: not connected (priv->connected = -1) >> ??? [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic] >> ??? 0-rpc-service: failed to submit message (XID: 0xa4e, Program: >> ??? GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport >> ??? (tcp.uploads-server) >> ??? [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply] >> >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) >> >> ??? [0x7f97bd37ee6a] >> >> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) >> >> ??? [0x7f97bcf1f4c8] >> >> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) >> >> ??? [0x7f97bcf0abd5] ) 0-: Reply submission failed >> ??? [2018-02-28 19:35:58.381196] I [MSGID: 101055] >> ??? [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down >> ??? connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0 >> ??? [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update] >> ??? 0-/glusterfs/uploads-brick2/data: allowed = "*", received addr >> ??? "CLIENT" >> ??? [2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login: >> ??? connecting user name: CLIENT >> >> ??? SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log >> ??? [2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv] >> ??? 0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available) >> ??? [2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller] >> ??? 0-tcp.VOL-server: poll error on socket >> ??? [2018-02-28 19:35:58.509957] I [MSGID: 115036] >> ??? [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting >> ??? connection from >> CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0 >> ??? [2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply] >> ??? 0-tcp.VOL-server: not connected (priv->connected = -1) >> ??? [2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic] >> ??? 0-rpc-service: failed to submit message (XID: 0x4b3f, Program: >> ??? GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport >> ??? (tcp.VOL-server) >> ??? [2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply] >> >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a) >> >> ??? [0x7f85bb7a8e6a] >> >> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8) >> >> ??? [0x7f85bb3494c8] >> >> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5) >> >> ??? [0x7f85bb334bd5] ) 0-: Reply submission failed >> ??? [2018-02-28 19:35:58.510409] I [MSGID: 101055] >> ??? [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down >> ??? connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0 >> ??? [2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update] >> ??? 0-/glusterfs/uploads-brick1/data: allowed = "*", received addr >> ??? "CLIENT" >> ??? [2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login: >> ??? connecting user name: CLIENT >> ??? _______________________________________________ >> ??? Gluster-users mailing list >> ??? Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >> ??? http://lists.gluster.org/mailman/listinfo/gluster-users >> ??? <http://lists.gluster.org/mailman/listinfo/gluster-users> >> >> >
Possibly Parallel Threads
- Intermittent mount disconnect due to socket poller error
- [PATCH v3 2/2] fb/nvaa: Enable non-isometric poller on NVAA/NVAC
- [Bug 1316] New: Add LDAP support to sshd
- [PATCH v3 2/2] fb/nvaa: Enable non-isometric poller on NVAA/NVAC
- Production Volume will not start