songxin
2016-Nov-25 06:33 UTC
[Gluster-users] not to reconnect between client and server because of race condition
Hi Atin I found a problem, that is about client(glusterfs) will not trying to reconnect to server(glusterfsd) after disconnect. Actually, it seems caused by race condition. Precondition The glusterfs version is 3.7.6. I create a replicate volume using two node, A node and B node.One brick is on A node and another brick is on B node. A node ip:10.32.1.144 B node ip:10.32.0.48 The phenomenon is following. Firstly, the client(glusterfs) on A board disconnect with server(glusterfsd) on B board.The log is following. ... readv on 10.32.0.48:49309 failed (No data available) ... And then the client(glusterfs) on A board disconnect with server(glusterfsd) on A board.The log is following. ... readv on 10.32.1.144:49391 failed (Connection reset by peer) ... After that, all operation in mount point will show "Transport endpoint is not connected" until client reconnect with server(glusterfsd) on B board. The client log is following.And I have highlight the important line. ... [2016-10-31 04:06:03.626047] W [socket.c:588:__socket_rwv] 0-c_glusterfs-client-9: readv on 10.32.1.144:49391 failed (Connection reset by peer) [2016-10-31 04:06:03.627345] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] ( --> /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] ( --> /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] ( --> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] ( --> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2016-10-31 04:06:03.626033 (xid=0x7f5e) [2016-10-31 04:06:03.627395] E [MSGID: 114031] [client-rpc-fops.c:1673:client3_3_finodelk_cbk] 0-c_glusterfs-client-9: remote operation failed [Transport endpoint is not connected] [2016-10-31 04:06:03.628381] I [socket.c:3308:socket_submit_request] 0-c_glusterfs-client-9: not connected (priv->connected = 0) [2016-10-31 04:06:03.628432] W [rpc-clnt.c:1586:rpc_clnt_submit] 0-c_glusterfs-client-9: failed to submit rpc-request (XID: 0x7f5f Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (c_glusterfs-client-9) [2016-10-31 04:06:03.628466] E [MSGID: 114031] [client-rpc-fops.c:1673:client3_3_finodelk_cbk] 0-c_glusterfs-client-9: remote operation failed [Transport endpoint is not connected] [2016-10-31 04:06:03.628475] I [MSGID: 108019] [afr-lk-common.c:1086:afr_lock_blocking] 0-c_glusterfs-replicate-0: unable to lock on even one child [2016-10-31 04:06:03.628539] I [MSGID: 108019] [afr-transaction.c:1224:afr_post_blocking_inodelk_cbk] 0-c_glusterfs-replicate-0: Blocking inodelks failed. [2016-10-31 04:06:03.628630] W [fuse-bridge.c:1282:fuse_err_cbk] 0-glusterfs-fuse: 20790: FLUSH() ERR => -1 (Transport endpoint is not connected) [2016-10-31 04:06:03.629149] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-10-31 04:06:03.624346 (xid=0x7f5a) [2016-10-31 04:06:03.629183] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-c_glusterfs-client-9: changing port to 49391 (from 0) [2016-10-31 04:06:03.629210] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-c_glusterfs-client-9: remote operation failed. Path: /loadmodules_norepl/CXC1725605_P93A001/cello/emasviews (b0e5a94e-a432-4dce-b86f-a551555780a2) [Transport endpoint is not connected] [2016-10-31 04:06:03.629266] I [socket.c:3308:socket_submit_request] 0-c_glusterfs-client-9: not connected (priv->connected = 255) [2016-10-31 04:06:03.629277] I [MSGID: 109063] [dht-layout.c:702:dht_layout_normalize] 0-c_glusterfs-dht: Found anomalies in /loadmodules_norepl/CXC1725605_P93A001/cello/emasviews (gfid = b0e5a94e-a432-4dce-b86f-a551555780a2). Holes=1 overlaps=0 [2016-10-31 04:06:03.629293] W [rpc-clnt.c:1586:rpc_clnt_submit] 0-c_glusterfs-client-9: failed to submit rpc-request (XID: 0x7f62 Program: GlusterFS 3.3, ProgVers: 330, Proc: 41) to rpc-transport (c_glusterfs-client-9) [2016-10-31 04:06:03.629333] W [fuse-resolve.c:149:fuse_resolve_gfid_cbk] 0-fuse: b0e5a94e-a432-4dce-b86f-a551555780a2: failed to resolve (Transport endpoint is not connected) [2016-10-31 04:06:03.629363] W [fuse-bridge.c:3385:fuse_getxattr_resume] 0-glusterfs-fuse: 20784: GETXATTR b0e5a94e-a432-4dce-b86f-a551555780a2/70366685109140 (system.posix_acl_access) resolution failed [2016-10-31 04:06:03.629729] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-10-31 04:06:03.624568 (xid=0x7f5b) [2016-10-31 04:06:03.629779] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-c_glusterfs-client-9: remote operation failed. Path: /pmd/2 (fa866805-ab22-4afe-8acb-528d15ce4117) [Transport endpoint is not connected] [2016-10-31 04:06:03.629784] E [MSGID: 114031] [client-rpc-fops.c:2883:client3_3_opendir_cbk] 0-c_glusterfs-client-9: remote operation failed. Path: /loadmodules_norepl/CXC1725605_P93A001/cello/emasviews (b0e5a94e-a432-4dce-b86f-a551555780a2) [Transport endpoint is not connected] [2016-10-31 04:06:03.629836] I [MSGID: 109063] [dht-layout.c:702:dht_layout_normalize] 0-c_glusterfs-dht: Found anomalies in /pmd/2 (gfid = fa866805-ab22-4afe-8acb-528d15ce4117). Holes=1 overlaps=0 [2016-10-31 04:06:03.629857] W [fuse-bridge.c:980:fuse_fd_cbk] 0-glusterfs-fuse: 20794: OPENDIR() /loadmodules_norepl/CXC1725605_P93A001/cello/emasviews => -1 (Transport endpoint is not connected) [2016-10-31 04:06:03.629896] W [fuse-resolve.c:149:fuse_resolve_gfid_cbk] 0-fuse: fa866805-ab22-4afe-8acb-528d15ce4117: failed to resolve (Transport endpoint is not connected) [2016-10-31 04:06:03.629990] E [fuse-bridge.c:787:fuse_getattr_resume] 0-glusterfs-fuse: 20785: GETATTR 70366685017568 (fa866805-ab22-4afe-8acb-528d15ce4117) resolution failed [2016-10-31 04:06:03.630164] W [defaults.c:2212:default_releasedir] (-->/usr/lib64/glusterfs/3.7.6/xlator/protocol/client.so(client_local_wipe-0x42d64) [0x3fff86c9bc5c] -->/usr/lib64/libglusterfs.so.0(fd_unref-0x7ba98) [0x3fff8abb7ad0] -->/usr/lib64/libglusterfs.so.0(default_releasedir-0x9d6d8) [0x3fff8ab93c58] ) 0-fuse: xlator does not implement releasedir_cbk [2016-10-31 04:06:03.630367] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2016-10-31 04:06:03.625675 (xid=0x7f5c) [2016-10-31 04:06:03.630368] W [MSGID: 114031] [client-rpc-fops.c:1569:client3_3_fstat_cbk] 0-c_glusterfs-client-9: remote operation failed [Transport endpoint is not connected] [2016-10-31 04:06:03.630497] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-c_glusterfs-client-9: remote operation failed. Path: /configuration/oamrdncy.cfg (200b0bb2-d903-4924-b025-262d805c0f2f) [Transport endpoint is not connected] [2016-10-31 04:06:03.630871] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-c_glusterfs-client-9: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2016-10-31 04:06:03.631488] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-c_glusterfs-client-9: Connected to c_glusterfs-client-9, attached to remote volume '/opt/lvmdir/c2/brick'. [2016-10-31 04:06:03.631523] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-c_glusterfs-client-9: Server and Client lk-version numbers are not same, reopening the fds [2016-10-31 04:06:03.631555] I [MSGID: 114042] [client-handshake.c:1056:client_post_handshake] 0-c_glusterfs-client-9: 6 fds open - Delaying child_up until they are re-opened [2016-10-31 04:06:03.631475] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-c_glusterfs-client-9: remote operation failed. Path: /configuration/ethmac.cfg (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected] [2016-10-31 04:06:03.632256] I [MSGID: 114060] [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: reopendir on <gfid:00000000-0000-0000-0000-000000000001> succeeded (fd = 1) [2016-10-31 04:06:03.632728] I [MSGID: 114060] [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: reopendir on <gfid:257add5a-9b87-4014-8a3a-09dd0a699eec> succeeded (fd = 2) [2016-10-31 04:06:03.632790] I [MSGID: 114060] [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: reopendir on <gfid:6fd47fe9-a566-47c7-8df7-e7a5e379101f> succeeded (fd = 3) [2016-10-31 04:06:03.632847] I [MSGID: 114060] [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: reopendir on <gfid:fa866805-ab22-4afe-8acb-528d15ce4117> succeeded (fd = 4) [2016-10-31 04:06:03.632905] I [MSGID: 114060] [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: reopendir on <gfid:139f5034-5b42-4fb8-9372-6dd7cac1e674> succeeded (fd = 5) [2016-10-31 04:06:03.632935] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 0-c_glusterfs-client-9: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2016-10-31 04:06:03.633028] I [MSGID: 108005] [afr-common.c:3841:afr_notify] 0-c_glusterfs-replicate-0: Subvolume 'c_glusterfs-client-9' came back up; going online. [2016-10-31 04:06:03.633386] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-c_glusterfs-client-9: Server lk version = 1 [2016-10-31 04:06:03.634579] W [fuse-bridge.c:758:fuse_attr_cbk] 0-glusterfs-fuse: 20795: FSTAT() /configuration/oamrdncy.cfg => -1 (Transport endpoint is not connected) [2016-10-31 04:06:03.634889] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) op(XATTROP(33)) called at 2016-10-31 04:06:03.625716 (xid=0x7f5d) [2016-10-31 04:06:03.634950] W [MSGID: 114031] [client-rpc-fops.c:1845:client3_3_xattrop_cbk] 0-c_glusterfs-client-9: remote operation failed. Path: /configuration/oamrdncy.cfg (200b0bb2-d903-4924-b025-262d805c0f2f) [2016-10-31 04:06:03.635134] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-c_glusterfs-replicate-0: Unreadable subvolume -1 found with event generation 3 for gfid 200b0bb2-d903-4924-b025-262d805c0f2f. (Possible split-brain) [2016-10-31 04:06:03.635364] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-c_glusterfs-client-9: disconnected from c_glusterfs-client-9. Client process will keep trying to connect to glusterd until brick's port is available [2016-10-31 04:06:03.635420] E [MSGID: 108006] [afr-common.c:3880:afr_notify] 0-c_glusterfs-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2016-10-31 04:06:03.635762] I [MSGID: 108006] [afr-common.c:4008:afr_local_init] 0-c_glusterfs-replicate-0: no subvolumes up [2016-10-31 04:06:03.641160] W [fuse-bridge.c:462:fuse_entry_cbk] 0-glusterfs-fuse: 20808: LOOKUP() /license => -1 (Transport endpoint is not connected) The message "I [MSGID: 108006] [afr-common.c:4008:afr_local_init] 0-c_glusterfs-replicate-0: no subvolumes up" repeated 6 times between [2016-10-31 04:06:03.640564] and [2016-10-31 04:06:03.641444] [2016-10-31 04:06:03.641484] E [MSGID: 114031] [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-c_glusterfs-client-9: remote operation failed [Invalid argument] [2016-10-31 04:06:03.641562] I [MSGID: 108006] [afr-common.c:4008:afr_local_init] 0-c_glusterfs-replicate-0: no subvolumes up [2016-10-31 04:06:03.641573] E [MSGID: 108010] [afr-lk-common.c:665:afr_unlock_inodelk_cbk] 0-c_glusterfs-replicate-0: path=/configuration/oamrdncy.cfg gfid=200b0bb2-d903-4924-b025-262d805c0f2f: unlock failed on subvolume c_glusterfs-client-9 with lock owner 6cd0c189ff3f0000 [Invalid argument] [2016-10-31 04:06:03.641655] I [MSGID: 108006] [afr-common.c:4008:afr_local_init] 0-c_glusterfs-replicate-0: no subvolumes up [2016-10-31 04:06:03.641716] W [fuse-bridge.c:462:fuse_entry_cbk] 0-glusterfs-fuse: 20809: LOOKUP() /lost+found => -1 (Transport endpoint is not connected) [2016-10-31 04:06:03.642387] W [fuse-bridge.c:462:fuse_entry_cbk] 0-glusterfs-fuse: 20813: LOOKUP() /lost+found => -1 (Transport endpoint is not connected) ... Analysis I think the disconnection and reconnectiong between client(glusterfs on A board) and server(glusterfsd on A board) has a race condition as below. process 1 process 2 [2016-10-31 04:06:03.626047] readv err show that the disconnect happen between client and server. [2016-10-31 04:06:03.627345] It will call rpc_clnt_notify(). conn->connected = 0 in rpc_clnt_connection_cleanup. Creating a timer which will reconnect after 10 second. [2016-10-31 04:06:03.628381] calling rpc_clnt_submit(). it will call rpc_transport_connect because conn->connected is 0. [2016-10-31 04:06:03.631488] the new connection has been successful. [2016-10-31 04:06:03.633028] calling afr_notify(). In the case GF_EVENT_CHILD_UP the new new connection is set by "priv->child_up[idx] = 1" [2016-10-31 04:06:03.635364] continue to run client_rpc_notify() will notify RPC_CLNT_DISCONNECT in case RPC_CLNT_DISCONNECT. [2016-10-31 04:06:03.635420] In afr_notify() will set "priv->child_up[idx] = 0"? which is the new connection, in the case RPC_CLNT_DISCONNECT. And the all the subvolume is down. [2016-10-31 04:06:03.635762] All the access will be failed because no subvolumes up.And rpc_transport_connect() will not be called because the conn->connected has been set to 1 in process 2 Please help me to check my analysis. Thanks, Xin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20161125/7e362756/attachment.html>
Kaushal M
2016-Nov-25 06:57 UTC
[Gluster-users] not to reconnect between client and server because of race condition
On Fri, Nov 25, 2016 at 12:03 PM, songxin <songxin_1980 at 126.com> wrote:> Hi Atin > I found a problem, that is about client(glusterfs) will not trying to > reconnect to server(glusterfsd) after disconnect. > Actually, it seems caused by race condition. > > > Precondition > > The glusterfs version is 3.7.6. > I create a replicate volume using two node, A node and B node.One brick is > on A node and another brick is on B node. > A node ip:10.32.1.144 > B node ip:10.32.0.48 > > > The phenomenon is following. > > Firstly, the client(glusterfs) on A board disconnect with server(glusterfsd) > on B board.The log is following. > ... > readv on 10.32.0.48:49309 failed (No data available) > ... > > And then the client(glusterfs) on A board disconnect with server(glusterfsd) > on A board.The log is following. > ... > readv on 10.32.1.144:49391 failed (Connection reset by peer) > ... > > After that, all operation in mount point will show "Transport endpoint is > not connected" until client reconnect with server(glusterfsd) on B board. > > > The client log is following.And I have highlight the important line. > ... > [2016-10-31 04:06:03.626047] W [socket.c:588:__socket_rwv] > 0-c_glusterfs-client-9: readv on 10.32.1.144:49391 failed (Connection reset > by peer) > [2016-10-31 04:06:03.627345] E [rpc-clnt.c:362:saved_frames_unwind] (--> > /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] ( > --> > /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] ( > --> > /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] ( > --> > /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] > ( > --> > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) > > 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) > > op(FINODELK(30)) called at 2016-10-31 04:06:03.626033 (xid=0x7f5e) > > [2016-10-31 04:06:03.627395] E [MSGID: 114031] > [client-rpc-fops.c:1673:client3_3_finodelk_cbk] 0-c_glusterfs-client-9: > remote operation failed [Transport endpoint is not connected] > > [2016-10-31 04:06:03.628381] I [socket.c:3308:socket_submit_request] > 0-c_glusterfs-client-9: not connected (priv->connected = 0) > > [2016-10-31 04:06:03.628432] W [rpc-clnt.c:1586:rpc_clnt_submit] > 0-c_glusterfs-client-9: failed to submit rpc-request (XID: 0x7f5f Program: > GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport > (c_glusterfs-client-9) > > [2016-10-31 04:06:03.628466] E [MSGID: 114031] > [client-rpc-fops.c:1673:client3_3_finodelk_cbk] 0-c_glusterfs-client-9: > remote operation failed [Transport endpoint is not connected] > [2016-10-31 04:06:03.628475] I [MSGID: 108019] > [afr-lk-common.c:1086:afr_lock_blocking] 0-c_glusterfs-replicate-0: unable > to lock on even one child > > [2016-10-31 04:06:03.628539] I [MSGID: 108019] > [afr-transaction.c:1224:afr_post_blocking_inodelk_cbk] > 0-c_glusterfs-replicate-0: Blocking inodelks failed. > > [2016-10-31 04:06:03.628630] W [fuse-bridge.c:1282:fuse_err_cbk] > 0-glusterfs-fuse: 20790: FLUSH() ERR => -1 (Transport endpoint is not > connected) > [2016-10-31 04:06:03.629149] E [rpc-clnt.c:362:saved_frames_unwind] (--> > /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] (--> > /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] (--> > /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] (--> > /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] > (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) > 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) > op(LOOKUP(27)) called at 2016-10-31 04:06:03.624346 (xid=0x7f5a) > > [2016-10-31 04:06:03.629183] I [rpc-clnt.c:1847:rpc_clnt_reconfig] > 0-c_glusterfs-client-9: changing port to 49391 (from 0) > > [2016-10-31 04:06:03.629210] W [MSGID: 114031] > [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-c_glusterfs-client-9: remote > operation failed. Path: > /loadmodules_norepl/CXC1725605_P93A001/cello/emasviews > (b0e5a94e-a432-4dce-b86f-a551555780a2) [Transport endpoint is not connected] > [2016-10-31 04:06:03.629266] I [socket.c:3308:socket_submit_request] > 0-c_glusterfs-client-9: not connected (priv->connected = 255) > [2016-10-31 04:06:03.629277] I [MSGID: 109063] > [dht-layout.c:702:dht_layout_normalize] 0-c_glusterfs-dht: Found anomalies > in /loadmodules_norepl/CXC1725605_P93A001/cello/emasviews (gfid > b0e5a94e-a432-4dce-b86f-a551555780a2). Holes=1 overlaps=0 > [2016-10-31 04:06:03.629293] W [rpc-clnt.c:1586:rpc_clnt_submit] > 0-c_glusterfs-client-9: failed to submit rpc-request (XID: 0x7f62 Program: > GlusterFS 3.3, ProgVers: 330, Proc: 41) to rpc-transport > (c_glusterfs-client-9) > [2016-10-31 04:06:03.629333] W [fuse-resolve.c:149:fuse_resolve_gfid_cbk] > 0-fuse: b0e5a94e-a432-4dce-b86f-a551555780a2: failed to resolve (Transport > endpoint is not connected) > [2016-10-31 04:06:03.629363] W [fuse-bridge.c:3385:fuse_getxattr_resume] > 0-glusterfs-fuse: 20784: GETXATTR > b0e5a94e-a432-4dce-b86f-a551555780a2/70366685109140 > (system.posix_acl_access) resolution failed > [2016-10-31 04:06:03.629729] E [rpc-clnt.c:362:saved_frames_unwind] (--> > /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] (--> > /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] (--> > /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] (--> > /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] > (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) > 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) > op(LOOKUP(27)) called at 2016-10-31 04:06:03.624568 (xid=0x7f5b) > [2016-10-31 04:06:03.629779] W [MSGID: 114031] > [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-c_glusterfs-client-9: remote > operation failed. Path: /pmd/2 (fa866805-ab22-4afe-8acb-528d15ce4117) > [Transport endpoint is not connected] > [2016-10-31 04:06:03.629784] E [MSGID: 114031] > [client-rpc-fops.c:2883:client3_3_opendir_cbk] 0-c_glusterfs-client-9: > remote operation failed. Path: > /loadmodules_norepl/CXC1725605_P93A001/cello/emasviews > (b0e5a94e-a432-4dce-b86f-a551555780a2) [Transport endpoint is not connected] > [2016-10-31 04:06:03.629836] I [MSGID: 109063] > [dht-layout.c:702:dht_layout_normalize] 0-c_glusterfs-dht: Found anomalies > in /pmd/2 (gfid = fa866805-ab22-4afe-8acb-528d15ce4117). Holes=1 overlaps=0 > [2016-10-31 04:06:03.629857] W [fuse-bridge.c:980:fuse_fd_cbk] > 0-glusterfs-fuse: 20794: OPENDIR() > /loadmodules_norepl/CXC1725605_P93A001/cello/emasviews => -1 (Transport > endpoint is not connected) > [2016-10-31 04:06:03.629896] W [fuse-resolve.c:149:fuse_resolve_gfid_cbk] > 0-fuse: fa866805-ab22-4afe-8acb-528d15ce4117: failed to resolve (Transport > endpoint is not connected) > [2016-10-31 04:06:03.629990] E [fuse-bridge.c:787:fuse_getattr_resume] > 0-glusterfs-fuse: 20785: GETATTR 70366685017568 > (fa866805-ab22-4afe-8acb-528d15ce4117) resolution failed > [2016-10-31 04:06:03.630164] W [defaults.c:2212:default_releasedir] > (-->/usr/lib64/glusterfs/3.7.6/xlator/protocol/client.so(client_local_wipe-0x42d64) > [0x3fff86c9bc5c] -->/usr/lib64/libglusterfs.so.0(fd_unref-0x7ba98) > [0x3fff8abb7ad0] -->/usr/lib64/libglusterfs.so.0(default_releasedir-0x9d6d8) > [0x3fff8ab93c58] ) 0-fuse: xlator does not implement releasedir_cbk > [2016-10-31 04:06:03.630367] E [rpc-clnt.c:362:saved_frames_unwind] (--> > /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] (--> > /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] (--> > /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] (--> > /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] > (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) > 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) > op(LOOKUP(27)) called at 2016-10-31 04:06:03.625675 (xid=0x7f5c) > [2016-10-31 04:06:03.630368] W [MSGID: 114031] > [client-rpc-fops.c:1569:client3_3_fstat_cbk] 0-c_glusterfs-client-9: remote > operation failed [Transport endpoint is not connected] > [2016-10-31 04:06:03.630497] W [MSGID: 114031] > [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-c_glusterfs-client-9: remote > operation failed. Path: /configuration/oamrdncy.cfg > (200b0bb2-d903-4924-b025-262d805c0f2f) [Transport endpoint is not connected] > [2016-10-31 04:06:03.630871] I [MSGID: 114057] > [client-handshake.c:1437:select_server_supported_programs] > 0-c_glusterfs-client-9: Using Program GlusterFS 3.3, Num (1298437), Version > (330) > > [2016-10-31 04:06:03.631488] I [MSGID: 114046] > [client-handshake.c:1213:client_setvolume_cbk] 0-c_glusterfs-client-9: > Connected to c_glusterfs-client-9, attached to remote volume > '/opt/lvmdir/c2/brick'. > > [2016-10-31 04:06:03.631523] I [MSGID: 114047] > [client-handshake.c:1224:client_setvolume_cbk] 0-c_glusterfs-client-9: > Server and Client lk-version numbers are not same, reopening the fds > [2016-10-31 04:06:03.631555] I [MSGID: 114042] > [client-handshake.c:1056:client_post_handshake] 0-c_glusterfs-client-9: 6 > fds open - Delaying child_up until they are re-opened > [2016-10-31 04:06:03.631475] W [MSGID: 114031] > [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-c_glusterfs-client-9: remote > operation failed. Path: /configuration/ethmac.cfg > (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected] > [2016-10-31 04:06:03.632256] I [MSGID: 114060] > [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: > reopendir on <gfid:00000000-0000-0000-0000-000000000001> succeeded (fd = 1) > [2016-10-31 04:06:03.632728] I [MSGID: 114060] > [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: > reopendir on <gfid:257add5a-9b87-4014-8a3a-09dd0a699eec> succeeded (fd = 2) > [2016-10-31 04:06:03.632790] I [MSGID: 114060] > [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: > reopendir on <gfid:6fd47fe9-a566-47c7-8df7-e7a5e379101f> succeeded (fd = 3) > [2016-10-31 04:06:03.632847] I [MSGID: 114060] > [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: > reopendir on <gfid:fa866805-ab22-4afe-8acb-528d15ce4117> succeeded (fd = 4) > [2016-10-31 04:06:03.632905] I [MSGID: 114060] > [client-handshake.c:820:client3_3_reopendir_cbk] 0-c_glusterfs-client-9: > reopendir on <gfid:139f5034-5b42-4fb8-9372-6dd7cac1e674> succeeded (fd = 5) > [2016-10-31 04:06:03.632935] I [MSGID: 114041] > [client-handshake.c:678:client_child_up_reopen_done] 0-c_glusterfs-client-9: > last fd open'd/lock-self-heal'd - notifying CHILD-UP > > [2016-10-31 04:06:03.633028] I [MSGID: 108005] > [afr-common.c:3841:afr_notify] 0-c_glusterfs-replicate-0: Subvolume > 'c_glusterfs-client-9' came back up; going online. > > [2016-10-31 04:06:03.633386] I [MSGID: 114035] > [client-handshake.c:193:client_set_lk_version_cbk] 0-c_glusterfs-client-9: > Server lk version = 1 > [2016-10-31 04:06:03.634579] W [fuse-bridge.c:758:fuse_attr_cbk] > 0-glusterfs-fuse: 20795: FSTAT() /configuration/oamrdncy.cfg => -1 > (Transport endpoint is not connected) > [2016-10-31 04:06:03.634889] E [rpc-clnt.c:362:saved_frames_unwind] (--> > /usr/lib64/libglusterfs.so.0(_gf_log_callingfn-0xb5c80)[0x3fff8ab79f58] (--> > /usr/lib64/libgfrpc.so.0(saved_frames_unwind-0x1b7a0)[0x3fff8ab1dc90] (--> > /usr/lib64/libgfrpc.so.0(saved_frames_destroy-0x1b638)[0x3fff8ab1de10] (--> > /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup-0x19af8)[0x3fff8ab1fb18] > (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify-0x18e68)[0x3fff8ab20808] ))))) > 0-c_glusterfs-client-9: forced unwinding frame type(GlusterFS 3.3) > op(XATTROP(33)) called at 2016-10-31 04:06:03.625716 (xid=0x7f5d) > [2016-10-31 04:06:03.634950] W [MSGID: 114031] > [client-rpc-fops.c:1845:client3_3_xattrop_cbk] 0-c_glusterfs-client-9: > remote operation failed. Path: /configuration/oamrdncy.cfg > (200b0bb2-d903-4924-b025-262d805c0f2f) > [2016-10-31 04:06:03.635134] W [MSGID: 108008] > [afr-read-txn.c:250:afr_read_txn] 0-c_glusterfs-replicate-0: Unreadable > subvolume -1 found with event generation 3 for gfid > 200b0bb2-d903-4924-b025-262d805c0f2f. (Possible split-brain) > [2016-10-31 04:06:03.635364] I [MSGID: 114018] > [client.c:2042:client_rpc_notify] 0-c_glusterfs-client-9: disconnected from > c_glusterfs-client-9. Client process will keep trying to connect to glusterd > until brick's port is available > [2016-10-31 04:06:03.635420] E [MSGID: 108006] > [afr-common.c:3880:afr_notify] 0-c_glusterfs-replicate-0: All subvolumes are > down. Going offline until atleast one of them comes back up. > [2016-10-31 04:06:03.635762] I [MSGID: 108006] > [afr-common.c:4008:afr_local_init] 0-c_glusterfs-replicate-0: no subvolumes > up > [2016-10-31 04:06:03.641160] W [fuse-bridge.c:462:fuse_entry_cbk] > 0-glusterfs-fuse: 20808: LOOKUP() /license => -1 (Transport endpoint is not > connected) > The message "I [MSGID: 108006] [afr-common.c:4008:afr_local_init] > 0-c_glusterfs-replicate-0: no subvolumes up" repeated 6 times between > [2016-10-31 04:06:03.640564] and [2016-10-31 04:06:03.641444] > [2016-10-31 04:06:03.641484] E [MSGID: 114031] > [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-c_glusterfs-client-9: > remote operation failed [Invalid argument] > [2016-10-31 04:06:03.641562] I [MSGID: 108006] > [afr-common.c:4008:afr_local_init] 0-c_glusterfs-replicate-0: no subvolumes > up > [2016-10-31 04:06:03.641573] E [MSGID: 108010] > [afr-lk-common.c:665:afr_unlock_inodelk_cbk] 0-c_glusterfs-replicate-0: > path=/configuration/oamrdncy.cfg gfid=200b0bb2-d903-4924-b025-262d805c0f2f: > unlock failed on subvolume c_glusterfs-client-9 with lock owner > 6cd0c189ff3f0000 [Invalid argument] > [2016-10-31 04:06:03.641655] I [MSGID: 108006] > [afr-common.c:4008:afr_local_init] 0-c_glusterfs-replicate-0: no subvolumes > up > [2016-10-31 04:06:03.641716] W [fuse-bridge.c:462:fuse_entry_cbk] > 0-glusterfs-fuse: 20809: LOOKUP() /lost+found => -1 (Transport endpoint is > not connected) > [2016-10-31 04:06:03.642387] W [fuse-bridge.c:462:fuse_entry_cbk] > 0-glusterfs-fuse: 20813: LOOKUP() /lost+found => -1 (Transport endpoint is > not connected) > ... > > Analysis > > I think the disconnection and reconnectiong between client(glusterfs on A > board) and server(glusterfsd on A board) has a race condition as below. > > > process 1 > process 2 > > [2016-10-31 04:06:03.626047] readv > err show that > > the disconnect happen > > between client and server. > > [2016-10-31 04:06:03.627345] It will > call rpc_clnt_notify(). > > conn->connected = 0 in > > rpc_clnt_connection_cleanup. > > Creating a timer which will > > reconnect after 10 second. > > > > [2016-10-31 04:06:03.628381] > calling rpc_clnt_submit(). > > it will call rpc_transport_connect > > because conn->connected is 0. > > [2016-10-31 04:06:03.631488] > the new connection has been successful. > > > [2016-10-31 04:06:03.633028] > calling afr_notify(). In the case > > GF_EVENT_CHILD_UP the new > > new connection is set by "priv->child_up[idx] = 1" > > > [2016-10-31 04:06:03.635364] continue > to run client_rpc_notify() > > will notify RPC_CLNT_DISCONNECT > > in case RPC_CLNT_DISCONNECT. > > [2016-10-31 04:06:03.635420] In > afr_notify() will set "priv->child_up[idx] = 0"? > > which is the new connection, > > in the case RPC_CLNT_DISCONNECT. > > And the all the subvolume is down. > > [2016-10-31 04:06:03.635762] All the > access will be failed because > > no subvolumes up.And rpc_transport_connect() > > will not be called because the conn->connected > > has been set to 1 in process 2 > > > Please help me to check my analysis. > > Thanks, > Xin >Awesome! You root caused this on your own. This exact same issue was root caused recently and we have a pending patch for it [1]. Expect this to be fixed soon. [1]: https://review.gluster.org/15916> > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users