I have some connectivity errors with GlusterFS mount points I can't get solved. We have a pretty basis setup with two Gluster bricks and a bunch of clients (all 3.3.2). Very occasionally we have a brief network outages and some Gluster mounts points get unavailable. The other Gluster mounts on other servers to the same bricks have no problems. The console on client shows: mountall: Plymouth command failed mountall: Disconnected from Plymouth mountall: Event failed mountall: Skipping mounting /home since Plymouth is not available Manual mount gives: $ sudo mount /home unknown option _netdev (ignored) ERROR: Mount point does not exist. Usage: mount.glusterfs <volumeserver>:<volumeid/volumeport> -o <options> <mount point> On the client, I can see a few hung connections (lsof | grep TCP shows stuck on SYN_SENT, source port 24010 on client). Also the connection tracker of iptables seem to have issues: Nov 22 09:28:36 app16 kernel: [3180197.360596] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:90:4c:aa:01:60:00:87:cb:08:00 SRC=10.243.0.24 DST=10.243.0.76 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24010 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 22 09:28:37 app16 kernel: [3180198.156075] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:90:4c:aa:01:60:00:87:cb:08:00 SRC=10.243.0.24 DST=10.243.0.76 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24010 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 22 09:28:44 app16 kernel: [3180205.377404] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:90:4c:aa:01:60:00:87:cb:08:00 SRC=10.243.0.24 DST=10.243.0.76 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24010 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 22 09:28:45 app16 kernel: [3180206.160003] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:90:4c:aa:01:60:00:87:cb:08:00 SRC=10.243.0.24 DST=10.243.0.76 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24010 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 22 09:29:00 app16 kernel: [3180221.410958] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:90:4c:aa:01:60:00:87:cb:08:00 SRC=10.243.0.24 DST=10.243.0.76 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24010 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Nov 22 09:29:00 app16 kernel: [3180222.154831] [INPUT] dropped IN=eth0 OUT= MAC=aa:01:60:00:90:4c:aa:01:60:00:87:cb:08:00 SRC=10.243.0.24 DST=10.243.0.76 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24010 DPT=1021 WINDOW=14480 RES=0x00 ACK SYN URGP=0 Work around is to manuallly umount and mount the failed shares. No more SYN_SENT connections in lsof and the share is accessible again. But what is the cause of this? We need the shares to be available any time, especially after network recovers. That's the whole point of distributed file systems... Some background info. /etc/fstab contains: file1.cluster.peercode.nl:GLUSTER-HOME /home glusterfs nobootwait,backupvolfile-server=file2.cluster.peercode.nl 0 0 This is the log of brick 10.243.0.76 during a short network hickup: [2013-11-21 21:57:07.877100] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-GLUSTER-HOME-client-0: remote operation failed: No such file or directory [2013-11-21 22:07:07.984100] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-GLUSTER-HOME-client-0: remote operation failed: No such file or directory [2013-11-21 22:17:08.093102] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-GLUSTER-HOME-client-0: remote operation failed: No such file or directory [2013-11-21 22:25:53.475072] W [socket.c:195:__socket_rwv] 0-GLUSTER-HOME-client-1: readv failed (Connection reset by peer) [2013-11-21 22:25:53.475149] W [socket.c:1512:__socket_proto_state_machine] 0-GLUSTER-HOME-client-1: reading from socket failed. Error (Connection reset by peer), peer (10.243.0.24:24009) [2013-11-21 22:25:53.492487] I [client.c:2090:client_rpc_notify] 0-GLUSTER-HOME-client-1: disconnected [2013-11-21 22:25:54.536414] W [socket.c:195:__socket_rwv] 0-GLUSTER-SHARE-client-1: readv failed (Connection reset by peer) [2013-11-21 22:25:54.536454] W [socket.c:1512:__socket_proto_state_machine] 0-GLUSTER-SHARE-client-1: reading from socket failed. Error (Connection reset by peer), peer (10.243.0.24:24010) [2013-11-21 22:25:54.536503] I [client.c:2090:client_rpc_notify] 0-GLUSTER-SHARE-client-1: disconnected [2013-11-21 22:26:03.539704] I [client-handshake.c:1614:select_server_supported_programs] 0-GLUSTER-HOME-client-1: Using Program GlusterFS 3.3.1, Num (1298437), Version (330) [2013-11-21 22:26:03.541640] I [client-handshake.c:1411:client_setvolume_cbk] 0-GLUSTER-HOME-client-1: Connected to 10.243.0.24:24009, attached to remote volume '/data/export-home-2'. [2013-11-21 22:26:03.541668] I [client-handshake.c:1423:client_setvolume_cbk] 0-GLUSTER-HOME-client-1: Server and Client lk-version numbers are not same, reopening the fds [2013-11-21 22:26:03.548534] I [client-handshake.c:453:client_set_lk_version_cbk] 0-GLUSTER-HOME-client-1: Server lk version = 1 [2013-11-21 22:26:05.536563] I [client-handshake.c:1614:select_server_supported_programs] 0-GLUSTER-SHARE-client-1: Using Program GlusterFS 3.3.2, Num (1298437), Version (330) [2013-11-21 22:26:05.537510] I [client-handshake.c:1411:client_setvolume_cbk] 0-GLUSTER-SHARE-client-1: Connected to 10.243.0.24:24010, attached to remote volume '/data/export-share-2'. [2013-11-21 22:26:05.537530] I [client-handshake.c:1423:client_setvolume_cbk] 0-GLUSTER-SHARE-client-1: Server and Client lk-version numbers are not same, reopening the fds [2013-11-21 22:26:05.541133] I [client-handshake.c:453:client_set_lk_version_cbk] 0-GLUSTER-SHARE-client-1: Server lk version = 1 [2013-11-21 22:27:08.549143] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-GLUSTER-HOME-client-0: remote operation failed: No such file or directory [2013-11-21 22:37:08.655387] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-GLUSTER-HOME-client-0: remote operation failed: No such file or directory [2013-11-21 22:47:05.551891] W [socket.c:195:__socket_rwv] 0-GLUSTER-SHARE-client-1: readv failed (Connection timed out) [2013-11-21 22:47:05.551961] W [socket.c:1512:__socket_proto_state_machine] 0-GLUSTER-SHARE-client-1: reading from socket failed. Error (Connection timed out), peer (10.243.0.24:24010) [2013-11-21 22:47:05.552011] I [client.c:2090:client_rpc_notify] 0-GLUSTER-SHARE-client-1: disconnected [2013-11-21 22:47:07.599889] W [socket.c:195:__socket_rwv] 0-GLUSTER-HOME-client-1: readv failed (Connection timed out) [2013-11-21 22:47:07.599956] W [socket.c:1512:__socket_proto_state_machine] 0-GLUSTER-HOME-client-1: reading from socket failed. Error (Connection timed out), peer (10.243.0.24:24009) [2013-11-21 22:47:07.600008] I [client.c:2090:client_rpc_notify] 0-GLUSTER-HOME-client-1: disconnected [2013-11-21 22:47:08.761366] E [afr-self-heald.c:418:_crawl_proceed] 0-GLUSTER-SHARE-replicate-0: Stopping crawl as < 2 children are up [2013-11-21 22:47:08.764653] E [afr-self-heald.c:418:_crawl_proceed] 0-GLUSTER-HOME-replicate-0: Stopping crawl as < 2 children are up [2013-11-21 22:47:18.759922] E [socket.c:1715:socket_connect_finish] 0-GLUSTER-HOME-client-1: connection to 10.243.0.24:24009 failed (No route to host) [2013-11-21 22:48:18.907865] E [socket.c:1715:socket_connect_finish] 0-GLUSTER-SHARE-client-1: connection to 10.243.0.24:24010 failed (Connection timed out) [2013-11-21 22:49:50.825110] I [client-handshake.c:1614:select_server_supported_programs] 0-GLUSTER-HOME-client-1: Using Program GlusterFS 3.3.1, Num (1298437), Version (330) [2013-11-21 22:49:50.825887] I [client-handshake.c:1411:client_setvolume_cbk] 0-GLUSTER-HOME-client-1: Connected to 10.243.0.24:24009, attached to remote volume '/data/export-home-2'. [2013-11-21 22:49:50.825906] I [client-handshake.c:1423:client_setvolume_cbk] 0-GLUSTER-HOME-client-1: Server and Client lk-version numbers are not same, reopening the fds [2013-11-21 22:49:50.826525] I [client-handshake.c:453:client_set_lk_version_cbk] 0-GLUSTER-HOME-client-1: Server lk version = 1 [2013-11-21 22:49:52.863320] I [client-handshake.c:1614:select_server_supported_programs] 0-GLUSTER-SHARE-client-1: Using Program GlusterFS 3.3.2, Num (1298437), Version (330) [2013-11-21 22:49:52.864061] I [client-handshake.c:1411:client_setvolume_cbk] 0-GLUSTER-SHARE-client-1: Connected to 10.243.0.24:24010, attached to remote volume '/data/export-share-2'. [2013-11-21 22:49:52.864089] I [client-handshake.c:1423:client_setvolume_cbk] 0-GLUSTER-SHARE-client-1: Server and Client lk-version numbers are not same, reopening the fds [2013-11-21 22:49:52.864841] I [client-handshake.c:453:client_set_lk_version_cbk] 0-GLUSTER-SHARE-client-1: Server lk version = 1 [2013-11-21 22:57:08.913844] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-GLUSTER-HOME-client-0: remote operation failed: No such file or directory [2013-11-21 23:07:09.033899] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-GLUSTER-HOME-client-0: remote operation failed: No such file or directory [2013-11-21 23:17:09.160547] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-GLUSTER-HOME-client-0: remote operation failed: No such file or directory From this moment on, 10.243.0.24 lost the /home share. And the other brick: [2013-11-21 22:23:01.157049] W [socket.c:195:__socket_rwv] 0-GLUSTER-SHARE-client-0: readv failed (Connection timed out) [2013-11-21 22:23:01.157145] W [socket.c:1512:__socket_proto_state_machine] 0-GLUSTER-SHARE-client-0: reading from socket failed. Error (Connection timed out), peer (10.243.0.23:24010) [2013-11-21 22:23:01.157198] I [client.c:2090:client_rpc_notify] 0-GLUSTER-SHARE-client-0: disconnected [2013-11-21 22:23:05.829005] W [socket.c:195:__socket_rwv] 0-GLUSTER-HOME-client-0: readv failed (Connection timed out) [2013-11-21 22:23:05.829068] W [socket.c:1512:__socket_proto_state_machine] 0-GLUSTER-HOME-client-0: reading from socket failed. Error (Connection timed out), peer (10.243.0.23:24009) [2013-11-21 22:23:05.829128] I [client.c:2090:client_rpc_notify] 0-GLUSTER-HOME-client-0: disconnected [2013-11-21 22:23:13.437077] E [socket.c:1715:socket_connect_finish] 0-GLUSTER-SHARE-client-0: connection to 10.243.0.23:24010 failed (No route to host) [2013-11-21 22:23:16.437010] E [socket.c:1715:socket_connect_finish] 0-GLUSTER-HOME-client-0: connection to 10.243.0.23:24009 failed (No route to host) [2013-11-21 22:25:53.476614] I [client-handshake.c:1614:select_server_supported_programs] 0-GLUSTER-SHARE-client-0: Using Program GlusterFS 3.3.2, Num (1298437), Version (330) [2013-11-21 22:25:53.477421] I [client-handshake.c:1411:client_setvolume_cbk] 0-GLUSTER-SHARE-client-0: Connected to 10.243.0.23:24010, attached to remote volume '/data/export-share-1'. [2013-11-21 22:25:53.477448] I [client-handshake.c:1432:client_setvolume_cbk] 0-GLUSTER-SHARE-client-0: Server and Client lk-version numbers are same, no need to reopen the fds [2013-11-21 22:25:56.482419] I [client-handshake.c:1614:select_server_supported_programs] 0-GLUSTER-HOME-client-0: Using Program GlusterFS 3.3.1, Num (1298437), Version (330) [2013-11-21 22:25:56.484738] I [client-handshake.c:1411:client_setvolume_cbk] 0-GLUSTER-HOME-client-0: Connected to 10.243.0.23:24009, attached to remote volume '/data/export-home-1'. [2013-11-21 22:25:56.484769] I [client-handshake.c:1432:client_setvolume_cbk] 0-GLUSTER-HOME-client-0: Server and Client lk-version numbers are same, no need to reopen the fds [2013-11-21 22:26:52.486420] E [afr-self-heald.c:418:_crawl_proceed] 0-GLUSTER-HOME-replicate-0: Stopping crawl as < 2 children are up [2013-11-21 22:30:02.519308] E [afr-self-heald.c:418:_crawl_proceed] 0-GLUSTER-SHARE-replicate-0: Stopping crawl as < 2 children are up [2013-11-21 22:36:52.593154] E [afr-self-heald.c:418:_crawl_proceed] 0-GLUSTER-HOME-replicate-0: Stopping crawl as < 2 children are up [2013-11-21 22:40:02.627325] E [afr-self-heald.c:418:_crawl_proceed] 0-GLUSTER-SHARE-replicate-0: Stopping crawl as < 2 children are up [2013-11-21 22:49:48.238564] E [afr-self-heald.c:418:_crawl_proceed] 0-GLUSTER-HOME-replicate-0: Stopping crawl as < 2 children are up [2013-11-21 22:49:50.245822] W [socket.c:195:__socket_rwv] 0-GLUSTER-HOME-client-0: readv failed (Connection reset by peer) [2013-11-21 22:49:50.245873] W [socket.c:1512:__socket_proto_state_machine] 0-GLUSTER-HOME-client-0: reading from socket failed. Error (Connection reset by peer), peer (10.243.0.23:24009) [2013-11-21 22:49:50.245913] I [client.c:2090:client_rpc_notify] 0-GLUSTER-HOME-client-0: disconnected [2013-11-21 22:49:50.245931] W [socket.c:195:__socket_rwv] 0-GLUSTER-SHARE-client-0: readv failed (Connection reset by peer) [2013-11-21 22:49:50.245943] W [socket.c:1512:__socket_proto_state_machine] 0-GLUSTER-SHARE-client-0: reading from socket failed. Error (Connection reset by peer), peer (10.243.0.23:24010) [2013-11-21 22:49:50.245965] I [client.c:2090:client_rpc_notify] 0-GLUSTER-SHARE-client-0: disconnected [2013-11-21 22:50:01.243099] I [client-handshake.c:1614:select_server_supported_programs] 0-GLUSTER-HOME-client-0: Using Program GlusterFS 3.3.1, Num (1298437), Version (330) [2013-11-21 22:50:01.243299] I [client-handshake.c:1614:select_server_supported_programs] 0-GLUSTER-SHARE-client-0: Using Program GlusterFS 3.3.2, Num (1298437), Version (330) [2013-11-21 22:50:01.244103] I [client-handshake.c:1411:client_setvolume_cbk] 0-GLUSTER-HOME-client-0: Connected to 10.243.0.23:24009, attached to remote volume '/data/export-home-1'. [2013-11-21 22:50:01.244154] I [client-handshake.c:1423:client_setvolume_cbk] 0-GLUSTER-HOME-client-0: Server and Client lk-version numbers are not same, reopening the fds [2013-11-21 22:50:01.244918] I [client-handshake.c:1411:client_setvolume_cbk] 0-GLUSTER-SHARE-client-0: Connected to 10.243.0.23:24010, attached to remote volume '/data/export-share-1'. [2013-11-21 22:50:01.244945] I [client-handshake.c:1423:client_setvolume_cbk] 0-GLUSTER-SHARE-client-0: Server and Client lk-version numbers are not same, reopening the fds [2013-11-21 22:50:01.246500] I [client-handshake.c:453:client_set_lk_version_cbk] 0-GLUSTER-SHARE-client-0: Server lk version = 1 [2013-11-21 22:50:01.246551] I [client-handshake.c:453:client_set_lk_version_cbk] 0-GLUSTER-HOME-client-0: Server lk version = 1 Home volume configuration: Volume Name: GLUSTER-HOME Type: Replicate Volume ID: 27ac2466-584e-491a-9717-2ed4869b1c28 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: file1.cluster.peercode.nl:/data/export-home-1 Brick2: file2.cluster.peercode.nl:/data/export-home-2 Options Reconfigured: auth.allow: 10.243.0.* features.quota: on features.limit-usage: /:30gb Installed packages: $ dpkg -l gluster* Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Description +++-======================-======================-===========================================================un glusterfs <none> (no description available) ii glusterfs-client 3.3.2-ubuntu1~precise2 clustered file-system (client package) ii glusterfs-common 3.3.2-ubuntu1~precise2 GlusterFS common libraries and translator modules ii glusterfs-server 3.3.2-ubuntu1~precise2 clustered file-system (server package) Ubuntu 12.04 Best, Mark Ruys --- dr M.P.J. Ruys (PhD) :: Peercode Oudenhof 4c, 4191NW Geldermalsen, The Netherlands Web site and travel directions: www.peercode.nl Phone +31.88.0084124 :: Mobile +31.6.51298623 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131122/7c23a51f/attachment.html>