We have three Gluster clusters, all three of which are exhibiting the same symptom: FUSE clients report network ping timeouts from bricks, disconnect from the volume, and then very quickly re-connect to all bricks. An example from the client logs: [2014-11-20 01:19:09.079725] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-wp_uploads-client-2: server 192.168.135.37:49152 has not responded in the last 5 seconds, disconnecting. [2014-11-20 01:19:09.278701] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x38ca00fced] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x38ca00f833] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x38ca00f74e]))) 0-wp_uploads-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-11-20 01:19:03.771414 (xid=0x7bd8a6) [2014-11-20 01:19:09.278734] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-wp_uploads-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2014-11-20 01:19:09.278893] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x38ca00fced] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x38ca00f833] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x38ca00f74e]))) 0-wp_uploads-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-11-20 01:19:03.771917 (xid=0x7bd8a7) [2014-11-20 01:19:09.278901] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-wp_uploads-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2014-11-20 01:19:09.279008] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x38ca00fced] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x38ca00f833] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x38ca00f74e]))) 0-wp_uploads-client-2: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2014-11-20 01:19:04.072860 (xid=0x7bd8a8) [2014-11-20 01:19:09.279028] W [client-handshake.c:276:client_ping_cbk] 0-wp_uploads-client-2: timer must have expired [2014-11-20 01:19:09.279090] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15d) [0x38ca00fced] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x38ca00f833] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x38ca00f74e]))) 0-wp_uploads-client-2: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-11-20 01:19:07.070544 (xid=0x7bd8a9) [2014-11-20 01:19:09.279099] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-wp_uploads-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2014-11-20 01:19:09.287885] I [client.c:2229:client_rpc_notify] 0-wp_uploads-client-2: disconnected from 192.168.135.37:49152. Client process will keep trying to connect to glusterd until brick's port is available [2014-11-20 01:19:09.377628] I [socket.c:3060:socket_submit_request] 0-wp_uploads-client-2: not connected (priv->connected = 0) [2014-11-20 01:19:09.377669] W [rpc-clnt.c:1542:rpc_clnt_submit] 0-wp_uploads-client-2: failed to submit rpc-request (XID: 0x7bd8aa Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (wp_uploads-client-2) [2014-11-20 01:19:09.377692] W [client-rpc-fops.c:2774:client3_3_lookup_cbk] 0-wp_uploads-client-2: remote operation failed: Transport endpoint is not connected. Path: /2014 (10537923-c903-4a34-af42-f74b9eb6cf11) [2014-11-20 01:19:09.498741] I [rpc-clnt.c:1729:rpc_clnt_reconfig] 0-wp_uploads-client-2: changing port to 49152 (from 0) [2014-11-20 01:19:09.501832] I [client-handshake.c:1677:select_server_supported_programs] 0-wp_uploads-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2014-11-20 01:19:09.538700] I [client-handshake.c:1462:client_setvolume_cbk] 0-wp_uploads-client-2: Connected to 192.168.135.37:49152, attached to remote volume '/bricks/brick1/brick'. [2014-11-20 01:19:09.538718] I [client-handshake.c:1474:client_setvolume_cbk] 0-wp_uploads-client-2: Server and Client lk-version numbers are not same, reopening the fds [2014-11-20 01:19:09.548683] I [client-handshake.c:450:client_set_lk_version_cbk] 0-wp_uploads-client-2: Server lk version = 1 As you can see, the error and the resolution occur within the same second. The servers on cluster 1 are physical servers with 10G network devices. The clients for this cluster are VMs in a different subnet, so all traffic passes through a firewall. These servers and clients run Gluster 3.6.1. The servers on cluster 2 are VMs. The clients for this cluster are VMs in a different subnet, so all traffic passes through a firewall. These servers and clients run Gluster 3.6.1. The servers on cluster 3 are VMs. The clients for this cluster are the other servers in the cluster, and all are on the same subnet. These servers and clients run Gluster 3.5.2. Clients on all three of these clusters are reporting timeouts. Recovery is usually sub-second, but I've seen it take as long as 10 seconds to recover. According to the folks in the #gluster IRC channel, these errors are always due to network problems. According the network folks here, there are no indications of network problems that we can find. No firewall logs indicating blocked traffic; no failures being reported in the switches that we can find. We see these errors much more frequently on the first two clusters: usually daily, though not at consistent times. Different clients report failure at different times. If it were a network issue, I would assume more consistent failure between clients. I'm not ruling out network issues, but the fact that we see any errors at all on the third cluster with all local traffic seems to reduce the implication of network issues. What additional troubleshooting steps are recommended? I know these kinds of transient errors are the worst bug reports to receive, but I simply don't know how to make the situation reproduce on demand. Thanks, Scott