Hi Gluster Community We have a PVE Proxmox cluster with two nodes. These two nodes each have 4 HDDs over which we have a glusterfs to migrate VMs live. A few days ago we had the problem that some disk files in the glusterfs got into a split-brain condition. We were able to secure the corresponding logfiles and resolve the split brain condition, but don't know how it happened. In the appendix you can find the Glusterfs log files. Maybe one of you can tell us what caused the problem: Here is the network setup of the PVE Cluster 192.168.231.0/24 --> Serverlan (reach PVE Gui port 8006) 10.10.11.0 /24 --> Cluster Ha Lan 10.10.12.0 /24 --> Glusterfs Storage lan Glusterfs Lan .) PVEServer1 - 10.10.12.31 .) PVEServer2 - 10.10.12.32 What we've seen in the mnt-pve-GlusterVol01.log log file: Server1: [2019-05-13 04:25:01.509716] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from server... [2019-05-13 09:47:48.277650] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.10.12.31:24007 failed (No data available) [2019-05-13 09:47:48.277696] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.10.12.31 (No data available) [2019-05-13 09:47:48.277704] I [glusterfsd-mgmt.c:1926:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers [2019-05-13 09:47:50.926948] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7fe58a1eb494] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5) [0x55a8728115e5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55a872811444] ) 0-: received signum (15), shutting down [2019-05-13 09:47:50.926977] I [fuse-bridge.c:5794:fini] 0-fuse: Unmounting '/mnt/pve/GlusterVol01'. [2019-05-13 09:47:50.950381] I [fuse-bridge.c:5086:fuse_thread_proc] 0-fuse: unmounting /mnt/pve/GlusterVol01 [2019-05-13 09:49:43.823117] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.8 (args: /usr/sbin/glusterfs --volfile-server=10.10.12.31 --volfile-id=vol0 /mnt/pve/GlusterVol01) [2019-05-13 09:49:43.828117] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-05-13 09:49:43.869885] W [MSGID: 108003] [afr.c:102:fix_quorum_options] 0-vol0-replicate-0: quorum-type none overriding quorum-count 1 [2019-05-13 09:49:43.871644] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2019-05-13 09:49:43.880208] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-0: parent translators are ready, attempting connect on transport [2019-05-13 09:49:43.880609] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-1: parent translators are ready, attempting connect on transport [2019-05-13 09:49:43.880816] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0) Final graph: +------------------------------------------------------------------------------+ 1: volume vol0-client-0 2: type protocol/client 3: option ping-timeout 5 4: option remote-host pvetau01-storage 5: option remote-subvolume /var/lib/glusterfs/data01/brick1/vol0 6: option transport-type socket 7: option transport.address-family inet 8: option username 4ccc2234-fba7-40f9-b97b-26d3fa8ab401 9: option password cef1b5f5-b16c-4a3c-b49f-f814901a3252 10: option filter-O_DIRECT enable 11: option send-gids true 12: end-volume 13: 14: volume vol0-client-1 15: type protocol/client 16: option ping-timeout 5 17: option remote-host pvetau02-storage 18: option remote-subvolume /var/lib/glusterfs/data01/brick1/vol0 19: option transport-type socket 20: option transport.address-family inet 21: option username 4ccc2234-fba7-40f9-b97b-26d3fa8ab401 22: option password cef1b5f5-b16c-4a3c-b49f-f814901a3252 23: option filter-O_DIRECT enable 24: option send-gids true 25: end-volume 26: 27: volume vol0-replicate-0 28: type cluster/replicate 29: option eager-lock enable 30: option quorum-count 1 31: subvolumes vol0-client-0 vol0-client-1 32: end-volume 33: 34: volume vol0-dht 35: type cluster/distribute 36: option lock-migration off 37: subvolumes vol0-replicate-0 38: end-volume 39: 40: volume vol0-write-behind 41: type performance/write-behind 42: subvolumes vol0-dht 43: end-volume 44: 45: volume vol0-readdir-ahead 46: type performance/readdir-ahead 47: subvolumes vol0-write-behind 48: end-volume 49: 50: volume vol0-open-behind 51: type performance/open-behind 52: subvolumes vol0-readdir-ahead 53: end-volume 54: 55: volume vol0 56: type debug/io-stats 57: option log-level INFO 58: option latency-measurement off 59: option count-fop-hits off 60: subvolumes vol0-open-behind 61: end-volume 62: 63: volume meta-autoload 64: type meta 65: subvolumes vol0 66: end-volume 67: +------------------------------------------------------------------------------+ [2019-05-13 09:49:43.881243] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2019-05-13 09:49:43.881434] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-1: changing port to 49154 (from 0) [2019-05-13 09:49:43.881906] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2019-05-13 09:49:43.882213] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'. [2019-05-13 09:49:43.882222] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds [2019-05-13 09:49:43.882249] I [MSGID: 108005] [afr-common.c:4382:afr_notify] 0-vol0-replicate-0: Subvolume 'vol0-client-1' came back up; going online. [2019-05-13 09:49:43.882360] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1 [2019-05-13 09:49:43.886625] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'. [2019-05-13 09:49:43.886633] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds [2019-05-13 09:49:43.890995] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1 [2019-05-13 09:49:43.891049] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.26 [2019-05-13 09:49:43.891067] I [fuse-bridge.c:4838:fuse_graph_sync] 0-fuse: switched to graph 0 [2019-05-13 09:49:43.891625] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0 [2019-05-13 10:20:38.998246] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-vol0-client-1: server 10.10.12.32:49154 has not responded in the last 5 seconds, disconnecting. [2019-05-13 10:20:38.998657] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f69df41fe83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f69df1e7b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69df1e7c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7f69df1e92e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f69df1e9bb4] ))))) 0-vol0-client-1: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2019-05-13 10:20:33.237111 (xid=0x492) [2019-05-13 10:20:38.998681] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-vol0-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2019-05-13 10:20:38.998829] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7f69df41fe83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f69df1e7b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69df1e7c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7f69df1e92e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f69df1e9bb4] ))))) 0-vol0-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-05-13 10:20:33.237115 (xid=0x493) [2019-05-13 10:20:38.998843] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-vol0-client-1: socket disconnected [2019-05-13 10:20:38.998854] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-1: disconnected from vol0-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2019-05-13 10:20:43.355917] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0 [2019-05-13 10:21:20.850030] E [socket.c:2309:socket_connect_finish] 0-vol0-client-1: connection to 10.10.12.32:24007 failed (No route to host) [2019-05-13 10:22:07.026615] E [MSGID: 114058] [client-handshake.c:1534:client_query_portmap_cbk] 0-vol0-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2019-05-13 10:22:07.026663] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-1: disconnected from vol0-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2019-05-13 10:22:10.010421] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-1: changing port to 49154 (from 0) [2019-05-13 10:22:10.011105] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2019-05-13 10:22:10.011558] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'. [2019-05-13 10:22:10.011609] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds [2019-05-13 10:22:10.011622] I [MSGID: 114042] [client-handshake.c:1054:client_post_handshake] 0-vol0-client-1: 2 fds open - Delaying child_up until they are re-opened [2019-05-13 10:22:10.032258] I [MSGID: 114041] [client-handshake.c:676:client_child_up_reopen_done] 0-vol0-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2019-05-13 10:22:10.032492] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1 [2019-05-13 10:22:13.790586] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-0 [2019-05-13 11:12:57.300347] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 11:12:57.305284] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 4 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain) [2019-05-13 11:12:57.305712] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 11:12:57.306277] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null) [2019-05-13 11:12:57.306938] I [MSGID: 114024] [client-helpers.c:99:this_fd_set_ctx] 0-vol0-client-0: /images/103/vm-103-disk-0.qcow2 (5f9490a8-ec56-410e-9c70-653e0da77174): trying duplicate remote fd set. [2019-05-13 11:12:57.306973] I [MSGID: 114024] [client-helpers.c:99:this_fd_set_ctx] 0-vol0-client-1: /images/103/vm-103-disk-0.qcow2 (5f9490a8-ec56-410e-9c70-653e0da77174): trying duplicate remote fd set. [2019-05-13 11:12:57.310052] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2698: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error) [2019-05-13 11:12:57.310137] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2697: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error) [2019-05-13 11:12:57.311543] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 2699: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f69d1cba184 (Input/output error) The message "E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]" repeated 2 times between [2019-05-13 11:12:57.305712] and [2019-05-13 11:12:57.310816] The message "W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)" repeated 2 times between [2019-05-13 11:12:57.306277] and [2019-05-13 11:12:57.311184] The message "W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 4 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain)" repeated 6 times between [2019-05-13 11:12:57.305284] and [2019-05-13 11:12:57.311274] The message "E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error]" repeated 5 times between [2019-05-13 11:12:57.300347] and [2019-05-13 11:12:57.311531] Server 2: [2019-05-13 04:25:01.338790] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file from server... [2019-05-13 09:47:59.443328] E [socket.c:2309:socket_connect_finish] 0-glusterfs: connection to 10.10.12.31:24007 failed (Connection refused) [2019-05-13 09:48:17.426580] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-vol0-client-0: server 10.10.12.31:49155 has not responded in the last 5 seconds, disconnecting. [2019-05-13 09:48:17.426872] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7efebd3f9e83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7efebd1c1b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efebd1c1c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7efebd1c32e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7efebd1c3bb4] ))))) 0-vol0-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2019-05-13 09:48:12.180579 (xid=0x5663a4) [2019-05-13 09:48:17.426899] W [MSGID: 114031] [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-vol0-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2019-05-13 09:48:17.427056] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x1a3)[0x7efebd3f9e83] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7efebd1c1b61] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efebd1c1c7e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x89)[0x7efebd1c32e9] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7efebd1c3bb4] ))))) 0-vol0-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2019-05-13 09:48:12.180591 (xid=0x5663a5) [2019-05-13 09:48:17.427067] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-vol0-client-0: socket disconnected [2019-05-13 09:48:17.427077] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-0: disconnected from vol0-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2019-05-13 09:48:21.479100] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1 [2019-05-13 09:48:59.219302] E [socket.c:2309:socket_connect_finish] 0-vol0-client-0: connection to 10.10.12.31:24007 failed (No route to host) [2019-05-13 09:49:41.468469] I [glusterfsd-mgmt.c:1600:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2019-05-13 09:49:42.505174] E [MSGID: 114058] [client-handshake.c:1534:client_query_portmap_cbk] 0-vol0-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2019-05-13 09:49:42.505225] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol0-client-0: disconnected from vol0-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2019-05-13 09:49:45.442003] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0) [2019-05-13 09:49:45.442523] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2019-05-13 09:49:45.442802] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'. [2019-05-13 09:49:45.442812] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds [2019-05-13 09:49:45.442820] I [MSGID: 114042] [client-handshake.c:1054:client_post_handshake] 0-vol0-client-0: 2 fds open - Delaying child_up until they are re-opened [2019-05-13 09:49:45.443244] I [MSGID: 114041] [client-handshake.c:676:client_child_up_reopen_done] 0-vol0-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2019-05-13 09:49:45.443353] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1 [2019-05-13 09:49:49.622255] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1 [2019-05-13 10:20:06.060045] W [glusterfsd.c:1327:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7efebc254494] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5) [0x55dba7a3b5e5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55dba7a3b444] ) 0-: received signum (15), shutting down [2019-05-13 10:20:06.068969] I [fuse-bridge.c:5794:fini] 0-fuse: Unmounting '/mnt/pve/GlusterVol01'. [2019-05-13 10:20:06.103235] I [fuse-bridge.c:5086:fuse_thread_proc] 0-fuse: unmounting /mnt/pve/GlusterVol01 [2019-05-13 10:22:08.842734] I [MSGID: 100030] [glusterfsd.c:2454:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.8 (args: /usr/sbin/glusterfs --volfile-server=10.10.12.31 --volfile-id=vol0 /mnt/pve/GlusterVol01) [2019-05-13 10:22:08.853935] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-05-13 10:22:08.944855] W [MSGID: 108003] [afr.c:102:fix_quorum_options] 0-vol0-replicate-0: quorum-type none overriding quorum-count 1 [2019-05-13 10:22:08.946502] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2019-05-13 10:22:08.972020] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-0: parent translators are ready, attempting connect on transport [2019-05-13 10:22:08.972395] I [MSGID: 114020] [client.c:2356:notify] 0-vol0-client-1: parent translators are ready, attempting connect on transport [2019-05-13 10:22:08.972832] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-vol0-client-0: changing port to 49155 (from 0) [2019-05-13 10:22:08.973142] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2019-05-13 10:22:08.973231] I [MSGID: 114057] [client-handshake.c:1447:select_server_supported_programs] 0-vol0-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2019-05-13 10:22:08.973544] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-1: Connected to vol0-client-1, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'. [2019-05-13 10:22:08.973544] I [MSGID: 114046] [client-handshake.c:1223:client_setvolume_cbk] 0-vol0-client-0: Connected to vol0-client-0, attached to remote volume '/var/lib/glusterfs/data01/brick1/vol0'. [2019-05-13 10:22:08.973566] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-0: Server and Client lk-version numbers are not same, reopening the fds [2019-05-13 10:22:08.973567] I [MSGID: 114047] [client-handshake.c:1234:client_setvolume_cbk] 0-vol0-client-1: Server and Client lk-version numbers are not same, reopening the fds [2019-05-13 10:22:08.973616] I [MSGID: 108005] [afr-common.c:4382:afr_notify] 0-vol0-replicate-0: Subvolume 'vol0-client-1' came back up; going online. [2019-05-13 10:22:08.973639] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-1: Server lk version = 1 [2019-05-13 10:22:08.977940] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-vol0-client-0: Server lk version = 1 [2019-05-13 10:22:08.978055] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.26 [2019-05-13 10:22:08.978075] I [fuse-bridge.c:4838:fuse_graph_sync] 0-fuse: switched to graph 0 [2019-05-13 10:22:08.978603] I [MSGID: 108031] [afr-common.c:2152:afr_local_discovery_cbk] 0-vol0-replicate-0: selecting local read_child vol0-client-1 [2019-05-13 10:53:46.573894] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 10:53:46.573992] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain) [2019-05-13 10:53:46.574253] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 10:53:46.574949] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null) [2019-05-13 10:53:46.575526] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1380: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error) [2019-05-13 10:53:46.577820] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1381: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error) [2019-05-13 10:53:46.596838] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error] [2019-05-13 10:53:46.597759] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain) [2019-05-13 10:53:46.598916] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error] The message "W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null)" repeated 2 times between [2019-05-13 10:53:46.574949] and [2019-05-13 10:53:46.599257] [2019-05-13 10:53:46.599525] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain) [2019-05-13 10:53:46.599797] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error] [2019-05-13 10:53:46.599825] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1389: READ => -1 gfid=609bb8be-3ae8-470d-9f88-2b65095fbed4 fd=0x7f649c00e06c (Input/output error) [2019-05-13 10:53:46.599876] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain) [2019-05-13 10:53:46.600149] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error] [2019-05-13 10:53:46.600193] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain) [2019-05-13 10:53:46.600417] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error] [2019-05-13 10:53:46.600775] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null) [2019-05-13 10:53:46.601071] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4. (Possible split-brain) [2019-05-13 10:53:46.601537] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 609bb8be-3ae8-470d-9f88-2b65095fbed4: split-brain observed. [Input/output error] [2019-05-13 10:53:46.601577] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1390: READ => -1 gfid=609bb8be-3ae8-470d-9f88-2b65095fbed4 fd=0x7f649c00e06c (Input/output error) [2019-05-13 10:53:46.619830] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error] [2019-05-13 10:53:46.620701] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain) [2019-05-13 10:53:46.621098] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error] [2019-05-13 10:53:46.621455] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null) [2019-05-13 10:53:46.621732] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain) [2019-05-13 10:53:46.623509] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error] [2019-05-13 10:53:46.624891] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error] [2019-05-13 10:53:46.625212] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null) [2019-05-13 10:53:46.625314] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 79423c92-0338-4dc9-bafc-091172e8d845. (Possible split-brain) [2019-05-13 10:53:46.625721] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 79423c92-0338-4dc9-bafc-091172e8d845: split-brain observed. [Input/output error] [2019-05-13 10:53:46.625754] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1399: READ => -1 gfid=79423c92-0338-4dc9-bafc-091172e8d845 fd=0x7f649c00e06c (Input/output error) [2019-05-13 10:53:46.576286] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 10:56:28.176786] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 10:56:28.177684] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain) [2019-05-13 10:56:28.178782] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 10:56:28.179128] W [MSGID: 108027] [afr-common.c:2491:afr_discover_done] 0-vol0-replicate-0: no read subvols for (null) [2019-05-13 10:56:28.180634] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1533: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error) [2019-05-13 10:56:28.179439] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain) [2019-05-13 10:56:28.180620] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 10:59:25.278595] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing READ on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 10:59:25.279517] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain) [2019-05-13 10:59:25.280605] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-vol0-replicate-0: Failing FGETXATTR on gfid 5f9490a8-ec56-410e-9c70-653e0da77174: split-brain observed. [Input/output error] [2019-05-13 10:59:25.281649] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 1685: READ => -1 gfid=5f9490a8-ec56-410e-9c70-653e0da77174 fd=0x7f649c00e06c (Input/output error) [2019-05-13 10:59:25.281250] W [MSGID: 108008] [afr-read-txn.c:238:afr_read_txn] 0-vol0-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 5f9490a8-ec56-410e-9c70-653e0da77174. (Possible split-brain) ------------------------------------------------- What we can't explain is why server 1 does the following: [2019-05-13 09:47:48.277650] W [socket.c:590:__socket_rwv] 0-glusterfs: readv on 10.10.12.31:24007 failed (No data available) [2019-05-13 09:47:48.277696] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to connect with remote-host: 10.10.12.31 (No data available) [2019-05-13 09:47:48.277704] I [glusterfsd-mgmt.c:1926:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers then the volume will be unmounted and re-mounted with another port again. In further consequence server 2 behaves exactly like this which consequences in a a split-brain condition of the disk files of the VMs. we would be glad if someone could explain these behaviors to us. BR Ren? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190524/7649a418/attachment.html>