Ćukasz Michalski
2019-May-06 13:13 UTC
[Gluster-users] heal: Not able to fetch volfile from glusterd
Hi, I have problem resolving split-brain in one of my installations. CenOS 7, glusterfs 3.10.12, replica on two nodes: [root at ixmed1 iscsi]# gluster volume status cluster Status of volume: cluster Gluster process???????????????????????????? TCP Port? RDMA Port Online? Pid ------------------------------------------------------------------------------ Brick ixmed2:/glusterfs-bricks/cluster/clus ter???????????????????????????????????????? 49153???? 0 Y?????? 3028 Brick ixmed1:/glusterfs-bricks/cluster/clus ter???????????????????????????????????????? 49153???? 0 Y?????? 2917 Self-heal Daemon on localhost?????????????? N/A?????? N/A Y?????? 112929 Self-heal Daemon on ixmed2????????????????? N/A?????? N/A Y?????? 57774 Task Status of Volume cluster ------------------------------------------------------------------------------ There are no active volume tasks When I try to access one file glusterd reports split brain: [2019-05-06 12:36:43.785098] E [MSGID: 108008] [afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0: Failing READ on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: split-brain observed. [Input/output error] [2019-05-06 12:36:43.787952] E [MSGID: 108008] [afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0: Failing FGETXATTR on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: split-brain observed. [Input/output error] [2019-05-06 12:36:43.788778] W [MSGID: 108027] [afr-common.c:2722:afr_discover_done] 0-cluster-replicate-0: no read subvols for (null) [2019-05-06 12:36:43.790123] W [fuse-bridge.c:2254:fuse_readv_cbk] 0-glusterfs-fuse: 3352501: READ => -1 gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde0803f390 (Input/output error) [2019-05-06 12:36:43.794979] W [fuse-bridge.c:2254:fuse_readv_cbk] 0-glusterfs-fuse: 3352506: READ => -1 gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0 (Input/output error) [2019-05-06 12:36:43.800468] W [fuse-bridge.c:2254:fuse_readv_cbk] 0-glusterfs-fuse: 3352508: READ => -1 gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0 (Input/output error) The problem is that "gluster volume heal info" hangs for 10 seconds and returns: ??? Not able to fetch volfile from glusterd ??? Volume heal failed glfsheal.log contains: [2019-05-06 12:40:25.589879] I [afr.c:94:fix_quorum_options] 0-cluster-replicate-0: reindeer: incoming qtype = none [2019-05-06 12:40:25.589967] I [afr.c:116:fix_quorum_options] 0-cluster-replicate-0: reindeer: quorum_count = 0 [2019-05-06 12:40:25.593294] W [MSGID: 101174] [graph.c:361:_log_if_unknown_option] 0-cluster-readdir-ahead: option 'parallel-readdir' is not recognized [2019-05-06 12:40:25.593895] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 69786d65-6431-2d32-3037-3739322d3230 (0) coming up [2019-05-06 12:40:25.593972] I [MSGID: 114020] [client.c:2352:notify] 0-cluster-client-0: parent translators are ready, attempting connect on transport [2019-05-06 12:40:25.607836] I [MSGID: 114020] [client.c:2352:notify] 0-cluster-client-1: parent translators are ready, attempting connect on transport [2019-05-06 12:40:25.608556] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-cluster-client-0: changing port to 49153 (from 0) [2019-05-06 12:40:25.618167] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-cluster-client-1: changing port to 49153 (from 0) [2019-05-06 12:40:25.629595] I [MSGID: 114057] [client-handshake.c:1451:select_server_supported_programs] 0-cluster-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2019-05-06 12:40:25.632031] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-0: Connected to cluster-client-0, attached to remote volume '/glusterfs-bricks/cluster/cluster'. [2019-05-06 12:40:25.632100] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-0: Server and Client lk-version numbers are not same, reopening the fds [2019-05-06 12:40:25.632263] I [MSGID: 108005] [afr-common.c:4817:afr_notify] 0-cluster-replicate-0: Subvolume 'cluster-client-0' came back up; going online. [2019-05-06 12:40:25.637707] I [MSGID: 114057] [client-handshake.c:1451:select_server_supported_programs] 0-cluster-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2019-05-06 12:40:25.639285] I [MSGID: 114046] [client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-1: Connected to cluster-client-1, attached to remote volume '/glusterfs-bricks/cluster/cluster'. [2019-05-06 12:40:25.639341] I [MSGID: 114047] [client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-1: Server and Client lk-version numbers are not same, reopening the fds [2019-05-06 12:40:31.564407] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-0: server 10.0.104.26:49153 has not responded in the last 5 seconds, disconnecting. [2019-05-06 12:40:31.565764] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-1: server 10.0.7.26:49153 has not responded in the last 5 seconds, disconnecting. [2019-05-06 12:40:35.645545] I [MSGID: 114018] [client.c:2276:client_rpc_notify] 0-cluster-client-0: disconnected from cluster-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2019-05-06 12:40:35.645683] I [socket.c:3534:socket_submit_request] 0-cluster-client-0: not connected (priv->connected = -1) [2019-05-06 12:40:35.645755] W [rpc-clnt.c:1693:rpc_clnt_submit] 0-cluster-client-0: failed to submit rpc-request (XID: 0x7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport (cluster-client-0) [2019-05-06 12:40:35.645807] W [MSGID: 114031] [client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-0: remote operation failed [Drugi koniec nie jest po??czony] [2019-05-06 12:40:35.645887] I [socket.c:3534:socket_submit_request] 0-cluster-client-1: not connected (priv->connected = -1) [2019-05-06 12:40:35.645918] W [rpc-clnt.c:1693:rpc_clnt_submit] 0-cluster-client-1: failed to submit rpc-request (XID: 0x7 Program: GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport (cluster-client-1) [2019-05-06 12:40:35.645955] W [MSGID: 114031] [client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-1: remote operation failed [Drugi koniec nie jest po??czony] [2019-05-06 12:40:35.646008] W [MSGID: 109075] [dht-diskusage.c:44:dht_du_info_cbk] 0-cluster-dht: failed to get disk info from cluster-replicate-0 [Drugi koniec nie jest po??czony] [2019-05-06 12:40:35.647846] I [MSGID: 114018] [client.c:2276:client_rpc_notify] 0-cluster-client-1: disconnected from cluster-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2019-05-06 12:40:35.647895] E [MSGID: 108006] [afr-common.c:4842:afr_notify] 0-cluster-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2019-05-06 12:40:35.647989] I [MSGID: 108006] [afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no subvolumes up [2019-05-06 12:40:35.648051] I [MSGID: 108006] [afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no subvolumes up [2019-05-06 12:40:35.648122] I [MSGID: 104039] [glfs-resolve.c:902:__glfs_active_subvol] 0-cluster: first lookup on graph 69786d65-6431-2d32-3037-3739322d3230 (0) failed (Drugi koniec nie jest po??czony) [Drugi koniec nie jest po??czony] "Drugi koniec nie jest po??czony" -> Transport endpoint not connected On brick process side there is an connection attempt: [2019-05-06 12:40:25.638032] I [addr.c:182:gf_auth] 0-/glusterfs-bricks/cluster/cluster: allowed = "*", received addr = "10.0.7.26" [2019-05-06 12:40:25.638080] I [login.c:111:gf_auth] 0-auth/login: allowed user names: e2f4c8f4-d040-4856-b6e3-62611fbab0ea [2019-05-06 12:40:25.638109] I [MSGID: 115029] [server-handshake.c:695:server_setvolume] 0-cluster-server: accepted client from ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0 (version: 3.10.12) [2019-05-06 12:40:31.565931] I [MSGID: 115036] [server.c:559:server_rpc_notify] 0-cluster-server: disconnecting connection from ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0 [2019-05-06 12:40:31.566420] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-cluster-server: Shutting down connection ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0 I am not able to use any heal command because of this problem. I have three volumes configured on that nodes. Configuration is identical and "gluster volume heal" command fails for all of them. Can anyone help? Thanks, ?ukasz
Ravishankar N
2019-May-07 06:25 UTC
[Gluster-users] heal: Not able to fetch volfile from glusterd
On 06/05/19 6:43 PM, ?ukasz Michalski wrote:> Hi, > > I have problem resolving split-brain in one of my installations. > > CenOS 7, glusterfs 3.10.12, replica on two nodes: > > [root at ixmed1 iscsi]# gluster volume status cluster > Status of volume: cluster > Gluster process???????????????????????????? TCP Port? RDMA Port > Online? Pid > ------------------------------------------------------------------------------ > > Brick ixmed2:/glusterfs-bricks/cluster/clus > ter???????????????????????????????????????? 49153???? 0 Y 3028 > Brick ixmed1:/glusterfs-bricks/cluster/clus > ter???????????????????????????????????????? 49153???? 0 Y 2917 > Self-heal Daemon on localhost?????????????? N/A?????? N/A Y 112929 > Self-heal Daemon on ixmed2????????????????? N/A?????? N/A Y 57774 > > Task Status of Volume cluster > ------------------------------------------------------------------------------ > > There are no active volume tasks > > When I try to access one file glusterd reports split brain: > > [2019-05-06 12:36:43.785098] E [MSGID: 108008] > [afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0: > Failing READ on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: split-brain > observed. [Input/output error] > [2019-05-06 12:36:43.787952] E [MSGID: 108008] > [afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0: > Failing FGETXATTR on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: > split-brain observed. [Input/output error] > [2019-05-06 12:36:43.788778] W [MSGID: 108027] > [afr-common.c:2722:afr_discover_done] 0-cluster-replicate-0: no read > subvols for (null) > [2019-05-06 12:36:43.790123] W [fuse-bridge.c:2254:fuse_readv_cbk] > 0-glusterfs-fuse: 3352501: READ => -1 > gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde0803f390 > (Input/output error) > [2019-05-06 12:36:43.794979] W [fuse-bridge.c:2254:fuse_readv_cbk] > 0-glusterfs-fuse: 3352506: READ => -1 > gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0 > (Input/output error) > [2019-05-06 12:36:43.800468] W [fuse-bridge.c:2254:fuse_readv_cbk] > 0-glusterfs-fuse: 3352508: READ => -1 > gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0 > (Input/output error) > > The problem is that "gluster volume heal info" hangs for 10 seconds > and returns: > > ??? Not able to fetch volfile from glusterd > ??? Volume heal failed > > glfsheal.log contains: > > [2019-05-06 12:40:25.589879] I [afr.c:94:fix_quorum_options] > 0-cluster-replicate-0: reindeer: incoming qtype = none > [2019-05-06 12:40:25.589967] I [afr.c:116:fix_quorum_options] > 0-cluster-replicate-0: reindeer: quorum_count = 0 > [2019-05-06 12:40:25.593294] W [MSGID: 101174] > [graph.c:361:_log_if_unknown_option] 0-cluster-readdir-ahead: option > 'parallel-readdir' is not recognized > [2019-05-06 12:40:25.593895] I [MSGID: 104045] > [glfs-master.c:91:notify] 0-gfapi: New graph > 69786d65-6431-2d32-3037-3739322d3230 (0) coming up > [2019-05-06 12:40:25.593972] I [MSGID: 114020] [client.c:2352:notify] > 0-cluster-client-0: parent translators are ready, attempting connect > on transport > [2019-05-06 12:40:25.607836] I [MSGID: 114020] [client.c:2352:notify] > 0-cluster-client-1: parent translators are ready, attempting connect > on transport > [2019-05-06 12:40:25.608556] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-cluster-client-0: changing port to 49153 (from 0) > [2019-05-06 12:40:25.618167] I [rpc-clnt.c:2000:rpc_clnt_reconfig] > 0-cluster-client-1: changing port to 49153 (from 0) > [2019-05-06 12:40:25.629595] I [MSGID: 114057] > [client-handshake.c:1451:select_server_supported_programs] > 0-cluster-client-0: Using Program GlusterFS 3.3, Num (1298437), > Version (330) > [2019-05-06 12:40:25.632031] I [MSGID: 114046] > [client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-0: > Connected to cluster-client-0, attached to remote volume > '/glusterfs-bricks/cluster/cluster'. > [2019-05-06 12:40:25.632100] I [MSGID: 114047] > [client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-0: > Server and Client lk-version numbers are not same, reopening the fds > [2019-05-06 12:40:25.632263] I [MSGID: 108005] > [afr-common.c:4817:afr_notify] 0-cluster-replicate-0: Subvolume > 'cluster-client-0' came back up; going online. > [2019-05-06 12:40:25.637707] I [MSGID: 114057] > [client-handshake.c:1451:select_server_supported_programs] > 0-cluster-client-1: Using Program GlusterFS 3.3, Num (1298437), > Version (330) > [2019-05-06 12:40:25.639285] I [MSGID: 114046] > [client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-1: > Connected to cluster-client-1, attached to remote volume > '/glusterfs-bricks/cluster/cluster'. > [2019-05-06 12:40:25.639341] I [MSGID: 114047] > [client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-1: > Server and Client lk-version numbers are not same, reopening the fds > [2019-05-06 12:40:31.564407] C > [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-0: > server 10.0.104.26:49153 has not responded in the last 5 seconds, > disconnecting. > [2019-05-06 12:40:31.565764] C > [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-1: > server 10.0.7.26:49153 has not responded in the last 5 seconds, > disconnecting.This seems to be a problem.? Have you changed the value of ping-timeout ? Could you share the output of `gluster volume info`? Does the same issue occur if you try to resolve the split-brain on the gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 using the |gluster volume heal <VOLNAME> split-brain |CLI? -Ravi> [2019-05-06 12:40:35.645545] I [MSGID: 114018] > [client.c:2276:client_rpc_notify] 0-cluster-client-0: disconnected > from cluster-client-0. Client process will keep trying to connect to > glusterd until brick's port is available > [2019-05-06 12:40:35.645683] I [socket.c:3534:socket_submit_request] > 0-cluster-client-0: not connected (priv->connected = -1) > [2019-05-06 12:40:35.645755] W [rpc-clnt.c:1693:rpc_clnt_submit] > 0-cluster-client-0: failed to submit rpc-request (XID: 0x7 Program: > GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport > (cluster-client-0) > [2019-05-06 12:40:35.645807] W [MSGID: 114031] > [client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-0: > remote operation failed [Drugi koniec nie jest po??czony] > [2019-05-06 12:40:35.645887] I [socket.c:3534:socket_submit_request] > 0-cluster-client-1: not connected (priv->connected = -1) > [2019-05-06 12:40:35.645918] W [rpc-clnt.c:1693:rpc_clnt_submit] > 0-cluster-client-1: failed to submit rpc-request (XID: 0x7 Program: > GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport > (cluster-client-1) > [2019-05-06 12:40:35.645955] W [MSGID: 114031] > [client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-1: > remote operation failed [Drugi koniec nie jest po??czony] > [2019-05-06 12:40:35.646008] W [MSGID: 109075] > [dht-diskusage.c:44:dht_du_info_cbk] 0-cluster-dht: failed to get disk > info from cluster-replicate-0 [Drugi koniec nie jest po??czony] > [2019-05-06 12:40:35.647846] I [MSGID: 114018] > [client.c:2276:client_rpc_notify] 0-cluster-client-1: disconnected > from cluster-client-1. Client process will keep trying to connect to > glusterd until brick's port is available > [2019-05-06 12:40:35.647895] E [MSGID: 108006] > [afr-common.c:4842:afr_notify] 0-cluster-replicate-0: All subvolumes > are down. Going offline until atleast one of them comes back up. > [2019-05-06 12:40:35.647989] I [MSGID: 108006] > [afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no > subvolumes up > [2019-05-06 12:40:35.648051] I [MSGID: 108006] > [afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no > subvolumes up > [2019-05-06 12:40:35.648122] I [MSGID: 104039] > [glfs-resolve.c:902:__glfs_active_subvol] 0-cluster: first lookup on > graph 69786d65-6431-2d32-3037-3739322d3230 (0) failed (Drugi koniec > nie jest po??czony) [Drugi koniec nie jest po??czony] > > "Drugi koniec nie jest po??czony" -> Transport endpoint not connected > > On brick process side there is an connection attempt: > > [2019-05-06 12:40:25.638032] I [addr.c:182:gf_auth] > 0-/glusterfs-bricks/cluster/cluster: allowed = "*", received addr = > "10.0.7.26" > [2019-05-06 12:40:25.638080] I [login.c:111:gf_auth] 0-auth/login: > allowed user names: e2f4c8f4-d040-4856-b6e3-62611fbab0ea > [2019-05-06 12:40:25.638109] I [MSGID: 115029] > [server-handshake.c:695:server_setvolume] 0-cluster-server: accepted > client from > ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0 > (version: 3.10.12) > [2019-05-06 12:40:31.565931] I [MSGID: 115036] > [server.c:559:server_rpc_notify] 0-cluster-server: disconnecting > connection from > ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0 > [2019-05-06 12:40:31.566420] I [MSGID: 101055] > [client_t.c:436:gf_client_unref] 0-cluster-server: Shutting down > connection ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0 > > I am not able to use any heal command because of this problem. > > I have three volumes configured on that nodes. Configuration is > identical and "gluster volume heal" command fails for all of them. > > Can anyone help? > > Thanks, > ?ukasz > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190507/01a9946f/attachment.html>