thr3ads.net - Gluster users - [Gluster-users] heal: Not able to fetch volfile from glusterd [May 2019]

If this information is useful, please help other people find it:
Share via:

Łukasz Michalski

2019-May-06 13:13 UTC

[Gluster-users] heal: Not able to fetch volfile from glusterd

Hi,

I have problem resolving split-brain in one of my installations.

CenOS 7, glusterfs 3.10.12, replica on two nodes:

[root at ixmed1 iscsi]# gluster volume status cluster
Status of volume: cluster
Gluster process???????????????????????????? TCP Port? RDMA Port Online? Pid
------------------------------------------------------------------------------
Brick ixmed2:/glusterfs-bricks/cluster/clus
ter???????????????????????????????????????? 49153???? 0 Y?????? 3028
Brick ixmed1:/glusterfs-bricks/cluster/clus
ter???????????????????????????????????????? 49153???? 0 Y?????? 2917
Self-heal Daemon on localhost?????????????? N/A?????? N/A Y?????? 112929
Self-heal Daemon on ixmed2????????????????? N/A?????? N/A Y?????? 57774

Task Status of Volume cluster
------------------------------------------------------------------------------
There are no active volume tasks

When I try to access one file glusterd reports split brain:

[2019-05-06 12:36:43.785098] E [MSGID: 108008] 
[afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0: 
Failing READ on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: split-brain 
observed. [Input/output error]
[2019-05-06 12:36:43.787952] E [MSGID: 108008] 
[afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0: 
Failing FGETXATTR on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: 
split-brain observed. [Input/output error]
[2019-05-06 12:36:43.788778] W [MSGID: 108027] 
[afr-common.c:2722:afr_discover_done] 0-cluster-replicate-0: no read 
subvols for (null)
[2019-05-06 12:36:43.790123] W [fuse-bridge.c:2254:fuse_readv_cbk] 
0-glusterfs-fuse: 3352501: READ => -1 
gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde0803f390 
(Input/output error)
[2019-05-06 12:36:43.794979] W [fuse-bridge.c:2254:fuse_readv_cbk] 
0-glusterfs-fuse: 3352506: READ => -1 
gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0 
(Input/output error)
[2019-05-06 12:36:43.800468] W [fuse-bridge.c:2254:fuse_readv_cbk] 
0-glusterfs-fuse: 3352508: READ => -1 
gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0 
(Input/output error)

The problem is that "gluster volume heal info" hangs for 10 seconds
and
returns:

 ??? Not able to fetch volfile from glusterd
 ??? Volume heal failed

glfsheal.log contains:

[2019-05-06 12:40:25.589879] I [afr.c:94:fix_quorum_options] 
0-cluster-replicate-0: reindeer: incoming qtype = none
[2019-05-06 12:40:25.589967] I [afr.c:116:fix_quorum_options] 
0-cluster-replicate-0: reindeer: quorum_count = 0
[2019-05-06 12:40:25.593294] W [MSGID: 101174] 
[graph.c:361:_log_if_unknown_option] 0-cluster-readdir-ahead: option 
'parallel-readdir' is not recognized
[2019-05-06 12:40:25.593895] I [MSGID: 104045] [glfs-master.c:91:notify] 
0-gfapi: New graph 69786d65-6431-2d32-3037-3739322d3230 (0) coming up
[2019-05-06 12:40:25.593972] I [MSGID: 114020] [client.c:2352:notify] 
0-cluster-client-0: parent translators are ready, attempting connect on 
transport
[2019-05-06 12:40:25.607836] I [MSGID: 114020] [client.c:2352:notify] 
0-cluster-client-1: parent translators are ready, attempting connect on 
transport
[2019-05-06 12:40:25.608556] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 
0-cluster-client-0: changing port to 49153 (from 0)
[2019-05-06 12:40:25.618167] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 
0-cluster-client-1: changing port to 49153 (from 0)
[2019-05-06 12:40:25.629595] I [MSGID: 114057] 
[client-handshake.c:1451:select_server_supported_programs] 
0-cluster-client-0: Using Program GlusterFS 3.3, Num (1298437), Version 
(330)
[2019-05-06 12:40:25.632031] I [MSGID: 114046] 
[client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-0: 
Connected to cluster-client-0, attached to remote volume 
'/glusterfs-bricks/cluster/cluster'.
[2019-05-06 12:40:25.632100] I [MSGID: 114047] 
[client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-0: 
Server and Client lk-version numbers are not same, reopening the fds
[2019-05-06 12:40:25.632263] I [MSGID: 108005] 
[afr-common.c:4817:afr_notify] 0-cluster-replicate-0: Subvolume 
'cluster-client-0' came back up; going online.
[2019-05-06 12:40:25.637707] I [MSGID: 114057] 
[client-handshake.c:1451:select_server_supported_programs] 
0-cluster-client-1: Using Program GlusterFS 3.3, Num (1298437), Version 
(330)
[2019-05-06 12:40:25.639285] I [MSGID: 114046] 
[client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-1: 
Connected to cluster-client-1, attached to remote volume 
'/glusterfs-bricks/cluster/cluster'.
[2019-05-06 12:40:25.639341] I [MSGID: 114047] 
[client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-1: 
Server and Client lk-version numbers are not same, reopening the fds
[2019-05-06 12:40:31.564407] C 
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-0: 
server 10.0.104.26:49153 has not responded in the last 5 seconds, 
disconnecting.
[2019-05-06 12:40:31.565764] C 
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-1: 
server 10.0.7.26:49153 has not responded in the last 5 seconds, 
disconnecting.
[2019-05-06 12:40:35.645545] I [MSGID: 114018] 
[client.c:2276:client_rpc_notify] 0-cluster-client-0: disconnected from 
cluster-client-0. Client process will keep trying to connect to glusterd 
until brick's port is available
[2019-05-06 12:40:35.645683] I [socket.c:3534:socket_submit_request] 
0-cluster-client-0: not connected (priv->connected = -1)
[2019-05-06 12:40:35.645755] W [rpc-clnt.c:1693:rpc_clnt_submit] 
0-cluster-client-0: failed to submit rpc-request (XID: 0x7 Program: 
GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport (cluster-client-0)
[2019-05-06 12:40:35.645807] W [MSGID: 114031] 
[client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-0: remote 
operation failed [Drugi koniec nie jest po??czony]
[2019-05-06 12:40:35.645887] I [socket.c:3534:socket_submit_request] 
0-cluster-client-1: not connected (priv->connected = -1)
[2019-05-06 12:40:35.645918] W [rpc-clnt.c:1693:rpc_clnt_submit] 
0-cluster-client-1: failed to submit rpc-request (XID: 0x7 Program: 
GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport (cluster-client-1)
[2019-05-06 12:40:35.645955] W [MSGID: 114031] 
[client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-1: remote 
operation failed [Drugi koniec nie jest po??czony]
[2019-05-06 12:40:35.646008] W [MSGID: 109075] 
[dht-diskusage.c:44:dht_du_info_cbk] 0-cluster-dht: failed to get disk 
info from cluster-replicate-0 [Drugi koniec nie jest po??czony]
[2019-05-06 12:40:35.647846] I [MSGID: 114018] 
[client.c:2276:client_rpc_notify] 0-cluster-client-1: disconnected from 
cluster-client-1. Client process will keep trying to connect to glusterd 
until brick's port is available
[2019-05-06 12:40:35.647895] E [MSGID: 108006] 
[afr-common.c:4842:afr_notify] 0-cluster-replicate-0: All subvolumes are 
down. Going offline until atleast one of them comes back up.
[2019-05-06 12:40:35.647989] I [MSGID: 108006] 
[afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no subvolumes up
[2019-05-06 12:40:35.648051] I [MSGID: 108006] 
[afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no subvolumes up
[2019-05-06 12:40:35.648122] I [MSGID: 104039] 
[glfs-resolve.c:902:__glfs_active_subvol] 0-cluster: first lookup on 
graph 69786d65-6431-2d32-3037-3739322d3230 (0) failed (Drugi koniec nie 
jest po??czony) [Drugi koniec nie jest po??czony]

"Drugi koniec nie jest po??czony" -> Transport endpoint not
connected

On brick process side there is an connection attempt:

[2019-05-06 12:40:25.638032] I [addr.c:182:gf_auth] 
0-/glusterfs-bricks/cluster/cluster: allowed = "*", received addr = 
"10.0.7.26"
[2019-05-06 12:40:25.638080] I [login.c:111:gf_auth] 0-auth/login: 
allowed user names: e2f4c8f4-d040-4856-b6e3-62611fbab0ea
[2019-05-06 12:40:25.638109] I [MSGID: 115029] 
[server-handshake.c:695:server_setvolume] 0-cluster-server: accepted 
client from 
ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0 (version: 
3.10.12)
[2019-05-06 12:40:31.565931] I [MSGID: 115036] 
[server.c:559:server_rpc_notify] 0-cluster-server: disconnecting 
connection from 
ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0
[2019-05-06 12:40:31.566420] I [MSGID: 101055] 
[client_t.c:436:gf_client_unref] 0-cluster-server: Shutting down 
connection ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0

I am not able to use any heal command because of this problem.

I have three volumes configured on that nodes. Configuration is 
identical and "gluster volume heal" command fails for all of them.

Can anyone help?

Thanks,
?ukasz

Ravishankar N

2019-May-07 06:25 UTC

head link

[Gluster-users] heal: Not able to fetch volfile from glusterd

On 06/05/19 6:43 PM, ?ukasz Michalski wrote:> Hi,
>
> I have problem resolving split-brain in one of my installations.
>
> CenOS 7, glusterfs 3.10.12, replica on two nodes:
>
> [root at ixmed1 iscsi]# gluster volume status cluster
> Status of volume: cluster
> Gluster process???????????????????????????? TCP Port? RDMA Port 
> Online? Pid
>
------------------------------------------------------------------------------
>
> Brick ixmed2:/glusterfs-bricks/cluster/clus
> ter???????????????????????????????????????? 49153???? 0 Y 3028
> Brick ixmed1:/glusterfs-bricks/cluster/clus
> ter???????????????????????????????????????? 49153???? 0 Y 2917
> Self-heal Daemon on localhost?????????????? N/A?????? N/A Y 112929
> Self-heal Daemon on ixmed2????????????????? N/A?????? N/A Y 57774
>
> Task Status of Volume cluster
>
------------------------------------------------------------------------------
>
> There are no active volume tasks
>
> When I try to access one file glusterd reports split brain:
>
> [2019-05-06 12:36:43.785098] E [MSGID: 108008] 
> [afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0: 
> Failing READ on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: split-brain 
> observed. [Input/output error]
> [2019-05-06 12:36:43.787952] E [MSGID: 108008] 
> [afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0: 
> Failing FGETXATTR on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: 
> split-brain observed. [Input/output error]
> [2019-05-06 12:36:43.788778] W [MSGID: 108027] 
> [afr-common.c:2722:afr_discover_done] 0-cluster-replicate-0: no read 
> subvols for (null)
> [2019-05-06 12:36:43.790123] W [fuse-bridge.c:2254:fuse_readv_cbk] 
> 0-glusterfs-fuse: 3352501: READ => -1 
> gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde0803f390 
> (Input/output error)
> [2019-05-06 12:36:43.794979] W [fuse-bridge.c:2254:fuse_readv_cbk] 
> 0-glusterfs-fuse: 3352506: READ => -1 
> gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0 
> (Input/output error)
> [2019-05-06 12:36:43.800468] W [fuse-bridge.c:2254:fuse_readv_cbk] 
> 0-glusterfs-fuse: 3352508: READ => -1 
> gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0 
> (Input/output error)
>
> The problem is that "gluster volume heal info" hangs for 10
seconds
> and returns:
>
> ??? Not able to fetch volfile from glusterd
> ??? Volume heal failed
>
> glfsheal.log contains:
>
> [2019-05-06 12:40:25.589879] I [afr.c:94:fix_quorum_options] 
> 0-cluster-replicate-0: reindeer: incoming qtype = none
> [2019-05-06 12:40:25.589967] I [afr.c:116:fix_quorum_options] 
> 0-cluster-replicate-0: reindeer: quorum_count = 0
> [2019-05-06 12:40:25.593294] W [MSGID: 101174] 
> [graph.c:361:_log_if_unknown_option] 0-cluster-readdir-ahead: option 
> 'parallel-readdir' is not recognized
> [2019-05-06 12:40:25.593895] I [MSGID: 104045] 
> [glfs-master.c:91:notify] 0-gfapi: New graph 
> 69786d65-6431-2d32-3037-3739322d3230 (0) coming up
> [2019-05-06 12:40:25.593972] I [MSGID: 114020] [client.c:2352:notify] 
> 0-cluster-client-0: parent translators are ready, attempting connect 
> on transport
> [2019-05-06 12:40:25.607836] I [MSGID: 114020] [client.c:2352:notify] 
> 0-cluster-client-1: parent translators are ready, attempting connect 
> on transport
> [2019-05-06 12:40:25.608556] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 
> 0-cluster-client-0: changing port to 49153 (from 0)
> [2019-05-06 12:40:25.618167] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 
> 0-cluster-client-1: changing port to 49153 (from 0)
> [2019-05-06 12:40:25.629595] I [MSGID: 114057] 
> [client-handshake.c:1451:select_server_supported_programs] 
> 0-cluster-client-0: Using Program GlusterFS 3.3, Num (1298437), 
> Version (330)
> [2019-05-06 12:40:25.632031] I [MSGID: 114046] 
> [client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-0: 
> Connected to cluster-client-0, attached to remote volume 
> '/glusterfs-bricks/cluster/cluster'.
> [2019-05-06 12:40:25.632100] I [MSGID: 114047] 
> [client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-0: 
> Server and Client lk-version numbers are not same, reopening the fds
> [2019-05-06 12:40:25.632263] I [MSGID: 108005] 
> [afr-common.c:4817:afr_notify] 0-cluster-replicate-0: Subvolume 
> 'cluster-client-0' came back up; going online.
> [2019-05-06 12:40:25.637707] I [MSGID: 114057] 
> [client-handshake.c:1451:select_server_supported_programs] 
> 0-cluster-client-1: Using Program GlusterFS 3.3, Num (1298437), 
> Version (330)
> [2019-05-06 12:40:25.639285] I [MSGID: 114046] 
> [client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-1: 
> Connected to cluster-client-1, attached to remote volume 
> '/glusterfs-bricks/cluster/cluster'.
> [2019-05-06 12:40:25.639341] I [MSGID: 114047] 
> [client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-1: 
> Server and Client lk-version numbers are not same, reopening the fds
> [2019-05-06 12:40:31.564407] C 
> [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-0: 
> server 10.0.104.26:49153 has not responded in the last 5 seconds, 
> disconnecting.
> [2019-05-06 12:40:31.565764] C 
> [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-1: 
> server 10.0.7.26:49153 has not responded in the last 5 seconds, 
> disconnecting.
This seems to be a problem.? Have you changed the value of ping-timeout 
? Could you share the output of `gluster volume info`?

Does the same issue occur if you try to resolve the split-brain on the 
gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 using the |gluster volume heal 
<VOLNAME> split-brain |CLI?

-Ravi
> [2019-05-06 12:40:35.645545] I [MSGID: 114018] 
> [client.c:2276:client_rpc_notify] 0-cluster-client-0: disconnected 
> from cluster-client-0. Client process will keep trying to connect to 
> glusterd until brick's port is available
> [2019-05-06 12:40:35.645683] I [socket.c:3534:socket_submit_request] 
> 0-cluster-client-0: not connected (priv->connected = -1)
> [2019-05-06 12:40:35.645755] W [rpc-clnt.c:1693:rpc_clnt_submit] 
> 0-cluster-client-0: failed to submit rpc-request (XID: 0x7 Program: 
> GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport 
> (cluster-client-0)
> [2019-05-06 12:40:35.645807] W [MSGID: 114031] 
> [client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-0: 
> remote operation failed [Drugi koniec nie jest po??czony]
> [2019-05-06 12:40:35.645887] I [socket.c:3534:socket_submit_request] 
> 0-cluster-client-1: not connected (priv->connected = -1)
> [2019-05-06 12:40:35.645918] W [rpc-clnt.c:1693:rpc_clnt_submit] 
> 0-cluster-client-1: failed to submit rpc-request (XID: 0x7 Program: 
> GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport 
> (cluster-client-1)
> [2019-05-06 12:40:35.645955] W [MSGID: 114031] 
> [client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-1: 
> remote operation failed [Drugi koniec nie jest po??czony]
> [2019-05-06 12:40:35.646008] W [MSGID: 109075] 
> [dht-diskusage.c:44:dht_du_info_cbk] 0-cluster-dht: failed to get disk 
> info from cluster-replicate-0 [Drugi koniec nie jest po??czony]
> [2019-05-06 12:40:35.647846] I [MSGID: 114018] 
> [client.c:2276:client_rpc_notify] 0-cluster-client-1: disconnected 
> from cluster-client-1. Client process will keep trying to connect to 
> glusterd until brick's port is available
> [2019-05-06 12:40:35.647895] E [MSGID: 108006] 
> [afr-common.c:4842:afr_notify] 0-cluster-replicate-0: All subvolumes 
> are down. Going offline until atleast one of them comes back up.
> [2019-05-06 12:40:35.647989] I [MSGID: 108006] 
> [afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no 
> subvolumes up
> [2019-05-06 12:40:35.648051] I [MSGID: 108006] 
> [afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no 
> subvolumes up
> [2019-05-06 12:40:35.648122] I [MSGID: 104039] 
> [glfs-resolve.c:902:__glfs_active_subvol] 0-cluster: first lookup on 
> graph 69786d65-6431-2d32-3037-3739322d3230 (0) failed (Drugi koniec 
> nie jest po??czony) [Drugi koniec nie jest po??czony]
>
> "Drugi koniec nie jest po??czony" -> Transport endpoint not
connected
>
> On brick process side there is an connection attempt:
>
> [2019-05-06 12:40:25.638032] I [addr.c:182:gf_auth] 
> 0-/glusterfs-bricks/cluster/cluster: allowed = "*", received addr
=
> "10.0.7.26"
> [2019-05-06 12:40:25.638080] I [login.c:111:gf_auth] 0-auth/login: 
> allowed user names: e2f4c8f4-d040-4856-b6e3-62611fbab0ea
> [2019-05-06 12:40:25.638109] I [MSGID: 115029] 
> [server-handshake.c:695:server_setvolume] 0-cluster-server: accepted 
> client from 
> ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0 
> (version: 3.10.12)
> [2019-05-06 12:40:31.565931] I [MSGID: 115036] 
> [server.c:559:server_rpc_notify] 0-cluster-server: disconnecting 
> connection from 
> ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0
> [2019-05-06 12:40:31.566420] I [MSGID: 101055] 
> [client_t.c:436:gf_client_unref] 0-cluster-server: Shutting down 
> connection ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0
>
> I am not able to use any heal command because of this problem.
>
> I have three volumes configured on that nodes. Configuration is 
> identical and "gluster volume heal" command fails for all of
them.
>
> Can anyone help?
>
> Thanks,
> ?ukasz
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190507/01a9946f/attachment.html>

Gluster users - May 2019 - heal: Not able to fetch volfile from glusterd

[Gluster-users] heal: Not able to fetch volfile from glusterd

[Gluster-users] heal: Not able to fetch volfile from glusterd