thr3ads.net - Gluster users - [Gluster-users] Unable to make HA work; mounts hang on remote node reboot [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Ravishankar N

2015-Apr-08 04:26 UTC

[Gluster-users] Unable to make HA work; mounts hang on remote node reboot

On 04/07/2015 10:11 PM, CJ Baar wrote:> Then, I issue ?init 0? on node2, and the mount on node1 becomes
unresponsive. This is the log from node1
> [2015-04-07 16:36:04.250693] W
[glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx
modification failed
> [2015-04-07 16:36:04.251102] I
[glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received
status volume req for volume test1
> The message "I [MSGID: 106004]
[glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer
1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected
from glusterd." repeated 39 times between [2015-04-07 16:34:40.609878] and
[2015-04-07 16:36:37.752489]
> [2015-04-07 16:36:40.755989] I [MSGID: 106004]
[glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer
1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected
from glusterd.This is the glusterd log. Could you also share the mount log of the 
healthy node in the non-responsive -->responsive time interval?
If this is indeed the ping timer issue, you should see something like: 
"server xxx has not responded in the last 42 seconds, disconnecting."
Have you, for testing sake, tried reducing the network.ping-timeout 
value to something lower and checked that the hang happens only for that 
time?>
> This does not seem like desired behaviour. I was trying to create this
cluster because I was under the impression it would be more resilient than a
single-point-of-failure NFS server. However, if the mount halts when one node in
the cluster dies, then I?m no better off.
>
> I also can?t seem to figure out how to bring a volume online if only one
node in the cluster is running; again, not really functioning as HA. The gluster
service runs and the volume ?starts?, but it is not ?online? or mountable until
both nodes are running. In a situation where a node fails and we need storage
online before we can troubleshoot the cause of the node failure, how do I get a
volume to go online?This is expected behavior. In a two node cluster, if only one is powered 
on, glusterd will not start other gluster processes (brick, nfs, shd ) 
until the glusterd of the other node is also up (i.e. quorum is met). If 
you want to override this behavior, do a `gluster vol start <volname> 
force` on the node that is up.

-Ravi>
> Thanks.

CJ Baar

2015-Apr-16 23:05 UTC

head link

[Gluster-users] Unable to make HA work; mounts hang on remote node reboot

I appreciate the info. I have tried adjust the ping-timeout setting, and it has
seems to have no effect. The whole system hangs for 45+ seconds, which is about
what it takes the second node to reboot, no matter what the value of
ping-timeout is.  The output of the mnt-log is below.  It shows the adjust value
I am currently testing (30s), but the system still hangs for longer than that.

Also, I have realized that the problem is deeper than I originally thought. 
It?s not just the mount that is hanging when a node reboots? it appears to be
the entire system.  I cannot use my SSH connection, no matter where I am in the
system, and services such as httpd become unresponsive.  I can ping the
?surviving? system, but other than that it appears pretty unusable.  This is a
major drawback to using gluster.  I can?t afford to lost two entire systems if
one dies.

[2015-04-16 22:59:21.281365] C [rpc-clnt-ping.c:109:rpc_clnt_ping_timer_expired]
0-common-client-0: server 172.31.64.200:49152 has not responded in the last 30
seconds, disconnecting.
[2015-04-16 22:59:21.281560] E [rpc-clnt.c:362:saved_frames_unwind] (-->
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (-->
/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (-->
/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (-->
/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951]
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] )))))
0-common-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27))
called at 2015-04-16 22:58:45.830962 (xid=0x6d)
[2015-04-16 22:59:21.281588] W [client-rpc-fops.c:2766:client3_3_lookup_cbk]
0-common-client-0: remote operation failed: Transport endpoint is not connected.
Path: / (00000000-0000-0000-0000-000000000001)
[2015-04-16 22:59:21.281788] E [rpc-clnt.c:362:saved_frames_unwind] (-->
/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x7fce96450550] (-->
/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fce96225787] (-->
/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fce9622589e] (-->
/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fce96225951]
(--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x15f)[0x7fce96225f1f] )))))
0-common-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at
2015-04-16 22:58:51.277528 (xid=0x6e)
[2015-04-16 22:59:21.281806] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk]
0-common-client-0: socket disconnected
[2015-04-16 22:59:21.281816] I [client.c:2215:client_rpc_notify]
0-common-client-0: disconnected from common-client-0. Client process will keep
trying to connect to glusterd until brick's port is available
[2015-04-16 22:59:21.283637] I [socket.c:3292:socket_submit_request]
0-common-client-0: not connected (priv->connected = 0)
[2015-04-16 22:59:21.283663] W [rpc-clnt.c:1562:rpc_clnt_submit]
0-common-client-0: failed to submit rpc-request (XID: 0x6f Program: GlusterFS
3.3, ProgVers: 330, Proc: 27) to rpc-transport (common-client-0)
[2015-04-16 22:59:21.283674] W [client-rpc-fops.c:2766:client3_3_lookup_cbk]
0-common-client-0: remote operation failed: Transport endpoint is not connected.
Path: /src (63fc077b-869d-4928-8819-a79cc5c5ffa6)
[2015-04-16 22:59:21.284219] W [client-rpc-fops.c:2766:client3_3_lookup_cbk]
0-common-client-0: remote operation failed: Transport endpoint is not connected.
Path: (null) (00000000-0000-0000-0000-000000000000)
[2015-04-16 22:59:52.322952] E
[client-handshake.c:1496:client_query_portmap_cbk] 0-common-client-0: failed to
get the port number for [root at cfm-c glusterfs]#

?CJ

> On Apr 7, 2015, at 10:26 PM, Ravishankar N <ravishankar at
redhat.com> wrote:
> 
> 
> 
> On 04/07/2015 10:11 PM, CJ Baar wrote:
>> Then, I issue ?init 0? on node2, and the mount on node1 becomes
unresponsive. This is the log from node1
>> [2015-04-07 16:36:04.250693] W
[glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx
modification failed
>> [2015-04-07 16:36:04.251102] I
[glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received
status volume req for volume test1
>> The message "I [MSGID: 106004]
[glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer
1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected
from glusterd." repeated 39 times between [2015-04-07 16:34:40.609878] and
[2015-04-07 16:36:37.752489]
>> [2015-04-07 16:36:40.755989] I [MSGID: 106004]
[glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer
1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected
from glusterd.
> This is the glusterd log. Could you also share the mount log of the healthy
node in the non-responsive -->responsive time interval?
> If this is indeed the ping timer issue, you should see something like:
"server xxx has not responded in the last 42 seconds, disconnecting."
> Have you, for testing sake, tried reducing the network.ping-timeout value
to something lower and checked that the hang happens only for that time?
>> 
>> This does not seem like desired behaviour. I was trying to create this
cluster because I was under the impression it would be more resilient than a
single-point-of-failure NFS server. However, if the mount halts when one node in
the cluster dies, then I?m no better off.
>> 
>> I also can?t seem to figure out how to bring a volume online if only
one node in the cluster is running; again, not really functioning as HA. The
gluster service runs and the volume ?starts?, but it is not ?online? or
mountable until both nodes are running. In a situation where a node fails and we
need storage online before we can troubleshoot the cause of the node failure,
how do I get a volume to go online?
> This is expected behavior. In a two node cluster, if only one is powered
on, glusterd will not start other gluster processes (brick, nfs, shd ) until the
glusterd of the other node is also up (i.e. quorum is met). If you want to
override this behavior, do a `gluster vol start <volname> force` on the
node that is up.
> 
> -Ravi
>> 
>> Thanks.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150416/06696767/attachment.html>

Gluster users - Apr 2015 - Unable to make HA work; mounts hang on remote node reboot

[Gluster-users] Unable to make HA work; mounts hang on remote node reboot

[Gluster-users] Unable to make HA work; mounts hang on remote node reboot