Martin Schlegel
2016-Nov-14 15:43 UTC
[Gluster-users] Volume ping-timeout parameter and client side mount timeouts
Hello Gluster Community We have 2x brick nodes running with replication for a volume gv0 for which set a "gluster volume set gv0?ping-timeout 20". In our tests it seemed there is unknown delay with this ping-timeout - we see it timing out much later after about 35 seconds and not at around 20 seconds (see test below). Our distributed database cluster is using Gluster as a secondary file system for backups etc. - it's Pacemaker cluster manager needs to know how long to wait before giving up on the glusterfs mounted file system to become available again or when to failover to another node. 1. When do we know when to give up waiting on the glusterfs mount point to become accessible again following an outage on the brick server this client was connected to ? 2. Is there a timeout / interval setting on the client side that we could reduce, so that it more quickly tries to switch the mount point to a different, available brick server ? Regards, Martin Schlegel __________ Here is how we tested this: As a test we blocked the entire network?on one of these brick nodes: root at glusterfs-brick-node1 $ date;iptables -A INPUT -i bond0 -j DROP ; iptables -A OUTPUT -o bond0 -j DROP Mon Nov 14 08:26:55 UTC 2016>From the syslog on the glusterfs-client-nodeNov 14 08:27:30 glusterfs-client-node1 pgshared1[26783]: [2016-11-14 08:27:30.275694] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-gv0-client-0: server glusterfs-brick-node1:49152 has not responded in the last 20 seconds, disconnecting. <--- This last message "has not responded in the last 20 seconds" is confusing to me, because the brick node was clearly blocked for 35 seconds already ! Is there some client-side check interval that can be reduced ?
Mohammed Rafi K C
2016-Nov-15 09:52 UTC
[Gluster-users] Volume ping-timeout parameter and client side mount timeouts
If I understand the query correctly, the problem is that gluster takes more than 20seconds to timeout even though the brick was offline for more than 35s. With that assumptions I have some How did you understand that the timer has expired after 35s only, by log file? If so glusterfs wait some time to flush the logs to console to push it as batch, not sure how long. So the actual timing in the logs may not be accurate. If you have already confirmed that by using a wireshark or similar tools that it takes more than 20seconds to disconnect the socket, then there could be some thing else which we need to look into. Can you conform that using wireshark or similar tools if not already done. Rafi KC On 11/14/2016 09:13 PM, Martin Schlegel wrote:> Hello Gluster Community > > We have 2x brick nodes running with replication for a volume gv0 for which set a > "gluster volume set gv0 ping-timeout 20". > > In our tests it seemed there is unknown delay with this ping-timeout - we see it > timing out much later after about 35 seconds and not at around 20 seconds (see > test below). > > Our distributed database cluster is using Gluster as a secondary file system for > backups etc. - it's Pacemaker cluster manager needs to know how long to wait > before giving up on the glusterfs mounted file system to become available again > or when to failover to another node. > > 1. When do we know when to give up waiting on the glusterfs mount point to > become accessible again following an outage on the brick server this client was > connected to ? > 2. Is there a timeout / interval setting on the client side that we could > reduce, so that it more quickly tries to switch the mount point to a different, > available brick server ? > > > Regards, > Martin Schlegel > > __________ > > Here is how we tested this: > > As a test we blocked the entire network on one of these brick nodes: > root at glusterfs-brick-node1 $ date;iptables -A INPUT -i bond0 -j DROP ; iptables > -A OUTPUT -o bond0 -j DROP > Mon Nov 14 08:26:55 UTC 2016 > > From the syslog on the glusterfs-client-node > Nov 14 08:27:30 glusterfs-client-node1 pgshared1[26783]: [2016-11-14 > 08:27:30.275694] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] > 0-gv0-client-0: server glusterfs-brick-node1:49152 has not responded in the last > 20 seconds, disconnecting. > > <--- This last message "has not responded in the last 20 seconds" is confusing > to me, because the brick node was clearly blocked for 35 seconds already ! Is > there some client-side check interval that can be reduced ? > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users