Liu, Dan
2019-Feb-25 02:43 UTC
[Samba] glusterfs + ctdb + nfs-ganesha , unplug the network cable of serving node, takes around ~20 mins for IO to resume
Hi all We did some failover/failback tests on 2 nodes��A and B�� with architecture 'glusterfs + ctdb(public address) + nfs-ganesha'�� 1st: During write, unplug the network cable of serving node A ->NFS Client took a few seconds to recover to conitinue writing. After some minutes, plug the network cable of serving node A ->NFS Client also took a few seconds to recover to conitinue writing. 2nd: During write, unplug the network cable of serving node A ->NFS Client took 20 minutes to recover to conitinue writing. It is too slow for clients to accept the recovery time�� From CTDB log, during failover and failback, fail node failed to kill the connection with client while recovery node failed to send ��tickle ack��to client to re-established connection. So during 1~3s ��takeover is failed�� Why is it failed to fast recovery and took 20 minutes to recovery successfully. Is there anyone knows the reason? We are looking forward to your reply. Thanks. ------------------------------------------------------------------------------------------------------------- The following is some test logs and configuration. Node A�� cat /var/log/log.ctdb 2019/02/22 18:00:57.468629 ctdbd[18309]: Release of IP 10.10.11.51/24 on interface eth3 node:1 2019/02/22 18:01:02.132565 ctdbd[18309]: Monitoring event was cancelled 2019/02/22 18:01:02.547046 ctdb-eventd[18310]: 10.interface: Killing TCP connection ::ffff:10.10.11.18:951 ::ffff:10.10.11.51:2049 2019/02/22 18:01:02.547112 ctdb-eventd[18310]: 10.interface: Failed sendto (No route to host) ... 2019/02/22 18:01:02.547259 ctdb-eventd[18310]: 10.interface: Failed sendto (No route to host) 2019/02/22 18:01:02.548458 ctdb-eventd[18310]: 10.interface: Failed to kill TCP connections for IP 10.10.11.51 (1/1 remaining) 2019/02/22 18:01:02.680399 ctdb-eventd[18310]: 60.nfs: method return time=1550829662.675715 sender=:1.1803 -> destination=:1.1819 serial=445 reply_serial=2 2019/02/22 18:01:02.680479 ctdb-eventd[18310]: 60.nfs: boolean true 2019/02/22 18:01:02.680500 ctdb-eventd[18310]: 60.nfs: string "Started grace period" 2019/02/22 18:01:03.255313 ctdb-eventd[18310]: 60.nfs: Reconfiguring service "nfs"... 2019/02/22 18:01:03.353830 ctdb-recoverd[18402]: Takeover run completed successfully 2019/02/22 18:01:05.345783 ctdbd[18309]: Starting traverse on DB ctdb.tdb (id 9809) 2019/02/22 18:01:05.348204 ctdbd[18309]: Ending traverse on DB ctdb.tdb (id 9809), records 1 Node B�� cat /var/log/log.ctdb 2019/02/22 18:01:02.699755 ctdbd[29541]: Takeover of IP 10.10.11.51/24 on interface eth3 2019/02/22 18:01:02.701360 ctdbd[29541]: Monitoring event was cancelled 2019/02/22 18:01:03.010811 ctdb-eventd[29542]: 60.nfs: removed ��/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/.noderefs/10.10.11.51�� 2019/02/22 18:01:03.010896 ctdb-eventd[29542]: 60.nfs: ��/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/.noderefs/10.10.11.51�� -> ��/mnt/mgt_vol/grp45/nfs_state/nfs-ganesha/node-4�� 2019/02/22 18:01:03.010922 ctdb-eventd[29542]: 60.nfs: method return time=1550829663.005719 sender=:1.192 -> destination=:1.206 serial=438 reply_serial=2 2019/02/22 18:01:03.010937 ctdb-eventd[29542]: 60.nfs: boolean true 2019/02/22 18:01:03.010973 ctdb-eventd[29542]: 60.nfs: string "Started grace period" 2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host) 2019/02/22 18:01:03.065191 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18 2019/02/22 18:01:03.303342 ctdb-eventd[29542]: 60.nfs: Reconfiguring service "nfs"... 2019/02/22 18:01:03.347137 ctdb-recoverd[29647]: Reenabling takeover runs 2019/02/22 18:01:04.172108 ctdbd[29541]: Failed sendto (No route to host) 2019/02/22 18:01:04.172180 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18 2019/02/22 18:01:05.278093 ctdbd[29541]: Failed sendto (No route to host) 2019/02/22 18:01:05.278159 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18 2019/02/22 18:01:05.389656 ctdbd[29541]: Starting traverse on DB ctdb.tdb (id 6238) 2019/02/22 18:01:05.392182 ctdbd[29541]: Ending traverse on DB ctdb.tdb (id 6238), records 1 cat /etc/sysconfig/ctdb CTDB_RECOVERY_LOCK=/mnt/mgt_vol/grp45/lockfile CTDB_PUBLIC_INTERFACE=eth3 CTDB_NODES=/mnt/mgt_vol/grp45/nodes CTDB_PUBLIC_ADDRESSES=/mnt/mgt_vol/grp45/public_addresses CTDB_MANAGES_SAMBA=yes CTDB_MANAGES_WINBIND=no CTDB_MANAGES_VSFTP=yes CTDB_SAMBA_SKIP_SHARE_CHECK=yes CTDB_MANAGES_NFS=yes CTDB_NFS_CALLOUT=/etc/ctdb/nfs-ganesha-callout CTDB_NFS_STATE_FS_TYPE=glusterfs CTDB_NFS_CHECKS_DIR=/etc/ctdb/nfs-checks-ganesha.d/ CTDB_NFS_STATE_MNT=/mnt/mgt_vol/grp45/nfs_state CTDB_NFS_SKIP_SHARE_CHECK=yes CTDB_SET_KeepaliveLimit=1 cat /mnt/mgt_vol/grp45/nodes 192.168.100.15 #inner network 192.168.100.14 #inner network cat /mnt/mgt_vol/grp45/public_addresses 10.10.11.50/24 eth3 #extranet network 10.10.11.51/24 eth3 #extranet network ï¿½ï¿½ï¿½Ï¡ï¿½ï¿½ï¿½í¤·ï¿½ï¿½ï¿½ï¿½îŠ¤ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½Þ¤ï¿½ï¿½ï¿½ -------------------------------------------------- ************************************************** Liu Dan PF Dept Nanjing Fujitsu Nanda Software Tech.Co.,Ltd.(FNST) TEL��+86+25-86630566-8512 FUJITSU INTERNAL��79955-8512 EMail: liud.fnst at cn.fujitsu.com<mailto:liud.fnst at cn.fujitsu.com> ************************************************** --------------------------------------------------
Martin Schwenke
2019-Mar-04 05:54 UTC
[Samba] glusterfs + ctdb + nfs-ganesha , unplug the network cable of serving node, takes around ~20 mins for IO to resume
Hi Dan, On Mon, 25 Feb 2019 02:43:31 +0000, "Liu, Dan via samba" <samba at lists.samba.org> wrote:> We did some failover/failback tests on 2 nodes(A and B) with > architecture 'glusterfs + ctdb(public address) + nfs-ganesha'。 > > 1st: > During write, unplug the network cable of serving node A > ->NFS Client took a few seconds to recover to conitinue writing. > > After some minutes, plug the network cable of serving node A > ->NFS Client also took a few seconds to recover to conitinue > writing. > > 2nd: > During write, unplug the network cable of serving node A > ->NFS Client took 20 minutes to recover to conitinue writing. > It is too slow for clients to accept the recovery time。Definitely! What was different between "1st" and "2nd"? Were they testing different scenarios?> From CTDB log, during failover and failback, fail node failed to kill > the connection with client while recovery node failed to send ‘tickle > ack’to client to re-established connection.The first really isn't a problem. I'm not sure why CTDB attempts to do a 2 way kill from the releasing node. We're going to stop doing that in the future. The 2nd is a mystery. Are you sure the network connection on node B was up? This message seems to indicate the network is down: 2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host) 2019/02/22 18:01:03.065191 ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp tickle ack for ::ffff:10.10.11.18 peace & happiness, martin
Liu, Dan
2019-Mar-04 08:38 UTC
[Samba] glusterfs + ctdb + nfs-ganesha , unplug the network cable of serving node, takes around ~20 mins for IO to resume
Martin Thanks for replying.> > We did some failover/failback tests on 2 nodes(A and B) with > > architecture 'glusterfs + ctdb(public address) + nfs-ganesha'。 > > > > 1st: > > During write, unplug the network cable of serving node A > > ->NFS Client took a few seconds to recover to conitinue writing. > > > > After some minutes, plug the network cable of serving node A > > ->NFS Client also took a few seconds to recover to conitinue > > writing. > > > > 2nd: > > During write, unplug the network cable of serving node A > > ->NFS Client took 20 minutes to recover to conitinue writing. > > It is too slow for clients to accept the recovery time。 > > Definitely! What was different between "1st" and "2nd"? Were they testing > different scenarios?"1st" and "2nd" is tested in the same scenarios. After we updated the client(Mac mini)'s os to Mojave v10.14.3(the problem happened on v10.10), not like '2nd', all the take over took very few minutes to finish. So I'm think that the ~20 mins take over might be the client's problem.> > From CTDB log, during failover and failback, fail node failed to kill > > the connection with client while recovery node failed to send ‘tickle > > ack’to client to re-established connection. > > The first really isn't a problem. I'm not sure why CTDB attempts to do > a 2 way kill from the releasing node. We're going to stop doing that in > the future. > > The 2nd is a mystery. Are you sure the network connection on node B was > up? This message seems to indicate the network is down: > > 2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host) > 2019/02/22 18:01:03.065191 > ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp > tickle ack for ::ffff:10.10.11.18All the test procedure, we just unplug the node A's network. From all the test's ctdb log(over 10 times), during every takeover there was such log to be outputed. When the tickle ACK was sent, because takeover is successful after some minutes, I'm think that the node B's was up, but not sure.. Best Regards> -----Original Message----- > From: Martin Schwenke [mailto:martin at meltin.net] > Sent: Monday, March 4, 2019 1:55 PM > To: Liu, Dan/刘 丹 <liud.fnst at cn.fujitsu.com> > Cc: samba at lists.samba.org > Subject: Re: [Samba] glusterfs + ctdb + nfs-ganesha , unplug the network > cable of serving node, takes around ~20 mins for IO to resume > > Hi Dan, > > On Mon, 25 Feb 2019 02:43:31 +0000, "Liu, Dan via samba" > <samba at lists.samba.org> wrote: > > > We did some failover/failback tests on 2 nodes(A and B) with > > architecture 'glusterfs + ctdb(public address) + nfs-ganesha'。 > > > > 1st: > > During write, unplug the network cable of serving node A > > ->NFS Client took a few seconds to recover to conitinue writing. > > > > After some minutes, plug the network cable of serving node A > > ->NFS Client also took a few seconds to recover to conitinue > > writing. > > > > 2nd: > > During write, unplug the network cable of serving node A > > ->NFS Client took 20 minutes to recover to conitinue writing. > > It is too slow for clients to accept the recovery time。 > > Definitely! What was different between "1st" and "2nd"? Were they testing > different scenarios? > > > From CTDB log, during failover and failback, fail node failed to kill > > the connection with client while recovery node failed to send ‘tickle > > ack’to client to re-established connection. > > The first really isn't a problem. I'm not sure why CTDB attempts to do > a 2 way kill from the releasing node. We're going to stop doing that in > the future. > > The 2nd is a mystery. Are you sure the network connection on node B was > up? This message seems to indicate the network is down: > > 2019/02/22 18:01:03.065121 ctdbd[29541]: Failed sendto (No route to host) > 2019/02/22 18:01:03.065191 > ctdbd[29541]: ../ctdb/server/ctdb_takeover.c:388 Failed to send tcp > tickle ack for ::ffff:10.10.11.18 > > peace & happiness, > martin >
Apparently Analagous Threads
- glusterfs + ctdb + nfs-ganesha , unplug the network cable of serving node, takes around ~20 mins for IO to resume
- Could not find node to take over public address
- ctdb_client.c control timed out - banning nodes
- WG: Which version of CTDB
- About starting nfs-ganesha