Hi Uli,
[Sorry for slow response, life is busy...]
On Mon, 13 Feb 2023 15:06:26 +0000, Ulrich Sibiller via samba
<samba at lists.samba.org> wrote:
> we are using ctdb 4.15.5 on RHEL8 (Kernel
> 4.18.0-372.32.1.el8_6.x86_64) to provide NFS v3 (via tcp) to RHEL7/8
> clients. Whenever an ip takeover happens most clients report
> something like this:
> [Mon Feb 13 12:21:22 2023] nfs: server x.x.253.252 not responding, still
trying
> [Mon Feb 13 12:21:28 2023] nfs: server x.x.253.252 not responding, still
trying
> [Mon Feb 13 12:22:31 2023] nfs: server x.x.253.252 OK
> [Mon Feb 13 12:22:31 2023] nfs: server x.x.253.252 OK
> 
> And/or
> 
> [Mon Feb 13 12:27:01 2023] lockd: server x.x.253.252 not responding, still
trying
> [Mon Feb 13 12:27:37 2023] lockd: server x.x.253.252 not responding, still
trying
> [Mon Feb 13 12:27:43 2023] lockd: server x.x.253.252 OK
> [Mon Feb 13 12:28:46 2023] lockd: server x.x.253.252 not responding, still
trying
> [Mon Feb 13 12:28:50 2023] lockd: server x.x.253.252 OK
> [Mon Feb 13 12:28:50 2023] lockd: server x.x.253.252 OK
> 
> (x.x.253.252 is one of 8 IPs that ctdb handles).
OK, this part looks kind-of good.  It would be interesting to know how
long the entire failover process is taking.
> _Some_ of the clients fail to get the nfs mounts alive again after
> those messages. We then have to reboot those or use a lazy umount. We
> are seeing this at almost all takeovers.
> Because of that we currently cannot update/reboot nodes without
> affecting many users which to some degree defeats the purpose of ctdb
> which should provide seamless takeovers...
:-(
> Today I (again) tried to debug these hanging clients and  came across this
output in the ctdb log:
> ...
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Sending
a TCP RST to for connection x.x.253.85:917 x.x.253.252:599
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Sending
a TCP RST to for connection x.x.253.72:809 x.x.253.252:599
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Sending
a TCP RST to for connection x.x.253.252:2049 53.55.144.116:861
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Sending
a TCP RST to for connection x.x.250.216:983 x.x.253.252:2049
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface: Killed
230/394 TCP connections to released IP x.x.253.252
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:
Remaining connections:
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:  
x.x.253.252:2049 x.x.247.218:727
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:  
x.x.253.252:599 x.x.253.156:686
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:  
x.x.253.252:2049 x.x.249.213:810
> Feb 13 12:27:55 <nfsserver> ctdb-eventd[85619]: 10.interface:  
x.x.253.252:2049 x.x.253.177:814
> ...
> The log (which I unfortunately do not have anymore) showed 405 of
> those "Sending a TCP RST" lines in a row which is more than the
> reported 394.
> This output is coming from the releaseip section in
> /etc/ctdb/events/legacy/10.interface, which calls
> kill_tcp_connections (in /etc/ctdb/functions) which calls the
> ctdb_killtcp utility to actually kill the connections. This happens
> inside a block_ip/unblock_ip guard that temporarily sets up a
> firewall rule to drop all incoming packages for the ip (x.x.253.252
> in this case).
> 
> Obviously the tool fails to be 100% successful.
The main aim of this code is to terminate the server end of TCP
connections.  This is meant to stop problems with failback, so when it
doesn't work properly it should not cause problems with simple failover.
It does a 2-way kill for non-SMB connections, probably because SMB
connections are tracked a little better in CTDB, because we have inside
knowledge. ;-)
So, for your NFS clients, it will attempt to terminate both the server
end of the connection and the client end.  It is unclear if you have any
SMB clients and I'm not sure what is on port 599.
This means the number of connections will not match the number "Sending
a TCP RST ..." messages.
The one thing I would ask you to check for this part is whether the
remaining connections are among those for which a TCP RST has been
sent.
* If not, then the block_ip part isn't working (and you would expect to
  see errors from iptables) because they are new connections.
* If there has been an attempt to RST the connections then I'm
  interested to know if the public address is on an ordinary ethernet
  interface.
  ctdb_killtcp has been very well tested.  In the function that calls
  ctdb_killtcp, you could add CTDB_DEBUGLEVEL=DEBUG to the ctdb_killtcp
  call, like this:
    CTDB_DEBUGLEVEL=DEBUG "${CTDB_HELPER_BINDIR}/ctdb_killtcp"
"$_iface" || {
  For Samba 4.18, a new script variable CTDB_KILLTCP_DEBUGLEVEL has
  been added for this purpose.
> I am wondering about possible reasons for ctdb not killing all
> affected connections. Are there tunables regarding this behaviour,
> maybe some timeout that is set too low for our number of connections?
> Any debugging suggestions?
I doubt it is a timeout issue.  The default timeouts have been used
successfully in very large scale deployments.
The main part of resetting client connections is done by ctdbd on the
takeover node(s).
The first possible point of failure is that CTDB only updates NFS
connections in 60.nfs monitor events, every ~15 seconds.  If there is a
lot of unmounting/mounting happening then CTDB could be missing some
connections on the takeover node.  However, this isn't typical of the
way NFS is used, with mounts often being established at boot time.  If
clients are using an automounter, with a fairly low timeout, then I
suppose it is possible that there are often new connections.
I have some old work-in-progress to make tracking of all connections
more real-time (sub-second), but it needs rewriting... and can't be
integrated into CTDB without substantial re-writing.
The next possible points of failure would be if packets sent by ctdbd
were not delivered/processed, because CTDB basically uses network
tricks to encourage clients to disconnect and reconnect quickly. There
are 2 types of packets sent by the takeover node:
* TCP "tickle" ACKs
  These are sent to clients to wake them up and cause their connection
  to terminated, so they reconnect:
  1. A TCP ACK is sent with a sequence number of 0 (and an identifying
     window size of 1234, which is easy to see in tcpdump or similar).
  2. The client responds to this nonsense with an ACK containing the
     real sequence number.
  3. When the server (takeover node) receives this, the TCP stack
     decides it is nonsense because it has no such connection, and
     sends a RST.  The client is disconnected and reconnects.
  This is repeated 3 times.
  Note that this is effectively a denial of service attack, so?
  It appears that your clients may be on different subnets to the
  servers.  If the routers filter/drop the TCP tickle ACKs then the
  clients will not know to establish a new connection and will still
  attempt to send packets on the old connection.  In this case, they
  should get a RST from the takeover node's TCP stack... so, this is
  strange.
  Are there any patterns for the clients that do not reconnect?  Are
  they all on certain subnets?  Is it always the same clients?
  You can test this part by running tcpdump on problematic clients and
  seeing if the (3) tickle ACKs are seen (with window size 1234).  Do
  you see a RST and an attempt to reconnect?
* Gratuitous ARPs
  These are meant to tell machines on the local subnet to refresh their
  ARP cache with a different MAC address for a public IP address that
  has moved.
  If there are routers involved, then these may also filter out
  gratuitous ARPs.  If several problematic clients are behind a
  particular router, and you can check the ARP cache on that router
  during failover, then you might find that the router is ignoring
  gratuitous ARPs.  There may be a configuration setting to change this.
Good luck.  Please let us know what you find...
At some stage in the next few weeks, I'll convert this to a wiki
page... unless someone else does it first...  :-)
peace & happiness,
martin