Martin Schwenke schrieb am 15.02.2023 23:23:> Hi Uli, > > [Sorry for slow response, life is busy...]Thanks for answering anyway!> On Mon, 13 Feb 2023 15:06:26 +0000, Ulrich Sibiller via samba> OK, this part looks kind-of good. It would be interesting to know how > long the entire failover process is taking.What exactly would you define as the begin and end of the failover?>> The log (which I unfortunately do not have anymore) showed 405 ofWhile journal has rotated out that log data I found /var/log/messages to still have it. So I can provide the log from all from ctdb- systems, if required.> The main aim of this code is to terminate the server end of TCP > connections. This is meant to stop problems with failback, so when it > doesn't work properly it should not cause problems with simple failover.Hmm, that node where above output was seen hat released and taken that very ip shortly before. And then released it again. The last point in time is the one where I saw the remaining connections: Feb 13 12:08:11 serverpnn1 ctdbd[85617]: ../../ctdb/server/eventscript.c:655 Running event releaseip with arguments team1 x.x.253.252 24 Feb 13 12:16:26 serverpnn1 ctdbd[85617]: ../../ctdb/server/eventscript.c:655 Running event takeip with arguments team1 x.x.253.252 24 Feb 13 12:27:50 serverpnn1 ctdbd[85617]: ../../ctdb/server/eventscript.c:655 Running event releaseip with arguments team1 x.x.253.252 24 So if the first releaseip could not clear all connections they might pile up...> So, for your NFS clients, it will attempt to terminate both the server > end of the connection and the client end. It is unclear if you have any > SMB clients and I'm not sure what is on port 599.No, no smb, only nfsv3. Port 599 is the lockd: [root at smtcfc0247 ~]# rpcinfo -p program vers proto port service 100000 4 tcp 111 portmapper 100000 3 tcp 111 portmapper 100000 2 tcp 111 portmapper 100000 4 udp 111 portmapper 100000 3 udp 111 portmapper 100000 2 udp 111 portmapper 100005 1 udp 597 mountd 100005 1 tcp 597 mountd 100005 2 udp 597 mountd 100005 2 tcp 597 mountd 100005 3 udp 597 mountd 100005 3 tcp 597 mountd 100003 3 tcp 2049 nfs 100227 3 tcp 2049 nfs_acl 100021 1 udp 599 nlockmgr 100021 3 udp 599 nlockmgr 100021 4 udp 599 nlockmgr 100021 1 tcp 599 nlockmgr 100021 3 tcp 599 nlockmgr 100021 4 tcp 599 nlockmgr 100011 1 udp 598 rquotad 100011 2 udp 598 rquotad 100011 1 tcp 598 rquotad 100011 2 tcp 598 rquotad 100024 1 udp 595 status 100024 1 tcp 595 status> This means the number of connections will not match the number "Sending > a TCP RST ..." messages.Ok, but without smb it should be doubled, right?> The one thing I would ask you to check for this part is whether the > remaining connections are among those for which a TCP RST has been sent.I saw 405 "Sending RST" lines and then: "Killed 230/394 TCP connections to released IP x.x.253.252 Remaining connections:" Followed by 164 remaining connections. That numbers match (230+164=394). But in the remaining list there are indeed 93 IPs that are not contained in the sending list.> * If not, then the block_ip part isn't working (and you would expect to > see errors from iptables) because they are new connections.I have not seen any errors from iptables.> * If there has been an attempt to RST the connections then I'm > interested to know if the public address is on an ordinary ethernet > interface.We are using two teaming-Interfaces (one for each LAN). Each team-Interface has 2x10Gbit LACP.> ctdb_killtcp has been very well tested. In the function that calls > ctdb_killtcp, you could add CTDB_DEBUGLEVEL=DEBUG to the ctdb_killtcp > call, like this: > CTDB_DEBUGLEVEL=DEBUG "${CTDB_HELPER_BINDIR}/ctdb_killtcp" "$_iface" > || {I have done that already. It is not really helpful as it floods the log with messages about packages it is ignoring. Within one minute of log I got >7000 of them, e.g. Feb 13 17:36:08 <server> ctdb-eventd[29607]: 10.interface.debug: Ignoring packet: z.z.139.15:2049 z.z.155.1:720, seq=0, ack_seq=0, rst=0, window=1234 Feb 13 17:36:08 <server> ctdb-eventd[29607]: 10.interface.debug: reset_connections_capture_tcp_handler: Ignoring packet for unknown connection: z.z.139.34:4739 z.z.139.14:41437> For Samba 4.18, a new script variable CTDB_KILLTCP_DEBUGLEVEL has > been added for this purpose.Yes, I have seen this.> I doubt it is a timeout issue. The default timeouts have been used > successfully in very large scale deployments.Have there been tuning to the tcp networking stack in Linux, maybe?> The main part of resetting client connections is done by ctdbd on the > takeover node(s). > > The first possible point of failure is that CTDB only updates NFS > connections in 60.nfs monitor events, every ~15 seconds. If there is a > lot of unmounting/mounting happening then CTDB could be missing some > connections on the takeover node. However, this isn't typical of the > way NFS is used, with mounts often being established at boot time. If > clients are using an automounter, with a fairly low timeout, then I > suppose it is possible that there are often new connections.We do not have any static mounts, everything is using autofs. We are using a dismount_interval of 300. There are between 1200 and 1500 clients mounting homes and projects from this cluster, so there are quite some mounts and umounts happening. In the last 7 hours we saw ~800-900 mounts (and about the same amount of umounts) per server node.> I have some old work-in-progress to make tracking of all connections > more real-time (sub-second), but it needs rewriting... and can't be > integrated into CTDB without substantial re-writing.Looking forward for this.> The next possible points of failure would be if packets sent by ctdbd > were not delivered/processed, because CTDB basically uses network > tricks to encourage clients to disconnect and reconnect quickly. There > are 2 types of packets sent by the takeover node: > > * TCP "tickle" ACKs > > These are sent to clients to wake them up and cause their connection > to terminated, so they reconnect: > ... > This is repeated 3 times.For the point in time noted above I see 308 "sending tcp tickle ack for 2049-><IP:Port>" messages. All messages are there 3 times. But all of them only for port 2049, none for 599 (lockd). On the live system I currently see 120 connections to port 599 but ctdb gettickles does not show a single one with this port! Is this expected?> It appears that your clients may be on different subnets to the > servers. If the routers filter/drop the TCP tickle ACKs then the > clients will not know to establish a new connection and will still > attempt to send packets on the old connection. In this case, they > should get a RST from the takeover node's TCP stack... so, this is > strange.Yes, we have a flat /16 network on team0 and man /24 subnets on team1. It _feels_ like the clients on the team1 network are affected more often but OTOH there are _much_ more clients in that network...> Are there any patterns for the clients that do not reconnect? Are > they all on certain subnets? Is it always the same clients?No, we could not find any pattern yet.> You can test this part by running tcpdump on problematic clients and > seeing if the (3) tickle ACKs are seen (with window size 1234). Do > you see a RST and an attempt to reconnect?As we cannot predict and also not trigger the problem at will this is not so easy. Running tcpdump everywhere is not desirable> At some stage in the next few weeks, I'll convert this to a wiki > page... unless someone else does it first... :-)+1! Speaking of the wiki: https://wiki.samba.org/index.php/Setting_up_CTDB_for_Clustered_NFS states "Never mount the same NFS share on a client from two different nodes in the cluster at the same time. The client-side caching in NFS is very fragile and assumes that an object can only be accessed through one single path at a time." It is unclear what "share" means in this context. We are exporting the whole fs (/filesys) but mounting only subdirs (e.g. /filesys/home/userA and /filesys/home/userB). So we are unsure if this warning applies here or not. Thanks a lot, Uli -- Dipl.-Inf. Ulrich Sibiller science + computing ag System Administration Hagellocher Weg 73 Hotline +49 7071 9457 681 72070 Tuebingen, Germany https://atos.net/de/deutschland/sc
On Thu, 16 Feb 2023 17:30:37 +0000, Ulrich Sibiller <ulrich.sibiller at atos.net> wrote:> Martin Schwenke schrieb am 15.02.2023 23:23:> > OK, this part looks kind-of good. It would be interesting to know how > > long the entire failover process is taking. > > What exactly would you define as the begin and end of the failover?From "Takeover run starting" to "Takeover run completed successfully" in the logs.> > >> The log (which I unfortunately do not have anymore) showed 405 of > > Hmm, that node where above output was seen hat released and taken > that very ip shortly before. And then released it again. The last > point in time is the one where I saw the remaining connections:> Feb 13 12:08:11 serverpnn1 ctdbd[85617]: ../../ctdb/server/eventscript.c:655 Running event releaseip with arguments team1 x.x.253.252 24 > Feb 13 12:16:26 serverpnn1 ctdbd[85617]: ../../ctdb/server/eventscript.c:655 Running event takeip with arguments team1 x.x.253.252 24 > Feb 13 12:27:50 serverpnn1 ctdbd[85617]: ../../ctdb/server/eventscript.c:655 Running event releaseip with arguments team1 x.x.253.252 24OK, it could be a failback problem.> No, no smb, only nfsv3. Port 599 is the lockd: > [...]OK.> > This means the number of connections will not match the number "Sending > > a TCP RST ..." messages. > > Ok, but without smb it should be doubled, right?You would think so.> > The one thing I would ask you to check for this part is whether the > > remaining connections are among those for which a TCP RST has been sent. > > I saw 405 "Sending RST" lines and then: > > "Killed 230/394 TCP connections to released IP x.x.253.252 > Remaining connections:" > > Followed by 164 remaining connections. That numbers match > (230+164=394). But in the remaining list there are indeed 93 IPs that > are not contained in the sending list.This means the problem must be one of 2 things: 1. The left over connections are new connections and iptables is not blocking incoming connections 2. ctdb_killtcp is not terminating all connections.> > * If not, then the block_ip part isn't working (and you would expect to > > see errors from iptables) because they are new connections. > > I have not seen any errors from iptables. >So, I think it is (2). :-) The problem seems to be that ctdb_killtcp is not capturing a response to its TCP tickle ACK. ctdb_killtcp needs to do things specially because it is trying to terminate TCP connections on the releasing side when the IP address is still hosted and the connections actually exist. This is different to the takeover side, where only the TCP tickle ACK is sent, and the client and server TCP stacks conspire to RST the connection. ctdb_killtcp works like this: 1. Start capturing packets 2. Send a TCP tickle ACK and remember that it was sent 3. When a captured packet comes in, ignore it if it is a packet sent by ctdb_killtcp (window size 1234) and some other cases. Look it up and see if it is a response to a sent TCP tickle ACK. If so, extract the sequence number and send a RST. So, if ctdb_killtcp is unable to capture the response to the TCP tickle ACK then it can't send a RST. This could happen if there are too many packets, so packets might be dropped. By default, the capture is done using a raw socket that tries to captures *everything*, which could be a lot of traffic. For Samba <4.18, this was the only option on Linux. For Samba ?4.18, I have improved the pcap-based packet capture and extended it to understand the generic DLT_LINUX_SLL and DLT_LINUX_SLL2 frame formats, mostly to support capture on InfiniBand networks. It might be worth building a static (to avoid shared library issues) ctdb_killtcp binary from 4.18.x source (after configuring with --enable-pcap) and installing that in place of the existing ctdb_killtcp binary. The version in 4.18 also has slightly better debugging for unknown packets.> > * If there has been an attempt to RST the connections then I'm > > interested to know if the public address is on an ordinary ethernet > > interface.> We are using two teaming-Interfaces (one for each LAN). Each > team-Interface has 2x10Gbit LACP.I'm not sure how this would affect the packet capture. I would guess that the frame format would be one of the ones that is now supported.> > > ctdb_killtcp has been very well tested. In the function that calls > > ctdb_killtcp, you could add CTDB_DEBUGLEVEL=DEBUG to the ctdb_killtcp > > call, like this: > > CTDB_DEBUGLEVEL=DEBUG "${CTDB_HELPER_BINDIR}/ctdb_killtcp" "$_iface" > > || { > > I have done that already. It is not really helpful as it floods the log with messages about packages it is ignoring. Within one minute of log I got >7000 of them, e.g. > Feb 13 17:36:08 <server> ctdb-eventd[29607]: 10.interface.debug: Ignoring packet: z.z.139.15:2049 z.z.155.1:720, seq=0, ack_seq=0, rst=0, window=1234 > Feb 13 17:36:08 <server> ctdb-eventd[29607]: 10.interface.debug: reset_connections_capture_tcp_handler: Ignoring packet for unknown connection: z.z.139.34:4739 z.z.139.14:41437Fair enough... but a whole minute is a long time to be running ctdb_killtcp during failover...> Have there been tuning to the tcp networking stack in Linux, maybe?I can't think what would be relevant...> We do not have any static mounts, everything is using autofs. We are > using a dismount_interval of 300. There are between 1200 and 1500 > clients mounting homes and projects from this cluster, so there are > quite some mounts and umounts happening. In the last 7 hours we saw > ~800-900 mounts (and about the same amount of umounts) per server > node.OK, this is still small. Only a couple of connections per minute. So, I could imagine being unable to reset 1 or 2 connections, but not the larger number you're seeing.> For the point in time noted above I see 308 "sending tcp tickle ack > for 2049-><IP:Port>" messages. All messages are there 3 times.> But all of them only for port 2049, none for 599 (lockd). On the live > system I currently see 120 connections to port 599 but ctdb > gettickles does not show a single one with this port! Is this > expected?In preparation for takeover, for NFS we only remember connections on 2049. The other ports, such as lockd, would be handled on reboot (rather than crash) by the releasing side... but I doubt anyone has thought about this! Perhaps we seen that part handling by UDP in the past? Please feel free to open a bug to remind me to look at this.> > It appears that your clients may be on different subnets to the > > servers. If the routers filter/drop the TCP tickle ACKs then the > > clients will not know to establish a new connection and will still > > attempt to send packets on the old connection. In this case, they > > should get a RST from the takeover node's TCP stack... so, this is > > strange.> Yes, we have a flat /16 network on team0 and man /24 subnets on > team1. It _feels_ like the clients on the team1 network are affected > more often but OTOH there are _much_ more clients in that network...OK, let's keep that in mind. :-)> Speaking of the wiki: https://wiki.samba.org/index.php/Setting_up_CTDB_for_Clustered_NFS states > "Never mount the same NFS share on a client from two different nodes > in the cluster at the same time. The client-side caching in NFS is > very fragile and assumes that an object can only be accessed through > one single path at a time."> It is unclear what "share" means in this context. We are exporting > the whole fs (/filesys) but mounting only subdirs (e.g. > /filesys/home/userA and /filesys/home/userB). So we are unsure if > this warning applies here or not.I suspect this could be improved to say "Never mount the same NFS directory ...". I think you are OK with mounting the subdirectories for each user. peace & happiness, martin