Martin Schwenke schrieb am 01.03.2023 23:53:> Not perfect, but better...
Yes, I am quite happy with the ctdb_killtcp.
> For ctdb_killtcp, when it was basically rewritten, we considered adding
> options for max_attempts, but decided to see if it was foolproof. We
> could now add those options. Patches welcome too...
I'll have a look.
> MonitorTimeoutCount defaults to 20 but can also be changed.
>
> For this, you should check the output of:
>
> ctdb event status legacy monitor
> All the scripts should be taking less than about 1s.
They normally do. But I have missed the MonitorTimeoutCount completely.
> If 50.samba is taking many seconds, then this may indicate a DNS issue.
> I have some work-in-progress changes to address this, and also to
> monitor DNS (and other hosts).
Does not apply, we are not service samba here, only NFSv3.
> True, but for a basic understanding of what happened, it could be good
> to just use INFO level. Right now, I think you're doing developer
> level debugging. ;-)
Yes, I agree ?
> We are in violent agreement! :-)
?
> The main problem is the code in 10.interface releaseip is only run when
> the "releasing" node does not crash hard. So, we need another
method.
>
> The port number is tricky because we need to support both kernel NFS
> and NFS Ganesha. We could add it to the call-out... but I'm not sure
> it would be easy. Probably better via a configuration variable.
For a first approach this is the way to go. But the callout should be extended
eventually.
> I just rebased this old 2015 branch for the first time in ~6 years:
> I think it would help.
Is this tested in any way? I don't think I can run this on my production
systems. And on test systems I do not have the load to see the problems in the
first place.
> I can't remember exactly why I abandoned it. I think it might have
> been because I was going to work on the dedicated connection tracking
> code, but that seems to have started a couple of years later. Or
> perhaps Amitay pointed out a valid flaw that I no longer remember. I
> will have to take a more careful look.
Knowing the reason might save us from walking a dead-end path...
> The 3 identical tickle ACKs are standard and should not differ between
> algorithms.
Ok, thanks.
> I think we're getting somewhere... :-)
Yes! At least the system is quite stable now that I have disabled the automatic
service restarts in 60.nfs. So no more unexpected takeovers which were quite
handy to check the kill_tcp behaviour. Unfortunately the stability problems of
the last weeks have created quite some "excitement" amongst the users
so I cannot simply run arbitrary takeovers at the moment. Must wait for some
time and maybe schedule a maintenance window.
We are currently seeing the rpc checks failing when the load is high. In that
case most of the nfsd are in status D, too. So we think the underlying
filesystem (gpfs) might cause some problems here. Still investigating.
Thank you!
Ulrich Sibiller
--
Dipl.-Inf. Ulrich Sibiller science + computing ag
System Administration Hagellocher Weg 73
Hotline +49 7071 9457 681 72070 Tuebingen, Germany
https://atos.net/de/deutschland/sc