thr3ads.net - samba - [Samba] ctdb tcp kill: remaining connections [Oct 2024]

If this information is useful, please help other people find it:
Share via:

Ulrich Sibiller

2023-Mar-09 17:02 UTC

[Samba] ctdb tcp kill: remaining connections

Martin Schwenke schrieb am 01.03.2023 23:53:> Not perfect, but better...
Yes, I am quite happy with the ctdb_killtcp.
 > For ctdb_killtcp, when it was basically rewritten, we considered adding
> options for max_attempts, but decided to see if it was foolproof.  We
> could now add those options.  Patches welcome too...
I'll have a look.
 > MonitorTimeoutCount defaults to 20 but can also be changed.
> 
> For this, you should check the output of:
> 
>   ctdb event status legacy monitor
> All the scripts should be taking less than about 1s.
They normally do. But I have missed the MonitorTimeoutCount completely.
 > If 50.samba is taking many seconds, then this may indicate a DNS issue.
> I have some work-in-progress changes to address this, and also to
> monitor DNS (and other hosts).
Does not apply, we are not service samba here, only NFSv3.
 > True, but for a basic understanding of what happened, it could be good
> to just use INFO level.  Right now, I think you're doing developer
> level debugging.  ;-)
Yes, I agree ?
> We are in violent agreement!  :-)
?
 > The main problem is the code in 10.interface releaseip is only run when
> the "releasing" node does not crash hard.  So, we need another
method.
> 
> The port number is tricky because we need to support both kernel NFS
> and NFS Ganesha.  We could add it to the call-out... but I'm not sure
> it would be easy.  Probably better via a configuration variable.
For a first approach this is the way to go. But the callout should be extended
eventually.
> I just rebased this old 2015 branch for the first time in ~6 years:
> I think it would help.
Is this tested in any way? I don't think I can run this on my production
systems. And on test systems I do not have the load to see the problems in the
first place.
 > I can't remember exactly why I abandoned it.  I think it might have
> been because I was going to work on the dedicated connection tracking
> code, but that seems to have started a couple of years later.  Or
> perhaps Amitay pointed out a valid flaw that I no longer remember.  I
> will have to take a more careful look.
Knowing the reason might save us from walking a dead-end path...
 > The 3 identical tickle ACKs are standard and should not differ between
> algorithms.
Ok, thanks.
> I think we're getting somewhere...  :-)
Yes! At least the system is quite stable now that I have disabled the automatic
service restarts in 60.nfs. So no more unexpected takeovers which were quite
handy to check the kill_tcp behaviour. Unfortunately the stability problems of
the last weeks have created quite some "excitement" amongst the users
so I cannot simply run arbitrary takeovers at the moment. Must wait for some
time and maybe schedule a maintenance window.

We are currently seeing the rpc checks failing when the load is high. In that
case most of the nfsd are in status D, too. So we think the underlying
filesystem (gpfs) might cause some problems here. Still investigating.

Thank you!

Ulrich Sibiller
-- 
Dipl.-Inf. Ulrich Sibiller           science + computing ag
System Administration                    Hagellocher Weg 73
Hotline +49 7071 9457 681          72070 Tuebingen, Germany
https://atos.net/de/deutschland/sc

Martin Schwenke

2024-Oct-16 02:33 UTC

head link

[Samba] ctdb tcp kill: remaining connections

Hi Ulrich,

[Reviving an old thread - I owe you an answer  :-)]

On Thu, 9 Mar 2023 17:02:15 +0000, Ulrich Sibiller via samba
<samba at lists.samba.org> wrote:
> Martin Schwenke schrieb am 01.03.2023 23:53:
> > On Wed, 1 Mar 2023 16:18:58 +0000, Ulrich Sibiller<ulrich.sibiller at atos.net> wrote:
> > > which ignores the port and thus matches all connections for the
ip
> > > anyway. On the other hand there's
> > >        update_tickles 2049  
> > > in /etc/ctdb/events/legacy/60.nfs without a corresponding tickle
> > > handling for lockd connections. I am thinking about adding an
> > > update_tickles 599 for lockd connections (what's the best way
to
> > > determine that port?). Any objections?  
> > I just rebased this old 2015 branch for the first time in ~6 years:
> > I think it would help.  
> 
> Is this tested in any way? I don't think I can run this on my
> production systems. And on test systems I do not have the load to see
> the problems in the first place. 
I've finally debugged this (definitely finding at least 1 important
bug) and modified it to just register all TCP connections to public IP
addresses (instead of using a configuration variable to specify
relevant ports), so this moves to 10.interface.

In this old thread, we also discussed problems with ctdb_killtcp.  The
patch series containing the above change also adds a script option to
enable use of "ss -K" for resetting TCP connections to a public IP
address.

These changes should be in the next version of Samba/CTDB.

peace & happiness,
martin

Apparently Analagous Threads

Search for more maybe matching threads

samba - Oct 2024 - ctdb tcp kill: remaining connections

[Samba] ctdb tcp kill: remaining connections

[Samba] ctdb tcp kill: remaining connections

Apparently Analagous Threads