thr3ads.net - samba - [Samba] gencache.tdb size and cache flush [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Francesco Malvezzi

2018-Aug-29 08:28 UTC

[Samba] gencache.tdb size and cache flush

Hi all,

I have a midsize AD domain with some 50k users but only 100 workstations
joined.

Sometimes I find server CPU throttling at 100%. In order to let it drop
and have smooth performance I delete cache:

systemctl stop samba
net cache flush
systemctl start samba

First of all, is it needed a samba stop to flush the cache?

Even if cache flush does the job to restore performance, I am clueless
about the root cause of the problem. Before flushing cache the
gencache.tdb had 15k entries. Is it large? Do you think is it worth time
to investigate why it grows so much or is it just normal?

thank you,

franz

L.P.H. van Belle

2018-Aug-29 08:34 UTC

head link

[Samba] gencache.tdb size and cache flush

Hai, 


It might be handing to tell your OS and samba version. 
A copy of smb.conf is also very handy.. 

Greetz, 

Louis
 

> -----Oorspronkelijk bericht-----
> Van: samba [mailto:samba-bounces at lists.samba.org] Namens 
> Francesco Malvezzi via samba
> Verzonden: woensdag 29 augustus 2018 10:28
> Aan: samba at lists.samba.org
> Onderwerp: [Samba] gencache.tdb size and cache flush
> 
> Hi all,
> 
> I have a midsize AD domain with some 50k users but only 100 
> workstations
> joined.
> 
> Sometimes I find server CPU throttling at 100%. In order to 
> let it drop
> and have smooth performance I delete cache:
> 
> systemctl stop samba
> net cache flush
> systemctl start samba
> 
> First of all, is it needed a samba stop to flush the cache?
> 
> Even if cache flush does the job to restore performance, I am clueless
> about the root cause of the problem. Before flushing cache the
> gencache.tdb had 15k entries. Is it large? Do you think is it 
> worth time
> to investigate why it grows so much or is it just normal?
> 
> thank you,
> 
> franz
> 
> -- 
> To unsubscribe from this list go to the following URL and read the
> instructions:  https://lists.samba.org/mailman/options/samba
> 
>

Peter Eriksson

2018-Aug-29 13:36 UTC

head link

[Samba] gencache.tdb size and cache flush

For what it’s worth you are not alone in seeing similar problems with Samba and
gencache.

Our site has some 110K users (university with staff & students (including
former ones), and currently around 2000 active (SMB) clients connecting to 5
different Samba servers (around 400-500 clients per server). When we previously
just let things “run” gencache.tdb would grow forever and authentication login
performance would start to deteriorate after a little while (would take more
than 10 seconds). So we now delete it (and locks/locking.tdb that also tends to
grow forever) and restart our samba processes every morning at 7 am - which
gives us much more stable performance.

- Servers with 256GB of RAM, 10Gbps ethernet interfaces and around 110TB of disk
per server.
- FreeBSD 11.2-p2
- Samba 4.7.6 with some local patches to allow (much) bigger socket listening
queues in order to handle the case of many clients connecting at the same time.

(We are trying to upgrade to a more recent Samba but 4.7.8 and 4.7.9 gave us
horrible authentication performance every 10:th hour where the servers basically
denied clients to login for about 2 hours so we had to back down to 4.7.6
again).

- Peter
> On 29 Aug 2018, at 10:28, Francesco Malvezzi via samba <samba at
lists.samba.org> wrote:
> 
> Hi all,
> 
> I have a midsize AD domain with some 50k users but only 100 workstations
> joined.
> 
> Sometimes I find server CPU throttling at 100%. In order to let it drop
> and have smooth performance I delete cache:
> 
> systemctl stop samba
> net cache flush
> systemctl start samba
> 
> First of all, is it needed a samba stop to flush the cache?
> 
> Even if cache flush does the job to restore performance, I am clueless
> about the root cause of the problem. Before flushing cache the
> gencache.tdb had 15k entries. Is it large? Do you think is it worth time
> to investigate why it grows so much or is it just normal?
> 
> thank you,
> 
> franz
> 
> -- 
> To unsubscribe from this list go to the following URL and read the
> instructions:  https://lists.samba.org/mailman/options/samba

Jeremy Allison

2018-Aug-29 17:06 UTC

head link

[Samba] gencache.tdb size and cache flush

On Wed, Aug 29, 2018 at 03:36:23PM +0200, Peter Eriksson via samba
wrote:> For what it’s worth you are not alone in seeing similar problems with Samba
and gencache.
> 
> Our site has some 110K users (university with staff & students
(including former ones), and currently around 2000 active (SMB) clients
connecting to 5 different Samba servers (around 400-500 clients per server).
When we previously just let things “run” gencache.tdb would grow forever and
authentication login performance would start to deteriorate after a little while
(would take more than 10 seconds). So we now delete it (and locks/locking.tdb
that also tends to grow forever) and restart our samba processes every morning
at 7 am - which gives us much more stable performance.
> 
> - Servers with 256GB of RAM, 10Gbps ethernet interfaces and around 110TB of
disk per server.
> - FreeBSD 11.2-p2
> - Samba 4.7.6 with some local patches to allow (much) bigger socket
listening queues in order to handle the case of many clients connecting at the
same time.
> 
> (We are trying to upgrade to a more recent Samba but 4.7.8 and 4.7.9 gave
us horrible authentication performance every 10:th hour where the servers
basically denied clients to login for about 2 hours so we had to back down to
4.7.6 again).
Hmmm. Can you save off one of the large
gencache.tdb files and work out if this
is a fragmentation issue ?

Sounds like it..

Andrew Bartlett

2018-Sep-04 02:15 UTC

head link

[Samba] authentication performance with 4.7.6 -> 4.7.8 upgrade (was: Re: gencache.tdb size and cache flush)

On Wed, 2018-08-29 at 15:36 +0200, Peter Eriksson via samba
wrote:> For what it’s worth you are not alone in seeing similar problems with Samba
and gencache.
> 
> Our site has some 110K users (university with staff & students
(including former ones), and currently around 2000 active (SMB) clients
connecting to 5 different Samba servers (around 400-500 clients per server).
When we previously just let things “run” gencache.tdb would grow forever and
authentication login performance would start to deteriorate after a little while
(would take more than 10 seconds). So we now delete it (and locks/locking.tdb
that also tends to grow forever) and restart our samba processes every morning
at 7 am - which gives us much more stable performance.
> 
> - Servers with 256GB of RAM, 10Gbps ethernet interfaces and around 110TB of
disk per server.
> - FreeBSD 11.2-p2
> - Samba 4.7.6 with some local patches to allow (much) bigger socket
listening queues in order to handle the case of many clients connecting at the
same time.
> 
> (We are trying to upgrade to a more recent Samba but 4.7.8 and 4.7.9 gave
us horrible authentication performance every 10:th hour where the servers
basically denied clients to login for about 2 hours so we had to back down to
4.7.6 again).
I realise testing in production is difficult, but is there any chance
you can pin down where between 4.7.6 and 4.7.8 it broke?  There are not
that many changes between, and while some appear authentication related
nothing stands out. 

Also, do you run Samba as an AD DC, or are these file servers in a
windows domain?

Thanks,

Andrew Bartlett

-- 
Andrew Bartlett
https://samba.org/~abartlet/
Authentication Developer, Samba Team         https://samba.org
Samba Development and Support, Catalyst IT   
https://catalyst.net.nz/services/samba

Volker Lendecke

2018-Sep-04 04:00 UTC

head link

[Samba] gencache.tdb size and cache flush

Hi!

Technical description below, but the exec summary is: Yes, we have a
performance problem with gencache.

On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via samba
wrote:> Hi all,
> 
> I have a midsize AD domain with some 50k users but only 100 workstations
> joined.
> 
> Sometimes I find server CPU throttling at 100%. In order to let it drop
Can you find out where *exactly* that 100% is spent? gstack on the
spinning process with debug symbols would be very helpful here.
> and have smooth performance I delete cache:
> 
> systemctl stop samba
> net cache flush
> systemctl start samba
> 
> First of all, is it needed a samba stop to flush the cache?
No.
> Even if cache flush does the job to restore performance, I am clueless
> about the root cause of the problem. Before flushing cache the
> gencache.tdb had 15k entries. Is it large? Do you think is it worth time
> to investigate why it grows so much or is it just normal?
15k entries is not really silly large. I've seen much larger ones.
What kind of OS do you have? The question is -- does it have the
ability to use robust mutexes? (FreeBSD 11 and recent Linux).

The other thing is -- we don't have code to do cache pruning at this
point. gencache used to be simple for just a few types of entries.
It is a very important performance improvement for many workloads, but
as git grew more and more types of entries (which IMHO is a good
thing), we need to trim the expired entries.

However, traversing the whole gencache periodically is expensive too.
We don't have a good, low-cost and background style traversal
routines, which is needed for tdb files.

This digresses into a technical discussion: I believe we need to
expose something like a "quickly traverse all records for one hash
chain", holding the hash chain lock just over that full traverse. This
would allow the cheap, background style gencache pruning.

Also, gencache has another problem: It's the stabilize calls that
happen frequently. gencache holds a few entries that *need* to survive
a reboot of a box. Mainly the saf_join cache entries come to mind. We
need to separate those out into a persistent gencache.

For the rest -- it's not vital for the system to keep them around,
however performance would significantly suffer if we for example lost
all idmap cache entries. This means we need a middle ground between
CLEAR_IF_FIRST and persistent tdb files. What we can not do is a full
tdb_check upon every daemon startup. We had that, and this sent
winbind into minutes of just reading a corrupt 4GB tdb file. We need
to make live tdb 100% robust against accidential (or even malicious)
corruption.

3 aspects here:

* The recent hardening patches in tdb make me confident we improved here
  a lot.

* We need bullet-proof circular chain detection. We need to put the same
  logic that tdb_check has into the normal flow of tdb_find().

* We need record crc checks. Easiest would be in gencache "user
  space", but tdb-level we could benefit too.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt at sernet.de

Meet us at Storage Developer Conference (SDC)
Santa Clara, CA USA, September 24th-27th 2018

Francesco Malvezzi

2018-Sep-04 09:59 UTC

head link

[Samba] gencache.tdb size and cache flush

Il 04/09/18 06:00, Volker Lendecke ha scritto:> Hi!
> 
> Technical description below, but the exec summary is: Yes, we have a
> performance problem with gencache.
> 
> On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via samba
wrote:
>> Hi all,
>>
>> I have a midsize AD domain with some 50k users but only 100
workstations
>> joined.
>>
>> Sometimes I find server CPU throttling at 100%. In order to let it drop
> 
> Can you find out where *exactly* that 100% is spent? gstack on the
> spinning process with debug symbols would be very helpful here.
not sure how to do it.

can be like that
https://gist.github.com/francescm/8e396f5470da8df8451be13777e18810
?

> 
>> and have smooth performance I delete cache:
>>
>> systemctl stop samba
>> net cache flush
>> systemctl start samba
>>
>> First of all, is it needed a samba stop to flush the cache?
> 
> No.
thank you.
> 
>> Even if cache flush does the job to restore performance, I am clueless
>> about the root cause of the problem. Before flushing cache the
>> gencache.tdb had 15k entries. Is it large? Do you think is it worth
time
>> to investigate why it grows so much or is it just normal?
> 
> 15k entries is not really silly large. I've seen much larger ones.
> What kind of OS do you have? The question is -- does it have the
> ability to use robust mutexes? (FreeBSD 11 and recent Linux).
Debian GNU/Linux 9 (stretch)
Linux addc 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u3 (2018-08-19)
x86_64 GNU/Linux

I absolutely agree to the need to further investigate. The gencache
trail was just a suspect. What I know for sure is I have high spike
loads from a PID with label: "samba: task[dcesrv]".

The stop/delete cache/start procedure actually works, but I am more and
more likely to believe the "delete cache" part is just useless.

thank you,

franz

Reasonably Related Threads

Search for more apparently analagous threads

samba - Aug 2018 - gencache.tdb size and cache flush

[Samba] gencache.tdb size and cache flush

[Samba] gencache.tdb size and cache flush

[Samba] gencache.tdb size and cache flush

[Samba] gencache.tdb size and cache flush

[Samba] authentication performance with 4.7.6 -> 4.7.8 upgrade (was: Re: gencache.tdb size and cache flush)

[Samba] gencache.tdb size and cache flush

[Samba] gencache.tdb size and cache flush

Reasonably Related Threads