Hi all, I have a midsize AD domain with some 50k users but only 100 workstations joined. Sometimes I find server CPU throttling at 100%. In order to let it drop and have smooth performance I delete cache: systemctl stop samba net cache flush systemctl start samba First of all, is it needed a samba stop to flush the cache? Even if cache flush does the job to restore performance, I am clueless about the root cause of the problem. Before flushing cache the gencache.tdb had 15k entries. Is it large? Do you think is it worth time to investigate why it grows so much or is it just normal? thank you, franz
Hai, It might be handing to tell your OS and samba version. A copy of smb.conf is also very handy.. Greetz, Louis> -----Oorspronkelijk bericht----- > Van: samba [mailto:samba-bounces at lists.samba.org] Namens > Francesco Malvezzi via samba > Verzonden: woensdag 29 augustus 2018 10:28 > Aan: samba at lists.samba.org > Onderwerp: [Samba] gencache.tdb size and cache flush > > Hi all, > > I have a midsize AD domain with some 50k users but only 100 > workstations > joined. > > Sometimes I find server CPU throttling at 100%. In order to > let it drop > and have smooth performance I delete cache: > > systemctl stop samba > net cache flush > systemctl start samba > > First of all, is it needed a samba stop to flush the cache? > > Even if cache flush does the job to restore performance, I am clueless > about the root cause of the problem. Before flushing cache the > gencache.tdb had 15k entries. Is it large? Do you think is it > worth time > to investigate why it grows so much or is it just normal? > > thank you, > > franz > > -- > To unsubscribe from this list go to the following URL and read the > instructions: https://lists.samba.org/mailman/options/samba > >
For what it’s worth you are not alone in seeing similar problems with Samba and gencache. Our site has some 110K users (university with staff & students (including former ones), and currently around 2000 active (SMB) clients connecting to 5 different Samba servers (around 400-500 clients per server). When we previously just let things “run” gencache.tdb would grow forever and authentication login performance would start to deteriorate after a little while (would take more than 10 seconds). So we now delete it (and locks/locking.tdb that also tends to grow forever) and restart our samba processes every morning at 7 am - which gives us much more stable performance. - Servers with 256GB of RAM, 10Gbps ethernet interfaces and around 110TB of disk per server. - FreeBSD 11.2-p2 - Samba 4.7.6 with some local patches to allow (much) bigger socket listening queues in order to handle the case of many clients connecting at the same time. (We are trying to upgrade to a more recent Samba but 4.7.8 and 4.7.9 gave us horrible authentication performance every 10:th hour where the servers basically denied clients to login for about 2 hours so we had to back down to 4.7.6 again). - Peter> On 29 Aug 2018, at 10:28, Francesco Malvezzi via samba <samba at lists.samba.org> wrote: > > Hi all, > > I have a midsize AD domain with some 50k users but only 100 workstations > joined. > > Sometimes I find server CPU throttling at 100%. In order to let it drop > and have smooth performance I delete cache: > > systemctl stop samba > net cache flush > systemctl start samba > > First of all, is it needed a samba stop to flush the cache? > > Even if cache flush does the job to restore performance, I am clueless > about the root cause of the problem. Before flushing cache the > gencache.tdb had 15k entries. Is it large? Do you think is it worth time > to investigate why it grows so much or is it just normal? > > thank you, > > franz > > -- > To unsubscribe from this list go to the following URL and read the > instructions: https://lists.samba.org/mailman/options/samba
On Wed, Aug 29, 2018 at 03:36:23PM +0200, Peter Eriksson via samba wrote:> For what it’s worth you are not alone in seeing similar problems with Samba and gencache. > > Our site has some 110K users (university with staff & students (including former ones), and currently around 2000 active (SMB) clients connecting to 5 different Samba servers (around 400-500 clients per server). When we previously just let things “run” gencache.tdb would grow forever and authentication login performance would start to deteriorate after a little while (would take more than 10 seconds). So we now delete it (and locks/locking.tdb that also tends to grow forever) and restart our samba processes every morning at 7 am - which gives us much more stable performance. > > - Servers with 256GB of RAM, 10Gbps ethernet interfaces and around 110TB of disk per server. > - FreeBSD 11.2-p2 > - Samba 4.7.6 with some local patches to allow (much) bigger socket listening queues in order to handle the case of many clients connecting at the same time. > > (We are trying to upgrade to a more recent Samba but 4.7.8 and 4.7.9 gave us horrible authentication performance every 10:th hour where the servers basically denied clients to login for about 2 hours so we had to back down to 4.7.6 again).Hmmm. Can you save off one of the large gencache.tdb files and work out if this is a fragmentation issue ? Sounds like it..
Andrew Bartlett
2018-Sep-04 02:15 UTC
[Samba] authentication performance with 4.7.6 -> 4.7.8 upgrade (was: Re: gencache.tdb size and cache flush)
On Wed, 2018-08-29 at 15:36 +0200, Peter Eriksson via samba wrote:> For what it’s worth you are not alone in seeing similar problems with Samba and gencache. > > Our site has some 110K users (university with staff & students (including former ones), and currently around 2000 active (SMB) clients connecting to 5 different Samba servers (around 400-500 clients per server). When we previously just let things “run” gencache.tdb would grow forever and authentication login performance would start to deteriorate after a little while (would take more than 10 seconds). So we now delete it (and locks/locking.tdb that also tends to grow forever) and restart our samba processes every morning at 7 am - which gives us much more stable performance. > > - Servers with 256GB of RAM, 10Gbps ethernet interfaces and around 110TB of disk per server. > - FreeBSD 11.2-p2 > - Samba 4.7.6 with some local patches to allow (much) bigger socket listening queues in order to handle the case of many clients connecting at the same time. > > (We are trying to upgrade to a more recent Samba but 4.7.8 and 4.7.9 gave us horrible authentication performance every 10:th hour where the servers basically denied clients to login for about 2 hours so we had to back down to 4.7.6 again).I realise testing in production is difficult, but is there any chance you can pin down where between 4.7.6 and 4.7.8 it broke? There are not that many changes between, and while some appear authentication related nothing stands out. Also, do you run Samba as an AD DC, or are these file servers in a windows domain? Thanks, Andrew Bartlett -- Andrew Bartlett https://samba.org/~abartlet/ Authentication Developer, Samba Team https://samba.org Samba Development and Support, Catalyst IT https://catalyst.net.nz/services/samba
Hi! Technical description below, but the exec summary is: Yes, we have a performance problem with gencache. On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via samba wrote:> Hi all, > > I have a midsize AD domain with some 50k users but only 100 workstations > joined. > > Sometimes I find server CPU throttling at 100%. In order to let it dropCan you find out where *exactly* that 100% is spent? gstack on the spinning process with debug symbols would be very helpful here.> and have smooth performance I delete cache: > > systemctl stop samba > net cache flush > systemctl start samba > > First of all, is it needed a samba stop to flush the cache?No.> Even if cache flush does the job to restore performance, I am clueless > about the root cause of the problem. Before flushing cache the > gencache.tdb had 15k entries. Is it large? Do you think is it worth time > to investigate why it grows so much or is it just normal?15k entries is not really silly large. I've seen much larger ones. What kind of OS do you have? The question is -- does it have the ability to use robust mutexes? (FreeBSD 11 and recent Linux). The other thing is -- we don't have code to do cache pruning at this point. gencache used to be simple for just a few types of entries. It is a very important performance improvement for many workloads, but as git grew more and more types of entries (which IMHO is a good thing), we need to trim the expired entries. However, traversing the whole gencache periodically is expensive too. We don't have a good, low-cost and background style traversal routines, which is needed for tdb files. This digresses into a technical discussion: I believe we need to expose something like a "quickly traverse all records for one hash chain", holding the hash chain lock just over that full traverse. This would allow the cheap, background style gencache pruning. Also, gencache has another problem: It's the stabilize calls that happen frequently. gencache holds a few entries that *need* to survive a reboot of a box. Mainly the saf_join cache entries come to mind. We need to separate those out into a persistent gencache. For the rest -- it's not vital for the system to keep them around, however performance would significantly suffer if we for example lost all idmap cache entries. This means we need a middle ground between CLEAR_IF_FIRST and persistent tdb files. What we can not do is a full tdb_check upon every daemon startup. We had that, and this sent winbind into minutes of just reading a corrupt 4GB tdb file. We need to make live tdb 100% robust against accidential (or even malicious) corruption. 3 aspects here: * The recent hardening patches in tdb make me confident we improved here a lot. * We need bullet-proof circular chain detection. We need to put the same logic that tdb_check has into the normal flow of tdb_find(). * We need record crc checks. Easiest would be in gencache "user space", but tdb-level we could benefit too. Volker -- SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen phone: +49-551-370000-0, fax: +49-551-370000-9 AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen http://www.sernet.de, mailto:kontakt at sernet.de Meet us at Storage Developer Conference (SDC) Santa Clara, CA USA, September 24th-27th 2018
Il 04/09/18 06:00, Volker Lendecke ha scritto:> Hi! > > Technical description below, but the exec summary is: Yes, we have a > performance problem with gencache. > > On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via samba wrote: >> Hi all, >> >> I have a midsize AD domain with some 50k users but only 100 workstations >> joined. >> >> Sometimes I find server CPU throttling at 100%. In order to let it drop > > Can you find out where *exactly* that 100% is spent? gstack on the > spinning process with debug symbols would be very helpful here.not sure how to do it. can be like that https://gist.github.com/francescm/8e396f5470da8df8451be13777e18810 ?> >> and have smooth performance I delete cache: >> >> systemctl stop samba >> net cache flush >> systemctl start samba >> >> First of all, is it needed a samba stop to flush the cache? > > No.thank you.> >> Even if cache flush does the job to restore performance, I am clueless >> about the root cause of the problem. Before flushing cache the >> gencache.tdb had 15k entries. Is it large? Do you think is it worth time >> to investigate why it grows so much or is it just normal? > > 15k entries is not really silly large. I've seen much larger ones. > What kind of OS do you have? The question is -- does it have the > ability to use robust mutexes? (FreeBSD 11 and recent Linux).Debian GNU/Linux 9 (stretch) Linux addc 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u3 (2018-08-19) x86_64 GNU/Linux I absolutely agree to the need to further investigate. The gencache trail was just a suspect. What I know for sure is I have high spike loads from a PID with label: "samba: task[dcesrv]". The stop/delete cache/start procedure actually works, but I am more and more likely to believe the "delete cache" part is just useless. thank you, franz
Possibly Parallel Threads
- authentication performance with 4.7.6 -> 4.7.8 upgrade (was: Re: gencache.tdb size and cache flush)
- gencache.tdb size and cache flush
- gencache.tdb size and cache flush
- authentication performance with 4.7.6 -> 4.7.8 upgrade (was: Re: gencache.tdb size and cache flush)
- gencache.tdb size and cache flush