Il 04/09/18 06:00, Volker Lendecke ha scritto:> Hi! > > Technical description below, but the exec summary is: Yes, we have a > performance problem with gencache. > > On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via samba wrote: >> Hi all, >> >> I have a midsize AD domain with some 50k users but only 100 workstations >> joined. >> >> Sometimes I find server CPU throttling at 100%. In order to let it drop > > Can you find out where *exactly* that 100% is spent? gstack on the > spinning process with debug symbols would be very helpful here.not sure how to do it. can be like that https://gist.github.com/francescm/8e396f5470da8df8451be13777e18810 ?> >> and have smooth performance I delete cache: >> >> systemctl stop samba >> net cache flush >> systemctl start samba >> >> First of all, is it needed a samba stop to flush the cache? > > No.thank you.> >> Even if cache flush does the job to restore performance, I am clueless >> about the root cause of the problem. Before flushing cache the >> gencache.tdb had 15k entries. Is it large? Do you think is it worth time >> to investigate why it grows so much or is it just normal? > > 15k entries is not really silly large. I've seen much larger ones. > What kind of OS do you have? The question is -- does it have the > ability to use robust mutexes? (FreeBSD 11 and recent Linux).Debian GNU/Linux 9 (stretch) Linux addc 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u3 (2018-08-19) x86_64 GNU/Linux I absolutely agree to the need to further investigate. The gencache trail was just a suspect. What I know for sure is I have high spike loads from a PID with label: "samba: task[dcesrv]". The stop/delete cache/start procedure actually works, but I am more and more likely to believe the "delete cache" part is just useless. thank you, franz
On Tue, Sep 04, 2018 at 11:59:04AM +0200, Francesco Malvezzi wrote:> Il 04/09/18 06:00, Volker Lendecke ha scritto: > > Hi! > > > > Technical description below, but the exec summary is: Yes, we have a > > performance problem with gencache. > > > > On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via samba wrote: > >> Hi all, > >> > >> I have a midsize AD domain with some 50k users but only 100 workstations > >> joined. > >> > >> Sometimes I find server CPU throttling at 100%. In order to let it drop > > > > Can you find out where *exactly* that 100% is spent? gstack on the > > spinning process with debug symbols would be very helpful here. > > not sure how to do it. > > can be like that > https://gist.github.com/francescm/8e396f5470da8df8451be13777e18810 > ?Yes, exactly. The relevant line is #19 0x00007fe50c1c5a3c in dcesrv_samr_EnumDomainUsers which means that some client is listing all users in your domain. With 50.000 users this takes a while. If the client times out and reconnects, this can pretty quickly pile up. Do you have Linux clients with winbind and "winbind enum users = yes" in your network? This would probably do that to your DC. Volker -- SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen phone: +49-551-370000-0, fax: +49-551-370000-9 AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen http://www.sernet.de, mailto:kontakt at sernet.de Meet us at Storage Developer Conference (SDC) Santa Clara, CA USA, September 24th-27th 2018
Il 04/09/18 12:42, Volker Lendecke ha scritto:> On Tue, Sep 04, 2018 at 11:59:04AM +0200, Francesco Malvezzi wrote: >> Il 04/09/18 06:00, Volker Lendecke ha scritto: >>> Hi! >>> >>> Technical description below, but the exec summary is: Yes, we have a >>> performance problem with gencache. >>> >>> On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via samba wrote: >>>> Hi all, >>>> >>>> I have a midsize AD domain with some 50k users but only 100 workstations >>>> joined. >>>> >>>> Sometimes I find server CPU throttling at 100%. In order to let it drop >>> >>> Can you find out where *exactly* that 100% is spent? gstack on the >>> spinning process with debug symbols would be very helpful here. >> >> not sure how to do it. >> >> can be like that >> https://gist.github.com/francescm/8e396f5470da8df8451be13777e18810 >> ? > > Yes, exactly. The relevant line is > > #19 0x00007fe50c1c5a3c in dcesrv_samr_EnumDomainUsersthank you for reading all that stuff.> > which means that some client is listing all users in your domain. With > 50.000 users this takes a while. If the client times out and > reconnects, this can pretty quickly pile up.If I simulate it by listing all user in Active Directory User and Computer utility, I obtain a load raise at 100% cpu, very short because client disconnects at around 1000 users. A call to: time sudo ./bin/ldbsearch -H private/sam.ldb "(objectClass=user)" > /dev/null real 0m22,410s user 0m20,132s sys 0m2,072s describes better your scenario: one cpu is full load for about 20 seconds and then it drops.> > Do you have Linux clients with winbind and "winbind enum users = yes" > in your network? This would probably do that to your DC.As far as I know, the winbindd clients in our milieu do not enumerate users, unless misconfigured (but can't talk for MacOSX clients), thank you, franz
On Tue, 2018-09-04 at 12:42 +0200, Volker Lendecke via samba wrote:> On Tue, Sep 04, 2018 at 11:59:04AM +0200, Francesco Malvezzi wrote: > > > > Il 04/09/18 06:00, Volker Lendecke ha scritto: > > > > > > Hi! > > > > > > Technical description below, but the exec summary is: Yes, we > > > have a > > > performance problem with gencache. > > > > > > On Wed, Aug 29, 2018 at 10:28:05AM +0200, Francesco Malvezzi via > > > samba wrote: > > > > > > > > Hi all, > > > > > > > > I have a midsize AD domain with some 50k users but only 100 > > > > workstations > > > > joined. > > > > > > > > Sometimes I find server CPU throttling at 100%. In order to let > > > > it drop > > > Can you find out where *exactly* that 100% is spent? gstack on > > > the > > > spinning process with debug symbols would be very helpful here. > > not sure how to do it. > > > > can be like that > > https://gist.github.com/francescm/8e396f5470da8df8451be13777e18810 > > ? > Yes, exactly. The relevant line is > > #19 0x00007fe50c1c5a3c in dcesrv_samr_EnumDomainUsers > > which means that some client is listing all users in your domain. > With > 50.000 users this takes a while. If the client times out and > reconnects, this can pretty quickly pile up. > > Do you have Linux clients with winbind and "winbind enum users = yes" > in your network? This would probably do that to your DC.And if the client can't be fixed, certainly the implementation in the Samba AD DC SAMR server could be made much, much more efficient. As far as I see it, we do a objectclass=user search for every 54 users in a page, that makes a lot of searches for 50,000 users!. Thanks, Andrew Bartlett -- Andrew Bartlett http://samba.org/~abartlet/ Authentication Developer, Samba Team http://samba.org Samba Developer, Catalyst IT http://catalyst.net.nz/services/samba