Peter Eriksson
2018-Dec-09 19:20 UTC
[Samba] "wbinfo -u" considered harmful towards Winbindd...
Our setup: Windows AD realm with ~115K users (and numerous groups etc) FreeBSD servers with Samba 4.7.6 and Samba 4.9.3 (both show the same growth) We just noticed that one of the ‘winbindd’ daemons on the servers seems to be growing and growing forever. A bit of detective work pointed us at the “wbinfo -u” command being that culprit. As part of a systems monitoring script we ran that once a minute (now disabled) in order to see if all AD users were detected, but somehow that seems to fail sometime and also cause the Winbindd daemon to grow around 455MB per hour… the memory used is not a huge problem on the production servers (they have 256GB RAM) so we didn’t notice this at first (since we restart smbd&winbindd every morning at 7am) - but an old test server with much less RAM ran out of memory around 4:30am… :-) This is what the process growth look like with a cron job running “wbinfo -u” once a minute:> 10:50/ps-auxwww.log:root 7658 0.0 0.7 1999680 1861712 - S 07:00 21:54.66 winbindd: domain child [AD] (winbindd) > 11:00/ps-auxwww.log:root 7658 0.0 0.7 2077504 1943248 - I 07:00 22:48.07 winbindd: domain child [AD] (winbindd) > 11:10/ps-auxwww.log:root 7658 0.0 0.8 2157376 2022720 - S 07:00 23:41.83 winbindd: domain child [AD] (winbindd) > 11:20/ps-auxwww.log:root 7658 95.3 0.8 2218816 2063896 - R 07:00 24:26.61 winbindd: domain child [AD] (winbindd) > 11:30/ps-auxwww.log:root 7658 0.0 0.8 2243392 2107712 - I 07:00 24:39.43 winbindd: domain child [AD] (winbindd) > 11:40/ps-auxwww.log:root 7658 0.0 0.8 2321216 2185384 - S 07:00 25:35.37 winbindd: domain child [AD] (winbindd) > 11:50/ps-auxwww.log:root 7658 0.0 0.9 2452288 2317168 - S 07:00 27:21.18 winbindd: domain child [AD] (winbindd) > 12:00/ps-auxwww.log:root 7658 0.0 0.9 2534208 2399384 - S 07:00 28:15.81 winbindd: domain child [AD] (winbindd) > 12:10/ps-auxwww.log:root 7658 96.4 0.9 2587456 2433792 - R 07:00 29:52.67 winbindd: domain child [AD] (winbindd) > 12:20/ps-auxwww.log:root 7658 0.0 0.9 2609984 2474460 - S 07:00 30:07.13 winbindd: domain child [AD] (winbindd) > 12:30/ps-auxwww.log:root 7658 0.0 1.0 2736960 2672352 - S 07:00 32:03.49 winbindd: domain child [AD] (winbindd) > 12:40/ps-auxwww.log:root 7658 0.0 1.0 2736960 2672744 - S 07:00 32:03.51 winbindd: domain child [AD] (winbindd)This is a listing of the user listings from “wbinfo -u” during that same time:> -rw-r--r-- 1 root wheel 0 Dec 9 10:48 10:48/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:49 10:49/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:50 10:50/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:51 10:51/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:52 10:52/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:53 10:53/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:54 10:54/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:55 10:55/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:56 10:56/wbinfo-u.log > -rw-r--r-- 1 root wheel 0 Dec 9 10:57 10:57/wbinfo-u.log > -rw-r--r-- 1 root wheel 0 Dec 9 10:58 10:58/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 10:59 10:59/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:00 11:00/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:01 11:01/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:02 11:02/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:03 11:03/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:04 11:04/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:05 11:05/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:06 11:06/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:07 11:07/wbinfo-u.log > -rw-r--r-- 1 root wheel 0 Dec 9 11:08 11:08/wbinfo-u.log > -rw-r--r-- 1 root wheel 0 Dec 9 11:09 11:09/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:10 11:10/wbinfo-u.log > -rw-r--r-- 1 root wheel 1062356 Dec 9 11:11 11:11/wbinfo-u.logAn attempt I just made at manually running the “wbinfo -u” command gave:> # ps uxwwww 7658 > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > root 7658 0.0 1.0 2736960 2675044 - S 07:00 32:04.70 winbindd: domain child [AD] (winbindd)> # /usr/bin/time /liu/bin/wbinfo -u > /tmp/wbinfo-u.out > Error looking up domain users > 64.82 real 0.00 user 0.00 sys> # ls -l /tmp/wbinfo-u.out > -rw-r--r-- 1 root wheel 0 Dec 9 19:50 /tmp/wbinfo-u.out> # ps uxwwww 7658 > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND> root 7658 0.0 1.0 2775872 2715664 - I 07:00 33:13.21 winbindd: domain child [AD] (winbindd)I have saved a gcore-generated core dump from one of the big Winbindd:s in case it might be of use in order to pin-point the leak. Here’s a GDB stack trace (doubt that gives much info though):> Reading symbols from /liu/sbin/winbindd...done. > [New LWP 102857] > Core was generated by `winbindd: domain child [AD]'. > #0 0x0000000809b48948 in ?? () from /lib/libc.so.7 > (gdb) bt > #0 0x0000000809b48948 in ?? () from /lib/libc.so.7 > #1 0x0000000809bad3c5 in snprintf () from /lib/libc.so.7 > #2 0x0000000806852f56 in namemap_cache_set_sid2name (sid=sid at entry=0x7fffffffccf0, domain=<optimized out>, domain at entry=0x8be412930 "AD", name=name at entry=0x8becdef10 "annha585", > type=type at entry=SID_NAME_USER, timeout=1544354765) at ../source3/lib/namemap_cache.c:74 > #3 0x0000000001054ff6 in wcache_save_sid_to_name (domain=domain at entry=0x812669fe0, status=..., sid=sid at entry=0x7fffffffccf0, domain_name=0x8be412930 "AD", name=0x8becdef10 "annha585", > type=SID_NAME_USER) at ../source3/winbindd/winbindd_cache.c:969 > #4 0x0000000001059b0f in wb_cache_rids_to_names (domain=domain at entry=0x812669fe0, mem_ctx=0x81261d660, domain_sid=domain_sid at entry=0x8b3e881e0, rids=rids at entry=0x8b800dfa0, > num_rids=num_rids at entry=117864, domain_name=domain_name at entry=0x7fffffffcd90, names=0x7fffffffcd98, types=0x7fffffffcda0) at ../source3/winbindd/winbindd_cache.c:2202 > #5 0x000000000107b4d0 in _wbint_LookupRids (p=p at entry=0x7fffffffce30, r=r at entry=0x8a9444080) at ../source3/winbindd/winbindd_dual_srv.c:660 > #6 0x00000000010cf243 in api_wbint_LookupRids (p=0x7fffffffce30) at default/librpc/gen_ndr/srv_winbind.c:1353 > #7 0x000000000107a345 in winbindd_dual_ndrcmd (domain=<optimized out>, state=0x7fffffffd0c8) at ../source3/winbindd/winbindd_dual_ndr.c:369 > #8 0x0000000001076082 in child_process_request (state=0x7fffffffd0c8, child=<optimized out>) at ../source3/winbindd/winbindd_dual.c:745 > #9 child_handler (ev=<optimized out>, fde=<optimized out>, flags=<optimized out>, private_data=0x7fffffffd0c0) at ../source3/winbindd/winbindd_dual.c:1602 > #10 0x000000080984bafd in tevent_common_invoke_fd_handler () from /usr/local/lib/libtevent.so.0 > #11 0x000000080984e85b in ?? () from /usr/local/lib/libtevent.so.0 > #12 0x000000080984ae4e in _tevent_loop_once () from /usr/local/lib/libtevent.so.0 > #13 0x00000000010797de in fork_domain_child (child=0x812681760) at ../source3/winbindd/winbindd_dual.c:1817 > #14 0x0000000001079956 in wb_child_request_waited (subreq=0x0) at ../source3/winbindd/winbindd_dual.c:238 > #15 0x000000080984bf37 in tevent_common_invoke_immediate_handler () from /usr/local/lib/libtevent.so.0 > #16 0x000000080984bf94 in tevent_common_loop_immediate () from /usr/local/lib/libtevent.so.0 > #17 0x000000080984e17c in ?? () from /usr/local/lib/libtevent.so.0 > #18 0x000000080984ae4e in _tevent_loop_once () from /usr/local/lib/libtevent.so.0 > #19 0x000000000104e9c2 in main (argc=<optimized out>, argv=<optimized out>) at ../source3/winbindd/winbindd.c:1911smb.conf stuff related to Winbindd:> ; Security type > security = ADS > realm = AD.LIU.SE > workgroup = AD > > ;; ID Mappings > idmap config * : backend = tdb > idmap config * : range = 2000000001-2100000000 > idmap config AD : backend = ad > idmap config AD : range = 1-2000000000 > idmap config AD : schema_mode = rfc2307 > idmap config AD : unix_primary_group = yes> winbind nested groups = false > winbind enum users = false > winbind enum groups = false > winbind use default domain = yes > winbind normalize names = yes > winbind max clients = 1000 > winbind max domain connections = 10 > winbind nss info = templateLet me know if there is something else I can check to pinpoint this leak… - Peter
Rowland Penny
2018-Dec-09 20:06 UTC
[Samba] "wbinfo -u" considered harmful towards Winbindd...
On Sun, 9 Dec 2018 20:20:00 +0100 Peter Eriksson via samba <samba at lists.samba.org> wrote:> Our setup: > Windows AD realm with ~115K users (and numerous groups etc) > FreeBSD servers with Samba 4.7.6 and Samba 4.9.3 (both show the same > growth) > > We just noticed that one of the ‘winbindd’ daemons on the servers > seems to be growing and growing forever. A bit of detective work > pointed us at the “wbinfo -u” command being that culprit. As part of > a systems monitoring script we ran that once a minute (now disabled) > in order to see if all AD users were detected, but somehow that seems > to fail sometime and also cause the Winbindd daemon to grow around > 455MB per hour… the memory used is not a huge problem on the > production servers (they have 256GB RAM) so we didn’t notice this at > first (since we restart smbd&winbindd every morning at 7am) - but an > old test server with much less RAM ran out of memory around > 4:30am… :-) > > smb.conf stuff related to Winbindd: > > > ; Security type > > security = ADS > > realm = AD.LIU.SE > > workgroup = AD > > > > ;; ID Mappings > > idmap config * : backend = tdb > > idmap config * : range = 2000000001-2100000000Interesting range size, do you really need 99,999,999 users or groups for something where 999 is too large ?> > idmap config AD : backend = ad > > idmap config AD : range = 1-2000000000 > > idmap config AD : schema_mode = rfc2307 > > idmap config AD : unix_primary_group = yesDo your users and groups have uidNumber & gidNumber attributes ? Why have you started at 1 ?> > > > winbind nested groups = falseIt would be better if you turned the above on. Rowland
Peter Eriksson
2018-Dec-09 21:31 UTC
[Samba] "wbinfo -u" considered harmful towards Winbindd...
> On 9 Dec 2018, at 21:06, Rowland Penny via samba <samba at lists.samba.org> wrote: > > On Sun, 9 Dec 2018 20:20:00 +0100 > Peter Eriksson via samba <samba at lists.samba.org <mailto:samba at lists.samba.org>> wrote: > >> Our setup: >> Windows AD realm with ~115K users (and numerous groups etc) >> FreeBSD servers with Samba 4.7.6 and Samba 4.9.3 (both show the same >> growth) >> >> We just noticed that one of the ‘winbindd’ daemons on the servers >> seems to be growing and growing forever. A bit of detective work >> pointed us at the “wbinfo -u” command being that culprit. As part of >> a systems monitoring script we ran that once a minute (now disabled) >> in order to see if all AD users were detected, but somehow that seems >> to fail sometime and also cause the Winbindd daemon to grow around >> 455MB per hour… the memory used is not a huge problem on the >> production servers (they have 256GB RAM) so we didn’t notice this at >> first (since we restart smbd&winbindd every morning at 7am) - but an >> old test server with much less RAM ran out of memory around >> 4:30am… :-) >> >> smb.conf stuff related to Winbindd: >> >>> ; Security type >>> security = ADS >>> realm = AD.LIU.SE >>> workgroup = AD >>> >>> ;; ID Mappings >>> idmap config * : backend = tdb >>> idmap config * : range = 2000000001-2100000000 > > Interesting range size, do you really need 99,999,999 users or groups > for something where 999 is too large ?Probably not. It’s just something that got left there from the initial testing.>>> idmap config AD : backend = ad >>> idmap config AD : range = 1-2000000000 >>> idmap config AD : schema_mode = rfc2307 >>> idmap config AD : unix_primary_group = yes > > Do your users and groups have uidNumber & gidNumber attributes ?Yes. All (normal) users, around 110K of them have them set. And around 4000 of the 435404 groups have gidNumber also so.> Why have you started at 1 ?Because of old stuff in the AD. Probably could have started a little bit higher, especially nowadays when things have been cleaned up a bit.>> >> >>> winbind nested groups = false > > It would be better if you turned the above on.Perhaps. It takes “forever" with that enabled though. Same with enabling “enum” of users & groups. However - we don’t really use Winbindd for uid/gid resolving on the file servers though. Due to it being too slow (initial logins would take many seconds/minutes before Winbindd had cached it all the first time) we generate a local database (BerkeleyDB-based) based on AD data (generated on another server) that then is used via a locally development nsswitch module) - so sub-second login times instead.. Winbindd here is mostly used for SID lookups for file ACLs and all other stuff that Smbd wants to use it for. - Peter