We have a setup with a bunch of Dell servers acting as fileserver for the university here, running FreeBSD with Samba (and NFS & SFTP) to provide service for staff & students. This works in general really (very) well. We have around 4000-5000 SMB connections spread over 6 servers during “rush hours” (8am-5pm). However, due to some strange (I bet it’s due to the service ticket timeout) reason exactly 10 hours after we restart the Samba daemons (and every 10:th hour after that) we are seeing problems with clients connecting to the Samba servers (connections with valid passwords and/or valid Kerberos ticket are getting a “timeout” - users with invalid passwords/invalid Kerberos ticket gets (correctly) denied service though! At Samba 4.7.6 we just see a short “hickup” at the 10 hour mark but with all the later versions service grinds to a halt. I’ve tried to read the diffs to see if I can find the change that causes this but so far no luck. So we decided to try Samba 4.9.3 a little bit more and it has been running well on our test systems (with no or very few concurrent users). Today we tried to use it one one of the production servers and … it failed at 17:00 again (we restart the Samba Damons at 07:00 every day in order to make sure the problem doesn’t occur during prime time (we originally restarted them at 04:00 but that caused the trouble to occur at 2pm - not good…) This production server had around 400 concurrent users when at the time it failed this time. In order to get it to start servicing users again I tried to kill the “Winbindd” daemons using “pkill winbindd” and all but one of the daemons died:> root at filur05# ps auxww | egrep winbind > root 1550 0.0 0.0 271956 127236 - S 07:12 0:23.66 winbindd: domain child [AD] (winbindd) > root 29659 0.0 0.0 237240 122108 - I 10:04 0:00.02 winbindd: domain child [FILUR05] (winbindd) > root 99774 0.0 0.1 267384 143028 - Ss 07:00 5:23.92 /liu/sbin/winbindd --daemon --configfile=/liu/etc/samba/smb.conf > root 99775 0.0 1.8 4913320 4748952 - S 07:00 68:08.91 winbindd: domain child [AD] (winbindd) > root 99778 0.0 0.0 197228 89496 - S 07:00 0:01.07 winbindd: domain child [FILUR05] (winbindd) > root 99779 0.0 0.2 679988 533092 - S 07:00 14:16.53 winbindd: domain child [AD] (winbindd) > root 99780 0.0 0.0 43336 30064 - I 07:00 0:04.61 winbindd: idmap child (winbindd) > root 99781 0.0 0.0 155968 36980 - I 07:00 0:00.03 winbindd: domain child [BUILTIN] (winbindd) > root 678 0.0 0.0 6740 2716 1 S+ 17:01 0:00.00 egrep winbind > root at filur05# pkill winbindd > root at filur05# ps auxww | egrep winbind > root 99780 0.0 0.0 43336 30064 - S 07:00 0:04.61 winbindd: idmap child (winbindd) > root 730 0.0 0.0 6740 2716 1 S+ 17:02 0:00.00 egrep winbind > ... > root at filur05# !ps > ps auxww | egrep winbind > root 99780 0.0 0.0 43336 30064 - S 07:00 0:04.61 winbindd: idmap child (winbindd) > root 738 0.0 0.0 6740 2716 1 S+ 17:02 0:00.00 egrep winbind > root at filur05# kill -9 99780 > root at filur05# /liu/sbin/winbindd --daemon > root at filur05# ps auxww | egrep winbind > root 745 1.9 0.0 220752 113184 - Ss 17:02 0:05.14 /liu/sbin/winbindd --daemon > root 746 0.0 0.1 255160 175056 - S 17:02 0:02.41 winbindd: domain child [AD] (winbindd) > root 748 0.0 0.0 161024 61276 - S 17:02 0:00.17 winbindd: idmap child (winbindd) > root 750 0.0 0.0 198032 101088 - S 17:02 0:00.01 winbindd: domain child [FILUR05] (winbindd) > root 813 0.0 0.0 200912 75004 - S 17:03 0:00.18 winbindd: domain child [AD] (winbindd) > root 1530 0.0 0.0 6740 2716 1 S+ 17:07 0:00.00 egrep winbindAnd then it started servicing users again. On one of our test servers I tried (at the same time) to run 55 concurrent smbclient connections to see if that would cause the same failure, but it worked just fine. A “pkill winbindd” on that server killed all the Winbindd processes - no dangling “idmap winbindd" So right now my main suspect is that something goes wrong in the “winbindd: idmap child (winbindd)” subprocess… Unfortunately I didn’t get a detailed stack trace of what it was doing at the time but I’ll try to get one tomorrow. Any ideas? Files open at the time of non-responsiveness:> 17:01/fstat.log:root winbindd 99775 text /liu 38725 -rwxr-xr-x 4711000 r > 17:01/fstat.log:root winbindd 99775 wd / 4 drwxr-xr-x 29 r > 17:01/fstat.log:root winbindd 99775 root / 4 drwxr-xr-x 29 r > 17:01/fstat.log:root winbindd 99775 0 /dev 26 crw-rw-rw- null rw > 17:01/fstat.log:root winbindd 99775 1 /dev 26 crw-rw-rw- null rw > 17:01/fstat.log:root winbindd 99775 2 /liu/var 31071 -rw-r--r-- 3583449 w > 17:01/fstat.log:root winbindd 99775 3 /dev 8 crw-r--r-- random r > 17:01/fstat.log:root winbindd 99775 4 /liu/var 7929 -rw-r--r-- 21 rw > 17:01/fstat.log:root winbindd 99775 5* pipe fffff813a2cf4be0 <-> fffff813a2cf4d48 0 rw > 17:01/fstat.log:root winbindd 99775 6 /dev 8 crw-r--r-- random r > 17:01/fstat.log:root winbindd 99775 7* local dgram fffff801e9e884b0 > 17:01/fstat.log:root winbindd 99775 8 /liu/var 18 -rw-r----- 12288 rw > 17:01/fstat.log:root winbindd 99775 9 /liu 1117 -rw------- 561152 rw > 17:01/fstat.log:root winbindd 99775 10 /liu 1122 -rw------- 28672 rw > 17:01/fstat.log:root winbindd 99775 11 /dev 26 crw-rw-rw- null rw > 17:01/fstat.log:root winbindd 99775 12 /dev 26 crw-rw-rw- null rw > 17:01/fstat.log:root winbindd 99775 13 /liu/var 7924 -rw-r--r-- 6 w > 17:01/fstat.log:root winbindd 99775 14* pipe fffff801cc3d6000 <-> fffff801cc3d6168 0 rw > 17:01/fstat.log:root winbindd 99775 15 /liu/var 31071 -rw-r--r-- 3583449 w > 17:01/fstat.log:root winbindd 99775 16* pipe fffff813a2cf4d48 <-> fffff813a2cf4be0 0 rw > 17:01/fstat.log:root winbindd 99775 17 /liu/var 7930 -rw-r--r-- 101470208 rw > 17:01/fstat.log:root winbindd 99775 18 /liu/var 39 -rw------- 4227072 rw > 17:01/fstat.log:root winbindd 99775 19* local dgram fffff801e9e5da50 <-> fffff801e9e65b40 > 17:01/fstat.log:root winbindd 99775 20 /liu/var 7926 -rw------- 69033984 rw > 17:01/fstat.log:root winbindd 99775 21* local stream fffff801e9e4a0f0 > 17:01/fstat.log:root winbindd 99775 22* local stream fffff8020f01fb40 > 17:01/fstat.log:root winbindd 99775 23* local stream fffff811b72d5870 <-> fffff801e9e53870 > 17:01/fstat.log:root winbindd 99775 24 /liu/var 20 -rw-r--r-- 16228352 rw > 17:01/fstat.log:root winbindd 99775 25* internet stream tcp fffff812c8690820 > 17:01/fstat.log:root winbindd 99775 26* internet stream tcp fffff8020f525410 > 17:01/fstat.log:root winbindd 99775 27 /liu/var 34 -rw------- 20480 rw > 17:01/fstat.log:root winbindd 99775 28* internet stream tcp fffff81250627410 > 17:01/fstat.log:root winbindd 99775 29* internet stream tcp fffff814f4815000Network sockets:> 17:01/netstat.log:fffff801e9e884b0 dgram 0 0 fffff812fd6fe3b0 0 0 0 /liu/etc/samba/private/msg.sock/99775Process kernel stacks for the process:> 17:01/procstat-kk-a.log:99775 102090 winbindd - pmap_remove_ptes+0x101 pmap_remove+0x35b vm_map_delete+0x14e kern_munmap+0x8d amd64_syscall+0xa38 fast_syscall_common+0x101Ps auxwww output:> 17:01/ps-auxwww.log:root 99775 100.0 1.8 4888744 4702144 - R 07:00 67:53.99 winbindd: domain child [AD] (winbindd)(At 17:02 I “kill -9” killed it) Setup: FreeBSD 11.2-RELEASE-p5 Samba 4.9.3 / 4.7.6 (with a local patch to allow larger socket listening queues) (ZFS storage) AD-joined with Microsoft Windows AD servers. Some 100k users and groups... smb.conf:> [global] > private directory = /liu/etc/samba/private > lock directory = /liu/var/samba/locks > cache directory = /liu/var/samba/cache > state directory = /liu/var/samba/state > ncalrpc dir = /liu/var/samba/ncalrpc > > ;; Network interfaces > bind interfaces only = true > interfaces = lagg0 > > ;; Server names > server string = Filur05 File Server > netbios name = FILUR05 > > ;; Security type > security = ADS > realm = AD.LIU.SE > workgroup = AD > > ;; ID Mappings > idmap config * : backend = tdb > idmap config * : range = 2000000001-2100000000 > idmap config AD : backend = ad > idmap config AD : range = 1-2000000000 > idmap config AD : schema_mode = rfc2307 > idmap config AD : unix_primary_group = yes > > > ;; Enable shared robust mutexes > dbwrap_tdb_mutexes:* = yes > > ;; VFS objects to enable > vfs objects = shadow_copy2 zfsacl full_audit > full_audit:prefix = %u|%I > full_audit:success = none > full_audit:failure = none > full_audit:facility = LOCAL7 > full_audit:priority = notice > > ;; Snapshots/Previous Versions > shadow:snapdir = .zfs/snapshot > shadow:format = auto-%Y-%m-%d.%H:%M:%S > shadow:sort = desc > shadow:localtime = yes > shadow:snapdirseverywhere = yes > > ;; Handle ZFS ACLs > zfsacl:acesort = dontcare > > nfs4:mode = special > nfs4:acedup = merge > > kerberos method = system keytab > kernel change notify = false > socket options = TCP_NODELAY SO_RCVBUF=65536 SO_SNDBUF=65536 > client ldap sasl wrapping = seal > socket listen backlog = 1024 > invalid users = root > min protocol = SMB2 > smb encrypt = auto > time server = true > hostname lookups = false > follow symlinks = true > wide links = false > unix extensions = false > winbind nested groups = false > winbind enum users = false > winbind enum groups = false > winbind use default domain = yes > winbind normalize names = yes > winbind max clients = 1000 > winbind max domain connections = 10 > winbind nss info = template > template homedir = /home/%U > template shell = /bin/sh > nt acl support = yes > inherit acls = yes > inherit permissions = yes > store dos attributes = yes > ea support = false > map hidden = false > map system = false > map archive = false > map read only = permissions > case sensitive = auto > logging = syslog:1 file:2 > log level = 1 auth:2 winbind:2 acls:0 > load printers = false > printing = bsd > printcap name = /dev/null > disable spoolss = true > veto files = /.zfs/ > hide files = /~$*/ > delete veto files = yes> [homes] > browseable = false > printable = false > public = false > writeable = true> [students] > copy = homes > comment = Student Home Directories > path = /export/students > inherit owner = yes > > [staff] > copy = homes > comment = Staff Home Directories > path = /export/staff > inherit owner = yes- Peter
On Mon, Dec 03, 2018 at 05:53:34PM +0100, Peter Eriksson via samba wrote:>However, due to some strange (I bet it’s due to the service ticket timeout) >reason exactly 10 hours after we restart the Samba daemons (and every 10:th >hour after that) we are seeing problems with clients connecting to the Samba >servers (connections with valid passwords and/or valid Kerberos ticket are >getting a “timeout” - users with invalid passwords/invalid Kerberos ticket gets >(correctly) denied service though!if the connections uses SMB signing, this might be this bug https://bugzilla.samba.org/show_bug.cgi?id=13661 Fix will ship with next release. -slow -- Ralph Boehme, Samba Team https://samba.org/ Samba Developer, SerNet GmbH https://sernet.de/en/samba/ GPG-Fingerprint FAE2C6088A24252051C559E4AA1E9B7126399E46
On Mon, 3 Dec 2018 17:53:34 +0100 Peter Eriksson via samba <samba at lists.samba.org> wrote:> We have a setup with a bunch of Dell servers acting as fileserver for > the university here, running FreeBSD with Samba (and NFS & SFTP) to > provide service for staff & students. > > This works in general really (very) well. We have around 4000-5000 > SMB connections spread over 6 servers during “rush hours” (8am-5pm). > > However, due to some strange (I bet it’s due to the service ticket > timeout) reason exactly 10 hours after we restart the Samba daemons > (and every 10:th hour after that) we are seeing problems with clients > connecting to the Samba servers (connections with valid passwords > and/or valid Kerberos ticket are getting a “timeout” - users with > invalid passwords/invalid Kerberos ticket gets (correctly) denied > service though! > > smb.conf: > > > [global] > > private directory = /liu/etc/samba/private > > lock directory = /liu/var/samba/locks > > cache directory = /liu/var/samba/cache > > state directory = /liu/var/samba/state > > ncalrpc dir = /liu/var/samba/ncalrpc > > > > ;; Network interfaces > > bind interfaces only = true > > interfaces = lagg0 > > > > ;; Server names > > server string = Filur05 File Server > > netbios name = FILUR05 > > > > ;; Security type > > security = ADS > > realm = AD.LIU.SE > > workgroup = AD > > > > ;; ID Mappings > > idmap config * : backend = tdb > > idmap config * : range = 2000000001-2100000000 > > idmap config AD : backend = ad > > idmap config AD : range = 1-2000000000 > > idmap config AD : schema_mode = rfc2307 > > idmap config AD : unix_primary_group = yes> > winbind nested groups = false > > winbind enum users = false > > winbind enum groups = false > > winbind use default domain = yes > > winbind normalize names = yes > > winbind max clients = 1000 > > winbind max domain connections = 10 > > winbind nss info = templateYou have a lot of default 'winbind' lines that you don't need, but there is one missing: winbind refresh tickets = yes Rowland
Hmm… I see that there is a patch in the bugzilla page for that bug. I guess I could try that one… :-) I notice in the patch that there is a lot of talk about SMB2 - we use SMB3 mostly now. But perhaps there is some code sharing? Or perhaps SMB2 is used when talking to the AD servers? Or the few SMB2-talking clients causes the problem for all of the other users? When testing the “timeout” issue I typically try a simple: smbclient -m SMB3 -U $USER -k //$SERVER/$SHARE - Peter> On 3 Dec 2018, at 18:36, Ralph Böhme <slow at samba.org> wrote: > > On Mon, Dec 03, 2018 at 05:53:34PM +0100, Peter Eriksson via samba wrote: >> However, due to some strange (I bet it’s due to the service ticket timeout) reason exactly 10 hours after we restart the Samba daemons (and every 10:th hour after that) we are seeing problems with clients connecting to the Samba servers (connections with valid passwords and/or valid Kerberos ticket are getting a “timeout” - users with invalid passwords/invalid Kerberos ticket gets (correctly) denied service though! > > if the connections uses SMB signing, this might be this bug > > https://bugzilla.samba.org/show_bug.cgi?id=13661 > > Fix will ship with next release. > > -slow > > -- > Ralph Boehme, Samba Team https://samba.org/ > Samba Developer, SerNet GmbH https://sernet.de/en/samba/ > GPG-Fingerprint FAE2C6088A24252051C559E4AA1E9B7126399E46