Humrick, Matt
2008-Apr-19 01:31 UTC
[Samba] Annoying Winbind Pause While Looking Up Permissions
Our linux SLES10 fileserver is running Samba (3.0.28) and is a domain member in a Win2003 AD. Everything has been working fine for several months with the exception of an annoying pause that occurs while browsing shares. The pause will last 30 seconds and occurs roughly once every 5-10 minutes. The pause is universal, meaning, that all Windows clients trying to browse or access files will see this pause, as well as, any Linux processes (like doing 'ls -l'). Basically, any command that requires winbind to determine access privileges pauses. This only occurs for objects that have AD users or groups. Here's the system info: Samba 3.0.28 (3.0.25a and 3.0.25c also had this problem) Krb5 1.4.3 Linux 2.6.16 (x64) All of the usual tests work: wbinfo -t wbinfo -u wbinfo -g net ads join kinit etc... I finally had time to investigate this further. All I did was 'ls -l /share' where /share contained files with domain permissions. The command paused for 30 seconds before finally listing the files correctly. All Windows clients experience this same pause at the same time also. Here's the data I collected: # tcpdump -n -r tcpdump reading from file tcpdump, link-type EN10MB (Ethernet) 16:52:20.838780 IP 192.168.1.210.44072 > 192.168.1.207.389: P 117:143(26) ack 214 win 1460 <nop,nop,timestamp 1551335289 70870460> 16:52:20.838951 IP 192.168.1.207.389 > 192.168.1.210.44072: P 214:333(119) ack 143 win 65393 <nop,nop,timestamp 70870460 1551335289> 16:52:20.839894 IP 192.168.1.210.32891 > 192.168.1.207.88: v5 16:52:20.840419 IP 192.168.1.207.88 > 192.168.1.210.32891: 16:52:20.873183 IP 192.168.1.210.58864 > 192.168.1.207.445: . ack 1771 win 2800 <nop,nop,timestamp 1551335298 70870460> 16:52:20.877180 IP 192.168.1.210.44072 > 192.168.1.207.389: . ack 333 win 1460 <nop,nop,timestamp 1551335299 70870460> 16:52:30.837599 IP 192.168.1.210.32891 > 192.168.1.207.88: v5 16:52:30.838482 IP 192.168.1.207.88 > 192.168.1.210.32891: v5 16:52:40.837652 IP 192.168.1.210.32891 > 192.168.1.207.88: 16:52:40.838606 IP 192.168.1.207.88 > 192.168.1.210.32891: 16:52:50.837439 IP 192.168.1.210.44072 > 192.168.1.207.389: P 143:1410(1267) ack 333 win 1460 <nop,nop,timestamp 1551342789 70870460> 16:52:50.838112 IP 192.168.1.207.389 > 192.168.1.210.44072: P 333:383(50) ack 1410 win 64126 <nop,nop,timestamp 70870761 1551342789> 192.168.1.210 is the Linux server running Samba and 192.168.1.207 is the Windows domain controller. You'll notice that everything is humming along smoothly through time 16:52:20. Between 16:52:20 and 16:52:50 is the 30 second delay. My understanding is that port 88 is related to Kerberos. Running tcpdump while executing the same command again yields instant results with identical network traffic minus the port 88 requests. # strace -T -t ls -l /share 16:52:20 read(4, "/var/lib/samba/winbindd_privileg"..., 35) = 35 <0.000009> 16:52:20 lstat("/var/lib/samba/winbindd_privileged", {st_mode=S_IFDIR|0750, st_size=72, ...}) = 0 <0.000011> 16:52:20 lstat("/var/lib/samba/winbindd_privileged/pipe", {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 <0.000011> 16:52:20 socket(PF_FILE, SOCK_STREAM, 0) = 5 <0.000011> 16:52:20 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) <0.000006> 16:52:20 fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000007> 16:52:20 fcntl(5, F_GETFD) = 0 <0.000006> 16:52:20 fcntl(5, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 16:52:20 connect(5, {sa_family=AF_FILE, path="/var/lib/samba/winbindd_privileged/pipe"}, 110) = 0 <0.000018> 16:52:20 close(4) = 0 <0.000011> 16:52:20 select(6, [5], NULL, NULL, {0, 0}) = 0 (Timeout) <0.000007> 16:52:20 write(5, "(\10\0\0\4\0\0\0d\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2088) 2088 <0.000011> 16:52:20 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.997279> 16:52:25 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999895> 16:52:30 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999885> 16:52:35 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999914> 16:52:40 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999891> 16:52:45 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999916> 16:52:50 select(6, [5], NULL, NULL, {5, 0}) = 1 (in [5], left {4, 968000}) <0.033682> 16:52:50 read(5, "\354\f\0\0\2\0\0\0STARTREK-phx_api_release"..., 3240) = 3240 <0.000014> Notice the chain of select() calls between 16:52:20 and 16:52:50 that all timeout after 5 seconds for a total of 30 seconds! Finally, here's the relevant section of the winbind log: [2008/04/18 16:52:20, 4] libsmb/namequery.c:get_dc_list(1599) get_dc_list: returning 2 ip addresses in an ordered list [2008/04/18 16:52:20, 4] libsmb/namequery.c:get_dc_list(1600) get_dc_list: 192.168.1.207:389 192.0.2.224:389 [2008/04/18 16:52:20, 5] libads/ldap.c:ads_try_connect(180) ads_try_connect: sending CLDAP request to 192.168.1.207 (realm: STARTREK.LOCAL) [2008/04/18 16:52:20, 3] libads/ldap.c:ads_connect(394) Connected to LDAP server 192.168.1.207 [2008/04/18 16:52:20, 4] libads/ldap.c:ads_current_time(2414) time offset is 23 seconds [2008/04/18 16:52:20, 4] libads/sasl.c:ads_sasl_bind(521) Found SASL mechanism GSS-SPNEGO [2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(213) ads_sasl_spnego_bind: got OID=1 2 840 48018 1 2 2 [2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(213) ads_sasl_spnego_bind: got OID=1 2 840 113554 1 2 2 [2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(213) ads_sasl_spnego_bind: got OID=1 2 840 113554 1 2 2 3 [2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(213) ads_sasl_spnego_bind: got OID=1 3 6 1 4 1 311 2 2 10 [2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(222) ads_sasl_spnego_bind: got server principal name trieste$@STARTREK.LOCAL [2008/04/18 16:52:20, 3] libsmb/clikrb5.c:ads_krb5_mk_req(593) ads_krb5_mk_req: krb5_cc_get_principal failed (No credentials cache found) [2008/04/18 16:52:50, 4] libsmb/clikrb5.c:ads_krb5_mk_req(610) ads_krb5_mk_req: Advancing clock by 13 seconds to cope with clock skew [2008/04/18 16:52:50, 3] libsmb/clikrb5.c:ads_cleanup_expired_creds(528) ads_cleanup_expired_creds: Ticket in ccache[MEMORY:winbind_ccache] expiration Fri, 18 Apr 2008 17:13:03 MST [2008/04/18 16:52:50, 5] libads/ldap_utils.c:ads_do_search_retry_internal(64) Search for (objectSid=\01\05\00\00\00\00\00\05\15\00\00\00\5F\52\F4\2D\49\0D\1F\07\ 7F\6C\71\4D\C6\1F\00\00) in <dc=STARTREK,dc=LOCAL> gave 1 replies This shows that at the start of the pause I have the following error: krb5_cc_get_principal failed (No credentials cache found) This is what I think is happening: 1.) winbind is unable to find it's info in the Kerberos cache 2.) winbind queries the Windows domain controller (via port 88) 3.) the select on this socket times out for 30 seconds (even though the domain controller responds??) 4.) winbind defaults back to "search_retry_internal" and finds what it needs to continue The weird thing is it only does this once every 5-10 minutes. Every other time I don't get the "krb5_cc_get_principal failed (No credentials cache found)" error that leads to the 30 second timeout. If I run kinit for a domain user, that user shows up when I klist. However, that's the only ticket that shows up. If anyone could shed some light on this issue I would greatly appreciate it. Thanks, Matt