Michael Tokarev
2025-Jun-25 17:14 UTC
[Samba] winbindd: how it chooses which LDAP servers to query?
Hi! We're having a huge issue with at least one of our samba servers which is joined to a samba AD. We had 2 DCs, in two offices. Each within its own site. Everything worked correctly, it looked like all queries are made to the nearby DC, local to the server's. Until we had a network/power outage and lost connectivity for over a day. And now, at least one of the samba servers almost completely stopped working, -- because usernames can't be looked up anymore, so only root can login over ssh, and samba shares does not work at all. winbindd constantly tries to reach a DC in the remote office, despite local DC is working instantly. Just a simple `id mjt` takes about a minute, despite the router immediately returning "No route to host" to all packets destined for the remote DC (it isn't timing out). It is more: the results aren't being cached, so the next run of the same `id mjt` takes another minute. Sometimes it succeeds after a minute (querying a local DC), and sometimes it reports "user not found", - but either way this behavior breaks whole system almost completely. log.winbindd has a lot of entries like [2025/06/25 19:32:48.961557, 1, traceid=40] source3/winbindd/wb_xids2sids.c:407(wb_xids2sids_recv) wb_sids_to_xids failed: NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND [2025/06/25 19:32:48.961609, 1, traceid=40] source3/winbindd/winbindd_xids_to_sids.c:111(winbindd_xids_to_sids_recv) Could not convert xids: NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND We've added another DC to this office, but it didn't help at all, -- the samba server still tries to contact the remote DC. This does not depend on the DNS config being in use - exactly the same happens when setting resolv.conf to point to the samba AD DC as nameservers, as with external nameservers with static contents for the AD functionality. I tried to play with DNS records, temporarily removing the remote DC from _ldap._tcp and _ldap._tcp.dc._msdcs sets of records, but this changes nothing, at least I haven't seen a change. Another samba member server I've set up locally for testing does not have this issue, it immediately finds both local to the site DCs and starts querying one of the two. Below is a typical level-5 debug from winbindd (asked for my groups). It correctly determines local site, correctly determines a good server to query and the DC list. It performs a *lot* of DNS queries. But it does not even try other LDAP servers besides the remote one, despite correctly finding the right ones. How does winbindd choose which LDAP server to query? Thanks, /mjt --- here, svdcm and svdcm2 are two local DCs in Moscow-Office, with IP addresses 192.168.177.8 and .9, and svdcp is the remote DC with an IP 192.168.19.6, which is unreachable. child daemon request 54 sitename_fetch: Returning sitename for realm 'TLS.MSK.RU': "Moscow-Office" namecache_fetch: name svdcm.tls.msk.ru#20 found. sitename_fetch: Returning sitename for realm 'TLS.MSK.RU': "Moscow-Office" saf_fetch: Returning "svdcm.tls.msk.ru" for "TLS.MSK.RU" domain get_dc_list: preferred server list: "svdcm.tls.msk.ru, *" resolve_ads: Attempting to resolve KDCs for TLS.MSK.RU using DNS dns_rr_srv_fill_done: async DNS A lookup for svdcm.tls.msk.ru [0] got svdcm.tls.msk.ru -> 192.168.177.8 dns_rr_srv_fill_done: async DNS AAAA lookup for svdcm.tls.msk.ru returned 0 addresses. dns_rr_srv_fill_done: async DNS A lookup for svdcm2.tls.msk.ru [0] got svdcm2.tls.msk.ru -> 192.168.177.9 dns_rr_srv_fill_done: async DNS AAAA lookup for svdcm2.tls.msk.ru returned 0 addresses. sitename_fetch: Returning sitename for realm 'TLS.MSK.RU': "Moscow-Office" namecache_fetch: name svdcm.tls.msk.ru#20 found. get_dc_list: returning 2 ip addresses in an ordered list get_dc_list: 192.168.177.8 192.168.177.9 saf_fetch: Returning "svdcm.tls.msk.ru" for "TLS.MSK.RU" domain get_dc_list: preferred server list: "svdcm.tls.msk.ru, *" resolve_ads: Attempting to resolve KDCs for TLS.MSK.RU using DNS dns_rr_srv_fill_done: async DNS A lookup for svdcm2.tls.msk.ru [0] got svdcm2.tls.msk.ru -> 192.168.177.9 dns_rr_srv_fill_done: async DNS AAAA lookup for svdcm2.tls.msk.ru returned 0 addresses. dns_rr_srv_fill_done: async DNS A lookup for svdcm.tls.msk.ru [0] got svdcm.tls.msk.ru -> 192.168.177.8 dns_rr_srv_fill_done: async DNS AAAA lookup for svdcm.tls.msk.ru returned 0 addresses. dns_rr_srv_fill_done: async DNS A lookup for svdcp.tls.msk.ru [0] got svdcp.tls.msk.ru -> 192.168.19.6 dns_rr_srv_fill_done: async DNS AAAA lookup for svdcp.tls.msk.ru returned 0 addresses. sitename_fetch: Returning sitename for realm 'TLS.MSK.RU': "Moscow-Office" namecache_fetch: name svdcm.tls.msk.ru#20 found. get_dc_list: returning 3 ip addresses in an ordered list get_dc_list: 192.168.177.8 192.168.177.9 192.168.19.6 At this point, strace shows that it sends UDP queries to .177.9 (which immediately replies) and to .19.6 (which never replies), apparently dislikes the answer from .177.9, sends a few more queries to .19.6, and finally: get_kdc_ip_string: Failed to get KDC ip address Finished processing child request 54 Here's the part from straace (note relative timestamps in the first column - waiting for answer takes quite some time): 0.000034 write(1, "get_dc_list: 192.168.177.8 192.1"..., 55get_dc_list: 192.168.177.8 192.168.177.9 192.168.19.6 ) = 55 0.000036 epoll_create1(EPOLL_CLOEXEC) = 19 0.000049 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 20 0.000034 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) 0.000029 fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0 0.000029 fcntl(20, F_GETFD) = 0 0.000028 fcntl(20, F_SETFD, FD_CLOEXEC) = 0 0.000028 connect(20, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("192.168.177.9")}, 16) = 0 0.000052 sendto(20, "0\\\2\3\0\352GcU\4\0\n\1\0\n\1\0\2\1\0\2\1\0\1\1\0\2406\243\r\4\5"..., 94, 0, NULL, 0) = 94 0.000059 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 21 0.000034 fcntl(21, F_GETFL) = 0x2 (flags O_RDWR) 0.000028 fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK) = 0 0.000029 fcntl(21, F_GETFD) = 0 0.000028 fcntl(21, F_SETFD, FD_CLOEXEC) = 0 0.000028 connect(21, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("192.168.19.6")}, 16) = 0 0.000044 sendto(21, "0[\2\2y\32cU\4\0\n\1\0\n\1\0\2\1\0\2\1\0\1\1\0\2406\243\r\4\5N"..., 93, 0, NULL, 0) = 93 0.000051 epoll_ctl(19, EPOLL_CTL_ADD, 20, {events=EPOLLIN|EPOLLRDHUP, data={u32=91469680, u64=94218789042032}}) = 0 0.000037 epoll_ctl(19, EPOLL_CTL_ADD, 21, {events=EPOLLIN|EPOLLRDHUP, data={u32=91478128, u64=94218789050480}}) = 0 0.000034 epoll_wait(19, [{events=EPOLLIN, data={u32=91469680, u64=94218789042032}}], 1, 2000) = 1 0.000816 ioctl(20, FIONREAD, [131]) = 0 0.000033 recvfrom(20, "0q\2\3\0\352Gdj\4\0000f0d\4\10netlogon1X\4V\27\0\0"..., 131, 0, {sa_family=AF_INET, sin_port=htons> 0.000054 epoll_ctl(19, EPOLL_CTL_DEL, 20, 0x7ffd48b2a064) = 0 0.000030 close(20) = 0 0.000041 epoll_wait(19, <unfinished ...> 0.965967 <... epoll_wait resumed>[], 1, 986) = 0 0.000035 close(12) = 0 0.000058 epoll_wait(3, <unfinished ...> 1.036069 <... epoll_wait resumed>[], 1, 2000) = 0 0.000212 sendto(21, "0[\2\2y\32cU\4\0\n\1\0\n\1\0\2\1\0\2\1\0\1\1\0\2406\243\r\4\5N"..., 93, 0, NULL, 0) = 93 0.000100 epoll_wait(19, <unfinished ...> 1.683967 <... epoll_wait resumed>[], 1, 3700) = 0 0.000042 epoll_wait(3, <unfinished ...> 0.318118 <... epoll_wait resumed>[], 1, 2000) = 0 0.000209 sendto(21, "0[\2\2y\32cU\4\0\n\1\0\n\1\0\2\1\0\2\1\0\1\1\0\2406\243\r\4\5N"..., 93, 0, NULL, 0) = 93 0.000095 epoll_wait(19, [], 1, 1995) = 0 1.995811 epoll_ctl(19, EPOLL_CTL_DEL, 21, 0x7ffd48b29f44) = 0 0.000049 close(21) = 0 0.000054 close(19) = 0 0.000045 write(1, "get_kdc_ip_string: Failed to get"..., 48get_kdc_ip_string: Failed to get KDC ip address
Michael Tokarev
2025-Jun-25 20:28 UTC
[Samba] winbindd: how it chooses which LDAP servers to query?
A looked at this even more closely, and what's what I observed. There are 3 DCs, 2 local to the site (svdcm & svdcm2) and one remote, belonging to the remote site, svdcp. winbindd at startup correctly determines the site it is on, but sends queries to svdcm and svdcp (not two local DCs but one local and one remote). Local DC, svdcm, replies instantly, but it keeps querying the remote one, which always responds with ENETUNREACH (which is being ignored). This pattern repeats ad infinitumm - query one local DC (which responds instantly) and query the remote DC 3 times, and repeat. Eventually it returns either a cached entry or "not found" error (logging "unable to find DC"). Once the remote DC becomes available and responds to a single query out of numerous, winbind stops querying the remote one and from now on, continues querying only the local DC (one of them). It does not query the remote DC any more. It looks like there's an inverted logic somewhere in the code. But another member server which I just joined to the domain, does not *always* shows this behavior - sometimes it does the same, but more often it switches to local DC before the first answer from the remote. So it's not exactly conclusive. It's a fun bug. Thanks, /mjt