Hello, I am using ntlm_auth called from FreeRADIUS to authenticate users on a network with their Active Directory credentials. The problem I seem to be having is that ntlm_auth is taking longer than it should and I can't seem to get it to go faster reliably. Some background information: Users are connecting to a wireless network using 802.1x. That network sends requests to FreeRADIUS which forks an ntlm_auth process to authenticate users against AD. ntlm_auth is called with the username and challenge contained in the radius request along with the nt-response and the domain, as in : ntlm_auth --username=$USERNAME --challenge=$CHALLENGE --nt-response=$NT-RESPONSE --domain=$DOMAIN An authentication is successful if ntlm_auth returns 0. Since I had error messages in the logs pointing to requests timing out on ntlm_auth I wrote a short C wrapper around ntlm_auth to log the time it takes to return (as well as the username and domain). That showed that while most (~90%) authentications succeed in less than 25ms, about 10% take longer than 100ms with some taking as much as a few seconds (2-4s). So I increased winbind max domain connections on the (linux) server while also raising the MaxConcurrentApi on the DC. I now see 39 connections open to the DC from winbind (that number fluctuates). And yet the problem remains. What's more, It seems winbind is only or mostly using one out of those 39 connections to the DC. When I trace the processes using strace, only the first child of winbind seems to be sending any request. All the others are idle. Can anyone shed some light on how winbind manages it's connections to the DC? Has anyone else encountered this problem? Any recommendations for scaling ntlm_auth? Here's my smb.conf file. The server is running RHEL 6.4 with winbind 3.6.9. [global] workgroup = UUULOCAL server string = %h interfaces = 192.236.38.96/22 security = ADS passdb backend = tdbsam realm = UUU.LOCAL encrypt passwords = yes winbind use default domain = yes client NTLMv2 auth = yes preferred master = no load printers = no cups options = raw winbind max clients = 750 winbind offline logon = false winbind max domain connections = 50 password server = uuu-dc04.usd.local, uuu-dc05.usd.local, uuu-dc02.usd.local, uuu-dc03.usd.local, uuu-dc06.usd.local, * log level = 1 winbind:5 auth:3 Best regards, -- Louis Munro lmunro at inverse.ca :: www.inverse.ca +1.514.447.4918 x125 :: +1 (866) 353-6153 x125 Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (www.packetfence.org)
On Mon, Sep 08, 2014 at 12:11:05PM -0400, Louis Munro wrote:> Hello, > > I am using ntlm_auth called from FreeRADIUS to authenticate users on a network with their Active Directory credentials. > The problem I seem to be having is that ntlm_auth is taking longer than it should and I can't seem to get it to go faster reliably. > > Some background information: > > Users are connecting to a wireless network using 802.1x. > That network sends requests to FreeRADIUS which forks an ntlm_auth process to authenticate users against AD. > ntlm_auth is called with the username and challenge contained in the radius request along with the nt-response and the domain, as in : > > ntlm_auth --username=$USERNAME --challenge=$CHALLENGE --nt-response=$NT-RESPONSE --domain=$DOMAIN > > An authentication is successful if ntlm_auth returns 0. > > Since I had error messages in the logs pointing to requests timing out on ntlm_auth I wrote a short C wrapper around ntlm_auth to log the time it takes to return (as well as the username and domain). > That showed that while most (~90%) authentications succeed in less than 25ms, about 10% take longer than 100ms with some taking as much as a few seconds (2-4s). > > So I increased winbind max domain connections on the (linux) server while also raising the MaxConcurrentApi on the DC. > I now see 39 connections open to the DC from winbind (that number fluctuates). > And yet the problem remains. > > What's more, It seems winbind is only or mostly using one out of those 39 connections to the DC. > When I trace the processes using strace, only the first child of winbind seems to be sending any request. > All the others are idle. > > > Can anyone shed some light on how winbind manages it's connections to the DC? > Has anyone else encountered this problem? Any recommendations for scaling ntlm_auth?winbind *should* balance between all children, at least we have code to do this. If it does not work, there might be a bug or else it might be that 38 out of 39 children sit in requests that block against the DC. Then only one child is used because it's the only responsive one. In that situation, can you figure out the call stack of an unused child? Fedora has a "gstack" script that might be helpful here. Moreover, the main scalability problem is probably that winbind only connects to one DC. It would be far better to connect to all available DCs, but DC location is pretty involved in winbind and needs some refactoring before this can be implemented. Thanks, Volker -- SerNet GmbH, Bahnhofsallee 1b, 37081 G?ttingen phone: +49-551-370000-0, fax: +49-551-370000-9 AG G?ttingen, HRB 2816, GF: Dr. Johannes Loxen http://www.sernet.de, mailto:kontakt at sernet.de
On 2014-09-09, at 5:52 , Volker Lendecke <Volker.Lendecke at SerNet.DE> wrote:>> > > winbind *should* balance between all children, at least we > have code to do this. If it does not work, there might be a > bug or else it might be that 38 out of 39 children sit in > requests that block against the DC. Then only one child is > used because it's the only responsive one. > > In that situation, can you figure out the call stack of an > unused child? Fedora has a "gstack" script that might be > helpful here.Hi Volcker, Here's a stack trace for a process that's idle. The process itself was forked off about 4 days ago and yet it does not seem to be doing anything. It still has one connection open to the DC on port 445. #gstack 25501 #0 0x00007f3e66c67218 in poll () from /lib64/libc.so.6 #1 0x00007f3e69995990 in sys_poll () #2 0x00007f3e699079aa in fork_domain_child () #3 0x00007f3e699087e5 in wb_child_request_trigger () #4 0x00007f3e699cf4ff in tevent_common_loop_immediate () #5 0x00007f3e699cd78c in run_events_poll () #6 0x00007f3e699cdef2 in s3_event_loop_once () #7 0x00007f3e699ce300 in _tevent_loop_once () #8 0x00007f3e698dff03 in main () Does that help? Regards, -- Louis Munro lmunro at inverse.ca :: www.inverse.ca +1.514.447.4918 x125 :: +1 (866) 353-6153 x125 Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (www.packetfence.org)