Hi all, I'm looking for answers regarding a problem I'm having with Samba. Since a year our Samba fileserver is part of out worldwide corporate Active Directory. Before that Samba was part of our local NT4 domain. Since the change to Active Directory the Samba server became slower and sometimes does not respond at all to share requests. I need to find a solution to reduce the time required to access the Samba server on first connect. Even more import for me is to stop the problem of the Samba fileserver sometimes being not available for more then 3 hours (currently it looks like a problem with the total amount of user access requests to winbind at a given time). I'm based in the Netherlands working for a company with a worldwide Active Directory setup. I cannot change Active Directory settings, so I need to approach this from the Samba side to find the problem(s). I have 3 Samba fileservers. All servers are responding slow on first connects. The server in the main office is the one that sometimes does not respond at all. All servers use an LDAP backend installed for uid/gid mappings on the main office server (replications for backups on the others). When we migrated to Active directory I faced a problem that the sid history the migration team told us would help avoid problems with the servers being in the old domain and the users in the Active Directory did not work for Samba. To work around this I setup the LDAP backend config and matched all Active directory sids for the Netherlands users to point to exact the same UID/GID as the NT4 sids. This solved my sid history problem back then. After all users got migrated to Active Directory the server was migrated to avoid further problems. The old sid entries for NT4 are still in the ldap database. (If this may be part of the problem I can remove them) Rough estimated user access is as followed: Main office: 250 Users (including some from the other 2 sites) Warehouse: 150 Users Small Office: 15 Users All sites have a local AD server. Samba is configured to use only the server on that site. all servers are running on Redhat ES 4 and are Samba version 3.0.23d. All samba upgrades so far helped in certain areas, but the problems in general remained (slow reponse on first connect, and long outages for the main office server). A recent unexpected event revealed something of the problem. Last thursday a minor power glitch in the main office caused a lot of workstations and workfloor switches to restart. The server room was not infected being on an UPS. When clients started to log on again the Samba server was not responding to share access requests. some users managed to access a share finding that 4 minutes later they cannot access the share again. The Samba server itself is a HP Proliant DL380 server with 2 3Ghz HT CPU's and 8Gb of RAM. The server had no load problems. Very few smbd daemons claimed CPU time (few users could get on). The winbind daemon claimed the most CPU time, but was not putting any load on the system. The server remained in this state for about 3 hours. After that it returned to slower access to the shares compared to our NT4 domain but working "normal" according to current behaviour. Our best test to know if the server is working normally again is if the Nagios monitor plugin can access shares again. It uses the linux smbclient and is less tolerant in the time it waits to access a share. I can stop and start all services during the outage making no real difference. A reboot of the server or the workstations makes no difference. I've added a share with guest access during the outage which the smbclient can access very quickly anonymous. If I add an AD user and provide a password the share will time out as well. Windows XP systems have the same response. during the outage most of them cannot access the guest share and during normal response (but slow) they can access it. Last friday I did some tests with a test server monitoring the network traffic. My test pattern was as followed: -start Samba -start the network tcpdump on the Samba server -connect from my windows XP box by using: net use \\server\share <file://server/share> -about 1 minute later the XP box reports succesful. If I then check the tcpdump capture with Wireshark I noticed the following. all packets from and to the server are answered fast. I can see the client connecting to the \\server\IPC$ <file://server/IPC$> at normal speed. at a certain stage I see the client sending a 'Session Setup AndX Request' packet. In this packet I see (what I think is the purpose of this packet) a SPNEGO and Kerberos ticket AP-REQ. Then the client is not sending any packets for a long time. Eventually the Samba server comes with an answer to the 'Session Setup AndX Request' packet. If I check the SMB Header for this packet I see that it is the answer to the client packet and it toke the request 52.48116500 seconds to respond! I've noticed no specific packets indicating the server talking to the AD server during those 52 seconds. The response packet includes a SPNEGO section which reports that the negResult is accept-completed. The supportedMech is returned as 'MS KRB5 - Microsoft Kerberos 5' After this packet the rest of the communication is handled quickly and the Windows XP reports success on the share. The whole process of getting to the share required 1 minute which is not acceptable for LAN access. Waiting a few minutes and trying again is fast. Only the first connect is slow. I managed to get some more test results this weekend. I've started the winbind daemon in interactive mode with a debug level of 3. When I then connect my windows XP system I see that the winbind daemon is doing a lot of sid to gid lookups. I've counted the lookups being 85 different sids. If I check the count for which groups my user is a member off it adds up to 59 for active directory (I believe some sids are sid history sids taken from the old domain). It takes the winbind daemon a long time to go through all those 85 sids (should be the same time required compared to the session setup packet response). when winbind stops the lookups I get the message 'succesful' from the net use command. A second run comes back successfull right away for the net use command. I think I figured that one out being the smbd daemon assigned to my session. If I kill the smbd process and connect again from my Windows XP box winbind again goes through all sids and after a minute or more reports success. Can it be a problem with the LDAP backend I'm using? When I whipe the database on the test server, clearing about 8000 to 9000 entries winbind is responding much faster with the sid lookups. I don't want to lose all my mappings, but I could try to clear some old entries from the old NT domain. It is also worth to mention that during a system performance consultancy a couple of months ago the Redhat ES 4 configuration was changed by increasing the number of open files descriptors in the limits.conf. by default this was set 1024 now configured as 32768. It turned out that the samba processes (especially winbind) opened a lot of file sockets. Reports revealed that the system had an iowait % which was to high before the change was made. This solved a large part of the returning outages we experienced. But now it appears to be back in a different form. Any help and pointers in the right direction to resolve this is appreciated. I'm currently testing with Samba 3.0.25a. I'm having a problem with the winbind daemon crashing a lot. doing an 'ls -la' in a directory with Active Directory groups assigned to directories is enough to crash it. I did a bug report on bugzilla. url: https://bugzilla.samba.org/show_bug.cgi?id=4667 Kind regards, Ton Hoogstraten The Main site server config (note: company specific values changed between <>) [global] workgroup = <workgroup> realm = <company realm> server string = Samba Fileserver security = ADS client schannel = No password server = <AD server1> <AD server2> restrict anonymous = 2 log file = /var/log/samba/samba.log max log size = 150 large readwrite = No name resolve order = host wins bcast time server = Yes server signing = auto client use spnego = No socket options = TCP_NODELAY IPTOS_LOWDELAY SO_KEEPALIVE SO_RCVBUF=8192 SO_SNDBUF=8192 printcap name = /etc/printcap preferred master = No local master = No domain master = No dns proxy = No wins server = <wins server1> <wins server 2> ldap admin dn = cn=<y>,dc=<x>,dc=<z> ldap idmap suffix = ou=Idmap ldap suffix = dc=<x>,dc=<z> ldap ssl = no idmap backend = ldap:ldap://127.0.0.1 idmap uid = 10000-2000000 idmap gid = 10000-2000000 template homedir = /home/%U winbind use default domain = Yes