Hello- We've got a bit of a variation of the "multiple smbd processes on an NFS-mounted filesystem" problem - I'm wondering if someone can be of help. In this case, it's not NFS that's the problem, but HSM (which can cause symptoms similar to a non-responding NFS filesystem). We are currently running Samba 2.0.6 on our Sun fileservers, primarily for home directories and group data. All of the filesystems that are accessed via Samba are managed by Veritas Storage Migrator (HSM), using optical and tape (in that order) as secondary storage. The main symptom we are experiencing is with multiple smbd processes per user. The severity of the problem varies depending on the root cause. Here are the scenarios that come up: 1. Access to files migrated to tape (blocks smbd) Files migrated to tape can take as long as 10 minutes to retrieve. While attempting to access such a file, the Windows NT redirector times out after about 45 seconds, and opens up a new connection (spawning a second smbd process). This happens until retrieval is complete. The main symptom of this problem is when a user gets sharing violations trying to access their own files -- this is becuase the blocking smbd process has locks on other other files, and the "new" smbd process cannot work with these locks. One thing that may help matters here is to increase the redirector timeout to wait longer (if I can ever get our NT admin folks to push out the REG file to all the clients!). Unfortunately, this is the least serious of our problems. 2. Full filesystems cause many smbd processes to appear This is similar to 1. except that access to an entire filesystem is blocked until the migration system has claimed space (i.e. migrated files out in response to a full disk). In this case, the scenario above happens to every user accessing that filesystem until the space situation is resolved. This time varies depending on migration criteria, responsiveness of the secondary media, and sysadmin response time. 3. General HSM failure This is obviously the worst situation - the HSM system stops responding due to some failure, access to all filesystems is blocked, and the smbd process load doubles after 45 seconds, and then continues to increase by that amount every 45 seconds, until the underlying problem is fixed. 1500+ smbd processes using up all of the system memory and process space makes it difficult to do this. Does anyone have ANY suggestions to getting around this problem? The main problem here is that the client is allowed to time out and tell Samba to fork off another smbd process. One suggestion I've seen is to set keepalives in the smb.conf file (i.e. keepalive = 30), but whether this will work will depend on which process is handling the keepalives. If the children smbd processes handle the keepalives, it probably won't help matters since smbd won't be able to send/receive keepalives when it is blocking on a read() or write() system call (which is what happens when HSM is unable to immediately satisfy a request). Oh, to make matters worse, this is a two-node SunCluster HA cluster, with separate Samba configs per logical host (binding to separate logical interfaces). This means it's not unusual for a user to have two smbd processes running when both logical hosts are failed over to the same physical host... Any hints?? We're probably the only site insane enough to combine Samba + HSM + SunCluster.. :-) :-S -Andrew Cherry UNIX System Admin Cummins Engine Company
acherry@pobox.com wrote:> > Any hints?? We're probably the only site insane enough > to combine Samba + HSM + SunCluster.. :-) :-SAndrew, I think you are the only site I have ever heard of trying to do this. :) Sorry. Wish I had more information for you. Cheers, jerry ---------------------------------------------------------------------- /\ Gerald (Jerry) Carter Professional Services \/ http://www.valinux.com/ VA Linux Systems gcarter@valinux.com http://www.samba.org/ SAMBA Team jerry@samba.org http://www.plainjoe.org/ jerry@plainjoe.org "...a hundred billion castaways looking for a home." - Sting "Message in a Bottle" ( 1979 )
Andrew Cherry wrote: | Files migrated to tape can take as long as 10 minutes to | retrieve. While attempting to access such a file, the Windows | NT redirector times out after about 45 seconds, and opens up | a new connection (spawning a second smbd process). Unfortunately, NT doesn't realize that it should tell the server that it's disconnecting. It probably assumes that the server has crashed. Samba can't do much about this, but it can be told to clean up the old (now disconected) smbd process. Do try "keepalive = 60", and we'll see if the code path allows keepalive processing to run while there is a read outstanding. If not, we'll probably have to raise this on samba-technical and see if there's a way to do so. Hmmm..., or perhaps a way to detect excessive HSM/NFS delay at read-time. --dave -- David Collier-Brown, | Always do right. This will gratify Performance & Engineering Team | some people and astonish the rest. Americas Customer Engineering | -- Mark Twain (905) 415-2849 | davecb@canada.sun.com
David Collier-Brown <David.Collier-Brown@canada.sun.com> wrote:> Andrew Cherry wrote: > | Files migrated to tape can take as long as 10 minutes to > | retrieve. While attempting to access such a file, the Windows > | NT redirector times out after about 45 seconds, and opens up > | a new connection (spawning a second smbd process). > > ... > > If not, we'll probably have to raise this on samba-technical > and see if there's a way to do so. Hmmm..., or perhaps a > way to detect excessive HSM/NFS delay at read-time.Well, in theory that's possible on the NFS side at least. NFSv3 has an error code specifically for this case: NFSERR_JUKEBOX. From RFC1813: The server initiated the request, but was not able to complete it in a timely fashion. The client should wait and then try the request with a new RPC transaction ID. For example, this error should be returned from a server that supports hierarchical storage and receives a request to process a file that has been migrated. In this case, the server should start the immigration process and respond to client with this error. The proposed NFSv4 has a nearly identical code called NFS4ERR_DELAY. So as long as both the NFS client and server fully support NFSv3/4, the client can find out when the HSM server is trying to locate a file. Now, of course, getting that information to a userland program like Samba is another matter entirely... ---------------------------+--------------------------------------------------- Bryan Feir VA3GBF|"A wrangle is the disinclination of two boarders to Work:bryan@sgl.crestech.ca | each other that meet together but are not in the Home:jenora@sympatico.ca | same line." -- Stephen Leacock ---------------------------+---------------------------------------------------
Andrew Cherry wrote: | This of course goes under the assumption that each user/client | combination should have only one smbd process. Can anyone think of | any situations where a single user logged onto an NT workstation would | have more than one SMB connection open to the same server? Yes, but the SMB spec **specifically** says you're allowed to restrict to just one. The two-connection case is rare, and NT tries to keep you from making more than one connection. Two connections were once used here for secretaries connecting as both themselves and as their bosses... Give me a call or send me your phone number: we should talk about this by voice... --dave (wearing his Sun hat) c-b -- David Collier-Brown, | Always do right. This will gratify Performance & Engineering Team | some people and astonish the rest. Americas Customer Engineering | -- Mark Twain (905) 415-2849 | davecb@canada.sun.com