I encountered a serious samba problem and want to publish details for public
benefit.?
SLES 10 server running Samba 3.0.28 as domain controller, file and cups print
server, running uneventfully for 2 years suddenly drops all users, load rapidly
grows to about 250 and becomes unresponsive. smbstatus reveals that every user
has about 10 instances of smbd instead of one. CPU (dual processsor, dual core)
utilization very low (2 % - mostly X and top).? Reboot clears problem but issue
returns every 30 minutes or so.? Logs are empty of any usefull info:?
/var/log/messages and /var/log/samba/log.smbd.? dmesg shows no errors. System is
not using any swap space.? Server passes all diagnostics possible. System is
fully patched. 2tb raid array attached via 320 SCSI checks fsck clean with zero
errors and so does each of the local file system slices. File system limit not
reached, limit of ~202000 , lsof says only 8800 files open during load spool-up.
? 50 irritated people idle.?
Grasping at straws,? we verify all 50 Windows XP clients have latest virus sigs
and we do deep scan of every machine.? Two virus' discovered, but niether
seemed responsible.
A clue comes in from a user.? "Every time I try to open a certain file, my
system freezes".. Oh really...
I go to the subdirectory, via linux console, where the suspect file is located
and ls the directory.? 9 files.? ls -al gets Killed. After ls -al filename for
each of the 9 files, I determine that 5 of these files are badly corrupt.? I
perform an experiment.? Tell everyone to leave these files alone, reboot the
server and it runs happily for an hour.? Load is .05 average.? I ask one user to
attempt to open one of the corrupt files, and instantly all 50 smbd daemons go
to uninterruptible sleep and every WinXP client instantly re-establishes its
smbd session with the server and these (all 50) smbd sessions also die and go to
heaven.? This cycle continues rapidly sending the load sky high with no cpu
utilization to speak of.
The short term fix is to move the offending directory to another place on the
volume which is out of scope of any share.? Not sure how to delete these files
as linux tools seem unable to handle them.?
Questions that remain:
1.? Why do all client smbd daemons have to die if only one of them ran into
trouble?
2.? How do files get in a state that they can't be viewed or managed??
virus, lack of sunspots?
3.? Why did the fsck say that the filesystem was fine, when obviously it
isn't?
4.? How to delete these poison files?
Karl