Server Gremlin
2007-Nov-02 15:51 UTC
[Samba] Samba Hung Process ("D state") and hung system
Hi, I had an interesting problem with a Samba server recently, and I think I know what happened. But I want to post here to see if those wiser than I can confirm this or give me a better explanation of what went awry. I initially noticed around 8:35 AM samba wasn't working at all, and that my server had huge load averages indicated by "uptime". Though in "top" I could clearly see that no burden was on the CPU. A "ps -e u" showed me a large number of process in the dread " D " state. I grepped the output of the ps command thusly, "ps -e u | grep " D " " to get a list of all the processes in the D state. I noticed that it seemed nearly all of them were just smbd. So I grepped out the smbd lines with the command "ps -e u | grep " D " | grep -v smbd". This gave me just 1 process. That full line reads as follows: root 16909 0.0 0.0 2428 508 ? D 04:02 0:00 quotaoff /md2/lv00 I tried to shutdown the system, but even THAT failed. Looking in ps revealed that the shutdown command was also in a " D " state. So I held the power button on the machine until it died and then brought it back up. Everything seemed fine at this point. So what happened? There is a cronjob on the server scheduled for 4:02AM every night that does nothing more than run quotaoff, quotacheck, and then quotaon. This time, for some reason, quotaoff failed miserably. It went into the "D" state permanently, and just locked up the whole hard disk, keeping anything else from using it. Of course, no one is trying to use the file server for any reason at 4AM.... so no problems are apparent yet. But then at about 8:35 AM, when people start using the server, smbd processes start showing up. They all try to access the hard disk, because that's where the files are. But they can't because of the hung quotaoff process. So they just all start hanging (going into the D state), waiting permanently on the hard disk. When I run shutdown, that tries to unmount the filesystem, because that is part of the procedure. So it also enters the D state forever, because it's also waiting on the hard disk which quotaoff has somehow locked up. Thus, the only fix is a nasty reboot by using the power button on the box. Because a process in the D state ignores all signals, including SIGKILL (9). So how much of that did I get right...? Any remarks from helpful gurus? Thanks, - SG