On Jun 26, 2003 06:46 -0700, Dale wrote:> I have a problem with an NFS server for my network. It has ran kernels
> 2.4.18-ac4 - 2.4.21-ac1, all with problems. The -ac patches are used
> to provide the new style quota support. The system seems to have
> gotten even less stable with the new kernel versions.
>
> This morning around 5 am, I got a page the system was unresponding to
> NFS requests. I ssh'd in, and found the loadavg at ~50. Below are
> some snippets from ps at the time:
>
> root 3414 0.8 0.1 3904 3048 ? DN 04:02 1:45
> /usr/bin/updatedb -f NFS,SMBFS,NCPFS,PROC,DEVPTS -e /tmp,/var/tmp,/us
> root 3979 0.0 0.0 2588 1192 ? DN 04:14 0:00
> /usr/bin/rsync -aH --delete /home/puser1 /home/puser2 /home/puser3
>
> The rsync command is backing up across the network to a backup nfs
> server. updatedb starts at 4:02 am, and the rsync had been running
> since 3:30 and was half-way completed (estimated by the 'p' in the
> uername).
>
> Also there were 32 nfsd's just like this:
> root 851 0.0 0.0 0 0 ? DW Jun19 4:35 [nfsd]
>
> and these, the other 4 kjournald's were in SW.
> root 7 0.1 0.0 0 0 ? DW Jun19 17:04 [kswapd]
> root 144 0.0 0.0 0 0 ? DW Jun19 6:53 [kjournald]
>
> I'm wondering what my options are, this has happened ~10 times in the
> last 6 months, although the system went a period of ~120 days without a
> hiccup. This last time on 2.4.21-ac1 was only 6 days.
> It wouldn't be so bad if a `shutdown -r now` would restart it, but it
> hangs while shutting down nfs and during killall and needs hard
> rebooted.
This almost certainly is a lock deadlock of some sort. I've had pretty
good luck in debugging such problems just by running "sysrq-T" on the
console and/or using "crash" to examine the running kernel. This
needs
a fair amount of knowledge of the various locks in ext3. The most
common problems are related to lock ordering problems with some process
starting a journal transaction and then blocking on a lock (e.g. directory
or inode semaphore, or superblock lock), and some other process holding
that lock and trying to start a new transaction when the journal is full.
The journal being full is a crucial issue, because if it isn't full you
can start a new transaction without problems, but when it is full you need
to flush the journal and wait for all existing users to free up their handles,
which will never happen if the first process has a transaction handle and is
blocked waiting for a lock the second process is holding.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/