thr3ads.net - freebsd stable - nfs send errors 32 and 35 on RELENG

If this information is useful, please help other people find it:
Share via:

Ollie Cook

2004-Jan-13 07:49 UTC

nfs send errors 32 and 35 on RELENG_4

Hi,

For a while I have been seeing errors of this nature on a cluster of i386
FreeBSD RELENG_4 hosts which mount a volume from a NetApp F825 filer using
NFSv3 over a mixture of UDP and TCP, depending on whether the host is on the
same local LAN as the filer or not:

Jan 13 14:02:02 mese /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: not
responding
Jan 13 14:02:03 mese /kernel: nfs server 192.168.1.1:/vol/vol1/claramail: is
alive again

The messages are logged with alarming regularity, but don't seem to actually
have any bearing on the performance or availablility of the volume. My full
findings are in my initial post to freebsd-net, which has been archived here:

http://www.freebsd.org/cgi/getmsg.cgi?fetch=178585+184466+/usr/local/www/db/text/2004/freebsd-net/20040111.freebsd-net

However more recently, and especially today, I am seeing errors which *are*
affecting the availability of the mount point on one of the hosts in question:

Jan 13 14:09:37 mese /kernel: nfs send error 35 for server
192.168.1.1:/vol/vol1/claramail
Jan 13 14:09:42 mese /kernel: nfs send error 35 for server
192.168.1.1:/vol/vol1/claramail
Jan 13 14:09:47 mese /kernel: nfs send error 35 for server
192.168.1.1:/vol/vol1/claramail
Jan 13 14:09:52 mese /kernel: nfs send error 35 for server
192.168.1.1:/vol/vol1/claramail
Jan 13 14:09:53 mese /kernel: nfs send error 32 for server
192.168.1.1:/vol/vol1/claramail

We are running version 1.60.2.6 of nfs_socket.c, which is generating this
message. Looking at the CVS Web Repository, that seems to be the latest version
for RELENG_4.

A quick google suggests that error 32 is 'OK' in the sense that the TCP
connection should be reestablished and things can pick up where they left
off[1], but I can't find what causes error 35. 35 seems to be the more
abundant
error, in any case.

The symptoms on the hosts when these errors occur are:

 - processes accessing files on the remote volume get stuck in disk wait,
   specifically their state is 'nfsrcv'.
 - even when all processes accessing volume are killed, and lsof shows no
   open files on the volume, "umount /vol" claims the device is busy.
 - a "umount -f" hangs and the umount process can't be killed.
 - however, after a "umount -f", /vol is not listed in
"mount" or "df"
 - similarly, trying to then mount the volume, "mount" hangs and
can't be
   killed, and the volume does not appear in "mount" or "df"
(in fact, df
   hangs too. Presumably as it's trying to work out available space etc.)
 - a tcpdump between client and server doesn't show any NFS traffic at all
   being emitted by the client, although IP connectivity to the server is
   maintained, and other hosts are able to still talk NFS to it happily.

I tried to reboot the host in question to restore service, but it stayed
multi-user. The host was in a remote data centre so in the end it had to be
power cycled. The host wasn't on console so I wasn't able to determine
why it
stayed multi-user.

I'm at a loss as to how to further debug this. It occurs to me that
determining
what error 35 is would be helpful. :) I've looked in a book that I have
available[2], but it lists neither error 32 nor 35. Is there an up-to-date list
of NFSv3 errors anywhere? 

At this stage, any and all advice on where to look and what data I can usefully
retrieve that would help analyse this problem would be gratefully received.

Cheers,

Ollie

1: http://lists.freebsd.org/pipermail/freebsd-hackers/2003-July/001988.html
2: NFS Illustrated, Brent Callaghan, First Printing, ISBN 0-201-32570-5

-- 
Oliver Cook    Systems Administrator, Claranet UK
ollie@uk.clara.net               +44 20 7903 3065

Doug White

2004-Jan-13 11:51 UTC

head link

nfs send errors 32 and 35 on RELENG_4

On Tue, 13 Jan 2004, Ollie Cook wrote:
> For a while I have been seeing errors of this nature on a cluster of i386
> FreeBSD RELENG_4 hosts which mount a volume from a NetApp F825 filer using
> NFSv3 over a mixture of UDP and TCP, depending on whether the host is on
the
> same local LAN as the filer or not:
>
> Jan 13 14:02:02 mese /kernel: nfs server 192.168.1.1:/vol/vol1/claramail:
not responding
> Jan 13 14:02:03 mese /kernel: nfs server 192.168.1.1:/vol/vol1/claramail:
is alive again
There's some tuning options for this, which I don't immediately recall.
Under heavy load these are somewhat normal.
> Jan 13 14:09:37 mese /kernel: nfs send error 35 for server
192.168.1.1:/vol/vol1/claramail
> Jan 13 14:09:42 mese /kernel: nfs send error 35 for server
192.168.1.1:/vol/vol1/claramail
> Jan 13 14:09:47 mese /kernel: nfs send error 35 for server
192.168.1.1:/vol/vol1/claramail
> Jan 13 14:09:52 mese /kernel: nfs send error 35 for server
192.168.1.1:/vol/vol1/claramail
> Jan 13 14:09:53 mese /kernel: nfs send error 32 for server
192.168.1.1:/vol/vol1/claramail
These errors tend to imply resource shortages. Monitor netstat -m output
and make sure you aren't running out of mbuf or mbuf clusters. Also check
for network errors and dropped packets (netstat -s, switch statistics).

Are you running rpc.lockd?

-- 
Doug White                    |  FreeBSD: The Power to Serve
dwhite@gumbysoft.com          |  www.FreeBSD.org

freebsd stable - Jan 2004 - nfs send errors 32 and 35 on RELENG_4

nfs send errors 32 and 35 on RELENG_4

nfs send errors 32 and 35 on RELENG_4