thr3ads.net - freebsd stable - NFS trouble on 7.3-STABLE i386 [May 2010]

If this information is useful, please help other people find it:
Share via:

Mark Morley

2010-May-21 14:56 UTC

NFS trouble on 7.3-STABLE i386

Having an issue with a file server here (7.3-STABLE i386)

The nfsd processes are hanging.  Client access to the nfs shares stops working
and the nfsd processes on the server cannot be killed by any means.  There are
no errors showing up anywhere on the server.  The network connection to the
server seems fine (ie: anything other than nfs traffic seems ok).  Rebooting the
server fixes the problem for a while, but it doesn't reboot easily.  It
times out on terminating the nfsd processes.  When it finally does reboot the
file system isn't marked clean, resulting in a long wait for fsck (although
it doesn't find any problems, it's a multi terrabyte share and it takes
a while).

This morning it did it again.  This time I tried manually killing nfsd but
nothing I did would make them die.  No errors.

The server is a dual core intel cpu with 2 gigs of ram.
Adaptec 5805 raid controller, 8 x 750G drives, RAID 6
2 x em interfaces

It's been find until about last week some time.  I did recently upgrade from
7.1 to 7.3, which may be related, although this issue didn't start happening
right away.  No particular time of day and it doesn't seem to coincide with
any particular cron tasks or have anything to do with the level of activity.

Any thoughts?

Mark

Rick Macklem

2010-May-21 15:17 UTC

head link

NFS trouble on 7.3-STABLE i386

On Fri, 21 May 2010, Mark Morley wrote:
> Having an issue with a file server here (7.3-STABLE i386)
>
> The nfsd processes are hanging.  Client access to the nfs shares stops
working and the nfsd processes on the server cannot be killed by any means. 
There are no errors showing up anywhere on the server.  The network connection
to the server seems fine (ie: anything other than nfs traffic seems ok). 
Rebooting the server fixes the problem for a while, but it doesn't reboot
easily.  It times out on terminating the nfsd processes.  When it finally does
reboot the file system isn't marked clean, resulting in a long wait for fsck
(although it doesn't find any problems, it's a multi terrabyte share and
it takes a while).
>
> This morning it did it again.  This time I tried manually killing nfsd but
nothing I did would make them die.  No errors.
>Next time it happens, do a "ps axlH" to see what the nfsd threads are
waiting for. It might give you a hint as to what is happening.

rick

Jeremy Chadwick

2010-May-21 15:21 UTC

head link

NFS trouble on 7.3-STABLE i386

On Fri, May 21, 2010 at 07:45:47AM -0700, Mark Morley
wrote:> Having an issue with a file server here (7.3-STABLE i386)
> 
> The nfsd processes are hanging.  Client access to the nfs shares stops
> working and the nfsd processes on the server cannot be killed by any
> means.  There are no errors showing up anywhere on the server.  The
> network connection to the server seems fine (ie: anything other than
> nfs traffic seems ok).  Rebooting the server fixes the problem for a
> while, but it doesn't reboot easily.  It times out on terminating the
> nfsd processes.  When it finally does reboot the file system isn't
> marked clean, resulting in a long wait for fsck (although it doesn't
> find any problems, it's a multi terrabyte share and it takes a while).
I can't explain the dirty filesystem problem, especially if the server
does reboot/shut down properly.
> This morning it did it again.  This time I tried manually killing nfsd
> but nothing I did would make them die.  No errors.
>
> ... 
> 
> Any thoughts?
1) Are you forcing TCP or UDP NFS, or just using the default?

2) Is RPC still working?  Try running rpcinfo on both the client and
server.

3) Using rpcinfo and netstat, figure out what TCP or UDP port the client
is communicating with on the server, then use tcpdump to sniff traffic
on both the client and server specific to those port numbers and see if
there's any network I/O happening.

4) On the server, ktrace -t + -p {nfsd-pid} (I'm not sure which of the
two (master vs.  server) though) to see if anything is going on.

Rick Macklem probably has some better ideas than these though.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Mark Morley

2010-May-25 21:52 UTC

head link

NFS trouble on 7.3-STABLE i386

On Fri, 21 May 2010 11:32:33 -0400 (EDT) Rick Macklem  wrote: On Fri, 21 May
2010, Mark Morley wrote:
> Having an issue with a file server here (7.3-STABLE i386)
>
> The nfsd processes are hanging.  Client access to the nfs shares stops
working and the nfsd processes on the server cannot be killed by any means. 
There are no errors showing up anywhere on the server.  The network connection
to the server seems fine (ie: anything other than nfs traffic seems ok). 
Rebooting the server fixes the problem for a while, but it doesn't reboot
easily.  It times out on terminating the nfsd processes.  When it finally does
reboot the file system isn't marked clean, resulting in a long wait for fsck
(although it doesn't find any problems, it's a multi terrabyte share and
it takes a while).
>
> This morning it did it again.  This time I tried manually killing nfsd but
nothing I did would make them die.  No errors.
>Next time it happens, do a "ps axlH" to see what the nfsd threads are
waiting for. It might give you a hint as to what is happening.

Ok, it did it again.  ps axlH shows all the nfsd processes stuck in the _ufs_
state.  The server isn't doing anything else, no other processes seem to be
monopolizing resources or disks in any way.

rpcinfo doesn't show anything amiss as far as I can tell (ie: rpc is
running)

After a reboot, one of the 32 nfsd's almost immediately goes into the
"ufs" state and never leaves it (and never racks up and CPU time
either).  The others are fine.  Slowly over time more and more enter this state.
When I rebooted it today, all but one were in that state.  The clients were
bogging down, presumably because the one and only functioning nfsd was
overworked.

One client is running 8.1-prerelease as a test, and that particular client only
will start getting lots of timeouts accessing the nfs share (even with less load
than the other clients).  Just in case it's tickling something on the server
I've shut it down this time and I'm leaving it off for the time being.

Any further thoughts?

Mark

Rick Macklem

2010-May-26 00:43 UTC

head link

NFS trouble on 7.3-STABLE i386

On Tue, 25 May 2010, Mark Morley wrote:
> On Fri, 21 May 2010 11:32:33 -0400 (EDT) Rick Macklem  wrote: On Fri, 21
May 2010, Mark Morley wrote:
>
>> Having an issue with a file server here (7.3-STABLE i386)
>>
>> The nfsd processes are hanging.  Client access to the nfs shares stops
working and the nfsd processes on the server cannot be killed by any means. 
There are no errors showing up anywhere on the server.  The network connection
to the server seems fine (ie: anything other than nfs traffic seems ok). 
Rebooting the server fixes the problem for a while, but it doesn't reboot
easily.  It times out on terminating the nfsd processes.  When it finally does
reboot the file system isn't marked clean, resulting in a long wait for fsck
(although it doesn't find any problems, it's a multi terrabyte share and
it takes a while).
>>
>> This morning it did it again.  This time I tried manually killing nfsd
but nothing I did would make them die.  No errors.
>>
> Next time it happens, do a "ps axlH" to see what the nfsd threads
are
> waiting for. It might give you a hint as to what is happening.
>
> Ok, it did it again.  ps axlH shows all the nfsd processes stuck in the
_ufs_ state.  The server isn't doing anything else, no other processes seem
to be monopolizing resources or disks in any way.
>
> rpcinfo doesn't show anything amiss as far as I can tell (ie: rpc is
running)
>
> After a reboot, one of the 32 nfsd's almost immediately goes into the
"ufs" state and never leaves it (and never racks up and CPU time
either).  The others are fine.  Slowly over time more and more enter this state.
When I rebooted it today, all but one were in that state.  The clients were
bogging down, presumably because the one and only functioning nfsd was
overworked.
>You could try this patch. (It reverts the only vnode locking change that I
can see was done the the nfs server between 7.1 and 7.3.):
--- nfs_serv.c.sav	2010-05-25 19:40:29.000000000 -0400
+++ nfs_serv.c	2010-05-25 19:41:38.000000000 -0400
@@ -3236,7 +3236,7 @@
  	io.uio_rw = UIO_READ;
  	io.uio_td = NULL;
  	eofflag = 0;
-	vn_lock(vp, LK_SHARED | LK_RETRY, td);
+	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
  	if (cookies) {
  		free((caddr_t)cookies, M_TEMP);
  		cookies = NULL;
@@ -3518,7 +3518,7 @@
  	io.uio_rw = UIO_READ;
  	io.uio_td = NULL;
  	eofflag = 0;
-	vn_lock(vp, LK_SHARED | LK_RETRY, td);
+	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
  	if (cookies) {
  		free((caddr_t)cookies, M_TEMP);
  		cookies = NULL;

If you get a chance to try it, please let us know if it helps, rick

Mark Morley

2010-May-26 10:08 UTC

head link

NFS trouble on 7.3-STABLE i386

On Tue, 25 May 2010 20:59:08 -0400 (EDT) Rick Macklem  wrote:
You could try this patch. (It reverts the only vnode locking change that I
can see was done the the nfs server between 7.1 and 7.3.):
--- nfs_serv.c.sav 2010-05-25 19:40:29.000000000 -0400
+++ nfs_serv.c 2010-05-25 19:41:38.000000000 -0400
@@ -3236,7 +3236,7 @@
   io.uio_rw = UIO_READ;
   io.uio_td = NULL;
   eofflag = 0;
- vn_lock(vp, LK_SHARED | LK_RETRY, td);
+ vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
   if (cookies) {
    free((caddr_t)cookies, M_TEMP);
    cookies = NULL;
@@ -3518,7 +3518,7 @@
   io.uio_rw = UIO_READ;
   io.uio_td = NULL;
   eofflag = 0;
- vn_lock(vp, LK_SHARED | LK_RETRY, td);
+ vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, td);
   if (cookies) {
    free((caddr_t)cookies, M_TEMP);
    cookies = NULL;

If you get a chance to try it, please let us know if it helps, rick

Thanks, but unfortunately it didn't work.  Rebooted it four hours ago with
the patch in place and at the moment I have seven nfsd processes stuck in that
state.

Could it indicate a problem with the underlying disk system?  It's an aac0
raid, but it has no errors and the controller indicates all is well, so I doubt
it.

Mark

Mark Morley

2010-May-27 19:08 UTC

head link

NFS trouble on 7.3-STABLE i386

On Tue, 25 May 2010 20:59:08 -0400 (EDT) Rick Macklem  wrote:You could try this
patch. (It reverts the only vnode locking change that I
can see was done the the nfs server between 7.1 and 7.3.):

.
.
.

If you get a chance to try it, please let us know if it helps, rick

The patch didn't help I'm afraid.  I wound up reverting back to 7.1 and
after more than 24 hours I haven't seen a single stuck nfsd.

Mark

Rick Macklem

2010-May-27 23:57 UTC

head link

NFS trouble on 7.3-STABLE i386

On Wed, 26 May 2010, Mark Morley wrote:
>
> Thanks, but unfortunately it didn't work.  Rebooted it four hours ago
with the patch in place and at the moment I have seven nfsd processes stuck in
that state.
>
> Could it indicate a problem with the underlying disk system?  It's an
aac0 raid, but it has no errors and the controller indicates all is well, so I
doubt it.
>Just about anything is possible. All we seem to know at this point is that
it is some change that went in between 7.1->7.3. It also doesn't appear
to
be the only change that was done to the nfs server during this period.

Any change applied to the aac driver might be a factor, but??

Is anyone else seeing this problem (nfsd threads stuck in wchan "ufs")
on FreeBSD7.3?

rick

freebsd stable - May 2010 - NFS trouble on 7.3-STABLE i386

NFS trouble on 7.3-STABLE i386

NFS trouble on 7.3-STABLE i386

NFS trouble on 7.3-STABLE i386

NFS trouble on 7.3-STABLE i386

NFS trouble on 7.3-STABLE i386

NFS trouble on 7.3-STABLE i386

NFS trouble on 7.3-STABLE i386

NFS trouble on 7.3-STABLE i386