thr3ads.net - freebsd stable - NFS stalling on 8.1-STABLE [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Mark Morley

2010-Aug-12 17:52 UTC

NFS stalling on 8.1-STABLE

Hi all,

I have five front end web servers that all mount their content from the same
server via NFS.  If I stress the link on any one of the machines (eg: copy a
large directory with a lot of files to/from the mounted file system) the client
will pause.  That is, all processes trying to access that mount will freeze. 
The log files with hundreds or thousands of nfs server not responding / is alive
again messages. After 60 seconds it returns to normal, unless the load is still
there in which case it continues to pause.

This has only started happening since I upgraded the client machines to
8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The server is
7.1-RELEASE-p11.  No other changes have taken place in terms of hardware or
software or mount options, etc.

All nics involved are gigabit em cards, and they are on a private network (web
access to the boxes is via an external interface).

If I truss a command such as "df", it gets to&nbsp;getfsstat() and
pauses there.

Mount options are currently
"rw,tcp,nolockd,noatime,nosuid,bg,intr,soft,rsize=32768,wsize=32768"
but I've tried all sorts of things and it doesn't seem to make a
difference.

Here's a sample output from nfsstat -c from one of the boxes (uptime 14
days):

Client Info:
Rpc Counts:
Getattr   Setattr    Lookup  Readlink      Read     Write    Create    Remove
75552107   3008653 300569929    253365   2426554   4748471   2035545   3015497
Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus    Access
864598     50887      7462     11895   1137933  16160386         0  31593291
Mknod    Fsstat    Fsinfo  PathConf    Commit
0  22510271         5         0   3569465
Rpc Info:
TimedOut   Invalid X Replies   Retries  Requests
0         0         0         0 467516377
Cache Info:
Attr Hits    Misses Lkup Hits    Misses BioR Hits    Misses BioW Hits    Misses
1461457650  75552057 963440449 300536041  37404178   2359677   9467719   4748471
BioRLHits    Misses BioD Hits    Misses DirE Hits    Misses
14409992    253365  29508747  16119060  22292421     23233

Any thoughts?

Mark

Rick Macklem

2010-Aug-15 21:11 UTC

head link

NFS stalling on 8.1-STABLE

> Hi all,
> 
> I have five front end web servers that all mount their content from the
same server via NFS.  If I stress the link on any one of the machines (eg: copy
a large directory with a lot of files to/from the mounted file system) the
client will pause.  That is, all processes trying to access that mount will
freeze.  The log files with hundreds or thousands of nfs server not responding /
is alive again messages. After 60 seconds it returns to normal, unless the load
is still there in which case it continues to pause.
> 
The 60sec delay suggests that the client is doing a TCP reconnect. I'd
suggest that you
look at a packet trace in wireshark (it knows how to decode NFS packets) and see
if
there are new TCP connections (SYN, SYN-ACK,...) being made. If that is what is
happening, I suspect it is NIC driver related, but it is really hard to say.

If you can try a network interface of a different type (not em) that will check
to
see if it is an em(4) issue.

Alternately, you could try turning off the TSO and checksum offload stuff for
the
em(4) and see if that helps.
> This has only started happening since I upgraded the client machines to
8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The server is
7.1-RELEASE-p11.  No other changes have taken place in terms of hardware or
software or mount options, etc.
> 
There were some client side fixes between 8.0 and 8.1, but I don't think any
of those have caused a regression w.r.t. connections. There is a problem w.r.t.
the nfsd getting in a loop, but that wouldn't recover after 60sec. (If it
happens,
the server has to be rebooted. There is a fix for this at:
   http://people.freebsd.org/~rmacklem/freebsd8.1-patches/replay.patch
but I don't think it is what you are seeing.)

rick

Jeremy Chadwick

2010-Aug-16 06:36 UTC

head link

NFS stalling on 8.1-STABLE

On Thu, Aug 12, 2010 at 10:35:49AM -0700, Mark Morley
wrote:> I have five front end web servers that all mount their content from
> the same server via NFS.  If I stress the link on any one of the
> machines (eg: copy a large directory with a lot of files to/from the
> mounted file system) the client will pause.  That is, all processes
> trying to access that mount will freeze.  The log files with hundreds
> or thousands of nfs server not responding / is alive again messages.
> After 60 seconds it returns to normal, unless the load is still there
> in which case it continues to pause.
> 
> This has only started happening since I upgraded the client machines
> to 8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The
> server is 7.1-RELEASE-p11.  No other changes have taken place in terms
> of hardware or software or mount options, etc.
> 
> All nics involved are gigabit em cards, and they are on a private
> network (web access to the boxes is via an external interface).
Are there any indications in dmesg that the NIC is responsible, e.g.
interface down/up, etc.?

Does switching to UDP-based NFS solve the problem for you?

What OS version (uname -a) and NIC are used on the NFS server?

Can you please provide the following output from one of the client
machines running 8.1-STABLE with gigE em(4)?  You can X-out machine
names, MAC addresses, and IP addresses/netblocks if need be.

* uname -a
* ifconfig emX  (where X is the interface number which would be
  used for NFS)
* netstat -idn -I emX
* pciconf -lvc  (provide only the data for emX please)
* vmstat -i
* sysctl hw.pci
* As root, run "sysctl dev.em.X.stats=1" then do "dmesg" and
  provide the output for NIC statistics (will start with "emX:")

Thanks.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Mark Morley

2010-Aug-17 17:46 UTC

head link

NFS stalling on 8.1-STABLE

On Sun, 15 Aug 2010 17:11:01 -0400 (EDT) Rick Macklem
<rmacklem@uoguelph.ca> wrote:>> Hi all,
>>
>> I have five front end web servers that all mount their content from the
same server via NFS.  If I stress the link on any one of the machines (eg: copy
a large directory with a lot of files to/from the mounted file system) the
client will pause.  That is, all processes trying to access that mount will
freeze.  The log files with hundreds or thousands of nfs server not responding /
is alive again messages. After 60 seconds it returns to normal, unless the load
is still there in which case it continues to pause.
>>
>
>The 60sec delay suggests that the client is doing a TCP reconnect. I'd
suggest that you
>look at a packet trace in wireshark (it knows how to decode NFS packets) and
see if
>there are new TCP connections (SYN, SYN-ACK,...) being made. If that is what
is
>happening, I suspect it is NIC driver related, but it is really hard to say.
I'll try this if/when it happens again.
>If you can try a network interface of a different type (not em) that will
check to
>see if it is an em(4) issue.
Unfortunately I don't have any non-em cards around.
>Alternately, you could try turning off the TSO and checksum offload stuff
for the
>em(4) and see if that helps.
Hmm, interesting.  The four machines that seem to be working (so far) have these
enabled by default.  The fifth one has checksums enabled, but not TSO. 
Doesn't appear to support it.

I also tried switching from TCP to UDP.  This seems to be working (so far) on
four of the clients (which happen to be identical load balanced machines), but
on the fifth one (which serves a different purpose) I'm getting something
really weird.  Instead of locking up periodically as before, it's actually
losing the mount.  For example, a 'df' doesn't include the mounted
system.  If I try to access the mounted system (with 'ls' for example) I
get an "Input / output error" message.  I can remount it, but only
after I force a dismount.

Mark

Mark Morley

2010-Aug-17 17:53 UTC

head link

NFS stalling on 8.1-STABLE

On Sun, 15 Aug 2010 23:35:50 -0700 Jeremy Chadwick
<freebsd@jdc.parodius.com> wrote:>On Thu, Aug 12, 2010 at 10:35:49AM -0700, Mark Morley wrote:
>> I have five front end web servers that all mount their content from
>> the same server via NFS.  If I stress the link on any one of the
>> machines (eg: copy a large directory with a lot of files to/from the
>> mounted file system) the client will pause.  That is, all processes
>> trying to access that mount will freeze.  The log files with hundreds
>> or thousands of nfs server not responding / is alive again messages.
>> After 60 seconds it returns to normal, unless the load is still there
>> in which case it continues to pause.
>>
>> This has only started happening since I upgraded the client machines
>> to 8.1-STABLE (previously four of them were 8.0 and one was 7.3).  The
>> server is 7.1-RELEASE-p11.  No other changes have taken place in terms
>> of hardware or software or mount options, etc.
>>
>> All nics involved are gigabit em cards, and they are on a private
>> network (web access to the boxes is via an external interface).
>
>Are there any indications in dmesg that the NIC is responsible, e.g.
>interface down/up, etc.?
No, nothing like that.
>Does switching to UDP-based NFS solve the problem for you?
Trying that now for the past 24 hours or so.  Four of the machine seem ok so
far, but the fifth one has started dropping the mount entirely.  Access to it
gives an "Input / output error" message.  Forcing a dismount and
remounting brings it back.
>What OS version (uname -a) and NIC are used on the NFS server?
FreeBSD xxx 7.1-RELEASE-p11 FreeBSD 7.1-RELEASE-p11 #0: Wed May 26 03:20:59 PDT
2010
root@xxx:/usr/obj/usr/src/sys/CUSTOM  i386

NICs are em
>Can you please provide the following output from one of the client
>machines running 8.1-STABLE with gigE em(4)?  You can X-out machine
>names, MAC addresses, and IP addresses/netblocks if need be.
>
>* uname -a
FreeBSD xxx 8.1-STABLE FreeBSD 8.1-STABLE #0: Tue Jul 27 16:27:44 PDT 2010
root@xxx:/usr/obj/usr/src/sys/CUSTOM  amd64
>* ifconfig emX  (where X is the interface number which would be
>  used for NFS)
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
ether 00:0e:0c:85:d5:0d
inet 192.168.1.30 netmask 0xffffff00 broadcast 192.168.1.255
media: Ethernet 1000baseT <full-duplex>
status: active
>* netstat -idn -I emX
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs 
Coll Drop
em0    1500 <Link#1>      00:0e:0c:85:d5:0d 39913814     2     0 39949943 
0     0    0
em0    1500 192.168.1.0/2 192.168.1.30      39944016     -     - 39949664     - 
-    -

>* pciconf -lvc  (provide only the data for emX please)
em0@pci0:1:6:0: class=0x020000 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00
vendor     = 'Intel Corporation'
device     = 'Gigabit Ethernet Controller (Copper) rev 5 (82541PI)'
class      = network
subclass   = ethernet
cap 01[dc] = powerspec 2  supports D0 D3  current D0
cap 07[e4] = PCI-X supports 2048 burst read, 1 split transaction

>* vmstat -i
interrupt                          total       rate
irq1: atkbd0                         239          0
irq16: em0                      36746591        883
irq18: em1                      12658607        304
irq21: ohci0                           2          0
irq22: ehci0                      528002         12
irq23: atapci1                   2334936         56
cpu0: timer                     83207296       2000
cpu1: timer                     83207289       2000
Total                          218682962       5256
>* sysctl hw.pci
hw.pci.usb_early_takeover: 1
hw.pci.honor_msi_blacklist: 1
hw.pci.enable_msix: 1
hw.pci.enable_msi: 1
hw.pci.do_power_resume: 1
hw.pci.do_power_nodriver: 0
hw.pci.enable_io_modes: 1
hw.pci.default_vgapci_unit: -1
hw.pci.host_mem_start: 2147483648
hw.pci.mcfg: 1
>* As root, run "sysctl dev.em.X.stats=1" then do "dmesg"
and
>  provide the output for NIC statistics (will start with "emX:")
em0: Excessive collisions = 0
em0: Sequence errors = 0
em0: Defer count = 52
em0: Missed Packets = 0
em0: Receive No Buffers = 0
em0: Receive Length Errors = 0
em0: Receive errors = 1
em0: Crc errors = 1
em0: Alignment errors = 0
em0: Collision/Carrier extension errors = 0
em0: RX overruns = 0
em0: watchdog timeouts = 0
em0: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK MSIX IRQ = 0
em0: XON Rcvd = 54
em0: XON Xmtd = 0
em0: XOFF Rcvd = 54
em0: XOFF Xmtd = 0
em0: Good Packets Rcvd = 39915088
em0: Good Packets Xmtd = 39951839

Mark

freebsd stable - Aug 2010 - NFS stalling on 8.1-STABLE

NFS stalling on 8.1-STABLE

NFS stalling on 8.1-STABLE

NFS stalling on 8.1-STABLE

NFS stalling on 8.1-STABLE

NFS stalling on 8.1-STABLE