thr3ads.net - CentOS - [CentOS] Help needed with NFS issue [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Steve Thompson

2012-Apr-17 21:40 UTC

[CentOS] Help needed with NFS issue

I have four NFS servers running on Dell hardware (PE2900) under CentOS 
5.7, x86_64. The number of NFS clients is about 170.

A few days ago, one of the four, with no apparent changes, stopped 
responding to NFS requests for two minutes every half an hour (approx). 
Let's call this "the hang". It has been doing this for four days
now.
There are no log messages of any kind pertaining to this. The other three 
servers are fine, although they are less loaded. Between hangs, 
performance is excellent. Load is more or less constant, not peaky.

NFS clients do get the usual "not responding, still trying" message
during
a hang.

There are no cron or other jobs that launch every half an hour.

All hardware on the affected server seems to be good. Disk volumes being 
served are RAID-5 sets with write-back cache enabled (BBU is good). RAID 
controller logs are free of errors.

NFS servers used dual bonded gigabit links in balance-alb mode. Turning 
off one interface in the bond made no difference.

Relevant /etc/sysctl.conf parameters:

vm.dirty_ratio = 50
vm.dirty_background_ratio = 1
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100
vm.min_free_kbytes = 65536
net.core.rmem_default = 262144
net.core.rmem_max = 262144
net.core.wmem_default = 262144
net.core.wmem_max = 262144
net.core.netdev_max_backlog = 25000
net.ipv4.tcp_reordering = 127
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_no_metrics_save = 1

The {r,w}mem_{max,default} values are twice what they were previously; 
changing these had no effect.

The number of dirty pages is nowhere near the dirty_ratio when the hangs 
occur; there may be only 50MB of dirty memory.

A local process on the NFS server is reading from disk at around 40-50 
MB/sec on average; this continues unaffected during the hang, as do all 
other network services on the host (eg an LDAP server). During the hang 
the server seems to be quite snappy in all respects apart from NFS. The 
network itself is fine as far as I can tell, and all NFS-related processes 
on the server are intact.

NFS mounts on clients are made with UDP or TCP with no difference in 
results. A client mount cannot be completed ("timed out") and access
to an
already NFS mounted volume stalls during the hang (both automounted and 
manual mounts).

NFS block size is 32768 r and w; using 16384 makes no difference.

Tcpdump shows no NFS packets exchanged between client and server during a 
hang.

I have not rebooted the affected server yet, but I have restarted NFS
with no change.

Help! I cannot figure out what is wrong, and I cannot find anything amiss. 
I'm running out of something but I don't know what it is (except perhaps
brains). Hints, please!

Steve

Ross Walker

2012-Apr-17 22:49 UTC

head link

[CentOS] Help needed with NFS issue

On Apr 17, 2012, at 5:40 PM, Steve Thompson <smt at vgersoft.com> wrote:
> I have four NFS servers running on Dell hardware (PE2900) under CentOS 
> 5.7, x86_64. The number of NFS clients is about 170.
> 
> A few days ago, one of the four, with no apparent changes, stopped 
> responding to NFS requests for two minutes every half an hour (approx). 
> Let's call this "the hang". It has been doing this for four
days now.
> There are no log messages of any kind pertaining to this. The other three 
> servers are fine, although they are less loaded. Between hangs, 
> performance is excellent. Load is more or less constant, not peaky.
> 
> NFS clients do get the usual "not responding, still trying"
message during
> a hang.
> 
> There are no cron or other jobs that launch every half an hour.
> 
> All hardware on the affected server seems to be good. Disk volumes being 
> served are RAID-5 sets with write-back cache enabled (BBU is good). RAID 
> controller logs are free of errors.
> 
> NFS servers used dual bonded gigabit links in balance-alb mode. Turning 
> off one interface in the bond made no difference.
> 
> Relevant /etc/sysctl.conf parameters:
> 
> vm.dirty_ratio = 50
> vm.dirty_background_ratio = 1
> vm.dirty_expire_centisecs = 1000
> vm.dirty_writeback_centisecs = 100
> vm.min_free_kbytes = 65536
> net.core.rmem_default = 262144
> net.core.rmem_max = 262144
> net.core.wmem_default = 262144
> net.core.wmem_max = 262144
> net.core.netdev_max_backlog = 25000
> net.ipv4.tcp_reordering = 127
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.ipv4.tcp_wmem = 4096 65536 16777216
> net.ipv4.tcp_max_syn_backlog = 8192
> net.ipv4.tcp_no_metrics_save = 1
> 
> The {r,w}mem_{max,default} values are twice what they were previously; 
> changing these had no effect.
> 
> The number of dirty pages is nowhere near the dirty_ratio when the hangs 
> occur; there may be only 50MB of dirty memory.
> 
> A local process on the NFS server is reading from disk at around 40-50 
> MB/sec on average; this continues unaffected during the hang, as do all 
> other network services on the host (eg an LDAP server). During the hang 
> the server seems to be quite snappy in all respects apart from NFS. The 
> network itself is fine as far as I can tell, and all NFS-related processes 
> on the server are intact.
> 
> NFS mounts on clients are made with UDP or TCP with no difference in 
> results. A client mount cannot be completed ("timed out") and
access to an
> already NFS mounted volume stalls during the hang (both automounted and 
> manual mounts).
> 
> NFS block size is 32768 r and w; using 16384 makes no difference.
> 
> Tcpdump shows no NFS packets exchanged between client and server during a 
> hang.
> 
> I have not rebooted the affected server yet, but I have restarted NFS
> with no change.
> 
> Help! I cannot figure out what is wrong, and I cannot find anything amiss. 
> I'm running out of something but I don't know what it is (except
perhaps
> brains). Hints, please!
Just a shot in the dark here.

Take a look at the NIC and switch port flow control status during an outage,
they may be paused due to switch load.

Is there anything else on the network switches that might flood them every half
hour for a two minute duration?

-Ross

Apparently Analagous Threads

Search for more reasonably related threads

CentOS - Apr 2012 - Help needed with NFS issue

[CentOS] Help needed with NFS issue

[CentOS] Help needed with NFS issue

Apparently Analagous Threads