Alfred von Campe
2009-Mar-11 21:23 UTC
[CentOS] Intermittent NFS problems with NetApp server
I've been experiencing some intermittent problems accessing at NetApp server via NFS and automount. I'm running CentOS 5.2 (fully updated) on all my servers and workstations. Usually, everything is working just fine, when suddenly we get the following error: /bin/sh: /home/epd/srcref/swtools/Crontabs/ run_release_requests.sh: Permission denied This is actually an email from cron because we try to run that shell script every minute (yes, the crontab entry is * * * * * /home/epd/ srcref/swtools/Crontabs/run_release_requests.sh), and /home/epd is an automounted directory. Here is its map entry: epd -rw,nointr,rsize=32768,wsize=32768 XXXXXX:/epd When this is happening, other users can successfully access that directory on the server. The directory is actually mounted correctly, and unmounting doesn't fix the issue. Furthermore, the same user that is being denied access, can successfully access that directory on a different server. The problem usually lasts about 20 minutes and then resolves itself. We have been pulling our hair out trying to debug this problem, because it's intermittent and the debug window is fairly short. Recently we have been getting help from one of the NetApp admins, and he ran a command on the NetApp that produced the following warning: The TCP receive window advertised by NFS client XXXXXXX is 5888. This is less than the recommended value of 32768 bytes. You should increase the TCP receive buffer size for NFS on the client. Some googling around got me to check these values for TCP: # sysctl net.ipv4.tcp_mem net.ipv4.tcp_mem = 98304 131072 196608 # sysctl net.ipv4.tcp_rmem net.ipv4.tcp_rmem = 4096 87380 4194304 # sysctl net.ipv4.tcp_wmem net.ipv4.tcp_wmem = 4096 16384 4194304 So these seem fine to me (i.e., the max is greater than 32768). Is there an NFS (as opposed to TCP) setting I should be tweaking? Any ideas why the NetApp is issuing those warnings? Any other suggestions on how to debug this problem? Thanks, Alfred
Louis Lagendijk
2009-Mar-11 22:13 UTC
[CentOS] Intermittent NFS problems with NetApp server
On Wed, 2009-03-11 at 17:23 -0400, Alfred von Campe wrote:> > # sysctl net.ipv4.tcp_mem > net.ipv4.tcp_mem = 98304 131072 196608 > # sysctl net.ipv4.tcp_rmem > net.ipv4.tcp_rmem = 4096 87380 4194304 > # sysctl net.ipv4.tcp_wmem > net.ipv4.tcp_wmem = 4096 16384 4194304 > > So these seem fine to me (i.e., the max is greater than 32768). Is > there an NFS (as opposed to TCP) setting I should be tweaking? Any > ideas why the NetApp is issuing those warnings? Any other > suggestions on how to debug this problem? >man nfs man mount.nfs cat /proc/mounts Louis
Ray Van Dolson
2009-Mar-11 22:15 UTC
[CentOS] Intermittent NFS problems with NetApp server
<snip>> So these seem fine to me (i.e., the max is greater than 32768). Is > there an NFS (as opposed to TCP) setting I should be tweaking? Any > ideas why the NetApp is issuing those warnings? Any other > suggestions on how to debug this problem?Sounds like a very interesting problem. The only time I've gotten such errors have been NFSv4 issues between Linux and Solaris hosts, never with a NetApp. You might try asking on the linux-nfs[1] list as well as the toasters[2] list. I'd be interested to hera what you come up with. Very strange symptoms though. Are you using NFS over TCP or UDP? It seems like one side is attempting to use a stale session... I've always found NFS stuff like this very difficult to troubleshoot. If you can reproduce the problem on demand maybe you could get a packet dump right when the issue begins... Ray [1]: http://vger.kernel.org/majordomo-info.html [2]: http://toasters.mathworks.com/toasters.html
On Mar 11, 2009, at 5:23 PM, Alfred von Campe <alfred at von-campe.com> wrote:> I've been experiencing some intermittent problems accessing at NetApp > server via NFS and automount. I'm running CentOS 5.2 (fully updated) > on all my servers and workstations. Usually, everything is working > just fine, when suddenly we get the following error: > > /bin/sh: /home/epd/srcref/swtools/Crontabs/ > run_release_requests.sh: Permission denied > > This is actually an email from cron because we try to run that shell > script every minute (yes, the crontab entry is * * * * * /home/epd/ > srcref/swtools/Crontabs/run_release_requests.sh), and /home/epd is an > automounted directory. Here is its map entry: > > epd -rw,nointr,rsize=32768,wsize=32768 XXXXXX:/epd > > When this is happening, other users can successfully access that > directory on the server. The directory is actually mounted > correctly, and unmounting doesn't fix the issue. Furthermore, the > same user that is being denied access, can successfully access that > directory on a different server. The problem usually lasts about 20 > minutes and then resolves itself. We have been pulling our hair out > trying to debug this problem, because it's intermittent and the debug > window is fairly short. > > Recently we have been getting help from one of the NetApp admins, and > he ran a command on the NetApp that produced the following warning: > > The TCP receive window advertised by NFS client XXXXXXX is 5888. > This is less than the recommended value of 32768 bytes. > You should increase the TCP receive buffer size for NFS on the > client. > > Some googling around got me to check these values for TCP: > > # sysctl net.ipv4.tcp_mem > net.ipv4.tcp_mem = 98304 131072 196608 > # sysctl net.ipv4.tcp_rmem > net.ipv4.tcp_rmem = 4096 87380 4194304 > # sysctl net.ipv4.tcp_wmem > net.ipv4.tcp_wmem = 4096 16384 4194304 > > So these seem fine to me (i.e., the max is greater than 32768). Is > there an NFS (as opposed to TCP) setting I should be tweaking? Any > ideas why the NetApp is issuing those warnings? Any other > suggestions on how to debug this problem?Run a headers only tcpdump of the NFS mount from mount to when the problem occurs, then use wireshark to analyze it. Maybe page cache is putting too much pressure on tcp buffering so you need to increase the minimum buffer size? -Ross