thr3ads.net - freebsd stable - 9-STABLE -> NFS -> NetAPP: [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Hub- Marketing

2012-Dec-19 04:58 UTC

9-STABLE -> NFS -> NetAPP:

I'm running a few servers sitting on top of a NetAPP file server ?
everything runs great, but periodically I'm getting:

nfs_getpages: error 13
vm_fault: pager read error, pid 11355 (https)

errors on my screen ? not always same pid ? the annoying part is that it seems
to always affect the same jail that is running .. if I shutdown all jails on
that physical server, everything shuts down except for that *one* jail, with a
ps listing looking like:

USER   PID %CPU %MEM    VSZ   RSS TT  STAT STARTED    TIME COMMAND
root  6670  0.0  0.0   9936  1372 ??  DsJ   3:00AM 0:00.01 newsyslog
root  6815  0.0  0.0   9936  1288 ??  DsJ   3:00AM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root  8361  0.0  0.1 220740 11400 ??  DsJ   7:33PM 0:01.25 /usr/local/sbin/httpd
-DNOHTTPACCEPT
www   8364  0.0  0.0      0     0 ??  ZJ    7:33PM 0:00.00 <defunct>
www  11866  0.0  0.1 318444 16792 ??  TJ    7:36PM 0:00.03 /usr/local/sbin/httpd
-DNOHTTPACCEPT
www  11872  0.0  0.1 297964 14008 ??  TJ    7:36PM 0:00.01 /usr/local/sbin/httpd
-DNOHTTPACCEPT
www  11873  0.0  0.1 306156 15028 ??  DEJ   7:36PM 0:00.02 /usr/local/sbin/httpd
-DNOHTTPACCEPT
root 17190  0.0  0.0   9936  1240 ??  DsJ   8:00PM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root 24864  0.0  0.0   9936  1392 ??  DsJ   4:00AM 0:00.01 newsyslog
root 24910  0.0  0.0   9936  1336 ??  DsJ   4:00AM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root 29972  0.0  0.0   9936  1240 ??  DsJ   9:00PM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root 34221  0.0  0.0  51480  4332 ??  DsJ   4:47AM 0:00.02 sshd: root at pts/1
(sshd)
root 42452  0.0  0.0   9936  1296 ??  DsJ  10:00PM 0:00.01 newsyslog
root 42522  0.0  0.0   9936  1240 ??  DsJ  10:00PM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root 55179  0.0  0.0   9936  1296 ??  DsJ  11:00PM 0:00.01 newsyslog
root 55244  0.0  0.0   9936  1240 ??  DsJ  11:00PM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root 67592  0.0  0.0   9936  1336 ??  DsJ  12:00AM 0:00.01 newsyslog
root 67762  0.0  0.0   9936  1288 ??  DsJ  12:00AM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root 81603  0.0  0.0   9936  1340 ??  DsJ   1:00AM 0:00.01 newsyslog
root 81640  0.0  0.0   9936  1284 ??  DsJ   1:00AM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root 93792  0.0  0.0   9936  1344 ??  DsJ   2:00AM 0:00.01 newsyslog
root 93815  0.0  0.0   9936  1288 ??  DsJ   2:00AM 0:00.01 /usr/sbin/newsyslog
-f /usr/local/etc/rotate_logs.cfg
root 34228  0.0  0.0  67960  4464  1  Ds+J  4:47AM 0:00.00 sshd: root at pts/1
(sshd)
root 38473  0.0  0.0  17556  3272  3  SJ    4:53AM 0:00.02 /bin/tcsh
root 38475  0.0  0.0  14212  1512  3  R+J   4:53AM 0:00.00 ps aux

I can do a 'jexec <JID> /bin/tcsh' to get into the jail, I can
perform ps commands, etc ? I just can't get those processes to shutdown ?

everything within the jail is 'up to date' ? updates the userland and
ports ? I've checked over the NetApp, but everything appears fine, and it
only seems to repeatedly affect that one jail, on that same physical server ...

I have no ideas on what / how to debug this ? thoughts?  help?

thx

Rick Macklem

2012-Dec-19 22:56 UTC

head link

9-STABLE -> NFS -> NetAPP:

Hub-Marketing wrote:> I'm running a few servers sitting on top of a NetAPP file server ?
> everything runs great, but periodically I'm getting:
> 
> nfs_getpages: error 13
> vm_fault: pager read error, pid 11355 (https)
> 13 is EACCES. This message means that the Netapp server is
replying EACCES to a read for a pagein. I notice that both
root and www are running the executable. (Also, root is often
mapped to something like "nobody" in the NFS server.)

You could try making sure the httpd executable file has r_x
permissions for all users (chmod 555 httpd).

If it still keeps hapenning once you've done that, you'd need
to capture packets when this happens and take a look at the
NFS RPCs via wireshark to see when the EACCES is returned and
what <uid, gids> are sent in the credentials for that Read.

rick
> errors on my screen ? not always same pid ? the annoying part is that
> it seems to always affect the same jail that is running .. if I
> shutdown all jails on that physical server, everything shuts down
> except for that *one* jail, with a ps listing looking like:
> 
> USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
> root 6670 0.0 0.0 9936 1372 ?? DsJ 3:00AM 0:00.01 newsyslog
> root 6815 0.0 0.0 9936 1288 ?? DsJ 3:00AM 0:00.01 /usr/sbin/newsyslog
> -f /usr/local/etc/rotate_logs.cfg
> root 8361 0.0 0.1 220740 11400 ?? DsJ 7:33PM 0:01.25
> /usr/local/sbin/httpd -DNOHTTPACCEPT
> www 8364 0.0 0.0 0 0 ?? ZJ 7:33PM 0:00.00 <defunct>
> www 11866 0.0 0.1 318444 16792 ?? TJ 7:36PM 0:00.03
> /usr/local/sbin/httpd -DNOHTTPACCEPT
> www 11872 0.0 0.1 297964 14008 ?? TJ 7:36PM 0:00.01
> /usr/local/sbin/httpd -DNOHTTPACCEPT
> www 11873 0.0 0.1 306156 15028 ?? DEJ 7:36PM 0:00.02
> /usr/local/sbin/httpd -DNOHTTPACCEPT
> root 17190 0.0 0.0 9936 1240 ?? DsJ 8:00PM 0:00.01 /usr/sbin/newsyslog
> -f /usr/local/etc/rotate_logs.cfg
> root 24864 0.0 0.0 9936 1392 ?? DsJ 4:00AM 0:00.01 newsyslog
> root 24910 0.0 0.0 9936 1336 ?? DsJ 4:00AM 0:00.01 /usr/sbin/newsyslog
> -f /usr/local/etc/rotate_logs.cfg
> root 29972 0.0 0.0 9936 1240 ?? DsJ 9:00PM 0:00.01 /usr/sbin/newsyslog
> -f /usr/local/etc/rotate_logs.cfg
> root 34221 0.0 0.0 51480 4332 ?? DsJ 4:47AM 0:00.02 sshd: root at pts/1
> (sshd)
> root 42452 0.0 0.0 9936 1296 ?? DsJ 10:00PM 0:00.01 newsyslog
> root 42522 0.0 0.0 9936 1240 ?? DsJ 10:00PM 0:00.01
> /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg
> root 55179 0.0 0.0 9936 1296 ?? DsJ 11:00PM 0:00.01 newsyslog
> root 55244 0.0 0.0 9936 1240 ?? DsJ 11:00PM 0:00.01
> /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg
> root 67592 0.0 0.0 9936 1336 ?? DsJ 12:00AM 0:00.01 newsyslog
> root 67762 0.0 0.0 9936 1288 ?? DsJ 12:00AM 0:00.01
> /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg
> root 81603 0.0 0.0 9936 1340 ?? DsJ 1:00AM 0:00.01 newsyslog
> root 81640 0.0 0.0 9936 1284 ?? DsJ 1:00AM 0:00.01 /usr/sbin/newsyslog
> -f /usr/local/etc/rotate_logs.cfg
> root 93792 0.0 0.0 9936 1344 ?? DsJ 2:00AM 0:00.01 newsyslog
> root 93815 0.0 0.0 9936 1288 ?? DsJ 2:00AM 0:00.01 /usr/sbin/newsyslog
> -f /usr/local/etc/rotate_logs.cfg
> root 34228 0.0 0.0 67960 4464 1 Ds+J 4:47AM 0:00.00 sshd: root at pts/1
> (sshd)
> root 38473 0.0 0.0 17556 3272 3 SJ 4:53AM 0:00.02 /bin/tcsh
> root 38475 0.0 0.0 14212 1512 3 R+J 4:53AM 0:00.00 ps aux
> 
> I can do a 'jexec <JID> /bin/tcsh' to get into the jail, I
can perform
> ps commands, etc ? I just can't get those processes to shutdown ?
> 
> everything within the jail is 'up to date' ? updates the userland
and
> ports ? I've checked over the NetApp, but everything appears fine, and
> it only seems to repeatedly affect that one jail, on that same
> physical server ...
> 
> I have no ideas on what / how to debug this ? thoughts? help?
> 
> thx
> 
> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscribe at freebsd.org"

John Baldwin

2013-Jan-19 12:57 UTC

head link

9-STABLE -> NFS -> NetAPP:

On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing
wrote:> I'm running a few servers sitting on top of a NetAPP file server ?
> everything runs great, but periodically I'm getting:
> 
> nfs_getpages: error 13
> vm_fault: pager read error, pid 11355 (https)
Are you using interruptible mounts ("intr" mount option)?

Also, can you get ps output that includes the 'l' flag to show what
the processes are stuck on?

-- 
John Baldwin

Marc Fournier

2013-Feb-10 06:06 UTC

head link

9-STABLE -> NFS -> NetAPP:

Thanks ?


# procstat -kk 64529
  PID    TID COMM             TDNAME           KSTACK                       
64529 100963 du               -                mi_switch+0x186 sleepq_wait+0x42
__lockmgr_args+0x5cb nfs_lock1+0x4a VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70
cache_lookup_times+0x54f nfs_lookup+0x17e lookup+0x42f namei+0x4ac
vn_open_cred+0x3bd kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7

On 2013-02-09, at 9:58 PM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> Off-list:
> 
> Marc,
> 
> You may want to also provide output from "procstat -kk 64529", as
this
> will give a full thread calling stack.
> 
> The -kk (double-kay) is not a typo.  :-)
> 
> -- 
> | Jeremy Chadwick                                   jdc at koitsu.org |
> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
> | Mountain View, CA, US                                            |
> | Making life hard for others since 1977.             PGP 4BD6C0CB |
> 
> On Sat, Feb 09, 2013 at 09:29:30PM -0800, Marc Fournier wrote:
>> 
>> Hi John ?
>> 
>>   Does this help?
>> 
>> root at io:~ # ps auxl | grep du
>> root     1054   0.0  0.1  16176  6600 ??  D     3:15AM     0:05.38 du
-skx /vm/2799     0 81426   0  20  0 newnfs
>> root    12353   0.0  0.1  16176  5104 ??  D    Sat03AM     0:05.41 du
-skx /vm/2799     0 91597   0  20  0 newnfs
>> root    64529   0.0  0.1  16176  5164 ??  D    Fri03AM     0:05.40 du
-skx /vm/2799     0 43227   0  20  0 newnfs
>> root    12855   0.0  0.0  16308  1988  0  S+    5:26AM     0:00.00 grep
du              0 12847   0  20  0 piperd
>> root at io:~ # grep vm /etc/fstab
>> 192.168.1.254:/vol/basic /vm            nfs     rw,nolockd,intr 0      
0
>> 
>> Haven't rebooted yet ? if there is anything I can do / try before ?
?
>> 
>> The kernel is from Jan 21st ?
>> 
>> 
>> On 2013-01-19, at 4:57 AM, John Baldwin <jhb at freebsd.org>
wrote:
>> 
>>> On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote:
>>>> I'm running a few servers sitting on top of a NetAPP file
server ?
>>>> everything runs great, but periodically I'm getting:
>>>> 
>>>> nfs_getpages: error 13
>>>> vm_fault: pager read error, pid 11355 (https)
>>> 
>>> Are you using interruptible mounts ("intr" mount option)?
>>> 
>>> Also, can you get ps output that includes the 'l' flag to
show what
>>> the processes are stuck on?
>>> 
>>> -- 
>>> John Baldwin
>> 
>> _______________________________________________
>> freebsd-stable at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"

Apparently Analagous Threads

Search for more reasonably related threads

freebsd stable - Dec 2012 - 9-STABLE -> NFS -> NetAPP:

9-STABLE -> NFS -> NetAPP:

9-STABLE -> NFS -> NetAPP:

9-STABLE -> NFS -> NetAPP:

9-STABLE -> NFS -> NetAPP:

Apparently Analagous Threads