I'm running a few servers sitting on top of a NetAPP file server ? everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) errors on my screen ? not always same pid ? the annoying part is that it seems to always affect the same jail that is running .. if I shutdown all jails on that physical server, everything shuts down except for that *one* jail, with a ps listing looking like: USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 6670 0.0 0.0 9936 1372 ?? DsJ 3:00AM 0:00.01 newsyslog root 6815 0.0 0.0 9936 1288 ?? DsJ 3:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 8361 0.0 0.1 220740 11400 ?? DsJ 7:33PM 0:01.25 /usr/local/sbin/httpd -DNOHTTPACCEPT www 8364 0.0 0.0 0 0 ?? ZJ 7:33PM 0:00.00 <defunct> www 11866 0.0 0.1 318444 16792 ?? TJ 7:36PM 0:00.03 /usr/local/sbin/httpd -DNOHTTPACCEPT www 11872 0.0 0.1 297964 14008 ?? TJ 7:36PM 0:00.01 /usr/local/sbin/httpd -DNOHTTPACCEPT www 11873 0.0 0.1 306156 15028 ?? DEJ 7:36PM 0:00.02 /usr/local/sbin/httpd -DNOHTTPACCEPT root 17190 0.0 0.0 9936 1240 ?? DsJ 8:00PM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 24864 0.0 0.0 9936 1392 ?? DsJ 4:00AM 0:00.01 newsyslog root 24910 0.0 0.0 9936 1336 ?? DsJ 4:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 29972 0.0 0.0 9936 1240 ?? DsJ 9:00PM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 34221 0.0 0.0 51480 4332 ?? DsJ 4:47AM 0:00.02 sshd: root at pts/1 (sshd) root 42452 0.0 0.0 9936 1296 ?? DsJ 10:00PM 0:00.01 newsyslog root 42522 0.0 0.0 9936 1240 ?? DsJ 10:00PM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 55179 0.0 0.0 9936 1296 ?? DsJ 11:00PM 0:00.01 newsyslog root 55244 0.0 0.0 9936 1240 ?? DsJ 11:00PM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 67592 0.0 0.0 9936 1336 ?? DsJ 12:00AM 0:00.01 newsyslog root 67762 0.0 0.0 9936 1288 ?? DsJ 12:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 81603 0.0 0.0 9936 1340 ?? DsJ 1:00AM 0:00.01 newsyslog root 81640 0.0 0.0 9936 1284 ?? DsJ 1:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 93792 0.0 0.0 9936 1344 ?? DsJ 2:00AM 0:00.01 newsyslog root 93815 0.0 0.0 9936 1288 ?? DsJ 2:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 34228 0.0 0.0 67960 4464 1 Ds+J 4:47AM 0:00.00 sshd: root at pts/1 (sshd) root 38473 0.0 0.0 17556 3272 3 SJ 4:53AM 0:00.02 /bin/tcsh root 38475 0.0 0.0 14212 1512 3 R+J 4:53AM 0:00.00 ps aux I can do a 'jexec <JID> /bin/tcsh' to get into the jail, I can perform ps commands, etc ? I just can't get those processes to shutdown ? everything within the jail is 'up to date' ? updates the userland and ports ? I've checked over the NetApp, but everything appears fine, and it only seems to repeatedly affect that one jail, on that same physical server ... I have no ideas on what / how to debug this ? thoughts? help? thx
Hub-Marketing wrote:> I'm running a few servers sitting on top of a NetAPP file server ? > everything runs great, but periodically I'm getting: > > nfs_getpages: error 13 > vm_fault: pager read error, pid 11355 (https) >13 is EACCES. This message means that the Netapp server is replying EACCES to a read for a pagein. I notice that both root and www are running the executable. (Also, root is often mapped to something like "nobody" in the NFS server.) You could try making sure the httpd executable file has r_x permissions for all users (chmod 555 httpd). If it still keeps hapenning once you've done that, you'd need to capture packets when this happens and take a look at the NFS RPCs via wireshark to see when the EACCES is returned and what <uid, gids> are sent in the credentials for that Read. rick> errors on my screen ? not always same pid ? the annoying part is that > it seems to always affect the same jail that is running .. if I > shutdown all jails on that physical server, everything shuts down > except for that *one* jail, with a ps listing looking like: > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > root 6670 0.0 0.0 9936 1372 ?? DsJ 3:00AM 0:00.01 newsyslog > root 6815 0.0 0.0 9936 1288 ?? DsJ 3:00AM 0:00.01 /usr/sbin/newsyslog > -f /usr/local/etc/rotate_logs.cfg > root 8361 0.0 0.1 220740 11400 ?? DsJ 7:33PM 0:01.25 > /usr/local/sbin/httpd -DNOHTTPACCEPT > www 8364 0.0 0.0 0 0 ?? ZJ 7:33PM 0:00.00 <defunct> > www 11866 0.0 0.1 318444 16792 ?? TJ 7:36PM 0:00.03 > /usr/local/sbin/httpd -DNOHTTPACCEPT > www 11872 0.0 0.1 297964 14008 ?? TJ 7:36PM 0:00.01 > /usr/local/sbin/httpd -DNOHTTPACCEPT > www 11873 0.0 0.1 306156 15028 ?? DEJ 7:36PM 0:00.02 > /usr/local/sbin/httpd -DNOHTTPACCEPT > root 17190 0.0 0.0 9936 1240 ?? DsJ 8:00PM 0:00.01 /usr/sbin/newsyslog > -f /usr/local/etc/rotate_logs.cfg > root 24864 0.0 0.0 9936 1392 ?? DsJ 4:00AM 0:00.01 newsyslog > root 24910 0.0 0.0 9936 1336 ?? DsJ 4:00AM 0:00.01 /usr/sbin/newsyslog > -f /usr/local/etc/rotate_logs.cfg > root 29972 0.0 0.0 9936 1240 ?? DsJ 9:00PM 0:00.01 /usr/sbin/newsyslog > -f /usr/local/etc/rotate_logs.cfg > root 34221 0.0 0.0 51480 4332 ?? DsJ 4:47AM 0:00.02 sshd: root at pts/1 > (sshd) > root 42452 0.0 0.0 9936 1296 ?? DsJ 10:00PM 0:00.01 newsyslog > root 42522 0.0 0.0 9936 1240 ?? DsJ 10:00PM 0:00.01 > /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg > root 55179 0.0 0.0 9936 1296 ?? DsJ 11:00PM 0:00.01 newsyslog > root 55244 0.0 0.0 9936 1240 ?? DsJ 11:00PM 0:00.01 > /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg > root 67592 0.0 0.0 9936 1336 ?? DsJ 12:00AM 0:00.01 newsyslog > root 67762 0.0 0.0 9936 1288 ?? DsJ 12:00AM 0:00.01 > /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg > root 81603 0.0 0.0 9936 1340 ?? DsJ 1:00AM 0:00.01 newsyslog > root 81640 0.0 0.0 9936 1284 ?? DsJ 1:00AM 0:00.01 /usr/sbin/newsyslog > -f /usr/local/etc/rotate_logs.cfg > root 93792 0.0 0.0 9936 1344 ?? DsJ 2:00AM 0:00.01 newsyslog > root 93815 0.0 0.0 9936 1288 ?? DsJ 2:00AM 0:00.01 /usr/sbin/newsyslog > -f /usr/local/etc/rotate_logs.cfg > root 34228 0.0 0.0 67960 4464 1 Ds+J 4:47AM 0:00.00 sshd: root at pts/1 > (sshd) > root 38473 0.0 0.0 17556 3272 3 SJ 4:53AM 0:00.02 /bin/tcsh > root 38475 0.0 0.0 14212 1512 3 R+J 4:53AM 0:00.00 ps aux > > I can do a 'jexec <JID> /bin/tcsh' to get into the jail, I can perform > ps commands, etc ? I just can't get those processes to shutdown ? > > everything within the jail is 'up to date' ? updates the userland and > ports ? I've checked over the NetApp, but everything appears fine, and > it only seems to repeatedly affect that one jail, on that same > physical server ... > > I have no ideas on what / how to debug this ? thoughts? help? > > thx > > > _______________________________________________ > freebsd-stable at freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe at freebsd.org"
On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote:> I'm running a few servers sitting on top of a NetAPP file server ? > everything runs great, but periodically I'm getting: > > nfs_getpages: error 13 > vm_fault: pager read error, pid 11355 (https)Are you using interruptible mounts ("intr" mount option)? Also, can you get ps output that includes the 'l' flag to show what the processes are stuck on? -- John Baldwin
Thanks ? # procstat -kk 64529 PID TID COMM TDNAME KSTACK 64529 100963 du - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0x5cb nfs_lock1+0x4a VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x54f nfs_lookup+0x17e lookup+0x42f namei+0x4ac vn_open_cred+0x3bd kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 On 2013-02-09, at 9:58 PM, Jeremy Chadwick <jdc at koitsu.org> wrote:> Off-list: > > Marc, > > You may want to also provide output from "procstat -kk 64529", as this > will give a full thread calling stack. > > The -kk (double-kay) is not a typo. :-) > > -- > | Jeremy Chadwick jdc at koitsu.org | > | UNIX Systems Administrator http://jdc.koitsu.org/ | > | Mountain View, CA, US | > | Making life hard for others since 1977. PGP 4BD6C0CB | > > On Sat, Feb 09, 2013 at 09:29:30PM -0800, Marc Fournier wrote: >> >> Hi John ? >> >> Does this help? >> >> root at io:~ # ps auxl | grep du >> root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs >> root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs >> root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs >> root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd >> root at io:~ # grep vm /etc/fstab >> 192.168.1.254:/vol/basic /vm nfs rw,nolockd,intr 0 0 >> >> Haven't rebooted yet ? if there is anything I can do / try before ? ? >> >> The kernel is from Jan 21st ? >> >> >> On 2013-01-19, at 4:57 AM, John Baldwin <jhb at freebsd.org> wrote: >> >>> On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote: >>>> I'm running a few servers sitting on top of a NetAPP file server ? >>>> everything runs great, but periodically I'm getting: >>>> >>>> nfs_getpages: error 13 >>>> vm_fault: pager read error, pid 11355 (https) >>> >>> Are you using interruptible mounts ("intr" mount option)? >>> >>> Also, can you get ps output that includes the 'l' flag to show what >>> the processes are stuck on? >>> >>> -- >>> John Baldwin >> >> _______________________________________________ >> freebsd-stable at freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"