TPCzfs at mklab.ph.rhul.ac.uk
2012-Jun-13 10:47 UTC
[zfs-discuss] (fwd) Re: ZFS NFS service hanging on Sunday morning problem
Dear All, I have been advised to enquire here on zfs-discuss with the ZFS problem described below, following discussion on Usenet NG comp.unix.solaris. The full thread should be available here https://groups.google.com/forum/#!topic/comp.unix.solaris/uEQzz1t-G1s Many thanks Tom Crane -- forwarded message cindy.swearingen at oracle.com wrote: : On Tuesday, May 29, 2012 5:39:11 AM UTC-6, (unknown) wrote: : > Dear All, : > Can anyone give any tips on diagnosing the following recurring problem? : > : > I have a Solaris box (server5, SunOS server5 5.10 Generic_147441-15 : > i86pc i386 i86pc ) whose ZFS FS NFS exported service fails every so : > often, always in the early hours of Sunday morning. I am barely : > familiar with Solaris but here what I have managed to discern when the : > problem occurs; : > : > Jobs on other machines which access server5''s shares (via automounter) : > hang and attempts to manually remote-mount shares just timeout. : > : > Remotely, showmount -e server5 shows all the exported FS are available. : > : > On server5, the following services are running; : > : > root at server5:/var/adm# svcs | grep nfs : > online May_25 svc:/network/nfs/status:default : > online May_25 svc:/network/nfs/nlockmgr:default : > online May_25 svc:/network/nfs/cbd:default : > online May_25 svc:/network/nfs/mapid:default : > online May_25 svc:/network/nfs/rquota:default : > online May_25 svc:/network/nfs/client:default : > online May_25 svc:/network/nfs/server:default : > : > On server5, I can list and read files on the affected FSs w/o problem : > but any attempt to write to the FS (eg. copy a file to or rm a file : > on the FS) just hangs the cp/rm process. : > : > On server5, using a zfs command zfs ''get sharenfs pptank/local_linux'' : > displays the expected list of hosts/IPs with remote ro & rw access. : > : > Here is the O/P from some other hopefully relevant commands; : > : > root at server5:/# zpool status : > pool: pptank : > state: ONLINE : > status: The pool is formatted using an older on-disk format. The pool can : > still be used, but some features are unavailable. : > action: Upgrade the pool using ''zpool upgrade''. Once this is done, the : > pool will no longer be accessible on older software versions. : > scan: none requested : > config: : > : > NAME STATE READ WRITE CKSUM : > pptank ONLINE 0 0 0 : > raidz1-0 ONLINE 0 0 0 : > c3t0d0 ONLINE 0 0 0 : > c3t1d0 ONLINE 0 0 0 : > c3t2d0 ONLINE 0 0 0 : > c3t3d0 ONLINE 0 0 0 : > c3t4d0 ONLINE 0 0 0 : > c3t5d0 ONLINE 0 0 0 : > c3t6d0 ONLINE 0 0 0 : > : > errors: No known data errors : > : > root at server5:/# zpool list : > NAME SIZE ALLOC FREE CAP HEALTH ALTROOT : > pptank 12.6T 384G 12.3T 2% ONLINE - : > : > root at server5:/# zpool history : > History for ''pptank'': : > <just hangs here> : > : > root at server5:/# zpool iostat 5 : > capacity operations bandwidth : > pool alloc free read write read write : > ---------- ----- ----- ----- ----- ----- ----- : > pptank 384G 12.3T 92 115 3.08M 1.22M : > pptank 384G 12.3T 1.11K 629 35.5M 3.03M : > pptank 384G 12.3T 886 889 27.1M 3.68M : > pptank 384G 12.3T 837 677 24.9M 2.82M : > pptank 384G 12.3T 1.19K 757 37.4M 3.69M : > pptank 384G 12.3T 1.02K 759 29.6M 3.90M : > pptank 384G 12.3T 952 707 32.5M 3.09M : > pptank 384G 12.3T 1.02K 831 34.5M 3.72M : > pptank 384G 12.3T 707 503 23.5M 1.98M : > pptank 384G 12.3T 626 707 20.8M 3.58M : > pptank 384G 12.3T 816 838 26.1M 4.26M : > pptank 384G 12.3T 942 800 30.1M 3.48M : > pptank 384G 12.3T 677 675 21.7M 2.91M : > pptank 384G 12.3T 590 725 19.2M 3.06M : > : > : > top shows the following runnable processes. Nothing excessive here AFAICT? : > : > last pid: 25282; load avg: 1.98, 1.95, 1.86; up 1+09:02:05 07:46:29 : > 72 processes: 67 sleeping, 1 running, 1 stopped, 3 on cpu : > CPU states: 81.5% idle, 0.1% user, 18.3% kernel, 0.0% iowait, 0.0% swap : > Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap : > : > PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND : > 748 root 18 60 -20 103M 9752K cpu/1 78:44 6.62% nfsd : > 24854 root 1 54 0 1480K 792K cpu/1 0:42 0.69% cp : > 25281 root 1 59 0 3584K 2152K cpu/0 0:00 0.02% top : > : > The above cp job is as mentioned above, attempting to copy a file to : > an effected FS, I''ve noticed is apparently not completely hung. : > : > The only thing that appears specific to Sunday morning is a cronjob to : > remove old .nfs* files, : > : > root at server5:/# crontab -l | grep nfsfind : > 15 3 * * 0 /usr/lib/fs/nfs/nfsfind : > : > Any suggestions on how to proceed? : > : > Many thanks : > Tom Crane : > : > Ps. The email address in the header is just a spam-trap. : > -- : > Tom Crane, IT support, RHUL Particle Physics., : > Dept. Physics, Royal Holloway, University of London, Egham Hill, : > Egham, Surrey, TW20 0EX, England. : > Email: T.Crane at rhul dot ac dot uk : Hi Tom, Hi Cindy, Thanks for the followup : I think SunOS server5 5.10 Generic_147441-15 is the Solaris 10 8/11 : release. Is this correct? I think so,... root at server5:/# cat /etc/release Solaris 10 10/08 s10x_u6wos_07b X86 Copyright 2008 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 27 October 2008 : We looked at your truss output briefly and it looks like it is hanging : trying to allocate memory. At least, that''s what the "br ...." statements : are at the end. : I will see if I can find out what diagnostic info would be help in : this case. Thanks. That would be much appreciated. : You might get a faster response on zfs-discuss as John suggested. I will CC to zfs-discuss. Best regards Tom. : Thanks, : Cindy Ps. The email address in the header is just a spam-trap. -- Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 0EX, England. Email: T.Crane at rhul dot ac dot uk -- end of forwarded message --