TPCzfs at mklab.ph.rhul.ac.uk
2012-Jun-13 17:43 UTC
[zfs-discuss] ZFS NFS service hanging on Sunday morning problem
> > Shot in the dark here: > > What are you using for the sharenfs value on the ZFS filesystem? Something like rw=.mydomain.lan ?They are IP blocks or hosts specified as FQDNs, eg., pptank/home/tcrane sharenfs rw=@192.168.101/24,rw=serverX.xx.rhul.ac.uk:serverY.xx.rhul.ac.uk>> I''ve had issues where a ZFS server loses connectivity to the primary DNS server and as a result the reverse lookups used to validate the identityIt was using our slave DNS but there have been no recent problems with it. I''ve switched it to the primary DNS.> of client systems fails and the connections hang. Any chance there''s a planned reboot of the DNS server Sunday morning? That sounds like the kind ofNo. The only things tied to Sunday morning are these two (Solaris factory installed?) cronjobs; root at server5:/# grep nfsfind /var/spool/cron/crontabs/root 15 3 * * 0 /usr/lib/fs/nfs/nfsfind root at server5:/# grep 13 /var/spool/cron/crontabs/lp # At 03:13am on Sundays: 13 3 * * 0 cd /var/lp/logs; if [ -f requests ]; then if [ -f requests.1 ]; then /bin/mv requests.1 requests.2; fi; /usr/bin/cp requests requests.1; >requests; fi The lp one does not access the main ZFS pool but the nfsfind does. However, AFAICT it has usually finished before the problem manifests itself.> preventative maintenance that might be happening in that time window.Cheers Tom.> > Cheers, > > Erik > > On 13 juin 2012, at 12:47, TPCzfs at mklab.ph.rhul.ac.uk wrote: > > > Dear All, > > I have been advised to enquire here on zfs-discuss with the > > ZFS problem described below, following discussion on Usenet NG > > comp.unix.solaris. The full thread should be available here > > https://groups.google.com/forum/#!topic/comp.unix.solaris/uEQzz1t-G1s > > > > Many thanks > > Tom Crane > > > > > > > > -- forwarded message > > > > cindy.swearingen at oracle.com wrote: > > : On Tuesday, May 29, 2012 5:39:11 AM UTC-6, (unknown) wrote: > > : > Dear All, > > : > Can anyone give any tips on diagnosing the following recurring problem? > > : > > > : > I have a Solaris box (server5, SunOS server5 5.10 Generic_147441-15 > > : > i86pc i386 i86pc ) whose ZFS FS NFS exported service fails every so > > : > often, always in the early hours of Sunday morning. I am barely > > : > familiar with Solaris but here what I have managed to discern when the > > : > problem occurs; > > : > > > : > Jobs on other machines which access server5''s shares (via automounter) > > : > hang and attempts to manually remote-mount shares just timeout. > > : > > > : > Remotely, showmount -e server5 shows all the exported FS are available. > > : > > > : > On server5, the following services are running; > > : > > > : > root at server5:/var/adm# svcs | grep nfs > > : > online May_25 svc:/network/nfs/status:default > > : > online May_25 svc:/network/nfs/nlockmgr:default > > : > online May_25 svc:/network/nfs/cbd:default > > : > online May_25 svc:/network/nfs/mapid:default > > : > online May_25 svc:/network/nfs/rquota:default > > : > online May_25 svc:/network/nfs/client:default > > : > online May_25 svc:/network/nfs/server:default > > : > > > : > On server5, I can list and read files on the affected FSs w/o problem > > : > but any attempt to write to the FS (eg. copy a file to or rm a file > > : > on the FS) just hangs the cp/rm process. > > : > > > : > On server5, using a zfs command zfs ''get sharenfs pptank/local_linux'' > > : > displays the expected list of hosts/IPs with remote ro & rw access. > > : > > > : > Here is the O/P from some other hopefully relevant commands; > > : > > > : > root at server5:/# zpool status > > : > pool: pptank > > : > state: ONLINE > > : > status: The pool is formatted using an older on-disk format. The pool can > > : > still be used, but some features are unavailable. > > : > action: Upgrade the pool using ''zpool upgrade''. Once this is done, the > > : > pool will no longer be accessible on older software versions. > > : > scan: none requested > > : > config: > > : > > > : > NAME STATE READ WRITE CKSUM > > : > pptank ONLINE 0 0 0 > > : > raidz1-0 ONLINE 0 0 0 > > : > c3t0d0 ONLINE 0 0 0 > > : > c3t1d0 ONLINE 0 0 0 > > : > c3t2d0 ONLINE 0 0 0 > > : > c3t3d0 ONLINE 0 0 0 > > : > c3t4d0 ONLINE 0 0 0 > > : > c3t5d0 ONLINE 0 0 0 > > : > c3t6d0 ONLINE 0 0 0 > > : > > > : > errors: No known data errors > > : > > > : > root at server5:/# zpool list > > : > NAME SIZE ALLOC FREE CAP HEALTH ALTROOT > > : > pptank 12.6T 384G 12.3T 2% ONLINE - > > : > > > : > root at server5:/# zpool history > > : > History for ''pptank'': > > : > <just hangs here> > > : > > > : > root at server5:/# zpool iostat 5 > > : > capacity operations bandwidth > > : > pool alloc free read write read write > > : > ---------- ----- ----- ----- ----- ----- ----- > > : > pptank 384G 12.3T 92 115 3.08M 1.22M > > : > pptank 384G 12.3T 1.11K 629 35.5M 3.03M > > : > pptank 384G 12.3T 886 889 27.1M 3.68M > > : > pptank 384G 12.3T 837 677 24.9M 2.82M > > : > pptank 384G 12.3T 1.19K 757 37.4M 3.69M > > : > pptank 384G 12.3T 1.02K 759 29.6M 3.90M > > : > pptank 384G 12.3T 952 707 32.5M 3.09M > > : > pptank 384G 12.3T 1.02K 831 34.5M 3.72M > > : > pptank 384G 12.3T 707 503 23.5M 1.98M > > : > pptank 384G 12.3T 626 707 20.8M 3.58M > > : > pptank 384G 12.3T 816 838 26.1M 4.26M > > : > pptank 384G 12.3T 942 800 30.1M 3.48M > > : > pptank 384G 12.3T 677 675 21.7M 2.91M > > : > pptank 384G 12.3T 590 725 19.2M 3.06M > > : > > > : > > > : > top shows the following runnable processes. Nothing excessive here AFAICT? > > : > > > : > last pid: 25282; load avg: 1.98, 1.95, 1.86; up 1+09:02:05 07:46:29 > > : > 72 processes: 67 sleeping, 1 running, 1 stopped, 3 on cpu > > : > CPU states: 81.5% idle, 0.1% user, 18.3% kernel, 0.0% iowait, 0.0% swap > > : > Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap > > : > > > : > PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND > > : > 748 root 18 60 -20 103M 9752K cpu/1 78:44 6.62% nfsd > > : > 24854 root 1 54 0 1480K 792K cpu/1 0:42 0.69% cp > > : > 25281 root 1 59 0 3584K 2152K cpu/0 0:00 0.02% top > > : > > > : > The above cp job is as mentioned above, attempting to copy a file to > > : > an effected FS, I''ve noticed is apparently not completely hung. > > : > > > : > The only thing that appears specific to Sunday morning is a cronjob to > > : > remove old .nfs* files, > > : > > > : > root at server5:/# crontab -l | grep nfsfind > > : > 15 3 * * 0 /usr/lib/fs/nfs/nfsfind > > : > > > : > Any suggestions on how to proceed? > > : > > > : > Many thanks > > : > Tom Crane > > : > > > : > Ps. The email address in the header is just a spam-trap. > > : > -- > > : > Tom Crane, IT support, RHUL Particle Physics., > > : > Dept. Physics, Royal Holloway, University of London, Egham Hill, > > : > Egham, Surrey, TW20 0EX, England. > > : > Email: T.Crane at rhul dot ac dot uk > > > > : Hi Tom, > > > > Hi Cindy, > > Thanks for the followup > > > > : I think SunOS server5 5.10 Generic_147441-15 is the Solaris 10 8/11 > > : release. Is this correct? > > > > I think so,... > > root at server5:/# cat /etc/release > > Solaris 10 10/08 s10x_u6wos_07b X86 > > Copyright 2008 Sun Microsystems, Inc. All Rights Reserved. > > Use is subject to license terms. > > Assembled 27 October 2008 > > > > > > : We looked at your truss output briefly and it looks like it is hanging > > : trying to allocate memory. At least, that''s what the "br ...." statements > > : are at the end. > > > > : I will see if I can find out what diagnostic info would be help in > > : this case. > > > > Thanks. That would be much appreciated. > > > > : You might get a faster response on zfs-discuss as John suggested. > > > > I will CC to zfs-discuss. > > > > Best regards > > Tom. > > > > : Thanks, > > > > : Cindy > > > > Ps. The email address in the header is just a spam-trap. > > -- > > Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill, > > Egham, Surrey, TW20 0EX, England. > > Email: T.Crane at rhul dot ac dot uk > > -- end of forwarded message -- > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 0EX, England. Email: T.Crane at rhul.ac.uk Fax: +44 (0) 1784 472794