olivier olivier
2012-Dec-03 18:41 UTC
NFS/ZFS hangs after upgrading from 9.0-RELEASE to -STABLE
Hi all After upgrading from 9.0-RELEASE to 9.1-PRERELEASE #0 r243679 I'm having severe problems with NFS sharing of a ZFS volume. nfsd appears to hang at random times (between once every couple hours to once every two days) while accessing a ZFS volume, and the only way I have found of resolving the problem is to reboot. The server console is sometimes still responsive during the nfsd hang, and I can read and write files to the same ZFS volume while nfsd is hung. I am pasting below the output of procstat -kk on nfsd, and details of my pool (nfsstat on the server gets hung when the problem has started occurring, and does not produce any output). The pool is v28 and was created from a bunch of volumes attached over Fibre Channel using the mpt driver. My system has a Supermicro board and 4 AMD Opteron 6274 CPUs. I did not experience any nfsd hangs with 9.0-RELEASE (same machine, essentially same configuration, same usage pattern). I would greatly appreciate any help to resolve this problem! Thank you Olivier PID TID COMM TDNAME KSTACK 1511 102751 nfsd nfsd: master mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0x5ae vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 zfs_fhtovp+0x338 nfsvno_fhtovp+0x87 nfsd_fhtovp+0x7a nfsrvd_dorpc+0x9cf nfssvc_program+0x447 svc_run_internal+0x687 svc_run+0x8f nfsrvd_nfsd+0x193 nfssvc_nfsd+0x9b sys_nfssvc+0x90 amd64_syscall+0x540 Xfast_syscall+0xf7 1511 102752 nfsd nfsd: service mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0x5ae vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 zfs_fhtovp+0x338 nfsvno_fhtovp+0x87 nfsd_fhtovp+0x7a nfsrvd_dorpc+0x9cf nfssvc_program+0x447 svc_run_internal+0x687 svc_thread_start+0xb fork_exit+0x11f fork_trampoline+0xe 1511 102753 nfsd nfsd: service mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x112 zio_wait+0x61 zil_commit+0x764 zfs_freebsd_write+0xba0 VOP_WRITE_APV+0xb2 nfsvno_write+0x14d nfsrvd_write+0x362 nfsrvd_dorpc+0x3c0 nfssvc_program+0x447 svc_run_internal+0x687 svc_thread_start+0xb fork_exit+0x11f fork_trampoline+0xe 1511 102754 nfsd nfsd: service mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x112 zio_wait+0x61 zil_commit+0x3cf zfs_freebsd_fsync+0xdc nfsvno_fsync+0x2f2 nfsrvd_commit+0xe7 nfsrvd_dorpc+0x3c0 nfssvc_program+0x447 svc_run_internal+0x687 svc_thread_start+0xb fork_exit+0x11f fork_trampoline+0xe 1511 102755 nfsd nfsd: service mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0x5ae vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 zfs_fhtovp+0x338 nfsvno_fhtovp+0x87 nfsd_fhtovp+0x7a nfsrvd_dorpc+0x9cf nfssvc_program+0x447 svc_run_internal+0x687 svc_thread_start+0xb fork_exit+0x11f fork_trampoline+0xe 1511 102756 nfsd nfsd: service mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x112 zil_commit+0x6d zfs_freebsd_write+0xba0 VOP_WRITE_APV+0xb2 nfsvno_write+0x14d nfsrvd_write+0x362 nfsrvd_dorpc+0x3c0 nfssvc_program+0x447 svc_run_internal+0x687 svc_thread_start+0xb fork_exit+0x11f fork_trampoline+0xe PID TID COMM TDNAME KSTACK 1507 102750 nfsd - mi_switch+0x186 sleepq_catch_signals+0x2e1 sleepq_wait_sig+0x16 _cv_wait_sig+0x12a seltdwait+0xf6 kern_select+0x6ef sys_select+0x5d amd64_syscall+0x540 Xfast_syscall+0xf7 pool: tank state: ONLINE status: The pool is formatted using a legacy on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on software that does not support feature flags. scan: scrub repaired 0 in 45h37m with 0 errors on Mon Dec 3 03:07:11 2012 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da19 ONLINE 0 0 0 da31 ONLINE 0 0 0 da32 ONLINE 0 0 0 da33 ONLINE 0 0 0 da34 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 da20 ONLINE 0 0 0 da36 ONLINE 0 0 0 da37 ONLINE 0 0 0 da38 ONLINE 0 0 0 da39 ONLINE 0 0 0
Rick Macklem
2012-Dec-04 14:26 UTC
NFS/ZFS hangs after upgrading from 9.0-RELEASE to -STABLE
Olivier wrote:> Hi all > After upgrading from 9.0-RELEASE to 9.1-PRERELEASE #0 r243679 I'm > having > severe problems with NFS sharing of a ZFS volume. nfsd appears to hang > at > random times (between once every couple hours to once every two days) > while > accessing a ZFS volume, and the only way I have found of resolving the > problem is to reboot. The server console is sometimes still responsive > during the nfsd hang, and I can read and write files to the same ZFS > volume > while nfsd is hung. I am pasting below the output of procstat -kk on > nfsd, > and details of my pool (nfsstat on the server gets hung when the > problem > has started occurring, and does not produce any output). The pool is > v28 > and was created from a bunch of volumes attached over Fibre Channel > using > the mpt driver. My system has a Supermicro board and 4 AMD Opteron > 6274 > CPUs. > > I did not experience any nfsd hangs with 9.0-RELEASE (same machine, > essentially same configuration, same usage pattern). > > I would greatly appreciate any help to resolve this problem! > Thank you > Olivier > > PID TID COMM TDNAME KSTACK > 1511 102751 nfsd nfsd: master > mi_switch+0x186 > sleepq_wait+0x42 > __lockmgr_args+0x5ae > vop_stdlock+0x39 > VOP_LOCK1_APV+0x46 > _vn_lock+0x47 > zfs_fhtovp+0x338 > nfsvno_fhtovp+0x87 > nfsd_fhtovp+0x7a > nfsrvd_dorpc+0x9cf > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_run+0x8f > nfsrvd_nfsd+0x193 > nfssvc_nfsd+0x9b > sys_nfssvc+0x90 > amd64_syscall+0x540 > Xfast_syscall+0xf7 > 1511 102752 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > __lockmgr_args+0x5ae > vop_stdlock+0x39 > VOP_LOCK1_APV+0x46 > _vn_lock+0x47 > zfs_fhtovp+0x338 > nfsvno_fhtovp+0x87 > nfsd_fhtovp+0x7a > nfsrvd_dorpc+0x9cf > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe > 1511 102753 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > _cv_wait+0x112 > zio_wait+0x61 > zil_commit+0x764 > zfs_freebsd_write+0xba0 > VOP_WRITE_APV+0xb2 > nfsvno_write+0x14d > nfsrvd_write+0x362 > nfsrvd_dorpc+0x3c0 > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe > 1511 102754 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > _cv_wait+0x112 > zio_wait+0x61 > zil_commit+0x3cf > zfs_freebsd_fsync+0xdc > nfsvno_fsync+0x2f2 > nfsrvd_commit+0xe7 > nfsrvd_dorpc+0x3c0 > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe > 1511 102755 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > __lockmgr_args+0x5ae > vop_stdlock+0x39 > VOP_LOCK1_APV+0x46 > _vn_lock+0x47 > zfs_fhtovp+0x338 > nfsvno_fhtovp+0x87 > nfsd_fhtovp+0x7a > nfsrvd_dorpc+0x9cf > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe > 1511 102756 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > _cv_wait+0x112 > zil_commit+0x6d > zfs_freebsd_write+0xba0 > VOP_WRITE_APV+0xb2 > nfsvno_write+0x14d > nfsrvd_write+0x362 > nfsrvd_dorpc+0x3c0 > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe >These threads are either waiting for a vnode lock or waiting inside zil_commit() { at 3 different locations in zil_commit() }. A guess would be that the ZIL hasn`t completed a write for some reason, so 3 threads are waiting for it when one of them is holding a lock on the vnode being written and the remaining threads are waiting for that vnode lock. I am not a ZFS guy, so I cannot help further, except to suggest that you try and determine what might cause a write to the ZIL to stall. (Different device, different device driver...) Good luck with it, rick> > PID TID COMM TDNAME KSTACK > 1507 102750 nfsd - > mi_switch+0x186 > sleepq_catch_signals+0x2e1 > sleepq_wait_sig+0x16 > _cv_wait_sig+0x12a > seltdwait+0xf6 > kern_select+0x6ef > sys_select+0x5d > amd64_syscall+0x540 > Xfast_syscall+0xf7 > > > pool: tank > state: ONLINE > status: The pool is formatted using a legacy on-disk format. The pool > can > still be used, but some features are unavailable. > action: Upgrade the pool using 'zpool upgrade'. Once this is done, the > pool will no longer be accessible on software that does not support > feature > flags. > scan: scrub repaired 0 in 45h37m with 0 errors on Mon Dec 3 03:07:11 > 2012 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > da19 ONLINE 0 0 0 > da31 ONLINE 0 0 0 > da32 ONLINE 0 0 0 > da33 ONLINE 0 0 0 > da34 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > da20 ONLINE 0 0 0 > da36 ONLINE 0 0 0 > da37 ONLINE 0 0 0 > da38 ONLINE 0 0 0 > da39 ONLINE 0 0 0 > _______________________________________________ > freebsd-stable at freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe at freebsd.org"
Andriy Gapon
2012-Dec-13 10:36 UTC
NFS/ZFS hangs after upgrading from 9.0-RELEASE to -STABLE
I decided to share here the comment that I made in private, so that more people could potentially benefit from it. on 03/12/2012 20:41 olivier olivier said the following:> Hi all > After upgrading from 9.0-RELEASE to 9.1-PRERELEASE #0 r243679 I'm having > severe problems with NFS sharing of a ZFS volume. nfsd appears to hang at > random times (between once every couple hours to once every two days) while > accessing a ZFS volume, and the only way I have found of resolving the > problem is to reboot. The server console is sometimes still responsive > during the nfsd hang, and I can read and write files to the same ZFS volume > while nfsd is hung. I am pasting below the output of procstat -kk on nfsd, > and details of my pool (nfsstat on the server gets hung when the problem > has started occurring, and does not produce any output). The pool is v28 > and was created from a bunch of volumes attached over Fibre Channel using > the mpt driver. My system has a Supermicro board and 4 AMD Opteron 6274 > CPUs. > > I did not experience any nfsd hangs with 9.0-RELEASE (same machine, > essentially same configuration, same usage pattern). > > I would greatly appreciate any help to resolve this problem!I've looked at the provided data and I do not see anything that implicates ZFS. My rules of the thumb for ZFS hangs: - if there are threads in zio_wait - if you can firm that they are indeed stuck there[*] - if there are no threads in zio_interrupt [*] you have to be sure that a thread just sits in zio_wait and doesn't make any forward progress as opposed to the thread doing a lot of I/O and thus having a high probability of being seen in zio_wait. Then it is most likely that the problem is at the storage level. Most likely it is a bug in storage controller driver which allowed an I/O request to get lost (instead of "errored out" or timed out). `camcontrol tags <disk> -v` can be used to query depth of a queue for each disk and determine the bad one. -- Andriy Gapon