olivier olivier
2012-Dec-03 18:41 UTC
NFS/ZFS hangs after upgrading from 9.0-RELEASE to -STABLE
Hi all
After upgrading from 9.0-RELEASE to 9.1-PRERELEASE #0 r243679 I'm having
severe problems with NFS sharing of a ZFS volume. nfsd appears to hang at
random times (between once every couple hours to once every two days) while
accessing a ZFS volume, and the only way I have found of resolving the
problem is to reboot. The server console is sometimes still responsive
during the nfsd hang, and I can read and write files to the same ZFS volume
while nfsd is hung. I am pasting below the output of procstat -kk on nfsd,
and details of my pool (nfsstat on the server gets hung when the problem
has started occurring, and does not produce any output). The pool is v28
and was created from a bunch of volumes attached over Fibre Channel using
the mpt driver. My system has a Supermicro board and 4 AMD Opteron 6274
CPUs.
I did not experience any nfsd hangs with 9.0-RELEASE (same machine,
essentially same configuration, same usage pattern).
I would greatly appreciate any help to resolve this problem!
Thank you
Olivier
PID TID COMM TDNAME KSTACK
1511 102751 nfsd nfsd: master
mi_switch+0x186
sleepq_wait+0x42
__lockmgr_args+0x5ae
vop_stdlock+0x39
VOP_LOCK1_APV+0x46
_vn_lock+0x47
zfs_fhtovp+0x338
nfsvno_fhtovp+0x87
nfsd_fhtovp+0x7a
nfsrvd_dorpc+0x9cf
nfssvc_program+0x447
svc_run_internal+0x687
svc_run+0x8f
nfsrvd_nfsd+0x193
nfssvc_nfsd+0x9b
sys_nfssvc+0x90
amd64_syscall+0x540
Xfast_syscall+0xf7
1511 102752 nfsd nfsd: service
mi_switch+0x186
sleepq_wait+0x42
__lockmgr_args+0x5ae
vop_stdlock+0x39
VOP_LOCK1_APV+0x46
_vn_lock+0x47
zfs_fhtovp+0x338
nfsvno_fhtovp+0x87
nfsd_fhtovp+0x7a
nfsrvd_dorpc+0x9cf
nfssvc_program+0x447
svc_run_internal+0x687
svc_thread_start+0xb
fork_exit+0x11f
fork_trampoline+0xe
1511 102753 nfsd nfsd: service
mi_switch+0x186
sleepq_wait+0x42
_cv_wait+0x112
zio_wait+0x61
zil_commit+0x764
zfs_freebsd_write+0xba0
VOP_WRITE_APV+0xb2
nfsvno_write+0x14d
nfsrvd_write+0x362
nfsrvd_dorpc+0x3c0
nfssvc_program+0x447
svc_run_internal+0x687
svc_thread_start+0xb
fork_exit+0x11f
fork_trampoline+0xe
1511 102754 nfsd nfsd: service
mi_switch+0x186
sleepq_wait+0x42
_cv_wait+0x112
zio_wait+0x61
zil_commit+0x3cf
zfs_freebsd_fsync+0xdc
nfsvno_fsync+0x2f2
nfsrvd_commit+0xe7
nfsrvd_dorpc+0x3c0
nfssvc_program+0x447
svc_run_internal+0x687
svc_thread_start+0xb
fork_exit+0x11f
fork_trampoline+0xe
1511 102755 nfsd nfsd: service
mi_switch+0x186
sleepq_wait+0x42
__lockmgr_args+0x5ae
vop_stdlock+0x39
VOP_LOCK1_APV+0x46
_vn_lock+0x47
zfs_fhtovp+0x338
nfsvno_fhtovp+0x87
nfsd_fhtovp+0x7a
nfsrvd_dorpc+0x9cf
nfssvc_program+0x447
svc_run_internal+0x687
svc_thread_start+0xb
fork_exit+0x11f
fork_trampoline+0xe
1511 102756 nfsd nfsd: service
mi_switch+0x186
sleepq_wait+0x42
_cv_wait+0x112
zil_commit+0x6d
zfs_freebsd_write+0xba0
VOP_WRITE_APV+0xb2
nfsvno_write+0x14d
nfsrvd_write+0x362
nfsrvd_dorpc+0x3c0
nfssvc_program+0x447
svc_run_internal+0x687
svc_thread_start+0xb
fork_exit+0x11f
fork_trampoline+0xe
PID TID COMM TDNAME KSTACK
1507 102750 nfsd -
mi_switch+0x186
sleepq_catch_signals+0x2e1
sleepq_wait_sig+0x16
_cv_wait_sig+0x12a
seltdwait+0xf6
kern_select+0x6ef
sys_select+0x5d
amd64_syscall+0x540
Xfast_syscall+0xf7
pool: tank
state: ONLINE
status: The pool is formatted using a legacy on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on software that does not support feature
flags.
scan: scrub repaired 0 in 45h37m with 0 errors on Mon Dec 3 03:07:11 2012
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
da19 ONLINE 0 0 0
da31 ONLINE 0 0 0
da32 ONLINE 0 0 0
da33 ONLINE 0 0 0
da34 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
da20 ONLINE 0 0 0
da36 ONLINE 0 0 0
da37 ONLINE 0 0 0
da38 ONLINE 0 0 0
da39 ONLINE 0 0 0
Rick Macklem
2012-Dec-04 14:26 UTC
NFS/ZFS hangs after upgrading from 9.0-RELEASE to -STABLE
Olivier wrote:> Hi all > After upgrading from 9.0-RELEASE to 9.1-PRERELEASE #0 r243679 I'm > having > severe problems with NFS sharing of a ZFS volume. nfsd appears to hang > at > random times (between once every couple hours to once every two days) > while > accessing a ZFS volume, and the only way I have found of resolving the > problem is to reboot. The server console is sometimes still responsive > during the nfsd hang, and I can read and write files to the same ZFS > volume > while nfsd is hung. I am pasting below the output of procstat -kk on > nfsd, > and details of my pool (nfsstat on the server gets hung when the > problem > has started occurring, and does not produce any output). The pool is > v28 > and was created from a bunch of volumes attached over Fibre Channel > using > the mpt driver. My system has a Supermicro board and 4 AMD Opteron > 6274 > CPUs. > > I did not experience any nfsd hangs with 9.0-RELEASE (same machine, > essentially same configuration, same usage pattern). > > I would greatly appreciate any help to resolve this problem! > Thank you > Olivier > > PID TID COMM TDNAME KSTACK > 1511 102751 nfsd nfsd: master > mi_switch+0x186 > sleepq_wait+0x42 > __lockmgr_args+0x5ae > vop_stdlock+0x39 > VOP_LOCK1_APV+0x46 > _vn_lock+0x47 > zfs_fhtovp+0x338 > nfsvno_fhtovp+0x87 > nfsd_fhtovp+0x7a > nfsrvd_dorpc+0x9cf > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_run+0x8f > nfsrvd_nfsd+0x193 > nfssvc_nfsd+0x9b > sys_nfssvc+0x90 > amd64_syscall+0x540 > Xfast_syscall+0xf7 > 1511 102752 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > __lockmgr_args+0x5ae > vop_stdlock+0x39 > VOP_LOCK1_APV+0x46 > _vn_lock+0x47 > zfs_fhtovp+0x338 > nfsvno_fhtovp+0x87 > nfsd_fhtovp+0x7a > nfsrvd_dorpc+0x9cf > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe > 1511 102753 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > _cv_wait+0x112 > zio_wait+0x61 > zil_commit+0x764 > zfs_freebsd_write+0xba0 > VOP_WRITE_APV+0xb2 > nfsvno_write+0x14d > nfsrvd_write+0x362 > nfsrvd_dorpc+0x3c0 > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe > 1511 102754 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > _cv_wait+0x112 > zio_wait+0x61 > zil_commit+0x3cf > zfs_freebsd_fsync+0xdc > nfsvno_fsync+0x2f2 > nfsrvd_commit+0xe7 > nfsrvd_dorpc+0x3c0 > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe > 1511 102755 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > __lockmgr_args+0x5ae > vop_stdlock+0x39 > VOP_LOCK1_APV+0x46 > _vn_lock+0x47 > zfs_fhtovp+0x338 > nfsvno_fhtovp+0x87 > nfsd_fhtovp+0x7a > nfsrvd_dorpc+0x9cf > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe > 1511 102756 nfsd nfsd: service > mi_switch+0x186 > sleepq_wait+0x42 > _cv_wait+0x112 > zil_commit+0x6d > zfs_freebsd_write+0xba0 > VOP_WRITE_APV+0xb2 > nfsvno_write+0x14d > nfsrvd_write+0x362 > nfsrvd_dorpc+0x3c0 > nfssvc_program+0x447 > svc_run_internal+0x687 > svc_thread_start+0xb > fork_exit+0x11f > fork_trampoline+0xe >These threads are either waiting for a vnode lock or waiting inside zil_commit() { at 3 different locations in zil_commit() }. A guess would be that the ZIL hasn`t completed a write for some reason, so 3 threads are waiting for it when one of them is holding a lock on the vnode being written and the remaining threads are waiting for that vnode lock. I am not a ZFS guy, so I cannot help further, except to suggest that you try and determine what might cause a write to the ZIL to stall. (Different device, different device driver...) Good luck with it, rick> > PID TID COMM TDNAME KSTACK > 1507 102750 nfsd - > mi_switch+0x186 > sleepq_catch_signals+0x2e1 > sleepq_wait_sig+0x16 > _cv_wait_sig+0x12a > seltdwait+0xf6 > kern_select+0x6ef > sys_select+0x5d > amd64_syscall+0x540 > Xfast_syscall+0xf7 > > > pool: tank > state: ONLINE > status: The pool is formatted using a legacy on-disk format. The pool > can > still be used, but some features are unavailable. > action: Upgrade the pool using 'zpool upgrade'. Once this is done, the > pool will no longer be accessible on software that does not support > feature > flags. > scan: scrub repaired 0 in 45h37m with 0 errors on Mon Dec 3 03:07:11 > 2012 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > da19 ONLINE 0 0 0 > da31 ONLINE 0 0 0 > da32 ONLINE 0 0 0 > da33 ONLINE 0 0 0 > da34 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > da20 ONLINE 0 0 0 > da36 ONLINE 0 0 0 > da37 ONLINE 0 0 0 > da38 ONLINE 0 0 0 > da39 ONLINE 0 0 0 > _______________________________________________ > freebsd-stable at freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe at freebsd.org"
Andriy Gapon
2012-Dec-13 10:36 UTC
NFS/ZFS hangs after upgrading from 9.0-RELEASE to -STABLE
I decided to share here the comment that I made in private, so that more people could potentially benefit from it. on 03/12/2012 20:41 olivier olivier said the following:> Hi all > After upgrading from 9.0-RELEASE to 9.1-PRERELEASE #0 r243679 I'm having > severe problems with NFS sharing of a ZFS volume. nfsd appears to hang at > random times (between once every couple hours to once every two days) while > accessing a ZFS volume, and the only way I have found of resolving the > problem is to reboot. The server console is sometimes still responsive > during the nfsd hang, and I can read and write files to the same ZFS volume > while nfsd is hung. I am pasting below the output of procstat -kk on nfsd, > and details of my pool (nfsstat on the server gets hung when the problem > has started occurring, and does not produce any output). The pool is v28 > and was created from a bunch of volumes attached over Fibre Channel using > the mpt driver. My system has a Supermicro board and 4 AMD Opteron 6274 > CPUs. > > I did not experience any nfsd hangs with 9.0-RELEASE (same machine, > essentially same configuration, same usage pattern). > > I would greatly appreciate any help to resolve this problem!I've looked at the provided data and I do not see anything that implicates ZFS. My rules of the thumb for ZFS hangs: - if there are threads in zio_wait - if you can firm that they are indeed stuck there[*] - if there are no threads in zio_interrupt [*] you have to be sure that a thread just sits in zio_wait and doesn't make any forward progress as opposed to the thread doing a lot of I/O and thus having a high probability of being seen in zio_wait. Then it is most likely that the problem is at the storage level. Most likely it is a bug in storage controller driver which allowed an I/O request to get lost (instead of "errored out" or timed out). `camcontrol tags <disk> -v` can be used to query depth of a queue for each disk and determine the bad one. -- Andriy Gapon