On 11/13/2016 11:06, Andriy Gapon wrote:> On 12/11/2016 14:40, Henri Hennebert wrote: >> I attatch it > > Thank you! > So, these two threads are trying to get the lock in the exclusive mode: > Thread 687 (Thread 101243): > #0 sched_switch (td=0xfffff800b642b500, newtd=0xfffff8000285ea00, flags=<value > optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 > #1 0xffffffff80561ae2 in mi_switch (flags=<value optimized out>, newtd=0x0) at > /usr/src/sys/kern/kern_synch.c:455 > #2 0xffffffff805ae8da in sleepq_wait (wchan=0x0, pri=0) at > /usr/src/sys/kern/subr_sleepqueue.c:646 > #3 0xffffffff8052f854 in sleeplk (lk=<value optimized out>, flags=<value > optimized out>, ilk=<value optimized out>, wmesg=0xffffffff813be535 "zfs", > pri=<value optimized out>, timo=51) at /usr/src/sys/kern/kern_lock.c:222 > #4 0xffffffff8052f39d in __lockmgr_args (lk=<value optimized out>, flags=<value > optimized out>, ilk=<value optimized out>, wmesg=<value optimized out>, > pri=<value optimized out>, timo=<value optimized out>, file=<value optimized > out>, line=<value optimized out>) at /usr/src/sys/kern/kern_lock.c:958 > #5 0xffffffff80616a8c in vop_stdlock (ap=<value optimized out>) at lockmgr.h:98 > #6 0xffffffff8093784d in VOP_LOCK1_APV (vop=<value optimized out>, a=<value > optimized out>) at vnode_if.c:2087 > #7 0xffffffff8063c5b3 in _vn_lock (vp=<value optimized out>, flags=548864, > file=<value optimized out>, line=<value optimized out>) at vnode_if.h:859 > #8 0xffffffff8062a5f7 in vget (vp=0xfffff80049c2c000, flags=548864, > td=0xfffff800b642b500) at /usr/src/sys/kern/vfs_subr.c:2523 > #9 0xffffffff806118b9 in cache_lookup (dvp=<value optimized out>, vpp=<value > optimized out>, cnp=<value optimized out>, tsp=<value optimized out>, > ticksp=<value optimized out>) at /usr/src/sys/kern/vfs_cache.c:686 > #10 0xffffffff806133dc in vfs_cache_lookup (ap=<value optimized out>) at > /usr/src/sys/kern/vfs_cache.c:1081 > #11 0xffffffff80935777 in VOP_LOOKUP_APV (vop=<value optimized out>, a=<value > optimized out>) at vnode_if.c:127 > #12 0xffffffff8061cdf1 in lookup (ndp=<value optimized out>) at vnode_if.h:54 > #13 0xffffffff8061c492 in namei (ndp=<value optimized out>) at > /usr/src/sys/kern/vfs_lookup.c:306 > #14 0xffffffff80509395 in kern_execve (td=<value optimized out>, args=<value > optimized out>, mac_p=0x0) at /usr/src/sys/kern/kern_exec.c:443 > #15 0xffffffff80508ccc in sys_execve (td=0xfffff800b642b500, > uap=0xfffffe010182cb80) at /usr/src/sys/kern/kern_exec.c:218 > #16 0xffffffff808d449e in amd64_syscall (td=<value optimized out>, traced=0) at > subr_syscall.c:135 > #17 0xffffffff808b7ddb in Xfast_syscall () at > /usr/src/sys/amd64/amd64/exception.S:396 > > Thread 681 (Thread 101147): > #0 sched_switch (td=0xfffff80065f4e500, newtd=0xfffff8000285f000, flags=<value > optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 > #1 0xffffffff80561ae2 in mi_switch (flags=<value optimized out>, newtd=0x0) at > /usr/src/sys/kern/kern_synch.c:455 > #2 0xffffffff805ae8da in sleepq_wait (wchan=0x0, pri=0) at > /usr/src/sys/kern/subr_sleepqueue.c:646 > #3 0xffffffff8052f854 in sleeplk (lk=<value optimized out>, flags=<value > optimized out>, ilk=<value optimized out>, wmesg=0xffffffff813be535 "zfs", > pri=<value optimized out>, timo=51) at /usr/src/sys/kern/kern_lock.c:222 > #4 0xffffffff8052f39d in __lockmgr_args (lk=<value optimized out>, flags=<value > optimized out>, ilk=<value optimized out>, wmesg=<value optimized out>, > pri=<value optimized out>, timo=<value optimized out>, file=<value optimized > out>, line=<value optimized out>) at /usr/src/sys/kern/kern_lock.c:958 > #5 0xffffffff80616a8c in vop_stdlock (ap=<value optimized out>) at lockmgr.h:98 > #6 0xffffffff8093784d in VOP_LOCK1_APV (vop=<value optimized out>, a=<value > optimized out>) at vnode_if.c:2087 > #7 0xffffffff8063c5b3 in _vn_lock (vp=<value optimized out>, flags=548864, > file=<value optimized out>, line=<value optimized out>) at vnode_if.h:859 > #8 0xffffffff8062a5f7 in vget (vp=0xfffff80049c2c000, flags=548864, > td=0xfffff80065f4e500) at /usr/src/sys/kern/vfs_subr.c:2523 > #9 0xffffffff806118b9 in cache_lookup (dvp=<value optimized out>, vpp=<value > optimized out>, cnp=<value optimized out>, tsp=<value optimized out>, > ticksp=<value optimized out>) at /usr/src/sys/kern/vfs_cache.c:686 > #10 0xffffffff806133dc in vfs_cache_lookup (ap=<value optimized out>) at > /usr/src/sys/kern/vfs_cache.c:1081 > #11 0xffffffff80935777 in VOP_LOOKUP_APV (vop=<value optimized out>, a=<value > optimized out>) at vnode_if.c:127 > #12 0xffffffff8061cdf1 in lookup (ndp=<value optimized out>) at vnode_if.h:54 > #13 0xffffffff8061c492 in namei (ndp=<value optimized out>) at > /usr/src/sys/kern/vfs_lookup.c:306 > #14 0xffffffff80509395 in kern_execve (td=<value optimized out>, args=<value > optimized out>, mac_p=0x0) at /usr/src/sys/kern/kern_exec.c:443 > #15 0xffffffff80508ccc in sys_execve (td=0xfffff80065f4e500, > uap=0xfffffe01016b8b80) at /usr/src/sys/kern/kern_exec.c:218 > #16 0xffffffff808d449e in amd64_syscall (td=<value optimized out>, traced=0) at > subr_syscall.c:135 > #17 0xffffffff808b7ddb in Xfast_syscall () at > /usr/src/sys/amd64/amd64/exception.S:396This 2 threads are innd processes. In core.txt.4: 8 14789 29165 0 24 4 40040 6612 zfs DN - 0:00.00 [innd] 8 29165 1 0 20 0 42496 6888 select Ds - 0:01.33 [innd] 8 49778 29165 0 24 4 40040 6900 zfs DN - 0:00.00 [innd] 8 82034 29165 0 24 4 132 0 zfs DN - 0:00.00 [innd] the corresponding info treads are: 687 Thread 101243 (PID=49778: innd) sched_switch (td=0xfffff800b642b500, newtd=0xfffff8000285ea00, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 681 Thread 101147 (PID=14789: innd) sched_switch (td=0xfffff80065f4e500, newtd=0xfffff8000285f000, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 669 Thread 101250 (PID=82034: innd) sched_switch (td=0xfffff800b6429000, newtd=0xfffff8000285ea00, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 665 Thread 101262 (PID=29165: innd) sched_switch (td=0xfffff800b6b54a00, newtd=0xfffff8000285ea00, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 So your missing tread must be 101250: (kgdb) tid 101250 [Switching to thread 669 (Thread 101250)]#0 sched_switch (td=0xfffff800b6429000, newtd=0xfffff8000285ea00, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 1973 cpuid = PCPU_GET(cpuid); Current language: auto; currently minimal (kgdb) bt #0 sched_switch (td=0xfffff800b6429000, newtd=0xfffff8000285ea00, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 #1 0xffffffff80561ae2 in mi_switch (flags=<value optimized out>, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:455 #2 0xffffffff805ae8da in sleepq_wait (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:646 #3 0xffffffff8052f854 in sleeplk (lk=<value optimized out>, flags=<value optimized out>, ilk=<value optimized out>, wmesg=0xffffffff813be535 "zfs", pri=<value optimized out>, timo=51) at /usr/src/sys/kern/kern_lock.c:222 #4 0xffffffff8052f39d in __lockmgr_args (lk=<value optimized out>, flags=<value optimized out>, ilk=<value optimized out>, wmesg=<value optimized out>, pri=<value optimized out>, timo=<value optimized out>, file=<value optimized out>, line=<value optimized out>) at /usr/src/sys/kern/kern_lock.c:958 #5 0xffffffff80616a8c in vop_stdlock (ap=<value optimized out>) at lockmgr.h:98 #6 0xffffffff8093784d in VOP_LOCK1_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:2087 #7 0xffffffff8063c5b3 in _vn_lock (vp=<value optimized out>, flags=525312, file=<value optimized out>, line=<value optimized out>) at vnode_if.h:859 #8 0xffffffff804e16cf in exec_elf64_imgact (imgp=<value optimized out>) at /usr/src/sys/kern/imgact_elf.c:899 #9 0xffffffff8050983d in kern_execve (td=<value optimized out>, args=<value optimized out>, mac_p=0x0) at /usr/src/sys/kern/kern_exec.c:602 #10 0xffffffff80508ccc in sys_execve (td=0xfffff800b6429000, uap=0xfffffe010184fb80) at /usr/src/sys/kern/kern_exec.c:218 #11 0xffffffff808d449e in amd64_syscall (td=<value optimized out>, traced=0) at subr_syscall.c:135 #12 0xffffffff808b7ddb in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396 #13 0x000000080217edaa in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb)> > And the original stuck thread wants to get the lock in the shared mode. > And there should be another thread that already holds the lock in the shared > mode. But I am not able to identify it. I wonder if the original thread could > be trying to get the lock recursively... > > It would be interesting to get more details from thread 101112. > You can switch to it using tid command, you can use 'fr' to select frames, 'info > local' and 'info args' to see what variables are available (not optimized out) > and the you can print any that look interesting. It would be nice to get a file > path and a directory vnode where the lookup is called. > > Thank you. >I find nothing interresting: (kgdb) fr 15 #15 0xffffffff806369cc in sys_fstatat (td=0x0, uap=0xfffffe010161db80) at /usr/src/sys/kern/vfs_syscalls.c:2136 2136 error = kern_statat(td, uap->flag, uap->fd, uap->path, (kgdb) print *uap $1 = {fd_l_ = 0xfffffe010161db80 "\234???", fd = -100, fd_r_ = 0xfffffe010161db84 "", path_l_ = 0xfffffe010161db88 "\210??", path = 0x800e2a388 <Address 0x800e2a388 out of bounds>, path_r_ = 0xfffffe010161db90 "\020??", buf_l_ = 0xfffffe010161db90 "\020??", buf = 0x800e2a310, buf_r_ = 0xfffffe010161db98 "", flag_l_ = 0xfffffe010161db98 "", flag = 512, flag_r_ = 0xfffffe010161db9c ""} (kgdb) print $1->path $2 = 0x800e2a388 <Address 0x800e2a388 out of bounds> (kgdb) fr 8 #8 0xffffffff8062a5f7 in vget (vp=0xfffff80049c2c000, flags=2121728, td=0xfffff80009ba0500) at /usr/src/sys/kern/vfs_subr.c:2523 2523 if ((error = vn_lock(vp, flags)) != 0) { (kgdb) print *vp $3 = {v_tag = 0xffffffff813be535 "zfs", v_op = 0xffffffff813d0f70, v_data = 0xfffff80049c1f420, v_mount = 0xfffff800093aa660, v_nmntvnodes = {tqe_next = 0xfffff80049c2c938, tqe_prev = 0xfffff80049c2bb30}, v_un = {vu_mount = 0x0, vu_socket = 0x0, vu_cdev = 0x0, vu_fifoinfo = 0x0}, v_hashlist = {le_next = 0x0, le_prev = 0x0}, v_cache_src = {lh_first = 0x0}, v_cache_dst = { tqh_first = 0xfffff800bfc8e3f0, tqh_last = 0xfffff800bfc8e410}, v_cache_dd = 0x0, v_lock = {lock_object = { lo_name = 0xffffffff813be535 "zfs", lo_flags = 117112832, lo_data = 0, lo_witness = 0x0}, lk_lock = 23, lk_exslpfail = 0, lk_timo = 51, lk_pri = 96}, v_interlock = {lock_object = {lo_name = 0xffffffff8099e9e0 "vnode interlock", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 4}, v_vnlock = 0xfffff80049c2c068, v_actfreelist = { tqe_next = 0xfffff80049c2c938, tqe_prev = 0xfffff80049ae9bd0}, v_bufobj = {bo_lock = {lock_object = { lo_name = 0xffffffff8099e9f0 "bufobj interlock", lo_flags = 86179840, lo_data = 0, lo_witness = 0x0}, rw_lock = 1}, bo_ops = 0xffffffff80c4bf70, bo_object = 0xfffff800b62e9c60, bo_synclist = {le_next = 0x0, le_prev = 0x0}, bo_private = 0xfffff80049c2c000, __bo_vnode = 0xfffff80049c2c000, bo_clean = {bv_hd = {tqh_first = 0x0, tqh_last = 0xfffff80049c2c120}, bv_root = {pt_root = 0}, bv_cnt = 0}, bo_dirty = {bv_hd = {tqh_first = 0x0, tqh_last = 0xfffff80049c2c140}, bv_root = {pt_root = 0}, bv_cnt = 0}, bo_numoutput = 0, bo_flag = 0, bo_bsize = 131072}, v_pollinfo = 0x0, v_label = 0x0, v_lockf = 0x0, v_rl = {rl_waiters = {tqh_first = 0x0, tqh_last = 0xfffff80049c2c188}, rl_currdep = 0x0}, v_cstart = 0, v_lasta = 0, v_lastw = 0, v_clen = 0, v_holdcnt = 10, v_usecount = 6, v_iflag = 512, v_vflag = 32, v_writecount = 0, v_hash = 4833984, v_type = VREG} (kgdb) I also try to get information from the execve of the other treads: for tid 101250: (kgdb) fr 10 #10 0xffffffff80508ccc in sys_execve (td=0xfffff800b6429000, uap=0xfffffe010184fb80) at /usr/src/sys/kern/kern_exec.c:218 218 error = kern_execve(td, &args, NULL); (kgdb) print *uap $4 = {fname_l_ = 0xfffffe010184fb80 "`\220\217\002\b", fname = 0x8028f9060 <Address 0x8028f9060 out of bounds>, fname_r_ = 0xfffffe010184fb88 "`????\177", argv_l_ = 0xfffffe010184fb88 "`????\177", argv = 0x7fffffffb660, argv_r_ = 0xfffffe010184fb90 "\b????\177", envv_l_ = 0xfffffe010184fb90 "\b????\177", envv = 0x7fffffffdc08, envv_r_ = 0xfffffe010184fb98 ""} (kgdb) for tid 101243: (kgdb) f 15 #15 0xffffffff80508ccc in sys_execve (td=0xfffff800b642b500, uap=0xfffffe010182cb80) at /usr/src/sys/kern/kern_exec.c:218 218 error = kern_execve(td, &args, NULL); (kgdb) print *uap $5 = {fname_l_ = 0xfffffe010182cb80 "??\205\002\b", fname = 0x80285cfc0 <Address 0x80285cfc0 out of bounds>, fname_r_ = 0xfffffe010182cb88 "`????\177", argv_l_ = 0xfffffe010182cb88 "`????\177", argv = 0x7fffffffb660, argv_r_ = 0xfffffe010182cb90 "\b????\177", envv_l_ = 0xfffffe010182cb90 "\b????\177", envv = 0x7fffffffdc08, envv_r_ = 0xfffffe010182cb98 ""} (kgdb) Henri
On 11/13/2016 14:28, Henri Hennebert wrote:> This 2 threads are innd processes. In core.txt.4: > > 8 14789 29165 0 24 4 40040 6612 zfs DN - 0:00.00 [innd] > 8 29165 1 0 20 0 42496 6888 select Ds - 0:01.33 [innd] > 8 49778 29165 0 24 4 40040 6900 zfs DN - 0:00.00 [innd] > 8 82034 29165 0 24 4 132 0 zfs DN - 0:00.00 [innd] > > the corresponding info treads are: > > 687 Thread 101243 (PID=49778: innd) sched_switch > (td=0xfffff800b642b500, newtd=0xfffff8000285ea00, flags=<value optimized > out>) at /usr/src/sys/kern/sched_ule.c:1973 > 681 Thread 101147 (PID=14789: innd) sched_switch > (td=0xfffff80065f4e500, newtd=0xfffff8000285f000, flags=<value optimized > out>) at /usr/src/sys/kern/sched_ule.c:1973 > 669 Thread 101250 (PID=82034: innd) sched_switch > (td=0xfffff800b6429000, newtd=0xfffff8000285ea00, flags=<value optimized > out>) at /usr/src/sys/kern/sched_ule.c:1973 > 665 Thread 101262 (PID=29165: innd) sched_switch > (td=0xfffff800b6b54a00, newtd=0xfffff8000285ea00, flags=<value optimized > out>) at /usr/src/sys/kern/sched_ule.c:1973 >In case it may help, I have a look at innd. This processes use 2 execv: one to execute /bin/sh and the other to execute itself: /* ** Re-exec ourselves. */ static const char * CCxexec(char *av[]) { char *innd; char *p; int i; if (CCargv == NULL) return "1 no argv!"; innd = concatpath(innconf->pathbin, "innd"); /* Get the pathname. */ p = av[0]; if (*p == '\0' || strcmp(p, "innd") == 0) CCargv[0] = innd; else return "1 Bad value"; #ifdef DO_PERL PLmode(Mode, OMshutdown, av[0]); #endif #ifdef DO_PYTHON PYmode(Mode, OMshutdown, av[0]); #endif JustCleanup(); syslog(L_NOTICE, "%s execv %s", LogName, CCargv[0]); /* Close all fds to protect possible fd leaking accross successive innds. */ for (i=3; i<30; i++) close(i); execv(CCargv[0], CCargv); syslog(L_FATAL, "%s cant execv %s %m", LogName, CCargv[0]); _exit(1); /* NOTREACHED */ return "1 Exit failed"; } The culprit may be /usr/local/news/bin/innd, remember that find is locked in /usr/local/news/bin Henri
On 13/11/2016 15:28, Henri Hennebert wrote:> On 11/13/2016 11:06, Andriy Gapon wrote: >> On 12/11/2016 14:40, Henri Hennebert wrote: >>> I attatch it >> >> Thank you! >> So, these two threads are trying to get the lock in the exclusive mode: >> Thread 687 (Thread 101243): >> #0 sched_switch (td=0xfffff800b642b500, newtd=0xfffff8000285ea00, flags=<value >> optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 >> #1 0xffffffff80561ae2 in mi_switch (flags=<value optimized out>, newtd=0x0) at >> /usr/src/sys/kern/kern_synch.c:455 >> #2 0xffffffff805ae8da in sleepq_wait (wchan=0x0, pri=0) at >> /usr/src/sys/kern/subr_sleepqueue.c:646 >> #3 0xffffffff8052f854 in sleeplk (lk=<value optimized out>, flags=<value >> optimized out>, ilk=<value optimized out>, wmesg=0xffffffff813be535 "zfs", >> pri=<value optimized out>, timo=51) at /usr/src/sys/kern/kern_lock.c:222 >> #4 0xffffffff8052f39d in __lockmgr_args (lk=<value optimized out>, flags=<value >> optimized out>, ilk=<value optimized out>, wmesg=<value optimized out>, >> pri=<value optimized out>, timo=<value optimized out>, file=<value optimized >> out>, line=<value optimized out>) at /usr/src/sys/kern/kern_lock.c:958 >> #5 0xffffffff80616a8c in vop_stdlock (ap=<value optimized out>) at lockmgr.h:98 >> #6 0xffffffff8093784d in VOP_LOCK1_APV (vop=<value optimized out>, a=<value >> optimized out>) at vnode_if.c:2087 >> #7 0xffffffff8063c5b3 in _vn_lock (vp=<value optimized out>, flags=548864, >> file=<value optimized out>, line=<value optimized out>) at vnode_if.h:859 >> #8 0xffffffff8062a5f7 in vget (vp=0xfffff80049c2c000, flags=548864, >> td=0xfffff800b642b500) at /usr/src/sys/kern/vfs_subr.c:2523 >> #9 0xffffffff806118b9 in cache_lookup (dvp=<value optimized out>, vpp=<value >> optimized out>, cnp=<value optimized out>, tsp=<value optimized out>, >> ticksp=<value optimized out>) at /usr/src/sys/kern/vfs_cache.c:686 >> #10 0xffffffff806133dc in vfs_cache_lookup (ap=<value optimized out>) at >> /usr/src/sys/kern/vfs_cache.c:1081 >> #11 0xffffffff80935777 in VOP_LOOKUP_APV (vop=<value optimized out>, a=<value >> optimized out>) at vnode_if.c:127 >> #12 0xffffffff8061cdf1 in lookup (ndp=<value optimized out>) at vnode_if.h:54 >> #13 0xffffffff8061c492 in namei (ndp=<value optimized out>) at >> /usr/src/sys/kern/vfs_lookup.c:306 >> #14 0xffffffff80509395 in kern_execve (td=<value optimized out>, args=<value >> optimized out>, mac_p=0x0) at /usr/src/sys/kern/kern_exec.c:443 >> #15 0xffffffff80508ccc in sys_execve (td=0xfffff800b642b500, >> uap=0xfffffe010182cb80) at /usr/src/sys/kern/kern_exec.c:218 >> #16 0xffffffff808d449e in amd64_syscall (td=<value optimized out>, traced=0) at >> subr_syscall.c:135 >> #17 0xffffffff808b7ddb in Xfast_syscall () at >> /usr/src/sys/amd64/amd64/exception.S:396 >> >> Thread 681 (Thread 101147): >> #0 sched_switch (td=0xfffff80065f4e500, newtd=0xfffff8000285f000, flags=<value >> optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 >> #1 0xffffffff80561ae2 in mi_switch (flags=<value optimized out>, newtd=0x0) at >> /usr/src/sys/kern/kern_synch.c:455 >> #2 0xffffffff805ae8da in sleepq_wait (wchan=0x0, pri=0) at >> /usr/src/sys/kern/subr_sleepqueue.c:646 >> #3 0xffffffff8052f854 in sleeplk (lk=<value optimized out>, flags=<value >> optimized out>, ilk=<value optimized out>, wmesg=0xffffffff813be535 "zfs", >> pri=<value optimized out>, timo=51) at /usr/src/sys/kern/kern_lock.c:222 >> #4 0xffffffff8052f39d in __lockmgr_args (lk=<value optimized out>, flags=<value >> optimized out>, ilk=<value optimized out>, wmesg=<value optimized out>, >> pri=<value optimized out>, timo=<value optimized out>, file=<value optimized >> out>, line=<value optimized out>) at /usr/src/sys/kern/kern_lock.c:958 >> #5 0xffffffff80616a8c in vop_stdlock (ap=<value optimized out>) at lockmgr.h:98 >> #6 0xffffffff8093784d in VOP_LOCK1_APV (vop=<value optimized out>, a=<value >> optimized out>) at vnode_if.c:2087 >> #7 0xffffffff8063c5b3 in _vn_lock (vp=<value optimized out>, flags=548864, >> file=<value optimized out>, line=<value optimized out>) at vnode_if.h:859 >> #8 0xffffffff8062a5f7 in vget (vp=0xfffff80049c2c000, flags=548864, >> td=0xfffff80065f4e500) at /usr/src/sys/kern/vfs_subr.c:2523 >> #9 0xffffffff806118b9 in cache_lookup (dvp=<value optimized out>, vpp=<value >> optimized out>, cnp=<value optimized out>, tsp=<value optimized out>, >> ticksp=<value optimized out>) at /usr/src/sys/kern/vfs_cache.c:686 >> #10 0xffffffff806133dc in vfs_cache_lookup (ap=<value optimized out>) at >> /usr/src/sys/kern/vfs_cache.c:1081 >> #11 0xffffffff80935777 in VOP_LOOKUP_APV (vop=<value optimized out>, a=<value >> optimized out>) at vnode_if.c:127 >> #12 0xffffffff8061cdf1 in lookup (ndp=<value optimized out>) at vnode_if.h:54 >> #13 0xffffffff8061c492 in namei (ndp=<value optimized out>) at >> /usr/src/sys/kern/vfs_lookup.c:306 >> #14 0xffffffff80509395 in kern_execve (td=<value optimized out>, args=<value >> optimized out>, mac_p=0x0) at /usr/src/sys/kern/kern_exec.c:443 >> #15 0xffffffff80508ccc in sys_execve (td=0xfffff80065f4e500, >> uap=0xfffffe01016b8b80) at /usr/src/sys/kern/kern_exec.c:218 >> #16 0xffffffff808d449e in amd64_syscall (td=<value optimized out>, traced=0) at >> subr_syscall.c:135 >> #17 0xffffffff808b7ddb in Xfast_syscall () at >> /usr/src/sys/amd64/amd64/exception.S:396 > > This 2 threads are innd processes. In core.txt.4: > > 8 14789 29165 0 24 4 40040 6612 zfs DN - 0:00.00 [innd] > 8 29165 1 0 20 0 42496 6888 select Ds - 0:01.33 [innd] > 8 49778 29165 0 24 4 40040 6900 zfs DN - 0:00.00 [innd] > 8 82034 29165 0 24 4 132 0 zfs DN - 0:00.00 [innd] > > the corresponding info treads are: > > 687 Thread 101243 (PID=49778: innd) sched_switch (td=0xfffff800b642b500, > newtd=0xfffff8000285ea00, flags=<value optimized out>) at > /usr/src/sys/kern/sched_ule.c:1973 > 681 Thread 101147 (PID=14789: innd) sched_switch (td=0xfffff80065f4e500, > newtd=0xfffff8000285f000, flags=<value optimized out>) at > /usr/src/sys/kern/sched_ule.c:1973 > 669 Thread 101250 (PID=82034: innd) sched_switch (td=0xfffff800b6429000, > newtd=0xfffff8000285ea00, flags=<value optimized out>) at > /usr/src/sys/kern/sched_ule.c:1973 > 665 Thread 101262 (PID=29165: innd) sched_switch (td=0xfffff800b6b54a00, > newtd=0xfffff8000285ea00, flags=<value optimized out>) at > /usr/src/sys/kern/sched_ule.c:1973 > > So your missing tread must be 101250:Oh, I missed this thread... But it looks like another LK_EXCLUSIVE contender rather than the lock owner.> (kgdb) tid 101250 > [Switching to thread 669 (Thread 101250)]#0 sched_switch > (td=0xfffff800b6429000, newtd=0xfffff8000285ea00, > flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 > 1973 cpuid = PCPU_GET(cpuid); > Current language: auto; currently minimal > (kgdb) bt > #0 sched_switch (td=0xfffff800b6429000, newtd=0xfffff8000285ea00, flags=<value > optimized out>) > at /usr/src/sys/kern/sched_ule.c:1973 > #1 0xffffffff80561ae2 in mi_switch (flags=<value optimized out>, newtd=0x0) at > /usr/src/sys/kern/kern_synch.c:455 > #2 0xffffffff805ae8da in sleepq_wait (wchan=0x0, pri=0) at > /usr/src/sys/kern/subr_sleepqueue.c:646 > #3 0xffffffff8052f854 in sleeplk (lk=<value optimized out>, flags=<value > optimized out>, ilk=<value optimized out>, > wmesg=0xffffffff813be535 "zfs", pri=<value optimized out>, timo=51) at > /usr/src/sys/kern/kern_lock.c:222 > #4 0xffffffff8052f39d in __lockmgr_args (lk=<value optimized out>, flags=<value > optimized out>, ilk=<value optimized out>, > wmesg=<value optimized out>, pri=<value optimized out>, timo=<value > optimized out>, file=<value optimized out>, > line=<value optimized out>) at /usr/src/sys/kern/kern_lock.c:958 > #5 0xffffffff80616a8c in vop_stdlock (ap=<value optimized out>) at lockmgr.h:98 > #6 0xffffffff8093784d in VOP_LOCK1_APV (vop=<value optimized out>, a=<value > optimized out>) at vnode_if.c:2087 > #7 0xffffffff8063c5b3 in _vn_lock (vp=<value optimized out>, flags=525312, > file=<value optimized out>, line=<value optimized out>) > at vnode_if.h:859 > #8 0xffffffff804e16cf in exec_elf64_imgact (imgp=<value optimized out>) at > /usr/src/sys/kern/imgact_elf.c:899 > #9 0xffffffff8050983d in kern_execve (td=<value optimized out>, args=<value > optimized out>, mac_p=0x0) > at /usr/src/sys/kern/kern_exec.c:602 > #10 0xffffffff80508ccc in sys_execve (td=0xfffff800b6429000, > uap=0xfffffe010184fb80) at /usr/src/sys/kern/kern_exec.c:218 > #11 0xffffffff808d449e in amd64_syscall (td=<value optimized out>, traced=0) at > subr_syscall.c:135 > #12 0xffffffff808b7ddb in Xfast_syscall () at > /usr/src/sys/amd64/amd64/exception.S:396 > #13 0x000000080217edaa in ?? () > Previous frame inner to this frame (corrupt stack?) > (kgdb) > >> >> And the original stuck thread wants to get the lock in the shared mode. >> And there should be another thread that already holds the lock in the shared >> mode. But I am not able to identify it. I wonder if the original thread could >> be trying to get the lock recursively... >> >> It would be interesting to get more details from thread 101112. >> You can switch to it using tid command, you can use 'fr' to select frames, 'info >> local' and 'info args' to see what variables are available (not optimized out) >> and the you can print any that look interesting. It would be nice to get a file >> path and a directory vnode where the lookup is called. >> >> Thank you. >> > I find nothing interresting: > > (kgdb) fr 15[snip] Could you please show 'info local' in frame 14? I expected that 'nd' variable would be defined there and it may contain some useful information.> I also try to get information from the execve of the other treads: > > for tid 101250: > (kgdb) fr 10 > #10 0xffffffff80508ccc in sys_execve (td=0xfffff800b6429000, > uap=0xfffffe010184fb80) at /usr/src/sys/kern/kern_exec.c:218 > 218 error = kern_execve(td, &args, NULL); > (kgdb) print *uap > $4 = {fname_l_ = 0xfffffe010184fb80 "`\220\217\002\b", fname = 0x8028f9060 > <Address 0x8028f9060 out of bounds>, > fname_r_ = 0xfffffe010184fb88 "`????\177", argv_l_ = 0xfffffe010184fb88 > "`????\177", argv = 0x7fffffffb660, > argv_r_ = 0xfffffe010184fb90 "\b????\177", envv_l_ = 0xfffffe010184fb90 > "\b????\177", envv = 0x7fffffffdc08, > envv_r_ = 0xfffffe010184fb98 ""} > (kgdb) > > for tid 101243: > > (kgdb) f 15 > #15 0xffffffff80508ccc in sys_execve (td=0xfffff800b642b500, > uap=0xfffffe010182cb80) at /usr/src/sys/kern/kern_exec.c:218 > 218 error = kern_execve(td, &args, NULL); > (kgdb) print *uap > $5 = {fname_l_ = 0xfffffe010182cb80 "??\205\002\b", fname = 0x80285cfc0 <Address > 0x80285cfc0 out of bounds>, > fname_r_ = 0xfffffe010182cb88 "`????\177", argv_l_ = 0xfffffe010182cb88 > "`????\177", argv = 0x7fffffffb660, > argv_r_ = 0xfffffe010182cb90 "\b????\177", envv_l_ = 0xfffffe010182cb90 > "\b????\177", envv = 0x7fffffffdc08, > envv_r_ = 0xfffffe010182cb98 ""} > (kgdb)I think that you see garbage in those structures because they contain pointers to userland data. Hmm, I've just noticed another interesting thread: Thread 668 (Thread 101245): #0 sched_switch (td=0xfffff800b642aa00, newtd=0xfffff8000285f000, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973 #1 0xffffffff80561ae2 in mi_switch (flags=<value optimized out>, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:455 #2 0xffffffff805ae8da in sleepq_wait (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:646 #3 0xffffffff805614b1 in _sleep (ident=<value optimized out>, lock=<value optimized out>, priority=<value optimized out>, wmesg=0xffffffff809c51bc "vmpfw", sbt=0, pr=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/kern_synch.c:229 #4 0xffffffff8089d1c1 in vm_page_busy_sleep (m=0xfffff800df68cd40, wmesg=<value optimized out>) at /usr/src/sys/vm/vm_page.c:753 #5 0xffffffff8089dd4d in vm_page_sleep_if_busy (m=0xfffff800df68cd40, msg=0xffffffff809c51bc "vmpfw") at /usr/src/sys/vm/vm_page.c:1086 #6 0xffffffff80886be9 in vm_fault_hold (map=<value optimized out>, vaddr=<value optimized out>, fault_type=4 '\004', fault_flags=0, m_hold=0x0) at /usr/src/sys/vm/vm_fault.c:495 #7 0xffffffff80885448 in vm_fault (map=0xfffff80011d66000, vaddr=<value optimized out>, fault_type=4 '\004', fault_flags=<value optimized out>) at /usr/src/sys/vm/vm_fault.c:273 #8 0xffffffff808d3c49 in trap_pfault (frame=0xfffffe0101836c00, usermode=1) at /usr/src/sys/amd64/amd64/trap.c:741 #9 0xffffffff808d3386 in trap (frame=0xfffffe0101836c00) at /usr/src/sys/amd64/amd64/trap.c:333 #10 0xffffffff808b7af1 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236 I strongly suspect that this is thread that we were looking for. I think that it has the vnode lock in the shared mode while trying to fault in a page. Could you please check that by going to frame 6 and printing 'fs' and '*fs.vp'? It'd be interesting to understand why this thread is waiting here. So, please also print '*fs.m' and '*fs.object'. Thanks! -- Andriy Gapon