I tried to import a zpool and the process just hung there, doing nothing. It has been ten minutes now so I tries to hit CTRL-C. That did nothing. So then I tried : Sun Microsystems Inc. SunOS 5.11 snv_110 November 2008 root at opensolaris:~# ps -efl F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 1 T root 0 0 0 0 SY fec1f318 0 10:02:47 ? 0:01 sched 0 S root 1 0 0 40 20 d3a62448 683 d3291d32 10:02:50 ? 0:00 /sbin/init 1 S root 2 0 0 0 SY d3a61bc0 0 fec776b0 10:02:50 ? 0:00 pageout . . . 0 S root 1185 1014 0 40 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 root at opensolaris:~# kill -9 1185 root at opensolaris:~# ps -efl | grep root . . . 0 S root 1014 1008 0 50 20 d74ff260 1470 d74ff2cc 10:16:23 pts/2 0:00 -bash 0 S root 1185 1014 0 40 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 . . . OKay, I''ll kill the shell. root at opensolaris:~# kill -9 1014 root at opensolaris:~# ps -efl | grep root 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 root at opensolaris:~# kill -9 1185 root at opensolaris:~# ps -efl | grep root | grep import 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 root at opensolaris:~# kill -9 1185 root at opensolaris:~# ps -efl | grep root | grep import 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 root at opensolaris:~# root at opensolaris:~# date Sat May 9 11:29:37 PDT 2009 root at opensolaris:~# ps -efl | grep root | grep import 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 pts/2 0:00 zpool import -f -R /a/foo 159890708 root at opensolaris:~# Seems to be permanently wedged in there. root at opensolaris:~# truss -faeild -p 1185 truss: unanticipated system error: 1185 So what is the trick to killing this ? -- Dennis
Dennis Clarke wrote:> I tried to import a zpool and the process just hung there, doing nothing. > It has been ten minutes now so I tries to hit CTRL-C. That did nothing. >This symptom is consistent with a process blocked waiting on disk I/O. Are the disks functional? -- richard> So then I tried : > > Sun Microsystems Inc. SunOS 5.11 snv_110 November 2008 > root at opensolaris:~# ps -efl > > F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY > TIME CMD > 1 T root 0 0 0 0 SY fec1f318 0 10:02:47 ? > 0:01 sched > 0 S root 1 0 0 40 20 d3a62448 683 d3291d32 10:02:50 ? > 0:00 /sbin/init > 1 S root 2 0 0 0 SY d3a61bc0 0 fec776b0 10:02:50 ? > 0:00 pageout > . > . > . > 0 S root 1185 1014 0 40 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > > root at opensolaris:~# kill -9 1185 > > root at opensolaris:~# ps -efl | grep root > . > . > . > 0 S root 1014 1008 0 50 20 d74ff260 1470 d74ff2cc 10:16:23 > pts/2 0:00 -bash > 0 S root 1185 1014 0 40 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > . > . > . > > OKay, I''ll kill the shell. > > root at opensolaris:~# kill -9 1014 > > root at opensolaris:~# ps -efl | grep root > > 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > > root at opensolaris:~# kill -9 1185 > root at opensolaris:~# ps -efl | grep root | grep import > 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > root at opensolaris:~# kill -9 1185 > root at opensolaris:~# ps -efl | grep root | grep import > 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > root at opensolaris:~# > > root at opensolaris:~# date > Sat May 9 11:29:37 PDT 2009 > root at opensolaris:~# ps -efl | grep root | grep import > 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > root at opensolaris:~# > > Seems to be permanently wedged in there. > > root at opensolaris:~# truss -faeild -p 1185 > truss: unanticipated system error: 1185 > > So what is the trick to killing this ? > >
> Dennis Clarke wrote: >> I tried to import a zpool and the process just hung there, doing >> nothing. >> It has been ten minutes now so I tries to hit CTRL-C. That did nothing. >> > > This symptom is consistent with a process blocked waiting on disk I/O. > Are the disks functional?totally I''m running with the machine right now. Dennis
> Dennis Clarke wrote: >> I tried to import a zpool and the process just hung there, doing >> nothing. >> It has been ten minutes now so I tries to hit CTRL-C. That did nothing. >> > > This symptom is consistent with a process blocked waiting on disk I/O. > Are the disks functional?dclarke at neptune:~$ zpool status pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c0d0s0 ONLINE 0 0 0 errors: No known data errors dclarke at neptune:~$ zpool get all rpool NAME PROPERTY VALUE SOURCE rpool size 74G - rpool used 11.3G - rpool available 62.7G - rpool capacity 15% - rpool altroot - default rpool health ONLINE - rpool guid 3386894308818650832 default rpool version 14 default rpool bootfs rpool/ROOT/snv_111 local rpool delegation on default rpool autoreplace off default rpool cachefile - default rpool failmode continue local rpool listsnapshots off default dclarke at neptune:~$ su - Password: Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 # zpool import pool: foo id: 15989070886807735056 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the ''-f'' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: foo ONLINE c0d0p0 ONLINE If I try this again .. it may just hang again. But here goes. # mkdir /mnt/foo # zpool import -f -R /mnt/foo 15989070886807735056 and then ... nothing happens. Not too sure what is going on here. In another window I do this and see the same thing as before : dclarke at neptune:~$ date;ps -efl | grep root | grep import Sat May 9 20:42:11 GMT 2009 0 S root 1096 1088 0 50 20 df81e378 1327 d8274526 20:40:38 pts/5 0:00 zpool import -f -R /mnt/foo 1598907 I have to look into this a bit and try to figure out why I am seeing this thing "foo" and why can I not import it. Dennis
Dennis Clarke wrote:> I tried to import a zpool and the process just hung there, doing nothing. > It has been ten minutes now so I tries to hit CTRL-C. That did nothing.It may be because it is blocked in kernel. Can you do something like this: echo "0t<pid of zpool import>::pid2proc|::walk thread|::findstack -v" and echo "::threadlist" | mdb -k Output of the second command may be rather big, so it would be better to post it somewhere. Wbr, Victor> > So then I tried : > > Sun Microsystems Inc. SunOS 5.11 snv_110 November 2008 > root at opensolaris:~# ps -efl > > F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY > TIME CMD > 1 T root 0 0 0 0 SY fec1f318 0 10:02:47 ? > 0:01 sched > 0 S root 1 0 0 40 20 d3a62448 683 d3291d32 10:02:50 ? > 0:00 /sbin/init > 1 S root 2 0 0 0 SY d3a61bc0 0 fec776b0 10:02:50 ? > 0:00 pageout > . > . > . > 0 S root 1185 1014 0 40 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > > root at opensolaris:~# kill -9 1185 > > root at opensolaris:~# ps -efl | grep root > . > . > . > 0 S root 1014 1008 0 50 20 d74ff260 1470 d74ff2cc 10:16:23 > pts/2 0:00 -bash > 0 S root 1185 1014 0 40 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > . > . > . > > OKay, I''ll kill the shell. > > root at opensolaris:~# kill -9 1014 > > root at opensolaris:~# ps -efl | grep root > > 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > > root at opensolaris:~# kill -9 1185 > root at opensolaris:~# ps -efl | grep root | grep import > 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > root at opensolaris:~# kill -9 1185 > root at opensolaris:~# ps -efl | grep root | grep import > 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > root at opensolaris:~# > > root at opensolaris:~# date > Sat May 9 11:29:37 PDT 2009 > root at opensolaris:~# ps -efl | grep root | grep import > 0 S root 1185 1 0 50 20 d74fd040 1943 d7f15c66 11:19:04 > pts/2 0:00 zpool import -f -R /a/foo 159890708 > root at opensolaris:~# > > Seems to be permanently wedged in there. > > root at opensolaris:~# truss -faeild -p 1185 > truss: unanticipated system error: 1185 > > So what is the trick to killing this ? >
> Dennis Clarke wrote: >> I tried to import a zpool and the process just hung there, doing nothing. >> It has been ten minutes now so I tries to hit CTRL-C. That didnothing.> > It may be because it is blocked in kernel. > > Can you do something like this: > > echo "0t<pid of zpool import>::pid2proc|::walk thread|::findstack -v" >dclarke at neptune:~$ date;ps -efl | grep root | grep import Sat May 9 22:54:45 GMT 2009 0 S root 1096 1088 0 50 20 df81e378 1327 d8274526 20:40:38 pts/5 0:00 zpool import -f -R /mnt/foo 1598907 dclarke at neptune:~$ su - Password: Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 # /bin/echo "0t1096::pid2proc|::walk thread|::findstack -v" | mdb -k stack pointer for thread e0156100: d575fc54 d575fc94 swtch+0x188() d575fca4 cv_wait+0x53(d8274526, d82744e8, ffffffff, 0) d575fce4 txg_wait_synced+0x90(d8274380, 65a, 0, 0) d575fd34 spa_config_update_common+0x88(e600fd40, 0, 0, d575fd68) d575fd84 spa_import_common+0x3cf() d575fdb4 spa_import+0x18(ecee4000, dfa040b0, d75b04e0, febd9444) d575fde4 zfs_ioc_pool_import+0xcd(ecee4000, 0, 0) d575fe14 zfsdev_ioctl+0xe0() d575fe44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, daf0b0b0, d575ff00) d575fe74 spec_ioctl+0x6b(d83b1d80, 5a02, 80424a0, 100003, daf0b0b0, d575ff00) d575fec4 fop_ioctl+0x49(d83b1d80, 5a02, 80424a0, 100003, daf0b0b0, d575ff00) d575ff84 ioctl+0x171() d575ffac sys_sysenter+0x106() #> echo "::threadlist" | mdb -k# /bin/echo "::threadlist" | mdb -k | wc -l 542 # /bin/echo "::threadlist" | mdb -k > kern.thread.list # wc -l kern.thread.list 541 kern.thread.list> Output of the second command may be rather big, so it would be better topost it somewhere. see http://www.blastwave.org/dclarke/blog/files/kern.thread.list I see this line : e0156100 df81e378 ed00fe70 zpool/1 which seems consistent with what ps says : F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 S root 1096 1088 0 50 20 df81e378 1327 d8274526 20:40:38 pts/5 0:00 zpool import -f -R /mnt/foo 1598907 which is not telling me much at the moment. I''m game to play, what is next here ? By the way, brace yourself, this is a 32-bit system and even worse than that, take a look at this isalist : dclarke at neptune:~$ isainfo -v 32-bit i386 applications ahf sse2 sse fxsr mmx cmov sep cx8 tsc fpu dclarke at neptune:~$ isalist i486 i386 i86 Dennis Clarke
Dennis Clarke wrote:>> Dennis Clarke wrote: >>> I tried to import a zpool and the process just hung there, doing nothing. >>> It has been ten minutes now so I tries to hit CTRL-C. That did > nothing. >> It may be because it is blocked in kernel. >> >> Can you do something like this: >> >> echo "0t<pid of zpool import>::pid2proc|::walk thread|::findstack -v" >> > > dclarke at neptune:~$ date;ps -efl | grep root | grep import > > Sat May 9 22:54:45 GMT 2009 > 0 S root 1096 1088 0 50 20 df81e378 1327 d8274526 20:40:38 > pts/5 0:00 zpool import -f -R /mnt/foo 1598907 > > dclarke at neptune:~$ su - > > Password: > Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 > # /bin/echo "0t1096::pid2proc|::walk thread|::findstack -v" | mdb -k > stack pointer for thread e0156100: d575fc54 > d575fc94 swtch+0x188() > d575fca4 cv_wait+0x53(d8274526, d82744e8, ffffffff, 0) > d575fce4 txg_wait_synced+0x90(d8274380, 65a, 0, 0) > d575fd34 spa_config_update_common+0x88(e600fd40, 0, 0, d575fd68) > d575fd84 spa_import_common+0x3cf() > d575fdb4 spa_import+0x18(ecee4000, dfa040b0, d75b04e0, febd9444) > d575fde4 zfs_ioc_pool_import+0xcd(ecee4000, 0, 0) > d575fe14 zfsdev_ioctl+0xe0() > d575fe44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, daf0b0b0, > d575ff00) > d575fe74 spec_ioctl+0x6b(d83b1d80, 5a02, 80424a0, 100003, daf0b0b0, > d575ff00) > d575fec4 fop_ioctl+0x49(d83b1d80, 5a02, 80424a0, 100003, daf0b0b0, > d575ff00) > d575ff84 ioctl+0x171() > d575ffac sys_sysenter+0x106() > #So we see that it cannot complete import here and is waiting for transaction group to sync. So probably spa_sync thread is stuck, and we need to find out why.>> echo "::threadlist" | mdb -k > > # /bin/echo "::threadlist" | mdb -k | wc -l > 542 > # /bin/echo "::threadlist" | mdb -k > kern.thread.list > # wc -l kern.thread.list > 541 kern.thread.list > >> Output of the second command may be rather big, so it would be better to > post it somewhere. > > see http://www.blastwave.org/dclarke/blog/files/kern.thread.listSorry I forgot ''-v'', could you do it once again: echo "::threadlist -v" | mdb -k Output of this one is going to be a lot bigger. Or alternatively we can try to get stack for two txg_sync_threads: echo "0xd4abadc0|::findstack -v" | mdb -k echo "0xd381ddc0|::findstack -v" | mdb -k Regards, Victor> > I see this line : > > e0156100 df81e378 ed00fe70 zpool/1 > > which seems consistent with what ps says : > > F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME > TTY TIME CMD > 0 S root 1096 1088 0 50 20 df81e378 1327 d8274526 20:40:38 > pts/5 0:00 zpool import -f -R /mnt/foo 1598907 > > which is not telling me much at the moment. > > I''m game to play, what is next here ? By the way, brace yourself, this is > a 32-bit system and even worse than that, take a look at this isalist : > > dclarke at neptune:~$ isainfo -v > > 32-bit i386 applications > ahf sse2 sse fxsr mmx cmov sep cx8 tsc fpu > dclarke at neptune:~$ isalist > i486 i386 i86 > > > Dennis Clarke
>>> It may be because it is blocked in kernel. >>> Can you do something like this: >>> echo "0t<pid of zpool import>::pid2proc|::walk thread|::findstack -v" > > So we see that it cannot complete import here and is waiting for > transaction group to sync. So probably spa_sync thread is stuck, and we > need to find out why.Well, the details are going to change, I had to reboot. :-( I''ll start up the stuck thread bug again here by simply starting over. I''ll bet you would be able to learn a few things if you were to ssh into this machine. ? regardless, let''s start over. dclarke at neptune:~$ uname -a SunOS neptune 5.11 snv_111 i86pc i386 i86pc dclarke at neptune:~$ uptime 2:04pm up 10:13, 1 user, load average: 0.17, 0.16, 0.15 dclarke at neptune:~$ su - Password: Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 # # zpool import pool: foo id: 15989070886807735056 state: ONLINE status: The pool was last accessed by another system. action: The pool can be imported using its name or numeric identifier and the ''-f'' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: foo ONLINE c0d0p0 ONLINE # please see ALL the details at : http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README also see output from fmdump -eV http://www.blastwave.org/dclarke/blog/files/fmdump_e.log Please let me know what else you may need. -- Dennis Clarke
> Dennis Clarke wrote: >>>>> It may be because it is blocked in kernel. >>>>> Can you do something like this: >>>>> echo "0t<pid of zpool import>::pid2proc|::walk thread|::findstack -v" >>> So we see that it cannot complete import here and is waiting for >>> transaction group to sync. So probably spa_sync thread is stuck, and we >>> need to find out why. >> >> Well, the details are going to change, I had to reboot. :-( >> >> I''ll start up the stuck thread bug again here by simply starting over. >> I''ll bet you would be able to learn a few things if you were to ssh into >> this machine. ? >> >> regardless, let''s start over. >> >> dclarke at neptune:~$ uname -a >> SunOS neptune 5.11 snv_111 i86pc i386 i86pc >> dclarke at neptune:~$ uptime >> 2:04pm up 10:13, 1 user, load average: 0.17, 0.16, 0.15 >> dclarke at neptune:~$ su - >> Password: >> Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 >> # >> # zpool import >> pool: foo >> id: 15989070886807735056 >> state: ONLINE >> status: The pool was last accessed by another system. >> action: The pool can be imported using its name or numeric identifier >> and >> the ''-f'' flag. >> see: http://www.sun.com/msg/ZFS-8000-EY >> config: >> >> foo ONLINE >> c0d0p0 ONLINE >> # >> >> please see ALL the details at : >> >> http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README > > There''s a corrupted space map which is being updated as part of the txg > sync; in order to update it (add a few free ops to the last block), we > need to read in current content of the last block from disk first, and > that fails because it is corrupted (as indicated by checksum errors in > the fmdump output): > > eb5c9dc0 fec1f398 0 0 60 d38e1828 > PC: _resume_from_idle+0xb1 THREAD: txg_sync_thread() > stack pointer for thread eb5c9dc0: eb5c9a28 > swtch+0x188() > cv_wait+0x53() > zio_wait+0x55() > dbuf_read+0x201() > dbuf_will_dirty+0x30() > dmu_write+0xd7() > space_map_sync+0x304() > metaslab_sync+0x284() > vdev_sync+0xc6() > spa_sync+0x3d0() > txg_sync_thread+0x308() > thread_start+8() > > VictorI had to cc that back onto the ZFS list, it may be of value here. I agree that there is something wrong, no doubt, however we should not see zpool import simply hang and become unresponsive nor should that pid be unresponsive to a SIGKILL. Good behaviour should be the norm and that is not what we see with a stuck kernel thread. Really, we should get some response to the effect that "a device is corrupt" or similar. Right now, what the user gets, is very little information other than a non-responsive command. CTRL+C does nothing and kill -9 pid does nothing to this command. feels like a bug to me Dennis
> Dennis Clarke wrote: >>> Dennis Clarke wrote: >>>>>>> It may be because it is blocked in kernel. >>>>>>> Can you do something like this: >>>>>>> echo "0t<pid of zpool import>::pid2proc|::walk thread|::findstack >>>>>>> -v" >>>>> So we see that it cannot complete import here and is waiting for >>>>> transaction group to sync. So probably spa_sync thread is stuck, and >>>>> we >>>>> need to find out why. >>>> Well, the details are going to change, I had to reboot. :-( >>>> >>>> I''ll start up the stuck thread bug again here by simply starting over. >>>> I''ll bet you would be able to learn a few things if you were to ssh >>>> into >>>> this machine. ? >>>> >>>> regardless, let''s start over. >>>> >>>> dclarke at neptune:~$ uname -a >>>> SunOS neptune 5.11 snv_111 i86pc i386 i86pc >>>> dclarke at neptune:~$ uptime >>>> 2:04pm up 10:13, 1 user, load average: 0.17, 0.16, 0.15 >>>> dclarke at neptune:~$ su - >>>> Password: >>>> Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 >>>> # >>>> # zpool import >>>> pool: foo >>>> id: 15989070886807735056 >>>> state: ONLINE >>>> status: The pool was last accessed by another system. >>>> action: The pool can be imported using its name or numeric identifier >>>> and >>>> the ''-f'' flag. >>>> see: http://www.sun.com/msg/ZFS-8000-EY >>>> config: >>>> >>>> foo ONLINE >>>> c0d0p0 ONLINE >>>> # >>>> >>>> please see ALL the details at : >>>> >>>> http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README >>> There''s a corrupted space map which is being updated as part of the txg >>> sync; in order to update it (add a few free ops to the last block), we >>> need to read in current content of the last block from disk first, and >>> that fails because it is corrupted (as indicated by checksum errors in >>> the fmdump output): >>> >>> eb5c9dc0 fec1f398 0 0 60 d38e1828 >>> PC: _resume_from_idle+0xb1 THREAD: txg_sync_thread() >>> stack pointer for thread eb5c9dc0: eb5c9a28 >>> swtch+0x188() >>> cv_wait+0x53() >>> zio_wait+0x55() >>> dbuf_read+0x201() >>> dbuf_will_dirty+0x30() >>> dmu_write+0xd7() >>> space_map_sync+0x304() >>> metaslab_sync+0x284() >>> vdev_sync+0xc6() >>> spa_sync+0x3d0() >>> txg_sync_thread+0x308() >>> thread_start+8() >>> >>> Victor >> >> I had to cc that back onto the ZFS list, it may be of value here. > > Sorry for that, I''ve just hit wrong button ;-) > >> I agree that there is something wrong, no doubt, however we should not >> see >> zpool import simply hang and become unresponsive nor should that pid be >> unresponsive to a SIGKILL. Good behaviour should be the norm and that is >> not what we see with a stuck kernel thread. Really, we should get some >> response to the effect that "a device is corrupt" or similar. >> >> Right now, what the user gets, is very little information other than a >> non-responsive command. >> >> CTRL+C does nothing and kill -9 pid does nothing to this command. >> >> feels like a bug to me > > Yes, it is: > > http://bugs.opensolaris.org/view_bug.do?bug_id=6758902oh drat, I thought I hit something new :-\ Not very likely with ZFS, it is pretty well flushed out all the way into the dark corners I guess. Dennis
Dennis Clarke wrote:>> Dennis Clarke wrote: >>>>>> It may be because it is blocked in kernel. >>>>>> Can you do something like this: >>>>>> echo "0t<pid of zpool import>::pid2proc|::walk thread|::findstack -v" >>>> So we see that it cannot complete import here and is waiting for >>>> transaction group to sync. So probably spa_sync thread is stuck, and we >>>> need to find out why. >>> Well, the details are going to change, I had to reboot. :-( >>> >>> I''ll start up the stuck thread bug again here by simply starting over. >>> I''ll bet you would be able to learn a few things if you were to ssh into >>> this machine. ? >>> >>> regardless, let''s start over. >>> >>> dclarke at neptune:~$ uname -a >>> SunOS neptune 5.11 snv_111 i86pc i386 i86pc >>> dclarke at neptune:~$ uptime >>> 2:04pm up 10:13, 1 user, load average: 0.17, 0.16, 0.15 >>> dclarke at neptune:~$ su - >>> Password: >>> Sun Microsystems Inc. SunOS 5.11 snv_111 November 2008 >>> # >>> # zpool import >>> pool: foo >>> id: 15989070886807735056 >>> state: ONLINE >>> status: The pool was last accessed by another system. >>> action: The pool can be imported using its name or numeric identifier >>> and >>> the ''-f'' flag. >>> see: http://www.sun.com/msg/ZFS-8000-EY >>> config: >>> >>> foo ONLINE >>> c0d0p0 ONLINE >>> # >>> >>> please see ALL the details at : >>> >>> http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README >> There''s a corrupted space map which is being updated as part of the txg >> sync; in order to update it (add a few free ops to the last block), we >> need to read in current content of the last block from disk first, and >> that fails because it is corrupted (as indicated by checksum errors in >> the fmdump output): >> >> eb5c9dc0 fec1f398 0 0 60 d38e1828 >> PC: _resume_from_idle+0xb1 THREAD: txg_sync_thread() >> stack pointer for thread eb5c9dc0: eb5c9a28 >> swtch+0x188() >> cv_wait+0x53() >> zio_wait+0x55() >> dbuf_read+0x201() >> dbuf_will_dirty+0x30() >> dmu_write+0xd7() >> space_map_sync+0x304() >> metaslab_sync+0x284() >> vdev_sync+0xc6() >> spa_sync+0x3d0() >> txg_sync_thread+0x308() >> thread_start+8() >> >> Victor > > I had to cc that back onto the ZFS list, it may be of value here.Sorry for that, I''ve just hit wrong button ;-)> I agree that there is something wrong, no doubt, however we should not see > zpool import simply hang and become unresponsive nor should that pid be > unresponsive to a SIGKILL. Good behaviour should be the norm and that is > not what we see with a stuck kernel thread. Really, we should get some > response to the effect that "a device is corrupt" or similar. > > Right now, what the user gets, is very little information other than a > non-responsive command. > > CTRL+C does nothing and kill -9 pid does nothing to this command. > > feels like a bug to meYes, it is: http://bugs.opensolaris.org/view_bug.do?bug_id=6758902 Regards, Victor
>> CTRL+C does nothing and kill -9 pid does nothing to this command. >> >> feels like a bug to me > > Yes, it is: > > http://bugs.opensolaris.org/view_bug.do?bug_id=6758902 >Now I recall why I had to reboot. Seems as if a lot of commands hang now. Things like : df -ak zfs list zpool list they all just hang. Dennis ps: this "machine" is really just an embedded device based on the VIA chipset. Not too sure if that matters.
Dennis Clarke wrote:>>> CTRL+C does nothing and kill -9 pid does nothing to this command. >>> >>> feels like a bug to me >> Yes, it is: >> >> http://bugs.opensolaris.org/view_bug.do?bug_id=6758902 >> > > Now I recall why I had to reboot. Seems as if a lot of commands hang now. > Things like : > > df -ak > > zfs listpid of ''zfs list'' is 1754> zpool listpid of ''zpool list'' is 1873> > they all just hang.Ok, here''s a little bit more details: pid of ''zpool import'' is 1361. > 0t1361::pid2proc|::walk thread|::findstack -v stack pointer for thread ec1b8ce0: ec223c54 ec223c94 swtch+0x188() ec223ca4 cv_wait+0x53(eb414966, eb414928, ffffffff, 0) ec223ce4 txg_wait_synced+0x90(eb4147c0, 65a, 0, 0) ec223d34 spa_config_update_common+0x88(ddbda380, 0, 0, ec223d68) ec223d84 spa_import_common+0x3cf() ec223db4 spa_import+0x18(ea841000, eb04f5e0, eb04f658, febd9444) ec223de4 zfs_ioc_pool_import+0xcd(ea841000, 0, 0) ec223e14 zfsdev_ioctl+0xe0() ec223e44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223e74 spec_ioctl+0x6b(da7ce340, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223ec4 fop_ioctl+0x49(da7ce340, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223f84 ioctl+0x171() ec223fac sys_call+0x10c() > So we see it is waiting for transaction grou to sync. Let''s find out sync thread and see what it is doing, for this we need to take argument to txg_wait_synced() and go from it: > eb4147c0::print dsl_pool_t dp_tx.tx_sync_thread dp_tx.tx_sync_thread = 0xeb5c9dc0 Let''s see what it is doing: > 0xeb5c9dc0::findstack -v stack pointer for thread eb5c9dc0: eb5c9a28 eb5c9a68 swtch+0x188() eb5c9a78 cv_wait+0x53(d38e1828, d38e1820, 0, 0) eb5c9aa8 zio_wait+0x55(d38e15e0, d38e15e0, eb5c9af8, 1) eb5c9ae8 dbuf_read+0x201(d8795028, 0) eb5c9b08 dbuf_will_dirty+0x30(d8795028, ea7fa880, eb5c9b4c, 0) eb5c9b68 dmu_write+0xd7(ec5b2e18, 22, 0, 1f0, 0, 10) eb5c9c08 space_map_sync+0x304(e0815db8, 1, e0815c20, ec5b2e18, ea7fa880) eb5c9c78 metaslab_sync+0x284(e0815c00, 65a, 0, 0) eb5c9cb8 vdev_sync+0xc6(e6f8b4c0, 65a, 0) eb5c9d28 spa_sync+0x3d0(ddbda380, 65a, 0, 772f6e75) eb5c9da8 txg_sync_thread+0x308(eb4147c0, 0) eb5c9db8 thread_start+8() > It s trying to write something (update) to a space map, so it needs to read in existing content first, and this fails. Let''s check what is block pointer: > d38e15e0::zio -c ADDRESS TYPE STAGE WAITER d397baa0 READ DONE - > d397baa0::print zio_t io_bp|::blkptr DVA[0]: vdev_id 0 / 6091e400 DVA[0]: GANG: FALSE GRID: 0000 ASIZE: 20000000000 DVA[0]: :0:6091e400:200:d DVA[1]: vdev_id 0 / 3e091e400 DVA[1]: GANG: FALSE GRID: 0000 ASIZE: 20000000000 DVA[1]: :0:3e091e400:200:d DVA[2]: vdev_id 0 / 78091e400 DVA[2]: GANG: FALSE GRID: 0000 ASIZE: 20000000000 DVA[2]: :0:78091e400:200:d LSIZE: 1000 PSIZE: 200 ENDIAN: LITTLE TYPE: SPA space map BIRTH: 5b5 LEVEL: 0 FILL: 100000000 CKFUNC: fletcher4 COMP: lzjb CKSUM: 1f7bc0ee12:6fcfd90640d:10787c83addaf:1f3ef97a921b6f > We clearly see here that block pointer is the same as one declared by zdb as corrupted (see another thread about zdb -e -bbcsL). Let''s check that this is indeed the same pool we are trying to import: > eb4147c0::print dsl_pool_t dp_spa|::print -d struct spa spa_load_guid spa_load_guid = 0t15989070886807735056 > ::ps -f! grep 1361 R 1361 1360 1353 1220 0 0x4a004000 e0596058 zpool import -f -R /mnt/foo 15989070886807735056 Indeed, it is the same. Let''s check why other processes are stuck: > 0t1783::pid2proc|::walk thread|::findstack -v stack pointer for thread d362a680: eab3ac94 eab3acd4 swtch+0x188() eab3ad24 turnstile_block+0x70b(d7b09f78, 0, fecb5418, fec04c80, 0, 0) eab3ad94 mutex_vector_enter+0x28f(fecb5418) eab3adc4 spa_all_configs+0x50(e6f9dd38, e6f9d000, 1020, 1) eab3ade4 zfs_ioc_pool_configs+0x16(e6f9d000, e8e3c010, 1020) eab3ae14 zfsdev_ioctl+0xe0() eab3ae44 cdev_ioctl+0x31(2d80000, 5a04, 8041f00, 100003, e8e3c010, eab3af00) eab3ae74 spec_ioctl+0x6b(da7ce340, 5a04, 8041f00, 100003, e8e3c010, eab3af00) eab3aec4 fop_ioctl+0x49(da7ce340, 5a04, 8041f00, 100003, e8e3c010, eab3af00) eab3af84 ioctl+0x171() eab3afac sys_sysenter+0x106() > fecb5418::mutex ADDR TYPE HELD MINSPL OLDSPL WAITERS fecb5418 adapt ec1b8ce0 - - yes > ec1b8ce0::findstack -v stack pointer for thread ec1b8ce0: ec223c54 ec223c94 swtch+0x188() ec223ca4 cv_wait+0x53(eb414966, eb414928, ffffffff, 0) ec223ce4 txg_wait_synced+0x90(eb4147c0, 65a, 0, 0) ec223d34 spa_config_update_common+0x88(ddbda380, 0, 0, ec223d68) ec223d84 spa_import_common+0x3cf() ec223db4 spa_import+0x18(ea841000, eb04f5e0, eb04f658, febd9444) ec223de4 zfs_ioc_pool_import+0xcd(ea841000, 0, 0) ec223e14 zfsdev_ioctl+0xe0() ec223e44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223e74 spec_ioctl+0x6b(da7ce340, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223ec4 fop_ioctl+0x49(da7ce340, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223f84 ioctl+0x171() ec223fac sys_call+0x10c() > Ok, we see that ''zpool list'' is waiting for our ''zpool import ...'' to complete. Let''s see what ''zfs list'' is waiting for: > 0t1754::pid2proc|::walk thread|::findstack -v stack pointer for thread e6f8d880: eaf7cc94 eaf7ccd4 swtch+0x188() eaf7cd24 turnstile_block+0x70b(d7b09f78, 0, fecb5418, fec04c80, 0, 0) eaf7cd94 mutex_vector_enter+0x28f(fecb5418) eaf7cdc4 spa_all_configs+0x50(e7c05d38, e7c05000, 1020, 1) eaf7cde4 zfs_ioc_pool_configs+0x16(e7c05000, e8e3c010, 1020) eaf7ce14 zfsdev_ioctl+0xe0() eaf7ce44 cdev_ioctl+0x31(2d80000, 5a04, 8045ed0, 100003, e8e3c010, eaf7cf00) eaf7ce74 spec_ioctl+0x6b(da7ce340, 5a04, 8045ed0, 100003, e8e3c010, eaf7cf00) eaf7cec4 fop_ioctl+0x49(da7ce340, 5a04, 8045ed0, 100003, e8e3c010, eaf7cf00) eaf7cf84 ioctl+0x171() eaf7cfac sys_sysenter+0x106() > fecb5418::mutex ADDR TYPE HELD MINSPL OLDSPL WAITERS fecb5418 adapt ec1b8ce0 - - yes > ec1b8ce0::findstack -v stack pointer for thread ec1b8ce0: ec223c54 ec223c94 swtch+0x188() ec223ca4 cv_wait+0x53(eb414966, eb414928, ffffffff, 0) ec223ce4 txg_wait_synced+0x90(eb4147c0, 65a, 0, 0) ec223d34 spa_config_update_common+0x88(ddbda380, 0, 0, ec223d68) ec223d84 spa_import_common+0x3cf() ec223db4 spa_import+0x18(ea841000, eb04f5e0, eb04f658, febd9444) ec223de4 zfs_ioc_pool_import+0xcd(ea841000, 0, 0) ec223e14 zfsdev_ioctl+0xe0() ec223e44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223e74 spec_ioctl+0x6b(da7ce340, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223ec4 fop_ioctl+0x49(da7ce340, 5a02, 80424a0, 100003, e8e3c208, ec223f00) ec223f84 ioctl+0x171() ec223fac sys_call+0x10c() > The same story. What is the mutex they are waiting for? > fecb5418::whatis fecb5418 is spa_namespace_lock+0 in zfs''s bss > It is spa_namespace_lock Hth, Victor