thr3ads.net - zfs discuss - [zfs-discuss] is zpool import unSIGKILLable ? [May 2009]

If this information is useful, please help other people find it:
Share via:

Dennis Clarke

2009-May-09 18:40 UTC

[zfs-discuss] is zpool import unSIGKILLable ?

I tried to import a zpool and the process just hung there, doing nothing.
It has been ten minutes now so I tries to hit CTRL-C.  That did nothing.

So then I tried :

Sun Microsystems Inc.	SunOS 5.11	snv_110	November 2008
root at opensolaris:~# ps -efl

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY
        TIME CMD
 1 T     root     0     0   0   0 SY fec1f318      0          10:02:47 ?  
        0:01 sched
 0 S     root     1     0   0  40 20 d3a62448    683 d3291d32 10:02:50 ?  
        0:00 /sbin/init
 1 S     root     2     0   0   0 SY d3a61bc0      0 fec776b0 10:02:50 ?  
        0:00 pageout
.
.
.
 0 S     root  1185  1014   0  40 20 d74fd040   1943 d7f15c66 11:19:04
pts/2       0:00 zpool import -f -R /a/foo 159890708

root at opensolaris:~# kill -9 1185

root at opensolaris:~# ps -efl | grep root
.
.
.
 0 S     root  1014  1008   0  50 20 d74ff260   1470 d74ff2cc 10:16:23
pts/2       0:00 -bash
 0 S     root  1185  1014   0  40 20 d74fd040   1943 d7f15c66 11:19:04
pts/2       0:00 zpool import -f -R /a/foo 159890708
.
.
.

OKay, I''ll kill the shell.

root at opensolaris:~# kill -9 1014

root at opensolaris:~# ps -efl | grep root

 0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
pts/2       0:00 zpool import -f -R /a/foo 159890708

root at opensolaris:~# kill -9 1185
root at opensolaris:~# ps -efl | grep root | grep import
 0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
pts/2       0:00 zpool import -f -R /a/foo 159890708
root at opensolaris:~# kill -9 1185
root at opensolaris:~# ps -efl | grep root | grep import
 0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
pts/2       0:00 zpool import -f -R /a/foo 159890708
root at opensolaris:~#

root at opensolaris:~# date
Sat May  9 11:29:37 PDT 2009
root at opensolaris:~# ps -efl | grep root | grep import
 0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
pts/2       0:00 zpool import -f -R /a/foo 159890708
root at opensolaris:~#

Seems to be permanently wedged in there.

root at opensolaris:~# truss -faeild -p 1185
truss: unanticipated system error: 1185

So what is the trick to killing this ?

-- 
Dennis

Richard Elling

2009-May-09 19:33 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

Dennis Clarke wrote:> I tried to import a zpool and the process just hung there, doing nothing.
> It has been ten minutes now so I tries to hit CTRL-C.  That did nothing.
>   
This symptom is consistent with a process blocked waiting on disk I/O.
Are the disks functional?
-- richard
> So then I tried :
>
> Sun Microsystems Inc.	SunOS 5.11	snv_110	November 2008
> root at opensolaris:~# ps -efl
>
>  F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY
>         TIME CMD
>  1 T     root     0     0   0   0 SY fec1f318      0          10:02:47 ?  
>         0:01 sched
>  0 S     root     1     0   0  40 20 d3a62448    683 d3291d32 10:02:50 ?  
>         0:00 /sbin/init
>  1 S     root     2     0   0   0 SY d3a61bc0      0 fec776b0 10:02:50 ?  
>         0:00 pageout
> .
> .
> .
>  0 S     root  1185  1014   0  40 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
>
> root at opensolaris:~# kill -9 1185
>
> root at opensolaris:~# ps -efl | grep root
> .
> .
> .
>  0 S     root  1014  1008   0  50 20 d74ff260   1470 d74ff2cc 10:16:23
> pts/2       0:00 -bash
>  0 S     root  1185  1014   0  40 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> .
> .
> .
>
> OKay, I''ll kill the shell.
>
> root at opensolaris:~# kill -9 1014
>
> root at opensolaris:~# ps -efl | grep root
>
>  0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
>
> root at opensolaris:~# kill -9 1185
> root at opensolaris:~# ps -efl | grep root | grep import
>  0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> root at opensolaris:~# kill -9 1185
> root at opensolaris:~# ps -efl | grep root | grep import
>  0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> root at opensolaris:~#
>
> root at opensolaris:~# date
> Sat May  9 11:29:37 PDT 2009
> root at opensolaris:~# ps -efl | grep root | grep import
>  0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> root at opensolaris:~#
>
> Seems to be permanently wedged in there.
>
> root at opensolaris:~# truss -faeild -p 1185
> truss: unanticipated system error: 1185
>
> So what is the trick to killing this ?
>
>

Dennis Clarke

2009-May-09 20:28 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

> Dennis Clarke wrote:
>> I tried to import a zpool and the process just hung there, doing
>> nothing.
>> It has been ten minutes now so I tries to hit CTRL-C.  That did
nothing.
>>
>
> This symptom is consistent with a process blocked waiting on disk I/O.
> Are the disks functional?
totally

I''m running with the machine right now.

Dennis

Dennis Clarke

2009-May-09 20:37 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

> Dennis Clarke wrote:
>> I tried to import a zpool and the process just hung there, doing
>> nothing.
>> It has been ten minutes now so I tries to hit CTRL-C.  That did
nothing.
>>
>
> This symptom is consistent with a process blocked waiting on disk I/O.
> Are the disks functional?
dclarke at neptune:~$ zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c0d0s0    ONLINE       0     0     0

errors: No known data errors
dclarke at neptune:~$ zpool get all rpool
NAME   PROPERTY       VALUE               SOURCE
rpool  size           74G                 -
rpool  used           11.3G               -
rpool  available      62.7G               -
rpool  capacity       15%                 -
rpool  altroot        -                   default
rpool  health         ONLINE              -
rpool  guid           3386894308818650832  default
rpool  version        14                  default
rpool  bootfs         rpool/ROOT/snv_111  local
rpool  delegation     on                  default
rpool  autoreplace    off                 default
rpool  cachefile      -                   default
rpool  failmode       continue            local
rpool  listsnapshots  off                 default

dclarke at neptune:~$ su -
Password:
Sun Microsystems Inc.   SunOS 5.11      snv_111 November 2008
# zpool import
  pool: foo
    id: 15989070886807735056
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the ''-f'' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        foo         ONLINE
          c0d0p0    ONLINE

If I try this again .. it may just hang again. But here goes.

# mkdir /mnt/foo
# zpool import -f -R /mnt/foo 15989070886807735056

and then ... nothing happens.

Not too sure what is going on here.

In another window I do this and see the same thing as before :

dclarke at neptune:~$ date;ps -efl | grep root | grep import
Sat May  9 20:42:11 GMT 2009
 0 S     root  1096  1088   0  50 20 df81e378   1327 d8274526 20:40:38
pts/5       0:00 zpool import -f -R /mnt/foo 1598907

I have to look into this a bit and try to figure out why I am seeing this
thing "foo" and why can I not import it.

Dennis

Victor Latushkin

2009-May-09 20:57 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

Dennis Clarke wrote:> I tried to import a zpool and the process just hung there, doing nothing.
> It has been ten minutes now so I tries to hit CTRL-C.  That did nothing.
It may be because it is blocked in kernel.

Can you do something like this:

echo "0t<pid of zpool import>::pid2proc|::walk thread|::findstack
-v"

and

echo "::threadlist" | mdb -k

Output of the second command may be rather big, so it would be better to 
post it somewhere.

Wbr,
Victor
> 
> So then I tried :
> 
> Sun Microsystems Inc.	SunOS 5.11	snv_110	November 2008
> root at opensolaris:~# ps -efl
> 
>  F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY
>         TIME CMD
>  1 T     root     0     0   0   0 SY fec1f318      0          10:02:47 ?  
>         0:01 sched
>  0 S     root     1     0   0  40 20 d3a62448    683 d3291d32 10:02:50 ?  
>         0:00 /sbin/init
>  1 S     root     2     0   0   0 SY d3a61bc0      0 fec776b0 10:02:50 ?  
>         0:00 pageout
> .
> .
> .
>  0 S     root  1185  1014   0  40 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> 
> root at opensolaris:~# kill -9 1185
> 
> root at opensolaris:~# ps -efl | grep root
> .
> .
> .
>  0 S     root  1014  1008   0  50 20 d74ff260   1470 d74ff2cc 10:16:23
> pts/2       0:00 -bash
>  0 S     root  1185  1014   0  40 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> .
> .
> .
> 
> OKay, I''ll kill the shell.
> 
> root at opensolaris:~# kill -9 1014
> 
> root at opensolaris:~# ps -efl | grep root
> 
>  0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> 
> root at opensolaris:~# kill -9 1185
> root at opensolaris:~# ps -efl | grep root | grep import
>  0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> root at opensolaris:~# kill -9 1185
> root at opensolaris:~# ps -efl | grep root | grep import
>  0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> root at opensolaris:~#
> 
> root at opensolaris:~# date
> Sat May  9 11:29:37 PDT 2009
> root at opensolaris:~# ps -efl | grep root | grep import
>  0 S     root  1185     1   0  50 20 d74fd040   1943 d7f15c66 11:19:04
> pts/2       0:00 zpool import -f -R /a/foo 159890708
> root at opensolaris:~#
> 
> Seems to be permanently wedged in there.
> 
> root at opensolaris:~# truss -faeild -p 1185
> truss: unanticipated system error: 1185
> 
> So what is the trick to killing this ?
>

Dennis Clarke

2009-May-10 00:02 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

> Dennis Clarke wrote:
>> I tried to import a zpool and the process just hung there, doing
nothing.
>> It has been ten minutes now so I tries to hit CTRL-C.  That did
nothing.>
> It may be because it is blocked in kernel.
>
> Can you do something like this:
>
> echo "0t<pid of zpool import>::pid2proc|::walk
thread|::findstack -v"
>
dclarke at neptune:~$ date;ps -efl | grep root | grep import

Sat May  9 22:54:45 GMT 2009
 0 S     root  1096  1088   0  50 20 df81e378   1327 d8274526 20:40:38
pts/5       0:00 zpool import -f -R /mnt/foo 1598907

dclarke at neptune:~$ su -

Password:
Sun Microsystems Inc.   SunOS 5.11      snv_111 November 2008
# /bin/echo "0t1096::pid2proc|::walk thread|::findstack -v" | mdb -k 
stack pointer for thread e0156100: d575fc54
  d575fc94 swtch+0x188()
  d575fca4 cv_wait+0x53(d8274526, d82744e8, ffffffff, 0)
  d575fce4 txg_wait_synced+0x90(d8274380, 65a, 0, 0)
  d575fd34 spa_config_update_common+0x88(e600fd40, 0, 0, d575fd68)
d575fd84 spa_import_common+0x3cf()
  d575fdb4 spa_import+0x18(ecee4000, dfa040b0, d75b04e0, febd9444)
d575fde4 zfs_ioc_pool_import+0xcd(ecee4000, 0, 0)
  d575fe14 zfsdev_ioctl+0xe0()
  d575fe44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, daf0b0b0,
d575ff00)
  d575fe74 spec_ioctl+0x6b(d83b1d80, 5a02, 80424a0, 100003, daf0b0b0,
d575ff00)
  d575fec4 fop_ioctl+0x49(d83b1d80, 5a02, 80424a0, 100003, daf0b0b0,
d575ff00)
  d575ff84 ioctl+0x171()
  d575ffac sys_sysenter+0x106()
#
> echo "::threadlist" | mdb -k
# /bin/echo "::threadlist" | mdb -k | wc -l
     542
# /bin/echo "::threadlist" | mdb -k > kern.thread.list
# wc -l kern.thread.list
     541 kern.thread.list
> Output of the second command may be rather big, so it would be better topost it somewhere.

see http://www.blastwave.org/dclarke/blog/files/kern.thread.list

I see this line :

   e0156100 df81e378 ed00fe70 zpool/1

which seems consistent with what ps says :

   F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME
TTY         TIME CMD
 0 S     root  1096  1088   0  50 20 df81e378   1327 d8274526 20:40:38
pts/5       0:00 zpool import -f -R /mnt/foo 1598907

which is not telling me much at the moment.

I''m game to play, what is next here ? By the way, brace yourself, this
is
a 32-bit system and even worse than that, take a look at this isalist :

dclarke at neptune:~$ isainfo -v

32-bit i386 applications
        ahf sse2 sse fxsr mmx cmov sep cx8 tsc fpu
dclarke at neptune:~$ isalist
i486 i386 i86


Dennis Clarke

Victor Latushkin

2009-May-10 11:38 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

Dennis Clarke wrote:>> Dennis Clarke wrote:
>>> I tried to import a zpool and the process just hung there, doing
nothing.
>>> It has been ten minutes now so I tries to hit CTRL-C.  That did
> nothing.
>> It may be because it is blocked in kernel.
>>
>> Can you do something like this:
>>
>> echo "0t<pid of zpool import>::pid2proc|::walk
thread|::findstack -v"
>>
> 
> dclarke at neptune:~$ date;ps -efl | grep root | grep import
> 
> Sat May  9 22:54:45 GMT 2009
>  0 S     root  1096  1088   0  50 20 df81e378   1327 d8274526 20:40:38
> pts/5       0:00 zpool import -f -R /mnt/foo 1598907
> 
> dclarke at neptune:~$ su -
> 
> Password:
> Sun Microsystems Inc.   SunOS 5.11      snv_111 November 2008
> # /bin/echo "0t1096::pid2proc|::walk thread|::findstack -v" | mdb
-k
> stack pointer for thread e0156100: d575fc54
>   d575fc94 swtch+0x188()
>   d575fca4 cv_wait+0x53(d8274526, d82744e8, ffffffff, 0)
>   d575fce4 txg_wait_synced+0x90(d8274380, 65a, 0, 0)
>   d575fd34 spa_config_update_common+0x88(e600fd40, 0, 0, d575fd68)
> d575fd84 spa_import_common+0x3cf()
>   d575fdb4 spa_import+0x18(ecee4000, dfa040b0, d75b04e0, febd9444)
> d575fde4 zfs_ioc_pool_import+0xcd(ecee4000, 0, 0)
>   d575fe14 zfsdev_ioctl+0xe0()
>   d575fe44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, daf0b0b0,
> d575ff00)
>   d575fe74 spec_ioctl+0x6b(d83b1d80, 5a02, 80424a0, 100003, daf0b0b0,
> d575ff00)
>   d575fec4 fop_ioctl+0x49(d83b1d80, 5a02, 80424a0, 100003, daf0b0b0,
> d575ff00)
>   d575ff84 ioctl+0x171()
>   d575ffac sys_sysenter+0x106()
> #
So we see that it cannot complete import here and is waiting for 
transaction group to sync. So probably spa_sync thread is stuck, and we 
need to find out why.

>> echo "::threadlist" | mdb -k
> 
> # /bin/echo "::threadlist" | mdb -k | wc -l
>      542
> # /bin/echo "::threadlist" | mdb -k > kern.thread.list
> # wc -l kern.thread.list
>      541 kern.thread.list
> 
>> Output of the second command may be rather big, so it would be better
to
> post it somewhere.
> 
> see http://www.blastwave.org/dclarke/blog/files/kern.thread.list
Sorry I forgot ''-v'', could you do it once again:

echo "::threadlist -v" | mdb -k

Output of this one is going to be a lot bigger.

Or alternatively we can try to get stack for two txg_sync_threads:

echo "0xd4abadc0|::findstack -v" | mdb -k
echo "0xd381ddc0|::findstack -v" | mdb -k

Regards,
Victor


> 
> I see this line :
> 
>    e0156100 df81e378 ed00fe70 zpool/1
> 
> which seems consistent with what ps says :
> 
>    F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME
> TTY         TIME CMD
>  0 S     root  1096  1088   0  50 20 df81e378   1327 d8274526 20:40:38
> pts/5       0:00 zpool import -f -R /mnt/foo 1598907
> 
> which is not telling me much at the moment.
> 
> I''m game to play, what is next here ? By the way, brace yourself,
this is
> a 32-bit system and even worse than that, take a look at this isalist :
> 
> dclarke at neptune:~$ isainfo -v
> 
> 32-bit i386 applications
>         ahf sse2 sse fxsr mmx cmov sep cx8 tsc fpu
> dclarke at neptune:~$ isalist
> i486 i386 i86
> 
> 
> Dennis Clarke

Dennis Clarke

2009-May-10 14:26 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

>>> It may be because it is blocked in kernel.
>>> Can you do something like this:
>>> echo "0t<pid of zpool import>::pid2proc|::walk
thread|::findstack -v"
>
> So we see that it cannot complete import here and is waiting for
> transaction group to sync. So probably spa_sync thread is stuck, and we
> need to find out why.
Well, the details are going to change, I had to reboot.  :-(

I''ll start up the stuck thread bug again here by simply starting over.
I''ll bet you would be able to learn a few things if you were to ssh
into
this machine. ?

regardless, let''s start over.

dclarke at neptune:~$ uname -a
SunOS neptune 5.11 snv_111 i86pc i386 i86pc
dclarke at neptune:~$ uptime
  2:04pm  up 10:13,  1 user,  load average: 0.17, 0.16, 0.15
dclarke at neptune:~$ su -
Password:
Sun Microsystems Inc.   SunOS 5.11      snv_111 November 2008
#
# zpool import
  pool: foo
    id: 15989070886807735056
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the ''-f'' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        foo         ONLINE
          c0d0p0    ONLINE
#

please see ALL the details at :

http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README

also see output from fmdump -eV

http://www.blastwave.org/dclarke/blog/files/fmdump_e.log

Please let me know what else you may need.

-- 
Dennis Clarke

Dennis Clarke

2009-May-10 14:39 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

> Dennis Clarke wrote:
>>>>> It may be because it is blocked in kernel.
>>>>> Can you do something like this:
>>>>> echo "0t<pid of zpool import>::pid2proc|::walk
thread|::findstack -v"
>>> So we see that it cannot complete import here and is waiting for
>>> transaction group to sync. So probably spa_sync thread is stuck,
and we
>>> need to find out why.
>>
>> Well, the details are going to change, I had to reboot.  :-(
>>
>> I''ll start up the stuck thread bug again here by simply
starting over.
>> I''ll bet you would be able to learn a few things if you were
to ssh into
>> this machine. ?
>>
>> regardless, let''s start over.
>>
>> dclarke at neptune:~$ uname -a
>> SunOS neptune 5.11 snv_111 i86pc i386 i86pc
>> dclarke at neptune:~$ uptime
>>   2:04pm  up 10:13,  1 user,  load average: 0.17, 0.16, 0.15
>> dclarke at neptune:~$ su -
>> Password:
>> Sun Microsystems Inc.   SunOS 5.11      snv_111 November 2008
>> #
>> # zpool import
>>   pool: foo
>>     id: 15989070886807735056
>>  state: ONLINE
>> status: The pool was last accessed by another system.
>> action: The pool can be imported using its name or numeric identifier
>> and
>>         the ''-f'' flag.
>>    see: http://www.sun.com/msg/ZFS-8000-EY
>> config:
>>
>>         foo         ONLINE
>>           c0d0p0    ONLINE
>> #
>>
>> please see ALL the details at :
>>
>> http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README
>
> There''s a corrupted space map which is being updated as part of
the txg
> sync; in order to update it (add a few free ops to the last block), we
> need to read in current content of the last block from disk first, and
> that fails because it is corrupted (as indicated by checksum errors in
> the fmdump output):
>
> eb5c9dc0 fec1f398        0   0  60 d38e1828
>    PC: _resume_from_idle+0xb1    THREAD: txg_sync_thread()
>    stack pointer for thread eb5c9dc0: eb5c9a28
>      swtch+0x188()
>      cv_wait+0x53()
>      zio_wait+0x55()
>      dbuf_read+0x201()
>      dbuf_will_dirty+0x30()
>      dmu_write+0xd7()
>      space_map_sync+0x304()
>      metaslab_sync+0x284()
>      vdev_sync+0xc6()
>      spa_sync+0x3d0()
>      txg_sync_thread+0x308()
>      thread_start+8()
>
> Victor
I had to cc that back onto the ZFS list, it may be of value here.

I agree that there is something wrong, no doubt, however we should not see
zpool import simply hang and become unresponsive nor should that pid be
unresponsive to a SIGKILL. Good behaviour should be the norm and that is
not what we see with a stuck kernel thread. Really, we should get some
response to the effect that "a device is corrupt" or similar.

Right now, what the user gets, is very little information other than a
non-responsive command.

 CTRL+C does nothing and kill -9 pid does nothing to this command.

feels like a bug to me

Dennis

Dennis Clarke

2009-May-10 14:56 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

> Dennis Clarke wrote:
>>> Dennis Clarke wrote:
>>>>>>> It may be because it is blocked in kernel.
>>>>>>> Can you do something like this:
>>>>>>> echo "0t<pid of zpool
import>::pid2proc|::walk thread|::findstack
>>>>>>> -v"
>>>>> So we see that it cannot complete import here and is
waiting for
>>>>> transaction group to sync. So probably spa_sync thread is
stuck, and
>>>>> we
>>>>> need to find out why.
>>>> Well, the details are going to change, I had to reboot.  :-(
>>>>
>>>> I''ll start up the stuck thread bug again here by
simply starting over.
>>>> I''ll bet you would be able to learn a few things if
you were to ssh
>>>> into
>>>> this machine. ?
>>>>
>>>> regardless, let''s start over.
>>>>
>>>> dclarke at neptune:~$ uname -a
>>>> SunOS neptune 5.11 snv_111 i86pc i386 i86pc
>>>> dclarke at neptune:~$ uptime
>>>>   2:04pm  up 10:13,  1 user,  load average: 0.17, 0.16, 0.15
>>>> dclarke at neptune:~$ su -
>>>> Password:
>>>> Sun Microsystems Inc.   SunOS 5.11      snv_111 November 2008
>>>> #
>>>> # zpool import
>>>>   pool: foo
>>>>     id: 15989070886807735056
>>>>  state: ONLINE
>>>> status: The pool was last accessed by another system.
>>>> action: The pool can be imported using its name or numeric
identifier
>>>> and
>>>>         the ''-f'' flag.
>>>>    see: http://www.sun.com/msg/ZFS-8000-EY
>>>> config:
>>>>
>>>>         foo         ONLINE
>>>>           c0d0p0    ONLINE
>>>> #
>>>>
>>>> please see ALL the details at :
>>>>
>>>>
http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README
>>> There''s a corrupted space map which is being updated as
part of the txg
>>> sync; in order to update it (add a few free ops to the last block),
we
>>> need to read in current content of the last block from disk first,
and
>>> that fails because it is corrupted (as indicated by checksum errors
in
>>> the fmdump output):
>>>
>>> eb5c9dc0 fec1f398        0   0  60 d38e1828
>>>    PC: _resume_from_idle+0xb1    THREAD: txg_sync_thread()
>>>    stack pointer for thread eb5c9dc0: eb5c9a28
>>>      swtch+0x188()
>>>      cv_wait+0x53()
>>>      zio_wait+0x55()
>>>      dbuf_read+0x201()
>>>      dbuf_will_dirty+0x30()
>>>      dmu_write+0xd7()
>>>      space_map_sync+0x304()
>>>      metaslab_sync+0x284()
>>>      vdev_sync+0xc6()
>>>      spa_sync+0x3d0()
>>>      txg_sync_thread+0x308()
>>>      thread_start+8()
>>>
>>> Victor
>>
>> I had to cc that back onto the ZFS list, it may be of value here.
>
> Sorry for that, I''ve just hit wrong button ;-)
>
>> I agree that there is something wrong, no doubt, however we should not
>> see
>> zpool import simply hang and become unresponsive nor should that pid be
>> unresponsive to a SIGKILL. Good behaviour should be the norm and that
is
>> not what we see with a stuck kernel thread. Really, we should get some
>> response to the effect that "a device is corrupt" or similar.
>>
>> Right now, what the user gets, is very little information other than a
>> non-responsive command.
>>
>>  CTRL+C does nothing and kill -9 pid does nothing to this command.
>>
>> feels like a bug to me
>
> Yes, it is:
>
> http://bugs.opensolaris.org/view_bug.do?bug_id=6758902
oh drat, I thought I hit something new :-\

Not very likely with ZFS, it is pretty well flushed out all the way into
the dark corners I guess.

Dennis

Victor Latushkin

2009-May-10 14:59 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

Dennis Clarke wrote:>> Dennis Clarke wrote:
>>>>>> It may be because it is blocked in kernel.
>>>>>> Can you do something like this:
>>>>>> echo "0t<pid of zpool
import>::pid2proc|::walk thread|::findstack -v"
>>>> So we see that it cannot complete import here and is waiting
for
>>>> transaction group to sync. So probably spa_sync thread is
stuck, and we
>>>> need to find out why.
>>> Well, the details are going to change, I had to reboot.  :-(
>>>
>>> I''ll start up the stuck thread bug again here by simply
starting over.
>>> I''ll bet you would be able to learn a few things if you
were to ssh into
>>> this machine. ?
>>>
>>> regardless, let''s start over.
>>>
>>> dclarke at neptune:~$ uname -a
>>> SunOS neptune 5.11 snv_111 i86pc i386 i86pc
>>> dclarke at neptune:~$ uptime
>>>   2:04pm  up 10:13,  1 user,  load average: 0.17, 0.16, 0.15
>>> dclarke at neptune:~$ su -
>>> Password:
>>> Sun Microsystems Inc.   SunOS 5.11      snv_111 November 2008
>>> #
>>> # zpool import
>>>   pool: foo
>>>     id: 15989070886807735056
>>>  state: ONLINE
>>> status: The pool was last accessed by another system.
>>> action: The pool can be imported using its name or numeric
identifier
>>> and
>>>         the ''-f'' flag.
>>>    see: http://www.sun.com/msg/ZFS-8000-EY
>>> config:
>>>
>>>         foo         ONLINE
>>>           c0d0p0    ONLINE
>>> #
>>>
>>> please see ALL the details at :
>>>
>>>
http://www.blastwave.org/dclarke/blog/files/kernel_thread_stuck.README
>> There''s a corrupted space map which is being updated as part
of the txg
>> sync; in order to update it (add a few free ops to the last block), we
>> need to read in current content of the last block from disk first, and
>> that fails because it is corrupted (as indicated by checksum errors in
>> the fmdump output):
>>
>> eb5c9dc0 fec1f398        0   0  60 d38e1828
>>    PC: _resume_from_idle+0xb1    THREAD: txg_sync_thread()
>>    stack pointer for thread eb5c9dc0: eb5c9a28
>>      swtch+0x188()
>>      cv_wait+0x53()
>>      zio_wait+0x55()
>>      dbuf_read+0x201()
>>      dbuf_will_dirty+0x30()
>>      dmu_write+0xd7()
>>      space_map_sync+0x304()
>>      metaslab_sync+0x284()
>>      vdev_sync+0xc6()
>>      spa_sync+0x3d0()
>>      txg_sync_thread+0x308()
>>      thread_start+8()
>>
>> Victor
> 
> I had to cc that back onto the ZFS list, it may be of value here.
Sorry for that, I''ve just hit wrong button ;-)
> I agree that there is something wrong, no doubt, however we should not see
> zpool import simply hang and become unresponsive nor should that pid be
> unresponsive to a SIGKILL. Good behaviour should be the norm and that is
> not what we see with a stuck kernel thread. Really, we should get some
> response to the effect that "a device is corrupt" or similar.
> 
> Right now, what the user gets, is very little information other than a
> non-responsive command.
> 
>  CTRL+C does nothing and kill -9 pid does nothing to this command.
> 
> feels like a bug to me
Yes, it is:

http://bugs.opensolaris.org/view_bug.do?bug_id=6758902

Regards,
Victor

Dennis Clarke

2009-May-10 15:15 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

>>  CTRL+C does nothing and kill -9 pid does nothing to this command.
>>
>> feels like a bug to me
>
> Yes, it is:
>
> http://bugs.opensolaris.org/view_bug.do?bug_id=6758902
>
Now I recall why I had to reboot.  Seems as if a lot of commands hang now.
Things like :

df -ak

zfs list

zpool list

they all just hang.

Dennis

ps: this "machine" is really just an embedded device based on the VIA
chipset. Not too sure if that matters.

Victor Latushkin

2009-May-10 18:31 UTC

head link

[zfs-discuss] is zpool import unSIGKILLable ?

Dennis Clarke wrote:>>>  CTRL+C does nothing and kill -9 pid does nothing to this command.
>>>
>>> feels like a bug to me
>> Yes, it is:
>>
>> http://bugs.opensolaris.org/view_bug.do?bug_id=6758902
>>
> 
> Now I recall why I had to reboot.  Seems as if a lot of commands hang now.
> Things like :
> 
> df -ak
> 
> zfs list
pid of ''zfs list'' is 1754
> zpool list
pid of ''zpool list'' is 1873> 
> they all just hang.
Ok, here''s a little bit more details:

pid of ''zpool import'' is 1361.

 > 0t1361::pid2proc|::walk thread|::findstack -v
stack pointer for thread ec1b8ce0: ec223c54
   ec223c94 swtch+0x188()
   ec223ca4 cv_wait+0x53(eb414966, eb414928, ffffffff, 0)
   ec223ce4 txg_wait_synced+0x90(eb4147c0, 65a, 0, 0)
   ec223d34 spa_config_update_common+0x88(ddbda380, 0, 0, ec223d68)
   ec223d84 spa_import_common+0x3cf()
   ec223db4 spa_import+0x18(ea841000, eb04f5e0, eb04f658, febd9444)
   ec223de4 zfs_ioc_pool_import+0xcd(ea841000, 0, 0)
   ec223e14 zfsdev_ioctl+0xe0()
   ec223e44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223e74 spec_ioctl+0x6b(da7ce340, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223ec4 fop_ioctl+0x49(da7ce340, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223f84 ioctl+0x171()
   ec223fac sys_call+0x10c()
 >

So we see it is waiting for transaction grou to sync. Let''s find out 
sync thread and see what it is doing, for this we need to take argument 
to txg_wait_synced() and go from it:

 > eb4147c0::print dsl_pool_t dp_tx.tx_sync_thread
dp_tx.tx_sync_thread = 0xeb5c9dc0

Let''s see what it is doing:

 > 0xeb5c9dc0::findstack -v
stack pointer for thread eb5c9dc0: eb5c9a28
   eb5c9a68 swtch+0x188()
   eb5c9a78 cv_wait+0x53(d38e1828, d38e1820, 0, 0)
   eb5c9aa8 zio_wait+0x55(d38e15e0, d38e15e0, eb5c9af8, 1)
   eb5c9ae8 dbuf_read+0x201(d8795028, 0)
   eb5c9b08 dbuf_will_dirty+0x30(d8795028, ea7fa880, eb5c9b4c, 0)
   eb5c9b68 dmu_write+0xd7(ec5b2e18, 22, 0, 1f0, 0, 10)
   eb5c9c08 space_map_sync+0x304(e0815db8, 1, e0815c20, ec5b2e18, ea7fa880)
   eb5c9c78 metaslab_sync+0x284(e0815c00, 65a, 0, 0)
   eb5c9cb8 vdev_sync+0xc6(e6f8b4c0, 65a, 0)
   eb5c9d28 spa_sync+0x3d0(ddbda380, 65a, 0, 772f6e75)
   eb5c9da8 txg_sync_thread+0x308(eb4147c0, 0)
   eb5c9db8 thread_start+8()
 >

It s trying to write something (update) to a space map, so it needs to 
read in existing content first, and this fails. Let''s check what is 
block pointer:

 > d38e15e0::zio -c
ADDRESS                          TYPE  STAGE            WAITER
  d397baa0                        READ  DONE             -
 > d397baa0::print zio_t io_bp|::blkptr
DVA[0]: vdev_id 0 / 6091e400
DVA[0]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[0]: :0:6091e400:200:d
DVA[1]: vdev_id 0 / 3e091e400
DVA[1]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[1]: :0:3e091e400:200:d
DVA[2]: vdev_id 0 / 78091e400
DVA[2]:       GANG: FALSE  GRID:  0000  ASIZE: 20000000000
DVA[2]: :0:78091e400:200:d
LSIZE:  1000                            PSIZE: 200
ENDIAN: LITTLE                                  TYPE:  SPA space map
BIRTH:  5b5                LEVEL: 0     FILL:  100000000
CKFUNC: fletcher4                       COMP:  lzjb
CKSUM:  1f7bc0ee12:6fcfd90640d:10787c83addaf:1f3ef97a921b6f
 >

We clearly see here that block pointer is the same as one declared by 
zdb as corrupted (see another thread about zdb -e -bbcsL).

Let''s check that this is indeed the same pool we are trying to import:

 > eb4147c0::print dsl_pool_t dp_spa|::print -d struct spa spa_load_guid
spa_load_guid = 0t15989070886807735056
 > ::ps -f! grep 1361
R   1361   1360   1353   1220      0 0x4a004000 e0596058 zpool import -f 
-R /mnt/foo 15989070886807735056

Indeed, it is the same.

Let''s check why other processes are stuck:

 > 0t1783::pid2proc|::walk thread|::findstack -v
stack pointer for thread d362a680: eab3ac94
   eab3acd4 swtch+0x188()
   eab3ad24 turnstile_block+0x70b(d7b09f78, 0, fecb5418, fec04c80, 0, 0)
   eab3ad94 mutex_vector_enter+0x28f(fecb5418)
   eab3adc4 spa_all_configs+0x50(e6f9dd38, e6f9d000, 1020, 1)
   eab3ade4 zfs_ioc_pool_configs+0x16(e6f9d000, e8e3c010, 1020)
   eab3ae14 zfsdev_ioctl+0xe0()
   eab3ae44 cdev_ioctl+0x31(2d80000, 5a04, 8041f00, 100003, e8e3c010, 
eab3af00)
   eab3ae74 spec_ioctl+0x6b(da7ce340, 5a04, 8041f00, 100003, e8e3c010, 
eab3af00)
   eab3aec4 fop_ioctl+0x49(da7ce340, 5a04, 8041f00, 100003, e8e3c010, 
eab3af00)
   eab3af84 ioctl+0x171()
   eab3afac sys_sysenter+0x106()
 > fecb5418::mutex
     ADDR  TYPE     HELD MINSPL OLDSPL WAITERS
fecb5418 adapt ec1b8ce0      -      -     yes
 > ec1b8ce0::findstack -v
stack pointer for thread ec1b8ce0: ec223c54
   ec223c94 swtch+0x188()
   ec223ca4 cv_wait+0x53(eb414966, eb414928, ffffffff, 0)
   ec223ce4 txg_wait_synced+0x90(eb4147c0, 65a, 0, 0)
   ec223d34 spa_config_update_common+0x88(ddbda380, 0, 0, ec223d68)
   ec223d84 spa_import_common+0x3cf()
   ec223db4 spa_import+0x18(ea841000, eb04f5e0, eb04f658, febd9444)
   ec223de4 zfs_ioc_pool_import+0xcd(ea841000, 0, 0)
   ec223e14 zfsdev_ioctl+0xe0()
   ec223e44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223e74 spec_ioctl+0x6b(da7ce340, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223ec4 fop_ioctl+0x49(da7ce340, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223f84 ioctl+0x171()
   ec223fac sys_call+0x10c()
 >

Ok, we see that ''zpool list'' is waiting for our
''zpool import ...'' to
complete.

Let''s see what ''zfs list'' is waiting for:

 > 0t1754::pid2proc|::walk thread|::findstack -v
stack pointer for thread e6f8d880: eaf7cc94
   eaf7ccd4 swtch+0x188()
   eaf7cd24 turnstile_block+0x70b(d7b09f78, 0, fecb5418, fec04c80, 0, 0)
   eaf7cd94 mutex_vector_enter+0x28f(fecb5418)
   eaf7cdc4 spa_all_configs+0x50(e7c05d38, e7c05000, 1020, 1)
   eaf7cde4 zfs_ioc_pool_configs+0x16(e7c05000, e8e3c010, 1020)
   eaf7ce14 zfsdev_ioctl+0xe0()
   eaf7ce44 cdev_ioctl+0x31(2d80000, 5a04, 8045ed0, 100003, e8e3c010, 
eaf7cf00)
   eaf7ce74 spec_ioctl+0x6b(da7ce340, 5a04, 8045ed0, 100003, e8e3c010, 
eaf7cf00)
   eaf7cec4 fop_ioctl+0x49(da7ce340, 5a04, 8045ed0, 100003, e8e3c010, 
eaf7cf00)
   eaf7cf84 ioctl+0x171()
   eaf7cfac sys_sysenter+0x106()
 > fecb5418::mutex
     ADDR  TYPE     HELD MINSPL OLDSPL WAITERS
fecb5418 adapt ec1b8ce0      -      -     yes
 > ec1b8ce0::findstack -v
stack pointer for thread ec1b8ce0: ec223c54
   ec223c94 swtch+0x188()
   ec223ca4 cv_wait+0x53(eb414966, eb414928, ffffffff, 0)
   ec223ce4 txg_wait_synced+0x90(eb4147c0, 65a, 0, 0)
   ec223d34 spa_config_update_common+0x88(ddbda380, 0, 0, ec223d68)
   ec223d84 spa_import_common+0x3cf()
   ec223db4 spa_import+0x18(ea841000, eb04f5e0, eb04f658, febd9444)
   ec223de4 zfs_ioc_pool_import+0xcd(ea841000, 0, 0)
   ec223e14 zfsdev_ioctl+0xe0()
   ec223e44 cdev_ioctl+0x31(2d80000, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223e74 spec_ioctl+0x6b(da7ce340, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223ec4 fop_ioctl+0x49(da7ce340, 5a02, 80424a0, 100003, e8e3c208, 
ec223f00)
   ec223f84 ioctl+0x171()
   ec223fac sys_call+0x10c()
 >

The same story.

What is the mutex they are waiting for?

 > fecb5418::whatis
fecb5418 is spa_namespace_lock+0 in zfs''s bss
 >

It is spa_namespace_lock

Hth,
Victor

zfs discuss - May 2009 - is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?

[zfs-discuss] is zpool import unSIGKILLable ?