thr3ads.net - zfs discuss - [zfs-discuss] Recovering from an apparent ZFS Hang [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Brian Leonard

2010-Jul-12 16:45 UTC

[zfs-discuss] Recovering from an apparent ZFS Hang

Hi,

I''m currently trying to work with a quad-bay USB drive enclosure.
I''ve created a raidz pool as follows:

bleonard at opensolaris:~# zpool status r5pool
  pool: r5pool
 state: ONLINE
 scrub: none requested
config:

	NAME          STATE     READ WRITE CKSUM
	r5pool        ONLINE       0     0     0
	  raidz1      ONLINE       0     0     0
	    c1t0d0p0  ONLINE       0     0     0
	    c1t0d1p0  ONLINE       0     0     0
	    c1t0d2p0  ONLINE       0     0     0
	    c1t0d3p0  ONLINE       0     0     0

errors: No known data errors

If I pop a disk and run a zpool scrub, the fault is noted:

bleonard at opensolaris:~# zpool scrub r5pool
bleonard at opensolaris:~# zpool status r5pool
  pool: r5pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using ''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub completed after 0h0m with 0 errors on Mon Jul 12 12:35:46 2010
config:

	NAME          STATE     READ WRITE CKSUM
	r5pool        DEGRADED     0     0     0
	  raidz1      DEGRADED     0     0     0
	    c1t0d0p0  ONLINE       0     0     0
	    c1t0d1p0  ONLINE       0     0     0
	    c1t0d2p0  FAULTED      0     0     0  corrupted data
	    c1t0d3p0  ONLINE       0     0     0

errors: No known data errors

However, it''s when I pop the disk back in that everything goes south.
If I run a zpool scrub at this point, the command appears to just hang.

Running zpool status again shows the scrub will finish in 2 minutes, but I never
does. You can see it''s been running for 33 minutes already, and
there''s no data in the pool.

bleonard at opensolaris:/r5pool# zpool status r5pool
  pool: r5pool
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run ''zpool
clear''.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 0h33m, 92.41% done, 0h2m to go
config:

	NAME          STATE     READ WRITE CKSUM
	r5pool        ONLINE       0     0     0
	  raidz1      ONLINE       0     0     0
	    c1t0d0p0  ONLINE       0     0     0
	    c1t0d1p0  ONLINE       0     0     0
	    c1t0d2p0  ONLINE       0     0     0
	    c1t0d3p0  ONLINE       0     0     0

errors: 24 data errors, use ''-v'' for a list

zpool scrub -s r5pool doesn''t have any effect.

I can''t even kill the scrub process. Even a reboot command at this
point will hang the machine, so I have to hard power-cycle the machine to get
everything back to normal. There must be a more elegant solution, right?
-- 
This message posted from opensolaris.org

Cindy Swearingen

2010-Jul-12 17:20 UTC

head link

[zfs-discuss] Recovering from an apparent ZFS Hang

Hi Brian,

What are you trying to determine? How the pool behaves when a drive is
yanked out?

Its hard to tell how a pool will react with external USB drives. I think
it will also depend on how the system handles a device removal.

I created a similar raidz pool with non-USB devices, offlined a disk,
and ran a scrub. It works as expected. See the output below. Could
you retry your test with an offline rather than a yank and see if
the system hangs?

In addition, we don''t support pools that are created on p* devices.
Use the c1t0d* names instead.

Thanks,

Cindy

# zpool create rzpool raidz1 c2t6d0 c2t7d0 c2t8d0
# zpool offline rzpool c2t8d0
# zpool status rzpool
   pool: rzpool
  state: DEGRADED
status: One or more devices has been taken offline by the administrator.
         Sufficient replicas exist for the pool to continue functioning in a
         degraded state.
action: Online the device using ''zpool online'' or replace the
device with
         ''zpool replace''.
  scan: none requested
config:

         NAME        STATE     READ WRITE CKSUM
         rzpool      DEGRADED     0     0     0
           raidz1-0  DEGRADED     0     0     0
             c2t6d0  ONLINE       0     0     0
             c2t7d0  ONLINE       0     0     0
             c2t8d0  OFFLINE      0     0     0

errors: No known data errors
# zpool scrub rzpool
# zpool status rzpool
   pool: rzpool
  state: DEGRADED
status: One or more devices has been taken offline by the administrator.
         Sufficient replicas exist for the pool to continue functioning in a
         degraded state.
action: Online the device using ''zpool online'' or replace the
device with
         ''zpool replace''.
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon Jul 12 09:56:36 2010
config:

         NAME        STATE     READ WRITE CKSUM
         rzpool      DEGRADED     0     0     0
           raidz1-0  DEGRADED     0     0     0
             c2t6d0  ONLINE       0     0     0
             c2t7d0  ONLINE       0     0     0
             c2t8d0  OFFLINE      0     0     0

errors: No known data errors
# zpool status rzpool
   pool: rzpool
  state: ONLINE
  scan: resilvered 14K in 0h0m with 0 errors on Mon Jul 12 10:12:55 2010
config:

         NAME        STATE     READ WRITE CKSUM
         rzpool      ONLINE       0     0     0
           raidz1-0  ONLINE       0     0     0
             c2t6d0  ONLINE       0     0     0
             c2t7d0  ONLINE       0     0     0
             c2t8d0  ONLINE       0     0     0

errors: No known data errors


On 07/12/10 10:45, Brian Leonard wrote:> Hi,
> 
> I''m currently trying to work with a quad-bay USB drive enclosure.
I''ve created a raidz pool as follows:
> 
> bleonard at opensolaris:~# zpool status r5pool
>   pool: r5pool
>  state: ONLINE
>  scrub: none requested
> config:
> 
> 	NAME          STATE     READ WRITE CKSUM
> 	r5pool        ONLINE       0     0     0
> 	  raidz1      ONLINE       0     0     0
> 	    c1t0d0p0  ONLINE       0     0     0
> 	    c1t0d1p0  ONLINE       0     0     0
> 	    c1t0d2p0  ONLINE       0     0     0
> 	    c1t0d3p0  ONLINE       0     0     0
> 
> errors: No known data errors
> 
> If I pop a disk and run a zpool scrub, the fault is noted:
> 
> bleonard at opensolaris:~# zpool scrub r5pool
> bleonard at opensolaris:~# zpool status r5pool
>   pool: r5pool
>  state: DEGRADED
> status: One or more devices could not be used because the label is missing
or
> 	invalid.  Sufficient replicas exist for the pool to continue
> 	functioning in a degraded state.
> action: Replace the device using ''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-4J
>  scrub: scrub completed after 0h0m with 0 errors on Mon Jul 12 12:35:46
2010
> config:
> 
> 	NAME          STATE     READ WRITE CKSUM
> 	r5pool        DEGRADED     0     0     0
> 	  raidz1      DEGRADED     0     0     0
> 	    c1t0d0p0  ONLINE       0     0     0
> 	    c1t0d1p0  ONLINE       0     0     0
> 	    c1t0d2p0  FAULTED      0     0     0  corrupted data
> 	    c1t0d3p0  ONLINE       0     0     0
> 
> errors: No known data errors
> 
> However, it''s when I pop the disk back in that everything goes
south. If I run a zpool scrub at this point, the command appears to just hang.
> 
> Running zpool status again shows the scrub will finish in 2 minutes, but I
never does. You can see it''s been running for 33 minutes already, and
there''s no data in the pool.
> 
> bleonard at opensolaris:/r5pool# zpool status r5pool
>   pool: r5pool
>  state: ONLINE
> status: One or more devices are faulted in response to IO failures.
> action: Make sure the affected devices are connected, then run
''zpool clear''.
>    see: http://www.sun.com/msg/ZFS-8000-HC
>  scrub: scrub in progress for 0h33m, 92.41% done, 0h2m to go
> config:
> 
> 	NAME          STATE     READ WRITE CKSUM
> 	r5pool        ONLINE       0     0     0
> 	  raidz1      ONLINE       0     0     0
> 	    c1t0d0p0  ONLINE       0     0     0
> 	    c1t0d1p0  ONLINE       0     0     0
> 	    c1t0d2p0  ONLINE       0     0     0
> 	    c1t0d3p0  ONLINE       0     0     0
> 
> errors: 24 data errors, use ''-v'' for a list
> 
> zpool scrub -s r5pool doesn''t have any effect.
> 
> I can''t even kill the scrub process. Even a reboot command at this
point will hang the machine, so I have to hard power-cycle the machine to get
everything back to normal. There must be a more elegant solution, right?

Brian Leonard

2010-Jul-13 17:04 UTC

head link

[zfs-discuss] Recovering from an apparent ZFS Hang

Hi Cindy,

I''m trying to demonstrate how ZFS behaves when a disk fails. The drive
enclosure I''m using (http://www.icydock.com/product/mb561us-4s-1.html)
says it supports hot swap, but that''s not what I''m
experiencing. When I plug the disk back in, all 4 disks are no longer
recognizable until I restart the enclosure.

This same demo works fine when using USB sticks, and maybe that''s
because each USB stick has its own controller.

Thanks for your help,
Brian
-- 
This message posted from opensolaris.org

Brian Leonard

2010-Jul-13 22:28 UTC

head link

[zfs-discuss] Recovering from an apparent ZFS Hang

Actually, there''s still the primary issue of this post - the apparent
hang. At the moment, I have 3 zpool commands running, all apparently hung and
doing nothing:

bleonard at opensolaris:~$ ps -ef | grep zpool
    root 20465 20411   0 18:10:44 pts/4       0:00 zpool clear r5pool
    root 20408 20403   0 18:08:19 pts/3       0:00 zpool status r5pool
    root 20396 17612   0 18:08:04 pts/2       0:00 zpool scrub r5pool

You can see all of them are not very busy, and seem to be waiting on something:

bleonard at opensolaris:~# ptime -p 20465
real    12:25.188031517
user        0.004037420
sys         0.008682963

bleonard at opensolaris:~# ptime -p 20408
real    15:03.977246851
user        0.002700817
sys         0.005662413

bleonard at opensolaris:~# ptime -p 20396
real    15:24.793176743
user        0.002954137
sys         0.014851215

And as I said earlier, I can''t control+break or kill any of these
processes. Time for hard-reboot.

/Brian
-- 
This message posted from opensolaris.org

zfs discuss - Jul 2010 - Recovering from an apparent ZFS Hang

[zfs-discuss] Recovering from an apparent ZFS Hang

[zfs-discuss] Recovering from an apparent ZFS Hang

[zfs-discuss] Recovering from an apparent ZFS Hang

[zfs-discuss] Recovering from an apparent ZFS Hang