thr3ads.net - freebsd stable - ZFS raidz recovery [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Gareth de Vaux

2010-Nov-27 14:13 UTC

ZFS raidz recovery

Hi all, I'm trying to simulate a disk fail and replacement in
a raidz array and failing myself. What'm I doing wrong? Here's
a transcript with interspersed commentary:

root@file:~# zpool status
  pool: raid
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:20:06 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0     0
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0

errors: No known data errors
root@file:~# zpool offline raid ad12

reboot
dd if=/dev/zero of=/dev/ad12 ..

root@file:~# zpool replace raid ad12
cannot replace ad12 with ad12: ad12 is busy
root@file:~# zpool replace -f raid ad12
cannot replace ad12 with ad12: ad12 is busy

	The handbook suggests 'replace' but I guess this is only
	if the disk is physically replaced and gets a new identifier?
	Trying with 'online':

root@file:~# zpool online raid ad12
root@file:~# zpool status
  pool: raid
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Sat Nov 27 13:29:14 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0     0  15.5K resilvered
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0

errors: No known data errors

	Output remains as such, is this normal?

root@file:~# zpool scrub raid
root@file:~# zpool status
  pool: raid
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:37 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0 2.11K  87.7M repaired
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0

errors: No known data errors
root@file:~# zpool scrub raid
root@file:~# zpool status
  pool: raid
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:55 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0 2.11K
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
            ad6     ONLINE       0     0     0

errors: No known data errors

	These are checksum errors? So the disk hasn't been integrated
	properly?

root@file:~# zpool clear raid ad12
root@file:~# zpool status
  pool: raid
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:39:09 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0     0
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0

errors: No known data errors
root@file:~# zpool status -x
all pools are healthy

	To make sure this's the case I fail a different disk:

root@file:~# zpool offline raid ad6
root@file:~# zpool status   
  pool: raid
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device
with
	'zpool replace'.
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:40:52 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        DEGRADED     0     0     0
	  raidz1    DEGRADED     0     0     0
	    ad12    ONLINE       0     0     0
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     OFFLINE      0     0     0

errors: No known data errors

	on reboot the status changes:

root@file:~# zpool status
  pool: raid
 state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
   see: http://www.sun.com/msg/ZFS-8000-72
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	raid        FAULTED      0     0     1  corrupted data
	  raidz1    DEGRADED     0     0     6
	    ad12    OFFLINE      0     0     0
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     1


The same happens if I recreate the array and try again.

Jeremy Chadwick

2010-Nov-27 15:30 UTC

head link

ZFS raidz recovery

On Sat, Nov 27, 2010 at 03:22:49PM +0200, Gareth de Vaux
wrote:> Hi all, I'm trying to simulate a disk fail and replacement in
> a raidz array and failing myself. What'm I doing wrong? Here's
> a transcript with interspersed commentary:
> 
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:20:06
2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0     0
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> root@file:~# zpool offline raid ad12
> 
> reboot
> dd if=/dev/zero of=/dev/ad12 ..
> 
> root@file:~# zpool replace raid ad12
> cannot replace ad12 with ad12: ad12 is busy
> root@file:~# zpool replace -f raid ad12
> cannot replace ad12 with ad12: ad12 is busy
> 
> 	The handbook suggests 'replace' but I guess this is only
> 	if the disk is physically replaced and gets a new identifier?
> 	Trying with 'online':
> 
> root@file:~# zpool online raid ad12
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
>  scrub: resilver completed after 0h0m with 0 errors on Sat Nov 27 13:29:14
2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0     0  15.5K resilvered
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> 	Output remains as such, is this normal?
> 
> root@file:~# zpool scrub raid
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
> 	attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> 	using 'zpool clear' or replace the device with 'zpool
replace'.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:37
2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0 2.11K  87.7M repaired
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> root@file:~# zpool scrub raid
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
> 	attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> 	using 'zpool clear' or replace the device with 'zpool
replace'.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:55
2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0 2.11K
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
>             ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> 	These are checksum errors? So the disk hasn't been integrated
> 	properly?
> 
> root@file:~# zpool clear raid ad12
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:39:09
2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0     0
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> root@file:~# zpool status -x
> all pools are healthy
> 
> 	To make sure this's the case I fail a different disk:
> 
> root@file:~# zpool offline raid ad6
> root@file:~# zpool status   
>   pool: raid
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
> 	Sufficient replicas exist for the pool to continue functioning in a
> 	degraded state.
> action: Online the device using 'zpool online' or replace the
device with
> 	'zpool replace'.
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:40:52
2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        DEGRADED     0     0     0
> 	  raidz1    DEGRADED     0     0     0
> 	    ad12    ONLINE       0     0     0
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     OFFLINE      0     0     0
> 
> errors: No known data errors
> 
> 	on reboot the status changes:
> 
> root@file:~# zpool status
>   pool: raid
>  state: FAULTED
> status: The pool metadata is corrupted and the pool cannot be opened.
> action: Destroy and re-create the pool from a backup source.
>    see: http://www.sun.com/msg/ZFS-8000-72
>  scrub: none requested
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        FAULTED      0     0     1  corrupted data
> 	  raidz1    DEGRADED     0     0     6
> 	    ad12    OFFLINE      0     0     0
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     1
> 
> 
> The same happens if I recreate the array and try again.
uname -a please -- it matters greatly.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Gareth de Vaux

2010-Nov-27 16:02 UTC

head link

ZFS raidz recovery

On Sat 2010-11-27 (07:30), Jeremy Chadwick wrote:> uname -a please -- it matters greatly.
$ uname -a
FreeBSD file 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #0: Wed Nov 24 07:56:04 SAST
2010     root@file:/usr/obj/usr/src/sys/COWNEL  amd64

Gareth de Vaux

2010-Dec-06 11:08 UTC

head link

ZFS raidz recovery

On Sat 2010-11-27 (15:22), Gareth de Vaux wrote:> Hi all, I'm trying to simulate a disk fail and replacement in
> a raidz array and failing myself. What'm I doing wrong? Here's
Ok I did some science, it looks like the array doesn't like me
throwing zeros at the disk when it's 'offline'. If I take the
disk offline, just fiddle with the array's data, then set the
disk online it resilvers fine.

'zpool replace' also only works if you physically swap out a disk
at the same port, or replace disk1 with disk2 online. 'zpool remove'
and 'zpool detach' don't remove devices from a raidz.

So I can recover an array if I have an extra disk to play with,
to use temporarily or to swap out with. If I don't and a disk is
giving trouble I can't drop it from the array, try to do something
with it, and reinsert it.

freebsd stable - Nov 2010 - ZFS raidz recovery

ZFS raidz recovery

ZFS raidz recovery

ZFS raidz recovery

ZFS raidz recovery