thr3ads.net - zfs discuss - [zfs-discuss] RAID-Z1 pool became faulted when a disk was removed. [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Rince

2006-Nov-01 10:04 UTC

[zfs-discuss] RAID-Z1 pool became faulted when a disk was removed.

So I have attached to my system two 7-disk SCSI arrays, each of 18.2 GB
disks.

Each of them is a RAID-Z1 zpool.

I had a disk I thought was a dud, so I pulled the fifth disk in my array and
put the dud in. Sure enough, Solaris started spitting errors like there was
no tomorrow in dmesg, and wouldn''t use the disk. Ah well. Remove it,
put the
original back in - hey, Solaris still thinks the disk is offline, and cfgadm
-c unconfigure [disk];cfgadm -c configure [disk] didn''t help - okay,
sane
poweroff. Hey, this is going to take awhile to rescrub, why not switch to
the wide SCSI module for this disk array rather than the narrow one? Okay,
fine, put the module in (this module is known working and was, in fact,
pulled from the other array).

I notice it takes nigh-forever to come back up, and I''m wondering why -
it
literally took over 5 minutes to give me console login. Commands took at
least 5s between being typed and appearing in console - it was obvious
something insane was going on. Load average was 6.33, and fmd was taking
most of the CPU. zpool status took about 10 minutes to tell me that it
thought c2t2d0 was missing and that c2t4d0 was corrupt, thereby screwing me.

Wait, what. I didn''t touch that disk, what''s going on here.

I try to convince ZFS that the disk is there and usable via zpool online
moonside c2t2d0, but it just claims the pool is inaccessable (great, thanks
ZFS). I figure it has to be the module swap that''s confusing it so, so
I
poweroff and switch back. Power back on...nope, still screwed the same way.

I try destroying the pool and importing it, but it "just" tells me the
pool
is corrupted because c2t4d0 has corrupt metadata.

pool: moonside
id: 8290331144559232496
state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: sun.com/msg/ZFS-8000-5E
config:

moonside FAULTED corrupted data
raidz1 FAULTED corrupted data
c2t0d0 ONLINE
c2t1d0 ONLINE
c2t2d0 ONLINE
c2t3d0 ONLINE
c2t4d0 FAULTED corrupted data
c2t5d0 ONLINE
c2t6d0 ONLINE

Thanks, ZFS. One disk (at most, one disk and attempting to use a different
SCSI connector) blew up my RAID-Z1. That''s...wonderful.

I try rebooting to see if it becomes less confused...

pool: moonside
id: 8290331144559232496
state: FAULTED
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: sun.com/msg/ZFS-8000-3C
config:

moonside FAULTED corrupted data
raidz1 DEGRADED
c2t0d0 ONLINE
c2t1d0 ONLINE
c2t2d0 ONLINE
c2t3d0 ONLINE
c2t4d0 UNAVAIL cannot open
c2t5d0 ONLINE
c2t6d0 ONLINE

Uh, what. So the pool is "degraded", but the state is
"faulted" because it
has corrupted data somewhere that it can''t tell me about? Screw this,
force
import.

# zpool import -f moonside
cannot import ''moonside'': I/O error

...what!? I don''t even know what that error means in this context,
maybe my
buddy dmesg does.

# dmesg | tail
Nov 1 03:28:02 maou scsi: [ID 193665 kern.info] sd2 at adp0: target 2 lun 0
Nov 1 03:28:02 maou genunix: [ID 936769 kern.info] sd2 is /pci at 0,0/pci9004,
7178 at b/sd at 2,0
Nov 1 03:28:06 maou genunix: [ID 773945 kern.info] UltraDMA mode 2
selected
Nov 1 03:28:31 maou last message repeated 7 times
Nov 1 03:28:52 maou genunix: [ID 408114 kern.info]
/pci at 0,0/pci9004,7178 at b/sd at 4,0
(sd4) offline
Nov 1 03:28:55 maou genunix: [ID 773945 kern.info] UltraDMA mode 2
selected
Nov 1 03:28:55 maou last message repeated 3 times
Nov 1 03:29:06 maou scsi: [ID 193665 kern.info] sd4 at adp0: target 4 lun 0
Nov 1 03:29:06 maou genunix: [ID 936769 kern.info] sd4 is /pci at 0,0/pci9004,
7178 at b/sd at 4,0
Nov 1 03:29:06 maou genunix: [ID 408114 kern.info]
/pci at 0,0/pci9004,7178 at b/sd at 4,0
(sd4) online

Nope, dmesg doesn''t know either. Uh, what?

Reboots fix everything. Reboot...

Now it''s just really confused.

# zpool import -f moonside
cannot import ''moonside'': one or more devices is currently
unavailable

Can it not make up its mind? Does it want the missing seventh device to save
it from the mean old corruption on that seventh device? What''s with the
claimed I/O errors that don''t show up in dmesg?

moonside UNAVAIL insufficient replicas
raidz1 UNAVAIL insufficient replicas
c2t0d0 ONLINE
c2t1d0 ONLINE
c2t2d0 FAULTED corrupted data
c2t3d0 ONLINE
c2t4d0 UNAVAIL cannot open
c2t5d0 ONLINE
c2t6d0 ONLINE

Oh wow, that''s really special. I''m not sure what''s
going on at this point. I
swear there''s no way I could have touched c2t2d0 by accident - this
array is
really sturdy and requires moderate physical effort to remove a disk from.

Is this behavior "expected", or is this a bug? Furthermore, should I
ever
expect to be able to see my precious data again?

snv b44, Pentium III 550.

- Rich
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061101/6c9b4e85/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

zfs discuss - Nov 2006 - RAID-Z1 pool became faulted when a disk was removed.

[zfs-discuss] RAID-Z1 pool became faulted when a disk was removed.

Reasonably Related Threads

Wisdom of the Ancients