Rince
2006-Nov-01 10:04 UTC
[zfs-discuss] RAID-Z1 pool became faulted when a disk was removed.
So I have attached to my system two 7-disk SCSI arrays, each of 18.2 GB disks. Each of them is a RAID-Z1 zpool. I had a disk I thought was a dud, so I pulled the fifth disk in my array and put the dud in. Sure enough, Solaris started spitting errors like there was no tomorrow in dmesg, and wouldn''t use the disk. Ah well. Remove it, put the original back in - hey, Solaris still thinks the disk is offline, and cfgadm -c unconfigure [disk];cfgadm -c configure [disk] didn''t help - okay, sane poweroff. Hey, this is going to take awhile to rescrub, why not switch to the wide SCSI module for this disk array rather than the narrow one? Okay, fine, put the module in (this module is known working and was, in fact, pulled from the other array). I notice it takes nigh-forever to come back up, and I''m wondering why - it literally took over 5 minutes to give me console login. Commands took at least 5s between being typed and appearing in console - it was obvious something insane was going on. Load average was 6.33, and fmd was taking most of the CPU. zpool status took about 10 minutes to tell me that it thought c2t2d0 was missing and that c2t4d0 was corrupt, thereby screwing me. Wait, what. I didn''t touch that disk, what''s going on here. I try to convince ZFS that the disk is there and usable via zpool online moonside c2t2d0, but it just claims the pool is inaccessable (great, thanks ZFS). I figure it has to be the module swap that''s confusing it so, so I poweroff and switch back. Power back on...nope, still screwed the same way. I try destroying the pool and importing it, but it "just" tells me the pool is corrupted because c2t4d0 has corrupt metadata. pool: moonside id: 8290331144559232496 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: sun.com/msg/ZFS-8000-5E config: moonside FAULTED corrupted data raidz1 FAULTED corrupted data c2t0d0 ONLINE c2t1d0 ONLINE c2t2d0 ONLINE c2t3d0 ONLINE c2t4d0 FAULTED corrupted data c2t5d0 ONLINE c2t6d0 ONLINE Thanks, ZFS. One disk (at most, one disk and attempting to use a different SCSI connector) blew up my RAID-Z1. That''s...wonderful. I try rebooting to see if it becomes less confused... pool: moonside id: 8290331144559232496 state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: sun.com/msg/ZFS-8000-3C config: moonside FAULTED corrupted data raidz1 DEGRADED c2t0d0 ONLINE c2t1d0 ONLINE c2t2d0 ONLINE c2t3d0 ONLINE c2t4d0 UNAVAIL cannot open c2t5d0 ONLINE c2t6d0 ONLINE Uh, what. So the pool is "degraded", but the state is "faulted" because it has corrupted data somewhere that it can''t tell me about? Screw this, force import. # zpool import -f moonside cannot import ''moonside'': I/O error ...what!? I don''t even know what that error means in this context, maybe my buddy dmesg does. # dmesg | tail Nov 1 03:28:02 maou scsi: [ID 193665 kern.info] sd2 at adp0: target 2 lun 0 Nov 1 03:28:02 maou genunix: [ID 936769 kern.info] sd2 is /pci at 0,0/pci9004, 7178 at b/sd at 2,0 Nov 1 03:28:06 maou genunix: [ID 773945 kern.info] UltraDMA mode 2 selected Nov 1 03:28:31 maou last message repeated 7 times Nov 1 03:28:52 maou genunix: [ID 408114 kern.info] /pci at 0,0/pci9004,7178 at b/sd at 4,0 (sd4) offline Nov 1 03:28:55 maou genunix: [ID 773945 kern.info] UltraDMA mode 2 selected Nov 1 03:28:55 maou last message repeated 3 times Nov 1 03:29:06 maou scsi: [ID 193665 kern.info] sd4 at adp0: target 4 lun 0 Nov 1 03:29:06 maou genunix: [ID 936769 kern.info] sd4 is /pci at 0,0/pci9004, 7178 at b/sd at 4,0 Nov 1 03:29:06 maou genunix: [ID 408114 kern.info] /pci at 0,0/pci9004,7178 at b/sd at 4,0 (sd4) online Nope, dmesg doesn''t know either. Uh, what? Reboots fix everything. Reboot... Now it''s just really confused. # zpool import -f moonside cannot import ''moonside'': one or more devices is currently unavailable Can it not make up its mind? Does it want the missing seventh device to save it from the mean old corruption on that seventh device? What''s with the claimed I/O errors that don''t show up in dmesg? pool: moonside id: 8290331144559232496 state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: sun.com/msg/ZFS-8000-3C config: moonside UNAVAIL insufficient replicas raidz1 UNAVAIL insufficient replicas c2t0d0 ONLINE c2t1d0 ONLINE c2t2d0 FAULTED corrupted data c2t3d0 ONLINE c2t4d0 UNAVAIL cannot open c2t5d0 ONLINE c2t6d0 ONLINE Oh wow, that''s really special. I''m not sure what''s going on at this point. I swear there''s no way I could have touched c2t2d0 by accident - this array is really sturdy and requires moderate physical effort to remove a disk from. Is this behavior "expected", or is this a bug? Furthermore, should I ever expect to be able to see my precious data again? snv b44, Pentium III 550. - Rich -------------- next part -------------- An HTML attachment was scrubbed... URL: <mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061101/6c9b4e85/attachment.html>