I have a zfs pool made of two vdevs, each using one whole physical disk, under OpenSolaris 2008.5. Disks live on a Netcell SATA/RAID controller, that has three ports (and I planned to use three disks there, and configure mirrors in zfs), but as it turned out could only provide one or two disks to the system. So I decided to create a mirror of two identical drives on controller instead of one of the vdevs in zfs pool. I did a ''dd'' dump of whole disk of second vdev, and created an array on controller with second disk to mirror that one. After booting to Solaris ''zpool status'' reported that it can''t use the disk because label is missing or corrupted. I restored the dump made previously with dd - and ''zpool status'' reported vdev in state FAILED, data corrupted, ~100k files with errors. Upon reboot, controller''s BIOS reported that one of the disks in mirror is failing and needs to be replaced. So I rebuilt an array. After booting, ZFS still reports disk as FAILED. An attempt to scrub crushes and reboots the system. Restoring the dump makes array broken from controller''s point of view. It seems it stores some configuration information on disk, and that conflicts with ZFS using whole disk. This is an example of when using whole disk for ZFS is not a good idea. I reverted to using one-disk array on controller, as it was when I created zfs filesystem, but that does not help either. ''zdb -l'' "fails to unpack" label 0 and 1. 2 and 3 look ok to me, showing correct zfs info. ''format'' reports this disk as being the part of active zfs pool (as well as for the other disk, that was part of a mirror and now connected via USB). ''zfs replace'' also does not want to replace because it thinks second disk is part of active pool. Is there a way to recover from this problem? I''m pretty sure the data is still OK, it''s just labels that get "corrupted" by controller or zfs. :( -- This message posted from opensolaris.org
> Is there a way to recover from this problem? I''m > pretty sure the data is still OK, it''s just labels > that get "corrupted" by controller or zfs. :(And this s confirmed by zdb, after loooooong wait for comparison of data and checksums: no data errors. -- This message posted from opensolaris.org
Oleg Muravskiy
2008-Oct-21 10:33 UTC
[zfs-discuss] Solution: Recover after disk labels "failure"
I recovered the pool by doing export, import and scrub. Apparently you could export pool with a FAILED device, and import will restore labels from backup copies. Data errors are still there after import, so you need to scrub pool. After all that the filesystem is back with no errors/problems. It would be nice if documentation mention this, namely that before trying to replace disks or restoring backups, you could try to export/import. Also, it is not clear what "zpool clear" actually clears (what a nice use of word "clear"!). It does not clear data errors recorded within the pool. In my case they were registered when I tried to read data from pool with one device marked as FAILED (when in fact only label was corrupted, the data itself was OK), and disappeared upon scrub. So my thanks go to people on Internet who share their findings about zfs, and zfs developers who made such a robust system (I still think it''s the best from all [free] systems I used). -- This message posted from opensolaris.org