One of my disks in the zfs raidz2 pool developed some mechanical faliure and had to be replaced. It is possible that I may have swaped the sata cables during the exchange, but this has never been a problem before in my previous tests. What concerns me is the output from zpool status for the c2d0 disk. The exchanged disk is now c3d0 but is no longer a part of the pool?! This is build_75 on x86 and sata disks with 3 controllers. Please advice on my further actions. NAME STATE READ WRITE CKSUM rz2pool DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c2d0 FAULTED 0 0 0 corrupted data c2d1 ONLINE 0 0 0 c2d0 ONLINE 0 0 0 c3d1 ONLINE 0 0 0 c10d0 ONLINE 0 0 0 c10d1 ONLINE 0 0 0 c11d0 ONLINE 0 0 0 c11d1 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 errors: No known data errors This message posted from opensolaris.org
Since there is no answer yet here''s a simpler(?) question, Why does zpool think that I have 2 c2d0? Even if all disks are offline, zpool still lists two c2d0 instead of c2d0 and c3d0 It seems that a logical name is confused with the physical, or something... This message posted from opensolaris.org
Ok, not a single soul knows this either, this doesn''t look promising.... How can I list/edit the metadata(?) that is on my disks or the pool so that I may see/edit what each physical disk in the pool has registered? Since I don''t know what I''m looking for yet I can''t be more specific in my question. I need some initial starters to go for and truss is a little bit low level as a first.... I simply need to rename/remove one of the erronous c2d0 entries/disks in the pool so that I can use it in full again, since at this time I can''t reconnect the 10th disk in my raid and if one more disk fails [u][b]all my data would be lost[/b][/u] (4 TB is a lot of disk to waste!) This message posted from opensolaris.org
Robert <slask <at> telia.com> writes:> > I simply need to rename/remove one of the erronous c2d0 entries/disks in > the pool so that I can use it in full again, since at this time I can''t > reconnect the 10th disk in my raid and if one more disk fails all my > data would be lost (4 TB is a lot of disk to waste!)You see an erroneous c2d0 device that you claim is in reality c3d0... If I were you I would try: $ zpool replace [-f] rz2pool c2d0 c3d0 The -f option may or may not be necessary. Also, what disk devices does this command display ?: $ format -marc
Robert wrote:> Ok, not a single soul knows this either, this doesn''t look promising.... > > How can I list/edit the metadata(?) that is on my disks or the pool so that I may see/edit what each physical disk in the pool has registered?To view but not edit you can use /usr/sbin/zdb -- Darren J Moffat
I finaly found the cause of the error.... Since my disks are mounted in a cassette with four in each I had to disconnect all cables to them to replace the crashed disk. When re-attaching the cables I reversed the order of them by accident. In my early tests this was not a problem since zfs identified the disks anyway, regardless on what controller the disk was connected to (as long as the controllers was listed for the pool). What happened was some kind of an race condition(?). Since the disk on controller c2d0 crashed it was listed as corrupt. But since I connected an healthy disk (and already an member of the pool (from controller c3d1)) to c2d0 instead of the new one that should originally replace the crashed disk my problem developed. Zfs therefor listed the original c2d0 disk as faulty but the realized that there was an healthy disk on c2d0, i.e. c2d0 was both ok and faulty!! This means that any command acting on c2d0 would not succeed due to both disks listed as c2d0. In this condition there seems to exist no way of telling zfs to discard the faulty disk entry since both are assigned the name c2d0 and is/was connected to the same controller. My resolution was (after many hours of moving disks and reboots to find the error) to make sure that only the new (not yet assigned) disk was connected to c2d0 and none of the other disks that already was assigned to the pool. I must consider this to be an bug that you can''t remove/clear this kind of error in zfs to be able to repair your pool. Due to all my efforts it seems that my pool became corrupted in the end (probably due to scrubbing and resilver) so I have some hours to kill, restoring my data...... This message posted from opensolaris.org
Wade.Stuart at fallon.com
2008-Jan-11 16:49 UTC
[zfs-discuss] ZFS problem after disk faliure
zfs-discuss-bounces at opensolaris.org wrote on 01/10/2008 08:07:37 PM:> I finaly found the cause of the error.... > > Since my disks are mounted in a cassette with four in each I had to > disconnect all cables to them to replace the crashed disk. > > When re-attaching the cables I reversed the order of them by > accident. In my early tests this was not a problem since zfs > identified the disks anyway, regardless on what controller the disk > was connected to (as long as the controllers was listed for the pool). > > What happened was some kind of an race condition(?). Since the disk > on controller c2d0 crashed it was listed as corrupt. But since I > connected an healthy disk (and already an member of the pool (from > controller c3d1)) to c2d0 instead of the new one that should > originally replace the crashed disk my problem developed. > > Zfs therefor listed the original c2d0 disk as faulty but the > realized that there was an healthy disk on c2d0, i.e. c2d0 was both > ok and faulty!! This means that any command acting on c2d0 would not > succeed due to both disks listed as c2d0. > > In this condition there seems to exist no way of telling zfs to > discard the faulty disk entry since both are assigned the name c2d0 > and is/was connected to the same controller. > > My resolution was (after many hours of moving disks and reboots to > find the error) to make sure that only the new (not yet assigned) > disk was connected to c2d0 and none of the other disks that already > was assigned to the pool. > > I must consider this to be an bug that you can''t remove/clear this > kind of error in zfs to be able to repair your pool. > > Due to all my efforts it seems that my pool became corrupted in the > end (probably due to scrubbing and resilver) so I have some hours > to kill, restoring my data...... >I am confused, I was under the impression that zfs actually checks the zfs labels on the disks to make sure they are correct when mounting (to avoid disk rename issues exactly like above)? Is this an edge case that has not been accounted for or am I misunderstanding the disk label/name schematics? -Wade
To me it seems it''s a special case that has not been accounted for... While is seems zfs is checking the disks against the pool and handle them nicely using labels/meta-data, even if they are mounted on different controllers, the problem I''ve encountered has to do with that a specific device/disk is flagged faulty and an already valid disk in the pool is mounted on that specific controller location. Zfs was telling me that c2d0 went bad but zfs also want you to clear the error when you have ''fixed'' it. In the case of an faulty disk and swapped controllers you get two mechanisms fighting for the same cause -- c2d0 is faulty and c2d0 is ok. This is why I want zpool to have an option to have some override control to clear an old entry that may not be valid anymore, due to moved disks or something similar. This message posted from opensolaris.org