thr3ads.net - zfs discuss - [zfs-discuss] ZFS problem after disk faliure [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Robert

2008-Jan-06 01:10 UTC

[zfs-discuss] ZFS problem after disk faliure

One of my disks in the zfs raidz2 pool developed some mechanical faliure and had
to be replaced. It is possible that I may have swaped the sata cables during the
exchange, but this has never been a problem before in my previous tests.

What concerns me is the output from zpool status for the c2d0 disk. The
exchanged disk is now c3d0 but is no longer a part of the pool?!

This is build_75 on x86 and sata disks with 3 controllers.

Please advice on my further actions.



       NAME        STATE     READ WRITE CKSUM
        rz2pool     DEGRADED     0     0     0
          raidz2    DEGRADED     0     0     0
            c2d0    FAULTED      0     0     0  corrupted data
            c2d1    ONLINE       0     0     0
            c2d0    ONLINE       0     0     0
            c3d1    ONLINE       0     0     0
            c10d0   ONLINE       0     0     0
            c10d1   ONLINE       0     0     0
            c11d0   ONLINE       0     0     0
            c11d1   ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0

errors: No known data errors
 
 
This message posted from opensolaris.org

Robert

2008-Jan-06 15:06 UTC

head link

[zfs-discuss] ZFS problem after disk faliure

Since there is no answer yet here''s a simpler(?) question,

Why does zpool think that I have 2 c2d0?

Even if all disks are offline, zpool still lists two c2d0 instead of c2d0 and
c3d0

It seems that a logical name is confused with the physical, or something...
 
 
This message posted from opensolaris.org

Robert

2008-Jan-09 23:15 UTC

head link

[zfs-discuss] ZFS problem after disk faliure

Ok, not a single soul knows this either, this doesn''t look
promising....

How can I list/edit the metadata(?) that is on my disks or the pool so that I
may see/edit what each physical disk in the pool has registered?

Since I don''t know what I''m looking for yet I can''t
be more specific in my question. I need some initial starters to go for and
truss is a little bit low level as a first....

I simply need to rename/remove one of the erronous c2d0 entries/disks in the
pool so that I can use it in full again, since at this time I can''t
reconnect the 10th disk in my raid and if one more disk fails [u][b]all my data
would be lost[/b][/u] (4 TB is a lot of disk to waste!)
 
 
This message posted from opensolaris.org

Marc Bevand

2008-Jan-10 08:06 UTC

head link

[zfs-discuss] ZFS problem after disk faliure

Robert <slask <at> telia.com> writes:> 
> I simply need to rename/remove one of the erronous c2d0 entries/disks in
> the pool so that I can use it in full again, since at this time I
can''t
> reconnect the 10th disk in my raid and if one more disk fails all my
> data would be lost (4 TB is a lot of disk to waste!)
You see an erroneous c2d0 device that you claim is in reality c3d0...
If I were you I would try:

  $ zpool replace [-f] rz2pool c2d0 c3d0

The -f option may or may not be necessary.
Also, what disk devices does this command display ?:

  $ format

-marc

Darren J Moffat

2008-Jan-10 09:58 UTC

head link

[zfs-discuss] ZFS problem after disk faliure

Robert wrote:> Ok, not a single soul knows this either, this doesn''t look
promising....
> 
> How can I list/edit the metadata(?) that is on my disks or the pool so that
I may see/edit what each physical disk in the pool has registered?
To view but not edit you can use /usr/sbin/zdb

-- 
Darren J Moffat

Robert

2008-Jan-11 02:07 UTC

head link

[zfs-discuss] ZFS problem after disk faliure

I finaly found the cause of the error....

Since my disks are mounted in a cassette with four in each I had to disconnect
all cables to them to replace the crashed disk.

When re-attaching the cables I reversed the order of them by accident. In my
early tests this was not a problem since zfs identified the disks anyway,
regardless on what controller the disk was connected to (as long as the
controllers was listed for the pool).

What happened was some kind of an race condition(?). Since the disk on
controller c2d0 crashed it was listed as corrupt. But since I connected an
healthy disk (and already an member of the pool (from controller c3d1)) to c2d0
instead of the new one that should originally replace the crashed disk my
problem developed.

Zfs therefor listed the original c2d0 disk as faulty but the realized that there
was an healthy disk on c2d0, i.e. c2d0 was both ok and faulty!! This means that
any command acting on c2d0 would not succeed due to both disks listed as c2d0.

In this condition there seems to exist no way of telling zfs to discard the
faulty disk entry since both are assigned the name c2d0 and is/was connected to
the same controller.

My resolution was (after many hours of moving disks and reboots to find the
error) to make sure that only the new (not yet assigned) disk was connected to
c2d0 and none of the other disks that already was assigned to the pool.

I must consider this to be an bug that you can''t remove/clear this kind
of error in zfs to be able to repair your pool.

Due to all my efforts it seems that my pool became corrupted in the end
(probably due to scrubbing and resilver)  so I have some hours to kill,
restoring my data......
 
 
This message posted from opensolaris.org

Wade.Stuart at fallon.com

2008-Jan-11 16:49 UTC

head link

[zfs-discuss] ZFS problem after disk faliure

zfs-discuss-bounces at opensolaris.org wrote on 01/10/2008 08:07:37 PM:
> I finaly found the cause of the error....
>
> Since my disks are mounted in a cassette with four in each I had to
> disconnect all cables to them to replace the crashed disk.
>
> When re-attaching the cables I reversed the order of them by
> accident. In my early tests this was not a problem since zfs
> identified the disks anyway, regardless on what controller the disk
> was connected to (as long as the controllers was listed for the pool).
>
> What happened was some kind of an race condition(?). Since the disk
> on controller c2d0 crashed it was listed as corrupt. But since I
> connected an healthy disk (and already an member of the pool (from
> controller c3d1)) to c2d0 instead of the new one that should
> originally replace the crashed disk my problem developed.
>
> Zfs therefor listed the original c2d0 disk as faulty but the
> realized that there was an healthy disk on c2d0, i.e. c2d0 was both
> ok and faulty!! This means that any command acting on c2d0 would not
> succeed due to both disks listed as c2d0.
>
> In this condition there seems to exist no way of telling zfs to
> discard the faulty disk entry since both are assigned the name c2d0
> and is/was connected to the same controller.
>
> My resolution was (after many hours of moving disks and reboots to
> find the error) to make sure that only the new (not yet assigned)
> disk was connected to c2d0 and none of the other disks that already
> was assigned to the pool.
>
> I must consider this to be an bug that you can''t remove/clear this
> kind of error in zfs to be able to repair your pool.
>
> Due to all my efforts it seems that my pool became corrupted in the
> end (probably due to scrubbing and resilver)  so I have some hours
> to kill, restoring my data......
>
I am confused,  I was under the impression that zfs actually checks the zfs
labels on the disks to make sure they are correct when mounting (to avoid
disk rename issues exactly like above)?  Is this an edge case that has not
been accounted for or am I misunderstanding the disk label/name schematics?

-Wade

Robert

2008-Jan-12 02:03 UTC

head link

[zfs-discuss] ZFS problem after disk faliure

To me it seems it''s a special case that has not been accounted for...

While is seems zfs is checking the disks against the pool and handle them nicely
using labels/meta-data, even if they are mounted on different controllers, the
problem I''ve encountered has to do with that a specific device/disk is
flagged faulty and an already valid disk in the pool is mounted on that specific
controller location.

Zfs was telling me that c2d0 went bad but zfs also want you to clear the error
when you have ''fixed'' it. In the case of an faulty disk and
swapped controllers you get two mechanisms fighting for the same cause -- c2d0
is faulty and c2d0 is ok.

This is why I want zpool to have an option to have some override control to
clear an old entry that may not be valid anymore, due to moved disks or
something similar.
 
 
This message posted from opensolaris.org

Possibly Parallel Threads

Search for more possibly parallel threads

zfs discuss - Jan 2008 - ZFS problem after disk faliure

[zfs-discuss] ZFS problem after disk faliure

[zfs-discuss] ZFS problem after disk faliure

[zfs-discuss] ZFS problem after disk faliure

[zfs-discuss] ZFS problem after disk faliure

[zfs-discuss] ZFS problem after disk faliure

[zfs-discuss] ZFS problem after disk faliure

[zfs-discuss] ZFS problem after disk faliure

[zfs-discuss] ZFS problem after disk faliure

Possibly Parallel Threads