thr3ads.net - zfs discuss - [zfs-discuss] Bug report: disk replacement confusion [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Scott L. Burson

2009-Jan-23 02:08 UTC

[zfs-discuss] Bug report: disk replacement confusion

This is in snv_86.  I have a four-drive raidz pool.  One of the drives died.  I
replaced it, but wasn''t careful to put the new drive on the same
controller port; one of the existing drives wound up on the port that had
previously been used by the failed drive, and the new drive wound up on the port
previously used by that drive.

I powered up and booted, and ZFS started a resilver automatically, but the pool
status was confused.  It looked like this, even after the resilver completed
(indentation is being discarded here):

        NAME        STATE     READ WRITE CKSUM
        pool0       DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            c5t1d0  FAULTED      0     0     0  too many errors
            c5t1d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0

Doing ''zpool clear'' just changed the "too many
errors" to "corrupted data".

I then tried ''zpool replace pool0 c5t1d0 c5t2d0'' to see if
that would straighten things out (hoping it wouldn''t screw things up
any further!).  It started another resilver, during which the status looked like
this:

        NAME           STATE     READ WRITE CKSUM
        pool0          DEGRADED     0     0     0
          raidz1       DEGRADED     0     0     0
            replacing  DEGRADED     0     0     0
              c5t1d0   FAULTED      0     0     0  corrupted data
              c5t2d0   ONLINE       0     0     0
            c5t1d0     ONLINE       0     0     0
            c5t3d0     ONLINE       0     0     0
            c5t0d0     ONLINE       0     0     0

Maybe this will work, but -- doesn''t ZFS put unique IDs on the drives
so it can track them in case they wind up on different ports?  If so, seems like
it needs to back-map that information to the device names when mounting.  Or
something :)

-- Scott
-- 
This message posted from opensolaris.org

Wes Morgan

2009-Jan-23 04:17 UTC

head link

[zfs-discuss] Bug report: disk replacement confusion

On Thu, 22 Jan 2009, Scott L. Burson wrote:
> This is in snv_86.  I have a four-drive raidz pool.  One of the drives 
> died.  I replaced it, but wasn''t careful to put the new drive on
the
> same controller port; one of the existing drives wound up on the port 
> that had previously been used by the failed drive, and the new drive 
> wound up on the port previously used by that drive.
>
> I powered up and booted, and ZFS started a resilver automatically, but 
> the pool status was confused.  It looked like this, even after the 
> resilver completed (indentation is being discarded here):
>
>        NAME        STATE     READ WRITE CKSUM
>        pool0       DEGRADED     0     0     0
>          raidz1    DEGRADED     0     0     0
>            c5t1d0  FAULTED      0     0     0  too many errors
>            c5t1d0  ONLINE       0     0     0
>            c5t3d0  ONLINE       0     0     0
>            c5t0d0  ONLINE       0     0     0
>
> Doing ''zpool clear'' just changed the "too many
errors" to "corrupted
> data".
>
> I then tried ''zpool replace pool0 c5t1d0 c5t2d0'' to see
if that would
> straighten things out (hoping it wouldn''t screw things up any
further!).
> It started another resilver, during which the status looked like this:
>
>        NAME           STATE     READ WRITE CKSUM
>        pool0          DEGRADED     0     0     0
>          raidz1       DEGRADED     0     0     0
>            replacing  DEGRADED     0     0     0
>              c5t1d0   FAULTED      0     0     0  corrupted data
>              c5t2d0   ONLINE       0     0     0
>            c5t1d0     ONLINE       0     0     0
>            c5t3d0     ONLINE       0     0     0
>            c5t0d0     ONLINE       0     0     0
>
> Maybe this will work, but -- doesn''t ZFS put unique IDs on the
drives so
> it can track them in case they wind up on different ports?  If so, seems 
> like it needs to back-map that information to the device names when 
> mounting.  Or something :)
Did your resilver complete successfully? I had a similar problem and the 
array was showing thousands of write errors to the missing drive and the 
resilver of the new drive would basically reset after a few minutes. And 
you can''t cancel a replacement for a nonexistent drive in this case. I 
ended up having to create a sparse device labeled with the proper guid and 
finally could remove one of the devices and properly initiate a 
replacement. Similar to bug id #6782540...

Scott L. Burson

2009-Jan-23 05:10 UTC

head link

[zfs-discuss] Bug report: disk replacement confusion

Well, the second resilver finished, and everything looks okay now.  Doing one
more scrub to be sure...

-- Scott
-- 
This message posted from opensolaris.org

Scott L. Burson

2009-Jan-23 18:56 UTC

head link

[zfs-discuss] Bug report: disk replacement confusion

Yes, everything seems to be fine, but that was still scary, and the fix was not
completely obvious.  At the very least, I would suggest adding text such as the
following to the page at http://www.sun.com/msg/ZFS-8000-FD :

When physically replacing the failed device, it is best to use the same
controller port, so that the new device will have the same device name as the
failed one.

-- Scott
-- 
This message posted from opensolaris.org

zfs discuss - Jan 2009 - Bug report: disk replacement confusion

[zfs-discuss] Bug report: disk replacement confusion

[zfs-discuss] Bug report: disk replacement confusion

[zfs-discuss] Bug report: disk replacement confusion

[zfs-discuss] Bug report: disk replacement confusion