thr3ads.net - zfs discuss - [zfs-discuss] Replacing a disk never completes [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Ben Miller

2010-Sep-16 12:36 UTC

[zfs-discuss] Replacing a disk never completes

I have an X4540 running b134 where I''m replacing 500GB disks with 2TB
disks
(Seagate Constellation) and the pool seems sick now.  The pool has four 
raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few 
months ago.  I replaced two disks in the second set (c2t0d0, c3t0d0) a 
couple of weeks ago, but have been unable to get the third disk to finish 
replacing (c4t0d0).

I have tried the resilver for c4t0d0 four times now and the pool also comes 
up with checksum errors and a permanent error (<metadata>:<0x0>). 
The
first resilver was from ''zpool replace'', which came up with
checksum
errors.  I cleared the errors which triggered the second resilver (same 
result).  I then did a ''zpool scrub'' which started the third
resilver and
also identified three permanent errors (the two additional were in files in 
snapshots which I then destroyed).  I then did a ''zpool clear''
and then
another scrub which started the fourth resilver attempt.  This last attempt 
identified another file with errors in a snapshot that I have now destroyed.

Any ideas how to get this disk finished being replaced without rebuilding 
the pool and restoring from backup?  The pool is working, but is reporting 
as degraded and with checksum errors.

Here is what the pool currently looks like:

  # zpool status -v pool2
   pool: pool2
  state: DEGRADED
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: resilver completed after 33h9m with 4 errors on Thu Sep 16 00:28:14
config:

         NAME              STATE     READ WRITE CKSUM
         pool2             DEGRADED     0     0     8
           raidz2-0        ONLINE       0     0     0
             c0t4d0        ONLINE       0     0     0
             c1t4d0        ONLINE       0     0     0
             c2t4d0        ONLINE       0     0     0
             c3t4d0        ONLINE       0     0     0
             c4t4d0        ONLINE       0     0     0
             c5t4d0        ONLINE       0     0     0
             c2t5d0        ONLINE       0     0     0
             c3t5d0        ONLINE       0     0     0
             c4t5d0        ONLINE       0     0     0
             c5t5d0        ONLINE       0     0     0
           raidz2-1        DEGRADED     0     0    14
             c0t5d0        ONLINE       0     0     0
             c1t5d0        ONLINE       0     0     0
             c2t1d0        ONLINE       0     0     0
             c3t1d0        ONLINE       0     0     0
             c4t1d0        ONLINE       0     0     0
             c5t1d0        ONLINE       0     0     0
             c2t0d0        ONLINE       0     0     0
             c3t0d0        ONLINE       0     0     0
             replacing-8   DEGRADED     0     0     0
               c4t0d0s0/o  OFFLINE      0     0     0
               c4t0d0      ONLINE       0     0     0  268G resilvered
             c5t0d0        ONLINE       0     0     0
           raidz2-2        ONLINE       0     0     0
             c0t6d0        ONLINE       0     0     0
             c1t6d0        ONLINE       0     0     0
             c2t6d0        ONLINE       0     0     0
             c3t6d0        ONLINE       0     0     0
             c4t6d0        ONLINE       0     0     0
             c5t6d0        ONLINE       0     0     0
             c2t7d0        ONLINE       0     0     0
             c3t7d0        ONLINE       0     0     0
             c4t7d0        ONLINE       0     0     0
             c5t7d0        ONLINE       0     0     0
           raidz2-3        ONLINE       0     0     0
             c0t7d0        ONLINE       0     0     0
             c1t7d0        ONLINE       0     0     0
             c2t3d0        ONLINE       0     0     0
             c3t3d0        ONLINE       0     0     0
             c4t3d0        ONLINE       0     0     0
             c5t3d0        ONLINE       0     0     0
             c2t2d0        ONLINE       0     0     0
             c3t2d0        ONLINE       0     0     0
             c4t2d0        ONLINE       0     0     0
             c5t2d0        ONLINE       0     0     0
         logs
           mirror-4        ONLINE       0     0     0
             c0t1d0s0      ONLINE       0     0     0
             c1t3d0s0      ONLINE       0     0     0
         cache
           c0t3d0s7        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

         <metadata>:<0x0>
         <0x167a2>:<0x552ed>
         (This second file was in a snapshot I destroyed after the resilver 
completed).

# zpool list pool2
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
pool2  31.8T  13.8T  17.9T    43%  1.65x  DEGRADED  -

The slog is a mirror of two SLC SSDs and the L2ARC is an MLC SSD.

thanks,
Ben

Giovanni Tirloni

2010-Sep-20 14:45 UTC

head link

[zfs-discuss] Replacing a disk never completes

On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller <bmiller at
mail.eecis.udel.edu>wrote:
> I have an X4540 running b134 where I''m replacing 500GB disks with
2TB disks
> (Seagate Constellation) and the pool seems sick now.  The pool has four
> raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few
> months ago.  I replaced two disks in the second set (c2t0d0, c3t0d0) a
> couple of weeks ago, but have been unable to get the third disk to finish
> replacing (c4t0d0).
>
> I have tried the resilver for c4t0d0 four times now and the pool also comes
> up with checksum errors and a permanent error
(<metadata>:<0x0>).  The first
> resilver was from ''zpool replace'', which came up with
checksum errors.  I
> cleared the errors which triggered the second resilver (same result).  I
> then did a ''zpool scrub'' which started the third resilver
and also
> identified three permanent errors (the two additional were in files in
> snapshots which I then destroyed).  I then did a ''zpool
clear'' and then
> another scrub which started the fourth resilver attempt.  This last attempt
> identified another file with errors in a snapshot that I have now
destroyed.
>
> Any ideas how to get this disk finished being replaced without rebuilding
> the pool and restoring from backup?  The pool is working, but is reporting
> as degraded and with checksum errors.
>
>[...]

Try to run a `zpool clear pool2` and see if clears the errors. If not, you
may have to detach `c4t0d0s0/o`.

I believe it''s a bug that was fixed in recent builds.

-- 
Giovanni Tirloni
gtirloni at sysdroid.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100920/d731e2c9/attachment.html>

Ben Miller

2010-Sep-21 13:16 UTC

head link

[zfs-discuss] Replacing a disk never completes

On 09/20/10 10:45 AM, Giovanni Tirloni wrote:> On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller <bmiller at
mail.eecis.udel.edu
> <mailto:bmiller at mail.eecis.udel.edu>> wrote:
>
>     I have an X4540 running b134 where I''m replacing 500GB disks
with 2TB
>     disks (Seagate Constellation) and the pool seems sick now.  The pool
>     has four raidz2 vdevs (8+2) where the first set of 10 disks were
>     replaced a few months ago.  I replaced two disks in the second set
>     (c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the
>     third disk to finish replacing (c4t0d0).
>
>     I have tried the resilver for c4t0d0 four times now and the pool also
>     comes up with checksum errors and a permanent error
(<metadata>:<0x0>).
>       The first resilver was from ''zpool replace'', which
came up with
>     checksum errors.  I cleared the errors which triggered the second
>     resilver (same result).  I then did a ''zpool scrub''
which started the
>     third resilver and also identified three permanent errors (the two
>     additional were in files in snapshots which I then destroyed).  I then
>     did a ''zpool clear'' and then another scrub which
started the fourth
>     resilver attempt.  This last attempt identified another file with
>     errors in a snapshot that I have now destroyed.
>
>     Any ideas how to get this disk finished being replaced without
>     rebuilding the pool and restoring from backup?  The pool is working,
>     but is reporting as degraded and with checksum errors.
>
>
> [...]
>
> Try to run a `zpool clear pool2` and see if clears the errors. If not, you
> may have to detach `c4t0d0s0/o`.
>
> I believe it''s a bug that was fixed in recent builds.
>	I had tried a clear a few times with no luck.  I just did a detach and 
that did remove the old disk and has now triggered another resilver which 
hopefully works.  I had tried a remove rather than a detach before, but 
that doesn''t work on raidz2...

thanks,
Ben
> --
> Giovanni Tirloni
> gtirloni at sysdroid.com <mailto:gtirloni at sysdroid.com>
>

Ben Miller

2010-Sep-22 20:27 UTC

head link

[zfs-discuss] Replacing a disk never completes

On 09/21/10 09:16 AM, Ben Miller wrote:> On 09/20/10 10:45 AM, Giovanni Tirloni wrote:
>> On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller <bmiller at
mail.eecis.udel.edu
>> <mailto:bmiller at mail.eecis.udel.edu>> wrote:
>>
>> I have an X4540 running b134 where I''m replacing 500GB disks
with 2TB
>> disks (Seagate Constellation) and the pool seems sick now. The pool
>> has four raidz2 vdevs (8+2) where the first set of 10 disks were
>> replaced a few months ago. I replaced two disks in the second set
>> (c2t0d0, c3t0d0) a couple of weeks ago, but have been unable to get the
>> third disk to finish replacing (c4t0d0).
>>
>> I have tried the resilver for c4t0d0 four times now and the pool also
>> comes up with checksum errors and a permanent error
(<metadata>:<0x0>).
>> The first resilver was from ''zpool replace'', which
came up with
>> checksum errors. I cleared the errors which triggered the second
>> resilver (same result). I then did a ''zpool scrub''
which started the
>> third resilver and also identified three permanent errors (the two
>> additional were in files in snapshots which I then destroyed). I then
>> did a ''zpool clear'' and then another scrub which
started the fourth
>> resilver attempt. This last attempt identified another file with
>> errors in a snapshot that I have now destroyed.
>>
>> Any ideas how to get this disk finished being replaced without
>> rebuilding the pool and restoring from backup? The pool is working,
>> but is reporting as degraded and with checksum errors.
>>
>>
>> [...]
>>
>> Try to run a `zpool clear pool2` and see if clears the errors. If not,
you
>> may have to detach `c4t0d0s0/o`.
>>
>> I believe it''s a bug that was fixed in recent builds.
>>
> I had tried a clear a few times with no luck. I just did a detach and that
> did remove the old disk and has now triggered another resilver which
> hopefully works. I had tried a remove rather than a detach before, but that
> doesn''t work on raidz2...
>
> thanks,
> Ben
>	I made some progress.  That resilver completed with 4 errors.  I cleared 
those and still had the one error "<metadata>:<0x0>" so I
started a scrub.
  The scrub restarted the resilver on c4t0d0 again though!  There currently 
are no errors anyway, but the resilver will be running for the next day+. 
Is this another bug or will doing a scrub eventually lead to a scrub of the 
pool instead of the resilver?

Ben

Ben Miller

2010-Sep-30 19:00 UTC

head link

[zfs-discuss] Disk keeps resilvering, was: Replacing a disk never completes

On 09/22/10 04:27 PM, Ben Miller wrote:> On 09/21/10 09:16 AM, Ben Miller wrote:
>> I had tried a clear a few times with no luck. I just did a detach and
that
>> did remove the old disk and has now triggered another resilver which
>> hopefully works. I had tried a remove rather than a detach before, but
that
>> doesn''t work on raidz2...
>>
>> thanks,
>> Ben
>>
> I made some progress. That resilver completed with 4 errors. I cleared
> those and still had the one error "<metadata>:<0x0>"
so I started a scrub.
> The scrub restarted the resilver on c4t0d0 again though! There currently
> are no errors anyway, but the resilver will be running for the next day+.
> Is this another bug or will doing a scrub eventually lead to a scrub of the
> pool instead of the resilver?
>
> Ben
	Well not much progress.  The one permanent error
"<metadata>:<0x0>" came
back.  And the disk keeps wanting to resilver when trying to do a scrub. 
Now after the last resilver I have more checksum errors on the pool, but 
not on any disks:
         NAME              STATE     READ WRITE CKSUM
         pool2             ONLINE      0     0    37
...
           raidz2-1        ONLINE      0     0    74

All other checksum totals are 0.  So three problems:
	1. How to get the disk to stop resilvering?

	2. How do you get checksum errors on the pool, but no disk is identified? 
  If I clear them and let the resilver go again more checksum errors 
appear.  So how to get rid of these errors?

	3. How to get rid of the metadata:0x0 error?  I''m currently destroying
old
snapshots (though that bug was fixed quite awhile ago and I''m running 
b134).  I can try unmounting filesystems and remounting next (all are 
currently mounted).  I can also schedule a reboot for next week if anyone 
things that would help.

thanks,
Ben

Victor Latushkin

2010-Oct-01 14:17 UTC

head link

[zfs-discuss] Disk keeps resilvering, was: Replacing a disk never completes

On Sep 30, 2010, at 11:00 PM, Ben Miller wrote:
> On 09/22/10 04:27 PM, Ben Miller wrote:
>> On 09/21/10 09:16 AM, Ben Miller wrote:
> 
>>> I had tried a clear a few times with no luck. I just did a detach
and that
>>> did remove the old disk and has now triggered another resilver
which
>>> hopefully works. I had tried a remove rather than a detach before,
but that
>>> doesn''t work on raidz2...
>>> 
>>> thanks,
>>> Ben
>>> 
>> I made some progress. That resilver completed with 4 errors. I cleared
>> those and still had the one error
"<metadata>:<0x0>" so I started a scrub.
>> The scrub restarted the resilver on c4t0d0 again though! There
currently
>> are no errors anyway, but the resilver will be running for the next
day+.
>> Is this another bug or will doing a scrub eventually lead to a scrub of
the
>> pool instead of the resilver?
>> 
>> Ben
> 
> 	Well not much progress.  The one permanent error
"<metadata>:<0x0>" came back.  And the disk keeps wanting
to resilver when trying to do a scrub. Now after the last resilver I have more
checksum errors on the pool, but not on any disks:
>        NAME              STATE     READ WRITE CKSUM
>        pool2             ONLINE      0     0    37
> ...
>          raidz2-1        ONLINE      0     0    74
> 
> All other checksum totals are 0.  So three problems:
> 	1. How to get the disk to stop resilvering?
This is a know bug which is fixed in build 135:

6887372 DTLs not cleared after resilver if permanent errors present
> 	2. How do you get checksum errors on the pool, but no disk is identified? 
If I clear them and let the resilver go again more checksum errors appear.  So
how to get rid of these errors?
It may be not possible to determine which disk(s) is(are) responsible for
errors, in that case you''ll see 0 counter on disk level and non-zero on
raidz level. It may mean that there''s more errors that your raidz
allows to recover from, or that data was corrupted in RAM after checksumming but
before writing... Check your FMA data for any signs of disk issues.
> 	3. How to get rid of the metadata:0x0 error?  I''m currently
destroying old snapshots (though that bug was fixed quite awhile ago and
I''m running b134).  I can try unmounting filesystems and remounting
next (all are currently mounted).  I can also schedule a reboot for next week if
anyone things that would help.
This is error in metadata, and the only way to get rid of it is to recreate your
pool.

Regards
Victor

zfs discuss - Sep 2010 - Replacing a disk never completes

[zfs-discuss] Replacing a disk never completes

[zfs-discuss] Replacing a disk never completes

[zfs-discuss] Replacing a disk never completes

[zfs-discuss] Replacing a disk never completes

[zfs-discuss] Disk keeps resilvering, was: Replacing a disk never completes

[zfs-discuss] Disk keeps resilvering, was: Replacing a disk never completes