thr3ads.net - zfs discuss - [zfs-discuss] ZFS resilvering loop from hell [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Charles Stephens

2011-Jul-26 17:49 UTC

[zfs-discuss] ZFS resilvering loop from hell

I''m on S11E 150.0.1.9 and I replaced one of the drives and the pool
seems to be stuck in a resilvering loop.  I performed a ''zpool
clear'' and ''zpool scrub'' and just complains that the
drives I didn''t replace are degraded because of too many errors.  Oddly
the replaced drive is reported as being fine.  The CKSUM counts get up to about
108 or so when the resilver is completed.

I''m now trying to evacuate the pool onto another pool, however the zfs
send/receive is dying after 380GB into sending the first dataset.

Here is some output.  Any help or insights will be helpful.  Thanks

cfs

  pool: dpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Tue Jul 26 15:03:32 2011
    63.4G scanned out of 5.02T at 6.81M/s, 212h12m to go
    15.1G resilvered, 1.23% done
config:

        NAME        STATE     READ WRITE CKSUM
        dpool       DEGRADED     0     0     6
          raidz1-0  DEGRADED     0     0    12
            c9t0d0  DEGRADED     0     0     0  too many errors
            c9t1d0  DEGRADED     0     0     0  too many errors
            c9t3d0  DEGRADED     0     0     0  too many errors
            c9t2d0  ONLINE       0     0     0  (resilvering)

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
	[redacted list of 20 files, mostly in the same directory]

Bob Friesenhahn

2011-Jul-28 01:00 UTC

head link

[zfs-discuss] ZFS resilvering loop from hell

On Tue, 26 Jul 2011, Charles Stephens wrote:
> I''m on S11E 150.0.1.9 and I replaced one of the drives and the
pool
> seems to be stuck in a resilvering loop.  I performed a ''zpool 
> clear'' and ''zpool scrub'' and just complains that
the drives I didn''t
> replace are degraded because of too many errors.  Oddly the replaced 
> drive is reported as being fine.  The CKSUM counts get up to about 
> 108 or so when the resilver is completed.
This sort of problem (failing disks during a recovery) is a good 
reason not to use raidz1 in modern systems.  Use raidz2 or raidz3.

Assuming that the system is good and it is really a problem with the 
disks experiencing bad reads, it seems that the only path forward is 
to wait for the resilver to complete or see if creating a new pool 
from a recent backup is better.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Daniel Carosone

2011-Jul-28 02:28 UTC

head link

[zfs-discuss] ZFS resilvering loop from hell

On Wed, Jul 27, 2011 at 08:00:43PM -0500, Bob Friesenhahn
wrote:> On Tue, 26 Jul 2011, Charles Stephens wrote:
>
>> I''m on S11E 150.0.1.9 and I replaced one of the drives and the
pool
>> seems to be stuck in a resilvering loop.  I performed a ''zpool
clear''
>> and ''zpool scrub'' and just complains that the drives
I didn''t replace
>> are degraded because of too many errors.  Oddly the replaced drive is 
>> reported as being fine.  The CKSUM counts get up to about 108 or so 
>> when the resilver is completed.
>
> This sort of problem (failing disks during a recovery) is a good reason 
> not to use raidz1 in modern systems.  Use raidz2 or raidz3.
>
> Assuming that the system is good and it is really a problem with the  
> disks experiencing bad reads, it seems that the only path forward is to 
> wait for the resilver to complete or see if creating a new pool from a 
> recent backup is better.
Indeed, but that assumption may be too strong.  If you''re getting
errors across all the members, you are likely to have some other
systemic problem, such as: 
 * bad ram / cpu / motherboard
 * too-weak power supply
 * faulty disk controller / driver

Had you scrubbed the pool regularly before the replacement? Were those
clean?  If not, the possibility is that the scrubs are telling you
that bad data was written originally, especially if it''s repeatable on
the same files.  If it hits different counts and files each scrub, you
may be seeing corruption on reads, due to the same causes. Or you may
have both.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110728/9d61aaa7/attachment.bin>

zfs discuss - Jul 2011 - ZFS resilvering loop from hell

[zfs-discuss] ZFS resilvering loop from hell

[zfs-discuss] ZFS resilvering loop from hell

[zfs-discuss] ZFS resilvering loop from hell