thr3ads.net - zfs discuss - [zfs-discuss] Excessive checksum errors... [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Chris Csanady

2007-Apr-05 04:50 UTC

[zfs-discuss] Excessive checksum errors...

After replacing a bad disk and waiting for the resilver to complete, I
started a scrub of the pool.  Currently, I have the pool mounted
readonly, yet almost a quarter of the I/O is writes to the new disk.
In fact, it looks like there are so many checksum errors, that zpool
doesn''t even list them properly:

  pool: p
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 18.71% done, 2h17m to go
config:

        NAME        STATE     READ WRITE CKSUM
        p           ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c2d0    ONLINE       0     0     0
            c3d0    ONLINE       0     0     0
            c5d0    ONLINE       0     0     0
            c4d0    ONLINE       0     0 231.5

errors: No known data errors

I assume that that should be followed by a K.  Is my brand new
replacement disk really returning gigabyte after gigabyte of silently
corrupted data?  I find that quite amazing, and I thought that I would
inquire here.  This is on snv_60.


Chris

Chris Csanady

2007-Apr-05 08:41 UTC

head link

[zfs-discuss] Re: Excessive checksum errors...

I have some further data now, and I don''t think that it is a hardware
problem.  Half way through the scrub, I rebooted and exchanged the
controller and cable used with the "bad" disk.  After restarting the
scrub, it proceeded error free until about the point where it left
off, and then it resumed the exact same behavior.

Basically, almost exactly one fourth of the amount of data that is
read from the resilvered disk is written to the same disk.  This was
constant throughout the scrub.  Meanwhile, fmd writes
ereport.fs.zfs.io events to errlog, until the disk is full.

At this point, it seems as if the resilvering code in snv_60 is
broken, and one fourth of the data was not reconstructed properly.  I
have an iosnoop trace of the disk in question, if anyone is
interested.  I will try to make some sense of it, but that probably
won''t happen today.

Chris

zfs discuss - Apr 2007 - Excessive checksum errors...

[zfs-discuss] Excessive checksum errors...

[zfs-discuss] Re: Excessive checksum errors...