thr3ads.net - zfs discuss - [zfs-discuss] ZFS Dedup and bad checksums [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Jan-13 00:12 UTC

[zfs-discuss] ZFS Dedup and bad checksums

As I recently wrote, my data pool has experienced some
"unrecoverable errors". It seems that a userdata block
of deduped data got corrupted and no longer matches the
stored checksum. For whatever reason, raidz2 did not
help in recovery of this data, so I rsync''ed the files
over from another copy. Then things got interesting...

Bug alert: it seems the block-pointer block with that
mismatching checksum did not get invalidated, so my
attempts to rsync known-good versions of the bad files
from external source seemed to work, but in fact failed:
subsequent reads of the files produced IO errors.
Apparently (my wild guess), upon writing the blocks,
checksums were calculated and the matching DDT entry
was found. ZFS did not care that the entry pointed to
inconsistent data (not matching the checksum now),
it still increased the DDT counter.

The problem was solved by disabling dedup for the dataset
involved and rsync-updating the file in-place. After the
dedup feature was disabled and new blocks were uniquely
written, everything was readable (and md5sums matched)
as expected.

I think of a couple of solutions:

If the block is detected to be corrupt (checksum mismatches
the data), the checksum value in blockpointers and DDT
should be rewritten to an "impossible" value, perhaps
all-zeroes or such, when the error is detected.

Alternatively (opportunistically), a flag might be set
in the DDT entry requesting that a new write mathching
this stored checksum should get committed to disk - thus
"repairing" all files which reference the block (at least,
stopping the IO errors).
Alas, so far there is anyways no guarantee that it was
not the checksum itself that got corrupted (except for
using ZDB to retrieve the block contents and matching
that with a known-good copy of the data, if any), so
corruption of the checksum would also cause replacement
of "really-good-but-normally-inaccessible" data.

//Jim Klimov

(Bug reported to Illumos: https://www.illumos.org/issues/1981)

Richard Elling

2012-Jan-13 00:26 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:
> As I recently wrote, my data pool has experienced some
> "unrecoverable errors". It seems that a userdata block
> of deduped data got corrupted and no longer matches the
> stored checksum. For whatever reason, raidz2 did not
> help in recovery of this data, so I rsync''ed the files
> over from another copy. Then things got interesting...
> 
> Bug alert: it seems the block-pointer block with that
> mismatching checksum did not get invalidated, so my
> attempts to rsync known-good versions of the bad files
> from external source seemed to work, but in fact failed:
> subsequent reads of the files produced IO errors.
> Apparently (my wild guess), upon writing the blocks,
> checksums were calculated and the matching DDT entry
> was found. ZFS did not care that the entry pointed to
> inconsistent data (not matching the checksum now),
> it still increased the DDT counter.
> 
> The problem was solved by disabling dedup for the dataset
> involved and rsync-updating the file in-place. After the
> dedup feature was disabled and new blocks were uniquely
> written, everything was readable (and md5sums matched)
> as expected.
> 
> I think of a couple of solutions:
In theory, the verify option will correct this going forward.
> If the block is detected to be corrupt (checksum mismatches
> the data), the checksum value in blockpointers and DDT
> should be rewritten to an "impossible" value, perhaps
> all-zeroes or such, when the error is detected.
What if it is a transient fault?
> Alternatively (opportunistically), a flag might be set
> in the DDT entry requesting that a new write mathching
> this stored checksum should get committed to disk - thus
> "repairing" all files which reference the block (at least,
> stopping the IO errors).
verify eliminates this failure mode.
> Alas, so far there is anyways no guarantee that it was
> not the checksum itself that got corrupted (except for
> using ZDB to retrieve the block contents and matching
> that with a known-good copy of the data, if any), so
> corruption of the checksum would also cause replacement
> of "really-good-but-normally-inaccessible" data.
Extrememly unlikely. The metadata is also checksummed. To arrive here
you will have to have two corruptions each of which generate the proper
checksum. Not impossible, but? I''d buy a lottery ticket instead.

See also dedupditto. I could argue that the default value of dedupditto 
should be 2 rather than "off".
> //Jim Klimov
> 
> (Bug reported to Illumos: https://www.illumos.org/issues/1981)
Thanks!
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com

Jim Klimov

2012-Jan-13 00:45 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-13 4:26, Richard Elling wrote:> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:
>
>> As I recently wrote, my data pool has experienced some
>> "unrecoverable errors". It seems that a userdata block
>> of deduped data got corrupted and no longer matches the
>> stored checksum. For whatever reason, raidz2 did not
>> help in recovery of this data, so I rsync''ed the files
>> over from another copy. Then things got interesting...
>>
>> Bug alert: it seems the block-pointer block with that
>> mismatching checksum did not get invalidated, so my
>> attempts to rsync known-good versions of the bad files
>> from external source seemed to work, but in fact failed:
>> subsequent reads of the files produced IO errors.
>> Apparently (my wild guess), upon writing the blocks,
>> checksums were calculated and the matching DDT entry
>> was found. ZFS did not care that the entry pointed to
>> inconsistent data (not matching the checksum now),
>> it still increased the DDT counter.
>>
>> The problem was solved by disabling dedup for the dataset
>> involved and rsync-updating the file in-place. After the
>> dedup feature was disabled and new blocks were uniquely
>> written, everything was readable (and md5sums matched)
>> as expected.
>>
>> I think of a couple of solutions:
>
> In theory, the verify option will correct this going forward.
But in practice there are many suggestions to disable
verification because it is slowing down the writes beyond
what DDT does to performance, and since there is just
some 10^-77 chance that two blocks would have same values
of checksums, it is there only for paranoics.

>> If the block is detected to be corrupt (checksum mismatches
>> the data), the checksum value in blockpointers and DDT
>> should be rewritten to an "impossible" value, perhaps
>> all-zeroes or such, when the error is detected.
>
> What if it is a transient fault?
Reread disk, retest checksums?.. I don''t know... :)
>
>> Alternatively (opportunistically), a flag might be set
>> in the DDT entry requesting that a new write mathching
>> this stored checksum should get committed to disk - thus
>> "repairing" all files which reference the block (at least,
>> stopping the IO errors).
>
> verify eliminates this failure mode.
Sounds true, I didn''t try that, though.
But my scrub is not yet complete, maybe there will be more
test subjects ;)
>
>> Alas, so far there is anyways no guarantee that it was
>> not the checksum itself that got corrupted (except for
>> using ZDB to retrieve the block contents and matching
>> that with a known-good copy of the data, if any), so
>> corruption of the checksum would also cause replacement
>> of "really-good-but-normally-inaccessible" data.
>
> Extrememly unlikely. The metadata is also checksummed. To arrive here
> you will have to have two corruptions each of which generate the proper
> checksum. Not impossible, but? I''d buy a lottery ticket instead.
I''ve rather meant the opposite: file data is actually good,
but checksums (apparently both DDT and BlockPointer ones
with all their ditto copies) are bad, either due to disk
rot or RAM failures. For example, are the "blockpointer"
and "dedup" versions of the sha256 checksum recalculated
by both stages, or reused, on writes of a block?..

> See also dedupditto. I could argue that the default value of dedupditto
> should be 2 rather than "off".
I couldn''t set it to smallish values (like 64), on oi_148a LiveUSB:

root at openindiana:~# zpool set dedupditto=64 pool
cannot set property for ''pool'': invalid argument for this pool
operation

root at openindiana:~# zpool set dedupditto=2 pool
cannot set property for ''pool'': invalid argument for this pool
operation

root at openindiana:~# zpool set dedupditto=127 pool
root at openindiana:~# zpool get dedupditto pool
NAME  PROPERTY    VALUE       SOURCE
pool  dedupditto  127         local


Thanks,
//Jim

Jim Klimov

2012-Jan-13 01:16 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-13 4:26, Richard Elling wrote:> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:
>> Alternatively (opportunistically), a flag might be set
>> in the DDT entry requesting that a new write mathching
>> this stored checksum should get committed to disk - thus
>> "repairing" all files which reference the block (at least,
>> stopping the IO errors).
>
> verify eliminates this failure mode.
Thinking about it... got more questions:

In this case: DDT/BP contain multiple references with
correct checksums, but the on-disk block is bad.
Newly written block has the same checksum, and verification
proves that on-disk data is different byte-to-byte.

1) How does the write-stack interact with those checksums
    that do not match the data? Would any checksum be tested
    for this verification read of existing data at all?

2) It would make sense for the failed verification to
    have the new block committed to disk, and a new DDT
    entry with same checksum created. I would normally
    expect this to be the new unique block of a new file,
    and have no influence on existing data (block chains).
    However in the discussed problematic case, this safe
    behavior would also mean not contributing to reparation
    of those existing block chains which include the
    mismatching on-disk block.

Either I misunderstand some of the above, or I fail to
see how verification would eliminate this failure mode
(namely, as per my suggestion, replace the bad block
with a good one and have all references updated and
block-chains -> files fixed with one shot).

Would you please explain?
Thanks,
//Jim Klimov

Daniel Carosone

2012-Jan-13 01:34 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov
wrote:> 2012-01-13 4:26, Richard Elling wrote:
>> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:
>>> Alternatively (opportunistically), a flag might be set
>>> in the DDT entry requesting that a new write mathching
>>> this stored checksum should get committed to disk - thus
>>> "repairing" all files which reference the block (at
least,
>>> stopping the IO errors).
>>
>> verify eliminates this failure mode.
>
> Thinking about it... got more questions:
>
> In this case: DDT/BP contain multiple references with
> correct checksums, but the on-disk block is bad.
> Newly written block has the same checksum, and verification
> proves that on-disk data is different byte-to-byte.
>
> 1) How does the write-stack interact with those checksums
>    that do not match the data? Would any checksum be tested
>    for this verification read of existing data at all?
>
> 2) It would make sense for the failed verification to
>    have the new block committed to disk, and a new DDT
>    entry with same checksum created. I would normally
>    expect this to be the new unique block of a new file,
>    and have no influence on existing data (block chains).
>    However in the discussed problematic case, this safe
>    behavior would also mean not contributing to reparation
>    of those existing block chains which include the
>    mismatching on-disk block.
>
> Either I misunderstand some of the above, or I fail to
> see how verification would eliminate this failure mode
> (namely, as per my suggestion, replace the bad block
> with a good one and have all references updated and
> block-chains -> files fixed with one shot).
It doesn''t update past data.

It gets treated as if there were a hash collision, and the new data is
really different despite having the same checksum, and so gets written
out instead of incrementing the existing DDT pointer.  So it addresses
your ability to recover the primary filesystem by overwriting with
same data, that dedup was previously defeating. 

--
Dan.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120113/5ba6d0d0/attachment.bin>

Jim Klimov

2012-Jan-13 01:56 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-13 5:34, Daniel Carosone wrote:> On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov wrote:
>> Either I misunderstand some of the above, or I fail to
>> see how verification would eliminate this failure mode
>> (namely, as per my suggestion, replace the bad block
>> with a good one and have all references updated and
>> block-chains ->  files fixed with one shot).
>
> It doesn''t update past data.
>
> It gets treated as if there were a hash collision, and the new data is
> really different despite having the same checksum, and so gets written
> out instead of incrementing the existing DDT pointer.  So it addresses
> your ability to recover the primary filesystem by overwriting with
> same data, that dedup was previously defeating.
But (yes/no?) I have to do this repair file-by-file,
either with dedup=off or dedup=verify.

Actually, that''s what I properly should do if there
is such a serious error, but what if the original data
is not available so I can''t fix it file-by-file, or
if there are very many errors (read, DDT references
from a number of files just under dedupditto value)
and such match-and-repair procedure is prohibitively
inconvenient, slow, whatever?

Say, previously we trusted the hash algorithm: that same
checksums mean identical blocks. With such trust the
user might want to replace the faulty block with another
one (matching the checksum) and expect ALL deduped files
that used this block to become automagically recovered.
Chances are, they actually would be correct (by external
verification).

And if we trust unverified dedup in the first place,
there is nothing wrong with such approach to repair.
It would not make possible errors worse than there were
in originally saved on-disk data (even if there were
hash collisions of really-different blocks - user had
discarded that difference long ago).

I think the user should be given an (informed) ability
to shoot himself in the foot or recover data, depending
on his luck. Anyway, people are doing it thanks to
Max Bruning''s or Viktor Latushkin''s posts and direct
help, or they research hardcore internals of ZFS.
We might as well play along and increase their chances
of success, even if unsupported and unguaranteed - no?

This situation with "obscured" recovery methods reminds
me of prohibited changes of firmware on cell phones:
customers are allowed to sit on a phone or drop it into
a sink, and perhaps have it replaced, but they are not
allowed to install different software. Many still do.

//Jim Klimov

Jim Klimov

2012-Jan-13 05:43 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-13 4:26, Richard Elling wrote:> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:
>> The problem was solved by disabling dedup for the dataset
>> involved and rsync-updating the file in-place. After the
>> dedup feature was disabled and new blocks were uniquely
>> written, everything was readable (and md5sums matched)
>> as expected.
> In theory, the verify option will correct this going forward.
Well, I have got more complaining blocks, and even new errors
in files that I''ve previously "repaired" with rsync, before I
figured out the problem with dedup today.

Now I''ve set the verify flag instead of dedup=off, and the
rsync replacement from external storage seems to happen a lot
faster. It also seems to persist even a few minutes after the
copying ;)

Thanks for the tip, Richard!
//Jim Klimov

Jim Klimov

2012-Jan-20 20:33 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-13 4:12, Jim Klimov wrote:> As I recently wrote, my data pool has experienced some
> "unrecoverable errors". It seems that a userdata block
> of deduped data got corrupted and no longer matches the
> stored checksum. For whatever reason, raidz2 did not
> help in recovery of this data, so I rsync''ed the files
> over from another copy. Then things got interesting...

Well, after some crawling over my data with zdb, od and dd,
I guess ZFS was right about finding checksum errors - the
metadata''s checksum matched that of a block on original
system, and the data block was indeed erring.

Just in case it helps others, the SHA256 checksums can be
tested with openssl as I show below. I still search for a
command-line fletcher4/fletcher2 checker (as that weak
hash is used on metadata; I wonder why).

Here''s a tail from on-disk blkptr_t, bytes with checksum:
# tail -2 /tmp/osx.l0+1100000.blkptr.txt
000460 1f 6f 4c 73 5d c1 ab 15 00 cc 56 90 38 8e b4 dd
000470 a9 8e 54 6f f1 a7 db 43 7d 61 9e 01 23 45 2e 70

In byte 0x435 I have value 0x8 - SHA256. And here is the
SHA256 hash for the excerpt from original file (128Kb
cut out with dd):

# dd if=osx.zip of=/tmp/osx.l0+1100000.bin.orig bs=512 skip=34816 count=256

# openssl dgst -sha256 < /tmp/osx.l0+1100000.bin.orig
15abc15d734c6f1fddb48e389056cc0043dba7f16f548ea9702e4523019e617d

As my x86 is little-endian, the four 8-byte words of
the checksum appear reversed. But you can see it matches,
so my source file is okay.

I did not find the DDT entries (yet), so I don''t know
what hash is there or what addresses it references for
how many files. The block pointer has the dedup bit set,
though.

However, of all my files with errors, there are no DVA
overlaps.

I hexdumped (with od) the two 128Kb excerpts (one from the
original file, another fetched with zdb) and diffed them,
and while some lines matched, others did not.

What is more interesting, is that most of the error area
contains a repeating pattern like this, sometimes with
"extra" chars thrown in:
fc 42 fc 42 fc 42 fc 42 fc 42 fc
fc 42 1f fc 42 fc 42
42 ff fc 42 fc 42 fc 42 fc 42

I have seen similar patterns when I zdb-dumped compressed
blocks without decompression, so I guess this could be a
miswrite of compressed data and/or parity destined for
another file (which also did not get it).

The erroneous data starts and ends at "round" offsets like
0x1000-0x2000, 0x9000-0xa000, 0x11000-0x12000 (step 0x8000
between both sets of mismatches, size 4kb is my disk sector
size), which also suggests a non-coincidental problem.

However, part of the differing data is "normal-looking
random noise", while some part is that pattern above,
starting and ending at a seemingly random location mid-sector.

Here''s about all I have to say and share so far :)

Open to suggestions how to compute fletcher checksums
on blocks...

Thanks,
//Jim

Jim Klimov

2012-Jan-21 14:32 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 0:33, Jim Klimov wrote:> 2012-01-13 4:12, Jim Klimov wrote:
>> As I recently wrote, my data pool has experienced some
>> "unrecoverable errors". It seems that a userdata block
>> of deduped data got corrupted and no longer matches the
>> stored checksum. For whatever reason, raidz2 did not
>> help in recovery of this data, so I rsync''ed the files
>> over from another copy. Then things got interesting...
>
>
> Well, after some crawling over my data with zdb, od and dd,
> I guess ZFS was right about finding checksum errors - the
> metadata''s checksum matched that of a block on original
> system, and the data block was indeed erring.
Well, as I''m moving to close my quest with broken data, I''d
like to draw up some conclusions and RFEs. I am still not
sure if they are factually true, I''m still learning the ZFS
internals. So "it currently seems to me, that":

1) My on-disk data could get corrupted for whatever reason
    ZFS tries to protect it from, at least once probably
    from misdirected writes (i.e. the head landed not where
    it was asked to write). It can not be ruled out that the
    checksums got broken in non-ECC RAM before writes of
    block pointers for some of my data, thus leading to
    mismatches. One way or another, ZFS noted the discrepancy
    during scrubs and "normal" file accesses. There is no
    (automatic) way to tell which part is faulty - checksum
    or data.

2) In the case where on-disk data did get corrupted, the
    checksum in block pointer was correct (matching original
    data), but the raidz2 redundancy did not aid recovery.

3) The file in question was created on a dataset with enabled
    deduplication, so at the very least the dedup bit was set
    on the corrupted block''s pointer and a DDT entry likely
    existed. Attempts to rewrite the block with the original
    one (having "dedup=on") failed in fact, probably because
    the matching checksum was already in DDT.

    Rewrites of such blocks with "dedup=off" or
"dedup=verify"
    succeeded.

    Failure/success were tested by "sync; md5sum FILE" some
    time after the fix attempt. (When done just after the
    fix, test tends to return success even if the ondisk data
    is bad, "thanks" to caching).

    My last attempt was to set "dedup=on" and write the block
    again and sync; the (remote) computer hung instantly :(

3*)The RFE stands: deduped blocks found to be invalid and not
    recovered by redundancy should somehow be evicted from DDT
    (or marked for required verification-before-write) so as
    not to pollute further writes, including repair attmepts.

    Alternatively, "dedup=verify" takes care of the situation
    and should be the recommended option.

3**) It was suggested to set "dedupditto" to small values,
    like "2". My oi_148a refused to set values smaller than 100.
    Moreover, it seems reasonable to have two dedupditto values:
    for example, to make a ditto copy when DDT reference counter
    exceeds some small value (2-5), and add ditto copies every
    "N" values for frequently-referenced data (every 64-128).

4) I did not get to check whether "dedup=verify" triggers a
    checksum mismatch alarm if the preexisting on-disk data
    does not in fact match the checksum.

    I think such alarm should exist and to as much as a scrub,
    read or other means of error detection and recovery would.

5) It seems like a worthy RFE to include a pool-wide option to
    "verify-after-write/commit" - to test that recent TXG sync
    data has indeed made it to disk on (consumer-grade) hardware
    into the designated sector numbers. Perhaps the test should
    be delayed several seconds after the sync writes.

    If the verifcation fails, currently cached data from recent
    TXGs can be recovered from on-disk redundancy and/or still
    exist in RAM cache, and rewritten again (and tested again).

    More importantly, a failed test *may* mean that the write
    landed on disk randomly, and the pool should be scrubbed
    ASAP. It may be guessed that the yet-unknown error can lie
    within "epsilon" tracks (sector numbers) from the currently
    found non-written data, so if it is possible to scrub just
    a portion of the pool based on DVAs - that''s a preferred
    start. It is possible that some data can be recovered if
    it is tended to ASAP (i.e. on mirror, raidz, copies>1)...

Finally, I should say I''m sorry for lame questions arising
from not reading the format spec and zdb blogs carefully ;)

In particular, it was my understanding for a long time that
block pointers each have a sector of their own, leading to
overheads that I''ve seen. Now I know (and checked) that most
of the blockpointer tree is made of larger groupings (128
blkptr_t''s in a single 16KB block), reducing the impact of
BP''s on fragmentation and/or slacky waste of large sectors
that I predicted and expected for the past year.

Sad that nobody ever contradicted that (mis)understanding
of mine.

//Jim Klimov

Bob Friesenhahn

2012-Jan-21 15:18 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

On Sat, 21 Jan 2012, Jim Klimov wrote:>
> 5) It seems like a worthy RFE to include a pool-wide option to
>   "verify-after-write/commit" - to test that recent TXG sync
>   data has indeed made it to disk on (consumer-grade) hardware
>   into the designated sector numbers. Perhaps the test should
>   be delayed several seconds after the sync writes.
This is an interesting idea.  I think that you would want to do a 
mini-scrub on a TXG at least one behind the last one written since 
otherwise any test would surely be foiled by caching.  The ability to 
restore data from RAM is doubtful since TXGs get forgotten from memory 
as soon as they are written.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jim Klimov

2012-Jan-21 15:43 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 19:18, Bob Friesenhahn wrote:> On Sat, 21 Jan 2012, Jim Klimov wrote:
>>
>> 5) It seems like a worthy RFE to include a pool-wide option to
>> "verify-after-write/commit" - to test that recent TXG sync
>> data has indeed made it to disk on (consumer-grade) hardware
>> into the designated sector numbers. Perhaps the test should
>> be delayed several seconds after the sync writes.
>
> This is an interesting idea. I think that you would want to do a
> mini-scrub on a TXG at least one behind the last one written since
> otherwise any test would surely be foiled by caching. The ability to
> restore data from RAM is doubtful since TXGs get forgotten from memory
> as soon as they are written.
That could be rearranged as part of the bug/RFE resolution ;)

Regarding the written data, I believe it may find place in the
ARC, and a for the past few TXGs it could still remain there.
I am not sure it is feasible to "guarantee" that it remains in
RAM for a certain time. Also there should be a way to enforce
media reads and not ARC re-reads when verifying writes...

//Jim

Bob Friesenhahn

2012-Jan-21 16:50 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

On Sat, 21 Jan 2012, Jim Klimov wrote:>
> Regarding the written data, I believe it may find place in the
> ARC, and a for the past few TXGs it could still remain there.
Any data in the ARC is subject to being overwritten with updated data 
just a millisecond later.  It is a live cache.
> I am not sure it is feasible to "guarantee" that it remains in
> RAM for a certain time. Also there should be a way to enforce
> media reads and not ARC re-reads when verifying writes...
Zfs already knows how to by-pass the ARC.  However, any "media" reads 
are subject to caching since the underlying devices try very hard to 
cache data in order to improve read performance.

As an extreme case of caching, consider a device represented by an 
iSCSI LUN on a OpenSolaris server with 512GB of RAM.  If you request 
to read data you are exceedingly likely to read data from the zfs ARC 
on that server rather than underlying "media".

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jim Klimov

2012-Jan-21 20:10 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-21 20:50, Bob Friesenhahn wrote:
 > TXGs get forgotten from memory as soon as they are written.

As I said, that can be arranged - i.e. free the TXG cache
after the corresponding TXG number has been verified?

Point about ARC being overwritten seems valid...
> Zfs already knows how to by-pass the ARC. However, any "media"
reads are
> subject to caching since the underlying devices try very hard to cache
> data in order to improve read performance.
As a pointer, the "format" command presents options to
disable (separately) read and write caching on drives
it sees. MAYBE there is some option to explicitly read
data from media, like sync-writes. Whether the drive
firmwares honor that (disabling caching and/or such
hypothetical sync-reads) - it''s something out of ZFS''s
control. But we can do the best effort...
> As an extreme case of caching, consider a device represented by an iSCSI
> LUN on a OpenSolaris server with 512GB of RAM. If you request to read
> data you are exceedingly likely to read data from the zfs ARC on that
> server rather than underlying "media".
So far I rather considered "flaky" hardware with lousy
consumer qualities. The server you describe is likely
to exceed that bar ;)

Besides, if this OpenSolaris server is up-to-date, it
would do such media checks itself, and/or honour the
sync-read requests or temporary cache disabling ;)

Of course, this can''t be guaranteed of other devices,
so in general ZFS can do best-effort verification.

//Jim

Bob Friesenhahn

2012-Jan-21 20:55 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

On Sun, 22 Jan 2012, Jim Klimov wrote:> So far I rather considered "flaky" hardware with lousy
> consumer qualities. The server you describe is likely
> to exceed that bar ;)
The most common "flaky" behavior of consumer hardware which causes 
troubles for zfs is not honoring cache-related requests. 
Unfortunately, it is not possible for zfs to fix such hardware.  Zfs 
works best with hardware which does what it is told.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jim Klimov

2012-Jan-21 22:24 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-22 0:55, Bob Friesenhahn wrote:> On Sun, 22 Jan 2012, Jim Klimov wrote:
>> So far I rather considered "flaky" hardware with lousy
>> consumer qualities. The server you describe is likely
>> to exceed that bar ;)
>
> The most common "flaky" behavior of consumer hardware which
causes
> troubles for zfs is not honoring cache-related requests. Unfortunately,
> it is not possible for zfs to fix such hardware. Zfs works best with
> hardware which does what it is told.
Also true. That''s what the "option" stood for in my proposal:
since the verification feature is going to be expensive and
add random IOs, we don''t want to enforce it on everybody.

Besides, the user might choose to trust his reliable and
expensive hardware like a SAN/NAS with battery-backed NVRAM,
which is indeed likely better that a homebrewn NAS box with
random HDDs thrown in with no measure, but with a desire for
some reliability nonetheless ;)

We can "expect" the individual HDDs caches to get expired
after some time (i.e. after we''ve sent 64Mbs worth of writes
to the particualr disk with a 64Mb cache), and after that
we are likely to get true media reads. That''s when the
verification reads are likely to return most relevant
(ondisk) sectors...

//Jim

Richard Elling

2012-Jan-22 18:58 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:> 2012-01-21 0:33, Jim Klimov wrote:
>> 2012-01-13 4:12, Jim Klimov wrote:
>>> As I recently wrote, my data pool has experienced some
>>> "unrecoverable errors". It seems that a userdata block
>>> of deduped data got corrupted and no longer matches the
>>> stored checksum. For whatever reason, raidz2 did not
>>> help in recovery of this data, so I rsync''ed the files
>>> over from another copy. Then things got interesting...
>> 
>> 
>> Well, after some crawling over my data with zdb, od and dd,
>> I guess ZFS was right about finding checksum errors - the
>> metadata''s checksum matched that of a block on original
>> system, and the data block was indeed erring.
> 
> Well, as I''m moving to close my quest with broken data,
I''d
> like to draw up some conclusions and RFEs. I am still not
> sure if they are factually true, I''m still learning the ZFS
> internals. So "it currently seems to me, that":
> 
> 1) My on-disk data could get corrupted for whatever reason
>   ZFS tries to protect it from, at least once probably
>   from misdirected writes (i.e. the head landed not where
>   it was asked to write). It can not be ruled out that the
>   checksums got broken in non-ECC RAM before writes of
>   block pointers for some of my data, thus leading to
>   mismatches. One way or another, ZFS noted the discrepancy
>   during scrubs and "normal" file accesses. There is no
>   (automatic) way to tell which part is faulty - checksum
>   or data.
Untrue. If a block pointer is corrupted, then on read it will be logged
and ignored. I''m not sure you have grasped the concept of checksums
in the parent object.
> 
> 2) In the case where on-disk data did get corrupted, the
>   checksum in block pointer was correct (matching original
>   data), but the raidz2 redundancy did not aid recovery.
I think your analysis is incomplete. Have you determined the root cause?
> 
> 3) The file in question was created on a dataset with enabled
>   deduplication, so at the very least the dedup bit was set
>   on the corrupted block''s pointer and a DDT entry likely
>   existed. Attempts to rewrite the block with the original
>   one (having "dedup=on") failed in fact, probably because
>   the matching checksum was already in DDT.
Works as designed.
> 
>   Rewrites of such blocks with "dedup=off" or
"dedup=verify"
>   succeeded.
> 
>   Failure/success were tested by "sync; md5sum FILE" some
>   time after the fix attempt. (When done just after the
>   fix, test tends to return success even if the ondisk data
>   is bad, "thanks" to caching).
No, I think you''ve missed the root cause. By default, data that does
not match its checksum is not used.
> 
>   My last attempt was to set "dedup=on" and write the block
>   again and sync; the (remote) computer hung instantly :(
> 
> 3*)The RFE stands: deduped blocks found to be invalid and not
>   recovered by redundancy should somehow be evicted from DDT
>   (or marked for required verification-before-write) so as
>   not to pollute further writes, including repair attmepts.
> 
>   Alternatively, "dedup=verify" takes care of the situation
>   and should be the recommended option.
I have lobbied for this, but so far people prefer performance to dependability.
> 
> 3**) It was suggested to set "dedupditto" to small values,
>   like "2". My oi_148a refused to set values smaller than 100.
>   Moreover, it seems reasonable to have two dedupditto values:
>   for example, to make a ditto copy when DDT reference counter
>   exceeds some small value (2-5), and add ditto copies every
>   "N" values for frequently-referenced data (every 64-128).
> 
> 4) I did not get to check whether "dedup=verify" triggers a
>   checksum mismatch alarm if the preexisting on-disk data
>   does not in fact match the checksum.
All checksum mismatches are handled the same way.
> 
>   I think such alarm should exist and to as much as a scrub,
>   read or other means of error detection and recovery would.
Checksum mismatches are logged, what was your root cause?
> 
> 5) It seems like a worthy RFE to include a pool-wide option to
>   "verify-after-write/commit" - to test that recent TXG sync
>   data has indeed made it to disk on (consumer-grade) hardware
>   into the designated sector numbers. Perhaps the test should
>   be delayed several seconds after the sync writes.
There are highly-reliable systems that do this in the fault-tolerant
market.
> 
>   If the verifcation fails, currently cached data from recent
>   TXGs can be recovered from on-disk redundancy and/or still
>   exist in RAM cache, and rewritten again (and tested again).
> 
>   More importantly, a failed test *may* mean that the write
>   landed on disk randomly, and the pool should be scrubbed
>   ASAP. It may be guessed that the yet-unknown error can lie
>   within "epsilon" tracks (sector numbers) from the currently
>   found non-written data, so if it is possible to scrub just
>   a portion of the pool based on DVAs - that''s a preferred
>   start. It is possible that some data can be recovered if
>   it is tended to ASAP (i.e. on mirror, raidz, copies>1)...
> 
> Finally, I should say I''m sorry for lame questions arising
> from not reading the format spec and zdb blogs carefully ;)
> 
> In particular, it was my understanding for a long time that
> block pointers each have a sector of their own, leading to
> overheads that I''ve seen. Now I know (and checked) that most
> of the blockpointer tree is made of larger groupings (128
> blkptr_t''s in a single 16KB block), reducing the impact of
> BP''s on fragmentation and/or slacky waste of large sectors
> that I predicted and expected for the past year.
> 
> Sad that nobody ever contradicted that (mis)understanding
> of mine.
Perhaps some day you can become a ZFS guru, but the journey is long...
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120122/9a922b52/attachment.html>

Jim Klimov

2012-Jan-23 14:25 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-22 22:58, Richard Elling wrote:> On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:
>> ... So "it currently seems to me, that":
>>
>> 1) My on-disk data could get corrupted for whatever reason
>> ZFS tries to protect it from, at least once probably
>> from misdirected writes (i.e. the head landed not where
>> it was asked to write). It can not be ruled out that the
>> checksums got broken in non-ECC RAM before writes of
>> block pointers for some of my data, thus leading to
>> mismatches. One way or another, ZFS noted the discrepancy
>> during scrubs and "normal" file accesses. There is no
>> (automatic) way to tell which part is faulty - checksum
>> or data.
>
> Untrue. If a block pointer is corrupted, then on read it will be logged
> and ignored. I''m not sure you have grasped the concept of
checksums
> in the parent object.
If a block pointer is corrupted on disk after the write -
then yes, it will not match the parent''s checksum, and
there would be another 1 or 2 ditto copies with possibly
correct data. Is that the correct grasping of the concept? ;)

Now, the (non-zero possibility) scenario I meant was that
the checksum for the block was calculated and was corrupted
in RAM/CPU before the ditto blocks were fanned out to disks,
and before the parent block checksums were calculated.

In this case on-disk data block is correct as compared to
other sources (if it is copies=2 - it may even be the same
as its other copy), but it does not match the BP''s checksum
while the BP tree seems valid (all tree checksums match).

I believe in this case ZFS should flag the data checksum
mismatch, although in reality (with miniscule probability)
it is the bad checksum mismatching the good data. Anyway,
the situation would seem the same if the data block was
corrupted in RAM before fanning out with copies>1, and that
is more probable given the size of this block compared to
the 256 bits of checksum.

Just *HOW* probable is that on an ECC and a non-ECC system,
with or without an overclocked overheated CPU in enthusiasts
overpumped workstation or unsuspecting consumer''s dusty
closet - that is a separate maths questions, with different
answers for different models. Random answer - on par with
disk UBER errors which ZFS by design considers serious
enough to combat.
>> 2) In the case where on-disk data did get corrupted, the
>> checksum in block pointer was correct (matching original
>> data), but the raidz2 redundancy did not aid recovery.
>
> I think your analysis is incomplete.
As I last wrote, I dumped the blocks with ZDB and compared
the bytes with the same block from a good copy. Particularly,
that copy had the same SHA256 checksum as was stored in my
problematic pool''s blockpointer entry for the corrupt block.

These blocks differed in three sets of 4096 bytes starting
at "round" offsets at even intervals (4KB, 36KB, 68KB).
4kb is my disks'' block size. It seems that some disk(s?)
overwrote existing data, or got scratched, or whatever
(no IO errors in dmesg though).

I am not certain why raidz2 did not suffice to fix the block,
and what garbage or data exists on all 6 drives - I did not
get zdb to dump all 0x30000 bytes of raidz2 raw data to try
permutations myself.

Possibly, for whatever reason (such as cable error, or some
firmware error given the same model of the drives), several
drives got the same erroneous write command at once, and
ultimately invalidated parts of the same stripe.

Many of the files in peril now, have existed on the pool
for some time, and scrubs completed successfully many times.

 > Have you determined the root cause?

Unfortunately, I''m currently in another country away from
my home-NAS server. So all physical maintenance including
pushing the reset button is done by friends living in the
apartment. And there is not much physical examination that
can be done this way.

At one point in time recently (during a scrub in January),
one of the disks got lost and was not seen by motherboard
even after reboots, so I had my friends take out and replug
the SATA cables. This helped, so connector noise was possibly
the root cause. It might also account for incorrect address
for a certain write that slashed randomly on the platter.

The PSU is excessive for the box''s requirements, with slack
performance to degrade ;) The P4 CPU is not overclocked.
RAM is non-ECC and that is not changeable given the Intel
CPU, chipset and motherboard. HDDs are on MB''s controller.
The 6 HDDs in raidz2 pool are consumer-grade SATA Seagate
ST2000DL003-9VT166 firmware CC32.

Degrading cabling and/or connectors can indeed be one
of about two main causes, the other being non-ECC RAM.
Or aging CPU.
>> 3) The file in question was created on a dataset with enabled
>> deduplication, so at the very least the dedup bit was set
>> on the corrupted block''s pointer and a DDT entry likely
>> existed. Attempts to rewrite the block with the original
>> one (having "dedup=on") failed in fact, probably because
>> the matching checksum was already in DDT.
>
> Works as designed.
If this is the case, the design does not account for finding
an error (ZFS found it first, not I) - and still trusts the
DDT entry although it points to garbage now. The design should
be fixed then ;)
>
>>
>> Rewrites of such blocks with "dedup=off" or
"dedup=verify"
>> succeeded.
>>
>> Failure/success were tested by "sync; md5sum FILE" some
>> time after the fix attempt. (When done just after the
>> fix, test tends to return success even if the ondisk data
>> is bad, "thanks" to caching).
>
> No, I think you''ve missed the root cause. By default, data that
does
> not match its checksum is not used.
I suspected that reads subsequent to a write (with dedup=on)
came from the ARC cache. At least, it was my expectation that
like on many other caching systems, recent writes'' buffers
are moved into the read-cache for the addresses in question.

Thus, if the system only incremented the DDT counter, it has
no information that the on-disk data mismatched the checksum.

Anyhow, the fact is that in this test case, when I read the
"fixed" local files just after the rsync from a good remote
source, I got no IO errors and correct md5sums. When I re-read
them after a few minutes, the IO errors were there again.
With "dedup=off" and "dedup=verify" the errors did not
return
after a while.

I explained my interpretation of these facts based on my
current understanding of the system; hopefully you have some
better-informed explanation ;)
>> My last attempt was to set "dedup=on" and write the block
>> again and sync; the (remote) computer hung instantly :(
>>
>> 3*)The RFE stands: deduped blocks found to be invalid and not
>> recovered by redundancy should somehow be evicted from DDT
>> (or marked for required verification-before-write) so as
>> not to pollute further writes, including repair attmepts.
>>
>> Alternatively, "dedup=verify" takes care of the situation
>> and should be the recommended option.
>
> I have lobbied for this, but so far people prefer performance to
> dependability.
At the very least, in the docs outlining deduplication, this
possible situation could be marked as an incentive to use
"verify".

And if the errors are found (by scrub/read), something should
be done - like forcing "verify"? Or at least suggesting that?
Logged in illumos bugtracker [1]

[1] https://www.illumos.org/issues/1981

Regarding performance... it seems that some of design decisions
were influenced by certain customers and their setups. There''s
nothing inherently wrong when tuning the system (by default)
with reference to real-life situations, until such cases
dictate the one-and-only required policy for everybody else.

Like with vdev-prefetch code slated for removal [2] - not
everybody of illumos users has hundreds of disks on memoryless
head nodes. And those who don''t might have more benefit from
prefetch than they lose with dedicating a few MB of RAM to that
cache. So I think for that case it was sufficient to have the
zeroed cache size by default and leaving the ability to use
more if we desire. Perhaps there can even be room for vdev-
prefetching improvement, also logged in bugtracker [2],[3] ;)

[2] https://www.illumos.org/issues/175
[3] https://www.illumos.org/issues/2017
>> 3**) It was suggested to set "dedupditto" to small values,
>> like "2". My oi_148a refused to set values smaller than 100.
>> Moreover, it seems reasonable to have two dedupditto values:
>> for example, to make a ditto copy when DDT reference counter
>> exceeds some small value (2-5), and add ditto copies every
>> "N" values for frequently-referenced data (every 64-128).
Also logged in bugtracker [4] ;)

[4] https://www.illumos.org/issues/2016
>>
>> 4) I did not get to check whether "dedup=verify" triggers a
>> checksum mismatch alarm if the preexisting on-disk data
>> does not in fact match the checksum.
>
> All checksum mismatches are handled the same way.
>
>>
>> I think such alarm should exist and to as much as a scrub,
>> read or other means of error detection and recovery would.
>
> Checksum mismatches are logged, what was your root cause?
As written above, at least for one case it was probably
a random write by a disk over existing sectors, invalidating
the block.

I have yet to test (to be certain) whether writing over a
block which is invalid on-disk and marked as deduped, with
dedup=verify, would increase the CKSUM counter.

Waiting for friends to reboot that computer now to make the
test ;(

Still, according to "Works as designed" above, logging the
mismatch so far has no effect on not-using the old DDT entry
pointing to corrupt data.

Just in case, logged as https://www.illumos.org/issues/2015
>> 5) It seems like a worthy RFE to include a pool-wide option to
>> "verify-after-write/commit" - to test that recent TXG sync
>> data has indeed made it to disk on (consumer-grade) hardware
>> into the designated sector numbers. Perhaps the test should
>> be delayed several seconds after the sync writes.
>
> There are highly-reliable systems that do this in the fault-tolerant
> market.
Do you want to sell your systems (and/or OS) or others to fill
this niche? ;)

At least, having the option for such checks (as much as disks
would obey it) is better than not having. Whether to enable
it - is another matter.

I logged an RFE in illumos bugtracker yesterday [5], lest the
idea be forgotten, at the very least...

[5] https://www.illumos.org/issues/2008

And, ummm, however much you dislike tunables, I think this
situation calls for an on-off switch and two codepaths, because
like dedup=verify, this feature will by design blow a heavy hit
to performance. Still, one size cannot fit all. Those who like
purple, likely won''t wear white ;)

Some want performance, others just want their photo archive to
survive the decade, and say 10MBps would suffice to copy their
new set pictures from the CF card ;) If the verification code
gets into the OS, it can be turned on or off by a tunable
(switch) depending on other components'' {hardware} reliability
and performance requirements. Perhaps, we should trust consumer
drives less than enterprise ones (100 orders of magnitute in
UBER difference, better materials, more in-factory testing)
and not request such verification?.. That''s not up to us to
decide (one size for all), but up to end-users or their
integrators.
>> Sad that nobody ever contradicted that (mis)understanding
>> of mine.
>
> Perhaps some day you can become a ZFS guru, but the journey is long...
Well, I do look forward to that, and hope I can learn from the
likes of you. And there are not many "ZFS under the hood" type
of textbooks out there. So I go on asking and asking ;)

//Jim Klimov

Jim Klimov

2012-Jan-24 03:51 UTC

head link

[zfs-discuss] ZFS Dedup and bad checksums

2012-01-23 18:25, Jim Klimov wrote:>>> 4) I did not get to check whether "dedup=verify" triggers
a
>>> checksum mismatch alarm if the preexisting on-disk data
>>> does not in fact match the checksum.
>>
>> All checksum mismatches are handled the same way.
>> > I have yet to test (to be certain) whether writing over a
 > block which is invalid on-disk and marked as deduped, with
 > dedup=verify, would increase the CKSUM counter.

I checked (oi_148a LiveUSB), by writing a correct block
(128KB) instead of the corrupted one into the file, and:
* "dedup=on" neither fixed the on-disk file, nor logged
   an error, and subsequent reads produced IO errors
   (and increased the counter). Probably just the DDT
   counter was increased during the write (that''s the
   "works as designed" part);
* "dedup=verify" *doesn''t* log a checksum error if it
   finds a block whose assumed checksum matches the newly
   written block, but contents differ from the new block
   during dedup-verification and in fact these contents
   do not match the checksum either (at least, not the
   one in block pointer). Reading the block produced no
   errors;
* what''s worse, reenabling "dedup=on" and writing the
   same again block crashes (reboots) the system instantly.
   Possibly, because now there are two DDT entries pointing
   to same checksum in different blocks, and no verification
   was explicitly requested?

Reenactment of the test (as a hopefully reproducible)
test case constitutes the remainder of the post and
thus it is going to be lengthy... Analyze that! ;)
>>> I think such alarm should exist and to as much as a scrub,
>>> read or other means of error detection and recovery would.
Statement/desire still stands.
>> Checksum mismatches are logged,
no they are not (in this case)

 >> what was your root cause?

Probably same as before - some sort of existing on-disk data
corruption which overwrote some sectors and raidz2 failed to
reconstruct the stripe. I seem to have had about a dozen of
such files. Fixed some by rsync with different dedup settings,
before going into it all deeper. I am not sure if any of them
had overlapping DVAs (those which remain corrupted now - don''t),
but many addresses lie in very roughly similar address ranges
(within several GBs or so).
>
> As written above, at least for one case it was probably
> a random write by a disk over existing sectors, invalidating
> the block.
>
> Still, according to "Works as designed" above, logging the
> mismatch so far has no effect on not-using the old DDT entry
> pointing to corrupt data.
>
> Just in case, logged as https://www.illumos.org/issues/2015
REENACTMENT OF THE TEST CASE

Beside illustrating my error for those who decide to take on
the bug, I hope this post would also help others in their data
recovery attempts, zfs research, etc.

If my methodology is faulty, I hope someone points that out ;)


1) Uh, I have unrecoverable errors!

The computer was freshly rebooted, pool imported (with rollback),
no newly known CKSUM errors (but we have the nvlist of existing
mismatching files):

# zpool status -vx
   pool: pool
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 244K in 138h57m with 31 errors on Sat Jan 14 
01:50:16 2012
config:

         NAME        STATE     READ WRITE CKSUM
         pool        ONLINE       0     0     0
           raidz2-0  ONLINE       0     0     0
             c6t0d0  ONLINE       0     0     0
             c6t1d0  ONLINE       0     0     0
             c6t2d0  ONLINE       0     0     0
             c6t3d0  ONLINE       0     0     0
             c6t4d0  ONLINE       0     0     0
             c6t5d0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

         <metadata>:<0x0>
...
         pool/mymedia:/incoming/DSCF1234.AVI

NOTE: I don''t yet have full detail of <metadata>:<0x0>
error,
asked about it numerously on the list.


2) Mine some information about the file and error location

* mount the dataset
   # zfs mount pool/mymedia
* find the inode number
   # ls -i /pool/mymedia/incoming/DSCF1234.AVI
   6313 /pool/mymedia/incoming/DSCF1234.AVI
* dump ZDB info
   # zdb -dddddd pool/mymedia 6313 > /tmp/zdb.txt
* find the bad block offset
   # dd if=/pool/mymedia/incoming/DSCF1234.AVI of=/dev/null \
     bs=512 conv=noerror,notrunc
   dd: reading `/pool/mymedia/incoming/DSCF1234.AVI'': I/O error
   58880+0 records in
   58880+0 records out
   30146560 bytes (30 MB) copied, 676.772 s, 44.5 kB/s
     (error repeated 256 times)
   239145+1 records in
   239145+1 records out
   122442738 bytes (122 MB) copied, 2136.19 s, 57.3 kB/s

   So the error is at offset 58800*512 bytes = 0x1CC0000
   And its size is 512b*256 = 128KB



3) Review the /tmp/zdb.txt information

We need the L0 entry for the erroneous block and its
parent L1 entry:

     Object  lvl   iblk   dblk  dsize  lsize   %full  type
       6313    3    16K   128K   117M   117M  100.00  ZFS plain file 
(K=inherit) (Z=inherit)
                                         168   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED
	dnode maxblkid: 935
...
Indirect blocks:
                0 L2   0:ac25a001000:3000 0:927ba8e7000:3000 4000L/400P 
F=936 B=325191092/325191092

                0  L1  0:ac23ecbf000:6000 0:927b38c1000:6000 4000L/1800P 
F=128 B=325191082/325191082

...
          1000000  L1  0:ac243ea6000:6000 0:927b474c000:6000 4000L/1800P 
F=128 B=325191084/325191084

          1cc0000   L0 0:aca016fa000:30000 20000L/20000P F=1 
B=325191083/325191083
...

For readers not in the know, ZFS indirect blocks at level L0
address the "userdata" blocks. Indirect blocks are stored in
sets of up to 128 entries; so if more blocks are needed to
store the file, there appears a tree of indirect blocks, up
to 6 levels deep or so. That''s a lot of possibly addressable
userdata blocks (2^64 they say, IIRC).

In my case, the file is comprised of 936 "userdata" blocks
(numbered from 0 to 935 = maxblkid) and requires an L2 tree
to store it. The file is conveniently made of 128KB blocks
sized 0x20000 bytes (logical and physical userdata, no
compression here), taking up 0x30000 bytes for allocation
of the data on-disk (including raidz2 redundancy on 6 disks).
Each L1 metadata block covers 0x1000000 bytes of this file,
so there are 8 L1 blocks and one L2 block to address them.

Our L0 block starting at 0x1CC0000 is number 0x66 (102) in
its parent L1 block starting at 0x1000000:
   (0x1CC0000-0x1000000)/0x20000 = 0x66 = 102
Its L0 metadata starts at the byte offset 0x3300 in the L1
metadata block, since all blkptr_t entries in Ln are 0x80
(128) bytes long:
   (0x66*0x80) = 0x3300 = 13056 bytes

* Extract the raw L1 information - for checksum information,
compression and dedup flags, etc. Metadata is compressed
(as seen via 4000L/1800P in zdb.txt), so we decompress it
on-the-fly:
   # zdb -R pool 0:ac243ea6000:1800:dr > /tmp/l1
   Found vdev type: raidz

* Confirm that we can''t read the block directly:
   # dd if=/pool/mymedia/incoming/DSCF1234.AVI of=/tmp/blk-bad.bin \
     bs=131072 count=1 skip=230
   dd: reading `/pool/mymedia/incoming/DSCF1234.AVI'': I/O error
   0+0 records in
   0+0 records out
   0 bytes (0 B) copied, 0.0995645 s, 0.0 kB/s

* Dump the (corrupt) raw block data with ZDB, uncompressed:
   # zdb -R pool 0:aca016fa000:20000:r > /tmp/blk-bad.bin
   Found vdev type: raidz


* Use the "dd" command similar to one above to extract the
   same block offset*size from the known-good copy of the
   file, saved as /tmp/blk-orig.bin.

4) Inspect the L1 entry

NOTES: See definition of the blkptr_t structure here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/sys/spa.h#blkptr
   and some more details here:
http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/ondiskformat0822.pdf

* Dump the blkptr_t entry 102 (offset 0x3300) for inspection.

# od -Ax -tx1 /tmp/l1 | egrep ''^0033'' | head -8
003300 80 01 00 00 00 00 00 00 d0 b7 00 65 05 00 00 00
003310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
003330 ff 00 ff 00 02 08 13 c0 00 00 00 00 00 00 00 00
003340 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
003350 ab 05 62 13 00 00 00 00 01 00 00 00 00 00 00 00
003360 14 ad 26 24 f2 89 8e 28 f1 ae 5d 99 2d 3b 3c 58
003370 c1 91 74 21 09 28 48 5b 55 6f b4 8f 7e ed 3a 52

Cutting to the chase, I''m interested in these bytes now:
0x3334   02  compression = off
0x3335   08  checksum algo = sha256
0x3336   13  data type = 19 = file data (DMU_OT_PLAIN_FILE_CONTENTS)
0x3337   c0  11000000 binary; bits: endianness=1, dedup=1

Bytes 0x3360-0x337f contain the checksum value in an
inversed form of 4 64-bit words. Read right-to-left
in four 8-byte groups (3367-3360, 336f-3368, etc.)

Most of the other information about the userblock was
presented by ZDB in its report about the L0 entry
(converted DVA, sizes, TXGs, etc.).

Checksum value in corect order of bytes:
28 8e 89 f2 24 26 ad 14
58 3c 3b 2d 99 5d ae f1
5b 48 28 09 21 74 91 c1
52 3a ed 7e 8f b4 6f 55

5) Test the checksums

   # openssl dgst -sha256 < /tmp/blk-orig.bin
   288e89f22426ad14583c3b2d995daef15b482809217491c1523aed7e8fb46f55

   # openssl dgst -sha256 < /tmp/blk-bad.bin
   ab34939a6342e3f353927a395a04aa37e50f1e74e7a9b993e731c790ba554a59

As we see, the checksums differ, and the "orig" one matches
the checksum value in the L1 block above. So that copy is
good indeed.

6) Compare the good and bad blocks

* Convert the files into readable forms
   # od -Ax -tx1 /tmp/blk-bad.bin >  /tmp/blk-bad.txt
   # od -Ax -tx1 /tmp/blk-orig.bin > /tmp/blk-orig.txt

* Compare the files
# diff -bu /tmp/blk-*txt | less
--- /tmp/blk-bad.txt    2012-01-23 22:20:56.669957059 +0000
+++ /tmp/blk-orig.txt   2012-01-23 22:20:42.886726737 +0000
@@ -6142,135 +6142,262 @@
  017fd0 bc ad fb 94 39 7d 9f e6 8b e8 ab 92 5a e5 70 b2
  017fe0 6e ed 47 e7 eb 9d e4 ed 9a 8c dd 13 66 79 aa 9e
  017ff0 8a cb d5 b4 59 f5 33 cc a8 b3 0f 52 86 64 e9 9e
-018000 00 02 00 69 f8 82 00 00 00 a8 02 00 8a 10 08 7e
-018010 10 08 a5 10 08 2a e4 10 08 d2 10 08 9f 08 08 05
-018020 00 54 1a f9 0c 38 32 10 08 29 10 08 8f 15 10 08
(...)
-0187e0 00 5b 54 00 83 00 00 00 00 00 00 00 00 00 00 00
-0187f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
-*
+018000 d4 c3 94 06 b3 9a dc 89 26 24 c9 36 15 91 c2 73
+018010 1f 48 06 04 b7 57 17 d6 57 00 d7 b3 03 40 2b c1
(...)
+018fe0 32 b3 50 8c e8 c6 77 69 4e 54 7d 55 9e b7 d5 ce
+018ff0 a7 23 21 41 0c ac ea 11 82 e0 86 da 6a 62 ef a6
  019000 28 1c 74 02 66 d4 7a 21 0b 86 97 08 e0 13 bd 2a
(...)

So here we had one 4kb streak of unexpected data.
No particular (ir)regularities noticed in this case,
just one block of "white noise" replaced by another.


===========================================================
So, by this time we have:
* Known-good and proven-bad copies of the block;
* Known offset and size of this block in the file;
* ZDB info on layout of the file;
* Corrupt file in a deduped dataset;
* Unanswered question: how do dedup and errors interact?

Time to squash some bugs!


1) What type of dedup was on the dataset?

   # zfs get dedup pool/mymedia
   NAME              PROPERTY  VALUE          SOURCE
   pool/mymedia      dedup     sha256,verify  local

2) How many CKSUM errors do we have now?
   # zpool status -v pool | grep ONLINE
         pool        ONLINE       0     0     8
           raidz2-0  ONLINE       0     0     8
             c6t0d0  ONLINE       0     0     0
             c6t1d0  ONLINE       0     0     0
             c6t2d0  ONLINE       0     0     0
             c6t3d0  ONLINE       0     0     0
             c6t4d0  ONLINE       0     0     0
             c6t5d0  ONLINE       0     0     0

   Ok, by the time I got here, the pool logged hitting CKSUM
errors 8 times; but none definitely on any certain disk.

3) Let''s try "fixing" the file with "dedup=on":

* Set dedup mode on dataset:
   # zfs set dedup=on pool/mymedia

* Overwrite the byterange in the file:
   # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \
     bs=131072 count=1 seek=230 conv=notrunc
   1+0 records in
   1+0 records out
   131072 bytes (131 kB) copied, 0.000837707 s, 156 MB/s
   # sleep 10; sync

* Did the known CKSUM error count increase? No, it did not:
   # zpool status -v pool | grep ONLINE | grep -v c6
         pool        ONLINE       0     0     8
           raidz2-0  ONLINE       0     0     8

* Try reading the file:
   # md5sum /pool/mymedia/incoming/DSCF1234.AVI
   md5sum: /pool/mymedia/incoming/DSCF1234.AVI: I/O error

* Did the known CKSUM error count increase? Yes, it did now:
   # zpool status -v pool | grep ONLINE | grep -v c6
         pool        ONLINE       0     0     10
           raidz2-0  ONLINE       0     0     10

So, in fact, this did not fix the file.

Did anything change with the blockpointer tree?

* dump ZDB info:
   # zdb -dddddd pool/mymedia 6313 > /tmp/zdb-2.txt
* compare the two info files:
   # diff /tmp/zdb*.txt

<                0 L2   0:6a01ccb6000:3000 0:2902c091000:3000 4000L/400P 
F=688 B=325617005/325617005
 >               0 L2   0:ac25a001000:3000 0:927ba8e7000:3000 4000L/400P 
F=936 B=325191092/325191092

<          1000000  L1  0:6a01ccb0000:6000 0:2902c05b000:6000 
4000L/1800P F=128 B=325617005/325617005
 >         1000000  L1  0:ac243ea6000:6000 0:927b474c000:6000 
4000L/1800P F=128 B=325191084/325191084

<         1cc0000   L0 0:aca016fa000:30000 20000L/20000P F=1 
B=325617005/325191083
 >         1cc0000   L0 0:aca016fa000:30000 20000L/20000P F=1 
B=325191083/325191083

So, the on-disk L0-block DVA address did not change, only a
TXG entry - and accordingly the indirect blockpointer tree.
Since DVA remains unchanged, and ZFS''s COW does not let it
overwrite existing data, this means that the block on disk
remains corrupted. As we''ve seen with md5sum above.


4) Now let''s retry with "dedup=verify":

* Set dedup mode on dataset:
   # zfs set dedup=sha256,verify pool/mymedia

* Overwrite the byterange in the file:
   # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \
     bs=131072 count=1 seek=230 conv=notrunc
   1+0 records in
   1+0 records out
   131072 bytes (131 kB) copied, 0.00109255 s, 120 MB/s
   # sleep 10; sync

* Did the known CKSUM error count increase? No, it did not:
   # zpool status -v pool | grep ONLINE | grep -v c6
         pool        ONLINE       0     0     10
           raidz2-0  ONLINE       0     0     10

* Try reading the file:
   # md5sum /pool/mymedia/incoming/DSCF1234.AVI
   731247e7d71951eb622b979bc18f3caa  /pool/mymedia/incoming/DSCF1234.AVI

* Did the known CKSUM error count increase? No, it still did not:
   # zpool status -v pool | grep ONLINE | grep -v c6
         pool        ONLINE       0     0     10
           raidz2-0  ONLINE       0     0     10

So, in fact, this did fix the file.

We can even attest to that with ZDB: the tree of indirect blocks
has changed (levels L2, L1 and L0 of our branch, but not L1 of
another branch):

Indirect blocks:
                0 L2   0:6a27641d000:3000 0:2992ca90000:3000 4000L/400P 
F=936 B=325619762/325619762
                0  L1  0:ac23ecbf000:6000 0:927b38c1000:6000 4000L/1800P 
F=128 B=325191082/325191082
          1000000  L1  0:6a19c140000:6000 0:299204f9000:6000 4000L/1800P 
F=128 B=325619762/325619762
          1cc0000   L0 0:6a7df3b2000:30000 20000L/20000P F=1 
B=325619762/325619762

But the CKSUM errors were not logged as such during dedup-verify -
even though the block read from disk mismatched the checksum in
blockpointer (and presumably in DDT).



5) Now for the interesting part: retry with "dedup=on"

* Set dedup mode on dataset:
   # zfs set dedup=on pool/mymedia

* Overwrite the byterange in the file:
   # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \
     bs=131072 count=1 seek=230 conv=notrunc
   1+0 records in
   1+0 records out
   131072 bytes (131 kB) copied, 0.000873926 s, 150 MB/s

The command prompt never returned, host rebooted "instantly".
According to another terminal logging "iostat -xnz 1", the new
writes never made it to disk.

After reboot the pool imported cleanly, file''s md5sum matches.
Fix from dedup=verify remains in place...

HTH and thanks,
//Jim Klimov

zfs discuss - Jan 2012 - ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums

[zfs-discuss] ZFS Dedup and bad checksums