thr3ads.net - zfs discuss - [zfs-discuss] Idea: ZFS and on-disk ECC for blocks [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Jan-11 15:16 UTC

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

Hello all, I have a new "crazy idea" of the day ;)

   Some years ago there was an idea proposed in one of ZFS
developers'' blogs (maybe Jeff''s? sorry, can''t find
and
link it now) that went somewhat along these lines:

    Modern disks have some ECC/CRC codes for each sector,
    and uses them to test read-in data. If the disk fails
    to produce a sector correctly, it tries harder to read
    it and reallocates the LBA from a spare-sector region,
    if possible. This leads to some more random IO for
    linearly-numbered LBA sectors, as well as waste of
    disk space for spare sectors and checksums - at least
    in comparison to better error-detection and redundancy
    of ZFS checksums. Besides, attempts to re-read a faulty
    sector may succeed or they may produce undeteced garbage,
    and take some time (maybe seconds) if the retries fail
    consistently. Then the block is marked bad and data is
    lost.

    The article went on to suggest "let''s get an OEM vendor
    to give us same disks without their kludges, and we''ll
    get (20%?) more platter-speed and volume, better used
    by ZFS error-detection and repair mechanisms".

I''ve recently had a sort of an opposite thought: yes,
ZFS redundancy is good - but also expensive in terms
of raw disk space. This is especially bad for hardware
space-constrained systems like laptops and home-NASes,
where doubling the number of HDDs (for mirrors) or
adding tens of percent of storage for raidZ is often
not practical for whatever reason.

Current ZFS checksums allow us to detect errors, but
in order for recovery to actually work, there should be
a redundant copy and/or parity block available and valid.

Hence the question: why not put ECC info into ZFS blocks?
IMHO, pluggable ECC (like pluggable compression or
varied checksums - in this case ECC algorithms allowing
for recovery of 1 or 2 bits, for example) would be cheaper
on disk space than redundancy (few % instead of 25-50% of
disk space), and still allow for recovery of certain errors,
such as on-disk or on-wire bit rot, even in single-disk
ZFS pools.

This could be an inheritable per-dataset attribute
like compression, encryption, dedup or checksum
algorithms.

Replacement of recovered "faulted" blocks into currently
free space is already part of ZFS, except that now it
might have to track the notion of "permanently-bad block
lists" and decreasing space available for addressing on
each leaf VDEV. There should also be a mechanism to
retest and clear such blocks, i.e. when a faulty drive
or LUN is replaced by a new one (perhaps with DD''ing
of an old hardware drive to a new one, and replacement,
while the pool is offline) - probably as a special
scrub-like command to zpool, also invoked during scrub.

This may be combined with the wish for OEM disks that
lack hardware ECC/spare sectors in return for more
performance; although I''m not sure how good that would
be in practice - the hardware creator''s in-depth
knowledge of how to retry reading initially "faulty"
blocks, i.e. by changing voltage or platter speeds
or whatever, may be invaluable and not replaceable
by software.

What do you think? Doable? Useful? Why not, if not? ;)

Thanks,
//Jim Klimov

Nico Williams

2012-Jan-11 16:40 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov <jimklimov at cos.ru>
wrote:> I''ve recently had a sort of an opposite thought: yes,
> ZFS redundancy is good - but also expensive in terms
> of raw disk space. This is especially bad for hardware
> space-constrained systems like laptops and home-NASes,
> where doubling the number of HDDs (for mirrors) or
> adding tens of percent of storage for raidZ is often
> not practical for whatever reason.
Redundancy through RAID-Z and mirroring is expensive for home systems
and laptops, but mostly due to the cost of SATA/SAS ports, not the
cost of the drives.  The drives are cheap, but getting an extra disk
in a laptop is either impossible or expensive.  But that doesn''t mean
you can''t mirror slices or use ditto blocks.  For laptops just use
ditto blocks and either zfs send or external mirror that you
attach/detach.
> Current ZFS checksums allow us to detect errors, but
> in order for recovery to actually work, there should be
> a redundant copy and/or parity block available and valid.
>
> Hence the question: why not put ECC info into ZFS blocks?
RAID-Zn *is* an error correction system.  But what you are asking for
is a same-device error correction method that costs less than ditto
blocks, with error correction data baked into the blkptr_t.  Are there
enough free bits left in the block pointer for error correction codes
for large blocks?  (128KB blocks, but eventually ZFS needs to support
even larger blocks, so keep that in mind.)  My guess is: no.  Error
correction data might have to get stored elsewhere.

I don''t find this terribly attractive, but maybe I''m just not
looking
at it the right way.  Perhaps there is a killer enterprise feature for
ECC here: stretching MTTDL in the face of a device failure in a mirror
or raid-z configuration (but if failures are typically of whole drives
rather than individual blocks, then this wouldn''t help).  But without
a good answer for where to store the ECC for the largest blocks, I
don''t see this happening.

Nico
--

Jim Klimov

2012-Jan-11 18:08 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 20:40, Nico Williams ?????:> On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov<jimklimov at cos.ru> 
wrote:
>> I''ve recently had a sort of an opposite thought: yes,
>> ZFS redundancy is good - but also expensive in terms
>> of raw disk space. This is especially bad for hardware
>> space-constrained systems like laptops and home-NASes,
>> where doubling the number of HDDs (for mirrors) or
>> adding tens of percent of storage for raidZ is often
>> not practical for whatever reason.
>
> Redundancy through RAID-Z and mirroring is expensive for home systems
> and laptops, but mostly due to the cost of SATA/SAS ports, not the
> cost of the drives.  The drives are cheap, but getting an extra disk
> in a laptop is either impossible or expensive.  But that doesn''t
mean
> you can''t mirror slices or use ditto blocks.  For laptops just use
> ditto blocks and either zfs send or external mirror that you
> attach/detach.
Yes, basically that''s what we do now, and it halves the
available disk space and increases latency (extra seeks) ;)

I get (and share) your concern about ECC entry size for
larger blocks. NOTE: I don''t know the ECC algorithms
deeply enough to speculate about space requirements,
except that as they are used in networking/RAM, an ECC
correction code for 4-8 bits of userdata is 1-2 bits long.

I''m reading the "ZFS On-disk Format" PDF (dated 2006 -
are there newer releases?), and on page 15 the blkptr_t
structure has 192 bits of padding before TXG. Can''t that
be used for a reasonably large ECC code?

Besides, I see that blkptr_t is 128 bytes in size.
This leaves us with some slack space in a physical
sector, which can be "abused" without extra costs -
(512-128) or (4096-128) bytes worth of {ECC} data.
Perhaps the padding space (near TXG entry) could
be used to specify that the blkptr_t bytes are
immediately followed by ECC bytes (and their size,
probably dependent on data block length), so that
larger on-disk block pointer blocks could be used
on legacy systems as well (using several contiguous
512 byte sectors). After successful reads from disk,
this ECC data can be discarded to save space in
ARC/L2ARC allocation (especially if every byte of
memory is ECC protected anyway).

Even if the ideas/storage above is not practical,
perhaps ECC codes can be used for smaller blocks (i.e.
{indirect} block pointer contents and metadata might
be "guaranteed" to be small enough). If nothing else,
this could save mechanical seek times if a CKSUM
error is detected as is normal for ZFS reads, but a
built-in/referring block''s ECC code infromation is
enough to repair this block. In this case we don''t
need to re-request data from another disk... and we
have some more error-resiliency beside ditto blocks
(already enforced for metadata) or raidz/mirrors.
While it is (barely) possible that all ditto replicas
are broken, there''s a non-zero chance that at least
one is recoverable :)

>
>> Current ZFS checksums allow us to detect errors, but
>> in order for recovery to actually work, there should be
>> a redundant copy and/or parity block available and valid.
>>
>> Hence the question: why not put ECC info into ZFS blocks?
>
> RAID-Zn *is* an error correction system.  But what you are asking for
> is a same-device error correction method that costs less than ditto
> blocks, with error correction data baked into the blkptr_t.  Are there
> enough free bits left in the block pointer for error correction codes
> for large blocks?  (128KB blocks, but eventually ZFS needs to support
> even larger blocks, so keep that in mind.)  My guess is: no.  Error
> correction data might have to get stored elsewhere.
>
> I don''t find this terribly attractive, but maybe I''m just
not looking
> at it the right way.  Perhaps there is a killer enterprise feature for
> ECC here: stretching MTTDL in the face of a device failure in a mirror
> or raid-z configuration (but if failures are typically of whole drives
> rather than individual blocks, then this wouldn''t help).  But
without
> a good answer for where to store the ECC for the largest blocks, I
> don''t see this happening.
Well, it is often mentioned that (by Murphy''s Law if nothing
else) device failures in RAID often are not single-device
failures. So traditional RAID5s tended to die while replacing
a dead disk onto a spare and detecting an error on an existing
unreplicated disk.

Per-block ECC could be used in this case to recover from
bit-rot errors on remaining alive disks when RAID-Zn or
mirror don''t help, decreasing the chance that tape backup
is the only recovery option remaining...

//Jim Klimov

David Magda

2012-Jan-12 17:20 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

On Wed, January 11, 2012 11:40, Nico Williams wrote:
> I don''t find this terribly attractive, but maybe I''m just
not looking
> at it the right way.  Perhaps there is a killer enterprise feature for
> ECC here: stretching MTTDL in the face of a device failure in a mirror
> or raid-z configuration (but if failures are typically of whole drives
> rather than individual blocks, then this wouldn''t help).  But
without
> a good answer for where to store the ECC for the largest blocks, I
> don''t see this happening.
Not so much for blocks, but talking more with sectors, there''s the T10
(SCSI) Data Integrity Field (DIF):

    http://www.usenix.org/event/lsf07/tech/petersen.pdf

This is a controller-drive specification. For host-controller
communication, the Data Integrity Extensions (DIX) have been define:

    http://oss.oracle.com/~mkp/docs/ols2008-petersen.pdf

It''s a pity that the field is only eight bytes, as if it was larger, a
useful cryptographic [HCUG]MAC could be saved there by disk encryption
software. Perhaps with 4K-sector "Advanced Format" drives a similar
field
will be defined that''s larger.

Jim Klimov

2012-Jan-12 22:34 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

I guess I have another practical rationale for a second
checksum, be it ECC or not: my scrubbing pool found some
"unrecoverable errors". Luckily, for those files I still
have external originals, so I rsynced them over. Still,
there is one file whose broken prehistory is referenced
in snapshots, and properly fixing that would probably
require me to resend the whole stack of snapshots.
That''s uncool, but a subject for another thread.

This thread is about checksums - namely, now, what are
our options when they mismatch the data? As has been
reported by many blog-posts researching ZDB, there do
happen cases when checksums are broken (i.e. bitrot in
block pointers, or rather in RAM while the checksum was
calculated - so each ditto copy of BP has the error),
but the file data is in fact intact (extracted from
disk with ZDB or DD, and compared to other copies).

For these cases bloggers asked (in vain) - why is it
not allowed for an admin to confirm validity of end-user
data and have the system reconstruct (re-checksum) the
metadata for it?.. IMHO, that''s a valid RFE.

While the system is scrubbing, I was reading up on theory.
Found a nice text "Keeping Bits Safe: How Hard Can It Be?"
by David Rosenthal [1], where I stumbled upon an interesting
thought:
   The bits forming the digest are no different from the
   bits forming the data; neither is magically incorruptible.
   ...Applications need to know whether the digest has
   been changed.

In our case, where original checksum in the blockpointer
could be corrupted in (non-ECC) RAM of my home-NAS just
before it was dittoed to disk, another checksum - copy
of this same one, or a differently calculated one, could
provide ZFS with the means to determine whether the data
or one of the checksums got corrupted (or all of them).
Of course, this is not an absolute protection method,
but it can reduce the cases where pools have to be
"destroyed, recreated and recovered from tape".

It is my belief that using dedup contributed to my issue -
there''s lots more of updating the block pointers and their
checksums, so it gradually becomes more likely that the
metadata (checksum) blocks gets broken (i.e. in non-ECC
RAM), while the written-once userdata remains intact...

--
[1] http://queue.acm.org/detail.cfm?id=1866298
While the text discusses what all ZFSers mostly know
already - about bit-rot, MTTDL and such, it does so with
great detail and many examples, and gave me a better
understanding of it all even though I deal with this for
several years now. A good read, I suggest it to others ;)

//Jim Klimov

Jim Klimov

2012-Jan-13 00:48 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-13 2:34, Jim Klimov wrote:> I guess I have another practical rationale for a second
> checksum, be it ECC or not: my scrubbing pool found some
> "unrecoverable errors".
> ...Applications need to know whether the digest has
> been changed.
As Richard reminded me in another thread, both metadata
and DDT can contain checksums, hopefully of the same data
block. So for deduped data we may already have a means
to test whether the data or the checksum is incorrect...
Incdentally, the problem also seems more critical for
the deduped data ;)

Just a thought...
//Jim

Daniel Carosone

2012-Jan-13 01:01 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

On Fri, Jan 13, 2012 at 04:48:44AM +0400, Jim Klimov
wrote:> As Richard reminded me in another thread, both metadata
> and DDT can contain checksums, hopefully of the same data
> block. So for deduped data we may already have a means
> to test whether the data or the checksum is incorrect...
It''s the same chksum, calculated once - this is why turning dedup=on
implies setting checksum=sha256 
> Incdentally, the problem also seems more critical for
> the deduped data ;)
Yes.  Add this to the list of reasons to use ECC, and add ''have
ECC''
to the list of constraints to circumstances where using dedup is
appropriate. 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120113/26df0a22/attachment.bin>

Richard Elling

2012-Jan-13 01:01 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

On Jan 12, 2012, at 2:34 PM, Jim Klimov wrote:
> I guess I have another practical rationale for a second
> checksum, be it ECC or not: my scrubbing pool found some
> "unrecoverable errors". Luckily, for those files I still
> have external originals, so I rsynced them over. Still,
> there is one file whose broken prehistory is referenced
> in snapshots, and properly fixing that would probably
> require me to resend the whole stack of snapshots.
> That''s uncool, but a subject for another thread.
> 
> This thread is about checksums - namely, now, what are
> our options when they mismatch the data? As has been
> reported by many blog-posts researching ZDB, there do
> happen cases when checksums are broken (i.e. bitrot in
> block pointers, or rather in RAM while the checksum was
> calculated - so each ditto copy of BP has the error),
> but the file data is in fact intact (extracted from
> disk with ZDB or DD, and compared to other copies).
Metadata is at least doubly redundant and checksummed.
Can you provide links to posts that describe this failure mode?
> For these cases bloggers asked (in vain) - why is it
> not allowed for an admin to confirm validity of end-user
> data and have the system reconstruct (re-checksum) the
> metadata for it?.. IMHO, that''s a valid RFE.
Metadata is COW, too. Rewriting the data also rewrites the metadata.
> While the system is scrubbing, I was reading up on theory.
> Found a nice text "Keeping Bits Safe: How Hard Can It Be?"
> by David Rosenthal [1], where I stumbled upon an interesting
> thought:
>  The bits forming the digest are no different from the
>  bits forming the data; neither is magically incorruptible.
>  ...Applications need to know whether the digest has
>  been changed.
Hence for ZFS, the checksum (digest) is kept in the parent metadata.

The condition described above can affect T10 DIF-style checksums, but not ZFS.
> In our case, where original checksum in the blockpointer
> could be corrupted in (non-ECC) RAM of my home-NAS just
> before it was dittoed to disk, another checksum - copy
> of this same one, or a differently calculated one, could
> provide ZFS with the means to determine whether the data
> or one of the checksums got corrupted (or all of them).
> Of course, this is not an absolute protection method,
> but it can reduce the cases where pools have to be
> "destroyed, recreated and recovered from tape".
Nope.
> It is my belief that using dedup contributed to my issue -
> there''s lots more of updating the block pointers and their
> checksums, so it gradually becomes more likely that the
> metadata (checksum) blocks gets broken (i.e. in non-ECC
> RAM), while the written-once userdata remains intact...
> 
> --
> [1] http://queue.acm.org/detail.cfm?id=1866298
> While the text discusses what all ZFSers mostly know
> already - about bit-rot, MTTDL and such, it does so with
> great detail and many examples, and gave me a better
> understanding of it all even though I deal with this for
> several years now. A good read, I suggest it to others ;)
> 
> //Jim Klimov
> _______________________________________________
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
SCALE 10x, Los Angeles, Jan 20-22, 2012

Daniel Carosone

2012-Jan-13 01:30 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

On Thu, Jan 12, 2012 at 05:01:48PM -0800, Richard Elling
wrote:> > This thread is about checksums - namely, now, what are
> > our options when they mismatch the data? As has been
> > reported by many blog-posts researching ZDB, there do
> > happen cases when checksums are broken (i.e. bitrot in
> > block pointers, or rather in RAM while the checksum was
> > calculated - so each ditto copy of BP has the error),
> > but the file data is in fact intact (extracted from
> > disk with ZDB or DD, and compared to other copies).
> 
> Metadata is at least doubly redundant and checksummed.
The implication is that the original calculation of the checksum was
bad in ram (undetected due to lack of ECC), and then written out
redundantly and fed as bad input to the rest of the merkle construct.
The data blocks on disk are correct, but they fail to verify against
the bad metadata.

The complaint appears to be that ZFS makes this ''worse''
because the
(independently verified) valid data blocks are inaccessible. 

Worse than what? Corrupted file data that is then accurately
checksummed and readable as valid? Accurate data that is read without
any assertion of validity, in a traditional filesystem? There''s
an inherent value judgement here that will vary by judge, but in each
case it''s as much a judgement on the value of ECC and reliable
hardware, and your data and time enacting various kinds of recovery,
as it is the value of ZFS.

The same circumstance could, in principle, happen due to bad CPU even
with ECC.  In either case, the value of ZFS includes that an error has
been detected you would otherwise have been unaware of, and you get a
clue that you need to fix hardware and spend time. 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120113/b8fbefa1/attachment.bin>

Jim Klimov

2012-Jan-13 02:26 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-13 5:30, Daniel Carosone wrote:> On Thu, Jan 12, 2012 at 05:01:48PM -0800, Richard Elling wrote:
>>> This thread is about checksums - namely, now, what are
>>> our options when they mismatch the data? As has been
>>> reported by many blog-posts researching ZDB, there do
>>> happen cases when checksums are broken (i.e. bitrot in
>>> block pointers, or rather in RAM while the checksum was
>>> calculated - so each ditto copy of BP has the error),
>>> but the file data is in fact intact (extracted from
>>> disk with ZDB or DD, and compared to other copies).
>>
>> Metadata is at least doubly redundant and checksummed.
>
> The implication is that the original calculation of the checksum was
> bad in ram (undetected due to lack of ECC), and then written out
> redundantly and fed as bad input to the rest of the merkle construct.
> The data blocks on disk are correct, but they fail to verify against
> the bad metadata.
Implication is correct, that was the outlined scenario :)
> The complaint appears to be that ZFS makes this ''worse''
because the
> (independently verified) valid data blocks are inaccessible.
Also correct, a frequent "woe" (generally in the context
of discussions about lack of ZFS fsck, though many of these
discussions tend to descend into flame wars and.or detailed
descriptions of how the COW and transaction engine keep
{meta}data intact - just until some such fatal bit rot that
the pool must be recreated as the only "recovery" option).
> Worse than what?
Worse than not having a (relatively easy-to-use) ability
to confirm to the system, which part to trust - data or the
checksum (which returns us to the subject of automating this
with ECC and/or other checksums). My data, my checks into it,
my word should be final in case of dispute ;)

 >  Corrupted file data that is then accurately> checksummed and readable as valid? Accurate data that is read without
> any assertion of validity, in a traditional filesystem?
If by ZFS automata itself - without my ability to intervene -
then probably not. It would make ZFS no better than others.

 > There''s> an inherent value judgement here that will vary by judge, but in each
> case it''s as much a judgement on the value of ECC and reliable
> hardware, and your data and time enacting various kinds of recovery,
> as it is the value of ZFS.
Perhaps so. I might read through a text file to see if it
is garbage or text. I might parse or display image files
and many other formats. I might compare to another copy, if
available. I just don''t have a mechanism to do so with ZFS.

Apparently, a view into the data "as it seems to be" without
checksums would speed up the process of data comparison,
eye-reading and other methods of validation.

People do that with LOST+FOUND and such directories
on other FSes, but usually after an unreversible attempt
of recovery, correct or not...

Heck, with ZFS I might have a snapshot-like view at my
recovery options (accessible to programs like image viewers)
without changing on-disk data until I pick a variant.

Yes, okay, ZFS did inform me of some inconsistency
(even then it is not necessarily the data that is bad)
and perhaps prompted me to fix the hardware and find
other copies of data. Kudos to the team, really!
But then it stops here, without providing me with
options ro recover whatever is on disk (at my risk).

As a Solaris example, admins are allowed to confirm
which part of a broken USF+SVM mirror to trust, even
if there is not a quorum set of metadb replicas.

This trust in the human is common in the industry, and
allows to account for whatever could not be done in the
software as a one-size-fits-all solution. Also it is
the user''s final chioce to kill or save the data, not
the programmers with whatever cryptic intentions he had.
>
> The same circumstance could, in principle, happen due to bad CPU even
> with ECC.  In either case, the value of ZFS includes that an error has
> been detected you would otherwise have been unaware of, and you get a
> clue that you need to fix hardware and spend time.
True, whenever that is possible.
Hardware will be faulty, always. We can only decrease the
extent of that. Not all implementation options (see laptops
and ECC RAM) or budgets can fix it to "reasonable" levels,
though.

Software must be the more resilient part, I guess - as long
as its error-detection algorithm can execute on that CPU... :)

//Jim

Jim Klimov

2012-Jan-13 02:47 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-13 5:01, Richard Elling wrote:> On Jan 12, 2012, at 2:34 PM, Jim Klimov wrote:
> Metadata is at least doubly redundant and checksummed.True, and this helps if it is valid in the first place
(in RAM).

 >> As has been
 >> reported by many blog-posts researching ZDB, there do
 >> happen cases when checksums are broken ...
 >> but the file data is in fact intact> Can you provide links to posts that describe this failure mode?
I''ll try in another message. That would take some googling
time ;)

I think the most apparent ones are the tutorials on ZDB
where authors poisoned their VDEVs in those sectors where
metadata was (all copies), so that filedata is factually
intact but not accessible due to mismatching checksums
along the metadata path.

Right now I can''t think of any other posts like that,
but nature can produce the same phenomonons and I think
it could have been discussed on-line. I''ve read too much
during the past weeks :(
>
>> For these cases bloggers asked (in vain) - why is it
>> not allowed for an admin to confirm validity of end-user
>> data and have the system reconstruct (re-checksum) the
>> metadata for it?.. IMHO, that''s a valid RFE.
>
> Metadata is COW, too. Rewriting the data also rewrites the metadata.
COW does not help well against mis-targeted hardware
writes, bit rot, solar storms, etc. that would break
existing on-disk data.

Random bit errors can happen anywhere, RAM buffers or
committed disks alike.

It is a fact (since the first blogposts about ZDB and
ZFS internals by Marcelo Leal, Max bruning, Ben Rockwood
and countless other kind samaritans) that inquisitive
users - or those repairing their systems - can determine
DVA and ultimately LBA addresses of their data, extract
the userdata blocks and confirm (sometimes) that their
data is intact, and the problem is in metadata paths.
>
>> While the system is scrubbing, I was reading up on theory.
>> Found a nice text "Keeping Bits Safe: How Hard Can It Be?"
>> by David Rosenthal [1], where I stumbled upon an interesting
>> thought:
>>   The bits forming the digest are no different from the
>>   bits forming the data; neither is magically incorruptible.
>>   ...Applications need to know whether the digest has
>>   been changed.
>
> Hence for ZFS, the checksum (digest) is kept in the parent metadata.
But it can still rot. And for a while they are in the
same RAM, which might lie. Probably the one good effect
there is - checksum is stored away from the data and
*likely* both at once won''t get scratched by HDD head
crash ;) Unless they were coalesced to storage near
each other...

Hm... so if the checksum in metadata has bit-rotted
on-disk, this metadata block would first not match
its parent block (as it is the parent''s checksummed
data), and would cause reread of a ditto copy.

But if the checksum got broken in-RAM just before the
write, so both ditto blocks have bad checksum values -
but they match their metadata-parents - currently the
data is considered bad :(

Granted, data is larger so there is seemingly a higher
chance that it would get a 1-bit error; but as I wrote,
metadata blocks are rewritten more often - so in fact
they could suffer errors more frequently.

Does your practice or theory prove this statement of
mine fundamentally wrong?

>
> The condition described above can affect T10 DIF-style checksums, but not
ZFS.
>
>> In our case, where original checksum in the blockpointer
>> could be corrupted in (non-ECC) RAM of my home-NAS just
>> before it was dittoed to disk, another checksum - copy
>> of this same one, or a differently calculated one, could
>> provide ZFS with the means to determine whether the data
>> or one of the checksums got corrupted (or all of them).
>> Of course, this is not an absolute protection method,
>> but it can reduce the cases where pools have to be
>> "destroyed, recreated and recovered from tape".
>
> Nope.
Maybe so... as I elaborate below, there are indeed some
scenarios with using several checksums of data, where
we can not unambiguously determine correctness of either.

Say, we have a data block D in RAM, which can fail always
(more probable without ECC - as is probable on consumer
devices like laptops or home-NASes). We produce two checksums
D'' and then D" with different algorithms while preparing to
write (these checksum values would go to all ditto blocks).
During this time a bit flopped, or whatever undetected
(non-ECC) RAM failure happened at least once. Variants:

1) Block D got broken before checksum calcs - we''re out
of luck, checksums would probably match, but the data is
still wrong.

2) Block D got broken between checksum calcs - one of
checksums (always D") matches the data, another one
(always D'') doesn''t.

3) Block D is okay, but one of checksums broke - one of
checksums matches the data, another one doesn''t.
About 50% similarity to case (2).

4) Block D is okay, and both checksums broke - block is
considered broken even if it is not...

The idea needs to be rethought, indeed ;)

Perhaps we can checksum or ECC the checksums, or a digest
of a (primary) checksum and the data?

Maybe we can presume that bitflips would produce small
(1-few bits at random location, 0xdeadbeef -> 0xdeafbeef)
differences, and with fuzzy logic the data would still
"likely match" the checksum?

I refuse to easily believe tehre is no solution, no hope! ;)

//Jim

Jim Klimov

2012-Jan-13 05:51 UTC

head link

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

> 2012-01-13 5:30, Daniel Carosone wrote:
> Corrupted file data that is then accurately
> checksummed and readable as valid?
Speaking of which, is there currently any simple way to disable
checksum validation during data reads (and not cause a kernel
panic when reading garbage under the guise of metadata)?

Some posts suggested that I try setting checksum=off on a dataset.
It doesn''t work, reads of files with blocks mismatching checksums
still provide IO errors ;)

Thanks,
//Jim

zfs discuss - Jan 2012 - Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks