thr3ads.net - zfs code - [zfs-code] correcting single-bit errors in fletcher4 checksums [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Jonathan Adams

2009-Apr-16 18:41 UTC

[zfs-code] correcting single-bit errors in fletcher4 checksums

Hi all,

Just a quick note; with a fletcher-4 checksum (the current version), the
following algorithm determines the position of any single-bit error.

bool
has_1bit_err(zio_cksum_t *base, zio_cksum_t *bad, size_t bufsize, int bswap)
{
	uint64_t a, b, c, d;
	int neg;

	size_t nwords = size / sizeof (uint32_t);
	uint64_t word;
	uint32_t pattern;

	if (base->a < bad->a) {
		neg = 0;
		a = bad->a - base->a;
		b = bad->b - base->b;
		c = bad->c - base->c;
		d = bad->d - base->d;
	} else {
		neg = 1;
		a = base->a - bad->a;
		b = base->b - bad->b;
		c = base->c - bad->c;
		d = base->d - bad->d;
	}

	if (a != (uint64_t)(uint32_t)a ||
	    a == 0 || (a & (a - 1)) != 0)
		return (0);	/* high bits set, or not a power-of-2 */

	if ((b & (a - 1)) != 0)
		return (0);	/* b not a multiple of a */

	word = b / a;
	if (word == 0 || word > nwords)
		return (0);	/* b out of range */

	if (c != (word * (word + 1)) / 2 * a ||
	    d != ((word * (word + 1) / 2) * (word + 2) / 3) * a)
		return (0);	/* c and d don''t match up */

	pattern = bswap ? BSWAP_32((uint32_t)a) : (uint32_t)a;

	printf("error is %c%lx in word %ld\n",
	    dir ? ''-'' : ''+'', pattern,
(size_t)(nwords - word);
	return (1);
}

Handling multi-bit errors would be more complicated, of course, and this
would need a slight update to be used with (mod 2^32 - 1) calculations.

In any case, this is effectively zero work; it wouldn''t be hard to
correct
single-bit errors when we read in a bad block, and self-heal the source.

Cheers,
- jonathan

paul

2009-May-03 19:53 UTC

head link

[zfs-code] correcting single-bit errors in fletcher4 checksums

Very nice; and ultimately will be very interesting to see what percentage of
checksum errors within a particular deployment turn out to most likely be
correctable single bit errors. (And thereby possibly even measurably help
improve the integrity of non-redundant array configurations, short of the
catastrophic failure of a sector or drive itself.)

After reviewing the code (and presuming you intended "if (base->a
less-than bad->a)"), I can''t quite seem to convince myself the
implementation is immune from misdiagnosing a double/triple bit error as a
single bit error in general (although likely staring me in the face; as all
correct, single, and double bit error checksums are warranted to be unique; as
should also be all 4 and 5 bit error checksums for a corrected fletcher4
implementation to my understanding)?
-- 
This message posted from opensolaris.org

Jonathan Adams

2009-May-04 17:03 UTC

head link

[zfs-code] correcting single-bit errors in fletcher4 checksums

On Sun, May 03, 2009 at 12:53:31PM -0700, paul wrote:> Very nice; and ultimately will be very interesting to see what
> percentage of checksum errors within a particular deployment turn out
> to most likely be correctable single bit errors. (And thereby possibly
> even measurably help improve the integrity of non-redundant array
> configurations, short of the catastrophic failure of a sector or drive
> itself.)
The other thing I''m working on is getting better FMA ereports for
checksum
errors;  one thing that''s currently missing in the case of a mirrored
or raid-z configuration is the information on the difference between the
correct content and the bad content.

That way, we''ll have a better idea of what''s actually
happening, and the
FMA responses may also get better.
> After reviewing the code (and presuming you intended "if (base->a
> less-than bad->a)"), I can''t quite seem to convince myself
the
> implementation is immune from misdiagnosing a double/triple bit
> error as a single bit error in general (although likely staring me
> in the face; as all correct, single, and double bit error checksums
> are warranted to be unique; as should also be all 4 and 5 bit
> error checksums for a corrected fletcher4 implementation to my
> understanding)?
Let me work on the math some and get back to you.

Cheers,
- jonathan

Daniel Carosone

2009-May-05 01:28 UTC

head link

[zfs-code] correcting single-bit errors in fletcher4 checksums

> Very nice; and ultimately will be very interesting to
> see what percentage of checksum errors within a
> particular deployment turn out to most likely be
> correctable single bit errors. 
Very promising and very nice, indeed.

One thing needs to be established carefully in the course of this analysis, and
that''s resiliency in where the errors are introduced (regardless of
how).  Do these properties hold up when error is introduced in the checksum
block rather than in the data block(s)?

This brings up another question, too. ZFS uses ditto blocks for metadata, which
covers checksums, and these are in turn covered by enclosing checksums in parent
blocks, so in theory all checksum data should be verified before be used to
verify user data.

If a data block fails to verify, does ZFS consider the possibility that the
damage may be in the checksum data at all (perhaps corrupted in bad memory since
being read)?  Does it attempt to re-read or reverify checksums at the same time
as looking for alternate copies of user data when trying to correct a checksum
failure?


(And thereby possibly> even measurably help improve the integrity of
> non-redundant array configurations, short of the
> catastrophic failure of a sector or drive itself.)
> 
> After reviewing the code (and presuming you intended
> "if (base->a less-than bad->a)"), I can''t quite
seem
> to convince myself the implementation is immune from
> misdiagnosing a double/triple bit error as a single
> bit error in general (although likely staring me in
> the face; as all correct, single, and double bit
> error checksums are warranted to be unique; as should
> also be all 4 and 5 bit error checksums for a
> corrected fletcher4 implementation to my
> understanding)?-- 
This message posted from opensolaris.org

zfs code - Apr 2009 - correcting single-bit errors in fletcher4 checksums

[zfs-code] correcting single-bit errors in fletcher4 checksums

[zfs-code] correcting single-bit errors in fletcher4 checksums

[zfs-code] correcting single-bit errors in fletcher4 checksums

[zfs-code] correcting single-bit errors in fletcher4 checksums