thr3ads.net - zfs code - [zfs-code] fletcher2/4 implementations fundamentally flawed [Aug 2008]

If this information is useful, please help other people find it:
Share via:

paul

2008-Aug-17 15:45 UTC

[zfs-code] fletcher2/4 implementations fundamentally flawed

As a heads up, Fletcher class of checksum algorithms are reasonably
strong because they are specified to perform 1''s complement sums
(i.e. end around carries [(a + b) mod (2^n -1)]) such that sums
don''t overflow into oblivion and thereby loose critical information.

Unsigned additions in "C" are not 1''s complement additions.

As presently implemented (and arguably not even Fletcher checksums),
they are incredibly weak, and not even capable of generating unique
checksums across large blocks of zero valued data for example,
containing single bit errors which are arguably not uncommon.

If one reviews carefully correct Fletcher checksum implementations,
it will become apparent that either adds with carry are utilized if
implemented in assembly language, or larger accumulators are used
to compute the sum (such that upper carries are not lost), and then
the upper portion of the accumulator are iteratively then added to
the lower portion of the sum until no further carries are generated,
and thereby effectively performing a 1''s complement modular addition.

Please consider fixing the implementation, as it''s presently
deceptively
weak, although perceived to be otherwise riding on the coattails of the
relative strength of Fletcher checksums, which the implementation is not.

[As an aside, as I believe that most otherwise undetected checksum miss
matches are most likely caused by the introduction of single bit errors
during the transport of data to or from storage or otherwise, having a
stronger checksum implementation having a hamming distance of at least
2 although more ideally 3, provides the opportunity to potentially enable
the correction of single bit errors with reasonable probability if all else
fails in lieu of relying on the data being hopefully archived somewhere
else or resulting in catastrophic otherwise.]
--
This messages posted from opensolaris.org

paul

2008-Aug-18 23:41 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed (correction)

CORRECTION: As fletcher4 actually sums 32bit (not 64bit) data using 64bit
accumulators,
the first two checksum terms (a and b) will not overflow for data block sizes as
large as
128KB, and therefore should be considered at least as strong as that expected of
a 32/64bit
Fletcher checksum (although the remaining terms (c and d) may overflow); and
thereby have
a worst case hamming distance of at least 3 for zfs''s maximum 128KB
block size; which is a
good thing. (sorry for my earlier miss-diagnosis)

fletcher2 (zfs''s default) however remains arguably flawed, however may
be easily strengthened
by correspondingly being restricted to summing 32bit data using 64bit
accumulators, or possibly
alternatively summing 64bit data using 128bit accumulators if it''s
viewed reasonably supportable
by target architectures/compilers likely to host zfs.
--
This messages posted from opensolaris.org

Jonathan Adams

2008-Aug-22 21:24 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed (correction)

On Mon, Aug 18, 2008 at 04:41:35PM -0700, paul wrote:> CORRECTION: As fletcher4 actually sums 32bit (not 64bit) data using 64bit
accumulators,
> the first two checksum terms (a and b) will not overflow for data block
sizes as large as
> 128KB, and therefore should be considered at least as strong as that
expected of a 32/64bit
> Fletcher checksum (although the remaining terms (c and d) may overflow);
and thereby have
> a worst case hamming distance of at least 3 for zfs''s maximum
128KB block size; which is a
> good thing. (sorry for my earlier miss-diagnosis)
> 
> fletcher2 (zfs''s default) however remains arguably flawed, however
may be easily strengthened
> by correspondingly being restricted to summing 32bit data using 64bit
accumulators, or possibly
> alternatively summing 64bit data using 128bit accumulators if it''s
viewed reasonably supportable
> by target architectures/compilers likely to host zfs.
Thanks for the report.  I''ve filed:

6740597 zfs fletcher-2 is losing its carries

to track this.

Cheers,
- jonathan
> --
> This messages posted from opensolaris.org
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Mark Butler

2009-Mar-28 03:42 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

As I understand it, there is no way to fix a problem with the algorithm of a
defined checksum without invalidating existing zfs filesystems.   Any fix to to
the fletcher2 will have to be given a new name.

Given how incredibly weak the current fletcher2 is, perhaps the first thing that
should be done is to change the default to fletcher4.  The flawed fletcher2
appears to be 32768 times weaker than the 16 bit TCP checksum algorithm, i.e. it
appears to only have a 50% chance of catching a single bit error or any series
of single bit errors in the most significant bit of any 64 bit word in a disk
block.  For those bits, it is equivalent to a *1 bit* checksum.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Mar-28 04:02 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

Mark Butler wrote:> As I understand it, there is no way to fix a problem with the algorithm of
a defined checksum without invalidating existing zfs filesystems.   Any fix to
to the fletcher2 will have to be given a new name.
>   
Other than the name, fletcher, is there actually a problem here?
> Given how incredibly weak the current fletcher2 is, perhaps the first thing
that should be done is to change the default to fletcher4.  The flawed fletcher2
appears to be 32768 times weaker than the 16 bit TCP checksum algorithm, i.e. it
appears to only have a 50% chance of catching a single bit error or any series
of single bit errors in the most significant bit of any 64 bit word in a disk
block.  For those bits, it is equivalent to a *1 bit* checksum.
>   
The way I see it, the current "fletcher"2 is better than a simple xor,
but has the
same computational cost.  Anecdotal evidence suggests that the current 
fletcher2
catches a large number of faults. This makes some sense for magnetic disk
drives, which already have significant single bit correction.  If you 
know of a
formal study which shows the distribution of errors in the current 
population,
I''d be very interested in seeing it.

NB, there are already some RFEs for improving the reporting of checksum
mismatches, but there may be more work we can do there, too.
 -- richard

Mark Butler

2009-Mar-28 05:41 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

The problem is that for 1 out of 64 bits, "fletcher2" is [i]not[/i]
better than a simple XOR. 50% probability of a false negative for any number of
bit errors in those bits.

If you have a phantom write to a previously used block that differs only in bits
that occupy that bit position, you only have a 50% chance of catching it.  That
is pathetic.

By way of suggestion, considering that hardware support is on its way in next
generation Intel x86 processors, CRC32-C checksum support (as used in iSCSI and
SCTP) would be an excellent addition.  There are decent software implementations
as well that process 8 bits at a time using a 8KB lookup table.
-- 
This message posted from opensolaris.org

Mark Butler

2009-Mar-29 22:32 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

Here is a better example. Suppose you are checksumming an 8 KB block that
consists of all one bits.  fletcher2 runs two parallel quasi-fletcher checksums
on alternating 64 bit words, such that the linear term of each checksum over an
8KB block is the sum of 512 additions.  The highest linear term you can get for
an 8KB block this way is 512 * (2^32-1), which is somewhat less than 2^73. 
However, fletcher2 uses a 64bit accumulator, so those extra nine bits are
effectively discarded. The equivalent bits are discarded even quicker from the
non-linear term.

Because of that weakness, with fletcher2 errors in any of the _eight_ most
significant bits in any of the 64bit words in the first half of an 8KB block
consisting of all one bits will be completely ignored.  In addition the errors
in the seven most significant bits in the next quarter of the block will be
ignored, errors in the next eighth of the block, and so on until you reach the
end of the block.  All told, errors in just under 8192 bit positions of an 8KB
block consisting of all ones will be completely ignored due to carry overflow. 
With larger blocks, the problem is naturally worse.

I therefore propose an alternative to fletcher2 that I call
"fletcher8" that doesn''t have the same weakness with
detecting problems near the MSB of each 64 bit word that fletcher2 does.  The
only difference here is that the high 32 bits of each 64 bit word are added
twice, first in their normal position, and then shifted downward 32 bits.

On an Intel Core 2 Q6600 (2.3 Ghz) in 32 bit mode with gcc 4.1.2,
"fletcher8" runs 3.3x faster than fletcher4, and about twice as slow
as the flawed fletcher2.  Of course, if you don''t care very much about
detecting errors near the MSB of each 64 bit word, the flawed fletcher2 is much
preferable.  I ran each checksum 500,000 times on an 8K block of quasi-random
data and measured the following:

fletcher2:  4137 MB/s  
fletcher4: 612 MB/s
fletcher8 (proposed):  2038 MB/s
---

void
fletcher_8_native(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
    const uint64_t *ip = buf;
    const uint64_t *ipend = ip + (size / sizeof (uint64_t));
    uint64_t a0, b0, a1, b1;

    for (a0 = b0 = a1 = b1 = 0; ip < ipend; ip += 2) {
        a0 += ip[0] + (ip[0] >> 32);
        a1 += ip[1] + (ip[1] >> 32);
        b0 += a0;
        b1 += a1;
    }

    ZIO_SET_CHECKSUM(zcp, a0, a1, b0, b1);
}
-- 
This message posted from opensolaris.org

Mark Butler

2009-Mar-29 22:54 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

The code didn''t come out right (presumably due to angle bracket
handling), so I have included it as an attachment here.
-- 
This message posted from opensolaris.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: prop_fletcher8.c
Type: application/octet-stream
Size: 371 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20090329/36e03cc3/attachment.obj>

Jeff Bonwick

2009-Mar-29 23:13 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

On Sun, Mar 29, 2009 at 03:32:20PM -0700, Mark Butler
wrote:> 
>         a0 += ip[0] + (ip[0] >> 32);
That has certain weaknesses too.  What we really want is a 128-bit add
across two 64-bit registers.  If you write the code like this:

	value = ip[0];
	a0 += value;
	if (a0 < value)	/* 64-bit overflow implies need to carry */
		a1++;
	b0 += a0;
	if (b0 < a0)
		b1++;

then you get the desired effect.  The pair a1:a0 is the 128-bit sum
of the 64-bit ip[] values, and the pair b1:b0 is the 128-bit sum of
the a1:a0 values.  Best of all, the compiler (at least, our compiler)
is smart enough to detect the carry-detection construct and turn it
into branchless add-with-carry instructions.  Very efficient.

We''ve been meaning to introduce this "fletcher2c" for some
time --
it just got lost in the sea of things to do.  Thanks for the reminder.

Jeff

Mark Butler

2009-Mar-30 01:37 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

The first thing I tried was 128 bit additions of 64 bit quantities.  The problem
is that it doesn''t run any faster than fletcher4 on a 32 bit machine,
and fletcher4 on 32bit machines is almost 7 times slower than fletcher2.

I did some further testing, and fletcher2 does not look as bad as in my last
estimate - no single bit error detection failures in my tests. I have confirmed
failures where two 64 bit MSBs are flipped.  I expect an error detection failure
when any even number of 64 bit MSBs in the same vertical stripe are flipped from
what they should be.

With "fletcher8", I have yet to discover any dual bit failures and I
don''t expect any.
-- 
This message posted from opensolaris.org

Mark Butler

2009-Mar-31 07:36 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

I ran tests of every dual bit error case on an 8KB block containing the first 8K
of a typical ELF executable with six checksum algorithms.  I included two
trivial checksums for comparison purposes. The undetected error counts are as
follows, out of a total of 214783648 cases:

xor128:        16744448  (~3 x 10^-3 false negative rate)
add64:         15763070  (~3 x 10^-3 false negative rate)
fletcher2:     187606     (~4 x 10^-5 false negative rate)
fletcher4:     0
crc32c:         0
"fletcher8":   0

The comparative single threaded bandwidth for these six checksums on a recent 32
bit x86 and gcc 4.1.2 w/ -O2 is as follows:

add64:               8096 MB/s
xor128:              5178 MB/s
fletcher2:            4137 MB/s
"fletcher8":         2038 MB/
crc32c:               1228 MB/s
fletcher4:             612 MB/s


Notes:
1. xor128 is a 128 bit exclusive or sum.
2. Add64 is a 64 bit two?s complement sum
3. The crc32c implementation was software implementation using an 8KB lookup
table - no special instructions.
4. The processor was an Intel Q6600 running at 2.4 Ghz in 32 bit mode.
5. 130560 of those fletcher2 false negatives appear to be data independent error
cases that occur due to the flipped parity of two 64 bit MSBs.  You get exactly
that many on a block consisting of all ones or all zeroes.
6. Despite its flaws, fletcher2 is two orders of magnitude better than weaker
checksums of comparable speed.
-- 
This message posted from opensolaris.org

Jeff Bonwick

2009-Mar-31 08:45 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

Very cool, Mark -- thanks for posting this!

I was discussing this general issue with Jonathan Adams this afternoon.
He''s currently coding and measuring the performance and effectiveness
of a number of fletcher2 variants, which I''ll let him describe in
exquisite detail.

Something we observed along the way: all orders of fletcher linear
combinations of the data words, so it shouldn''t matter what initial
data block you use when checking for two-bit error detection.
Also, note that the fletcher checksum of an all-zero block is zero.
Therefore, you can find two-bit vulnerabilities by setting all
possible pairs of bits in an otherwise all-zero block, and checking
whether any of the resulting fletcher checksums are zero.

Jeff

On Tue, Mar 31, 2009 at 12:36:48AM -0700, Mark Butler
wrote:> I ran tests of every dual bit error case on an 8KB block containing the
first 8K of a typical ELF executable with six checksum algorithms.  I included
two trivial checksums for comparison purposes. The undetected error counts are
as follows, out of a total of 214783648 cases:
> 
> xor128:        16744448  (~3 x 10^-3 false negative rate)
> add64:         15763070  (~3 x 10^-3 false negative rate)
> fletcher2:     187606     (~4 x 10^-5 false negative rate)
> fletcher4:     0
> crc32c:         0
> "fletcher8":   0
> 
> The comparative single threaded bandwidth for these six checksums on a
recent 32 bit x86 and gcc 4.1.2 w/ -O2 is as follows:
> 
> add64:               8096 MB/s
> xor128:              5178 MB/s
> fletcher2:            4137 MB/s
> "fletcher8":         2038 MB/
> crc32c:               1228 MB/s
> fletcher4:             612 MB/s
> 
> 
> Notes:
> 1. xor128 is a 128 bit exclusive or sum.
> 2. Add64 is a 64 bit two?s complement sum
> 3. The crc32c implementation was software implementation using an 8KB
lookup table - no special instructions.
> 4. The processor was an Intel Q6600 running at 2.4 Ghz in 32 bit mode.
> 5. 130560 of those fletcher2 false negatives appear to be data independent
error cases that occur due to the flipped parity of two 64 bit MSBs.  You get
exactly that many on a block consisting of all ones or all zeroes.
> 6. Despite its flaws, fletcher2 is two orders of magnitude better than
weaker checksums of comparable speed.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

paul

2009-Mar-31 16:07 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

Nice.

As a minor thought with respect to naming conventions, as most aren''t
likely to bother attempting to understand the relative merits of available
checksum implementations, I can''t help but wonder if the naming
conventions should to be ordered relative to their performance/goodness.

For example:

fletcher1 (future, 64b add-w/c or 128b sums, 64b data, fast/robust)*
fletcher2 (present, using 64b adds, 64b data, fast/less-robust)
fletcher3 (proposed, using 64b adds, 32b data, slower/robust)
fletcher4 (present, using 64b adds, 32b data, very-slow/robust)

*where a future fletcher1 may possibly use either 128b sums or 64b
add-with-carry implementation (coded in assembly or leverage compiler intrinsics
if/when available), and thereby be as fast as the current fletcher2 and as
robust as the proposed fletcher3 (fletcher8).

(lastly, it seems reasonable to allow existing checksums to be checked/converted
to another if later deemed more desirable.)
-- 
This message posted from opensolaris.org

Jonathan Adams

2009-Apr-13 23:18 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

On Tue, Mar 31, 2009 at 09:07:31AM -0700, paul wrote:> Nice.
> 
> As a minor thought with respect to naming conventions, as most
> aren''t likely to bother attempting to understand the relative
merits
> of available checksum implementations, I can''t help but wonder
> if the naming conventions should to be ordered relative to their
> performance/goodness.
> 
> For example:
> 
> fletcher1 (future, 64b add-w/c or 128b sums, 64b data, fast/robust)*
> fletcher2 (present, using 64b adds, 64b data, fast/less-robust)
> fletcher3 (proposed, using 64b adds, 32b data, slower/robust)
> fletcher4 (present, using 64b adds, 32b data, very-slow/robust)
> 
> *where a future fletcher1 may possibly use either 128b sums or 64b
> add-with-carry implementation (coded in assembly or leverage compiler
> intrinsics if/when available), and thereby be as fast as the current
> fletcher2 and as robust as the proposed fletcher3 (fletcher8).
I''ve been investigating this.  My main performance results are that the
checksum computation is overwhelmingly dominated by the data motion;  from
a performance standpoint, fletcher2 and fletcher4 are completely equivalent,
despite the fact that on cached data, fletcher4 is ~4x slower than fletcher2.

(checksum=none is faster, since it doesn''t touch the data at all, and
checksum=sha256 is slower, since it does significant amounts of computation
per word)

fletcher2 has two issues:

	1. Since it is not pulling around its carries, it is vulnerable to
	   2-bit errors in the higher bits.  It is (as is any 2nd-order
	   fletcher) vulnerable to three-bit errors in a variety of spacings.

	2. Since there is no performance boost associated with using fletcher2,
	   there is no motivation to continue using it, given its various
	   vulnerabilities.

Fletcher-4 is still somewhat vulnerable;  here''s part of a block-quoted
comment
I''ve written:

 * Fletcher-4 is defined by the recurrence relation:
 *    a_i = a_(i-1) + f_(i-1)
 *    b_i = b_(i-1) + a_i
 *    c_i = c_(i-1) + b_i
 *    d_i = d_(i-1) + c_i
 *
 * Where:
 *   a_0 = b_0 = c_0 = d_0 = 0
 * and
 *   f_0 .. f_(n-1) are the input data.
 *
 * Using standard series transformations, these become the following sums:
 *
 *           __n_                            __n_
 *           \   |                           \   |
 *      a  =  >     f                   b  =  >     i * f
 *       n   /___|   n - i               n   /___|       n - i
 *           i = 1                           i = 1
 *
 *           __n_                            __n_
 *           \   |  i*(i+1)                  \   |  i*(i+1)*(i+2)
 *      c  =  >     ------- f           d  =  >     ------------- f
 *       n   /___|     2     n - i       n   /___|        6        n - i
 *           i = 1                           i = 1
 *
 *
 * Now, f_i are 32-bit values, and [abcd]_n are 64-bit accumulators.  In
 * order to not loose data, we need to make sure that we don''t
overflow.
 * The worst-case for this is if f_i = 0xffffffff for all i.  We can
 * conservatively estimate how big n can get before we overflow by running:
 *
 * % bc
 * a=2^32-1;for (i=1;a < 2^64; i++) {a += i*(i+1)*(i+2)/6*(2^32-1)};i-1
 * 566
 * quit

566 * 4bytes/datum is slightly above 2k of data, so a 128k block is ~64 times
longer than the earliest truncation.

There are two ways of dealing with this:

  1. Leave fletcher4 alone.

	Our largest buffer is 128k; that''s 32k*sizeof (uint32_t). That
	means that n = 32768, and we want to look at the factorization
	of:

		D_i = (i*(i+1)*(i+2))/6

	i = 1..n.  Writing it out as:

		D_i = (2^(E_i)) * (F_i)

	where F_i is odd, we can ask about the distribution of E_i.
	Some scripting later, we get:

	for the Ds:

	   COUNT 2-FACTOR
	    8192 2^0
	    4096 2^1
	   10240 2^2
	    5120 2^3
	    2560 2^4
	    1280 2^5
	     640 2^6
	     320 2^7
	     160 2^8
	      80 2^9
	      40 2^10
	      20 2^11
	      10 2^12
	       5 2^13
	       3 2^14
	       2 2^15

	And the Cs:
	   16384 2^0
	    8192 2^1
	    4096 2^2
	    2048 2^3
	    1024 2^4
	     512 2^5
	     256 2^6
	     128 2^7
	      64 2^8
	      32 2^9
	      16 2^10
	       8 2^11
	       4 2^12
	       2 2^13
	       1 2^14

	This means that even in the worst case, f_i will be added at
	2^15 * (32-bit quantity), which fits in a 64-bit accumulator.
	This does mean that early-on words in the buffer have *less* of
	an effect on the result.

	Of course, without more math, I don''t have a compelling case for
	this being effective.

  2. Invent a fletcher4c, which does something about the overflow.

	My suggestion would be to add a cleanup pass, run every 512 iterations
	and at the end, which does:

		c_a += (a >> 32)
		a = (a & 0xffffffff) + (a >> 32);
		if ((a >> 32) != 0) {
			a = (a & 0affffffff) + 1;
			c_a++;
		}

	(and similarly for b, c, and d), then at the end, do:

	while ((c_a >> 32) != 0)
		c_a = (c_a & 0xffffffff) + (c_a >> 32);

	a |= (c_a) << 32;

	This computes a, b, c, and d (mod 2^32 - 1), then puts the carry
	count in the top bits.  The advantage of putting the carry count there
	is that it counteracts a weakness of computing (mod 2^32 - 1), which
	is that 0 == 2^32 - 1 (mod 2^32 - 1).  But 0 won''t add a carry, and
	2^32 - 1 will, so the final counts will end up different.

The main complication in introducing fletcher4c is there are several places
which hardcode the use of fletcher4 (the send/receive code in particular), and
so versioning could be an issue.

Thoughts?

Cheers,
- jonathan

paul

2009-Apr-14 01:34 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

Nice analysis. I understand you''ve found on the
processor/memory-subsystem architecture you''ve experimented with, that
if the data being summed isn''t already cached, it''s access
will likely dominate it''s use in calculations:

- Although this makes sense (presuming no mechanism may used to pre-cache data
to be only then be subsequently check-summed when cache resident in the future),
as processor/memory system architectures are still evolving, it may be prudent
to account for that possibility; for example possible future use of streaming
vector/memory units, where multiple data streams may be potentially summed in
parallel and/or possibly integrated into the I/O channel such that while data is
being retrieved its checksum may be computed without processor intervention may
be desirable.

- With respect to Fletcher4, as the first two running sums (representing a
traditional Fletcher checksum implementation) are sufficient to yield a hamming
distance of at least 3 (meaning all 1 and 2 bit errors are detectable, and
thereby 1 bit errors also potentially correctable if ever desired) for block
sizes somewhat larger than 256KB, being larger than required by ZFS; I
can''t help but wonder if this may be sufficient rather than worrying
about trying to maintain two more running sums each dependent on the previous
without potential overflow (or even potential algorithm implementation issues
which may impede performance in the future by having to maintain 4 sums with the
later 3 being dependant on it''s predecessor''s sum, rather than
just having to schedule only 1 such dependency if only maintaining 2 sums, and
presuming support for potentially more efficient data access on future platform
implementations)?

(As a caveat, it''s not clear to me how much more resilient the checksum
will be by maintaining 2 more sums than traditionally used, and thereby when I
previously observed that fletcher4 was fine, it was based on it''s first
two terms not overflowing for data sizes required, without regard to the
possibility that the upper two terms may overflow, as I didn''t consider
them significant, or rather didn''t rely on their significance).

- However in conclusion, although without pre-fetching, streaming, and/or
channel integration, fletcher2 (corrected to use 32b data with 64b sums) may be
no faster than the current implementation of fletcher4 (being possibly not
significantly more resilient than a corrected flecther2, but which may be
refined to warrant the upper two term also do not overflow, and thereby
improving it''s resilience to some degree); I personally suspect
it''s best to BOTH refine fletcher4 to warrant the upper two sums do not
overflow by wrapping the upper bits of the upper two sums every N (pick your N)
iterations; AND fix fletcher2 because when fixed it has the potential to be
significantly faster than the fixed fletcher4 on future or other platforms
leveraging mechanizes to pre-fetch data, and/or synchronize computations with
access. (Both being relatively easily fixes, so why not?)

Again, merely IMHO.
-- 
This message posted from opensolaris.org

Jonathan Adams

2009-Apr-14 19:41 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

On Mon, Apr 13, 2009 at 06:34:36PM -0700, paul wrote:> Nice analysis. I understand you''ve found on the
> processor/memory-subsystem architecture you''ve experimented with,
that
> if the data being summed isn''t already cached, it''s
access will likely
> dominate it''s use in calculations:
> 
> - Although this makes sense (presuming no mechanism may used to
> pre-cache data to be only then be subsequently check-summed when cache
> resident in the future), as processor/memory system architectures are
> still evolving, it may be prudent to account for that possibility; for
> example possible future use of streaming vector/memory units, where
> multiple data streams may be potentially summed in parallel and/or
> possibly integrated into the I/O channel such that while data is being
> retrieved its checksum may be computed without processor intervention
> may be desirable.
True;  my experience with the performance was that any fix to fletcher2 to
make it stronger slowed things down by ~2-2.5x *when the data was in-cache*,
which is most of the way to fletcher4 (~3-3.5x slower).  And that was with
hand-optimized assembly, because the compiler''s inline assembly has
bugs.

(that said, the bugs are being fixed; even with them fixed, the algorithms
will be slower by a significant amount)
> - With respect to Fletcher4, as the first two running sums
> (representing a traditional Fletcher checksum implementation) are
> sufficient to yield a hamming distance of at least 3 (meaning all
> 1 and 2 bit errors are detectable, and thereby 1 bit errors also
> potentially correctable if ever desired) for block sizes somewhat
> larger than 256KB, being larger than required by ZFS; I can''t help
but
> wonder if this may be sufficient rather than worrying about trying to
> maintain two more running sums each dependent on the previous without
> potential overflow (or even potential algorithm implementation issues
> which may impede performance in the future by having to maintain 4
> sums with the later 3 being dependant on it''s
predecessor''s sum,
> rather than just having to schedule only 1 such dependency if only
> maintaining 2 sums, and presuming support for potentially more
> efficient data access on future platform implementations)?
I''ve verified by exhaustive search that fletcher4 has a hamming
distance
of at least 6 for 128-byte buffers, and my intuition is it is much higher.
The i*(i+1)/2 and i*(i+1)*(i+2)/6 terms in the third and fourth accumulators
make it much harder to have cancellations line up correctly, compared to the
first two.

As an aside, the way to model undetectable errors is to split the errors into
bit clears and bit sets, and make two mostly-zero data buffers:

	f_0..(n-1), with the "bit clears" bits set, and

	g_0..(n-1), with the "bit sets" bit set.

Since the fletcher checksums are linear in the data being checksummed, if
the two checksums are equal, then the error will be undetected.  You can
further refine this by noting:

	1. Since all words before the first error word are zero, they will
	   have a checksum of zero and no effect on the result.

	2. If the error is undetected, the two checksums will be equal after
	   the final error word is processed, and will remain equal through
	   the end of the buffer.  If the error is detected, the checksums
	   will not be equal after the final error word is processed, and
	   will remain unequal throughout the remaining processing.

Since we only care whether or not the checksums will be equal, we can
shorten the buffers to just big enough to cover the errors without any loss of
generality.

For fletcher2, it''s fairly easy to show that at least three different
error
words are required, and the precise position and value constraints.  For
fletcher4, you need at least five different error words, and the math
quickly overwhelms my ability to pen-and-paper it.  Without any access to
Maple/Mathematica/etc, I''m not going to make much more process.
> (As a caveat, it''s not clear to me how much more resilient the
> checksum will be by maintaining 2 more sums than traditionally used,
> and thereby when I previously observed that fletcher4 was fine, it
> was based on it''s first two terms not overflowing for data sizes
> required, without regard to the possibility that the upper two terms
> may overflow, as I didn''t consider them significant, or rather
didn''t
> rely on their significance).
Understood.
> - However in conclusion, although without pre-fetching, streaming,
> and/or channel integration, fletcher2 (corrected to use 32b data
> with 64b sums) may be no faster than the current implementation of
> fletcher4 (being possibly not significantly more resilient than a
> corrected flecther2, but which may be refined to warrant the upper
> two term also do not overflow, and thereby improving it''s
resilience
> to some degree); I personally suspect it''s best to BOTH refine
> fletcher4 to warrant the upper two sums do not overflow by wrapping
> the upper bits of the upper two sums every N (pick your N) iterations;
> AND fix fletcher2 because when fixed it has the potential to be
> significantly faster than the fixed fletcher4 on future or other
> platforms leveraging mechanizes to pre-fetch data, and/or synchronize
> computations with access. (Both being relatively easily fixes, so why
> not?)
It''s mainly a question of defaults and not giving dangerous options. 
If
fletcher4 is around the same speed as our fixed fletcher2, and provides
better fault detection (and possibly correction), why offer them both?

Perhaps what we should do is:

	rename fletcher2 and fletcher4 to fletcher2-old and fletcher4-old.
	introduce new "fletcher2" and "fletcher4" algorithms, fixed
as above.
	repoint "checksum=on" to "fletcher4".

So old data will continue to be checked using the old algorithms, new data
will use the new fletcher2 and 4, and we''ll default to fletcher4.

Thoughts?

- jonathan

paul

2009-Apr-14 21:27 UTC

head link

[zfs-code] fletcher2/4 implementations fundamentally flawed

Agreed, seems like a prudent plan.
-- 
This message posted from opensolaris.org

zfs code - Aug 2008 - fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed (correction)

[zfs-code] fletcher2/4 implementations fundamentally flawed (correction)

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed

[zfs-code] fletcher2/4 implementations fundamentally flawed