thr3ads.net - zfs discuss - [zfs-discuss] Scrub and checksum permutations [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Oct-25 09:25 UTC

[zfs-discuss] Scrub and checksum permutations

Hello all,

   I was describing how raidzN works recently, and got myself wondering:
does zpool scrub verify all the parity sectors and the mirror halves?
That is, IIRC, the scrub should try to read all allocated blocks and
if they are read in OK - fine; if not - fix in-place with redundant
data or copies, if available.

   On the other hand, if the first tested mirror half/block copy/raidzN
permutation has yielded no errors, are other variants still checked?

   The descriptions I''ve seen so far put an emphasis on verifying all
copies of the blocks and probably all the mirror halves - perhaps
because these examples make it easier to describe the concept.

   I don''t think I saw a statement that for raidzN blocks all of the
combinations are verified to work (plain userdata sectors, and parity
permutations with these sectors). Can someone in the know say "yes"?
;)

Thanks,
//Jim

Karl Wagner

2012-Oct-25 11:22 UTC

head link

[zfs-discuss] Scrub and checksum permutations

I can only speak anecdotally, but I believe it does.

Watching zpool iostat it does read all data on both disks in a mirrored 
pair.

Logically, it would not make sense not to verify all redundant data. 
The point of a scrub is to ensure all data is correct.

On 2012-10-25 10:25, Jim Klimov wrote:
> Hello all,
>
> I was describing how raidzN works recently, and got myself wondering:
> does zpool scrub verify all the parity sectors and the mirror halves?
> That is, IIRC, the scrub should try to read all allocated blocks and
> if they are read in OK - fine; if not - fix in-place with redundant
> data or copies, if available.
>
> On the other hand, if the first tested mirror half/block copy/raidzN
> permutation has yielded no errors, are other variants still checked?
>
> The descriptions I''ve seen so far put an emphasis on verifying all
> copies of the blocks and probably all the mirror halves - perhaps
> because these examples make it easier to describe the concept.
>
> I don''t think I saw a statement that for raidzN blocks all of the
> combinations are verified to work (plain userdata sectors, and parity
> permutations with these sectors). Can someone in the know say
"yes"?
> ;)
>
> Thanks,
> //Jim
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-25 11:30 UTC

head link

[zfs-discuss] Scrub and checksum permutations

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Karl Wagner
> 
> I can only speak anecdotally, but I believe it does.
> 
> Watching zpool iostat it does read all data on both disks in a mirrored
> pair.
> 
> Logically, it would not make sense not to verify all redundant data.
> The point of a scrub is to ensure all data is correct.
Same for me.

Think about it:  When you write some block, it computes parity bits, and writes
them to the redundant parity disks.  When you later scrub the same data, it
wouldn''t make sense to do anything other than repeating this process,
to verify all the disks including parity.

Jim Klimov

2012-Oct-25 12:35 UTC

head link

[zfs-discuss] Scrub and checksum permutations

2012-10-25 15:30, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) ?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Karl Wagner
>>
>> I can only speak anecdotally, but I believe it does.
>>
>> Watching zpool iostat it does read all data on both disks in a mirrored
>> pair.
>>
>> Logically, it would not make sense not to verify all redundant data.
>> The point of a scrub is to ensure all data is correct.
>
> Same for me.
>
> Think about it:  When you write some block, it computes parity bits, and
writes them to the redundant parity disks.  When you later scrub the same data,
it wouldn''t make sense to do anything other than repeating this
process, to verify all the disks including parity.
Logically, yes - I agree this is what we expect to be done.
However, at least with the normal ZFS reading pipeline, reads
of redundant copies and parities only kick in if the first
read variant of the block had errors (HW IO errors, checksum
mismatch).

In case of raidzN, normal reads should first try to use the
unencumbered plain userdata sectors (with IO speeds like
striping), and if there are errors - retry with permutations
based on parity sectors and different combinations of userdata
sectors, until it makes a block whose checksum matches the
expected value - or fails by running out of combinations.

If scrubbing works the way we "logically" expect it to, it
should enforce validation of such combinations for each read
of each copy of a block, in order to ensure that parity sectors
are intact and can be used for data recovery if a plain sector
fails. Likely, raidzN scrubs should show as compute-intensive
tasks compared to similar mirror scrubs.

I thought about it, wondered and posted the question, and went
on to my other work. I did not (yet) research the code to find
first-hand, partly because gurus might know the answer and reply
faster than I dig into it ;)

Thanks,
//Jim Klimov

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-25 14:23 UTC

head link

[zfs-discuss] Scrub and checksum permutations

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> Logically, yes - I agree this is what we expect to be done.
> However, at least with the normal ZFS reading pipeline, reads
> of redundant copies and parities only kick in if the first
> read variant of the block had errors (HW IO errors, checksum
> mismatch).
I haven''t read or written the code myself personally, so I''m
not authoritative.  But I certainly know I''ve heard it said on this
list before, that when you read a mirror, it only reads one side (as you said)
unless there''s an error; this allows a mirror to read 2x faster than a
single disk (which I confirm by benchmarking.)  However, a scrub reads both
sides, all redundant copies of the data.  I''m personally comfortably
confident assuming this is true also for reading the redundant copies of raidzN
data.

Timothy Coalson

2012-Oct-25 17:17 UTC

head link

[zfs-discuss] Scrub and checksum permutations

On Thu, Oct 25, 2012 at 7:35 AM, Jim Klimov <jimklimov at cos.ru> wrote:
>
> If scrubbing works the way we "logically" expect it to, it
> should enforce validation of such combinations for each read
> of each copy of a block, in order to ensure that parity sectors
> are intact and can be used for data recovery if a plain sector
> fails. Likely, raidzN scrubs should show as compute-intensive
> tasks compared to similar mirror scrubs.
>
It should only be as compute intensive as writes - it can read the userdata
and parity sectors, ensure the userdata checksum matches (reconstruct and
do a fresh write in the rare cases it is not), and then recalculate the
parity sectors from the verified user sectors, and compare them to the
parity sectors it actually read.  The only reason it would need to use the
combinatorial approach is if it has a checksum mismatch and needs to
rebuild the data in the presence of bit rot.

I thought about it, wondered and posted the question, and
went> on to my other work. I did not (yet) research the code to find
> first-hand, partly because gurus might know the answer and reply
> faster than I dig into it ;)
>
I recently wondered this also and am glad you asked, I hope someone can
answer definitively.

Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121025/ef89c10d/attachment.html>

Jim Klimov

2012-Oct-25 19:46 UTC

head link

[zfs-discuss] Scrub and checksum permutations

2012-10-25 21:17, Timothy Coalson wrote:>
> On Thu, Oct 25, 2012 at 7:35 AM, Jim Klimov <jimklimov at cos.ru
> <mailto:jimklimov at cos.ru>> wrote:
>
>
>     If scrubbing works the way we "logically" expect it to, it
>     should enforce validation of such combinations for each read
>     of each copy of a block, in order to ensure that parity sectors
>     are intact and can be used for data recovery if a plain sector
>     fails. Likely, raidzN scrubs should show as compute-intensive
>     tasks compared to similar mirror scrubs.
>
>
> It should only be as compute intensive as writes - it can read the
> userdata and parity sectors, ensure the userdata checksum matches
> (reconstruct and do a fresh write in the rare cases it is not), and then
> recalculate the parity sectors from the verified user sectors, and
> compare them to the parity sectors it actually read.


Hmmm, that''s another way to skin the cat, and it makes more sense ;)


//Jim

Karl Wagner

2012-Oct-26 08:29 UTC

head link

[zfs-discuss] Scrub and checksum permutations

Does it not store a separate checksum for a parity block? If so, it
should not even need to recalculate the parity: assuming checksums match
for all data and parity blocks, the data is good. 

I could understand
why it would not store a checksum for a parity block. It is not really
necessary: Parity is only used to reconstruct a corrupted block, so you
can reconstruct the block and verify the data checksum. But I can also
see why they would: Simplified logic, faster identification of corrupt
parity blocks (more usefull for RAIDZ2 and greater), and the general
principal that all blocks are checksummed. 

If this was the case, it
should mean that RAIDZ scub is faster than mirror scrub, which I don''t
think it is. So this post is probably redundant (pun intended) 

On
2012-10-25 20:46, Jim Klimov wrote: 
> 2012-10-25 21:17, Timothy
Coalson wrote:>> On Thu, Oct 25, 2012 at 7:35 AM, Jim Klimov<jimklimov at cos.ru [1] jimklimov at cos.ru>> wrote: If scrubbing
works the
way we "logically" expect it to, it should enforce validation of such
combinations for each read of each copy of a block, in order to ensure
that parity sectors are intact and can be used for data recovery if a
plain sector fails. Likely, raidzN scrubs should show as
compute-intensive tasks compared to similar mirror scrubs. It should
only be as compute intensive as writes - it can read the userdata and
parity sectors, ensure the userdata checksum matches (reconstruct and do
a fresh write in the rare cases it is not), and then recalculate the
parity sectors from the verified user sectors, and compare them to the
parity sectors it actually read.> Hmmm, that''s another way to skin thecat, and it makes more sense ;) //Jim
_______________________________________________ zfs-discuss mailing list
zfs-discuss at opensolaris.org [2]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss [3]



Links:
------
[1] mailto:jimklimov at cos.ru
[2]
mailto:zfs-discuss at opensolaris.org
[3]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121026/207da1db/attachment-0001.html>

Jim Klimov

2012-Oct-26 08:59 UTC

head link

[zfs-discuss] Scrub and checksum permutations

2012-10-26 12:29, Karl Wagner wrote:> Does it not store a separate checksum for a parity block? If so, it
> should not even need to recalculate the parity: assuming checksums match
> for all data and parity blocks, the data is good.
No, for the on-disk sector allocation over M disks, zfs raidzN writes
N parity sectors and up to M-N data sectors (may be less for small
blocks and tails of blocks) for which that parity is, then repeats
the process until it runs out of userdata sectors in the to-write
buffer. Overall the parity and data sectors make up the on-disk
block''s contents.
Matching the contents (plain userdata or permutations with parities)
to the expected checksum of the logical block (of userdata) is what
guarantees that the block was not corrupted - or that corruption
was detected and repaired (or not).

I am not sure how far permutations go however - does it try to repair
per-stripe or per-column? If it tries hard, there are quite many
combinations to try for a 128Kb block spread over 512b sectors in
a raidz3 set ;)

> I could understand why it would not store a checksum for a parity block.
To summarize, parities are not separate blocks. They are additional
sectors prepended (and intermixed with) sectors of userdata, all
together saved as the on-disk block.

//Jim

Ray Arachelian

2012-Oct-27 15:56 UTC

head link

[zfs-discuss] Scrub and checksum permutations

On 10/26/2012 04:29 AM, Karl Wagner wrote:>
> Does it not store a separate checksum for a parity block? If so, it
> should not even need to recalculate the parity: assuming checksums
> match for all data and parity blocks, the data is good.
>
> I could understand why it would not store a checksum for a parity
> block. It is not really necessary: Parity is only used to reconstruct
> a corrupted block, so you can reconstruct the block and verify the
> data checksum. But I can also see why they would: Simplified logic,
> faster identification of corrupt parity blocks (more usefull for
> RAIDZ2 and greater), and the general principal that all blocks are
> checksummed.
>
> If this was the case, it should mean that RAIDZ scub is faster than
> mirror scrub, which I don''t think it is. So this post is probably
> redundant (pun intended)
>
Parity is very simple to calculate and doesn''t use a lot of CPU - just
slightly more work than reading all the blocks: read all the stripe
blocks on all the drives involved in a stripe, then do a simple XOR
operation across all the data.  The actual checksums are more expensive
as they''re MD5 - much nicer when these can be hardware accelerated.

Also, on x86, there are SSE block operations that make XORing for a
whole block a lot faster by doing a whole chunk at a time, so you don''t
need a loop to do it - not sure which ZFS implementations take advantage
of these, but in the end XOR is not an expensive operation. MD5 is by
several orders of magnitude.

Toby Thain

2012-Oct-27 16:54 UTC

head link

[zfs-discuss] Scrub and checksum permutations

On 27/10/12 11:56 AM, Ray Arachelian wrote:> On 10/26/2012 04:29 AM, Karl Wagner wrote:
>>
>> Does it not store a separate checksum for a parity block? If so, it
>> should not even need to recalculate the parity: assuming checksums
>> match for all data and parity blocks, the data is good.
>>...
>>
>
> Parity is very simple to calculate and doesn''t use a lot of CPU -
just
> slightly more work than reading all the blocks: read all the stripe
> blocks on all the drives involved in a stripe, then do a simple XOR
> operation across all the data.  The actual checksums are more expensive
> as they''re MD5 - much nicer when these can be hardware
accelerated.
Checksums are MD5??

--Toby
>
> Also, on x86,  ...

Jim Klimov

2012-Oct-27 17:35 UTC

head link

[zfs-discuss] Scrub and checksum permutations

2012-10-27 20:54, Toby Thain wrote:>> Parity is very simple to calculate and doesn''t use a lot of
CPU - just
>> slightly more work than reading all the blocks: read all the stripe
>> blocks on all the drives involved in a stripe, then do a simple XOR
>> operation across all the data.  The actual checksums are more expensive
>> as they''re MD5 - much nicer when these can be hardware
accelerated.
>
> Checksums are MD5??
No, they are fletcher variants or sha256, with more probably coming
up soon, and some of these might also be boosted by certain hardware
capabilities, but I tend to agree that parity calculations likely
are faster (even if not all parities are simple XORs - that would
be silly for double- or triple-parity sets which may use different
algos just to be sure).

//Jim

Timothy Coalson

2012-Oct-27 19:36 UTC

head link

[zfs-discuss] Scrub and checksum permutations

On Sat, Oct 27, 2012 at 12:35 PM, Jim Klimov <jimklimov at cos.ru> wrote:
> 2012-10-27 20:54, Toby Thain wrote:
>
>> Parity is very simple to calculate and doesn''t use a lot of
CPU - just
>>> slightly more work than reading all the blocks: read all the stripe
>>> blocks on all the drives involved in a stripe, then do a simple XOR
>>> operation across all the data.  The actual checksums are more
expensive
>>> as they''re MD5 - much nicer when these can be hardware
accelerated.
>>>
>>
>> Checksums are MD5??
>>
>
> No, they are fletcher variants or sha256, with more probably coming
> up soon, and some of these might also be boosted by certain hardware
> capabilities, but I tend to agree that parity calculations likely
> are faster (even if not all parities are simple XORs - that would
> be silly for double- or triple-parity sets which may use different
> algos just to be sure).
>
I would expect raidz2 and 3 to use the same math as traditional raid6 for
parity: https://en.wikipedia.org/wiki/Raid6#RAID_6 .  In particular, the
sentence "For a computer scientist, a good way to think about this is that
<operator> is a bitwise XOR operator and <g superscript i> is the
action of
a linear feedback shift register on a chunk of data."  If I understood it
correctly, it does a different number of iterations of the LFSR on each
sector, depending on which sector among the data sectors it is, and that
the LFSR is applied independently to small groups of bytes in each sector,
and then does the XOR to get the second parity sector (and for third
parity, I believe it needs to use a different generator polynomial for the
LFSR).  For small numbers of iterations, multiple iterations of the LSFR
can be optimized to a single shift and an XOR with a lookup value on the
lowest bits.  For larger numbers of iterations (if you have, say, 28 disks
in a raidz3), it could construct the 25th iteration by doing 10, 10, 5, but
I have no idea how ZFS actually implements it.

As I understand it, fletcher checksums are extremely simple and are
basically 2 additions and 2 modulus per however many bytes at a time it
processes, so I wouldn''t be surprised if fletcher was about the same
speed
as computing second/third parity.  SHA256 I don''t know, I would expect
it
to be more expensive, simply because it is a cryptographic hash.

Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121027/81941bca/attachment-0001.html>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-28 20:08 UTC

head link

[zfs-discuss] Scrub and checksum permutations

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> I tend to agree that parity calculations likely
> are faster (even if not all parities are simple XORs - that would
> be silly for double- or triple-parity sets which may use different
> algos just to be sure).
Even though parity calculation is faster than fletcher, which is faster than
sha256, it''s all irrelevant, except in the hugest of file servers.  Go
write to disk or read from disk as fast as you can, and see how much CPU you
use.  Even on moderate fileservers that I''ve done this on (a dozen
disks in parallel) the cpu load is negligible.

If you ever get up to a scale where the cpu load becomes significant, you solve
it by adding more cpu''s.  There is a limit somewhere, but it''s
huge.

Tomas Forsman

2012-Oct-29 10:35 UTC

head link

[zfs-discuss] Scrub and checksum permutations

On 28 October, 2012 - Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
sent me these 1,0K bytes:
> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Jim Klimov
> > 
> > I tend to agree that parity calculations likely
> > are faster (even if not all parities are simple XORs - that would
> > be silly for double- or triple-parity sets which may use different
> > algos just to be sure).
> 
> Even though parity calculation is faster than fletcher, which is
> faster than sha256, it''s all irrelevant, except in the hugest of
file
> servers.  Go write to disk or read from disk as fast as you can, and
> see how much CPU you use.  Even on moderate fileservers that I''ve
done
> this on (a dozen disks in parallel) the cpu load is negligible.  
> 
> If you ever get up to a scale where the cpu load becomes significant,
> you solve it by adding more cpu''s.  There is a limit somewhere,
but
> it''s huge.
For just the parity thing, this is an older linux on a quite old cpu
(first dual core athlon64''s):

[961655.168961] xor: automatically using best checksumming function: generic_sse
[961655.188007]    generic_sse:  6128.000 MB/sec
[961655.188010] xor: using function: generic_sse (6128.000 MB/sec)
[961655.256025] raid6: int64x1   1867 MB/s
[961655.324020] raid6: int64x2   2372 MB/s
[961655.392027] raid6: int64x4   1854 MB/s
[961655.460019] raid6: int64x8   1672 MB/s
[961655.528062] raid6: sse2x1     834 MB/s
[961655.596047] raid6: sse2x2    1273 MB/s
[961655.664028] raid6: sse2x4    2116 MB/s
[961655.664030] raid6: using algorithm sse2x4 (2116 MB/s)

So raid6 at 2Gbyte/s and raid5 at 6Gbyte/s should be enough on a 6+ year
old low-end desktop machine..

/Tomas
-- 
Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Matthew Ahrens

2012-Oct-31 23:47 UTC

head link

[zfs-discuss] Scrub and checksum permutations

On Thu, Oct 25, 2012 at 2:25 AM, Jim Klimov <jimklimov at cos.ru> wrote:
> Hello all,
>
>   I was describing how raidzN works recently, and got myself wondering:
> does zpool scrub verify all the parity sectors and the mirror halves?
>
Yes.  The ZIO_FLAG_SCRUB instructs the raidz or mirror vdev to read and
verify all parts of the blocks (parity sectors and mirror copies).

The math for RAID-Z is described in detail in the comments of vdev_raidz.c.
 If there is a checksum error, we reconstitue the data by trying all
possible combinations of N incorrect sectors (N being the number of parity
disks) -- see vdev_raidz_combrec().

--matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121031/e7c15651/attachment.html>

Timothy Coalson

2012-Nov-01 01:07 UTC

head link

[zfs-discuss] Scrub and checksum permutations

On Wed, Oct 31, 2012 at 6:47 PM, Matthew Ahrens <mahrens at delphix.com>
wrote:
> On Thu, Oct 25, 2012 at 2:25 AM, Jim Klimov <jimklimov at cos.ru>
wrote:
>
>> Hello all,
>>
>>   I was describing how raidzN works recently, and got myself wondering:
>> does zpool scrub verify all the parity sectors and the mirror halves?
>>
>
> Yes.  The ZIO_FLAG_SCRUB instructs the raidz or mirror vdev to read and
> verify all parts of the blocks (parity sectors and mirror copies).
>
Good to know.

> The math for RAID-Z is described in detail in the comments of
> vdev_raidz.c.  If there is a checksum error, we reconstitue the data by
> trying all possible combinations of N incorrect sectors (N being the number
> of parity disks) -- see vdev_raidz_combrec().
>
Google gave me this result:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c

It had me slightly concerned because it does the LFSR on single bytes,
though for mainly theoretical reasons - for a raidz3 of 258 devices
(ill-advised to say the least), using single bytes in the LSFR wouldn''t
allow the cycle to be long enough to have the parity protect against two
specific failures.  However, I tested whether zpool create checks for this
by creating 300 100MB files and attempting to make them into a raidz3 pool,
and got this:

$ zpool create -n -o cachefile=none testlargeraidz raidz3 `pwd`/*
invalid vdev specification: raidz3 supports no more than 255 devices

Same story for raidz and raidz2.  So, looks like they already thought of
this too.

Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121031/ace1acdb/attachment.html>

zfs discuss - Oct 2012 - Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations

[zfs-discuss] Scrub and checksum permutations