thr3ads.net - zfs discuss - [zfs-discuss] Digging in the bowels of ZFS [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Dec-02 04:42 UTC

[zfs-discuss] Digging in the bowels of ZFS

My problematic home-NAS (that old list readers might still remember
about from a year-two ago) is back online, thanks to a friend who
fixed and turned it on. I''m going to try some more research on that
failure I had with the 6-disk raidz2 set, when it suddenly couldn''t
read and recover some blocks (presumed scratched or otherwise damaged
on all disks at once while the heads hovered over similar locations).

My plan is to dig out the needed sectors of the broken block from
each of the 6 disks and try any and all reasonable recombinations
of redundancy and data sectors to try and match the checksum - this
should be my definite answer on whether ZFS (of that oi151.1.3-based
build) does all I think it can to save data or not. Either I put the
last nail into my itching question''s coffin, or I''d nail a bug
to
yell about ;)

So... here are some applied questions:

1) With DD I found the logical offset after which it fails to read
    data from a damaged file, because ZFS doesn''t trust the block.
    That''s 3840 sectors (@512b), or 0x1e0000. With ZDB I listed the
    file inode''s block tree and got this, in particular:

# zdb -ddddd -bbbbbb -e 1601233584937321596/export/DUMP 27352564 \
   > /var/tmp/brokenfile.ZDBreport 2>&1
...
           1c0000   L0 DVA[0]=<0:acbc2a46000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4a00P (txg, cksum)

           1e0000   L0 DVA[0]=<0:acbc2a4f000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4c00P
birth=324721364L/324721364P fill=1
cksum=705e79361b8f028e:5a45c8f863a4035f:41b2961480304d7:be685ec248f00e78

           200000   L0 DVA[0]=<0:acbc2a58000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous dedup single size=20000L/4c00P (txg, cksum)
...

    So... how DO I properly interpret this to select sector ranges to
    DD into my test area from each of the 6 disks in the raidz2 set?

    On one hand, the DVA states the block length is 0x9000, and this
    matches the offsets of neighboring blocks.

    On the other hand, compressed "physical" data size is 0x4c00 for
    this block, and ranges 0x4800-0x5000 for other blocks of the file.
    Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way
    smaller than 0x9000. For uncompressed files I think I saw entries
    like "size=20000L/30000P", so I''m not sure even my
multiplication
    by 1.5x above is valid, and the discrepancy between DVA size and
    interval, and "physical" allocation size reaches about 2x.

    So... again... how many sectors from each disk should I fetch for
    my research of this one block?

2) Do I understand correctly that for the offset definition, sectors
    in a top-level VDEV (which is all of my pool) are numbered in rows
    per-component disk? Like this:
          0  1  2  3  4  5
          6  7  8  9  10 11...

    That is, "offset % setsize = disknum"?

    If true, does such numbering scheme apply all over the TLVDEV,
    so as for my block on a 6-disk raidz2 disk set - its sectors
    start at (roughly rounded) "offset_from_DVA / 6" on each disk,
    right?

3) Then, if I read the ZFS on-disk spec correctly, the sectors of
    the first disk holding anything from this block would contain the
    raid-algo1 permutations of the four data sectors, sectors of
    the second disk contain the raid-algo2 for those 4 sectors,
    and the remaining 4 disks contain the data sectors?
    The redundancy algos should in fact cover other redundancy disks
    too (in order to sustain loss of any 2 disks), correct?

    Is it, in particular, true that the redundancy-protected stripes
    involve a single sector from each disk times the length of the
    block''s portion on each disk (not that some, say, 32kb from one
    disk are wholly the redundancy for 4*32kb data from other disks)?

    I think this is what I hope to catch - if certain non-overlapping
    sectors got broken on each disk, but ZFS compares larger ranges
    to recover data - these are two work on very different data.

4) Where are the redundancy algorithms specified? Is there any simple
    tool that would recombine a given algo-N redundancy sector with
    some other 4 sectors from a 6-sector stripe in order to try and
    recalculate the sixth sector''s contents? (Perhaps part of some
    unit tests?)

5) Is there any magic to the checksum algorithms? I.e. if I pass
    some 128KB block''s logical (userdata) contents to the command-line
    "sha256" or "openssl sha256" - should I get the same
checksum as
    ZFS provides and uses?

6) What exactly does a checksum apply to - the 128Kb userdata block
    or a 15-20Kb (lzjb-)compressed portion of data? I am sure it''s the
    latter, but ask just in case I don''t miss anything... :)

As I said, in the end I hope to have from-disk and guessed userdata
sectors, a gazillion or so for given logical offsets inside a 128K
userdata block, which I would then recombine and hash with sha256
to see if any combination yields the value saved in block pointer
and ZFS missed something, or if I don''t get any such combo and ZFS
does what it should exhaustively and correctly, indeed ;)

Thanks a lot in advance for any info, ideas, insights,
and just for reading this long post to the end ;)
//Jim Klimov

Jim Klimov

2012-Dec-03 17:23 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

On 2012-12-02 05:42, Jim Klimov wrote:> So... here are some applied questions:
Well, I am ready to reply a few of my own questions now :)

I''ve staged an experiment by taking a 128Kb block from that file
and appending it to a new file in a test dataset, where I changed
the compression settings between the appendages. Thus I''ve got
a ZDB dump of three blocks with identical logical userdata and
different physical data.

# zdb -ddddd -bbbbbb -e 1601233584937321596/test3 8 > /pool/test3/a.zdb
...
Indirect blocks:
                0 L1  DVA[0]=<0:59492a98000:3000>
DVA[1]<0:83e2f65000:3000> [L1 ZFS plain file] sha256 lzjb LE contiguous
unique double size=4000L/400P birth=326381727L/326381727P fill=3
cksum=2ebbfb189e7ce003:166a23fd39d583ed:f527884977645395:896a967526ea9cea

                0  L0 DVA[0]=<0:590002c1000:30000> [L0 ZFS plain file]
sha256 uncompressed LE contiguous unique single size=20000L/20000P
birth=326381721L/326381721P fill=1
cksum=3c691e8fc86de2ea:90a0b76f0d1fe3ff:46e055c32dfd116d:f2af276f0a6a96b9

            20000  L0 DVA[0]=<0:594928b8000:9000> [L0 ZFS plain file]
sha256 lzjb LE contiguous unique single size=20000L/4800P
birth=326381724L/326381724P fill=1
cksum=57164faa0c1cbef4:23348aa9722f47d3:3b1b480dc731610b:7f62fce0cc18876f

            40000  L0 DVA[0]=<0:59492a92000:6000> [L0 ZFS plain file]
sha256 gzip-9 LE contiguous unique single size=20000L/2800P
birth=326381727L/326381727P fill=1
cksum=d68246ee846944c6:70e28f6c52e0c6ba:ea8f94fc93f8dbfd:c22ad491c1e78530

                 segment [0000000000000000, 0000000000080000) size  512K

 >1)   So... how DO I properly interpret this to select sector ranges to
 >     DD into my test area from each of the 6 disks in the raidz2 set?
 >
 >     On one hand, the DVA states the block length is 0x9000, and this
 >     matches the offsets of neighboring blocks.
 >
 >     On the other hand, compressed "physical" data size is 0x4c00
for
 >     this block, and ranges 0x4800-0x5000 for other blocks of the file.
 >     Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way
 >     smaller than 0x9000. For uncompressed files I think I saw entries
 >     like "size=20000L/30000P", so I''m not sure even my
multiplication
 >     by 1.5x above is valid, and the discrepancy between DVA size and
 >     interval, and "physical" allocation size reaches about 2x.

Apparently, my memory failed me. The values in "size" field regard the
userdata (compressed, non-redundant). Also I forgot to consider that
this pool uses 4KB sectors (ashift=12).

So my userdata which takes up about 0x4800 bytes would require 4.5
(rather, 5 whole) sectors and this warrants 4 sectors of the raidz2
redundancy on a 6-disk set - 2 sectors for the first 4 data sectors,
and 2 sectors for the remaining half-sector''s worth of data. This
does sum up to 9*0x1000 bytes in whole-sector counting (as in offsets).

However, the gzip-compressed block above which only has 0x2800 bytes
of userdata and requires 3 sectors plus 2 redundancy sectors, still
has a DVA size of six 4KB sectors (0x6000). This is strange to me -
I''d expect 5 sectors for this block altogether... does anyone have
an explanation? Also, what should the extra userdata sector contain
physically - zeroes?

 > 5) Is there any magic to the checksum algorithms? I.e. if I pass
 >     some 128KB block''s logical (userdata) contents to the
command-line
 >     "sha256" or "openssl sha256" - should I get the
same checksum as
 >     ZFS provides and uses?

The original 128KB file''s sha256 checksum matches the uncompressed
block''s ZFS checksum, so in my further tests I can use the command
line tools to verify the recombined results:

# sha256sum /tmp/b128
3c691e8fc86de2ea90a0b76f0d1fe3ff46e055c32dfd116df2af276f0a6a96b9  /tmp/b128

No magic, as long as there are useable command-line implementations
of the needed algos (sha256sum is there, fletcher[24] are not).

 > 6) What exactly does a checksum apply to - the 128Kb userdata block
 >     or a 15-20Kb (lzjb-)compressed portion of data? I am sure
it''s the
 >     latter, but ask just in case I don''t miss anything... :)

ZFS parent block checksum applies to the on-disk variant of userdata
payload (compression included, redundancy excluded).

NEW QUESTIONS:

7) Is there a command-line tool to do lzjb compressions and
decompressions (in the same blocky manner as would be applicable
to ZFS compression)?

I''ve also tried to gzip-compress the original 128KB file, but
none of the compressed results (with varying gzip level) yielded
the same checksum that would match the ZFS block''s one.
Zero-padding to 10240 bytes (psize=0x2800) did not help.

8) When should the decompression stop - as soon as it has extracted
the logical-size number of bytes (i.e. 0x20000)?

9) Physical sizes magically are in whole 512b units, so it seems...
I doubt that the compressed data would always end at such boundary.

How many bytes should be covered by a checksum?
Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)?

Some OLD questions remain raised, just in case anyone answers them.

> 2) Do I understand correctly that for the offset definition, sectors
>     in a top-level VDEV (which is all of my pool) are numbered in rows
>     per-component disk? Like this:
>           0  1  2  3  4  5
>           6  7  8  9  10 11...
>
>     That is, "offset % setsize = disknum"?
>
>     If true, does such numbering scheme apply all over the TLVDEV,
>     so as for my block on a 6-disk raidz2 disk set - its sectors
>     start at (roughly rounded) "offset_from_DVA / 6" on each
disk,
>     right?
>
> 3) Then, if I read the ZFS on-disk spec correctly, the sectors of
>     the first disk holding anything from this block would contain the
>     raid-algo1 permutations of the four data sectors, sectors of
>     the second disk contain the raid-algo2 for those 4 sectors,
>     and the remaining 4 disks contain the data sectors?
>     The redundancy algos should in fact cover other redundancy disks
>     too (in order to sustain loss of any 2 disks), correct? (...)
>
> 4) Where are the redundancy algorithms specified? Is there any simple
>     tool that would recombine a given algo-N redundancy sector with
>     some other 4 sectors from a 6-sector stripe in order to try and
>     recalculate the sixth sector''s contents? (Perhaps part of some
>     unit tests?)
>
>
I''m almost ready to go and test Q2 and Q3, however, the questions
which regard useable tools (and "what data should be fed into such
tools?") are still on the table.
> Thanks a lot in advance for any info, ideas, insights,
> and just for reading this long post to the end ;)
> //Jim Klimov
Ditto =)
//Jim

Jim Klimov

2012-Dec-05 04:52 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

On 2012-12-03 18:23, Jim Klimov wrote:> On 2012-12-02 05:42, Jim Klimov wrote:
>> So... here are some applied questions:
>
> Well, I am ready to reply a few of my own questions now :)
Continuing the desecration of my deceased files'' resting grounds...
>> 2) Do I understand correctly that for the offset definition, sectors
>>     in a top-level VDEV (which is all of my pool) are numbered in rows
>>     per-component disk? Like this:
>>           0  1  2  3  4  5
>>           6  7  8  9  10 11...
>>
>>     That is, "offset % setsize = disknum"?
>>
>>     If true, does such numbering scheme apply all over the TLVDEV,
>>     so as for my block on a 6-disk raidz2 disk set - its sectors
>>     start at (roughly rounded) "offset_from_DVA / 6" on each
disk,
>>     right?
>>
>> 3) Then, if I read the ZFS on-disk spec correctly, the sectors of
>>     the first disk holding anything from this block would contain the
>>     raid-algo1 permutations of the four data sectors, sectors of
>>     the second disk contain the raid-algo2 for those 4 sectors,
>>     and the remaining 4 disks contain the data sectors?
My understanding was correct. For posterity, in the earlier set up
example I had an uncompressed 128KB block residing at the address
DVA[0]=<0:590002c1000:30000>. Counting in my disks'' 4KB sectors,
this is 0x590002c1000/0x1000 = 0x590002C1 or 1493172929 logical
offset into the TLVDEV number 0 (and the only one in this pool).

Given that this TLVDEV is a 6-disk raidz2 set, my expected offset
on each component drive is 1493172929/6 = 248862154.83 (.83=5/6),
starting from after the ZFS header (2 labels and a reservation,
amounting to 4MB = 1024*4KB sectors). So this block''s allocation
covers 8 4KB-sectors starting at 248862154+1024 on disk5 and at
248862155+1024 on disks 0,1,2,3,4.

As my further tests showed, the sector-columns (not rows as I had
expected after doc-reading) from disks 1,2,3,4 do recombine into
the original userdata (sha256 checksum matches), so disks 5 and 0
should hold the two parities - how ever that is calculated:

# for D in 1 2 3 4; do dd bs=4096 count=8 conv=noerror,sync \
   if=/dev/dsk/c7t${D}d0s0 of=b1d${D}.img skip=248863179; done

# for D in 1 2 3 4; do for R in 0 1 2 3 4 5 6 7; do \
   dd if=/pool/test3/b1d${D}.img bs=4096 skip=$R count=1; \
   done; done > /tmp/d

Note that the latter can be greatly simplified as "cat", which
also works to the same effect, and is faster:
# cat /pool/test3/b1d?.img > /tmp/d
However I left the "difficult" notation to use in experiments later
on.

That is, the original 128KB block was cut into 4 pieces (my 4 data
drives in the 6-disk raidz2 set), and each 32Kb strip was stored
on a separate drive. Nice descriptive pictures in some presentations
suggested to me that the original block is stored sector by sector
rotating onto the next disk - the set of 4 sectors with 2 parity
sectors in my case being a single stripe for the RAID purposes.
This directly suggested that incomplete such "stripes", such as
the ends of files or whole small files, would still have the two
parity sectors and a handful of data sectors.

Reality differs.

For undersized allocations, i.e. of compressed data, it is possible
to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
some sectors do apparently get wasted because the A-size in the DVA
is divisible by 6*4KB. With columnar allocation of disks, it is
easier to see why full stripes have to be used:

p1 p2 d1 d2 d3 d4
.  ,  1  5  9   13
.  ,  2  6  10  14
.  ,  3  7  11  x
.  ,  4  8  12  x

In this illustration a 14-sector-long block is saved, with X being
the empty leftovers, on which we can''t really save (as would be the
case with the other allocation, which is likely less efficient for
CPU and IOs).

The metadata blocks do have A-sizes of 0x3000 (2 parity + 1 data),
at least, which on 4KB-sectored disks is also pretty much for these
miniature data objects - but not as sad as 6*4KB would have been ;)

It also seems that the instinctive desire to have raidzN sets of
4*M+N disks (i.e. 6-disk raidz2, 11-disk raidz3, etc.) which was
discussed over and over on the list a couple of years ago, may
still be valid with typical block sizes being powers of two...
Even though gurus said that this should not matter much.
For IOPS - maybe not. For wasted space - likely...

> I''m almost ready to go and test Q2 and Q3, however, the questions
> which regard useable tools (and "what data should be fed into such
> tools?") are still on the table.
 > Some OLD questions remain raised, just in case anyone answers them.

 >> 3b) The redundancy algos should in fact cover other redundancy disks
 >>     too (in order to sustain loss of any 2 disks), correct? (...)
 >>
 >> 4) Where are the redundancy algorithms specified? Is there any simple
 >>     tool that would recombine a given algo-N redundancy sector with
 >>     some other 4 sectors from a 6-sector stripe in order to try and
 >>     recalculate the sixth sector''s contents? (Perhaps part of
some
 >>     unit tests?)

 > 7) Is there a command-line tool to do lzjb compressions and
 > decompressions (in the same blocky manner as would be applicable
 > to ZFS compression)?
 >
 > I''ve also tried to gzip-compress the original 128KB file, but
 > none of the compressed results (with varying gzip level) yielded
 > the same checksum that would match the ZFS block''s one.
 > Zero-padding to 10240 bytes (psize=0x2800) did not help.
 >
 >
 > 8) When should the decompression stop - as soon as it has extracted
 > the logical-size number of bytes (i.e. 0x20000)?
 >
 >
 > 9) Physical sizes magically are in whole 512b units, so it seems...
 > I doubt that the compressed data would always end at such boundary.
 >
 > How many bytes should be covered by a checksum?
 > Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)?
 >
>
>> Thanks a lot in advance for any info, ideas, insights,
>> and just for reading this long post to the end ;)
>> //Jim Klimov
>Ditto =)
//Jim

Timothy Coalson

2012-Dec-05 22:16 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

On Tue, Dec 4, 2012 at 10:52 PM, Jim Klimov <jimklimov at cos.ru> wrote:
> On 2012-12-03 18:23, Jim Klimov wrote:
>
>> On 2012-12-02 05:42, Jim Klimov wrote:
>>
>>>
>>> >> 4) Where are the redundancy algorithms specified? Is there
any simple
> >>     tool that would recombine a given algo-N redundancy sector
with
> >>     some other 4 sectors from a 6-sector stripe in order to try
and
> >>     recalculate the sixth sector''s contents? (Perhaps
part of some
> >>     unit tests?)
>
>I''m a bit late to the party, but from a previous list thread about
redundancy algorithms, I had found this:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c

Particularly the
functions "vdev_raidz_reconstruct_p",
"vdev_raidz_reconstruct_q",
"vdev_raidz_reconstruct_pq"
(and possibly "vdev_raidz_reconstruct_general") seem like what you are
looking for.

As I understand it, the case where you have both redundancy blocks, but are
missing two data blocks, is the hardest (if you are missing only one data
block, you can either do straight xor with the first redundancy section, or
some LFSR shifting, xor, and then reverse LFSR shifting to use the second
redundancy section).

Wikipedia describes the math to restore from two missing data sections
here, under "computing parity":
http://en.wikipedia.org/wiki/Raid6#RAID_6

I don''t know any tools to do this for you from arbitrary input, sorry.

Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/b8e96f76/attachment.html>

Jim Klimov

2012-Dec-06 01:08 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

On 2012-12-05 05:52, Jim Klimov wrote:> For undersized allocations, i.e. of compressed data, it is possible
> to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
> some sectors do apparently get wasted because the A-size in the DVA
> is divisible by 6*4KB. With columnar allocation of disks, it is
> easier to see why full stripes have to be used:
>
> p1 p2 d1 d2 d3 d4
> .  ,  1  5  9   13
> .  ,  2  6  10  14
> .  ,  3  7  11  x
> .  ,  4  8  12  x
>
> In this illustration a 14-sector-long block is saved, with X being
> the empty leftovers, on which we can''t really save (as would be
the
> case with the other allocation, which is likely less efficient for
> CPU and IOs).
Getting more and more puzzled with this... I have seen DVA values
matching both theories now...

Interestingly, all the allocations I looked over involved the number
of sectors divisible by 3... rounding to half of my 6-disk RAID set -
is it merely a coincidence, or some means of balancing IOs?

Anyhow, with 4KB sectors involved, I saw many 128KB logical blocks
compressed into just half a dozen sectors of userdata payload, so
wasting one or two sectors here is quite a large percentage of my
storage overhead.

Exposition of found evidence follows:

Say, this one from my original post:
DVA[0]=<0:594928b8000:9000> ... size=20000L/4800P

It has 5 data sectors (@4Kb) over 4 data disks in my raidz2 set,
so it spills over to a second row and requires additional parity
sectors - overall 5d+4p = 9 sectors, which we see in DVA A-size.
This is normal, like expected.

These ones however differ:

DVA[0]=<0:acef500e000:c000> ... size=20000L/6a00P
DVA[0]=<0:acef501a000:c000> ... size=20000L/7200P
DVA[0]=<0:acef5026000:c000> ... size=20000L/5c00P

These neighbors, with 7, 8 and 6 sectors worth of data all occupy
12 sectors on disk along with their parities.

DVA[0]=<0:59492a92000:6000> ... size=20000L/2800P

With 3*4Kb sectors worth of data and 2 parity sectors, this block
is allocated over 6 not 5 sectors.

DVA[0]=<0:5996bf7c000:12000> ... size=20000L/a800P

Likewise, with 11 sectors of data and likely 6 sectors of parity,
this one is given 18, not 17 sectors of storage allocation.

DVA[0]=<0:5996be32000:1e000> ... size=20000L/12c00P

Here, 19 sectors of data and 10 of parity occupy 30 sectors on disk.

I did not yet research where exactly the "unused" sectors are
allocated - "vertically" on the last strip, like in my yesterdays
depiction quoted above, or "horizontally" across several disks,
but now that I know about this - it really bothers me as wasted
space with no apparent gain. I mean, the raidz code does tricks
to ensure that parities are located on different disks, and in
normal conditions the userdata sector reads land on all disks
in a uniform manner. Why forfeit the natural "rotation" thanks
to P-sizes smaller than the multiple of number of data-disks?
Writes are anyway streamed and coalesced, so by not allocating
these unused blocks we''d only reduce the needed write IOPS by
some portion - and save disk space...

In short: can someone explain the rationale - why are allocations
such as they are now, and can it be discussed as a bug or should
this be rationalized as a feature?

Thanks,
//Jim

Jim Klimov

2012-Dec-06 02:06 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

more below...

On 2012-12-05 23:16, Timothy Coalson wrote:> On Tue, Dec 4, 2012 at 10:52 PM, Jim Klimov <jimklimov at cos.ru
> <mailto:jimklimov at cos.ru>> wrote:
>
>     On 2012-12-03 18:23, Jim Klimov wrote:
>
>         On 2012-12-02 05:42, Jim Klimov wrote:
>
>
>      >> 4) Where are the redundancy algorithms specified? Is there
any
>     simple
>      >>     tool that would recombine a given algo-N redundancy
sector with
>      >>     some other 4 sectors from a 6-sector stripe in order to
try and
>      >>     recalculate the sixth sector''s contents?
(Perhaps part of some
>      >>     unit tests?)
>
>
> I''m a bit late to the party, but from a previous list thread about
> redundancy algorithms, I had found this:
>
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
>
> Particularly the
> functions "vdev_raidz_reconstruct_p",
"vdev_raidz_reconstruct_q", "vdev_raidz_reconstruct_pq"
> (and possibly "vdev_raidz_reconstruct_general") seem like what
you are
> looking for.
>
> As I understand it, the case where you have both redundancy blocks, but
> are missing two data blocks, is the hardest (if you are missing only one
> data block, you can either do straight xor with the first redundancy
> section, or some LFSR shifting, xor, and then reverse LFSR shifting to
> use the second redundancy section).
>
> Wikipedia describes the math to restore from two missing data sections
> here, under "computing parity":
http://en.wikipedia.org/wiki/Raid6#RAID_6
>
> I don''t know any tools to do this for you from arbitrary input,
sorry.

Thanks, you are not "late" and welcome to the party ;)

I''m hacking together a simple program to look over the data sectors and
XOR parity and determine how many, if any, discrepancies there are, and
at what offsets into the sector - byte by byte.

Running it on raw ZFS block component sectors, extracted with DD in the
ways I wrote of earlier in the thread, I did confirm some good sectors
and the one erroneous block that I have.

The latter turns out to have 4.5 worth of sectors in userdata, overall
laid out like this:
    dsk0  dsk1  dsk2  dsk3  dsk4  dsk5
    _     _     _     _     _     p1
    q1    d0    d2    d3    d4*   p2
    q2    d1    _     _     _     _

Here the compressed userdata is contained in the order of my
"d"-sector
numbering, d0-d1-d2-d3-d4, and d4 is only partially occupied (P-size of
the block is 0x4c00) so its final quarter is all zeroes.

It also happens that on disks 1,2,3 the first row''s sectors (d0, d2,
d3)
are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes.

The neighboring blocks, located a few sectors away from this one, also
have compressed data and have some regular-looking patterns of bytes,
certainly no long stretches of zeroes.

However, the byte-by-byte XOR matching complains about the whole sector.
All bytes, except some 40 single-byte locations here and there, don''t
XOR up to produce the expected (known from disk) value.

I did not yet try the second parity algorithm.

At least in this case, it does not seem that I would find an incantation
needed to recover this block - too many zeroes overlapping (at least 3
disks'' data proven compromised), where I did hope for some shortcoming
in ZFS recombination exhaustiveness. In this case - it is indeed too
much failure to handle.

Now waiting for scrub to find me more test subjects - broken files ;)

Thnks,
//Jim

Jim Klimov

2012-Dec-09 00:05 UTC

head link

[zfs-discuss] Userdata physical allocation in rows vs. columns WAS Digging in the bowels of ZFS

For those who have work to do and can''t be bothered to read detailed
context, please do scroll down to the marked Applied question about
the possible project to implement a better on-disk layout of blocks.
The busy experts'' opinions are highly regarded here. Thanks ;) //Jim

CONTEXT AND SPECULATION

Well, now that I''ve mostly completed building my tool to locate and
extract from disk and verify the sectors related to any particular
block, I can state with certainty: data sector numbering is columnar
as was depicted in my recent mails (quote below), not rows as I had
believed earlier - and which would be more compact to store.

Columns do make certain sense, but do also lead to more wasted space
than could be possible otherwise - and I''m not sure if the allocation
in rows would be really slower to write or read, especially since
the HDD caching would coalesce requests to neighboring sectors -
be they a contiguous quarter of my block''s physical data or a series
of every fourth sector from that. This would be more complex to code
and comprehend - likely. Might even require more CPU cycles to account
sizes properly (IF today we just quickly allocate columns of same
size - I skimmed over vdev_raidz.c, but did not look into this detail).

Saving 1-2 sectors from allocations which are some 10-30 sectors long
altogether - this is IMHO a worthy percentage of savings to worry and
bother about, especially with the compression-related paradigm of
"our CPUs are slackers with nothing to do". ZFS overhead on
4K-sectored
disks is pretty "expensive" already, so I see little need to feed it
extra desserts too ;)

APPLIED QUESTION:

If one were to implement a different sector allocator (rows with more
precise cutoff vs. columns as they are today) and expose it as a zfs
property that can be set by users (or testing developers), would it
make sense to call it a "compression" mode (in current terms) and use
a bit from that field? Or should a GRID bits be more properly used for
this?

I am not sure if feature flags are a proper mechanism for this, except
to protect form import and interpretation of such "fixed" datasets and
pools on incompatible (older) implementations - the allocation layout
is likely going to be an attribute applied to each block at write-time
and noted in blkptr_t like the checksums and compression, but only
apply to raidzN.

AFAIK, the contents of userdata sectors and their ordering don''t even
matter to ZFS layers until decompression - parities and checksums just
apply to prepared bulk data...

//Jim Klimov

On 2012-12-06 02:08, Jim Klimov wrote:> On 2012-12-05 05:52, Jim Klimov wrote:
>> For undersized allocations, i.e. of compressed data, it is possible
>> to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
>> some sectors do apparently get wasted because the A-size in the DVA
>> is divisible by 6*4KB. With columnar allocation of disks, it is
>> easier to see why full stripes have to be used:
>>
>> p1 p2 d1 d2 d3 d4
>> .  ,  1  5  9   13
>> .  ,  2  6  10  14
>> .  ,  3  7  11  x
>> .  ,  4  8  12  x
>>
>> In this illustration a 14-sector-long block is saved, with X being
>> the empty leftovers, on which we can''t really save (as would
be the
>> case with the other allocation, which is likely less efficient for
>> CPU and IOs).
>
> Getting more and more puzzled with this... I have seen DVA values
> matching both theories now...
>
> Interestingly, all the allocations I looked over involved the number
> of sectors divisible by 3... rounding to half of my 6-disk RAID set -
> is it merely a coincidence, or some means of balancing IOs?
...> I did not yet research where exactly the "unused" sectors are
> allocated - "vertically" on the last strip, like in my yesterdays
> depiction quoted above, or "horizontally" across several disks,
> but now that I know about this - it really bothers me as wasted
> space with no apparent gain. I mean, the raidz code does tricks
> to ensure that parities are located on different disks, and in
> normal conditions the userdata sector reads land on all disks
> in a uniform manner. Why forfeit the natural "rotation" thanks
> to P-sizes smaller than the multiple of number of data-disks?
...> In short: can someone explain the rationale - why are allocations
> such as they are now, and can it be discussed as a bug or should
> this be rationalized as a feature?

Jim Klimov

2012-Dec-09 19:27 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

more below...

On 2012-12-06 03:06, Jim Klimov wrote:> It also happens that on disks 1,2,3 the first row''s sectors (d0,
d2, d3)
> are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes.
>
> The neighboring blocks, located a few sectors away from this one, also
> have compressed data and have some regular-looking patterns of bytes,
> certainly no long stretches of zeroes.
>
> However, the byte-by-byte XOR matching complains about the whole sector.
> All bytes, except some 40 single-byte locations here and there,
don''t
> XOR up to produce the expected (known from disk) value.
>
> I did not yet try the second parity algorithm.
>
> At least in this case, it does not seem that I would find an incantation
> needed to recover this block - too many zeroes overlapping (at least 3
> disks'' data proven compromised), where I did hope for some
shortcoming
> in ZFS recombination exhaustiveness. In this case - it is indeed too
> much failure to handle.
>
> Now waiting for scrub to find me more test subjects - broken files ;)So, these findings from my first tested bad file remain valid.
Now that I have a couple more error locations found again by
scrub (which for the past week progressed just above 50% of
the pool), there are some more results.

So far only one location has random-looking different data in
the sectors of the block on different disks, which I might at
least try to salvage as described in the beginning of this thread.

In two of three cases, some of the sectors (in the range which
mismatches the parity data) are not only clearly invalid, like
being filled with long stretches of zeroes with other sectors
being uniformly-looking binary data (results of compression).
Moreover, several of these sectors (4096-bytes long at same
offsets on different drives which are data components of the
same block) are literally identical, which is apparently some
error upon write (perhaps, some noise was interpreted by several
disks at once like a command for them to write at that location).

The corrupted area looks like a series of "0xFC 0x42" bytes about
half a kilobyte long, followed by zero bytes to the end of sector.
Start of this area is not aligned to a multiple of 512 bytes.

These disks being of an identical model and firmware, I am ready
to believe that they might misinterpret same interference in the
same way. However, I was under the impression that SATA involved
CRCs on commands and data in the protocol - to counter the noise?..

Question: does such conclusion sound like a potentially possible
explanation for my data corruptions (on disks which passed dozens
of scrubs successfully before developing these problems nearly at
once in about ten locations)?

Thanks for attention,
//Jim Klimov

Timothy Coalson

2012-Dec-10 06:35 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

On Sun, Dec 9, 2012 at 1:27 PM, Jim Klimov <jimklimov at cos.ru> wrote:
> In two of three cases, some of the sectors (in the range which
> mismatches the parity data) are not only clearly invalid, like
> being filled with long stretches of zeroes with other sectors
> being uniformly-looking binary data (results of compression).
> Moreover, several of these sectors (4096-bytes long at same
> offsets on different drives which are data components of the
> same block) are literally identical, which is apparently some
> error upon write (perhaps, some noise was interpreted by several
> disks at once like a command for them to write at that location).
>
> The corrupted area looks like a series of "0xFC 0x42" bytes about
> half a kilobyte long, followed by zero bytes to the end of sector.
> Start of this area is not aligned to a multiple of 512 bytes.
>
Just a guess, but that might be how the sectors were when the drive came
from the manufacturer, rather than filled with zeros (a test pattern while
checking for bad sectors).  As for why some other sectors did show zeros in
your other results, perhaps those sectors got reallocated from the reserved
sectors after whatever caused your problems, which may not have been
written to during the manufacturer test.

These disks being of an identical model and firmware, I am
ready> to believe that they might misinterpret same interference in the
> same way. However, I was under the impression that SATA involved
> CRCs on commands and data in the protocol - to counter the noise?..
>
There is 8b/10b encoding, though I am not entirely sure how well this
counters noise (it is intended to DC-balance the signal and reduce run
length).  I don''t know if there is further integrity checking, though
one
would hope that noise wouldn''t decode as a sensible ATA command.

Question: does such conclusion sound like a potentially
possible> explanation for my data corruptions (on disks which passed dozens
> of scrubs successfully before developing these problems nearly at
> once in about ten locations)?
>
Potentially?  I suppose, but I would bet on something else (controller went
haywire?).

Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121210/5d72e512/attachment.html>

Jim Klimov

2012-Dec-10 08:18 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

On 2012-12-10 07:35, Timothy Coalson wrote:>     The corrupted area looks like a series of "0xFC 0x42" bytes
about
>     half a kilobyte long, followed by zero bytes to the end of sector.
>     Start of this area is not aligned to a multiple of 512 bytes.
>
>
> Just a guess, but that might be how the sectors were when the drive came
> from the manufacturer, rather than filled with zeros (a test pattern
> while checking for bad sectors).  As for why some other sectors did show
> zeros in your other results, perhaps those sectors got reallocated from
> the reserved sectors after whatever caused your problems, which may not
> have been written to during the manufacturer test.
Thanks for the idea. I also figured it might be some test pattern
or maybe some sort of "secure wipe", and HDD''s relocation to
spare
sectors might be a reasonable scenario for such an error creeping
into an LBA which previously had valid data - i.e. the disk tried
to salvage as much of a newly corrupted sector as it could...

I dismissed it because several HDDs had the error at same offsets,
and some of them had the same contents of the corrupted sectors;
how-ever identical the disks might be, this is just too much of a
coincidence for disk-internal hardware relocation to be The reason.

Controller going haywire - that is possible, given that this box
was off until recently repaired due to broken cooling, and this
is the nearest "centralized" SPOF location common to all disks
(with overheated CPU, non-ECC RAM and the software further along
the road). I am not sure which one of these *couldn''t* issue
(or be interpreted to issue) a number of weird identical writes
to different disks at same offsets.

Everyone is a suspect :(

Thanks,
//Jim Klimov

Jim Klimov

2012-Dec-11 15:44 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

On 2012-12-02 05:42, Jim Klimov wrote:> My plan is to dig out the needed sectors of the broken block from
> each of the 6 disks and try any and all reasonable recombinations
> of redundancy and data sectors to try and match the checksum - this
> should be my definite answer on whether ZFS (of that oi151.1.3-based
> build) does all I think it can to save data or not. Either I put the
> last nail into my itching question''s coffin, or I''d nail
a bug to
> yell about ;)
Well, I''ve come to a number of conclusions, though did not yet close
this matter to myself. One regards the definition of "all reasonable
recombinations" - ZFS does not do "*everything* possible" to
recover
corrupt data, and in fact it can''t, nobody can.

When I took this to an extreme, assuming that the bytes at different
offsets within a sector might fail on different disks that comprise a
block, attempt to reconstruct and test single failed sector per-byte
becomes computationally infeasible - for 4 data disks and 1 parity I
got about 4^4096 combinations to test. The next Big Bang will happen
sooner than I''d get a "yes or no", or so they say (yes, I did
a rough
estimate - about 10^100 seconds if I used all computing horsepower on
Earth today).

If there are R known-broken rows of data (be it bits, bytes, sectors,
whole columns, or whatever quantum of data we take) on D data disks
and P parity disks (all readable without HW IO errors), where "known
brokenness" is both a parity mismatch in this row and checksum mismatch
for the whole userdata block, we do not know in advance how many errors
there are in the row (only hope that not more than there are parity
columns) nor where exactly the problem is. Thanks to checksum mismatch
we do know that at least one error is in the data disks'' on-disk data.

We might hope to find a correct "original data" which matches the
checksum by determining for each data disk the possible alternate
byte values (computed from bytes at same offsets on other disks of
data and parity), and checksumming the recombined userdata blocks
with some of the on-disk bytes replaced by these calculated values.

For each row we test 1..P alternate column values, and we must apply
the alteration to all of the rows where known errors exist, in order
to detect some neighboring but not overlapping errors in different
components of the block''s allocation. (This was the breakage scenario
that was deemed possible for raidzN with disk heads hovering over
similar locations all the time).

This can yield a very large field of combinations with small height
of rows (i.e. matching 1 byte per disk), or too few combinations with
row height chosen too big (i.e. whole portion of one disk''s part of
the userdata - quarter in case of my 4-data-disk set).

For single-break-per-row tests based on hypotheses from P parities,
D data disks and R broken rows, we need to checksum P*(D^R) userdata
recombinations in order to determine that we can''t recover the block.

To catch the less probable several errors per row (up to the amount of
parities we have), we need to retry even more combinations afterwards.

My 5-year-old Pentium D tested 1000 sha256 checksums over 128KB blocks
in about 2-3 seconds, so it is reasonable to keep reconstruction loops
and thus the smallness of a step and thus the amount of steps within a
given arbitrarily chosen timeout (30 sec? 1 sec?) With a fixed amount
of parity and data disks in a particular TLVDEV, we can determine the
"reasonable" row heights. Also, this low-level recovery at higher
amount of cycles might be a job for a separate tool - i.e. "on-line"
recovery during ZFS IO and scrubs might be limited by a few sectors,
and whatever is not fixed by that can be manually fed to programmatic
number-cruncher and possibly get recovered overnight...

I now know that it is cheap and fast to determine parity mismatches
for each single-byte column offset in a userdata block (leading to
D*R userdata bytes whose contents we are not certain of), so even
if the quantum of data for reconstructions is a sector, it is
quite reasonable to start with byte-by-byte mismatch detection.

Locations of detected errors can help us determine whether the
errors are colocated in a single row of sectors (so likely one or
more sectors at the same offset on different disks got broken),
or in several sectors (we might be lucky and have single errors
per disk in neighboring sector numbers).

It is, after all, not reasonable to go below 512b or even the
larger HW sector size as the quantum of data for recovery attempts.
But testing *only* whole columns (*if* this is done today) also
avoids some chances of automated recovery - though, certainly,
the recovery attempts should start with some of the most probable
combinations, such as all errors being confined to a single disk,
and then going down in step size and testing possible errors on
several component disks. We can afford several thousand checksum
tests, which might give a chance to recover more data that might
be recoverable today *if* the tests are not so exhaustive...

To conclude, I still do not know (did not read that deep into code)
how ZFS does its raidz recovery attempts today - is the "row height"
the whole single disk''s portion (i.e. 32Kb from a 128KB block over 4
data disks), or some physical or logical sector size (4kb, 512b)?..

Even per-sector reconstructions of a 128KB block over 4 512b-sectored
disks yields and a single alternate variant per-byte from parity
reconstructions, if I am not mistaken, an impressive and infeasible
4^64 combinations to test with checksums by pure brute force.
Heck, just counting from 1 to 2^64 in a "i++" loop takes a lot
of CPU time :)

And so far my problems had occurred to compressed blocks which had
the physical allocation of a single sector in height or so, and at
this "resolution" it was not possible to find one broken sector and
fix the userdata. I''m still waiting for scrub to complete so that
I could get some corrupted files with parity errors in different
rows of HW-sector height.

Thanks for listening, I''m out :)
//Jim Klimov

Jim Klimov

2012-Dec-11 16:57 UTC

head link

[zfs-discuss] Digging in the bowels of ZFS

On 2012-12-11 16:44, Jim Klimov wrote:> For single-break-per-row tests based on hypotheses from P parities,
> D data disks and R broken rows, we need to checksum P*(D^R) userdata
> recombinations in order to determine that we can''t recover the
block.
A small maths correction: the formula above reflects that we change
some one item from on-disk value to reconstrycted hypothesis on some
one data disk(column) in all rows, or on P disks if we try to recover
from more than one failed item in a row.

Reality is worse :)

Our original info (parity errors and checksum mismatch) warranted
only that we have at least one error in userdata. It is possible
that other (R-1) errors are on the parity disk, so the recombination
should also check all variants with (0..R-1) unchanged rows with
their on-disk contents intact.

This gives us something like P*(D + D^2 + ... + D^R) variants to
test, which is roughly a 25% increase in recombinations in the
range of computationally feasible amounts of error-matching.
> Heck, just counting from 1 to 2^64 in a "i++" loop takes a lot
> of CPU time
By my estimate, even that would take until the next Big Bang,
at least on my one computer ;)

Just for fun: a count to 2^32 took 42 seconds, so my computer
can do 10^8 trivial loops per second - but that''s just a data
point. What really matters is that 4^64 == (2^32)^33, which
is a lot. Roughly, 2^3 = 8 ~= 10, so the plain count from 1 to
4^64 would take about 42*10^30 seconds, or roughly 10^24 years.
If the astronomers'' estimates are correct, this amounts to
10^13 lifetimes of our universe, or so ;)

//Jim

zfs discuss - Dec 2012 - Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Userdata physical allocation in rows vs. columns WAS Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS

[zfs-discuss] Digging in the bowels of ZFS