My problematic home-NAS (that old list readers might still remember about from a year-two ago) is back online, thanks to a friend who fixed and turned it on. I''m going to try some more research on that failure I had with the 6-disk raidz2 set, when it suddenly couldn''t read and recover some blocks (presumed scratched or otherwise damaged on all disks at once while the heads hovered over similar locations). My plan is to dig out the needed sectors of the broken block from each of the 6 disks and try any and all reasonable recombinations of redundancy and data sectors to try and match the checksum - this should be my definite answer on whether ZFS (of that oi151.1.3-based build) does all I think it can to save data or not. Either I put the last nail into my itching question''s coffin, or I''d nail a bug to yell about ;) So... here are some applied questions: 1) With DD I found the logical offset after which it fails to read data from a damaged file, because ZFS doesn''t trust the block. That''s 3840 sectors (@512b), or 0x1e0000. With ZDB I listed the file inode''s block tree and got this, in particular: # zdb -ddddd -bbbbbb -e 1601233584937321596/export/DUMP 27352564 \ > /var/tmp/brokenfile.ZDBreport 2>&1 ... 1c0000 L0 DVA[0]=<0:acbc2a46000:9000> [L0 ZFS plain file] sha256 lzjb LE contiguous dedup single size=20000L/4a00P (txg, cksum) 1e0000 L0 DVA[0]=<0:acbc2a4f000:9000> [L0 ZFS plain file] sha256 lzjb LE contiguous dedup single size=20000L/4c00P birth=324721364L/324721364P fill=1 cksum=705e79361b8f028e:5a45c8f863a4035f:41b2961480304d7:be685ec248f00e78 200000 L0 DVA[0]=<0:acbc2a58000:9000> [L0 ZFS plain file] sha256 lzjb LE contiguous dedup single size=20000L/4c00P (txg, cksum) ... So... how DO I properly interpret this to select sector ranges to DD into my test area from each of the 6 disks in the raidz2 set? On one hand, the DVA states the block length is 0x9000, and this matches the offsets of neighboring blocks. On the other hand, compressed "physical" data size is 0x4c00 for this block, and ranges 0x4800-0x5000 for other blocks of the file. Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way smaller than 0x9000. For uncompressed files I think I saw entries like "size=20000L/30000P", so I''m not sure even my multiplication by 1.5x above is valid, and the discrepancy between DVA size and interval, and "physical" allocation size reaches about 2x. So... again... how many sectors from each disk should I fetch for my research of this one block? 2) Do I understand correctly that for the offset definition, sectors in a top-level VDEV (which is all of my pool) are numbered in rows per-component disk? Like this: 0 1 2 3 4 5 6 7 8 9 10 11... That is, "offset % setsize = disknum"? If true, does such numbering scheme apply all over the TLVDEV, so as for my block on a 6-disk raidz2 disk set - its sectors start at (roughly rounded) "offset_from_DVA / 6" on each disk, right? 3) Then, if I read the ZFS on-disk spec correctly, the sectors of the first disk holding anything from this block would contain the raid-algo1 permutations of the four data sectors, sectors of the second disk contain the raid-algo2 for those 4 sectors, and the remaining 4 disks contain the data sectors? The redundancy algos should in fact cover other redundancy disks too (in order to sustain loss of any 2 disks), correct? Is it, in particular, true that the redundancy-protected stripes involve a single sector from each disk times the length of the block''s portion on each disk (not that some, say, 32kb from one disk are wholly the redundancy for 4*32kb data from other disks)? I think this is what I hope to catch - if certain non-overlapping sectors got broken on each disk, but ZFS compares larger ranges to recover data - these are two work on very different data. 4) Where are the redundancy algorithms specified? Is there any simple tool that would recombine a given algo-N redundancy sector with some other 4 sectors from a 6-sector stripe in order to try and recalculate the sixth sector''s contents? (Perhaps part of some unit tests?) 5) Is there any magic to the checksum algorithms? I.e. if I pass some 128KB block''s logical (userdata) contents to the command-line "sha256" or "openssl sha256" - should I get the same checksum as ZFS provides and uses? 6) What exactly does a checksum apply to - the 128Kb userdata block or a 15-20Kb (lzjb-)compressed portion of data? I am sure it''s the latter, but ask just in case I don''t miss anything... :) As I said, in the end I hope to have from-disk and guessed userdata sectors, a gazillion or so for given logical offsets inside a 128K userdata block, which I would then recombine and hash with sha256 to see if any combination yields the value saved in block pointer and ZFS missed something, or if I don''t get any such combo and ZFS does what it should exhaustively and correctly, indeed ;) Thanks a lot in advance for any info, ideas, insights, and just for reading this long post to the end ;) //Jim Klimov
On 2012-12-02 05:42, Jim Klimov wrote:> So... here are some applied questions:Well, I am ready to reply a few of my own questions now :) I''ve staged an experiment by taking a 128Kb block from that file and appending it to a new file in a test dataset, where I changed the compression settings between the appendages. Thus I''ve got a ZDB dump of three blocks with identical logical userdata and different physical data. # zdb -ddddd -bbbbbb -e 1601233584937321596/test3 8 > /pool/test3/a.zdb ... Indirect blocks: 0 L1 DVA[0]=<0:59492a98000:3000> DVA[1]<0:83e2f65000:3000> [L1 ZFS plain file] sha256 lzjb LE contiguous unique double size=4000L/400P birth=326381727L/326381727P fill=3 cksum=2ebbfb189e7ce003:166a23fd39d583ed:f527884977645395:896a967526ea9cea 0 L0 DVA[0]=<0:590002c1000:30000> [L0 ZFS plain file] sha256 uncompressed LE contiguous unique single size=20000L/20000P birth=326381721L/326381721P fill=1 cksum=3c691e8fc86de2ea:90a0b76f0d1fe3ff:46e055c32dfd116d:f2af276f0a6a96b9 20000 L0 DVA[0]=<0:594928b8000:9000> [L0 ZFS plain file] sha256 lzjb LE contiguous unique single size=20000L/4800P birth=326381724L/326381724P fill=1 cksum=57164faa0c1cbef4:23348aa9722f47d3:3b1b480dc731610b:7f62fce0cc18876f 40000 L0 DVA[0]=<0:59492a92000:6000> [L0 ZFS plain file] sha256 gzip-9 LE contiguous unique single size=20000L/2800P birth=326381727L/326381727P fill=1 cksum=d68246ee846944c6:70e28f6c52e0c6ba:ea8f94fc93f8dbfd:c22ad491c1e78530 segment [0000000000000000, 0000000000080000) size 512K >1) So... how DO I properly interpret this to select sector ranges to > DD into my test area from each of the 6 disks in the raidz2 set? > > On one hand, the DVA states the block length is 0x9000, and this > matches the offsets of neighboring blocks. > > On the other hand, compressed "physical" data size is 0x4c00 for > this block, and ranges 0x4800-0x5000 for other blocks of the file. > Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way > smaller than 0x9000. For uncompressed files I think I saw entries > like "size=20000L/30000P", so I''m not sure even my multiplication > by 1.5x above is valid, and the discrepancy between DVA size and > interval, and "physical" allocation size reaches about 2x. Apparently, my memory failed me. The values in "size" field regard the userdata (compressed, non-redundant). Also I forgot to consider that this pool uses 4KB sectors (ashift=12). So my userdata which takes up about 0x4800 bytes would require 4.5 (rather, 5 whole) sectors and this warrants 4 sectors of the raidz2 redundancy on a 6-disk set - 2 sectors for the first 4 data sectors, and 2 sectors for the remaining half-sector''s worth of data. This does sum up to 9*0x1000 bytes in whole-sector counting (as in offsets). However, the gzip-compressed block above which only has 0x2800 bytes of userdata and requires 3 sectors plus 2 redundancy sectors, still has a DVA size of six 4KB sectors (0x6000). This is strange to me - I''d expect 5 sectors for this block altogether... does anyone have an explanation? Also, what should the extra userdata sector contain physically - zeroes? > 5) Is there any magic to the checksum algorithms? I.e. if I pass > some 128KB block''s logical (userdata) contents to the command-line > "sha256" or "openssl sha256" - should I get the same checksum as > ZFS provides and uses? The original 128KB file''s sha256 checksum matches the uncompressed block''s ZFS checksum, so in my further tests I can use the command line tools to verify the recombined results: # sha256sum /tmp/b128 3c691e8fc86de2ea90a0b76f0d1fe3ff46e055c32dfd116df2af276f0a6a96b9 /tmp/b128 No magic, as long as there are useable command-line implementations of the needed algos (sha256sum is there, fletcher[24] are not). > 6) What exactly does a checksum apply to - the 128Kb userdata block > or a 15-20Kb (lzjb-)compressed portion of data? I am sure it''s the > latter, but ask just in case I don''t miss anything... :) ZFS parent block checksum applies to the on-disk variant of userdata payload (compression included, redundancy excluded). NEW QUESTIONS: 7) Is there a command-line tool to do lzjb compressions and decompressions (in the same blocky manner as would be applicable to ZFS compression)? I''ve also tried to gzip-compress the original 128KB file, but none of the compressed results (with varying gzip level) yielded the same checksum that would match the ZFS block''s one. Zero-padding to 10240 bytes (psize=0x2800) did not help. 8) When should the decompression stop - as soon as it has extracted the logical-size number of bytes (i.e. 0x20000)? 9) Physical sizes magically are in whole 512b units, so it seems... I doubt that the compressed data would always end at such boundary. How many bytes should be covered by a checksum? Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)? Some OLD questions remain raised, just in case anyone answers them.> 2) Do I understand correctly that for the offset definition, sectors > in a top-level VDEV (which is all of my pool) are numbered in rows > per-component disk? Like this: > 0 1 2 3 4 5 > 6 7 8 9 10 11... > > That is, "offset % setsize = disknum"? > > If true, does such numbering scheme apply all over the TLVDEV, > so as for my block on a 6-disk raidz2 disk set - its sectors > start at (roughly rounded) "offset_from_DVA / 6" on each disk, > right? > > 3) Then, if I read the ZFS on-disk spec correctly, the sectors of > the first disk holding anything from this block would contain the > raid-algo1 permutations of the four data sectors, sectors of > the second disk contain the raid-algo2 for those 4 sectors, > and the remaining 4 disks contain the data sectors? > The redundancy algos should in fact cover other redundancy disks > too (in order to sustain loss of any 2 disks), correct? (...) > > 4) Where are the redundancy algorithms specified? Is there any simple > tool that would recombine a given algo-N redundancy sector with > some other 4 sectors from a 6-sector stripe in order to try and > recalculate the sixth sector''s contents? (Perhaps part of some > unit tests?) > >I''m almost ready to go and test Q2 and Q3, however, the questions which regard useable tools (and "what data should be fed into such tools?") are still on the table.> Thanks a lot in advance for any info, ideas, insights, > and just for reading this long post to the end ;) > //Jim KlimovDitto =) //Jim
On 2012-12-03 18:23, Jim Klimov wrote:> On 2012-12-02 05:42, Jim Klimov wrote: >> So... here are some applied questions: > > Well, I am ready to reply a few of my own questions now :)Continuing the desecration of my deceased files'' resting grounds...>> 2) Do I understand correctly that for the offset definition, sectors >> in a top-level VDEV (which is all of my pool) are numbered in rows >> per-component disk? Like this: >> 0 1 2 3 4 5 >> 6 7 8 9 10 11... >> >> That is, "offset % setsize = disknum"? >> >> If true, does such numbering scheme apply all over the TLVDEV, >> so as for my block on a 6-disk raidz2 disk set - its sectors >> start at (roughly rounded) "offset_from_DVA / 6" on each disk, >> right? >> >> 3) Then, if I read the ZFS on-disk spec correctly, the sectors of >> the first disk holding anything from this block would contain the >> raid-algo1 permutations of the four data sectors, sectors of >> the second disk contain the raid-algo2 for those 4 sectors, >> and the remaining 4 disks contain the data sectors?My understanding was correct. For posterity, in the earlier set up example I had an uncompressed 128KB block residing at the address DVA[0]=<0:590002c1000:30000>. Counting in my disks'' 4KB sectors, this is 0x590002c1000/0x1000 = 0x590002C1 or 1493172929 logical offset into the TLVDEV number 0 (and the only one in this pool). Given that this TLVDEV is a 6-disk raidz2 set, my expected offset on each component drive is 1493172929/6 = 248862154.83 (.83=5/6), starting from after the ZFS header (2 labels and a reservation, amounting to 4MB = 1024*4KB sectors). So this block''s allocation covers 8 4KB-sectors starting at 248862154+1024 on disk5 and at 248862155+1024 on disks 0,1,2,3,4. As my further tests showed, the sector-columns (not rows as I had expected after doc-reading) from disks 1,2,3,4 do recombine into the original userdata (sha256 checksum matches), so disks 5 and 0 should hold the two parities - how ever that is calculated: # for D in 1 2 3 4; do dd bs=4096 count=8 conv=noerror,sync \ if=/dev/dsk/c7t${D}d0s0 of=b1d${D}.img skip=248863179; done # for D in 1 2 3 4; do for R in 0 1 2 3 4 5 6 7; do \ dd if=/pool/test3/b1d${D}.img bs=4096 skip=$R count=1; \ done; done > /tmp/d Note that the latter can be greatly simplified as "cat", which also works to the same effect, and is faster: # cat /pool/test3/b1d?.img > /tmp/d However I left the "difficult" notation to use in experiments later on. That is, the original 128KB block was cut into 4 pieces (my 4 data drives in the 6-disk raidz2 set), and each 32Kb strip was stored on a separate drive. Nice descriptive pictures in some presentations suggested to me that the original block is stored sector by sector rotating onto the next disk - the set of 4 sectors with 2 parity sectors in my case being a single stripe for the RAID purposes. This directly suggested that incomplete such "stripes", such as the ends of files or whole small files, would still have the two parity sectors and a handful of data sectors. Reality differs. For undersized allocations, i.e. of compressed data, it is possible to see P-sizes not divisible by 4 (disks) in 4KB sectors, however, some sectors do apparently get wasted because the A-size in the DVA is divisible by 6*4KB. With columnar allocation of disks, it is easier to see why full stripes have to be used: p1 p2 d1 d2 d3 d4 . , 1 5 9 13 . , 2 6 10 14 . , 3 7 11 x . , 4 8 12 x In this illustration a 14-sector-long block is saved, with X being the empty leftovers, on which we can''t really save (as would be the case with the other allocation, which is likely less efficient for CPU and IOs). The metadata blocks do have A-sizes of 0x3000 (2 parity + 1 data), at least, which on 4KB-sectored disks is also pretty much for these miniature data objects - but not as sad as 6*4KB would have been ;) It also seems that the instinctive desire to have raidzN sets of 4*M+N disks (i.e. 6-disk raidz2, 11-disk raidz3, etc.) which was discussed over and over on the list a couple of years ago, may still be valid with typical block sizes being powers of two... Even though gurus said that this should not matter much. For IOPS - maybe not. For wasted space - likely...> I''m almost ready to go and test Q2 and Q3, however, the questions > which regard useable tools (and "what data should be fed into such > tools?") are still on the table.> Some OLD questions remain raised, just in case anyone answers them. >> 3b) The redundancy algos should in fact cover other redundancy disks >> too (in order to sustain loss of any 2 disks), correct? (...) >> >> 4) Where are the redundancy algorithms specified? Is there any simple >> tool that would recombine a given algo-N redundancy sector with >> some other 4 sectors from a 6-sector stripe in order to try and >> recalculate the sixth sector''s contents? (Perhaps part of some >> unit tests?) > 7) Is there a command-line tool to do lzjb compressions and > decompressions (in the same blocky manner as would be applicable > to ZFS compression)? > > I''ve also tried to gzip-compress the original 128KB file, but > none of the compressed results (with varying gzip level) yielded > the same checksum that would match the ZFS block''s one. > Zero-padding to 10240 bytes (psize=0x2800) did not help. > > > 8) When should the decompression stop - as soon as it has extracted > the logical-size number of bytes (i.e. 0x20000)? > > > 9) Physical sizes magically are in whole 512b units, so it seems... > I doubt that the compressed data would always end at such boundary. > > How many bytes should be covered by a checksum? > Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)? >> >> Thanks a lot in advance for any info, ideas, insights, >> and just for reading this long post to the end ;) >> //Jim Klimov >Ditto =) //Jim
On Tue, Dec 4, 2012 at 10:52 PM, Jim Klimov <jimklimov at cos.ru> wrote:> On 2012-12-03 18:23, Jim Klimov wrote: > >> On 2012-12-02 05:42, Jim Klimov wrote: >> >>> >>> >> 4) Where are the redundancy algorithms specified? Is there any simple > >> tool that would recombine a given algo-N redundancy sector with > >> some other 4 sectors from a 6-sector stripe in order to try and > >> recalculate the sixth sector''s contents? (Perhaps part of some > >> unit tests?) > >I''m a bit late to the party, but from a previous list thread about redundancy algorithms, I had found this: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c Particularly the functions "vdev_raidz_reconstruct_p", "vdev_raidz_reconstruct_q", "vdev_raidz_reconstruct_pq" (and possibly "vdev_raidz_reconstruct_general") seem like what you are looking for. As I understand it, the case where you have both redundancy blocks, but are missing two data blocks, is the hardest (if you are missing only one data block, you can either do straight xor with the first redundancy section, or some LFSR shifting, xor, and then reverse LFSR shifting to use the second redundancy section). Wikipedia describes the math to restore from two missing data sections here, under "computing parity": http://en.wikipedia.org/wiki/Raid6#RAID_6 I don''t know any tools to do this for you from arbitrary input, sorry. Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121205/b8e96f76/attachment.html>
On 2012-12-05 05:52, Jim Klimov wrote:> For undersized allocations, i.e. of compressed data, it is possible > to see P-sizes not divisible by 4 (disks) in 4KB sectors, however, > some sectors do apparently get wasted because the A-size in the DVA > is divisible by 6*4KB. With columnar allocation of disks, it is > easier to see why full stripes have to be used: > > p1 p2 d1 d2 d3 d4 > . , 1 5 9 13 > . , 2 6 10 14 > . , 3 7 11 x > . , 4 8 12 x > > In this illustration a 14-sector-long block is saved, with X being > the empty leftovers, on which we can''t really save (as would be the > case with the other allocation, which is likely less efficient for > CPU and IOs).Getting more and more puzzled with this... I have seen DVA values matching both theories now... Interestingly, all the allocations I looked over involved the number of sectors divisible by 3... rounding to half of my 6-disk RAID set - is it merely a coincidence, or some means of balancing IOs? Anyhow, with 4KB sectors involved, I saw many 128KB logical blocks compressed into just half a dozen sectors of userdata payload, so wasting one or two sectors here is quite a large percentage of my storage overhead. Exposition of found evidence follows: Say, this one from my original post: DVA[0]=<0:594928b8000:9000> ... size=20000L/4800P It has 5 data sectors (@4Kb) over 4 data disks in my raidz2 set, so it spills over to a second row and requires additional parity sectors - overall 5d+4p = 9 sectors, which we see in DVA A-size. This is normal, like expected. These ones however differ: DVA[0]=<0:acef500e000:c000> ... size=20000L/6a00P DVA[0]=<0:acef501a000:c000> ... size=20000L/7200P DVA[0]=<0:acef5026000:c000> ... size=20000L/5c00P These neighbors, with 7, 8 and 6 sectors worth of data all occupy 12 sectors on disk along with their parities. DVA[0]=<0:59492a92000:6000> ... size=20000L/2800P With 3*4Kb sectors worth of data and 2 parity sectors, this block is allocated over 6 not 5 sectors. DVA[0]=<0:5996bf7c000:12000> ... size=20000L/a800P Likewise, with 11 sectors of data and likely 6 sectors of parity, this one is given 18, not 17 sectors of storage allocation. DVA[0]=<0:5996be32000:1e000> ... size=20000L/12c00P Here, 19 sectors of data and 10 of parity occupy 30 sectors on disk. I did not yet research where exactly the "unused" sectors are allocated - "vertically" on the last strip, like in my yesterdays depiction quoted above, or "horizontally" across several disks, but now that I know about this - it really bothers me as wasted space with no apparent gain. I mean, the raidz code does tricks to ensure that parities are located on different disks, and in normal conditions the userdata sector reads land on all disks in a uniform manner. Why forfeit the natural "rotation" thanks to P-sizes smaller than the multiple of number of data-disks? Writes are anyway streamed and coalesced, so by not allocating these unused blocks we''d only reduce the needed write IOPS by some portion - and save disk space... In short: can someone explain the rationale - why are allocations such as they are now, and can it be discussed as a bug or should this be rationalized as a feature? Thanks, //Jim
more below... On 2012-12-05 23:16, Timothy Coalson wrote:> On Tue, Dec 4, 2012 at 10:52 PM, Jim Klimov <jimklimov at cos.ru > <mailto:jimklimov at cos.ru>> wrote: > > On 2012-12-03 18:23, Jim Klimov wrote: > > On 2012-12-02 05:42, Jim Klimov wrote: > > > >> 4) Where are the redundancy algorithms specified? Is there any > simple > >> tool that would recombine a given algo-N redundancy sector with > >> some other 4 sectors from a 6-sector stripe in order to try and > >> recalculate the sixth sector''s contents? (Perhaps part of some > >> unit tests?) > > > I''m a bit late to the party, but from a previous list thread about > redundancy algorithms, I had found this: > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c > > Particularly the > functions "vdev_raidz_reconstruct_p", "vdev_raidz_reconstruct_q", "vdev_raidz_reconstruct_pq" > (and possibly "vdev_raidz_reconstruct_general") seem like what you are > looking for. > > As I understand it, the case where you have both redundancy blocks, but > are missing two data blocks, is the hardest (if you are missing only one > data block, you can either do straight xor with the first redundancy > section, or some LFSR shifting, xor, and then reverse LFSR shifting to > use the second redundancy section). > > Wikipedia describes the math to restore from two missing data sections > here, under "computing parity": http://en.wikipedia.org/wiki/Raid6#RAID_6 > > I don''t know any tools to do this for you from arbitrary input, sorry.Thanks, you are not "late" and welcome to the party ;) I''m hacking together a simple program to look over the data sectors and XOR parity and determine how many, if any, discrepancies there are, and at what offsets into the sector - byte by byte. Running it on raw ZFS block component sectors, extracted with DD in the ways I wrote of earlier in the thread, I did confirm some good sectors and the one erroneous block that I have. The latter turns out to have 4.5 worth of sectors in userdata, overall laid out like this: dsk0 dsk1 dsk2 dsk3 dsk4 dsk5 _ _ _ _ _ p1 q1 d0 d2 d3 d4* p2 q2 d1 _ _ _ _ Here the compressed userdata is contained in the order of my "d"-sector numbering, d0-d1-d2-d3-d4, and d4 is only partially occupied (P-size of the block is 0x4c00) so its final quarter is all zeroes. It also happens that on disks 1,2,3 the first row''s sectors (d0, d2, d3) are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes. The neighboring blocks, located a few sectors away from this one, also have compressed data and have some regular-looking patterns of bytes, certainly no long stretches of zeroes. However, the byte-by-byte XOR matching complains about the whole sector. All bytes, except some 40 single-byte locations here and there, don''t XOR up to produce the expected (known from disk) value. I did not yet try the second parity algorithm. At least in this case, it does not seem that I would find an incantation needed to recover this block - too many zeroes overlapping (at least 3 disks'' data proven compromised), where I did hope for some shortcoming in ZFS recombination exhaustiveness. In this case - it is indeed too much failure to handle. Now waiting for scrub to find me more test subjects - broken files ;) Thnks, //Jim
Jim Klimov
2012-Dec-09 00:05 UTC
[zfs-discuss] Userdata physical allocation in rows vs. columns WAS Digging in the bowels of ZFS
For those who have work to do and can''t be bothered to read detailed context, please do scroll down to the marked Applied question about the possible project to implement a better on-disk layout of blocks. The busy experts'' opinions are highly regarded here. Thanks ;) //Jim CONTEXT AND SPECULATION Well, now that I''ve mostly completed building my tool to locate and extract from disk and verify the sectors related to any particular block, I can state with certainty: data sector numbering is columnar as was depicted in my recent mails (quote below), not rows as I had believed earlier - and which would be more compact to store. Columns do make certain sense, but do also lead to more wasted space than could be possible otherwise - and I''m not sure if the allocation in rows would be really slower to write or read, especially since the HDD caching would coalesce requests to neighboring sectors - be they a contiguous quarter of my block''s physical data or a series of every fourth sector from that. This would be more complex to code and comprehend - likely. Might even require more CPU cycles to account sizes properly (IF today we just quickly allocate columns of same size - I skimmed over vdev_raidz.c, but did not look into this detail). Saving 1-2 sectors from allocations which are some 10-30 sectors long altogether - this is IMHO a worthy percentage of savings to worry and bother about, especially with the compression-related paradigm of "our CPUs are slackers with nothing to do". ZFS overhead on 4K-sectored disks is pretty "expensive" already, so I see little need to feed it extra desserts too ;) APPLIED QUESTION: If one were to implement a different sector allocator (rows with more precise cutoff vs. columns as they are today) and expose it as a zfs property that can be set by users (or testing developers), would it make sense to call it a "compression" mode (in current terms) and use a bit from that field? Or should a GRID bits be more properly used for this? I am not sure if feature flags are a proper mechanism for this, except to protect form import and interpretation of such "fixed" datasets and pools on incompatible (older) implementations - the allocation layout is likely going to be an attribute applied to each block at write-time and noted in blkptr_t like the checksums and compression, but only apply to raidzN. AFAIK, the contents of userdata sectors and their ordering don''t even matter to ZFS layers until decompression - parities and checksums just apply to prepared bulk data... //Jim Klimov On 2012-12-06 02:08, Jim Klimov wrote:> On 2012-12-05 05:52, Jim Klimov wrote: >> For undersized allocations, i.e. of compressed data, it is possible >> to see P-sizes not divisible by 4 (disks) in 4KB sectors, however, >> some sectors do apparently get wasted because the A-size in the DVA >> is divisible by 6*4KB. With columnar allocation of disks, it is >> easier to see why full stripes have to be used: >> >> p1 p2 d1 d2 d3 d4 >> . , 1 5 9 13 >> . , 2 6 10 14 >> . , 3 7 11 x >> . , 4 8 12 x >> >> In this illustration a 14-sector-long block is saved, with X being >> the empty leftovers, on which we can''t really save (as would be the >> case with the other allocation, which is likely less efficient for >> CPU and IOs). > > Getting more and more puzzled with this... I have seen DVA values > matching both theories now... > > Interestingly, all the allocations I looked over involved the number > of sectors divisible by 3... rounding to half of my 6-disk RAID set - > is it merely a coincidence, or some means of balancing IOs?...> I did not yet research where exactly the "unused" sectors are > allocated - "vertically" on the last strip, like in my yesterdays > depiction quoted above, or "horizontally" across several disks, > but now that I know about this - it really bothers me as wasted > space with no apparent gain. I mean, the raidz code does tricks > to ensure that parities are located on different disks, and in > normal conditions the userdata sector reads land on all disks > in a uniform manner. Why forfeit the natural "rotation" thanks > to P-sizes smaller than the multiple of number of data-disks?...> In short: can someone explain the rationale - why are allocations > such as they are now, and can it be discussed as a bug or should > this be rationalized as a feature?
more below... On 2012-12-06 03:06, Jim Klimov wrote:> It also happens that on disks 1,2,3 the first row''s sectors (d0, d2, d3) > are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes. > > The neighboring blocks, located a few sectors away from this one, also > have compressed data and have some regular-looking patterns of bytes, > certainly no long stretches of zeroes. > > However, the byte-by-byte XOR matching complains about the whole sector. > All bytes, except some 40 single-byte locations here and there, don''t > XOR up to produce the expected (known from disk) value. > > I did not yet try the second parity algorithm. > > At least in this case, it does not seem that I would find an incantation > needed to recover this block - too many zeroes overlapping (at least 3 > disks'' data proven compromised), where I did hope for some shortcoming > in ZFS recombination exhaustiveness. In this case - it is indeed too > much failure to handle. > > Now waiting for scrub to find me more test subjects - broken files ;)So, these findings from my first tested bad file remain valid. Now that I have a couple more error locations found again by scrub (which for the past week progressed just above 50% of the pool), there are some more results. So far only one location has random-looking different data in the sectors of the block on different disks, which I might at least try to salvage as described in the beginning of this thread. In two of three cases, some of the sectors (in the range which mismatches the parity data) are not only clearly invalid, like being filled with long stretches of zeroes with other sectors being uniformly-looking binary data (results of compression). Moreover, several of these sectors (4096-bytes long at same offsets on different drives which are data components of the same block) are literally identical, which is apparently some error upon write (perhaps, some noise was interpreted by several disks at once like a command for them to write at that location). The corrupted area looks like a series of "0xFC 0x42" bytes about half a kilobyte long, followed by zero bytes to the end of sector. Start of this area is not aligned to a multiple of 512 bytes. These disks being of an identical model and firmware, I am ready to believe that they might misinterpret same interference in the same way. However, I was under the impression that SATA involved CRCs on commands and data in the protocol - to counter the noise?.. Question: does such conclusion sound like a potentially possible explanation for my data corruptions (on disks which passed dozens of scrubs successfully before developing these problems nearly at once in about ten locations)? Thanks for attention, //Jim Klimov
On Sun, Dec 9, 2012 at 1:27 PM, Jim Klimov <jimklimov at cos.ru> wrote:> In two of three cases, some of the sectors (in the range which > mismatches the parity data) are not only clearly invalid, like > being filled with long stretches of zeroes with other sectors > being uniformly-looking binary data (results of compression). > Moreover, several of these sectors (4096-bytes long at same > offsets on different drives which are data components of the > same block) are literally identical, which is apparently some > error upon write (perhaps, some noise was interpreted by several > disks at once like a command for them to write at that location). > > The corrupted area looks like a series of "0xFC 0x42" bytes about > half a kilobyte long, followed by zero bytes to the end of sector. > Start of this area is not aligned to a multiple of 512 bytes. >Just a guess, but that might be how the sectors were when the drive came from the manufacturer, rather than filled with zeros (a test pattern while checking for bad sectors). As for why some other sectors did show zeros in your other results, perhaps those sectors got reallocated from the reserved sectors after whatever caused your problems, which may not have been written to during the manufacturer test. These disks being of an identical model and firmware, I am ready> to believe that they might misinterpret same interference in the > same way. However, I was under the impression that SATA involved > CRCs on commands and data in the protocol - to counter the noise?.. >There is 8b/10b encoding, though I am not entirely sure how well this counters noise (it is intended to DC-balance the signal and reduce run length). I don''t know if there is further integrity checking, though one would hope that noise wouldn''t decode as a sensible ATA command. Question: does such conclusion sound like a potentially possible> explanation for my data corruptions (on disks which passed dozens > of scrubs successfully before developing these problems nearly at > once in about ten locations)? >Potentially? I suppose, but I would bet on something else (controller went haywire?). Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121210/5d72e512/attachment.html>
On 2012-12-10 07:35, Timothy Coalson wrote:> The corrupted area looks like a series of "0xFC 0x42" bytes about > half a kilobyte long, followed by zero bytes to the end of sector. > Start of this area is not aligned to a multiple of 512 bytes. > > > Just a guess, but that might be how the sectors were when the drive came > from the manufacturer, rather than filled with zeros (a test pattern > while checking for bad sectors). As for why some other sectors did show > zeros in your other results, perhaps those sectors got reallocated from > the reserved sectors after whatever caused your problems, which may not > have been written to during the manufacturer test.Thanks for the idea. I also figured it might be some test pattern or maybe some sort of "secure wipe", and HDD''s relocation to spare sectors might be a reasonable scenario for such an error creeping into an LBA which previously had valid data - i.e. the disk tried to salvage as much of a newly corrupted sector as it could... I dismissed it because several HDDs had the error at same offsets, and some of them had the same contents of the corrupted sectors; how-ever identical the disks might be, this is just too much of a coincidence for disk-internal hardware relocation to be The reason. Controller going haywire - that is possible, given that this box was off until recently repaired due to broken cooling, and this is the nearest "centralized" SPOF location common to all disks (with overheated CPU, non-ECC RAM and the software further along the road). I am not sure which one of these *couldn''t* issue (or be interpreted to issue) a number of weird identical writes to different disks at same offsets. Everyone is a suspect :( Thanks, //Jim Klimov
On 2012-12-02 05:42, Jim Klimov wrote:> My plan is to dig out the needed sectors of the broken block from > each of the 6 disks and try any and all reasonable recombinations > of redundancy and data sectors to try and match the checksum - this > should be my definite answer on whether ZFS (of that oi151.1.3-based > build) does all I think it can to save data or not. Either I put the > last nail into my itching question''s coffin, or I''d nail a bug to > yell about ;)Well, I''ve come to a number of conclusions, though did not yet close this matter to myself. One regards the definition of "all reasonable recombinations" - ZFS does not do "*everything* possible" to recover corrupt data, and in fact it can''t, nobody can. When I took this to an extreme, assuming that the bytes at different offsets within a sector might fail on different disks that comprise a block, attempt to reconstruct and test single failed sector per-byte becomes computationally infeasible - for 4 data disks and 1 parity I got about 4^4096 combinations to test. The next Big Bang will happen sooner than I''d get a "yes or no", or so they say (yes, I did a rough estimate - about 10^100 seconds if I used all computing horsepower on Earth today). If there are R known-broken rows of data (be it bits, bytes, sectors, whole columns, or whatever quantum of data we take) on D data disks and P parity disks (all readable without HW IO errors), where "known brokenness" is both a parity mismatch in this row and checksum mismatch for the whole userdata block, we do not know in advance how many errors there are in the row (only hope that not more than there are parity columns) nor where exactly the problem is. Thanks to checksum mismatch we do know that at least one error is in the data disks'' on-disk data. We might hope to find a correct "original data" which matches the checksum by determining for each data disk the possible alternate byte values (computed from bytes at same offsets on other disks of data and parity), and checksumming the recombined userdata blocks with some of the on-disk bytes replaced by these calculated values. For each row we test 1..P alternate column values, and we must apply the alteration to all of the rows where known errors exist, in order to detect some neighboring but not overlapping errors in different components of the block''s allocation. (This was the breakage scenario that was deemed possible for raidzN with disk heads hovering over similar locations all the time). This can yield a very large field of combinations with small height of rows (i.e. matching 1 byte per disk), or too few combinations with row height chosen too big (i.e. whole portion of one disk''s part of the userdata - quarter in case of my 4-data-disk set). For single-break-per-row tests based on hypotheses from P parities, D data disks and R broken rows, we need to checksum P*(D^R) userdata recombinations in order to determine that we can''t recover the block. To catch the less probable several errors per row (up to the amount of parities we have), we need to retry even more combinations afterwards. My 5-year-old Pentium D tested 1000 sha256 checksums over 128KB blocks in about 2-3 seconds, so it is reasonable to keep reconstruction loops and thus the smallness of a step and thus the amount of steps within a given arbitrarily chosen timeout (30 sec? 1 sec?) With a fixed amount of parity and data disks in a particular TLVDEV, we can determine the "reasonable" row heights. Also, this low-level recovery at higher amount of cycles might be a job for a separate tool - i.e. "on-line" recovery during ZFS IO and scrubs might be limited by a few sectors, and whatever is not fixed by that can be manually fed to programmatic number-cruncher and possibly get recovered overnight... I now know that it is cheap and fast to determine parity mismatches for each single-byte column offset in a userdata block (leading to D*R userdata bytes whose contents we are not certain of), so even if the quantum of data for reconstructions is a sector, it is quite reasonable to start with byte-by-byte mismatch detection. Locations of detected errors can help us determine whether the errors are colocated in a single row of sectors (so likely one or more sectors at the same offset on different disks got broken), or in several sectors (we might be lucky and have single errors per disk in neighboring sector numbers). It is, after all, not reasonable to go below 512b or even the larger HW sector size as the quantum of data for recovery attempts. But testing *only* whole columns (*if* this is done today) also avoids some chances of automated recovery - though, certainly, the recovery attempts should start with some of the most probable combinations, such as all errors being confined to a single disk, and then going down in step size and testing possible errors on several component disks. We can afford several thousand checksum tests, which might give a chance to recover more data that might be recoverable today *if* the tests are not so exhaustive... To conclude, I still do not know (did not read that deep into code) how ZFS does its raidz recovery attempts today - is the "row height" the whole single disk''s portion (i.e. 32Kb from a 128KB block over 4 data disks), or some physical or logical sector size (4kb, 512b)?.. Even per-sector reconstructions of a 128KB block over 4 512b-sectored disks yields and a single alternate variant per-byte from parity reconstructions, if I am not mistaken, an impressive and infeasible 4^64 combinations to test with checksums by pure brute force. Heck, just counting from 1 to 2^64 in a "i++" loop takes a lot of CPU time :) And so far my problems had occurred to compressed blocks which had the physical allocation of a single sector in height or so, and at this "resolution" it was not possible to find one broken sector and fix the userdata. I''m still waiting for scrub to complete so that I could get some corrupted files with parity errors in different rows of HW-sector height. Thanks for listening, I''m out :) //Jim Klimov
On 2012-12-11 16:44, Jim Klimov wrote:> For single-break-per-row tests based on hypotheses from P parities, > D data disks and R broken rows, we need to checksum P*(D^R) userdata > recombinations in order to determine that we can''t recover the block.A small maths correction: the formula above reflects that we change some one item from on-disk value to reconstrycted hypothesis on some one data disk(column) in all rows, or on P disks if we try to recover from more than one failed item in a row. Reality is worse :) Our original info (parity errors and checksum mismatch) warranted only that we have at least one error in userdata. It is possible that other (R-1) errors are on the parity disk, so the recombination should also check all variants with (0..R-1) unchanged rows with their on-disk contents intact. This gives us something like P*(D + D^2 + ... + D^R) variants to test, which is roughly a 25% increase in recombinations in the range of computationally feasible amounts of error-matching.> Heck, just counting from 1 to 2^64 in a "i++" loop takes a lot > of CPU timeBy my estimate, even that would take until the next Big Bang, at least on my one computer ;) Just for fun: a count to 2^32 took 42 seconds, so my computer can do 10^8 trivial loops per second - but that''s just a data point. What really matters is that 4^64 == (2^32)^33, which is a lot. Roughly, 2^3 = 8 ~= 10, so the plain count from 1 to 4^64 would take about 42*10^30 seconds, or roughly 10^24 years. If the astronomers'' estimates are correct, this amounts to 10^13 lifetimes of our universe, or so ;) //Jim