As I recently wrote, my data pool has experienced some "unrecoverable errors". It seems that a userdata block of deduped data got corrupted and no longer matches the stored checksum. For whatever reason, raidz2 did not help in recovery of this data, so I rsync''ed the files over from another copy. Then things got interesting... Bug alert: it seems the block-pointer block with that mismatching checksum did not get invalidated, so my attempts to rsync known-good versions of the bad files from external source seemed to work, but in fact failed: subsequent reads of the files produced IO errors. Apparently (my wild guess), upon writing the blocks, checksums were calculated and the matching DDT entry was found. ZFS did not care that the entry pointed to inconsistent data (not matching the checksum now), it still increased the DDT counter. The problem was solved by disabling dedup for the dataset involved and rsync-updating the file in-place. After the dedup feature was disabled and new blocks were uniquely written, everything was readable (and md5sums matched) as expected. I think of a couple of solutions: If the block is detected to be corrupt (checksum mismatches the data), the checksum value in blockpointers and DDT should be rewritten to an "impossible" value, perhaps all-zeroes or such, when the error is detected. Alternatively (opportunistically), a flag might be set in the DDT entry requesting that a new write mathching this stored checksum should get committed to disk - thus "repairing" all files which reference the block (at least, stopping the IO errors). Alas, so far there is anyways no guarantee that it was not the checksum itself that got corrupted (except for using ZDB to retrieve the block contents and matching that with a known-good copy of the data, if any), so corruption of the checksum would also cause replacement of "really-good-but-normally-inaccessible" data. //Jim Klimov (Bug reported to Illumos: https://www.illumos.org/issues/1981)
On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:> As I recently wrote, my data pool has experienced some > "unrecoverable errors". It seems that a userdata block > of deduped data got corrupted and no longer matches the > stored checksum. For whatever reason, raidz2 did not > help in recovery of this data, so I rsync''ed the files > over from another copy. Then things got interesting... > > Bug alert: it seems the block-pointer block with that > mismatching checksum did not get invalidated, so my > attempts to rsync known-good versions of the bad files > from external source seemed to work, but in fact failed: > subsequent reads of the files produced IO errors. > Apparently (my wild guess), upon writing the blocks, > checksums were calculated and the matching DDT entry > was found. ZFS did not care that the entry pointed to > inconsistent data (not matching the checksum now), > it still increased the DDT counter. > > The problem was solved by disabling dedup for the dataset > involved and rsync-updating the file in-place. After the > dedup feature was disabled and new blocks were uniquely > written, everything was readable (and md5sums matched) > as expected. > > I think of a couple of solutions:In theory, the verify option will correct this going forward.> If the block is detected to be corrupt (checksum mismatches > the data), the checksum value in blockpointers and DDT > should be rewritten to an "impossible" value, perhaps > all-zeroes or such, when the error is detected.What if it is a transient fault?> Alternatively (opportunistically), a flag might be set > in the DDT entry requesting that a new write mathching > this stored checksum should get committed to disk - thus > "repairing" all files which reference the block (at least, > stopping the IO errors).verify eliminates this failure mode.> Alas, so far there is anyways no guarantee that it was > not the checksum itself that got corrupted (except for > using ZDB to retrieve the block contents and matching > that with a known-good copy of the data, if any), so > corruption of the checksum would also cause replacement > of "really-good-but-normally-inaccessible" data.Extrememly unlikely. The metadata is also checksummed. To arrive here you will have to have two corruptions each of which generate the proper checksum. Not impossible, but? I''d buy a lottery ticket instead. See also dedupditto. I could argue that the default value of dedupditto should be 2 rather than "off".> //Jim Klimov > > (Bug reported to Illumos: https://www.illumos.org/issues/1981)Thanks! -- richard -- ZFS and performance consulting http://www.RichardElling.com
2012-01-13 4:26, Richard Elling wrote:> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: > >> As I recently wrote, my data pool has experienced some >> "unrecoverable errors". It seems that a userdata block >> of deduped data got corrupted and no longer matches the >> stored checksum. For whatever reason, raidz2 did not >> help in recovery of this data, so I rsync''ed the files >> over from another copy. Then things got interesting... >> >> Bug alert: it seems the block-pointer block with that >> mismatching checksum did not get invalidated, so my >> attempts to rsync known-good versions of the bad files >> from external source seemed to work, but in fact failed: >> subsequent reads of the files produced IO errors. >> Apparently (my wild guess), upon writing the blocks, >> checksums were calculated and the matching DDT entry >> was found. ZFS did not care that the entry pointed to >> inconsistent data (not matching the checksum now), >> it still increased the DDT counter. >> >> The problem was solved by disabling dedup for the dataset >> involved and rsync-updating the file in-place. After the >> dedup feature was disabled and new blocks were uniquely >> written, everything was readable (and md5sums matched) >> as expected. >> >> I think of a couple of solutions: > > In theory, the verify option will correct this going forward.But in practice there are many suggestions to disable verification because it is slowing down the writes beyond what DDT does to performance, and since there is just some 10^-77 chance that two blocks would have same values of checksums, it is there only for paranoics.>> If the block is detected to be corrupt (checksum mismatches >> the data), the checksum value in blockpointers and DDT >> should be rewritten to an "impossible" value, perhaps >> all-zeroes or such, when the error is detected. > > What if it is a transient fault?Reread disk, retest checksums?.. I don''t know... :)> >> Alternatively (opportunistically), a flag might be set >> in the DDT entry requesting that a new write mathching >> this stored checksum should get committed to disk - thus >> "repairing" all files which reference the block (at least, >> stopping the IO errors). > > verify eliminates this failure mode.Sounds true, I didn''t try that, though. But my scrub is not yet complete, maybe there will be more test subjects ;)> >> Alas, so far there is anyways no guarantee that it was >> not the checksum itself that got corrupted (except for >> using ZDB to retrieve the block contents and matching >> that with a known-good copy of the data, if any), so >> corruption of the checksum would also cause replacement >> of "really-good-but-normally-inaccessible" data. > > Extrememly unlikely. The metadata is also checksummed. To arrive here > you will have to have two corruptions each of which generate the proper > checksum. Not impossible, but? I''d buy a lottery ticket instead.I''ve rather meant the opposite: file data is actually good, but checksums (apparently both DDT and BlockPointer ones with all their ditto copies) are bad, either due to disk rot or RAM failures. For example, are the "blockpointer" and "dedup" versions of the sha256 checksum recalculated by both stages, or reused, on writes of a block?..> See also dedupditto. I could argue that the default value of dedupditto > should be 2 rather than "off".I couldn''t set it to smallish values (like 64), on oi_148a LiveUSB: root at openindiana:~# zpool set dedupditto=64 pool cannot set property for ''pool'': invalid argument for this pool operation root at openindiana:~# zpool set dedupditto=2 pool cannot set property for ''pool'': invalid argument for this pool operation root at openindiana:~# zpool set dedupditto=127 pool root at openindiana:~# zpool get dedupditto pool NAME PROPERTY VALUE SOURCE pool dedupditto 127 local Thanks, //Jim
2012-01-13 4:26, Richard Elling wrote:> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: >> Alternatively (opportunistically), a flag might be set >> in the DDT entry requesting that a new write mathching >> this stored checksum should get committed to disk - thus >> "repairing" all files which reference the block (at least, >> stopping the IO errors). > > verify eliminates this failure mode.Thinking about it... got more questions: In this case: DDT/BP contain multiple references with correct checksums, but the on-disk block is bad. Newly written block has the same checksum, and verification proves that on-disk data is different byte-to-byte. 1) How does the write-stack interact with those checksums that do not match the data? Would any checksum be tested for this verification read of existing data at all? 2) It would make sense for the failed verification to have the new block committed to disk, and a new DDT entry with same checksum created. I would normally expect this to be the new unique block of a new file, and have no influence on existing data (block chains). However in the discussed problematic case, this safe behavior would also mean not contributing to reparation of those existing block chains which include the mismatching on-disk block. Either I misunderstand some of the above, or I fail to see how verification would eliminate this failure mode (namely, as per my suggestion, replace the bad block with a good one and have all references updated and block-chains -> files fixed with one shot). Would you please explain? Thanks, //Jim Klimov
On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov wrote:> 2012-01-13 4:26, Richard Elling wrote: >> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: >>> Alternatively (opportunistically), a flag might be set >>> in the DDT entry requesting that a new write mathching >>> this stored checksum should get committed to disk - thus >>> "repairing" all files which reference the block (at least, >>> stopping the IO errors). >> >> verify eliminates this failure mode. > > Thinking about it... got more questions: > > In this case: DDT/BP contain multiple references with > correct checksums, but the on-disk block is bad. > Newly written block has the same checksum, and verification > proves that on-disk data is different byte-to-byte. > > 1) How does the write-stack interact with those checksums > that do not match the data? Would any checksum be tested > for this verification read of existing data at all? > > 2) It would make sense for the failed verification to > have the new block committed to disk, and a new DDT > entry with same checksum created. I would normally > expect this to be the new unique block of a new file, > and have no influence on existing data (block chains). > However in the discussed problematic case, this safe > behavior would also mean not contributing to reparation > of those existing block chains which include the > mismatching on-disk block. > > Either I misunderstand some of the above, or I fail to > see how verification would eliminate this failure mode > (namely, as per my suggestion, replace the bad block > with a good one and have all references updated and > block-chains -> files fixed with one shot).It doesn''t update past data. It gets treated as if there were a hash collision, and the new data is really different despite having the same checksum, and so gets written out instead of incrementing the existing DDT pointer. So it addresses your ability to recover the primary filesystem by overwriting with same data, that dedup was previously defeating. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120113/5ba6d0d0/attachment.bin>
2012-01-13 5:34, Daniel Carosone wrote:> On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov wrote: >> Either I misunderstand some of the above, or I fail to >> see how verification would eliminate this failure mode >> (namely, as per my suggestion, replace the bad block >> with a good one and have all references updated and >> block-chains -> files fixed with one shot). > > It doesn''t update past data. > > It gets treated as if there were a hash collision, and the new data is > really different despite having the same checksum, and so gets written > out instead of incrementing the existing DDT pointer. So it addresses > your ability to recover the primary filesystem by overwriting with > same data, that dedup was previously defeating.But (yes/no?) I have to do this repair file-by-file, either with dedup=off or dedup=verify. Actually, that''s what I properly should do if there is such a serious error, but what if the original data is not available so I can''t fix it file-by-file, or if there are very many errors (read, DDT references from a number of files just under dedupditto value) and such match-and-repair procedure is prohibitively inconvenient, slow, whatever? Say, previously we trusted the hash algorithm: that same checksums mean identical blocks. With such trust the user might want to replace the faulty block with another one (matching the checksum) and expect ALL deduped files that used this block to become automagically recovered. Chances are, they actually would be correct (by external verification). And if we trust unverified dedup in the first place, there is nothing wrong with such approach to repair. It would not make possible errors worse than there were in originally saved on-disk data (even if there were hash collisions of really-different blocks - user had discarded that difference long ago). I think the user should be given an (informed) ability to shoot himself in the foot or recover data, depending on his luck. Anyway, people are doing it thanks to Max Bruning''s or Viktor Latushkin''s posts and direct help, or they research hardcore internals of ZFS. We might as well play along and increase their chances of success, even if unsupported and unguaranteed - no? This situation with "obscured" recovery methods reminds me of prohibited changes of firmware on cell phones: customers are allowed to sit on a phone or drop it into a sink, and perhaps have it replaced, but they are not allowed to install different software. Many still do. //Jim Klimov
2012-01-13 4:26, Richard Elling wrote:> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote: >> The problem was solved by disabling dedup for the dataset >> involved and rsync-updating the file in-place. After the >> dedup feature was disabled and new blocks were uniquely >> written, everything was readable (and md5sums matched) >> as expected. > In theory, the verify option will correct this going forward.Well, I have got more complaining blocks, and even new errors in files that I''ve previously "repaired" with rsync, before I figured out the problem with dedup today. Now I''ve set the verify flag instead of dedup=off, and the rsync replacement from external storage seems to happen a lot faster. It also seems to persist even a few minutes after the copying ;) Thanks for the tip, Richard! //Jim Klimov
2012-01-13 4:12, Jim Klimov wrote:> As I recently wrote, my data pool has experienced some > "unrecoverable errors". It seems that a userdata block > of deduped data got corrupted and no longer matches the > stored checksum. For whatever reason, raidz2 did not > help in recovery of this data, so I rsync''ed the files > over from another copy. Then things got interesting...Well, after some crawling over my data with zdb, od and dd, I guess ZFS was right about finding checksum errors - the metadata''s checksum matched that of a block on original system, and the data block was indeed erring. Just in case it helps others, the SHA256 checksums can be tested with openssl as I show below. I still search for a command-line fletcher4/fletcher2 checker (as that weak hash is used on metadata; I wonder why). Here''s a tail from on-disk blkptr_t, bytes with checksum: # tail -2 /tmp/osx.l0+1100000.blkptr.txt 000460 1f 6f 4c 73 5d c1 ab 15 00 cc 56 90 38 8e b4 dd 000470 a9 8e 54 6f f1 a7 db 43 7d 61 9e 01 23 45 2e 70 In byte 0x435 I have value 0x8 - SHA256. And here is the SHA256 hash for the excerpt from original file (128Kb cut out with dd): # dd if=osx.zip of=/tmp/osx.l0+1100000.bin.orig bs=512 skip=34816 count=256 # openssl dgst -sha256 < /tmp/osx.l0+1100000.bin.orig 15abc15d734c6f1fddb48e389056cc0043dba7f16f548ea9702e4523019e617d As my x86 is little-endian, the four 8-byte words of the checksum appear reversed. But you can see it matches, so my source file is okay. I did not find the DDT entries (yet), so I don''t know what hash is there or what addresses it references for how many files. The block pointer has the dedup bit set, though. However, of all my files with errors, there are no DVA overlaps. I hexdumped (with od) the two 128Kb excerpts (one from the original file, another fetched with zdb) and diffed them, and while some lines matched, others did not. What is more interesting, is that most of the error area contains a repeating pattern like this, sometimes with "extra" chars thrown in: fc 42 fc 42 fc 42 fc 42 fc 42 fc fc 42 1f fc 42 fc 42 42 ff fc 42 fc 42 fc 42 fc 42 I have seen similar patterns when I zdb-dumped compressed blocks without decompression, so I guess this could be a miswrite of compressed data and/or parity destined for another file (which also did not get it). The erroneous data starts and ends at "round" offsets like 0x1000-0x2000, 0x9000-0xa000, 0x11000-0x12000 (step 0x8000 between both sets of mismatches, size 4kb is my disk sector size), which also suggests a non-coincidental problem. However, part of the differing data is "normal-looking random noise", while some part is that pattern above, starting and ending at a seemingly random location mid-sector. Here''s about all I have to say and share so far :) Open to suggestions how to compute fletcher checksums on blocks... Thanks, //Jim
2012-01-21 0:33, Jim Klimov wrote:> 2012-01-13 4:12, Jim Klimov wrote: >> As I recently wrote, my data pool has experienced some >> "unrecoverable errors". It seems that a userdata block >> of deduped data got corrupted and no longer matches the >> stored checksum. For whatever reason, raidz2 did not >> help in recovery of this data, so I rsync''ed the files >> over from another copy. Then things got interesting... > > > Well, after some crawling over my data with zdb, od and dd, > I guess ZFS was right about finding checksum errors - the > metadata''s checksum matched that of a block on original > system, and the data block was indeed erring.Well, as I''m moving to close my quest with broken data, I''d like to draw up some conclusions and RFEs. I am still not sure if they are factually true, I''m still learning the ZFS internals. So "it currently seems to me, that": 1) My on-disk data could get corrupted for whatever reason ZFS tries to protect it from, at least once probably from misdirected writes (i.e. the head landed not where it was asked to write). It can not be ruled out that the checksums got broken in non-ECC RAM before writes of block pointers for some of my data, thus leading to mismatches. One way or another, ZFS noted the discrepancy during scrubs and "normal" file accesses. There is no (automatic) way to tell which part is faulty - checksum or data. 2) In the case where on-disk data did get corrupted, the checksum in block pointer was correct (matching original data), but the raidz2 redundancy did not aid recovery. 3) The file in question was created on a dataset with enabled deduplication, so at the very least the dedup bit was set on the corrupted block''s pointer and a DDT entry likely existed. Attempts to rewrite the block with the original one (having "dedup=on") failed in fact, probably because the matching checksum was already in DDT. Rewrites of such blocks with "dedup=off" or "dedup=verify" succeeded. Failure/success were tested by "sync; md5sum FILE" some time after the fix attempt. (When done just after the fix, test tends to return success even if the ondisk data is bad, "thanks" to caching). My last attempt was to set "dedup=on" and write the block again and sync; the (remote) computer hung instantly :( 3*)The RFE stands: deduped blocks found to be invalid and not recovered by redundancy should somehow be evicted from DDT (or marked for required verification-before-write) so as not to pollute further writes, including repair attmepts. Alternatively, "dedup=verify" takes care of the situation and should be the recommended option. 3**) It was suggested to set "dedupditto" to small values, like "2". My oi_148a refused to set values smaller than 100. Moreover, it seems reasonable to have two dedupditto values: for example, to make a ditto copy when DDT reference counter exceeds some small value (2-5), and add ditto copies every "N" values for frequently-referenced data (every 64-128). 4) I did not get to check whether "dedup=verify" triggers a checksum mismatch alarm if the preexisting on-disk data does not in fact match the checksum. I think such alarm should exist and to as much as a scrub, read or other means of error detection and recovery would. 5) It seems like a worthy RFE to include a pool-wide option to "verify-after-write/commit" - to test that recent TXG sync data has indeed made it to disk on (consumer-grade) hardware into the designated sector numbers. Perhaps the test should be delayed several seconds after the sync writes. If the verifcation fails, currently cached data from recent TXGs can be recovered from on-disk redundancy and/or still exist in RAM cache, and rewritten again (and tested again). More importantly, a failed test *may* mean that the write landed on disk randomly, and the pool should be scrubbed ASAP. It may be guessed that the yet-unknown error can lie within "epsilon" tracks (sector numbers) from the currently found non-written data, so if it is possible to scrub just a portion of the pool based on DVAs - that''s a preferred start. It is possible that some data can be recovered if it is tended to ASAP (i.e. on mirror, raidz, copies>1)... Finally, I should say I''m sorry for lame questions arising from not reading the format spec and zdb blogs carefully ;) In particular, it was my understanding for a long time that block pointers each have a sector of their own, leading to overheads that I''ve seen. Now I know (and checked) that most of the blockpointer tree is made of larger groupings (128 blkptr_t''s in a single 16KB block), reducing the impact of BP''s on fragmentation and/or slacky waste of large sectors that I predicted and expected for the past year. Sad that nobody ever contradicted that (mis)understanding of mine. //Jim Klimov
On Sat, 21 Jan 2012, Jim Klimov wrote:> > 5) It seems like a worthy RFE to include a pool-wide option to > "verify-after-write/commit" - to test that recent TXG sync > data has indeed made it to disk on (consumer-grade) hardware > into the designated sector numbers. Perhaps the test should > be delayed several seconds after the sync writes.This is an interesting idea. I think that you would want to do a mini-scrub on a TXG at least one behind the last one written since otherwise any test would surely be foiled by caching. The ability to restore data from RAM is doubtful since TXGs get forgotten from memory as soon as they are written. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
2012-01-21 19:18, Bob Friesenhahn wrote:> On Sat, 21 Jan 2012, Jim Klimov wrote: >> >> 5) It seems like a worthy RFE to include a pool-wide option to >> "verify-after-write/commit" - to test that recent TXG sync >> data has indeed made it to disk on (consumer-grade) hardware >> into the designated sector numbers. Perhaps the test should >> be delayed several seconds after the sync writes. > > This is an interesting idea. I think that you would want to do a > mini-scrub on a TXG at least one behind the last one written since > otherwise any test would surely be foiled by caching. The ability to > restore data from RAM is doubtful since TXGs get forgotten from memory > as soon as they are written.That could be rearranged as part of the bug/RFE resolution ;) Regarding the written data, I believe it may find place in the ARC, and a for the past few TXGs it could still remain there. I am not sure it is feasible to "guarantee" that it remains in RAM for a certain time. Also there should be a way to enforce media reads and not ARC re-reads when verifying writes... //Jim
On Sat, 21 Jan 2012, Jim Klimov wrote:> > Regarding the written data, I believe it may find place in the > ARC, and a for the past few TXGs it could still remain there.Any data in the ARC is subject to being overwritten with updated data just a millisecond later. It is a live cache.> I am not sure it is feasible to "guarantee" that it remains in > RAM for a certain time. Also there should be a way to enforce > media reads and not ARC re-reads when verifying writes...Zfs already knows how to by-pass the ARC. However, any "media" reads are subject to caching since the underlying devices try very hard to cache data in order to improve read performance. As an extreme case of caching, consider a device represented by an iSCSI LUN on a OpenSolaris server with 512GB of RAM. If you request to read data you are exceedingly likely to read data from the zfs ARC on that server rather than underlying "media". Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
2012-01-21 20:50, Bob Friesenhahn wrote: > TXGs get forgotten from memory as soon as they are written. As I said, that can be arranged - i.e. free the TXG cache after the corresponding TXG number has been verified? Point about ARC being overwritten seems valid...> Zfs already knows how to by-pass the ARC. However, any "media" reads are > subject to caching since the underlying devices try very hard to cache > data in order to improve read performance.As a pointer, the "format" command presents options to disable (separately) read and write caching on drives it sees. MAYBE there is some option to explicitly read data from media, like sync-writes. Whether the drive firmwares honor that (disabling caching and/or such hypothetical sync-reads) - it''s something out of ZFS''s control. But we can do the best effort...> As an extreme case of caching, consider a device represented by an iSCSI > LUN on a OpenSolaris server with 512GB of RAM. If you request to read > data you are exceedingly likely to read data from the zfs ARC on that > server rather than underlying "media".So far I rather considered "flaky" hardware with lousy consumer qualities. The server you describe is likely to exceed that bar ;) Besides, if this OpenSolaris server is up-to-date, it would do such media checks itself, and/or honour the sync-read requests or temporary cache disabling ;) Of course, this can''t be guaranteed of other devices, so in general ZFS can do best-effort verification. //Jim
On Sun, 22 Jan 2012, Jim Klimov wrote:> So far I rather considered "flaky" hardware with lousy > consumer qualities. The server you describe is likely > to exceed that bar ;)The most common "flaky" behavior of consumer hardware which causes troubles for zfs is not honoring cache-related requests. Unfortunately, it is not possible for zfs to fix such hardware. Zfs works best with hardware which does what it is told. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
2012-01-22 0:55, Bob Friesenhahn wrote:> On Sun, 22 Jan 2012, Jim Klimov wrote: >> So far I rather considered "flaky" hardware with lousy >> consumer qualities. The server you describe is likely >> to exceed that bar ;) > > The most common "flaky" behavior of consumer hardware which causes > troubles for zfs is not honoring cache-related requests. Unfortunately, > it is not possible for zfs to fix such hardware. Zfs works best with > hardware which does what it is told.Also true. That''s what the "option" stood for in my proposal: since the verification feature is going to be expensive and add random IOs, we don''t want to enforce it on everybody. Besides, the user might choose to trust his reliable and expensive hardware like a SAN/NAS with battery-backed NVRAM, which is indeed likely better that a homebrewn NAS box with random HDDs thrown in with no measure, but with a desire for some reliability nonetheless ;) We can "expect" the individual HDDs caches to get expired after some time (i.e. after we''ve sent 64Mbs worth of writes to the particualr disk with a 64Mb cache), and after that we are likely to get true media reads. That''s when the verification reads are likely to return most relevant (ondisk) sectors... //Jim
On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:> 2012-01-21 0:33, Jim Klimov wrote: >> 2012-01-13 4:12, Jim Klimov wrote: >>> As I recently wrote, my data pool has experienced some >>> "unrecoverable errors". It seems that a userdata block >>> of deduped data got corrupted and no longer matches the >>> stored checksum. For whatever reason, raidz2 did not >>> help in recovery of this data, so I rsync''ed the files >>> over from another copy. Then things got interesting... >> >> >> Well, after some crawling over my data with zdb, od and dd, >> I guess ZFS was right about finding checksum errors - the >> metadata''s checksum matched that of a block on original >> system, and the data block was indeed erring. > > Well, as I''m moving to close my quest with broken data, I''d > like to draw up some conclusions and RFEs. I am still not > sure if they are factually true, I''m still learning the ZFS > internals. So "it currently seems to me, that": > > 1) My on-disk data could get corrupted for whatever reason > ZFS tries to protect it from, at least once probably > from misdirected writes (i.e. the head landed not where > it was asked to write). It can not be ruled out that the > checksums got broken in non-ECC RAM before writes of > block pointers for some of my data, thus leading to > mismatches. One way or another, ZFS noted the discrepancy > during scrubs and "normal" file accesses. There is no > (automatic) way to tell which part is faulty - checksum > or data.Untrue. If a block pointer is corrupted, then on read it will be logged and ignored. I''m not sure you have grasped the concept of checksums in the parent object.> > 2) In the case where on-disk data did get corrupted, the > checksum in block pointer was correct (matching original > data), but the raidz2 redundancy did not aid recovery.I think your analysis is incomplete. Have you determined the root cause?> > 3) The file in question was created on a dataset with enabled > deduplication, so at the very least the dedup bit was set > on the corrupted block''s pointer and a DDT entry likely > existed. Attempts to rewrite the block with the original > one (having "dedup=on") failed in fact, probably because > the matching checksum was already in DDT.Works as designed.> > Rewrites of such blocks with "dedup=off" or "dedup=verify" > succeeded. > > Failure/success were tested by "sync; md5sum FILE" some > time after the fix attempt. (When done just after the > fix, test tends to return success even if the ondisk data > is bad, "thanks" to caching).No, I think you''ve missed the root cause. By default, data that does not match its checksum is not used.> > My last attempt was to set "dedup=on" and write the block > again and sync; the (remote) computer hung instantly :( > > 3*)The RFE stands: deduped blocks found to be invalid and not > recovered by redundancy should somehow be evicted from DDT > (or marked for required verification-before-write) so as > not to pollute further writes, including repair attmepts. > > Alternatively, "dedup=verify" takes care of the situation > and should be the recommended option.I have lobbied for this, but so far people prefer performance to dependability.> > 3**) It was suggested to set "dedupditto" to small values, > like "2". My oi_148a refused to set values smaller than 100. > Moreover, it seems reasonable to have two dedupditto values: > for example, to make a ditto copy when DDT reference counter > exceeds some small value (2-5), and add ditto copies every > "N" values for frequently-referenced data (every 64-128). > > 4) I did not get to check whether "dedup=verify" triggers a > checksum mismatch alarm if the preexisting on-disk data > does not in fact match the checksum.All checksum mismatches are handled the same way.> > I think such alarm should exist and to as much as a scrub, > read or other means of error detection and recovery would.Checksum mismatches are logged, what was your root cause?> > 5) It seems like a worthy RFE to include a pool-wide option to > "verify-after-write/commit" - to test that recent TXG sync > data has indeed made it to disk on (consumer-grade) hardware > into the designated sector numbers. Perhaps the test should > be delayed several seconds after the sync writes.There are highly-reliable systems that do this in the fault-tolerant market.> > If the verifcation fails, currently cached data from recent > TXGs can be recovered from on-disk redundancy and/or still > exist in RAM cache, and rewritten again (and tested again). > > More importantly, a failed test *may* mean that the write > landed on disk randomly, and the pool should be scrubbed > ASAP. It may be guessed that the yet-unknown error can lie > within "epsilon" tracks (sector numbers) from the currently > found non-written data, so if it is possible to scrub just > a portion of the pool based on DVAs - that''s a preferred > start. It is possible that some data can be recovered if > it is tended to ASAP (i.e. on mirror, raidz, copies>1)... > > Finally, I should say I''m sorry for lame questions arising > from not reading the format spec and zdb blogs carefully ;) > > In particular, it was my understanding for a long time that > block pointers each have a sector of their own, leading to > overheads that I''ve seen. Now I know (and checked) that most > of the blockpointer tree is made of larger groupings (128 > blkptr_t''s in a single 16KB block), reducing the impact of > BP''s on fragmentation and/or slacky waste of large sectors > that I predicted and expected for the past year. > > Sad that nobody ever contradicted that (mis)understanding > of mine.Perhaps some day you can become a ZFS guru, but the journey is long... -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120122/9a922b52/attachment.html>
2012-01-22 22:58, Richard Elling wrote:> On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:>> ... So "it currently seems to me, that": >> >> 1) My on-disk data could get corrupted for whatever reason >> ZFS tries to protect it from, at least once probably >> from misdirected writes (i.e. the head landed not where >> it was asked to write). It can not be ruled out that the >> checksums got broken in non-ECC RAM before writes of >> block pointers for some of my data, thus leading to >> mismatches. One way or another, ZFS noted the discrepancy >> during scrubs and "normal" file accesses. There is no >> (automatic) way to tell which part is faulty - checksum >> or data. > > Untrue. If a block pointer is corrupted, then on read it will be logged > and ignored. I''m not sure you have grasped the concept of checksums > in the parent object.If a block pointer is corrupted on disk after the write - then yes, it will not match the parent''s checksum, and there would be another 1 or 2 ditto copies with possibly correct data. Is that the correct grasping of the concept? ;) Now, the (non-zero possibility) scenario I meant was that the checksum for the block was calculated and was corrupted in RAM/CPU before the ditto blocks were fanned out to disks, and before the parent block checksums were calculated. In this case on-disk data block is correct as compared to other sources (if it is copies=2 - it may even be the same as its other copy), but it does not match the BP''s checksum while the BP tree seems valid (all tree checksums match). I believe in this case ZFS should flag the data checksum mismatch, although in reality (with miniscule probability) it is the bad checksum mismatching the good data. Anyway, the situation would seem the same if the data block was corrupted in RAM before fanning out with copies>1, and that is more probable given the size of this block compared to the 256 bits of checksum. Just *HOW* probable is that on an ECC and a non-ECC system, with or without an overclocked overheated CPU in enthusiasts overpumped workstation or unsuspecting consumer''s dusty closet - that is a separate maths questions, with different answers for different models. Random answer - on par with disk UBER errors which ZFS by design considers serious enough to combat.>> 2) In the case where on-disk data did get corrupted, the >> checksum in block pointer was correct (matching original >> data), but the raidz2 redundancy did not aid recovery. > > I think your analysis is incomplete.As I last wrote, I dumped the blocks with ZDB and compared the bytes with the same block from a good copy. Particularly, that copy had the same SHA256 checksum as was stored in my problematic pool''s blockpointer entry for the corrupt block. These blocks differed in three sets of 4096 bytes starting at "round" offsets at even intervals (4KB, 36KB, 68KB). 4kb is my disks'' block size. It seems that some disk(s?) overwrote existing data, or got scratched, or whatever (no IO errors in dmesg though). I am not certain why raidz2 did not suffice to fix the block, and what garbage or data exists on all 6 drives - I did not get zdb to dump all 0x30000 bytes of raidz2 raw data to try permutations myself. Possibly, for whatever reason (such as cable error, or some firmware error given the same model of the drives), several drives got the same erroneous write command at once, and ultimately invalidated parts of the same stripe. Many of the files in peril now, have existed on the pool for some time, and scrubs completed successfully many times. > Have you determined the root cause? Unfortunately, I''m currently in another country away from my home-NAS server. So all physical maintenance including pushing the reset button is done by friends living in the apartment. And there is not much physical examination that can be done this way. At one point in time recently (during a scrub in January), one of the disks got lost and was not seen by motherboard even after reboots, so I had my friends take out and replug the SATA cables. This helped, so connector noise was possibly the root cause. It might also account for incorrect address for a certain write that slashed randomly on the platter. The PSU is excessive for the box''s requirements, with slack performance to degrade ;) The P4 CPU is not overclocked. RAM is non-ECC and that is not changeable given the Intel CPU, chipset and motherboard. HDDs are on MB''s controller. The 6 HDDs in raidz2 pool are consumer-grade SATA Seagate ST2000DL003-9VT166 firmware CC32. Degrading cabling and/or connectors can indeed be one of about two main causes, the other being non-ECC RAM. Or aging CPU.>> 3) The file in question was created on a dataset with enabled >> deduplication, so at the very least the dedup bit was set >> on the corrupted block''s pointer and a DDT entry likely >> existed. Attempts to rewrite the block with the original >> one (having "dedup=on") failed in fact, probably because >> the matching checksum was already in DDT. > > Works as designed.If this is the case, the design does not account for finding an error (ZFS found it first, not I) - and still trusts the DDT entry although it points to garbage now. The design should be fixed then ;)> >> >> Rewrites of such blocks with "dedup=off" or "dedup=verify" >> succeeded. >> >> Failure/success were tested by "sync; md5sum FILE" some >> time after the fix attempt. (When done just after the >> fix, test tends to return success even if the ondisk data >> is bad, "thanks" to caching). > > No, I think you''ve missed the root cause. By default, data that does > not match its checksum is not used.I suspected that reads subsequent to a write (with dedup=on) came from the ARC cache. At least, it was my expectation that like on many other caching systems, recent writes'' buffers are moved into the read-cache for the addresses in question. Thus, if the system only incremented the DDT counter, it has no information that the on-disk data mismatched the checksum. Anyhow, the fact is that in this test case, when I read the "fixed" local files just after the rsync from a good remote source, I got no IO errors and correct md5sums. When I re-read them after a few minutes, the IO errors were there again. With "dedup=off" and "dedup=verify" the errors did not return after a while. I explained my interpretation of these facts based on my current understanding of the system; hopefully you have some better-informed explanation ;)>> My last attempt was to set "dedup=on" and write the block >> again and sync; the (remote) computer hung instantly :( >> >> 3*)The RFE stands: deduped blocks found to be invalid and not >> recovered by redundancy should somehow be evicted from DDT >> (or marked for required verification-before-write) so as >> not to pollute further writes, including repair attmepts. >> >> Alternatively, "dedup=verify" takes care of the situation >> and should be the recommended option. > > I have lobbied for this, but so far people prefer performance to > dependability.At the very least, in the docs outlining deduplication, this possible situation could be marked as an incentive to use "verify". And if the errors are found (by scrub/read), something should be done - like forcing "verify"? Or at least suggesting that? Logged in illumos bugtracker [1] [1] https://www.illumos.org/issues/1981 Regarding performance... it seems that some of design decisions were influenced by certain customers and their setups. There''s nothing inherently wrong when tuning the system (by default) with reference to real-life situations, until such cases dictate the one-and-only required policy for everybody else. Like with vdev-prefetch code slated for removal [2] - not everybody of illumos users has hundreds of disks on memoryless head nodes. And those who don''t might have more benefit from prefetch than they lose with dedicating a few MB of RAM to that cache. So I think for that case it was sufficient to have the zeroed cache size by default and leaving the ability to use more if we desire. Perhaps there can even be room for vdev- prefetching improvement, also logged in bugtracker [2],[3] ;) [2] https://www.illumos.org/issues/175 [3] https://www.illumos.org/issues/2017>> 3**) It was suggested to set "dedupditto" to small values, >> like "2". My oi_148a refused to set values smaller than 100. >> Moreover, it seems reasonable to have two dedupditto values: >> for example, to make a ditto copy when DDT reference counter >> exceeds some small value (2-5), and add ditto copies every >> "N" values for frequently-referenced data (every 64-128).Also logged in bugtracker [4] ;) [4] https://www.illumos.org/issues/2016>> >> 4) I did not get to check whether "dedup=verify" triggers a >> checksum mismatch alarm if the preexisting on-disk data >> does not in fact match the checksum. > > All checksum mismatches are handled the same way. > >> >> I think such alarm should exist and to as much as a scrub, >> read or other means of error detection and recovery would. > > Checksum mismatches are logged, what was your root cause?As written above, at least for one case it was probably a random write by a disk over existing sectors, invalidating the block. I have yet to test (to be certain) whether writing over a block which is invalid on-disk and marked as deduped, with dedup=verify, would increase the CKSUM counter. Waiting for friends to reboot that computer now to make the test ;( Still, according to "Works as designed" above, logging the mismatch so far has no effect on not-using the old DDT entry pointing to corrupt data. Just in case, logged as https://www.illumos.org/issues/2015>> 5) It seems like a worthy RFE to include a pool-wide option to >> "verify-after-write/commit" - to test that recent TXG sync >> data has indeed made it to disk on (consumer-grade) hardware >> into the designated sector numbers. Perhaps the test should >> be delayed several seconds after the sync writes. > > There are highly-reliable systems that do this in the fault-tolerant > market.Do you want to sell your systems (and/or OS) or others to fill this niche? ;) At least, having the option for such checks (as much as disks would obey it) is better than not having. Whether to enable it - is another matter. I logged an RFE in illumos bugtracker yesterday [5], lest the idea be forgotten, at the very least... [5] https://www.illumos.org/issues/2008 And, ummm, however much you dislike tunables, I think this situation calls for an on-off switch and two codepaths, because like dedup=verify, this feature will by design blow a heavy hit to performance. Still, one size cannot fit all. Those who like purple, likely won''t wear white ;) Some want performance, others just want their photo archive to survive the decade, and say 10MBps would suffice to copy their new set pictures from the CF card ;) If the verification code gets into the OS, it can be turned on or off by a tunable (switch) depending on other components'' {hardware} reliability and performance requirements. Perhaps, we should trust consumer drives less than enterprise ones (100 orders of magnitute in UBER difference, better materials, more in-factory testing) and not request such verification?.. That''s not up to us to decide (one size for all), but up to end-users or their integrators.>> Sad that nobody ever contradicted that (mis)understanding >> of mine. > > Perhaps some day you can become a ZFS guru, but the journey is long...Well, I do look forward to that, and hope I can learn from the likes of you. And there are not many "ZFS under the hood" type of textbooks out there. So I go on asking and asking ;) //Jim Klimov
2012-01-23 18:25, Jim Klimov wrote:>>> 4) I did not get to check whether "dedup=verify" triggers a >>> checksum mismatch alarm if the preexisting on-disk data >>> does not in fact match the checksum. >> >> All checksum mismatches are handled the same way. >>> I have yet to test (to be certain) whether writing over a > block which is invalid on-disk and marked as deduped, with > dedup=verify, would increase the CKSUM counter. I checked (oi_148a LiveUSB), by writing a correct block (128KB) instead of the corrupted one into the file, and: * "dedup=on" neither fixed the on-disk file, nor logged an error, and subsequent reads produced IO errors (and increased the counter). Probably just the DDT counter was increased during the write (that''s the "works as designed" part); * "dedup=verify" *doesn''t* log a checksum error if it finds a block whose assumed checksum matches the newly written block, but contents differ from the new block during dedup-verification and in fact these contents do not match the checksum either (at least, not the one in block pointer). Reading the block produced no errors; * what''s worse, reenabling "dedup=on" and writing the same again block crashes (reboots) the system instantly. Possibly, because now there are two DDT entries pointing to same checksum in different blocks, and no verification was explicitly requested? Reenactment of the test (as a hopefully reproducible) test case constitutes the remainder of the post and thus it is going to be lengthy... Analyze that! ;)>>> I think such alarm should exist and to as much as a scrub, >>> read or other means of error detection and recovery would.Statement/desire still stands.>> Checksum mismatches are logged,no they are not (in this case) >> what was your root cause? Probably same as before - some sort of existing on-disk data corruption which overwrote some sectors and raidz2 failed to reconstruct the stripe. I seem to have had about a dozen of such files. Fixed some by rsync with different dedup settings, before going into it all deeper. I am not sure if any of them had overlapping DVAs (those which remain corrupted now - don''t), but many addresses lie in very roughly similar address ranges (within several GBs or so).> > As written above, at least for one case it was probably > a random write by a disk over existing sectors, invalidating > the block. > > Still, according to "Works as designed" above, logging the > mismatch so far has no effect on not-using the old DDT entry > pointing to corrupt data. > > Just in case, logged as https://www.illumos.org/issues/2015REENACTMENT OF THE TEST CASE Beside illustrating my error for those who decide to take on the bug, I hope this post would also help others in their data recovery attempts, zfs research, etc. If my methodology is faulty, I hope someone points that out ;) 1) Uh, I have unrecoverable errors! The computer was freshly rebooted, pool imported (with rollback), no newly known CKSUM errors (but we have the nvlist of existing mismatching files): # zpool status -vx pool: pool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scan: scrub repaired 244K in 138h57m with 31 errors on Sat Jan 14 01:50:16 2012 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <metadata>:<0x0> ... pool/mymedia:/incoming/DSCF1234.AVI NOTE: I don''t yet have full detail of <metadata>:<0x0> error, asked about it numerously on the list. 2) Mine some information about the file and error location * mount the dataset # zfs mount pool/mymedia * find the inode number # ls -i /pool/mymedia/incoming/DSCF1234.AVI 6313 /pool/mymedia/incoming/DSCF1234.AVI * dump ZDB info # zdb -dddddd pool/mymedia 6313 > /tmp/zdb.txt * find the bad block offset # dd if=/pool/mymedia/incoming/DSCF1234.AVI of=/dev/null \ bs=512 conv=noerror,notrunc dd: reading `/pool/mymedia/incoming/DSCF1234.AVI'': I/O error 58880+0 records in 58880+0 records out 30146560 bytes (30 MB) copied, 676.772 s, 44.5 kB/s (error repeated 256 times) 239145+1 records in 239145+1 records out 122442738 bytes (122 MB) copied, 2136.19 s, 57.3 kB/s So the error is at offset 58800*512 bytes = 0x1CC0000 And its size is 512b*256 = 128KB 3) Review the /tmp/zdb.txt information We need the L0 entry for the erroneous block and its parent L1 entry: Object lvl iblk dblk dsize lsize %full type 6313 3 16K 128K 117M 117M 100.00 ZFS plain file (K=inherit) (Z=inherit) 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 935 ... Indirect blocks: 0 L2 0:ac25a001000:3000 0:927ba8e7000:3000 4000L/400P F=936 B=325191092/325191092 0 L1 0:ac23ecbf000:6000 0:927b38c1000:6000 4000L/1800P F=128 B=325191082/325191082 ... 1000000 L1 0:ac243ea6000:6000 0:927b474c000:6000 4000L/1800P F=128 B=325191084/325191084 1cc0000 L0 0:aca016fa000:30000 20000L/20000P F=1 B=325191083/325191083 ... For readers not in the know, ZFS indirect blocks at level L0 address the "userdata" blocks. Indirect blocks are stored in sets of up to 128 entries; so if more blocks are needed to store the file, there appears a tree of indirect blocks, up to 6 levels deep or so. That''s a lot of possibly addressable userdata blocks (2^64 they say, IIRC). In my case, the file is comprised of 936 "userdata" blocks (numbered from 0 to 935 = maxblkid) and requires an L2 tree to store it. The file is conveniently made of 128KB blocks sized 0x20000 bytes (logical and physical userdata, no compression here), taking up 0x30000 bytes for allocation of the data on-disk (including raidz2 redundancy on 6 disks). Each L1 metadata block covers 0x1000000 bytes of this file, so there are 8 L1 blocks and one L2 block to address them. Our L0 block starting at 0x1CC0000 is number 0x66 (102) in its parent L1 block starting at 0x1000000: (0x1CC0000-0x1000000)/0x20000 = 0x66 = 102 Its L0 metadata starts at the byte offset 0x3300 in the L1 metadata block, since all blkptr_t entries in Ln are 0x80 (128) bytes long: (0x66*0x80) = 0x3300 = 13056 bytes * Extract the raw L1 information - for checksum information, compression and dedup flags, etc. Metadata is compressed (as seen via 4000L/1800P in zdb.txt), so we decompress it on-the-fly: # zdb -R pool 0:ac243ea6000:1800:dr > /tmp/l1 Found vdev type: raidz * Confirm that we can''t read the block directly: # dd if=/pool/mymedia/incoming/DSCF1234.AVI of=/tmp/blk-bad.bin \ bs=131072 count=1 skip=230 dd: reading `/pool/mymedia/incoming/DSCF1234.AVI'': I/O error 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.0995645 s, 0.0 kB/s * Dump the (corrupt) raw block data with ZDB, uncompressed: # zdb -R pool 0:aca016fa000:20000:r > /tmp/blk-bad.bin Found vdev type: raidz * Use the "dd" command similar to one above to extract the same block offset*size from the known-good copy of the file, saved as /tmp/blk-orig.bin. 4) Inspect the L1 entry NOTES: See definition of the blkptr_t structure here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/sys/spa.h#blkptr and some more details here: http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/ondiskformat0822.pdf * Dump the blkptr_t entry 102 (offset 0x3300) for inspection. # od -Ax -tx1 /tmp/l1 | egrep ''^0033'' | head -8 003300 80 01 00 00 00 00 00 00 d0 b7 00 65 05 00 00 00 003310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 003330 ff 00 ff 00 02 08 13 c0 00 00 00 00 00 00 00 00 003340 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 003350 ab 05 62 13 00 00 00 00 01 00 00 00 00 00 00 00 003360 14 ad 26 24 f2 89 8e 28 f1 ae 5d 99 2d 3b 3c 58 003370 c1 91 74 21 09 28 48 5b 55 6f b4 8f 7e ed 3a 52 Cutting to the chase, I''m interested in these bytes now: 0x3334 02 compression = off 0x3335 08 checksum algo = sha256 0x3336 13 data type = 19 = file data (DMU_OT_PLAIN_FILE_CONTENTS) 0x3337 c0 11000000 binary; bits: endianness=1, dedup=1 Bytes 0x3360-0x337f contain the checksum value in an inversed form of 4 64-bit words. Read right-to-left in four 8-byte groups (3367-3360, 336f-3368, etc.) Most of the other information about the userblock was presented by ZDB in its report about the L0 entry (converted DVA, sizes, TXGs, etc.). Checksum value in corect order of bytes: 28 8e 89 f2 24 26 ad 14 58 3c 3b 2d 99 5d ae f1 5b 48 28 09 21 74 91 c1 52 3a ed 7e 8f b4 6f 55 5) Test the checksums # openssl dgst -sha256 < /tmp/blk-orig.bin 288e89f22426ad14583c3b2d995daef15b482809217491c1523aed7e8fb46f55 # openssl dgst -sha256 < /tmp/blk-bad.bin ab34939a6342e3f353927a395a04aa37e50f1e74e7a9b993e731c790ba554a59 As we see, the checksums differ, and the "orig" one matches the checksum value in the L1 block above. So that copy is good indeed. 6) Compare the good and bad blocks * Convert the files into readable forms # od -Ax -tx1 /tmp/blk-bad.bin > /tmp/blk-bad.txt # od -Ax -tx1 /tmp/blk-orig.bin > /tmp/blk-orig.txt * Compare the files # diff -bu /tmp/blk-*txt | less --- /tmp/blk-bad.txt 2012-01-23 22:20:56.669957059 +0000 +++ /tmp/blk-orig.txt 2012-01-23 22:20:42.886726737 +0000 @@ -6142,135 +6142,262 @@ 017fd0 bc ad fb 94 39 7d 9f e6 8b e8 ab 92 5a e5 70 b2 017fe0 6e ed 47 e7 eb 9d e4 ed 9a 8c dd 13 66 79 aa 9e 017ff0 8a cb d5 b4 59 f5 33 cc a8 b3 0f 52 86 64 e9 9e -018000 00 02 00 69 f8 82 00 00 00 a8 02 00 8a 10 08 7e -018010 10 08 a5 10 08 2a e4 10 08 d2 10 08 9f 08 08 05 -018020 00 54 1a f9 0c 38 32 10 08 29 10 08 8f 15 10 08 (...) -0187e0 00 5b 54 00 83 00 00 00 00 00 00 00 00 00 00 00 -0187f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -* +018000 d4 c3 94 06 b3 9a dc 89 26 24 c9 36 15 91 c2 73 +018010 1f 48 06 04 b7 57 17 d6 57 00 d7 b3 03 40 2b c1 (...) +018fe0 32 b3 50 8c e8 c6 77 69 4e 54 7d 55 9e b7 d5 ce +018ff0 a7 23 21 41 0c ac ea 11 82 e0 86 da 6a 62 ef a6 019000 28 1c 74 02 66 d4 7a 21 0b 86 97 08 e0 13 bd 2a (...) So here we had one 4kb streak of unexpected data. No particular (ir)regularities noticed in this case, just one block of "white noise" replaced by another. =========================================================== So, by this time we have: * Known-good and proven-bad copies of the block; * Known offset and size of this block in the file; * ZDB info on layout of the file; * Corrupt file in a deduped dataset; * Unanswered question: how do dedup and errors interact? Time to squash some bugs! 1) What type of dedup was on the dataset? # zfs get dedup pool/mymedia NAME PROPERTY VALUE SOURCE pool/mymedia dedup sha256,verify local 2) How many CKSUM errors do we have now? # zpool status -v pool | grep ONLINE pool ONLINE 0 0 8 raidz2-0 ONLINE 0 0 8 c6t0d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 Ok, by the time I got here, the pool logged hitting CKSUM errors 8 times; but none definitely on any certain disk. 3) Let''s try "fixing" the file with "dedup=on": * Set dedup mode on dataset: # zfs set dedup=on pool/mymedia * Overwrite the byterange in the file: # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \ bs=131072 count=1 seek=230 conv=notrunc 1+0 records in 1+0 records out 131072 bytes (131 kB) copied, 0.000837707 s, 156 MB/s # sleep 10; sync * Did the known CKSUM error count increase? No, it did not: # zpool status -v pool | grep ONLINE | grep -v c6 pool ONLINE 0 0 8 raidz2-0 ONLINE 0 0 8 * Try reading the file: # md5sum /pool/mymedia/incoming/DSCF1234.AVI md5sum: /pool/mymedia/incoming/DSCF1234.AVI: I/O error * Did the known CKSUM error count increase? Yes, it did now: # zpool status -v pool | grep ONLINE | grep -v c6 pool ONLINE 0 0 10 raidz2-0 ONLINE 0 0 10 So, in fact, this did not fix the file. Did anything change with the blockpointer tree? * dump ZDB info: # zdb -dddddd pool/mymedia 6313 > /tmp/zdb-2.txt * compare the two info files: # diff /tmp/zdb*.txt < 0 L2 0:6a01ccb6000:3000 0:2902c091000:3000 4000L/400P F=688 B=325617005/325617005 > 0 L2 0:ac25a001000:3000 0:927ba8e7000:3000 4000L/400P F=936 B=325191092/325191092 < 1000000 L1 0:6a01ccb0000:6000 0:2902c05b000:6000 4000L/1800P F=128 B=325617005/325617005 > 1000000 L1 0:ac243ea6000:6000 0:927b474c000:6000 4000L/1800P F=128 B=325191084/325191084 < 1cc0000 L0 0:aca016fa000:30000 20000L/20000P F=1 B=325617005/325191083 > 1cc0000 L0 0:aca016fa000:30000 20000L/20000P F=1 B=325191083/325191083 So, the on-disk L0-block DVA address did not change, only a TXG entry - and accordingly the indirect blockpointer tree. Since DVA remains unchanged, and ZFS''s COW does not let it overwrite existing data, this means that the block on disk remains corrupted. As we''ve seen with md5sum above. 4) Now let''s retry with "dedup=verify": * Set dedup mode on dataset: # zfs set dedup=sha256,verify pool/mymedia * Overwrite the byterange in the file: # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \ bs=131072 count=1 seek=230 conv=notrunc 1+0 records in 1+0 records out 131072 bytes (131 kB) copied, 0.00109255 s, 120 MB/s # sleep 10; sync * Did the known CKSUM error count increase? No, it did not: # zpool status -v pool | grep ONLINE | grep -v c6 pool ONLINE 0 0 10 raidz2-0 ONLINE 0 0 10 * Try reading the file: # md5sum /pool/mymedia/incoming/DSCF1234.AVI 731247e7d71951eb622b979bc18f3caa /pool/mymedia/incoming/DSCF1234.AVI * Did the known CKSUM error count increase? No, it still did not: # zpool status -v pool | grep ONLINE | grep -v c6 pool ONLINE 0 0 10 raidz2-0 ONLINE 0 0 10 So, in fact, this did fix the file. We can even attest to that with ZDB: the tree of indirect blocks has changed (levels L2, L1 and L0 of our branch, but not L1 of another branch): Indirect blocks: 0 L2 0:6a27641d000:3000 0:2992ca90000:3000 4000L/400P F=936 B=325619762/325619762 0 L1 0:ac23ecbf000:6000 0:927b38c1000:6000 4000L/1800P F=128 B=325191082/325191082 1000000 L1 0:6a19c140000:6000 0:299204f9000:6000 4000L/1800P F=128 B=325619762/325619762 1cc0000 L0 0:6a7df3b2000:30000 20000L/20000P F=1 B=325619762/325619762 But the CKSUM errors were not logged as such during dedup-verify - even though the block read from disk mismatched the checksum in blockpointer (and presumably in DDT). 5) Now for the interesting part: retry with "dedup=on" * Set dedup mode on dataset: # zfs set dedup=on pool/mymedia * Overwrite the byterange in the file: # dd if=/tmp/blk-orig.bin of=/pool/mymedia/incoming/DSCF1234.AVI \ bs=131072 count=1 seek=230 conv=notrunc 1+0 records in 1+0 records out 131072 bytes (131 kB) copied, 0.000873926 s, 150 MB/s The command prompt never returned, host rebooted "instantly". According to another terminal logging "iostat -xnz 1", the new writes never made it to disk. After reboot the pool imported cleanly, file''s md5sum matches. Fix from dedup=verify remains in place... HTH and thanks, //Jim Klimov