As most of the zfs recovery problems seem to stem from zfs?s own strict insistence that data be ideally consistent with its corresponding checksum, which of course is good when correspondingly consistent data may be recovered from somewhere, but catastrophic otherwise; it seem clear that zfs must support an inherent worst-case recovery mechanism to allow as much of the file-system to be brought back on line as possible with speculatively recovered files/blocks correspondingly marked as being potentially compromised such that they may be subsequently further scrutinized as may be desired. In the circumstance when inconsistent data is been returned from storage without any error otherwise, it seems likely that the data was subject to a soft-error somewhere in its journey, therefore it seems (in order): - first both the presumed checksum/indexes and data should be re-read in case the actual error occurred during/after its retrieval from storage. - if that doesn?t work, then it?s corruption would seem to have most likely occurred prior to being stored (as error detection/correction schemes utilized by disk drives are fairly good at not misidentifying corrupted data as being otherwise); thereby implying that its a good bet either the parent or child of the blocks correspondingly containing the checksum and subsequent data may most likely contain a single bit error, and thereby may be possibly recovered by iterating through all possible 1-bit differences in the checksum and data, or block pointers and corresponding child blocks to try to determine if any then satisfy the newly computed check sum requirement. (and correspondingly mark the nodes such that subsequent more comprehensive file system consistency checks may be performed) - although errors may have occurred causing the wrong blocks to have been written and/or multi-bit errors may have occurred during transmission; it seems unlikely to try to exhaustively continue further searching for candidates, and likely simply best to just mark the terminal block and corresponding parent file being likely corrupt and allow some other tool to attempt user piloted file fragment recovery. ZFS?s stringent constancy requirements are very nice, but as data is subject to soft errors throughout it?s transport/storage/use, a file system must be capable of at least attempting to recover from that is reasonably recoverable, as sh*t will always happen, and catastrophic failure should be avoidable at all reasonable costs. This message posted from opensolaris.org
Anton B. Rang
2008-Aug-12 06:48 UTC
[zfs-discuss] integrated failure recovery thoughts (single-bit correction)
That brings up another interesting idea. ZFS currently uses a 128-bit checksum for blocks of up to 1048576 bits. If 20-odd bits of that were a Hamming code, you''d have something slightly stronger than SECDED, and ZFS could correct any single-bit errors encountered. This could be done without changing the ZFS on-disk format. This message posted from opensolaris.org
Mario Goebbels (iPhone)
2008-Aug-12 07:35 UTC
[zfs-discuss] integrated failure recovery thoughts (single-bit correction)
I suppose an error correcting code like 256bit Hamming or Reed-Solomon can''t substitute as reliable checksum on the level of default Fletcher2/4? If it can, it could be offered as alternative algorithm where necessary and let ZFS react accordingly, or not? Regards, -mg On 12-ao?t-08, at 08:48, "Anton B. Rang" <rang at acm.org> wrote:> That brings up another interesting idea. > > ZFS currently uses a 128-bit checksum for blocks of up to 1048576 > bits. > > If 20-odd bits of that were a Hamming code, you''d have something > slightly stronger than SECDED, and ZFS could correct any single-bit > errors encountered. > > This could be done without changing the ZFS on-disk format. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2008-Aug-12 14:29 UTC
[zfs-discuss] integrated failure recovery thoughts (single-bit correction)
Anton B. Rang wrote:> That brings up another interesting idea. > > ZFS currently uses a 128-bit checksum for blocks of up to 1048576 bits. > > If 20-odd bits of that were a Hamming code, you''d have something slightly stronger than SECDED, and ZFS could correct any single-bit errors encountered. >Yes. But I''m not convinced that we will see single bit errors, since there is already a large number of single-bit-error detection and (often) correction capability in modern systems. It seems that when we lose a block of data, we lose more than a single bit. It should be relatively easy to add code to the current protection schemes which will compare a bad block to a reconstructed, good block and deliver this information for us. I''ll add an RFE. -- richard
Although I don''t know for sure that most such errors are in fact single bit in nature, I can only surmise they most likely statistically are absent detection otherwise; as with the exception of error corrected memory systems and/or check-summed communication channels, each transition of data between hardware interfaces at ever increasing clock clock rates, correspondingly increase the probability of such otherwise non-detectable soft single bit error being injected at these boundaries, where although the probabilities of their occurrence are small enough not to be easily detectable or classifiable as a hardware failure, they none the less can occur with a high enough probability that over the course of days/weeks/years and trillions of bits they will be observable and should be expected and planed for within reason. Utilizing a strong error correcting code in combination with or in lieu of a strong hash code would seem like a good thing to more strongly warrant that data''s representation in memory at the time of it''s computation is more resilient to transmission and subsequent retrieval; but suspect through time as technology continues to push clock rates and corresponding data pool size ever higher, that some form of uniform data integrity mechanism will need to be incorporated within all the processing and communications interface data paths within systems in order to improve data''s resilience to transmission and processing errors albeit being statistically very small for any single bit.> Anton B. Rang wrote: > > That brings up another interesting idea. > > > > ZFS currently uses a 128-bit checksum for blocks of > up to 1048576 bits. > > > > If 20-odd bits of that were a Hamming code, you''d > have something slightly stronger than SECDED, and ZFS > could correct any single-bit errors encountered. > > > > Yes. But I''m not convinced that we will see single > bit errors, since > there is already a large number of single-bit-error > detection and (often) > correction capability in modern systems. It seems > that when we lose > a block of data, we lose more than a single bit. > > It should be relatively easy to add code to the > current protection schemes > which will compare a bad block to a reconstructed, > good block and > deliver this information for us. I''ll add an RFE. > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssThis message posted from opensolaris.org
Anton B. Rang
2008-Aug-13 04:14 UTC
[zfs-discuss] integrated failure recovery thoughts (single-bit
Reed-Solomon could correct multiple-bit errors, but an effective Reed-Solomon code for 128K blocks of data would be very slow if implemented in software (and, for that matter, take a lot of hardware to implement). A multi-bit Hamming code would be simpler, but I suspect that undetected multi-bit errors are quite rare. I''ve seen a fair number of single-bit errors coming from SATA drives because the data is often not parity-protected through the whole data path within the drive. Some enterprise-class SATA disks have data protected (with a parity-equivalent) through the write data path, and more of these models will have this feature soon. All SAS and FibreChannel drives (that I am aware of) have data protected with ECC through the whole path for both reads and writes. Single-bit errors can also be introduced in non-ECC DRAM, of course. In this case, it can happen either before the checksum computation (=> undetected data corruption) or after it (=> checksum failure on a later read). This message posted from opensolaris.org
Given that the checksum algorithms utilized in zfs are already fairly CPU intensive, I can''t help but wonder if it''s verified that a majority of checksum inconsistency failures appear to be single bit; if it may be advantageous to utilize some computationally simpler hybrid form of a checksum/hamming code (as you''ve suggested), such that although a simpler hybrid form would not be able to detect as high a percentage all possible failures, it would be capable of correcting a theoretical majority while retaining an ability to detect a large majority of all possible remaining errors (which correspondingly would be known to occur with less frequency) and ideally consume no more than the exiting checksum algorithm overhead, while simultaneously improving the apparent resilience of even non-otherwise redundantly configured storage devices. (although I confess I haven''t done such an analysis yet, I suspect someone already more intimately familiar with error detection/correcting algorithm implementation/trade-offs may have some interesting suggestions, as currently having a strong detection capability without an ability to recover that which may otherwise be easily recoverable in lieu of potentially catastrophic data loss does not seem reasonable). This message posted from opensolaris.org
Bob Friesenhahn
2008-Aug-13 19:01 UTC
[zfs-discuss] integrated failure recovery thoughts (single-bit
On Wed, 13 Aug 2008, paul wrote:> Given that the checksum algorithms utilized in zfs are already fairly CPU intensive, I > can''t help but wonder if it''s verified that a majority of checksum inconsistency failures > appear to be single bit; if it may be advantageous to utilize some computationallyThe default checksum algorithm used by zfs is not very CPU intensive. The actual overhead is easily determined by testing. Given the many hardware safeguards against single (and several) bit errors, the most common data error will be large. For example, the disk drive may return data from the wrong sector. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob wrote:> ... Given the many hardware safeguards against single (and several) bit errors, > the most common data error will be large. For example, the disk drive may > return data from the wrong sector.- actually data integrity check bits as may exist within memory systems and/or communication channels are rarely prorogated beyond their boundaries, thereby data is subject to corruption at every such interface traversal, including for example during the simple process of being read and re-written by the CPUs anywhere within the system that touches data, including within the disk drive itself. (unless a machine with error detecting/correcting memory is itself detecting uncorrectable 2-bit errors, which should kill the process being run, there''s no real reason to suspect that 3 or more bit errors are sneeking through with any measurable frequency; although possible). - personally I believe that errors such as erroneous sectors being written or read are themselves most likely due to single bit errors propagating into critical things like sector addresses calculations and thereby ultimately expressing themselves as large obvious errors, although actually caused by more subtle ones. Shy extremely noisy hardware and/or literal hard failure, most errors will most likely always be expressed as 1 bit out of some very large N number of bits. This message posted from opensolaris.org
Bob Friesenhahn
2008-Aug-13 21:08 UTC
[zfs-discuss] integrated failure recovery thoughts (single-bit
On Wed, 13 Aug 2008, paul wrote:> Shy extremely noisy hardware and/or literal hard failure, most > errors will most likely always be expressed as 1 bit out of some > very large N number of bits.This claim ignores the fact that most computers today are still based on synchronously clocked parallel bus hardware. A common failure mode is clock skew, which causes many bits to be wrong at once. This can even happen within the CPU. As serial interfaces continue to be added to computers, the number of single bit errors (vs multi-bit errors) would tend to increase except for the fact that these serial interfaces are designed to detect and discard erroroneous packets. I do agree that the logic between the self-validating interfaces can be faulty. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2008-Aug-13 21:53 UTC
[zfs-discuss] integrated failure recovery thoughts (single-bit
paul wrote:> Bob wrote: > >> ... Given the many hardware safeguards against single (and several) bit errors, >> the most common data error will be large. For example, the disk drive may >> return data from the wrong sector. >> > > - actually data integrity check bits as may exist within memory systems and/or > communication channels are rarely prorogated beyond their boundaries, thereby > data is subject to corruption at every such interface traversal, including for > example during the simple process of being read and re-written by the CPUs > anywhere within the system that touches data, including within the disk drive > itself. (unless a machine with error detecting/correcting memory is itself > detecting uncorrectable 2-bit errors, which should kill the process being run, > there''s no real reason to suspect that 3 or more bit errors are sneeking through > with any measurable frequency; although possible). > > - personally I believe that errors such as erroneous sectors being written or read > are themselves most likely due to single bit errors propagating into critical things > like sector addresses calculations and thereby ultimately expressing themselves as > large obvious errors, although actually caused by more subtle ones. Shy extremely > noisy hardware and/or literal hard failure, most errors will most likely always be > expressed as 1 bit out of some very large N number of bits. >Today, we can detect a large number of these using the current ZFS checksum (by default, fletcher-2). But we don''t record the scope of the corruption, once we correct the data. I filed RFE 6736986, bitwise failure data collection for zfs. Once implemented, we would get a better idea of how extensive corruption can be, even though the root cause cannot be determined from ZFS -- that would be a job for a different FMA DE. -- richard
bob wrote:> On Wed, 13 Aug 2008, paul wrote: > >> Shy extremely noisy hardware and/or literal hard failure, most >> errors will most likely always be expressed as 1 bit out of some >> very large N number of bits. > > This claim ignores the fact that most computers today are still based > on synchronously clocked parallel bus hardware. A common failure mode > is clock skew, which causes many bits to be wrong at once. This can > even happen within the CPU.- in my experience clock skew/drift problems will first manifest themselves by expressing single bit errors even on parallel interfaces, as all although all paths are logically parallel, the actual physical performance of each of the individual transistor & traces composing the data path will be ever so slightly different and although physical cad layout tools attempt to balance clock trees, the actual arrival time of the clock to the latch elements of the physical data-path implementation will also be slightly different (often differing by as much as few picoseconds; therefore as a circuit approaches its maximum frequency threshold (which depends on temperature, age, etc), some very small number of single bit errors will begin to be generated, due to setup/hold time violations being exceeded on the bit with the least physical clock skew tolerance, as the clock frequency and/or temperature (etc) increases, more and more bit paths will begin to fail, until the whole path fails. Thereby as all systems have some of the bits within parallel paths being more sensitive to one type of corruption or another, I tend to believe that single bit failures will tend to express themselves statistically prior to and in greater number than multi-bit failures even though hardware still seems operable.> As serial interfaces continue to be added to computers, the number of > single bit errors (vs multi-bit errors) would tend to increase except > for the fact that these serial interfaces are designed to detect and > discard erroroneous packets. > > I do agree that the logic between the self-validating interfaces can > be faulty. > > BobThis message posted from opensolaris.org
Yes, Thank you. This message posted from opensolaris.org
Richard Elling
2008-Aug-14 16:18 UTC
[zfs-discuss] integrated failure recovery thoughts (single-bit
paul wrote:> bob wrote: > >> On Wed, 13 Aug 2008, paul wrote: >> >> >>> Shy extremely noisy hardware and/or literal hard failure, most >>> errors will most likely always be expressed as 1 bit out of some >>> very large N number of bits. >>> >> This claim ignores the fact that most computers today are still based >> on synchronously clocked parallel bus hardware. A common failure mode >> is clock skew, which causes many bits to be wrong at once. This can >> even happen within the CPU. >> > > - in my experience clock skew/drift problems will first manifest themselves > by expressing single bit errors even on parallel interfaces, as all although > all paths are logically parallel, the actual physical performance of each of > the individual transistor & traces composing the data path will be ever so > slightly different and although physical cad layout tools attempt to balance > clock trees, the actual arrival time of the clock to the latch elements of > the physical data-path implementation will also be slightly different (often > differing by as much as few picoseconds; therefore as a circuit approaches > its maximum frequency threshold (which depends on temperature, age, etc), > some very small number of single bit errors will begin to be generated, due > to setup/hold time violations being exceeded on the bit with the least > physical clock skew tolerance, as the clock frequency and/or temperature > (etc) increases, more and more bit paths will begin to fail, until the whole > path fails. Thereby as all systems have some of the bits within parallel paths > being more sensitive to one type of corruption or another, I tend to believe > that single bit failures will tend to express themselves statistically prior to > and in greater number than multi-bit failures even though hardware still > seems operable. >I''m not convinced, but perhaps it is because of the scar near my left ankle. Long, long ago... SunOS 3.2 days, we had a server with two (!) ethernet interfaces which we used to serve two different subnets (router) in addition to its normal services (NFS, mail, etc.) If the server couldn''t service the ethernet interrupts fast enough, the ethernet interface would zero-fill the packets. This is a really bad idea because the symptom was random zeros intermixed with legitimate data... but only sometimes. The lesson here is that you are often dealing with firmware or other, high-level decisions on what happens to data as it flows through the system, and I doubt very seriously that the firmware developers would just flip a single bit somewhere rather than do something like ZFOD. -- richard
I apologize for in effect suggesting that which was previously suggested in an earlier thread: http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046234.html And discovering that the feature to attempt worst case single bit recovery had apparently already been present in some form in an earlier development version of zfs, but removed for some illogical reason claiming that it masked programming errors (which makes no sense to me), and correspondingly recognized by other sun engineers that single bit errors may in fact be injected by various elements of the system which may touch the data beyond the drives themselves such as cpu''s for example. I don''t know where it comes from but there seems to be a standing assumption that most checksum errors are in fact multi-bit (some statistical testing should be able to determine if this is the case or not); personally I suspect the opposite, as drives tend to do a fairly good job of identifying uncorrectable data, therefore tend not to erroneously return garbage as being good, and the remaining hardware in most systems will not tend to generate sporadic multi-bit errors in greater frequency than single bit ones, therefore logical to assume that most data corruption will originate as single bit errors, which however if not detected and corrected may subsequently be utilized in calculations yielding potentially substantially more catastrophic results (which likely contribute to some percentage of wrong blocks being read/written, and subsequently mistaken as a resulting multi-bit data block error). Overall please reconsider re-incorporating this feature to be minimally enabled upon request if not default, as although worst case recovery of large block file data may be resource intensive, it would only be invoked as a last resort, with the alternative being a catastrophic loss of data which seems wholly unacceptable if in fact recoverable in place. This message posted from opensolaris.org