thr3ads.net - zfs discuss - [zfs-discuss] integrated failure recovery thoughts [Aug 2008]

If this information is useful, please help other people find it:
Share via:

paul

2008-Aug-12 01:24 UTC

[zfs-discuss] integrated failure recovery thoughts

As most of the zfs recovery problems seem to stem from zfs?s own strict
insistence that data be ideally consistent with its corresponding checksum,
which of course is good when correspondingly consistent data may be
recovered from somewhere, but catastrophic otherwise; it seem clear that
zfs must support an inherent worst-case recovery mechanism to allow as
much of the file-system to be brought back on line as possible with
speculatively
recovered files/blocks correspondingly marked as being potentially compromised
such that they may be subsequently further scrutinized as may be desired.

In the circumstance when inconsistent data is been returned from storage
without any error otherwise, it seems likely that the data was subject to
a soft-error somewhere in its journey, therefore it seems (in order):

- first both the presumed checksum/indexes and data should be re-read in
  case the actual error occurred during/after its retrieval from storage.
  
- if that doesn?t work, then it?s corruption would seem to have most likely
  occurred prior to being stored (as error detection/correction schemes
  utilized by disk drives are fairly good at not misidentifying corrupted
  data as being otherwise); thereby implying that its a good bet either
  the parent or child of the blocks correspondingly containing the checksum
  and subsequent data may most likely contain a single bit error, and
  thereby may be possibly recovered by iterating through all possible 1-bit
  differences in the checksum and data, or block pointers and corresponding
  child blocks to try to determine if any then satisfy the newly computed
  check sum requirement. (and correspondingly mark the nodes such that
  subsequent more comprehensive file system consistency checks may be
  performed)
  
- although errors may have occurred causing the wrong blocks to have been
  written and/or multi-bit errors may have occurred during transmission;
  it seems unlikely to try to exhaustively continue further searching for
  candidates, and likely simply best to just mark the terminal block and
  corresponding parent file being likely corrupt and allow some other
  tool to attempt user piloted file fragment recovery.

ZFS?s stringent constancy requirements are very nice, but as data is
subject to soft errors throughout it?s transport/storage/use, a file
system must be capable of at least attempting to recover from that is
reasonably recoverable, as sh*t will always happen, and catastrophic
failure should be avoidable at all reasonable costs.
 
 
This message posted from opensolaris.org

Anton B. Rang

2008-Aug-12 06:48 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit correction)

That brings up another interesting idea.

ZFS currently uses a 128-bit checksum for blocks of up to 1048576 bits.

If 20-odd bits of that were a Hamming code, you''d have something
slightly stronger than SECDED, and ZFS could correct any single-bit errors
encountered.

This could be done without changing the ZFS on-disk format.
 
 
This message posted from opensolaris.org

Mario Goebbels (iPhone)

2008-Aug-12 07:35 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit correction)

I suppose an error correcting code like 256bit Hamming or Reed-Solomon  
can''t substitute as reliable checksum on the level of default  
Fletcher2/4? If it can, it could be offered as alternative algorithm  
where necessary and let ZFS react accordingly, or not?

Regards,
-mg

On 12-ao?t-08, at 08:48, "Anton B. Rang" <rang at acm.org>
wrote:
> That brings up another interesting idea.
>
> ZFS currently uses a 128-bit checksum for blocks of up to 1048576  
> bits.
>
> If 20-odd bits of that were a Hamming code, you''d have something  
> slightly stronger than SECDED, and ZFS could correct any single-bit  
> errors encountered.
>
> This could be done without changing the ZFS on-disk format.
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2008-Aug-12 14:29 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit correction)

Anton B. Rang wrote:> That brings up another interesting idea.
>
> ZFS currently uses a 128-bit checksum for blocks of up to 1048576 bits.
>
> If 20-odd bits of that were a Hamming code, you''d have something
slightly stronger than SECDED, and ZFS could correct any single-bit errors
encountered.
>   
Yes.  But I''m not convinced that we will see single bit errors, since
there is already a large number of single-bit-error detection and (often)
correction capability in modern systems.  It seems that when we lose
a block of data, we lose more than a single bit. 

It should be relatively easy to add code to the current protection schemes
which will compare a bad block to a reconstructed, good block and
deliver this information for us. I''ll add an RFE.
 -- richard

paul

2008-Aug-12 16:15 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

Although I don''t know for sure that most such errors are in fact single
bit in nature,
I can only surmise they most likely statistically are absent detection
otherwise;
as with the exception of error corrected memory systems and/or check-summed
communication channels, each transition of data between hardware interfaces at
ever
increasing clock clock rates, correspondingly increase the probability of such
otherwise
non-detectable soft single bit error being injected at these boundaries, where
although
the probabilities of their occurrence are small enough not to be easily
detectable or
classifiable as a hardware failure, they none the less can occur with a high
enough
probability that over the course of days/weeks/years and trillions of bits they
will be
observable and should be expected and planed for within reason.

Utilizing a strong error correcting code in combination with or in lieu of a
strong hash
code would seem like a good thing to more strongly warrant that data''s
representation in
memory at the time of it''s computation is more resilient to
transmission and subsequent
retrieval; but suspect through time as technology continues to push clock rates
and
corresponding data pool size ever higher, that some form of uniform data
integrity
mechanism will need to be incorporated within all the processing and
communications
interface data paths within systems in order to improve data''s
resilience to transmission
and processing errors albeit being statistically very small for any single bit.
> Anton B. Rang wrote:
> > That brings up another interesting idea.
> >
> > ZFS currently uses a 128-bit checksum for blocks of
> up to 1048576 bits.
> >
> > If 20-odd bits of that were a Hamming code, you''d
> have something slightly stronger than SECDED, and ZFS
> could correct any single-bit errors encountered.
> >   
> 
> Yes.  But I''m not convinced that we will see single
> bit errors, since
> there is already a large number of single-bit-error
> detection and (often)
> correction capability in modern systems.  It seems
> that when we lose
> a block of data, we lose more than a single bit. 
> 
> It should be relatively easy to add code to the
> current protection schemes
> which will compare a bad block to a reconstructed,
> good block and
> deliver this information for us. I''ll add an RFE.
>  -- richard
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss 

This message posted from opensolaris.org

Anton B. Rang

2008-Aug-13 04:14 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

Reed-Solomon could correct multiple-bit errors, but an effective Reed-Solomon
code for 128K blocks of data would be very slow if implemented in software (and,
for that matter, take a lot of hardware to implement). A multi-bit Hamming code
would be simpler, but I suspect that undetected multi-bit errors are quite rare.

I''ve seen a fair number of single-bit errors coming from SATA drives
because the data is often not parity-protected through the whole data path
within the drive. Some enterprise-class SATA disks have data protected (with a
parity-equivalent) through the write data path, and more of these models will
have this feature soon. All SAS and FibreChannel drives (that I am aware of)
have data protected with ECC through the whole path for both reads and writes.

Single-bit errors can also be introduced in non-ECC DRAM, of course. In this
case, it can happen either before the checksum computation (=> undetected
data corruption) or after it (=> checksum failure on a later read).
 
 
This message posted from opensolaris.org

paul

2008-Aug-13 17:13 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

Given that the checksum algorithms utilized in zfs are already fairly CPU
intensive, I
can''t help but wonder if it''s verified that a majority of
checksum inconsistency failures
appear to be single bit; if it may be advantageous to utilize some
computationally
simpler hybrid form of a checksum/hamming code (as you''ve suggested),
such that
although a simpler hybrid form would not be able to detect as high a percentage
all
possible failures, it would be capable of correcting a theoretical majority
while retaining
an ability to detect a large majority of all possible remaining errors (which
correspondingly
would be known to occur with less frequency) and ideally consume no more than
the
exiting checksum algorithm overhead, while simultaneously improving the apparent
resilience of even non-otherwise redundantly configured storage devices.

(although I confess I haven''t done such an analysis yet, I suspect
someone already more
intimately familiar with error detection/correcting algorithm
implementation/trade-offs
may have some interesting suggestions, as currently having a strong detection
capability
without an ability to recover that which may otherwise be easily recoverable in
lieu of
potentially catastrophic data loss does not seem reasonable).
 
 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Aug-13 19:01 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

On Wed, 13 Aug 2008, paul wrote:
> Given that the checksum algorithms utilized in zfs are already fairly CPU
intensive, I
> can''t help but wonder if it''s verified that a majority of
checksum inconsistency failures
> appear to be single bit; if it may be advantageous to utilize some
computationally
The default checksum algorithm used by zfs is not very CPU intensive. 
The actual overhead is easily determined by testing. Given the many 
hardware safeguards against single (and several) bit errors, the most 
common data error will be large.  For example, the disk drive may 
return data from the wrong sector.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

paul

2008-Aug-13 20:09 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

Bob wrote:> ... Given the many hardware safeguards against single (and several) bit
errors,
> the most common data error will be large. For example, the disk drive may 
> return data from the wrong sector.
- actually data integrity check bits as may exist within memory systems and/or
  communication channels are rarely prorogated beyond their boundaries, thereby
  data is subject to corruption at every such interface traversal, including for
  example during the simple process of being read and re-written by the CPUs
  anywhere within the system that touches data, including within the disk drive
  itself. (unless a machine with error detecting/correcting memory is itself
  detecting uncorrectable 2-bit errors, which should kill the process being run,
  there''s no real reason to suspect that 3 or more bit errors are
sneeking through
  with any measurable frequency; although possible).

- personally I believe that errors such as erroneous sectors being written or
read
  are themselves most likely due to single bit errors propagating into critical
things
  like sector addresses calculations and thereby ultimately expressing
themselves as
  large obvious errors, although actually caused by more subtle ones.  Shy
extremely
  noisy hardware and/or literal hard failure, most errors will most likely
always be
  expressed as 1 bit out of some very large N number of bits.
 
 
This message posted from opensolaris.org

Bob Friesenhahn

2008-Aug-13 21:08 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

On Wed, 13 Aug 2008, paul wrote:
>  Shy extremely noisy hardware and/or literal hard failure, most
>  errors will most likely always be expressed as 1 bit out of some
>  very large N number of bits.
This claim ignores the fact that most computers today are still based 
on synchronously clocked parallel bus hardware.  A common failure mode 
is clock skew, which causes many bits to be wrong at once.  This can 
even happen within the CPU.

As serial interfaces continue to be added to computers, the number of 
single bit errors (vs multi-bit errors) would tend to increase except 
for the fact that these serial interfaces are designed to detect and 
discard erroroneous packets.

I do agree that the logic between the self-validating interfaces can 
be faulty.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Aug-13 21:53 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

paul wrote:> Bob wrote:
>   
>> ... Given the many hardware safeguards against single (and several) bit
errors,
>> the most common data error will be large. For example, the disk drive
may
>> return data from the wrong sector.
>>     
>
> - actually data integrity check bits as may exist within memory systems
and/or
>   communication channels are rarely prorogated beyond their boundaries,
thereby
>   data is subject to corruption at every such interface traversal,
including for
>   example during the simple process of being read and re-written by the
CPUs
>   anywhere within the system that touches data, including within the disk
drive
>   itself. (unless a machine with error detecting/correcting memory is
itself
>   detecting uncorrectable 2-bit errors, which should kill the process being
run,
>   there''s no real reason to suspect that 3 or more bit errors are
sneeking through
>   with any measurable frequency; although possible).
>
> - personally I believe that errors such as erroneous sectors being written
or read
>   are themselves most likely due to single bit errors propagating into
critical things
>   like sector addresses calculations and thereby ultimately expressing
themselves as
>   large obvious errors, although actually caused by more subtle ones.  Shy
extremely
>   noisy hardware and/or literal hard failure, most errors will most likely
always be
>   expressed as 1 bit out of some very large N number of bits.
>   
Today, we can detect a large number of these using the current ZFS
checksum (by default, fletcher-2).  But we don''t record the scope of
the
corruption, once we correct the data.  I filed RFE 6736986, bitwise
failure data collection for zfs.  Once implemented, we would get a
better idea of how extensive corruption can be, even though the
root cause cannot be determined from ZFS -- that would be a job
for a different FMA DE.
 -- richard

paul

2008-Aug-14 14:46 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

bob wrote:> On Wed, 13 Aug 2008, paul wrote:
> 
>>  Shy extremely noisy hardware and/or literal hard failure, most
>>  errors will most likely always be expressed as 1 bit out of some
>>  very large N number of bits.
> 
> This claim ignores the fact that most computers today are still based 
> on synchronously clocked parallel bus hardware.  A common failure mode 
> is clock skew, which causes many bits to be wrong at once.  This can 
> even happen within the CPU.
- in my experience clock skew/drift problems will first manifest themselves
by expressing single bit errors even on parallel interfaces, as all although
all paths are logically parallel, the actual physical performance of each of
the individual transistor & traces composing the data path will be ever so
slightly different and although physical cad layout tools attempt to balance
clock trees, the actual arrival time of the clock to the latch elements of
the physical data-path implementation will also be slightly different (often
differing by as much as few picoseconds; therefore as a circuit approaches
its maximum frequency threshold (which depends on temperature, age, etc),
some very small number of single bit errors will begin to be generated, due
to setup/hold time violations being exceeded on the bit with the least
physical clock skew tolerance, as the clock frequency and/or temperature
(etc) increases, more and more bit paths will begin to fail, until the whole
path fails. Thereby as all systems have some of the bits within parallel paths
being more sensitive to one type of corruption or another, I tend to believe
that single bit failures will tend to express themselves statistically prior to
and in greater number than multi-bit failures even though hardware still
seems operable.
> As serial interfaces continue to be added to computers, the number of 
> single bit errors (vs multi-bit errors) would tend to increase except 
> for the fact that these serial interfaces are designed to detect and 
> discard erroroneous packets.
> 
> I do agree that the logic between the self-validating interfaces can 
> be faulty.
> 
> Bob 
 
This message posted from opensolaris.org

paul

2008-Aug-14 14:47 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

Yes, Thank you.
 
 
This message posted from opensolaris.org

Richard Elling

2008-Aug-14 16:18 UTC

head link

[zfs-discuss] integrated failure recovery thoughts (single-bit

paul wrote:> bob wrote:
>   
>> On Wed, 13 Aug 2008, paul wrote:
>>
>>     
>>>  Shy extremely noisy hardware and/or literal hard failure, most
>>>  errors will most likely always be expressed as 1 bit out of some
>>>  very large N number of bits.
>>>       
>> This claim ignores the fact that most computers today are still based 
>> on synchronously clocked parallel bus hardware.  A common failure mode 
>> is clock skew, which causes many bits to be wrong at once.  This can 
>> even happen within the CPU.
>>     
>
> - in my experience clock skew/drift problems will first manifest themselves
> by expressing single bit errors even on parallel interfaces, as all
although
> all paths are logically parallel, the actual physical performance of each
of
> the individual transistor & traces composing the data path will be ever
so
> slightly different and although physical cad layout tools attempt to
balance
> clock trees, the actual arrival time of the clock to the latch elements of
> the physical data-path implementation will also be slightly different
(often
> differing by as much as few picoseconds; therefore as a circuit approaches
> its maximum frequency threshold (which depends on temperature, age, etc),
> some very small number of single bit errors will begin to be generated, due
> to setup/hold time violations being exceeded on the bit with the least
> physical clock skew tolerance, as the clock frequency and/or temperature
> (etc) increases, more and more bit paths will begin to fail, until the
whole
> path fails. Thereby as all systems have some of the bits within parallel
paths
> being more sensitive to one type of corruption or another, I tend to
believe
> that single bit failures will tend to express themselves statistically
prior to
> and in greater number than multi-bit failures even though hardware still
> seems operable.
>   
I''m not convinced, but perhaps it is because of the scar near my left
ankle.
Long, long ago... SunOS 3.2 days, we had a server with two (!) ethernet
interfaces which we used to serve two different subnets (router) in addition
to its normal services (NFS, mail, etc.)  If the server couldn''t
service the
ethernet interrupts fast enough, the ethernet interface would zero-fill the
packets.  This is a really bad idea because the symptom was random zeros
intermixed with legitimate data... but only sometimes. The lesson here
is that you are often dealing with firmware or other, high-level decisions
on what happens to data as it flows through the system, and I doubt
very seriously that the firmware developers would just flip a single bit
somewhere rather than do something like ZFOD.
 -- richard

paul

2008-Aug-15 00:32 UTC

head link

[zfs-discuss] integrated failure recovery thoughts

I apologize for in effect suggesting that which was previously suggested in an
earlier thread:

 http://mail.opensolaris.org/pipermail/zfs-discuss/2008-March/046234.html

And discovering that the feature to attempt worst case single bit recovery had
apparently
already been present in some form in an earlier development version of zfs, but
removed
for some illogical reason claiming that it masked programming errors (which
makes no
sense to me), and correspondingly recognized by other sun engineers that single
bit errors
may in fact be injected by various elements of the system which may touch the
data
beyond the drives themselves such as cpu''s for example.

I don''t know where it comes from but there seems to be a standing
assumption that most
checksum errors are in fact multi-bit (some statistical testing should be able
to determine
if this is the case or not); personally I suspect the opposite, as drives tend
to do a fairly
good job of identifying uncorrectable data, therefore tend not to erroneously
return
garbage as being good, and the remaining hardware in most systems will not tend
to
generate sporadic multi-bit errors in greater frequency than single bit ones,
therefore
logical to assume that most data corruption will originate as single bit errors,
which
however if not detected and corrected may subsequently be utilized in
calculations
yielding potentially substantially more catastrophic results (which likely
contribute
to some percentage of wrong blocks being read/written, and subsequently mistaken
as a resulting multi-bit data block error).

Overall please reconsider re-incorporating this feature to be minimally enabled
upon
request if not default, as although worst case recovery of large block file data
may
be resource intensive, it would only be invoked as a last resort, with the
alternative
being a catastrophic loss of data which seems wholly unacceptable if in fact
recoverable
in place.
 
 
This message posted from opensolaris.org

zfs discuss - Aug 2008 - integrated failure recovery thoughts

[zfs-discuss] integrated failure recovery thoughts

[zfs-discuss] integrated failure recovery thoughts (single-bit correction)

[zfs-discuss] integrated failure recovery thoughts (single-bit correction)

[zfs-discuss] integrated failure recovery thoughts (single-bit correction)

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts (single-bit

[zfs-discuss] integrated failure recovery thoughts