thr3ads.net - zfs discuss - [zfs-discuss] Single disk parity [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Christian Auby

2009-Jul-08 00:23 UTC

[zfs-discuss] Single disk parity

ZFS is able to detect corruption thanks to checksumming, but for single drives
(regular folk-pcs) it doesn''t help much unless it can correct them.
I''ve been searching and can''t find anything on the topic, so
here goes:

1. Can ZFS do parity data on a single drive? e.g. x% parity for all writes,
recover on checksum error.
2. If not, why not? I imagine it would have been a killer feature.

I guess you could possibly do it by partitioning the single drive and running
raidz(2) on the partitions, but that would lose you way more space than e.g.
10%. Also not practical for OS drive.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Jul-08 00:42 UTC

head link

[zfs-discuss] Single disk parity

Christian Auby wrote:> ZFS is able to detect corruption thanks to checksumming, but for single
drives (regular folk-pcs) it doesn''t help much unless it can correct
them. I''ve been searching and can''t find anything on the
topic, so here goes:
>
> 1. Can ZFS do parity data on a single drive? e.g. x% parity for all writes,
recover on checksum error.
> 2. If not, why not? I imagine it would have been a killer feature.
>
> I guess you could possibly do it by partitioning the single drive and
running raidz(2) on the partitions, but that would lose you way more space than
e.g. 10%. Also not practical for OS drive.
>   
You are describing the copies parameter.  It really helps to describe
it in pictures, rather than words.  So I did that.
http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection
 -- richard

Louis-Frédéric Feuillette

2009-Jul-08 01:19 UTC

head link

[zfs-discuss] Single disk parity

On Tue, 2009-07-07 at 17:42 -0700, Richard Elling wrote:> Christian Auby wrote:
> > ZFS is able to detect corruption thanks to checksumming, but for
single drives (regular folk-pcs) it doesn''t help much unless it can
correct them. I''ve been searching and can''t find anything on
the topic, so here goes:
> >
> > 1. Can ZFS do parity data on a single drive? e.g. x% parity for all
writes, recover on checksum error.
> > 2. If not, why not? I imagine it would have been a killer feature.
> >
> > I guess you could possibly do it by partitioning the single drive and
running raidz(2) on the partitions, but that would lose you way more space than
e.g. 10%. Also not practical for OS drive.
> >   
> 
> You are describing the copies parameter.  It really helps to describe
> it in pictures, rather than words.  So I did that.
> http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection
>  -- richard
I think one solution to what Christian is asking is copies.  But I think
he is asking if there is a way to do something like a ''raid''
of the
block so that your capacity isn''t cut in half. For example, write 5
blocks to the disk, 4 data and one parity, then if any one of the block
gets corrupted or is unreadable, then you can reconstruct the missing
block. In this example you would only loose 20% of your capacity not
50%.

I think this option would only really be useful for home users or simple
workstations. It also could have some performance implications.

-Jebnor
-- 
Louis-Fr?d?ric Feuillette <jebnor at gmail.com>

Daniel Carosone

2009-Jul-08 03:01 UTC

head link

[zfs-discuss] Single disk parity

There was a discussion in zfs-code around error-correcting (rather than just
-detecting) properties of the checksums currently kept, an of potential
additional checksum methods with stronger properties.

It came out of another discussion about fletcher2 being both weaker than
desired, and flawed in present implementation. Sorry, don''t have a
thread reference to hand just now.
-- 
This message posted from opensolaris.org

Christian Auby

2009-Jul-08 04:10 UTC

head link

[zfs-discuss] Single disk parity

> You are describing the copies parameter.  It really
> helps to describe
> it in pictures, rather than words.  So I did that.
> http://blogs.sun.com/relling/entry/zfs_copies_and_data
> _protection
>  -- richard
It''s not quite like copies as it''s not actually a copy of the
data I''m talking about. 10% parity or even 5% could easily fix most
disk errors that won''t result in a total disk loss.

Basically something like par archives.

"Only useful for simple workstations" isn''t exactly
"only" in my book. That would account for what, 90% of all computers?
(97% of statistics are made up)

Uep: Yeah it could be implemented by extending the checksum in some way.

I don''t see a performance issue if it''s not enabled by default
though.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Jul-08 05:00 UTC

head link

[zfs-discuss] Single disk parity

Christian Auby wrote:>> You are describing the copies parameter.  It really
>> helps to describe
>> it in pictures, rather than words.  So I did that.
>> http://blogs.sun.com/relling/entry/zfs_copies_and_data
>> _protection
>>  -- richard
>>     
>
> It''s not quite like copies as it''s not actually a copy of
the data I''m talking about. 10% parity or even 5% could easily fix most
disk errors that won''t result in a total disk loss.
>   
Do you have data to back this up?
> Basically something like par archives.
>   
par archives are trying to solve a different problem.
> "Only useful for simple workstations" isn''t exactly
"only" in my book. That would account for what, 90% of all computers?
(97% of statistics are made up)
>
> Uep: Yeah it could be implemented by extending the checksum in some way.
>   
Pedantically, if the checksum has the same number of bits, then it is a 
copy.
> I don''t see a performance issue if it''s not enabled by
default though.
>   
Eh?
 -- richard

Christian Auby

2009-Jul-08 05:30 UTC

head link

[zfs-discuss] Single disk parity

> Do you have data to back this up?
It''s more of a logical observation. The random data corruption
I''ve had up through the years have generally either involved a single
sector or two or a full disk failure. 5% parity on a 128KB block size would
allow you to lose 6.4KB, or ~10 512 byte sectors.

Unless I messed up. In which case tell me how/when/where/what.
-- 
This message posted from opensolaris.org

Daniel Carosone

2009-Jul-08 05:38 UTC

head link

[zfs-discuss] Single disk parity

> Sorry, don''t have a thread reference
> to hand just now.
http://www.opensolaris.org/jive/thread.jspa?threadID=100296

Note that there''s little empirical evidence that this is directly
applicable to the kinds of errors (single bit, or otherwise) that a single
failing disk medium would produce.  Modern disks already include and rely on a
lot of ECC as part of ordinary operation, below the level usually seen by the
host.  These mechanisms seem unlikely to return a read with just one (or a few)
bit errors.

This strikes me, if implemented, as potentially more applicable to errors
introduced from other sources (controller/bus transfer errors, non-ecc memory,
weak power supply, etc).  Still handy.
-- 
This message posted from opensolaris.org

Moore, Joe

2009-Jul-08 17:57 UTC

head link

[zfs-discuss] Single disk parity

Christian Auby wrote:> It''s not quite like copies as it''s not actually a copy of
the data I''m
> talking about. 10% parity or even 5% could easily fix most disk errors
> that won''t result in a total disk loss.(snip)
 > I don''t see a performance issue if it''s not enabled by
default though.
The copies code is nice because it tries to put each copy "far away"
from the others.  This does have a significant performance impact when on a
single spindle, however, because each logical write will be written
"here" and then a disk seek to write it to "there".

With a N+K parity (ECC) scheme, you would turn 1 logical write into at least K
disk seeks, which is by several orders of magnitude the slowest part of I/O. 
(unless you''re using flash media, but that''s not a common case
yet)

If you don''t spread out the writes across the platter(s), you run the
risk of the common-case disk failure mode where many consecutive sectors are
damaged.

It would not hurt when it''s disabled, but it would cripple a system
when it is enabled.

--Joe

Mark J Musante

2009-Jul-08 18:19 UTC

head link

[zfs-discuss] Single disk parity

On Wed, 8 Jul 2009, Moore, Joe wrote:
> The copies code is nice because it tries to put each copy "far
away"
> from the others.  This does have a significant performance impact when 
> on a single spindle, however, because each logical write will be written 
> "here" and then a disk seek to write it to "there".
That''s true for the worst case, but zfs mitigates that somewhat by 
batching i/o into a transaction group.  This means that i/o is done every 
30 seconds (or 5 seconds, depending on the version you''re running), 
allowing multiple writes to be written together in the disparate 
locations.


Regards,
markm

Haudy Kazemi

2009-Jul-08 23:14 UTC

head link

[zfs-discuss] Single disk parity

Daniel Carosone wrote:>> Sorry, don''t have a thread reference
>> to hand just now.
>>     
>
> http://www.opensolaris.org/jive/thread.jspa?threadID=100296
>
> Note that there''s little empirical evidence that this is directly
applicable to the kinds of errors (single bit, or otherwise) that a single
failing disk medium would produce.  Modern disks already include and rely on a
lot of ECC as part of ordinary operation, below the level usually seen by the
host.  These mechanisms seem unlikely to return a read with just one (or a few)
bit errors.
>
> This strikes me, if implemented, as potentially more applicable to errors
introduced from other sources (controller/bus transfer errors, non-ecc memory,
weak power supply, etc).  Still handy.
>   
Adding additional data protection options are commendable.  On the other 
hand I feel there are important gaps in the existing feature set that 
are worthy of a higher priority, not the least of which is the automatic 
recovery of uberblock / transaction group problems (see Victor 
Latushkin''s recovery technique which I linked to in a recent post), 
followed closely by a zpool shrink or zpool remove command that lets you 
resize pools and disconnect devices without replacing them.  I saw 
postings or blog entries from about 6 months ago that this code was 
''near'' as part of solving a resilvering bug but have not seen
anything
else since.  I think many users would like to see improved resilience in 
the existing features and the addition of frequently long requested 
features before other new features are added.  (Exceptions can readily 
be made for new features that are trivially easy to implement and/or are 
not competing for developer time with higher priority features.)

In the meantime, there is the copies flag option that you can use on 
single disks.  With immense drives, even losing 1/2 the capacity to 
copies isn''t as traumatic for many people as it was in days gone by.  
(E.g. consider a 500 gb hard drive with copies=2 versus a 128 gb SSD).  
Of course if you need all that space then it is a no-go.

Related threads that also had ideas on using spare CPU cycles for brute 
force recovery of single bit errors using the checksum:

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data 
corruption?
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg14720.html

[zfs-discuss] integrated failure recovery thoughts (single-bit correction)
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg18540.html


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090708/8b917835/attachment.html>

Richard Elling

2009-Jul-09 00:46 UTC

head link

[zfs-discuss] Single disk parity

Haudy Kazemi wrote:> Daniel Carosone wrote:
>>> Sorry, don''t have a thread reference
>>> to hand just now.
>>>     
>>
>> http://www.opensolaris.org/jive/thread.jspa?threadID=100296
>>
>> Note that there''s little empirical evidence that this is
directly applicable to the kinds of errors (single bit, or otherwise) that a
single failing disk medium would produce.  Modern disks already include and rely
on a lot of ECC as part of ordinary operation, below the level usually seen by
the host.  These mechanisms seem unlikely to return a read with just one (or a
few) bit errors.
>>
>> This strikes me, if implemented, as potentially more applicable to
errors introduced from other sources (controller/bus transfer errors, non-ecc
memory, weak power supply, etc).  Still handy.
>>   
>
> Adding additional data protection options are commendable.  On the 
> other hand I feel there are important gaps in the existing feature set 
> that are worthy of a higher priority, not the least of which is the 
> automatic recovery of uberblock / transaction group problems (see 
> Victor Latushkin''s recovery technique which I linked to in a
recent
> post), 
This does not seem to be a widespread problem.  We do see the
occasional complaint on this forum, but considering the substantial
number of ZFS implementations in existence today, the rate seems
to be quite low.  In other words, the impact does not seem to be high.
Perhaps someone at Sun could comment on the call rate for such
conditions?
> followed closely by a zpool shrink or zpool remove command that lets 
> you resize pools and disconnect devices without replacing them.  I saw 
> postings or blog entries from about 6 months ago that this code was 
> ''near'' as part of solving a resilvering bug but have not
seen anything
> else since.  I think many users would like to see improved resilience 
> in the existing features and the addition of frequently long requested 
> features before other new features are added.  (Exceptions can readily 
> be made for new features that are trivially easy to implement and/or 
> are not competing for developer time with higher priority features.)
>
> In the meantime, there is the copies flag option that you can use on 
> single disks.  With immense drives, even losing 1/2 the capacity to 
> copies isn''t as traumatic for many people as it was in days gone
by.
> (E.g. consider a 500 gb hard drive with copies=2 versus a 128 gb 
> SSD).  Of course if you need all that space then it is a no-go.
Space, performance, dependability: you can pick any two.
>
> Related threads that also had ideas on using spare CPU cycles for 
> brute force recovery of single bit errors using the checksum:
There is no evidence that the type of unrecoverable read errors we
see are single bit errors.  And while it is possible for an error handling
code to correct single bit flips, multiple bit flips would remain as a
large problem space.  There are error codes which can correct multiple
flips, but they quickly become expensive.  This is one reason why nobody
does RAID-2.

BTW, if you do have the case where unprotected data is not
readable, then I have a little DTrace script that I''d like you to run
which would help determine the extent of the corruption.  This is
one of those studies which doesn''t like induced errors ;-)
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon

The data we do have suggests that magnetic hard disk failures tend
to be spatially clustered. So there is still the problem of spatial 
diversity
which is rather nicely handled by copies, today.
 -- richard

Haudy Kazemi

2009-Jul-09 06:19 UTC

head link

[zfs-discuss] Single disk parity

>>
>> Adding additional data protection options are commendable.  On the 
>> other hand I feel there are important gaps in the existing feature 
>> set that are worthy of a higher priority, not the least of which is 
>> the automatic recovery of uberblock / transaction group problems (see 
>> Victor Latushkin''s recovery technique which I linked to in a
recent
>> post), 
>
> This does not seem to be a widespread problem.  We do see the
> occasional complaint on this forum, but considering the substantial
> number of ZFS implementations in existence today, the rate seems
> to be quite low.  In other words, the impact does not seem to be high.
> Perhaps someone at Sun could comment on the call rate for such
> conditions?I counter this.  The user impact is very high when the pool is 
completely inaccessible due to a minor glitch in the ZFS metadata, and 
the user is told to restore from backups, particularly if they''ve been 
considering snapshots to be their backups (I know they''re not the same 
thing).  The incidence rate may be low, but the impact is still high, 
and anecdotally there have been enough reports on list to know it is a 
real non-zero event probability.  Think earth-asteroid 
collisions...doesn''t happen very often but is catastrophic when it does
happen.  Graceful handling of low incidence high impact events plays a 
role in real world robustness and is important in widescale adoption of 
a filesystem.  It is about software robustness in the face of failure 
vs. brittleness.  (In another area, I and others found MythTV''s 
dependence on MySQL to be source of system brittleness.)  Google adopts 
robustness principles in its Google File System (GFS) by not trusting 
the hardware at all and then keeping a minimum of three copies of 
everything on three separate computers.

Consider the users/admin''s dilemma of choosing between a filesystem
that
offers all the great features of ZFS but can be broken (and is 
documented to have broken) with a few miswritten bytes, or choosing a 
filesystem with no great features but is also generally robust to wide 
variety of minor metadata corrupt issues.  Complex filesystems need to 
take special measures that their complexity doesn''t compromise their 
efforts at ensuring reliability.  ZFS''s extra metadata copies provide 
this versus simply duplicating the file allocation table as is done in 
FAT16/32 filesystems (a basic filesystem).  The extra filesystem 
complexity also makes users more dependent upon built in recovery 
mechanisms and makes manual recovery more challenging. (This is an 
unavoidable result of more complicated filesystem design.)

More below.>> followed closely by a zpool shrink or zpool remove command that lets 
>> you resize pools and disconnect devices without replacing them.  I 
>> saw postings or blog entries from about 6 months ago that this code 
>> was ''near'' as part of solving a resilvering bug but
have not seen
>> anything else since.  I think many users would like to see improved 
>> resilience in the existing features and the addition of frequently 
>> long requested features before other new features are added.  
>> (Exceptions can readily be made for new features that are trivially 
>> easy to implement and/or are not competing for developer time with 
>> higher priority features.)
>>
>> In the meantime, there is the copies flag option that you can use on 
>> single disks.  With immense drives, even losing 1/2 the capacity to 
>> copies isn''t as traumatic for many people as it was in days
gone by.
>> (E.g. consider a 500 gb hard drive with copies=2 versus a 128 gb 
>> SSD).  Of course if you need all that space then it is a no-go.
>
> Space, performance, dependability: you can pick any two.
>
>>
>> Related threads that also had ideas on using spare CPU cycles for 
>> brute force recovery of single bit errors using the checksum:
>
> There is no evidence that the type of unrecoverable read errors we
> see are single bit errors.  And while it is possible for an error 
> handling
> code to correct single bit flips, multiple bit flips would remain as a
> large problem space.  There are error codes which can correct multiple
> flips, but they quickly become expensive.  This is one reason why nobody
> does RAID-2.Expensive in CPU cycles or engineering resources or hardware or 
dollars?  If the argument is CPU cycles, then that is the same case made 
against software RAID as a whole and an argument increasingly broken by 
modern high performance CPUs.  If the argument is engineering resources, 
consider the complexity of ZFS itself.  If the argument is hardware, 
(e.g. you need a lot of disks) why not run it at the block level?  
Dollars is going to be a function of engineering resources, hardware, 
and licenses.

There are many error correcting codes available.  RAID2 used Hamming 
codes, but that''s just one of many options out there.  Par2 uses 
configurable strength Reed-Solomon to get multi bit error correction.  
The par2 source is available, although from a ZFS perspective is 
hindered by the CDDL-GPL license incompatibility problem.

It is possible to write a FUSE filesystem using Reed-Solomon (like par2) 
as the underlying protection.  A quick search of the FUSE website turns 
up the Reed-Solomon FS (a FUSE-based filesystem):
"Shielding your files with Reed-Solomon codes" 
http://ttsiodras.googlepages.com/rsbep.html

While most FUSE work is on Linux, and there is a ZFS-FUSE project for 
it, there has also been FUSE work done for OpenSolaris:
http://www.opensolaris.org/os/project/fuse/
> BTW, if you do have the case where unprotected data is not
> readable, then I have a little DTrace script that I''d like you to
run
> which would help determine the extent of the corruption.  This is
> one of those studies which doesn''t like induced errors ;-)
> http://www.richardelling.com/Home/scripts-and-programs-1/zcksummonIs this intended as general monitoring script or only after one has 
otherwise experienced corruption problems?
To be pedantic, wouldn''t protected data also be affected if all copies 
are damaged at the same time, especially if also damaged in the same 
place? 

-hk
> The data we do have suggests that magnetic hard disk failures tend
> to be spatially clustered. So there is still the problem of spatial 
> diversity
> which is rather nicely handled by copies, today.
> -- richard

Richard Elling

2009-Jul-09 17:43 UTC

head link

[zfs-discuss] Single disk parity

Haudy Kazemi wrote:>>>
>>> Adding additional data protection options are commendable.  On the 
>>> other hand I feel there are important gaps in the existing feature 
>>> set that are worthy of a higher priority, not the least of which is
>>> the automatic recovery of uberblock / transaction group problems 
>>> (see Victor Latushkin''s recovery technique which I linked
to in a
>>> recent post), 
>>
>> This does not seem to be a widespread problem.  We do see the
>> occasional complaint on this forum, but considering the substantial
>> number of ZFS implementations in existence today, the rate seems
>> to be quite low.  In other words, the impact does not seem to be high.
>> Perhaps someone at Sun could comment on the call rate for such
>> conditions?
> I counter this.  The user impact is very high when the pool is 
> completely inaccessible due to a minor glitch in the ZFS metadata, and 
> the user is told to restore from backups, particularly if they''ve
been
> considering snapshots to be their backups (I know they''re not the
same
> thing).  The incidence rate may be low, but the impact is still high, 
> and anecdotally there have been enough reports on list to know it is a 
> real non-zero event probability.  
Impact in my context is statistical.  If everyone was hitting this problem,
then it would have been automated long ago.  Sun does track such reports
and will know their rate.
> Think earth-asteroid collisions...doesn''t happen very often but is
> catastrophic when it does happen.  Graceful handling of low incidence 
> high impact events plays a role in real world robustness and is 
> important in widescale adoption of a filesystem.  It is about software 
> robustness in the face of failure vs. brittleness.  (In another area, 
> I and others found MythTV''s dependence on MySQL to be source of
system
> brittleness.)  Google adopts robustness principles in its Google File 
> System (GFS) by not trusting the hardware at all and then keeping a 
> minimum of three copies of everything on three separate computers.
Right, so you also know that the reports of this problem are for 
non-mirrored
pools.  I agree with Google, mirrors work.
>
> Consider the users/admin''s dilemma of choosing between a
filesystem
> that offers all the great features of ZFS but can be broken (and is 
> documented to have broken) with a few miswritten bytes, or choosing a 
> filesystem with no great features but is also generally robust to wide 
> variety of minor metadata corrupt issues.  Complex filesystems need to 
> take special measures that their complexity doesn''t compromise
their
> efforts at ensuring reliability.  ZFS''s extra metadata copies
provide
> this versus simply duplicating the file allocation table as is done in 
> FAT16/32 filesystems (a basic filesystem).  The extra filesystem 
> complexity also makes users more dependent upon built in recovery 
> mechanisms and makes manual recovery more challenging. (This is an 
> unavoidable result of more complicated filesystem design.)
I agree 100%.  But the question here is manual vs automated, not possible
vs impossible.  Even the venerable UFS fsck defers to manual if things are
really messed up.
>
> More below.
>>> followed closely by a zpool shrink or zpool remove command that
lets
>>> you resize pools and disconnect devices without replacing them.  I 
>>> saw postings or blog entries from about 6 months ago that this code
>>> was ''near'' as part of solving a resilvering bug
but have not seen
>>> anything else since.  I think many users would like to see improved
>>> resilience in the existing features and the addition of frequently 
>>> long requested features before other new features are added.  
>>> (Exceptions can readily be made for new features that are trivially
>>> easy to implement and/or are not competing for developer time with 
>>> higher priority features.)
>>>
>>> In the meantime, there is the copies flag option that you can use
on
>>> single disks.  With immense drives, even losing 1/2 the capacity to
>>> copies isn''t as traumatic for many people as it was in
days gone
>>> by.  (E.g. consider a 500 gb hard drive with copies=2 versus a 128 
>>> gb SSD).  Of course if you need all that space then it is a no-go.
>>
>> Space, performance, dependability: you can pick any two.
>>
>>>
>>> Related threads that also had ideas on using spare CPU cycles for 
>>> brute force recovery of single bit errors using the checksum:
>>
>> There is no evidence that the type of unrecoverable read errors we
>> see are single bit errors.  And while it is possible for an error 
>> handling
>> code to correct single bit flips, multiple bit flips would remain as a
>> large problem space.  There are error codes which can correct multiple
>> flips, but they quickly become expensive.  This is one reason why
nobody
>> does RAID-2.
> Expensive in CPU cycles or engineering resources or hardware or 
> dollars?  If the argument is CPU cycles, then that is the same case 
> made against software RAID as a whole and an argument increasingly 
> broken by modern high performance CPUs.  If the argument is 
> engineering resources, consider the complexity of ZFS itself.  If the 
> argument is hardware, (e.g. you need a lot of disks) why not run it at 
> the block level?  Dollars is going to be a function of engineering 
> resources, hardware, and licenses.
All algorithms are not created equal.  A CPU can do XOR at memory
bandwidth rates.  Even the special case BCH, called Reed-Solomon,
used for raidz2 has a reputation for slowness.

Simple redundancy works pretty well.  Space, speed, dependability: pick two.
>
> There are many error correcting codes available.  RAID2 used Hamming 
> codes, but that''s just one of many options out there.  Par2 uses 
> configurable strength Reed-Solomon to get multi bit error correction.  
> The par2 source is available, although from a ZFS perspective is 
> hindered by the CDDL-GPL license incompatibility problem.
>
> It is possible to write a FUSE filesystem using Reed-Solomon (like 
> par2) as the underlying protection.  A quick search of the FUSE 
> website turns up the Reed-Solomon FS (a FUSE-based filesystem):
> "Shielding your files with Reed-Solomon codes" 
> http://ttsiodras.googlepages.com/rsbep.html
>
> While most FUSE work is on Linux, and there is a ZFS-FUSE project for 
> it, there has also been FUSE work done for OpenSolaris:
> http://www.opensolaris.org/os/project/fuse/
>
>> BTW, if you do have the case where unprotected data is not
>> readable, then I have a little DTrace script that I''d like you
to run
>> which would help determine the extent of the corruption.  This is
>> one of those studies which doesn''t like induced errors ;-)
>> http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
> Is this intended as general monitoring script or only after one has 
> otherwise experienced corruption problems?
It is intended to try to answer the question of whether the errors we see
in real life might be single bit errors.  I do not believe they will be 
single
bit errors, but we don''t have the data.
> To be pedantic, wouldn''t protected data also be affected if all
copies
> are damaged at the same time, especially if also damaged in the same 
> place?
Yep.  Which is why there is RFE CR 6674679, complain if all data
copies are identical and corrupt.
 -- richard

Christian Auby

2009-Jul-09 23:30 UTC

head link

[zfs-discuss] Single disk parity

> On Wed, 8 Jul 2009, Moore, Joe wrote:
> That''s true for the worst case, but zfs mitigates
> that somewhat by 
> batching i/o into a transaction group.  This means
> that i/o is done every 
> 30 seconds (or 5 seconds, depending on the version
> you''re running), 
> allowing multiple writes to be written together in
> the disparate 
> locations.
> 
I''d think that writing the same data two or three times is a much
larger performance hit anyway. Calculating 5% parity and writing it in addition
to the stripe might be heaps faster. Might try to do some tests on this.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Jul-10 01:06 UTC

head link

[zfs-discuss] Single disk parity

Christian Auby wrote:>> On Wed, 8 Jul 2009, Moore, Joe wrote:
>> That''s true for the worst case, but zfs mitigates
>> that somewhat by 
>> batching i/o into a transaction group.  This means
>> that i/o is done every 
>> 30 seconds (or 5 seconds, depending on the version
>> you''re running), 
>> allowing multiple writes to be written together in
>> the disparate 
>> locations.
>>
>>     
>
> I''d think that writing the same data two or three times is a much
larger performance hit anyway. Calculating 5% parity and writing it in addition
to the stripe might be heaps faster. Might try to do some tests on this.
>   
Before you get too happy, you should look at the current constraints.
The minimum disk block size is 512 bytes for most disks, but there has
been talk in the industry of cranking this up to 2 or 4 kBytes.  For small
files, your 5% becomes 100%, and you might as well be happy now and
set copies=2.  The largest ZFS block size is 128 kBytes, so perhaps you
could do something with 5% overhead there, but you couldn''t correct
very many bits with only 5%. How many bits do you need to correct?
I don''t know... that is the big elephant in the room shaped like a
question
mark.  Maybe zcksummon data will help us figure out what color the
elephant might be.

If you were to implement something at the DMU layer, which is where
copies are, then without major structural changes to the blkptr, you are
restricted to 3 DVAs.  So the best you could do there is 50% overhead,
which would be a 200% overhead for small files.

If you were to implement at the SPA layer, then you might be able to
get back to a more consistently small overhead, but that would require
implementing a whole new vdev type, which means integration with
install, grub, and friends.  You would need to manage spatial diversity,
which might impact the allocation code in strange ways, but surely is
possible. The spatial diversity requirement means you basically can''t
gain much by replacing a compressor with additional data redundancy,
though it might be an interesting proposal for the summer of code.

Or you could just do it in user land, like par2.

Bottom line: until you understand the failure modes you''re trying
to survive, you can''t make significant progress except by accident.
We know that redundant copies allows us to correct all bits for very
little performance impact, but costs space. Trying to increase space
without sacrificing dependability will cost something -- most likely
performance.

NB, one nice thing about copies is that you can set it per-file system.
For my laptop, I don''t set copies for the OS, but I do for my home
directory.  This is a case where I trade off dependability of read-only
data which, is available on CD or on the net, to gain a little bit of
space.  But I don''t compromise on dependability for my data.
 -- richard

Haudy Kazemi

2009-Jul-11 03:35 UTC

head link

[zfs-discuss] Single disk parity

Richard Elling wrote:>> There are many error correcting codes available.  RAID2 used Hamming 
>> codes, but that''s just one of many options out there.  Par2
uses
>> configurable strength Reed-Solomon to get multi bit error 
>> correction.  The par2 source is available, although from a ZFS 
>> perspective is hindered by the CDDL-GPL license incompatibility
problem.
>>
>> It is possible to write a FUSE filesystem using Reed-Solomon (like 
>> par2) as the underlying protection.  A quick search of the FUSE 
>> website turns up the Reed-Solomon FS (a FUSE-based filesystem):
>> "Shielding your files with Reed-Solomon codes" 
>> http://ttsiodras.googlepages.com/rsbep.html
>>
>> While most FUSE work is on Linux, and there is a ZFS-FUSE project for 
>> it, there has also been FUSE work done for OpenSolaris:
>> http://www.opensolaris.org/os/project/fuse/
>>
>>> BTW, if you do have the case where unprotected data is not
>>> readable, then I have a little DTrace script that I''d like
you to run
>>> which would help determine the extent of the corruption.  This is
>>> one of those studies which doesn''t like induced errors ;-)
>>> http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
>> Is this intended as general monitoring script or only after one has 
>> otherwise experienced corruption problems?
>
>
> It is intended to try to answer the question of whether the errors we see
> in real life might be single bit errors.  I do not believe they will 
> be single
> bit errors, but we don''t have the data.
>
>> To be pedantic, wouldn''t protected data also be affected if
all
>> copies are damaged at the same time, especially if also damaged in 
>> the same place?
>
> Yep.  Which is why there is RFE CR 6674679, complain if all data
> copies are identical and corrupt.
> -- richard
There is a related but an unlikely scenario, that is also probably not 
covered yet.  I''m not sure what kind of common cause would lead to it.
Maybe a disk array turning into swiss cheese with bad sectors suddenly 
showing up on multiple drives?  Its probability increases with larger 
logical block sizes (e.g. 128k blocks are at higher risk than 4k blocks; 
a block being the smallest piece of storage real estate used by the 
filesystem).  It is the edge case of multiple damaged copies where the 
damage is unreadable bad sectors on different corresponding sectors of a 
block.  This could be recovered from by copying the readable sectors 
from each copy and filling in the holes using the appropriate sectors 
from the other copies.  The final result, a rebuilt block, should pass 
the checksum tests assuming there were not any other problems with the 
still readable sectors.

---

A bad sector specific recovery technique is to instruct the disk to 
return raw read data rather than trying to correct it.  The READ LONG 
command can do this (though the specs say it only works on 28 bit LBA).  
(READ LONG corresponds to writes done with WRITE LONG (28 bit) or WRITE 
UNCORRECTABLE EXT (48 bit).  Linux HDPARM uses these write commands when 
it is used to create bad sectors with the --make-bad-sector command.  
The resulting sectors are low level logically bad where the sector''s 
data and ECC do not match; they are not physically bad).  With multiple 
read attempts, a statistical distribution of the likely ''true''
contents
of the sector can be found.  Spinrite claims to do this.  Linux ''HDPARM
--read-sector'' can sometimes return data from nominally bad sectors too
but it doesn''t have a built in statistical recovery method (a wrapper 
script could probably solve that).  I don''t know if HDPARM --read
sector
uses READ LONG or not.
HDPARM man page: http://linuxreviews.org/man/hdparm/

Good description of IDE commands including READ LONG and WRITE LONG 
(specs say they are 28 bit only)
http://www.repairfaq.org/filipg/LINK/F_IDE-tech.html
SCSI versions of READ LONG and WRITE LONG
http://en.wikipedia.org/wiki/SCSI_Read_Commands#Read_Long
http://en.wikipedia.org/wiki/SCSI_Write_Commands#Write_Long

Here is a report by forum member "qubit" modifying his Linux taskfile 
driver to use READ LONG for data recovery purposes, and his subsequent 
analysis:

http://forums.storagereview.net/index.php?showtopic=5910
http://www.tech-report.com/news_reply.x/3035
http://techreport.com/ja.zz?comments=3035&page=5

------ quote ------
318. Posted at 07:00 am on Jun 6th 2002 by qubit

My DTLA-307075 (75GB 75GXP) went bad 6 months ago. But I didn''t write 
off the data as being unrecoverable. I used WinHex to make a ghost image 
of the drive onto my new larger one, zeroing out the bad sectors in the 
target while logging each bad sector. (There were bad sectors in the FAT 
so I combined the good parts from FATs 1 and 2.) At this point I had a 
working mirror of the drive that went bad, with zeroed-out 512 byte 
holes in files where the bad sectors were.

Then I set the 75GXP aside, because I knew it was possible to recover 
some of the data *on* the bad sectors, but I didn''t have the tools to
do
it. So I decided to wait until then to RMA it.

I did write a program to parse the bad sector list along with the 
partition''s FAT, to create a list of files with bad sectors in them, so
at least I knew which files were effected. There are 8516 bad sectors, 
and 722 files effected.

But this week, I got Linux working on my new computer (upgraded not too 
long after the 75GXP went bad) and modified the IDE taskfile driver to 
allow me to use READ LONG on the bad sectors -- thus allowing me to 
salvage data from the bad sectors, while avoiding the nasty 
click-click-click and delay of retrying (I can now repeat reads of a bad 
sector about twice per second) and I can also get the 40 bytes of ECC 
data. Each read of one sector turns up different data, and by comparing 
them I can try to divine what the original was. That part I''m still 
working on (it''d help a lot to know what encoding method the drive uses
- it''s not RLL(2,7), which is the only one I''ve been able to
get the
details on).

But today I did a different kind of analysis, with VERY interesting 
results. I wrote a program to convert the list of bad sectors into a 
graphics file, using the data on zones and sectors per track found in 
IBM''s specification. After some time and manipulation, I discovered
that
all the bad sectors are in a line going from the outer edge 1/3 of the 
way to the inner edge, on one platter surface! It''s actually a spiral, 
because of the platter rotation. But this explains why all the sectors 
went bad at once. One of the heads must have executed a write cycle 
while seeking! I could even measure the seek speed from my bad sector 
data -- it''s 4.475 ms/track! (assuming precisely 7200 rpm) And there
are
evenly spaced nodes along the line where larger chunks were corrupted -- 
starting 300 ms apart, gradually fading to where they actually are 
*less* corrupted than the line itself, at 750 ms apart.

I don''t know if anyone else will find this interesting, but I found it 
fascinating, and it explained a lot. If you''d like to talk to me about 
the technical aspects of 75GXP failure, please email me at 
quSPAMLESSbitATinorNOSPAMbitDOTcom (remove the chunks of spam, change AT 
and DOT to their respective symbols).

For completeness, I should say that I had the drive for a year before it 
developed the rash of bad sectors. It\''s made in Hungary, SEP-2000.

I wasn\''t using it too heavily until I got an HDTV card, then I was 
recording HDTV onto the drive; this heavy usage might have helped it 
along to failure. (2.4 MB/sec sustained writing -- and it was quite 
noisy too.)

I updated the drive\''s firmware not too long after it developed the bad
sectors; of course this didn\''t let me read them any better -- I
didn\''t
expect it to. I\''m not sure if the firmware update will make the drive 
safe to use after a reformat, but I\''ll surely try it once
I\''ve
recovered as much of the bad sectors as I can. Even if I still RMA the 
drive, I\''d like to know.
------ end quote ------
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090710/752c7e21/attachment.html>

zfs discuss - Jul 2009 - Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity

[zfs-discuss] Single disk parity