thr3ads.net - zfs discuss - [zfs-discuss] Some miscellaneous questions [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Raf Schietekat

2005-Nov-20 15:45 UTC

[zfs-discuss] Some miscellaneous questions

I''m getting a warm fuzzy feeling about this new ZFS, because it
addresses many concerns I''ve always had about computer file systems,
even as a non-expert user. It even looks like a decisive argument for choosing
an operating system, much like Chipkill memory would be for choosing server
hardware. After reading a bit here and there, including zfsadmin_1016.pdf, and
searching specifically for answers, though not yet in the source code, I still
have several questions, currently mostly out of curiosity. It''s
probably easiest to keep them together in one thread if a developer could answer
all of them at once, perhaps just pointing to some existing documentation.

What are the numbers related to the performance penalty of doing all that RAID
stuff in software? Does this model assume that some processing load is moved
from a dedicated card to some free cycles on a second core or chip? Or is it
really negligible on any current single-core CPU? What about checksum
verification?

Which of those weird RAID combinations like 1+0 does ZFS support now and in the
future, and why (not)?

What are the characteristics of using fletcher2 and fletcher4 for computing
checksums compared to sha256, with regard to performance and ability to detect
(un)intentional data corruption? Is there such a thing as CRC-256, and how would
it compare?

Do you have anything to say about the following? Currently, e.g., rsync suffers
from a severe performance penalty if you want to be certain, by using
--checksum, that everything gets replicated correctly even in the presence of
programs that manipulate the modification date of files. Resolving make
dependencies is vulnerable as well. These problems could be addressed by having
a good checksum at the file level (not the block level), computed on demand by
the file system and retained as long as the file remains unchanged.

When doing self-healing, does ZFS support error correction at the block level
that improves on any such support by a lower system, or will it use or discard a
block as a whole based only on a comparison with the checksum value stored in
the referring node?

What is a clone? zfsadmin_1016.pdf neglects to address them in "1.1.5
Snapshots and Clones" and "Chapter 6: ZFS Snapshots and Clones"
isn''t very informative either. Why do they continue referring to an
initial snapshot and prevent it from being freed? What''s their purpose
other than a quick COW copy, which might just as well be symmetrical? Speaking
of COW, can individual files and directories be easily copied cheaply, or is
that only possible through snapshots and clones?

Does ZFS regard all storage as the same? Some disks have lower latency and could
be preferred for faster computing, some older disks have higher error rates and
could be set aside for a cache or so, some media might be really slow (tapes),
... Does ZFS support using faster storage as a cache for slower storage?

Is it possible to migrate some data set to a particular disk or set of disks and
then move that disk or set of disks to a different system, or would that require
copying outside of the ZFS (to some separately mounted extra disks or to the
network)?
This message posted from opensolaris.org

Eric Schrock

2005-Nov-20 19:47 UTC

head link

[zfs-discuss] Some miscellaneous questions

On Sun, Nov 20, 2005 at 07:45:47AM -0800, Raf Schietekat
wrote:>
> What are the numbers related to the performance penalty of doing all
> that RAID stuff in software? Does this model assume that some
> processing load is moved from a dedicated card to some free cycles on
> a second core or chip? Or is it really negligible on any current
> single-core CPU? What about checksum verification?
Yes, software RAID and checksums in software will consume additional
CPU.  I don''t think anyone has done a performance comparison of RAID-Z
to some other form of hardware RAID, but there are many aspects of
ZFS and RAID-Z that provide performance much higher than any other
software RAID solution.

We have not found the CPU overhead of checksums to be noticeable on any
workload, though you can turn checksums off (for user data only) and
test this out for yourself.
> Which of those weird RAID combinations like 1+0 does ZFS support now
> and in the future, and why (not)?
ZFS only supports RAID-Z.  See Jeff''s blog for reasons why:

http://blogs.sun.com/roller/page/bonwick?entry=raid_z
> What are the characteristics of using fletcher2 and fletcher4 for
> computing checksums compared to sha256, with regard to performance and
> ability to detect (un)intentional data corruption? Is there such a
> thing as CRC-256, and how would it compare?
I don''t have the numbers offhand; I believe that Jeff and Bill have
done
these calculations at some point.  We do support SHA-256, which is a
strong cryptographic checksum if you want even higher data integrity.
> Do you have anything to say about the following? Currently, e.g.,
> rsync suffers from a severe performance penalty if you want to be
> certain, by using --checksum, that everything gets replicated
> correctly even in the presence of programs that manipulate the
> modification date of files. Resolving make dependencies is vulnerable
> as well. These problems could be addressed by having a good checksum
> at the file level (not the block level), computed on demand by the
> file system and retained as long as the file remains unchanged.
I''m not sure how this relates to ZFS - do you have something in mind?
ZFS checksums are done per block.
> When doing self-healing, does ZFS support error correction at the
> block level that improves on any such support by a lower system, or
> will it use or discard a block as a whole based only on a comparison
> with the checksum value stored in the referring node?
The current checksum algorithms do not support error correction - all
blocks are either valid or invalid.  Future checksum enhancements may
allow for this ability.
> What is a clone? zfsadmin_1016.pdf neglects to address them in "1.1.5
> Snapshots and Clones" and "Chapter 6: ZFS Snapshots and
Clones" isn''t
> very informative either. Why do they continue referring to an initial
> snapshot and prevent it from being freed? What''s their purpose
other
> than a quick COW copy, which might just as well be symmetrical?
> Speaking of COW, can individual files and directories be easily copied
> cheaply, or is that only possible through snapshots and clones?
Clones share space with an underlying snapshot.  You cannot delete the
underlying snapshot because the snapshot "owns" the blocks that are
being shared.  The data will diverge over time, so that if you re-write
all the blocks in the clone, it wil end up taking up the same amount of
space.  They are extremely useful for quickly provisioning writable
copies of a static source (such as workspaces, zone root, upgrade
scenarios, etc).  While it would be nice to have all blocks reference
counted (and therefore the original snapshot could be deleted), this is
untenable for a variety of reasons.  Hopefully one of the DMU junkies
(Matt, Mark) will blog on this at some point.

Individual files and directories cannot be copied in this manner.
> Does ZFS regard all storage as the same? Some disks have lower latency
> and could be preferred for faster computing, some older disks have
> higher error rates and could be set aside for a cache or so, some
> media might be really slow (tapes), ... Does ZFS support using faster
> storage as a cache for slower storage?
Currently, dynamic striping in ZFS performs allocations on space usage
patterns - trying to distribute data to all vdevs in the pool.
Eventually, it would be nice to factor in performance characteristics
(or fault characteristics) of the various vdevs.  We have some ideas in
this space, but it''s not a straightforward answer.  The ZFS I/O
pipeline
makes this enhancement simple, once we decide exactly what we want to
do.
> Is it possible to migrate some data set to a particular disk or set of
> disks and then move that disk or set of disks to a different system,
> or would that require copying outside of the ZFS (to some separately
> mounted extra disks or to the network)?
Entire pools can be moved between systems using ''zpool export''
and ''zfs
import''.  We have discussed the idea of ''zpool
split'', which would let
you export and import a subset of disks on a different system.

Hope that helps.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Raf Schietekat

2005-Nov-20 22:39 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Eric Schrock wrote:
> On Sun, Nov 20, 2005 at 07:45:47AM -0800, Raf Schietekat wrote:
> 
>[...]
> 
>> Do you have anything to say about the following? Currently, e.g.,
>> rsync suffers from a severe performance penalty if you want to be
>> certain, by using --checksum, that everything gets replicated
>> correctly even in the presence of programs that manipulate the
>> modification date of files. Resolving make dependencies is vulnerable
>> as well. These problems could be addressed by having a good checksum
>> at the file level (not the block level), computed on demand by the
>> file system and retained as long as the file remains unchanged.
> 
> I''m not sure how this relates to ZFS - do you have something in
mind?
> ZFS checksums are done per block.
Such extra read-only metadata (maybe that''s another difference:
block+invisible vs. file+readable?) would require implementation by the
filesystem to make sure it is invalidated whenever a file changes, and ZFS is
new and it already does checksums (though not md5), so... Just an idea.
> 
>[...]This message posted from opensolaris.org

Bill Sommerfeld

2005-Nov-21 14:40 UTC

head link

[zfs-discuss] Some miscellaneous questions

On Sun, 2005-11-20 at 14:47, Eric Schrock wrote:> ZFS only supports RAID-Z.  See Jeff''s blog for reasons why:
> 
> http://blogs.sun.com/roller/page/bonwick?entry=raid_z
Hmm.   The top level vdevs form an implicit stripe (raid 0).
raid-1 is just simple mirroring, so a pool with a single mirror vdev is
raid-1, and a a pool consisting of multiple mirrors is pretty much raid
1+0 (stripe of mirrors).

So, under this interpretation, with ZFS you''ve got raid levels 0, 1,
and
Z, and striped-1 and striped-Z available.  

With most other systems, you have 0, 1, 5, striped-1 and maybe
striped-5.

Did I get this wrong?

						- Bill

Torrey McMahon

2005-Nov-21 21:21 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Raf Schietekat wrote:> Eric Schrock wrote:
>
>   
>> On Sun, Nov 20, 2005 at 07:45:47AM -0800, Raf Schietekat wrote:
>>
>> [...]
>>
>>     
>>> Do you have anything to say about the following? Currently, e.g.,
>>> rsync suffers from a severe performance penalty if you want to be
>>> certain, by using --checksum, that everything gets replicated
>>> correctly even in the presence of programs that manipulate the
>>> modification date of files. Resolving make dependencies is
vulnerable
>>> as well. These problems could be addressed by having a good
checksum
>>> at the file level (not the block level), computed on demand by the
>>> file system and retained as long as the file remains unchanged.
>>>       
>> I''m not sure how this relates to ZFS - do you have something
in mind?
>> ZFS checksums are done per block.
>>     
>
> Such extra read-only metadata (maybe that''s another difference:
block+invisible vs. file+readable?) would require implementation by the
filesystem to make sure it is invalidated whenever a file changes, and ZFS is
new and it already does checksums (though not md5), so... Just an idea.
>   
You would still need to compare the checksums at the file level after 
the transfer as the files from one system are more then likely not going 
to end up in the same block locations as the files on the other. In some 
rare cases you might even have different block sizes and that would 
really mess things up.

Also, many would argue that replicating a file from one system to an 
other would require a new checksum as the metadata has changed. (The 
files location.)

Richard Elling - PAE

2005-Nov-22 00:06 UTC

head link

[zfs-discuss] Some miscellaneous questions

Raf Schietekat wrote:> I''m getting a warm fuzzy feeling about this new ZFS, because it
addresses
> many concerns I''ve always had about computer file systems, even as
a non-expert
> user. It even looks like a decisive argument for choosing an operating
system,
> much like Chipkill memory would be for choosing server hardware.
Really?  I look at field failure data regularly.  I don''t see a strong
correlation between memory failures and dead DRAM chips.  While chipkill
is better than non-chipkill, it doesn''t seem to improve system
reliability
nearly as much as the marketing hype would lead you to believe.  OTOH,
SECDED ECC *does* have a strong, positive influence on reliability.  Do
you have data which indicates otherwise?
  -- richard

Raf Schietekat

2005-Nov-22 09:30 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Torrey McMahon wrote:
>[...]
> 
> You would still need to compare the checksums at the file level after 
> the transfer as the files from one system are more then likely not going 
> to end up in the same block locations as the files on the other. In some 
> rare cases you might even have different block sizes and that would 
> really mess things up.
Absolutely, that''s what we were talking about, file level, where the
file is the logical array of bytes implied by the blocks.
> 
> Also, many would argue that replicating a file from one system to an 
> other would require a new checksum as the metadata has changed. (The 
> files location.)
Location and attributes are probably far less expensive to compare than
multimegabyte file content. I would not be surprised if there were (almost?) no
benefit to maintaining checksums for entire subtrees, much like trying to
achieve 100% defragmentation has not turned out to be a worthwhile strategy,
even to the contrary (running the defragmenter to completion on NTFS probably
causes a suboptimal data layout). A checksum would probably also help as an
associative key to detect renaming to a different location, if it is a very safe
hash.
This message posted from opensolaris.org

Raf Schietekat

2005-Nov-22 10:23 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Richard Elling - PAE wrote:
> Raf Schietekat wrote:
> 
>> I''m getting a warm fuzzy feeling about this new ZFS, because
it
>> addresses many concerns I''ve always had about computer file
systems,
>> even as a non-expert user. It even looks like a decisive argument for 
>> choosing an operating system, much like Chipkill memory would be for 
>> choosing server hardware.
> 
> Really?  I look at field failure data regularly.  I don''t see a
strong
> correlation between memory failures and dead DRAM chips.  While chipkill
> is better than non-chipkill, it doesn''t seem to improve system
reliability
> nearly as much as the marketing hype would lead you to believe.  OTOH,
> SECDED ECC *does* have a strong, positive influence on reliability.  Do
> you have data which indicates otherwise?
No, just a warm fuzzy feeling (I''ve already confessed to being a
non-expert). But I suppose that you mean that Chipkill does not improve markedly
on ordinary SECDED with regard to incidence of memory problems, not that SECDED
is better than Chipkill, as a careless reader may interpret from your reply? Any
objective statistics about that would obviously be interesting. Chipkill may or
may not have gotten its name from the notion of having a dead memory chip (how
likely is that?), and it has specific provisions for that situation, but to me
it mainly seems like SECDED on steroids, with multiple-error correctability.
Memory scrubbing is probably orthogonal to both. If I''m mistaken, prey
tell. (What''s it called when, e.g., a CNN journalist writes about AOL,
an ancestor company, and dutifully points out that relation in his article?
Richard Elling has an @sun.com address, and Sun doesn''t sell systems
with Chipkill. No attack, and maybe it''s just me who doesn''t
automatically know your affiliation, but this is one of the things I find
exemplary about American journalistic practice.)
This message posted from opensolaris.org

Raf Schietekat

2005-Nov-22 12:38 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Errata: careless->casual reader, prey->pray tell, Sun doesn''t
sell->make systems with Chipkill, multi-megabyte (hyphenated, elsewhere in
this thread). Also, to clarify where I''m coming from, with regard to
RAM I''ve never heard of any simpler ECC than SECDED ECC, the first step
up from parity memory, with ECC currently kind of being synonymous with SECDED
ECC. I''ve now clicked through to Richard Elling''s profile, and
while recognising his obvious expert status, I stand by my comments.
This message posted from opensolaris.org

Richard Elling - PAE

2005-Nov-22 22:38 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Sun does indeed sell systems with chipkill, and has since 1997.
More clarifications below...

Raf Schietekat wrote:> Richard Elling - PAE wrote:
> 
> 
>>Raf Schietekat wrote:
>>
>>
>>>I''m getting a warm fuzzy feeling about this new ZFS,
because it
>>>addresses many concerns I''ve always had about computer file
systems,
>>>even as a non-expert user. It even looks like a decisive argument
for
>>>choosing an operating system, much like Chipkill memory would be for
>>>choosing server hardware.
>>
>>Really?  I look at field failure data regularly.  I don''t see a
strong
>>correlation between memory failures and dead DRAM chips.  While chipkill
>>is better than non-chipkill, it doesn''t seem to improve system
reliability
>>nearly as much as the marketing hype would lead you to believe.  OTOH,
>>SECDED ECC *does* have a strong, positive influence on reliability.  Do
>>you have data which indicates otherwise?
> 
> 
> No, just a warm fuzzy feeling (I''ve already confessed to being a
non-expert).
> But I suppose that you mean that Chipkill does not improve markedly on 
> ordinary SECDED with regard to incidence of memory problems, not that
SECDED
> is better than Chipkill, as a careless reader may interpret from your
reply?
I thought that I had chosen my words carefully.
> Any objective statistics about that would obviously be interesting.
Indeed, hence my request for data.  I have data on Sun systems (obviously)
but would always like to correlate to others.
> Chipkill may or may not have gotten its name from the notion of having a
dead
> memory chip (how likely is that?), and it has specific provisions for that 
> situation, but to me it mainly seems like SECDED on steroids, with
multiple-error
> correctability. 
Technically, yes and no.  Chipkill extends SECDED to be able to also detect
the collection of N-bits which would be in a single chip.  Typically we see
this as 4-bits (x4 DRAMs).  In such systems you can correct any single bit
error and 4-bit errors on 4-bit boundaries.  This is the "steroids"
part :-)
This is not really multiple error correction because a DRAM failure is a
single failure, even though it affects N bits.
> Memory scrubbing is probably orthogonal to both. If I''m mistaken,
prey tell.
Yes, scrubbing is orthogonal.  We scrub to detect, and hopefully correct,
data that we aren''t looking at.  If a tree falls in the forest, we want
to
hear it.  Scrubbing is imperfect, though, because basically you''re just
wandering through the forest and hope to be within earshot before you need
the data.  There are several trade-offs here, and we''ll talk about
those
wrt ZFS later.
> (What''s it called when, e.g., a CNN journalist writes about AOL,
an ancestor
> company, and dutifully points out that relation in his article? Richard
Elling
> has an @sun.com address, and Sun doesn''t sell systems with
Chipkill. No attack,
> and maybe it''s just me who doesn''t automatically know
your affiliation, but
> this is one of the things I find exemplary about American journalistic
practice.)
Due diligence?  But since Sun sells systems with chipkill, I''m not sure
where you are going...

If you have data which shows that chipkill offers significantly improved
fault coverage, or data which shows actually chip kill reliability rates,
I''d be very interested.

If you have data which shows that correctable errors usually precede
uncorrectable errors, I''d be even more interested.  I''ll even
buy the beer :-)
  -- richard

Raf Schietekat

2005-Nov-23 00:36 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

>[...]
> Also, to clarify where I''m coming from, with
> regard to RAM I''ve never heard of any simpler ECC
> than SECDED ECC, the first step up from parity
> memory, with ECC currently kind of being synonymous
> with SECDED ECC.
...and apparently SECDED is indeed higher than SEC?
>[...]
> I stand by my comments.
Oh boy...
This message posted from opensolaris.org

Raf Schietekat

2005-Nov-23 00:41 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Richard Elling - PAE wrote:
> Sun does indeed sell systems with chipkill, and has since 1997.
I stand corrected. I took one source (which I now can''t find anymore,
of course) at face value, coupled with a vague memory of not finding an
alternative with Chipkill for those xSeries servers I bought a couple of years
ago, and went with it, sorry. I didn''t even double-check my sources, so
what was I doing commenting about disclosure?
>[...]
> 
> Raf Schietekat wrote:
> 
>[...]
> 
>> Chipkill may or may not have gotten its name from the notion of having 
>> a dead memory chip (how likely is that?), and it has specific 
>> provisions for that situation, but to me it mainly seems like SECDED 
>> on steroids, with multiple-error
>> correctability. 
> 
> Technically, yes and no.  Chipkill extends SECDED to be able to also detect
> the collection of N-bits which would be in a single chip.  Typically we see
> this as 4-bits (x4 DRAMs).  In such systems you can correct any single bit
> error and 4-bit errors on 4-bit boundaries.  This is the
"steroids" part
> :-)
> This is not really multiple error correction because a DRAM failure is a
> single failure, even though it affects N bits.
I''ll have to digest this first.
> 
>[...]
> 
> If you have data which shows that chipkill offers significantly improved
> fault coverage, or data which shows actually chip kill reliability rates,
> I''d be very interested.
I don''t have any data myself. Should I not believe fig. 1 in
IBM''s white paper
<www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf>? Or is it
meaningless without SECDED in the picture?
> 
> If you have data which shows that correctable errors usually precede
> uncorrectable errors, I''d be even more interested.  I''ll
even buy the
> beer :-)This message posted from opensolaris.org

Raf Schietekat

2005-Nov-24 12:54 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

> ...and apparently SECDED is indeed higher than SEC?
But only trivially so (just add a parity bit per ECC word, according to
http://www.hackersdelight.org/ecc.pdf), and 64/72 is enough, so I think it is
reasonable to assume all such ECC systems are in fact SEC/DED. Right?
> > I stand by my comments.
> 
> Oh boy...
Oh well, it''s not as bad as all that, I think so far only my product
knowledge has been shown to be deficient (I hope...).
This message posted from opensolaris.org

Raf Schietekat

2005-Nov-24 14:20 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

> Richard Elling - PAE wrote:
> 
>[...]
> > 
> > Technically, yes and no.  Chipkill extends SECDED to be able to also
detect
> > the collection of N-bits which would be in a single chip.  Typically
we see
> > this as 4-bits (x4 DRAMs).  In such systems you can correct any single
bit
> > error and 4-bit errors on 4-bit boundaries.  This is the
"steroids" part :-)
> > This is not really multiple error correction because a DRAM failure is
a
> > single failure, even though it affects N bits.
> 
> I''ll have to digest this first.
It''s a big read, that white paper I mentioned, and I admit
I''ve only read it cursorily so far. Apparently, multibit soft errors do
tend to occur within a single chip (it''s not just about entire chips or
chip regions going dead), so it''s not a bad idea to build in specific
resilience against that.
> 
> > 
> >[...]
> > 
> > If you have data which shows that chipkill offers significantly
improved
> > fault coverage, or data which shows actually chip kill reliability
rates,
> > I''d be very interested.
> 
> I don''t have any data myself. Should I not believe fig. 1 in
IBM''s white paper
> www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf? Or is it
> meaningless without SECDED in the picture?
Since SEC/DED is apparently only trivially more complicated than SEC,
I''m assuming that SEC is indeed SEC/DED. So if you believe the Monte
Carlo simulation and its assumptions, Chipkill is spectacularly more effective
than SEC/DED. Not so? A pity, though, now I''ll have to rely on the
lottery instead for getting that big money transfer to my account. Well,
that''s if actual field data supports this.
> 
>[...]This message posted from opensolaris.org

Richard Elling - PAE

2005-Nov-28 17:20 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

[appologize for seeming to drift away from ZFS, but we''ll tie the
loose ends together later]

Raf Schietekat wrote:>>Richard Elling - PAE wrote:
>>
>>[...]
>>
>>>Technically, yes and no.  Chipkill extends SECDED to be able to also
detect
>>>the collection of N-bits which would be in a single chip.  Typically
we see
>>>this as 4-bits (x4 DRAMs).  In such systems you can correct any
single bit
>>>error and 4-bit errors on 4-bit boundaries.  This is the
"steroids" part :-)
>>>This is not really multiple error correction because a DRAM failure
is a
>>>single failure, even though it affects N bits.
>>
>>I''ll have to digest this first.
> 
> 
> It''s a big read, that white paper I mentioned, and I admit
I''ve only read
> it cursorily so far. Apparently, multibit soft errors do tend to occur 
> within a single chip (it''s not just about entire chips or chip
regions going
> dead), so it''s not a bad idea to build in specific resilience
against that.
Yes, soft errors can cluster around an event and knock out several bits.
However, these are usually detected and corrected as single bit errors
because the adjacent bits in the array are not adjacent in the symbol or
word.
>>>[...]
>>>
>>>If you have data which shows that chipkill offers significantly
improved
>>>fault coverage, or data which shows actually chip kill reliability
rates,
>>>I''d be very interested.
>>
>>I don''t have any data myself. Should I not believe fig. 1 in
IBM''s white paper
>>www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf? Or is it
>>meaningless without SECDED in the picture?
This data is 10 years old, and seems to be highly variant.  I''m always
looking for newer field data.
> Since SEC/DED is apparently only trivially more complicated than SEC,
I''m
> assuming that SEC is indeed SEC/DED. So if you believe the Monte Carlo 
> simulation and its assumptions, Chipkill is spectacularly more effective 
> than SEC/DED. Not so? A pity, though, now I''ll have to rely on the
lottery
> instead for getting that big money transfer to my account. Well,
that''s if
> actual field data supports this.
For the masses,
	SEC == single (bit) error correction
	DED == double (bit) error detection

I don''t know anyone who implements SEC without also DED, as the DED
part is
basically free.  Some people may or may not implement chip kill.  The tradeoff
for chipkill also affects the cost of DIMMs.  For example, a 9-chip DIMM
can''t
implement chipkill (in any modern system I''m aware of), but is less
expensive
and more reliable than an 18-chip DIMM.  Cheap, fast, reliable: pick two.

If anyone has real experience seeing a chipkill event, and perhaps even some
data, I''d be very insterested in talking with you.
  -- richard

Raf Schietekat

2005-Nov-29 11:51 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

>[...]
> 
> If anyone has real experience seeing a chipkill event, and perhaps even
some
> data, I''d be very insterested in talking with you.
I would also be interested to know more. Meanwhile, Richard Elling has been
successful in eroding my warm fuzzy feeling...

BTW, I forgot to mention some more motivations for that file-level checksum I
mentioned earlier (oneway, read-only, computed on demand, retained in the file
system)... Next to helping with replication (and verification also) and make
dependency tracking, it can make scripts launch faster (even extending to
languages not traditionally thought of as meant for scripting, like C or C++,
thus reducing clutter and administrative burden and increasing trustworthiness),
Java programs (I suppose!), Java IDEs perhaps (Scanning Project Classpaths is
now a repeated occurrence), etc., by offering a reliable (the extent of which
has to be examined and described) key for use with an application-specific
cache. Well, it''s just an idea, but I keep bumping into the absense of
its implementation.

But that''s all from me for now.
This message posted from opensolaris.org

Richard Elling

2005-Nov-29 18:52 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Raf Schietekat wrote:>>[...]
>>
>>If anyone has real experience seeing a chipkill event, and perhaps even
some
>>data, I''d be very insterested in talking with you.
> 
> 
> I would also be interested to know more. Meanwhile, Richard Elling has been
> successful in eroding my warm fuzzy feeling...
I didn''t intend to frighten, but us RAS guys are always thinking about
how
things break... makes for odd dinner conversations with the family :-)

What you will see more of in Solaris, with ZFS as an example, is tight
integration between the hardware capabilities and the OS for fault management.
We have to move away from the reliance on purely hardware forms of
resilience.  But doing everything in software is also difficult.  We
leverage the two in the Solaris Fault Management Architecture.  For example,
if we see a number of correctable errors in a memory page, Solaris can
relocate the data and stop using the page.  For an uncorrectable error,
rather than panic, we can restart the application using the page. 
We''re
studying this in the field to try and measure the impact on availability.
So far we can say that FMA has positively contributed to better availability.
This is all goodness.
  -- richard

Raf Schietekat

2005-Nov-30 06:55 UTC

head link

[zfs-discuss] Re: Re: Some miscellaneous questions

Richard Elling wrote:
> I didn''t intend to frighten, but us RAS guys are always thinking
about how
> things break... makes for odd dinner conversations with the family :-)
Oh, I''m not frightened, I''m still confident that Chipkill is
strictly better than ordinary SEC/DED, I''m just not sure anymore how
relevant it really is.
This message posted from opensolaris.org

Glenn Herteg

2005-Dec-05 23:39 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

> > When doing self-healing, does ZFS support error correction at the
> > block level that improves on any such support by a lower system, or
> > will it use or discard a block as a whole based only on a comparison
> > with the checksum value stored in the referring node?
>
> The current checksum algorithms do not support error correction - all
> blocks are either valid or invalid. Future checksum enhancements may
> allow for this ability.
I think you''re getting into dangerous territory if you believe that ZFS
block-level checksums could be used for error correction, not just error
detection.  Consider that the bits on the disk are not raw bits to begin
with:  they''re RLL- or PRML- or whatever- encoded on the media,
instead,
so bit-level media failures would get translated into much more complex
bit transliterations by the time software sees them.
Also, the disk itself has already applied its own error correction, to the
best of its ability, before it handed the data back to you.  That means
you''re by no means seeing the raw bits from the platter and trying to
model their potential failure modes with your checksum algorithm.
Instead, you''re seeing either (1) some bits which have already been
mangled by the disk''s error-correction algorithm (if the drive thinks
it
was able to restore the data), or (2) some perfectly good bits which
ended up on the wrong place on the drive -- which is to say, no
disk-level error correction was involved, but the ZFS checksum has no
relationship whatever to the data actually found at the desired location.
In either case, trying to correct the data is unlikely to yield the hoped-for
results.

The upshot of all this is that ZFS should be considered entirely inappropriate
for use with a non-replicated storage pool.  In such a deployment, without
an fsck-like repair facility, any storage failure with local effect cannot be
corrected, and will leave the filesystem permanently damaged.  (I''m
assuming that scrubbing won''t modify anything, even to just repair the
consistency of the filesystem, because it has no second copy of the data
to compare against.)  At best, all later encounters with the damaged area
would be detected and result in an application-level I/O error.  Hopefully
there are no scenarios in which the damaged area would ever be trusted.
But if you lost a high-level directory in such a failure, there would be no
hope of recovering anything deeper in that file tree -- meaning a local
failure could evidence itself as something truly massive, and with no way
out except to restore the entire tree from the last tape backup (or perhaps
from a recent snapshot, if you''re lucky).  It''s not even
clear, though, how ZFS
would handle the updating of a damaged directory during such a restore
operation, if the application-level restore program gets i/o errors when
trying to read the existing damaged directory.
> > Speaking of COW, can individual files and directories be easily copied
> > cheaply, or is that only possible through snapshots and clones?
>
> Individual files and directories cannot be copied in this manner.
>From what I understand of ZFS, that''s not quite accurate in a
practicalsense.  Nested filesystems are easy and efficient to create, and you can
have tons of them.  So if you know beforehand that you will eventually
want to clone or snapshot a particular file tree, just create its root as a
filesystem instead of as just a simple directory.
This message posted from opensolaris.org

Jeff Bonwick

2005-Dec-06 06:02 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

> I think you''re getting into dangerous territory if you believe
that ZFS
> block-level checksums could be used for error correction, not just error
> detection.  [...]  In either case, trying to correct the data is unlikely
> to yield the hoped-for results.
There''s nothing dangerous about it.  If the problem is corrupt media,
then you''re right: it''s highly unlikely that it''s
only off by one bit.
But a bit flip in the drive''s cache, or during the DMA, or on some
bundle of wire that you''d just assume has parity but often
doesn''t,
then checksum-directed single-bit correction is perfectly reasonable
(assuming the checksum is strong enough).

We actually prototyped this a few years ago, but didn''t get around
to reimplementing it after rewriting some related code.  It''ll be
back in the near future.
> The upshot of all this is that ZFS should be considered entirely
> inappropriate for use with a non-replicated storage pool.
Don''t ask, don''t tell?

I agree that correctable errors are better than uncorrectable,
but either is preferable to an undiagnosed error.  

That said, to make single-copy ZFS pools more fault-tolerant we''re
planning to provide metadata replication even in pools that aren''t
otherwise replicated.  The blkptr_t structure defined here:

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/sys/spa.h

provides a hint as to how it''ll work.  Bill Moore, whose baby this is,
may have more to say about it as the work progresses.

Jeff

Richard Elling

2005-Dec-06 18:25 UTC

head link

[zfs-discuss] Re: Some miscellaneous questions

Jeff Bonwick wrote:>>I think you''re getting into dangerous territory if you believe
that ZFS
>>block-level checksums could be used for error correction, not just error
>>detection.  [...]  In either case, trying to correct the data is
unlikely
>>to yield the hoped-for results.
> 
> There''s nothing dangerous about it.  If the problem is corrupt
media,
> then you''re right: it''s highly unlikely that
it''s only off by one bit.
> But a bit flip in the drive''s cache, or during the DMA, or on some
> bundle of wire that you''d just assume has parity but often
doesn''t,
> then checksum-directed single-bit correction is perfectly reasonable
> (assuming the checksum is strong enough).
We often do such ECC in the hardware side of the industry.  To wit, many
disk drive manufacturers use an ECC which can correct up to 5 bytes of
data per sector.  This is largely hidden from view, so you might not
recognize that it occurs.  What you do see is better drive robustness,
which is a good thing.

[I can''t speak for the ZFS development team, but I can dream :-)]
 From a relative FIT rate perspective, we do expect that disk drive
media-related faults occur more often than other faults in the system.
So it is not unreasonable for the system design to include multiple
levels of ECC around each of the fault isolation zones.  Indeed, this
has been the basis on which we''ve built the industry to date.  With
ZFS''s end-to-end checksumming, we move to the next level of robustness.
Given ZFS''s architecture, it should be fairly simple to add any
arbitrary ECC code into the mix to enhance robustness.  Think of it
like compression or encryption options, we could have an enhanced
ECC option.
  -- richard

Glenn Herteg

2005-Dec-06 19:12 UTC

head link

[zfs-discuss] Re: Re: Some miscellaneous questions

> > I think you''re getting into dangerous territory if you
believe that ZFS
> > block-level checksums could be used for error correction, not just
error
> > detection. [...] In either case, trying to correct the data is
unlikely
> > to yield the hoped-for results.
>
> There''s nothing dangerous about it. If the problem is corrupt
media,
> then you''re right: it''s highly unlikely that
it''s only off by one bit.
> But a bit flip in the drive''s cache, or during the DMA, or on some
> bundle of wire that you''d just assume has parity but often
doesn''t,
> then checksum-directed single-bit correction is perfectly reasonable
> (assuming the checksum is strong enough).
That might be fine if you could distinguish corrupt media failures from
other failures, but in real life these failure modes will be conflated.

I''ve seen both extremes in terms of storage failure.  While cutting
some
copies of the Solaris 9 CDs, I verified each disc afterward.  One disc had
precisely one bad bit in 600 MB.  On another disc, it was as though
someone had, in a number of areas, intermittently scratched out bit 6
of a lot of bytes.  In both cases, the data was "readable" without i/o
error, so I assume the failure happened on the way to the disc and the
CD''s own ECC codes encoded the bad data.

If this had happened on a ZFS filesystem, the one-bad-bit failure would
have been correctable, and that''s fine.  The other failure I assume was
due to some kind of cable, connector, or circuit failure in a byte-wide
channel somewhere in the hardware.  What scares me about trying to
correct a failure like this is that it is way beyond a single bit flip.  Your
comment about assuming the checksum is strong enough definitely
applies here.  I would want to know the probability that the restoration
could yield bad data because the damage exceeded the detection
capabilities of the algorithm.

Question:  would the checksum used for correction be the only checksum
on the block (replacing, say, the fletcher2 checksum), or would it be a
separate checksum in addition to the checksum currently implemented?
A separate checksum would relieve my worries because you would want
to compare the repaired data against the original checksum, and that
should provide reasonable reliability.
This message posted from opensolaris.org

zfs discuss - Nov 2005 - Some miscellaneous questions

[zfs-discuss] Some miscellaneous questions

[zfs-discuss] Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Some miscellaneous questions

[zfs-discuss] Re: Re: Some miscellaneous questions