I''m getting a warm fuzzy feeling about this new ZFS, because it addresses many concerns I''ve always had about computer file systems, even as a non-expert user. It even looks like a decisive argument for choosing an operating system, much like Chipkill memory would be for choosing server hardware. After reading a bit here and there, including zfsadmin_1016.pdf, and searching specifically for answers, though not yet in the source code, I still have several questions, currently mostly out of curiosity. It''s probably easiest to keep them together in one thread if a developer could answer all of them at once, perhaps just pointing to some existing documentation. What are the numbers related to the performance penalty of doing all that RAID stuff in software? Does this model assume that some processing load is moved from a dedicated card to some free cycles on a second core or chip? Or is it really negligible on any current single-core CPU? What about checksum verification? Which of those weird RAID combinations like 1+0 does ZFS support now and in the future, and why (not)? What are the characteristics of using fletcher2 and fletcher4 for computing checksums compared to sha256, with regard to performance and ability to detect (un)intentional data corruption? Is there such a thing as CRC-256, and how would it compare? Do you have anything to say about the following? Currently, e.g., rsync suffers from a severe performance penalty if you want to be certain, by using --checksum, that everything gets replicated correctly even in the presence of programs that manipulate the modification date of files. Resolving make dependencies is vulnerable as well. These problems could be addressed by having a good checksum at the file level (not the block level), computed on demand by the file system and retained as long as the file remains unchanged. When doing self-healing, does ZFS support error correction at the block level that improves on any such support by a lower system, or will it use or discard a block as a whole based only on a comparison with the checksum value stored in the referring node? What is a clone? zfsadmin_1016.pdf neglects to address them in "1.1.5 Snapshots and Clones" and "Chapter 6: ZFS Snapshots and Clones" isn''t very informative either. Why do they continue referring to an initial snapshot and prevent it from being freed? What''s their purpose other than a quick COW copy, which might just as well be symmetrical? Speaking of COW, can individual files and directories be easily copied cheaply, or is that only possible through snapshots and clones? Does ZFS regard all storage as the same? Some disks have lower latency and could be preferred for faster computing, some older disks have higher error rates and could be set aside for a cache or so, some media might be really slow (tapes), ... Does ZFS support using faster storage as a cache for slower storage? Is it possible to migrate some data set to a particular disk or set of disks and then move that disk or set of disks to a different system, or would that require copying outside of the ZFS (to some separately mounted extra disks or to the network)? This message posted from opensolaris.org
On Sun, Nov 20, 2005 at 07:45:47AM -0800, Raf Schietekat wrote:> > What are the numbers related to the performance penalty of doing all > that RAID stuff in software? Does this model assume that some > processing load is moved from a dedicated card to some free cycles on > a second core or chip? Or is it really negligible on any current > single-core CPU? What about checksum verification?Yes, software RAID and checksums in software will consume additional CPU. I don''t think anyone has done a performance comparison of RAID-Z to some other form of hardware RAID, but there are many aspects of ZFS and RAID-Z that provide performance much higher than any other software RAID solution. We have not found the CPU overhead of checksums to be noticeable on any workload, though you can turn checksums off (for user data only) and test this out for yourself.> Which of those weird RAID combinations like 1+0 does ZFS support now > and in the future, and why (not)?ZFS only supports RAID-Z. See Jeff''s blog for reasons why: http://blogs.sun.com/roller/page/bonwick?entry=raid_z> What are the characteristics of using fletcher2 and fletcher4 for > computing checksums compared to sha256, with regard to performance and > ability to detect (un)intentional data corruption? Is there such a > thing as CRC-256, and how would it compare?I don''t have the numbers offhand; I believe that Jeff and Bill have done these calculations at some point. We do support SHA-256, which is a strong cryptographic checksum if you want even higher data integrity.> Do you have anything to say about the following? Currently, e.g., > rsync suffers from a severe performance penalty if you want to be > certain, by using --checksum, that everything gets replicated > correctly even in the presence of programs that manipulate the > modification date of files. Resolving make dependencies is vulnerable > as well. These problems could be addressed by having a good checksum > at the file level (not the block level), computed on demand by the > file system and retained as long as the file remains unchanged.I''m not sure how this relates to ZFS - do you have something in mind? ZFS checksums are done per block.> When doing self-healing, does ZFS support error correction at the > block level that improves on any such support by a lower system, or > will it use or discard a block as a whole based only on a comparison > with the checksum value stored in the referring node?The current checksum algorithms do not support error correction - all blocks are either valid or invalid. Future checksum enhancements may allow for this ability.> What is a clone? zfsadmin_1016.pdf neglects to address them in "1.1.5 > Snapshots and Clones" and "Chapter 6: ZFS Snapshots and Clones" isn''t > very informative either. Why do they continue referring to an initial > snapshot and prevent it from being freed? What''s their purpose other > than a quick COW copy, which might just as well be symmetrical? > Speaking of COW, can individual files and directories be easily copied > cheaply, or is that only possible through snapshots and clones?Clones share space with an underlying snapshot. You cannot delete the underlying snapshot because the snapshot "owns" the blocks that are being shared. The data will diverge over time, so that if you re-write all the blocks in the clone, it wil end up taking up the same amount of space. They are extremely useful for quickly provisioning writable copies of a static source (such as workspaces, zone root, upgrade scenarios, etc). While it would be nice to have all blocks reference counted (and therefore the original snapshot could be deleted), this is untenable for a variety of reasons. Hopefully one of the DMU junkies (Matt, Mark) will blog on this at some point. Individual files and directories cannot be copied in this manner.> Does ZFS regard all storage as the same? Some disks have lower latency > and could be preferred for faster computing, some older disks have > higher error rates and could be set aside for a cache or so, some > media might be really slow (tapes), ... Does ZFS support using faster > storage as a cache for slower storage?Currently, dynamic striping in ZFS performs allocations on space usage patterns - trying to distribute data to all vdevs in the pool. Eventually, it would be nice to factor in performance characteristics (or fault characteristics) of the various vdevs. We have some ideas in this space, but it''s not a straightforward answer. The ZFS I/O pipeline makes this enhancement simple, once we decide exactly what we want to do.> Is it possible to migrate some data set to a particular disk or set of > disks and then move that disk or set of disks to a different system, > or would that require copying outside of the ZFS (to some separately > mounted extra disks or to the network)?Entire pools can be moved between systems using ''zpool export'' and ''zfs import''. We have discussed the idea of ''zpool split'', which would let you export and import a subset of disks on a different system. Hope that helps. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Eric Schrock wrote:> On Sun, Nov 20, 2005 at 07:45:47AM -0800, Raf Schietekat wrote: > >[...] > >> Do you have anything to say about the following? Currently, e.g., >> rsync suffers from a severe performance penalty if you want to be >> certain, by using --checksum, that everything gets replicated >> correctly even in the presence of programs that manipulate the >> modification date of files. Resolving make dependencies is vulnerable >> as well. These problems could be addressed by having a good checksum >> at the file level (not the block level), computed on demand by the >> file system and retained as long as the file remains unchanged. > > I''m not sure how this relates to ZFS - do you have something in mind? > ZFS checksums are done per block.Such extra read-only metadata (maybe that''s another difference: block+invisible vs. file+readable?) would require implementation by the filesystem to make sure it is invalidated whenever a file changes, and ZFS is new and it already does checksums (though not md5), so... Just an idea.> >[...]This message posted from opensolaris.org
On Sun, 2005-11-20 at 14:47, Eric Schrock wrote:> ZFS only supports RAID-Z. See Jeff''s blog for reasons why: > > http://blogs.sun.com/roller/page/bonwick?entry=raid_zHmm. The top level vdevs form an implicit stripe (raid 0). raid-1 is just simple mirroring, so a pool with a single mirror vdev is raid-1, and a a pool consisting of multiple mirrors is pretty much raid 1+0 (stripe of mirrors). So, under this interpretation, with ZFS you''ve got raid levels 0, 1, and Z, and striped-1 and striped-Z available. With most other systems, you have 0, 1, 5, striped-1 and maybe striped-5. Did I get this wrong? - Bill
Raf Schietekat wrote:> Eric Schrock wrote: > > >> On Sun, Nov 20, 2005 at 07:45:47AM -0800, Raf Schietekat wrote: >> >> [...] >> >> >>> Do you have anything to say about the following? Currently, e.g., >>> rsync suffers from a severe performance penalty if you want to be >>> certain, by using --checksum, that everything gets replicated >>> correctly even in the presence of programs that manipulate the >>> modification date of files. Resolving make dependencies is vulnerable >>> as well. These problems could be addressed by having a good checksum >>> at the file level (not the block level), computed on demand by the >>> file system and retained as long as the file remains unchanged. >>> >> I''m not sure how this relates to ZFS - do you have something in mind? >> ZFS checksums are done per block. >> > > Such extra read-only metadata (maybe that''s another difference: block+invisible vs. file+readable?) would require implementation by the filesystem to make sure it is invalidated whenever a file changes, and ZFS is new and it already does checksums (though not md5), so... Just an idea. >You would still need to compare the checksums at the file level after the transfer as the files from one system are more then likely not going to end up in the same block locations as the files on the other. In some rare cases you might even have different block sizes and that would really mess things up. Also, many would argue that replicating a file from one system to an other would require a new checksum as the metadata has changed. (The files location.)
Raf Schietekat wrote:> I''m getting a warm fuzzy feeling about this new ZFS, because it addresses > many concerns I''ve always had about computer file systems, even as a non-expert > user. It even looks like a decisive argument for choosing an operating system, > much like Chipkill memory would be for choosing server hardware.Really? I look at field failure data regularly. I don''t see a strong correlation between memory failures and dead DRAM chips. While chipkill is better than non-chipkill, it doesn''t seem to improve system reliability nearly as much as the marketing hype would lead you to believe. OTOH, SECDED ECC *does* have a strong, positive influence on reliability. Do you have data which indicates otherwise? -- richard
Torrey McMahon wrote:>[...] > > You would still need to compare the checksums at the file level after > the transfer as the files from one system are more then likely not going > to end up in the same block locations as the files on the other. In some > rare cases you might even have different block sizes and that would > really mess things up.Absolutely, that''s what we were talking about, file level, where the file is the logical array of bytes implied by the blocks.> > Also, many would argue that replicating a file from one system to an > other would require a new checksum as the metadata has changed. (The > files location.)Location and attributes are probably far less expensive to compare than multimegabyte file content. I would not be surprised if there were (almost?) no benefit to maintaining checksums for entire subtrees, much like trying to achieve 100% defragmentation has not turned out to be a worthwhile strategy, even to the contrary (running the defragmenter to completion on NTFS probably causes a suboptimal data layout). A checksum would probably also help as an associative key to detect renaming to a different location, if it is a very safe hash. This message posted from opensolaris.org
Richard Elling - PAE wrote:> Raf Schietekat wrote: > >> I''m getting a warm fuzzy feeling about this new ZFS, because it >> addresses many concerns I''ve always had about computer file systems, >> even as a non-expert user. It even looks like a decisive argument for >> choosing an operating system, much like Chipkill memory would be for >> choosing server hardware. > > Really? I look at field failure data regularly. I don''t see a strong > correlation between memory failures and dead DRAM chips. While chipkill > is better than non-chipkill, it doesn''t seem to improve system reliability > nearly as much as the marketing hype would lead you to believe. OTOH, > SECDED ECC *does* have a strong, positive influence on reliability. Do > you have data which indicates otherwise?No, just a warm fuzzy feeling (I''ve already confessed to being a non-expert). But I suppose that you mean that Chipkill does not improve markedly on ordinary SECDED with regard to incidence of memory problems, not that SECDED is better than Chipkill, as a careless reader may interpret from your reply? Any objective statistics about that would obviously be interesting. Chipkill may or may not have gotten its name from the notion of having a dead memory chip (how likely is that?), and it has specific provisions for that situation, but to me it mainly seems like SECDED on steroids, with multiple-error correctability. Memory scrubbing is probably orthogonal to both. If I''m mistaken, prey tell. (What''s it called when, e.g., a CNN journalist writes about AOL, an ancestor company, and dutifully points out that relation in his article? Richard Elling has an @sun.com address, and Sun doesn''t sell systems with Chipkill. No attack, and maybe it''s just me who doesn''t automatically know your affiliation, but this is one of the things I find exemplary about American journalistic practice.) This message posted from opensolaris.org
Errata: careless->casual reader, prey->pray tell, Sun doesn''t sell->make systems with Chipkill, multi-megabyte (hyphenated, elsewhere in this thread). Also, to clarify where I''m coming from, with regard to RAM I''ve never heard of any simpler ECC than SECDED ECC, the first step up from parity memory, with ECC currently kind of being synonymous with SECDED ECC. I''ve now clicked through to Richard Elling''s profile, and while recognising his obvious expert status, I stand by my comments. This message posted from opensolaris.org
Sun does indeed sell systems with chipkill, and has since 1997. More clarifications below... Raf Schietekat wrote:> Richard Elling - PAE wrote: > > >>Raf Schietekat wrote: >> >> >>>I''m getting a warm fuzzy feeling about this new ZFS, because it >>>addresses many concerns I''ve always had about computer file systems, >>>even as a non-expert user. It even looks like a decisive argument for >>>choosing an operating system, much like Chipkill memory would be for >>>choosing server hardware. >> >>Really? I look at field failure data regularly. I don''t see a strong >>correlation between memory failures and dead DRAM chips. While chipkill >>is better than non-chipkill, it doesn''t seem to improve system reliability >>nearly as much as the marketing hype would lead you to believe. OTOH, >>SECDED ECC *does* have a strong, positive influence on reliability. Do >>you have data which indicates otherwise? > > > No, just a warm fuzzy feeling (I''ve already confessed to being a non-expert). > But I suppose that you mean that Chipkill does not improve markedly on > ordinary SECDED with regard to incidence of memory problems, not that SECDED > is better than Chipkill, as a careless reader may interpret from your reply?I thought that I had chosen my words carefully.> Any objective statistics about that would obviously be interesting.Indeed, hence my request for data. I have data on Sun systems (obviously) but would always like to correlate to others.> Chipkill may or may not have gotten its name from the notion of having a dead > memory chip (how likely is that?), and it has specific provisions for that > situation, but to me it mainly seems like SECDED on steroids, with multiple-error > correctability.Technically, yes and no. Chipkill extends SECDED to be able to also detect the collection of N-bits which would be in a single chip. Typically we see this as 4-bits (x4 DRAMs). In such systems you can correct any single bit error and 4-bit errors on 4-bit boundaries. This is the "steroids" part :-) This is not really multiple error correction because a DRAM failure is a single failure, even though it affects N bits.> Memory scrubbing is probably orthogonal to both. If I''m mistaken, prey tell.Yes, scrubbing is orthogonal. We scrub to detect, and hopefully correct, data that we aren''t looking at. If a tree falls in the forest, we want to hear it. Scrubbing is imperfect, though, because basically you''re just wandering through the forest and hope to be within earshot before you need the data. There are several trade-offs here, and we''ll talk about those wrt ZFS later.> (What''s it called when, e.g., a CNN journalist writes about AOL, an ancestor > company, and dutifully points out that relation in his article? Richard Elling > has an @sun.com address, and Sun doesn''t sell systems with Chipkill. No attack, > and maybe it''s just me who doesn''t automatically know your affiliation, but > this is one of the things I find exemplary about American journalistic practice.)Due diligence? But since Sun sells systems with chipkill, I''m not sure where you are going... If you have data which shows that chipkill offers significantly improved fault coverage, or data which shows actually chip kill reliability rates, I''d be very interested. If you have data which shows that correctable errors usually precede uncorrectable errors, I''d be even more interested. I''ll even buy the beer :-) -- richard
>[...] > Also, to clarify where I''m coming from, with > regard to RAM I''ve never heard of any simpler ECC > than SECDED ECC, the first step up from parity > memory, with ECC currently kind of being synonymous > with SECDED ECC....and apparently SECDED is indeed higher than SEC?>[...] > I stand by my comments.Oh boy... This message posted from opensolaris.org
Richard Elling - PAE wrote:> Sun does indeed sell systems with chipkill, and has since 1997.I stand corrected. I took one source (which I now can''t find anymore, of course) at face value, coupled with a vague memory of not finding an alternative with Chipkill for those xSeries servers I bought a couple of years ago, and went with it, sorry. I didn''t even double-check my sources, so what was I doing commenting about disclosure?>[...] > > Raf Schietekat wrote: > >[...] > >> Chipkill may or may not have gotten its name from the notion of having >> a dead memory chip (how likely is that?), and it has specific >> provisions for that situation, but to me it mainly seems like SECDED >> on steroids, with multiple-error >> correctability. > > Technically, yes and no. Chipkill extends SECDED to be able to also detect > the collection of N-bits which would be in a single chip. Typically we see > this as 4-bits (x4 DRAMs). In such systems you can correct any single bit > error and 4-bit errors on 4-bit boundaries. This is the "steroids" part > :-) > This is not really multiple error correction because a DRAM failure is a > single failure, even though it affects N bits.I''ll have to digest this first.> >[...] > > If you have data which shows that chipkill offers significantly improved > fault coverage, or data which shows actually chip kill reliability rates, > I''d be very interested.I don''t have any data myself. Should I not believe fig. 1 in IBM''s white paper <www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf>? Or is it meaningless without SECDED in the picture?> > If you have data which shows that correctable errors usually precede > uncorrectable errors, I''d be even more interested. I''ll even buy the > beer :-)This message posted from opensolaris.org
> ...and apparently SECDED is indeed higher than SEC?But only trivially so (just add a parity bit per ECC word, according to http://www.hackersdelight.org/ecc.pdf), and 64/72 is enough, so I think it is reasonable to assume all such ECC systems are in fact SEC/DED. Right?> > I stand by my comments. > > Oh boy...Oh well, it''s not as bad as all that, I think so far only my product knowledge has been shown to be deficient (I hope...). This message posted from opensolaris.org
> Richard Elling - PAE wrote: > >[...] > > > > Technically, yes and no. Chipkill extends SECDED to be able to also detect > > the collection of N-bits which would be in a single chip. Typically we see > > this as 4-bits (x4 DRAMs). In such systems you can correct any single bit > > error and 4-bit errors on 4-bit boundaries. This is the "steroids" part :-) > > This is not really multiple error correction because a DRAM failure is a > > single failure, even though it affects N bits. > > I''ll have to digest this first.It''s a big read, that white paper I mentioned, and I admit I''ve only read it cursorily so far. Apparently, multibit soft errors do tend to occur within a single chip (it''s not just about entire chips or chip regions going dead), so it''s not a bad idea to build in specific resilience against that.> > > > >[...] > > > > If you have data which shows that chipkill offers significantly improved > > fault coverage, or data which shows actually chip kill reliability rates, > > I''d be very interested. > > I don''t have any data myself. Should I not believe fig. 1 in IBM''s white paper > www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf? Or is it > meaningless without SECDED in the picture?Since SEC/DED is apparently only trivially more complicated than SEC, I''m assuming that SEC is indeed SEC/DED. So if you believe the Monte Carlo simulation and its assumptions, Chipkill is spectacularly more effective than SEC/DED. Not so? A pity, though, now I''ll have to rely on the lottery instead for getting that big money transfer to my account. Well, that''s if actual field data supports this.> >[...]This message posted from opensolaris.org
[appologize for seeming to drift away from ZFS, but we''ll tie the loose ends together later] Raf Schietekat wrote:>>Richard Elling - PAE wrote: >> >>[...] >> >>>Technically, yes and no. Chipkill extends SECDED to be able to also detect >>>the collection of N-bits which would be in a single chip. Typically we see >>>this as 4-bits (x4 DRAMs). In such systems you can correct any single bit >>>error and 4-bit errors on 4-bit boundaries. This is the "steroids" part :-) >>>This is not really multiple error correction because a DRAM failure is a >>>single failure, even though it affects N bits. >> >>I''ll have to digest this first. > > > It''s a big read, that white paper I mentioned, and I admit I''ve only read > it cursorily so far. Apparently, multibit soft errors do tend to occur > within a single chip (it''s not just about entire chips or chip regions going > dead), so it''s not a bad idea to build in specific resilience against that.Yes, soft errors can cluster around an event and knock out several bits. However, these are usually detected and corrected as single bit errors because the adjacent bits in the array are not adjacent in the symbol or word.>>>[...] >>> >>>If you have data which shows that chipkill offers significantly improved >>>fault coverage, or data which shows actually chip kill reliability rates, >>>I''d be very interested. >> >>I don''t have any data myself. Should I not believe fig. 1 in IBM''s white paper >>www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf? Or is it >>meaningless without SECDED in the picture?This data is 10 years old, and seems to be highly variant. I''m always looking for newer field data.> Since SEC/DED is apparently only trivially more complicated than SEC, I''m > assuming that SEC is indeed SEC/DED. So if you believe the Monte Carlo > simulation and its assumptions, Chipkill is spectacularly more effective > than SEC/DED. Not so? A pity, though, now I''ll have to rely on the lottery > instead for getting that big money transfer to my account. Well, that''s if > actual field data supports this.For the masses, SEC == single (bit) error correction DED == double (bit) error detection I don''t know anyone who implements SEC without also DED, as the DED part is basically free. Some people may or may not implement chip kill. The tradeoff for chipkill also affects the cost of DIMMs. For example, a 9-chip DIMM can''t implement chipkill (in any modern system I''m aware of), but is less expensive and more reliable than an 18-chip DIMM. Cheap, fast, reliable: pick two. If anyone has real experience seeing a chipkill event, and perhaps even some data, I''d be very insterested in talking with you. -- richard
>[...] > > If anyone has real experience seeing a chipkill event, and perhaps even some > data, I''d be very insterested in talking with you.I would also be interested to know more. Meanwhile, Richard Elling has been successful in eroding my warm fuzzy feeling... BTW, I forgot to mention some more motivations for that file-level checksum I mentioned earlier (oneway, read-only, computed on demand, retained in the file system)... Next to helping with replication (and verification also) and make dependency tracking, it can make scripts launch faster (even extending to languages not traditionally thought of as meant for scripting, like C or C++, thus reducing clutter and administrative burden and increasing trustworthiness), Java programs (I suppose!), Java IDEs perhaps (Scanning Project Classpaths is now a repeated occurrence), etc., by offering a reliable (the extent of which has to be examined and described) key for use with an application-specific cache. Well, it''s just an idea, but I keep bumping into the absense of its implementation. But that''s all from me for now. This message posted from opensolaris.org
Raf Schietekat wrote:>>[...] >> >>If anyone has real experience seeing a chipkill event, and perhaps even some >>data, I''d be very insterested in talking with you. > > > I would also be interested to know more. Meanwhile, Richard Elling has been > successful in eroding my warm fuzzy feeling...I didn''t intend to frighten, but us RAS guys are always thinking about how things break... makes for odd dinner conversations with the family :-) What you will see more of in Solaris, with ZFS as an example, is tight integration between the hardware capabilities and the OS for fault management. We have to move away from the reliance on purely hardware forms of resilience. But doing everything in software is also difficult. We leverage the two in the Solaris Fault Management Architecture. For example, if we see a number of correctable errors in a memory page, Solaris can relocate the data and stop using the page. For an uncorrectable error, rather than panic, we can restart the application using the page. We''re studying this in the field to try and measure the impact on availability. So far we can say that FMA has positively contributed to better availability. This is all goodness. -- richard
Richard Elling wrote:> I didn''t intend to frighten, but us RAS guys are always thinking about how > things break... makes for odd dinner conversations with the family :-)Oh, I''m not frightened, I''m still confident that Chipkill is strictly better than ordinary SEC/DED, I''m just not sure anymore how relevant it really is. This message posted from opensolaris.org
> > When doing self-healing, does ZFS support error correction at the > > block level that improves on any such support by a lower system, or > > will it use or discard a block as a whole based only on a comparison > > with the checksum value stored in the referring node? > > The current checksum algorithms do not support error correction - all > blocks are either valid or invalid. Future checksum enhancements may > allow for this ability.I think you''re getting into dangerous territory if you believe that ZFS block-level checksums could be used for error correction, not just error detection. Consider that the bits on the disk are not raw bits to begin with: they''re RLL- or PRML- or whatever- encoded on the media, instead, so bit-level media failures would get translated into much more complex bit transliterations by the time software sees them. Also, the disk itself has already applied its own error correction, to the best of its ability, before it handed the data back to you. That means you''re by no means seeing the raw bits from the platter and trying to model their potential failure modes with your checksum algorithm. Instead, you''re seeing either (1) some bits which have already been mangled by the disk''s error-correction algorithm (if the drive thinks it was able to restore the data), or (2) some perfectly good bits which ended up on the wrong place on the drive -- which is to say, no disk-level error correction was involved, but the ZFS checksum has no relationship whatever to the data actually found at the desired location. In either case, trying to correct the data is unlikely to yield the hoped-for results. The upshot of all this is that ZFS should be considered entirely inappropriate for use with a non-replicated storage pool. In such a deployment, without an fsck-like repair facility, any storage failure with local effect cannot be corrected, and will leave the filesystem permanently damaged. (I''m assuming that scrubbing won''t modify anything, even to just repair the consistency of the filesystem, because it has no second copy of the data to compare against.) At best, all later encounters with the damaged area would be detected and result in an application-level I/O error. Hopefully there are no scenarios in which the damaged area would ever be trusted. But if you lost a high-level directory in such a failure, there would be no hope of recovering anything deeper in that file tree -- meaning a local failure could evidence itself as something truly massive, and with no way out except to restore the entire tree from the last tape backup (or perhaps from a recent snapshot, if you''re lucky). It''s not even clear, though, how ZFS would handle the updating of a damaged directory during such a restore operation, if the application-level restore program gets i/o errors when trying to read the existing damaged directory.> > Speaking of COW, can individual files and directories be easily copied > > cheaply, or is that only possible through snapshots and clones? > > Individual files and directories cannot be copied in this manner.>From what I understand of ZFS, that''s not quite accurate in a practicalsense. Nested filesystems are easy and efficient to create, and you can have tons of them. So if you know beforehand that you will eventually want to clone or snapshot a particular file tree, just create its root as a filesystem instead of as just a simple directory. This message posted from opensolaris.org
> I think you''re getting into dangerous territory if you believe that ZFS > block-level checksums could be used for error correction, not just error > detection. [...] In either case, trying to correct the data is unlikely > to yield the hoped-for results.There''s nothing dangerous about it. If the problem is corrupt media, then you''re right: it''s highly unlikely that it''s only off by one bit. But a bit flip in the drive''s cache, or during the DMA, or on some bundle of wire that you''d just assume has parity but often doesn''t, then checksum-directed single-bit correction is perfectly reasonable (assuming the checksum is strong enough). We actually prototyped this a few years ago, but didn''t get around to reimplementing it after rewriting some related code. It''ll be back in the near future.> The upshot of all this is that ZFS should be considered entirely > inappropriate for use with a non-replicated storage pool.Don''t ask, don''t tell? I agree that correctable errors are better than uncorrectable, but either is preferable to an undiagnosed error. That said, to make single-copy ZFS pools more fault-tolerant we''re planning to provide metadata replication even in pools that aren''t otherwise replicated. The blkptr_t structure defined here: http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/sys/spa.h provides a hint as to how it''ll work. Bill Moore, whose baby this is, may have more to say about it as the work progresses. Jeff
Jeff Bonwick wrote:>>I think you''re getting into dangerous territory if you believe that ZFS >>block-level checksums could be used for error correction, not just error >>detection. [...] In either case, trying to correct the data is unlikely >>to yield the hoped-for results. > > There''s nothing dangerous about it. If the problem is corrupt media, > then you''re right: it''s highly unlikely that it''s only off by one bit. > But a bit flip in the drive''s cache, or during the DMA, or on some > bundle of wire that you''d just assume has parity but often doesn''t, > then checksum-directed single-bit correction is perfectly reasonable > (assuming the checksum is strong enough).We often do such ECC in the hardware side of the industry. To wit, many disk drive manufacturers use an ECC which can correct up to 5 bytes of data per sector. This is largely hidden from view, so you might not recognize that it occurs. What you do see is better drive robustness, which is a good thing. [I can''t speak for the ZFS development team, but I can dream :-)] From a relative FIT rate perspective, we do expect that disk drive media-related faults occur more often than other faults in the system. So it is not unreasonable for the system design to include multiple levels of ECC around each of the fault isolation zones. Indeed, this has been the basis on which we''ve built the industry to date. With ZFS''s end-to-end checksumming, we move to the next level of robustness. Given ZFS''s architecture, it should be fairly simple to add any arbitrary ECC code into the mix to enhance robustness. Think of it like compression or encryption options, we could have an enhanced ECC option. -- richard
> > I think you''re getting into dangerous territory if you believe that ZFS > > block-level checksums could be used for error correction, not just error > > detection. [...] In either case, trying to correct the data is unlikely > > to yield the hoped-for results. > > There''s nothing dangerous about it. If the problem is corrupt media, > then you''re right: it''s highly unlikely that it''s only off by one bit. > But a bit flip in the drive''s cache, or during the DMA, or on some > bundle of wire that you''d just assume has parity but often doesn''t, > then checksum-directed single-bit correction is perfectly reasonable > (assuming the checksum is strong enough).That might be fine if you could distinguish corrupt media failures from other failures, but in real life these failure modes will be conflated. I''ve seen both extremes in terms of storage failure. While cutting some copies of the Solaris 9 CDs, I verified each disc afterward. One disc had precisely one bad bit in 600 MB. On another disc, it was as though someone had, in a number of areas, intermittently scratched out bit 6 of a lot of bytes. In both cases, the data was "readable" without i/o error, so I assume the failure happened on the way to the disc and the CD''s own ECC codes encoded the bad data. If this had happened on a ZFS filesystem, the one-bad-bit failure would have been correctable, and that''s fine. The other failure I assume was due to some kind of cable, connector, or circuit failure in a byte-wide channel somewhere in the hardware. What scares me about trying to correct a failure like this is that it is way beyond a single bit flip. Your comment about assuming the checksum is strong enough definitely applies here. I would want to know the probability that the restoration could yield bad data because the damage exceeded the detection capabilities of the algorithm. Question: would the checksum used for correction be the only checksum on the block (replacing, say, the fletcher2 checksum), or would it be a separate checksum in addition to the checksum currently implemented? A separate checksum would relieve my worries because you would want to compare the repaired data against the original checksum, and that should provide reasonable reliability. This message posted from opensolaris.org