Inspired by the paper "End-to-end Data Integrity for File Systems: A ZFS Case Study" [1], I''ve been thinking if it is possible to devise a way, in which a minimal in-memory data corruption would cause massive data loss. I could imagine a scenario where an entire directory branch drops off the tree structure, for example. Since I know too little about ZFS''s structure, I''m also asking myself if it is possible to make old snapshots disappear via memory corruption or lose data blocks to leakage (not containing data, but not marked as available). I''d appreciate it if someone with a good understanding of ZFS''s internals and principles could comment on the possibility of such scenarios. [1] http://www.usenix.org/event/fast10/tech/full_papers/zhang.pdf
2012-01-14 18:36, Stefan Ring wrote:> Inspired by the paper "End-to-end Data Integrity for File Systems: A > ZFS Case Study" [1], I''ve been thinking if it is possible to devise a way, > in which a minimal in-memory data corruption would cause massive data > loss. I could imagine a scenario where an entire directory branch > drops off the tree structure, for example. Since I know too little > about ZFS''s structure, I''m also asking myself if it is possible to > make old snapshots disappear via memory corruption or lose data blocks > to leakage (not containing data, but not marked as available). > > I''d appreciate it if someone with a good understanding of ZFS''s > internals and principles could comment on the possibility of such > scenarios. > > [1] http://www.usenix.org/event/fast10/tech/full_papers/zhang.pdfBy no means I''m an expert like ones you seek, but I''m asking similar questions, and have more popping up ;) I do have some reported corruptions on my non-ECC system despite raidz2 on disk, so I have a keen interest as to how stuff works and why it doesn''t sometimes ;) As for block leakage, according to error messages I''m seeing now, leaked blocks are at least expected and checked for: "allocating allocated segment" and "freeing free segment". How my system got here - that''s the puzzle... It does seem possible that in-memory corruption of data payload and/or checksum of a block before writing it to disk would render it invalid on read (data doesn''t match checksum, ZFS returns EIO) . Maybe even worse if the in-memory block is corrupted before the checksumming, and seemingly valid garbage gets stored on disk, read afterwards, and used with blind trust. If it is a leaf block (userdata) you just get a corrupted file. If it is a metadata block, and if the corruption happened before it was ditto-written to several disk locations, you''re in trouble. It is likewise possible that data in-RAM gets corrupted after reading from disk and checksum-checking, but before using it as a metadata block or whatever. If you''re as "lucky" as to have irrepairable (by ditto blocks) corruption in a metadata block near the root of a tree, you can happen to be in bad trouble. In all these cases RAM is the SPOF (single point of failure) so all ZFS recommendations involve using ECC systems. Alas, even though ECC chips and chipsets are cheap nowadays, not all architectures use them anyway (i.e. desktops, laptops, etc.), and the tagline of running ZFS for "reliable storage on consumer grade hardware" is poisoned by this fact. Other filesystems obviously suffer the same from bad components, but ZFS reports on these detected errors, and unlike other systems that let you dismiss the errors (i.e. free all blocks and files touched by a corrupt block, leaving you with a smaller but consistent tree of data blocks), or don''t even notice them, ZFS tends to get really upset on many of themm and ask for recovery from backups (as if they are 100% reliable). I do wonder, however, if it is possible to make a software ECC to detect-and/or-repair small memory corruptions on consumer grade systems. And where would such part fit - in ZFS (i.e. some ECC bits appended in every zfs_*_t structure) or in the {Solaris} kernel for general VM management. And even then there''s a question whether this would solve more problems than create a greater one - pose the visibility of solution and hide problems that actually exist (because there would be some non-ECC parts of the data path and GIGO principle can apply at any point). In the bad case, you ECC an invalid piece of memory, and afterwards trust it as it matches the checksum. On the good side, there is a smaller window that data is exposed unprotected, so statistically this solution should help. HTH, //Jim Klimov
On Sun, 15 Jan 2012, Jim Klimov wrote:> > It does seem possible that in-memory corruption of data payload > and/or checksum of a block before writing it to disk would render > it invalid on read (data doesn''t match checksum, ZFS returns EIO) . > Maybe even worse if the in-memory block is corrupted before the > checksumming, and seemingly valid garbage gets stored on disk, > read afterwards, and used with blind trust.Please don''t under-state the actual issue. ZFS assumes that RAM is 100% reliable. ZFS uses an in-memory cache called the ARC which can span many tens of gigabytes on busy large memory systems. User data is stored in this ARC and the cached data becomes the reference copy of the data until it is evicted. This means that user data can be silently and undetectably corrupted due to memory corruption. The effects that zfs''s checksums can detect are just a small subset of the problems which may occur if memory returns wrong values.> In all these cases RAM is the SPOF (single point of failure) > so all ZFS recommendations involve using ECC systems. Alas, > even though ECC chips and chipsets are cheap nowadays, not all > architectures use them anyway (i.e. desktops, laptops, etc.), > and the tagline of running ZFS for "reliable storage on consumer > grade hardware" is poisoned by this fact. Other filesystemsFeel free to blame Intel for this since they seem to be primarily responsible for delivering CPUs and chipsets which don''t support ECC. AMD has not been such a perpetrator, although it is possible to buy AMD-based systems which don''t provide ECC.> I do wonder, however, if it is possible to make a software ECC > to detect-and/or-repair small memory corruptions on consumer > grade systems. And where would such part fit - in ZFS (i.e.This could be done for part of the memory but it would obviously result in huge performance loss. I/O to memory would have to become block-oriented rather than random access. It is still necessary for random access to be used in a large part of the memory since it is a requirement in order to run programs and there would no way to defend that part of the memory.> some ECC bits appended in every zfs_*_t structure) or in the > {Solaris} kernel for general VM management. And even then > there''s a question whether this would solve more problems than > create a greater one - pose the visibility of solution and > hide problems that actually exist (because there would be > some non-ECC parts of the data path and GIGO principle can > apply at any point). In the bad case, you ECC an invalid > piece of memory, and afterwards trust it as it matches the > checksum. On the good side, there is a smaller window that > data is exposed unprotected, so statistically this solution > should help.The problem is that with unreliable memory, the software-based ECC would not be able to correct the content of the memory since the ECC itself might have been computed incorrectly (due to unreliable memory). You are then faced with notifications of problems that the user can''t fix. The proper solution (regardless of filesystem used) is to assure that ECC is included in any computer that you buy. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sun, 2012-01-15 at 16:28 +0400, Jim Klimov wrote:> 2012-01-14 18:36, Stefan Ring wrote: > > Inspired by the paper "End-to-end Data Integrity for File Systems: A > > ZFS Case Study" [1], I''ve been thinking if it is possible to devise a way, > > in which a minimal in-memory data corruption would cause massive data > > loss. I could imagine a scenario where an entire directory branch > > drops off the tree structure, for example. Since I know too little > > about ZFS''s structure, I''m also asking myself if it is possible to > > make old snapshots disappear via memory corruption or lose data blocks > > to leakage (not containing data, but not marked as available).I''ve never understood why these conclusions are considered so interesting-- it''s as though ZFS were analyzed as a system but the conclusions weren''t drawn systematically. If you don''t protect buffer integrity elsewhere on the system, what would in be worth for ZFS to provide in-core integrity for its kernel pages? The vast preponderance of consumers of ZFS data have to use buffers outside of the ZFS kernel subsystem, leaving you with a trivial added assurance in protecting against in-core corruption. Compare the effort of doing that to the cost of using ECC, and there doesn''t seem to be anything like a compelling case for putting all that work into ZFS or accepting the overhead that would result. Put into a more reasonable context, there may still be something there, but it looks very different than how the authors seemed to pitch it. Or have I missed something?> Alas, > even though ECC chips and chipsets are cheap nowadays, not all > architectures use them anyway (i.e. desktops, laptops, etc.), > and the tagline of running ZFS for "reliable storage on consumer > grade hardware" is poisoned by this fact.Yes, you can get reliable and probably performant ZFS storage without having to buy enterprise-class components. But you still have to treat midrange or consumer components as differentiated on reliability and performance if you want achieve those things meaningfully. ZFS is good, but it''s not magic.
On Jan 14, 2012, at 6:36 AM, Stefan Ring wrote:> Inspired by the paper "End-to-end Data Integrity for File Systems: A > ZFS Case Study" [1], I''ve been thinking if it is possible to devise a way, > in which a minimal in-memory data corruption would cause massive data > loss.For enterprise-class systems, you will find hardware protection such as ECC and other mechanisms all the way up and down the datapath. For example, if you build an ALU, you can add a few transistors to also detect the various failure modes that afflict data flowing through an ALU. This is one of the things that diffentiates a mainframe or SPARC64 processor from a run-of-the-mill PeeCee processor.> I could imagine a scenario where an entire directory branch > drops off the tree structure, for example. Since I know too little > about ZFS''s structure, I''m also asking myself if it is possible to > make old snapshots disappear via memory corruption or lose data blocks > to leakage (not containing data, but not marked as available).Sure. If you''d like a fright, read the errata sheet for a modern microprocessor :-)> I''d appreciate it if someone with a good understanding of ZFS''s > internals and principles could comment on the possibility of such > scenarios.ZFS does expect that the processor, memory, and I/O systems work to some degree. The only way to get beyond this sort of dependency is to implement a system like we do for avionics.> > [1] http://www.usenix.org/event/fast10/tech/full_papers/zhang.pdfYes. Netapp has funded those researchers in the past. Looks like a FUD piece to me. Lookout everyone, the memory system you bought from Intel might suck! -- richard
On Mon, January 16, 2012 01:19, Richard Elling wrote:>> [1] http://www.usenix.org/event/fast10/tech/full_papers/zhang.pdf > > Yes. Netapp has funded those researchers in the past. Looks like a FUD > piece to me. > Lookout everyone, the memory system you bought from Intel might suck!>From the paper:> This material is based upon work supported by the National Science > Foundation under the following grants: CCF-0621487, CNS-0509474, > CNS-0834392, CCF-0811697, CCF-0811697, CCF-0937959, as well as by generous > donations from NetApp, Inc, Sun Microsystems, and Google.So Sun paid to FUD themselves? The conclusions are hardly unreasonable:> While the reliability mechanisms in ZFS are able to provide reasonable > robustness against disk corruptions, memory corruptions still remain a > serious problem to data integrity.I''ve heard the same thing said ("use ECC!") on this list many times over the years.
On 01/16/12 11:08, David Magda wrote:> > The conclusions are hardly unreasonable: > >> While the reliability mechanisms in ZFS are able to provide reasonable >> robustness against disk corruptions, memory corruptions still remain a >> serious problem to data integrity. > > I''ve heard the same thing said ("use ECC!") on this list many times over > the years.I believe the whole paragraph quoted from the USENIX paper above is important: While the reliability mechanisms in ZFS are able to provide reasonable robustness against disk corruptions, memory corruptions still remain a serious problem to data integrity. Our results for memory corruptions in- dicate cases where bad data is returned to the user, oper- ations silently fail, and the whole system crashes. Our probability analysis shows that one single bit flip has small but non-negligible chances to cause failures such as reading/writing corrupt data and system crashing. The authors provide probability calculations in section 6.3 for single bit flips. ECC provides detection and correction of single bit flips.
On Jan 16, 2012, at 8:08 AM, David Magda wrote:> On Mon, January 16, 2012 01:19, Richard Elling wrote: > >>> [1] http://www.usenix.org/event/fast10/tech/full_papers/zhang.pdf >> >> Yes. Netapp has funded those researchers in the past. Looks like a FUD >> piece to me. >> Lookout everyone, the memory system you bought from Intel might suck! > > From the paper: > >> This material is based upon work supported by the National Science >> Foundation under the following grants: CCF-0621487, CNS-0509474, >> CNS-0834392, CCF-0811697, CCF-0811697, CCF-0937959, as well as by generous >> donations from NetApp, Inc, Sun Microsystems, and Google. > > So Sun paid to FUD themselves?wouldn''t be the first time...> The conclusions are hardly unreasonable: > >> While the reliability mechanisms in ZFS are able to provide reasonable >> robustness against disk corruptions, memory corruptions still remain a >> serious problem to data integrity. > > I''ve heard the same thing said ("use ECC!") on this list many times over > the years.Agree with the ECC comment :-) If we can classify this as encouragement to use ECC, then you don''t need to drag ZFS into the conversation. Interestingly, the only market that doesn''t use ECC is the PeeCee market. Embedded and enterprise markets use ECC. -- richard
On Tue, 17 Jan 2012, Richard Elling wrote:> Agree with the ECC comment :-) > > If we can classify this as encouragement to use ECC, then you don''t need to drag ZFS > into the conversation. Interestingly, the only market that doesn''t use ECC is the PeeCee > market. Embedded and enterprise markets use ECC.The issue is definitely not specific to ZFS. For example, the whole OS depends on relable memory content in order to function. Likewise, no one likes it if characters mysteriously change in their word processing documents. Most of the blame seems to focus on Intel, with its objective to spew CPUs with the highest-clocking performance at the lowest possible price point for the desktop market. AMD CPUs seem to usually be slower but include ECC as standard in the CPU or AMD-supplied chipset. If it can be believed (and even if some may doubt it), Intel sells Xeon-branded CPUs which lack ECC support. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> The issue is definitely not specific to ZFS. ?For example, the whole OS > depends on relable memory content in order to function. ?Likewise, no one > likes it if characters mysteriously change in their word processing > documents.I don?t care too much if a single document gets corrupted ? there?ll always be a good copy in a snapshot. I do care however if a whole directory branch or old snapshots were to disappear.> Most of the blame seems to focus on Intel, with its objective to spew CPUs > with the highest-clocking performance at the lowest possible price point for > the desktop market. ?AMD CPUs seem to usually be slower but include ECC as > standard in the CPU or AMD-supplied chipset.Agreed. I originally bought an AMD-based system for that reason alone, with the intention of running OpenSolaris on it. Alas, it performed abysmally, so it was quickly swapped for an Intel-based one (without ECC). Additionally, consider that Joyent?s port of KVM supports only Intel systems, AFAIK.
On Tue, 17 Jan 2012, Stefan Ring wrote:> > Additionally, consider that Joyent?s port of KVM supports only Intel > systems, AFAIK.Hopefully that will be a short-term issue. 64-core AMD Opteron systems are affordable now. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
2012-01-18 1:20, Stefan Ring wrote:>> The issue is definitely not specific to ZFS. For example, the whole OS >> depends on relable memory content in order to function. Likewise, no one >> likes it if characters mysteriously change in their word processing >> documents. > > I don?t care too much if a single document gets corrupted ? there?ll > always be a good copy in a snapshot. I do care however if a whole > directory branch or old snapshots were to disappear.Well, as far as this problem "relies" on random memory corruptions, you don''t get to choose whether your document gets broken or some low-level part of metadata tree ;) Besides, what if that document you don''t care about is your account''s entry in a banking system (as if they had no other redundancy and double-checks)? And suddenly you "don''t exist" because of some EIOIO, or your balance is zeroed (or worse, highly negative)? ;) //Jim
On Wed, Jan 18, 2012 at 4:53 AM, Jim Klimov <jimklimov at cos.ru> wrote:> 2012-01-18 1:20, Stefan Ring wrote: >> I don?t care too much if a single document gets corrupted ? there?ll >> always be a good copy in a snapshot. I do care however if a whole >> directory branch or old snapshots were to disappear. > > Well, as far as this problem "relies" on random memory corruptions, > you don''t get to choose whether your document gets broken or some > low-level part of metadata tree ;)Other filesystems tend to be much more tolerant of bit rot of all types precisely because they have no block checksums. But I''d rather have ZFS -- *with* redundancy, of course, and with ECC. It might be useful to have a way to recover from checksum mismatches by involving a human. I''m imagining a tool that tests whether accepting a block''s actual contents results in making data available that the human thinks checks out, and if so, then rewriting that block. Some bit errors might simply result in meaningless metadata, but in some cases this can be corrected (e.g., ridiculous block addresses). But if ECC takes care of the problem then why waste the effort? (Partial answer: because it''d be a very neat GSoC type project!)> Besides, what if that document you don''t care about is your account''s > entry in a banking system (as if they had no other redundancy and > double-checks)? And suddenly you "don''t exist" because of some EIOIO, > or your balance is zeroed (or worse, highly negative)? ;)This is why we have paper trails, logs, backups, redundancy at various levels, ... Nico --
2012-01-18 20:36, Nico Williams wrote:> On Wed, Jan 18, 2012 at 4:53 AM, Jim Klimov<jimklimov at cos.ru> wrote: >> 2012-01-18 1:20, Stefan Ring wrote: >>> I don?t care too much if a single document gets corrupted ? there?ll >>> always be a good copy in a snapshot. I do care however if a whole >>> directory branch or old snapshots were to disappear. >> >> Well, as far as this problem "relies" on random memory corruptions, >> you don''t get to choose whether your document gets broken or some >> low-level part of metadata tree ;) > > Other filesystems tend to be much more tolerant of bit rot of all > types precisely because they have no block checksums. > > But I''d rather have ZFS -- *with* redundancy, of course, and with ECC. > > It might be useful to have a way to recover from checksum mismatches > by involving a human. I''m imagining a tool that tests whether > accepting a block''s actual contents results in making data available > that the human thinks checks out, and if so, then rewriting that > block. Some bit errors might simply result in meaningless metadata, > but in some cases this can be corrected (e.g., ridiculous block > addresses). But if ECC takes care of the problem then why waste the > effort?Because RAM ECC only decreases the probability of one type of corruption? You still have CPUs (i.e. overclocked and overheated, as is likely in enthusiast systems, or in laptops with blocked vents, thus sometimes generating random garbage). Many other parts are not SPoF in a good design, i.e. noise on wire, bugs in HBA and HDD firmware can be mitigated by some hardware redundancy (multipathing, mixed vendors) in higher-end systems, and by just ZFS approaches in other systems - such as ditto copies for metadata and by vdev redundancy; but these can still corrupt the copies=1 data (i.e. on single-disk laptops without explicit copies=2). > (Partial answer: because it''d be a very neat GSoC type project!) Good point for at least one motivator ;) "I don''t care how it is done - but it should be! This time you may even use sorcery, I''ll not ask questions!" ;)> >> Besides, what if that document you don''t care about is your account''s >> entry in a banking system (as if they had no other redundancy and >> double-checks)? And suddenly you "don''t exist" because of some EIOIO, >> or your balance is zeroed (or worse, highly negative)? ;) > > This is why we have paper trails, logs, backups, redundancy at various > levels, ...As if any of them is 100% good and reliable and readily accessible-available ;) //Jim