Peter Braam
2010-Jul-02 18:53 UTC
[Lustre-discuss] Integrity and corruption - can file systems be scalable?
I wrote a blog post that pertains to Lustre scalability and data integrity. You can find it here: http://braamstorage.blogspot.com Regards, Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/b6cb205b/attachment.html
Dmitry Zogin
2010-Jul-02 20:52 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
Hello Peter, These are really good questions posted there, but I don''t think they are Lustre specific. These issues are sort of common to any file systems. Some of the mature file systems, like Veritas already solved this by 1. Integrating the Volume management and File system. The file system can be spread across many volumes. 2. Dividing the file system into a group of file sets(like data, metadata, checkpoints) , and allowing the policies to keep different filesets on different volumes. 3. Creating the checkpoints (they are sort of like volume snapshots, but they are created inside the file system itself). The checkpoints are simply the copy-on-write filesets created instantly inside the fs itself. Using copy-on-write techniques allows to save the physical space and make the process of the file sets creation instantaneous. They do allow to revert back to a certain point instantaneously, as the modified blocks are kept aside, and the only thing that has to be done is to point back to the old blocks of information. 4. Parallel fsck - if the filesystem consists of the allocation units - a sort of the sub- file systems, or cylinder groups, then the fsck can be started in parallel on those units. Well, the ZFS does solve many of these issues, but in a different way, too. So, my point is that this probably has to be solved on the backend side of the Lustre, rather than inside the Lustre. Best regards, Dmitry Peter Braam wrote:> I wrote a blog post that pertains to Lustre scalability and data > integrity. You can find it here: > > http://braamstorage.blogspot.com > > Regards, > > Peter > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/825eabe5/attachment.html
Peter Braam
2010-Jul-02 20:59 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
Dmitry, The point of the note is the opposite of what you write, namely that backend systems in fact do not solve this, unless they are guaranteed to be bug free. Peter On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin <dmitry.zoguine at oracle.com>wrote:> Hello Peter, > > These are really good questions posted there, but I don''t think they are > Lustre specific. These issues are sort of common to any file systems. Some > of the mature file systems, like Veritas already solved this by > > 1. Integrating the Volume management and File system. The file system can > be spread across many volumes. > 2. Dividing the file system into a group of file sets(like data, metadata, > checkpoints) , and allowing the policies to keep different filesets on > different volumes. > 3. Creating the checkpoints (they are sort of like volume snapshots, but > they are created inside the file system itself). The checkpoints are simply > the copy-on-write filesets created instantly inside the fs itself. Using > copy-on-write techniques allows to save the physical space and make the > process of the file sets creation instantaneous. They do allow to revert > back to a certain point instantaneously, as the modified blocks are kept > aside, and the only thing that has to be done is to point back to the old > blocks of information. > 4. Parallel fsck - if the filesystem consists of the allocation units - a > sort of the sub- file systems, or cylinder groups, then the fsck can be > started in parallel on those units. > > Well, the ZFS does solve many of these issues, but in a different way, too. > So, my point is that this probably has to be solved on the backend side of > the Lustre, rather than inside the Lustre. > > Best regards, > > Dmitry > > Peter Braam wrote: > > I wrote a blog post that pertains to Lustre scalability and data integrity. > You can find it here: > > http://braamstorage.blogspot.com > > Regards, > > Peter > > ------------------------------ > > _______________________________________________ > Lustre-devel mailing listLustre-devel at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/ba50cbf2/attachment.html
Nicolas Williams
2010-Jul-02 21:09 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
On Fri, Jul 02, 2010 at 02:59:00PM -0600, Peter Braam wrote:> The point of the note is the opposite of what you write, namely that backend > systems in fact do not solve this, unless they are guaranteed to be bug > free.Fsck tools can also be buggy. Consider them redundant code run asynchronously. Is it possible to fsck petabytes in reasonable time? Not if storage capacity grows faster than storage bandwidth. The obvious alternatives are: test, test, test, and/or run redundant fsck-like code synchronously. The latter could be done by reading just-written transactions to check that the filesystem is consistent. Nico --
Dmitry Zogin
2010-Jul-02 21:18 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
Peter, That is right - some of them do not. My point was that Veritas fs already has many things implemented, like parallel fsck, copy-on-write checkpoints,etc. If it was used as a backend for the Lustre, that would be the perfect match. ZFS has some of its features, but not all. But, let''s say, adding things like that into the Lustre itself will make it even more complex, and now it is very complex already . Certainly, things like checkpoints can be added at MDT level - consider an inode on MDT pointing to another MDT inode, instead of the OST objects - that would be a clone. If the file is modified, then, the MDT inode becomes pointing to an OST object which keeps changed file blocks only. This will be sort of the checkpoint allowing to revert the file back. Well, this is is known to help restoring the data in case of the human error, or an application bug, it won''t help to protect from HW induced errors. But, the parallel fsck issue is sort of standing alone - if we want fsck to be faster, we better make it parallel at every OST level - that''s why I think this has to be done on the backend side. Dmitry Peter Braam wrote:> Dmitry, > > The point of the note is the opposite of what you write, namely that > backend systems in fact do not solve this, unless they are guaranteed > to be bug free. > > Peter > > On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin > <dmitry.zoguine at oracle.com <mailto:dmitry.zoguine at oracle.com>> wrote: > > Hello Peter, > > These are really good questions posted there, but I don''t think > they are Lustre specific. These issues are sort of common to any > file systems. Some of the mature file systems, like Veritas > already solved this by > > 1. Integrating the Volume management and File system. The file > system can be spread across many volumes. > 2. Dividing the file system into a group of file sets(like data, > metadata, checkpoints) , and allowing the policies to keep > different filesets on different volumes. > 3. Creating the checkpoints (they are sort of like volume > snapshots, but they are created inside the file system itself). > The checkpoints are simply the copy-on-write filesets created > instantly inside the fs itself. Using copy-on-write techniques > allows to save the physical space and make the process of the file > sets creation instantaneous. They do allow to revert back to a > certain point instantaneously, as the modified blocks are kept > aside, and the only thing that has to be done is to point back to > the old blocks of information. > 4. Parallel fsck - if the filesystem consists of the allocation > units - a sort of the sub- file systems, or cylinder groups, then > the fsck can be started in parallel on those units. > > Well, the ZFS does solve many of these issues, but in a different > way, too. > So, my point is that this probably has to be solved on the backend > side of the Lustre, rather than inside the Lustre. > > Best regards, > > Dmitry > > Peter Braam wrote: >> I wrote a blog post that pertains to Lustre scalability and data >> integrity. You can find it here: >> >> http://braamstorage.blogspot.com >> >> Regards, >> >> Peter >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org <mailto:Lustre-devel at lists.lustre.org> >> http://lists.lustre.org/mailman/listinfo/lustre-devel >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/8d022e8e/attachment.html
Peter Braam
2010-Jul-02 21:39 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine at oracle.com>wrote:> Peter, > > That is right - some of them do not. My point was that Veritas fs already > has many things implemented, like parallel fsck, copy-on-write > checkpoints,etc. If it was used as a backend for the Lustre, that would be > the perfect match. ZFS has some of its features, but not all. > >Parallel fsck doesn''t help once you are down to one disk (as pointed out in the post). The post also mentions copy on write checkpoints, and their usefulness has not been proven. There has been no study about this, and certainly in many cases they are implemented in such a way that bugs in the software can corrupt them. For example, most volume level copy on write schemes actually copy the old data instead of leaving it in place, which is a vulnerability. Shadow copies are vulnerable to software bugs, things would get better if there was something similar to page protection for disk blocks. But, let''s say, adding things like that into the Lustre itself will make it> even more complex, and now it is very complex already . Certainly, things > like checkpoints can be added at MDT level - consider an inode on MDT > pointing to another MDT inode, instead of the OST objects - that would be a > clone. If the file is modified, then, the MDT inode becomes pointing to an > OST object which keeps changed file blocks only. This will be sort of the > checkpoint allowing to revert the file back. Well, this is is known to help > restoring the data in case of the human error, or an application bug, it > won''t help to protect from HW induced errors. >Again, pointing to other objects is subject to possible software bugs. I wrote this post because I''m unconvinced with the barrage of by now endlessly repeated ideas like checkpoints, checksums etc, and the falsehood of the claim that advanced file systems address these issues - they only address some, and leave critical vulnerability. Nicolas post is more along the lines that I think will lead to a solution. Peter> But, the parallel fsck issue is sort of standing alone - if we want fsck to > be faster, we better make it parallel at every OST level - that''s why I > think this has to be done on the backend side. > > Dmitry > > > > Peter Braam wrote: > > Dmitry, > > The point of the note is the opposite of what you write, namely that > backend systems in fact do not solve this, unless they are guaranteed to be > bug free. > > Peter > > On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin <dmitry.zoguine at oracle.com>wrote: > >> Hello Peter, >> >> These are really good questions posted there, but I don''t think they are >> Lustre specific. These issues are sort of common to any file systems. Some >> of the mature file systems, like Veritas already solved this by >> >> 1. Integrating the Volume management and File system. The file system can >> be spread across many volumes. >> 2. Dividing the file system into a group of file sets(like data, metadata, >> checkpoints) , and allowing the policies to keep different filesets on >> different volumes. >> 3. Creating the checkpoints (they are sort of like volume snapshots, but >> they are created inside the file system itself). The checkpoints are simply >> the copy-on-write filesets created instantly inside the fs itself. Using >> copy-on-write techniques allows to save the physical space and make the >> process of the file sets creation instantaneous. They do allow to revert >> back to a certain point instantaneously, as the modified blocks are kept >> aside, and the only thing that has to be done is to point back to the old >> blocks of information. >> 4. Parallel fsck - if the filesystem consists of the allocation units - a >> sort of the sub- file systems, or cylinder groups, then the fsck can be >> started in parallel on those units. >> >> Well, the ZFS does solve many of these issues, but in a different way, >> too. >> So, my point is that this probably has to be solved on the backend side of >> the Lustre, rather than inside the Lustre. >> >> Best regards, >> >> Dmitry >> >> Peter Braam wrote: >> >> I wrote a blog post that pertains to Lustre scalability and data >> integrity. You can find it here: >> >> http://braamstorage.blogspot.com >> >> Regards, >> >> Peter >> >> ------------------------------ >> >> _______________________________________________ >> Lustre-devel mailing listLustre-devel at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel >> >> >> > ------------------------------ > > _______________________________________________ > Lustre-devel mailing listLustre-devel at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/b88545b5/attachment.html
Nicolas Williams
2010-Jul-02 22:21 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
On Fri, Jul 02, 2010 at 03:39:42PM -0600, Peter Braam wrote:> On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine at oracle.com>wrote: > The post also mentions copy on write checkpoints, and their usefulness has > not been proven. There has been no study about this, and certainly in many > cases they are implemented in such a way that bugs in the software can > corrupt them. For example, most volume level copy on write schemes actually > copy the old data instead of leaving it in place, which is a vulnerability. > Shadow copies are vulnerable to software bugs, things would get better if > there was something similar to page protection for disk blocks.Well-delineated transactions are certainly useful. The reason: you can fsck each transaction discretely and incrementally. That means that you know exactly how much work must be done to fsck a priori. Sure, you still have to be confident that N correct transactions == correct filesystem, but that''s much easier to be confident of than software correctness. (It''d be interesting to apply theorem provers to theorems related to on-disk data formats!) Another problem, incidentally, is software correctness on the read side. It''s nice to know that no bugs on the write side will corrupt your filesystem, but read-side bugs that cause your data to be unavailable are not good either. The distinction between bugs in the write vs. read sides is subtle: recovery from the latter is just a patch away, while recovery from the former might require long fscks, or even more manual intervention (e.g., writing a better fsck).> I wrote this post because I''m unconvinced with the barrage of by now > endlessly repeated ideas like checkpoints, checksums etc, and the falsehood > of the claim that advanced file systems address these issues - they only > address some, and leave critical vulnerability.I do believe COW transactions + Merkel hash trees are _the_ key aspect of the solution. Because only by making fscks incremental and discrete can we get a handle on the amount of time that must be spent waiting for fscks to complete. Without incremental fscks there''d be no hope as storage capacity outstrips storage and compute bandwidth. If you believe that COW, transactional, Merkle trees are an anti-solution, or if you believe that they are only a tiny part of the solution, please argue that view. Otherwise I think your use of "barrage" here is a bit over the top (nay, a lot over the top). It''s one thing to be missing a part of the solution, and it''s another to be on the wrong track, or missing the largest part of the solution. Extraordinary claims and all that... (And no, manually partitioning storage into discrete "filesystems", "filesets", "datasets", whatever, is not a solution; at most it''s a bandaid.) Nico --
Nicolas Williams
2010-Jul-02 22:35 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
I explained why well-delineated transactions help, but didn''t really explain why COW and Merkle hash trees help. COW helps ensure that correct transactions cannot result in incorrect filesystems -- fsck need only ensure that a transaction hasn''t overwritten live blocks to guarantee that one can at least rollback to that transaction. Merkle hash trees help detect (and recover from) bit rot and hardware errors, which in turn helps ensure that those incremental fscks are dealing with correct meta-data (correct fsck code + bad meta-data == bad fsck). It''s much harder to ensure that there are no errors in parts of the system that are exposed due to lack of special protection features (such as ECC memory), in system buses and CPUs, that might be difficult or impossible to protect against in software. One option is to run the fscks on different hosts than the ones doing the writing (this means multi-pathing though, which complicates the overall system, but at least we currently depend on multipathing anyways). But even that won''t protect against such unprotectable errors in _data_ (originating in faraway clients, say). Nico --
Dmitry Zogin
2010-Jul-03 03:37 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
Nicolas Williams wrote:> On Fri, Jul 02, 2010 at 03:39:42PM -0600, Peter Braam wrote: > >> On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine at oracle.com>wrote: >> The post also mentions copy on write checkpoints, and their usefulness has >> not been proven. There has been no study about this, and certainly in many >> cases they are implemented in such a way that bugs in the software can >> corrupt them. For example, most volume level copy on write schemes actually >> copy the old data instead of leaving it in place, which is a vulnerability. >> Shadow copies are vulnerable to software bugs, things would get better if >> there was something similar to page protection for disk blocks. >> > > Well-delineated transactions are certainly useful. The reason: you can > fsck each transaction discretely and incrementally. That means that you > know exactly how much work must be done to fsck a priori. Sure, you > still have to be confident that N correct transactions == correct > filesystem, but that''s much easier to be confident of than software > correctness. (It''d be interesting to apply theorem provers to theorems > related to on-disk data formats!) > > Another problem, incidentally, is software correctness on the read side. > It''s nice to know that no bugs on the write side will corrupt your > filesystem, but read-side bugs that cause your data to be unavailable > are not good either. The distinction between bugs in the write vs. read > sides is subtle: recovery from the latter is just a patch away, while > recovery from the former might require long fscks, or even more manual > intervention (e.g., writing a better fsck). > > >> I wrote this post because I''m unconvinced with the barrage of by now >> endlessly repeated ideas like checkpoints, checksums etc, and the falsehood >> of the claim that advanced file systems address these issues - they only >> address some, and leave critical vulnerability. >> > > I do believe COW transactions + Merkel hash trees are _the_ key aspect > of the solution. Because only by making fscks incremental and discrete > can we get a handle on the amount of time that must be spent waiting for > fscks to complete. Without incremental fscks there''d be no hope as > storage capacity outstrips storage and compute bandwidth. > > If you believe that COW, transactional, Merkle trees are an > anti-solution, or if you believe that they are only a tiny part of the > solution, please argue that view. Otherwise I think your use of > "barrage" here is a bit over the top (nay, a lot over the top). It''s > one thing to be missing a part of the solution, and it''s another to be > on the wrong track, or missing the largest part of the solution. > Extraordinary claims and all that... >Well, the hash trees certainly help to achieve data integrity, but at the performance cost. Eventually, the file system becomes fragmented, and moving the data around implies more random seeks with Merkle hash trees.> (And no, manually partitioning storage into discrete "filesystems", > "filesets", "datasets", whatever, is not a solution; at most it''s a > bandaid.) > > Nico >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/068ed9a3/attachment.html
Peter Grandi
2010-Jul-03 20:03 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
>> I wrote a blog post that pertains to Lustre scalability and >> data integrity. You can find it here: >> http://braamstorage.blogspot.comAh amusing, but a bit late to the party. The DBMS community have been dealing with these issues for a very long time; consider the canonical definitions of "database" and "very large database": * "database": a mass of data whose working set cannot be held in memory; a mass of data where every access involves at least one physical IO. * "very large database": a mass of data that cannot be realistically taken offline for maintenance; a mass of data that takes "too long" to backup or check. But I am very pleased that the "fsck wall" is getting wider exposure, I have been pointing it out in my little corner for years.> [ ... ] like Veritas already solved this by> 1. Integrating the Volume management and File system. The file > system can be spread across many volumes.That''s both crazy and nearly pointless. It is at best a dubious convenience.> 2. Dividing the file system into a group of file sets(like > data, metadata, checkpoints) , and allowing the policies to > keep different filesets on different volumes.That''s also crazy and nearly pointless, as described.> 3. Creating the checkpoints (they are sort of like volume > snapshots, but they are created inside the file system > itself). [ ... ]These are an ancient feature of many fs designs, and for various reasons versioned filesystems have never been that popular. In part because of performance, in part because it is not that useful, in part because it is the wrong abstraction levbel.> 4. Parallel fsck - if the filesystem consists of the > allocation units - a sort of the sub- file systems, or > cylinder groups, then the fsck can be started in parallel > on those units.This either is pointless or not that useful. This can be done fairly trivially by using many filesystems, and creating a single namespace by "mounting" them together; of course then one does not have a single free storage pool, even if the namespace is stitched together. But it is exceptionally difficult to have a single storage pool *and* chunking (as soon as object contents are spread across mutiple chunks ''fsck'' becomes hard, and if objects contents are not spread across multiple chunks, you don''t really have a single storage pool). The fundamental problem with ''fsck'' is that: * Data access scales up by using RAID, as N disks, with suitable access patterns, give a speedup of up to N (either in bandwidth or IOPS), so it is feasible to create very large storage systems by driving parallelism up at the data level. * Unfortunately while data performance *can* scale with the number of disks, metadata access cannot, because it is driven by wholly different access patterns, usually more graph-like than stream-like. In essence ''fsck'' is a garbage collector, and thus it is both unavoidable, and exceptionally hard to parallelize. Note also that the "IOPS wall" (similar to the "memory wall"), where storage device capacity and bandwith grow faster than IOPS, eventually calls into question even data scalability, and in some applications (like the Lustre MDS) that is already quite apparent.> Well, the ZFS does solve many of these issues, but in a > different way, too.ZFS is not the solution to almost any problem, except perhaps sysadmin convenience. The UNIX lesson is that the main job of a file system is to provide a simple, trivial "dataspace" abstraction layer, and that trying to have it address storage (for example checksumming) or application layer (for example indices) concerns is poor design. It does seem quite convenient though (to the sort of people who want to do triple parity RAID and 46+2 RAID6 arrays, or build large filesystems as LVM2 concats [VGs] spanning several disks).> So, my point is that this probably has to be solved on the > backend side of the Lustre, rather than inside the Lustre.The Lustre has embodies a very specific set of tradeoffs aimed at a specific "sweet spot" as described by PeterB in his blogpost. Violating design integrity usually is very painful. A wholly new design is probably needed. As to scalability there is a proof of existence for extremely scalable file system designs, and that is GoogleFS, and it embodies pretty extreme tradeoffs (far more extreme than Lustre) in pursuit of scalability. If GoogleFS is the state of the art, then I suspect that very scalable, fine grained, and highly efficient are incompatible goals (and very, very rarely a requirement either). BTW I am occasionally reminded of two ancient MIT TRs, one by Peter Bishop about distributed persistent garbage collection, and one by Svobodova on object histories in the swallow repository.
Peter Grandi
2010-Jul-03 20:18 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
[ ... ]>> Shadow copies are vulnerable to software bugs, things would >> get better if there was something similar to page protection >> for disk blocks.Somewhat agreeable, but I hope that everybody involved in this discussion has read the reports by CERN on invisible data corruptions, and has meditated on the implications (real data integrity can only be end-to-end). [ ... ]> [ ... ] you can fsck each transaction discretely and > incrementally. That means that you know exactly how much work > must be done to fsck a priori. Sure, you still have to be > confident that N correct transactions == correct filesystem, > but that''s much easier to be confident of than software > correctness.That to me seems very naive like some old claims that journals obviate the need for ''fsck''. Nothing can obviate the need for ''fsck'', ad it is essentially an auditing tool; "proving" that a sequence of correct operations results in a correct outcome and thus no auditing is required, or is required only once, to me sounds extraordinarily unrealistic (and Peter Braam uses the killer argument of bugs, but that''s not even the strongest), as it is based on this delusion:> [ ... ] Because only by making fscks incremental and discrete > can we get a handle on the amount of time that must be spent > waiting for fscks to complete.Auditing of metadata cannot be incremental. I wonder how little real world experience backs this kind of delusion; in the real world existing, already checked metadata and data can be corrupted by faulty IO directed at other data and metadata.> Without incremental fscks there''d be no hope as storage > capacity outstrips storage and compute bandwidth.And it is not capacity vs. bandwidth; it is really the intrinsic ability to parallelize data access vs. the much lesses ability to parallelize garbage collection. Something has got to give, and if GoogleFS is the state of the art, what has to give is functionality and efficiency. [ ... ]
Nicolas Williams
2010-Jul-04 23:56 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin wrote:> Well, the hash trees certainly help to achieve data integrity, but > at the performance cost.Merkle hash trees cost more CPU cycles, not more I/O. Indeed, they result in _less_ I/O in the case of RAID-Zn because there''s no need to read the parity unless the checksum doesn''t match. Also, how much CPU depends on the hash function. And HW could help if this became enough of a problem for us.> Eventually, the file system becomes fragmented, and moving the data > around implies more random seeks with Merkle hash trees.Yes, fragmentation is a problem for COW, but that has nothing to do with Merkle trees. But practically every modern filesystem coalesces writes into contiguous writes on disk to reach streaming write perfmormance, and that, like COW, results in filesystem fragmentation. (Of course, you needn''t get fragmentation if you never delete or over write files. You''ll get some fragmentation of meta-data, but that''s much easier to garbage collect since meta-data will amount to much less on disk than data.) Everything we do involves trade-offs. Nico --
Nicolas Williams
2010-Jul-05 01:33 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
On Sat, Jul 03, 2010 at 09:18:47PM +0100, pg_lus at lus.for.sabi.co.UK wrote:> [ ... ] > > [ ... ] you can fsck each transaction discretely and > > incrementally. That means that you know exactly how much work > > must be done to fsck a priori. Sure, you still have to be > > confident that N correct transactions == correct filesystem, > > but that''s much easier to be confident of than software > > correctness. > > That to me seems very naive like some old claims that journals > obviate the need for ''fsck''. > > Nothing can obviate the need for ''fsck'', ad it is essentially an > auditing tool; "proving" that a sequence of correct operations > results in a correct outcome and thus no auditing is required, > or is required only once, to me sounds extraordinarily > unrealistic (and Peter Braam uses the killer argument of bugs, > but that''s not even the strongest), as it is based on this > delusion:Just because I didn''t mention what ZFS calls "scrubbing" doesn''t mean that I think it''s not desirable or not needed. Indeed, ZFS can do exactly what you suggest by "scrubbing" pools, a process that traverses all meta-data and reads all data and verifies integrity, and which can be done concurrently with normal filesystem operation. However, scubbing is not "fsck" as we''ve always understood "fsck". The traditional "fsck" runs before you can mount a filesystem, and it reads at least all meta-data. That is either not feasible or not acceptable today. Scrubbing is. As is incremental fsck. Perhaps I misunderstood what Peter B. was getting at; perhaps Peter B. was referring to "scrub" rather than "traditional fsck" and simply used terminology that confused me. Or perhaps you misunderstood what "fsck" means to me.> > [ ... ] Because only by making fscks incremental and discrete > > can we get a handle on the amount of time that must be spent > > waiting for fscks to complete. > > Auditing of metadata cannot be incremental. I wonder how little > real world experience backs this kind of delusion; in the real > world existing, already checked metadata and data can be > corrupted by faulty IO directed at other data and metadata.I think it''s much too early for you to speak of delusion on anyone''s part here. Resorting to personal attacks is not exactly a good approach to exchanging ideas. Nico --
Dmitry Zogin
2010-Jul-05 03:53 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
Nicolas Williams wrote:> On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin wrote: > >> Well, the hash trees certainly help to achieve data integrity, but >> at the performance cost. >> > > Merkle hash trees cost more CPU cycles, not more I/O. Indeed, they > result in _less_ I/O in the case of RAID-Zn because there''s no need to > read the parity unless the checksum doesn''t match. Also, how much CPU > depends on the hash function. And HW could help if this became enough > of a problem for us. > > >> Eventually, the file system becomes fragmented, and moving the data >> around implies more random seeks with Merkle hash trees. >> > > Yes, fragmentation is a problem for COW, but that has nothing to do with > Merkle trees. But practically every modern filesystem coalesces writes > into contiguous writes on disk to reach streaming write perfmormance, > and that, like COW, results in filesystem fragmentation. > >What I really mean is the defragmentation issue and not the fragmentation itself. All file systems becomes fragmented, as it is unavoidable. But the defragmentation of the file system using hash trees really becomes a problem.> (Of course, you needn''t get fragmentation if you never delete or over > write files. You''ll get some fragmentation of meta-data, but that''s > much easier to garbage collect since meta-data will amount to much less > on disk than data.) >Well, that is really never happens, unless the file system is read-only. The files are deleted and created all the time.> Everything we do involves trade-offs. > > >Yes, but if the performance drop becomes unacceptable, any gain in the integrity is miserable. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100704/1073278d/attachment.html
Mitchell Erblich
2010-Jul-05 07:11 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
On Jul 4, 2010, at 8:53 PM, Dmitry Zogin wrote:> Nicolas Williams wrote: >> >> On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin wrote: >> >>> Well, the hash trees certainly help to achieve data integrity, but >>> at the performance cost. >>> >> >> Merkle hash trees cost more CPU cycles, not more I/O. Indeed, they >> result in _less_ I/O in the case of RAID-Zn because there''s no need to >> read the parity unless the checksum doesn''t match. Also, how much CPU >> depends on the hash function. And HW could help if this became enough >> of a problem for us. >> >> >>> Eventually, the file system becomes fragmented, and moving the data >>> around implies more random seeks with Merkle hash trees. >>> >> >> Yes, fragmentation is a problem for COW, but that has nothing to do with >> Merkle trees. But practically every modern filesystem coalesces writes >> into contiguous writes on disk to reach streaming write perfmormance, >> and that, like COW, results in filesystem fragmentation. >> >> > What I really mean is the defragmentation issue and not the fragmentation itself. All file systems becomes fragmented, as it is unavoidable. But the defragmentation of the file system using hash trees really becomes a problem.Stupid me. I thought the FS fragmentation issue had a solution over a decade ago. When the write doesn''t change the offset, then do nothing. If it is a concatenating write, locate the best fit block for the new size/offset, update the metadata/inode, then free the old block. Since writes as mostly asynch, who cares how long it takes as long as their are no commits waiting. Mitchell Erblich>> (Of course, you needn''t get fragmentation if you never delete or over >> write files. You''ll get some fragmentation of meta-data, but that''s >> much easier to garbage collect since meta-data will amount to much less >> on disk than data.) >> > Well, that is really never happens, unless the file system is read-only. The files are deleted and created all the time. >> Everything we do involves trade-offs. >> >> >> > Yes, but if the performance drop becomes unacceptable, any gain in the integrity is miserable. > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Nicolas Williams
2010-Jul-05 17:58 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
On Sun, Jul 04, 2010 at 11:53:29PM -0400, Dmitry Zogin wrote:> What I really mean is the defragmentation issue and not the > fragmentation itself. All file systems becomes fragmented, as it is > unavoidable. But the defragmentation of the file system using hash > trees really becomes a problem.That is emphatically not true. To defragment a ZFS-like filesystem all you need to do is traverse the metadata looking for live blocks from old transaction groups, then relocate those by writing them out again almost as if an application had written to them (except with no mtime updates). In ZFS we call this block pointer rewrite, or bp rewrite.> >Everything we do involves trade-offs. > > > Yes, but if the performance drop becomes unacceptable, any gain in > the integrity is miserable.I believe ZFS has shown that unacceptable performance losses are not required in order to get the additional integrity protection. Nico --
Andreas Dilger
2010-Jul-07 06:57 UTC
[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?
On 2010-07-02, at 15:39, Peter Braam wrote:> I wrote a blog post that pertains to Lustre scalability and data integrity. > > http://braamstorage.blogspot.comIn your blog you write:> Unfortunately once file system check and repair is required, the scalability of all file systems becomes questionable. The repair tool needs to iterate over all objects stored in the file system, and this can take unacceptably long on the advanced file systems like ZFS and btrfs just as much as on the more traditional ones like ext4. > > This shows the shortcoming of the Lustre-ZFS proposal to address scalability. It merely addresses data integrity.I agree that ZFS checksums will help detect and recover the data integrity, and we are leveraging this to provide data integrity (as described in "End to End Data Integrity Design" on the Lustre wiki). However, contrary to your statement, we are not depending on the checksums for checking and fixing the distributed filesystem consistency. The Integrity design you referenced describes the process for doing the (largely) single-pass parallel consistency checking of the ZFS backing filesystems at the same time as doing the distributed Lustre filesystem consistency check, while the filesystem is active. In the years since you have been working on Lustre, we have already implemented similar ideas as ChunkFS/TileFS to use back-references for avoiding the need to keep the full filesystem state in memory when doing checks and recovering from corruption. The OST filesystem inodes contain their own object IDs (for recreating the OST namespace in case of directory corruption, as anyone who''s used ll_recover_lost_found_objs can attest), and a back-pointer to the MDT inode FID to be used for fast orphan and layout inconsistency detection. With 2.0 the MDT inodes will also contain the FID number for reconstructing the object index, should it be corrupted, and also the list of hard links to the inode for doing O(1) path construction and nlink verification. With CMD the remotely referenced MDT inodes will have back-pointers to the originating MDT to allow local consistency checking, similar to the shadow inodes proposed for ChunkFS. As you pointed out, scaling fsck to be able to check a filesystem with 10^12 files within 100h is difficult. It turns out that the metadata requirements for doing a full check within this time period exceed the metadata requirements specified for normal operation. It of course isn''t possible to do a consistency check of a filesystem without actually checking each of the items in that filesystem, so each one has to be visited at least (and preferably at most) once. That said, the requirements are not beyond what is capable from the hardware that will be needed to host a filesystem this large in the first place, assuming the local and distributed consistency checking can run in parallel and utilize the full bandwidth of the filesystem. What is also important to note is that both ZFS and the new lfsck are designed to be able to validate the filesystem continuously as it is being used, so there is no need to take a 100h outage before putting the filesystem back into use. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.