thr3ads.net - Lustre devel - [Lustre-discuss] Integrity and corruption

If this information is useful, please help other people find it:
Share via:

Peter Braam

2010-Jul-02 18:53 UTC

[Lustre-discuss] Integrity and corruption - can file systems be scalable?

I wrote a blog post that pertains to Lustre scalability and data integrity.
 You can find it here:

http://braamstorage.blogspot.com

Regards,

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/b6cb205b/attachment.html

Dmitry Zogin

2010-Jul-02 20:52 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

Hello Peter,

These are really good questions posted there, but I don''t think they
are
Lustre specific. These issues are sort of common to any file systems. 
Some of the mature file systems, like Veritas already solved this by

1. Integrating the Volume management and File system. The file system 
can be spread across many volumes.
2. Dividing the file system into a group of file sets(like data, 
metadata, checkpoints) , and allowing the policies to keep different 
filesets on different volumes.
3. Creating the checkpoints (they are sort of like volume snapshots, but 
they are created inside the file system itself). The checkpoints are 
simply the copy-on-write filesets created instantly inside the fs 
itself. Using copy-on-write techniques allows to save the physical space 
and make the process of the file sets creation instantaneous. They do 
allow to revert back to a certain point instantaneously, as the modified 
blocks are kept aside, and the only thing that has to be done is to 
point back to the old blocks of information.
4. Parallel fsck - if the filesystem consists of the allocation units - 
a sort of the sub- file systems, or cylinder groups,  then the fsck can 
be started in parallel on those units.

Well, the ZFS does solve many of these issues, but in a different way, too.
So, my point is that this probably has to be solved on the backend side 
of the Lustre, rather than inside the Lustre.

Best regards,

Dmitry

Peter Braam wrote:> I wrote a blog post that pertains to Lustre scalability and data 
> integrity.  You can find it here:
>
> http://braamstorage.blogspot.com
>
> Regards,
>
> Peter
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/825eabe5/attachment.html

Peter Braam

2010-Jul-02 20:59 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

Dmitry,

The point of the note is the opposite of what you write, namely that backend
systems in fact do not solve this, unless they are guaranteed to be bug
free.

Peter

On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin <dmitry.zoguine at
oracle.com>wrote:
>  Hello Peter,
>
> These are really good questions posted there, but I don''t think
they are
> Lustre specific. These issues are sort of common to any file systems. Some
> of the mature file systems, like Veritas already solved this by
>
> 1. Integrating the Volume management and File system. The file system can
> be spread across many volumes.
> 2. Dividing the file system into a group of file sets(like data, metadata,
> checkpoints) , and allowing the policies to keep different filesets on
> different volumes.
> 3. Creating the checkpoints (they are sort of like volume snapshots, but
> they are created inside the file system itself). The checkpoints are simply
> the copy-on-write filesets created instantly inside the fs itself. Using
> copy-on-write techniques allows to save the physical space and make the
> process of the file sets creation instantaneous. They do allow to revert
> back to a certain point instantaneously, as the modified blocks are kept
> aside, and the only thing that has to be done is to point back to the old
> blocks of information.
> 4. Parallel fsck - if the filesystem consists of the allocation units - a
> sort of the sub- file systems, or cylinder groups,  then the fsck can be
> started in parallel on those units.
>
> Well, the ZFS does solve many of these issues, but in a different way, too.
> So, my point is that this probably has to be solved on the backend side of
> the Lustre, rather than inside the Lustre.
>
> Best regards,
>
> Dmitry
>
> Peter Braam wrote:
>
> I wrote a blog post that pertains to Lustre scalability and data integrity.
>  You can find it here:
>
>  http://braamstorage.blogspot.com
>
>  Regards,
>
>  Peter
>
> ------------------------------
>
> _______________________________________________
> Lustre-devel mailing listLustre-devel at
lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/ba50cbf2/attachment.html

Nicolas Williams

2010-Jul-02 21:09 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

On Fri, Jul 02, 2010 at 02:59:00PM -0600, Peter Braam
wrote:> The point of the note is the opposite of what you write, namely that
backend
> systems in fact do not solve this, unless they are guaranteed to be bug
> free.
Fsck tools can also be buggy.  Consider them redundant code run
asynchronously.  Is it possible to fsck petabytes in reasonable time?
Not if storage capacity grows faster than storage bandwidth.

The obvious alternatives are: test, test, test, and/or run redundant
fsck-like code synchronously.  The latter could be done by reading
just-written transactions to check that the filesystem is consistent.

Nico
--

Dmitry Zogin

2010-Jul-02 21:18 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

Peter,

That is right - some of them do not. My point was that Veritas fs 
already has many things implemented, like parallel fsck, copy-on-write 
checkpoints,etc. If it was used as a backend for the Lustre, that would 
be the perfect match. ZFS has some of its features, but not all.

But, let''s say, adding things like that into the Lustre itself will
make
it even more complex, and now it is very complex already . Certainly, 
things like checkpoints can be added at MDT level - consider an inode on 
MDT pointing to another MDT inode, instead of the OST objects - that 
would be a clone. If the file is modified, then, the MDT inode becomes 
pointing to an OST object which keeps changed file blocks only. This 
will be sort of the checkpoint allowing to revert the file back. Well, 
this is is known to help restoring the data in case of the human error, 
or an application bug, it won''t help to protect from HW induced errors.
But, the parallel fsck issue is sort of standing alone - if we want fsck 
to be faster, we better make it parallel at every OST level - that''s
why
I think this has to be done on the backend side.

Dmitry


Peter Braam wrote:> Dmitry, 
>
> The point of the note is the opposite of what you write, namely that 
> backend systems in fact do not solve this, unless they are guaranteed 
> to be bug free.
>
> Peter
>
> On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin 
> <dmitry.zoguine at oracle.com <mailto:dmitry.zoguine at
oracle.com>> wrote:
>
>     Hello Peter,
>
>     These are really good questions posted there, but I don''t
think
>     they are Lustre specific. These issues are sort of common to any
>     file systems. Some of the mature file systems, like Veritas
>     already solved this by
>
>     1. Integrating the Volume management and File system. The file
>     system can be spread across many volumes.
>     2. Dividing the file system into a group of file sets(like data,
>     metadata, checkpoints) , and allowing the policies to keep
>     different filesets on different volumes.
>     3. Creating the checkpoints (they are sort of like volume
>     snapshots, but they are created inside the file system itself).
>     The checkpoints are simply the copy-on-write filesets created
>     instantly inside the fs itself. Using copy-on-write techniques
>     allows to save the physical space and make the process of the file
>     sets creation instantaneous. They do allow to revert back to a
>     certain point instantaneously, as the modified blocks are kept
>     aside, and the only thing that has to be done is to point back to
>     the old blocks of information.
>     4. Parallel fsck - if the filesystem consists of the allocation
>     units - a sort of the sub- file systems, or cylinder groups,  then
>     the fsck can be started in parallel on those units.
>
>     Well, the ZFS does solve many of these issues, but in a different
>     way, too.
>     So, my point is that this probably has to be solved on the backend
>     side of the Lustre, rather than inside the Lustre.
>
>     Best regards,
>
>     Dmitry
>
>     Peter Braam wrote:
>>     I wrote a blog post that pertains to Lustre scalability and data
>>     integrity.  You can find it here:
>>
>>     http://braamstorage.blogspot.com
>>
>>     Regards,
>>
>>     Peter
>>    
------------------------------------------------------------------------
>>
>>     _______________________________________________
>>     Lustre-devel mailing list
>>     Lustre-devel at lists.lustre.org <mailto:Lustre-devel at
lists.lustre.org>
>>     http://lists.lustre.org/mailman/listinfo/lustre-devel
>>       
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/8d022e8e/attachment.html

Peter Braam

2010-Jul-02 21:39 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine at
oracle.com>wrote:
>  Peter,
>
> That is right - some of them do not. My point was that Veritas fs already
> has many things implemented, like parallel fsck, copy-on-write
> checkpoints,etc. If it was used as a backend for the Lustre, that would be
> the perfect match. ZFS has some of its features, but not all.
>
>Parallel fsck doesn''t help once you are down to one disk (as pointed
out in
the post).

The post also mentions copy on write checkpoints, and their usefulness has
not been proven.  There has been no study about this, and certainly in many
cases they are implemented in such a way that bugs in the software can
corrupt them.  For example, most volume level copy on write schemes actually
copy the old data instead of leaving it in place, which is a vulnerability.
 Shadow copies are vulnerable to software bugs, things would get better if
there was something similar to page protection for disk blocks.

But, let''s say, adding things like that into the Lustre itself will
make it> even more complex, and now it is very complex already . Certainly, things
> like checkpoints can be added at MDT level - consider an inode on MDT
> pointing to another MDT inode, instead of the OST objects - that would be a
> clone. If the file is modified, then, the MDT inode becomes pointing to an
> OST object which keeps changed file blocks only. This will be sort of the
> checkpoint allowing to revert the file back. Well, this is is known to help
> restoring the data in case of the human error, or an application bug, it
> won''t help to protect from HW induced errors.
>
Again, pointing to other objects is subject to possible software bugs.

I wrote this post because I''m unconvinced with the barrage of by now
endlessly repeated ideas like checkpoints, checksums etc, and the falsehood
of the claim that advanced file systems address these issues - they only
address some, and leave critical vulnerability.

Nicolas post is more along the lines that I think will lead to a solution.

Peter



> But, the parallel fsck issue is sort of standing alone - if we want fsck to
> be faster, we better make it parallel at every OST level - that''s
why I
> think this has to be done on the backend side.
>
> Dmitry
>
>
>
> Peter Braam wrote:
>
> Dmitry,
>
>  The point of the note is the opposite of what you write, namely that
> backend systems in fact do not solve this, unless they are guaranteed to be
> bug free.
>
>  Peter
>
> On Fri, Jul 2, 2010 at 2:52 PM, Dmitry Zogin <dmitry.zoguine at
oracle.com>wrote:
>
>> Hello Peter,
>>
>> These are really good questions posted there, but I don''t
think they are
>> Lustre specific. These issues are sort of common to any file systems.
Some
>> of the mature file systems, like Veritas already solved this by
>>
>> 1. Integrating the Volume management and File system. The file system
can
>> be spread across many volumes.
>> 2. Dividing the file system into a group of file sets(like data,
metadata,
>> checkpoints) , and allowing the policies to keep different filesets on
>> different volumes.
>> 3. Creating the checkpoints (they are sort of like volume snapshots,
but
>> they are created inside the file system itself). The checkpoints are
simply
>> the copy-on-write filesets created instantly inside the fs itself.
Using
>> copy-on-write techniques allows to save the physical space and make the
>> process of the file sets creation instantaneous. They do allow to
revert
>> back to a certain point instantaneously, as the modified blocks are
kept
>> aside, and the only thing that has to be done is to point back to the
old
>> blocks of information.
>> 4. Parallel fsck - if the filesystem consists of the allocation units -
a
>> sort of the sub- file systems, or cylinder groups,  then the fsck can
be
>> started in parallel on those units.
>>
>> Well, the ZFS does solve many of these issues, but in a different way,
>> too.
>> So, my point is that this probably has to be solved on the backend side
of
>> the Lustre, rather than inside the Lustre.
>>
>> Best regards,
>>
>> Dmitry
>>
>> Peter Braam wrote:
>>
>>  I wrote a blog post that pertains to Lustre scalability and data
>> integrity.  You can find it here:
>>
>>  http://braamstorage.blogspot.com
>>
>>  Regards,
>>
>>  Peter
>>
>> ------------------------------
>>
>> _______________________________________________
>> Lustre-devel mailing listLustre-devel at
lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel
>>
>>
>>
>  ------------------------------
>
> _______________________________________________
> Lustre-devel mailing listLustre-devel at
lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-devel
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/b88545b5/attachment.html

Nicolas Williams

2010-Jul-02 22:21 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

On Fri, Jul 02, 2010 at 03:39:42PM -0600, Peter Braam
wrote:> On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine at
oracle.com>wrote:
> The post also mentions copy on write checkpoints, and their usefulness has
> not been proven.  There has been no study about this, and certainly in many
> cases they are implemented in such a way that bugs in the software can
> corrupt them.  For example, most volume level copy on write schemes
actually
> copy the old data instead of leaving it in place, which is a vulnerability.
>  Shadow copies are vulnerable to software bugs, things would get better if
> there was something similar to page protection for disk blocks.
Well-delineated transactions are certainly useful.  The reason: you can
fsck each transaction discretely and incrementally.  That means that you
know exactly how much work must be done to fsck a priori.  Sure, you
still have to be confident that N correct transactions == correct
filesystem, but that''s much easier to be confident of than software
correctness.  (It''d be interesting to apply theorem provers to theorems
related to on-disk data formats!) 

Another problem, incidentally, is software correctness on the read side.
It''s nice to know that no bugs on the write side will corrupt your
filesystem, but read-side bugs that cause your data to be unavailable
are not good either.  The distinction between bugs in the write vs. read
sides is subtle: recovery from the latter is just a patch away, while
recovery from the former might require long fscks, or even more manual
intervention (e.g., writing a better fsck).
> I wrote this post because I''m unconvinced with the barrage of by
now
> endlessly repeated ideas like checkpoints, checksums etc, and the falsehood
> of the claim that advanced file systems address these issues - they only
> address some, and leave critical vulnerability.
I do believe COW transactions + Merkel hash trees are _the_ key aspect
of the solution.  Because only by making fscks incremental and discrete
can we get a handle on the amount of time that must be spent waiting for
fscks to complete.  Without incremental fscks there''d be no hope as
storage capacity outstrips storage and compute bandwidth.

If you believe that COW, transactional, Merkle trees are an
anti-solution, or if you believe that they are only a tiny part of the
solution, please argue that view.  Otherwise I think your use of
"barrage" here is a bit over the top (nay, a lot over the top). 
It''s
one thing to be missing a part of the solution, and it''s another to be
on the wrong track, or missing the largest part of the solution.
Extraordinary claims and all that...

(And no, manually partitioning storage into discrete "filesystems",
"filesets", "datasets", whatever, is not a solution; at most
it''s a
bandaid.)

Nico
--

Nicolas Williams

2010-Jul-02 22:35 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

I explained why well-delineated transactions help, but didn''t really
explain why COW and Merkle hash trees help.  COW helps ensure that
correct transactions cannot result in incorrect filesystems -- fsck need
only ensure that a transaction hasn''t overwritten live blocks to
guarantee that one can at least rollback to that transaction.  Merkle
hash trees help detect (and recover from) bit rot and hardware errors,
which in turn helps ensure that those incremental fscks are dealing with
correct meta-data (correct fsck code + bad meta-data == bad fsck).

It''s much harder to ensure that there are no errors in parts of the
system that are exposed due to lack of special protection features (such
as ECC memory), in system buses and CPUs, that might be difficult or
impossible to protect against in software.  One option is to run the
fscks on different hosts than the ones doing the writing (this means
multi-pathing though, which complicates the overall system, but at least
we currently depend on multipathing anyways).  But even that won''t
protect against such unprotectable errors in _data_ (originating in
faraway clients, say).

Nico
--

Dmitry Zogin

2010-Jul-03 03:37 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

Nicolas Williams wrote:> On Fri, Jul 02, 2010 at 03:39:42PM -0600, Peter Braam wrote:
>   
>> On Fri, Jul 2, 2010 at 3:18 PM, Dmitry Zogin <dmitry.zoguine at
oracle.com>wrote:
>> The post also mentions copy on write checkpoints, and their usefulness
has
>> not been proven.  There has been no study about this, and certainly in
many
>> cases they are implemented in such a way that bugs in the software can
>> corrupt them.  For example, most volume level copy on write schemes
actually
>> copy the old data instead of leaving it in place, which is a
vulnerability.
>>  Shadow copies are vulnerable to software bugs, things would get better
if
>> there was something similar to page protection for disk blocks.
>>     
>
> Well-delineated transactions are certainly useful.  The reason: you can
> fsck each transaction discretely and incrementally.  That means that you
> know exactly how much work must be done to fsck a priori.  Sure, you
> still have to be confident that N correct transactions == correct
> filesystem, but that''s much easier to be confident of than
software
> correctness.  (It''d be interesting to apply theorem provers to
theorems
> related to on-disk data formats!) 
>
> Another problem, incidentally, is software correctness on the read side.
> It''s nice to know that no bugs on the write side will corrupt your
> filesystem, but read-side bugs that cause your data to be unavailable
> are not good either.  The distinction between bugs in the write vs. read
> sides is subtle: recovery from the latter is just a patch away, while
> recovery from the former might require long fscks, or even more manual
> intervention (e.g., writing a better fsck).
>
>   
>> I wrote this post because I''m unconvinced with the barrage of
by now
>> endlessly repeated ideas like checkpoints, checksums etc, and the
falsehood
>> of the claim that advanced file systems address these issues - they
only
>> address some, and leave critical vulnerability.
>>     
>
> I do believe COW transactions + Merkel hash trees are _the_ key aspect
> of the solution.  Because only by making fscks incremental and discrete
> can we get a handle on the amount of time that must be spent waiting for
> fscks to complete.  Without incremental fscks there''d be no hope
as
> storage capacity outstrips storage and compute bandwidth.
>
> If you believe that COW, transactional, Merkle trees are an
> anti-solution, or if you believe that they are only a tiny part of the
> solution, please argue that view.  Otherwise I think your use of
> "barrage" here is a bit over the top (nay, a lot over the top). 
It''s
> one thing to be missing a part of the solution, and it''s another
to be
> on the wrong track, or missing the largest part of the solution.
> Extraordinary claims and all that...
>   Well, the hash trees certainly help to achieve data integrity, but at 
the performance cost.
Eventually, the file system becomes fragmented, and moving the data 
around implies more random seeks with Merkle hash trees.> (And no, manually partitioning storage into discrete
"filesystems",
> "filesets", "datasets", whatever, is not a solution; at
most it''s a
> bandaid.)
>
> Nico
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100702/068ed9a3/attachment.html

Peter Grandi

2010-Jul-03 20:03 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

>> I wrote a blog post that pertains to Lustre scalability and
>> data integrity. You can find it here:
>> http://braamstorage.blogspot.com
Ah amusing, but a bit late to the party. The DBMS community have
been dealing with these issues for a very long time; consider the
canonical definitions of "database" and "very large
database":

* "database": a mass of data whose working set cannot be held
  in memory; a mass of data where every access involves at least
  one physical IO.

* "very large database": a mass of data that cannot be
  realistically taken offline for maintenance; a mass of data
  that takes "too long" to backup or check.

But I am very pleased that the "fsck wall" is getting wider
exposure, I have been pointing it out in my little corner for
years.
> [ ... ] like Veritas already solved this by
> 1. Integrating the Volume management and File system. The file
>    system can be spread across many volumes.
That''s both crazy and nearly pointless. It is at best a dubious
convenience.
> 2. Dividing the file system into a group of file sets(like
>    data, metadata, checkpoints) , and allowing the policies to
>    keep different filesets on different volumes.
That''s also crazy and nearly pointless, as described.
> 3. Creating the checkpoints (they are sort of like volume
>    snapshots, but they are created inside the file system
>    itself). [ ... ]
These are an ancient feature of many fs designs, and for various
reasons versioned filesystems have never been that popular. In
part because of performance, in part because it is not that
useful, in part because it is the wrong abstraction levbel.
> 4. Parallel fsck - if the filesystem consists of the
>    allocation units - a sort of the sub- file systems, or
>    cylinder groups, then the fsck can be started in parallel
>    on those units.
This either is pointless or not that useful. This can be done
fairly trivially by using many filesystems, and creating a
single namespace by "mounting" them together; of course then one
does not have a single free storage pool, even if the namespace
is stitched together.

But it is exceptionally difficult to have a single storage pool
*and* chunking (as soon as object contents are spread across
mutiple chunks ''fsck'' becomes hard, and if objects contents
are
not spread across multiple chunks, you don''t really have a single
storage pool).

The fundamental problem with ''fsck'' is that:

* Data access scales up by using RAID, as N disks, with suitable
  access patterns, give a speedup of up to N (either in bandwidth
  or IOPS), so it is feasible to create very large storage
  systems by driving parallelism up at the data level.

* Unfortunately while data performance *can* scale with the
  number of disks, metadata access cannot, because it is driven
  by wholly different access patterns, usually more graph-like
  than stream-like. In essence ''fsck'' is a garbage collector,
and
  thus it is both unavoidable, and exceptionally hard to
  parallelize.

Note also that the "IOPS wall" (similar to the "memory
wall"),
where storage device capacity and bandwith grow faster than IOPS,
eventually calls into question even data scalability, and in some
applications (like the Lustre MDS) that is already quite apparent.
> Well, the ZFS does solve many of these issues, but in a
> different way, too.
ZFS is not the solution to almost any problem, except perhaps
sysadmin convenience.

The UNIX lesson is that the main job of a file system is to
provide a simple, trivial "dataspace" abstraction layer, and that
trying to have it address storage (for example checksumming) or
application layer (for example indices) concerns is poor design.
It does seem quite convenient though (to the sort of people who
want to do triple parity RAID and 46+2 RAID6 arrays, or build
large filesystems as LVM2 concats [VGs] spanning several disks).
> So, my point is that this probably has to be solved on the
> backend side of the Lustre, rather than inside the Lustre.
The Lustre has embodies a very specific set of tradeoffs aimed at
a specific "sweet spot" as described by PeterB in his blogpost.
Violating design integrity usually is very painful. A wholly new
design is probably needed.

As to scalability there is a proof of existence for extremely
scalable file system designs, and that is GoogleFS, and it
embodies pretty extreme tradeoffs (far more extreme than Lustre)
in pursuit of scalability.

If GoogleFS is the state of the art, then I suspect that very
scalable, fine grained, and highly efficient are incompatible
goals (and very, very rarely a requirement either).

BTW I am occasionally reminded of two ancient MIT TRs, one by
Peter Bishop about distributed persistent garbage collection, and
one by Svobodova on object histories in the swallow repository.

Peter Grandi

2010-Jul-03 20:18 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[ ... ]
>> Shadow copies are vulnerable to software bugs, things would
>> get better if there was something similar to page protection
>> for disk blocks.
Somewhat agreeable, but I hope that everybody involved in this
discussion has read the reports by CERN on invisible data
corruptions, and has meditated on the implications (real data
integrity can only be end-to-end).

[ ... ]
> [ ... ] you can fsck each transaction discretely and
> incrementally.  That means that you know exactly how much work
> must be done to fsck a priori.  Sure, you still have to be
> confident that N correct transactions == correct filesystem,
> but that''s much easier to be confident of than software
> correctness.
That to me seems very naive like some old claims that journals
obviate the need for ''fsck''.

Nothing can obviate the need for ''fsck'', ad it is essentially
an
auditing tool; "proving" that a sequence of correct operations
results in a correct outcome and thus no auditing is required,
or is required only once, to me sounds extraordinarily
unrealistic (and Peter Braam uses the killer argument of bugs,
but that''s not even the strongest), as it is based on this
delusion:
> [ ... ] Because only by making fscks incremental and discrete
> can we get a handle on the amount of time that must be spent
> waiting for fscks to complete.
Auditing of metadata cannot be incremental. I wonder how little
real world experience backs this kind of delusion; in the real
world existing, already checked metadata and data can be
corrupted by faulty IO directed at other data and metadata.
> Without incremental fscks there''d be no hope as storage
> capacity outstrips storage and compute bandwidth.
And it is not capacity vs. bandwidth; it is really the intrinsic
ability to parallelize data access vs. the much lesses ability
to parallelize garbage collection.  Something has got to give,
and if GoogleFS is the state of the art, what has to give is
functionality and efficiency.

[ ... ]

Nicolas Williams

2010-Jul-04 23:56 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin
wrote:> Well, the hash trees certainly help to achieve data integrity, but
> at the performance cost.
Merkle hash trees cost more CPU cycles, not more I/O.  Indeed, they
result in _less_ I/O in the case of RAID-Zn because there''s no need to
read the parity unless the checksum doesn''t match.  Also, how much CPU
depends on the hash function.  And HW could help if this became enough
of a problem for us.
> Eventually, the file system becomes fragmented, and moving the data
> around implies more random seeks with Merkle hash trees.
Yes, fragmentation is a problem for COW, but that has nothing to do with
Merkle trees.  But practically every modern filesystem coalesces writes
into contiguous writes on disk to reach streaming write perfmormance,
and that, like COW, results in filesystem fragmentation.

(Of course, you needn''t get fragmentation if you never delete or over
write files.  You''ll get some fragmentation of meta-data, but
that''s
much easier to garbage collect since meta-data will amount to much less
on disk than data.)

Everything we do involves trade-offs.

Nico
--

Nicolas Williams

2010-Jul-05 01:33 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

On Sat, Jul 03, 2010 at 09:18:47PM +0100, pg_lus at lus.for.sabi.co.UK
wrote:> [ ... ]
> > [ ... ] you can fsck each transaction discretely and
> > incrementally.  That means that you know exactly how much work
> > must be done to fsck a priori.  Sure, you still have to be
> > confident that N correct transactions == correct filesystem,
> > but that''s much easier to be confident of than software
> > correctness.
> 
> That to me seems very naive like some old claims that journals
> obviate the need for ''fsck''.
> 
> Nothing can obviate the need for ''fsck'', ad it is
essentially an
> auditing tool; "proving" that a sequence of correct operations
> results in a correct outcome and thus no auditing is required,
> or is required only once, to me sounds extraordinarily
> unrealistic (and Peter Braam uses the killer argument of bugs,
> but that''s not even the strongest), as it is based on this
> delusion:
Just because I didn''t mention what ZFS calls "scrubbing"
doesn''t mean
that I think it''s not desirable or not needed.  Indeed, ZFS can do
exactly what you suggest by "scrubbing" pools, a process that
traverses
all meta-data and reads all data and verifies integrity, and which can
be done concurrently with normal filesystem operation.

However, scubbing is not "fsck" as we''ve always understood
"fsck".  The
traditional "fsck" runs before you can mount a filesystem, and it
reads
at least all meta-data.  That is either not feasible or not acceptable
today.  Scrubbing is.  As is incremental fsck.

Perhaps I misunderstood what Peter B. was getting at; perhaps Peter B.
was referring to "scrub" rather than "traditional fsck" and
simply used
terminology that confused me.  Or perhaps you misunderstood what
"fsck"
means to me.
> > [ ... ] Because only by making fscks incremental and discrete
> > can we get a handle on the amount of time that must be spent
> > waiting for fscks to complete.
> 
> Auditing of metadata cannot be incremental. I wonder how little
> real world experience backs this kind of delusion; in the real
> world existing, already checked metadata and data can be
> corrupted by faulty IO directed at other data and metadata.
I think it''s much too early for you to speak of delusion on
anyone''s
part here.  Resorting to personal attacks is not exactly a good approach
to exchanging ideas.

Nico
--

Dmitry Zogin

2010-Jul-05 03:53 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

Nicolas Williams wrote:> On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin wrote:
>   
>> Well, the hash trees certainly help to achieve data integrity, but
>> at the performance cost.
>>     
>
> Merkle hash trees cost more CPU cycles, not more I/O.  Indeed, they
> result in _less_ I/O in the case of RAID-Zn because there''s no
need to
> read the parity unless the checksum doesn''t match.  Also, how much
CPU
> depends on the hash function.  And HW could help if this became enough
> of a problem for us.
>
>   
>> Eventually, the file system becomes fragmented, and moving the data
>> around implies more random seeks with Merkle hash trees.
>>     
>
> Yes, fragmentation is a problem for COW, but that has nothing to do with
> Merkle trees.  But practically every modern filesystem coalesces writes
> into contiguous writes on disk to reach streaming write perfmormance,
> and that, like COW, results in filesystem fragmentation.
>
>   What I really mean is the defragmentation issue and not the 
fragmentation itself. All file systems becomes fragmented, as it is 
unavoidable. But the defragmentation of the file system using hash trees 
really becomes a problem.> (Of course, you needn''t get fragmentation if you never delete or
over
> write files.  You''ll get some fragmentation of meta-data, but
that''s
> much easier to garbage collect since meta-data will amount to much less
> on disk than data.)
>   Well, that is really never happens, unless the file system is read-only. 
The files are deleted and created all the time.> Everything we do involves trade-offs.
>
>
>   Yes, but if the performance drop becomes unacceptable, any gain in the 
integrity is miserable.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100704/1073278d/attachment.html

Mitchell Erblich

2010-Jul-05 07:11 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

On Jul 4, 2010, at 8:53 PM, Dmitry Zogin wrote:
> Nicolas Williams wrote:
>> 
>> On Fri, Jul 02, 2010 at 11:37:52PM -0400, Dmitry Zogin wrote:
>>   
>>> Well, the hash trees certainly help to achieve data integrity, but
>>> at the performance cost.
>>>     
>> 
>> Merkle hash trees cost more CPU cycles, not more I/O.  Indeed, they
>> result in _less_ I/O in the case of RAID-Zn because there''s no
need to
>> read the parity unless the checksum doesn''t match.  Also, how
much CPU
>> depends on the hash function.  And HW could help if this became enough
>> of a problem for us.
>> 
>>   
>>> Eventually, the file system becomes fragmented, and moving the data
>>> around implies more random seeks with Merkle hash trees.
>>>     
>> 
>> Yes, fragmentation is a problem for COW, but that has nothing to do
with
>> Merkle trees.  But practically every modern filesystem coalesces writes
>> into contiguous writes on disk to reach streaming write perfmormance,
>> and that, like COW, results in filesystem fragmentation.
>> 
>>   
> What I really mean is the defragmentation issue and not the fragmentation
itself. All file systems becomes fragmented, as it is unavoidable. But the
defragmentation of the file system using hash trees really becomes a problem.
Stupid me. I thought the FS fragmentation issue had a solution over a decade
ago.

When the write doesn''t change the offset, then do nothing. If it is a
concatenating write,
locate the best fit block for the new size/offset, update the metadata/inode,
then free the
old block. Since writes as mostly asynch, who cares how long it takes as long as
their
are no commits waiting.

Mitchell Erblich
>> (Of course, you needn''t get fragmentation if you never delete
or over
>> write files.  You''ll get some fragmentation of meta-data, but
that''s
>> much easier to garbage collect since meta-data will amount to much less
>> on disk than data.)
>>   
> Well, that is really never happens, unless the file system is read-only.
The files are deleted and created all the time.
>> Everything we do involves trade-offs.
>> 
>> 
>>   
> Yes, but if the performance drop becomes unacceptable, any gain in the
integrity is miserable.
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Nicolas Williams

2010-Jul-05 17:58 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

On Sun, Jul 04, 2010 at 11:53:29PM -0400, Dmitry Zogin
wrote:> What I really mean is the defragmentation issue and not the
> fragmentation itself. All file systems becomes fragmented, as it is
> unavoidable. But the defragmentation of the file system using hash
> trees really becomes a problem.
That is emphatically not true.

To defragment a ZFS-like filesystem all you need to do is traverse the
metadata looking for live blocks from old transaction groups, then
relocate those by writing them out again almost as if an application had
written to them (except with no mtime updates).

In ZFS we call this block pointer rewrite, or bp rewrite.
> >Everything we do involves trade-offs.
> >
> Yes, but if the performance drop becomes unacceptable, any gain in
> the integrity is miserable.
I believe ZFS has shown that unacceptable performance losses are not
required in order to get the additional integrity protection.

Nico
--

Andreas Dilger

2010-Jul-07 06:57 UTC

head link

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

On 2010-07-02, at 15:39, Peter Braam wrote:> I wrote a blog post that pertains to Lustre scalability and data integrity.
> 
> http://braamstorage.blogspot.com
In your blog you write:
> Unfortunately once file system check and repair is required, the
scalability of all file systems becomes questionable.  The repair tool needs to
iterate over all objects stored in the file system, and this can take
unacceptably long on the advanced file systems like ZFS and btrfs just as much
as on the more traditional ones like ext4.
> 
> This shows the shortcoming of the Lustre-ZFS proposal to address
scalability.  It merely addresses data integrity.
I agree that ZFS checksums will help detect and recover the data integrity, and
we are leveraging this to provide data integrity (as described in "End to
End Data Integrity Design" on the Lustre wiki).  However, contrary to your
statement, we are not depending on the checksums for checking and fixing the
distributed filesystem consistency.

The Integrity design you referenced describes the process for doing the
(largely) single-pass parallel consistency checking of the ZFS backing
filesystems at the same time as doing the distributed Lustre filesystem
consistency check, while the filesystem is active.

In the years since you have been working on Lustre, we have already implemented
similar ideas as ChunkFS/TileFS to use back-references for avoiding the need to
keep the full filesystem state in memory when doing checks and recovering from
corruption.  The OST filesystem inodes contain their own object IDs (for
recreating the OST namespace in case of directory corruption, as anyone
who''s used ll_recover_lost_found_objs can attest), and a back-pointer
to the MDT inode FID to be used for fast orphan and layout inconsistency
detection.  With 2.0 the MDT inodes will also contain the FID number for
reconstructing the object index, should it be corrupted, and also the list of
hard links to the inode for doing O(1) path construction and nlink verification.
With CMD the remotely referenced  MDT inodes will have back-pointers to the
originating MDT to allow local consistency checking, similar to the shadow
inodes proposed for ChunkFS.

As you pointed out, scaling fsck to be able to check a filesystem with 10^12
files within 100h is difficult.  It turns out that the metadata requirements for
doing a full check within this time period exceed the metadata requirements
specified for normal operation.  It of course isn''t possible to do a
consistency check of a filesystem without actually checking each of the items in
that filesystem, so each one has to be visited at least (and preferably at most)
once.  That said, the requirements are not beyond what is capable from the
hardware that will be needed to host a filesystem this large in the first place,
assuming the local and distributed consistency checking can run in parallel and
utilize the full bandwidth of the filesystem.

What is also important to note is that both ZFS and the new lfsck are designed
to be able to validate the filesystem continuously as it is being used, so there
is no need to take a 100h outage before putting the filesystem back into use.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Lustre devel - Jul 2010 - Integrity and corruption - can file systems be scalable?

[Lustre-discuss] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?

[Lustre-discuss] [Lustre-devel] Integrity and corruption - can file systems be scalable?