thr3ads.net - Lustre discuss - [Lustre-discuss] Fsck downtime estimates [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Andrei Maslennikov

2007-Apr-27 06:50 UTC

[Lustre-discuss] Fsck downtime estimates

Hello, some questions for CFS people:

I am just fresh from a workshop where one whole day was dedicated to
file systems and Lustre was one of the key solutions. There was one
person attending this workshop who was asking every speaker to
comment on the fsck downtime for each of the file systems discussed,
and we looked at Lustre from this point of view, as well.

In case of Lustre, this downtime is estimated as 1-3 hours for 1 TB of ext3
in use on OST (depending on the underlying hardware), and may last of up
to several days per 1 PB for the metadata part.

Could someone from CFS suggest a sort of formula to calculate the fsck
downtime in a more accurate manner? This is often important when
planning for service levels. If a file system is spread over multiple OSTs,
which fsck operations run in parallel? May metadata checking be
parallelized?

Thanks ahead - Andrei.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070427/f777606c/attachment.html

Andreas Dilger

2007-Apr-27 14:14 UTC

head link

[Lustre-discuss] Fsck downtime estimates

On Apr 27, 2007  14:50 +0200, Andrei Maslennikov wrote:> I am just fresh from a workshop where one whole day was dedicated to
> file systems and Lustre was one of the key solutions. There was one
> person attending this workshop who was asking every speaker to
> comment on the fsck downtime for each of the file systems discussed,
> and we looked at Lustre from this point of view, as well.
> 
> In case of Lustre, this downtime is estimated as 1-3 hours for 1 TB of ext3
> in use on OST (depending on the underlying hardware), and may last of up
> to several days per 1 PB for the metadata part.
While the 1-3h per TB is reasonable, what is important to note is that this
checking happens IN PARALLEL for lustre.  If you have 500 2TB OSTs = 1PB,
then you can still check all of them in 2-4 hours.

CFS has also recently developed patches to improve the e2fsck speed for 
ext3 filesystems by 2-20x (depends on filesystem usage).  What used to
take 1h to check has been shown for production filesystems to take only
10 minutes...
> Could someone from CFS suggest a sort of formula to calculate the fsck
> downtime in a more accurate manner? This is often important when
> planning for service levels. If a file system is spread over multiple OSTs,
> which fsck operations run in parallel? May metadata checking be
> parallelized?
Yes, the OST and MDS e2fsck checking can be done in parallel.
The distributed checking phase (lfsck) is not needed before returning
the filesystem to service, and can also be run while the filesystem is
in use.  We are planning to eliminate the need for running a separate
lfsck entirely, and the filesystem will just do "scrubbing" internally
all the time during idle times or as a low-priority task.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Peter Grandi

2007-Apr-28 13:43 UTC

head link

[Lustre-discuss] Fsck downtime estimates

>>> On Fri, 27 Apr 2007 14:13:56 -0600, Andreas Dilger
>>> <adilger@clusterfs.com> said:
[ ... ''fsck'' times ... ]

adilger> While the 1-3h per TB is reasonable, what is important to
adilger> note is that this checking happens IN PARALLEL for lustre.
adilger> If you have 500 2TB OSTs = 1PB, then you can still check
adilger> all of them in 2-4 hours.

Ahhh interesting. But yes, if they are on separate hosts, but
for example I have only one with 12TB on a RAID10.

My main reason to look at Lustre is not to take advantage of the
cluster based parallelism, but to have 6x2TB OSTs on the same
machine and hope that if there are active updates to only one then
only one needs ''fsck''ing. Basically my main reason is to
reduce
post-crash service unavailability due to ''fsck''.

My particular application would have 12TB of 20-80MB files, let''s
say around 200,000-700,000 inodes in total.

adilger> CFS has also recently developed patches to improve the
adilger> e2fsck speed for ext3 filesystems by 2-20x (depends on
adilger> filesystem usage). What used to take 1h to check has been
adilger> shown for production filesystems to take only 10
adilger> minutes...

Well, that would be nice, but also sounds a bit implausible.
Production filesystems tend to be full, with metadata scattered
all over the place, and ''ext3'' has quite a bit of quite
scattered
metadata, and become very fragmented quite rapidly.
>> Could someone from CFS suggest a sort of formula to calculate
>> the fsck downtime in a more accurate manner? This is often
>> important when planning for service levels. If a file system is
>> spread over multiple OSTs, which fsck operations run in
>> parallel? May metadata checking be parallelized?
adilger> Yes, the OST and MDS e2fsck checking can be done in
adilger> parallel.

I wonder if one had those 6x2TB OSTs on the same RAID10 then
parallel checking would be faster thanks to all those arms.

adilger> The distributed checking phase (lfsck) is not needed
adilger> before returning the filesystem to service, and can also
adilger> be run while the filesystem is in use. We are planning to
adilger> eliminate the need for running a separate lfsck entirely,
adilger> and the filesystem will just do "scrubbing" internally
adilger> all the time during idle times or as a low-priority task.

Ahh interesting too, but this may not always be feasible: the
application I am thinking of has 24x7 simultaneous read and write
rates of around 100MB/s each (and yes using just a single system
is unfortunately non-negotiable right now).

Peter Grandi

2007-May-05 19:39 UTC

head link

[Lustre-discuss] Fsck downtime estimates

>>> On Sat, 28 Apr 2007 20:42:25 +0100,
>>> pg_lus@lus.for.sabi.co.UK (Peter Grandi) said:
>>> On Fri, 27 Apr 2007 14:13:56 -0600, Andreas Dilger
>>> <adilger@clusterfs.com> said:
pg> [ ... ''fsck'' times ... ] My main reason to look at
Lustre is
pg> not to take advantage of the cluster based parallelism, but
pg> to have 6x2TB OSTs on the same machine and hope that if
pg> there are active updates to only one then only one needs
pg> ''fsck''ing. Basically my main reason is to reduce
post-crash
pg> service unavailability due to ''fsck''.

Just noticed a recent thread in ''comp.arch.storage'' that
demonstrates the same worry, only on a much, much bugger scale:

 
http://groups.google.com/group/comp.arch.storage/browse_thread/thread/d6f10ec24c07ed53/ecd3a745fbf561e6

   ?The system''s storage is based on code which writes many files
    to the file system, with overall storage needs currently around
    40TB and expected to reach hundreds of TBs. The average file
    size of the system is ~100K, which translates to ~500 million
    files today, and billions of files in the future.?

   ?We''re looking for an alternative solution, in an attempt to
    improve performance and ability to recover from disasters
    (fsck on 2^42 files isn''t practical, and I''m getting
pretty
    worried due to this fact - even the smallest filesystem
    inconsistency will leave me lots of useless bits).?

Ehehehe, ?pretty worried? :-). From numbers that I have seen a
Lustre cluster might support that kind of requirements, even if
the recovery time might be some days. I hope.

Lustre discuss - Apr 2007 - Fsck downtime estimates

[Lustre-discuss] Fsck downtime estimates

[Lustre-discuss] Fsck downtime estimates

[Lustre-discuss] Fsck downtime estimates

[Lustre-discuss] Fsck downtime estimates