Hello, some questions for CFS people: I am just fresh from a workshop where one whole day was dedicated to file systems and Lustre was one of the key solutions. There was one person attending this workshop who was asking every speaker to comment on the fsck downtime for each of the file systems discussed, and we looked at Lustre from this point of view, as well. In case of Lustre, this downtime is estimated as 1-3 hours for 1 TB of ext3 in use on OST (depending on the underlying hardware), and may last of up to several days per 1 PB for the metadata part. Could someone from CFS suggest a sort of formula to calculate the fsck downtime in a more accurate manner? This is often important when planning for service levels. If a file system is spread over multiple OSTs, which fsck operations run in parallel? May metadata checking be parallelized? Thanks ahead - Andrei. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070427/f777606c/attachment.html
On Apr 27, 2007 14:50 +0200, Andrei Maslennikov wrote:> I am just fresh from a workshop where one whole day was dedicated to > file systems and Lustre was one of the key solutions. There was one > person attending this workshop who was asking every speaker to > comment on the fsck downtime for each of the file systems discussed, > and we looked at Lustre from this point of view, as well. > > In case of Lustre, this downtime is estimated as 1-3 hours for 1 TB of ext3 > in use on OST (depending on the underlying hardware), and may last of up > to several days per 1 PB for the metadata part.While the 1-3h per TB is reasonable, what is important to note is that this checking happens IN PARALLEL for lustre. If you have 500 2TB OSTs = 1PB, then you can still check all of them in 2-4 hours. CFS has also recently developed patches to improve the e2fsck speed for ext3 filesystems by 2-20x (depends on filesystem usage). What used to take 1h to check has been shown for production filesystems to take only 10 minutes...> Could someone from CFS suggest a sort of formula to calculate the fsck > downtime in a more accurate manner? This is often important when > planning for service levels. If a file system is spread over multiple OSTs, > which fsck operations run in parallel? May metadata checking be > parallelized?Yes, the OST and MDS e2fsck checking can be done in parallel. The distributed checking phase (lfsck) is not needed before returning the filesystem to service, and can also be run while the filesystem is in use. We are planning to eliminate the need for running a separate lfsck entirely, and the filesystem will just do "scrubbing" internally all the time during idle times or as a low-priority task. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
>>> On Fri, 27 Apr 2007 14:13:56 -0600, Andreas Dilger >>> <adilger@clusterfs.com> said:[ ... ''fsck'' times ... ] adilger> While the 1-3h per TB is reasonable, what is important to adilger> note is that this checking happens IN PARALLEL for lustre. adilger> If you have 500 2TB OSTs = 1PB, then you can still check adilger> all of them in 2-4 hours. Ahhh interesting. But yes, if they are on separate hosts, but for example I have only one with 12TB on a RAID10. My main reason to look at Lustre is not to take advantage of the cluster based parallelism, but to have 6x2TB OSTs on the same machine and hope that if there are active updates to only one then only one needs ''fsck''ing. Basically my main reason is to reduce post-crash service unavailability due to ''fsck''. My particular application would have 12TB of 20-80MB files, let''s say around 200,000-700,000 inodes in total. adilger> CFS has also recently developed patches to improve the adilger> e2fsck speed for ext3 filesystems by 2-20x (depends on adilger> filesystem usage). What used to take 1h to check has been adilger> shown for production filesystems to take only 10 adilger> minutes... Well, that would be nice, but also sounds a bit implausible. Production filesystems tend to be full, with metadata scattered all over the place, and ''ext3'' has quite a bit of quite scattered metadata, and become very fragmented quite rapidly.>> Could someone from CFS suggest a sort of formula to calculate >> the fsck downtime in a more accurate manner? This is often >> important when planning for service levels. If a file system is >> spread over multiple OSTs, which fsck operations run in >> parallel? May metadata checking be parallelized?adilger> Yes, the OST and MDS e2fsck checking can be done in adilger> parallel. I wonder if one had those 6x2TB OSTs on the same RAID10 then parallel checking would be faster thanks to all those arms. adilger> The distributed checking phase (lfsck) is not needed adilger> before returning the filesystem to service, and can also adilger> be run while the filesystem is in use. We are planning to adilger> eliminate the need for running a separate lfsck entirely, adilger> and the filesystem will just do "scrubbing" internally adilger> all the time during idle times or as a low-priority task. Ahh interesting too, but this may not always be feasible: the application I am thinking of has 24x7 simultaneous read and write rates of around 100MB/s each (and yes using just a single system is unfortunately non-negotiable right now).
>>> On Sat, 28 Apr 2007 20:42:25 +0100, >>> pg_lus@lus.for.sabi.co.UK (Peter Grandi) said:>>> On Fri, 27 Apr 2007 14:13:56 -0600, Andreas Dilger >>> <adilger@clusterfs.com> said:pg> [ ... ''fsck'' times ... ] My main reason to look at Lustre is pg> not to take advantage of the cluster based parallelism, but pg> to have 6x2TB OSTs on the same machine and hope that if pg> there are active updates to only one then only one needs pg> ''fsck''ing. Basically my main reason is to reduce post-crash pg> service unavailability due to ''fsck''. Just noticed a recent thread in ''comp.arch.storage'' that demonstrates the same worry, only on a much, much bugger scale: http://groups.google.com/group/comp.arch.storage/browse_thread/thread/d6f10ec24c07ed53/ecd3a745fbf561e6 ?The system''s storage is based on code which writes many files to the file system, with overall storage needs currently around 40TB and expected to reach hundreds of TBs. The average file size of the system is ~100K, which translates to ~500 million files today, and billions of files in the future.? ?We''re looking for an alternative solution, in an attempt to improve performance and ability to recover from disasters (fsck on 2^42 files isn''t practical, and I''m getting pretty worried due to this fact - even the smallest filesystem inconsistency will leave me lots of useless bits).? Ehehehe, ?pretty worried? :-). From numbers that I have seen a Lustre cluster might support that kind of requirements, even if the recovery time might be some days. I hope.