Hello folks, I wonder if anyone has any failure metrics on their specific installations. We''re quite new to the lustre space and wanted to get a feel for what we might be in for downtime-wise. In particular, does anyone have numbers for the mean time between failure and mean time to repair? Any info would be greatly appreciated, thanks. ---------------- John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50B-3209C Lawrence Berkeley National Lab Berkeley, CA 94720
On Fri, 2009-04-24 at 09:48 -0700, John White wrote:> > I wonder if anyone has any failure metrics on their specific > installations. We''re quite new to the lustre space and wanted to get > a feel for what we might be in for downtime-wise. In particular, does > anyone have numbers for the mean time between failure and mean time to > repair?I think this is a very subjective question. To a great deal it''s going to depend on how much you spend on your infrastructure. If you buy cheap(ly built) hardware, it will most likely fail more often than better built hardware. Additionally, given Lustre''s HA abilities, uptime is something you can throw money at (or not). If you have a high amount of redundancy in your architecture, including failover pairs and so on, then downtime is reduced as your redundant hardware kicks in to provide uptime where it would have not been had you not spent on and built that redundant architecture. There are probably lots of places where the same kind of arguments can be made, making the question all that more subjective. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090424/e56392f7/attachment.bin
On Apr 24, 2009, at 9:59 AM, Brian J. Murrell wrote:> On Fri, 2009-04-24 at 09:48 -0700, John White wrote: >> >> I wonder if anyone has any failure metrics on their specific >> installations. We''re quite new to the lustre space and wanted to get >> a feel for what we might be in for downtime-wise. In particular, >> does >> anyone have numbers for the mean time between failure and mean time >> to >> repair? > > I think this is a very subjective question. To a great deal it''s > going > to depend on how much you spend on your infrastructure. If you buy > cheap(ly built) hardware, it will most likely fail more often than > better built hardware.Oh, naturally. I suppose I was short on details. The question is more geared at the software side of things. Of course you can build in hardware redundancy on the back-end, set up failover on the server- end, etc. Beyond those, I''m curious how often software unavoidably "flips-out" under lustre and how long these commonly take to recover from. Say the lock manager tweaks, etc. I know this is a rather difficult metric to quantify, especially after experiences with.. other.. parallel filesystems. Perhaps people have numbers for their specific configuration?> > > Additionally, given Lustre''s HA abilities, uptime is something you can > throw money at (or not). If you have a high amount of redundancy in > your architecture, including failover pairs and so on, then downtime > is > reduced as your redundant hardware kicks in to provide uptime where it > would have not been had you not spent on and built that redundant > architecture. > > There are probably lots of places where the same kind of arguments can > be made, making the question all that more subjective. > > b. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi John:>From our experience (1 year), we simply love Lustre. Couple of lessonsI learned regarding stability: 1) Always have the latest e2fs progs 2) I recommend not to get the latest version available version 3) Run on good hardware. Test your hardware with iozone and bonnie for 72 hours straight so you know its legit hardware. 4) Have a reliable network (naturally) 5) Compile your own kernel for OST/MDT and patch. Apply the patches from bugzilla "release tickets" Also, make sure you take the time to appreciate what Lustre is really doing :-) Hope this Helps! On Fri, Apr 24, 2009 at 1:11 PM, John White <jwhite at lbl.gov> wrote:> > On Apr 24, 2009, at 9:59 AM, Brian J. Murrell wrote: > >> On Fri, 2009-04-24 at 09:48 -0700, John White wrote: >>> >>> ? ? ?I wonder if anyone has any failure metrics on their specific >>> installations. ?We''re quite new to the lustre space and wanted to get >>> a feel for what we might be in for downtime-wise. ?In particular, >>> does >>> anyone have numbers for the mean time between failure and mean time >>> to >>> repair? >> >> I think this is a very subjective question. ?To a great deal it''s >> going >> to depend on how much you spend on your infrastructure. ?If you buy >> cheap(ly built) hardware, it will most likely fail more often than >> better built hardware. > > Oh, naturally. ?I suppose I was short on details. ?The question is > more geared at the software side of things. ?Of course you can build > in hardware redundancy on the back-end, set up failover on the server- > end, etc. ?Beyond those, I''m curious how often software unavoidably > "flips-out" under lustre and how long these commonly take to recover > from. ?Say the lock manager tweaks, etc. > > I know this is a rather difficult metric to quantify, especially after > experiences with.. other.. parallel filesystems. ?Perhaps people have > numbers for their specific configuration? > >> >> >> Additionally, given Lustre''s HA abilities, uptime is something you can >> throw money at (or not). ?If you have a high amount of redundancy in >> your architecture, including failover pairs and so on, then downtime >> is >> reduced as your redundant hardware kicks in to provide uptime where it >> would have not been had you not spent on and built that redundant >> architecture. >> >> There are probably lots of places where the same kind of arguments can >> be made, making the question all that more subjective. >> >> b. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >