thr3ads.net - Lustre discuss - [Lustre-discuss] failure rates [Apr 2009]

If this information is useful, please help other people find it:
Share via:

John White

2009-Apr-24 16:48 UTC

[Lustre-discuss] failure rates

Hello folks,
	I wonder if anyone has any failure metrics on their specific  
installations.  We''re quite new to the lustre space and wanted to get  
a feel for what we might be in for downtime-wise.  In particular, does  
anyone have numbers for the mean time between failure and mean time to  
repair?  Any info would be greatly appreciated, thanks.
----------------
John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

Brian J. Murrell

2009-Apr-24 16:59 UTC

head link

[Lustre-discuss] failure rates

On Fri, 2009-04-24 at 09:48 -0700, John White wrote:> 
> 	I wonder if anyone has any failure metrics on their specific  
> installations.  We''re quite new to the lustre space and wanted to
get
> a feel for what we might be in for downtime-wise.  In particular, does  
> anyone have numbers for the mean time between failure and mean time to  
> repair?
I think this is a very subjective question.  To a great deal it''s going
to depend on how much you spend on your infrastructure.  If you buy
cheap(ly built) hardware, it will most likely fail more often than
better built hardware.

Additionally, given Lustre''s HA abilities, uptime is something you can
throw money at (or not).  If you have a high amount of redundancy in
your architecture, including failover pairs and so on, then downtime is
reduced as your redundant hardware kicks in to provide uptime where it
would have not been had you not spent on and built that redundant
architecture.

There are probably lots of places where the same kind of arguments can
be made, making the question all that more subjective.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090424/e56392f7/attachment.bin

John White

2009-Apr-24 17:11 UTC

head link

[Lustre-discuss] failure rates

On Apr 24, 2009, at 9:59 AM, Brian J. Murrell wrote:
> On Fri, 2009-04-24 at 09:48 -0700, John White wrote:
>>
>> 	I wonder if anyone has any failure metrics on their specific
>> installations.  We''re quite new to the lustre space and wanted
to get
>> a feel for what we might be in for downtime-wise.  In particular,  
>> does
>> anyone have numbers for the mean time between failure and mean time  
>> to
>> repair?
>
> I think this is a very subjective question.  To a great deal it''s
> going
> to depend on how much you spend on your infrastructure.  If you buy
> cheap(ly built) hardware, it will most likely fail more often than
> better built hardware.
Oh, naturally.  I suppose I was short on details.  The question is  
more geared at the software side of things.  Of course you can build  
in hardware redundancy on the back-end, set up failover on the server- 
end, etc.  Beyond those, I''m curious how often software unavoidably  
"flips-out" under lustre and how long these commonly take to recover  
from.  Say the lock manager tweaks, etc.

I know this is a rather difficult metric to quantify, especially after  
experiences with.. other.. parallel filesystems.  Perhaps people have  
numbers for their specific configuration?
>
>
> Additionally, given Lustre''s HA abilities, uptime is something you
can
> throw money at (or not).  If you have a high amount of redundancy in
> your architecture, including failover pairs and so on, then downtime  
> is
> reduced as your redundant hardware kicks in to provide uptime where it
> would have not been had you not spent on and built that redundant
> architecture.
>
> There are probably lots of places where the same kind of arguments can
> be made, making the question all that more subjective.
>
> b.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Mag Gam

2009-Apr-26 22:51 UTC

head link

[Lustre-discuss] failure rates

Hi John:
>From our experience (1 year), we simply love Lustre. Couple of lessonsI learned regarding stability:

1) Always have the latest e2fs progs
2) I recommend not to get the latest version available version
3) Run on good hardware. Test your hardware with iozone and bonnie for
72 hours straight so you know its legit hardware.
4) Have a reliable network (naturally)
5) Compile your own kernel for OST/MDT and patch. Apply the patches
from bugzilla "release tickets"

Also, make sure you take the time to appreciate what Lustre is really doing :-)

Hope this Helps!




On Fri, Apr 24, 2009 at 1:11 PM, John White <jwhite at lbl.gov>
wrote:>
> On Apr 24, 2009, at 9:59 AM, Brian J. Murrell wrote:
>
>> On Fri, 2009-04-24 at 09:48 -0700, John White wrote:
>>>
>>> ? ? ?I wonder if anyone has any failure metrics on their specific
>>> installations. ?We''re quite new to the lustre space and
wanted to get
>>> a feel for what we might be in for downtime-wise. ?In particular,
>>> does
>>> anyone have numbers for the mean time between failure and mean time
>>> to
>>> repair?
>>
>> I think this is a very subjective question. ?To a great deal
it''s
>> going
>> to depend on how much you spend on your infrastructure. ?If you buy
>> cheap(ly built) hardware, it will most likely fail more often than
>> better built hardware.
>
> Oh, naturally. ?I suppose I was short on details. ?The question is
> more geared at the software side of things. ?Of course you can build
> in hardware redundancy on the back-end, set up failover on the server-
> end, etc. ?Beyond those, I''m curious how often software
unavoidably
> "flips-out" under lustre and how long these commonly take to
recover
> from. ?Say the lock manager tweaks, etc.
>
> I know this is a rather difficult metric to quantify, especially after
> experiences with.. other.. parallel filesystems. ?Perhaps people have
> numbers for their specific configuration?
>
>>
>>
>> Additionally, given Lustre''s HA abilities, uptime is something
you can
>> throw money at (or not). ?If you have a high amount of redundancy in
>> your architecture, including failover pairs and so on, then downtime
>> is
>> reduced as your redundant hardware kicks in to provide uptime where it
>> would have not been had you not spent on and built that redundant
>> architecture.
>>
>> There are probably lots of places where the same kind of arguments can
>> be made, making the question all that more subjective.
>>
>> b.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - Apr 2009 - failure rates

[Lustre-discuss] failure rates

[Lustre-discuss] failure rates

[Lustre-discuss] failure rates

[Lustre-discuss] failure rates