thr3ads.net - Lustre discuss - [Lustre-discuss] OST failure scenario [May 2006]

If this information is useful, please help other people find it:
Share via:

Phil Schwan

2006-May-19 07:36 UTC

[Lustre-discuss] OST failure scenario

Tim Moran wrote:> 
> I''m doing some research to see if Lustre is applicable to my
company.
> Specifically, my question is: how does Lustre handle a full-out,
> data-is-not-coming-back OST failure. Reading through the docs I can find,
it
> appears to just rely on the journalling filesystem - just reboot it and it
> will come back. But under some circumstances, this is not true - data has
to
> be restored from tape say.
Lustre 1.0.x provides OST failover which relies on shared storage and,
as you point out, also relies on that storage device to be redundant
enough to cope with failures on its own.  But what if that device
catches on fire?

On our roadmap for later this year is a RAID-1 OST, in which file data
is synchronized between multiple, totally separate pieces of storage.
When one catches on fire, the other keeps going.

But what if the whole building collapses in an earthquake?  You''re
quite
right.  Sometimes data must be put on, and restored from, external storage.
> And more generally, how does the filesystem deal with backup and restore to
> tape?
Today, you have two choices: backup the file system from a high level,
via a mounted client; or doing a dumpe2fs of the individual file systems
on each MDS and OST.

If you do the former, backup and restore is relatively easy: you do both
from a normally mounted client.  It''s hard, in that case, to restore
just one failed OST.

If you dumpe2fs the individual partitions, you need a way to bring the
file system back in sync.  Lustre has a handful of fairly sophisticated
protocols which keep everything in sync without fsck across a "simple"
failure, like a crash, network disconnect, or a power outage.  For a
restore from tape following a hardware failure, you will need to fsck.

Lustre 1.0.x does not include this cluster-wide fsck tool, but Lustre
1.2 or 1.4 will.  The tool is already being tested now.

Does this answer your question?

-Phil

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OST failure scenario

Hi Tim--

Tim Moran wrote:> 
> The alternatives all seem lacking. Backing up the whole filesystem (say 100
> Tb) is impossible. Even dumping a filesystem (say 2 Tb) is difficult. How
> would the filesystem deal with that, by the way? That is, the backup of the
> partition will contain old data - some files may have been changed, added,
> or deleted. How does the file system deal with those and maintain a
> consistent file system (with data loss, of course)?
The Lustre fsck tool is meant to deal with exactly that problem of
inconsistent backups.  But there is a better answer.

My last email stopped with RAID1, but it''s not really the end of the
story.  You will be happy to hear that snapshots are also on the way,
which will allow you to backup a consistent snapshot of your whole file
system by backing up N consistent snapshots of your servers'' backend
file systems.
> I guess of the alternatives, the RAID1 is most appealing. That would double
> storage costs, which is a difficult pill to swallow, but at least it would
> be real-time. Any plan for a RAID5-style stripe over multiple
OST''s? n+1 is
> much more palatable than 2n.
That is not on our roadmap right now, no.  It''s not clear how feasible
a
RAID5 OST would be -- RAID5 is a tricky beast.  If someone is interested
in funding such work, the first step would be an exploration of whether
we can produce a scalable and performant design.
> In my perfect world, the file system would be integrated with the backup
> system: it would backup new/changed data, and be able to restore from tape
> to an alterate OST (with data loss since the backup, of course), since it
> would know what data was on that OST. Not an easy proposition, I know, but
> you have to expect hardware failures: double disk failures, raid cards that
> corrupt data, etc., in the real world. It doesn''t seem to me that
the system
> is really able to handle these cases well - that it is a cluster for those
> that buy expensive equipment, and expect to be able to recover from faults
> due to hardware investments.
You are not the only one who wants this, of course, and advanced backup
features are on the roadmap in the Lustre 2.x timeframe.  We have
designed an HSM layer to integrate with backup systems and support
automatic migration to/from offline storage.
> Perhaps even knowing the data that was on the OST and being able to
manually
> put Humpty-Dumpty back together again could work out. Can I query the
> filesystem, get a list of files on the OST, restore them manually (from
> manual client-level nightly archives of changes), and re-insert the OST?
> Would the file system be able to handle the reinserted node?
Yes, you can search for files which were stored on a given OST, with
"lfs find" and the --obd parameter.  Embarrassingly, it appears to
have
slipped through our test net, and does not behave exactly as it should.
 This has been filed as issue 2510, and will be fixed in 1.0.3.

If you removed an OST and replaced it with an empty one, Lustre would
create some sparse holes in your files, where that data used to be.  You
could restore backups of those affected files as you describe.

Did I cover everything?  Thanks for the good questions.

-Phil

Pavlica, Nick

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OST failure scenario

I wounder if DRBD (http://www.drbd.org/) would be a good solution for
OST redundancy until Lustre 1.2 arrives with OST raid 1?



On Tue, 2004-01-06 at 20:32, Phil Schwan wrote:> Tim Moran wrote:
> > 
> > I''m doing some research to see if Lustre is applicable to my
company.
> > Specifically, my question is: how does Lustre handle a full-out,
> > data-is-not-coming-back OST failure. Reading through the docs I can
find, it
> > appears to just rely on the journalling filesystem - just reboot it
and it
> > will come back. But under some circumstances, this is not true - data
has to
> > be restored from tape say.
> 
> Lustre 1.0.x provides OST failover which relies on shared storage and,
> as you point out, also relies on that storage device to be redundant
> enough to cope with failures on its own.  But what if that device
> catches on fire?
> 
> On our roadmap for later this year is a RAID-1 OST, in which file data
> is synchronized between multiple, totally separate pieces of storage.
> When one catches on fire, the other keeps going.
> 
> But what if the whole building collapses in an earthquake?  You''re
quite
> right.  Sometimes data must be put on, and restored from, external storage.
> 
> > And more generally, how does the filesystem deal with backup and
restore to
> > tape?
> 
> Today, you have two choices: backup the file system from a high level,
> via a mounted client; or doing a dumpe2fs of the individual file systems
> on each MDS and OST.
> 
> If you do the former, backup and restore is relatively easy: you do both
> from a normally mounted client.  It''s hard, in that case, to
restore
> just one failed OST.
> 
> If you dumpe2fs the individual partitions, you need a way to bring the
> file system back in sync.  Lustre has a handful of fairly sophisticated
> protocols which keep everything in sync without fsck across a
"simple"
> failure, like a crash, network disconnect, or a power outage.  For a
> restore from tape following a hardware failure, you will need to fsck.
> 
> Lustre 1.0.x does not include this cluster-wide fsck tool, but Lustre
> 1.2 or 1.4 will.  The tool is already being tested now.
> 
> Does this answer your question?
> 
> -Phil
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Peter Braam

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OST failure scenario

Pavlica, Nick wrote:
>I wounder if DRBD (http://www.drbd.org/) would be a good solution for
>OST redundancy until Lustre 1.2 arrives with OST raid 1?
>
>  
>Yes, that should work, although it may require some tuning of flight 
group size and iov size to keep performance reasonable.  The OST simply 
uses a file system on  a block device so anything that is line with that 
will work.

- Peter -

Tim Moran

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OST failure scenario

Well, yes, that does answer my question. Just not the way I''d like it
;-)

The alternatives all seem lacking. Backing up the whole filesystem (say 100
Tb) is impossible. Even dumping a filesystem (say 2 Tb) is difficult. How
would the filesystem deal with that, by the way? That is, the backup of the
partition will contain old data - some files may have been changed, added,
or deleted. How does the file system deal with those and maintain a
consistent file system (with data loss, of course)?

I guess of the alternatives, the RAID1 is most appealing. That would double
storage costs, which is a difficult pill to swallow, but at least it would
be real-time. Any plan for a RAID5-style stripe over multiple OST''s?
n+1 is
much more palatable than 2n.

In my perfect world, the file system would be integrated with the backup
system: it would backup new/changed data, and be able to restore from tape
to an alterate OST (with data loss since the backup, of course), since it
would know what data was on that OST. Not an easy proposition, I know, but
you have to expect hardware failures: double disk failures, raid cards that
corrupt data, etc., in the real world. It doesn''t seem to me that the
system
is really able to handle these cases well - that it is a cluster for those
that buy expensive equipment, and expect to be able to recover from faults
due to hardware investments.

Perhaps even knowing the data that was on the OST and being able to manually
put Humpty-Dumpty back together again could work out. Can I query the
filesystem, get a list of files on the OST, restore them manually (from
manual client-level nightly archives of changes), and re-insert the OST?
Would the file system be able to handle the reinserted node?

Any further thoughts or ideas?

Tim

-----Original Message-----
From: Phil Schwan [mailto:phil@clusterfs.com]
Sent: Tuesday, January 06, 2004 7:32 PM
To: Tim Moran
Cc: ''lustre-discuss@lists.clusterfs.com''
Subject: Re: [Lustre-discuss] OST failure scenario

Tim Moran wrote:> 
> I''m doing some research to see if Lustre is applicable to my
company.
> Specifically, my question is: how does Lustre handle a full-out,
> data-is-not-coming-back OST failure. Reading through the docs I can find,
it> appears to just rely on the journalling filesystem - just reboot it and it
> will come back. But under some circumstances, this is not true - data has
to> be restored from tape say.
Lustre 1.0.x provides OST failover which relies on shared storage and,
as you point out, also relies on that storage device to be redundant
enough to cope with failures on its own.  But what if that device
catches on fire?

On our roadmap for later this year is a RAID-1 OST, in which file data
is synchronized between multiple, totally separate pieces of storage.
When one catches on fire, the other keeps going.

But what if the whole building collapses in an earthquake?  You''re
quite
right.  Sometimes data must be put on, and restored from, external storage.
> And more generally, how does the filesystem deal with backup and restore
to> tape?
Today, you have two choices: backup the file system from a high level,
via a mounted client; or doing a dumpe2fs of the individual file systems
on each MDS and OST.

If you do the former, backup and restore is relatively easy: you do both
from a normally mounted client.  It''s hard, in that case, to restore
just one failed OST.

If you dumpe2fs the individual partitions, you need a way to bring the
file system back in sync.  Lustre has a handful of fairly sophisticated
protocols which keep everything in sync without fsck across a "simple"
failure, like a crash, network disconnect, or a power outage.  For a
restore from tape following a hardware failure, you will need to fsck.

Lustre 1.0.x does not include this cluster-wide fsck tool, but Lustre
1.2 or 1.4 will.  The tool is already being tested now.

Does this answer your question?

-Phil

Tim Moran

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OST failure scenario

Howdy-

I''m doing some research to see if Lustre is applicable to my company.
Specifically, my question is: how does Lustre handle a full-out,
data-is-not-coming-back OST failure. Reading through the docs I can find, it
appears to just rely on the journalling filesystem - just reboot it and it
will come back. But under some circumstances, this is not true - data has to
be restored from tape say.

And more generally, how does the filesystem deal with backup and restore to
tape?

Tim

Lustre discuss - May 2006 - OST failure scenario

[Lustre-discuss] OST failure scenario

[Lustre-discuss] OST failure scenario

[Lustre-discuss] OST failure scenario

[Lustre-discuss] OST failure scenario

[Lustre-discuss] OST failure scenario

[Lustre-discuss] OST failure scenario