Tim Moran wrote:> > I''m doing some research to see if Lustre is applicable to my company. > Specifically, my question is: how does Lustre handle a full-out, > data-is-not-coming-back OST failure. Reading through the docs I can find, it > appears to just rely on the journalling filesystem - just reboot it and it > will come back. But under some circumstances, this is not true - data has to > be restored from tape say.Lustre 1.0.x provides OST failover which relies on shared storage and, as you point out, also relies on that storage device to be redundant enough to cope with failures on its own. But what if that device catches on fire? On our roadmap for later this year is a RAID-1 OST, in which file data is synchronized between multiple, totally separate pieces of storage. When one catches on fire, the other keeps going. But what if the whole building collapses in an earthquake? You''re quite right. Sometimes data must be put on, and restored from, external storage.> And more generally, how does the filesystem deal with backup and restore to > tape?Today, you have two choices: backup the file system from a high level, via a mounted client; or doing a dumpe2fs of the individual file systems on each MDS and OST. If you do the former, backup and restore is relatively easy: you do both from a normally mounted client. It''s hard, in that case, to restore just one failed OST. If you dumpe2fs the individual partitions, you need a way to bring the file system back in sync. Lustre has a handful of fairly sophisticated protocols which keep everything in sync without fsck across a "simple" failure, like a crash, network disconnect, or a power outage. For a restore from tape following a hardware failure, you will need to fsck. Lustre 1.0.x does not include this cluster-wide fsck tool, but Lustre 1.2 or 1.4 will. The tool is already being tested now. Does this answer your question? -Phil
Hi Tim-- Tim Moran wrote:> > The alternatives all seem lacking. Backing up the whole filesystem (say 100 > Tb) is impossible. Even dumping a filesystem (say 2 Tb) is difficult. How > would the filesystem deal with that, by the way? That is, the backup of the > partition will contain old data - some files may have been changed, added, > or deleted. How does the file system deal with those and maintain a > consistent file system (with data loss, of course)?The Lustre fsck tool is meant to deal with exactly that problem of inconsistent backups. But there is a better answer. My last email stopped with RAID1, but it''s not really the end of the story. You will be happy to hear that snapshots are also on the way, which will allow you to backup a consistent snapshot of your whole file system by backing up N consistent snapshots of your servers'' backend file systems.> I guess of the alternatives, the RAID1 is most appealing. That would double > storage costs, which is a difficult pill to swallow, but at least it would > be real-time. Any plan for a RAID5-style stripe over multiple OST''s? n+1 is > much more palatable than 2n.That is not on our roadmap right now, no. It''s not clear how feasible a RAID5 OST would be -- RAID5 is a tricky beast. If someone is interested in funding such work, the first step would be an exploration of whether we can produce a scalable and performant design.> In my perfect world, the file system would be integrated with the backup > system: it would backup new/changed data, and be able to restore from tape > to an alterate OST (with data loss since the backup, of course), since it > would know what data was on that OST. Not an easy proposition, I know, but > you have to expect hardware failures: double disk failures, raid cards that > corrupt data, etc., in the real world. It doesn''t seem to me that the system > is really able to handle these cases well - that it is a cluster for those > that buy expensive equipment, and expect to be able to recover from faults > due to hardware investments.You are not the only one who wants this, of course, and advanced backup features are on the roadmap in the Lustre 2.x timeframe. We have designed an HSM layer to integrate with backup systems and support automatic migration to/from offline storage.> Perhaps even knowing the data that was on the OST and being able to manually > put Humpty-Dumpty back together again could work out. Can I query the > filesystem, get a list of files on the OST, restore them manually (from > manual client-level nightly archives of changes), and re-insert the OST? > Would the file system be able to handle the reinserted node?Yes, you can search for files which were stored on a given OST, with "lfs find" and the --obd parameter. Embarrassingly, it appears to have slipped through our test net, and does not behave exactly as it should. This has been filed as issue 2510, and will be fixed in 1.0.3. If you removed an OST and replaced it with an empty one, Lustre would create some sparse holes in your files, where that data used to be. You could restore backups of those affected files as you describe. Did I cover everything? Thanks for the good questions. -Phil
I wounder if DRBD (http://www.drbd.org/) would be a good solution for OST redundancy until Lustre 1.2 arrives with OST raid 1? On Tue, 2004-01-06 at 20:32, Phil Schwan wrote:> Tim Moran wrote: > > > > I''m doing some research to see if Lustre is applicable to my company. > > Specifically, my question is: how does Lustre handle a full-out, > > data-is-not-coming-back OST failure. Reading through the docs I can find, it > > appears to just rely on the journalling filesystem - just reboot it and it > > will come back. But under some circumstances, this is not true - data has to > > be restored from tape say. > > Lustre 1.0.x provides OST failover which relies on shared storage and, > as you point out, also relies on that storage device to be redundant > enough to cope with failures on its own. But what if that device > catches on fire? > > On our roadmap for later this year is a RAID-1 OST, in which file data > is synchronized between multiple, totally separate pieces of storage. > When one catches on fire, the other keeps going. > > But what if the whole building collapses in an earthquake? You''re quite > right. Sometimes data must be put on, and restored from, external storage. > > > And more generally, how does the filesystem deal with backup and restore to > > tape? > > Today, you have two choices: backup the file system from a high level, > via a mounted client; or doing a dumpe2fs of the individual file systems > on each MDS and OST. > > If you do the former, backup and restore is relatively easy: you do both > from a normally mounted client. It''s hard, in that case, to restore > just one failed OST. > > If you dumpe2fs the individual partitions, you need a way to bring the > file system back in sync. Lustre has a handful of fairly sophisticated > protocols which keep everything in sync without fsck across a "simple" > failure, like a crash, network disconnect, or a power outage. For a > restore from tape following a hardware failure, you will need to fsck. > > Lustre 1.0.x does not include this cluster-wide fsck tool, but Lustre > 1.2 or 1.4 will. The tool is already being tested now. > > Does this answer your question? > > -Phil > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
Pavlica, Nick wrote:>I wounder if DRBD (http://www.drbd.org/) would be a good solution for >OST redundancy until Lustre 1.2 arrives with OST raid 1? > > >Yes, that should work, although it may require some tuning of flight group size and iov size to keep performance reasonable. The OST simply uses a file system on a block device so anything that is line with that will work. - Peter -
Well, yes, that does answer my question. Just not the way I''d like it ;-) The alternatives all seem lacking. Backing up the whole filesystem (say 100 Tb) is impossible. Even dumping a filesystem (say 2 Tb) is difficult. How would the filesystem deal with that, by the way? That is, the backup of the partition will contain old data - some files may have been changed, added, or deleted. How does the file system deal with those and maintain a consistent file system (with data loss, of course)? I guess of the alternatives, the RAID1 is most appealing. That would double storage costs, which is a difficult pill to swallow, but at least it would be real-time. Any plan for a RAID5-style stripe over multiple OST''s? n+1 is much more palatable than 2n. In my perfect world, the file system would be integrated with the backup system: it would backup new/changed data, and be able to restore from tape to an alterate OST (with data loss since the backup, of course), since it would know what data was on that OST. Not an easy proposition, I know, but you have to expect hardware failures: double disk failures, raid cards that corrupt data, etc., in the real world. It doesn''t seem to me that the system is really able to handle these cases well - that it is a cluster for those that buy expensive equipment, and expect to be able to recover from faults due to hardware investments. Perhaps even knowing the data that was on the OST and being able to manually put Humpty-Dumpty back together again could work out. Can I query the filesystem, get a list of files on the OST, restore them manually (from manual client-level nightly archives of changes), and re-insert the OST? Would the file system be able to handle the reinserted node? Any further thoughts or ideas? Tim -----Original Message----- From: Phil Schwan [mailto:phil@clusterfs.com] Sent: Tuesday, January 06, 2004 7:32 PM To: Tim Moran Cc: ''lustre-discuss@lists.clusterfs.com'' Subject: Re: [Lustre-discuss] OST failure scenario Tim Moran wrote:> > I''m doing some research to see if Lustre is applicable to my company. > Specifically, my question is: how does Lustre handle a full-out, > data-is-not-coming-back OST failure. Reading through the docs I can find,it> appears to just rely on the journalling filesystem - just reboot it and it > will come back. But under some circumstances, this is not true - data hasto> be restored from tape say.Lustre 1.0.x provides OST failover which relies on shared storage and, as you point out, also relies on that storage device to be redundant enough to cope with failures on its own. But what if that device catches on fire? On our roadmap for later this year is a RAID-1 OST, in which file data is synchronized between multiple, totally separate pieces of storage. When one catches on fire, the other keeps going. But what if the whole building collapses in an earthquake? You''re quite right. Sometimes data must be put on, and restored from, external storage.> And more generally, how does the filesystem deal with backup and restoreto> tape?Today, you have two choices: backup the file system from a high level, via a mounted client; or doing a dumpe2fs of the individual file systems on each MDS and OST. If you do the former, backup and restore is relatively easy: you do both from a normally mounted client. It''s hard, in that case, to restore just one failed OST. If you dumpe2fs the individual partitions, you need a way to bring the file system back in sync. Lustre has a handful of fairly sophisticated protocols which keep everything in sync without fsck across a "simple" failure, like a crash, network disconnect, or a power outage. For a restore from tape following a hardware failure, you will need to fsck. Lustre 1.0.x does not include this cluster-wide fsck tool, but Lustre 1.2 or 1.4 will. The tool is already being tested now. Does this answer your question? -Phil
Howdy- I''m doing some research to see if Lustre is applicable to my company. Specifically, my question is: how does Lustre handle a full-out, data-is-not-coming-back OST failure. Reading through the docs I can find, it appears to just rely on the journalling filesystem - just reboot it and it will come back. But under some circumstances, this is not true - data has to be restored from tape say. And more generally, how does the filesystem deal with backup and restore to tape? Tim