All, I''m hoping to set up a Lustre file system using 1 MGS and 3 OST on 4 separate pieces of hardware, but I''m doing this and hoping to gain some reliability at the same time. The problem that I see is that if any one piece of the 4-node system fails, the whole system will fail. Is it possible to configure Lustre to write Objects to more than 1 node simultaneously such that I am guaranteed that if one node goes down that all files are still accessible? This would effectively mean that I would use 2 times the storage space for each object written and would require that every cluster have a minimum of 2 nodes. I understand the concepts of using DRBD and replicating block devices as well as creating a full separate cluster for fail-over, but I''m hoping to build redundancy into a single cluster without having to duplicate my network with a bunch of active/passive machine combinations. -- Dante
On Mon, 2007-12-03 at 20:32 -0600, D. Dante Lorenso wrote:> > The problem that I see is that if any one > piece of the 4-node system fails, the whole system will fail.Not quite. If an OST fails, only the objects on that OST become failed. The filesystem will continue to run and service requests from the available OSTs. That means, depending on striping policies, that some or all of a file''s contents might not be available.> Is it possible to configure Lustre to write Objects to more than 1 node > simultaneously such that I am guaranteed that if one node goes down that > all files are still accessible?That''s called RAID, and right now, no. It''s on the roadmap though.> This would effectively mean that I > would use 2 times the storage space for each object written and would > require that every cluster have a minimum of 2 nodes.This is a description of mirroring.> I understand the concepts of using DRBD and replicating block devices as > well as creating a full separate cluster for fail-over, but I''m hoping > to build redundancy into a single cluster without having to duplicate my > network with a bunch of active/passive machine combinations.Using a reliable (some form of RAID -- which drbd qualifies as) shared storage device is the only way to mitigate the SPOF scenario with Lustre currently. As far as duplication, etc. there is no reason why you cannot mirror your drbd devices amongst the hardware you have currently (i.e pair your machines up and create drbd based, mirrored devices) and along with some form of failover (i.e. HA heartbeat) to get some redundancy. If you were already resigned to halving your storage by mirroring OSTs at the Lustre layer, dropping that mirroring down to drbd should not impose any more significant costs. Or maybe I''m misunderstanding your concerns with using drbd. b.
Brian J. Murrell wrote:> On Mon, 2007-12-03 at 20:32 -0600, D. Dante Lorenso wrote: >> The problem that I see is that if any one >> piece of the 4-node system fails, the whole system will fail. > > Not quite. If an OST fails, only the objects on that OST become failed. > The filesystem will continue to run and service requests from the > available OSTs. That means, depending on striping policies, that some > or all of a file''s contents might not be available.What happens when you try to read a file from the OST that is down? I''m guessing that read will hang for a considerable period of time. Likely that hanging will eventually occur for many files on a box simultaneously and the whole box will lock up waiting on I/O it will never get ... essentially taking the whole shebang down.>> Is it possible to configure Lustre to write Objects to more than 1 node >> simultaneously such that I am guaranteed that if one node goes down that >> all files are still accessible? > That''s called RAID, and right now, no. It''s on the roadmap though.Is the road map posted somewhere? URL? Any timeline I might want to watch and wait for?>> This would effectively mean that I >> would use 2 times the storage space for each object written and would >> require that every cluster have a minimum of 2 nodes. > This is a description of mirroring.Right, like RAID 1, but at the network level.>> I understand the concepts of using DRBD and replicating block devices as >> well as creating a full separate cluster for fail-over, but I''m hoping >> to build redundancy into a single cluster without having to duplicate my >> network with a bunch of active/passive machine combinations. > > Using a reliable (some form of RAID -- which drbd qualifies as) shared > storage device is the only way to mitigate the SPOF scenario with Lustre > currently. > > As far as duplication, etc. there is no reason why you cannot mirror > your drbd devices amongst the hardware you have currently (i.e pair your > machines up and create drbd based, mirrored devices) and along with some > form of failover (i.e. HA heartbeat) to get some redundancy. > > If you were already resigned to halving your storage by mirroring OSTs > at the Lustre layer, dropping that mirroring down to drbd should not > impose any more significant costs. Or maybe I''m misunderstanding your > concerns with using drbd.I have configured a DRBD system with heartbeat in my lab tests and it seems to work well enough, but I haven''t tied it into Lustre just yet. I was concerned about the frailty of a system that requires all 3 (lustre, drbd, and heartbeat) to magically work in unison. It is a delicate mounting/unmounting game to ensure that partitions are monitored, mounted, and fail-over in just the right order. Eliminating all the moving parts by using 1 solution like Lustre was what I was hoping for. I''m leaning toward doing the L,D,H solution, but was really hoping for something easier. Are there any online howtos that demonstrate that configuration? -- Dante
On Tue, 2007-12-04 at 17:59 -0600, D. Dante Lorenso wrote:> > What happens when you try to read a file from the OST that is down?That depends on whether the OST has been configured for failout or failover. In failover mode, the assumption is that another node will resume service for that OST, so I/O to objects on the failed OST will block, waiting for the service to be resumed. In failout mode, I/O to the failed OST will return EIOs.> I''m > guessing that read will hang for a considerable period of time.For ever, or until the OST is repaired in the case of failover, yes.> Likely > that hanging will eventually occur for many files on a boxOn a given client, yes.> simultaneously and the whole box will lock up waiting on I/O it will > never getNo. Having even a lot of processes blocked on I/O to a failed OST will not "lock up" a whole client. The client will continue to run and complete tasks that are not dependent on the failed OST.> ... essentially taking the whole shebang down.I guess it depends on how you define shebang.> Is the road map posted somewhere?First (non-ad-sponsored) hit on google for "lustre roadmap": http://www.clusterfs.com/roadmap.html> URL? Any timeline I might want to > watch and wait for?Server Network Striping. Looks like 2.0 in Q4 2008.> Right, like RAID 1, but at the network level.Which is what drbd is effectively.> I have configured a DRBD system with heartbeat in my lab tests and it > seems to work well enough, but I haven''t tied it into Lustre just yet.Adding Lustre should not be a big hurdle.> It is a delicate mounting/unmounting game to ensure that partitions are > monitored, mounted, and fail-over in just the right order.Indeed.> I''m leaning toward doing the L,D,H solution, but was really hoping for > something easier. Are there any online howtos that demonstrate that > configuration?I don''t know of any HOWTO/cookbook to it. If you implement it, perhaps you could create the HOWTO. :-) b.
D. Dante Lorenso wrote:> Is it possible to configure Lustre to write Objects to more than 1 node > simultaneously such that I am guaranteed that if one node goes down that > all files are still accessible?As Brian Murrell said earlier, if the data for a certain OST or MDS is visible to only one node then you will lose access to that data when that node is down. Continuous replication of the data is one approach, but commercial Lustre implementations today typically use shared storage hardware instead. HP''s Lustre-based product (SFS) for example, places all Lustre data on shared disks and uses clustering software to nominate one node as the primary for each Lustre service and another node as the backup. We configure the server nodes in pairs for redundancy; node A is the primary server for OST1 and secondary for OST2, node B is primary for OST2 and secondary for OST1. This means that as long as either A or B is up clients will have access to both OST1 and OST2. This sounds like the sort of configuration you are looking for. To make it work you absolutely need both A and B to be able to see the data for both OST1 and OST2, though only one of them will be serving each OST at a given time of course (if both nodes try to serve the same OST at the same time the underlying ext3 filesystem will get corrupted so fast it''ll make your head spin).> It is a delicate mounting/unmounting game to ensure that partitions are > monitored, mounted, and fail-over in just the right order.Absolutely right, this is the hard bit. I have no personal experience of DRDB but from their website I see that it''s remote disk mirroring software that works by sending notifications of all changes to a local disk to a remote node. The remote node makes the same changes to one of its local disks, making that disk a sort of remote mirror of the one on the original node. Like long distance RAID1. You could also think of it as a shared storage emulator in software and with that in mind you can see where it would fit into the architecture I outlined above. Having said that, I''m not aware of anyone using DRDB in a Lustre environment, so can''t comment on how well it works. Maybe others on this list have experience with it and can comment better. I''d be a bit concerned about the timeliness of updates to the remote mirror, whether the latency would cause problems after a failover (though DRDB does support ext3 and these are ext3 filesystems under the hood, albeit heavily modified). I''d also wonder about performance with change notifications for every write being sent over ethernet to the other node, though I''m sure you''ve thought about that aspect already. Joe. -----Original Message----- From: Brian J. Murrell [mailto:Brian.Murrell at Sun.COM] Sent: 05 December 2007 14:47 To: lustre-discuss Subject: Re: [Lustre-discuss] Redundancy with object storage? On Tue, 2007-12-04 at 17:59 -0600, D. Dante Lorenso wrote:> > What happens when you try to read a file from the OST that is down?That depends on whether the OST has been configured for failout or failover. In failover mode, the assumption is that another node will resume service for that OST, so I/O to objects on the failed OST will block, waiting for the service to be resumed. In failout mode, I/O to the failed OST will return EIOs.> I''m > guessing that read will hang for a considerable period of time.For ever, or until the OST is repaired in the case of failover, yes.> Likely > that hanging will eventually occur for many files on a boxOn a given client, yes.> simultaneously and the whole box will lock up waiting on I/O it will > never getNo. Having even a lot of processes blocked on I/O to a failed OST will not "lock up" a whole client. The client will continue to run and complete tasks that are not dependent on the failed OST.> ... essentially taking the whole shebang down.I guess it depends on how you define shebang.> Is the road map posted somewhere?First (non-ad-sponsored) hit on google for "lustre roadmap": http://www.clusterfs.com/roadmap.html> URL? Any timeline I might want to > watch and wait for?Server Network Striping. Looks like 2.0 in Q4 2008.> Right, like RAID 1, but at the network level.Which is what drbd is effectively.> I have configured a DRBD system with heartbeat in my lab tests and it > seems to work well enough, but I haven''t tied it into Lustre just yet.Adding Lustre should not be a big hurdle.> It is a delicate mounting/unmounting game to ensure that partitions are > monitored, mounted, and fail-over in just the right order.Indeed.> I''m leaning toward doing the L,D,H solution, but was really hoping for > something easier. Are there any online howtos that demonstrate that > configuration?I don''t know of any HOWTO/cookbook to it. If you implement it, perhaps you could create the HOWTO. :-) b.
Dear Joe, Dante, Apologies in advance about not replying inline to your comments. I am getting the impression here that DRBD is being considered as a "remote" mirroring solution which makes it seems like the secondary oss housing the backup OST is sitting far far away making it unreliable or inefficient. Side note: DRBD+ does have the provision of allowing mirroring of data to a third node which will replicate asynchronously (quite customizable really). One can configure independent network routes for DRBD replication, which are synchronous btw, and with heartbeat in the picture and a NPS accounted for, the overall deployment can absolutely have a very reliable, highly available and robust architecture coupling the various technologies being discussed. Our company uses a small Lustre cluster in the above configuration whereas two of our clients (both financial houses) have similar clustered solutions, which admittedly are small (3 TBs approximately serving a no more than 20 clients), catering to core applications. DRBD / local storage / HA and Lustre would require a bit of know-how to put together, however, if cost is an issue (or even sometimes when it''s not), it''s absolutely worth a look into. We''ve been running happily for months now -- with many many fail-overs :) mustafa. On Dec 6, 2007 5:19 PM, Fegan, Joe <Joe.Fegan at hp.com> wrote:> D. Dante Lorenso wrote: > > > Is it possible to configure Lustre to write Objects to more than 1 node > > simultaneously such that I am guaranteed that if one node goes down that > > all files are still accessible? > > As Brian Murrell said earlier, if the data for a certain OST or MDS is visible to only one node then you will lose access to that data when that node is down. Continuous replication of the data is one approach, but commercial Lustre implementations today typically use shared storage hardware instead. > > HP''s Lustre-based product (SFS) for example, places all Lustre data on shared disks and uses clustering software to nominate one node as the primary for each Lustre service and another node as the backup. We configure the server nodes in pairs for redundancy; node A is the primary server for OST1 and secondary for OST2, node B is primary for OST2 and secondary for OST1. This means that as long as either A or B is up clients will have access to both OST1 and OST2. This sounds like the sort of configuration you are looking for. To make it work you absolutely need both A and B to be able to see the data for both OST1 and OST2, though only one of them will be serving each OST at a given time of course (if both nodes try to serve the same OST at the same time the underlying ext3 filesystem will get corrupted so fast it''ll make your head spin). > > > It is a delicate mounting/unmounting game to ensure that partitions are > > monitored, mounted, and fail-over in just the right order. > > Absolutely right, this is the hard bit. > > I have no personal experience of DRDB but from their website I see that it''s remote disk mirroring software that works by sending notifications of all changes to a local disk to a remote node. The remote node makes the same changes to one of its local disks, making that disk a sort of remote mirror of the one on the original node. Like long distance RAID1. You could also think of it as a shared storage emulator in software and with that in mind you can see where it would fit into the architecture I outlined above. > > Having said that, I''m not aware of anyone using DRDB in a Lustre environment, so can''t comment on how well it works. Maybe others on this list have experience with it and can comment better. I''d be a bit concerned about the timeliness of updates to the remote mirror, whether the latency would cause problems after a failover (though DRDB does support ext3 and these are ext3 filesystems under the hood, albeit heavily modified). I''d also wonder about performance with change notifications for every write being sent over ethernet to the other node, though I''m sure you''ve thought about that aspect already.