thr3ads.net - Lustre discuss - [Lustre-discuss] Redundancy with object storage? [Dec 2007]

If this information is useful, please help other people find it:
Share via:

D. Dante Lorenso

2007-Dec-04 02:32 UTC

[Lustre-discuss] Redundancy with object storage?

All,

I''m hoping to set up a Lustre file system using 1 MGS and 3 OST on 4 
separate pieces of hardware, but I''m doing this and hoping to gain some
reliability at the same time.  The problem that I see is that if any one 
piece of the 4-node system fails, the whole system will fail.

Is it possible to configure Lustre to write Objects to more than 1 node 
simultaneously such that I am guaranteed that if one node goes down that 
all files are still accessible?  This would effectively mean that I 
would use 2 times the storage space for each object written and would 
require that every cluster have a minimum of 2 nodes.

I understand the concepts of using DRBD and replicating block devices as 
well as creating a full separate cluster for fail-over, but I''m hoping 
to build redundancy into a single cluster without having to duplicate my 
network with a bunch of active/passive machine combinations.

-- Dante

Brian J. Murrell

2007-Dec-04 21:10 UTC

head link

[Lustre-discuss] Redundancy with object storage?

On Mon, 2007-12-03 at 20:32 -0600, D. Dante Lorenso
wrote:> 
> The problem that I see is that if any one 
> piece of the 4-node system fails, the whole system will fail.
Not quite.  If an OST fails, only the objects on that OST become failed.
The filesystem will continue to run and service requests from the
available OSTs.  That means, depending on striping policies, that some
or all of a file''s contents might not be available.
> Is it possible to configure Lustre to write Objects to more than 1 node 
> simultaneously such that I am guaranteed that if one node goes down that 
> all files are still accessible?
That''s called RAID, and right now, no.  It''s on the roadmap
though.
> This would effectively mean that I 
> would use 2 times the storage space for each object written and would 
> require that every cluster have a minimum of 2 nodes.
This is a description of mirroring.
> I understand the concepts of using DRBD and replicating block devices as 
> well as creating a full separate cluster for fail-over, but I''m
hoping
> to build redundancy into a single cluster without having to duplicate my 
> network with a bunch of active/passive machine combinations.
Using a reliable (some form of RAID -- which drbd qualifies as) shared
storage device is the only way to mitigate the SPOF scenario with Lustre
currently.

As far as duplication, etc. there is no reason why you cannot mirror
your drbd devices amongst the hardware you have currently (i.e pair your
machines up and create drbd based, mirrored devices) and along with some
form of failover (i.e. HA heartbeat) to get some redundancy.

If you were already resigned to halving your storage by mirroring OSTs
at the Lustre layer, dropping that mirroring down to drbd should not
impose any more significant costs.  Or maybe I''m misunderstanding your
concerns with using drbd.

b.

D. Dante Lorenso

2007-Dec-04 23:59 UTC

head link

[Lustre-discuss] Redundancy with object storage?

Brian J. Murrell wrote:> On Mon, 2007-12-03 at 20:32 -0600, D. Dante Lorenso wrote:
>> The problem that I see is that if any one 
>> piece of the 4-node system fails, the whole system will fail.
> 
> Not quite.  If an OST fails, only the objects on that OST become failed.
> The filesystem will continue to run and service requests from the
> available OSTs.  That means, depending on striping policies, that some
> or all of a file''s contents might not be available.
What happens when you try to read a file from the OST that is down? 
I''m
guessing that read will hang for a considerable period of time.  Likely 
that hanging will eventually occur for many files on a box 
simultaneously and the whole box will lock up waiting on I/O it will 
never get ... essentially taking the whole shebang down.
>> Is it possible to configure Lustre to write Objects to more than 1 node
>> simultaneously such that I am guaranteed that if one node goes down
that
>> all files are still accessible?
> That''s called RAID, and right now, no.  It''s on the
roadmap though.
Is the road map posted somewhere?  URL?  Any timeline I might want to 
watch and wait for?
>> This would effectively mean that I 
>> would use 2 times the storage space for each object written and would 
>> require that every cluster have a minimum of 2 nodes.
> This is a description of mirroring.
Right, like RAID 1, but at the network level.
>> I understand the concepts of using DRBD and replicating block devices
as
>> well as creating a full separate cluster for fail-over, but
I''m hoping
>> to build redundancy into a single cluster without having to duplicate
my
>> network with a bunch of active/passive machine combinations.
> 
> Using a reliable (some form of RAID -- which drbd qualifies as) shared
> storage device is the only way to mitigate the SPOF scenario with Lustre
> currently.
> 
> As far as duplication, etc. there is no reason why you cannot mirror
> your drbd devices amongst the hardware you have currently (i.e pair your
> machines up and create drbd based, mirrored devices) and along with some
> form of failover (i.e. HA heartbeat) to get some redundancy.
> 
> If you were already resigned to halving your storage by mirroring OSTs
> at the Lustre layer, dropping that mirroring down to drbd should not
> impose any more significant costs.  Or maybe I''m misunderstanding
your
> concerns with using drbd.
I have configured a DRBD system with heartbeat in my lab tests and it 
seems to work well enough, but I haven''t tied it into Lustre just yet. 
I was concerned about the frailty of a system that requires all 3 
(lustre, drbd, and heartbeat) to magically work in unison.

It is a delicate mounting/unmounting game to ensure that partitions are 
monitored, mounted, and fail-over in just the right order.  Eliminating 
all the moving parts by using 1 solution like Lustre was what I was 
hoping for.

I''m leaning toward doing the L,D,H solution, but was really hoping for 
something easier.  Are there any online howtos that demonstrate that 
configuration?

-- Dante

Brian J. Murrell

2007-Dec-05 14:47 UTC

head link

[Lustre-discuss] Redundancy with object storage?

On Tue, 2007-12-04 at 17:59 -0600, D. Dante Lorenso
wrote:> 
> What happens when you try to read a file from the OST that is down?
That depends on whether the OST has been configured for failout or
failover.  In failover mode, the assumption is that another node will
resume service for that OST, so I/O to objects on the failed OST will
block, waiting for the service to be resumed.  In failout mode, I/O to
the failed OST will return EIOs.
> I''m 
> guessing that read will hang for a considerable period of time.
For ever, or until the OST is repaired in the case of failover, yes.
> Likely 
> that hanging will eventually occur for many files on a box
On a given client, yes.
 > simultaneously and the whole box will lock up waiting on I/O it will 
> never get
No.  Having even a lot of processes blocked on I/O to a failed OST will
not "lock up" a whole client.  The client will continue to run and
complete tasks that are not dependent on the failed OST.
>  ... essentially taking the whole shebang down.
I guess it depends on how you define shebang.
> Is the road map posted somewhere?
First (non-ad-sponsored) hit on google for "lustre roadmap":
http://www.clusterfs.com/roadmap.html
>   URL?  Any timeline I might want to 
> watch and wait for?
Server Network Striping.  Looks like 2.0 in Q4 2008.
> Right, like RAID 1, but at the network level.
Which is what drbd is effectively.
> I have configured a DRBD system with heartbeat in my lab tests and it 
> seems to work well enough, but I haven''t tied it into Lustre just
yet.
Adding Lustre should not be a big hurdle.
 > It is a delicate mounting/unmounting game to ensure that partitions are 
> monitored, mounted, and fail-over in just the right order.
Indeed.
> I''m leaning toward doing the L,D,H solution, but was really hoping
for
> something easier.  Are there any online howtos that demonstrate that 
> configuration?
I don''t know of any HOWTO/cookbook to it.  If you implement it, perhaps
you could create the HOWTO.  :-)

b.

Fegan, Joe

2007-Dec-06 12:19 UTC

head link

[Lustre-discuss] Redundancy with object storage?

D. Dante Lorenso wrote:
> Is it possible to configure Lustre to write Objects to more than 1 node
> simultaneously such that I am guaranteed that if one node goes down that
> all files are still accessible?
As Brian Murrell said earlier, if the data for a certain OST or MDS is visible
to only one node then you will lose access to that data when that node is down.
Continuous replication of the data is one approach, but commercial Lustre
implementations today typically use shared storage hardware instead.

HP''s Lustre-based product (SFS) for example, places all Lustre data on
shared disks and uses clustering software to nominate one node as the primary
for each Lustre service and another node as the backup. We configure the server
nodes in pairs for redundancy; node A is the primary server for OST1 and
secondary for OST2, node B is primary for OST2 and secondary for OST1. This
means that as long as either A or B is up clients will have access to both OST1
and OST2. This sounds like the sort of configuration you are looking for. To
make it work you absolutely need both A and B to be able to see the data for
both OST1 and OST2, though only one of them will be serving each OST at a given
time of course (if both nodes try to serve the same OST at the same time the
underlying ext3 filesystem will get corrupted so fast it''ll make your
head spin).
> It is a delicate mounting/unmounting game to ensure that partitions are
> monitored, mounted, and fail-over in just the right order.
Absolutely right, this is the hard bit.

I have no personal experience of DRDB but from their website I see that
it''s remote disk mirroring software that works by sending notifications
of all changes to a local disk to a remote node. The remote node makes the same
changes to one of its local disks, making that disk a sort of remote mirror of
the one on the original node. Like long distance RAID1. You could also think of
it as a shared storage emulator in software and with that in mind you can see
where it would fit into the architecture I outlined above.

Having said that, I''m not aware of anyone using DRDB in a Lustre
environment, so can''t comment on how well it works. Maybe others on
this list have experience with it and can comment better. I''d be a bit
concerned about the timeliness of updates to the remote mirror, whether the
latency would cause problems after a failover (though DRDB does support ext3 and
these are ext3 filesystems under the hood, albeit heavily modified).
I''d also wonder about performance with change notifications for every
write being sent over ethernet to the other node, though I''m sure
you''ve thought about that aspect already.

Joe.


-----Original Message-----
From: Brian J. Murrell [mailto:Brian.Murrell at Sun.COM]
Sent: 05 December 2007 14:47
To: lustre-discuss
Subject: Re: [Lustre-discuss] Redundancy with object storage?

On Tue, 2007-12-04 at 17:59 -0600, D. Dante Lorenso
wrote:>
> What happens when you try to read a file from the OST that is down?
That depends on whether the OST has been configured for failout or
failover.  In failover mode, the assumption is that another node will
resume service for that OST, so I/O to objects on the failed OST will
block, waiting for the service to be resumed.  In failout mode, I/O to
the failed OST will return EIOs.
> I''m
> guessing that read will hang for a considerable period of time.
For ever, or until the OST is repaired in the case of failover, yes.
> Likely
> that hanging will eventually occur for many files on a box
On a given client, yes.
> simultaneously and the whole box will lock up waiting on I/O it will
> never get
No.  Having even a lot of processes blocked on I/O to a failed OST will
not "lock up" a whole client.  The client will continue to run and
complete tasks that are not dependent on the failed OST.
>  ... essentially taking the whole shebang down.
I guess it depends on how you define shebang.
> Is the road map posted somewhere?
First (non-ad-sponsored) hit on google for "lustre roadmap":
http://www.clusterfs.com/roadmap.html
>   URL?  Any timeline I might want to
> watch and wait for?
Server Network Striping.  Looks like 2.0 in Q4 2008.
> Right, like RAID 1, but at the network level.
Which is what drbd is effectively.
> I have configured a DRBD system with heartbeat in my lab tests and it
> seems to work well enough, but I haven''t tied it into Lustre just
yet.
Adding Lustre should not be a big hurdle.
> It is a delicate mounting/unmounting game to ensure that partitions are
> monitored, mounted, and fail-over in just the right order.
Indeed.
> I''m leaning toward doing the L,D,H solution, but was really hoping
for
> something easier.  Are there any online howtos that demonstrate that
> configuration?
I don''t know of any HOWTO/cookbook to it.  If you implement it, perhaps
you could create the HOWTO.  :-)

b.

Mustafa A. Hashmi

2007-Dec-06 12:58 UTC

head link

[Lustre-discuss] Redundancy with object storage?

Dear Joe, Dante,

Apologies in advance about not replying inline to your comments.

I am getting the impression here that DRBD is being considered as a
"remote" mirroring solution which makes it seems like the secondary
oss housing the backup OST is sitting far far away making it
unreliable or inefficient. Side note: DRBD+ does have the provision of
allowing mirroring of data to a third node which will replicate
asynchronously (quite customizable really).

One can configure independent network routes for DRBD replication,
which are synchronous btw, and with heartbeat in the picture and a NPS
accounted for, the overall deployment can absolutely have a very
reliable, highly available and robust architecture coupling the
various technologies being discussed.

Our company uses a small Lustre cluster in the above configuration
whereas two of our clients (both financial houses) have similar
clustered solutions, which admittedly are small (3 TBs approximately
serving a no more than 20 clients), catering to core applications.

DRBD / local storage / HA and Lustre would require a bit of know-how
to put together, however,  if cost is an issue (or even sometimes when
it''s not), it''s absolutely worth a look into. We''ve
been running
happily for months now -- with many many fail-overs :)

mustafa.

On Dec 6, 2007 5:19 PM, Fegan, Joe <Joe.Fegan at hp.com>
wrote:> D. Dante Lorenso wrote:
>
> > Is it possible to configure Lustre to write Objects to more than 1
node
> > simultaneously such that I am guaranteed that if one node goes down
that
> > all files are still accessible?
>
> As Brian Murrell said earlier, if the data for a certain OST or MDS is
visible to only one node then you will lose access to that data when that node
is down. Continuous replication of the data is one approach, but commercial
Lustre implementations today typically use shared storage hardware instead.
>
> HP''s Lustre-based product (SFS) for example, places all Lustre
data on shared disks and uses clustering software to nominate one node as the
primary for each Lustre service and another node as the backup. We configure the
server nodes in pairs for redundancy; node A is the primary server for OST1 and
secondary for OST2, node B is primary for OST2 and secondary for OST1. This
means that as long as either A or B is up clients will have access to both OST1
and OST2. This sounds like the sort of configuration you are looking for. To
make it work you absolutely need both A and B to be able to see the data for
both OST1 and OST2, though only one of them will be serving each OST at a given
time of course (if both nodes try to serve the same OST at the same time the
underlying ext3 filesystem will get corrupted so fast it''ll make your
head spin).
>
> > It is a delicate mounting/unmounting game to ensure that partitions
are
> > monitored, mounted, and fail-over in just the right order.
>
> Absolutely right, this is the hard bit.
>
> I have no personal experience of DRDB but from their website I see that
it''s remote disk mirroring software that works by sending notifications
of all changes to a local disk to a remote node. The remote node makes the same
changes to one of its local disks, making that disk a sort of remote mirror of
the one on the original node. Like long distance RAID1. You could also think of
it as a shared storage emulator in software and with that in mind you can see
where it would fit into the architecture I outlined above.
>
> Having said that, I''m not aware of anyone using DRDB in a Lustre
environment, so can''t comment on how well it works. Maybe others on
this list have experience with it and can comment better. I''d be a bit
concerned about the timeliness of updates to the remote mirror, whether the
latency would cause problems after a failover (though DRDB does support ext3 and
these are ext3 filesystems under the hood, albeit heavily modified).
I''d also wonder about performance with change notifications for every
write being sent over ethernet to the other node, though I''m sure
you''ve thought about that aspect already.

Lustre discuss - Dec 2007 - Redundancy with object storage?

[Lustre-discuss] Redundancy with object storage?

[Lustre-discuss] Redundancy with object storage?

[Lustre-discuss] Redundancy with object storage?

[Lustre-discuss] Redundancy with object storage?

[Lustre-discuss] Redundancy with object storage?

[Lustre-discuss] Redundancy with object storage?