thr3ads.net - Lustre discuss - [Lustre-discuss] Failover / reliability using SAD direct-attached storage [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Tyler Hawes

2011-Jul-21 21:42 UTC

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

Apologies if this is a bit newbie, but I''m just getting started,
really. I''m
still in design / testing stage and looking to wrap my head around a few
things.

I''m most familiar with Fibre Channel storage. As I understand it, you
configure a pair of OSS per OST, one actively serving it, the other
passively waiting in case the primary OSS fails. Please correct me if
I''m
wrong...

With SAS/SATA direct-attached storage (DAS), though, it''s a little less
clear to me. With SATA, I imagine that if an OSS goes down, all it''s
OSTs go
down with it (whether they be internal or external mounted drives), since
there is no multipathing. Also, I suppose I''d want a hardware RAID
controller PCIe card, which would also preclude failover since it''s not
going to have cache and configuration mirrored in another OSS''s RAID
card.

With SAS, there seems to be a new way of doing this that I''m just
starting
to learn about, but is a bit fuzzy still to me. I see that with things like
Storage Bridge Bay storage servers from the likes of Supermicro, there is a
method of putting two server motherboards in one enclosure, having an
internal 10GigE link between them to keep cache coherency, some sort of
software layer to manage that (?), and then you can use inexpensive SAS
drives internally and through external JBOD chassis. Is anyone using
something like this with Lustre?

Or perhaps I''m not seeing the forest through the trees and Lustre has
software features built-in that negate the need for this (such as parity of
objects at the server level, so you can loose N+1 OSS)? Bottom line, what
I''m after is figuring out what architecture works with inexpensive
internal
and/or JBOD SAS storage that won''t risk data loss with the failure of a
single drive or server RAID array...

Thanks,

Tyler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110721/14f1606a/attachment.html

Tyler Hawes

2011-Jul-21 21:43 UTC

head link

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

Perhaps that was a freudian slip that I titled the thread "SAD Direct
Storage" :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110721/81d85fc1/attachment.html

Kevin Van Maren

2011-Jul-22 15:00 UTC

head link

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

Tyler Hawes wrote:> Apologies if this is a bit newbie, but I''m just getting started, 
> really. I''m still in design / testing stage and looking to wrap my
> head around a few things.
>
> I''m most familiar with Fibre Channel storage. As I understand it,
you
> configure a pair of OSS per OST, one actively serving it, the other 
> passively waiting in case the primary OSS fails. Please correct me if 
> I''m wrong...No, that''s basically it.  Lustre works well with FC storage, although a
full SAN configuration (redundant switch fabrics) is not often used: 
with only 2 servers needing access to each LUN, and bandwidth to storage 
being key, servers are most often directly attached to the FC storage, 
with multiple paths to handle controller/path failure and improve BW.

But to clarify one point, Lustre is not waiting passively on the backup 
server.  Lustre can only be active on one server for a given OST at a 
time.  Some high-availability package, external to Lustre, is 
responsible for ensuring Lustre is active on one server (the OST is 
mounted on one server).  Heartbeat was quite popular, but more people 
have been moving to the more modern packages like Pacemaker.  It is left 
to the HA package to perform failover as necessary, even though most HA 
packages do not perform failover by default if the network or back-end 
storage link goes down (which is where bonded networks and storage 
multipath could come in).
> With SAS/SATA direct-attached storage (DAS), though, it''s a little
> less clear to me. With SATA, I imagine that if an OSS goes down, all 
> it''s OSTs go down with it (whether they be internal or external 
> mounted drives), since there is no multipathing. Also, I suppose
I''d
> want a hardware RAID controller PCIe card, which would also preclude 
> failover since it''s not going to have cache and configuration
mirrored
> in another OSS''s RAID card.
Normally, yes.  Sun shipped quite a bit of Lustre stoage with failover 
using SATA in external enclosures (J4400), but that was special in that 
there were (2) SAS expanders per enclosure, and each drive was connected 
to a SATA MUX to allow both servers access to the SATA drives.

I am glad you understand the hazards of connecting two servers using 
internal raid controllers with external storage.  Until a RAID card is 
developed specifically designed with that in mind (and strictly uses a 
write-though cache), it is a very bad idea.  [For others, please 
consider what would happen to the file system if the raid card has a 
battery backed cache with a bunch of pending writes to get replayed at 
some point _after_ the other server completes recovery.]

If you are using a SAS-attached external RAID enclosure, then it is not 
much different than using a FC-attached RAID.  Ie, the direct-attached 
ST2530 (SAS) can be used in place of a direct-attached ST2540 (FC) with 
the only architecture change being the use of a SAS card/cables instead 
of an FC card/cables.  The big difference between SAS and FC is that 
people are not (yet) building SAS-based SANs.  Already many FC arrays 
have moved to SAS drives on the back end.
http://www.oracle.com/us/products/servers-storage/storage/disk-storage/sun-storage-2500-m2-array-407918.html
> With SAS, there seems to be a new way of doing this that I''m just 
> starting to learn about, but is a bit fuzzy still to me. I see that 
> with things like Storage Bridge Bay storage servers from the likes of 
> Supermicro, there is a method of putting two server motherboards in 
> one enclosure, having an internal 10GigE link between them to keep 
> cache coherency, some sort of software layer to manage that (?), and 
> then you can use inexpensive SAS drives internally and through 
> external JBOD chassis. Is anyone using something like this with Lustre?
Some people have used (or at least toyed with using) DRDB and Lustre, 
but I would not say it is fast, recommended, or a mainstream Lustre 
configuration.  But that is one way to replicate internal storage across 
servers, to allow Lustre failover.

With SAS drives in an external enclosure, it is possible to configure 
shared storage for use with Lustre, although if you are using a JBOD 
rather than a raid controller, there are the normal issues (Linux SW 
raid/LVM layers are not "clustered", so you have to ensure they are
only
active on one node at a time).
> Or perhaps I''m not seeing the forest through the trees and Lustre
has
> software features built-in that negate the need for this (such as 
> parity of objects at the server level, so you can loose N+1 OSS)? 
> Bottom line, what I''m after is figuring out what architecture
works
> with inexpensive internal and/or JBOD SAS storage that won''t risk
data
> loss with the failure of a single drive or server RAID array...
Lustre does not support redundancy in the file system.  All data 
availability is through RAID protection, combined with server failover.

With internal storage, you lose the failover part.  Sun also delivered 
quite a bit of storage without failover, based on the x4500/x4540 
servers.  If your servers do not crash often, and you can live with the 
file system being down until it is rebooted, that is also an option 
[note that in non-failover mode the file system defaults to returning 
errors rather than hanging, but that can be changed].


Kevin
> Thanks,
>
> Tyler

Tyler Hawes

2011-Jul-23 18:34 UTC

head link

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

Thank you for the detailed response, Kevin. It seems an external fibre
or SAS raid is needed, as the idea of loosing the file system if one
node goes down doesn''t seem good, even if temporary. If Lustre allowed
for a single downed node I''d feel differently  However, it does have
me thinking of building a 2nd cluster for backup/replication, and that
one could use cheap sata internal storage since it is only for
nearline use, really.




On Jul 22, 2011, at 8:01 AM, Kevin Van Maren <kevin.van.maren at
oracle.com> wrote:
> Tyler Hawes wrote:
>> Apologies if this is a bit newbie, but I''m just getting
started, really. I''m still in design / testing stage and looking to
wrap my head around a few things.
>>
>> I''m most familiar with Fibre Channel storage. As I understand
it, you configure a pair of OSS per OST, one actively serving it, the other
passively waiting in case the primary OSS fails. Please correct me if
I''m wrong...
> No, that''s basically it.  Lustre works well with FC storage,
although a full SAN configuration (redundant switch fabrics) is not often used:
with only 2 servers needing access to each LUN, and bandwidth to storage being
key, servers are most often directly attached to the FC storage, with multiple
paths to handle controller/path failure and improve BW.
>
> But to clarify one point, Lustre is not waiting passively on the backup
server.  Lustre can only be active on one server for a given OST at a time. 
Some high-availability package, external to Lustre, is responsible for ensuring
Lustre is active on one server (the OST is mounted on one server).  Heartbeat
was quite popular, but more people have been moving to the more modern packages
like Pacemaker.  It is left to the HA package to perform failover as necessary,
even though most HA packages do not perform failover by default if the network
or back-end storage link goes down (which is where bonded networks and storage
multipath could come in).
>
>> With SAS/SATA direct-attached storage (DAS), though, it''s a
little less clear to me. With SATA, I imagine that if an OSS goes down, all
it''s OSTs go down with it (whether they be internal or external mounted
drives), since there is no multipathing. Also, I suppose I''d want a
hardware RAID controller PCIe card, which would also preclude failover since
it''s not going to have cache and configuration mirrored in another
OSS''s RAID card.
>
> Normally, yes.  Sun shipped quite a bit of Lustre stoage with failover
using SATA in external enclosures (J4400), but that was special in that there
were (2) SAS expanders per enclosure, and each drive was connected to a SATA MUX
to allow both servers access to the SATA drives.
>
> I am glad you understand the hazards of connecting two servers using
internal raid controllers with external storage.  Until a RAID card is developed
specifically designed with that in mind (and strictly uses a write-though
cache), it is a very bad idea.  [For others, please consider what would happen
to the file system if the raid card has a battery backed cache with a bunch of
pending writes to get replayed at some point _after_ the other server completes
recovery.]
>
> If you are using a SAS-attached external RAID enclosure, then it is not
much different than using a FC-attached RAID.  Ie, the direct-attached ST2530
(SAS) can be used in place of a direct-attached ST2540 (FC) with the only
architecture change being the use of a SAS card/cables instead of an FC
card/cables.  The big difference between SAS and FC is that people are not (yet)
building SAS-based SANs.  Already many FC arrays have moved to SAS drives on the
back end.
>
http://www.oracle.com/us/products/servers-storage/storage/disk-storage/sun-storage-2500-m2-array-407918.html
>
>> With SAS, there seems to be a new way of doing this that I''m
just starting to learn about, but is a bit fuzzy still to me. I see that with
things like Storage Bridge Bay storage servers from the likes of Supermicro,
there is a method of putting two server motherboards in one enclosure, having an
internal 10GigE link between them to keep cache coherency, some sort of software
layer to manage that (?), and then you can use inexpensive SAS drives internally
and through external JBOD chassis. Is anyone using something like this with
Lustre?
>
> Some people have used (or at least toyed with using) DRDB and Lustre, but I
would not say it is fast, recommended, or a mainstream Lustre configuration. 
But that is one way to replicate internal storage across servers, to allow
Lustre failover.
>
> With SAS drives in an external enclosure, it is possible to configure
shared storage for use with Lustre, although if you are using a JBOD rather than
a raid controller, there are the normal issues (Linux SW raid/LVM layers are not
"clustered", so you have to ensure they are only active on one node at
a time).
>
>> Or perhaps I''m not seeing the forest through the trees and
Lustre has software features built-in that negate the need for this (such as
parity of objects at the server level, so you can loose N+1 OSS)? Bottom line,
what I''m after is figuring out what architecture works with inexpensive
internal and/or JBOD SAS storage that won''t risk data loss with the
failure of a single drive or server RAID array...
>
> Lustre does not support redundancy in the file system.  All data
availability is through RAID protection, combined with server failover.
>
> With internal storage, you lose the failover part.  Sun also delivered
quite a bit of storage without failover, based on the x4500/x4540 servers.  If
your servers do not crash often, and you can live with the file system being
down until it is rebooted, that is also an option [note that in non-failover
mode the file system defaults to returning errors rather than hanging, but that
can be changed].
>
>
> Kevin
>
>> Thanks,
>>
>> Tyler
>

Mark Hahn

2011-Jul-23 19:06 UTC

head link

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

> It seems an external fibre
> or SAS raid is needed,
to be precise, a redundant-path SAN is needed.  you could do it with 
commodity disks and Gb, or you can spend almost unlimited amounts on 
gold-plated disks, FC switches, etc.

the range of costs is really quite remarkable, I guess O(100x). 
compare this to cars where even VERY nice production cars are only 
a few times more expensive than the most cost-effective ones.
> as the idea of loosing the file system if one
> node goes down doesn''t seem good, even if temporary.
how often do you expect nodes to fail, and why?

regards, mark hahn.

Kevin Van Maren

2011-Jul-24 14:25 UTC

head link

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

Mark Hahn wrote:>> It seems an external fibre
>> or SAS raid is needed,
>>     
>
> to be precise, a redundant-path SAN is needed.  you could do it with 
> commodity disks and Gb, or you can spend almost unlimited amounts on 
> gold-plated disks, FC switches, etc.
>   Many deployments are done without redundant paths, which offer 
additional insurance.
> the range of costs is really quite remarkable, I guess O(100x). 
> compare this to cars where even VERY nice production cars are only 
> a few times more expensive than the most cost-effective ones.
>   
You''re comparing two mass-market cars: there is a nearly 1000x 
difference in price
between a cheap dune buggy and a Bugatti, but both provide 
transportation for 1-2 people.
>> as the idea of loosing the file system if one
>> node goes down doesn''t seem good, even if temporary.
>>     
The clients should just hang on the file system until the server is 
again available.
This is not so different from using NFS with hard mounts.

Note that even with failover, the Lustre file system will be down for 
several
minutes, as the HA package has to first detect a problem, and then 
safely startup
Lustre on the backup server, and then Lustre recovery has to occur.
> how often do you expect nodes to fail, and why?
>
> regards, mark hahn.
>

Lustre discuss - Jul 2011 - Failover / reliability using SAD direct-attached storage

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

[Lustre-discuss] Failover / reliability using SAD direct-attached storage

[Lustre-discuss] Failover / reliability using SAD direct-attached storage