Tyler Hawes
2011-Jul-21 21:42 UTC
[Lustre-discuss] Failover / reliability using SAD direct-attached storage
Apologies if this is a bit newbie, but I''m just getting started, really. I''m still in design / testing stage and looking to wrap my head around a few things. I''m most familiar with Fibre Channel storage. As I understand it, you configure a pair of OSS per OST, one actively serving it, the other passively waiting in case the primary OSS fails. Please correct me if I''m wrong... With SAS/SATA direct-attached storage (DAS), though, it''s a little less clear to me. With SATA, I imagine that if an OSS goes down, all it''s OSTs go down with it (whether they be internal or external mounted drives), since there is no multipathing. Also, I suppose I''d want a hardware RAID controller PCIe card, which would also preclude failover since it''s not going to have cache and configuration mirrored in another OSS''s RAID card. With SAS, there seems to be a new way of doing this that I''m just starting to learn about, but is a bit fuzzy still to me. I see that with things like Storage Bridge Bay storage servers from the likes of Supermicro, there is a method of putting two server motherboards in one enclosure, having an internal 10GigE link between them to keep cache coherency, some sort of software layer to manage that (?), and then you can use inexpensive SAS drives internally and through external JBOD chassis. Is anyone using something like this with Lustre? Or perhaps I''m not seeing the forest through the trees and Lustre has software features built-in that negate the need for this (such as parity of objects at the server level, so you can loose N+1 OSS)? Bottom line, what I''m after is figuring out what architecture works with inexpensive internal and/or JBOD SAS storage that won''t risk data loss with the failure of a single drive or server RAID array... Thanks, Tyler -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110721/14f1606a/attachment.html
Tyler Hawes
2011-Jul-21 21:43 UTC
[Lustre-discuss] Failover / reliability using SAD direct-attached storage
Perhaps that was a freudian slip that I titled the thread "SAD Direct Storage" :) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110721/81d85fc1/attachment.html
Kevin Van Maren
2011-Jul-22 15:00 UTC
[Lustre-discuss] Failover / reliability using SAD direct-attached storage
Tyler Hawes wrote:> Apologies if this is a bit newbie, but I''m just getting started, > really. I''m still in design / testing stage and looking to wrap my > head around a few things. > > I''m most familiar with Fibre Channel storage. As I understand it, you > configure a pair of OSS per OST, one actively serving it, the other > passively waiting in case the primary OSS fails. Please correct me if > I''m wrong...No, that''s basically it. Lustre works well with FC storage, although a full SAN configuration (redundant switch fabrics) is not often used: with only 2 servers needing access to each LUN, and bandwidth to storage being key, servers are most often directly attached to the FC storage, with multiple paths to handle controller/path failure and improve BW. But to clarify one point, Lustre is not waiting passively on the backup server. Lustre can only be active on one server for a given OST at a time. Some high-availability package, external to Lustre, is responsible for ensuring Lustre is active on one server (the OST is mounted on one server). Heartbeat was quite popular, but more people have been moving to the more modern packages like Pacemaker. It is left to the HA package to perform failover as necessary, even though most HA packages do not perform failover by default if the network or back-end storage link goes down (which is where bonded networks and storage multipath could come in).> With SAS/SATA direct-attached storage (DAS), though, it''s a little > less clear to me. With SATA, I imagine that if an OSS goes down, all > it''s OSTs go down with it (whether they be internal or external > mounted drives), since there is no multipathing. Also, I suppose I''d > want a hardware RAID controller PCIe card, which would also preclude > failover since it''s not going to have cache and configuration mirrored > in another OSS''s RAID card.Normally, yes. Sun shipped quite a bit of Lustre stoage with failover using SATA in external enclosures (J4400), but that was special in that there were (2) SAS expanders per enclosure, and each drive was connected to a SATA MUX to allow both servers access to the SATA drives. I am glad you understand the hazards of connecting two servers using internal raid controllers with external storage. Until a RAID card is developed specifically designed with that in mind (and strictly uses a write-though cache), it is a very bad idea. [For others, please consider what would happen to the file system if the raid card has a battery backed cache with a bunch of pending writes to get replayed at some point _after_ the other server completes recovery.] If you are using a SAS-attached external RAID enclosure, then it is not much different than using a FC-attached RAID. Ie, the direct-attached ST2530 (SAS) can be used in place of a direct-attached ST2540 (FC) with the only architecture change being the use of a SAS card/cables instead of an FC card/cables. The big difference between SAS and FC is that people are not (yet) building SAS-based SANs. Already many FC arrays have moved to SAS drives on the back end. http://www.oracle.com/us/products/servers-storage/storage/disk-storage/sun-storage-2500-m2-array-407918.html> With SAS, there seems to be a new way of doing this that I''m just > starting to learn about, but is a bit fuzzy still to me. I see that > with things like Storage Bridge Bay storage servers from the likes of > Supermicro, there is a method of putting two server motherboards in > one enclosure, having an internal 10GigE link between them to keep > cache coherency, some sort of software layer to manage that (?), and > then you can use inexpensive SAS drives internally and through > external JBOD chassis. Is anyone using something like this with Lustre?Some people have used (or at least toyed with using) DRDB and Lustre, but I would not say it is fast, recommended, or a mainstream Lustre configuration. But that is one way to replicate internal storage across servers, to allow Lustre failover. With SAS drives in an external enclosure, it is possible to configure shared storage for use with Lustre, although if you are using a JBOD rather than a raid controller, there are the normal issues (Linux SW raid/LVM layers are not "clustered", so you have to ensure they are only active on one node at a time).> Or perhaps I''m not seeing the forest through the trees and Lustre has > software features built-in that negate the need for this (such as > parity of objects at the server level, so you can loose N+1 OSS)? > Bottom line, what I''m after is figuring out what architecture works > with inexpensive internal and/or JBOD SAS storage that won''t risk data > loss with the failure of a single drive or server RAID array...Lustre does not support redundancy in the file system. All data availability is through RAID protection, combined with server failover. With internal storage, you lose the failover part. Sun also delivered quite a bit of storage without failover, based on the x4500/x4540 servers. If your servers do not crash often, and you can live with the file system being down until it is rebooted, that is also an option [note that in non-failover mode the file system defaults to returning errors rather than hanging, but that can be changed]. Kevin> Thanks, > > Tyler
Tyler Hawes
2011-Jul-23 18:34 UTC
[Lustre-discuss] Failover / reliability using SAD direct-attached storage
Thank you for the detailed response, Kevin. It seems an external fibre or SAS raid is needed, as the idea of loosing the file system if one node goes down doesn''t seem good, even if temporary. If Lustre allowed for a single downed node I''d feel differently However, it does have me thinking of building a 2nd cluster for backup/replication, and that one could use cheap sata internal storage since it is only for nearline use, really. On Jul 22, 2011, at 8:01 AM, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> Tyler Hawes wrote: >> Apologies if this is a bit newbie, but I''m just getting started, really. I''m still in design / testing stage and looking to wrap my head around a few things. >> >> I''m most familiar with Fibre Channel storage. As I understand it, you configure a pair of OSS per OST, one actively serving it, the other passively waiting in case the primary OSS fails. Please correct me if I''m wrong... > No, that''s basically it. Lustre works well with FC storage, although a full SAN configuration (redundant switch fabrics) is not often used: with only 2 servers needing access to each LUN, and bandwidth to storage being key, servers are most often directly attached to the FC storage, with multiple paths to handle controller/path failure and improve BW. > > But to clarify one point, Lustre is not waiting passively on the backup server. Lustre can only be active on one server for a given OST at a time. Some high-availability package, external to Lustre, is responsible for ensuring Lustre is active on one server (the OST is mounted on one server). Heartbeat was quite popular, but more people have been moving to the more modern packages like Pacemaker. It is left to the HA package to perform failover as necessary, even though most HA packages do not perform failover by default if the network or back-end storage link goes down (which is where bonded networks and storage multipath could come in). > >> With SAS/SATA direct-attached storage (DAS), though, it''s a little less clear to me. With SATA, I imagine that if an OSS goes down, all it''s OSTs go down with it (whether they be internal or external mounted drives), since there is no multipathing. Also, I suppose I''d want a hardware RAID controller PCIe card, which would also preclude failover since it''s not going to have cache and configuration mirrored in another OSS''s RAID card. > > Normally, yes. Sun shipped quite a bit of Lustre stoage with failover using SATA in external enclosures (J4400), but that was special in that there were (2) SAS expanders per enclosure, and each drive was connected to a SATA MUX to allow both servers access to the SATA drives. > > I am glad you understand the hazards of connecting two servers using internal raid controllers with external storage. Until a RAID card is developed specifically designed with that in mind (and strictly uses a write-though cache), it is a very bad idea. [For others, please consider what would happen to the file system if the raid card has a battery backed cache with a bunch of pending writes to get replayed at some point _after_ the other server completes recovery.] > > If you are using a SAS-attached external RAID enclosure, then it is not much different than using a FC-attached RAID. Ie, the direct-attached ST2530 (SAS) can be used in place of a direct-attached ST2540 (FC) with the only architecture change being the use of a SAS card/cables instead of an FC card/cables. The big difference between SAS and FC is that people are not (yet) building SAS-based SANs. Already many FC arrays have moved to SAS drives on the back end. > http://www.oracle.com/us/products/servers-storage/storage/disk-storage/sun-storage-2500-m2-array-407918.html > >> With SAS, there seems to be a new way of doing this that I''m just starting to learn about, but is a bit fuzzy still to me. I see that with things like Storage Bridge Bay storage servers from the likes of Supermicro, there is a method of putting two server motherboards in one enclosure, having an internal 10GigE link between them to keep cache coherency, some sort of software layer to manage that (?), and then you can use inexpensive SAS drives internally and through external JBOD chassis. Is anyone using something like this with Lustre? > > Some people have used (or at least toyed with using) DRDB and Lustre, but I would not say it is fast, recommended, or a mainstream Lustre configuration. But that is one way to replicate internal storage across servers, to allow Lustre failover. > > With SAS drives in an external enclosure, it is possible to configure shared storage for use with Lustre, although if you are using a JBOD rather than a raid controller, there are the normal issues (Linux SW raid/LVM layers are not "clustered", so you have to ensure they are only active on one node at a time). > >> Or perhaps I''m not seeing the forest through the trees and Lustre has software features built-in that negate the need for this (such as parity of objects at the server level, so you can loose N+1 OSS)? Bottom line, what I''m after is figuring out what architecture works with inexpensive internal and/or JBOD SAS storage that won''t risk data loss with the failure of a single drive or server RAID array... > > Lustre does not support redundancy in the file system. All data availability is through RAID protection, combined with server failover. > > With internal storage, you lose the failover part. Sun also delivered quite a bit of storage without failover, based on the x4500/x4540 servers. If your servers do not crash often, and you can live with the file system being down until it is rebooted, that is also an option [note that in non-failover mode the file system defaults to returning errors rather than hanging, but that can be changed]. > > > Kevin > >> Thanks, >> >> Tyler >
Mark Hahn
2011-Jul-23 19:06 UTC
[Lustre-discuss] Failover / reliability using SAD direct-attached storage
> It seems an external fibre > or SAS raid is needed,to be precise, a redundant-path SAN is needed. you could do it with commodity disks and Gb, or you can spend almost unlimited amounts on gold-plated disks, FC switches, etc. the range of costs is really quite remarkable, I guess O(100x). compare this to cars where even VERY nice production cars are only a few times more expensive than the most cost-effective ones.> as the idea of loosing the file system if one > node goes down doesn''t seem good, even if temporary.how often do you expect nodes to fail, and why? regards, mark hahn.
Kevin Van Maren
2011-Jul-24 14:25 UTC
[Lustre-discuss] Failover / reliability using SAD direct-attached storage
Mark Hahn wrote:>> It seems an external fibre >> or SAS raid is needed, >> > > to be precise, a redundant-path SAN is needed. you could do it with > commodity disks and Gb, or you can spend almost unlimited amounts on > gold-plated disks, FC switches, etc. >Many deployments are done without redundant paths, which offer additional insurance.> the range of costs is really quite remarkable, I guess O(100x). > compare this to cars where even VERY nice production cars are only > a few times more expensive than the most cost-effective ones. >You''re comparing two mass-market cars: there is a nearly 1000x difference in price between a cheap dune buggy and a Bugatti, but both provide transportation for 1-2 people.>> as the idea of loosing the file system if one >> node goes down doesn''t seem good, even if temporary. >>The clients should just hang on the file system until the server is again available. This is not so different from using NFS with hard mounts. Note that even with failover, the Lustre file system will be down for several minutes, as the HA package has to first detect a problem, and then safely startup Lustre on the backup server, and then Lustre recovery has to occur.> how often do you expect nodes to fail, and why? > > regards, mark hahn. >