Emmanuel Noobadmin
2010-Jun-26 21:13 UTC
[Lustre-discuss] Question on lustre redundancy/failure features
I''m looking at using Lustre to implement a centralized storage for several virtualized machines. The key consideration being reliability and ease of increasing/replacing capacity. However, I''m still quite confused and haven''t read the manual fully because I''m tripping on this: what exactly happens if a piece of hardware fails? Perhaps it''s because I haven''t yet tried to setup Lustre so the terms used don''t quite translate for me yet. So I''ll appreciate some newbie hand holding here :) For example, if I have a simple 5 machine cluster, one MDS/MDTand one failover MDS/MDT. Three OSS/OST machines with 4 drives each, for 2 sets of MD Raid 1 block devices and so total of 6 OST if I didn''t understand the term wrongly. What happens if one of the OSS/OST dies, say motherboard failure? Because the manual mentions data striping across multiple OST, it sounds like either networked RAID 0 or RAID 5. In the case of network RAID 0, a single machine failure means the whole cluster is dead. It doesn''t seem to make sense for Lustre to fail in this manner. Where as if Lustre implements network RAID 5, the cluster would continue to serve all data despite the dead machine. Yet the manual warns that Lustre does not have redundancy and relies entirely on some kind of hardware RAID being used. So it seems to imply that the network RAID 0 is what''s implemented. This appears to be the case given the example in the manual of a simple combined MGS/MDT with two OSS/OST which uses the same fsname "temp" for the OSTs, which then combines the two 16MB OST into a single 30MB block device mounted as /lustre on the client. Does this then mean that if I want redundancy on the storage, I would basically need to have a failover machine for every OSS/OST? I''m also confused because the manual says an OST is a block device such as /dev/sda1 but OSS can be configured to provide failover services. But if the OSS machine which houses the OST dies, how would another OSS take over anyway since it would not be able to access the other set of data? Or does that mean this functionality is only available if the OST in the cluster are standalone SAN devices?
Peter Grandi
2010-Jun-28 08:28 UTC
[Lustre-discuss] Question on lustre redundancy/failure features
> I''m looking at using Lustre to implement a centralized storage > for several virtualized machines.That''s such a cliche, and Lustre is very suitable for it if you don''t mind network latency :-), or if you use a very low latency fabric. In general I am surprised (or perhaps not :->) by how many "clever" people choose to provide resource virtualization and parallelization at the lower levels of abstaction (e.g. block device) and not at the higher ones (service protocol), thus enjoying all the "benefits" of centralization. But then probably they don''t care about availability and in particular abut latency (and sometimes not even about throughput).> The key consideration being reliabilityData "reliability" is not a Lustre concern as such. Eventually Lustre on ZFS will gain what ZFS offersq about it. Also Lustre 2.x will have object level redundancy (sort of like RAID1), somewhat compromising the purity of its design. For overall service availability choose carefully Lustre version and patches, that is do extensive integration testing before production use. Lots of sites have reported spending a few months figuring out the combination of firmware, OS, and Lustre versions that actually works well together. Lustre setups tend to be demanding and to exercise corner cases that less ambitious systems don''t reach.> and ease of increasing/replacing capacity.Add more OSSes with more OSTs. Some sites have hundreds or thousands. While avoiding having too few MDSes. Which means more than one Lustre instance (which can be done in some cool ways, as nothing prevents a node from being a frontend for more than one instance). Note that like in many other cases in your specific application there is no significant benefit from having a single storage pool (Lustre instance).> However, I''m still quite confused and haven''t read the manual > fully because I''m tripping on this: what exactly happens if a > piece of hardware fails?The manual, the Wiki, a number of papers and presentations and this mailing list have extensive discussions of various schemes. Keep inm kind that Lustre is fundamentally aimed at being an HPCn filesystem, not a HA one. That is the primary use of multiple hw resources is parallelism not redundancy.> For example, if I have a simple 5 machine cluster, one > MDS/MDTand one failover MDS/MDT. Three OSS/OST machines with 4 > drives each, for 2 sets of MD Raid 1 block devices [ ... ]That''s somewhat unusual, as this leaves parallelization entirely up to the Lustre layer striping, which perhaps is not wise. It is surely wiser than using parity RAID for OSTs (which is what the Lustre docs suggest for data, while RAID10 is recommended for metadata).> What happens if one of the OSS/OST dies, say motherboard > failure? Because the manual mentions data striping across > multiple OST, it sounds like either networked RAID 0 or RAID 5.Sort of like RAID0 but at the object (file or file section) level instead of block level.> In the case of network RAID 0, a single machine failure means the > whole cluster is dead. It doesn''t seem to make sense for Lustre to > fail in this manner. [ ... ]Perhaps it does make sense ton others. :-)> Yet the manual warns that Lustre does not have redundancy and > relies entirely on some kind of hardware RAID being used. So it > seems to imply that the network RAID 0 is what''s implemented.The manual is pretty clear on that.> Does this then mean that if I want redundancy on the storage, > I would basically need to have a failover machine for every > OSS/OST?Depending on how much redundancy you want to achieve, you may need both failover machine and failover drives.> I''m also confused because the manual says an OST is a block > device such as /dev/sda1 but OSS can be configured to provide > failover services. [ ... ] Or does that mean this > functionality is only available if the OST in the cluster are > standalone SAN devices?Whichever storage device can be shared across multiple servers in a hot/warm setup. There are detailed discussions of frontend server failover ( various HA schemes) and storage backend replication (DRBD for example) setups in the Lustre Wiki and several papers.
William Olson
2010-Jun-28 13:44 UTC
[Lustre-discuss] Question on lustre redundancy/failure features
Hello, being a newbie myself I''ve just recently worked through all these questions myself, here''s what I''ve learned.. On 6/26/2010 2:13 PM, Emmanuel Noobadmin wrote:> I''m looking at using Lustre to implement a centralized storage for > several virtualized machines. The key consideration being reliability > and ease of increasing/replacing capacity.Increasing capacity is easy, replacing it will take some practice and careful reading of the manual and the mailing list archives.> However, I''m still quite confused and haven''t read the manual fully > because I''m tripping on this: what exactly happens if a piece of > hardware fails? > Perhaps it''s because I haven''t yet tried to setup Lustre so the terms > used don''t quite translate for me yet. So I''ll appreciate some newbie > hand holding here :) > > For example, if I have a simple 5 machine cluster, one MDS/MDTand one > failover MDS/MDT. Three OSS/OST machines with 4 drives each, for 2 > sets of MD Raid 1 block devices and so total of 6 OST if I didn''t > understand the term wrongly.Think you understood it correctly there.> What happens if one of the OSS/OST dies, say motherboard failure? > Because the manual mentions data striping across multiple OST, it > sounds like either networked RAID 0 or RAID 5.networked RAID 0 is the closest analogy.> In the case of network RAID 0, a single machine failure means the > whole cluster is dead. It doesn''t seem to make sense for Lustre to > fail in this manner. Where as if Lustre implements network RAID 5, the > cluster would continue to serve all data despite the dead machine.This is why the manual points out that it''s important to have reliable hardware on the back-end. I would strongly suggest a SAN/NAS solution or at least a well-tested and executed backup strategy.> Yet the manual warns that Lustre does not have redundancy and relies > entirely on some kind of hardware RAID being used. So it seems to > imply that the network RAID 0 is what''s implemented. >Yup.> This appears to be the case given the example in the manual of a > simple combined MGS/MDT with two OSS/OST which uses the same fsname > "temp" for the OSTs, which then combines the two 16MB OST into a > single 30MB block device mounted as /lustre on the client. > > Does this then mean that if I want redundancy on the storage, I would > basically need to have a failover machine for every OSS/OST? >Correct, however if you are using a 5 node cluster 2 mgs/mds and 3 oss then the 3 oss servers could be configured to back each other up in the event of a failure, assuming you were using a SAN/NAS solution for the storage. If not then I would recommend extra drives in each machine that a backup of the failed OST could be restored to.> I''m also confused because the manual says an OST is a block device > such as /dev/sda1 but OSS can be configured to provide failover > services. But if the OSS machine which houses the OST dies, how would > another OSS take over anyway since it would not be able to access the > other set of data? > > Or does that mean this functionality is only available if the OST in > the cluster are standalone SAN devices?This would be the most advisable hardware configuration from my experience. If on the other hand, you have spare hardware for the production servers(such as replacement mobo, drives, etc..) then you can be fairly safe as long as you ensure that you have a proper RAID configuration on your lustre partitions. You will experience downtime while you replace failed core components(mobo, proc, ram, etc), but if it''s just a RAID member HD, then lustre can keep on truckin. Downtime should only be as long as it takes to replace the part. We make it a point to always have a hot-spare of any core production machine that we have in the rack. So if you only have 5 machines to work with(and no NAS/SAN), I would suggest moving to a 4 node Lustre environment and keeping the 5th server as a hot-spare. Good Luck! -Billy Olson
Brian J. Murrell
2010-Jun-28 15:10 UTC
[Lustre-discuss] Question on lustre redundancy/failure features
On Sun, 2010-06-27 at 05:13 +0800, Emmanuel Noobadmin wrote:> > However, I''m still quite confused and haven''t read the manual fully > because I''m tripping on this: what exactly happens if a piece of > hardware fails?What happens depends on which piece of hardware fails. If it''s an OSS configured for failover, the backup OSS takes over serving the OSTs. Ditto for an MDS. If it''s a disk in a RAID LUN, well, you replace the disk and let RAID rebuild the LUN.> For example, if I have a simple 5 machine cluster, one MDS/MDTand one > failover MDS/MDT.We should get you started out correctly with nomenclature and concepts. For any give filesystem there can be only 1 MDT. The MDT is the actual device/disk and associated processes that stores and serves the metadata. You can have 1 or more MDSes configured to provide service for it. Of course, if you have more than one, then somehow, usually through shared storage, all of those machines must be able to see the MDT (the disk). An MDS is a physical machine that hosts (can provide) MDT services. You can only have one active MDS at a time -- that is, only one MDS can have the MDT mounted at a time. This is paramount. No more than one machine can mount the MDT at a time.> Three OSS/OST machinesThey are usually just called OSSes.> with 4 drives each, for 2 > sets of MD Raid 1 block devices and so total of 6 OST if I didn''t > understand the term wrongly. > > What happens if one of the OSS/OST dies, say motherboard failure?In order to survive such a failure, the OST must be visible by another OSS which can then mount it and provide service for it.> Because the manual mentions data striping across multiple OST, it > sounds like either networked RAID 0 or RAID 5.Lustre does not provide any form of data redundancy and expects the storage below it to provide that, so yes, if you value your data, you put your OSTs on RAID disk.> In the case of network RAID 0, a single machine failure means the > whole cluster is dead.No. Even if you didn''t configure failover (so that another machine can provide service for the OST(s)), the filesystem is still available for access to any data that is not on the OSTs of a failed, non-failover configured OSS. Any access to data from the failed OSSes OSTs will either just block (i.e. hang) the client''s request until the OSS is brought back into service, or I/O to failed OSTs can return an EIO to client. That is configurable by the administrator.> It doesn''t seem to make sense for Lustre to > fail in this manner. Where as if Lustre implements network RAID 5, the > cluster would continue to serve all data despite the dead machine.I think you are missing the point of failover (with shared disk). A failure of an OSS is survivable in that case.> Yet the manual warns that Lustre does not have redundancy and relies > entirely on some kind of hardware RAID being used. So it seems to > imply that the network RAID 0 is what''s implemented.No. Lustre provides no RAID at all.> Does this then mean that if I want redundancy on the storage, I would > basically need to have a failover machine for every OSS/OST?Yes, Typically people configure active/active failover for OSTs. That is, if they have enough disk for 12 OSTs, they configure two OSSes and put 6 OSTs on each with each OST also being configured to provide service for the other OSSes 6. So normally, each OSS actively provides service for 6 OSTs but if one of the OSSes fails, the survivor takes over service for and provides for all 12 OSTs.> I''m also confused because the manual says an OST is a block device > such as /dev/sda1 but OSS can be configured to provide failover > services. But if the OSS machine which houses the OST dies, how would > another OSS take over anyway since it would not be able to access the > other set of data?You need to be using some sort of shared storage where two computers can both see the same disk. This is typically achieved with FC SCSI type configurations, however it can be done at the lower end with Firewire (which supports shared access, to the extent of various hardware and software implementations). Others here are also using DRBD, but we (Oracle) don''t really have any experience with the robustness of such a solution, so you will need to test it for yourself to your level of satisfaction.> Or does that mean this functionality is only available if the OST in > the cluster are standalone SAN devices?Well, not so much actual SAN devices -- which, IIUC usually implies a filesystem service not a block device, but yes, you are typically referring to disks that are physically outside of the OSSes and connected via some sharable medium such as FC SCSI or infiniband, etc. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100628/49e2b71f/attachment.bin
Emmanuel Noobadmin
2010-Jun-28 17:10 UTC
[Lustre-discuss] Question on lustre redundancy/failure features
> We should get you started out correctly with nomenclature and concepts.Thanks for the clarification, it really helped my understanding :) Unfortunately, after all the much appreciated responses, it seems that Lustre is not the solution I''m looking for. I was hoping that to use it as an easily expandable storage cluster with the equivalent of network RAID 5 across 3 machines with RAID 1 physical disks. This storage cluster/SAN would then hold VM images for several VM servers. This way, I thought it would make recovery of any machine easy, I just have to mount the network storage on a working/replacement server and boot up the VMs originally hosted on a failed server. Somebody else pointed out that I might be looking for OpenFiler instead.
Emmanuel Noobadmin
2010-Jun-28 18:03 UTC
[Lustre-discuss] Question on lustre redundancy/failure features
On 6/28/10, Peter Grandi <pg_lus at lus.for.sabi.co.uk> wrote:>> I''m looking at using Lustre to implement a centralized storage >> for several virtualized machines. > > That''s such a cliche, and Lustre is very suitable for it if you > don''t mind network latency :-), or if you use a very low latency > fabric. > > In general I am surprised (or perhaps not :->) by how many > "clever" people choose to provide resource virtualization and > parallelization at the lower levels of abstaction (e.g. block > device) and not at the higher ones (service protocol), thus > enjoying all the "benefits" of centralization. But then probably > they don''t care about availability and in particular abut latency > (and sometimes not even about throughput).Am I correct to understand that you mean the approach I am considering is stupid then? Which wouldn''t be too surprising since I''m a newbie at this so I''ll appreciate any pointers in the right direction :) What do you mean by higher levels of abstraction and benefits of centralization? Would it be correct to understand that to mean instead of trying to provide redundant storage, I should be looking at providing several servers that would simply fail over to each other? e.g. S1 (VM1, VM2, VM3) failover to S2 S2 (VM4, VM5, VM6) failover to S3 S3 (VM7, VM8, VM9) failover to S1
Peter Grandi
2010-Jun-29 07:40 UTC
[Lustre-discuss] Question on lustre redundancy/failure features
> Lustre is not the solution I''m looking for. I was hoping that to use > it as an easily expandable storage clusterBut it is that; it is just a particular type of storage cluster system with a specific performance profile. As to "expandable", consider again whether your requirements involve a single storage pool or you can do multiple instances.> with the equivalent of network RAID 5 across 3 machines with > RAID 1 physical disks.That seems to me quite a peculiar setup with some strong performance anisotropy and it is difficult for me to imagine the requirements driving that.> This storage cluster/SAN would then hold VM images for several > VM servers.The images can be relatively small things. What about the storage for those VMs? Virtual disks (more images) or do you mount the filesystems from a NAS server (e.g. Lustre) while the VM is booting?> This way, I thought it would make recovery of any machine > easy, I just have to mount the network storage on a > working/replacement server and boot up the VMs originally > hosted on a failed server.Ah that''s an interesting point, as you have implicitly stated some avalaibility requirements and expected failure modes. Apparently you don''t need continuous VM availability and recovery can be manual and take some time. Also you think that loss of a compute server is more likely or easier to recover from than loss of a storage server or a storage device (even if you want to provide two levels of redundancy). You also seem to imply that network latency and bandwidth is not a big issue as to VM performance.> Somebody else pointed out that I might be looking for > OpenFiler instead.Or perhaps GlusterFS. Or perhaps check again your requirements and simplify a bit your design. The ideal application for Lustre is massively parallel (many-to-many) IO of large sequentially accessed datasets, and down from there. Scalability is bought at the price of network latency and traffic (in this it is a smaller scale version of the GoogleFS, where the tradeoff is even more extreme), and careful design of the underlying storage layer (in this the GoogleFS is the opposite). It can also do decently the same workloads that
Peter Grandi
2010-Jun-29 08:25 UTC
[Lustre-discuss] Question on lustre redundancy/failure features
>>> I''m looking at using Lustre to implement a centralized >>> storage for several virtualized machines.>> That''s such a cliche, and Lustre is very suitable for it if >> you don''t mind network latency :-), or if you use a very low >> latency fabric.>> [ ... ] choose to provide resource virtualization and >> parallelization at the lower levels of abstaction (e.g. block >> device) and not at the higher ones (service protocol), thus >> enjoying all the "benefits" of centralization. But then >> probably they don''t care about availability and in particular >> abut latency (and sometimes not even about throughput).> Am I correct to understand that you mean the approach I am > considering is stupid then?Not quite, there are some legitimate applications in which availability, latency or throughput don''t matter much or less than other goals, and then low level virtualization and parallelization are acceptable design choices. But it is difficult for me to imagine the requirements that justify a choice of network RAID5 on RAID1 arrays.> [ ... ] pointers in the right direction :)It depends on requirements, and what is the priority, and the budget for the hardware layer.> What do you mean by higher levels of abstraction and benefits > of centralization?Well, consider the case of something like a data repository, e.g. RDBMS tablespaces or local mail store. The alternative could be between virtualizing and sharing the disks using a low level block oriented protocol (e.g. GFS/GFS2) or having two redundant RDBMS or mail storage systems each with its own local storage and applications specific sync, that is whether the virtualize the storage used by the service, or the service. I think the latteris preferable in most cases. Another popular choice to have a central SAN server, a central NAS (NFS, Lustre, ...) server using it, and a central compute or timesharing server mounting the latter, instead of three computers each with local storage and filesystem and each serving a third of the load. Network latency and througput limitations usually matter more than realtime continuous sharing and availability and unless one wants to invest in HPC style fabrics, network latency and throughput issues are best avoided and local access at low levels of abstractions/virtualization is vastly preferable. Note: there are some people who do need massive shared systems with very high continuous realtime sharing and availability requirements, and there are very expensive and difficult ways to address those requirements properly.> Would it be correct to understand that to mean instead of > trying to provide redundant storage, I should be looking at > providing several servers that would simply fail over to each > other? e.g. > S1 (VM1, VM2, VM3) failover to S2 > S2 (VM4, VM5, VM6) failover to S3 > S3 (VM7, VM8, VM9) failover to S1I presume that this means that S1 is running VM1, VM2, VM3 from local disks. This might be a good alternative, and you could be using DRBD to mirror the images in realtime across machines. The advantage would be a lot less network latency (with the "main" image being on local storage and only writes, and queued, to the network) and network traffic (all reads being local). Another issue is whether you have different requirements for the VM images (e.g. the ''/'' filesystem) and/or the filesystems they access (e.g. ''/home'' or ''/var/www''), and whether the latter should be shared across two or more VMs. In which case a network filesystem could be handy, and Lustre is a good choice even if one does not need its massively parallel (many-to-many) streaming performance. Note: for VMs regrettably block level virtualization over the network might be better than mounting filesystems over the network, because in the former case the network traffic is done by the real system, in the latter by the virtual system, and many VM implementations don''t do network traffic that well.