We have come across two situations where we''ve had to rebuild our Lustre filesystem. Both happened when one of the OSTs hard drives failed. We did set the OSTs up for failover, however the network was never interrupted so the switch to the failover node never happened. How exactly should failover work? We are running 1.6.0.1 on 20 OSTs and one dedicated MGS/MDT. -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: 210-567-2672
On Jan 03, 2008 16:35 -0600, Jeremy Mann wrote:> We have come across two situations where we''ve had to rebuild our Lustre > filesystem. Both happened when one of the OSTs hard drives failed. We > did set the OSTs up for failover, however the network was never > interrupted so the switch to the failover node never happened. > > How exactly should failover work?To be clear - Lustre failover has nothing to do with data replication. It is meant only as a mechanism to allow high-availability of shared disk. This means - more than one node can serve shared disk from a SAN or multi-port FC/SCSI disks. You currently need another mechanism (hardware or software RAID) to provide data redundancy in case of disk failure. We are working to provide data replication at the Lustre level, but that is not yet available. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
> You currently need another mechanism (hardware or software RAID) to > provide data redundancy in case of disk failure. We are working to > provide data replication at the Lustre level, but that is not yet > available.I should say. That technology has me pretty excited. Right now, unless I bend over backwards and do something like "vertical" RAID stripe/mirrors across multiple disk trays in a storage cluster, I can end up with a very bad situation if I lose an entire tray. This can have a potentially devastating impact on my entire storage tier. A few companies here and there (XIV, Isilon) are starting to abandon hardware raid and are doing block replication across the entire storage cluster. With that, I can forget worrying about specific disks (except to replace them), and don''t even have to worry about whole trays (insofar as I have spare capacity). This is a pretty neat capability. If you add to it the ability to "rebalance" your cluster on the fly as new nodes are added, what you end up with is a self-healing storage cluster. Pretty compelling for those availability figures, and can help with the disk-service pattern as well. Joe Kraska San Diego CA USA -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080103/da4e35d4/attachment-0002.html
Data protection and business continuity come at a price both money wise and performance wise. The architecture and components of a Lustre system should be determined by deciding about what components failures system should be able to tolerate. If you want to tolerate OSS server failure you will need 1) 2x OSS clustered server 2) More complex installation 3) FC/SAS non-caching HBA''s 4) Shared disk with NV Ram capability 5) Less performance per dollar If you want to tolerate disk failures 1) Soft or Hard raid on Lun''s. 2) Raid costs some performance If you want to tolerate tray failure 1) Vertical raid sets. 2) However if a tray fails in a vertical raid 5 setup, this will put a lot of problem during rebuilding. If you want to tolerate a entire storage system failover 1) Multiple storage systems behind OSS (pair) You might use software raid across individual disk systems 2) Rebuild will take a lot of time (ZFS will offer dirty time logging and will cure this) If you want to tolerate OSS system fail 1) You need to wait Lustre network raid , however this will put some pressure on clients. Today, distributed disk systems do not offer much advantage when used with Lustre. Replication across a limited bandwith and coharancy problems limits the performance. And the cost is high. Lustre systems are normally tuned for performance not HA . And Lustre can tolerate a OSS fail without much problem. So it might not be wise to spent a lot of money on Lustre HA when only you are going to gain is a 1-2 hours more availability per year. Best regards Mertol From: lustre-discuss-bounces at clusterfs.com [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of Joe Kraska Sent: 04 Ocak 2008 Cuma 05:59 To: Jeremy Mann; lustre-discuss at clusterfs.com Subject: Re: [Lustre-discuss] Problems with failover You currently need another mechanism (hardware or software RAID) to provide data redundancy in case of disk failure. We are working to provide data replication at the Lustre level, but that is not yet available. I should say. That technology has me pretty excited. Right now, unless I bend over backwards and do something like "vertical" RAID stripe/mirrors across multiple disk trays in a storage cluster, I can end up with a very bad situation if I lose an entire tray. This can have a potentially devastating impact on my entire storage tier. A few companies here and there (XIV, Isilon) are starting to abandon hardware raid and are doing block replication across the entire storage cluster. With that, I can forget worrying about specific disks (except to replace them), and don''t even have to worry about whole trays (insofar as I have spare capacity). This is a pretty neat capability. If you add to it the ability to "rebalance" your cluster on the fly as new nodes are added, what you end up with is a self-healing storage cluster. Pretty compelling for those availability figures, and can help with the disk-service pattern as well. Joe Kraska San Diego CA USA -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080104/d6d86dd0/attachment-0002.html
On Thu, 2008-01-03 at 17:34 -0700, Andreas Dilger wrote:> To be clear - Lustre failover has nothing to do with data replication. > It is meant only as a mechanism to allow high-availability of shared > disk. This means - more than one node can serve shared disk from a > SAN or multi-port FC/SCSI disks. > > You currently need another mechanism (hardware or software RAID) to > provide data redundancy in case of disk failure. We are working to > provide data replication at the Lustre level, but that is not yet > available.Thank you Andreas, we were under the impression it was data replication as well. We will have to redesign our filesystem based on this. -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: 210-567-2672
On Thu, 2008-01-03 at 17:34 -0700, Andreas Dilger wrote:> To be clear - Lustre failover has nothing to do with data replication. > It is meant only as a mechanism to allow high-availability of shared > disk. This means - more than one node can serve shared disk from a > SAN or multi-port FC/SCSI disks.How would one build a reliable system with 20 OSTs? Our system contains 20 compute nodes, each with 2 200GB drives in a RAID0 configuration. Each node acts as an OST and a failover of each other, i.e. 0-1, 1-2, 3-4, etc.. I can start from scratch, so I''m thinking of rebuilding the RAID arrays with RAID1 to compensate for disk failures. But that still leaves me questioning if a node goes down, or we lose another drive, if we''ll be back to the same problems we''ve been having. -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: 210-567-2672
Personally I strongly advise against using compute nodes to host any type of storage service. If a user job crashes a compute node (it will actually usually take out several) in which case you''re once again up a creek. I don''t know of any filesystem that could handle the failure of more than two or three underlying storage components. Separating storage from computation was the best decision I''ve ever made because it allows both to be scaled independently. Am I totally missing the mark here? If you still want to do this, try the gfarm filesystem and there''s another one but I can''t think of the name. If i find it I''ll let you know. On Jan 4, 2008, at 11:10 AM, Jeremy Mann wrote:> > On Thu, 2008-01-03 at 17:34 -0700, Andreas Dilger wrote: > >> To be clear - Lustre failover has nothing to do with data >> replication. >> It is meant only as a mechanism to allow high-availability of shared >> disk. This means - more than one node can serve shared disk from a >> SAN or multi-port FC/SCSI disks. > > How would one build a reliable system with 20 OSTs? Our system > contains > 20 compute nodes, each with 2 200GB drives in a RAID0 configuration. > Each node acts as an OST and a failover of each other, i.e. 0-1, 1-2, > 3-4, etc.. > > I can start from scratch, so I''m thinking of rebuilding the RAID > arrays > with RAID1 to compensate for disk failures. But that still leaves me > questioning if a node goes down, or we lose another drive, if we''ll be > back to the same problems we''ve been having. > > -- > Jeremy Mann > jeremy at biochem.uthscsa.edu > > University of Texas Health Science Center > Bioinformatics Core Facility > http://www.bioinformatics.uthscsa.edu > Phone: 210-567-2672 > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org