thr3ads.net - Lustre discuss - [Lustre-discuss] Problems with failover [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Jeremy Mann

2008-Jan-03 22:35 UTC

[Lustre-discuss] Problems with failover

We have come across two situations where we''ve had to rebuild our
Lustre
filesystem. Both happened when one of the OSTs hard drives failed. We
did set the OSTs up for failover, however the network was never
interrupted so the switch to the failover node never happened.

How exactly should failover work?

We are running 1.6.0.1 on 20 OSTs and one dedicated MGS/MDT.

-- 
Jeremy Mann
jeremy at biochem.uthscsa.edu
         
University of Texas Health Science Center 
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: 210-567-2672

Andreas Dilger

2008-Jan-04 00:34 UTC

head link

[Lustre-discuss] Problems with failover

On Jan 03, 2008  16:35 -0600, Jeremy Mann wrote:> We have come across two situations where we''ve had to rebuild our
Lustre
> filesystem. Both happened when one of the OSTs hard drives failed. We
> did set the OSTs up for failover, however the network was never
> interrupted so the switch to the failover node never happened.
> 
> How exactly should failover work?
To be clear - Lustre failover has nothing to do with data replication.
It is meant only as a mechanism to allow high-availability of shared
disk.  This means - more than one node can serve shared disk from a
SAN or multi-port FC/SCSI disks.

You currently need another mechanism (hardware or software RAID) to 
provide data redundancy in case of disk failure.  We are working to
provide data replication at the Lustre level, but that is not yet
available.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Joe Kraska

2008-Jan-04 03:59 UTC

head link

[Lustre-discuss] Problems with failover

> You currently need another mechanism (hardware or software RAID) to
> provide data redundancy in case of disk failure.  We are working to
> provide data replication at the Lustre level, but that is not yet
> available.

I should say. That technology has me pretty excited. Right now, unless I
bend over backwards
and do something like "vertical" RAID stripe/mirrors across multiple
disk
trays in a storage cluster,
I can end up with a very bad situation if I lose an entire tray. This can
have a potentially devastating
impact on my entire storage tier.

A few companies here and there (XIV, Isilon) are starting to abandon
hardware raid and are doing
block replication across the entire storage cluster. With that, I can forget
worrying about specific
disks (except to replace them), and don''t even have to worry about
whole
trays (insofar as I have
spare capacity).

This is a pretty neat capability. If you add to it the ability to
"rebalance" your cluster on the fly as
new nodes are added, what you end up with is a self-healing storage cluster.
Pretty compelling
for those availability figures, and can help with the disk-service pattern
as well.

Joe Kraska
San Diego CA
USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080103/da4e35d4/attachment-0002.html

Mertol Ozyoney

2008-Jan-04 10:04 UTC

head link

[Lustre-discuss] Problems with failover

Data protection and business continuity come at a price both money wise and
performance wise. 

 

The architecture and components of a Lustre system should be determined by
deciding about what components failures system should be able to tolerate. 

 

If you want to tolerate OSS server failure you will need 

 

1)      2x OSS clustered server

2)      More complex installation

3)      FC/SAS non-caching HBA''s

4)      Shared disk with NV Ram capability

5)      Less performance per dollar

 

If you want to tolerate disk failures 

 

1)      Soft or Hard raid on Lun''s. 

2)      Raid costs some performance

 

If you want to tolerate tray failure 

 

1)      Vertical raid sets. 

2)      However if a tray fails in a vertical raid 5 setup, this will put a
lot of problem during rebuilding. 

 

If you want to tolerate a entire storage system failover

 

1)      Multiple storage systems behind OSS (pair) You might use software
raid across individual disk systems

2)      Rebuild will take a lot of time (ZFS will offer dirty time logging
and will cure this)

 

If you want to tolerate OSS system fail

 

1)      You need to wait Lustre network raid , however this will put some
pressure on clients. 

 

Today, distributed disk systems do not offer much advantage when used with
Lustre. Replication across a limited bandwith and coharancy problems limits
the performance. And the cost is high. 

 

 

Lustre systems are normally tuned for performance not HA . And Lustre can
tolerate a OSS fail without much problem. So it might not be wise to spent a
lot of money on Lustre HA when only you are going to gain is a 1-2 hours
more availability per year. 

 

Best regards

Mertol

 

 

 

 

 

From: lustre-discuss-bounces at clusterfs.com
[mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of Joe Kraska
Sent: 04 Ocak 2008 Cuma 05:59
To: Jeremy Mann; lustre-discuss at clusterfs.com
Subject: Re: [Lustre-discuss] Problems with failover

 

 

You currently need another mechanism (hardware or software RAID) to 
provide data redundancy in case of disk failure.  We are working to
provide data replication at the Lustre level, but that is not yet
available.


I should say. That technology has me pretty excited. Right now, unless I
bend over backwards 
and do something like "vertical" RAID stripe/mirrors across multiple
disk
trays in a storage cluster,
I can end up with a very bad situation if I lose an entire tray. This can
have a potentially devastating 
impact on my entire storage tier.

A few companies here and there (XIV, Isilon) are starting to abandon
hardware raid and are doing
block replication across the entire storage cluster. With that, I can forget
worrying about specific 
disks (except to replace them), and don''t even have to worry about
whole
trays (insofar as I have
spare capacity).

This is a pretty neat capability. If you add to it the ability to
"rebalance" your cluster on the fly as 
new nodes are added, what you end up with is a self-healing storage cluster.
Pretty compelling
for those availability figures, and can help with the disk-service pattern
as well.

Joe Kraska
San Diego CA
USA

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080104/d6d86dd0/attachment-0002.html

Jeremy Mann

2008-Jan-04 15:56 UTC

head link

[Lustre-discuss] Problems with failover

On Thu, 2008-01-03 at 17:34 -0700, Andreas Dilger wrote:
> To be clear - Lustre failover has nothing to do with data replication.
> It is meant only as a mechanism to allow high-availability of shared
> disk.  This means - more than one node can serve shared disk from a
> SAN or multi-port FC/SCSI disks.
> 
> You currently need another mechanism (hardware or software RAID) to 
> provide data redundancy in case of disk failure.  We are working to
> provide data replication at the Lustre level, but that is not yet
> available.
Thank you Andreas, we were under the impression it was data replication
as well. We will have to redesign our filesystem based on this.


-- 
Jeremy Mann
jeremy at biochem.uthscsa.edu
         
University of Texas Health Science Center 
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: 210-567-2672

Jeremy Mann

2008-Jan-04 16:10 UTC

head link

[Lustre-discuss] Problems with failover

On Thu, 2008-01-03 at 17:34 -0700, Andreas Dilger wrote:
> To be clear - Lustre failover has nothing to do with data replication.
> It is meant only as a mechanism to allow high-availability of shared
> disk.  This means - more than one node can serve shared disk from a
> SAN or multi-port FC/SCSI disks.
How would one build a reliable system with 20 OSTs? Our system contains
20 compute nodes, each with 2 200GB drives in a RAID0 configuration.
Each node acts as an OST and a failover of each other, i.e. 0-1, 1-2,
3-4, etc..

I can start from scratch, so I''m thinking of rebuilding the RAID arrays
with RAID1 to compensate for disk failures. But that still leaves me
questioning if a node goes down, or we lose another drive, if we''ll be
back to the same problems we''ve been having.

-- 
Jeremy Mann
jeremy at biochem.uthscsa.edu
         
University of Texas Health Science Center 
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: 210-567-2672

Aaron Knister

2008-Jan-04 20:35 UTC

head link

[Lustre-discuss] Problems with failover

Personally I strongly advise against using compute nodes to host any  
type of storage service. If a user job crashes  a compute node (it  
will actually usually take out several) in which case you''re once  
again up a creek. I don''t know of any filesystem that could handle the
failure of more than two or three underlying storage components.  
Separating storage from computation was the best decision I''ve ever  
made because it allows both to be scaled independently. Am I totally  
missing the mark here? If you still want to do this, try the gfarm  
filesystem and there''s another one but I can''t think of the
name. If i
find it I''ll let you know.

On Jan 4, 2008, at 11:10 AM, Jeremy Mann wrote:
>
> On Thu, 2008-01-03 at 17:34 -0700, Andreas Dilger wrote:
>
>> To be clear - Lustre failover has nothing to do with data  
>> replication.
>> It is meant only as a mechanism to allow high-availability of shared
>> disk.  This means - more than one node can serve shared disk from a
>> SAN or multi-port FC/SCSI disks.
>
> How would one build a reliable system with 20 OSTs? Our system  
> contains
> 20 compute nodes, each with 2 200GB drives in a RAID0 configuration.
> Each node acts as an OST and a failover of each other, i.e. 0-1, 1-2,
> 3-4, etc..
>
> I can start from scratch, so I''m thinking of rebuilding the RAID  
> arrays
> with RAID1 to compensate for disk failures. But that still leaves me
> questioning if a node goes down, or we lose another drive, if
we''ll be
> back to the same problems we''ve been having.
>
> -- 
> Jeremy Mann
> jeremy at biochem.uthscsa.edu
>
> University of Texas Health Science Center
> Bioinformatics Core Facility
> http://www.bioinformatics.uthscsa.edu
> Phone: 210-567-2672
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

Lustre discuss - Jan 2008 - Problems with failover

[Lustre-discuss] Problems with failover

[Lustre-discuss] Problems with failover

[Lustre-discuss] Problems with failover

[Lustre-discuss] Problems with failover

[Lustre-discuss] Problems with failover

[Lustre-discuss] Problems with failover

[Lustre-discuss] Problems with failover