thr3ads.net - Lustre discuss - [Lustre-discuss] software raid [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Brian O''Connor

2011-Mar-24 02:54 UTC

[Lustre-discuss] software raid

This has probably been asked and answered.

Is software raid(md) still considered bad practice?

I would like to use ssd drives for an mdt, but using fast ssd drives
behind a raid controller seems to defeat the purpose.

There was some thought that the decision not to support
software raid was mostly about Sun/Oracle trying to sell hardware
raid.

thoughts?

-- 
Brian O''Connor
-----------------------------------------------------------------------
SGI Consulting
Email: briano at sgi.com, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax:  +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA
http://www.sgi.com/support/services
-----------------------------------------------------------------------

Temple Jason

2011-Mar-24 07:03 UTC

head link

[Lustre-discuss] software raid

I believe that software raid has a historical bias.  I use software raid
exclusively for my lustre installations here, and have never seen any problem
with it.  The argument used to be that having dedicated hardware running your
raids removed any overhead from the OS having to control them, and that raid in
general took too much cpu and memory, but the md stack has been drastically
improved since those times (over a decade), and now I see very little evidence
of this being a problem.

My argument against hardware raid is that if you lose a controller, you lose the
raid completely.

Just my 2cents.

Jason

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of Brian O''Connor
Sent: gioved?, 24. marzo 2011 03:55
To: lustre-discuss at lists.lustre.org
Subject: [Lustre-discuss] software raid


This has probably been asked and answered.

Is software raid(md) still considered bad practice?

I would like to use ssd drives for an mdt, but using fast ssd drives
behind a raid controller seems to defeat the purpose.

There was some thought that the decision not to support
software raid was mostly about Sun/Oracle trying to sell hardware
raid.

thoughts?

-- 
Brian O''Connor
-----------------------------------------------------------------------
SGI Consulting
Email: briano at sgi.com, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax:  +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA
http://www.sgi.com/support/services
-----------------------------------------------------------------------

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Stuart Midgley

2011-Mar-24 10:34 UTC

head link

[Lustre-discuss] software raid

Hi Brian

Long time no speak.

Anyway, we use to use software raid exclusively but have slowly stopped.  Using
3ware cards in Rackable nodes now.  All going well so far.  Though, for our MDS
we are running a 3 way mirror on sas disks.

md has a few issues... all of them tend to end at the same place... losing data.
We have had situations where md returns crap data cause its getting it from a
disk, but doesn''t actually verify it against other disks (the disk
hasn''t actually thrown hardware errors)... you manually fail the disk
and all of a sudden the file is no longer corrupt.

We have also had situations where md says the write occurred successfully, but
really it has just hit the cache on the disk and hasn''t been committed
to platter... and a short time later, the disk reports the error to md but for a
much earlier read/write.  The data is now corrupt on disk and flushed form all
of lustre''s caches.

With all our software raid we now do /sbin/hdparm -W 0 "$dev"  to
disable write caching on the disk.  This has helped, but obviously hurts
performance.

--
Dr Stuart Midgley
sdm900 at gmail.com

On 24/03/2011, at 10:54 AM, Brian O''Connor wrote:
> 
> This has probably been asked and answered.
> 
> Is software raid(md) still considered bad practice?
> 
> I would like to use ssd drives for an mdt, but using fast ssd drives
> behind a raid controller seems to defeat the purpose.
> 
> There was some thought that the decision not to support
> software raid was mostly about Sun/Oracle trying to sell hardware
> raid.
> 
> thoughts?
> 
> -- 
> Brian O''Connor
> -----------------------------------------------------------------------
> SGI Consulting
> Email: briano at sgi.com, Mobile +61 417 746 452
> Phone: +61 3 9963 1900, Fax:  +61 3 9963 1902
> 357 Camberwell Road, Camberwell, Victoria, 3124
> AUSTRALIA
> http://www.sgi.com/support/services
> -----------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cliff White

2011-Mar-24 18:00 UTC

head link

[Lustre-discuss] software raid

Historically, Linux software RAID had multiple issues, we did not advise
using it.
Those issues afaik were fixed long ago, and we changed the advice.
Sun/Oracle sold a product that was based on software RAID - there are no
unique issues
using soft RAID with Lustre.

Performance/reliability  is a whole ''nother set of topics - there are
reasons why people
buy the expensive flavors.,
cliffw

On Thu, Mar 24, 2011 at 3:34 AM, Stuart Midgley <sdm900 at gmail.com>
wrote:
> Hi Brian
>
> Long time no speak.
>
> Anyway, we use to use software raid exclusively but have slowly stopped.
>  Using 3ware cards in Rackable nodes now.  All going well so far.  Though,
> for our MDS we are running a 3 way mirror on sas disks.
>
> md has a few issues... all of them tend to end at the same place... losing
> data.  We have had situations where md returns crap data cause its getting
> it from a disk, but doesn''t actually verify it against other disks
(the disk
> hasn''t actually thrown hardware errors)... you manually fail the
disk and
> all of a sudden the file is no longer corrupt.
>
> We have also had situations where md says the write occurred successfully,
> but really it has just hit the cache on the disk and hasn''t been
committed
> to platter... and a short time later, the disk reports the error to md but
> for a much earlier read/write.  The data is now corrupt on disk and flushed
> form all of lustre''s caches.
>
> With all our software raid we now do /sbin/hdparm -W 0 "$dev"  to
disable
> write caching on the disk.  This has helped, but obviously hurts
> performance.
>
>
>
>
>
> --
> Dr Stuart Midgley
> sdm900 at gmail.com
>
>
>
> On 24/03/2011, at 10:54 AM, Brian O''Connor wrote:
>
> >
> > This has probably been asked and answered.
> >
> > Is software raid(md) still considered bad practice?
> >
> > I would like to use ssd drives for an mdt, but using fast ssd drives
> > behind a raid controller seems to defeat the purpose.
> >
> > There was some thought that the decision not to support
> > software raid was mostly about Sun/Oracle trying to sell hardware
> > raid.
> >
> > thoughts?
> >
> > --
> > Brian O''Connor
> >
-----------------------------------------------------------------------
> > SGI Consulting
> > Email: briano at sgi.com, Mobile +61 417 746 452
> > Phone: +61 3 9963 1900, Fax:  +61 3 9963 1902
> > 357 Camberwell Road, Camberwell, Victoria, 3124
> > AUSTRALIA
> > http://www.sgi.com/support/services
> >
-----------------------------------------------------------------------
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110324/85cd6982/attachment.html

Mark Hahn

2011-Mar-26 05:36 UTC

head link

[Lustre-discuss] software raid

choosing sw vs hw raid depends entirely on your systems, pocketbook, taste.
I think there are two edge cases which are pretty unambiguous:

 	- modern systems have obscene CPU power and memory bandwidth,
 	at least compared to disks.  and even compared to the embedded
 	cpu in raid cards.  this means that software raid is very fast
 	and attractive for at least moderate numbers of disks.  because
 	disks are so incredibly cheap, it''s almost a shame not to use
 	the 6+ sata ports present on every motherboard, for instance.

 	- if you need to minimize CPU and memory-bandwidth overheads,
 	or address very large numbers of disks, you want as much hardware
 	assist as you can get even though it''s expensive and wimpy.
 	having 100 15k rpm SAS disks as JBOD under SW raid would make
 	little sense, since the disks, expanders, backplanes and controllers
 	overwhelm the cost savings.

I think it boils down to your personal weighting of factors in TCO.

"classic" best practices, for instance, emphasizes device reliability
to maintain extreme uptime and minimize admin monkeywork.  that''s fine,
but it''s completely opposite to the less ideological, more
market-reality
driven approach that recognizes disks cost $30/TB and dropping, and that 
with appropriate use of redundancy, mass-market hardware can still achive 
however many nines you set your heart on.

it is convenient that a 2u node supporting 6-12 disks can be done with 
the free/builtin controller with SW raid and delivers bandwidth that matches 
relevant network interfaces (10G, IB).  I like the fact that a single unit
like that has no "extra" firmware to maintain, or over-smart
controllers
to go bonkers.  IPMI power control includes the disks.  SMART works directly.
and in a pinch, the content can be brought online via any old PC.

I''ve used MD since it was new in the kernel, and never had problems
with it.

regards, mark hahn

Lundgren, Andrew

2011-Mar-28 20:43 UTC

head link

[Lustre-discuss] software raid

I have done both SW and HW raid across with OSTs and MDTs.

As part of your choice, look into what happens when you have to replace a failed
disk in a sw configuration.  My negatives for sw raid are all management at this
point.

When you pull a bad disk out of a linux box (/dev/sde for example) and insert a
new disk, the new disk will not always come back as sde, it will come back at
first available device letter.  When you reconfigure your partitions and add it
back into your array, you will have to remember that you need to tweak the
partitions on the new drive letter.  When you reboot, your device letters will
sort themselves back out and that new disk will again go back to sde, if that is
the placement on the controller. If your machine has been up for a long time
with a few failed disks, you may have multiple holes in your dev lettering.  Not
a big deal for one or two, but when you have hundred of machines, you will
probably have an ops team that does the work, not you.

When you reboot a machine that has a failed disk in the array (degraded), the
array will not start by default in a degraded state.  If you have LVMs on top of
your raid arrays, they will also not start.  You will need to log into the
machine, manually force start the array in a degraded state and then manually
start the LVM on top of the SW raid array.

By default, grub does not install on multiple disks.  Assuming you also raid
your boot disks, you will need to manually put your boot loader on the front of
each bootable disk.

Some controllers have a memory of which disks are inserted into which slots.
They will not present disks beyond a certain number to the BIOS for booting.  If
you boot replace the boot disks too many times, they will no longer present a
bootable disk to the BIOS.  The only way to correct this for the controller I
have worked with is to pull all but one non-bootable disk, then boot into the
controller firmware and clear the device memory, then reconnect all of the
disks. (We only discovered this issue in the lab, and haven''t seen it
yet in production.)

In my experience, maintenance for linux sw raid is significantly more difficult
than hw raid.



-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of Brian O''Connor
Sent: Wednesday, March 23, 2011 8:55 PM
To: lustre-discuss at lists.lustre.org
Subject: [Lustre-discuss] software raid


This has probably been asked and answered.

Is software raid(md) still considered bad practice?

I would like to use ssd drives for an mdt, but using fast ssd drives
behind a raid controller seems to defeat the purpose.

There was some thought that the decision not to support
software raid was mostly about Sun/Oracle trying to sell hardware
raid.

thoughts?

-- 
Brian O''Connor
-----------------------------------------------------------------------
SGI Consulting
Email: briano at sgi.com, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax:  +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA
http://www.sgi.com/support/services
-----------------------------------------------------------------------

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Oleg Drokin

2011-Mar-28 21:56 UTC

head link

[Lustre-discuss] software raid

Hello!

On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:> 
> When you reboot a machine that has a failed disk in the array (degraded),
the array will not start by default in a degraded state.  If you have LVMs on
top of your raid arrays, they will also not start.  You will need to log into
the machine, manually force start the array in a degraded state and then
manually start the LVM on top of the SW raid array.
I am with you on everything but this point.
In my experience Linux SW raid does start when the array is degraded. Unless you
have --no-degraded as default mdadm option, of course.
There is a subtle case when it does behave strange and I see it on just one of
my nodes, this is when all devices claim they were stopped cleanly yet they
disagree about number of events processed.
In this case the array still starts in degraded mode, but the one disk that has
the outlying event counter is kicked from the array and is not rebuilt until you
manually re-add it back.
I have seen it only with RAID5 so far and the theory is that a disk controller
(or the disks themselves?) in that particular node is bad and does not flush
it''s cache when asked and on power off.
Of course if you miss this degraded state and don''t re-add anything
thee is a chance on next reboot the two remaining disks will get out of sync as
well and then the array will fail to start completely.
Surprisingly what totally fixed this issue for me was enabling bitmaps (of
course if you don''t want to have negative performance impact of those
you need to set them up on a separate device).

Bye,
    Oleg

Lundgren, Andrew

2011-Mar-28 22:17 UTC

head link

[Lustre-discuss] software raid

We have also had a few kernel panic''s at the same time as a failed
disk. I don''t know what was first but anecdotally, it seems that we
might be seeing an occasional kernel panic with a disk failure on swraid... 
Though that is still just FUD, so don''t put stock in it unless you see
it.

-----Original Message-----
From: Oleg Drokin [mailto:green at whamcloud.com] 
Sent: Monday, March 28, 2011 3:57 PM
To: Lundgren, Andrew
Cc: Brian O''Connor; lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] software raid

Hello!

On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:> 
> When you reboot a machine that has a failed disk in the array (degraded),
the array will not start by default in a degraded state.  If you have LVMs on
top of your raid arrays, they will also not start.  You will need to log into
the machine, manually force start the array in a degraded state and then
manually start the LVM on top of the SW raid array.
I am with you on everything but this point.
In my experience Linux SW raid does start when the array is degraded. Unless you
have --no-degraded as default mdadm option, of course.
There is a subtle case when it does behave strange and I see it on just one of
my nodes, this is when all devices claim they were stopped cleanly yet they
disagree about number of events processed.
In this case the array still starts in degraded mode, but the one disk that has
the outlying event counter is kicked from the array and is not rebuilt until you
manually re-add it back.
I have seen it only with RAID5 so far and the theory is that a disk controller
(or the disks themselves?) in that particular node is bad and does not flush
it''s cache when asked and on power off.
Of course if you miss this degraded state and don''t re-add anything
thee is a chance on next reboot the two remaining disks will get out of sync as
well and then the array will fail to start completely.
Surprisingly what totally fixed this issue for me was enabling bitmaps (of
course if you don''t want to have negative performance impact of those
you need to set them up on a separate device).

Bye,
    Oleg

Eudes PHILIPPE

2011-Mar-28 22:25 UTC

head link

[Lustre-discuss] software raid

Hello

For me hard raid (3ware card) is important for hotplug, and have no
downtime. I hate shutdown / reboot a server :)
But software raid works fine and consumes few resources if it''s a
mirror
raid.

I think, use software raid on a cluster is a source of problems we can
avoid.
it''s just my opinion:)


-----Message d''origine-----
De?: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] De la part de Lundgren,
Andrew
Envoy??: mardi 29 mars 2011 00:18
??: Oleg Drokin
Cc?: lustre-discuss at lists.lustre.org
Objet?: Re: [Lustre-discuss] software raid

We have also had a few kernel panic''s at the same time as a failed
disk. I
don''t know what was first but anecdotally, it seems that we might be
seeing
an occasional kernel panic with a disk failure on swraid...  Though that is
still just FUD, so don''t put stock in it unless you see it.

-----Original Message-----
From: Oleg Drokin [mailto:green at whamcloud.com]
Sent: Monday, March 28, 2011 3:57 PM
To: Lundgren, Andrew
Cc: Brian O''Connor; lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] software raid

Hello!

On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:>
> When you reboot a machine that has a failed disk in the array (degraded),the array will not start by default in a degraded state.  If you have LVMs
on top of your raid arrays, they will also not start.  You will need to log
into the machine, manually force start the array in a degraded state and
then manually start the LVM on top of the SW raid array.

I am with you on everything but this point.
In my experience Linux SW raid does start when the array is degraded. Unless
you have --no-degraded as default mdadm option, of course.
There is a subtle case when it does behave strange and I see it on just one
of my nodes, this is when all devices claim they were stopped cleanly yet
they disagree about number of events processed.
In this case the array still starts in degraded mode, but the one disk that
has the outlying event counter is kicked from the array and is not rebuilt
until you manually re-add it back.
I have seen it only with RAID5 so far and the theory is that a disk
controller (or the disks themselves?) in that particular node is bad and
does not flush it''s cache when asked and on power off.
Of course if you miss this degraded state and don''t re-add anything
thee is
a chance on next reboot the two remaining disks will get out of sync as well
and then the array will fail to start completely.
Surprisingly what totally fixed this issue for me was enabling bitmaps (of
course if you don''t want to have negative performance impact of those
you
need to set them up on a separate device).

Bye,
    Oleg
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Ms. Megan Larko

2011-Mar-30 00:07 UTC

head link

[Lustre-discuss] software raid

Hi,

Just as a clarification/update, I have done both software and hardware
raid.   The issue with the device not coming back as the same drive
letter or position was mitigated by using the LABEL=disk5 (or whatever
string) so that the mounts are placed into position by label.   Newer
versions of software raid use the physical drive serial number (s/n)
or other unique identifying number obtained from the hardware itself.
For example root=UUID=21c81788-30ea-4e5d-ad9b-a00a0be5ce7e"    I have
had hardware raid cards early on that were not capable of this
behavior.   Now the choice is entirely up to the administrator/user as
to preference.

Cheers!
megan

Lustre discuss - Mar 2011 - software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid

[Lustre-discuss] software raid