This has probably been asked and answered. Is software raid(md) still considered bad practice? I would like to use ssd drives for an mdt, but using fast ssd drives behind a raid controller seems to defeat the purpose. There was some thought that the decision not to support software raid was mostly about Sun/Oracle trying to sell hardware raid. thoughts? -- Brian O''Connor ----------------------------------------------------------------------- SGI Consulting Email: briano at sgi.com, Mobile +61 417 746 452 Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 357 Camberwell Road, Camberwell, Victoria, 3124 AUSTRALIA http://www.sgi.com/support/services -----------------------------------------------------------------------
I believe that software raid has a historical bias. I use software raid exclusively for my lustre installations here, and have never seen any problem with it. The argument used to be that having dedicated hardware running your raids removed any overhead from the OS having to control them, and that raid in general took too much cpu and memory, but the md stack has been drastically improved since those times (over a decade), and now I see very little evidence of this being a problem. My argument against hardware raid is that if you lose a controller, you lose the raid completely. Just my 2cents. Jason -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Brian O''Connor Sent: gioved?, 24. marzo 2011 03:55 To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] software raid This has probably been asked and answered. Is software raid(md) still considered bad practice? I would like to use ssd drives for an mdt, but using fast ssd drives behind a raid controller seems to defeat the purpose. There was some thought that the decision not to support software raid was mostly about Sun/Oracle trying to sell hardware raid. thoughts? -- Brian O''Connor ----------------------------------------------------------------------- SGI Consulting Email: briano at sgi.com, Mobile +61 417 746 452 Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 357 Camberwell Road, Camberwell, Victoria, 3124 AUSTRALIA http://www.sgi.com/support/services ----------------------------------------------------------------------- _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi Brian Long time no speak. Anyway, we use to use software raid exclusively but have slowly stopped. Using 3ware cards in Rackable nodes now. All going well so far. Though, for our MDS we are running a 3 way mirror on sas disks. md has a few issues... all of them tend to end at the same place... losing data. We have had situations where md returns crap data cause its getting it from a disk, but doesn''t actually verify it against other disks (the disk hasn''t actually thrown hardware errors)... you manually fail the disk and all of a sudden the file is no longer corrupt. We have also had situations where md says the write occurred successfully, but really it has just hit the cache on the disk and hasn''t been committed to platter... and a short time later, the disk reports the error to md but for a much earlier read/write. The data is now corrupt on disk and flushed form all of lustre''s caches. With all our software raid we now do /sbin/hdparm -W 0 "$dev" to disable write caching on the disk. This has helped, but obviously hurts performance. -- Dr Stuart Midgley sdm900 at gmail.com On 24/03/2011, at 10:54 AM, Brian O''Connor wrote:> > This has probably been asked and answered. > > Is software raid(md) still considered bad practice? > > I would like to use ssd drives for an mdt, but using fast ssd drives > behind a raid controller seems to defeat the purpose. > > There was some thought that the decision not to support > software raid was mostly about Sun/Oracle trying to sell hardware > raid. > > thoughts? > > -- > Brian O''Connor > ----------------------------------------------------------------------- > SGI Consulting > Email: briano at sgi.com, Mobile +61 417 746 452 > Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 > 357 Camberwell Road, Camberwell, Victoria, 3124 > AUSTRALIA > http://www.sgi.com/support/services > ----------------------------------------------------------------------- > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Historically, Linux software RAID had multiple issues, we did not advise using it. Those issues afaik were fixed long ago, and we changed the advice. Sun/Oracle sold a product that was based on software RAID - there are no unique issues using soft RAID with Lustre. Performance/reliability is a whole ''nother set of topics - there are reasons why people buy the expensive flavors., cliffw On Thu, Mar 24, 2011 at 3:34 AM, Stuart Midgley <sdm900 at gmail.com> wrote:> Hi Brian > > Long time no speak. > > Anyway, we use to use software raid exclusively but have slowly stopped. > Using 3ware cards in Rackable nodes now. All going well so far. Though, > for our MDS we are running a 3 way mirror on sas disks. > > md has a few issues... all of them tend to end at the same place... losing > data. We have had situations where md returns crap data cause its getting > it from a disk, but doesn''t actually verify it against other disks (the disk > hasn''t actually thrown hardware errors)... you manually fail the disk and > all of a sudden the file is no longer corrupt. > > We have also had situations where md says the write occurred successfully, > but really it has just hit the cache on the disk and hasn''t been committed > to platter... and a short time later, the disk reports the error to md but > for a much earlier read/write. The data is now corrupt on disk and flushed > form all of lustre''s caches. > > With all our software raid we now do /sbin/hdparm -W 0 "$dev" to disable > write caching on the disk. This has helped, but obviously hurts > performance. > > > > > > -- > Dr Stuart Midgley > sdm900 at gmail.com > > > > On 24/03/2011, at 10:54 AM, Brian O''Connor wrote: > > > > > This has probably been asked and answered. > > > > Is software raid(md) still considered bad practice? > > > > I would like to use ssd drives for an mdt, but using fast ssd drives > > behind a raid controller seems to defeat the purpose. > > > > There was some thought that the decision not to support > > software raid was mostly about Sun/Oracle trying to sell hardware > > raid. > > > > thoughts? > > > > -- > > Brian O''Connor > > ----------------------------------------------------------------------- > > SGI Consulting > > Email: briano at sgi.com, Mobile +61 417 746 452 > > Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 > > 357 Camberwell Road, Camberwell, Victoria, 3124 > > AUSTRALIA > > http://www.sgi.com/support/services > > ----------------------------------------------------------------------- > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- cliffw Support Guy WhamCloud, Inc. www.whamcloud.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110324/85cd6982/attachment.html
choosing sw vs hw raid depends entirely on your systems, pocketbook, taste. I think there are two edge cases which are pretty unambiguous: - modern systems have obscene CPU power and memory bandwidth, at least compared to disks. and even compared to the embedded cpu in raid cards. this means that software raid is very fast and attractive for at least moderate numbers of disks. because disks are so incredibly cheap, it''s almost a shame not to use the 6+ sata ports present on every motherboard, for instance. - if you need to minimize CPU and memory-bandwidth overheads, or address very large numbers of disks, you want as much hardware assist as you can get even though it''s expensive and wimpy. having 100 15k rpm SAS disks as JBOD under SW raid would make little sense, since the disks, expanders, backplanes and controllers overwhelm the cost savings. I think it boils down to your personal weighting of factors in TCO. "classic" best practices, for instance, emphasizes device reliability to maintain extreme uptime and minimize admin monkeywork. that''s fine, but it''s completely opposite to the less ideological, more market-reality driven approach that recognizes disks cost $30/TB and dropping, and that with appropriate use of redundancy, mass-market hardware can still achive however many nines you set your heart on. it is convenient that a 2u node supporting 6-12 disks can be done with the free/builtin controller with SW raid and delivers bandwidth that matches relevant network interfaces (10G, IB). I like the fact that a single unit like that has no "extra" firmware to maintain, or over-smart controllers to go bonkers. IPMI power control includes the disks. SMART works directly. and in a pinch, the content can be brought online via any old PC. I''ve used MD since it was new in the kernel, and never had problems with it. regards, mark hahn
I have done both SW and HW raid across with OSTs and MDTs. As part of your choice, look into what happens when you have to replace a failed disk in a sw configuration. My negatives for sw raid are all management at this point. When you pull a bad disk out of a linux box (/dev/sde for example) and insert a new disk, the new disk will not always come back as sde, it will come back at first available device letter. When you reconfigure your partitions and add it back into your array, you will have to remember that you need to tweak the partitions on the new drive letter. When you reboot, your device letters will sort themselves back out and that new disk will again go back to sde, if that is the placement on the controller. If your machine has been up for a long time with a few failed disks, you may have multiple holes in your dev lettering. Not a big deal for one or two, but when you have hundred of machines, you will probably have an ops team that does the work, not you. When you reboot a machine that has a failed disk in the array (degraded), the array will not start by default in a degraded state. If you have LVMs on top of your raid arrays, they will also not start. You will need to log into the machine, manually force start the array in a degraded state and then manually start the LVM on top of the SW raid array. By default, grub does not install on multiple disks. Assuming you also raid your boot disks, you will need to manually put your boot loader on the front of each bootable disk. Some controllers have a memory of which disks are inserted into which slots. They will not present disks beyond a certain number to the BIOS for booting. If you boot replace the boot disks too many times, they will no longer present a bootable disk to the BIOS. The only way to correct this for the controller I have worked with is to pull all but one non-bootable disk, then boot into the controller firmware and clear the device memory, then reconnect all of the disks. (We only discovered this issue in the lab, and haven''t seen it yet in production.) In my experience, maintenance for linux sw raid is significantly more difficult than hw raid. -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Brian O''Connor Sent: Wednesday, March 23, 2011 8:55 PM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] software raid This has probably been asked and answered. Is software raid(md) still considered bad practice? I would like to use ssd drives for an mdt, but using fast ssd drives behind a raid controller seems to defeat the purpose. There was some thought that the decision not to support software raid was mostly about Sun/Oracle trying to sell hardware raid. thoughts? -- Brian O''Connor ----------------------------------------------------------------------- SGI Consulting Email: briano at sgi.com, Mobile +61 417 746 452 Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 357 Camberwell Road, Camberwell, Victoria, 3124 AUSTRALIA http://www.sgi.com/support/services ----------------------------------------------------------------------- _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello! On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:> > When you reboot a machine that has a failed disk in the array (degraded), the array will not start by default in a degraded state. If you have LVMs on top of your raid arrays, they will also not start. You will need to log into the machine, manually force start the array in a degraded state and then manually start the LVM on top of the SW raid array.I am with you on everything but this point. In my experience Linux SW raid does start when the array is degraded. Unless you have --no-degraded as default mdadm option, of course. There is a subtle case when it does behave strange and I see it on just one of my nodes, this is when all devices claim they were stopped cleanly yet they disagree about number of events processed. In this case the array still starts in degraded mode, but the one disk that has the outlying event counter is kicked from the array and is not rebuilt until you manually re-add it back. I have seen it only with RAID5 so far and the theory is that a disk controller (or the disks themselves?) in that particular node is bad and does not flush it''s cache when asked and on power off. Of course if you miss this degraded state and don''t re-add anything thee is a chance on next reboot the two remaining disks will get out of sync as well and then the array will fail to start completely. Surprisingly what totally fixed this issue for me was enabling bitmaps (of course if you don''t want to have negative performance impact of those you need to set them up on a separate device). Bye, Oleg
We have also had a few kernel panic''s at the same time as a failed disk. I don''t know what was first but anecdotally, it seems that we might be seeing an occasional kernel panic with a disk failure on swraid... Though that is still just FUD, so don''t put stock in it unless you see it. -----Original Message----- From: Oleg Drokin [mailto:green at whamcloud.com] Sent: Monday, March 28, 2011 3:57 PM To: Lundgren, Andrew Cc: Brian O''Connor; lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] software raid Hello! On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:> > When you reboot a machine that has a failed disk in the array (degraded), the array will not start by default in a degraded state. If you have LVMs on top of your raid arrays, they will also not start. You will need to log into the machine, manually force start the array in a degraded state and then manually start the LVM on top of the SW raid array.I am with you on everything but this point. In my experience Linux SW raid does start when the array is degraded. Unless you have --no-degraded as default mdadm option, of course. There is a subtle case when it does behave strange and I see it on just one of my nodes, this is when all devices claim they were stopped cleanly yet they disagree about number of events processed. In this case the array still starts in degraded mode, but the one disk that has the outlying event counter is kicked from the array and is not rebuilt until you manually re-add it back. I have seen it only with RAID5 so far and the theory is that a disk controller (or the disks themselves?) in that particular node is bad and does not flush it''s cache when asked and on power off. Of course if you miss this degraded state and don''t re-add anything thee is a chance on next reboot the two remaining disks will get out of sync as well and then the array will fail to start completely. Surprisingly what totally fixed this issue for me was enabling bitmaps (of course if you don''t want to have negative performance impact of those you need to set them up on a separate device). Bye, Oleg
Hello For me hard raid (3ware card) is important for hotplug, and have no downtime. I hate shutdown / reboot a server :) But software raid works fine and consumes few resources if it''s a mirror raid. I think, use software raid on a cluster is a source of problems we can avoid. it''s just my opinion:) -----Message d''origine----- De?: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] De la part de Lundgren, Andrew Envoy??: mardi 29 mars 2011 00:18 ??: Oleg Drokin Cc?: lustre-discuss at lists.lustre.org Objet?: Re: [Lustre-discuss] software raid We have also had a few kernel panic''s at the same time as a failed disk. I don''t know what was first but anecdotally, it seems that we might be seeing an occasional kernel panic with a disk failure on swraid... Though that is still just FUD, so don''t put stock in it unless you see it. -----Original Message----- From: Oleg Drokin [mailto:green at whamcloud.com] Sent: Monday, March 28, 2011 3:57 PM To: Lundgren, Andrew Cc: Brian O''Connor; lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] software raid Hello! On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:> > When you reboot a machine that has a failed disk in the array (degraded),the array will not start by default in a degraded state. If you have LVMs on top of your raid arrays, they will also not start. You will need to log into the machine, manually force start the array in a degraded state and then manually start the LVM on top of the SW raid array. I am with you on everything but this point. In my experience Linux SW raid does start when the array is degraded. Unless you have --no-degraded as default mdadm option, of course. There is a subtle case when it does behave strange and I see it on just one of my nodes, this is when all devices claim they were stopped cleanly yet they disagree about number of events processed. In this case the array still starts in degraded mode, but the one disk that has the outlying event counter is kicked from the array and is not rebuilt until you manually re-add it back. I have seen it only with RAID5 so far and the theory is that a disk controller (or the disks themselves?) in that particular node is bad and does not flush it''s cache when asked and on power off. Of course if you miss this degraded state and don''t re-add anything thee is a chance on next reboot the two remaining disks will get out of sync as well and then the array will fail to start completely. Surprisingly what totally fixed this issue for me was enabling bitmaps (of course if you don''t want to have negative performance impact of those you need to set them up on a separate device). Bye, Oleg _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi, Just as a clarification/update, I have done both software and hardware raid. The issue with the device not coming back as the same drive letter or position was mitigated by using the LABEL=disk5 (or whatever string) so that the mounts are placed into position by label. Newer versions of software raid use the physical drive serial number (s/n) or other unique identifying number obtained from the hardware itself. For example root=UUID=21c81788-30ea-4e5d-ad9b-a00a0be5ce7e" I have had hardware raid cards early on that were not capable of this behavior. Now the choice is entirely up to the administrator/user as to preference. Cheers! megan