Hi, Which method of RAID is preferred with Lustre: hardware or software? This might seem like a daft question, but I''m a newbie. The list, operations guide and various "best practice" papers do not appear to express a preference. I''m thinking about a system with: * 2x failover MDS with a RAID1 or RAID10 volume * 2x failover OSS with RAID5 or RAID6 volumes. I''m trying to gauge whether it''s worth having shared storage arrays for each failover set with hardware RAID, or just leave them as dual-attached JBODs. My instinct says that hardware RAID from a reputable vendor is best - particularly because there''s a battery-backed cache - but I see from the lists that Lustre has put a lot of effort in improving the Linux MD RAID layer. Thanks, Mark -- ----------------------------------------------------------------- Mark Dixon Email : m.c.dixon at leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -----------------------------------------------------------------
On Nov 20, 2008 12:26 +0000, Mark Dixon wrote:> Which method of RAID is preferred with Lustre: hardware or software? > > This might seem like a daft question, but I''m a newbie. The list, > operations guide and various "best practice" papers do not appear to > express a preference.The great thing about Lustre is that you can make this decision entirely based on what kind of price/performance/reliability you need, and not because of a specific hardware requirement.> I''m thinking about a system with: > > * 2x failover MDS with a RAID1 or RAID10 volume > * 2x failover OSS with RAID5 or RAID6 volumes. > > I''m trying to gauge whether it''s worth having shared storage arrays for > each failover set with hardware RAID, or just leave them as dual-attached > JBODs. > > My instinct says that hardware RAID from a reputable vendor is best - > particularly because there''s a battery-backed cache - but I see from the > lists that Lustre has put a lot of effort in improving the Linux MD RAID > layer.There are a number of large clusters (TACC Ranger in particular) that use software RAID on JBODs, but the majority of systems use hardware RAID in order to maximize performance (at an increased cost of course). These days I would tend to recommend using RAID-6 over RAID-5 just because the large disks available take a long time to rebuild, and there is a non-zero risk of a second disk failing during that time. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Thanks Klaus, Andreas, "correct and slow" sounds generally preferable over "wrong and fast" - so I guess in this situation hardware RAID still wins. I get worried when I hear Sun''s marketing department talk about the "RAID5 write hole" whenever they tout ZFS. Clearly, moving to RAID6 does not solve this. Even if you have a UPS, systems sometimes still come down hard. It''s interesting that TACC is using software RAID on JBODs, as clearly the cost of HW RAID would not have been an issue for them. Many thanks once again, Mark -- ----------------------------------------------------------------- Mark Dixon Email : m.c.dixon at leeds.ac.uk HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK -----------------------------------------------------------------
Andreas Dilger wrote:> There are a number of large clusters (TACC Ranger in particular) that use > software RAID on JBODs, but the majority of systems use hardware RAID in order to > maximize performance (at an increased cost of course). > > These days I would tend to recommend using RAID-6 over RAID-5 just because > the large disks available take a long time to rebuild, and there is a > non-zero risk of a second disk failing during that time. >What about using more, but smaller raid groups? For example, perhaps 4-5 drives in a RAID-5? That way if a disk fails, the rebuilds are faster since there is less data? Jeff
On Monday 24 November 2008, Jeff Layton wrote:> Andreas Dilger wrote: > > There are a number of large clusters (TACC Ranger in particular) that use > > software RAID on JBODs, but the majority of systems use hardware RAID in > > order to maximize performance (at an increased cost of course). > > > > These days I would tend to recommend using RAID-6 over RAID-5 just > > because the large disks available take a long time to rebuild, and there > > is a non-zero risk of a second disk failing during that time. > > What about using more, but smaller raid groups? For example, > perhaps 4-5 drives in a RAID-5? That way if a disk fails, the > rebuilds are faster since there is less data?I''d pick raid6 not so much for the time-window/"drive fail" as for the read error rate. I''ve seen numbers for SATA drives at about once every 10TB or so. If so, then rebuilding a 10+1 raid5 is likely to (on average) see one sector read error (you''re allowed 0 to manage a perfect rebuild). A 5+1 set would be 50% likely to hit one and so on. How your raid controller (or software) reacts to a failed sector read varies but behaviours include: continue as if nothing happened (you now have bad data on your raid-set), fail the offending drive (and then also the rebuild and the entire raid-set), ... /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081124/902843b9/attachment.bin
Hey Jeff, Smaller _drives_ usually have shorter rebuild times, but having _fewer_ drives normally makes little difference to rebuild times. Both a 4+1 and an 8+1 need to write out "1 drive" worth of data to the replacement drive (and read every other drive in the raid set once, but those are all in parallel). Kevin Jeff Layton wrote:> Andreas Dilger wrote: > >> There are a number of large clusters (TACC Ranger in particular) that use >> software RAID on JBODs, but the majority of systems use hardware RAID in order to >> maximize performance (at an increased cost of course). >> >> These days I would tend to recommend using RAID-6 over RAID-5 just because >> the large disks available take a long time to rebuild, and there is a >> non-zero risk of a second disk failing during that time. >> >> > > What about using more, but smaller raid groups? For example, > perhaps 4-5 drives in a RAID-5? That way if a disk fails, the > rebuilds are faster since there is less data? > > Jeff > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >