I am currently having a debate about the best way to carve up Dell MD3200''s to be used as OST''s in a Lustre file system and I invite this community to weigh in... I am of the opinion that it should be setup as multiple raid groups each having a single LUN, with each raid group representing an OST, while my colleague feels that it should be setup as a single raid group across the whole array with multiple LUNS, with each LUN representing an OST. Does anyone in this group have an opinion (one way or another)? Regards, Ron Jerome
Michael Shuey
2013-Mar-09 15:43 UTC
[Lustre-discuss] [HPDD-discuss] Disk array setup opinions
If you can, I''d advocate the route you suggest - multiple RAID groups, each group maps to a unique LUN, and each LUN is an OST. Note that you''ll likely want the number of data disks in each RAID to be a power of 2 (e.g., 6- or 10-disk raid6, 5- or 9-disk raid5). Obviously, you''ll be wasting more spindles on overhead (RAID parity), but performance is more predictable. The other way (single RAID, multiple LUNs, LUN == OST) means the performance of OSTs aren''t independent - they all bottleneck on the same RAID array. If you have enough RAID controller bandwidth, this can (theoretically) work, but makes hunting/fixing performance problems more complex. In Lustre, if it''s not writing fast enough, you can just stripe over more OSTs. However, if your OSTs aren''t really independent, that may or may not help - you''ll get different bandwidth depending on how many OSTs are sharing the same pool of physical disks. I''d expect two OSTs that don''t share drives to write faster than two that do, and so on. BTW, if you have multiple controllers, and the LUN platform has a sense of controller affinity (i.e., a LUN uses one controller as "primary" and another as "secondary" or "backup"), try and balance your RAIDs across the two controllers in your array. For instance, stick even-numbered LUNs on one, odd-numbered LUNs on another. Also, if you''re doing multi-pathing into your OSSes, make sure your multipath drivers are aware of this arrangement, and respect it. Most midrange disk trays will do multipath, and cache mirroring between controllers - but if you read the fine print, you often find that access through the secondary controller is MUCH slower. It''s usually implemented as a write-through to the primary, or has its cache disabled while the primary is active, etc. Cache mirroring at high speed is hard, complicated, and expensive, so vendors often only implement what''s minimally necessary to do failover - even if it means the secondary controller doesn''t cache a LUN unless the primary dies. If you have one of these (and I''ve no idea if Dell''s 3200 does this, but this behavior is common enough I''d think about it), you''ll want to split LUNs evenly between controllers to maximize the cache use. You''ll also want to make sure the OSS knows which path is primary for which LUN, so it doesn''t send traffic down the wrong path (or worse, down both - round-robin balancing is a bad idea when the paths are asymmetric) unless there''s been a hardware failure. BTW, if you implemented a single RAID group and exported multiple LUNs, any multi-controller effects can get way more complicated - and are highly implementation-dependent. TL;DR - Multi-raid, RAID group == LUN == OST. Keep OSTs as independent as you can, and watch your controller and OSS multipath settings (if used). -- Mike Shuey On Sat, Mar 9, 2013 at 10:19 AM, Jerome, Ron <Ron.Jerome at ssc-spc.gc.ca> wrote:> I am currently having a debate about the best way to carve up Dell MD3200''s to be used as OST''s in a Lustre file system and I invite this community to weigh in... > > I am of the opinion that it should be setup as multiple raid groups each having a single LUN, with each raid group representing an OST, while my colleague feels that it should be setup as a single raid group across the whole array with multiple LUNS, with each LUN representing an OST. > > Does anyone in this group have an opinion (one way or another)? > > Regards, > > Ron Jerome > _______________________________________________ > HPDD-discuss mailing list > HPDD-discuss at lists.01.org > https://lists.01.org/mailman/listinfo/hpdd-discuss
I am currently having a debate about the best way to carve up Dell MD3200''s to be used as OST''s in a Lustre file system and I invite this community to weigh in... I am of the opinion that it should be setup as multiple raid groups each having a single LUN, with each raid group representing an OST, while my colleague feels that it should be setup as a single raid group across the whole array with multiple LUNS, with each LUN representing an OST. Does anyone in this group have an opinion (one way or another)? Regards, Ron Jerome
1 RG->1 LUN -> 1 OST is the way to go. Also, max out the cache block size (should be 32k), and pay for the High Performance Keys. Both those changes make a huge difference in performance. -Ben Evans ________________________________________ From: lustre-discuss-bounces at lists.lustre.org [lustre-discuss-bounces at lists.lustre.org] on behalf of Jerome, Ron [Ron.Jerome at ssc-spc.gc.ca] Sent: Monday, March 11, 2013 9:01 AM To: ''lustre-discuss at lists.lustre.org'' Subject: [Lustre-discuss] Disk array setup opinions I am currently having a debate about the best way to carve up Dell MD3200''s to be used as OST''s in a Lustre file system and I invite this community to weigh in... I am of the opinion that it should be setup as multiple raid groups each having a single LUN, with each raid group representing an OST, while my colleague feels that it should be setup as a single raid group across the whole array with multiple LUNS, with each LUN representing an OST. Does anyone in this group have an opinion (one way or another)? Regards, Ron Jerome _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
The best configuration to go for should be based on your file sizes, file count and read / write patterns. But as such you are right, Jerome. *In general -* ----------------- The best RAID Configuration would be to create one that aligns with the 1MB I/O size of Lustre. Say you have 1 x MD3200 and 4 x MD1200 expansion arrays. That would give you 60 Disks. So the best option here would be to create 6 RAID6 Arrays of 10 Disks each. In this case, you would end up having - 6 RAID6 Arrays = 6 LUNs = 6 OSTs. In each of the 10 Disk RAID6 Array, you would get 8 Data Disks and 2 Parity Disks. Now if you divide 1MB (I/O Size) by 8 Data Disks, it gives you - *128KB* - the segment / chunk size you should go for. This config will align the I/O reads / writes, giving you the best performance possible with the disk set. Divide the 6 OSTs among 2 OSS Nodes, and configure the nodes to act as failover to each other. Just ensure that each node has enough RAM to support 6 OSTs, in case one of them fails. Hope this helps. Regards, Indivar Nair On Mon, Mar 11, 2013 at 6:31 PM, Jerome, Ron <Ron.Jerome at ssc-spc.gc.ca>wrote:> I am currently having a debate about the best way to carve up Dell > MD3200''s to be used as OST''s in a Lustre file system and I invite this > community to weigh in... > > I am of the opinion that it should be setup as multiple raid groups each > having a single LUN, with each raid group representing an OST, while my > colleague feels that it should be setup as a single raid group across the > whole array with multiple LUNS, with each LUN representing an OST. > > Does anyone in this group have an opinion (one way or another)? > > Regards, > > Ron Jerome > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130311/db822a26/attachment.html
Is capacity potential or performance potential more important to you? Raid type and segment size are characteristics of a disk group not a virtual disk...so the answer is right there depending on your needs... Then we could move the conversation on to disaster recovery as well... Regards, Charles -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Jerome, Ron Sent: Monday, March 11, 2013 9:02 AM To: ''lustre-discuss at lists.lustre.org'' Subject: [Lustre-discuss] Disk array setup opinions I am currently having a debate about the best way to carve up Dell MD3200''s to be used as OST''s in a Lustre file system and I invite this community to weigh in... I am of the opinion that it should be setup as multiple raid groups each having a single LUN, with each raid group representing an OST, while my colleague feels that it should be setup as a single raid group across the whole array with multiple LUNS, with each LUN representing an OST. Does anyone in this group have an opinion (one way or another)? Regards, Ron Jerome _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
On 11/03/13 15:30, Indivar Nair wrote:> The best configuration to go for should be based on your file sizes, > file count and read / write patterns. > But as such you are right, Jerome. > > *In general -* > ----------------- > > The best RAID Configuration would be to create one that aligns with the > 1MB I/O size of Lustre. > > Say you have 1 x MD3200 and 4 x MD1200 expansion arrays. That would give > you 60 Disks. > So the best option here would be to create 6 RAID6 Arrays of 10 Disks each. > In this case, you would end up having - 6 RAID6 Arrays = 6 LUNs = 6 OSTs. > > In each of the 10 Disk RAID6 Array, you would get 8 Data Disks and 2 > Parity Disks. > Now if you divide 1MB (I/O Size) by 8 Data Disks, it gives you - *128KB* > - the segment / chunk size you should go for. > This config will align the I/O reads / writes, giving you the best > performance possible with the disk set. > > Divide the 6 OSTs among 2 OSS Nodes, and configure the nodes to act as > failover to each other. > Just ensure that each node has enough RAM to support 6 OSTs, in case one > of them fails. >http://content.dell.com/uk/en/enterprise/d/hpcc/cambridge-hpc-solution-centre Has links to a couple of white papers with Dell MD3200s and MD1200s in a failover configuration. They use 2*(MD3200 + 4*MD1200 +server) to give a failover solution - which looks like exactly the setup you describe. Chris> Hope this helps. > > Regards, > > > Indivar Nair > > > On Mon, Mar 11, 2013 at 6:31 PM, Jerome, Ron <Ron.Jerome at ssc-spc.gc.ca > <mailto:Ron.Jerome at ssc-spc.gc.ca>> wrote: > > I am currently having a debate about the best way to carve up Dell > MD3200''s to be used as OST''s in a Lustre file system and I invite > this community to weigh in... > > I am of the opinion that it should be setup as multiple raid groups > each having a single LUN, with each raid group representing an OST, > while my colleague feels that it should be setup as a single raid > group across the whole array with multiple LUNS, with each LUN > representing an OST. > > Does anyone in this group have an opinion (one way or another)? > > Regards, > > Ron Jerome > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Yes, thats good stuff. Anyone planning to implement Lustre for the first time must surely read it. It helps you visualize your storage requirements very nicely. And yes, the examples are similar. The configuration options are quite common when use Dell Storage. It is also mentioned in the ''Lustre 2.0 Operations Manual'' (821-2076-10.pdf), 10.1.1 Selecting Storage for the MDS or OSTs, Page 175. *(May have been moved to some other page in the newer version of the doc)* Had implemented a similar configuration (with lesser disks) for one my clients. Regards, Indivar Nair On Mon, Mar 11, 2013 at 9:20 PM, Christopher J. Walker < C.J.Walker at qmul.ac.uk> wrote:> On 11/03/13 15:30, Indivar Nair wrote: > > The best configuration to go for should be based on your file sizes, > > file count and read / write patterns. > > But as such you are right, Jerome. > > > > *In general -* > > ----------------- > > > > The best RAID Configuration would be to create one that aligns with the > > 1MB I/O size of Lustre. > > > > Say you have 1 x MD3200 and 4 x MD1200 expansion arrays. That would give > > you 60 Disks. > > So the best option here would be to create 6 RAID6 Arrays of 10 Disks > each. > > In this case, you would end up having - 6 RAID6 Arrays = 6 LUNs = 6 > OSTs. > > > > In each of the 10 Disk RAID6 Array, you would get 8 Data Disks and 2 > > Parity Disks. > > Now if you divide 1MB (I/O Size) by 8 Data Disks, it gives you - *128KB* > > - the segment / chunk size you should go for. > > This config will align the I/O reads / writes, giving you the best > > performance possible with the disk set. > > > > Divide the 6 OSTs among 2 OSS Nodes, and configure the nodes to act as > > failover to each other. > > Just ensure that each node has enough RAM to support 6 OSTs, in case one > > of them fails. > > > > > > http://content.dell.com/uk/en/enterprise/d/hpcc/cambridge-hpc-solution-centre > > Has links to a couple of white papers with Dell MD3200s and MD1200s in a > failover configuration. They use 2*(MD3200 + 4*MD1200 +server) to give a > failover solution - which looks like exactly the setup you describe. > > Chris > > > > Hope this helps. > > > > Regards, > > > > > > Indivar Nair > > > > > > On Mon, Mar 11, 2013 at 6:31 PM, Jerome, Ron <Ron.Jerome at ssc-spc.gc.ca > > <mailto:Ron.Jerome at ssc-spc.gc.ca>> wrote: > > > > I am currently having a debate about the best way to carve up Dell > > MD3200''s to be used as OST''s in a Lustre file system and I invite > > this community to weigh in... > > > > I am of the opinion that it should be setup as multiple raid groups > > each having a single LUN, with each raid group representing an OST, > > while my colleague feels that it should be setup as a single raid > > group across the whole array with multiple LUNS, with each LUN > > representing an OST. > > > > Does anyone in this group have an opinion (one way or another)? > > > > Regards, > > > > Ron Jerome > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org <mailto: > Lustre-discuss at lists.lustre.org> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130311/715a2027/attachment-0001.html