Daire Byrne
2011-Jul-05 16:33 UTC
[Lustre-discuss] RAID cards - what works well with Lustre?
Hi, I have been testing some LSI 9260 RAID cards for use with Lustre v1.8.6 but have found that the "megaraid_sas" driver is not really able to facilitate the 1MB full stripe IOs that Lustre likes. This topic has also come up recently in the following two email threads: http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/65a1fdc312b0eccb# http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/fcf39d85b7e945ab I was able to up the max_hw_sectors_kb -> 1024 by setting the "max_sectors" megaraid_sas module option but found that the IOs were still being pretty fragmented: disk I/O size ios % cum % | ios % cum % 4K: 3060 0 0 | 2611 0 0 8K: 3261 0 0 | 2664 0 0 16K: 6408 0 1 | 5296 0 1 32K: 13025 1 2 | 10692 1 2 64K: 48397 4 6 | 26417 2 4 128K: 50166 4 10 | 42218 4 9 256K: 113124 9 20 | 86516 8 17 512K: 677242 57 78 | 448231 45 63 1M: 254195 21 100 | 355804 36 100 So next I looked at the sg_tablesize and found it was being set to "80" by the driver (which queries the firmware). I tried to hack the driver and increase this value but bad things happened and so it looks like it is a genuine hardware limit with these cards. The overall throughput isn''t exactly terrible because the RAID write-back cache does a reasonable job but I suspect it could be better, e.g. ost 3 sz 201326592K rsz 1024K obj 192 thr 192 write 1100.52 [ 231.75, 529.96] read 940.26 [ 275.70, 357.60] ost 3 sz 201326592K rsz 1024K obj 192 thr 384 write 1112.19 [ 184.80, 546.43] read 1169.20 [ 337.63, 462.52] ost 3 sz 201326592K rsz 1024K obj 192 thr 768 write 1217.79 [ 219.77, 665.32] read 1532.47 [ 403.58, 552.43] ost 3 sz 201326592K rsz 1024K obj 384 thr 384 write 920.87 [ 171.82, 466.77] read 901.03 [ 257.73, 372.87] ost 3 sz 201326592K rsz 1024K obj 384 thr 768 write 1058.11 [ 166.83, 681.25] read 1309.63 [ 346.64, 484.51] All of this brings me to my main question - what internal cards have people here used which work well with Lustre? 3ware, Areca or other models of LSI? Cheers, Daire -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110705/a9190b91/attachment.html
Charles Taylor
2011-Jul-05 17:58 UTC
[Lustre-discuss] RAID cards - what works well with Lustre?
We use adaptec 51245s and 51645s with 1. max_hw_sectors_kb=512 2. RAID5 4+1 or RAID6 4+2 3. RAID chunk size = 128 So each 1 MB lustre RPC results in two 4-way, striped writes with no read-modify-write penalty. We can further improve write performance by matching the max_pages_per_rpc (per OST on the client side) i.e. the max rpc size to the max_hw_sectors_kb setting for the block devices. In this case max_pages_per_rpc=128 instead of the default 256 at which point you have 1 raid-stripe write per rpc. If you put your OSTs atop LVs (LVM2) as we do, you will want to take the additional step of making sure your LVs are aligned as well. pvcreate --dataalignment 1024S /dev/sd$driveChar You need a fairly new version of the LVM2 that supports the -- dataalignment option. We are using lvm2-2.02.56-8.el5_5.6.x86_64. Note that we attempted to increase the max_hw_sectors_kb for the block devices (RAID LDs) to 1024 but in order to do so, we needed to change the adaptec driver (aacraid) kernel parameter acbsize=8192 which we found to be unstable. For our adaptec drivers we use.. options aacraid cache=7 msi=2 expose_physicals=-1 acbsize=4096 Note that most of the information above was the result of testing and tuning performed here by Craig Prescott. We now have close to a PB of such storage in production here at the UF HPC Center. We used Areca cards at first but found them to be a bit too flakey for our needs. The adaptecs seem to have some infant mortality issues. We RMA about 10% to 12% percent of newly purchased cards but if they make it past initial burn-in testing, they tend to be pretty reliable. Regards, Charlie Taylor UF HPC Center On Jul 5, 2011, at 12:33 PM, Daire Byrne wrote:> Hi, > > I have been testing some LSI 9260 RAID cards for use with Lustre > v1.8.6 but have found that the "megaraid_sas" driver is not really > able to facilitate the 1MB full stripe IOs that Lustre likes. This > topic has also come up recently in the following two email threads: > > http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/65a1fdc312b0eccb# > http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/fcf39d85b7e945ab > > I was able to up the max_hw_sectors_kb -> 1024 by setting the > "max_sectors" megaraid_sas module option but found that the IOs were > still being pretty fragmented: > > disk I/O size ios % cum % | ios % cum % > 4K: 3060 0 0 | 2611 0 0 > 8K: 3261 0 0 | 2664 0 0 > 16K: 6408 0 1 | 5296 0 1 > 32K: 13025 1 2 | 10692 1 2 > 64K: 48397 4 6 | 26417 2 4 > 128K: 50166 4 10 | 42218 4 9 > 256K: 113124 9 20 | 86516 8 17 > 512K: 677242 57 78 | 448231 45 63 > 1M: 254195 21 100 | 355804 36 100 > > So next I looked at the sg_tablesize and found it was being set to > "80" by the driver (which queries the firmware). I tried to hack the > driver and increase this value but bad things happened and so it > looks like it is a genuine hardware limit with these cards. > > The overall throughput isn''t exactly terrible because the RAID write- > back cache does a reasonable job but I suspect it could be better, > e.g. > > ost 3 sz 201326592K rsz 1024K obj 192 thr 192 write 1100.52 > [ 231.75, 529.96] read 940.26 [ 275.70, 357.60] > ost 3 sz 201326592K rsz 1024K obj 192 thr 384 write 1112.19 > [ 184.80, 546.43] read 1169.20 [ 337.63, 462.52] > ost 3 sz 201326592K rsz 1024K obj 192 thr 768 write 1217.79 > [ 219.77, 665.32] read 1532.47 [ 403.58, 552.43] > ost 3 sz 201326592K rsz 1024K obj 384 thr 384 write 920.87 > [ 171.82, 466.77] read 901.03 [ 257.73, 372.87] > ost 3 sz 201326592K rsz 1024K obj 384 thr 768 write 1058.11 > [ 166.83, 681.25] read 1309.63 [ 346.64, 484.51] > > All of this brings me to my main question - what internal cards have > people here used which work well with Lustre? 3ware, Areca or other > models of LSI? > > Cheers, > > Daire > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110705/65410cc5/attachment.html
Daire Byrne
2011-Jul-18 14:54 UTC
[Lustre-discuss] RAID cards - what works well with Lustre?
Thanks for the replies and insight. I played around with various sg_tablesize and max_hw_sectors_kb values but it didn''t seem to make very much actual difference to the overall performance (using obdfilter-survey). I guess the RAM cache in these cards helps to smooth things out despite the fact that the IOs going to disk are rarely ever 1MB. I''ll test an Adaptec card too but we don''t really want to do 4+2 and lose so many disks to RAID6. I''ll see how they perform with 8+2. Thanks again, Daire On Tue, Jul 5, 2011 at 6:58 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote:> We use adaptec 51245s and 51645s with > > 1. max_hw_sectors_kb=512 > 2. RAID5 4+1 or RAID6 4+2 > 3. RAID chunk size = 128 > > So each 1 MB lustre RPC results in two 4-way, striped writes with no > read-modify-write penalty. We can further improve write performance by > matching the max_pages_per_rpc (per OST on the client side) i.e. the max rpc > size to the max_hw_sectors_kb setting for the block devices. In this case > > max_pages_per_rpc=128 > > instead of the default 256 at which point you have 1 raid-stripe write per > rpc. > > If you put your OSTs atop LVs (LVM2) as we do, you will want to take the > additional step of making sure your LVs are aligned as well. > > pvcreate --dataalignment 1024S /dev/sd$driveChar > > You need a fairly new version of the LVM2 that supports the --dataalignment > option. We are using lvm2-2.02.56-8.el5_5.6.x86_64. > > Note that we attempted to increase the max_hw_sectors_kb for the block > devices (RAID LDs) to 1024 but in order to do so, we needed to change the > adaptec driver (aacraid) kernel parameter acbsize=8192 which we found to be > unstable. For our adaptec drivers we use.. > > options aacraid cache=7 msi=2 expose_physicals=-1 acbsize=4096 > > Note that most of the information above was the result of testing and > tuning performed here by Craig Prescott. > > We now have close to a PB of such storage in production here at the UF HPC > Center. We used Areca cards at first but found them to be a bit too flakey > for our needs. The adaptecs seem to have some infant mortality issues. > We RMA about 10% to 12% percent of newly purchased cards but if they make it > past initial burn-in testing, they tend to be pretty reliable. > > Regards, > > Charlie Taylor > UF HPC Center > > > > > > > > > > On Jul 5, 2011, at 12:33 PM, Daire Byrne wrote: > > Hi, > > I have been testing some LSI 9260 RAID cards for use with Lustre v1.8.6 but > have found that the "megaraid_sas" driver is not really able to facilitate > the 1MB full stripe IOs that Lustre likes. This topic has also come up > recently in the following two email threads: > > > http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/65a1fdc312b0eccb# > > http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/fcf39d85b7e945ab > > I was able to up the max_hw_sectors_kb -> 1024 by setting the "max_sectors" > megaraid_sas module option but found that the IOs were still being pretty > fragmented: > > disk I/O size ios % cum % | ios % cum % > 4K: 3060 0 0 | 2611 0 0 > 8K: 3261 0 0 | 2664 0 0 > 16K: 6408 0 1 | 5296 0 1 > 32K: 13025 1 2 | 10692 1 2 > 64K: 48397 4 6 | 26417 2 4 > 128K: 50166 4 10 | 42218 4 9 > 256K: 113124 9 20 | 86516 8 17 > 512K: 677242 57 78 | 448231 45 63 > 1M: 254195 21 100 | 355804 36 100 > > So next I looked at the sg_tablesize and found it was being set to "80" by > the driver (which queries the firmware). I tried to hack the driver and > increase this value but bad things happened and so it looks like it is a > genuine hardware limit with these cards. > > The overall throughput isn''t exactly terrible because the > RAID write-back cache does a reasonable job but I suspect it could be > better, e.g. > > ost 3 sz 201326592K rsz 1024K obj 192 thr 192 write 1100.52 [ 231.75, > 529.96] read 940.26 [ 275.70, 357.60] > ost 3 sz 201326592K rsz 1024K obj 192 thr 384 write 1112.19 [ 184.80, > 546.43] read 1169.20 [ 337.63, 462.52] > ost 3 sz 201326592K rsz 1024K obj 192 thr 768 write 1217.79 [ 219.77, > 665.32] read 1532.47 [ 403.58, 552.43] > ost 3 sz 201326592K rsz 1024K obj 384 thr 384 write 920.87 [ 171.82, > 466.77] read 901.03 [ 257.73, 372.87] > ost 3 sz 201326592K rsz 1024K obj 384 thr 768 write 1058.11 [ 166.83, > 681.25] read 1309.63 [ 346.64, 484.51] > > All of this brings me to my main question - what internal cards have people > here used which work well with Lustre? 3ware, Areca or other models of LSI? > > Cheers, > > Daire > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110718/21aab787/attachment.html
lior amar
2011-Jul-19 06:19 UTC
[Lustre-discuss] RAID cards - what works well with Lustre?
Hi Daire, I am currently building a Lustre system using LSI-2965 (the successor of the LSI-2960 ) it has 1 GB of RAM and dual core processors. My experience with the card is quite good (Raid 6 performance of this card is very good). I am using it on my OSS''s with the following configuration 18 1TB (7200) sas disks in a raid 6 (16 data + 2) with 256KB. 2 15K sas disks in raid 1 for the journal I know the Lustre manual recommends using a stripe unit that will give a total of 1MB. But the performance I got when using 64KB stripe units (16*64=1M) were lower then the case of 256KB. The max_sectors_kb is limited to 280 so I set it to 256 And I tested varios I/O schedulers without getting any difference. Raw performance of the raid6 : ~960MB/sec write 1.5GB/sec Read Lustre performance using the obdfilter-survey (local mode) are also good. I will email you the results later (not in office now). Please let me know if there are any more details I can provide that might help you. Regards --lior ----------------------oo--o(:-:)o--oo---------------- Lior Amar, Ph.D. Cluster Logic Ltd --> The Art of HPC www.clusterlogic.net ---------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110719/53002e30/attachment-0001.html
Daire Byrne
2011-Jul-19 17:49 UTC
[Lustre-discuss] RAID cards - what works well with Lustre?
Lior, I am curious how you have the 16+2 drives connected to the LSI-9265. It has eight internal ports so I''m guessing you are using some sort of SAS fan out expander? We are thinking of using 3TB drives in a 36 bay enclosure/server so 3 x RAID cards (8+2 RAID6 + 1 spare) seemed like a good fit. It also means the resulting volumes are pretty much bang on the 24TB that Lustre v1.8.6 now supports. Daire On Tue, Jul 19, 2011 at 7:19 AM, lior amar <liororama at gmail.com> wrote:> Hi Daire, > > I am currently building a Lustre system using LSI-2965 (the successor of > the LSI-2960 ) it has 1 GB of RAM and dual core processors. > My experience with the card is quite good (Raid 6 performance of this card > is very good). > > I am using it on my OSS''s with the following configuration > > 18 1TB (7200) sas disks in a raid 6 (16 data + 2) with 256KB. > 2 15K sas disks in raid 1 for the journal > > I know the Lustre manual recommends using a stripe unit that will give a > total of 1MB. > But the performance I got when using 64KB stripe units (16*64=1M) were > lower then the case of 256KB. > > The max_sectors_kb is limited to 280 so I set it to 256 > And I tested varios I/O schedulers without getting any difference. > > Raw performance of the raid6 : ~960MB/sec write 1.5GB/sec Read > Lustre performance using the obdfilter-survey (local mode) are also good. I > will email you the results later (not in office now). > > > Please let me know if there are any more details I can provide that might > help you. > > Regards > > --lior > > ----------------------oo--o(:-:)o--oo---------------- > Lior Amar, Ph.D. > Cluster Logic Ltd --> The Art of HPC > www.clusterlogic.net > ---------------------------------------------------------- > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110719/7b066d31/attachment.html
lior amar
2011-Jul-19 18:38 UTC
[Lustre-discuss] RAID cards - what works well with Lustre?
Hi Daire, On Tue, Jul 19, 2011 at 8:49 PM, Daire Byrne <daire.byrne at gmail.com> wrote:> Lior, > > I am curious how you have the 16+2 drives connected to the LSI-9265. It has > eight internal ports so I''m guessing you are using some sort of SAS fan out > expander? >Correct. We are using a SuperMicro server with 24 drive bays. The server has only one backplane to the disks so sas expenders are used. We use 18 drives (+1 hot spare) for data and 2 (+1 hot spare for journal). We don''t use failover in the oss level. --lior -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110719/aafa31f5/attachment.html