thr3ads.net - Lustre discuss - [Lustre-discuss] mkfs options/tuning for RAID based OSTs [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Edward Walter

2010-Oct-19 20:42 UTC

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Hello All,

We''re doing a fresh Lustre 1.8.4 install using Sun StorageTek 2540 
arrays for our OST targets.  We''ve configured these as RAID6 with no 
spares which means we have the equivalent of 10 data disks and 2 parity 
disks in play on each OST.

We configured the "Segment Size" on these arrays at 512 KB.  I believe
this is equivalent to the "chunk size" in the Lustre operations manual
(section 10.1.1).  Based on the formulae in the manual: in order to have 
my stripe width fall below 1MB; I need to reconfigure my "Segment
Size"
like this:

Segment Size <= 1024KB/(12-2) = 102.4 KB
so 16KB, 32KB or 64KB are optimal values
Does this seem right?

Do I really need to do this (reinitialize the arrays/volumes) to get my 
Segment Size below 1MB?  What impact will/won''t this have on
performance?

When I format the OST filesystem; I need to provide options for both 
stripe and stride.  The manual indicates that the units for these values 
are 4096-byte (4KB) blocks.  Given that, I should use something like:

-E stride= (one of)
    16KB/4KB = 4
    32KB/4KB = 8
    64KB/4KB = 16

stripe= (one of)
    16KB*10/4KB = 40
    32KB*10/4KB = 80
    64KB*10/4KB = 160

so for example I would issue the following:
mkfs.lustre --mountfsoptions="stripe=160" --mkfsoptions="-E
stride=16 -m
1" ...

Is it better for to opt for the higher values or lower values here?

Also, does anyone have recommendations for "aligning" the filesystem
so
that the fs blocks align with the RAID chunks?  We''ve done things like 
this for SSD drives.  We''d normally give Lustre the entire RAID device 
(without partitions) so this hasn''t been an issue in the past.  For
this
installation though; we''re creating multiple volumes (for size/space 
reasons) so partitioning is a necessary evil now.

Thanks for any feedback!

-Ed Walter
Carnegie Mellon University

Paul Nowoczynski

2010-Oct-19 21:43 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Ed,
Does ''segment size'' refer to the amount of data written to
each disk
before proceeding to the next disk (e.g. stride)?  This is my guess 
since these values are usually powers of two and therefore 52KB 
[512KB/(10 data disks)] is probably not the stride size.  In any event I 
think you''ll get the most bang for your buck by creating raid stripe 
where n_data_disks * stride = 1MB.  My recent experience when dealing 
with our software raid6 systems here is that elimination of 
read-modify-write is key for achieving good performance.  I would 
recommend exploring configurations where the the number of data disks is 
a power of 2 so that you can configure the stripe size to be 1MB.  I 
wouldn''t be surprised if you see better performance by dividing the 12 
disks in 2x(4+2) raid6 luns. 
paul

Edward Walter wrote:> Hello All,
>
> We''re doing a fresh Lustre 1.8.4 install using Sun StorageTek 2540
> arrays for our OST targets.  We''ve configured these as RAID6 with
no
> spares which means we have the equivalent of 10 data disks and 2 parity 
> disks in play on each OST.
>
> We configured the "Segment Size" on these arrays at 512 KB.  I
believe
> this is equivalent to the "chunk size" in the Lustre operations
manual
> (section 10.1.1).  Based on the formulae in the manual: in order to have 
> my stripe width fall below 1MB; I need to reconfigure my "Segment
Size"
> like this:
>
> Segment Size <= 1024KB/(12-2) = 102.4 KB
> so 16KB, 32KB or 64KB are optimal values
> Does this seem right?
>
> Do I really need to do this (reinitialize the arrays/volumes) to get my 
> Segment Size below 1MB?  What impact will/won''t this have on
performance?
>
> When I format the OST filesystem; I need to provide options for both 
> stripe and stride.  The manual indicates that the units for these values 
> are 4096-byte (4KB) blocks.  Given that, I should use something like:
>
> -E stride= (one of)
>     16KB/4KB = 4
>     32KB/4KB = 8
>     64KB/4KB = 16
>
> stripe= (one of)
>     16KB*10/4KB = 40
>     32KB*10/4KB = 80
>     64KB*10/4KB = 160
>
> so for example I would issue the following:
> mkfs.lustre --mountfsoptions="stripe=160" --mkfsoptions="-E
stride=16 -m
> 1" ...
>
> Is it better for to opt for the higher values or lower values here?
>
> Also, does anyone have recommendations for "aligning" the
filesystem so
> that the fs blocks align with the RAID chunks?  We''ve done things
like
> this for SSD drives.  We''d normally give Lustre the entire RAID
device
> (without partitions) so this hasn''t been an issue in the past. 
For this
> installation though; we''re creating multiple volumes (for
size/space
> reasons) so partitioning is a necessary evil now.
>
> Thanks for any feedback!
>
> -Ed Walter
> Carnegie Mellon University
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Dennis Nelson

2010-Oct-19 21:56 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Segment size should be 128.

128 KB * 8 data drives = 1 MB.


On 10/19/10 3:42 PM, "Edward Walter" <ewalter at cs.cmu.edu>
wrote:
> Hello All,
> 
> We''re doing a fresh Lustre 1.8.4 install using Sun StorageTek 2540
> arrays for our OST targets.  We''ve configured these as RAID6 with
no
> spares which means we have the equivalent of 10 data disks and 2 parity
> disks in play on each OST.
> 
> We configured the "Segment Size" on these arrays at 512 KB.  I
believe
> this is equivalent to the "chunk size" in the Lustre operations
manual
> (section 10.1.1).  Based on the formulae in the manual: in order to have
> my stripe width fall below 1MB; I need to reconfigure my "Segment
Size"
> like this:
> 
> Segment Size <= 1024KB/(12-2) = 102.4 KB
> so 16KB, 32KB or 64KB are optimal values
> Does this seem right?
> 
> Do I really need to do this (reinitialize the arrays/volumes) to get my
> Segment Size below 1MB?  What impact will/won''t this have on
performance?
> 
> When I format the OST filesystem; I need to provide options for both
> stripe and stride.  The manual indicates that the units for these values
> are 4096-byte (4KB) blocks.  Given that, I should use something like:
> 
> -E stride= (one of)
>     16KB/4KB = 4
>     32KB/4KB = 8
>     64KB/4KB = 16
> 
> stripe= (one of)
>     16KB*10/4KB = 40
>     32KB*10/4KB = 80
>     64KB*10/4KB = 160
> 
> so for example I would issue the following:
> mkfs.lustre --mountfsoptions="stripe=160" --mkfsoptions="-E
stride=16 -m
> 1" ...
> 
> Is it better for to opt for the higher values or lower values here?
> 
> Also, does anyone have recommendations for "aligning" the
filesystem so
> that the fs blocks align with the RAID chunks?  We''ve done things
like
> this for SSD drives.  We''d normally give Lustre the entire RAID
device
> (without partitions) so this hasn''t been an issue in the past. 
For this
> installation though; we''re creating multiple volumes (for
size/space
> reasons) so partitioning is a necessary evil now.
> 
> Thanks for any feedback!
> 
> -Ed Walter
> Carnegie Mellon University
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Andreas Dilger

2010-Oct-19 23:16 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

On 2010-10-19, at 14:42, Edward Walter wrote:> We''re doing a fresh Lustre 1.8.4 install using Sun StorageTek 2540
> arrays for our OST targets.  We''ve configured these as RAID6 with
no
> spares which means we have the equivalent of 10 data disks and 2 parity 
> disks in play on each OST.
As Paul mentioned, using something other than 8 data + N parity is bad for
performance.  It is doubly bad if the stripe width (ndata * segment size) is
> 1MB in size, because that means EVERY WRITE will be a read-modify-write,
and kill performance.
> Also, does anyone have recommendations for "aligning" the
filesystem so
> that the fs blocks align with the RAID chunks?  We''ve done things
like
> this for SSD drives.  We''d normally give Lustre the entire RAID
device
> (without partitions) so this hasn''t been an issue in the past. 
For this
> installation though; we''re creating multiple volumes (for
size/space
> reasons) so partitioning is a necessary evil now.
Partitioning is doubly evil (unless done extremely carefully) because it will
further mis-align the IO (due to the partition table and crazy MS-DOS odd sector
alignment) so that you will always partially modify extra blocks at the
beginning/end of each of each write (possibly causing data corruption in case of
incomplete writes/cache loss/etc).

If you stick with 8 data disks, and assuming 2TB drives or smaller, with 1.8.4
you can use the ext4-based ldiskfs (in a separate ldiskfs RPM on the download
site) to format up to 16TB LUNs for a single OST.  That is really the best
configuration, and will probably double your write performance.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Edward Walter

2010-Oct-20 01:00 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Hi Dennis,

That seems to validate how I''m interpreting the parameters. We have 10
data disks and 2 parity disks per array so it looks like we need to be at 64 KB
or less.

I''m guessing I''ll just need to run some tests to see how
performance changes as I adjust the segment size.

Thanks,

-Ed


On  Oct 19, 2010, at 5:56 PM, Dennis Nelson <dnelson at sgi.com> wrote:
> Segment size should be 128.
> 
> 128 KB * 8 data drives = 1 MB.
> 
> 
> On 10/19/10 3:42 PM, "Edward Walter" <ewalter at
cs.cmu.edu> wrote:
> 
>> Hello All,
>> 
>> We''re doing a fresh Lustre 1.8.4 install using Sun StorageTek
2540
>> arrays for our OST targets.  We''ve configured these as RAID6
with no
>> spares which means we have the equivalent of 10 data disks and 2 parity
>> disks in play on each OST.
>> 
>> We configured the "Segment Size" on these arrays at 512 KB. 
I believe
>> this is equivalent to the "chunk size" in the Lustre
operations manual
>> (section 10.1.1).  Based on the formulae in the manual: in order to
have
>> my stripe width fall below 1MB; I need to reconfigure my "Segment
Size"
>> like this:
>> 
>> Segment Size <= 1024KB/(12-2) = 102.4 KB
>> so 16KB, 32KB or 64KB are optimal values
>> Does this seem right?
>> 
>> Do I really need to do this (reinitialize the arrays/volumes) to get my
>> Segment Size below 1MB?  What impact will/won''t this have on
performance?
>> 
>> When I format the OST filesystem; I need to provide options for both
>> stripe and stride.  The manual indicates that the units for these
values
>> are 4096-byte (4KB) blocks.  Given that, I should use something like:
>> 
>> -E stride= (one of)
>>   16KB/4KB = 4
>>   32KB/4KB = 8
>>   64KB/4KB = 16
>> 
>> stripe= (one of)
>>   16KB*10/4KB = 40
>>   32KB*10/4KB = 80
>>   64KB*10/4KB = 160
>> 
>> so for example I would issue the following:
>> mkfs.lustre --mountfsoptions="stripe=160"
--mkfsoptions="-E stride=16 -m
>> 1" ...
>> 
>> Is it better for to opt for the higher values or lower values here?
>> 
>> Also, does anyone have recommendations for "aligning" the
filesystem so
>> that the fs blocks align with the RAID chunks?  We''ve done
things like
>> this for SSD drives.  We''d normally give Lustre the entire
RAID device
>> (without partitions) so this hasn''t been an issue in the past.
For this
>> installation though; we''re creating multiple volumes (for
size/space
>> reasons) so partitioning is a necessary evil now.
>> 
>> Thanks for any feedback!
>> 
>> -Ed Walter
>> Carnegie Mellon University
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

Brian J. Murrell

2010-Oct-20 12:27 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: >
Ed,
> That seems to validate how I''m interpreting the parameters. We
have 10 data disks and 2 parity disks per array so it looks like we need to be
at 64 KB or less.
I think you have been missing everyone''s point in this thread. The
magic value is not "anything below 1MB", it''s 1MB exactly.
No more, no
less (although I guess technically 256KB or 512KB would work).

The reason is that Lustre attempts to package up I/Os from the client to
the OST in 1MB chunks. If the RAID stripe matches that 1MB then when
the OSS writes that 1MB to the OST, it''s a single write to the RAID
disk
underlying the OST of 1MB of data plus the parity.

Conversely, if the OSS receives 1MB of data for the OST and the RAID
stripe under the OST is not 1MB, but less, then 1MB-<raid_stripe_size>
will be written as data+parity to the first strip, but the remaining
portion of that 1MB of data from the client will be written into the
next RAID stripe only partially filling the stripe causing the RAID
layer to have to first read that whole stripe, insert the new data,
calculate a new parity and then write that whole RAID stripe back out
the disk.

So as you can see, when your RAID stripe is not exactly 1MB, the RAID
code has to do a lot more I/O, which impacts performance, obviously.

This is why the recommendations in this thread have continued to be
using a number of data disks that divides evenly into 1MB (i.e. powers
of 2: 2, 4, 8, etc.). So for RAID6: 4+2 or 8+2, etc.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101020/5f375937/attachment.bin

Edward Walter

2010-Oct-20 14:19 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Hi Brian,

Thanks for the clarification.  It didn''t click that the optimal data 
size is exactly 1MB...  Everything you''re saying makes sense though. 

Obviously with 12 disk arrays; there''s tension between maximizing space
and maximizing performance.  I was hoping/trying to get the best of 
both.  The difference between doing 10 data and 2 parity vs 4+2 or 8+2 
works out to a difference of 2 data disks (4 TB) per shelf for us or 24 
TB in total which is why I was trying to figure out how to make this 
work with more data disks.

Thanks to everyone for the input.  This has been very helpful.

-Ed

Brian J. Murrell wrote:> On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: 
>   
>
> Ed,
>
>   
>> That seems to validate how I''m interpreting the parameters. We
have 10 data disks and 2 parity disks per array so it looks like we need to be
at 64 KB or less.
>>     
>
> I think you have been missing everyone''s point in this thread. 
The
> magic value is not "anything below 1MB", it''s 1MB
exactly.  No more, no
> less (although I guess technically 256KB or 512KB would work).
>
> The reason is that Lustre attempts to package up I/Os from the client to
> the OST in 1MB chunks.  If the RAID stripe matches that 1MB then when
> the OSS writes that 1MB to the OST, it''s a single write to the
RAID disk
> underlying the OST of 1MB of data plus the parity.
>
> Conversely, if the OSS receives 1MB of data for the OST and the RAID
> stripe under the OST is not 1MB, but less, then
1MB-<raid_stripe_size>
> will be written as data+parity to the first strip, but the remaining
> portion of that 1MB of data from the client will be written into the
> next RAID stripe only partially filling the stripe causing the RAID
> layer to have to first read that whole stripe, insert the new data,
> calculate a new parity and then write that whole RAID stripe back out
> the disk.
>
> So as you can see, when your RAID stripe is not exactly 1MB, the RAID
> code has to do a lot more I/O, which impacts performance, obviously.
>
> This is why the recommendations in this thread have continued to be
> using a number of data disks that divides evenly into 1MB (i.e. powers
> of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
>
> b.
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Charland, Denis

2010-Oct-20 15:49 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Brian J. Murrell wrote:> On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: 
>   
>
> This is why the recommendations in this thread have continued to be
> using a number of data disks that divides evenly into 1MB (i.e. powers
> of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
> 
What about RAID5?

Denis

Edward Walter

2010-Oct-20 16:30 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Hi Denis,

Changing the number of parity disks (RAID5 = 1, RAID6 = 2) doesn''t 
change the math on the data disks and data segment size. You still need 
a power of 2 number of data disks to insure that the product of the RAID 
chunk size and the number of data disks is 1MB.

Aside from that; I wouldn''t comfortably rely on RAID5 to protect my
data
at this point. We''ve seen to many dual-disk failures to trust it.

Thanks.

-Ed

Charland, Denis wrote:> Brian J. Murrell wrote:
>   
>> On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: 
>>   
>>
>> This is why the recommendations in this thread have continued to be
>> using a number of data disks that divides evenly into 1MB (i.e. powers
>> of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
>>
>>     
>  
> What about RAID5?
>
> Denis
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Bernd Schubert

2010-Oct-20 16:36 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

On Wednesday, October 20, 2010, Charland, Denis wrote:> Brian J. Murrell wrote:
> > On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote:
> > 
> > 
> > This is why the recommendations in this thread have continued to be
> > using a number of data disks that divides evenly into 1MB (i.e. powers
> > of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
> 
> What about RAID5?
Personally I don''t lile raid5 too much, but with raid5 it is obviously
+1
instead of +2


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks

Wojciech Turek

2010-Oct-20 16:50 UTC

head link

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

Hi Edward,

As Andreas mentioned earlier the max OST size is 16TB if one uses ext4 based
ldiskfs. So creation of RAID group bigger than that will definitely hurt
your performance because you would have to split the large array into
smaller logical disks and that randomises IOs on the raid controller. With
2TB disks, RAID6 is the way to go as the rebuild time of the failed disk is
quite long which increases the chance of double disk failure to
uncomfortable level. So taking that into consideration I think that 8+2
RAID6 with 128kb segment size is the right choice. Spare disk can be used as
hotspares or for external journal.


On 20 October 2010 15:19, Edward Walter <ewalter at cs.cmu.edu> wrote:
> Hi Brian,
>
> Thanks for the clarification.  It didn''t click that the optimal
data
> size is exactly 1MB...  Everything you''re saying makes sense
though.
>
> Obviously with 12 disk arrays; there''s tension between maximizing
space
> and maximizing performance.  I was hoping/trying to get the best of
> both.  The difference between doing 10 data and 2 parity vs 4+2 or 8+2
> works out to a difference of 2 data disks (4 TB) per shelf for us or 24
> TB in total which is why I was trying to figure out how to make this
> work with more data disks.
>
> Thanks to everyone for the input.  This has been very helpful.
>
> -Ed
>
> Brian J. Murrell wrote:
> > On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote:
> >
> >
> > Ed,
> >
> >
> >> That seems to validate how I''m interpreting the
parameters. We have 10
> data disks and 2 parity disks per array so it looks like we need to be at
64
> KB or less.
> >>
> >
> > I think you have been missing everyone''s point in this
thread.  The
> > magic value is not "anything below 1MB", it''s 1MB
exactly.  No more, no
> > less (although I guess technically 256KB or 512KB would work).
> >
> > The reason is that Lustre attempts to package up I/Os from the client
to
> > the OST in 1MB chunks.  If the RAID stripe matches that 1MB then when
> > the OSS writes that 1MB to the OST, it''s a single write to
the RAID disk
> > underlying the OST of 1MB of data plus the parity.
> >
> > Conversely, if the OSS receives 1MB of data for the OST and the RAID
> > stripe under the OST is not 1MB, but less, then
1MB-<raid_stripe_size>
> > will be written as data+parity to the first strip, but the remaining
> > portion of that 1MB of data from the client will be written into the
> > next RAID stripe only partially filling the stripe causing the RAID
> > layer to have to first read that whole stripe, insert the new data,
> > calculate a new parity and then write that whole RAID stripe back out
> > the disk.
> >
> > So as you can see, when your RAID stripe is not exactly 1MB, the RAID
> > code has to do a lot more I/O, which impacts performance, obviously.
> >
> > This is why the recommendations in this thread have continued to be
> > using a number of data disks that divides evenly into 1MB (i.e. powers
> > of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
> >
> > b.
> >
> >
> >
------------------------------------------------------------------------
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101020/ee6b4cb8/attachment.html

Kaizaad Bilimorya

2010-Dec-06 16:44 UTC

head link

[Lustre-discuss] RAID Stripe alignment

Hello,

I read this thread
http://www.mail-archive.com/lustre-discuss at lists.lustre.org/msg07791.html

and the Lustre Operations manual "10.1 Considerations for Backend
Storage"

in order to determine the best performing setup for our OSS hardware.

HP DL180 G6
- CentOS 5.5 and Lustre 1.8.4
- HP Smart Array P410 controller (512 MB cache, 25% Read / 75% Write)
- 600 GB SAS drives

The stripesizes (in HP P410 terminology is the amount of data read/written 
to each disk) available on the P410 array controllers are 8,16,32,64,128,256

Two of the scenarios that we tested are:

1) 1 x 12 disk RAID6 lun
chunksize = 1024 / 10 disks = 102.4 => use stripesize=64
- not optimally aligned but maximum space usage
- setup on oss[2-4]
- sgpdd_survey results:
   http://www.sharcnet.ca/~kaizaad/orcafs/unaligned.html

2) 1 x 10 disk RAID6 lun
chunksize = 1024 / 8 = 128 => use stripesize=128
- optimally aligned but at the sacrifice of 2 disks of space
- setup on oss[8-10]
- sgpdd_survey results:
   http://www.sharcnet.ca/~kaizaad/orcafs/aligned.html


In our cases, the graphs seem to indicate that the the underlying RAID 
alignment setup doesn''t matter much, which is totally counter intuitive
to
the recommendations by the Lustre list and manual.

Is there something we are missing here? Maybe I misunderstood the 
recommendations? Or are we just bottlenecking on a component in the setup 
so a proper RAID alignment doesn''t show as being beneficial? Any
insight
is appreciated.

thanks
-k

Andreas Dilger

2010-Dec-06 17:55 UTC

head link

[Lustre-discuss] RAID Stripe alignment

Is your write cache persistent?  One major factor in having Lustre read and
write alignment is that any misaligned write  will cause read-modify-write, and
misaligned reads will cause 2x reads if the RAID layer is doing parity
verification.

If your RAID layer is hiding this overhead via cache, you need to be absolutely
sure that it is safe in case of crashes and failover of either or both the OSS
and RAID controller.

Cheers, Andreas

On 2010-12-06, at 9:44, Kaizaad Bilimorya <kaizaad at sharcnet.ca> wrote:
> Hello,
> 
> I read this thread
> http://www.mail-archive.com/lustre-discuss at
lists.lustre.org/msg07791.html
> 
> and the Lustre Operations manual "10.1 Considerations for Backend
Storage"
> 
> in order to determine the best performing setup for our OSS hardware.
> 
> HP DL180 G6
> - CentOS 5.5 and Lustre 1.8.4
> - HP Smart Array P410 controller (512 MB cache, 25% Read / 75% Write)
> - 600 GB SAS drives
> 
> The stripesizes (in HP P410 terminology is the amount of data read/written 
> to each disk) available on the P410 array controllers are
8,16,32,64,128,256
> 
> Two of the scenarios that we tested are:
> 
> 1) 1 x 12 disk RAID6 lun
> chunksize = 1024 / 10 disks = 102.4 => use stripesize=64
> - not optimally aligned but maximum space usage
> - setup on oss[2-4]
> - sgpdd_survey results:
>   http://www.sharcnet.ca/~kaizaad/orcafs/unaligned.html
> 
> 2) 1 x 10 disk RAID6 lun
> chunksize = 1024 / 8 = 128 => use stripesize=128
> - optimally aligned but at the sacrifice of 2 disks of space
> - setup on oss[8-10]
> - sgpdd_survey results:
>   http://www.sharcnet.ca/~kaizaad/orcafs/aligned.html
> 
> 
> In our cases, the graphs seem to indicate that the the underlying RAID 
> alignment setup doesn''t matter much, which is totally counter
intuitive to
> the recommendations by the Lustre list and manual.
> 
> Is there something we are missing here? Maybe I misunderstood the 
> recommendations? Or are we just bottlenecking on a component in the setup 
> so a proper RAID alignment doesn''t show as being beneficial? Any
insight
> is appreciated.
> 
> thanks
> -k
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Kaizaad Bilimorya

2010-Dec-06 19:13 UTC

head link

[Lustre-discuss] RAID Stripe alignment

Hello Andreas,

Thanks for your reply.

On Mon, 6 Dec 2010, Andreas Dilger wrote:> Is your write cache persistent?
Yes. It is 512 MB battery backed.
> One major factor in having Lustre read and write alignment is that any 
> misaligned write will cause read-modify-write, and misaligned reads will 
> cause 2x reads if the RAID layer is doing parity verification.
>
> If your RAID layer is hiding this overhead via cache, you need to be 
> absolutely sure that it is safe in case of crashes and failover of 
> either or both the OSS and RAID controller.
The HP Smart Array P410 controller also has this setting called 
"Accelerator Ratio" which determines the amount of cache devoted to
either
reads or writes. Currently it is set (default) as follows:

  Accelerator Ratio: 25% Read / 75% Write

We can try setting it to one extreme and the other to see what difference 
it makes. This Lustre system is going to be used as /scratch for a broad 
range of HPC code with diverse requirements (large files, small files, 
many files, mostly reading, mostly writing) so I don''t know how much we
can tune this cache setting to help specific access patterns at the 
detriment of others, we are just looking for appropriate middle ground 
here. But for thread completeness, I post the sgpdd_survey results if 
there are any large differences in performance.
> Cheers, Andreas
thanks a bunch
-k

Lustre discuss - Oct 2010 - mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] mkfs options/tuning for RAID based OSTs

[Lustre-discuss] RAID Stripe alignment

[Lustre-discuss] RAID Stripe alignment

[Lustre-discuss] RAID Stripe alignment