thr3ads.net - Lustre discuss - [Lustre-discuss] clarification on mkfs.lustre options [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Sebastian Gutierrez

2010-Jul-30 06:42 UTC

[Lustre-discuss] clarification on mkfs.lustre options

Hello 

This will probably be a pretty basic question however I wanted to verify that I
was understanding the documentation correctly.

The enclosure I am using supports 15 disks.   We have initially ordered 6 disks
+ hostpare.  I think my recommendation is going to be a 6 disk raid 6.  Which
will give us a 4 data disk raid set plus 2 parity disks.  Later we will have the
option to create another 6 disk raid6 or expand the current raid set to a 10
disk raid 6 then move the journal to a 4 disk raid 1/0 and have 1 disk hot
spare.

The current Raid 6 will have a 128k chunksize (lustre terminology) 

this gives us:

 <stripe_width> = <chunksize> * (<disks> -
<parity_disks>) <=1MB
512K <= 128k*4

<chunksize> <= 1024kB/4; either 256k, 128k, 64k 
256k = <	= 1024k/4k

<chunk_blocks> = <chunksize(decided above)> / 4k
32 = 256/4 

128k = 512k / 4k
<stripe_width_blocks> = <stripe_width> / 4k   

Therefore 
My mkfs options should be 
--mkfsoptions="-E stripe=128 -E stride=32" /dev/sdb

upgrade options:
The filesystem chunk_block options would still be valid since the chunk_blocks
would stay the same.

purchase 6 more disks
This will allow for a new raid 6 that is aligned the same way with 3 hotspares. 

or (I need clarification if this understanding is correct)

purchase 10 more disks
Expand the current raid 6 to a larger 10 disk/raid 6 with a 4 disk raid 1/0 for
a external journal plus a hot spare.

Is my understanding of the documentation accurate? 
Do both of these options seem like potential upgrade options?

Cheers, 
Sebastian

Sebastian Gutierrez

2010-Jul-30 17:33 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

Hello

After reading my post I realized I had some typos. 
> 
> <chunksize> <= 1024kB/4; either 256k, 128k, 64k 
> 256k = <	= 1024k/4k
This was supposed to be 128k per the raid6 configuration

> 
> <chunk_blocks> = <chunksize(decided above)> / 4k
> 32 = 256/4 
This was supposed to be 
32 = 128k/4k
> 
> 128k = 512k / 4k
> <stripe_width_blocks> = <stripe_width> / 4k   
> 
> Therefore 
> My mkfs options should be 
> --mkfsoptions="-E stripe=128 -E stride=32" /dev/sdb
I should have waited until I was awake to send this out.  

Sebastian

Andreas Dilger

2010-Jul-30 17:35 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

On 2010-07-30, at 00:42, Sebastian Gutierrez wrote:> The enclosure I am using supports 15 disks.   We have initially ordered 6
disks + hostpare.  I think my recommendation is going to be a 6 disk raid 6. 
Which will give us a 4 data disk raid set plus 2 parity disks.  Later we will
have the option to create another 6 disk raid6 or expand the current raid set to
a 10 disk raid 6 then move the journal to a 4 disk raid 1/0 and have 1 disk hot
spare.
> 
> The current Raid 6 will have a 128k chunksize (lustre terminology) 
> 
> this gives us:
> 
> <stripe_width> = <chunksize> * (<disks> -
<parity_disks>) <=1MB
> 512K <= 128k*4
> 
> <chunksize> <= 1024kB/4; either 256k, 128k, 64k 
> 256k = <	= 1024k/4k
> 
> <chunk_blocks> = <chunksize(decided above)> / 4k
> 32 = 256/4 
> 
> 128k = 512k / 4k
> <stripe_width_blocks> = <stripe_width> / 4k   
> 
> Therefore 
> My mkfs options should be 
> --mkfsoptions="-E stripe=128 -E stride=32" /dev/sdb
If you are planning on expanding this at the RAID6 level to be an 8+2
configuration, you should specify "-E stripe=256,stride=64".  Note
that you cannot specify mulitple separate "-E" options to mke2fs, it
would only use the last one specified.
> purchase 6 more disks
> This will allow for a new raid 6 that is aligned the same way with 3
hotspares.
> or (I need clarification if this understanding is correct)
> 
> purchase 10 more disks
> Expand the current raid 6 to a larger 10 disk/raid 6 with a 4 disk raid 1/0
for a external journal plus a hot spare.
Using a 4-disk RAID-10 external journal is unlikely to give you any extra
performance, since journal IO is nearly sequential (though sometimes small block
writes if there are few clients and you are not using async journal).

Also, 16TB LUN support is only available with ext4, so if you have 2TB drives
you need to make sure to download the right ldiskfs package.

Depending on the hardware options on your RAID, it may be that you need a
separate hot spare for each LUN, in which case having 2 hot spares makes sense. 
Otherwise, you can probably use only 13 or 14 drives.
> Is my understanding of the documentation accurate? 
> Do both of these options seem like potential upgrade options?
Either of them seem reasonable.

If the hardware allows in-place RAID reshaping then it is possible. 
I''d always recommend to make a backup before doing this, because one
never knows what might happen if this operation is interrupted for some reason.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Sebastian Gutierrez

2010-Jul-30 20:14 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

Hello Andreas 
Thank you very much for you help.  I do have a few more questions.
>> 
>> Therefore 
>> My mkfs options should be 
>> --mkfsoptions="-E stripe=128 -E stride=32" /dev/sdb
> 
> If you are planning on expanding this at the RAID6 level to be an 8+2
configuration, you should specify "-E stripe=256,stride=64".
Are there any potential negatives here?   I initially used a 6 disk raid 10 but
I ended up with wasted space on the filesystem since I could not fit the 1M
lustre I/O into the number of active disks cleanly.  Would there be a way to
minimize the amount of wasted space if I wanted to stick to raid 1/0?
I assume that aligned I/O is always preferred. 
 
> Note that you cannot specify mulitple separate "-E" options to
mke2fs, it would only use the last one specified.
I want to say I tried the mkfs.lustre "-E stripe=xx,stride=xx" but I
received some error.  I will see if this happens again.
> 
>> purchase 6 more disks
>> This will allow for a new raid 6 that is aligned the same way with 3
hotspares.
> 
>> or (I need clarification if this understanding is correct)
>> 
>> purchase 10 more disks
>> Expand the current raid 6 to a larger 10 disk/raid 6 with a 4 disk raid
1/0 for a external journal plus a hot spare.
> 
> Using a 4-disk RAID-10 external journal is unlikely to give you any extra
performance, since journal IO is nearly sequential (though sometimes small block
writes if there are few clients and you are not using async journal).
I see.  
> 
> Also, 16TB LUN support is only available with ext4, so if you have 2TB
drives you need to make sure to download the right ldiskfs package.
> 
I am creating these filesystems with lustre 1.8.3 from the prebuilt RPMs.  Do
you mean that there is a different ldiskfs package I should use?

I guess this brings up another question.  
Should i also use the flexible block group options on the OSTs?
-O flex_bg and -G nr_merged_groups
> Depending on the hardware options on your RAID, it may be that you need a
separate hot spare for each LUN, in which case having 2 hot spares makes sense. 
Otherwise, you can probably use only 13 or 14 drives.
> 
>> Is my understanding of the documentation accurate? 
>> Do both of these options seem like potential upgrade options?
> 
> Either of them seem reasonable.
> 
> If the hardware allows in-place RAID reshaping then it is possible. 
I''d always recommend to make a backup before doing this, because one
never knows what might happen if this operation is interrupted for some reason.
Understood.  

Thanks Again
Sebastian

Sebastian Gutierrez

2010-Jul-31 03:55 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

Hello

I did not know that the full ext4 support was a different build.  I will
download and install the correct build before I move forward.

I assume that the ext4 version will work fine with older OSTs.  I will test in
my virtual environment.

Thanks
Sebastian
> 
> I am creating these filesystems with lustre 1.8.3 from the prebuilt RPMs. 
Do you mean that there is a different ldiskfs package I should use?
> 
> I guess this brings up another question.  
> Should i also use the flexible block group options on the OSTs?
> -O flex_bg and -G nr_merged_groups

Andreas Dilger

2010-Jul-31 06:31 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

On 2010-07-30, at 13:14, Sebastian Gutierrez <gutseb at cs.stanford.edu>
wrote:>> If you are planning on expanding this at the RAID6 level to be an 8+2
configuration, you should specify "-E stripe=256,stride=64".
> 
> Are there any potential negatives here?   I initially used a 6 disk raid 10
but I ended up with wasted space on the filesystem since I could not fit the 1M
lustre I/O into the number of active disks cleanly.  Would there be a way to
minimize the amount of wasted space if I wanted to stick to raid 1/0?
> I assume that aligned I/O is always preferred. 
For RAID-1+0 the alignment is much less important. While there is still some
negative effect if the 1MB read of write is not aligned (because it will make an
extra pair of disks active to fill the RPC) this is not nearly so bad as
RAID-5/6 where it will cause also the parity chunk to be rewritten.

If you are using a 6-disk RAID-1+0 then it would be OK for example to configure
the RAID chunksize to be 128kB. While this means that a 1MB IO would handle
3*128kB from two pairs of disks and 4*128kB from the third pair of disks (each
IO would be sequential though).

 It means a given pair of disks would do a bit more work than the others for a
given RPC, but since the IO is sequential (assuming the request itself is
sequential) it will not need an extra seek for the last disk and the extra IO is
a minimal effort.
>> 
>> Also, 16TB LUN support is only available with ext4, so if you have 2TB
drives you need to make sure to download the right ldiskfs package.
> 
> I am creating these filesystems with lustre 1.8.3 from the prebuilt RPMs. 
Do you mean that there is a different ldiskfs package I should use?
There should be an ldiskfs-ext4 RPM available for download with 1.8.3 and later
(for specific vendor kernels). The ext3 code has had a  lot more testing than
ext4 so we recommend using ext3 unless there is a reason to use ext4 (e.g. >
8TB LUN size).
> Should i also use the flexible block group options on the OSTs?
> -O flex_bg and -G nr_merged_groups
The flex_bg feature is only available with ext4. We haven''t done any
testing with this feature yet, but in theory it can help.

Cheers,  Andreas

Johann Lombardi

2010-Aug-02 21:27 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

On Fri, Jul 30, 2010 at 11:31:44PM -0700, Andreas Dilger
wrote:> > I am creating these filesystems with lustre 1.8.3 from the prebuilt
RPMs.  Do you mean that there is a different ldiskfs package I should use?
> 
> There should be an ldiskfs-ext4 RPM available for download with 1.8.3 and
later (for specific vendor kernels).
Unfortunately, the lustre-modules package should be changed too since the
fsfilt_ldiskfs.ko module is packaged in this RPM and depends on ldiskfs.
We should consider packaging fsfilt_ldiskfs.ko into the ldiskfs RPM.

Johann

Sebastian Gutierrez

2010-Aug-03 05:06 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

Hello

During this upgrade my plan was to add OSS and new OSTs to my FS.  
Deactivate my old OSTs and migrate the data by copying the old data into 
/lustre/tmp.  Once most of the data was moved over I was going to 
schedule a outage to rsync the Deltas.   I was then going to empty the 
old OSTs and upgrade the disks in those OSTs and copy the /CONFIG* data 
over to the newly created disks and rebalance the data across the 
OSTs.   However I ran into a issue caused by a typo detailed below.

> For RAID-1+0 the alignment is much less important. While there is still
some negative effect if the 1MB read of write is not aligned (because it will
make an extra pair of disks active to fill the RPC) this is not nearly so bad as
RAID-5/6 where it will cause also the parity chunk to be rewritten.
>
> If you are using a 6-disk RAID-1+0 then it would be OK for example to
configure the RAID chunksize to be 128kB. While this means that a 1MB IO would
handle 3*128kB from two pairs of disks and 4*128kB from the third pair of disks
(each IO would be sequential though).
>
>   It means a given pair of disks would do a bit more work than the others
for a given RPC, but since the IO is sequential (assuming the request itself is
sequential) it will not need an extra seek for the last disk and the extra IO is
a minimal effort.
>
>    
It looks like I had a typo in my config the first time I created the 
FS.  I had planned ahead and have extra disks to shuffle things around. 
While migrating data off of the old OSTs these settings seem to have me 
missing about 5 TB out of 20TB.

I created my 6 disk raid 1/0 filesystem with the following settings.
The first time I created the FS I had the following options.
--mkfsoptions="-E stripe=256 -E stride=32"

To recover from this I am performing the following:

I am recreating the FS with the settings below and cp the contents of 
the OST.old to OST.new then remount the OST.new as OST.old.

<stripe-width>=(<chunk>*<data disks>)/<4k>
96=(128*3)/4k
--mkfsoptions="-E stripe-width=96,stride=32"

I have a couple of sanity check questions.
If I have the old OST and new OST side by side would it be enough to do 
a cp -ar of the /ost/O dir or should I use a different migration procedure?

I have found some mention on lustre-discuss that using a tool that does 
a backup of the xattrs is preferable.  I am assuming that the cp -a 
should be sufficient since it is supposed to preserve all.  In the 
lustre-discuss articles I only saw a mention of the patched tar and 
rsync.  Is there any reason not to trust cp?

on a related tangent
I also found that the documentation in the manual is a bit out of 
date.   The manual refers to the <stripe-width> as <stripe> The
curent
version of mkfs.lustre only takes <stripe-width> as a valid option.  I 
will submit a documentation bug for this tomorrow.

Thank You
Sebastian

Andreas Dilger

2010-Aug-03 07:46 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

On 2010-08-02, at 23:06, Sebastian Gutierrez <gutseb at cs.stanford.edu>
wrote:>> 
>> 
>> 
> I have found some mention on lustre-discuss that using a tool that does a
backup of the xattrs is preferable.  I am assuming that the cp -a should be
sufficient since it is supposed to preserve all.  In the lustre-discuss articles
I only saw a mention of the patched tar and rsync.  Is there any reason not to
trust cp?
If your co copies xattrs then great. I would verify this is correct by listing
xattrs on both the old and new objects and comparing them.

In newer Lustre rpms is ll_decode_filter_fid, which will format  the OST object
xattrs nicely.

Cheers, Andreas

Sebastian Gutierrez

2010-Aug-04 18:04 UTC

head link

[Lustre-discuss] clarification on mkfs.lustre options

>> I have found some mention on lustre-discuss that using a tool that does
a backup of the xattrs is preferable.  I am assuming that the cp -a should be
sufficient since it is supposed to preserve all.  In the lustre-discuss articles
I only saw a mention of the patched tar and rsync.  Is there any reason not to
trust cp?
> 
> If your co copies xattrs then great. I would verify this is correct by
listing xattrs on both the old and new objects and comparing them.
> 
> In newer Lustre rpms is ll_decode_filter_fid, which will format  the OST
object xattrs nicely.
> 
> Cheers, Andreas
The cp -ar worked fine.  

Thanks Again

Sebastian

Lustre discuss - Jul 2010 - clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options

[Lustre-discuss] clarification on mkfs.lustre options