Sebastian Gutierrez
2010-Jul-30 06:42 UTC
[Lustre-discuss] clarification on mkfs.lustre options
Hello This will probably be a pretty basic question however I wanted to verify that I was understanding the documentation correctly. The enclosure I am using supports 15 disks. We have initially ordered 6 disks + hostpare. I think my recommendation is going to be a 6 disk raid 6. Which will give us a 4 data disk raid set plus 2 parity disks. Later we will have the option to create another 6 disk raid6 or expand the current raid set to a 10 disk raid 6 then move the journal to a 4 disk raid 1/0 and have 1 disk hot spare. The current Raid 6 will have a 128k chunksize (lustre terminology) this gives us: <stripe_width> = <chunksize> * (<disks> - <parity_disks>) <=1MB 512K <= 128k*4 <chunksize> <= 1024kB/4; either 256k, 128k, 64k 256k = < = 1024k/4k <chunk_blocks> = <chunksize(decided above)> / 4k 32 = 256/4 128k = 512k / 4k <stripe_width_blocks> = <stripe_width> / 4k Therefore My mkfs options should be --mkfsoptions="-E stripe=128 -E stride=32" /dev/sdb upgrade options: The filesystem chunk_block options would still be valid since the chunk_blocks would stay the same. purchase 6 more disks This will allow for a new raid 6 that is aligned the same way with 3 hotspares. or (I need clarification if this understanding is correct) purchase 10 more disks Expand the current raid 6 to a larger 10 disk/raid 6 with a 4 disk raid 1/0 for a external journal plus a hot spare. Is my understanding of the documentation accurate? Do both of these options seem like potential upgrade options? Cheers, Sebastian
Sebastian Gutierrez
2010-Jul-30 17:33 UTC
[Lustre-discuss] clarification on mkfs.lustre options
Hello After reading my post I realized I had some typos.> > <chunksize> <= 1024kB/4; either 256k, 128k, 64k > 256k = < = 1024k/4kThis was supposed to be 128k per the raid6 configuration> > <chunk_blocks> = <chunksize(decided above)> / 4k > 32 = 256/4This was supposed to be 32 = 128k/4k> > 128k = 512k / 4k > <stripe_width_blocks> = <stripe_width> / 4k > > Therefore > My mkfs options should be > --mkfsoptions="-E stripe=128 -E stride=32" /dev/sdbI should have waited until I was awake to send this out. Sebastian
Andreas Dilger
2010-Jul-30 17:35 UTC
[Lustre-discuss] clarification on mkfs.lustre options
On 2010-07-30, at 00:42, Sebastian Gutierrez wrote:> The enclosure I am using supports 15 disks. We have initially ordered 6 disks + hostpare. I think my recommendation is going to be a 6 disk raid 6. Which will give us a 4 data disk raid set plus 2 parity disks. Later we will have the option to create another 6 disk raid6 or expand the current raid set to a 10 disk raid 6 then move the journal to a 4 disk raid 1/0 and have 1 disk hot spare. > > The current Raid 6 will have a 128k chunksize (lustre terminology) > > this gives us: > > <stripe_width> = <chunksize> * (<disks> - <parity_disks>) <=1MB > 512K <= 128k*4 > > <chunksize> <= 1024kB/4; either 256k, 128k, 64k > 256k = < = 1024k/4k > > <chunk_blocks> = <chunksize(decided above)> / 4k > 32 = 256/4 > > 128k = 512k / 4k > <stripe_width_blocks> = <stripe_width> / 4k > > Therefore > My mkfs options should be > --mkfsoptions="-E stripe=128 -E stride=32" /dev/sdbIf you are planning on expanding this at the RAID6 level to be an 8+2 configuration, you should specify "-E stripe=256,stride=64". Note that you cannot specify mulitple separate "-E" options to mke2fs, it would only use the last one specified.> purchase 6 more disks > This will allow for a new raid 6 that is aligned the same way with 3 hotspares.> or (I need clarification if this understanding is correct) > > purchase 10 more disks > Expand the current raid 6 to a larger 10 disk/raid 6 with a 4 disk raid 1/0 for a external journal plus a hot spare.Using a 4-disk RAID-10 external journal is unlikely to give you any extra performance, since journal IO is nearly sequential (though sometimes small block writes if there are few clients and you are not using async journal). Also, 16TB LUN support is only available with ext4, so if you have 2TB drives you need to make sure to download the right ldiskfs package. Depending on the hardware options on your RAID, it may be that you need a separate hot spare for each LUN, in which case having 2 hot spares makes sense. Otherwise, you can probably use only 13 or 14 drives.> Is my understanding of the documentation accurate? > Do both of these options seem like potential upgrade options?Either of them seem reasonable. If the hardware allows in-place RAID reshaping then it is possible. I''d always recommend to make a backup before doing this, because one never knows what might happen if this operation is interrupted for some reason. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Sebastian Gutierrez
2010-Jul-30 20:14 UTC
[Lustre-discuss] clarification on mkfs.lustre options
Hello Andreas Thank you very much for you help. I do have a few more questions.>> >> Therefore >> My mkfs options should be >> --mkfsoptions="-E stripe=128 -E stride=32" /dev/sdb > > If you are planning on expanding this at the RAID6 level to be an 8+2 configuration, you should specify "-E stripe=256,stride=64".Are there any potential negatives here? I initially used a 6 disk raid 10 but I ended up with wasted space on the filesystem since I could not fit the 1M lustre I/O into the number of active disks cleanly. Would there be a way to minimize the amount of wasted space if I wanted to stick to raid 1/0? I assume that aligned I/O is always preferred.> Note that you cannot specify mulitple separate "-E" options to mke2fs, it would only use the last one specified.I want to say I tried the mkfs.lustre "-E stripe=xx,stride=xx" but I received some error. I will see if this happens again.> >> purchase 6 more disks >> This will allow for a new raid 6 that is aligned the same way with 3 hotspares. > >> or (I need clarification if this understanding is correct) >> >> purchase 10 more disks >> Expand the current raid 6 to a larger 10 disk/raid 6 with a 4 disk raid 1/0 for a external journal plus a hot spare. > > Using a 4-disk RAID-10 external journal is unlikely to give you any extra performance, since journal IO is nearly sequential (though sometimes small block writes if there are few clients and you are not using async journal).I see.> > Also, 16TB LUN support is only available with ext4, so if you have 2TB drives you need to make sure to download the right ldiskfs package. >I am creating these filesystems with lustre 1.8.3 from the prebuilt RPMs. Do you mean that there is a different ldiskfs package I should use? I guess this brings up another question. Should i also use the flexible block group options on the OSTs? -O flex_bg and -G nr_merged_groups> Depending on the hardware options on your RAID, it may be that you need a separate hot spare for each LUN, in which case having 2 hot spares makes sense. Otherwise, you can probably use only 13 or 14 drives. > >> Is my understanding of the documentation accurate? >> Do both of these options seem like potential upgrade options? > > Either of them seem reasonable. > > If the hardware allows in-place RAID reshaping then it is possible. I''d always recommend to make a backup before doing this, because one never knows what might happen if this operation is interrupted for some reason.Understood. Thanks Again Sebastian
Sebastian Gutierrez
2010-Jul-31 03:55 UTC
[Lustre-discuss] clarification on mkfs.lustre options
Hello I did not know that the full ext4 support was a different build. I will download and install the correct build before I move forward. I assume that the ext4 version will work fine with older OSTs. I will test in my virtual environment. Thanks Sebastian> > I am creating these filesystems with lustre 1.8.3 from the prebuilt RPMs. Do you mean that there is a different ldiskfs package I should use? > > I guess this brings up another question. > Should i also use the flexible block group options on the OSTs? > -O flex_bg and -G nr_merged_groups
Andreas Dilger
2010-Jul-31 06:31 UTC
[Lustre-discuss] clarification on mkfs.lustre options
On 2010-07-30, at 13:14, Sebastian Gutierrez <gutseb at cs.stanford.edu> wrote:>> If you are planning on expanding this at the RAID6 level to be an 8+2 configuration, you should specify "-E stripe=256,stride=64". > > Are there any potential negatives here? I initially used a 6 disk raid 10 but I ended up with wasted space on the filesystem since I could not fit the 1M lustre I/O into the number of active disks cleanly. Would there be a way to minimize the amount of wasted space if I wanted to stick to raid 1/0? > I assume that aligned I/O is always preferred.For RAID-1+0 the alignment is much less important. While there is still some negative effect if the 1MB read of write is not aligned (because it will make an extra pair of disks active to fill the RPC) this is not nearly so bad as RAID-5/6 where it will cause also the parity chunk to be rewritten. If you are using a 6-disk RAID-1+0 then it would be OK for example to configure the RAID chunksize to be 128kB. While this means that a 1MB IO would handle 3*128kB from two pairs of disks and 4*128kB from the third pair of disks (each IO would be sequential though). It means a given pair of disks would do a bit more work than the others for a given RPC, but since the IO is sequential (assuming the request itself is sequential) it will not need an extra seek for the last disk and the extra IO is a minimal effort.>> >> Also, 16TB LUN support is only available with ext4, so if you have 2TB drives you need to make sure to download the right ldiskfs package. > > I am creating these filesystems with lustre 1.8.3 from the prebuilt RPMs. Do you mean that there is a different ldiskfs package I should use?There should be an ldiskfs-ext4 RPM available for download with 1.8.3 and later (for specific vendor kernels). The ext3 code has had a lot more testing than ext4 so we recommend using ext3 unless there is a reason to use ext4 (e.g. > 8TB LUN size).> Should i also use the flexible block group options on the OSTs? > -O flex_bg and -G nr_merged_groupsThe flex_bg feature is only available with ext4. We haven''t done any testing with this feature yet, but in theory it can help. Cheers, Andreas
Johann Lombardi
2010-Aug-02 21:27 UTC
[Lustre-discuss] clarification on mkfs.lustre options
On Fri, Jul 30, 2010 at 11:31:44PM -0700, Andreas Dilger wrote:> > I am creating these filesystems with lustre 1.8.3 from the prebuilt RPMs. Do you mean that there is a different ldiskfs package I should use? > > There should be an ldiskfs-ext4 RPM available for download with 1.8.3 and later (for specific vendor kernels).Unfortunately, the lustre-modules package should be changed too since the fsfilt_ldiskfs.ko module is packaged in this RPM and depends on ldiskfs. We should consider packaging fsfilt_ldiskfs.ko into the ldiskfs RPM. Johann
Sebastian Gutierrez
2010-Aug-03 05:06 UTC
[Lustre-discuss] clarification on mkfs.lustre options
Hello During this upgrade my plan was to add OSS and new OSTs to my FS. Deactivate my old OSTs and migrate the data by copying the old data into /lustre/tmp. Once most of the data was moved over I was going to schedule a outage to rsync the Deltas. I was then going to empty the old OSTs and upgrade the disks in those OSTs and copy the /CONFIG* data over to the newly created disks and rebalance the data across the OSTs. However I ran into a issue caused by a typo detailed below.> For RAID-1+0 the alignment is much less important. While there is still some negative effect if the 1MB read of write is not aligned (because it will make an extra pair of disks active to fill the RPC) this is not nearly so bad as RAID-5/6 where it will cause also the parity chunk to be rewritten. > > If you are using a 6-disk RAID-1+0 then it would be OK for example to configure the RAID chunksize to be 128kB. While this means that a 1MB IO would handle 3*128kB from two pairs of disks and 4*128kB from the third pair of disks (each IO would be sequential though). > > It means a given pair of disks would do a bit more work than the others for a given RPC, but since the IO is sequential (assuming the request itself is sequential) it will not need an extra seek for the last disk and the extra IO is a minimal effort. > >It looks like I had a typo in my config the first time I created the FS. I had planned ahead and have extra disks to shuffle things around. While migrating data off of the old OSTs these settings seem to have me missing about 5 TB out of 20TB. I created my 6 disk raid 1/0 filesystem with the following settings. The first time I created the FS I had the following options. --mkfsoptions="-E stripe=256 -E stride=32" To recover from this I am performing the following: I am recreating the FS with the settings below and cp the contents of the OST.old to OST.new then remount the OST.new as OST.old. <stripe-width>=(<chunk>*<data disks>)/<4k> 96=(128*3)/4k --mkfsoptions="-E stripe-width=96,stride=32" I have a couple of sanity check questions. If I have the old OST and new OST side by side would it be enough to do a cp -ar of the /ost/O dir or should I use a different migration procedure? I have found some mention on lustre-discuss that using a tool that does a backup of the xattrs is preferable. I am assuming that the cp -a should be sufficient since it is supposed to preserve all. In the lustre-discuss articles I only saw a mention of the patched tar and rsync. Is there any reason not to trust cp? on a related tangent I also found that the documentation in the manual is a bit out of date. The manual refers to the <stripe-width> as <stripe> The curent version of mkfs.lustre only takes <stripe-width> as a valid option. I will submit a documentation bug for this tomorrow. Thank You Sebastian
Andreas Dilger
2010-Aug-03 07:46 UTC
[Lustre-discuss] clarification on mkfs.lustre options
On 2010-08-02, at 23:06, Sebastian Gutierrez <gutseb at cs.stanford.edu> wrote:>> >> >> > I have found some mention on lustre-discuss that using a tool that does a backup of the xattrs is preferable. I am assuming that the cp -a should be sufficient since it is supposed to preserve all. In the lustre-discuss articles I only saw a mention of the patched tar and rsync. Is there any reason not to trust cp?If your co copies xattrs then great. I would verify this is correct by listing xattrs on both the old and new objects and comparing them. In newer Lustre rpms is ll_decode_filter_fid, which will format the OST object xattrs nicely. Cheers, Andreas
Sebastian Gutierrez
2010-Aug-04 18:04 UTC
[Lustre-discuss] clarification on mkfs.lustre options
>> I have found some mention on lustre-discuss that using a tool that does a backup of the xattrs is preferable. I am assuming that the cp -a should be sufficient since it is supposed to preserve all. In the lustre-discuss articles I only saw a mention of the patched tar and rsync. Is there any reason not to trust cp? > > If your co copies xattrs then great. I would verify this is correct by listing xattrs on both the old and new objects and comparing them. > > In newer Lustre rpms is ll_decode_filter_fid, which will format the OST object xattrs nicely. > > Cheers, AndreasThe cp -ar worked fine. Thanks Again Sebastian