Configuring a new MDS/MGS machine for our full blown lustre setup. OST''s are x4500''s following the layout of the sun paper on the lustre wiki. This paper does not talk about the MDS though. We have a sun 2540 with 12 15K 300 GB drives. We plan to use 4 drives in a 1+0 with the rest being spares. What I am curious about are the following options Stripe size, Readahead on the MDS Raid I did not find anything in the manual about this, other than disable readahead on DDN hardware but that sounded like OST''s not MDS. Also (not lustre related) if someone at sun is listening CAM is awful. But I will bring that up with our sales guy. Thanks! Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
On Jun 16, 2008 14:00 -0400, Brock Palen wrote:> This paper does not talk about the MDS though. We have a sun 2540 > with 12 15K 300 GB drives. > > We plan to use 4 drives in a 1+0 with the rest being spares. What > I am curious about are the following options > > Stripe size, > Readahead on the MDS RaidThere is a discussion about MDS + RAID in the Lustre Manual, section 10. When formatting a filesystem on a RAID device, it is beneficial to specify additional parameters at the time of formatting. This ensures that the filesystem is optimized for the underlying disk geometry. Use the --mkfsoptions parameter to specify these options in the Lustre configuration. For RAID5, RAID6, RAID1+0 storage, specifying the -E stride={stride_size} option improves the layout of the filesystem metadata ensuring that no single disk contains all of the allocation bitmaps. The stride_size parameter is in units of 4096-byte blocks and represents the amount of contiguous data written to a single disk before moving to the next disk. This is applicable to both MDS and OST filesystems. Note - It is better to have the MDS on RAID1+0 than on RAID5 or RAID6. RAID1 with an internal journal and two disks from different controllers. If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and then make a RAID0 array of the RAID1 devices. This ensures maximum reliability because multiple disk failures only have a small chance of hitting both disks in the same RAID1 device. Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even two disk failures can cause the loss of the whole MDT device. The first failure will disable an entire half of the mirror and the second failure has a 50% chance of disabling the remaining mirror.> I did not find anything in the manual about this, other than disable > readahead on DDN hardware but that sounded like OST''s not MDS.Readahead will not have much benefit for the MDT, because most of the IO is random. The chunksize for RAID1 is mostly meaningless. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jun 16, 2008, at 2:31 PM, Andreas Dilger wrote:> On Jun 16, 2008 14:00 -0400, Brock Palen wrote: >> This paper does not talk about the MDS though. We have a sun 2540 >> with 12 15K 300 GB drives. >> >> We plan to use 4 drives in a 1+0 with the rest being spares. What >> I am curious about are the following options >> >> Stripe size, >> Readahead on the MDS Raid > > There is a discussion about MDS + RAID in the Lustre Manual, > section 10.I read to fast and make mistakes thank you!> > When formatting a filesystem on a RAID device, it is beneficial to > specify > additional parameters at the time of formatting. This ensures that the > filesystem is optimized for the underlying disk geometry. Use the > --mkfsoptions parameter to specify these options in the Lustre > configuration. > > For RAID5, RAID6, RAID1+0 storage, specifying the -E stride= > {stride_size} > option improves the layout of the filesystem metadata ensuring that > no single > disk contains all of the allocation bitmaps. The stride_size > parameter is in > units of 4096-byte blocks and represents the amount of contiguous > data written > to a single disk before moving to the next disk. This is applicable > to both > MDS and OST filesystems.Good to point that out.> > Note - It is better to have the MDS on RAID1+0 than on RAID5 or RAID6. > > RAID1 with an internal journal and two disks from different > controllers. > If you need a larger MDT, create multiple RAID1 devices from pairs of > disks, and then make a RAID0 array of the RAID1 devices. This ensures > maximum reliability because multiple disk failures only have a small > chance of hitting both disks in the same RAID1 device.We are going to have several unused disks in the MGS/MDS array. Would it be helpful for the MDS (less the MGS I would think) to use and external journal on a pair of disks in raid 1? I am not sure it would help that much, but I could be wrong.> > Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance > that even two disk failures can cause the loss of the whole MDT > device. > The first failure will disable an entire half of the mirror and the > second failure has a 50% chance of disabling the remaining mirror. > >> I did not find anything in the manual about this, other than disable >> readahead on DDN hardware but that sounded like OST''s not MDS. > > Readahead will not have much benefit for the MDT, because most of the > IO is random. The chunksize for RAID1 is mostly meaningless.Ok, did''t know that I will look into that more as to why.> > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > >
On Jun 16, 2008 14:50 -0400, Brock Palen wrote:> On Jun 16, 2008, at 2:31 PM, Andreas Dilger wrote: > > Note - It is better to have the MDS on RAID1+0 than on RAID5 or RAID6. > > > > RAID1 with an internal journal and two disks from different controllers. > > If you need a larger MDT, create multiple RAID1 devices from pairs of > > disks, and then make a RAID0 array of the RAID1 devices. This ensures > > maximum reliability because multiple disk failures only have a small > > chance of hitting both disks in the same RAID1 device. > > We are going to have several unused disks in the MGS/MDS array. > Would it be helpful for the MDS (less the MGS I would think) to use > and external journal on a pair of disks in raid 1? I am not sure it > would help that much, but I could be wrong.Yes, having an external RAID1 journal helps spread the load over more spindles and avoids seeking the "filesystem" heads to write into the journal area. The journal itself is mostly written linearly. Note also that "journal size ~= RAM usage" so don''t go making it too huge unless you have enough RAM to back it up. Some customers use journals up to 2GB (on a 32GB RAM machine) for maximum performance. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.