I read in the Lustre Operations Manual that there is an OST size limitation of 16 TB on RHEL and 8 TB on other distributions because of the ext3 file system limitation. I have a few questions about that. Why is the limitation 16 TB on RHEL? I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation? What is the OST size limitation when using ext4? Is it preferable to use ext4 instead of ext3? If the block device has more than 8 TB or 16 TB, it must be partitioned. Is there a performance degradation when a device has multiple partitions compared to a single partition? In other words, is it better to have three 8 TB devices with one partition per device than to have one 24 TB device with three partitions? Denis Charland UNIX Systems Administrator National Research Council Canada -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111102/a9619b4e/attachment.html
On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote: I read in the Lustre Operations Manual that there is an OST size limitation of 16 TB on RHEL and 8 TB on other distributions because of the ext3 file system limitation. I have a few questions about that. Why is the limitation 16 TB on RHEL? 16TB is the maximum size RedHat supports. See http://www.redhat.com/rhel/compare/ Larger than that requires bigger changes. Note that whamcloud''s 1.8.6-wc1 claimed support for 24TB LUNs (but see http://jira.whamcloud.com/browse/LU-419 ). Whamcloud''s Lustre 2.1 (not sure you''d want to use it) claims support for 128TB LUNs. I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation? What is the OST size limitation when using ext4? 16TB with the Lustre-patched RHEL kernel. Is it preferable to use ext4 instead of ext3? If the block device has more than 8 TB or 16 TB, it must be partitioned. Is there a performance degradation when a device has multiple partitions compared to a single partition? In other words, is it better to have three 8 TB devices with one partition per device than to have one 24 TB device with three partitions? Better to have 3 separate 8TB LUNs. Different OSTs forcing the same drive heads to move to opposite parts of the disk does degrade performance (with a single OST moving the drive heads, the block allocator tries to minimize movement). Denis Charland UNIX Systems Administrator National Research Council Canada _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org> http://lists.lustre.org/mailman/listinfo/lustre-discuss Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111102/108f2da1/attachment.html
Any good reasons to use ext4 instead of ext3? Denis -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111103/26e40e19/attachment-0001.html
Those two are most important to me - Bigger OST size (up to 24TB) - Faster fsck (~10x) Wojciech On 3 November 2011 14:38, Charland, Denis <Denis.Charland at imi.cnrc-nrc.gc.ca> wrote:> Any good reasons to use ext4 instead of ext3?**** > > ** ** > > Denis**** > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111103/318806ae/attachment.html
On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote:> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote: >> I read in the Lustre Operations Manual that there is an OST size >> limitation of 16 TB on RHEL and 8 TB on other distributions because >> of the ext3 file system limitation. I have a few questions about that. >> >> Why is the limitation 16 TB on RHEL? > > 16TB is the maximum size RedHat supports. See http://www.redhat.com/rhel/compare/ > Larger than that requires bigger changes. > > Note that whamcloud''s 1.8.6-wc1 claimed support for 24TB LUNs (but see http://jira.whamcloud.com/browse/LU-419 ).That is just from not wanting to force ext4 formatting for users that do not need it. As discussed in that bug, using ''--mkfsopts=-t ext4"'' allows formatting LUNs over 16TB. This will be the default for 1.8.7-wc because all supported distros are only using ext4-based ldiskfs.> Whamcloud''s Lustre 2.1 (not sure you''d want to use it) claims support for 128TB LUNs.We tested LUNs this large (filling full and verifying all data), but I don''t expect they will be needed for some time yet.>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation? >> >> What is the OST size limitation when using ext4? > > 16TB with the Lustre-patched RHEL kernel.You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because the init scripts are different. Also, as Kevin writes, none of the >16TB fixes are included into 1.8.5. I would strongly recommend running 1.8.6 instead.>> Is it preferable to use ext4 instead of ext3? >> >> If the block device has more than 8 TB or 16 TB, it must be partitioned. >> Is there a performance degradation when a device has multiple partitions >> compared to a single partition? In other words, is it better to have >> three 8 TB devices with one partition per device than to have one 24 TB >> device with three partitions? > > Better to have 3 separate 8TB LUNs. Different OSTs forcing the same drive heads to move to opposite parts of the disk does degrade performance (with a single OST moving the drive heads, the block allocator tries to minimize movement).Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on flash :-), but the 512-byte sector offset added by the partition table will cause all IO to be misaligned to the underlying device. Even with flash storage it is much better to align the IO on power-of-two boundaries, since the erase blocks cause extra latency if there are read- modify-write operations. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Andreas, I''m not going to use the 1.8.5 RHEL5 kernel on FC12. On a test system running FC12, I already rebuilt a patched 2.6.32.19-163 kernel using the SLES11 patch series. If I build Lustre 1.8.5 using --enable-ext4 with configure, is mkfs.lustre going to create an ext4 file system by default or will I have to use ''--mkfsoptions="-t ext4"''? Denis
On Nov 3, 2011, at 12:40 PM, Andreas Dilger wrote:> > Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on > flash :-), but the 512-byte sector offset added by the partition table will cause all IO to be misaligned to the underlying device.It is possible to align partition boundaries, but it is not the default. Partitions (if used) should normally be aligned to a multiple of the RAID stripe size, although note that some RAID controllers internally compensate for the "expected" misalignment. See http://wikis.sun.com/display/Performance/Aligning+Flash+Modules+for+Optimal+Performance> Even with flash storage it is much better to align the IO on power-of-two > boundaries, since the erase blocks cause extra latency if there are read- > modify-write operations.That also depends on the flash. The Fusion-io products have no alignment issues. Kevin Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited.
Looking at the config.log file in the ldiskfs directory, I noticed that configure was called with --enable-ext4 even if I built Lustre 1.8.5 without specifying --enable-ext4 when I ran the configure command. Is ldiskfs built based on ext4 by default when ext4 is available? If yes, can I create the filesystems (mgs,mdt and ost) using either ext3 or ext4? Denis
Picking up on an old message... On 03/11/11 18:40, Andreas Dilger wrote:> On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote: >> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote: >>> I read in the Lustre Operations Manual that there is an OST size >>> limitation of 16 TB on RHEL and 8 TB on other distributions because >>> of the ext3 file system limitation. I have a few questions about that. >>> >>> Why is the limitation 16 TB on RHEL? >> >> 16TB is the maximum size RedHat supports. See http://www.redhat.com/rhel/compare/ >> Larger than that requires bigger changes. >> >> Note that whamcloud''s 1.8.6-wc1 claimed support for 24TB LUNs (but see http://jira.whamcloud.com/browse/LU-419 ). > > That is just from not wanting to force ext4 formatting for users that do > not need it. As discussed in that bug, using ''--mkfsopts=-t ext4"'' allows > formatting LUNs over 16TB. > > This will be the default for 1.8.7-wc because all supported distros are > only using ext4-based ldiskfs. > >> Whamcloud''s Lustre 2.1 (not sure you''d want to use it) claims support for 128TB LUNs. > > We tested LUNs this large (filling full and verifying all data), but I don''t > expect they will be needed for some time yet.They would be useful to us with 1.8.8-wc1. We have disk servers where we want to use 30TB OSTs - this is annoyingly just over the 24TiB limit [1]. When I try to create a filesystem, it fails with: mkfs.lustre: Unable to mount /dev/sdb: Invalid argument mkfs.lustre FATAL: failed to write local files mkfs.lustre: exiting with 22 (Invalid argument) And I see the following in /var/log/messages [2]: LDISKFS-fs does not support filesystems greater than 24TB and can cause data corruption.Use "force_over_24tb" mount option to override. Is this warning just being cautious - or are there known issues? Has there been testing of this in the last 9 months?> >>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation? >>> >>> What is the OST size limitation when using ext4? >> >> 16TB with the Lustre-patched RHEL kernel. > > You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because the > init scripts are different. Also, as Kevin writes, none of the >16TB fixes > are included into 1.8.5. I would strongly recommend running 1.8.6 instead. > >>> Is it preferable to use ext4 instead of ext3? >>> >>> If the block device has more than 8 TB or 16 TB, it must be partitioned. >>> Is there a performance degradation when a device has multiple partitions >>> compared to a single partition? In other words, is it better to have >>> three 8 TB devices with one partition per device than to have one 24 TB >>> device with three partitions? >> >> Better to have 3 separate 8TB LUNs. Different OSTs forcing the same drive heads to move to opposite parts of the disk does degrade performance (with a single OST moving the drive heads, the block allocator tries to minimize movement).The advantage of 1 partition of 30TB is we avoid losing the space taken up by creating multiple LUNs and the performance degradation of different partitions.> > Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on > flash :-), but the 512-byte sector offset added by the partition table will cause all IO to be misaligned to the underlying device. > > Even with flash storage it is much better to align the IO on power-of-two > boundaries, since the erase blocks cause extra latency if there are read- > modify-write operations.Chris [1] We do appreciate that with 12*3TB disks as a RAID 6 array we may not get the performance of an 8+2 array, but we would like to keep the capacity (and the performance of older servers with 12*2TB disks is "good enough"). [2] It would be helpful if I saw this error on the terminal too. PS man mkfs.lustre is somewhat out of date - it says: mkfs.lustre is part of the Lustre(7) filesystem package and is available from Sun Microsystems via http://downloads.lustre.org/
On 2012-08-02, at 8:50, "Christopher J.Walker" <C.J.Walker at qmul.ac.uk> wrote:> Picking up on an old message... > > On 03/11/11 18:40, Andreas Dilger wrote: >> On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote: >>> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote: >>>> I read in the Lustre Operations Manual that there is an OST size >>>> limitation of 16 TB on RHEL and 8 TB on other distributions because >>>> of the ext3 file system limitation. I have a few questions about that. >>>> >>>> Why is the limitation 16 TB on RHEL? >>> >>> 16TB is the maximum size RedHat supports. See http://www.redhat.com/rhel/compare/ >>> Larger than that requires bigger changes. >>> >>> Note that whamcloud''s 1.8.6-wc1 claimed support for 24TB LUNs (but see http://jira.whamcloud.com/browse/LU-419 ). >> >> That is just from not wanting to force ext4 formatting for users that do >> not need it. As discussed in that bug, using ''--mkfsopts=-t ext4"'' allows >> formatting LUNs over 16TB. >> >> This will be the default for 1.8.7-wc because all supported distros are >> only using ext4-based ldiskfs. >> >>> Whamcloud''s Lustre 2.1 (not sure you''d want to use it) claims support for 128TB LUNs. >> >> We tested LUNs this large (filling full and verifying all data), but I don''t >> expect they will be needed for some time yet. > > They would be useful to us with 1.8.8-wc1. We have disk servers where we > want to use 30TB OSTs - this is annoyingly just over the 24TiB limit [1]. > > When I try to create a filesystem, it fails with: > > mkfs.lustre: Unable to mount /dev/sdb: Invalid argument > mkfs.lustre FATAL: failed to write local files > mkfs.lustre: exiting with 22 (Invalid argument) > > And I see the following in /var/log/messages [2]: > > LDISKFS-fs does not support filesystems greater than 24TB and can cause > data corruption.Use "force_over_24tb" mount option to override. > > Is this warning just being cautious - or are there known issues? Has > there been testing of this in the last 9 months?It is about being cautious and only allowing what we have tested. There are no limits that I''m aware of that differentiate between 24TB and 32TB, but we never tested more that this. At a very minimum, you need to be running e2fsprogs-1.42.3-wc1, since it fixes one critical bug for filesystems larger than 16TB (which was proportionally more likely to be hit for larger filesystems). It would also be useful if you report to the list about you success or failure at this size, since I don''t think many sites are using LUNs this large yet.>>>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation? >>>> >>>> What is the OST size limitation when using ext4? >>> >>> 16TB with the Lustre-patched RHEL kernel. >> >> You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because the >> init scripts are different. Also, as Kevin writes, none of the >16TB fixes >> are included into 1.8.5. I would strongly recommend running 1.8.6 instead. >> >>>> Is it preferable to use ext4 instead of ext3? >>>> >>>> If the block device has more than 8 TB or 16 TB, it must be partitioned. >>>> Is there a performance degradation when a device has multiple partitions >>>> compared to a single partition? In other words, is it better to have >>>> three 8 TB devices with one partition per device than to have one 24 TB >>>> device with three partitions? >>> >>> Better to have 3 separate 8TB LUNs. Different OSTs forcing the same drive heads to move to opposite parts of the disk does degrade performance (with a single OST moving the drive heads, the block allocator tries to minimize movement). > > The advantage of 1 partition of 30TB is we avoid losing the space taken > up by creating multiple LUNs and the performance degradation of > different partitions. > >> Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on >> flash :-), but the 512-byte sector offset added by the partition table will cause all IO to be misaligned to the underlying device. >> >> Even with flash storage it is much better to align the IO on power-of-two >> boundaries, since the erase blocks cause extra latency if there are read- >> modify-write operations. > > > Chris > > [1] We do appreciate that with 12*3TB disks as a RAID 6 array we may not > get the performance of an 8+2 array, but we would like to keep the > capacity (and the performance of older servers with 12*2TB disks is > "good enough"). > > [2] It would be helpful if I saw this error on the terminal too.It is not possible to print messages from within the kernel to the terminal.> PS man mkfs.lustre is somewhat out of date - it says: > mkfs.lustre is part of the Lustre(7) filesystem package and is > available from Sun Microsystems via > http://downloads.lustre.org/I believe we''ve fixed this for newer 2.x releases, but not for 1.8. Cheers, Andreas