thr3ads.net - Lustre discuss - [Lustre-discuss] OST size limitation [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Charland, Denis

2011-Nov-02 19:48 UTC

[Lustre-discuss] OST size limitation

I read in the Lustre Operations Manual that there is an OST size limitation of
16 TB on RHEL and
8 TB on other distributions because of the ext3 file system limitation. I have a
few questions about that.

Why is the limitation 16 TB on RHEL?

I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will
be the OST size limitation?

What is the OST size limitation when using ext4?

Is it preferable to use ext4 instead of ext3?

If the block device has more than 8 TB or 16 TB, it must be partitioned. Is
there a performance degradation
when a device has multiple partitions compared to a single partition? In other
words, is it better to have three
8 TB devices with one partition per device than to have one 24 TB device with
three partitions?

Denis Charland
UNIX Systems Administrator
National Research Council Canada

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111102/a9619b4e/attachment.html

Kevin Van Maren

2011-Nov-02 20:09 UTC

head link

[Lustre-discuss] OST size limitation

On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote:

I read in the Lustre Operations Manual that there is an OST size limitation of
16 TB on RHEL and
8 TB on other distributions because of the ext3 file system limitation. I have a
few questions about that.

Why is the limitation 16 TB on RHEL?

16TB is the maximum size RedHat supports. See
http://www.redhat.com/rhel/compare/
Larger than that requires bigger changes.

Note that whamcloud''s 1.8.6-wc1 claimed support for 24TB LUNs (but see
http://jira.whamcloud.com/browse/LU-419 ).

Whamcloud''s Lustre 2.1 (not sure you''d want to use it) claims
support for 128TB LUNs.

I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will
be the OST size limitation?

What is the OST size limitation when using ext4?

16TB with the Lustre-patched RHEL kernel.

Is it preferable to use ext4 instead of ext3?

If the block device has more than 8 TB or 16 TB, it must be partitioned. Is
there a performance degradation
when a device has multiple partitions compared to a single partition? In other
words, is it better to have three
8 TB devices with one partition per device than to have one 24 TB device with
three partitions?

Better to have 3 separate 8TB LUNs. Different OSTs forcing the same drive heads
to move to opposite parts of the disk does degrade performance (with a single
OST moving the drive heads, the block allocator tries to minimize movement).

Denis Charland
UNIX Systems Administrator
National Research Council Canada

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at
lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Confidentiality Notice: This e-mail message, its contents and any attachments to
it are confidential to the intended recipient, and may contain information that
is privileged and/or exempt from disclosure under applicable law. If you are not
the intended recipient, please immediately notify the sender and destroy the
original e-mail message and any attachments (and any copies that may have been
made) from your system or otherwise. Any unauthorized use, copying, disclosure
or distribution of this information is strictly prohibited.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111102/108f2da1/attachment.html

Charland, Denis

2011-Nov-03 14:38 UTC

head link

[Lustre-discuss] OST size limitation

Any good reasons to use ext4 instead of ext3?

Denis
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111103/26e40e19/attachment-0001.html

Wojciech Turek

2011-Nov-03 15:00 UTC

head link

[Lustre-discuss] OST size limitation

Those two are most important to me

- Bigger OST size (up to 24TB)
- Faster fsck (~10x)

Wojciech


On 3 November 2011 14:38, Charland, Denis <Denis.Charland at
imi.cnrc-nrc.gc.ca> wrote:
> Any good reasons to use ext4 instead of ext3?****
>
> ** **
>
> Denis****
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111103/318806ae/attachment.html

Andreas Dilger

2011-Nov-03 18:40 UTC

head link

[Lustre-discuss] OST size limitation

On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote:> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote:
>> I read in the Lustre Operations Manual that there is an OST size
>> limitation of 16 TB on RHEL and 8 TB on other distributions because
>> of the ext3 file system limitation. I have a few questions about that.
>>  
>> Why is the limitation 16 TB on RHEL?
> 
> 16TB is the maximum size RedHat supports.  See
http://www.redhat.com/rhel/compare/
> Larger than that requires bigger changes.
> 
> Note that whamcloud''s 1.8.6-wc1 claimed support for 24TB LUNs (but
see http://jira.whamcloud.com/browse/LU-419 ).
That is just from not wanting to force ext4 formatting for users that do
not need it.  As discussed in that bug, using ''--mkfsopts=-t
ext4"'' allows
formatting LUNs over 16TB.

This will be the default for 1.8.7-wc because all supported distros are
only using ext4-based ldiskfs.
> Whamcloud''s Lustre 2.1 (not sure you''d want to use it)
claims support for 128TB LUNs.
We tested LUNs this large (filling full and verifying all data), but I
don''t
expect they will be needed for some time yet.
>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system.
What will be the OST size limitation?
>>  
>> What is the OST size limitation when using ext4?
> 
> 16TB with the Lustre-patched RHEL kernel.
You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because the
init scripts are different.  Also, as Kevin writes, none of the >16TB fixes
are included into 1.8.5.  I would strongly recommend running 1.8.6 instead.
>> Is it preferable to use ext4 instead of ext3?
>>  
>> If the block device has more than 8 TB or 16 TB, it must be
partitioned.
>> Is there a performance degradation when a device has multiple
partitions
>> compared to a single partition? In other words, is it better to have
>> three 8 TB devices with one partition per device than to have one 24 TB
>> device with three partitions?
> 
> Better to have 3 separate 8TB LUNs.  Different OSTs forcing the same drive
heads to move to opposite parts of the disk does degrade performance (with a
single OST moving the drive heads, the block allocator tries to minimize
movement).
Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on
flash :-), but the 512-byte sector offset added by the partition table will
cause all IO to be misaligned to the underlying device.

Even with flash storage it is much better to align the IO on power-of-two
boundaries, since the erase blocks cause extra latency if there are read-
modify-write operations.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Charland, Denis

2011-Nov-03 19:03 UTC

head link

[Lustre-discuss] OST size limitation

Andreas,

I''m not going to use the 1.8.5 RHEL5 kernel on FC12. On a test system
running FC12, I already
rebuilt a patched 2.6.32.19-163 kernel using the SLES11 patch series.

If I build Lustre 1.8.5 using --enable-ext4 with configure, is mkfs.lustre going
to create an
ext4 file system by default or will I have to use
''--mkfsoptions="-t ext4"''?

Denis

Kevin Van Maren

2011-Nov-03 19:15 UTC

head link

[Lustre-discuss] OST size limitation

On Nov 3, 2011, at 12:40 PM, Andreas Dilger wrote:> 
> Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on
> flash :-), but the 512-byte sector offset added by the partition table will
cause all IO to be misaligned to the underlying device.

It is possible to align partition boundaries, but it is not the default. 
Partitions (if used) should normally be aligned to a multiple of the RAID stripe
size, although note that some RAID controllers internally compensate for the
"expected" misalignment.

See
http://wikis.sun.com/display/Performance/Aligning+Flash+Modules+for+Optimal+Performance

> Even with flash storage it is much better to align the IO on power-of-two
> boundaries, since the erase blocks cause extra latency if there are read-
> modify-write operations.

That also depends on the flash.  The Fusion-io products have no alignment
issues.

Kevin

Confidentiality Notice: This e-mail message, its contents and any attachments to
it are confidential to the intended recipient, and may contain information that
is privileged and/or exempt from disclosure under applicable law. If you are not
the intended recipient, please immediately notify the sender and destroy the
original e-mail message and any attachments (and any copies that may have been
made) from your system or otherwise. Any unauthorized use, copying, disclosure
or distribution of this information is strictly prohibited.

Charland, Denis

2011-Nov-08 19:50 UTC

head link

[Lustre-discuss] OST size limitation

Looking at the config.log file in the ldiskfs directory, I noticed that
configure was called with --enable-ext4
even if I built Lustre 1.8.5 without specifying --enable-ext4 when I ran the
configure command.

Is ldiskfs built based on ext4 by default when ext4 is available?

If yes, can I create the filesystems (mgs,mdt and ost) using either ext3 or
ext4?

Denis

Christopher J.Walker

2012-Aug-02 15:50 UTC

head link

[Lustre-discuss] OST size limitation

Picking up on an old message...

On 03/11/11 18:40, Andreas Dilger wrote:> On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote:
>> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote:
>>> I read in the Lustre Operations Manual that there is an OST size
>>> limitation of 16 TB on RHEL and 8 TB on other distributions because
>>> of the ext3 file system limitation. I have a few questions about
that.
>>>  
>>> Why is the limitation 16 TB on RHEL?
>>
>> 16TB is the maximum size RedHat supports.  See
http://www.redhat.com/rhel/compare/
>> Larger than that requires bigger changes.
>>
>> Note that whamcloud''s 1.8.6-wc1 claimed support for 24TB LUNs
(but see http://jira.whamcloud.com/browse/LU-419 ).
> 
> That is just from not wanting to force ext4 formatting for users that do
> not need it.  As discussed in that bug, using ''--mkfsopts=-t
ext4"'' allows
> formatting LUNs over 16TB.
> 
> This will be the default for 1.8.7-wc because all supported distros are
> only using ext4-based ldiskfs.
> 
>> Whamcloud''s Lustre 2.1 (not sure you''d want to use
it) claims support for 128TB LUNs.
> 
> We tested LUNs this large (filling full and verifying all data), but I
don''t
> expect they will be needed for some time yet.
They would be useful to us with 1.8.8-wc1. We have disk servers where we
want to use 30TB OSTs - this is annoyingly just over the 24TiB limit [1].

When I try to create a filesystem, it fails with:

mkfs.lustre: Unable to mount /dev/sdb: Invalid argument
mkfs.lustre FATAL: failed to write local files
mkfs.lustre: exiting with 22 (Invalid argument)

And I see the following in /var/log/messages [2]:

LDISKFS-fs does not support filesystems greater than 24TB and can cause
data corruption.Use "force_over_24tb" mount option to override.

Is this warning just being cautious - or are there known issues? Has
there been testing of this in the last 9 months?

> 
>>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file
system. What will be the OST size limitation?
>>>  
>>> What is the OST size limitation when using ext4?
>>
>> 16TB with the Lustre-patched RHEL kernel.
> 
> You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because the
> init scripts are different.  Also, as Kevin writes, none of the >16TB
fixes
> are included into 1.8.5.  I would strongly recommend running 1.8.6 instead.
> 
>>> Is it preferable to use ext4 instead of ext3?
>>>  
>>> If the block device has more than 8 TB or 16 TB, it must be
partitioned.
>>> Is there a performance degradation when a device has multiple
partitions
>>> compared to a single partition? In other words, is it better to
have
>>> three 8 TB devices with one partition per device than to have one
24 TB
>>> device with three partitions?
>>
>> Better to have 3 separate 8TB LUNs.  Different OSTs forcing the same
drive heads to move to opposite parts of the disk does degrade performance (with
a single OST moving the drive heads, the block allocator tries to minimize
movement).
The advantage of 1 partition of 30TB is we avoid losing the space taken
up by creating multiple LUNs and the performance degradation of
different partitions.
> 
> Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on
> flash :-), but the 512-byte sector offset added by the partition table will
cause all IO to be misaligned to the underlying device.
> 
> Even with flash storage it is much better to align the IO on power-of-two
> boundaries, since the erase blocks cause extra latency if there are read-
> modify-write operations.

Chris

[1] We do appreciate that with 12*3TB disks as a RAID 6 array we may not
get the performance of an 8+2 array, but we would like to keep the
capacity (and the performance of older servers with 12*2TB disks is
"good enough").

[2] It would be helpful if I saw this error on the terminal too.

PS man mkfs.lustre is somewhat out of date - it says:
     mkfs.lustre is part of the Lustre(7) filesystem package and  is
       available from Sun Microsystems via
       http://downloads.lustre.org/

Andreas Dilger

2012-Aug-02 16:21 UTC

head link

[Lustre-discuss] OST size limitation

On 2012-08-02, at 8:50, "Christopher J.Walker" <C.J.Walker at
qmul.ac.uk> wrote:> Picking up on an old message...
> 
> On 03/11/11 18:40, Andreas Dilger wrote:
>> On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote:
>>> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote:
>>>> I read in the Lustre Operations Manual that there is an OST
size
>>>> limitation of 16 TB on RHEL and 8 TB on other distributions
because
>>>> of the ext3 file system limitation. I have a few questions
about that.
>>>> 
>>>> Why is the limitation 16 TB on RHEL?
>>> 
>>> 16TB is the maximum size RedHat supports.  See
http://www.redhat.com/rhel/compare/
>>> Larger than that requires bigger changes.
>>> 
>>> Note that whamcloud''s 1.8.6-wc1 claimed support for 24TB
LUNs (but see http://jira.whamcloud.com/browse/LU-419 ).
>> 
>> That is just from not wanting to force ext4 formatting for users that
do
>> not need it.  As discussed in that bug, using ''--mkfsopts=-t
ext4"'' allows
>> formatting LUNs over 16TB.
>> 
>> This will be the default for 1.8.7-wc because all supported distros are
>> only using ext4-based ldiskfs.
>> 
>>> Whamcloud''s Lustre 2.1 (not sure you''d want to
use it) claims support for 128TB LUNs.
>> 
>> We tested LUNs this large (filling full and verifying all data), but I
don''t
>> expect they will be needed for some time yet.
> 
> They would be useful to us with 1.8.8-wc1. We have disk servers where we
> want to use 30TB OSTs - this is annoyingly just over the 24TiB limit [1].
> 
> When I try to create a filesystem, it fails with:
> 
> mkfs.lustre: Unable to mount /dev/sdb: Invalid argument
> mkfs.lustre FATAL: failed to write local files
> mkfs.lustre: exiting with 22 (Invalid argument)
> 
> And I see the following in /var/log/messages [2]:
> 
> LDISKFS-fs does not support filesystems greater than 24TB and can cause
> data corruption.Use "force_over_24tb" mount option to override.
> 
> Is this warning just being cautious - or are there known issues? Has
> there been testing of this in the last 9 months?
It is about being cautious and only allowing what we have tested.  There are no
limits that I''m aware of that differentiate between 24TB and 32TB, but
we never tested more that this.

At a very minimum, you need to be running e2fsprogs-1.42.3-wc1, since it fixes
one critical bug for filesystems larger than 16TB (which was proportionally more
likely to be hit for larger filesystems).

It would also be useful if you report to the list about you success or failure
at this size, since I don''t think many sites are using LUNs this large
yet.
>>>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file
system. What will be the OST size limitation?
>>>> 
>>>> What is the OST size limitation when using ext4?
>>> 
>>> 16TB with the Lustre-patched RHEL kernel.
>> 
>> You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because
the
>> init scripts are different.  Also, as Kevin writes, none of the
>16TB fixes
>> are included into 1.8.5.  I would strongly recommend running 1.8.6
instead.
>> 
>>>> Is it preferable to use ext4 instead of ext3?
>>>> 
>>>> If the block device has more than 8 TB or 16 TB, it must be
partitioned.
>>>> Is there a performance degradation when a device has multiple
partitions
>>>> compared to a single partition? In other words, is it better to
have
>>>> three 8 TB devices with one partition per device than to have
one 24 TB
>>>> device with three partitions?
>>> 
>>> Better to have 3 separate 8TB LUNs.  Different OSTs forcing the
same drive heads to move to opposite parts of the disk does degrade performance
(with a single OST moving the drive heads, the block allocator tries to minimize
movement).
> 
> The advantage of 1 partition of 30TB is we avoid losing the space taken
> up by creating multiple LUNs and the performance degradation of
> different partitions.
> 
>> Not only is the seeking evil (talk to Kevin if you want to run 24TB
OSTs on
>> flash :-), but the 512-byte sector offset added by the partition table
will cause all IO to be misaligned to the underlying device.
>> 
>> Even with flash storage it is much better to align the IO on
power-of-two
>> boundaries, since the erase blocks cause extra latency if there are
read-
>> modify-write operations.
> 
> 
> Chris
> 
> [1] We do appreciate that with 12*3TB disks as a RAID 6 array we may not
> get the performance of an 8+2 array, but we would like to keep the
> capacity (and the performance of older servers with 12*2TB disks is
> "good enough").
> 
> [2] It would be helpful if I saw this error on the terminal too.
It is not possible to print messages from  within the kernel to the terminal. 
> PS man mkfs.lustre is somewhat out of date - it says:
>     mkfs.lustre is part of the Lustre(7) filesystem package and  is
>       available from Sun Microsystems via
>       http://downloads.lustre.org/
I believe we''ve fixed this for newer 2.x releases, but not for 1.8.

Cheers, Andreas

Lustre discuss - Nov 2011 - OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation

[Lustre-discuss] OST size limitation