thr3ads.net - Lustre discuss - [Lustre-discuss] ldiskfs performance vs. XFS performance [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Michael Kluge

2010-Oct-18 11:58 UTC

[Lustre-discuss] ldiskfs performance vs. XFS performance

Hi list,

we have Lustre 1.8.3 running on a DDN 9900. One LUN (10 discs) formatted
with XFS shows 400 MB/s if oppressed with one ''dd'' and large
block
sizes. One LUN formatted an mounted with ldiskfs (the ext3 based that is
default in 1.8.3.) shows 110 MB/s. It this the expected behaviour? It
looks a bit low compared to XFS.

We think with help from DDN we did everything we can from a hardware
perspective. We formatted the LUN with the correct striping and stripe
size, DDN adjusted some controller parameters and we even put the file
system journal on a RAM disk. The LUN has 16 TB capacity. I formated
only 7 for the moment due to the 8 TB limit. 

This is what I did:

MDS_NID=IP at SOMEHWERE
RAM_DEV=/dev/ram1
dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
mke2fs -O journal_dev -b 4096 $RAM_DEV

mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
--mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b 4096
-j -J device=$RAM_DEV" /dev/disk/by-path/...

mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1

Is there a way to push the bandwidth limit for a single data stream any
further?



Michael


-- 

Michael Kluge, M.Sc.

Technische Universit?t Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101018/14af448b/attachment-0001.bin

Bernd Schubert

2010-Oct-18 12:42 UTC

head link

[Lustre-discuss] ldiskfs performance vs. XFS performance

Hello Michael,

On Monday, October 18, 2010, Michael Kluge wrote:> Hi list,
> 
> we have Lustre 1.8.3 running on a DDN 9900. One LUN (10 discs) formatted
> with XFS shows 400 MB/s if oppressed with one ''dd'' and
large block
> sizes. One LUN formatted an mounted with ldiskfs (the ext3 based that is
> default in 1.8.3.) shows 110 MB/s. It this the expected behaviour? It
> looks a bit low compared to XFS.
Yes, unfortunately not entirely unexpected, with upstream Oracle versions. 
Firstly, please send a mail to support at ddn.com and ask for the udev tuning
rpm
(please add [Lustre] in the subject line).

Then see this MMP issue here:
https://bugzilla.lustre.org/show_bug.cgi?id=23129

which requires 
https://bugzilla.lustre.org/show_bug.cgi?id=22882

(as Lustre requires contributor agreements and as self-signed agreements do 
not work anymore, that presently causes some headache and brought in legacy 
and as always with bureaucracy it takes ages to sort it out - so landing our 
patches is delayed presently).

In order to prevent data corruption in case of controller failures, you should 
also disable the S2A write back cache and enable async-journals instead on 
Lustre (enabled by default in DDN Lustre versions). 
> 
> We think with help from DDN we did everything we can from a hardware
> perspective. We formatted the LUN with the correct striping and stripe
> size, DDN adjusted some controller parameters and we even put the file
> system journal on a RAM disk. The LUN has 16 TB capacity. I formated
> only 7 for the moment due to the 8 TB limit.
You should use ext4 based ldiskfs to get more than 8TiB. Our releases use that 
as default.
> 
> This is what I did:
> 
> MDS_NID=IP at SOMEHWERE
> RAM_DEV=/dev/ram1
> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
> mke2fs -O journal_dev -b 4096 $RAM_DEV
> 
> mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
> --mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b
4096
> -j -J device=$RAM_DEV" /dev/disk/by-path/...
> 
> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
> 
> Is there a way to push the bandwidth limit for a single data stream any
> further?
While it could make it difficult with support, you could use our DDN Lustre 
releases:

http://eu.ddn.com:8080/lustre/lustre/1.8.3/ddn3.3/


Hope it helps,
Bernd


-- 
Bernd Schubert
DataDirect Networks

Johann Lombardi

2010-Oct-18 16:40 UTC

head link

[Lustre-discuss] ldiskfs performance vs. XFS performance

On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge
wrote:> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
> mke2fs -O journal_dev -b 4096 $RAM_DEV
> 
> mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
> --mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b
4096
> -j -J device=$RAM_DEV" /dev/disk/by-path/...
> 
> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
In fact, Lustre uses additional mount options (see "Persistent mount
opts" in tunefs.lustre output).
If your ldiskfs module is based on ext3, you should add the extents and mballoc
options which are known to improve performance.

HTH

Cheers,
Johann

Johann Lombardi

2010-Oct-18 16:44 UTC

head link

[Lustre-discuss] ldiskfs performance vs. XFS performance

On Mon, Oct 18, 2010 at 02:42:47PM +0200, Bernd Schubert
wrote:> Yes, unfortunately not entirely unexpected, with upstream Oracle versions. 
> Firstly, please send a mail to support at ddn.com and ask for the udev
tuning rpm
> (please add [Lustre] in the subject line).
> 
> Then see this MMP issue here:
> https://bugzilla.lustre.org/show_bug.cgi?id=23129
> 
> which requires 
> https://bugzilla.lustre.org/show_bug.cgi?id=22882
Please note that since Michael has not specified a failnode, MMP is disabled.

Cheers,
Johann

Andreas Dilger

2010-Oct-18 20:04 UTC

head link

[Lustre-discuss] ldiskfs performance vs. XFS performance

On 2010-10-18, at 10:40, Johann Lombardi wrote:> On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote:
>> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
>> mke2fs -O journal_dev -b 4096 $RAM_DEV
>> 
>> mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
>> --mgsnode=$MDS_NID --mkfsoptions="-E stride=32,stripe-width=256 -b
4096
>> -j -J device=$RAM_DEV" /dev/disk/by-path/...
>> 
>> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
> 
> In fact, Lustre uses additional mount options (see "Persistent mount
opts" in tunefs.lustre output).
> If your ldiskfs module is based on ext3, you should add the extents and
mballoc options which are known to improve performance.
Even then, the IO submission path of ext3 from userspace is not very good, and
such a performance difference is not unexpected.  When submitting IO from
userspace to ext3/ldiskfs it is being done in 4kB blocks, and each block is
allocated separately (regardless of mballoc, unfortunately).  When Lustre is
doing IO from the kernel, the client is aggregating the IO into 1MB chunks and
the entire 1MB write is allocated in one operation.

That is why we developed the "delalloc" code for ext4 - so that
userspace could also get better IO performance, and utilize the multi-block
allocation (mballoc) routines that have been in ldiskfs for ages, but only
accessible from the kernel.

For Lustre performance testing, I would suggest looking at lustre-iokit, and in
particular "sgpdd" to test the underlying block device, and then
obdfilter-survey to test the local Lustre IO submission path.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Michael Kluge

2010-Oct-20 14:08 UTC

head link

[Lustre-discuss] ldiskfs performance vs. XFS performance

Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device.
We trapped into one or two bugs with obdfilter-survey as lctl has at
least one bug in 1.8.3 when is uses multiple threads and
obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+
MB/s for obdfilter-survey over a reasonable parameter space after we
changed to the ext4 based ldiskfs. So that seems to be the trick.

Michael

Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger:
> On 2010-10-18, at 10:40, Johann Lombardi wrote:
> > On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote:
> >> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
> >> mke2fs -O journal_dev -b 4096 $RAM_DEV
> >> 
> >> mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost
--fsname=luram
> >> --mgsnode=$MDS_NID --mkfsoptions="-E
stride=32,stripe-width=256 -b 4096
> >> -j -J device=$RAM_DEV" /dev/disk/by-path/...
> >> 
> >> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
> > 
> > In fact, Lustre uses additional mount options (see "Persistent
mount opts" in tunefs.lustre output).
> > If your ldiskfs module is based on ext3, you should add the extents
and mballoc options which are known to improve performance.
> 
> Even then, the IO submission path of ext3 from userspace is not very good,
and such a performance difference is not unexpected.  When submitting IO from
userspace to ext3/ldiskfs it is being done in 4kB blocks, and each block is
allocated separately (regardless of mballoc, unfortunately).  When Lustre is
doing IO from the kernel, the client is aggregating the IO into 1MB chunks and
the entire 1MB write is allocated in one operation.
> 
> That is why we developed the "delalloc" code for ext4 - so that
userspace could also get better IO performance, and utilize the multi-block
allocation (mballoc) routines that have been in ldiskfs for ages, but only
accessible from the kernel.
> 
> For Lustre performance testing, I would suggest looking at lustre-iokit,
and in particular "sgpdd" to test the underlying block device, and
then obdfilter-survey to test the local Lustre IO submission path.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
> 
> 
-- 

Michael Kluge, M.Sc.

Technische Universit?t Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5997 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101020/67faf27b/attachment.bin

Bernd Schubert

2010-Oct-20 16:42 UTC

head link

[Lustre-discuss] ldiskfs performance vs. XFS performance

For your final final filesystem you still probably want to enable async 
journals (unless you are willing to enable the S2A unmirrored device cache).

Most obdecho/obdfilter-survey bugs are gone in 1.8.4, except your ctrl+c 
problem, for which a patch exists:

https://bugzilla.lustre.org/show_bug.cgi?id=21745

Cheers,
Bernd


On Wednesday, October 20, 2010, Michael Kluge wrote:> Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device.
> We trapped into one or two bugs with obdfilter-survey as lctl has at
> least one bug in 1.8.3 when is uses multiple threads and
> obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+
> MB/s for obdfilter-survey over a reasonable parameter space after we
> changed to the ext4 based ldiskfs. So that seems to be the trick.
> 
> Michael
> 
> Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger:
> > On 2010-10-18, at 10:40, Johann Lombardi wrote:
> > > On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote:
> > >> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
> > >> mke2fs -O journal_dev -b 4096 $RAM_DEV
> > >> 
> > >> mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost
--fsname=luram
> > >> --mgsnode=$MDS_NID --mkfsoptions="-E
stride=32,stripe-width=256 -b
> > >> 4096 -j -J device=$RAM_DEV" /dev/disk/by-path/...
> > >> 
> > >> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
> > > 
> > > In fact, Lustre uses additional mount options (see
"Persistent mount
> > > opts" in tunefs.lustre output). If your ldiskfs module is
based on
> > > ext3, you should add the extents and mballoc options which are
known
> > > to improve performance.
> > 
> > Even then, the IO submission path of ext3 from userspace is not very
> > good, and such a performance difference is not unexpected.  When
> > submitting IO from userspace to ext3/ldiskfs it is being done in 4kB
> > blocks, and each block is allocated separately (regardless of mballoc,
> > unfortunately).  When Lustre is doing IO from the kernel, the client
is
> > aggregating the IO into 1MB chunks and the entire 1MB write is
allocated
> > in one operation.
> > 
> > That is why we developed the "delalloc" code for ext4 - so
that userspace
> > could also get better IO performance, and utilize the multi-block
> > allocation (mballoc) routines that have been in ldiskfs for ages, but
> > only accessible from the kernel.
> > 
> > For Lustre performance testing, I would suggest looking at
lustre-iokit,
> > and in particular "sgpdd" to test the underlying block
device, and then
> > obdfilter-survey to test the local Lustre IO submission path.
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Lustre Technical Lead
> > Oracle Corporation Canada Inc.

-- 
Bernd Schubert
DataDirect Networks

Michael Kluge

2010-Oct-20 16:47 UTC

head link

[Lustre-discuss] ldiskfs performance vs. XFS performance

> For your final final filesystem you still probably want to enable async
> journals (unless you are willing to enable the S2A unmirrored device
cache).
OK, thanks. We''ll give this a try.

Michael
> Most obdecho/obdfilter-survey bugs are gone in 1.8.4, except your ctrl+c
> problem, for which a patch exists:
>
> https://bugzilla.lustre.org/show_bug.cgi?id=21745



>
> Cheers,
> Bernd
>
>
> On Wednesday, October 20, 2010, Michael Kluge wrote:
>> Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device.
>> We trapped into one or two bugs with obdfilter-survey as lctl has at
>> least one bug in 1.8.3 when is uses multiple threads and
>> obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+
>> MB/s for obdfilter-survey over a reasonable parameter space after we
>> changed to the ext4 based ldiskfs. So that seems to be the trick.
>>
>> Michael
>>
>> Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger:
>>> On 2010-10-18, at 10:40, Johann Lombardi wrote:
>>>> On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote:
>>>>> dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
>>>>> mke2fs -O journal_dev -b 4096 $RAM_DEV
>>>>>
>>>>> mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost
--fsname=luram
>>>>> --mgsnode=$MDS_NID --mkfsoptions="-E
stride=32,stripe-width=256 -b
>>>>> 4096 -j -J device=$RAM_DEV" /dev/disk/by-path/...
>>>>>
>>>>> mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
>>>>
>>>> In fact, Lustre uses additional mount options (see
"Persistent mount
>>>> opts" in tunefs.lustre output). If your ldiskfs module is
based on
>>>> ext3, you should add the extents and mballoc options which are
known
>>>> to improve performance.
>>>
>>> Even then, the IO submission path of ext3 from userspace is not
very
>>> good, and such a performance difference is not unexpected.  When
>>> submitting IO from userspace to ext3/ldiskfs it is being done in
4kB
>>> blocks, and each block is allocated separately (regardless of
mballoc,
>>> unfortunately).  When Lustre is doing IO from the kernel, the
client is
>>> aggregating the IO into 1MB chunks and the entire 1MB write is
allocated
>>> in one operation.
>>>
>>> That is why we developed the "delalloc" code for ext4 -
so that userspace
>>> could also get better IO performance, and utilize the multi-block
>>> allocation (mballoc) routines that have been in ldiskfs for ages,
but
>>> only accessible from the kernel.
>>>
>>> For Lustre performance testing, I would suggest looking at
lustre-iokit,
>>> and in particular "sgpdd" to test the underlying block
device, and then
>>> obdfilter-survey to test the local Lustre IO submission path.
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Lustre Technical Lead
>>> Oracle Corporation Canada Inc.
>
>

-- 
Michael Kluge, M.Sc.

Technische Universit?t Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room WIL A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih

Lustre discuss - Oct 2010 - ldiskfs performance vs. XFS performance

[Lustre-discuss] ldiskfs performance vs. XFS performance

[Lustre-discuss] ldiskfs performance vs. XFS performance

[Lustre-discuss] ldiskfs performance vs. XFS performance

[Lustre-discuss] ldiskfs performance vs. XFS performance

[Lustre-discuss] ldiskfs performance vs. XFS performance

[Lustre-discuss] ldiskfs performance vs. XFS performance

[Lustre-discuss] ldiskfs performance vs. XFS performance

[Lustre-discuss] ldiskfs performance vs. XFS performance