Hi,
I''m seeing what can only be described as dismal striped write
performance from lustre 1.6.3 clients :-/
1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple
of days ago) are also terrible.
the below shows that the OS (centos4.5/5) or fabric (gigE/IB) or lustre
version on the servers doesn''t matter - the problem is with the 1.6.3
and 1.6.4rc3 client kernels with striped writes (although un-striped
writes are a tad slower too).
with 1M lustre stripes:
   client              client                     dd write speed (MB/s)
     OS                kernel                     a)    b)    c)    d)
1.6.2:
  centos4.5  2.6.9-55.0.2.EL_lustre.1.6.2smp     202   270   118   117
  centos5    2.6.18-8.1.8.el5_lustre.1.6.2rjh    166   190   117   119
1.6.3+:
  centos4.5  2.6.9-55.0.9.EL_lustre.1.6.3smp      32     9    30     9
  centos5    2.6.18-53.el5-lustre1.6.4rc3rjh      36    10    27    10
                                                       ^^^^        ^^^^
                                         yes, that is really 9MB/s. sigh
with no lustre stripes:
   client              client                     dd write speed (MB/s)
     OS                kernel                     a)          c)
1.6.2:
  centos4.5  2.6.9-55.0.2.EL_lustre.1.6.2smp     102          98
  centos5    2.6.18-8.1.8.el5_lustre.1.6.2rjh     84          77
1.6.3+:
  centos4.5  2.6.9-55.0.9.EL_lustre.1.6.3smp      94          95
  centos5    2.6.18-53.el5-lustre1.6.4rc3rjh      73          67
a) servers   centos5, 2.6.18-53.el5-lustre1.6.4rc3rjh, md raid5, fabric IB
b) servers centos4.5, 2.6.9-55.0.9.EL_lustre.1.6.3smp,    ""   ,
fabric IB
c) servers   centos5, 2.6.18-8.1.14.el5_lustre.1.6.3smp,  ""   .
fabric gigE
d) servers centos4.5, 2.6.9-55.0.9.EL_lustre.1.6.3smp,    ""   ,
fabric gigE
all runs have the same setup - two OSS''s, each with a 16 FC disk md
raid5 OST clients with 512m ram, server with 8g, all x86_64, test is
  dd if=/dev/zero of=/mnt/testfs/blah bs=1M count=5000
each test run >=2 times. there are no errors from lustre or kernels.
I can''t see anything relevant in bugzilla.
is anyone else seeing this?
seems weird that 1.6.3 has been out there for a while and nobody else
has reported it, but I can''t think or any more testing variants I can
try...
anyway, some more simple setup info:
 % lfs getstripe /mnt/testfs/
 OBDS:
 0: testfs-OST0000_UUID ACTIVE
 1: testfs-OST0001_UUID ACTIVE
 /mnt/testfs/
 default stripe_count: -1 stripe_size: 1048576 stripe_offset: -1
 /mnt/testfs/blah
         obdidx           objid          objid            group
              1               3            0x3                0
              0               2            0x2                0
 % lfs df
 UUID                 1K-blocks      Used Available  Use% Mounted on
 testfs-MDT0000_UUID    1534832    306680   1228152   19% /mnt/testfs[MDT:0]
 testfs-OST0000_UUID   15481840   3803284  11678556   24% /mnt/testfs[OST:0]
 testfs-OST0001_UUID   15481840   3803284  11678556   24% /mnt/testfs[OST:1]
 filesystem summary:   30963680   7606568  23357112   24% /mnt/testfs
cheers,
robin
ps. the ''rjh'' series kernels are required ''cos lustre
rhel5 kernels
don''t have ko2iblnd support in them.
Johann Lombardi
2007-Nov-26  13:53 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
On Mon, Nov 26, 2007 at 08:39:58AM -0500, Robin Humble wrote:> Hi, > > I''m seeing what can only be described as dismal striped write > performance from lustre 1.6.3 clients :-/ > 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple > of days ago) are also terrible. > > the below shows that the OS (centos4.5/5) or fabric (gigE/IB) or lustre > version on the servers doesn''t matter - the problem is with the 1.6.3 > and 1.6.4rc3 client kernels with striped writes (although un-striped > writes are a tad slower too). > > with 1M lustre stripes: > client client dd write speed (MB/s) > OS kernel a) b) c) d) > 1.6.2: > centos4.5 2.6.9-55.0.2.EL_lustre.1.6.2smp 202 270 118 117 > centos5 2.6.18-8.1.8.el5_lustre.1.6.2rjh 166 190 117 119 > 1.6.3+: > centos4.5 2.6.9-55.0.9.EL_lustre.1.6.3smp 32 9 30 9 > centos5 2.6.18-53.el5-lustre1.6.4rc3rjh 36 10 27 10 > ^^^^ ^^^^ > yes, that is really 9MB/s. sighCould you please try to disable checksums? On the client side: for file in /proc/fs/lustre/osc/*/checksums; do echo 0 > $file; done Johann
On Mon, Nov 26, 2007 at 02:53:25PM +0100, Johann Lombardi wrote:>On Mon, Nov 26, 2007 at 08:39:58AM -0500, Robin Humble wrote: >> Hi, >> >> I''m seeing what can only be described as dismal striped write >> performance from lustre 1.6.3 clients :-/ >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple >> of days ago) are also terrible. >> >> the below shows that the OS (centos4.5/5) or fabric (gigE/IB) or lustre >> version on the servers doesn''t matter - the problem is with the 1.6.3 >> and 1.6.4rc3 client kernels with striped writes (although un-striped >> writes are a tad slower too). >> >> with 1M lustre stripes: >> client client dd write speed (MB/s) >> OS kernel a) b) c) d) >> 1.6.2: >> centos4.5 2.6.9-55.0.2.EL_lustre.1.6.2smp 202 270 118 117 >> centos5 2.6.18-8.1.8.el5_lustre.1.6.2rjh 166 190 117 119 >> 1.6.3+: >> centos4.5 2.6.9-55.0.9.EL_lustre.1.6.3smp 32 9 30 9 >> centos5 2.6.18-53.el5-lustre1.6.4rc3rjh 36 10 27 10 >> ^^^^ ^^^^ >> yes, that is really 9MB/s. sigh > >Could you please try to disable checksums? >On the client side: >for file in /proc/fs/lustre/osc/*/checksums; do echo 0 > $file; donedone. no change. cheers, robin
Andrei Maslennikov
2007-Nov-26  15:58 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
On Nov 26, 2007 3:32 PM, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote:> >> I''m seeing what can only be described as dismal striped write > >> performance from lustre 1.6.3 clients :-/ > >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple > >> of days ago) are also terrible.I have 3 OSTs capable to deliver 300+ MB/sec each for large streaming writes with 1M blocksize. On one client, with one OST I can see almost all this bandwidth over Infiniband. If I run three processes in parallel on this very client, each writing into a separate OST, I arrive to 520 MB/sec aggregate (3 streams at approx 170+ MB/sec each). If I try to stripe over these three OSTs on this client, performance of one stream drops to 60+ MB/sec. Changing stripesize to a smaller one (1/3 MB) makes things worse. Writing with larger block sizes (9M, 30M) does not improve things. Increasing the stripesize to 25 MB allows to approach the speed of a single OST, as one would expect (blocks are round robined over all three OSTs). But never more. Zeroing checksums on the client does not help. Will now be downgrading the client to 1.6.2 to see if this helps. Andrei.
Andrei Maslennikov
2007-Nov-26  17:16 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
Confirmed: 1.6.3 striped write performance sux. With 1.6.2, I see this: [root at srvandrei ~]$ lfs setstripe /lustre/162 0 0 3 [root at srvandrei ~]$ lmdd.linux of=/lustre/162 bs=1024k time=180 fsync=1 157705.8304 MB in 180.0225 secs, 876.0341 MB/sec I.e. 1.6.2 had nicely joined the aggregate bw of three OSTs of 300 MB/sec each into the almost 900 MB/sec. Andrei. On Nov 26, 2007 4:58 PM, Andrei Maslennikov <andrei.maslennikov at gmail.com> wrote:> On Nov 26, 2007 3:32 PM, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote: > > > >> I''m seeing what can only be described as dismal striped write > > >> performance from lustre 1.6.3 clients :-/ > > >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple > > >> of days ago) are also terrible. > > I have 3 OSTs capable to deliver 300+ MB/sec each for large streaming writes > with 1M blocksize. On one client, with one OST I can see almost all > this bandwidth over Infiniband. If I run three processes in parallel on this very client, > each writing into a separate OST, I arrive to 520 MB/sec aggregate (3 streams at > approx 170+ MB/sec each). > > If I try to stripe over these three OSTs on this client, performance of one > stream drops to 60+ MB/sec. Changing stripesize to a smaller one (1/3 MB) > makes things worse. Writing with larger block sizes (9M, 30M) does not improve > things. Increasing the stripesize to 25 MB allows to approach the speed > of a single OST, as one would expect (blocks are round robined over all three > OSTs). But never more. Zeroing checksums on the client does not help. > > Will now be downgrading the client to 1.6.2 to see if this helps. > > Andrei. >
On Nov 26, 2007 18:16 +0100, Andrei Maslennikov wrote:> Confirmed: 1.6.3 striped write performance sux. > > With 1.6.2, I see this: > > [root at srvandrei ~]$ lfs setstripe /lustre/162 0 0 3 > [root at srvandrei ~]$ lmdd.linux of=/lustre/162 bs=1024k time=180 fsync=1 > 157705.8304 MB in 180.0225 secs, 876.0341 MB/sec > > I.e. 1.6.2 had nicely joined the aggregate bw of three OSTs of 300 MB/sec each > into the almost 900 MB/sec.Can you verify that you disabled data checksumming: echo 0 > /proc/fs/lustre/llite/*/checksum_pages Note that there are 2 kinds of checksumming that Lustre does. The first one is checksumming of data in client memory, and the second one is checksumming of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off the network checksums only. If checksums are disabled, can you please report if the CPU usage on the client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 and on 1.6.2?> On Nov 26, 2007 4:58 PM, Andrei Maslennikov > <andrei.maslennikov at gmail.com> wrote: > > On Nov 26, 2007 3:32 PM, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote: > > > > > >> I''m seeing what can only be described as dismal striped write > > > >> performance from lustre 1.6.3 clients :-/ > > > >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple > > > >> of days ago) are also terrible. > > > > I have 3 OSTs capable to deliver 300+ MB/sec each for large streaming writes > > with 1M blocksize. On one client, with one OST I can see almost all > > this bandwidth over Infiniband. If I run three processes in parallel on this very client, > > each writing into a separate OST, I arrive to 520 MB/sec aggregate (3 streams at > > approx 170+ MB/sec each). > > > > If I try to stripe over these three OSTs on this client, performance of one > > stream drops to 60+ MB/sec. Changing stripesize to a smaller one (1/3 MB) > > makes things worse. Writing with larger block sizes (9M, 30M) does not improve > > things. Increasing the stripesize to 25 MB allows to approach the speed > > of a single OST, as one would expect (blocks are round robined over all three > > OSTs). But never more. Zeroing checksums on the client does not help. > > > > Will now be downgrading the client to 1.6.2 to see if this helps.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andrei Maslennikov
2007-Nov-26  19:14 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
Hello Andreas, I am currently reconfiguring the setup so cannot do these checks immediately. Will come back on it, hopefully tomorrow. Greetings - Andrei. On Nov 26, 2007 7:59 PM, Andreas Dilger <adilger at sun.com> wrote:> Can you verify that you disabled data checksumming: > > echo 0 > /proc/fs/lustre/llite/*/checksum_pages > > Note that there are 2 kinds of checksumming that Lustre does. The first one > is checksumming of data in client memory, and the second one is checksumming > of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off > both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off > the network checksums only. > > If checksums are disabled, can you please report if the CPU usage on the > client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 > and on 1.6.2?
On Mon, Nov 26, 2007 at 11:59:32AM -0700, Andreas Dilger wrote:>On Nov 26, 2007 18:16 +0100, Andrei Maslennikov wrote: >> Confirmed: 1.6.3 striped write performance sux. >> >> With 1.6.2, I see this: >> >> [root at srvandrei ~]$ lfs setstripe /lustre/162 0 0 3 >> [root at srvandrei ~]$ lmdd.linux of=/lustre/162 bs=1024k time=180 fsync=1 >> 157705.8304 MB in 180.0225 secs, 876.0341 MB/sec >> >> I.e. 1.6.2 had nicely joined the aggregate bw of three OSTs of 300 MB/sec each >> into the almost 900 MB/sec. > >Can you verify that you disabled data checksumming: > echo 0 > /proc/fs/lustre/llite/*/checksum_pagesthose checksums were off in my runs (they were off by default?). so I don''t think any of the checksums are making a difference.>Note that there are 2 kinds of checksumming that Lustre does. The first one >is checksumming of data in client memory, and the second one is checksumming >of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off >both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off >the network checksums only.good to know. thanks. all those are new in 1.6.3?>If checksums are disabled, can you please report if the CPU usage on the >client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 >and on 1.6.2?with checksums disabled, a 1.6.3+ client looks like: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7437 root 15 0 0 0 0 R 57 0.0 7:31.77 ldlm_poold 18547 rjh900 15 0 5820 504 412 S 3 0.1 0:34.52 dd which is interesting. ldlm_poold is using an awful lot of cpu. a ''top'' on a 1.6.2 client shows only dd using significant cpu (plus the usual small percentages for ptlrpcd, kswapd0, pdflush, kiblnd_sd_*) cheers, robin>> On Nov 26, 2007 4:58 PM, Andrei Maslennikov >> <andrei.maslennikov at gmail.com> wrote: >> > On Nov 26, 2007 3:32 PM, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote: >> > >> > > >> I''m seeing what can only be described as dismal striped write >> > > >> performance from lustre 1.6.3 clients :-/ >> > > >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple >> > > >> of days ago) are also terrible. >> > >> > I have 3 OSTs capable to deliver 300+ MB/sec each for large streaming writes >> > with 1M blocksize. On one client, with one OST I can see almost all >> > this bandwidth over Infiniband. If I run three processes in parallel on this very client, >> > each writing into a separate OST, I arrive to 520 MB/sec aggregate (3 streams at >> > approx 170+ MB/sec each). >> > >> > If I try to stripe over these three OSTs on this client, performance of one >> > stream drops to 60+ MB/sec. Changing stripesize to a smaller one (1/3 MB) >> > makes things worse. Writing with larger block sizes (9M, 30M) does not improve >> > things. Increasing the stripesize to 25 MB allows to approach the speed >> > of a single OST, as one would expect (blocks are round robined over all three >> > OSTs). But never more. Zeroing checksums on the client does not help. >> > >> > Will now be downgrading the client to 1.6.2 to see if this helps. > >Cheers, Andreas >-- >Andreas Dilger >Sr. Staff Engineer, Lustre Group >Sun Microsystems of Canada, Inc.
Andrei Maslennikov
2007-Nov-27  12:59 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
Andreas, here are some numbers obtained against a file striped over 3 OSTs
(there were 8 cores; only 3 of them had any visible load, so I quote the
CPU usage only for these cores):
1) 1.6.3 client checksums enabled: 35 MB/sec
            ldlm_poold: all the time 85-100%,
            ptlrpcd: 15-23%
            dd: from 85% down gradually to 4-6%
2) Idem, after zeroed /proc/fs/lustre/llite/*/checksum_pages: 65 MB/sec
           loads are very much the same as in the first case
3) 1.6.2 client checksums enabled: 790 MB/sec
           dd: 85-95%
           kswapd: 35%
           ptlrpcd: 15-20%
Andrei.
On Nov 26, 2007 7:59 PM, Andreas Dilger <adilger at sun.com> wrote:
> Can you verify that you disabled data checksumming:
>
>         echo 0 > /proc/fs/lustre/llite/*/checksum_pages
>
> Note that there are 2 kinds of checksumming that Lustre does.  The first
one
> is checksumming of data in client memory, and the second one is
checksumming
> of data over the network.  Setting $LPROC/llite/*/checksum_pages turns
on/off
> both in-memory and wire checksums.  Setting $LPROC/osc/*/checksums turns
on/off
> the network checksums only.
>
> If checksums are disabled, can you please report if the CPU usage on the
> client is consuming all of the CPU, or possibly all of a single CPU on
1.6.3
> and on 1.6.2?
>
Andrei Maslennikov wrote:> Andreas, here are some numbers obtained against a file striped over 3 OSTs > (there were 8 cores; only 3 of them had any visible load, so I quote the > CPU usage only for these cores): > > 1) 1.6.3 client checksums enabled: 35 MB/sec > ldlm_poold: all the time 85-100%, > ptlrpcd: 15-23% > dd: from 85% down gradually to 4-6% >Any chance you can run oprofile for the 1.6.3 case? It''d (hopefully) show where ldlm_poold is spinning. Nic
Andrei Maslennikov
2007-Nov-27  15:14 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
On Nov 27, 2007 3:47 PM, Nicholas Henke <nic at cray.com> wrote:> > 1) 1.6.3 client checksums enabled: 35 MB/sec > > ldlm_poold: all the time 85-100%, > > ptlrpcd: 15-23% > > dd: from 85% down gradually to 4-6% > > > > Any chance you can run oprofile for the 1.6.3 case? It''d (hopefully) > show where ldlm_poold is spinning.Not at the moment, my lab is currently dismantled... Maybe will be able to do it before Friday... Andrei.
chas williams - CONTRACTOR
2007-Nov-28  03:01 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
reconfigure your client with --disable-lru-resize. this appears to be a new feature in 1.6.3. this fixed striped performance for me. In message <515158c30711270459k5c80c142k1e179c71bfecbdab at mail.gmail.com>,"Andrei Maslennikov" writes:>Andreas, here are some numbers obtained against a file striped over 3 OSTs >(there were 8 cores; only 3 of them had any visible load, so I quote the >CPU usage only for these cores): > >1) 1.6.3 client checksums enabled: 35 MB/sec > ldlm_poold: all the time 85-100%, > ptlrpcd: 15-23% > dd: from 85% down gradually to 4-6% > >2) Idem, after zeroed /proc/fs/lustre/llite/*/checksum_pages: 65 MB/sec > loads are very much the same as in the first case > >3) 1.6.2 client checksums enabled: 790 MB/sec > dd: 85-95% > kswapd: 35% > ptlrpcd: 15-20% > >Andrei. > >On Nov 26, 2007 7:59 PM, Andreas Dilger <adilger at sun.com> wrote: > >> Can you verify that you disabled data checksumming: >> >> echo 0 > /proc/fs/lustre/llite/*/checksum_pages >> >> Note that there are 2 kinds of checksumming that Lustre does. The first one >> is checksumming of data in client memory, and the second one is checksumming >> of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off >> both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off >> the network checksums only. >> >> If checksums are disabled, can you please report if the CPU usage on the >> client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 >> and on 1.6.2? >> > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at clusterfs.com >https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Johann Lombardi
2007-Nov-28  16:32 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
On Tue, Nov 27, 2007 at 10:01:05PM -0500, chas williams - CONTRACTOR wrote:> reconfigure your client with --disable-lru-resize. this appears to be > a new feature in 1.6.3. this fixed striped performance for me.FYI, I''ve filed a new bugzilla ticket about this problem (see bug #14353). Johann
Johann Lombardi wrote:> On Tue, Nov 27, 2007 at 10:01:05PM -0500, chas williams - CONTRACTOR wrote: > >> reconfigure your client with --disable-lru-resize. this appears to be >> a new feature in 1.6.3. this fixed striped performance for me. >> > > FYI, I''ve filed a new bugzilla ticket about this problem (see bug #14353). > >hi all! Will fix it tomorrow, but the fix will come to current 1.6 branch so only will be available in next release. Thanks.> Johann > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >-- umka
chas williams - CONTRACTOR
2007-Nov-28  17:28 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
In message <474D9D00.1080506 at sun.com>,Yuriy Umanets writes:>Will fix it tomorrow, but the fix will come to current 1.6 branch so >only will be available in next release.if you have a fix, i can apply it as a point patch to my local copy. its not a huge deal.
chas williams - CONTRACTOR wrote:> In message <474D9D00.1080506 at sun.com>,Yuriy Umanets writes: > >> Will fix it tomorrow, but the fix will come to current 1.6 branch so >> only will be available in next release. >> > >hi Williams,> if you have a fix, i can apply it as a point patch to my local copy. > its not a huge deal. >It turned out to be a bit more complex issue which was already observed earlier in bug 13766. Some more info on this bug is also located in bug 14353. It is related to aggressive memory pressure event handling in server side ldlm pools code. I have the patch (quite big, 55K) for 1.6.4 version. It fixes this as well as other related things. But it is completely not tested on serious HW and I would not like to make you deal with stuff unless you ask for it. If you really like to give it a try and will not use this on some alive storage with sensitive data, you need to update your local copy to 1.6.4 and I will send you the patch. But as soon as I think you only want to make lustre behave like 1.6.2 about IO performance, you probably do not need this and may disable this feature by configure key --disable-lru-resize Thanks. -- umka