Hi, I''m seeing what can only be described as dismal striped write performance from lustre 1.6.3 clients :-/ 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple of days ago) are also terrible. the below shows that the OS (centos4.5/5) or fabric (gigE/IB) or lustre version on the servers doesn''t matter - the problem is with the 1.6.3 and 1.6.4rc3 client kernels with striped writes (although un-striped writes are a tad slower too). with 1M lustre stripes: client client dd write speed (MB/s) OS kernel a) b) c) d) 1.6.2: centos4.5 2.6.9-55.0.2.EL_lustre.1.6.2smp 202 270 118 117 centos5 2.6.18-8.1.8.el5_lustre.1.6.2rjh 166 190 117 119 1.6.3+: centos4.5 2.6.9-55.0.9.EL_lustre.1.6.3smp 32 9 30 9 centos5 2.6.18-53.el5-lustre1.6.4rc3rjh 36 10 27 10 ^^^^ ^^^^ yes, that is really 9MB/s. sigh with no lustre stripes: client client dd write speed (MB/s) OS kernel a) c) 1.6.2: centos4.5 2.6.9-55.0.2.EL_lustre.1.6.2smp 102 98 centos5 2.6.18-8.1.8.el5_lustre.1.6.2rjh 84 77 1.6.3+: centos4.5 2.6.9-55.0.9.EL_lustre.1.6.3smp 94 95 centos5 2.6.18-53.el5-lustre1.6.4rc3rjh 73 67 a) servers centos5, 2.6.18-53.el5-lustre1.6.4rc3rjh, md raid5, fabric IB b) servers centos4.5, 2.6.9-55.0.9.EL_lustre.1.6.3smp, "" , fabric IB c) servers centos5, 2.6.18-8.1.14.el5_lustre.1.6.3smp, "" . fabric gigE d) servers centos4.5, 2.6.9-55.0.9.EL_lustre.1.6.3smp, "" , fabric gigE all runs have the same setup - two OSS''s, each with a 16 FC disk md raid5 OST clients with 512m ram, server with 8g, all x86_64, test is dd if=/dev/zero of=/mnt/testfs/blah bs=1M count=5000 each test run >=2 times. there are no errors from lustre or kernels. I can''t see anything relevant in bugzilla. is anyone else seeing this? seems weird that 1.6.3 has been out there for a while and nobody else has reported it, but I can''t think or any more testing variants I can try... anyway, some more simple setup info: % lfs getstripe /mnt/testfs/ OBDS: 0: testfs-OST0000_UUID ACTIVE 1: testfs-OST0001_UUID ACTIVE /mnt/testfs/ default stripe_count: -1 stripe_size: 1048576 stripe_offset: -1 /mnt/testfs/blah obdidx objid objid group 1 3 0x3 0 0 2 0x2 0 % lfs df UUID 1K-blocks Used Available Use% Mounted on testfs-MDT0000_UUID 1534832 306680 1228152 19% /mnt/testfs[MDT:0] testfs-OST0000_UUID 15481840 3803284 11678556 24% /mnt/testfs[OST:0] testfs-OST0001_UUID 15481840 3803284 11678556 24% /mnt/testfs[OST:1] filesystem summary: 30963680 7606568 23357112 24% /mnt/testfs cheers, robin ps. the ''rjh'' series kernels are required ''cos lustre rhel5 kernels don''t have ko2iblnd support in them.
Johann Lombardi
2007-Nov-26 13:53 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
On Mon, Nov 26, 2007 at 08:39:58AM -0500, Robin Humble wrote:> Hi, > > I''m seeing what can only be described as dismal striped write > performance from lustre 1.6.3 clients :-/ > 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple > of days ago) are also terrible. > > the below shows that the OS (centos4.5/5) or fabric (gigE/IB) or lustre > version on the servers doesn''t matter - the problem is with the 1.6.3 > and 1.6.4rc3 client kernels with striped writes (although un-striped > writes are a tad slower too). > > with 1M lustre stripes: > client client dd write speed (MB/s) > OS kernel a) b) c) d) > 1.6.2: > centos4.5 2.6.9-55.0.2.EL_lustre.1.6.2smp 202 270 118 117 > centos5 2.6.18-8.1.8.el5_lustre.1.6.2rjh 166 190 117 119 > 1.6.3+: > centos4.5 2.6.9-55.0.9.EL_lustre.1.6.3smp 32 9 30 9 > centos5 2.6.18-53.el5-lustre1.6.4rc3rjh 36 10 27 10 > ^^^^ ^^^^ > yes, that is really 9MB/s. sighCould you please try to disable checksums? On the client side: for file in /proc/fs/lustre/osc/*/checksums; do echo 0 > $file; done Johann
On Mon, Nov 26, 2007 at 02:53:25PM +0100, Johann Lombardi wrote:>On Mon, Nov 26, 2007 at 08:39:58AM -0500, Robin Humble wrote: >> Hi, >> >> I''m seeing what can only be described as dismal striped write >> performance from lustre 1.6.3 clients :-/ >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple >> of days ago) are also terrible. >> >> the below shows that the OS (centos4.5/5) or fabric (gigE/IB) or lustre >> version on the servers doesn''t matter - the problem is with the 1.6.3 >> and 1.6.4rc3 client kernels with striped writes (although un-striped >> writes are a tad slower too). >> >> with 1M lustre stripes: >> client client dd write speed (MB/s) >> OS kernel a) b) c) d) >> 1.6.2: >> centos4.5 2.6.9-55.0.2.EL_lustre.1.6.2smp 202 270 118 117 >> centos5 2.6.18-8.1.8.el5_lustre.1.6.2rjh 166 190 117 119 >> 1.6.3+: >> centos4.5 2.6.9-55.0.9.EL_lustre.1.6.3smp 32 9 30 9 >> centos5 2.6.18-53.el5-lustre1.6.4rc3rjh 36 10 27 10 >> ^^^^ ^^^^ >> yes, that is really 9MB/s. sigh > >Could you please try to disable checksums? >On the client side: >for file in /proc/fs/lustre/osc/*/checksums; do echo 0 > $file; donedone. no change. cheers, robin
Andrei Maslennikov
2007-Nov-26 15:58 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
On Nov 26, 2007 3:32 PM, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote:> >> I''m seeing what can only be described as dismal striped write > >> performance from lustre 1.6.3 clients :-/ > >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple > >> of days ago) are also terrible.I have 3 OSTs capable to deliver 300+ MB/sec each for large streaming writes with 1M blocksize. On one client, with one OST I can see almost all this bandwidth over Infiniband. If I run three processes in parallel on this very client, each writing into a separate OST, I arrive to 520 MB/sec aggregate (3 streams at approx 170+ MB/sec each). If I try to stripe over these three OSTs on this client, performance of one stream drops to 60+ MB/sec. Changing stripesize to a smaller one (1/3 MB) makes things worse. Writing with larger block sizes (9M, 30M) does not improve things. Increasing the stripesize to 25 MB allows to approach the speed of a single OST, as one would expect (blocks are round robined over all three OSTs). But never more. Zeroing checksums on the client does not help. Will now be downgrading the client to 1.6.2 to see if this helps. Andrei.
Andrei Maslennikov
2007-Nov-26 17:16 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
Confirmed: 1.6.3 striped write performance sux. With 1.6.2, I see this: [root at srvandrei ~]$ lfs setstripe /lustre/162 0 0 3 [root at srvandrei ~]$ lmdd.linux of=/lustre/162 bs=1024k time=180 fsync=1 157705.8304 MB in 180.0225 secs, 876.0341 MB/sec I.e. 1.6.2 had nicely joined the aggregate bw of three OSTs of 300 MB/sec each into the almost 900 MB/sec. Andrei. On Nov 26, 2007 4:58 PM, Andrei Maslennikov <andrei.maslennikov at gmail.com> wrote:> On Nov 26, 2007 3:32 PM, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote: > > > >> I''m seeing what can only be described as dismal striped write > > >> performance from lustre 1.6.3 clients :-/ > > >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple > > >> of days ago) are also terrible. > > I have 3 OSTs capable to deliver 300+ MB/sec each for large streaming writes > with 1M blocksize. On one client, with one OST I can see almost all > this bandwidth over Infiniband. If I run three processes in parallel on this very client, > each writing into a separate OST, I arrive to 520 MB/sec aggregate (3 streams at > approx 170+ MB/sec each). > > If I try to stripe over these three OSTs on this client, performance of one > stream drops to 60+ MB/sec. Changing stripesize to a smaller one (1/3 MB) > makes things worse. Writing with larger block sizes (9M, 30M) does not improve > things. Increasing the stripesize to 25 MB allows to approach the speed > of a single OST, as one would expect (blocks are round robined over all three > OSTs). But never more. Zeroing checksums on the client does not help. > > Will now be downgrading the client to 1.6.2 to see if this helps. > > Andrei. >
On Nov 26, 2007 18:16 +0100, Andrei Maslennikov wrote:> Confirmed: 1.6.3 striped write performance sux. > > With 1.6.2, I see this: > > [root at srvandrei ~]$ lfs setstripe /lustre/162 0 0 3 > [root at srvandrei ~]$ lmdd.linux of=/lustre/162 bs=1024k time=180 fsync=1 > 157705.8304 MB in 180.0225 secs, 876.0341 MB/sec > > I.e. 1.6.2 had nicely joined the aggregate bw of three OSTs of 300 MB/sec each > into the almost 900 MB/sec.Can you verify that you disabled data checksumming: echo 0 > /proc/fs/lustre/llite/*/checksum_pages Note that there are 2 kinds of checksumming that Lustre does. The first one is checksumming of data in client memory, and the second one is checksumming of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off the network checksums only. If checksums are disabled, can you please report if the CPU usage on the client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 and on 1.6.2?> On Nov 26, 2007 4:58 PM, Andrei Maslennikov > <andrei.maslennikov at gmail.com> wrote: > > On Nov 26, 2007 3:32 PM, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote: > > > > > >> I''m seeing what can only be described as dismal striped write > > > >> performance from lustre 1.6.3 clients :-/ > > > >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple > > > >> of days ago) are also terrible. > > > > I have 3 OSTs capable to deliver 300+ MB/sec each for large streaming writes > > with 1M blocksize. On one client, with one OST I can see almost all > > this bandwidth over Infiniband. If I run three processes in parallel on this very client, > > each writing into a separate OST, I arrive to 520 MB/sec aggregate (3 streams at > > approx 170+ MB/sec each). > > > > If I try to stripe over these three OSTs on this client, performance of one > > stream drops to 60+ MB/sec. Changing stripesize to a smaller one (1/3 MB) > > makes things worse. Writing with larger block sizes (9M, 30M) does not improve > > things. Increasing the stripesize to 25 MB allows to approach the speed > > of a single OST, as one would expect (blocks are round robined over all three > > OSTs). But never more. Zeroing checksums on the client does not help. > > > > Will now be downgrading the client to 1.6.2 to see if this helps.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andrei Maslennikov
2007-Nov-26 19:14 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
Hello Andreas, I am currently reconfiguring the setup so cannot do these checks immediately. Will come back on it, hopefully tomorrow. Greetings - Andrei. On Nov 26, 2007 7:59 PM, Andreas Dilger <adilger at sun.com> wrote:> Can you verify that you disabled data checksumming: > > echo 0 > /proc/fs/lustre/llite/*/checksum_pages > > Note that there are 2 kinds of checksumming that Lustre does. The first one > is checksumming of data in client memory, and the second one is checksumming > of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off > both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off > the network checksums only. > > If checksums are disabled, can you please report if the CPU usage on the > client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 > and on 1.6.2?
On Mon, Nov 26, 2007 at 11:59:32AM -0700, Andreas Dilger wrote:>On Nov 26, 2007 18:16 +0100, Andrei Maslennikov wrote: >> Confirmed: 1.6.3 striped write performance sux. >> >> With 1.6.2, I see this: >> >> [root at srvandrei ~]$ lfs setstripe /lustre/162 0 0 3 >> [root at srvandrei ~]$ lmdd.linux of=/lustre/162 bs=1024k time=180 fsync=1 >> 157705.8304 MB in 180.0225 secs, 876.0341 MB/sec >> >> I.e. 1.6.2 had nicely joined the aggregate bw of three OSTs of 300 MB/sec each >> into the almost 900 MB/sec. > >Can you verify that you disabled data checksumming: > echo 0 > /proc/fs/lustre/llite/*/checksum_pagesthose checksums were off in my runs (they were off by default?). so I don''t think any of the checksums are making a difference.>Note that there are 2 kinds of checksumming that Lustre does. The first one >is checksumming of data in client memory, and the second one is checksumming >of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off >both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off >the network checksums only.good to know. thanks. all those are new in 1.6.3?>If checksums are disabled, can you please report if the CPU usage on the >client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 >and on 1.6.2?with checksums disabled, a 1.6.3+ client looks like: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7437 root 15 0 0 0 0 R 57 0.0 7:31.77 ldlm_poold 18547 rjh900 15 0 5820 504 412 S 3 0.1 0:34.52 dd which is interesting. ldlm_poold is using an awful lot of cpu. a ''top'' on a 1.6.2 client shows only dd using significant cpu (plus the usual small percentages for ptlrpcd, kswapd0, pdflush, kiblnd_sd_*) cheers, robin>> On Nov 26, 2007 4:58 PM, Andrei Maslennikov >> <andrei.maslennikov at gmail.com> wrote: >> > On Nov 26, 2007 3:32 PM, Robin Humble <rjh+lustre at cita.utoronto.ca> wrote: >> > >> > > >> I''m seeing what can only be described as dismal striped write >> > > >> performance from lustre 1.6.3 clients :-/ >> > > >> 1.6.2 and 1.6.1 clients are fine. 1.6.4rc3 clients (from cvs a couple >> > > >> of days ago) are also terrible. >> > >> > I have 3 OSTs capable to deliver 300+ MB/sec each for large streaming writes >> > with 1M blocksize. On one client, with one OST I can see almost all >> > this bandwidth over Infiniband. If I run three processes in parallel on this very client, >> > each writing into a separate OST, I arrive to 520 MB/sec aggregate (3 streams at >> > approx 170+ MB/sec each). >> > >> > If I try to stripe over these three OSTs on this client, performance of one >> > stream drops to 60+ MB/sec. Changing stripesize to a smaller one (1/3 MB) >> > makes things worse. Writing with larger block sizes (9M, 30M) does not improve >> > things. Increasing the stripesize to 25 MB allows to approach the speed >> > of a single OST, as one would expect (blocks are round robined over all three >> > OSTs). But never more. Zeroing checksums on the client does not help. >> > >> > Will now be downgrading the client to 1.6.2 to see if this helps. > >Cheers, Andreas >-- >Andreas Dilger >Sr. Staff Engineer, Lustre Group >Sun Microsystems of Canada, Inc.
Andrei Maslennikov
2007-Nov-27 12:59 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
Andreas, here are some numbers obtained against a file striped over 3 OSTs (there were 8 cores; only 3 of them had any visible load, so I quote the CPU usage only for these cores): 1) 1.6.3 client checksums enabled: 35 MB/sec ldlm_poold: all the time 85-100%, ptlrpcd: 15-23% dd: from 85% down gradually to 4-6% 2) Idem, after zeroed /proc/fs/lustre/llite/*/checksum_pages: 65 MB/sec loads are very much the same as in the first case 3) 1.6.2 client checksums enabled: 790 MB/sec dd: 85-95% kswapd: 35% ptlrpcd: 15-20% Andrei. On Nov 26, 2007 7:59 PM, Andreas Dilger <adilger at sun.com> wrote:> Can you verify that you disabled data checksumming: > > echo 0 > /proc/fs/lustre/llite/*/checksum_pages > > Note that there are 2 kinds of checksumming that Lustre does. The first one > is checksumming of data in client memory, and the second one is checksumming > of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off > both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off > the network checksums only. > > If checksums are disabled, can you please report if the CPU usage on the > client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 > and on 1.6.2? >
Andrei Maslennikov wrote:> Andreas, here are some numbers obtained against a file striped over 3 OSTs > (there were 8 cores; only 3 of them had any visible load, so I quote the > CPU usage only for these cores): > > 1) 1.6.3 client checksums enabled: 35 MB/sec > ldlm_poold: all the time 85-100%, > ptlrpcd: 15-23% > dd: from 85% down gradually to 4-6% >Any chance you can run oprofile for the 1.6.3 case? It''d (hopefully) show where ldlm_poold is spinning. Nic
Andrei Maslennikov
2007-Nov-27 15:14 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
On Nov 27, 2007 3:47 PM, Nicholas Henke <nic at cray.com> wrote:> > 1) 1.6.3 client checksums enabled: 35 MB/sec > > ldlm_poold: all the time 85-100%, > > ptlrpcd: 15-23% > > dd: from 85% down gradually to 4-6% > > > > Any chance you can run oprofile for the 1.6.3 case? It''d (hopefully) > show where ldlm_poold is spinning.Not at the moment, my lab is currently dismantled... Maybe will be able to do it before Friday... Andrei.
chas williams - CONTRACTOR
2007-Nov-28 03:01 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
reconfigure your client with --disable-lru-resize. this appears to be a new feature in 1.6.3. this fixed striped performance for me. In message <515158c30711270459k5c80c142k1e179c71bfecbdab at mail.gmail.com>,"Andrei Maslennikov" writes:>Andreas, here are some numbers obtained against a file striped over 3 OSTs >(there were 8 cores; only 3 of them had any visible load, so I quote the >CPU usage only for these cores): > >1) 1.6.3 client checksums enabled: 35 MB/sec > ldlm_poold: all the time 85-100%, > ptlrpcd: 15-23% > dd: from 85% down gradually to 4-6% > >2) Idem, after zeroed /proc/fs/lustre/llite/*/checksum_pages: 65 MB/sec > loads are very much the same as in the first case > >3) 1.6.2 client checksums enabled: 790 MB/sec > dd: 85-95% > kswapd: 35% > ptlrpcd: 15-20% > >Andrei. > >On Nov 26, 2007 7:59 PM, Andreas Dilger <adilger at sun.com> wrote: > >> Can you verify that you disabled data checksumming: >> >> echo 0 > /proc/fs/lustre/llite/*/checksum_pages >> >> Note that there are 2 kinds of checksumming that Lustre does. The first one >> is checksumming of data in client memory, and the second one is checksumming >> of data over the network. Setting $LPROC/llite/*/checksum_pages turns on/off >> both in-memory and wire checksums. Setting $LPROC/osc/*/checksums turns on/off >> the network checksums only. >> >> If checksums are disabled, can you please report if the CPU usage on the >> client is consuming all of the CPU, or possibly all of a single CPU on 1.6.3 >> and on 1.6.2? >> > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at clusterfs.com >https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Johann Lombardi
2007-Nov-28 16:32 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
On Tue, Nov 27, 2007 at 10:01:05PM -0500, chas williams - CONTRACTOR wrote:> reconfigure your client with --disable-lru-resize. this appears to be > a new feature in 1.6.3. this fixed striped performance for me.FYI, I''ve filed a new bugzilla ticket about this problem (see bug #14353). Johann
Johann Lombardi wrote:> On Tue, Nov 27, 2007 at 10:01:05PM -0500, chas williams - CONTRACTOR wrote: > >> reconfigure your client with --disable-lru-resize. this appears to be >> a new feature in 1.6.3. this fixed striped performance for me. >> > > FYI, I''ve filed a new bugzilla ticket about this problem (see bug #14353). > >hi all! Will fix it tomorrow, but the fix will come to current 1.6 branch so only will be available in next release. Thanks.> Johann > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >-- umka
chas williams - CONTRACTOR
2007-Nov-28 17:28 UTC
[Lustre-discuss] bad 1.6.3 striped write performance
In message <474D9D00.1080506 at sun.com>,Yuriy Umanets writes:>Will fix it tomorrow, but the fix will come to current 1.6 branch so >only will be available in next release.if you have a fix, i can apply it as a point patch to my local copy. its not a huge deal.
chas williams - CONTRACTOR wrote:> In message <474D9D00.1080506 at sun.com>,Yuriy Umanets writes: > >> Will fix it tomorrow, but the fix will come to current 1.6 branch so >> only will be available in next release. >> > >hi Williams,> if you have a fix, i can apply it as a point patch to my local copy. > its not a huge deal. >It turned out to be a bit more complex issue which was already observed earlier in bug 13766. Some more info on this bug is also located in bug 14353. It is related to aggressive memory pressure event handling in server side ldlm pools code. I have the patch (quite big, 55K) for 1.6.4 version. It fixes this as well as other related things. But it is completely not tested on serious HW and I would not like to make you deal with stuff unless you ask for it. If you really like to give it a try and will not use this on some alive storage with sensitive data, you need to update your local copy to 1.6.4 and I will send you the patch. But as soon as I think you only want to make lustre behave like 1.6.2 about IO performance, you probably do not need this and may disable this feature by configure key --disable-lru-resize Thanks. -- umka