thr3ads.net - Lustre discuss - [Lustre-discuss] Optimize parallel commpilation on lustre [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Maxence Dunnewind

2010-Jun-24 06:54 UTC

[Lustre-discuss] Optimize parallel commpilation on lustre

Hi,

I''m using lustre 1.8.3 with a SSI (single system image). I''m
trying to make some
compilation bench. My process is simple :
download 2.6.34 kernel
extract it on a lustre mount
make defconfig
time make -j X

for reference, if I use only one client, with a local filesystem, it takes about
3min50. The same client alone with a mounted lustre partition (with local lock)
takes more than 10 minutes.

Using lustre on my SSI, I have these results : 
-j 4 : 9min37
-j 8 : 5min34
-j 12 : 4min42
-j 16 : 4min19

so even with 16 process (on 4 nodes), I can''t compile faster than 1
local node ...
I tried http://blogs.sun.com/atulvid/entry/improving_performance_of_small_files
but it does not change anything.

My lustre setup : 
1 mgs + mds
3 OST

Is there some other way to optimize it or is lustre just bad on multiple access
for small file ? 

Regards,

Maxence
-- 
Maxence DUNNEWIND
Contact : maxence at dunnewind.net
Site : http://www.dunnewind.net
06 32 39 39 93
GPG : 18AE 61E4 D0B0 1C7C AAC9  E40D 4D39 68DB 0D2E B533
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100624/cc350200/attachment.bin

Andreas Dilger

2010-Jun-24 21:16 UTC

head link

[Lustre-discuss] Optimize parallel commpilation on lustre

On 2010-06-24, at 00:54, Maxence Dunnewind wrote:> I''m using lustre 1.8.3 with a SSI (single system image).
I''m trying to make some compilation bench. My process is simple :
> download 2.6.34 kernel
> extract it on a lustre mount
> make defconfig
> time make -j X
> 
> for reference, if I use only one client, with a local filesystem, it takes
about 3min50. The same client alone with a mounted lustre partition (with local
lock) takes more than 10 minutes.
> 
> Using lustre on my SSI, I have these results : 
> -j 4 : 9min37
> -j 8 : 5min34
> -j 12 : 4min42
> -j 16 : 4min19
> 
> so even with 16 process (on 4 nodes), I can''t compile faster than
1 local node ...
> I tried
http://blogs.sun.com/atulvid/entry/improving_performance_of_small_files
> but it does not change anything.
> 
> My lustre setup : 
> 1 mgs + mds
> 3 OST
> 
> Is there some other way to optimize it or is lustre just bad on multiple
access for small file ?
I don''t think it is realistic to expect that a cache-coherent
distributed filesystem that can scale to 10000''s of clients is also
performing as fast as a single client on a local filesystem.  That Lustre is
completing in 4:19 vs. 3:50 on the local filesystem (12% slower) is a pretty
good result for Lustre, I think.

We''re of course working on improving Lustre performance for this kind
of situation, but it isn''t really a priority for most of our customers.
I don''t want to discourage you from using Lustre, and of course
I''d also like Lustre to be faster than even the local filesystem but
you should also look at the other benefits.

A more fair test might be to do the local-node compile, and then copy the kernel
and all the modules to each of the client nodes, since Lustre is also making the
output files available on all of the clients.  It is also worthwhile (if you
have the time) to determine whether your "make -j 16" is CPU bound, or
IO bound on the local filesystem?  You might try pre-staging all of the input
files on the client nodes, and have the compiler output go into a separate
directory (not sure if this is possible with linux kernel compiles) so that the
output files created during the run do not invalidate the directory caches.

For comparison, on the same two systems (local fs vs. Lustre) try writing
32*10GB files from 32 clients (use rsh or NFS or whatever you want to transport
data from clients to local filesystem) and see how performance compares. :-)

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Maxence Dunnewind

2010-Jun-25 06:51 UTC

head link

[Lustre-discuss] Optimize parallel commpilation on lustre

Hi,

To explain things, I''m working on Kerrighed (www.kerrighed.org)  and I
am trying
to find a system "better than nfs" for 4 to 256 clients (atm, maybe
more later).
The fs should :
- client cache coherent
- the most POSIX compatible
- the most efficient in "general use".

Lustre seems indeed pretty good for client cache coherency and (it seems) POSIX
compliance (ie: I can''t fine other fs with same properties).
> I don''t think it is realistic to expect that a cache-coherent
distributed filesystem that can scale to 10000''s of clients is also
performing as fast as a single client on a local filesystem.  That Lustre is
completing in 4:19 vs. 3:50 on the local filesystem (12% slower) is a pretty
good result for Lustre, I think.To be more exact, lustre on 4 nodes is completing on 4:19 vs 3:50 on one node :)
 
I know a filesystem can''t be good for everything, my question could be
understood as "is there some basic (or a bit more complex) tunning to
improve
compilation/ I/O on small files".
Lustre seems, as I said, good for cache coherency and POSIX compliance (maybe
note the best, but definitively not the worst).> We''re of course working on improving Lustre performance for this
kind of situation, but it isn''t really a priority for most of our
customers.  I don''t want to discourage you from using Lustre, and of
course I''d also like Lustre to be faster than even the local filesystem
but you should also look at the other benefits.If you know some other filesystem with such properties, I can try them, but I
did not find any of them (it also must be free
software).> 
> A more fair test might be to do the local-node compile, and then copy the
kernel and all the modules to each of the client nodes, since Lustre is also
making the output files available on all of the clients.  It is also worthwhile
(if you have the time) to determine whether your "make -j 16" is CPU
bound, or IO bound on the local filesystem?  You might try pre-staging all of
the input files on the client nodes, and have the compiler output go into a
separate directory (not sure if this is possible with linux kernel compiles) so
that the output files created during the run do not invalidate the directory
caches.I will try this.
 
 > For comparison, on the same two systems (local fs vs. Lustre) try writing
32*10GB files from 32 clients (use rsh or NFS or whatever you want to transport
data from clients to local filesystem) and see how performance compares. :-)I would try to find some more "practical" tests, maybe something video
related
on some "big" videos, or some parrallel work on big images, or ... 

Thanks for your answers, 

Regards,

Maxence
-- 
Maxence DUNNEWIND
Contact : maxence at dunnewind.net
Site : http://www.dunnewind.net
06 32 39 39 93
GPG : 18AE 61E4 D0B0 1C7C AAC9  E40D 4D39 68DB 0D2E B533
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100625/92f26652/attachment.bin

Andreas Dilger

2010-Jun-25 18:09 UTC

head link

[Lustre-discuss] Optimize parallel commpilation on lustre

On 2010-06-25, at 00:51, Maxence Dunnewind wrote:>> I don''t think it is realistic to expect that a cache-coherent
distributed filesystem that can scale to 10000''s of clients is also
performing as fast as a single client on a local filesystem. That Lustre is
completing in 4:19 vs. 3:50 on the local filesystem (12% slower) is a pretty
good result for Lustre, I think.
> To be more exact, lustre on 4 nodes is completing on 4:19 vs 3:50 on one
node :)
I recognized this, but that it gets better performance as nodes are added is a
_good_ sign for a filesystem that should perform well when there are many
clients. Of course, it would be ideal if the same performance can be had by a
single client...
> I know a filesystem can''t be good for everything, my question
could be understood as "is there some basic (or a bit more complex) tunning
to improve compilation/ I/O on small files".
If you are interested to do a tiny bit of hacking, it would be interesting to do
an experiment to see what kind of performance can be gotten in your benchmark by
a single client. Currently, Lustre limits each client to a single
filesystem-modifying metadata operation at one time, in order to prevent the
clients from overwhelming the server, and to ensure that the clients can recover
the filesystem correctly in case of a server crash.

However, for testing purposes it would be interesting to disable this limit to
see how fast your clients run if this limit is removed. I haven''t
tested the attached patch at all, so YMMV, but I''d be interested to see
the results.

I''m not sure if it makes a difference in your case or not, but
increasing the MDC RPCs in flight might also help performance. Also, increasing
the client cache size and the number of IO RPCs may also help. On the clients
run:

lctl set_param *.*.max_rpcs_in_flight=64
lctl set_param osc.*.max_dirty_mb=512

to see if these make any difference.

You may also test running the make directly on the MDS with a local Lustre mount
to determine if the network latency is a significant factor in the performance.
If you are using Ethernet instead of IB the latency could be hurting you, since
kernel compiles are generally only doing a tiny amount of work per file and then
you need to send a few RPCs to open and read the next file and the headers.
Some of this can be hidden by pre-reading all of the files into the client
caches (new machines should have enough RAM, about 1GB or so), but the
"open" operations still need to send an RPC to the MDS for each file
open, so running on the MDS (or with a low-latency network like IB) may help
compiles like this run more quickly.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mdc-multiop.diff
Type: application/octet-stream
Size: 803 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100625/9eead638/attachment.obj

Maxence Dunnewind

2010-Jun-28 16:04 UTC

head link

[Lustre-discuss] Optimize parallel commpilation on lustre

Heya,> If you are interested to do a tiny bit of hacking, it would be interesting
to do an experiment to see what kind of performance can be gotten in your
benchmark by a single client.  Currently, Lustre limits each client to a single
filesystem-modifying metadata operation at one time, in order to prevent the
clients from overwhelming the server, and to ensure that the clients can recover
the filesystem correctly in case of a server crash.I just tested this. Before, I tried to do an out-of-tree build. My for clients
are using nfsroot, so I put the kernel source on it, then I mount lustre on
/mnt/lustre, and I compile on /mnt/lustrE/build (with make O=). The results
(without) your patch are interesting :
7m42 against 9m7 before with -j 4
4 min 51 against 5 min 34 with -j 8
3 min 27 againt 4 min 19 with -j 16
I also use -pipe as gcc option, to avoid temp files.

So, my first question is : could it be possible in some way to disable cache
coherency on some subdirectory ? If I know all the files in this directory will
be acceded in read only, I do not need coherency. It would permit to read the
files from lustre instead of nfs.

I then tried with your patch, not much difference :

4 min 43 againt 4 min 51 without it (-j 8)
7min 40 against 7 min 42 with -j 8
So it changes almost nothing :)
> I''m not sure if it makes a difference in your case or not, but
increasing the MDC RPCs in flight might also help performance.  Also, increasing
the client cache size and the number of IO RPCs may also help.  On the clients
run:
> 
> lctl set_param *.*.max_rpcs_in_flight=64
> lctl set_param osc.*.max_dirty_mb=512no change 
> You may also test running the make directly on the MDS with a local Lustre
mount to determine if the network latency is a significant factor in the
performance. If you are using Ethernet instead of IB the latency could be
hurting you, since kernel compiles are generally only doing a tiny amount of
work per file and then you need to send a few RPCs to open and read the next
file and the headers.  Some of this can be hidden by pre-reading all of the
files into the client caches (new machines should have enough RAM, about 1GB or
so), but the "open" operations still need to send an RPC to the MDS
for each file open, so running on the MDS (or with a low-latency network like
IB) may help compiles like this run more quickly.we don''t have IB set up atm, so I can not test with it. I will try
directly on
the mds (so on only one node) to compare.

Regards,

Maxence
-- 
Maxence DUNNEWIND
Contact : maxence at dunnewind.net
Site : http://www.dunnewind.net
06 32 39 39 93
GPG : 18AE 61E4 D0B0 1C7C AAC9  E40D 4D39 68DB 0D2E B533
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100628/31319832/attachment.bin

Andreas Dilger

2010-Jun-28 20:00 UTC

head link

[Lustre-discuss] Optimize parallel commpilation on lustre

On 2010-06-28, at 10:04, Maxence Dunnewind wrote:>> If you are interested to do a tiny bit of hacking, it would be
interesting to do an experiment to see what kind of performance can be gotten in
your benchmark by a single client.  Currently, Lustre limits each client to a
single filesystem-modifying metadata operation at one time, in order to prevent
the clients from overwhelming the server, and to ensure that the clients can
recover the filesystem correctly in case of a server crash.
> 
> I just tested this. Before, I tried to do an out-of-tree build. My for
clients
> are using nfsroot, so I put the kernel source on it, then I mount lustre on
> /mnt/lustre, and I compile on /mnt/lustrE/build (with make O=). The results
> (without) your patch are interesting :
> 7m42 against 9m37 before with -j 4
> 4 min 51 against 5 min 34 with -j 8
> 3 min 27 againt 4 min 19 with -j 16
> I also use -pipe as gcc option, to avoid temp files.
I was actually thinking of keeping the source tree on Lustre as well, just not
building the output files in the same directory as the input files.  It
isn''t clear from this result whether the speedup was due to having the
input files in a separate directory (i.e. lock contention), or whether it was
because you had a second server hosting the input files (i.e. RPC limitation of
the server).
> So, my first question is : could it be possible in some way to disable
cache coherency on some subdirectory ? If I know all the files in this directory
will be acceded in read only, I do not need coherency. It would permit to read
the files from lustre instead of nfs.
I don''t think this would be practical to do for many years.
> I then tried with your patch, not much difference :
> 
> 4 min 43 againt 4 min 51 without it (-j 8)
Ah, this number is with a separate server for the input files.  It might be more
interesting to see if it made a difference with the files all hosted on the same
server.
> 7min 40 against 7 min 42 with -j 8
This should be "-j 4" to match the above numbers, 
> So it changes almost nothing :)
That implies that the MDS modifying RPCs are not necessarily the bottleneck
here.
>> I''m not sure if it makes a difference in your case or not, but
increasing the MDC RPCs in flight might also help performance.  Also, increasing
the client cache size and the number of IO RPCs may also help.  On the clients
run:
>> 
>> lctl set_param *.*.max_rpcs_in_flight=64
>> lctl set_param osc.*.max_dirty_mb=512
> no change
Hmm, I''d thought possibly allowing more of the output files to be
cached on the clients would reduce the compilation time, but that
doesn''t seem to be the bottleneck either.

Did you try pre-reading all of the input files on the clients to see if
eliminating the small-file reads was a source of improvement?
> I will try directly on the mds (so on only one node) to compare.

I look forward to your results.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Maxence Dunnewind

2010-Jun-29 08:09 UTC

head link

[Lustre-discuss] Optimize parallel commpilation on lustre

> > 4 min 43 againt 4 min 51 without it (-j 8)
> 
> Ah, this number is with a separate server for the input files.  It might be
more interesting to see if it made a difference with the files all hosted on the
same server.With the source on the same lustre mount :
- 5min34 without patch, source and build on same dir
- 5min32 with patch, source and build on same dir
- 5min59 with patch, source and build on 2 lustre dir (on the same mount) /o\ I
  made it 3 times, same result
For ref : 
- 4 min 43 with patch and source on other mount
- 4 min 51 without patch and source on other mount> 
> > 7min 40 against 7 min 42 with -j 8
> 
> This should be "-j 4" to match the above numbers, 
it was, typo.> Hmm, I''d thought possibly allowing more of the output files to be
cached on the clients would reduce the compilation time, but that
doesn''t seem to be the bottleneck either.
> 
> Did you try pre-reading all of the input files on the clients to see if
eliminating the small-file reads was a source of improvement?do you have any idea for doing that ? (It''s a kernel compile ;)
> > I will try directly on the mds (so on only one node) to compare.I did, on an unpatched module version, and so only on one node :
- 10m03 on a remote node, on same dir -j 4
- 6m30 on the MDS, source and compile on same dir , -j 4
- 6m30 on the MDS, source and compile on same dir , -j 8
- 5min54 on the MDS, source and compile on different dir, -j 8

I will also try on some other software (with some big c++ file, so that the
ratio compilation time/access time would be better).

Maxence
-- 
Maxence DUNNEWIND
Contact : maxence at dunnewind.net
Site : http://www.dunnewind.net
06 32 39 39 93
GPG : 18AE 61E4 D0B0 1C7C AAC9  E40D 4D39 68DB 0D2E B533
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100629/33663e40/attachment.bin

Andreas Dilger

2010-Jul-10 04:21 UTC

head link

[Lustre-discuss] Optimize parallel compilation on lustre

On 2010-06-28, at 10:04, Maxence Dunnewind wrote:> I then tried with your patch, not much difference :
> 
> 4 min 43 againt 4 min 51 without it (-j 8)
> 7min 40 against 7 min 42 with -j 8
> So it changes almost nothing :)
I just realized that in the patch I sent you, it defaults to "no change in
behaviour" unless FOR_TESTING_ONLY is defined at compile time.  I
didn''t want someone using the patch and then complaining that their
filesystem didn''t work afterward.

I''ve attached an updated patch that has "#define
FOR_TESTING_ONLY", and hopefully this one will make more of a difference in
performance.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mdc-multiop.diff
Type: application/octet-stream
Size: 829 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100709/eb20badb/attachment.obj

Maxence Dunnewind

2010-Jul-10 06:47 UTC

head link

[Lustre-discuss] Optimize parallel compilation on lustre

> I just realized that in the patch I sent you, it defaults to "no
change in behaviour" unless FOR_TESTING_ONLY is defined at compile time.  I
didn''t want someone using the patch and then complaining that their
filesystem didn''t work afterward.
> 
> I''ve attached an updated patch that has "#define
FOR_TESTING_ONLY", and hopefully this one will make more of a difference in
performance.Ahah :)
I''ll retry on monday then.

Maxence

-- 
Maxence DUNNEWIND
Contact : maxence at dunnewind.net
Site : http://www.dunnewind.net
06 32 39 39 93
GPG : 18AE 61E4 D0B0 1C7C AAC9  E40D 4D39 68DB 0D2E B533
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100710/b2de6c9f/attachment.bin

Maxence Dunnewind

2010-Jul-15 15:02 UTC

head link

[Lustre-discuss] Optimize parallel compilation on lustre

Heya,

I just tried your patch, which is not working very well :D

If I am on one node, the results are a bit better (8min30 against 9min07), but
as soon as I compile in // on more than one node, I get many compilation errors.

Regards,

Maxence

-- 
Maxence DUNNEWIND
Contact : maxence at dunnewind.net
Site : http://www.dunnewind.net
06 32 39 39 93
GPG : 18AE 61E4 D0B0 1C7C AAC9  E40D 4D39 68DB 0D2E B533
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/f472a2dc/attachment.bin

Maxence Dunnewind

2010-Jul-16 06:27 UTC

head link

[Lustre-discuss] Optimize parallel compilation on lustre

> I just tried your patch, which is not working very well :D
> 
> If I am on one node, the results are a bit better (8min30 against 9min07),
but
> as soon as I compile in // on more than one node, I get many compilation
errors.These results was for a linux-2.6.34 defconfig compile.

I just tried on qt4, and it compiles correctly, the results are :
-j 16  : 30min35 against 32 min
-j 8 : same time (34min25 vs 34min36)

Regards,

Maxence

-- 
Maxence DUNNEWIND
Contact : maxence at dunnewind.net
Site : http://www.dunnewind.net
06 32 39 39 93
GPG : 18AE 61E4 D0B0 1C7C AAC9  E40D 4D39 68DB 0D2E B533
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100716/6828188e/attachment.bin

Andreas Dilger

2010-Jul-16 16:44 UTC

head link

[Lustre-discuss] Optimize parallel compilation on lustre

On 2010-07-16, at 0:27, Maxence Dunnewind <maxence at dunnewind.net>
wrote:> I just tried on qt4, and it compiles correctly, the results are :
> -j 16  : 30min35 against 32 min
> -j 8 : same time (34min25 vs 34min36)
Thanks for testing this. What it means is that there is very little contention
on the client''s single metadata write RPC until there are a lot (16) of
concurrent threads.

I''m assuming the two numbers are for compiles on Lustre with and
without the patch?

How does this performance compare to local filesystem performance?

If you are reading all of the input files into cache before the start of the run
(e.g. find . | xargs cat > /dev/null) then the slowdown isn''t from
reading the small input files from the OSTs.

 If you are writing the output to a separate directory, then the namespace cache
of the input files is not being invalidated. Even so, the fact that the patch
didn''t improve performance means the metadata changes are not blocking
the threads much.

You also increased the write cache size, so it _shouldn''t_ be that
writing the output is taking too long.

I''m a bit at a loss for further suggestions for you to test, unless you
want to start into things like profile on the client.

Cheers, Andreas

Lustre discuss - Jun 2010 - Optimize parallel commpilation on lustre

[Lustre-discuss] Optimize parallel commpilation on lustre

[Lustre-discuss] Optimize parallel commpilation on lustre

[Lustre-discuss] Optimize parallel commpilation on lustre

[Lustre-discuss] Optimize parallel commpilation on lustre

[Lustre-discuss] Optimize parallel commpilation on lustre

[Lustre-discuss] Optimize parallel commpilation on lustre

[Lustre-discuss] Optimize parallel commpilation on lustre

[Lustre-discuss] Optimize parallel compilation on lustre

[Lustre-discuss] Optimize parallel compilation on lustre

[Lustre-discuss] Optimize parallel compilation on lustre

[Lustre-discuss] Optimize parallel compilation on lustre

[Lustre-discuss] Optimize parallel compilation on lustre