Maxence Dunnewind
2010-Jun-24 06:54 UTC
[Lustre-discuss] Optimize parallel commpilation on lustre
Hi, I''m using lustre 1.8.3 with a SSI (single system image). I''m trying to make some compilation bench. My process is simple : download 2.6.34 kernel extract it on a lustre mount make defconfig time make -j X for reference, if I use only one client, with a local filesystem, it takes about 3min50. The same client alone with a mounted lustre partition (with local lock) takes more than 10 minutes. Using lustre on my SSI, I have these results : -j 4 : 9min37 -j 8 : 5min34 -j 12 : 4min42 -j 16 : 4min19 so even with 16 process (on 4 nodes), I can''t compile faster than 1 local node ... I tried http://blogs.sun.com/atulvid/entry/improving_performance_of_small_files but it does not change anything. My lustre setup : 1 mgs + mds 3 OST Is there some other way to optimize it or is lustre just bad on multiple access for small file ? Regards, Maxence -- Maxence DUNNEWIND Contact : maxence at dunnewind.net Site : http://www.dunnewind.net 06 32 39 39 93 GPG : 18AE 61E4 D0B0 1C7C AAC9 E40D 4D39 68DB 0D2E B533 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100624/cc350200/attachment.bin
Andreas Dilger
2010-Jun-24 21:16 UTC
[Lustre-discuss] Optimize parallel commpilation on lustre
On 2010-06-24, at 00:54, Maxence Dunnewind wrote:> I''m using lustre 1.8.3 with a SSI (single system image). I''m trying to make some compilation bench. My process is simple : > download 2.6.34 kernel > extract it on a lustre mount > make defconfig > time make -j X > > for reference, if I use only one client, with a local filesystem, it takes about 3min50. The same client alone with a mounted lustre partition (with local lock) takes more than 10 minutes. > > Using lustre on my SSI, I have these results : > -j 4 : 9min37 > -j 8 : 5min34 > -j 12 : 4min42 > -j 16 : 4min19 > > so even with 16 process (on 4 nodes), I can''t compile faster than 1 local node ... > I tried http://blogs.sun.com/atulvid/entry/improving_performance_of_small_files > but it does not change anything. > > My lustre setup : > 1 mgs + mds > 3 OST > > Is there some other way to optimize it or is lustre just bad on multiple access for small file ?I don''t think it is realistic to expect that a cache-coherent distributed filesystem that can scale to 10000''s of clients is also performing as fast as a single client on a local filesystem. That Lustre is completing in 4:19 vs. 3:50 on the local filesystem (12% slower) is a pretty good result for Lustre, I think. We''re of course working on improving Lustre performance for this kind of situation, but it isn''t really a priority for most of our customers. I don''t want to discourage you from using Lustre, and of course I''d also like Lustre to be faster than even the local filesystem but you should also look at the other benefits. A more fair test might be to do the local-node compile, and then copy the kernel and all the modules to each of the client nodes, since Lustre is also making the output files available on all of the clients. It is also worthwhile (if you have the time) to determine whether your "make -j 16" is CPU bound, or IO bound on the local filesystem? You might try pre-staging all of the input files on the client nodes, and have the compiler output go into a separate directory (not sure if this is possible with linux kernel compiles) so that the output files created during the run do not invalidate the directory caches. For comparison, on the same two systems (local fs vs. Lustre) try writing 32*10GB files from 32 clients (use rsh or NFS or whatever you want to transport data from clients to local filesystem) and see how performance compares. :-) Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Maxence Dunnewind
2010-Jun-25 06:51 UTC
[Lustre-discuss] Optimize parallel commpilation on lustre
Hi, To explain things, I''m working on Kerrighed (www.kerrighed.org) and I am trying to find a system "better than nfs" for 4 to 256 clients (atm, maybe more later). The fs should : - client cache coherent - the most POSIX compatible - the most efficient in "general use". Lustre seems indeed pretty good for client cache coherency and (it seems) POSIX compliance (ie: I can''t fine other fs with same properties).> I don''t think it is realistic to expect that a cache-coherent distributed filesystem that can scale to 10000''s of clients is also performing as fast as a single client on a local filesystem. That Lustre is completing in 4:19 vs. 3:50 on the local filesystem (12% slower) is a pretty good result for Lustre, I think.To be more exact, lustre on 4 nodes is completing on 4:19 vs 3:50 on one node :) I know a filesystem can''t be good for everything, my question could be understood as "is there some basic (or a bit more complex) tunning to improve compilation/ I/O on small files". Lustre seems, as I said, good for cache coherency and POSIX compliance (maybe note the best, but definitively not the worst).> We''re of course working on improving Lustre performance for this kind of situation, but it isn''t really a priority for most of our customers. I don''t want to discourage you from using Lustre, and of course I''d also like Lustre to be faster than even the local filesystem but you should also look at the other benefits.If you know some other filesystem with such properties, I can try them, but I did not find any of them (it also must be free software).> > A more fair test might be to do the local-node compile, and then copy the kernel and all the modules to each of the client nodes, since Lustre is also making the output files available on all of the clients. It is also worthwhile (if you have the time) to determine whether your "make -j 16" is CPU bound, or IO bound on the local filesystem? You might try pre-staging all of the input files on the client nodes, and have the compiler output go into a separate directory (not sure if this is possible with linux kernel compiles) so that the output files created during the run do not invalidate the directory caches.I will try this.> For comparison, on the same two systems (local fs vs. Lustre) try writing 32*10GB files from 32 clients (use rsh or NFS or whatever you want to transport data from clients to local filesystem) and see how performance compares. :-)I would try to find some more "practical" tests, maybe something video related on some "big" videos, or some parrallel work on big images, or ... Thanks for your answers, Regards, Maxence -- Maxence DUNNEWIND Contact : maxence at dunnewind.net Site : http://www.dunnewind.net 06 32 39 39 93 GPG : 18AE 61E4 D0B0 1C7C AAC9 E40D 4D39 68DB 0D2E B533 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100625/92f26652/attachment.bin
Andreas Dilger
2010-Jun-25 18:09 UTC
[Lustre-discuss] Optimize parallel commpilation on lustre
On 2010-06-25, at 00:51, Maxence Dunnewind wrote:>> I don''t think it is realistic to expect that a cache-coherent distributed filesystem that can scale to 10000''s of clients is also performing as fast as a single client on a local filesystem. That Lustre is completing in 4:19 vs. 3:50 on the local filesystem (12% slower) is a pretty good result for Lustre, I think. > To be more exact, lustre on 4 nodes is completing on 4:19 vs 3:50 on one node :)I recognized this, but that it gets better performance as nodes are added is a _good_ sign for a filesystem that should perform well when there are many clients. Of course, it would be ideal if the same performance can be had by a single client...> I know a filesystem can''t be good for everything, my question could be understood as "is there some basic (or a bit more complex) tunning to improve compilation/ I/O on small files".If you are interested to do a tiny bit of hacking, it would be interesting to do an experiment to see what kind of performance can be gotten in your benchmark by a single client. Currently, Lustre limits each client to a single filesystem-modifying metadata operation at one time, in order to prevent the clients from overwhelming the server, and to ensure that the clients can recover the filesystem correctly in case of a server crash. However, for testing purposes it would be interesting to disable this limit to see how fast your clients run if this limit is removed. I haven''t tested the attached patch at all, so YMMV, but I''d be interested to see the results. I''m not sure if it makes a difference in your case or not, but increasing the MDC RPCs in flight might also help performance. Also, increasing the client cache size and the number of IO RPCs may also help. On the clients run: lctl set_param *.*.max_rpcs_in_flight=64 lctl set_param osc.*.max_dirty_mb=512 to see if these make any difference. You may also test running the make directly on the MDS with a local Lustre mount to determine if the network latency is a significant factor in the performance. If you are using Ethernet instead of IB the latency could be hurting you, since kernel compiles are generally only doing a tiny amount of work per file and then you need to send a few RPCs to open and read the next file and the headers. Some of this can be hidden by pre-reading all of the files into the client caches (new machines should have enough RAM, about 1GB or so), but the "open" operations still need to send an RPC to the MDS for each file open, so running on the MDS (or with a low-latency network like IB) may help compiles like this run more quickly. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. -------------- next part -------------- A non-text attachment was scrubbed... Name: mdc-multiop.diff Type: application/octet-stream Size: 803 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100625/9eead638/attachment.obj
Maxence Dunnewind
2010-Jun-28 16:04 UTC
[Lustre-discuss] Optimize parallel commpilation on lustre
Heya,> If you are interested to do a tiny bit of hacking, it would be interesting to do an experiment to see what kind of performance can be gotten in your benchmark by a single client. Currently, Lustre limits each client to a single filesystem-modifying metadata operation at one time, in order to prevent the clients from overwhelming the server, and to ensure that the clients can recover the filesystem correctly in case of a server crash.I just tested this. Before, I tried to do an out-of-tree build. My for clients are using nfsroot, so I put the kernel source on it, then I mount lustre on /mnt/lustre, and I compile on /mnt/lustrE/build (with make O=). The results (without) your patch are interesting : 7m42 against 9m7 before with -j 4 4 min 51 against 5 min 34 with -j 8 3 min 27 againt 4 min 19 with -j 16 I also use -pipe as gcc option, to avoid temp files. So, my first question is : could it be possible in some way to disable cache coherency on some subdirectory ? If I know all the files in this directory will be acceded in read only, I do not need coherency. It would permit to read the files from lustre instead of nfs. I then tried with your patch, not much difference : 4 min 43 againt 4 min 51 without it (-j 8) 7min 40 against 7 min 42 with -j 8 So it changes almost nothing :)> I''m not sure if it makes a difference in your case or not, but increasing the MDC RPCs in flight might also help performance. Also, increasing the client cache size and the number of IO RPCs may also help. On the clients run: > > lctl set_param *.*.max_rpcs_in_flight=64 > lctl set_param osc.*.max_dirty_mb=512no change> You may also test running the make directly on the MDS with a local Lustre mount to determine if the network latency is a significant factor in the performance. If you are using Ethernet instead of IB the latency could be hurting you, since kernel compiles are generally only doing a tiny amount of work per file and then you need to send a few RPCs to open and read the next file and the headers. Some of this can be hidden by pre-reading all of the files into the client caches (new machines should have enough RAM, about 1GB or so), but the "open" operations still need to send an RPC to the MDS for each file open, so running on the MDS (or with a low-latency network like IB) may help compiles like this run more quickly.we don''t have IB set up atm, so I can not test with it. I will try directly on the mds (so on only one node) to compare. Regards, Maxence -- Maxence DUNNEWIND Contact : maxence at dunnewind.net Site : http://www.dunnewind.net 06 32 39 39 93 GPG : 18AE 61E4 D0B0 1C7C AAC9 E40D 4D39 68DB 0D2E B533 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100628/31319832/attachment.bin
Andreas Dilger
2010-Jun-28 20:00 UTC
[Lustre-discuss] Optimize parallel commpilation on lustre
On 2010-06-28, at 10:04, Maxence Dunnewind wrote:>> If you are interested to do a tiny bit of hacking, it would be interesting to do an experiment to see what kind of performance can be gotten in your benchmark by a single client. Currently, Lustre limits each client to a single filesystem-modifying metadata operation at one time, in order to prevent the clients from overwhelming the server, and to ensure that the clients can recover the filesystem correctly in case of a server crash. > > I just tested this. Before, I tried to do an out-of-tree build. My for clients > are using nfsroot, so I put the kernel source on it, then I mount lustre on > /mnt/lustre, and I compile on /mnt/lustrE/build (with make O=). The results > (without) your patch are interesting : > 7m42 against 9m37 before with -j 4 > 4 min 51 against 5 min 34 with -j 8 > 3 min 27 againt 4 min 19 with -j 16 > I also use -pipe as gcc option, to avoid temp files.I was actually thinking of keeping the source tree on Lustre as well, just not building the output files in the same directory as the input files. It isn''t clear from this result whether the speedup was due to having the input files in a separate directory (i.e. lock contention), or whether it was because you had a second server hosting the input files (i.e. RPC limitation of the server).> So, my first question is : could it be possible in some way to disable cache coherency on some subdirectory ? If I know all the files in this directory will be acceded in read only, I do not need coherency. It would permit to read the files from lustre instead of nfs.I don''t think this would be practical to do for many years.> I then tried with your patch, not much difference : > > 4 min 43 againt 4 min 51 without it (-j 8)Ah, this number is with a separate server for the input files. It might be more interesting to see if it made a difference with the files all hosted on the same server.> 7min 40 against 7 min 42 with -j 8This should be "-j 4" to match the above numbers,> So it changes almost nothing :)That implies that the MDS modifying RPCs are not necessarily the bottleneck here.>> I''m not sure if it makes a difference in your case or not, but increasing the MDC RPCs in flight might also help performance. Also, increasing the client cache size and the number of IO RPCs may also help. On the clients run: >> >> lctl set_param *.*.max_rpcs_in_flight=64 >> lctl set_param osc.*.max_dirty_mb=512 > no changeHmm, I''d thought possibly allowing more of the output files to be cached on the clients would reduce the compilation time, but that doesn''t seem to be the bottleneck either. Did you try pre-reading all of the input files on the clients to see if eliminating the small-file reads was a source of improvement?> I will try directly on the mds (so on only one node) to compare.I look forward to your results. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Maxence Dunnewind
2010-Jun-29 08:09 UTC
[Lustre-discuss] Optimize parallel commpilation on lustre
> > 4 min 43 againt 4 min 51 without it (-j 8) > > Ah, this number is with a separate server for the input files. It might be more interesting to see if it made a difference with the files all hosted on the same server.With the source on the same lustre mount : - 5min34 without patch, source and build on same dir - 5min32 with patch, source and build on same dir - 5min59 with patch, source and build on 2 lustre dir (on the same mount) /o\ I made it 3 times, same result For ref : - 4 min 43 with patch and source on other mount - 4 min 51 without patch and source on other mount> > > 7min 40 against 7 min 42 with -j 8 > > This should be "-j 4" to match the above numbers,it was, typo.> Hmm, I''d thought possibly allowing more of the output files to be cached on the clients would reduce the compilation time, but that doesn''t seem to be the bottleneck either. > > Did you try pre-reading all of the input files on the clients to see if eliminating the small-file reads was a source of improvement?do you have any idea for doing that ? (It''s a kernel compile ;)> > I will try directly on the mds (so on only one node) to compare.I did, on an unpatched module version, and so only on one node : - 10m03 on a remote node, on same dir -j 4 - 6m30 on the MDS, source and compile on same dir , -j 4 - 6m30 on the MDS, source and compile on same dir , -j 8 - 5min54 on the MDS, source and compile on different dir, -j 8 I will also try on some other software (with some big c++ file, so that the ratio compilation time/access time would be better). Maxence -- Maxence DUNNEWIND Contact : maxence at dunnewind.net Site : http://www.dunnewind.net 06 32 39 39 93 GPG : 18AE 61E4 D0B0 1C7C AAC9 E40D 4D39 68DB 0D2E B533 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100629/33663e40/attachment.bin
Andreas Dilger
2010-Jul-10 04:21 UTC
[Lustre-discuss] Optimize parallel compilation on lustre
On 2010-06-28, at 10:04, Maxence Dunnewind wrote:> I then tried with your patch, not much difference : > > 4 min 43 againt 4 min 51 without it (-j 8) > 7min 40 against 7 min 42 with -j 8 > So it changes almost nothing :)I just realized that in the patch I sent you, it defaults to "no change in behaviour" unless FOR_TESTING_ONLY is defined at compile time. I didn''t want someone using the patch and then complaining that their filesystem didn''t work afterward. I''ve attached an updated patch that has "#define FOR_TESTING_ONLY", and hopefully this one will make more of a difference in performance. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. -------------- next part -------------- A non-text attachment was scrubbed... Name: mdc-multiop.diff Type: application/octet-stream Size: 829 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100709/eb20badb/attachment.obj
Maxence Dunnewind
2010-Jul-10 06:47 UTC
[Lustre-discuss] Optimize parallel compilation on lustre
> I just realized that in the patch I sent you, it defaults to "no change in behaviour" unless FOR_TESTING_ONLY is defined at compile time. I didn''t want someone using the patch and then complaining that their filesystem didn''t work afterward. > > I''ve attached an updated patch that has "#define FOR_TESTING_ONLY", and hopefully this one will make more of a difference in performance.Ahah :) I''ll retry on monday then. Maxence -- Maxence DUNNEWIND Contact : maxence at dunnewind.net Site : http://www.dunnewind.net 06 32 39 39 93 GPG : 18AE 61E4 D0B0 1C7C AAC9 E40D 4D39 68DB 0D2E B533 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100710/b2de6c9f/attachment.bin
Maxence Dunnewind
2010-Jul-15 15:02 UTC
[Lustre-discuss] Optimize parallel compilation on lustre
Heya, I just tried your patch, which is not working very well :D If I am on one node, the results are a bit better (8min30 against 9min07), but as soon as I compile in // on more than one node, I get many compilation errors. Regards, Maxence -- Maxence DUNNEWIND Contact : maxence at dunnewind.net Site : http://www.dunnewind.net 06 32 39 39 93 GPG : 18AE 61E4 D0B0 1C7C AAC9 E40D 4D39 68DB 0D2E B533 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/f472a2dc/attachment.bin
Maxence Dunnewind
2010-Jul-16 06:27 UTC
[Lustre-discuss] Optimize parallel compilation on lustre
> I just tried your patch, which is not working very well :D > > If I am on one node, the results are a bit better (8min30 against 9min07), but > as soon as I compile in // on more than one node, I get many compilation errors.These results was for a linux-2.6.34 defconfig compile. I just tried on qt4, and it compiles correctly, the results are : -j 16 : 30min35 against 32 min -j 8 : same time (34min25 vs 34min36) Regards, Maxence -- Maxence DUNNEWIND Contact : maxence at dunnewind.net Site : http://www.dunnewind.net 06 32 39 39 93 GPG : 18AE 61E4 D0B0 1C7C AAC9 E40D 4D39 68DB 0D2E B533 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: Digital signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100716/6828188e/attachment.bin
Andreas Dilger
2010-Jul-16 16:44 UTC
[Lustre-discuss] Optimize parallel compilation on lustre
On 2010-07-16, at 0:27, Maxence Dunnewind <maxence at dunnewind.net> wrote:> I just tried on qt4, and it compiles correctly, the results are : > -j 16 : 30min35 against 32 min > -j 8 : same time (34min25 vs 34min36)Thanks for testing this. What it means is that there is very little contention on the client''s single metadata write RPC until there are a lot (16) of concurrent threads. I''m assuming the two numbers are for compiles on Lustre with and without the patch? How does this performance compare to local filesystem performance? If you are reading all of the input files into cache before the start of the run (e.g. find . | xargs cat > /dev/null) then the slowdown isn''t from reading the small input files from the OSTs. If you are writing the output to a separate directory, then the namespace cache of the input files is not being invalidated. Even so, the fact that the patch didn''t improve performance means the metadata changes are not blocking the threads much. You also increased the write cache size, so it _shouldn''t_ be that writing the output is taking too long. I''m a bit at a loss for further suggestions for you to test, unless you want to start into things like profile on the client. Cheers, Andreas