I spent a week to optimize LLD performance and just wanted to share things what I found. Also if there's anyone who have a good idea on how to make it faster, I'd like to hear. My focus is mainly on Windows, but my optimizations are generally platform neutral. I aim both single-thread and multi-thread performance. r231434 <http://reviews.llvm.org/rL231454> is a change that has the largest impact. It greatly reduced linking time to link large binaries, as it changed the order of number of symbols we need to process for files within --start-group/--end-group. If you have many library files in a group, you would see a significant performance gain. It's probably not often the case on Unix. On the other hand, on Windows (and IIRC on Darwin), we move all library files to end of input file list and group them together, so it's effective. It improves single-thread performance. r231454 <http://reviews.llvm.org/rL231454> is to apply relocations in parallel in the writer using parallel_for. Because number of relocations are usually pretty large, and each application is independent, you basically get linear performance gain by using threads. Previously it took about 2 seconds to apply 10M relocations (the number in the commit message is wrong) on my machine, but it's now 300 milliseconds, for example. This technique should be applicable to other ports. r231585 <http://reviews.llvm.org/rL231585> changes the algorithm to create base relocations so that we can use parallel_sort. Unfortunately, base relocations are Windows specific, I don't think this technique is applicable to other ports. r231584 <http://reviews.llvm.org/rL231584> and r231432 <http://reviews.llvm.org/rL231432> are effective but minor improvements. At this point, the resolver is bottleneck. Readers introduce surprisingly small delay when linking large binaries, probably thanks to parallel file loading and archive member preloading. Or just that file parsing is an easy task. Preloading hit rate is >99%, so when you need a symbol from an archive file, its member is almost always parsed and ready to be used. ArchiveFile::find() returns immediately with a result. Writers and other post-resolver passes seem reasonably fast. The dominant factor is the resolver. What the resolver does is, roughly speaking, reading files until it resolves all symbols, and put all symbols received from files to a hash table. That's a kind of tight loop. In r231549 <http://reviews.llvm.org/rL231549> I cut number of hash table lookup, but looks like it's hard to optimize that more than this. An idea to make the resolver faster would be to use a concurrent hash map to insert new symbols in parallel. Assuming symbols from the same file don't conflict each other (I think it's a valid assumption), this can be parallelized. I wanted single-thread performance gain, though. (Also, concurrent hash maps are not currently available in LLVM.) Another idea is to eliminate a preprocess to create reverse edges of the atom graph. Some edges in the graph need to be treated as bi-directional, so that all connected edges become live or dead as a group regardless of direction of edges (so that depending symbols are not reclaimed by garbage collector). We probably should always add two edges for each bi-directional edge in the first place to eliminate the need of preprocessing entirely. It's definitely doable. An interesting idea that unfortunately didn't work is interning symbol names. I thought that computing a hash value of a symbol name is bottleneck in hash table, as C++ symbols can be very long. So I wrote a thread-safe string pool to intern (unique-fy) symbol strings, so that we can compare symbol equivalence by pointer comparison. String interning was done in the reader, which is parallelized, and I expected it would improve single-thread performance of the resolver. But I didn't observe meaningful difference in terms of performance. Bottlenecks were not there where I expected, as always (I learned about that again and again, but I was surprised every time). Maybe we need to revisit "optimized" code in LLD to see if it's not really premature optimization. If it is, we should rewrite with simple code. I'm currently trying two ideas above. This mail is just FYI, but if you have any recommendations or ideas or whatever, hit "reply". -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150311/d4850540/attachment.html>
On 11.03.2015 23:02, Rui Ueyama wrote:> An idea to make the resolver faster would be to use a concurrent hash > map to insert new symbols in parallel. Assuming symbols from the same > file don't conflict each other (I think it's a valid assumption), this > can be parallelized. I wanted single-thread performance gain, though. > (Also, concurrent hash maps are not currently available in LLVM.)Instead of using a concurrent hash map you could fill multiple hash maps in parallel and then merge them afterwards. If the implementation caches the hash values (which it should for string keys), merging maps doesn't require recomputing the hash values and can be a relatively fast operation when there is only a small overlap between the maps. More generally, if hash map operations make up a large share of the run time, it might be worth looking for situations where hash values are unnecessarily recomputed, e.g. when an item is conditionally inserted or when a hash map is grown or copied. (This is just a general observation, I don't know anything about LLD's implementation.) - Stephan
On Thu, Mar 12, 2015 at 4:52 AM, Stephan Tolksdorf <st at quanttec.com> wrote:> > On 11.03.2015 23:02, Rui Ueyama wrote: > >> An idea to make the resolver faster would be to use a concurrent hash >> map to insert new symbols in parallel. Assuming symbols from the same >> file don't conflict each other (I think it's a valid assumption), this >> can be parallelized. I wanted single-thread performance gain, though. >> (Also, concurrent hash maps are not currently available in LLVM.) >> > > Instead of using a concurrent hash map you could fill multiple hash maps > in parallel and then merge them afterwards. If the implementation caches > the hash values (which it should for string keys), merging maps doesn't > require recomputing the hash values and can be a relatively fast operation > when there is only a small overlap between the maps. >Yeah - I'd wonder more about the process here - each file has non-overlapping items? Are the possibly-overlapping items between files know-able? (only certain kinds of symbols) maybe you can use that to your advantage in some way? (yeah, I know - super vague)> > More generally, if hash map operations make up a large share of the run > time, it might be worth looking for situations where hash values are > unnecessarily recomputed, e.g. when an item is conditionally inserted or > when a hash map is grown or copied. (This is just a general observation, I > don't know anything about LLD's implementation.) > > - Stephan > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150312/3ed1953c/attachment.html>
I tried benchmarking it on linux by linking clang Release+asserts (but
lld itself with no asserts). The first things I noticed were:
missing options:
warning: ignoring unknown argument: --no-add-needed
warning: ignoring unknown argument: -O3
warning: ignoring unknown argument: --gc-sections
I just removed them from the command line.
Looks like --hash-style=gnu and --build-id are just ignored, so I
removed them too.
Looks like --strip-all is ignored, so I removed and ran strip manually.
Looks like .note.GNU-stack is incorrectly added, neither gold nor
bfd.ld adds it for clang.
Looks like .gnu.version and .gnu.version_r are not implemented.
Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
it is not included in .got.
Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
bfd puts everything in .data.rel. I have to research a bit to find out
what this is. For now I just added the sizes into a single entry.
.eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
from the command line.
With all that, the sections that increased in size the most when using lld were:
.rodata: 9 449 278 bytes bigger
.eh_frame: 438 376 bytes bigger
.comment: 77 797 bytes bigger
.data.rel.ro: 48 056 bytes bigger
The comment section is bigger because it has multiple copies of
clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)
The lack of duplicate entry merging would also explain the size
difference of .rodata and .eh_frame. No idea why .data.rel.ro is
bigger.
So, with the big warning that both linkers are not doing exactly the
same thing, the performance numbers I got were:
lld:
       1961.842991      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.04% )
             1,152      context-switches          #    0.587 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               ( +-100.00% )
           199,310      page-faults               #    0.102 M/sec
               ( +-  0.00% )
     5,893,291,145      cycles                    #    3.004 GHz
               ( +-  0.03% )
     3,329,741,079      stalled-cycles-frontend   #   56.50% frontend
cycles idle     ( +-  0.05% )
   <not supported>      stalled-cycles-backend
     6,255,727,902      instructions              #    1.06  insns per
cycle
                                                  #    0.53  stalled
cycles per insn  ( +-  0.01% )
     1,295,893,191      branches                  #  660.549 M/sec
               ( +-  0.01% )
        26,760,734      branch-misses             #    2.07% of all
branches          ( +-  0.01% )
       1.963705923 seconds time elapsed
          ( +-  0.04% )
gold:
        990.708786      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.06% )
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               ( +-100.00% )
            77,840      page-faults               #    0.079 M/sec
     2,976,552,629      cycles                    #    3.004 GHz
               ( +-  0.02% )
     1,384,720,988      stalled-cycles-frontend   #   46.52% frontend
cycles idle     ( +-  0.04% )
   <not supported>      stalled-cycles-backend
     4,105,948,264      instructions              #    1.38  insns per
cycle
                                                  #    0.34  stalled
cycles per insn  ( +-  0.00% )
       868,894,366      branches                  #  877.043 M/sec
               ( +-  0.00% )
        15,426,051      branch-misses             #    1.78% of all
branches          ( +-  0.01% )
       0.991619294 seconds time elapsed
          ( +-  0.06% )
The biggest difference that shows up is that lld has 1,152 context
switches, but the cpu utilization is still < 1. Maybe there is just a
threading bug somewhere?
>From your description, we build a hash of symbols in an archive and
for each undefined symbol in the overall link check if it is there. It
would probably be more efficient to walk the symbols defined in an
archive and check if it is needed by the overall link status, no? That
would save building a hash table for each archive.
One big difference of how lld works is the atom model. It basically
creates one Atom per symbol. That is inherently more work than what is
done by gold. IMHO it is confusing what atoms are and one way to
specify atoms in the object files.
It would be interesting to define an atom as the smallest thing that
cannot be split. It could still have multiple symbols in it for
example, and there would be no such thing as a AbsoltueAtom, just an
AbsoluteSymbol. In this model, the MachO reader would use symbols to
create atoms, but that is just one way to do it. The elf reader would
create 1 atom per regular section and special case SHF_MERGE and
.eh_frame (but we should really fix this one in LLVM too).
The atoms created in this way (for ELF at least) would be freely
movable, further reducing the cost.
Cheers,
Rafael
Rafael, This is very good information and extremely useful. On 3/12/2015 11:49 AM, Rafael Espíndola wrote:> I tried benchmarking it on linux by linking clang Release+asserts (but > lld itself with no asserts). The first things I noticed were: > > missing options: > > warning: ignoring unknown argument: --no-add-needed > warning: ignoring unknown argument: -O3 > warning: ignoring unknown argument: --gc-sections > > I just removed them from the command line. > > Looks like --hash-style=gnu and --build-id are just ignored, so I > removed them too. > > Looks like --strip-all is ignored, so I removed and ran strip manually. > > Looks like .note.GNU-stack is incorrectly added, neither gold nor > bfd.ld adds it for clang. > > Looks like .gnu.version and .gnu.version_r are not implemented. > > Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why > it is not included in .got.I have a fix for this. Will merge it.> > Gold produces a .data.rel.ro.local. lld produces a .data.rel.local. > bfd puts everything in .data.rel. I have to research a bit to find out > what this is. For now I just added the sizes into a single entry. > > .eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr > from the command line. > > With all that, the sections that increased in size the most when using lld were: > > .rodata: 9 449 278 bytes bigger > .eh_frame: 438 376 bytes bigger > .comment: 77 797 bytes bigger > .data.rel.ro: 48 056 bytes biggerDid you try --merge-strings with lld ? --gc-sections> > The comment section is bigger because it has multiple copies of > > clang version 3.7.0 (trunk 232021) (llvm/trunk 232027) > > The lack of duplicate entry merging would also explain the size > difference of .rodata and .eh_frame. No idea why .data.rel.ro is > bigger. > > So, with the big warning that both linkers are not doing exactly the > same thing, the performance numbers I got were: > > lld: > > > 1961.842991 task-clock (msec) # 0.999 CPUs > utilized ( +- 0.04% ) > 1,152 context-switches # 0.587 K/sec > 0 cpu-migrations # 0.000 K/sec > ( +-100.00% ) > 199,310 page-faults # 0.102 M/sec > ( +- 0.00% ) > 5,893,291,145 cycles # 3.004 GHz > ( +- 0.03% ) > 3,329,741,079 stalled-cycles-frontend # 56.50% frontend > cycles idle ( +- 0.05% ) > <not supported> stalled-cycles-backend > 6,255,727,902 instructions # 1.06 insns per > cycle > # 0.53 stalled > cycles per insn ( +- 0.01% ) > 1,295,893,191 branches # 660.549 M/sec > ( +- 0.01% ) > 26,760,734 branch-misses # 2.07% of all > branches ( +- 0.01% ) > > 1.963705923 seconds time elapsed > ( +- 0.04% ) > > gold: > > 990.708786 task-clock (msec) # 0.999 CPUs > utilized ( +- 0.06% ) > 0 context-switches # 0.000 K/sec > 0 cpu-migrations # 0.000 K/sec > ( +-100.00% ) > 77,840 page-faults # 0.079 M/sec > 2,976,552,629 cycles # 3.004 GHz > ( +- 0.02% ) > 1,384,720,988 stalled-cycles-frontend # 46.52% frontend > cycles idle ( +- 0.04% ) > <not supported> stalled-cycles-backend > 4,105,948,264 instructions # 1.38 insns per > cycle > # 0.34 stalled > cycles per insn ( +- 0.00% ) > 868,894,366 branches # 877.043 M/sec > ( +- 0.00% ) > 15,426,051 branch-misses # 1.78% of all > branches ( +- 0.01% ) > > 0.991619294 seconds time elapsed > ( +- 0.06% ) > > > The biggest difference that shows up is that lld has 1,152 context > switches, but the cpu utilization is still < 1. Maybe there is just a > threading bug somewhere?lld apparently is highly multithreaded, but I see your point. May be trying to do this exercise on /dev/shm can show more cpu utilization ? Shankar Easwaran -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation
On Thu, Mar 12, 2015 at 9:49 AM, Rafael Espíndola < rafael.espindola at gmail.com> wrote:> I tried benchmarking it on linux by linking clang Release+asserts (but > lld itself with no asserts). The first things I noticed were: > > missing options: > > warning: ignoring unknown argument: --no-add-needed > warning: ignoring unknown argument: -O3 > warning: ignoring unknown argument: --gc-sections > > I just removed them from the command line. > > Looks like --hash-style=gnu and --build-id are just ignored, so I > removed them too. > > Looks like --strip-all is ignored, so I removed and ran strip manually. > > Looks like .note.GNU-stack is incorrectly added, neither gold nor > bfd.ld adds it for clang. > > Looks like .gnu.version and .gnu.version_r are not implemented. > > Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why > it is not included in .got. > > Gold produces a .data.rel.ro.local. lld produces a .data.rel.local. > bfd puts everything in .data.rel. I have to research a bit to find out > what this is. For now I just added the sizes into a single entry. > > .eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr > from the command line. > > With all that, the sections that increased in size the most when using lld > were: > > .rodata: 9 449 278 bytes bigger > .eh_frame: 438 376 bytes bigger > .comment: 77 797 bytes bigger > .data.rel.ro: 48 056 bytes bigger > > The comment section is bigger because it has multiple copies of > > clang version 3.7.0 (trunk 232021) (llvm/trunk 232027) > > The lack of duplicate entry merging would also explain the size > difference of .rodata and .eh_frame. No idea why .data.rel.ro is > bigger. > > So, with the big warning that both linkers are not doing exactly the > same thing, the performance numbers I got were: > > lld: > > > 1961.842991 task-clock (msec) # 0.999 CPUs > utilized ( +- 0.04% ) > 1,152 context-switches # 0.587 K/sec > 0 cpu-migrations # 0.000 K/sec > ( +-100.00% ) > 199,310 page-faults # 0.102 M/sec > ( +- 0.00% ) > 5,893,291,145 cycles # 3.004 GHz > ( +- 0.03% ) > 3,329,741,079 stalled-cycles-frontend # 56.50% frontend > cycles idle ( +- 0.05% ) > <not supported> stalled-cycles-backend > 6,255,727,902 instructions # 1.06 insns per > cycle > # 0.53 stalled > cycles per insn ( +- 0.01% ) > 1,295,893,191 branches # 660.549 M/sec > ( +- 0.01% ) > 26,760,734 branch-misses # 2.07% of all > branches ( +- 0.01% ) > > 1.963705923 seconds time elapsed > ( +- 0.04% ) > > gold: > > 990.708786 task-clock (msec) # 0.999 CPUs > utilized ( +- 0.06% ) > 0 context-switches # 0.000 K/sec > 0 cpu-migrations # 0.000 K/sec > ( +-100.00% ) > 77,840 page-faults # 0.079 M/sec > 2,976,552,629 cycles # 3.004 GHz > ( +- 0.02% ) > 1,384,720,988 stalled-cycles-frontend # 46.52% frontend > cycles idle ( +- 0.04% ) > <not supported> stalled-cycles-backend > 4,105,948,264 instructions # 1.38 insns per > cycle > # 0.34 stalled > cycles per insn ( +- 0.00% ) > 868,894,366 branches # 877.043 M/sec > ( +- 0.00% ) > 15,426,051 branch-misses # 1.78% of all > branches ( +- 0.01% ) > > 0.991619294 seconds time elapsed > ( +- 0.06% ) > > > The biggest difference that shows up is that lld has 1,152 context > switches, but the cpu utilization is still < 1. Maybe there is just a > threading bug somewhere? >The implementation of the threading class inside LLD is different between Windows and other platforms. On Windows, it's just a wrapper for Microsoft Concrt threading library. On other platforms, we have a simple implementation to mimic it. So, first of all, I don't know about the number measured on Unix. (I didn't test that.) But, 1,152 context switches is small number, I guess? It's unlikely that that number of context switches would make LLD two times slower than gold. I believe bottleneck is something else. I think no one really optimized ELF reader, passes and writers, there might be some bad code there, but probably I shouldn't make a guess but instead measure.>From your description, we build a hash of symbols in an archive and > for each undefined symbol in the overall link check if it is there. It > would probably be more efficient to walk the symbols defined in an > archive and check if it is needed by the overall link status, no? That > would save building a hash table for each archive. > > One big difference of how lld works is the atom model. It basically > creates one Atom per symbol. That is inherently more work than what is > done by gold. IMHO it is confusing what atoms are and one way to > specify atoms in the object files. > > It would be interesting to define an atom as the smallest thing that > cannot be split. It could still have multiple symbols in it for > example, and there would be no such thing as a AbsoltueAtom, just an > AbsoluteSymbol. In this model, the MachO reader would use symbols to > create atoms, but that is just one way to do it. The elf reader would > create 1 atom per regular section and special case SHF_MERGE and > .eh_frame (but we should really fix this one in LLVM too). > > The atoms created in this way (for ELF at least) would be freely > movable, further reducing the cost. >I think I agree. Or, at least, the term "atom" is odd because it's not atomic. What we call atom is symbol with associated data. Usually all atoms created from the same section will be linked or excluded as a group. Section is not divisible (or atomic). We don't have notion of section in the resolver. Many linker features are defined in terms of sections, so in order to handle them in the atom model, we need to do something not straightforward. (For example, we copy section attributes to atoms so that they are preserved during linking process. Which means we need to copy attributes to atoms although atoms created from the same section will have the exact same values.) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/3c2dfbff/attachment.html>
Rui, Thanks for this write up and the work you are doing to improve performance. A couple of thoughts: * In the old days, linking was I/O bound so the linker run tended to be the same speed as cat’ing all the input files together. Now that spinning disks are less common, I/O may no longer be the bottleneck. * How does the link time compare to the Windows system linker? * Where exactly in the Resolving phase is the bottleneck? Is it the symbol table or the other work? (Is much time spent growing (and rehashing) the DenseMap? If so, would setting a larger initial bucket size help?) * Regarding the Rafeal’s performance comparison with gold (where lld is about half the speed). The old binutil’s ld was a cross platform linker that supported multiple formats. The goal of gold was to have better performance by only supporting ELF. So I would expect gold to be faster than the general purpose lld. -Nick On Mar 11, 2015, at 3:02 PM, Rui Ueyama <ruiu at google.com> wrote:> I spent a week to optimize LLD performance and just wanted to share things what I found. Also if there's anyone who have a good idea on how to make it faster, I'd like to hear. > > My focus is mainly on Windows, but my optimizations are generally platform neutral. I aim both single-thread and multi-thread performance. > > r231434 is a change that has the largest impact. It greatly reduced linking time to link large binaries, as it changed the order of number of symbols we need to process for files within --start-group/--end-group. If you have many library files in a group, you would see a significant performance gain. It's probably not often the case on Unix. On the other hand, on Windows (and IIRC on Darwin), we move all library files to end of input file list and group them together, so it's effective. It improves single-thread performance. > > r231454 is to apply relocations in parallel in the writer using parallel_for. Because number of relocations are usually pretty large, and each application is independent, you basically get linear performance gain by using threads. Previously it took about 2 seconds to apply 10M relocations (the number in the commit message is wrong) on my machine, but it's now 300 milliseconds, for example. This technique should be applicable to other ports. > > r231585 changes the algorithm to create base relocations so that we can use parallel_sort. Unfortunately, base relocations are Windows specific, I don't think this technique is applicable to other ports. > > r231584 and r231432 are effective but minor improvements. > > At this point, the resolver is bottleneck. Readers introduce surprisingly small delay when linking large binaries, probably thanks to parallel file loading and archive member preloading. Or just that file parsing is an easy task. Preloading hit rate is >99%, so when you need a symbol from an archive file, its member is almost always parsed and ready to be used. ArchiveFile::find() returns immediately with a result. Writers and other post-resolver passes seem reasonably fast. The dominant factor is the resolver. > > What the resolver does is, roughly speaking, reading files until it resolves all symbols, and put all symbols received from files to a hash table. That's a kind of tight loop. In r231549 I cut number of hash table lookup, but looks like it's hard to optimize that more than this. > > An idea to make the resolver faster would be to use a concurrent hash map to insert new symbols in parallel. Assuming symbols from the same file don't conflict each other (I think it's a valid assumption), this can be parallelized. I wanted single-thread performance gain, though. (Also, concurrent hash maps are not currently available in LLVM.) > > Another idea is to eliminate a preprocess to create reverse edges of the atom graph. Some edges in the graph need to be treated as bi-directional, so that all connected edges become live or dead as a group regardless of direction of edges (so that depending symbols are not reclaimed by garbage collector). We probably should always add two edges for each bi-directional edge in the first place to eliminate the need of preprocessing entirely. It's definitely doable. > > An interesting idea that unfortunately didn't work is interning symbol names. I thought that computing a hash value of a symbol name is bottleneck in hash table, as C++ symbols can be very long. So I wrote a thread-safe string pool to intern (unique-fy) symbol strings, so that we can compare symbol equivalence by pointer comparison. String interning was done in the reader, which is parallelized, and I expected it would improve single-thread performance of the resolver. But I didn't observe meaningful difference in terms of performance. > > Bottlenecks were not there where I expected, as always (I learned about that again and again, but I was surprised every time). Maybe we need to revisit "optimized" code in LLD to see if it's not really premature optimization. If it is, we should rewrite with simple code. > > I'm currently trying two ideas above. This mail is just FYI, but if you have any recommendations or ideas or whatever, hit "reply". > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/08d483a6/attachment.html>
On Sat, Mar 14, 2015 at 2:52 AM, Nick Kledzik <kledzik at apple.com> wrote:> Rui, > > Thanks for this write up and the work you are doing to improve performance. > A couple of thoughts: > > * In the old days, linking was I/O bound so the linker run tended to be the > same speed as cat’ing all the input files together. Now that spinning disks > are less common, I/O may no longer be the bottleneck. > > * How does the link time compare to the Windows system linker? > > * Where exactly in the Resolving phase is the bottleneck? Is it the symbol > table or the other work? (Is much time spent growing (and rehashing) the > DenseMap? If so, would setting a larger initial bucket size help?) > > * Regarding the Rafeal’s performance comparison with gold (where lld is > about half the speed). The old binutil’s ld was a cross platform linker > that supported multiple formats. The goal of gold was to have better > performance by only supporting ELF. So I would expect gold to be faster > than the general purpose lld. > > > -Nick >After my wild and wrong guess from a couple of days ago, I realized it was better spending some time analyzing code and profiling, rather than speculating. The machine where I tried to link clang using lld is an 8-core Xeon, but I hardly was able to use more than two cores. I ran my tests on UFS + 7200rpm disk ( I can provide details, if needed). The OS is FreeBSD-11 from a month ago or such. Flamegraph results (now zoomable): https://people.freebsd.org/~davide/llvm/lld.svg Nick, to answer your question DenseMap ops are a generally an hotspot for this kind of workload, although grow() operations aren't particularly a big problem (at least in this case). I see them showing up only for addReferencetoSymbol() and they account for large part of the overall samples. In order to convince myself of my theory I started tinkering and modified the initial size of the map but I wasn't able to notice any non-negligible improvement. What instead seems to be a bigger problem is symbolTable::replacement, DenseMap operations are about 10% of the overall samples. It seems that in order to find the replacement for a given atom we need to traverse a chain of atoms until the last one is found. This results in O(n) calls to find() where n is the length of the chain. I will try to investigate and see if we can make something differently. Please let me know if there's something else that shows up from the graph, and/or you have any idea on how to improve. Sorry for the output of symbols being a little bit verbose, the collector wasn't written with C++ in mind. -- Davide
> * Regarding the Rafeal’s performance comparison with gold (where lld is > about half the speed). The old binutil’s ld was a cross platform linker > that supported multiple formats. The goal of gold was to have better > performance by only supporting ELF. So I would expect gold to be faster > than the general purpose lld.Well, knowing that a linker supports other formats is not consolation for a user waiting for a link to finish :-) Hopefully we can figure out what makes lld slower without impacting its design too much, but if an abstraction has such a high cost we should revisit it. Cheers, Rafael
> The biggest difference that shows up is that lld has 1,152 context > switches, but the cpu utilization is still < 1. Maybe there is just a > threading bug somewhere?It was a bug, but in my script: I was pining the process to a single core, a leftover from benchmarking single threaded code. sorry about that. With that fixed (new script attached) what I got when linking clang with a today's build of clang, lld and gold (so not directly comparable to the old numbers) was: lld: 2544.422504 task-clock (msec) # 1.837 CPUs utilized ( +- 0.13% ) 69,815 context-switches # 0.027 M/sec ( +- 0.10% ) 515 cpu-migrations # 0.203 K/sec ( +- 2.38% ) 188,589 page-faults # 0.074 M/sec ( +- 0.00% ) 7,518,975,892 cycles # 2.955 GHz ( +- 0.12% ) 4,848,133,388 stalled-cycles-frontend # 64.48% frontend cycles idle ( +- 0.18% ) <not supported> stalled-cycles-backend 6,383,065,952 instructions # 0.85 insns per cycle # 0.76 stalled cycles per insn ( +- 0.06% ) 1,331,202,027 branches # 523.184 M/sec ( +- 0.07% ) 30,053,442 branch-misses # 2.26% of all branches ( +- 0.04% ) 1.385020712 seconds time elapsed ( +- 0.15% ) gold: 918.859273 task-clock (msec) # 0.999 CPUs utilized ( +- 0.01% ) 0 context-switches # 0.000 K/sec ( +- 41.02% ) 0 cpu-migrations # 0.000 K/sec 76,393 page-faults # 0.083 M/sec 2,758,439,113 cycles # 3.002 GHz ( +- 0.01% ) 1,357,937,367 stalled-cycles-frontend # 49.23% frontend cycles idle ( +- 0.02% ) <not supported> stalled-cycles-backend 3,881,565,068 instructions # 1.41 insns per cycle # 0.35 stalled cycles per insn ( +- 0.00% ) 784,078,474 branches # 853.317 M/sec ( +- 0.00% ) 13,984,077 branch-misses # 1.78% of all branches ( +- 0.01% ) 0.919717669 seconds time elapsed ( +- 0.01% ) gold --threads 1300.210314 task-clock (msec) # 1.523 CPUs utilized ( +- 0.15% ) 26,913 context-switches # 0.021 M/sec ( +- 0.34% ) 7,910 cpu-migrations # 0.006 M/sec ( +- 0.63% ) 83,507 page-faults # 0.064 M/sec ( +- 0.02% ) 3,842,183,459 cycles # 2.955 GHz ( +- 0.14% ) 2,273,634,375 stalled-cycles-frontend # 59.18% frontend cycles idle ( +- 0.20% ) <not supported> stalled-cycles-backend 4,058,907,107 instructions # 1.06 insns per cycle # 0.56 stalled cycles per insn ( +- 0.07% ) 838,791,061 branches # 645.120 M/sec ( +- 0.06% ) 14,220,563 branch-misses # 1.70% of all branches ( +- 0.02% ) 0.853835099 seconds time elapsed ( +- 0.25% ) So much better! The machine has 12 cores, each with 2 hardware threads. -------------- next part -------------- A non-text attachment was scrubbed... Name: run.sh Type: application/x-sh Size: 5352 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150319/5f2c7cae/attachment.sh>
Possibly Parallel Threads
- [LLVMdev] [cfe-dev] Code generation for noexcept functions
- [LLVMdev] ADT/Hashing.h on 32-bit platforms
- [LLVMdev] Exploiting 'unreachable' for optimization purposes
- [LLVMdev] Exploiting 'unreachable' for optimization purposes
- [LLVMdev] ADT/Hashing.h on 32-bit platforms