thr3ads.net - llvm dev - [LLVMdev] On LLD performance [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Rui Ueyama

2015-Mar-11 22:02 UTC

[LLVMdev] On LLD performance

I spent a week to optimize LLD performance and just wanted to share things
what I found. Also if there's anyone who have a good idea on how to make it
faster, I'd like to hear.

My focus is mainly on Windows, but my optimizations are generally platform
neutral. I aim both single-thread and multi-thread performance.

r231434 <http://reviews.llvm.org/rL231454> is a change that has the
largest
impact. It greatly reduced linking time to link large binaries, as it
changed the order of number of symbols we need to process for files within
--start-group/--end-group. If you have many library files in a group, you
would see a significant performance gain. It's probably not often the case
on Unix. On the other hand, on Windows (and IIRC on Darwin), we move all
library files to end of input file list and group them together, so it's
effective. It improves single-thread performance.

r231454 <http://reviews.llvm.org/rL231454> is to apply relocations in
parallel in the writer using parallel_for. Because number of relocations
are usually pretty large, and each application is independent,
you basically get linear performance gain by using threads. Previously it
took about 2 seconds to apply 10M relocations (the number in the commit
message is wrong) on my machine, but it's now 300 milliseconds, for
example. This technique should be applicable to other ports.

r231585 <http://reviews.llvm.org/rL231585> changes the algorithm to create
base relocations so that we can use parallel_sort. Unfortunately, base
relocations are Windows specific, I don't think this technique is
applicable to other ports.

r231584 <http://reviews.llvm.org/rL231584> and r231432
<http://reviews.llvm.org/rL231432> are effective but minor improvements.

At this point, the resolver is bottleneck. Readers introduce surprisingly
small delay when linking large binaries, probably thanks to parallel file
loading and archive member preloading. Or just that file parsing is an easy
task. Preloading hit rate is >99%, so when you need a symbol from an
archive file, its member is almost always parsed and ready to be used.
ArchiveFile::find() returns immediately with a result. Writers and other
post-resolver passes seem reasonably fast. The dominant factor is the
resolver.

What the resolver does is, roughly speaking, reading files until it
resolves all symbols, and put all symbols received from files to a hash
table. That's a kind of tight loop. In r231549
<http://reviews.llvm.org/rL231549> I cut number of hash table lookup, but
looks like it's hard to optimize that more than this.

An idea to make the resolver faster would be to use a concurrent hash map
to insert new symbols in parallel. Assuming symbols from the same file
don't conflict each other (I think it's a valid assumption), this can be
parallelized. I wanted single-thread performance gain, though. (Also,
concurrent hash maps are not currently available in LLVM.)

Another idea is to eliminate a preprocess to create reverse edges of the
atom graph. Some edges in the graph need to be treated as bi-directional,
so that all connected edges become live or dead as a group regardless of
direction of edges (so that depending symbols are not reclaimed by garbage
collector). We probably should always add two edges for each bi-directional
edge in the first place to eliminate the need of preprocessing entirely.
It's definitely doable.

An interesting idea that unfortunately didn't work is interning symbol
names. I thought that computing a hash value of a symbol name is bottleneck
in hash table, as C++ symbols can be very long. So I wrote a thread-safe
string pool to intern (unique-fy) symbol strings, so that we can compare
symbol equivalence by pointer comparison. String interning was done in the
reader, which is parallelized, and I expected it would improve
single-thread performance of the resolver. But I didn't observe meaningful
difference in terms of performance.

Bottlenecks were not there where I expected, as always (I learned about
that again and again, but I was surprised every time). Maybe we need to
revisit "optimized" code in LLD to see if it's not really
premature
optimization. If it is, we should rewrite with simple code.

I'm currently trying two ideas above. This mail is just FYI, but if you
have any recommendations or ideas or whatever, hit "reply".
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150311/d4850540/attachment.html>

Stephan Tolksdorf

2015-Mar-12 11:52 UTC

head link

[LLVMdev] On LLD performance

On 11.03.2015 23:02, Rui Ueyama wrote:> An idea to make the resolver faster would be to use a concurrent hash
> map to insert new symbols in parallel. Assuming symbols from the same
> file don't conflict each other (I think it's a valid assumption),
this
> can be parallelized. I wanted single-thread performance gain, though.
> (Also, concurrent hash maps are not currently available in LLVM.)
Instead of using a concurrent hash map you could fill multiple hash maps 
in parallel and then merge them afterwards. If the implementation caches 
the hash values (which it should for string keys), merging maps doesn't 
require recomputing the hash values and can be a relatively fast 
operation when there is only a small overlap between the maps.

More generally, if hash map operations make up a large share of the run 
time, it might be worth looking for situations where hash values are 
unnecessarily recomputed, e.g. when an item is conditionally inserted or 
when a hash map is grown or copied. (This is just a general observation, 
I don't know anything about LLD's implementation.)

- Stephan

David Blaikie

2015-Mar-12 16:40 UTC

head link

[LLVMdev] On LLD performance

On Thu, Mar 12, 2015 at 4:52 AM, Stephan Tolksdorf <st at quanttec.com>
wrote:
>
> On 11.03.2015 23:02, Rui Ueyama wrote:
>
>> An idea to make the resolver faster would be to use a concurrent hash
>> map to insert new symbols in parallel. Assuming symbols from the same
>> file don't conflict each other (I think it's a valid
assumption), this
>> can be parallelized. I wanted single-thread performance gain, though.
>> (Also, concurrent hash maps are not currently available in LLVM.)
>>
>
> Instead of using a concurrent hash map you could fill multiple hash maps
> in parallel and then merge them afterwards. If the implementation caches
> the hash values (which it should for string keys), merging maps doesn't
> require recomputing the hash values and can be a relatively fast operation
> when there is only a small overlap between the maps.
>
Yeah - I'd wonder more about the process here - each file has
non-overlapping items? Are the possibly-overlapping items between files
know-able? (only certain kinds of symbols) maybe you can use that to your
advantage in some way? (yeah, I know - super vague)

>
> More generally, if hash map operations make up a large share of the run
> time, it might be worth looking for situations where hash values are
> unnecessarily recomputed, e.g. when an item is conditionally inserted or
> when a hash map is grown or copied. (This is just a general observation, I
> don't know anything about LLD's implementation.)
>
> - Stephan
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150312/3ed1953c/attachment.html>

Rafael Espíndola

2015-Mar-12 16:49 UTC

head link

[LLVMdev] On LLD performance

I tried benchmarking it on linux by linking clang Release+asserts (but
lld itself with no asserts). The first things I noticed were:

missing options:

warning: ignoring unknown argument: --no-add-needed
warning: ignoring unknown argument: -O3
warning: ignoring unknown argument: --gc-sections

I just removed them from the command line.

Looks like --hash-style=gnu and --build-id are just ignored, so I
removed them too.

Looks like --strip-all is ignored, so I removed and ran strip manually.

Looks like .note.GNU-stack is incorrectly added, neither gold nor
bfd.ld adds it for clang.

Looks like .gnu.version and .gnu.version_r are not implemented.

Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
it is not included in .got.

Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
bfd puts everything in .data.rel. I have to research a bit to find out
what this is. For now I just added the sizes into a single entry.

.eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
from the command line.

With all that, the sections that increased in size the most when using lld were:

.rodata: 9 449 278 bytes bigger
.eh_frame: 438 376 bytes bigger
.comment: 77 797 bytes bigger
.data.rel.ro: 48 056 bytes bigger

The comment section is bigger because it has multiple copies of

clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)

The lack of duplicate entry merging would also explain the size
difference of .rodata and .eh_frame. No idea why .data.rel.ro is
bigger.

So, with the big warning that both linkers are not doing exactly the
same thing, the performance numbers I got were:

lld:


       1961.842991      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.04% )
             1,152      context-switches          #    0.587 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               ( +-100.00% )
           199,310      page-faults               #    0.102 M/sec
               ( +-  0.00% )
     5,893,291,145      cycles                    #    3.004 GHz
               ( +-  0.03% )
     3,329,741,079      stalled-cycles-frontend   #   56.50% frontend
cycles idle     ( +-  0.05% )
   <not supported>      stalled-cycles-backend
     6,255,727,902      instructions              #    1.06  insns per
cycle
                                                  #    0.53  stalled
cycles per insn  ( +-  0.01% )
     1,295,893,191      branches                  #  660.549 M/sec
               ( +-  0.01% )
        26,760,734      branch-misses             #    2.07% of all
branches          ( +-  0.01% )

       1.963705923 seconds time elapsed
          ( +-  0.04% )

gold:

        990.708786      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.06% )
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               ( +-100.00% )
            77,840      page-faults               #    0.079 M/sec
     2,976,552,629      cycles                    #    3.004 GHz
               ( +-  0.02% )
     1,384,720,988      stalled-cycles-frontend   #   46.52% frontend
cycles idle     ( +-  0.04% )
   <not supported>      stalled-cycles-backend
     4,105,948,264      instructions              #    1.38  insns per
cycle
                                                  #    0.34  stalled
cycles per insn  ( +-  0.00% )
       868,894,366      branches                  #  877.043 M/sec
               ( +-  0.00% )
        15,426,051      branch-misses             #    1.78% of all
branches          ( +-  0.01% )

       0.991619294 seconds time elapsed
          ( +-  0.06% )


The biggest difference that shows up is that lld has 1,152 context
switches, but the cpu utilization is still < 1. Maybe there is just a
threading bug somewhere?
>From your description, we build a hash of symbols in an archive andfor each undefined symbol in the overall link check if it is there. It
would probably be more efficient to walk the symbols defined in an
archive and check if it is needed by the overall link status, no? That
would save building a hash table for each archive.

One big difference of how lld works is the atom model. It basically
creates one Atom per symbol. That is inherently more work than what is
done by gold. IMHO it is confusing what atoms are and one way to
specify atoms in the object files.

It would be interesting to define an atom as the smallest thing that
cannot be split. It could still have multiple symbols in it for
example, and there would be no such thing as a AbsoltueAtom, just an
AbsoluteSymbol. In this model, the MachO reader would use symbols to
create atoms, but that is just one way to do it. The elf reader would
create 1 atom per regular section and special case SHF_MERGE and
.eh_frame (but we should really fix this one in LLVM too).

The atoms created in this way (for ELF at least) would be freely
movable, further reducing the cost.

Cheers,
Rafael

Shankar Easwaran

2015-Mar-13 16:38 UTC

head link

[LLVMdev] On LLD performance

Rafael,

This is very good information and extremely useful.

On 3/12/2015 11:49 AM, Rafael Espíndola wrote:> I tried benchmarking it on linux by linking clang Release+asserts (but
> lld itself with no asserts). The first things I noticed were:
>
> missing options:
>
> warning: ignoring unknown argument: --no-add-needed
> warning: ignoring unknown argument: -O3
> warning: ignoring unknown argument: --gc-sections
>
> I just removed them from the command line.
>
> Looks like --hash-style=gnu and --build-id are just ignored, so I
> removed them too.
>
> Looks like --strip-all is ignored, so I removed and ran strip manually.
>
> Looks like .note.GNU-stack is incorrectly added, neither gold nor
> bfd.ld adds it for clang.
>
> Looks like .gnu.version and .gnu.version_r are not implemented.
>
> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
> it is not included in .got.
I have a fix for this. Will merge it.>
> Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
> bfd puts everything in .data.rel. I have to research a bit to find out
> what this is. For now I just added the sizes into a single entry.
>
> .eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
> from the command line.
>
> With all that, the sections that increased in size the most when using lld
were:
>
> .rodata: 9 449 278 bytes bigger
> .eh_frame: 438 376 bytes bigger
> .comment: 77 797 bytes bigger
> .data.rel.ro: 48 056 bytes bigger
Did you try --merge-strings with lld ? --gc-sections>
> The comment section is bigger because it has multiple copies of
>
> clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)
>
> The lack of duplicate entry merging would also explain the size
> difference of .rodata and .eh_frame. No idea why .data.rel.ro is
> bigger.
>
> So, with the big warning that both linkers are not doing exactly the
> same thing, the performance numbers I got were:
>
> lld:
>
>
>         1961.842991      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.04% )
>               1,152      context-switches          #    0.587 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>                 ( +-100.00% )
>             199,310      page-faults               #    0.102 M/sec
>                 ( +-  0.00% )
>       5,893,291,145      cycles                    #    3.004 GHz
>                 ( +-  0.03% )
>       3,329,741,079      stalled-cycles-frontend   #   56.50% frontend
> cycles idle     ( +-  0.05% )
>     <not supported>      stalled-cycles-backend
>       6,255,727,902      instructions              #    1.06  insns per
> cycle
>                                                    #    0.53  stalled
> cycles per insn  ( +-  0.01% )
>       1,295,893,191      branches                  #  660.549 M/sec
>                 ( +-  0.01% )
>          26,760,734      branch-misses             #    2.07% of all
> branches          ( +-  0.01% )
>
>         1.963705923 seconds time elapsed
>            ( +-  0.04% )
>
> gold:
>
>          990.708786      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.06% )
>                   0      context-switches          #    0.000 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>                 ( +-100.00% )
>              77,840      page-faults               #    0.079 M/sec
>       2,976,552,629      cycles                    #    3.004 GHz
>                 ( +-  0.02% )
>       1,384,720,988      stalled-cycles-frontend   #   46.52% frontend
> cycles idle     ( +-  0.04% )
>     <not supported>      stalled-cycles-backend
>       4,105,948,264      instructions              #    1.38  insns per
> cycle
>                                                    #    0.34  stalled
> cycles per insn  ( +-  0.00% )
>         868,894,366      branches                  #  877.043 M/sec
>                 ( +-  0.00% )
>          15,426,051      branch-misses             #    1.78% of all
> branches          ( +-  0.01% )
>
>         0.991619294 seconds time elapsed
>            ( +-  0.06% )
>
>
> The biggest difference that shows up is that lld has 1,152 context
> switches, but the cpu utilization is still < 1. Maybe there is just a
> threading bug somewhere?lld apparently is highly multithreaded, but I see your point.  May be 
trying to do this exercise on /dev/shm can show more cpu utilization ?

Shankar Easwaran

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the
Linux Foundation

Rui Ueyama

2015-Mar-13 16:59 UTC

head link

[LLVMdev] On LLD performance

On Thu, Mar 12, 2015 at 9:49 AM, Rafael Espíndola <
rafael.espindola at gmail.com> wrote:
> I tried benchmarking it on linux by linking clang Release+asserts (but
> lld itself with no asserts). The first things I noticed were:
>
> missing options:
>
> warning: ignoring unknown argument: --no-add-needed
> warning: ignoring unknown argument: -O3
> warning: ignoring unknown argument: --gc-sections
>
> I just removed them from the command line.
>
> Looks like --hash-style=gnu and --build-id are just ignored, so I
> removed them too.
>
> Looks like --strip-all is ignored, so I removed and ran strip manually.
>
> Looks like .note.GNU-stack is incorrectly added, neither gold nor
> bfd.ld adds it for clang.
>
> Looks like .gnu.version and .gnu.version_r are not implemented.
>
> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
> it is not included in .got.
>
> Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
> bfd puts everything in .data.rel. I have to research a bit to find out
> what this is. For now I just added the sizes into a single entry.
>
> .eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
> from the command line.
>
> With all that, the sections that increased in size the most when using lld
> were:
>
> .rodata: 9 449 278 bytes bigger
> .eh_frame: 438 376 bytes bigger
> .comment: 77 797 bytes bigger
> .data.rel.ro: 48 056 bytes bigger
>
> The comment section is bigger because it has multiple copies of
>
> clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)
>
> The lack of duplicate entry merging would also explain the size
> difference of .rodata and .eh_frame. No idea why .data.rel.ro is
> bigger.
>
> So, with the big warning that both linkers are not doing exactly the
> same thing, the performance numbers I got were:
>
> lld:
>
>
>        1961.842991      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.04% )
>              1,152      context-switches          #    0.587 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                ( +-100.00% )
>            199,310      page-faults               #    0.102 M/sec
>                ( +-  0.00% )
>      5,893,291,145      cycles                    #    3.004 GHz
>                ( +-  0.03% )
>      3,329,741,079      stalled-cycles-frontend   #   56.50% frontend
> cycles idle     ( +-  0.05% )
>    <not supported>      stalled-cycles-backend
>      6,255,727,902      instructions              #    1.06  insns per
> cycle
>                                                   #    0.53  stalled
> cycles per insn  ( +-  0.01% )
>      1,295,893,191      branches                  #  660.549 M/sec
>                ( +-  0.01% )
>         26,760,734      branch-misses             #    2.07% of all
> branches          ( +-  0.01% )
>
>        1.963705923 seconds time elapsed
>           ( +-  0.04% )
>
> gold:
>
>         990.708786      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.06% )
>                  0      context-switches          #    0.000 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                ( +-100.00% )
>             77,840      page-faults               #    0.079 M/sec
>      2,976,552,629      cycles                    #    3.004 GHz
>                ( +-  0.02% )
>      1,384,720,988      stalled-cycles-frontend   #   46.52% frontend
> cycles idle     ( +-  0.04% )
>    <not supported>      stalled-cycles-backend
>      4,105,948,264      instructions              #    1.38  insns per
> cycle
>                                                   #    0.34  stalled
> cycles per insn  ( +-  0.00% )
>        868,894,366      branches                  #  877.043 M/sec
>                ( +-  0.00% )
>         15,426,051      branch-misses             #    1.78% of all
> branches          ( +-  0.01% )
>
>        0.991619294 seconds time elapsed
>           ( +-  0.06% )
>
>
> The biggest difference that shows up is that lld has 1,152 context
> switches, but the cpu utilization is still < 1. Maybe there is just a
> threading bug somewhere?
>
The implementation of the threading class inside LLD is different between
Windows and other platforms. On Windows, it's just a wrapper for Microsoft
Concrt threading library. On other platforms, we have a simple
implementation to mimic it. So, first of all, I don't know about the number
measured on Unix. (I didn't test that.)

But, 1,152 context switches is small number, I guess? It's unlikely that
that number of context switches would make LLD two times slower than gold.
I believe bottleneck is something else. I think no one really optimized ELF
reader, passes and writers, there might be some bad code there, but
probably I shouldn't make a guess but instead measure.
>From your description, we build a hash of symbols in an archive and
> for each undefined symbol in the overall link check if it is there. It
> would probably be more efficient to walk the symbols defined in an
> archive and check if it is needed by the overall link status, no? That
> would save building a hash table for each archive.
>
> One big difference of how lld works is the atom model. It basically
> creates one Atom per symbol. That is inherently more work than what is
> done by gold. IMHO it is confusing what atoms are and one way to
> specify atoms in the object files.
>
> It would be interesting to define an atom as the smallest thing that
> cannot be split. It could still have multiple symbols in it for
> example, and there would be no such thing as a AbsoltueAtom, just an
> AbsoluteSymbol. In this model, the MachO reader would use symbols to
> create atoms, but that is just one way to do it. The elf reader would
> create 1 atom per regular section and special case SHF_MERGE and
> .eh_frame (but we should really fix this one in LLVM too).
>
> The atoms created in this way (for ELF at least) would be freely
> movable, further reducing the cost.
>
I think I agree. Or, at least, the term "atom" is odd because it's
not
atomic. What we call atom is symbol with associated data. Usually all atoms
created from the same section will be linked or excluded as a group.
Section is not divisible (or atomic).

We don't have notion of section in the resolver. Many linker features are
defined in terms of sections, so in order to handle them in the atom model,
we need to do something not straightforward. (For example, we copy section
attributes to atoms so that they are preserved during linking process.
Which means we need to copy attributes to atoms although atoms created from
the same section will have the exact same values.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/3c2dfbff/attachment.html>

Nick Kledzik

2015-Mar-14 01:52 UTC

head link

[LLVMdev] On LLD performance

Rui,

Thanks for this write up and the work you are doing to improve performance. A 
couple of thoughts:

* In the old days, linking was I/O bound so the linker run tended to be the same
speed as cat’ing all the input files together.  Now that spinning disks are less
common, I/O may no longer be the bottleneck.

* How does the link time compare to the Windows system linker?

* Where exactly in the Resolving phase is the bottleneck?  Is it the symbol
table or the other work?  (Is much time spent growing (and rehashing) the
DenseMap?  If so, would setting a larger initial bucket size help?)

* Regarding the Rafeal’s performance comparison with gold (where lld is about
half the speed).  The old binutil’s ld was a cross platform linker that
supported multiple formats.  The goal of gold was to have better performance by
only supporting ELF.  So I would expect gold to be faster than the general
purpose lld.


-Nick

On Mar 11, 2015, at 3:02 PM, Rui Ueyama <ruiu at google.com> wrote:
> I spent a week to optimize LLD performance and just wanted to share things
what I found. Also if there's anyone who have a good idea on how to make it
faster, I'd like to hear.
> 
> My focus is mainly on Windows, but my optimizations are generally platform
neutral. I aim both single-thread and multi-thread performance.
> 
> r231434 is a change that has the largest impact. It greatly reduced linking
time to link large binaries, as it changed the order of number of symbols we
need to process for files within --start-group/--end-group. If you have many
library files in a group, you would see a significant performance gain. It's
probably not often the case on Unix. On the other hand, on Windows (and IIRC on
Darwin), we move all library files to end of input file list and group them
together, so it's effective. It improves single-thread performance.
> 
> r231454 is to apply relocations in parallel in the writer using
parallel_for. Because number of relocations are usually pretty large, and each
application is independent, you basically get linear performance gain by using
threads. Previously it took about 2 seconds to apply 10M relocations (the number
in the commit message is wrong) on my machine, but it's now 300
milliseconds, for example. This technique should be applicable to other ports.
> 
> r231585 changes the algorithm to create base relocations so that we can use
parallel_sort. Unfortunately, base relocations are Windows specific, I don't
think this technique is applicable to other ports.
> 
> r231584 and r231432 are effective but minor improvements.
> 
> At this point, the resolver is bottleneck. Readers introduce surprisingly
small delay when linking large binaries, probably thanks to parallel file
loading and archive member preloading. Or just that file parsing is an easy
task. Preloading hit rate is >99%, so when you need a symbol from an archive
file, its member is almost always parsed and ready to be used.
ArchiveFile::find() returns immediately with a result. Writers and other
post-resolver passes seem reasonably fast. The dominant factor is the resolver.
> 
> What the resolver does is, roughly speaking, reading files until it
resolves all symbols, and put all symbols received from files to a hash table.
That's a kind of tight loop. In r231549 I cut number of hash table lookup,
but looks like it's hard to optimize that more than this.
> 
> An idea to make the resolver faster would be to use a concurrent hash map
to insert new symbols in parallel. Assuming symbols from the same file don't
conflict each other (I think it's a valid assumption), this can be
parallelized. I wanted single-thread performance gain, though. (Also, concurrent
hash maps are not currently available in LLVM.)
> 
> Another idea is to eliminate a preprocess to create reverse edges of the
atom graph. Some edges in the graph need to be treated as bi-directional, so
that all connected edges become live or dead as a group regardless of direction
of edges (so that depending symbols are not reclaimed by garbage collector). We
probably should always add two edges for each bi-directional edge in the first
place to eliminate the need of preprocessing entirely. It's definitely
doable.
> 
> An interesting idea that unfortunately didn't work is interning symbol
names. I thought that computing a hash value of a symbol name is bottleneck in
hash table, as C++ symbols can be very long. So I wrote a thread-safe string
pool to intern (unique-fy) symbol strings, so that we can compare symbol
equivalence by pointer comparison. String interning was done in the reader,
which is parallelized, and I expected it would improve single-thread performance
of the resolver. But I didn't observe meaningful difference in terms of
performance.
> 
> Bottlenecks were not there where I expected, as always (I learned about
that again and again, but I was surprised every time). Maybe we need to revisit
"optimized" code in LLD to see if it's not really premature
optimization. If it is, we should rewrite with simple code.
> 
> I'm currently trying two ideas above. This mail is just FYI, but if you
have any recommendations or ideas or whatever, hit "reply".
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/08d483a6/attachment.html>

Davide Italiano

2015-Mar-15 07:36 UTC

head link

[LLVMdev] On LLD performance

On Sat, Mar 14, 2015 at 2:52 AM, Nick Kledzik <kledzik at apple.com>
wrote:> Rui,
>
> Thanks for this write up and the work you are doing to improve performance.
> A  couple of thoughts:
>
> * In the old days, linking was I/O bound so the linker run tended to be the
> same speed as cat’ing all the input files together.  Now that spinning
disks
> are less common, I/O may no longer be the bottleneck.
>
> * How does the link time compare to the Windows system linker?
>
> * Where exactly in the Resolving phase is the bottleneck?  Is it the symbol
> table or the other work?  (Is much time spent growing (and rehashing) the
> DenseMap?  If so, would setting a larger initial bucket size help?)
>
> * Regarding the Rafeal’s performance comparison with gold (where lld is
> about half the speed).  The old binutil’s ld was a cross platform linker
> that supported multiple formats.  The goal of gold was to have better
> performance by only supporting ELF.  So I would expect gold to be faster
> than the general purpose lld.
>
>
> -Nick
>
After my wild and wrong guess from a couple of days ago, I realized it
was better spending some time analyzing code and profiling, rather
than speculating.
The machine where I tried to link clang using lld is an 8-core Xeon,
but I hardly was able to use more than two cores.
I ran my tests on UFS + 7200rpm disk ( I can provide details, if
needed). The OS is FreeBSD-11 from a month ago or such.

Flamegraph results (now zoomable):
https://people.freebsd.org/~davide/llvm/lld.svg

Nick, to answer your question DenseMap ops are a generally an hotspot
for this kind of workload, although grow() operations aren't
particularly a big problem (at least in this case). I see them showing
up only for addReferencetoSymbol() and they account for large part of
the overall samples. In order to convince myself of my theory I
started tinkering and modified the initial size of the map but I
wasn't able to notice any non-negligible improvement.

What instead seems to be a bigger problem is symbolTable::replacement,
DenseMap operations are about 10% of the overall samples. It seems
that in order to find the replacement for a given atom we need to
traverse a chain of atoms until the last one is found. This results in
O(n) calls to find() where n is the length of the chain. I will try to
investigate and see if we can make something differently.

Please let me know if there's something else that shows up from the
graph, and/or you have any idea on how to improve. Sorry for the
output of symbols being a little bit verbose, the collector wasn't
written with C++ in mind.

--
Davide

Rafael Espíndola

2015-Mar-19 16:36 UTC

head link

[LLVMdev] On LLD performance

> * Regarding the Rafeal’s performance comparison with gold (where lld is
> about half the speed).  The old binutil’s ld was a cross platform linker
> that supported multiple formats.  The goal of gold was to have better
> performance by only supporting ELF.  So I would expect gold to be faster
> than the general purpose lld.
Well, knowing that a linker supports other formats is not consolation
for a user waiting for a link to finish :-)

Hopefully we can figure out what makes lld slower without impacting
its design too much, but if an abstraction has such a high cost we
should revisit it.

Cheers,
Rafael

Rafael Espíndola

2015-Mar-19 17:03 UTC

head link

[LLVMdev] On LLD performance

> The biggest difference that shows up is that lld has 1,152 context
> switches, but the cpu utilization is still < 1. Maybe there is just a
> threading bug somewhere?
It was a bug, but in my script: I was pining the process to a single
core, a leftover from benchmarking single threaded code. sorry about
that.

With that fixed (new script attached) what I got when linking clang
with a today's build of clang, lld and gold (so not directly
comparable to the old numbers) was:


lld:

       2544.422504      task-clock (msec)         #    1.837 CPUs
utilized            ( +-  0.13% )
            69,815      context-switches          #    0.027 M/sec
               ( +-  0.10% )
               515      cpu-migrations            #    0.203 K/sec
               ( +-  2.38% )
           188,589      page-faults               #    0.074 M/sec
               ( +-  0.00% )
     7,518,975,892      cycles                    #    2.955 GHz
               ( +-  0.12% )
     4,848,133,388      stalled-cycles-frontend   #   64.48% frontend
cycles idle     ( +-  0.18% )
   <not supported>      stalled-cycles-backend
     6,383,065,952      instructions              #    0.85  insns per
cycle
                                                  #    0.76  stalled
cycles per insn  ( +-  0.06% )
     1,331,202,027      branches                  #  523.184 M/sec
               ( +-  0.07% )
        30,053,442      branch-misses             #    2.26% of all
branches          ( +-  0.04% )

       1.385020712 seconds time elapsed
          ( +-  0.15% )

gold:

        918.859273      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.01% )
                 0      context-switches          #    0.000 K/sec
               ( +- 41.02% )
                 0      cpu-migrations            #    0.000 K/sec
            76,393      page-faults               #    0.083 M/sec
     2,758,439,113      cycles                    #    3.002 GHz
               ( +-  0.01% )
     1,357,937,367      stalled-cycles-frontend   #   49.23% frontend
cycles idle     ( +-  0.02% )
   <not supported>      stalled-cycles-backend
     3,881,565,068      instructions              #    1.41  insns per
cycle
                                                  #    0.35  stalled
cycles per insn  ( +-  0.00% )
       784,078,474      branches                  #  853.317 M/sec
               ( +-  0.00% )
        13,984,077      branch-misses             #    1.78% of all
branches          ( +-  0.01% )

       0.919717669 seconds time elapsed
          ( +-  0.01% )

gold --threads

       1300.210314      task-clock (msec)         #    1.523 CPUs
utilized            ( +-  0.15% )
            26,913      context-switches          #    0.021 M/sec
               ( +-  0.34% )
             7,910      cpu-migrations            #    0.006 M/sec
               ( +-  0.63% )
            83,507      page-faults               #    0.064 M/sec
               ( +-  0.02% )
     3,842,183,459      cycles                    #    2.955 GHz
               ( +-  0.14% )
     2,273,634,375      stalled-cycles-frontend   #   59.18% frontend
cycles idle     ( +-  0.20% )
   <not supported>      stalled-cycles-backend
     4,058,907,107      instructions              #    1.06  insns per
cycle
                                                  #    0.56  stalled
cycles per insn  ( +-  0.07% )
       838,791,061      branches                  #  645.120 M/sec
               ( +-  0.06% )
        14,220,563      branch-misses             #    1.70% of all
branches          ( +-  0.02% )

       0.853835099 seconds time elapsed
          ( +-  0.25% )

So much better!

The machine has 12 cores, each with 2 hardware threads.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run.sh
Type: application/x-sh
Size: 5352 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150319/5f2c7cae/attachment.sh>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Mar 2015 - [LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

Apparently Analagous Threads