thr3ads.net - llvm dev - [LLVMdev] On LLD performance [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Davide Italiano

2015-Mar-15 07:36 UTC

[LLVMdev] On LLD performance

On Sat, Mar 14, 2015 at 2:52 AM, Nick Kledzik <kledzik at apple.com>
wrote:> Rui,
>
> Thanks for this write up and the work you are doing to improve performance.
> A  couple of thoughts:
>
> * In the old days, linking was I/O bound so the linker run tended to be the
> same speed as cat’ing all the input files together.  Now that spinning
disks
> are less common, I/O may no longer be the bottleneck.
>
> * How does the link time compare to the Windows system linker?
>
> * Where exactly in the Resolving phase is the bottleneck?  Is it the symbol
> table or the other work?  (Is much time spent growing (and rehashing) the
> DenseMap?  If so, would setting a larger initial bucket size help?)
>
> * Regarding the Rafeal’s performance comparison with gold (where lld is
> about half the speed).  The old binutil’s ld was a cross platform linker
> that supported multiple formats.  The goal of gold was to have better
> performance by only supporting ELF.  So I would expect gold to be faster
> than the general purpose lld.
>
>
> -Nick
>
After my wild and wrong guess from a couple of days ago, I realized it
was better spending some time analyzing code and profiling, rather
than speculating.
The machine where I tried to link clang using lld is an 8-core Xeon,
but I hardly was able to use more than two cores.
I ran my tests on UFS + 7200rpm disk ( I can provide details, if
needed). The OS is FreeBSD-11 from a month ago or such.

Flamegraph results (now zoomable):
https://people.freebsd.org/~davide/llvm/lld.svg

Nick, to answer your question DenseMap ops are a generally an hotspot
for this kind of workload, although grow() operations aren't
particularly a big problem (at least in this case). I see them showing
up only for addReferencetoSymbol() and they account for large part of
the overall samples. In order to convince myself of my theory I
started tinkering and modified the initial size of the map but I
wasn't able to notice any non-negligible improvement.

What instead seems to be a bigger problem is symbolTable::replacement,
DenseMap operations are about 10% of the overall samples. It seems
that in order to find the replacement for a given atom we need to
traverse a chain of atoms until the last one is found. This results in
O(n) calls to find() where n is the length of the chain. I will try to
investigate and see if we can make something differently.

Please let me know if there's something else that shows up from the
graph, and/or you have any idea on how to improve. Sorry for the
output of symbols being a little bit verbose, the collector wasn't
written with C++ in mind.

--
Davide

Davide Italiano

2015-Mar-16 00:54 UTC

head link

[LLVMdev] On LLD performance

On Sun, Mar 15, 2015 at 12:36 AM, Davide Italiano <davide at freebsd.org>
wrote:> On Sat, Mar 14, 2015 at 2:52 AM, Nick Kledzik <kledzik at apple.com>
wrote:
>> Rui,
>>
>> Thanks for this write up and the work you are doing to improve
performance.
>> A  couple of thoughts:
>>
>> * In the old days, linking was I/O bound so the linker run tended to be
the
>> same speed as cat’ing all the input files together.  Now that spinning
disks
>> are less common, I/O may no longer be the bottleneck.
>>
>> * How does the link time compare to the Windows system linker?
>>
>> * Where exactly in the Resolving phase is the bottleneck?  Is it the
symbol
>> table or the other work?  (Is much time spent growing (and rehashing)
the
>> DenseMap?  If so, would setting a larger initial bucket size help?)
>>
>> * Regarding the Rafeal’s performance comparison with gold (where lld is
>> about half the speed).  The old binutil’s ld was a cross platform
linker
>> that supported multiple formats.  The goal of gold was to have better
>> performance by only supporting ELF.  So I would expect gold to be
faster
>> than the general purpose lld.
>>
>>
>> -Nick
>>
>
> After my wild and wrong guess from a couple of days ago, I realized it
> was better spending some time analyzing code and profiling, rather
> than speculating.
> The machine where I tried to link clang using lld is an 8-core Xeon,
> but I hardly was able to use more than two cores.
> I ran my tests on UFS + 7200rpm disk ( I can provide details, if
> needed). The OS is FreeBSD-11 from a month ago or such.
>
> Flamegraph results (now zoomable):
> https://people.freebsd.org/~davide/llvm/lld.svg
>
> Nick, to answer your question DenseMap ops are a generally an hotspot
> for this kind of workload, although grow() operations aren't
> particularly a big problem (at least in this case). I see them showing
> up only for addReferencetoSymbol() and they account for large part of
> the overall samples. In order to convince myself of my theory I
> started tinkering and modified the initial size of the map but I
> wasn't able to notice any non-negligible improvement.
>
> What instead seems to be a bigger problem is symbolTable::replacement,
> DenseMap operations are about 10% of the overall samples. It seems
> that in order to find the replacement for a given atom we need to
> traverse a chain of atoms until the last one is found. This results in
> O(n) calls to find() where n is the length of the chain. I will try to
> investigate and see if we can make something differently.
>
> Please let me know if there's something else that shows up from the
> graph, and/or you have any idea on how to improve. Sorry for the
> output of symbols being a little bit verbose, the collector wasn't
> written with C++ in mind.
>
> --
> Davide
The following patch:

Index: lib/Core/Resolver.cpp
==================================================================---
lib/Core/Resolver.cpp       (revision 232330)
+++ lib/Core/Resolver.cpp       (working copy)
@@ -341,7 +341,8 @@
 // to the new defined atom
 void Resolver::updateReferences() {
   ScopedTask task(getDefaultDomain(), "updateReferences");
-  for (const Atom *atom : _atoms) {
+  parallel_for_each(_atoms.begin(), _atoms.end(),
+      [&](const Atom *atom) {
     if (const DefinedAtom *defAtom = dyn_cast<DefinedAtom>(atom)) {
       for (const Reference *ref : *defAtom) {
         // A reference of type kindAssociate should't be updated.
@@ -358,7 +359,7 @@
         const_cast<Reference *>(ref)->setTarget(newTarget);
       }
     }
-  }
+  });
 }

built on top of Shankar's parallel for shaves ~ 20 seconds from
linking clang on my machine, making the overall linking 20% faster.

Results (unpatched):
https://people.freebsd.org/~davide/llvm/lld_nonparallel_updaterel.txt

Results (patched):
https://people.freebsd.org/~davide/llvm/lld_parallel_updaterel.txt

Shankar's parallel for per-se didn't introduce any performance benefit
(or regression).
If the change I propose is safe, I would like to see Shankar's change
in (and this on top of it).
I have other related changes coming next, but I would like to tackle
them one at a time.

Thanks,

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Davide Italiano

2015-Mar-17 05:52 UTC

head link

[LLVMdev] On LLD performance

On Mon, Mar 16, 2015 at 1:54 AM, Davide Italiano <davide at freebsd.org>
wrote:>
> Shankar's parallel for per-se didn't introduce any performance
benefit
> (or regression).
> If the change I propose is safe, I would like to see Shankar's change
> in (and this on top of it).
> I have other related changes coming next, but I would like to tackle
> them one at a time.
>
Here's an update.

After http://reviews.llvm.org/D8372 , I updated the profiling data.

https://people.freebsd.org/~davide/llvm/lld-03162015.svg
It seems now 85% of CPU time is spent inside
FileArchive::buildTableOfContents().
In particular, 35% of the samples are spent inserting into
unordered_map, so there's maybe something we can do differently there
(e.g. , Rui's proposal of a concurrent map doesn't seem that bad).

Thanks,

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Mar 2015 - [LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

Apparently Analagous Threads