thr3ads.net - llvm dev - [LLVMdev] On LLD performance [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Davide Italiano

2015-Mar-17 05:52 UTC

[LLVMdev] On LLD performance

On Mon, Mar 16, 2015 at 1:54 AM, Davide Italiano <davide at freebsd.org>
wrote:>
> Shankar's parallel for per-se didn't introduce any performance
benefit
> (or regression).
> If the change I propose is safe, I would like to see Shankar's change
> in (and this on top of it).
> I have other related changes coming next, but I would like to tackle
> them one at a time.
>
Here's an update.

After http://reviews.llvm.org/D8372 , I updated the profiling data.

https://people.freebsd.org/~davide/llvm/lld-03162015.svg
It seems now 85% of CPU time is spent inside
FileArchive::buildTableOfContents().
In particular, 35% of the samples are spent inserting into
unordered_map, so there's maybe something we can do differently there
(e.g. , Rui's proposal of a concurrent map doesn't seem that bad).

Thanks,

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

David Blaikie

2015-Mar-17 06:00 UTC

head link

[LLVMdev] On LLD performance

On Mon, Mar 16, 2015 at 10:52 PM, Davide Italiano <davide at freebsd.org>
wrote:
> On Mon, Mar 16, 2015 at 1:54 AM, Davide Italiano <davide at
freebsd.org>
> wrote:
> >
> > Shankar's parallel for per-se didn't introduce any performance
benefit
> > (or regression).
> > If the change I propose is safe, I would like to see Shankar's
change
> > in (and this on top of it).
> > I have other related changes coming next, but I would like to tackle
> > them one at a time.
> >
>
> Here's an update.
>
> After http://reviews.llvm.org/D8372 , I updated the profiling data.
>
> https://people.freebsd.org/~davide/llvm/lld-03162015.svg
> It seems now 85% of CPU time is spent inside
> FileArchive::buildTableOfContents().
> In particular, 35% of the samples are spent inserting into
> unordered_map, so there's maybe something we can do differently there
> (e.g. , Rui's proposal of a concurrent map doesn't seem that bad).
>
Anyone tried a DenseMap instead of an unordered_map? If you need pointer
validity to the elements, a DenseMap with unique_ptrs rather than direct
values could be an option. Chandler's usual argument here is that walking
the map is cheap with high locality (as in a DenseMap) even if the nodes
themselves involve indirection. Could be worth an experiment.

>
> Thanks,
>
> --
> Davide
>
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150316/06782a1d/attachment.html>

Sean Silva

2015-Mar-17 06:17 UTC

head link

[LLVMdev] On LLD performance

On Mon, Mar 16, 2015 at 10:52 PM, Davide Italiano <davide at freebsd.org>
wrote:
> On Mon, Mar 16, 2015 at 1:54 AM, Davide Italiano <davide at
freebsd.org>
> wrote:
> >
> > Shankar's parallel for per-se didn't introduce any performance
benefit
> > (or regression).
> > If the change I propose is safe, I would like to see Shankar's
change
> > in (and this on top of it).
> > I have other related changes coming next, but I would like to tackle
> > them one at a time.
> >
>
> Here's an update.
>
> After http://reviews.llvm.org/D8372 , I updated the profiling data.
>
> https://people.freebsd.org/~davide/llvm/lld-03162015.svg
> It seems now 85% of CPU time is spent inside
> FileArchive::buildTableOfContents().
>
I'm rather amazed that that patch changed the total CPU time. Just doing
the work in parallel shouldn't reduce the total CPU time spent on the task.
A reduction in CPU time would happen though if parallelizing it increased
the single-threaded performance of the tasks being done in parallel.
Perhaps using multiple cores means we are using multiple caches, so each
thread is getting much better single-threaded performance due to reduced
memory bottlenecking?

-- Sean Silva

> In particular, 35% of the samples are spent inserting into
> unordered_map, so there's maybe something we can do differently there
> (e.g. , Rui's proposal of a concurrent map doesn't seem that bad).
>
> Thanks,
>
> --
> Davide
>
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150316/f3d90ad6/attachment.html>

Davide Italiano

2015-Mar-17 06:39 UTC

head link

[LLVMdev] On LLD performance

On Tue, Mar 17, 2015 at 7:17 AM, Sean Silva <chisophugis at gmail.com>
wrote:>
>
> On Mon, Mar 16, 2015 at 10:52 PM, Davide Italiano <davide at
freebsd.org>
> wrote:
>>
>> On Mon, Mar 16, 2015 at 1:54 AM, Davide Italiano <davide at
freebsd.org>
>> wrote:
>> >
>> > Shankar's parallel for per-se didn't introduce any
performance benefit
>> > (or regression).
>> > If the change I propose is safe, I would like to see Shankar's
change
>> > in (and this on top of it).
>> > I have other related changes coming next, but I would like to
tackle
>> > them one at a time.
>> >
>>
>> Here's an update.
>>
>> After http://reviews.llvm.org/D8372 , I updated the profiling data.
>>
>> https://people.freebsd.org/~davide/llvm/lld-03162015.svg
>> It seems now 85% of CPU time is spent inside
>> FileArchive::buildTableOfContents().
>
>
> I'm rather amazed that that patch changed the total CPU time. Just
doing the
> work in parallel shouldn't reduce the total CPU time spent on the task.
A
> reduction in CPU time would happen though if parallelizing it increased the
> single-threaded performance of the tasks being done in parallel. Perhaps
> using multiple cores means we are using multiple caches, so each thread is
> getting much better single-threaded performance due to reduced memory
> bottlenecking?
>
> -- Sean Silva
>
>>
>> In particular, 35% of the samples are spent inserting into
>> unordered_map, so there's maybe something we can do differently
there
>> (e.g. , Rui's proposal of a concurrent map doesn't seem that
bad).
>>
>> Thanks,
>>
>> --
>> Davide
>>
>> "There are no solved problems; there are only problems that are
more
>> or less solved" -- Henri Poincare
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
David, Thanks for the input. I'll try DenseMap tomorrow and report results.
Sean, I personally was amazed by that too. I cannot exclude some
errors in the sampling for hwpmc, I'll try to repeat the profiling
and/or use another profiler to see if I can confirm the results.
About your other answer, I guess that would require a more
fine-grained analysis which includes memory bandwidth, cache misses
etc.. I'll try to get to it later this week or in the weekend. For
now, I'm just focusing on CPU profiling.

Thanks,

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Davide Italiano

2015-Mar-18 04:36 UTC

head link

[LLVMdev] On LLD performance

On Mon, Mar 16, 2015 at 11:00 PM, David Blaikie <dblaikie at gmail.com>
wrote:>
>
> On Mon, Mar 16, 2015 at 10:52 PM, Davide Italiano <davide at
freebsd.org>
> wrote:
>>
>> On Mon, Mar 16, 2015 at 1:54 AM, Davide Italiano <davide at
freebsd.org>
>> wrote:
>> >
>> > Shankar's parallel for per-se didn't introduce any
performance benefit
>> > (or regression).
>> > If the change I propose is safe, I would like to see Shankar's
change
>> > in (and this on top of it).
>> > I have other related changes coming next, but I would like to
tackle
>> > them one at a time.
>> >
>>
>> Here's an update.
>>
>> After http://reviews.llvm.org/D8372 , I updated the profiling data.
>>
>> https://people.freebsd.org/~davide/llvm/lld-03162015.svg
>> It seems now 85% of CPU time is spent inside
>> FileArchive::buildTableOfContents().
>> In particular, 35% of the samples are spent inserting into
>> unordered_map, so there's maybe something we can do differently
there
>> (e.g. , Rui's proposal of a concurrent map doesn't seem that
bad).
>
>
> Anyone tried a DenseMap instead of an unordered_map? If you need pointer
> validity to the elements, a DenseMap with unique_ptrs rather than direct
> values could be an option. Chandler's usual argument here is that
walking
> the map is cheap with high locality (as in a DenseMap) even if the nodes
> themselves involve indirection. Could be worth an experiment.
>
I did now. It actually makes things slower for the aforementioned
workload (linking clang). It was worth trying though.

Patch, in case somebody wants to try at home:
https://people.freebsd.org/~davide/llvm/densemap_membermap.diff

Patched:
real    1m27.849s  user    2m47.373s   sys     0m16.370s
real    1m29.583s  user    2m47.771s   sys     0m16.816s
real    1m25.956s  user    2m43.397s   sys     0m15.254s
real    1m29.380s  user    2m47.618s   sys     0m15.386s
real    1m25.426s  user    2m43.388s   sys     0m16.961s

Unpatched:
real    1m26.872s  user    2m46.999s sys     0m16.540s
real    1m28.187s  user    2m47.084s sys     0m17.149s
real    1m24.814s  user    2m43.311s  sys     0m16.979s
real    1m25.011s  user    2m43.184s  sys     0m16.975s
real    1m25.536s  user    2m44.577s  sys     0m16.784s

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Rafael Espíndola

2015-Mar-19 17:13 UTC

head link

[LLVMdev] On LLD performance

> Here's an update.
>
> After http://reviews.llvm.org/D8372 , I updated the profiling data.
>
> https://people.freebsd.org/~davide/llvm/lld-03162015.svg
> It seems now 85% of CPU time is spent inside
> FileArchive::buildTableOfContents().
> In particular, 35% of the samples are spent inserting into
> unordered_map, so there's maybe something we can do differently there
> (e.g. , Rui's proposal of a concurrent map doesn't seem that bad).
>
Why do we even need to build the table from name to member?

Can't we just walk "archive->symbols()" and check for each
symbol if
it is needed by the current link status?

Cheers,
Rafael

Rui Ueyama

2015-Mar-20 21:42 UTC

head link

[LLVMdev] On LLD performance

Rafael,

Your latest benchmark results look great. LLD took 1.38 seconds where gold
--threads takes 0.85 seconds. It needs to be faster, but that's not too bad.

On Thu, Mar 19, 2015 at 10:13 AM, Rafael Espíndola <
rafael.espindola at gmail.com> wrote:
> > Here's an update.
> >
> > After http://reviews.llvm.org/D8372 , I updated the profiling data.
> >
> > https://people.freebsd.org/~davide/llvm/lld-03162015.svg
> > It seems now 85% of CPU time is spent inside
> > FileArchive::buildTableOfContents().
> > In particular, 35% of the samples are spent inserting into
> > unordered_map, so there's maybe something we can do differently
there
> > (e.g. , Rui's proposal of a concurrent map doesn't seem that
bad).
> >
>
> Why do we even need to build the table from name to member?
>
> Can't we just walk "archive->symbols()" and check for each
symbol if
> it is needed by the current link status?

Are you suggesting we do linear search instead of hash table lookup each
time ArchiveFile::find(StringRef symbolName) is called?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150320/8e67ab35/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Mar 2015 - [LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

Possibly Parallel Threads