thr3ads.net - llvm dev - [LLVMdev] [lld] Current performance issues [Dec 2013]

If this information is useful, please help other people find it:
Share via:

Michael Spencer

2013-Dec-07 01:30 UTC

[LLVMdev] [lld] Current performance issues

So I started doing performance analysis again, and we've slowed down
quite a bit. My current test is statically linking clang for Linux on
Windows. I currently care mostly about Windows performance as that's
where we run it.

Here's a rough breakdown of time usage (doesn't add up to %100 because
of rounding):

%8.8 - fs::get_magic from the driver.
%0.8 - Reading the files on the command line.
%29 - Resolver. This ~%90 of this is reading objects out of archives.
This can be parallelized, and I have an outdated patch which does
this.
%51 - Passes. Mostly the layout pass. And in the layout pass it's
mostly due to cache misses. I've already tried parallelizing the sort,
it doesn't help much.
%9   - Writer. Most of this is in prep work. The actual writing to
disk part and applying relocations is very small.
%1   - Unaccounted for.

I'm going to do some work to solve the get_magic and resolver issue
with threads. I think we really need to look into how the layout pass
is handled. If the cache effects are bad enough, we may actually need
to change to a non-virtual POD based interface for atoms. Meaning that
readers fill in atom data at the start, instead of figuring it out at
runtime.

- Michael Spencer

Rui Ueyama

2013-Dec-07 01:50 UTC

head link

[LLVMdev] [lld] Current performance issues

I happened to start making performance improvements for LLD this week, and
have already submitted the following patches for it. My test is to link LLD
on Windows, and it has been improved from roughly 10 seconds to 6 seconds
this week.

r196504: [PECOFF] Handle .lib files as if they are grouped by
--{start,end}-group.
r196628: Re-submit r195852 with GroupedSectionsPass change.


I'm now looking into the layout pass as it takes ~20% link time. It looks
to me that the main reason why the pass is so slow is because we do too
much things in _compare() function. We might be able to cache _compare()'s
result. Or simple parallel_sort might work.

I've also noticed that reading from archive files is slow because, as you
wrote, it's not multi-threaded. How would you parallelize it?


On Sat, Dec 7, 2013 at 10:30 AM, Michael Spencer <bigcheesegs at
gmail.com>wrote:
> So I started doing performance analysis again, and we've slowed down
> quite a bit. My current test is statically linking clang for Linux on
> Windows. I currently care mostly about Windows performance as that's
> where we run it.
>
> Here's a rough breakdown of time usage (doesn't add up to %100
because
> of rounding):
>
> %8.8 - fs::get_magic from the driver.
> %0.8 - Reading the files on the command line.
> %29 - Resolver. This ~%90 of this is reading objects out of archives.
> This can be parallelized, and I have an outdated patch which does
> this.
> %51 - Passes. Mostly the layout pass. And in the layout pass it's
> mostly due to cache misses. I've already tried parallelizing the sort,
> it doesn't help much.
> %9   - Writer. Most of this is in prep work. The actual writing to
> disk part and applying relocations is very small.
> %1   - Unaccounted for.
>
> I'm going to do some work to solve the get_magic and resolver issue
> with threads. I think we really need to look into how the layout pass
> is handled. If the cache effects are bad enough, we may actually need
> to change to a non-virtual POD based interface for atoms. Meaning that
> readers fill in atom data at the start, instead of figuring it out at
> runtime.
>
> - Michael Spencer
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131207/a5f29483/attachment.html>

Shankar Easwaran

2013-Dec-07 02:07 UTC

head link

[LLVMdev] [lld] Current performance issues

On 12/6/2013 7:30 PM, Michael Spencer wrote:> So I started doing performance analysis again, and we've slowed down
> quite a bit. My current test is statically linking clang for Linux on
> Windows. I currently care mostly about Windows performance as that's
> where we run it.While we are at this, I think Memory usage also needs to be
measured.> Here's a rough breakdown of time usage (doesn't add up to %100
because
> of rounding):I think in addition to this if we have a way to run non-dependent passes 
to run concurrently (Concurrent PassManager) it would be awesome.
> %8.8 - fs::get_magic from the driver.
Wow!> %0.8 - Reading the files on the command line.
> %29 - Resolver. This ~%90 of this is reading objects out of archives.
> This can be parallelized, and I have an outdated patch which does
> this.How do you plan to read them in parallel ? The archive member is needed 
only when a symbol is undefined and that is defined in the archive. If 
you read objects in advance, this might increase the memory
footprint.> %51 - Passes. Mostly the layout pass. And in the layout pass it's
> mostly due to cache misses. I've already tried parallelizing the sort,
> it doesn't help much.This is because the Ordering pass is done serially. Goes over the 
follow-on, and builds the preceded by table, and the ingroup reference 
table. I was thinking on this for a while and thought if we could build 
the tables in parallel (follow-on,preceded-by,in-group) and merge them 
serially, it might be faster.

Thoughts ?
> %9   - Writer. Most of this is in prep work.With Linker scripts I think this might be even more, depending on how we 
really implement all the complex semantics.
> The actual writing to
> disk part and applying relocations is very small.
> %1   - Unaccounted for.
>
> I'm going to do some work to solve the get_magic and resolver issue
> with threads. I think we really need to look into how the layout pass
> is handled. If the cache effects are bad enough, we may actually need
> to change to a non-virtual POD based interface for atoms. Meaning that
> readers fill in atom data at the start, instead of figuring it out at
> runtime.Couldnt follow about the non-virtual POD interface, can you give more info ?

Thanks

Shankar Easwaran

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the
Linux Foundation

Nick Kledzik

2013-Dec-07 02:22 UTC

head link

[LLVMdev] [lld] Current performance issues

On Dec 6, 2013, at 5:30 PM, Michael Spencer <bigcheesegs at gmail.com>
wrote:
> If the cache effects are bad enough, we may actually need
> to change to a non-virtual POD based interface for atoms. Meaning that
> readers fill in atom data at the start, instead of figuring it out at
> runtime.Taking scope() as an example, there are two layers of cost:
1) The virtual call to scope() (as opposed to an inlined fetch of an ivar)
2) The work the implementation of scope() does.  
Which are you referring to?

Changing the Atom model to have a base POD, would hurt the native file format. 
It is based around bulk instantiated cheap atoms objects that just have a
pointer into the atom info in the mapped file.

If accessing atom attributes (like scope) really is called alot, perhaps we can
redo algorithms to call them less?

-Nick

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Dec 2013 - [LLVMdev] [lld] Current performance issues

[LLVMdev] [lld] Current performance issues

[LLVMdev] [lld] Current performance issues

[LLVMdev] [lld] Current performance issues

[LLVMdev] [lld] Current performance issues

Reasonably Related Threads