So I started doing performance analysis again, and we've slowed down quite a bit. My current test is statically linking clang for Linux on Windows. I currently care mostly about Windows performance as that's where we run it. Here's a rough breakdown of time usage (doesn't add up to %100 because of rounding): %8.8 - fs::get_magic from the driver. %0.8 - Reading the files on the command line. %29 - Resolver. This ~%90 of this is reading objects out of archives. This can be parallelized, and I have an outdated patch which does this. %51 - Passes. Mostly the layout pass. And in the layout pass it's mostly due to cache misses. I've already tried parallelizing the sort, it doesn't help much. %9 - Writer. Most of this is in prep work. The actual writing to disk part and applying relocations is very small. %1 - Unaccounted for. I'm going to do some work to solve the get_magic and resolver issue with threads. I think we really need to look into how the layout pass is handled. If the cache effects are bad enough, we may actually need to change to a non-virtual POD based interface for atoms. Meaning that readers fill in atom data at the start, instead of figuring it out at runtime. - Michael Spencer
I happened to start making performance improvements for LLD this week, and have already submitted the following patches for it. My test is to link LLD on Windows, and it has been improved from roughly 10 seconds to 6 seconds this week. r196504: [PECOFF] Handle .lib files as if they are grouped by --{start,end}-group. r196628: Re-submit r195852 with GroupedSectionsPass change. I'm now looking into the layout pass as it takes ~20% link time. It looks to me that the main reason why the pass is so slow is because we do too much things in _compare() function. We might be able to cache _compare()'s result. Or simple parallel_sort might work. I've also noticed that reading from archive files is slow because, as you wrote, it's not multi-threaded. How would you parallelize it? On Sat, Dec 7, 2013 at 10:30 AM, Michael Spencer <bigcheesegs at gmail.com>wrote:> So I started doing performance analysis again, and we've slowed down > quite a bit. My current test is statically linking clang for Linux on > Windows. I currently care mostly about Windows performance as that's > where we run it. > > Here's a rough breakdown of time usage (doesn't add up to %100 because > of rounding): > > %8.8 - fs::get_magic from the driver. > %0.8 - Reading the files on the command line. > %29 - Resolver. This ~%90 of this is reading objects out of archives. > This can be parallelized, and I have an outdated patch which does > this. > %51 - Passes. Mostly the layout pass. And in the layout pass it's > mostly due to cache misses. I've already tried parallelizing the sort, > it doesn't help much. > %9 - Writer. Most of this is in prep work. The actual writing to > disk part and applying relocations is very small. > %1 - Unaccounted for. > > I'm going to do some work to solve the get_magic and resolver issue > with threads. I think we really need to look into how the layout pass > is handled. If the cache effects are bad enough, we may actually need > to change to a non-virtual POD based interface for atoms. Meaning that > readers fill in atom data at the start, instead of figuring it out at > runtime. > > - Michael Spencer >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131207/a5f29483/attachment.html>
On 12/6/2013 7:30 PM, Michael Spencer wrote:> So I started doing performance analysis again, and we've slowed down > quite a bit. My current test is statically linking clang for Linux on > Windows. I currently care mostly about Windows performance as that's > where we run it.While we are at this, I think Memory usage also needs to be measured.> Here's a rough breakdown of time usage (doesn't add up to %100 because > of rounding):I think in addition to this if we have a way to run non-dependent passes to run concurrently (Concurrent PassManager) it would be awesome.> %8.8 - fs::get_magic from the driver.Wow!> %0.8 - Reading the files on the command line. > %29 - Resolver. This ~%90 of this is reading objects out of archives. > This can be parallelized, and I have an outdated patch which does > this.How do you plan to read them in parallel ? The archive member is needed only when a symbol is undefined and that is defined in the archive. If you read objects in advance, this might increase the memory footprint.> %51 - Passes. Mostly the layout pass. And in the layout pass it's > mostly due to cache misses. I've already tried parallelizing the sort, > it doesn't help much.This is because the Ordering pass is done serially. Goes over the follow-on, and builds the preceded by table, and the ingroup reference table. I was thinking on this for a while and thought if we could build the tables in parallel (follow-on,preceded-by,in-group) and merge them serially, it might be faster. Thoughts ?> %9 - Writer. Most of this is in prep work.With Linker scripts I think this might be even more, depending on how we really implement all the complex semantics.> The actual writing to > disk part and applying relocations is very small. > %1 - Unaccounted for. > > I'm going to do some work to solve the get_magic and resolver issue > with threads. I think we really need to look into how the layout pass > is handled. If the cache effects are bad enough, we may actually need > to change to a non-virtual POD based interface for atoms. Meaning that > readers fill in atom data at the start, instead of figuring it out at > runtime.Couldnt follow about the non-virtual POD interface, can you give more info ? Thanks Shankar Easwaran -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation
On Dec 6, 2013, at 5:30 PM, Michael Spencer <bigcheesegs at gmail.com> wrote:> If the cache effects are bad enough, we may actually need > to change to a non-virtual POD based interface for atoms. Meaning that > readers fill in atom data at the start, instead of figuring it out at > runtime.Taking scope() as an example, there are two layers of cost: 1) The virtual call to scope() (as opposed to an inlined fetch of an ivar) 2) The work the implementation of scope() does. Which are you referring to? Changing the Atom model to have a base POD, would hurt the native file format. It is based around bulk instantiated cheap atoms objects that just have a pointer into the atom info in the mapped file. If accessing atom attributes (like scope) really is called alot, perhaps we can redo algorithms to call them less? -Nick