Sean Silva via llvm-dev
2017-Mar-01 22:31 UTC
[llvm-dev] [lld] We call SymbolBody::getVA redundantly a lot...
On Tue, Feb 28, 2017 at 11:39 PM, Rui Ueyama <ruiu at google.com> wrote:> I also did a quick profiling a few months ago and noticed just like you > that scanRelocations consumes a fairly large percentage of overall > execution time. That caught my attention because at the time I was looking > for a place that I can parallelize. > > scanRelocations is not parallelizable as-is because it does some global > operations as side effects such as adding new symbols to GOT/PLT. My idea > is to split the function into two halves. In the first half, we scan > relocations and mark some symbols in parallel. In the second half, we add > them to GOT/PLT. >Won't that require scanning all symbols in the second pass? I think that we want to do O(number of GOT/PLT entries) work (and cache footprint) in the second pass instead of O(symbols).> > I think it's doable although I haven't actually tried that yet. Relocation > processing is one of the most complex things in LLD at least partly due to > the nature of the ELF format, and it is not easy to hack that code to try > some new ideas. But it should be worth a try. >Yes, I noticed too that this code has a lot of essential complexity. However, it looks like it has grown organically, so there are likely opportunities to simplify it.> > I believe it has a potential to make LLD 10% or more faster, as > scanRelocations currently takes more than 20% of total execution time. >Wasn't there the idea also of merging this scan into the final relocation processing? In the code, it says: ``` // The reason we have to do this early scan is as follows // * To mmap the output file, we need to know the size // * For that, we need to know how many dynamic relocs we will have. ``` But I think you can resize a memory mapping (e.g. ftruncate) to add data at the end. Or conservatively overallocate and then truncate at the end (pages we don't touch won't be allocated, so the overallocation shouldn't cost too much) -- Sean Silva> > On Tue, Feb 28, 2017 at 11:15 PM, Sean Silva <chisophugis at gmail.com> > wrote: > >> >> >> On Tue, Feb 28, 2017 at 12:10 PM, Rui Ueyama <ruiu at google.com> wrote: >> >>> I don't think getVA is particularly expensive, and if it is not >>> expensive I wouldn't cache its result. Did you experiment to cache getVA >>> results? I think you can do that fairly easily by adding a >>> std::atomic_uint64_t to SymbolBody and use it as a cache for getVA. >>> >> >> >> You're right, caching it didn't have any significant effect (though I >> wasn't measuring super precisely ). I think I was remembering the profile >> wrong. I remember measuring that we had some very bad cache/TLB misses >> here, but I guess those aren't too important on the current profile (at >> least, not on this test case; the locality of these accesses depends a lot >> on the test case). >> >> Also, it seems like our performance is a lot more stable w.r.t. >> InputSectionBase::relocate than it used to be (or maybe my current CPU is >> just less affected; it's a desktop class processor instead of a xeon). >> >> >> I took a quick profile of this workload and it looks like it is: >> >> 65% in the writer ("backend") >> 30% in the "frontend" (everything called by SymbolTable::addFile) >> >> The frontend work seems to be largely dominated by ObjectFile::parse (as >> you would expect), though there is about 10% of total runtime slipping >> through the cracks here in various other "frontend" tasks. >> >> The backend work is split about evenly between scanRelocations and >> OutputSection::writeTo. InputSectionBase::relocate is only about 10% of the >> total runtime (part of OutputSection::writeTo). >> >> Some slightly cleaned up `perf report` output with some more details: >> https://reviews.llvm.org/P7972 >> >> So it seems like overall, the profile is basically split 3 ways (about >> 30% each): >> - frontend (reading input files and building the symbol table and >> associated data structures) >> - scanRelocations (initial pass over relocations) >> - writeTo (mostly IO and InputSectionBase::relocate) >> >> -- Sean Silva >> >> >>> >>> On Tue, Feb 28, 2017 at 4:19 AM, Sean Silva <chisophugis at gmail.com> >>> wrote: >>> >>>> tl;dr: it looks like we call SymbolBody::getVA about 5x more times than >>>> we need to >>>> >>>> Should we cache it or something? (careful with threads). >>>> >>>> >>>> Here is a link to a PDF of my Mathematica notebook which has all the >>>> details of my investigation: >>>> https://drive.google.com/open?id=0B8v10qJ6EXRxVDQ3YnZtUlFtZ1k >>>> >>>> >>>> There seem to be two main regimes that we redundantly call >>>> SymbolBody::getVA: >>>> >>>> 1. most redundant calls on the same symbol (about 80%) happen in quick >>>> succession with few intervening calls for other symbols. Most likely we are >>>> processing a bunch of relocations right next to each other that all refer >>>> to the same symbol (or small set of symbols); e.g. within a TU >>>> >>>> 2. there is a long-ish tail (about 20% of calls to SymbolBody::getVA) >>>> which happen at a long temporal distance from any previous call to >>>> SymbolBody::getVA on the same symbol. I don't know off the top of my head >>>> where these are coming from, but it doesn't sound like relocations. A quick >>>> grepping shows a bunch of source locations that match getVA, so it's hard >>>> at a glance to see. Any ideas where these other calls are coming from? >>>> >>>> The particular link I was looking at was a release without debug info >>>> link, using `-O0 --no-gc-sections --no-threads`. The particular test case >>>> is LLD itself. >>>> >>>> -- Sean Silva >>>> >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170301/2c828c08/attachment.html>
Rui Ueyama via llvm-dev
2017-Mar-02 01:08 UTC
[llvm-dev] [lld] We call SymbolBody::getVA redundantly a lot...
On Wed, Mar 1, 2017 at 2:31 PM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Tue, Feb 28, 2017 at 11:39 PM, Rui Ueyama <ruiu at google.com> wrote: > >> I also did a quick profiling a few months ago and noticed just like you >> that scanRelocations consumes a fairly large percentage of overall >> execution time. That caught my attention because at the time I was looking >> for a place that I can parallelize. >> >> scanRelocations is not parallelizable as-is because it does some global >> operations as side effects such as adding new symbols to GOT/PLT. My idea >> is to split the function into two halves. In the first half, we scan >> relocations and mark some symbols in parallel. In the second half, we add >> them to GOT/PLT. >> > > Won't that require scanning all symbols in the second pass? I think that > we want to do O(number of GOT/PLT entries) work (and cache footprint) in > the second pass instead of O(symbols). >It will, but it might not be that bad than you imagined, because number of global symbols are much smaller than the number of relocations. For exmaple, at SymbolTable.cpp#673 <https://github.com/llvm-project/llvm-project/blob/0ad2ea48e8fa9b6243c2ec20b57b4268c279f9bf/lld/ELF/SymbolTable.cpp#L673>, it iterates over all non-local symbols, but that is pretty cheap (which surprised me at first.)> >> I think it's doable although I haven't actually tried that yet. >> Relocation processing is one of the most complex things in LLD at least >> partly due to the nature of the ELF format, and it is not easy to hack that >> code to try some new ideas. But it should be worth a try. >> > > Yes, I noticed too that this code has a lot of essential complexity. > However, it looks like it has grown organically, so there are likely > opportunities to simplify it. > > >> >> I believe it has a potential to make LLD 10% or more faster, as >> scanRelocations currently takes more than 20% of total execution time. >> > > Wasn't there the idea also of merging this scan into the final relocation > processing? > > In the code, it says: > ``` > // The reason we have to do this early scan is as follows > // * To mmap the output file, we need to know the size > // * For that, we need to know how many dynamic relocs we will have. > ``` > > But I think you can resize a memory mapping (e.g. ftruncate) to add data > at the end. Or conservatively overallocate and then truncate at the end > (pages we don't touch won't be allocated, so the overallocation shouldn't > cost too much) >Putting a variable length thing at end of a file works only if you have only one variable-sized stuff. There could be multiple variable-size things in linkers, such as string table, GOT, PLT, thunks, etc. So it is probably hard to do that. Also, it doesn't work for linker scripts if linker scripts enforces some specific section layout.> -- Sean Silva > > >> >> On Tue, Feb 28, 2017 at 11:15 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >> >>> >>> >>> On Tue, Feb 28, 2017 at 12:10 PM, Rui Ueyama <ruiu at google.com> wrote: >>> >>>> I don't think getVA is particularly expensive, and if it is not >>>> expensive I wouldn't cache its result. Did you experiment to cache getVA >>>> results? I think you can do that fairly easily by adding a >>>> std::atomic_uint64_t to SymbolBody and use it as a cache for getVA. >>>> >>> >>> >>> You're right, caching it didn't have any significant effect (though I >>> wasn't measuring super precisely ). I think I was remembering the profile >>> wrong. I remember measuring that we had some very bad cache/TLB misses >>> here, but I guess those aren't too important on the current profile (at >>> least, not on this test case; the locality of these accesses depends a lot >>> on the test case). >>> >>> Also, it seems like our performance is a lot more stable w.r.t. >>> InputSectionBase::relocate than it used to be (or maybe my current CPU is >>> just less affected; it's a desktop class processor instead of a xeon). >>> >>> >>> I took a quick profile of this workload and it looks like it is: >>> >>> 65% in the writer ("backend") >>> 30% in the "frontend" (everything called by SymbolTable::addFile) >>> >>> The frontend work seems to be largely dominated by ObjectFile::parse (as >>> you would expect), though there is about 10% of total runtime slipping >>> through the cracks here in various other "frontend" tasks. >>> >>> The backend work is split about evenly between scanRelocations and >>> OutputSection::writeTo. InputSectionBase::relocate is only about 10% of the >>> total runtime (part of OutputSection::writeTo). >>> >>> Some slightly cleaned up `perf report` output with some more details: >>> https://reviews.llvm.org/P7972 >>> >>> So it seems like overall, the profile is basically split 3 ways (about >>> 30% each): >>> - frontend (reading input files and building the symbol table and >>> associated data structures) >>> - scanRelocations (initial pass over relocations) >>> - writeTo (mostly IO and InputSectionBase::relocate) >>> >>> -- Sean Silva >>> >>> >>>> >>>> On Tue, Feb 28, 2017 at 4:19 AM, Sean Silva <chisophugis at gmail.com> >>>> wrote: >>>> >>>>> tl;dr: it looks like we call SymbolBody::getVA about 5x more times >>>>> than we need to >>>>> >>>>> Should we cache it or something? (careful with threads). >>>>> >>>>> >>>>> Here is a link to a PDF of my Mathematica notebook which has all the >>>>> details of my investigation: >>>>> https://drive.google.com/open?id=0B8v10qJ6EXRxVDQ3YnZtUlFtZ1k >>>>> >>>>> >>>>> There seem to be two main regimes that we redundantly call >>>>> SymbolBody::getVA: >>>>> >>>>> 1. most redundant calls on the same symbol (about 80%) happen in quick >>>>> succession with few intervening calls for other symbols. Most likely we are >>>>> processing a bunch of relocations right next to each other that all refer >>>>> to the same symbol (or small set of symbols); e.g. within a TU >>>>> >>>>> 2. there is a long-ish tail (about 20% of calls to SymbolBody::getVA) >>>>> which happen at a long temporal distance from any previous call to >>>>> SymbolBody::getVA on the same symbol. I don't know off the top of my head >>>>> where these are coming from, but it doesn't sound like relocations. A quick >>>>> grepping shows a bunch of source locations that match getVA, so it's hard >>>>> at a glance to see. Any ideas where these other calls are coming from? >>>>> >>>>> The particular link I was looking at was a release without debug info >>>>> link, using `-O0 --no-gc-sections --no-threads`. The particular test case >>>>> is LLD itself. >>>>> >>>>> -- Sean Silva >>>>> >>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170301/83f68def/attachment.html>
Sean Silva via llvm-dev
2017-Mar-03 00:48 UTC
[llvm-dev] [lld] We call SymbolBody::getVA redundantly a lot...
On Wed, Mar 1, 2017 at 5:08 PM, Rui Ueyama <ruiu at google.com> wrote:> On Wed, Mar 1, 2017 at 2:31 PM, Sean Silva <chisophugis at gmail.com> wrote: > >> >> >> On Tue, Feb 28, 2017 at 11:39 PM, Rui Ueyama <ruiu at google.com> wrote: >> >>> I also did a quick profiling a few months ago and noticed just like you >>> that scanRelocations consumes a fairly large percentage of overall >>> execution time. That caught my attention because at the time I was looking >>> for a place that I can parallelize. >>> >>> scanRelocations is not parallelizable as-is because it does some global >>> operations as side effects such as adding new symbols to GOT/PLT. My idea >>> is to split the function into two halves. In the first half, we scan >>> relocations and mark some symbols in parallel. In the second half, we add >>> them to GOT/PLT. >>> >> >> Won't that require scanning all symbols in the second pass? I think that >> we want to do O(number of GOT/PLT entries) work (and cache footprint) in >> the second pass instead of O(symbols). >> > > It will, but it might not be that bad than you imagined, because number of > global symbols are much smaller than the number of relocations. For > exmaple, at SymbolTable.cpp#673 > <https://github.com/llvm-project/llvm-project/blob/0ad2ea48e8fa9b6243c2ec20b57b4268c279f9bf/lld/ELF/SymbolTable.cpp#L673>, > it iterates over all non-local symbols, but that is pretty cheap (which > surprised me at first.) >Interesting. I guess it makes sense though.> > > >> >>> I think it's doable although I haven't actually tried that yet. >>> Relocation processing is one of the most complex things in LLD at least >>> partly due to the nature of the ELF format, and it is not easy to hack that >>> code to try some new ideas. But it should be worth a try. >>> >> >> Yes, I noticed too that this code has a lot of essential complexity. >> However, it looks like it has grown organically, so there are likely >> opportunities to simplify it. >> >> >>> >>> I believe it has a potential to make LLD 10% or more faster, as >>> scanRelocations currently takes more than 20% of total execution time. >>> >> >> Wasn't there the idea also of merging this scan into the final relocation >> processing? >> >> In the code, it says: >> ``` >> // The reason we have to do this early scan is as follows >> // * To mmap the output file, we need to know the size >> // * For that, we need to know how many dynamic relocs we will have. >> ``` >> >> But I think you can resize a memory mapping (e.g. ftruncate) to add data >> at the end. Or conservatively overallocate and then truncate at the end >> (pages we don't touch won't be allocated, so the overallocation shouldn't >> cost too much) >> > > Putting a variable length thing at end of a file works only if you have > only one variable-sized stuff. There could be multiple variable-size things > in linkers, such as string table, GOT, PLT, thunks, etc. So it is probably > hard to do that. Also, it doesn't work for linker scripts if linker scripts > enforces some specific section layout. >Gah, that's annoying :( -- Sean Silva> > >> -- Sean Silva >> >> >>> >>> On Tue, Feb 28, 2017 at 11:15 PM, Sean Silva <chisophugis at gmail.com> >>> wrote: >>> >>>> >>>> >>>> On Tue, Feb 28, 2017 at 12:10 PM, Rui Ueyama <ruiu at google.com> wrote: >>>> >>>>> I don't think getVA is particularly expensive, and if it is not >>>>> expensive I wouldn't cache its result. Did you experiment to cache getVA >>>>> results? I think you can do that fairly easily by adding a >>>>> std::atomic_uint64_t to SymbolBody and use it as a cache for getVA. >>>>> >>>> >>>> >>>> You're right, caching it didn't have any significant effect (though I >>>> wasn't measuring super precisely ). I think I was remembering the profile >>>> wrong. I remember measuring that we had some very bad cache/TLB misses >>>> here, but I guess those aren't too important on the current profile (at >>>> least, not on this test case; the locality of these accesses depends a lot >>>> on the test case). >>>> >>>> Also, it seems like our performance is a lot more stable w.r.t. >>>> InputSectionBase::relocate than it used to be (or maybe my current CPU is >>>> just less affected; it's a desktop class processor instead of a xeon). >>>> >>>> >>>> I took a quick profile of this workload and it looks like it is: >>>> >>>> 65% in the writer ("backend") >>>> 30% in the "frontend" (everything called by SymbolTable::addFile) >>>> >>>> The frontend work seems to be largely dominated by ObjectFile::parse >>>> (as you would expect), though there is about 10% of total runtime slipping >>>> through the cracks here in various other "frontend" tasks. >>>> >>>> The backend work is split about evenly between scanRelocations and >>>> OutputSection::writeTo. InputSectionBase::relocate is only about 10% of the >>>> total runtime (part of OutputSection::writeTo). >>>> >>>> Some slightly cleaned up `perf report` output with some more details: >>>> https://reviews.llvm.org/P7972 >>>> >>>> So it seems like overall, the profile is basically split 3 ways (about >>>> 30% each): >>>> - frontend (reading input files and building the symbol table and >>>> associated data structures) >>>> - scanRelocations (initial pass over relocations) >>>> - writeTo (mostly IO and InputSectionBase::relocate) >>>> >>>> -- Sean Silva >>>> >>>> >>>>> >>>>> On Tue, Feb 28, 2017 at 4:19 AM, Sean Silva <chisophugis at gmail.com> >>>>> wrote: >>>>> >>>>>> tl;dr: it looks like we call SymbolBody::getVA about 5x more times >>>>>> than we need to >>>>>> >>>>>> Should we cache it or something? (careful with threads). >>>>>> >>>>>> >>>>>> Here is a link to a PDF of my Mathematica notebook which has all the >>>>>> details of my investigation: >>>>>> https://drive.google.com/open?id=0B8v10qJ6EXRxVDQ3YnZtUlFtZ1k >>>>>> >>>>>> >>>>>> There seem to be two main regimes that we redundantly call >>>>>> SymbolBody::getVA: >>>>>> >>>>>> 1. most redundant calls on the same symbol (about 80%) happen in >>>>>> quick succession with few intervening calls for other symbols. Most likely >>>>>> we are processing a bunch of relocations right next to each other that all >>>>>> refer to the same symbol (or small set of symbols); e.g. within a TU >>>>>> >>>>>> 2. there is a long-ish tail (about 20% of calls to SymbolBody::getVA) >>>>>> which happen at a long temporal distance from any previous call to >>>>>> SymbolBody::getVA on the same symbol. I don't know off the top of my head >>>>>> where these are coming from, but it doesn't sound like relocations. A quick >>>>>> grepping shows a bunch of source locations that match getVA, so it's hard >>>>>> at a glance to see. Any ideas where these other calls are coming from? >>>>>> >>>>>> The particular link I was looking at was a release without debug info >>>>>> link, using `-O0 --no-gc-sections --no-threads`. The particular test case >>>>>> is LLD itself. >>>>>> >>>>>> -- Sean Silva >>>>>> >>>>> >>>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170302/7281f086/attachment.html>