Sriraman Tallam via llvm-dev
2019-Oct-18  18:20 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Hello Maksim, On Fri, Oct 18, 2019 at 10:57 AM Maksim Panchenko <maks at fb.com> wrote:> Cool. The new numbers look good. If you run BOLT with jemalloc library > > preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed > that > > compared to the default malloc, it improves the multithreaded > > performance and brings down memory usage significantly. >Great, thanks for confirming! Would you be willing to share specific numbers, how significant is the reduction in memory with jemalloc for clang? We double-checked our numbers with the larger benchmarks and we can confirm they were *not built with labels*. One of our large benchmarks, search1, is about 5 times the size of clang in terms of text size as reported by size command, and we are seeing a 70G memory overhead on this. Do you have data on the memory consumption of BOLT with larger benchmarks with jemalloc. We are trying to build Chrome with latest BOLT so that we can share the memory overheads and the binaries with you for transparency but we are struggling with the disassembly errors. If you have data on large benchmarks we would appreciate it if you could share it. Further, if you have a recipe to use jemalloc with BOLT, please point it at us. We could try it out too and share our findings. Thanks much, Sri> > > Thanks, > > Maksim > > > > On 10/17/19, 2:59 PM, "Sriraman Tallam" <tmsriram at google.com> wrote: > > > > > > On Wed, Oct 16, 2019 at 3:52 PM Maksim Panchenko <maks at fb.com> wrote: > > Hi Sri, > > > > I want to clarify one thing before sending a detailed reply: did you > evaluate > > BOLT on Clang built with basic block sections? > > In the makefile you reference, > > there are two versions: a “vanilla” and a default built with function > sections. > > High overheads you see with BOLT on Clang do not match our experience. > > > > Thanks for pointing that out in the Makefile. We double-checked and > noticed a bug in our Makefile. For clang, we noticed that we are BOLTING > with basic block symbols which seems to affect the memory consumption of > BOLT. So, we have re-measured with recent bolt and for *full > transparency* we have uploaded the binaries, BOLT's yaml files and > perf.data files and the commands so that you can independently verify our > results and check the binaries. We have gzipped all the files to keep it > under 2G limit for git lfs, everything is here : > https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment > We have run our experiments on a 192G machine with Intel 18 core. > > > > We built llvm-bolt with most recent sources and is *pristine* with none of > our patches and uploading the binary we used here, > https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt > That's a very recent BOLT binary, git > hash: 988a7e8819b882fd14e18d149f8d3f702b134680 > > > > The > https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/{v1,v2} contains > two sets of binaries. The first binary is pristine recent clang-10 built > from 2 weeks ago with additionally only -Wl,-q. v2 is another clang binary > also only additionally built with -q. There are no labels or sections or > any other Propeller flags used to build these clang binaries. Here is the > command we are using to optimize with BOLT, all the commands have been > checked in too. > > > > You should be able to run llvm-bolt now on these binaries as all the files > are provided. We have also provided the raw perf data files in case you > want to independently convert. > > > > $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b > pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ > -reorder-functions=hfsort -relocs=1 --update-debug-sections > > > > For version 2, this is the number: > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40 > > Maximum resident set size (kbytes): 18742688 > > > > That is 125 seconds and ~18G of RAM. > > > > For version 1, this hangs and we stopped it after several minutes and the > maximum RSS size crossing 50G. This is most likely just a bug and you > should be able to reproduce this. The binary seems to be running and > printing update messages. > > > > We also measured without update-debug-sections too with the command : > > > > $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b > pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ > -reorder-functions=hfsort -relocs=1 > > > > For version1 : > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74 > > Maximum resident set size (kbytes): 14824444 > > > > 93 seconds and ~14G of RAM > > > > version 2 : > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90 > > Maximum resident set size (kbytes): 14511912 > > > > similar 91 secs and ~14G > > > > Now, coming back to the bug in the Makefile, we originally reported ~35G. > That is *wrong* since the clang binary used to measure bolt overheads was > built with basic block labels. Our *sincere apologies* for this, this > showed BOLT as consuming more memory than is actual for clang. We > double-checked BOLT numbers with the internal benchmark search2 for sanity > and that is built *without any labels* and only with "-Wl,-q". We are > checking the other large internal benchmarks too. We cannot disclose > internal benchmarks. So, we will get more large open-source benchmarks like > Chrome or gcc built with bolt and share the binaries and results so you can > independently verify. > > > > Thanks > > Sri > > > > > > Thanks, > > Maksim > > > > On 10/14/19, 11:44 AM, "llvm-dev on behalf of Sriraman Tallam via > llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of > llvm-dev at lists.llvm.org> wrote: > > > > Hello, > > > > I wanted to consolidate all the discussions and our final thoughts on the > concerns raised. I have attached a document consolidating it. > > > > BOLT’s performance gains inspired this work and we believe BOLT > > is a great piece of engineering. However, there are build environments > where > scalability is critical and memory limits per process are tight : > > * Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily > invented to achieve scalability and better incremental build times while > building large binaries with debug information. > > * ThinLTO, > http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html > <https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2016_06_thinlto-2Dscalable-2Dand-2Dincremental-2Dlto.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=rW9yHyu5DPla9M38HolcW_w_Md8TLqe53BTWIClBxO4&e=> > was > primarily invented to make LLVM’s full LTO scalable and keep the memory > and > time overheads low. ThinLTO has enabled much broader adoption of whole > program optimization, by making it non-monolithic. > > * For Chromium builds, > > https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha > <https://urldefense.proofpoint.com/v2/url?u=https-3A__chromium-2Dreview.googlesource.com_c_chromium_src_-2B_695714_3_build_toolcha&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=8EBzmSqxfeVJXXXFKkx4Mzkf5d6cucxPc9pXkF36v_o&e=> > in/concurrent_links.gni, the linker process memory is set to 10GB with > ThinLTO. > It was 26GB with Full LTO before that and individual processes will run of > out > of memory beyond that. > > * Here, > > https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib > <https://urldefense.proofpoint.com/v2/url?u=https-3A__gotocon.com_dl_goto-2Dchicago-2D2016_slides_AysyluGreenberg-5FBuildingADistrib&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=lH-bp7s0QTtCyJkcSTOL8B_wOsRw-SLGrFsbZLmjaxQ&e=> > utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google > scale > is shown where 5 million binary and test builds are performed every day on > several thousands of machines, each with a limitation of 12G of memory > per > process and 15 minute time-out on tests. Memory overheads of 35G (clang) > are > well above these thresholds. > > We have developed Propeller like ThinLTO that can be used to obtain > similar > performance gains like BOLT in such environments. > > > > Thanks > > Sri > > > > > > On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > > > On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Is there large value from deferring the block ordering to link time? That > is, does the block layout algorithm need to consider global layout issues > when deciding which blocks to put together and which to relegate to the > far-away part of the code? > > > > Or, could the propellor-optimized compile step instead split each function > into only 2 pieces -- one containing an "optimally-ordered" set of hot > blocks from the function, and another containing the cold blocks? The > linker would have less flexibility in placement, but maybe it doesn't > actually need that flexibility? > > > > Apologies if this is obvious for those who actually know what they're > talking about here. :) > > > > It is a fair question. > > > > We believe the flexibility to do fine grained layout in whole program > context is important. PostLinkOptimization is aimed at getting as much > performance improvement as possible (usually applied on top of > ThinLTO+PGO), so the framework is designed to enable it. > > > > In particular, it allows the linker to stitch hot bb traces from different > functions to be stitched together. It also allows hot trace duplication > across procedure boundaries (kind of interprocedural tailDup). Besides, > code alignment decisions to minimize branch mispredictions may require > global context (e.g, too conflicting branches residing in two different > functions). Other micro-arch specific optimizations to improve processor > front-end throughput may also require global context. > > > > It is conceivable to have an option to control the level of granularity at > the possible cost of performance. > > > > thanks, > > > > David > > > > > > > > On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at fb.com> wrote: > > You’re correct, except that, in Propeller, CFI duplication happens for > every basic block as it operates with the conservative assumption that a > block can be put anywhere by the linker. That’s a significant bloat that is > not cleaned up later. So, during link time, if N blocks from the same > function are contiguous in the final layout, as it should happen most of > the time for any sane BB order, we would have several FDEs for a region > that only needs one. The bloat goes to the final binary (a lot more FDEs, > specifically, one FDE per basic block). > > BOLT will only split a function in two parts, and only if it has profile. > Most of the time, a function is not split. It also has an option not to > split at all. For internally reordered basic blocks of a given function, it > has CFI deduplication logic (it will interpret and build the CFI states for > each block and rewrite the CFIs in a way that uses the minimum number of > instructions to encode the states for each block). > > > > *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of James Y > Knight via llvm-dev <llvm-dev at lists.llvm.org> > *Reply-To: *James Y Knight <jyknight at google.com> > *Date: *Wednesday, October 2, 2019 at 1:59 PM > *To: *Maksim Panchenko <maks at fb.com> > *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org> > *Subject: *Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link > Optimizations > > > > I'm a bit confused by this subthread -- doesn't BOLT have the exact same > CFI bloat issue? From my cursory reading of the propellor doc, the CFI > duplication is _necessary_ to represent discontiguous functions, not > anything particular to the way Propellor happens to generate those > discontiguous functions. > > > > And emitting discontiguous functions is a fundamental goal of this, right? > > > > On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Thanks for clarifying. This means once you move to the next basic block > (or any other basic > > block in the function) you have to execute an entirely new set of CFI > instructions > > except for the common CIE part. While indeed this is not as bad, on > average, the overall > > active memory footprint will increase. > > > > Creating one FDE per basic block means that .eh_frame_hdr, an allocatable > section, > > will be bloated too. This will increase the FDE lookup time. I don’t see > .eh_frame_hdr > > being mentioned in the proposal. > > > > Maksim > > > > On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at google.com> > wrote: > > > > > > > > On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > *Pessimization/overhead for stack unwinding used by system-wide profilers > and > for exception handling* > > Larger CFI programs put an extra burden on unwinding at runtime as more CFI > (and thus native) instructions have to be executed. This will cause more > overhead for any profiler that records stack traces, and, as you correctly > note > in the proposal, for any program that heavily uses exceptions. > > > > The number of CFI instructions that have to be executed when unwinding any > given stack stays the same. The CFI instructions for a function have to be > duplicated in every basic block section, but when performing unwinding only > one such a set is executed -- the copy for the current basic block. > However, this copy contains precisely the same CFI instructions as the ones > that would have to be executed if there were no basic block sections. > > > > -- > > Krzysztof Pszeniczny > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191018/aaae55c7/attachment-0001.html>
Maksim Panchenko via llvm-dev
2019-Oct-22  05:07 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Hi Sri,
Thank you for replying to our feedback. 7 out 12 high-level concerns have been
answered; 2 of them are fully addressed. The rest are being tracked at the
following Google doc:
https://docs.google.com/document/d/1jqjUZc8sEoNl6_lrVD6ZkITyFBFqhUOZ6ZaDm_XVbb8/
To keep the discussion at a high level, I have to reference the LTO vs ThinLTO
comparison since that appears to be the central theme in your response to the
feedback.
Unlike LTO, BOLT does not have to keep the entire program in memory.
Furthermore, as we have previously mentioned, most of the passes are run in
parallel, and the performance scales well with the number of CPUs.
To demonstrate that running BOLT on a hot subset of functions is not just a
speculation, we have prototyped a "Thin" version that optimizes
Clang-7 in under
15 seconds using less than 4GB of memory. No modifications to the linker or
compiler were required. And by the way, that appears to be faster than just the
re-linking phase of the Propeller. Larger loads show similar improvements
providing 2x-5x savings over the original processing specs.
Let me reiterate that current BOLT requires large amounts of memory not because
it's a fundamental limitation, unlike LTO. For us, system memory was never a
constraint. The runtime of the application, not BOLT, was the primary goal
during the development.
ThinLTO design solves a real problem and dramatically improves compilation time
even when building on a single node. ThinLTO results provide "end-to-end
build
time" comparison to LTO. I've asked you to show a similar comparison
for
Propeller vs. BOLT. I haven't seen the results, and I suspect the total
overhead
will exceed that of even the oldest non-parallel version of BOLT.
One argument I've heard is that BOLT is not taking advantage of the
distributed
build system. That's correct. It does not have to since it does not require
to
rebuild the application. In "Thin" mode, the overhead is similar to a
regular
linker running with a linker script.
You are right that we do not support debug fission packages. It is unimplemented
for a simple reason: no one asked for it previously. And as we like to say in
the open-source community: "patches are welcome."
Maksim
P.S. We have updated https://github.com/facebookincubator/BOLT with instructions
on running BOLT with jemalloc or tcmalloc.
On 10/18/19, 11:21 AM, "Sriraman Tallam" <tmsriram at
google.com<mailto:tmsriram at google.com>> wrote:
Hello Maksim,
On Fri, Oct 18, 2019 at 10:57 AM Maksim Panchenko <maks at
fb.com<mailto:maks at fb.com>> wrote:
Cool. The new numbers look good. If you run BOLT with jemalloc library
preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed that
compared to the default malloc, it improves the multithreaded
performance and brings down memory usage significantly.
Great, thanks for confirming!  Would you be willing to share specific numbers,
how significant is the reduction in memory with jemalloc for clang?    We
double-checked our numbers with the larger benchmarks and we can confirm they
were *not built with labels*.  One of our large benchmarks, search1, is about 5
times the size of clang in terms of text size as reported by size command, and
we are seeing a 70G memory overhead on this. Do you have  data on the memory
consumption of BOLT with larger benchmarks with jemalloc.   We are trying to
build Chrome with latest BOLT so that we can share the memory overheads and the
binaries with you for transparency but we are struggling with the disassembly
errors. If you have data on large benchmarks we would appreciate it if you could
share it.
Further, if you have a recipe to use jemalloc with BOLT, please point it at us.
We could try it out too and share our findings.
Thanks much,
Sri
Thanks,
Maksim
On 10/17/19, 2:59 PM, "Sriraman Tallam" <tmsriram at
google.com<mailto:tmsriram at google.com>> wrote:
On Wed, Oct 16, 2019 at 3:52 PM Maksim Panchenko <maks at
fb.com<mailto:maks at fb.com>> wrote:
Hi Sri,
I want to clarify one thing before sending a detailed reply: did you evaluate
BOLT on Clang built with basic block sections?
In the makefile you reference,
there are two versions: a “vanilla” and a default built with function sections.
High overheads you see with BOLT on Clang do not match our experience.
Thanks for pointing that out in the Makefile. We double-checked and noticed a
bug in our Makefile.  For clang, we noticed that we are BOLTING with basic block
symbols which seems to affect the memory consumption of BOLT.  So, we  have
re-measured with recent bolt and for *full transparency* we have uploaded the
binaries,  BOLT's yaml files and perf.data files  and the commands so that
you can independently verify our results and check the binaries.  We have
gzipped all the files to keep it under 2G limit for git lfs, everything is here
:   https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment 
We have run our experiments on a 192G machine with Intel 18 core.
We built llvm-bolt with most recent sources and is *pristine* with none of our
patches and uploading the binary we used here,
https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt
That's a very recent BOLT binary, git hash:
988a7e8819b882fd14e18d149f8d3f702b134680
The 
https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/{v1,v2}
contains two sets of binaries.  The first binary is pristine recent clang-10
built from 2 weeks ago with additionally only -Wl,-q.  v2 is another clang
binary also only additionally built with -q.  There are no labels or sections or
any other Propeller flags used to build these clang binaries.  Here is the
command we are using to optimize with BOLT, all the commands have been checked
in too.
You should be able to run llvm-bolt now on these binaries as all the files are
provided.  We have also provided the raw perf data files in case you want to
independently convert.
$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b
pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+
-reorder-functions=hfsort -relocs=1 --update-debug-sections
For version 2, this is the number:
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40
Maximum resident set size (kbytes): 18742688
That is 125 seconds and ~18G of RAM.
For version 1, this hangs and we stopped it after several minutes and the
maximum RSS size crossing 50G.  This is most likely just a bug and you should be
able to reproduce this.  The binary seems to be running and printing update
messages.
We also measured without update-debug-sections too with the command :
$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b
pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+
-reorder-functions=hfsort -relocs=1
For version1 :
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74
Maximum resident set size (kbytes): 14824444
93 seconds and ~14G of RAM
version 2 :
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90
Maximum resident set size (kbytes): 14511912
similar 91 secs and ~14G
Now, coming back to the bug in the Makefile, we originally reported ~35G.  That
is *wrong* since the clang binary used to measure bolt overheads was built with
basic block labels.  Our  *sincere apologies* for this, this showed BOLT as
consuming more memory than is actual for clang.  We double-checked BOLT numbers
with the internal benchmark search2 for sanity and that is built *without any
labels* and only with "-Wl,-q".  We are checking the other large
internal benchmarks too.  We cannot disclose internal benchmarks. So, we will
get more large open-source benchmarks like Chrome or gcc built with bolt and
share the binaries and results so you can independently verify.
Thanks
Sri
Thanks,
Maksim
On 10/14/19, 11:44 AM, "llvm-dev on behalf of Sriraman Tallam via
llvm-dev" <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org> on behalf of llvm-dev at lists.llvm.org<mailto:llvm-dev
at lists.llvm.org>> wrote:
Hello,
I wanted to consolidate all the discussions and our final thoughts on the
concerns raised.  I have attached a document consolidating it.
BOLT’s performance gains inspired this work and we believe BOLT
is a great piece of engineering.  However, there are build environments where
scalability is critical and memory limits per process are tight :
* Debug Fission,  https://gcc.gnu.org/wiki/DebugFission was primarily
invented to achieve scalability and better incremental build times while
building large binaries with debug information.
* ThinLTO,
http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2016_06_thinlto-2Dscalable-2Dand-2Dincremental-2Dlto.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=rW9yHyu5DPla9M38HolcW_w_Md8TLqe53BTWIClBxO4&e=>
was
primarily invented to make LLVM’s full LTO scalable and keep the memory and
time overheads low.  ThinLTO has enabled much broader adoption of whole
program optimization, by making it non-monolithic.
* For Chromium builds,
https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha<https://urldefense.proofpoint.com/v2/url?u=https-3A__chromium-2Dreview.googlesource.com_c_chromium_src_-2B_695714_3_build_toolcha&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=8EBzmSqxfeVJXXXFKkx4Mzkf5d6cucxPc9pXkF36v_o&e=>
in/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO.
It was 26GB with Full LTO before that and individual processes will run of out
of memory beyond that.
* Here,
https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib<https://urldefense.proofpoint.com/v2/url?u=https-3A__gotocon.com_dl_goto-2Dchicago-2D2016_slides_AysyluGreenberg-5FBuildingADistrib&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=lH-bp7s0QTtCyJkcSTOL8B_wOsRw-SLGrFsbZLmjaxQ&e=>
utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale
is shown where 5 million binary and test builds are performed every day on
several thousands of machines, each  with a limitation of 12G of memory per
process and 15 minute time-out on tests. Memory overheads of 35G (clang) are
well above these thresholds.
We have developed Propeller like ThinLTO that can be used to obtain similar
performance gains like BOLT in such environments.
Thanks
Sri
On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Is there large value from deferring the block ordering to link time? That is,
does the block layout algorithm need to consider global layout issues when
deciding which blocks to put together and which to relegate to the far-away part
of the code?
Or, could the propellor-optimized compile step instead split each function into
only 2 pieces -- one containing an "optimally-ordered" set of hot
blocks from the function, and another containing the cold blocks? The linker
would have less flexibility in placement, but maybe it doesn't actually need
that flexibility?
Apologies if this is obvious for those who actually know what they're
talking about here. :)
It is a fair question.
We believe the flexibility to do fine grained layout in whole program context is
important. PostLinkOptimization is aimed at getting as much performance
improvement as possible (usually applied on top of ThinLTO+PGO), so the
framework is designed to enable it.
In particular, it allows the linker to stitch hot bb traces from different
functions to be stitched together. It also allows hot trace duplication across
procedure boundaries (kind of interprocedural tailDup). Besides, code alignment
decisions to minimize branch mispredictions  may require global context (e.g,
too conflicting branches residing in two different functions).  Other micro-arch
specific optimizations to improve processor front-end throughput may also
require global context.
It is conceivable to have an option to control the level of granularity at the
possible cost of performance.
thanks,
David
On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at
fb.com<mailto:rafaelauler at fb.com>> wrote:
You’re correct, except that, in Propeller, CFI duplication happens for every
basic block as it operates with the conservative assumption that a block can be
put anywhere by the linker. That’s a significant bloat that is not cleaned up
later. So, during link time, if N blocks from the same function are contiguous
in the final layout, as it should happen most of the time for any sane BB order,
we would have several FDEs for a region that only needs one. The bloat goes to
the final binary (a lot more FDEs, specifically, one FDE per basic block).
BOLT will only split a function in two parts, and only if it has profile. Most
of the time, a function is not split. It also has an option not to split at all.
For internally reordered basic blocks of a given function, it has CFI
deduplication logic (it will interpret and build the CFI states for each block
and rewrite the CFIs in a way that uses the minimum number of instructions to
encode the states for each block).
From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org>> on behalf of James Y Knight via llvm-dev <llvm-dev
at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Reply-To: James Y Knight <jyknight at google.com<mailto:jyknight at
google.com>>
Date: Wednesday, October 2, 2019 at 1:59 PM
To: Maksim Panchenko <maks at fb.com<mailto:maks at fb.com>>
Cc: "llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations
I'm a bit confused by this subthread -- doesn't BOLT have the exact same
CFI bloat issue? From my cursory reading of the propellor doc, the CFI
duplication is _necessary_ to represent discontiguous functions, not anything
particular to the way Propellor happens to generate those discontiguous
functions.
And emitting discontiguous functions is a fundamental goal of this, right?
On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Thanks for clarifying. This means once you move to the next basic block (or any
other basic
block in the function) you have to execute an entirely new set of CFI
instructions
except for the common CIE part. While indeed this is not as bad, on average, the
overall
active memory footprint will increase.
Creating one FDE per basic block means that .eh_frame_hdr, an allocatable
section,
will be bloated too. This will increase the FDE lookup time. I don’t see
.eh_frame_hdr
being mentioned in the proposal.
Maksim
On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at
google.com<mailto:kpszeniczny at google.com>> wrote:
On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
*Pessimization/overhead for stack unwinding used by system-wide profilers and
for exception handling*
Larger CFI programs put an extra burden on unwinding at runtime as more CFI
(and thus native) instructions have to be executed. This will cause more
overhead for any profiler that records stack traces, and, as you correctly note
in the proposal, for any program that heavily uses exceptions.
The number of CFI instructions that have to be executed when unwinding any given
stack stays the same. The CFI instructions for a function have to be duplicated
in every basic block section, but when performing unwinding only one such a set
is executed -- the copy for the current basic block. However, this copy contains
precisely the same CFI instructions as the ones that would have to be executed
if there were no basic block sections.
--
Krzysztof Pszeniczny
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=>
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=>
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191022/f59c2627/attachment-0001.html>
Sriraman Tallam via llvm-dev
2019-Oct-22  05:38 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
We are going to be at the llvm-dev meeting the next two days. We will get back to you after that. Sri On Mon, Oct 21, 2019 at 10:07 PM Maksim Panchenko <maks at fb.com> wrote:> Hi Sri, > > > > Thank you for replying to our feedback. 7 out 12 high-level concerns have > been > > answered; 2 of them are fully addressed. The rest are being tracked at the > > following Google doc: > > > > > https://docs.google.com/document/d/1jqjUZc8sEoNl6_lrVD6ZkITyFBFqhUOZ6ZaDm_XVbb8/ > > > > To keep the discussion at a high level, I have to reference the LTO vs > ThinLTO > > comparison since that appears to be the central theme in your response to > the > > feedback. > > > > Unlike LTO, BOLT does not have to keep the entire program in memory. > > Furthermore, as we have previously mentioned, most of the passes are run in > > parallel, and the performance scales well with the number of CPUs. > > > > To demonstrate that running BOLT on a hot subset of functions is not just a > > speculation, we have prototyped a "Thin" version that optimizes Clang-7 in > under > > 15 seconds using less than 4GB of memory. No modifications to the linker or > > compiler were required. And by the way, that appears to be faster than > just the > > re-linking phase of the Propeller. Larger loads show similar improvements > > providing 2x-5x savings over the original processing specs. > > > > Let me reiterate that current BOLT requires large amounts of memory not > because > > it's a fundamental limitation, unlike LTO. For us, system memory was never > a > > constraint. The runtime of the application, not BOLT, was the primary goal > > during the development. > > > > ThinLTO design solves a real problem and dramatically improves compilation > time > > even when building on a single node. ThinLTO results provide "end-to-end > build > > time" comparison to LTO. I've asked you to show a similar comparison for > > Propeller vs. BOLT. I haven't seen the results, and I suspect the total > overhead > > will exceed that of even the oldest non-parallel version of BOLT. > > > > One argument I've heard is that BOLT is not taking advantage of the > distributed > > build system. That's correct. It does not have to since it does not > require to > > rebuild the application. In "Thin" mode, the overhead is similar to a > regular > > linker running with a linker script. > > > > You are right that we do not support debug fission packages. It is > unimplemented > > for a simple reason: no one asked for it previously. And as we like to say > in > > the open-source community: "patches are welcome." > > > > Maksim > > > > P.S. We have updated https://github.com/facebookincubator/BOLT with > instructions on running BOLT with jemalloc or tcmalloc. > > > > On 10/18/19, 11:21 AM, "Sriraman Tallam" <tmsriram at google.com> wrote: > > > > Hello Maksim, > > > > On Fri, Oct 18, 2019 at 10:57 AM Maksim Panchenko <maks at fb.com> wrote: > > Cool. The new numbers look good. If you run BOLT with jemalloc library > > preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed > that > > compared to the default malloc, it improves the multithreaded > > performance and brings down memory usage significantly. > > > > Great, thanks for confirming! Would you be willing to share specific > numbers, how significant is the reduction in memory with jemalloc for > clang? We double-checked our numbers with the larger benchmarks and we > can confirm they were *not built with labels*. One of our > large benchmarks, search1, is about 5 times the size of clang in terms of > text size as reported by size command, and we are seeing a 70G memory > overhead on this. Do you have data on the memory consumption of BOLT with > larger benchmarks with jemalloc. We are trying to build Chrome with > latest BOLT so that we can share the memory overheads and the binaries with > you for transparency but we are struggling with the disassembly errors. If > you have data on large benchmarks we would appreciate it if you could share > it. > > > > Further, if you have a recipe to use jemalloc with BOLT, please point it > at us. We could try it out too and share our findings. > > > > Thanks much, > > Sri > > > > > > Thanks, > > Maksim > > > > On 10/17/19, 2:59 PM, "Sriraman Tallam" <tmsriram at google.com> wrote: > > > > > > On Wed, Oct 16, 2019 at 3:52 PM Maksim Panchenko <maks at fb.com> wrote: > > Hi Sri, > > > > I want to clarify one thing before sending a detailed reply: did you > evaluate > > BOLT on Clang built with basic block sections? > > In the makefile you reference, > > there are two versions: a “vanilla” and a default built with function > sections. > > High overheads you see with BOLT on Clang do not match our experience. > > > > Thanks for pointing that out in the Makefile. We double-checked and > noticed a bug in our Makefile. For clang, we noticed that we are BOLTING > with basic block symbols which seems to affect the memory consumption of > BOLT. So, we have re-measured with recent bolt and for *full > transparency* we have uploaded the binaries, BOLT's yaml files and > perf.data files and the commands so that you can independently verify our > results and check the binaries. We have gzipped all the files to keep it > under 2G limit for git lfs, everything is here : > https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment > We have run our experiments on a 192G machine with Intel 18 core. > > > > We built llvm-bolt with most recent sources and is *pristine* with none of > our patches and uploading the binary we used here, > https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt > That's a very recent BOLT binary, git > hash: 988a7e8819b882fd14e18d149f8d3f702b134680 > > > > The > https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/{v1,v2} contains > two sets of binaries. The first binary is pristine recent clang-10 built > from 2 weeks ago with additionally only -Wl,-q. v2 is another clang binary > also only additionally built with -q. There are no labels or sections or > any other Propeller flags used to build these clang binaries. Here is the > command we are using to optimize with BOLT, all the commands have been > checked in too. > > > > You should be able to run llvm-bolt now on these binaries as all the files > are provided. We have also provided the raw perf data files in case you > want to independently convert. > > > > $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b > pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ > -reorder-functions=hfsort -relocs=1 --update-debug-sections > > > > For version 2, this is the number: > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40 > > Maximum resident set size (kbytes): 18742688 > > > > That is 125 seconds and ~18G of RAM. > > > > For version 1, this hangs and we stopped it after several minutes and the > maximum RSS size crossing 50G. This is most likely just a bug and you > should be able to reproduce this. The binary seems to be running and > printing update messages. > > > > We also measured without update-debug-sections too with the command : > > > > $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b > pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ > -reorder-functions=hfsort -relocs=1 > > > > For version1 : > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74 > > Maximum resident set size (kbytes): 14824444 > > > > 93 seconds and ~14G of RAM > > > > version 2 : > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90 > > Maximum resident set size (kbytes): 14511912 > > > > similar 91 secs and ~14G > > > > Now, coming back to the bug in the Makefile, we originally reported ~35G. > That is *wrong* since the clang binary used to measure bolt overheads was > built with basic block labels. Our *sincere apologies* for this, this > showed BOLT as consuming more memory than is actual for clang. We > double-checked BOLT numbers with the internal benchmark search2 for sanity > and that is built *without any labels* and only with "-Wl,-q". We are > checking the other large internal benchmarks too. We cannot disclose > internal benchmarks. So, we will get more large open-source benchmarks like > Chrome or gcc built with bolt and share the binaries and results so you can > independently verify. > > > > Thanks > > Sri > > > > > > Thanks, > > Maksim > > > > On 10/14/19, 11:44 AM, "llvm-dev on behalf of Sriraman Tallam via > llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of > llvm-dev at lists.llvm.org> wrote: > > > > Hello, > > > > I wanted to consolidate all the discussions and our final thoughts on the > concerns raised. I have attached a document consolidating it. > > > > BOLT’s performance gains inspired this work and we believe BOLT > > is a great piece of engineering. However, there are build environments > where > scalability is critical and memory limits per process are tight : > > * Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily > invented to achieve scalability and better incremental build times while > building large binaries with debug information. > > * ThinLTO, > http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html > <https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2016_06_thinlto-2Dscalable-2Dand-2Dincremental-2Dlto.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=rW9yHyu5DPla9M38HolcW_w_Md8TLqe53BTWIClBxO4&e=> > was > primarily invented to make LLVM’s full LTO scalable and keep the memory > and > time overheads low. ThinLTO has enabled much broader adoption of whole > program optimization, by making it non-monolithic. > > * For Chromium builds, > > https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha > <https://urldefense.proofpoint.com/v2/url?u=https-3A__chromium-2Dreview.googlesource.com_c_chromium_src_-2B_695714_3_build_toolcha&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=8EBzmSqxfeVJXXXFKkx4Mzkf5d6cucxPc9pXkF36v_o&e=> > in/concurrent_links.gni, the linker process memory is set to 10GB with > ThinLTO. > It was 26GB with Full LTO before that and individual processes will run of > out > of memory beyond that. > > * Here, > > https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib > <https://urldefense.proofpoint.com/v2/url?u=https-3A__gotocon.com_dl_goto-2Dchicago-2D2016_slides_AysyluGreenberg-5FBuildingADistrib&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=lH-bp7s0QTtCyJkcSTOL8B_wOsRw-SLGrFsbZLmjaxQ&e=> > utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google > scale > is shown where 5 million binary and test builds are performed every day on > several thousands of machines, each with a limitation of 12G of memory > per > process and 15 minute time-out on tests. Memory overheads of 35G (clang) > are > well above these thresholds. > > We have developed Propeller like ThinLTO that can be used to obtain > similar > performance gains like BOLT in such environments. > > > > Thanks > > Sri > > > > > > On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > > > On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Is there large value from deferring the block ordering to link time? That > is, does the block layout algorithm need to consider global layout issues > when deciding which blocks to put together and which to relegate to the > far-away part of the code? > > > > Or, could the propellor-optimized compile step instead split each function > into only 2 pieces -- one containing an "optimally-ordered" set of hot > blocks from the function, and another containing the cold blocks? The > linker would have less flexibility in placement, but maybe it doesn't > actually need that flexibility? > > > > Apologies if this is obvious for those who actually know what they're > talking about here. :) > > > > It is a fair question. > > > > We believe the flexibility to do fine grained layout in whole program > context is important. PostLinkOptimization is aimed at getting as much > performance improvement as possible (usually applied on top of > ThinLTO+PGO), so the framework is designed to enable it. > > > > In particular, it allows the linker to stitch hot bb traces from different > functions to be stitched together. It also allows hot trace duplication > across procedure boundaries (kind of interprocedural tailDup). Besides, > code alignment decisions to minimize branch mispredictions may require > global context (e.g, too conflicting branches residing in two different > functions). Other micro-arch specific optimizations to improve processor > front-end throughput may also require global context. > > > > It is conceivable to have an option to control the level of granularity at > the possible cost of performance. > > > > thanks, > > > > David > > > > > > > > On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at fb.com> wrote: > > You’re correct, except that, in Propeller, CFI duplication happens for > every basic block as it operates with the conservative assumption that a > block can be put anywhere by the linker. That’s a significant bloat that is > not cleaned up later. So, during link time, if N blocks from the same > function are contiguous in the final layout, as it should happen most of > the time for any sane BB order, we would have several FDEs for a region > that only needs one. The bloat goes to the final binary (a lot more FDEs, > specifically, one FDE per basic block). > > BOLT will only split a function in two parts, and only if it has profile. > Most of the time, a function is not split. It also has an option not to > split at all. For internally reordered basic blocks of a given function, it > has CFI deduplication logic (it will interpret and build the CFI states for > each block and rewrite the CFIs in a way that uses the minimum number of > instructions to encode the states for each block). > > > > *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of James Y > Knight via llvm-dev <llvm-dev at lists.llvm.org> > *Reply-To: *James Y Knight <jyknight at google.com> > *Date: *Wednesday, October 2, 2019 at 1:59 PM > *To: *Maksim Panchenko <maks at fb.com> > *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org> > *Subject: *Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link > Optimizations > > > > I'm a bit confused by this subthread -- doesn't BOLT have the exact same > CFI bloat issue? From my cursory reading of the propellor doc, the CFI > duplication is _necessary_ to represent discontiguous functions, not > anything particular to the way Propellor happens to generate those > discontiguous functions. > > > > And emitting discontiguous functions is a fundamental goal of this, right? > > > > On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Thanks for clarifying. This means once you move to the next basic block > (or any other basic > > block in the function) you have to execute an entirely new set of CFI > instructions > > except for the common CIE part. While indeed this is not as bad, on > average, the overall > > active memory footprint will increase. > > > > Creating one FDE per basic block means that .eh_frame_hdr, an allocatable > section, > > will be bloated too. This will increase the FDE lookup time. I don’t see > .eh_frame_hdr > > being mentioned in the proposal. > > > > Maksim > > > > On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at google.com> > wrote: > > > > > > > > On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > *Pessimization/overhead for stack unwinding used by system-wide profilers > and > for exception handling* > > Larger CFI programs put an extra burden on unwinding at runtime as more CFI > (and thus native) instructions have to be executed. This will cause more > overhead for any profiler that records stack traces, and, as you correctly > note > in the proposal, for any program that heavily uses exceptions. > > > > The number of CFI instructions that have to be executed when unwinding any > given stack stays the same. The CFI instructions for a function have to be > duplicated in every basic block section, but when performing unwinding only > one such a set is executed -- the copy for the current basic block. > However, this copy contains precisely the same CFI instructions as the ones > that would have to be executed if there were no basic block sections. > > > > -- > > Krzysztof Pszeniczny > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191021/bc9bcfb5/attachment.html>
Xinliang David Li via llvm-dev
2019-Oct-22  07:03 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Hi Maksim, see my reply inline below. On Mon, Oct 21, 2019 at 10:07 PM Maksim Panchenko <maks at fb.com> wrote:> Hi Sri, > > > > Thank you for replying to our feedback. 7 out 12 high-level concerns have > been > > answered; 2 of them are fully addressed. The rest are being tracked at the > > following Google doc: > > > > > https://docs.google.com/document/d/1jqjUZc8sEoNl6_lrVD6ZkITyFBFqhUOZ6ZaDm_XVbb8/ > > > > To keep the discussion at a high level, I have to reference the LTO vs > ThinLTO > > comparison since that appears to be the central theme in your response to > the > > feedback. > > > > Unlike LTO, BOLT does not have to keep the entire program in memory. >IMO, there is no fundamental difference there. Partial (non Whole Program) LTO does not need to put everything in memory either, but it is still MonoLTO, and does not make it ThinLTO. Like ThinLTO, Propeller is designed with the following principles in mind: 1) The serial step of Propeller (either an integrated component of linker or a plugin to it) is made as thin as possible -- it works on lightweight summary like data 2) The serial step makes global decisions and do not make transformations 3) Code transformations are jobs of compiler backend and linker (reordering etc) and there will be clear interface between Propeller and the transformers (longer term with persistent McInst support) 4) Propeller is designed to be turned on for tens of thousands of build targets, in other words, it should be turned on for *everyone* by default. This means the memory footprint for the optimization much be reasonably small. It should integrate easily with distributed build systems. 5) Related to goal 4) there, disassembling should be avoid as much as possible. We have experienced all kinds of problems with it, and it can be a constant source of pain (not to mention the memory overhead).> Furthermore, as we have previously mentioned, most of the passes are run in > > parallel, and the performance scales well with the number of CPUs. >If a pass can be run in parallel (e.g, function level), then PLO may be the wrong place to do such optimizations.> > > To demonstrate that running BOLT on a hot subset of functions is not just a > > speculation, we have prototyped a "Thin" version that optimizes Clang-7 in > under > > 15 seconds using less than 4GB of memory. No modifications to the linker or > > compiler were required. And by the way, that appears to be faster than > just the > > re-linking phase of the Propeller. Larger loads show similar improvements > > providing 2x-5x savings over the original processing specs. > >If you can provide a patch so that we can try it with large benchmarks, that will be great. On the other hand, Clang's text is relatively small (<100M?), so using 4GB memory is still quite large IMO. Sri and team tried to bring up BOLT for Chrome (which is about 2 or 3x larger than Clang) for the last two days, but have not been successful. They encountered all sorts of problems --- after fixing many sources that led to disassembly errors, they got repeatedly hit by 'jump into the middle of insn' errors. See also 5) above. (Similarly, a few months time were spent on bringing up BOLT for an internal benchmark which is 8x larger than Clang)> > > Let me reiterate that current BOLT requires large amounts of memory not > because > > it's a fundamental limitation, unlike LTO. For us, system memory was never > a > > constraint. The runtime of the application, not BOLT, was the primary goal > > during the development. > >BOLT is an amazing engineering pierce that inspired Propeller. See the objectives above, system memory is one of the primary design goals for it , and it is architected to enforce the policy.> > > ThinLTO design solves a real problem and dramatically improves compilation > time > > even when building on a single node. ThinLTO results provide "end-to-end > build > > time" comparison to LTO. I've asked you to show a similar comparison for > > Propeller vs. BOLT. I haven't seen the results, and I suspect the total > overhead > > will exceed that of even the oldest non-parallel version of BOLT. >I have explained this above and we are confident that Propeller can be turned on by default (out of box) for any targets just like ThinLTO today (and this is indeed the goal). We will continue to iterate on later versions to make it happen.> > > One argument I've heard is that BOLT is not taking advantage of the > distributed > > build system. That's correct. It does not have to since it does not > require to > > rebuild the application. In "Thin" mode, the overhead is similar to a > regular > > linker running with a linker script. > > >4GB in Thin Mode for a Clang sized program is not the scalability goal to shoot for wider adoption of PLO. thanks, David> You are right that we do not support debug fission packages. It is > unimplemented > > for a simple reason: no one asked for it previously. And as we like to say > in > > the open-source community: "patches are welcome." > > > > Maksim > > > > P.S. We have updated https://github.com/facebookincubator/BOLT with > instructions on running BOLT with jemalloc or tcmalloc. > > > > On 10/18/19, 11:21 AM, "Sriraman Tallam" <tmsriram at google.com> wrote: > > > > Hello Maksim, > > > > On Fri, Oct 18, 2019 at 10:57 AM Maksim Panchenko <maks at fb.com> wrote: > > Cool. The new numbers look good. If you run BOLT with jemalloc library > > preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed > that > > compared to the default malloc, it improves the multithreaded > > performance and brings down memory usage significantly. > > > > Great, thanks for confirming! Would you be willing to share specific > numbers, how significant is the reduction in memory with jemalloc for > clang? We double-checked our numbers with the larger benchmarks and we > can confirm they were *not built with labels*. One of our > large benchmarks, search1, is about 5 times the size of clang in terms of > text size as reported by size command, and we are seeing a 70G memory > overhead on this. Do you have data on the memory consumption of BOLT with > larger benchmarks with jemalloc. We are trying to build Chrome with > latest BOLT so that we can share the memory overheads and the binaries with > you for transparency but we are struggling with the disassembly errors. If > you have data on large benchmarks we would appreciate it if you could share > it. > > > > Further, if you have a recipe to use jemalloc with BOLT, please point it > at us. We could try it out too and share our findings. > > > > Thanks much, > > Sri > > > > > > Thanks, > > Maksim > > > > On 10/17/19, 2:59 PM, "Sriraman Tallam" <tmsriram at google.com> wrote: > > > > > > On Wed, Oct 16, 2019 at 3:52 PM Maksim Panchenko <maks at fb.com> wrote: > > Hi Sri, > > > > I want to clarify one thing before sending a detailed reply: did you > evaluate > > BOLT on Clang built with basic block sections? > > In the makefile you reference, > > there are two versions: a “vanilla” and a default built with function > sections. > > High overheads you see with BOLT on Clang do not match our experience. > > > > Thanks for pointing that out in the Makefile. We double-checked and > noticed a bug in our Makefile. For clang, we noticed that we are BOLTING > with basic block symbols which seems to affect the memory consumption of > BOLT. So, we have re-measured with recent bolt and for *full > transparency* we have uploaded the binaries, BOLT's yaml files and > perf.data files and the commands so that you can independently verify our > results and check the binaries. We have gzipped all the files to keep it > under 2G limit for git lfs, everything is here : > https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment > We have run our experiments on a 192G machine with Intel 18 core. > > > > We built llvm-bolt with most recent sources and is *pristine* with none of > our patches and uploading the binary we used here, > https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt > That's a very recent BOLT binary, git > hash: 988a7e8819b882fd14e18d149f8d3f702b134680 > > > > The > https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/{v1,v2} contains > two sets of binaries. The first binary is pristine recent clang-10 built > from 2 weeks ago with additionally only -Wl,-q. v2 is another clang binary > also only additionally built with -q. There are no labels or sections or > any other Propeller flags used to build these clang binaries. Here is the > command we are using to optimize with BOLT, all the commands have been > checked in too. > > > > You should be able to run llvm-bolt now on these binaries as all the files > are provided. We have also provided the raw perf data files in case you > want to independently convert. > > > > $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b > pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ > -reorder-functions=hfsort -relocs=1 --update-debug-sections > > > > For version 2, this is the number: > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40 > > Maximum resident set size (kbytes): 18742688 > > > > That is 125 seconds and ~18G of RAM. > > > > For version 1, this hangs and we stopped it after several minutes and the > maximum RSS size crossing 50G. This is most likely just a bug and you > should be able to reproduce this. The binary seems to be running and > printing update messages. > > > > We also measured without update-debug-sections too with the command : > > > > $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b > pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ > -reorder-functions=hfsort -relocs=1 > > > > For version1 : > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74 > > Maximum resident set size (kbytes): 14824444 > > > > 93 seconds and ~14G of RAM > > > > version 2 : > > Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90 > > Maximum resident set size (kbytes): 14511912 > > > > similar 91 secs and ~14G > > > > Now, coming back to the bug in the Makefile, we originally reported ~35G. > That is *wrong* since the clang binary used to measure bolt overheads was > built with basic block labels. Our *sincere apologies* for this, this > showed BOLT as consuming more memory than is actual for clang. We > double-checked BOLT numbers with the internal benchmark search2 for sanity > and that is built *without any labels* and only with "-Wl,-q". We are > checking the other large internal benchmarks too. We cannot disclose > internal benchmarks. So, we will get more large open-source benchmarks like > Chrome or gcc built with bolt and share the binaries and results so you can > independently verify. > > > > Thanks > > Sri > > > > > > Thanks, > > Maksim > > > > On 10/14/19, 11:44 AM, "llvm-dev on behalf of Sriraman Tallam via > llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of > llvm-dev at lists.llvm.org> wrote: > > > > Hello, > > > > I wanted to consolidate all the discussions and our final thoughts on the > concerns raised. I have attached a document consolidating it. > > > > BOLT’s performance gains inspired this work and we believe BOLT > > is a great piece of engineering. However, there are build environments > where > scalability is critical and memory limits per process are tight : > > * Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily > invented to achieve scalability and better incremental build times while > building large binaries with debug information. > > * ThinLTO, > http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html > <https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2016_06_thinlto-2Dscalable-2Dand-2Dincremental-2Dlto.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=rW9yHyu5DPla9M38HolcW_w_Md8TLqe53BTWIClBxO4&e=> > was > primarily invented to make LLVM’s full LTO scalable and keep the memory > and > time overheads low. ThinLTO has enabled much broader adoption of whole > program optimization, by making it non-monolithic. > > * For Chromium builds, > > https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha > <https://urldefense.proofpoint.com/v2/url?u=https-3A__chromium-2Dreview.googlesource.com_c_chromium_src_-2B_695714_3_build_toolcha&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=8EBzmSqxfeVJXXXFKkx4Mzkf5d6cucxPc9pXkF36v_o&e=> > in/concurrent_links.gni, the linker process memory is set to 10GB with > ThinLTO. > It was 26GB with Full LTO before that and individual processes will run of > out > of memory beyond that. > > * Here, > > https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib > <https://urldefense.proofpoint.com/v2/url?u=https-3A__gotocon.com_dl_goto-2Dchicago-2D2016_slides_AysyluGreenberg-5FBuildingADistrib&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=lH-bp7s0QTtCyJkcSTOL8B_wOsRw-SLGrFsbZLmjaxQ&e=> > utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google > scale > is shown where 5 million binary and test builds are performed every day on > several thousands of machines, each with a limitation of 12G of memory > per > process and 15 minute time-out on tests. Memory overheads of 35G (clang) > are > well above these thresholds. > > We have developed Propeller like ThinLTO that can be used to obtain > similar > performance gains like BOLT in such environments. > > > > Thanks > > Sri > > > > > > On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > > > On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Is there large value from deferring the block ordering to link time? That > is, does the block layout algorithm need to consider global layout issues > when deciding which blocks to put together and which to relegate to the > far-away part of the code? > > > > Or, could the propellor-optimized compile step instead split each function > into only 2 pieces -- one containing an "optimally-ordered" set of hot > blocks from the function, and another containing the cold blocks? The > linker would have less flexibility in placement, but maybe it doesn't > actually need that flexibility? > > > > Apologies if this is obvious for those who actually know what they're > talking about here. :) > > > > It is a fair question. > > > > We believe the flexibility to do fine grained layout in whole program > context is important. PostLinkOptimization is aimed at getting as much > performance improvement as possible (usually applied on top of > ThinLTO+PGO), so the framework is designed to enable it. > > > > In particular, it allows the linker to stitch hot bb traces from different > functions to be stitched together. It also allows hot trace duplication > across procedure boundaries (kind of interprocedural tailDup). Besides, > code alignment decisions to minimize branch mispredictions may require > global context (e.g, too conflicting branches residing in two different > functions). Other micro-arch specific optimizations to improve processor > front-end throughput may also require global context. > > > > It is conceivable to have an option to control the level of granularity at > the possible cost of performance. > > > > thanks, > > > > David > > > > > > > > On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at fb.com> wrote: > > You’re correct, except that, in Propeller, CFI duplication happens for > every basic block as it operates with the conservative assumption that a > block can be put anywhere by the linker. That’s a significant bloat that is > not cleaned up later. So, during link time, if N blocks from the same > function are contiguous in the final layout, as it should happen most of > the time for any sane BB order, we would have several FDEs for a region > that only needs one. The bloat goes to the final binary (a lot more FDEs, > specifically, one FDE per basic block). > > BOLT will only split a function in two parts, and only if it has profile. > Most of the time, a function is not split. It also has an option not to > split at all. For internally reordered basic blocks of a given function, it > has CFI deduplication logic (it will interpret and build the CFI states for > each block and rewrite the CFIs in a way that uses the minimum number of > instructions to encode the states for each block). > > > > *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of James Y > Knight via llvm-dev <llvm-dev at lists.llvm.org> > *Reply-To: *James Y Knight <jyknight at google.com> > *Date: *Wednesday, October 2, 2019 at 1:59 PM > *To: *Maksim Panchenko <maks at fb.com> > *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org> > *Subject: *Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link > Optimizations > > > > I'm a bit confused by this subthread -- doesn't BOLT have the exact same > CFI bloat issue? From my cursory reading of the propellor doc, the CFI > duplication is _necessary_ to represent discontiguous functions, not > anything particular to the way Propellor happens to generate those > discontiguous functions. > > > > And emitting discontiguous functions is a fundamental goal of this, right? > > > > On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Thanks for clarifying. This means once you move to the next basic block > (or any other basic > > block in the function) you have to execute an entirely new set of CFI > instructions > > except for the common CIE part. While indeed this is not as bad, on > average, the overall > > active memory footprint will increase. > > > > Creating one FDE per basic block means that .eh_frame_hdr, an allocatable > section, > > will be bloated too. This will increase the FDE lookup time. I don’t see > .eh_frame_hdr > > being mentioned in the proposal. > > > > Maksim > > > > On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at google.com> > wrote: > > > > > > > > On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > *Pessimization/overhead for stack unwinding used by system-wide profilers > and > for exception handling* > > Larger CFI programs put an extra burden on unwinding at runtime as more CFI > (and thus native) instructions have to be executed. This will cause more > overhead for any profiler that records stack traces, and, as you correctly > note > in the proposal, for any program that heavily uses exceptions. > > > > The number of CFI instructions that have to be executed when unwinding any > given stack stays the same. The CFI instructions for a function have to be > duplicated in every basic block section, but when performing unwinding only > one such a set is executed -- the copy for the current basic block. > However, this copy contains precisely the same CFI instructions as the ones > that would have to be executed if there were no basic block sections. > > > > -- > > Krzysztof Pszeniczny > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191022/9b343ddd/attachment.html>
Maybe Matching Threads
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations