Sriraman Tallam via llvm-dev
2019-Oct-14  18:43 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Hello, I wanted to consolidate all the discussions and our final thoughts on the concerns raised. I have attached a document consolidating it. BOLT’s performance gains inspired this work and we believe BOLT is a great piece of engineering. However, there are build environments where scalability is critical and memory limits per process are tight : * Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily invented to achieve scalability and better incremental build times while building large binaries with debug information. * ThinLTO, http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html was primarily invented to make LLVM’s full LTO scalable and keep the memory and time overheads low. ThinLTO has enabled much broader adoption of whole program optimization, by making it non-monolithic. * For Chromium builds, https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha in/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO. It was 26GB with Full LTO before that and individual processes will run of out of memory beyond that. * Here, https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale is shown where 5 million binary and test builds are performed every day on several thousands of machines, each with a limitation of 12G of memory per process and 15 minute time-out on tests. Memory overheads of 35G (clang) are well above these thresholds. We have developed Propeller like ThinLTO that can be used to obtain similar performance gains like BOLT in such environments. Thanks Sri On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Is there large value from deferring the block ordering to link time? That >> is, does the block layout algorithm need to consider global layout issues >> when deciding which blocks to put together and which to relegate to the >> far-away part of the code? >> >> Or, could the propellor-optimized compile step instead split each >> function into only 2 pieces -- one containing an "optimally-ordered" set of >> hot blocks from the function, and another containing the cold blocks? The >> linker would have less flexibility in placement, but maybe it doesn't >> actually need that flexibility? >> >> Apologies if this is obvious for those who actually know what they're >> talking about here. :) >> > > It is a fair question. > > We believe the flexibility to do fine grained layout in whole program > context is important. PostLinkOptimization is aimed at getting as much > performance improvement as possible (usually applied on top of > ThinLTO+PGO), so the framework is designed to enable it. > > In particular, it allows the linker to stitch hot bb traces from different > functions to be stitched together. It also allows hot trace duplication > across procedure boundaries (kind of interprocedural tailDup). Besides, > code alignment decisions to minimize branch mispredictions may require > global context (e.g, too conflicting branches residing in two different > functions). Other micro-arch specific optimizations to improve processor > front-end throughput may also require global context. > > It is conceivable to have an option to control the level of granularity at > the possible cost of performance. > > thanks, > > David > > > >> >> On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at fb.com> wrote: >> >>> You’re correct, except that, in Propeller, CFI duplication happens for >>> every basic block as it operates with the conservative assumption that a >>> block can be put anywhere by the linker. That’s a significant bloat that is >>> not cleaned up later. So, during link time, if N blocks from the same >>> function are contiguous in the final layout, as it should happen most of >>> the time for any sane BB order, we would have several FDEs for a region >>> that only needs one. The bloat goes to the final binary (a lot more FDEs, >>> specifically, one FDE per basic block). >>> >>> BOLT will only split a function in two parts, and only if it has >>> profile. Most of the time, a function is not split. It also has an option >>> not to split at all. For internally reordered basic blocks of a given >>> function, it has CFI deduplication logic (it will interpret and build the >>> CFI states for each block and rewrite the CFIs in a way that uses the >>> minimum number of instructions to encode the states for each block). >>> >>> >>> >>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of James Y >>> Knight via llvm-dev <llvm-dev at lists.llvm.org> >>> *Reply-To: *James Y Knight <jyknight at google.com> >>> *Date: *Wednesday, October 2, 2019 at 1:59 PM >>> *To: *Maksim Panchenko <maks at fb.com> >>> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org> >>> *Subject: *Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link >>> Optimizations >>> >>> >>> >>> I'm a bit confused by this subthread -- doesn't BOLT have the exact same >>> CFI bloat issue? From my cursory reading of the propellor doc, the CFI >>> duplication is _necessary_ to represent discontiguous functions, not >>> anything particular to the way Propellor happens to generate those >>> discontiguous functions. >>> >>> >>> >>> And emitting discontiguous functions is a fundamental goal of this, >>> right? >>> >>> >>> >>> On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>> Thanks for clarifying. This means once you move to the next basic block >>> (or any other basic >>> >>> block in the function) you have to execute an entirely new set of CFI >>> instructions >>> >>> except for the common CIE part. While indeed this is not as bad, on >>> average, the overall >>> >>> active memory footprint will increase. >>> >>> >>> >>> Creating one FDE per basic block means that .eh_frame_hdr, an >>> allocatable section, >>> >>> will be bloated too. This will increase the FDE lookup time. I don’t see >>> .eh_frame_hdr >>> >>> being mentioned in the proposal. >>> >>> >>> >>> Maksim >>> >>> >>> >>> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at google.com> >>> wrote: >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>> *Pessimization/overhead for stack unwinding used by system-wide >>> profilers and >>> for exception handling* >>> >>> Larger CFI programs put an extra burden on unwinding at runtime as more >>> CFI >>> (and thus native) instructions have to be executed. This will cause more >>> overhead for any profiler that records stack traces, and, as you >>> correctly note >>> in the proposal, for any program that heavily uses exceptions. >>> >>> >>> >>> The number of CFI instructions that have to be executed when unwinding >>> any given stack stays the same. The CFI instructions for a function have to >>> be duplicated in every basic block section, but when performing unwinding >>> only one such a set is executed -- the copy for the current basic block. >>> However, this copy contains precisely the same CFI instructions as the ones >>> that would have to be executed if there were no basic block sections. >>> >>> >>> >>> -- >>> >>> Krzysztof Pszeniczny >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=> >>> >>> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191014/ccb6d0c1/attachment-0001.html> -------------- next part -------------- In summary, BOLT’s performance gains inspired this work and we believe BOLT is a great piece of engineering. However, there are build environments where scalability is critical and memory limits per process are tight : * Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily invented to achieve scalability and better incremental build times while building large binaries with debug information. * ThinLTO, http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html was primarily invented to make LLVM’s full LTO scalable and keep the memory and time overheads low. ThinLTO has enabled much broader adoption of whole program optimization, by making it non-monolithic. * For Chromium builds, https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolchain/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO. It was 26GB with Full LTO before that and individual processes will run of out of memory beyond that. * Here, https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistributedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale is shown where 5 million binary and test builds are performed every day on several thousands of machines, each with a limitation of 12G of memory per process and 15 minute time-out on tests. Memory overheads of 35G (clang) are well above these thresholds. We have developed Propeller like ThinLTO that can be used to obtain similar performance gains like BOLT in such environments. + Evaluation with the most recent BOLT For the results, we based our customized bolt on upstream revision 7e80b300 which is committed on 8/28/2018. Evaluating with the most recent version of BOLT (Oct 4th 2019) on clang binary with ~50M of text requires ~35G of RAM, similar to the older version of BOLT. Memory consumption increases rapidly with binary sizes and on a large binary with 350M of text consumes ~70G of RAM (older BOLT version). Even when running several times changing the thread count from 1 to 70, the running time and memory overhead respectively remain over 198 seconds and 34GB (Multithreading reduces BOLT’s time overhead by at most 15% but also increases memory overhead by 10% to upto 38 GB). + Overheads of CFI TLDR; clang is pathological and the actual CFI bloat will go down from 7x to 2x. Let us present the CFI Bloats for each benchmark with the default option, which is creating basic block sections only for functions with samples. For clang, it is 7x and *not 17x* (the 17 x number is for all sections), and for the large benchmarks it is less than 30% or 1.3x. For large benchmarks, Storage is the highest, going from 18M to 23M, ~30%. Clang is almost pathological here. This is because 10% of all functions in clang have samples (touched by the profile information), and all of these functions get full basic block sections. Whereas, for the large benchmarks, only 0.5% of functions have samples. Now, for clang, 10% of functions that have samples also happen to contain 35% of all basic blocks. This means, we are creating sections for 35% of all basic blocks and the CFI bloats are clearly showing. Now, the data for clang also shows that only 7% of the basic blocks have samples. We are working on restricting basic block sections to only those basic blocks that have samples. The rest of the basic blocks (cold) in that function would share the same section. With this, we would reduce the bloat of CFI from 7x to 2x. This is not hard to do and we will follow up with this patch. Also for object size bloats with regards to eh_frame, the reasoning is similar. Restricting the section creation to only basic blocks that have profiles will reduce this a lot more. Importantly, if CFI support were available for discontiguous ranges we wouldn't have to duplicate CFI FDEs and the bloats would be near minimal. BOLT parses CFI and DWARF and generates compact information by rewriting it. Whereas, Propeller uses lld which uses relocations and sections to fixup but does not rewrite it. This is by design and that lld is not DWARF and CFI aware. We designed basic block sections just like function sections. The compiler produces a bunch of sections and relocations. The linker patches the relocations and the debug info and CFI are right, that's it. For CFI, since there is no support for discontiguous ranges we have to duplicate and dedup FDEs only for blocks with sections. We are asking that CFI support discontiguous ranges and this would look even simpler. Alternately, if lld were made DWARF and CFI aware we could rewrite it compactly like BOLT. These would help with object size bloats and binary size bloats. We would also like to clarify on the misconceptions around CFI Instructions: There are two things that need to be clarified here: 1) Extra CFI FDE entries for basic blocks does not mean more dynamic instructions are executed. In fact, they do not increase at all. Krys talked about this earlier. 2) We do deduplication of common static CFI instructions in the FDE and move it to the CIE . Hence, moving to a new basic block does not mean a completely new set of CFI instructions is executed. + C++ Exceptions and basic block sections for these: TLDR; This can be implemented in propeller easily. It was considered low priority because exception handling code is assumed in general not performance critical . Currently, we do not create basic block sections for any block that is involved in exception handling. There is nothing fundamentally difficult about exception handling basic blocks. The exception table uses address offsets which currently does not use relocations, and we must modify it to use relocations in order to enable sections for exception handling basic blocks. We will re-prioritize and send out a patch to handle exception basic blocks. + Relaxation Pass Efficiency TLDR; Our relaxation pass is similar to what LLVM’s MCAssembler does but with a caveat for efficiency. Our experimental results show it is efficient and convergence is guaranteed. Our relaxation pass is very similar to what MCAssembler does as it needs to solve the same problem, which is laying out basic blocks and optimizing branch instruction sizes based on jump offset. The optimal algorithm to do this is quadratic and it involves computating of section offsets multiple times. Our relaxation pass has a caveat for efficiency. Our relaxation pass recomputes section offsets only after all the sections have been processed. This is repeated until convergence, whereas, the MCAssembler re-computes section offsets every time a section is modified. Our pass works well in the linker as it minimizes the section offset recomputation, which is more expensive to do in the linker (whole program) than doing it in the assembler(per function). We shrink jump instructions aggressively and then fix the wrongly shrunk branches by running a pass that grows them back. We guarantee convergence because we always shrink in the first round and always grow in the second. Data shows that our relaxation is efficient. For example, for clang, the aggressive shrinking affects ~995000 branches in 6 iterations and only 4 branches had to be grown back. + On inconsistencies in performance data TLDR; We have provided a Makefile in the git repo, under llvm-propeller/splo, to compare BOLT and Propeller performance. “make check-performance” with pointers to the BOLT binaries should do the comparisons. In terms of performance optimization for code layout, fundamentally, BOLT and Propeller are similar. In fact, in one experiment, we extracted the basic block order from BOLT and just used basic block sections and linker symbol ordering to perform the same order during linking. The performance was identical and the ordering was exactly as expected. Some recent changes in Propeller has caused it to regress by a small amount and is sometimes getting hidden by the small noise. We are debugging the performance issue and it looks like an alignment issue or a second order effect. Currently, because of this bug, BOLT is faster than Propeller by about 0.5% to 1% on Clang benchmark. + Debug Info Overhead Large binaries are built using debug fission to as it significantly improves the scalability of debug builds, https://gcc.gnu.org/wiki/DebugFission. Quoting from the page which again reiterates the importance of scalability and incremental builds, Fission was invented to mitigate the following issues for debug builds of large applications: * Out of memory errors * Slow link times * Slow gdb startup times Debug Fission essentially splits the dwarf info into a separate object file and the link action memory overheads are smaller as the input object files and the resulting binary are smaller. Propeller supports both fission and non-fission builds. Even with fission, the dwarf objects are not relocatable, so the extra relocations needed by Propeller would still go into the main object file that contains code. Also, Propeller uses extra relocations for location lists too. We looked at BOLT and it does not seem to have any support for binaries built with fission, something we missed mentioning in the original RFC. Fission supports a mode which allows the individual dwarf objects, dwo files, to be linked into a single dwp file and BOLT would have to also rewrite the fission object file and likely in the same process. Another concern raised was regarding debugging overhead. The specific concern here is that having more debug ranges with basic block sections might increase overheads of debugging tools. We have not evaluated this, any benchmarks that could measure these overheads? Also, this can be mitigated by just exposing debug_ranges to lld. The various ranges created for each basic block are contiguous as most basic blocks end up together and these ranges can be collapsed by the linker into a larger range. + Plan for adding future optimizations TLDR: The model being proposed in this RFC is a first and necessary step for enabling future optimizations in a scalable way. Propeller is designed to be like ThinLTO. ThinLTO breaks down whole-program optimizations into two steps, a thin-link step generates summaries which are used to guide the distributed/parallel backend actions step on how to transform the individual modules. Similarly, Propeller would use a thin-link like analysis step to determine how to modify individual IR objects and guide distributed/parallel back-end actions on transformations. The goal is to keep the whole-program step light-weight and avoid expensive code analysis and transformations which would bloat memory. New optimizations are being developed on Propeller and there will be more scrutiny and discussions on how they are implemented to keep it scalable which is a good thing. + Overhead of backend code generation We have collected data for backend actions for the clang binary. Running in parallel (standalone machine) increases the link time overhead by 40% from 38secs to 54secs and the memory overheads by ~50% from 3GB to 4.5GB, still way smaller than monolithic approaches. Distributing these backend actions in a distributed build system would not increase memory overheads, each action is a separate process. We will include the data in the camera ready version.
Maksim Panchenko via llvm-dev
2019-Oct-16  22:52 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Hi Sri, I want to clarify one thing before sending a detailed reply: did you evaluate BOLT on Clang built with basic block sections? In the makefile you reference, there are two versions: a “vanilla” and a default built with function sections. High overheads you see with BOLT on Clang do not match our experience. Thanks, Maksim On 10/14/19, 11:44 AM, "llvm-dev on behalf of Sriraman Tallam via llvm-dev" <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org> on behalf of llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Hello, I wanted to consolidate all the discussions and our final thoughts on the concerns raised. I have attached a document consolidating it. BOLT’s performance gains inspired this work and we believe BOLT is a great piece of engineering. However, there are build environments where scalability is critical and memory limits per process are tight : * Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily invented to achieve scalability and better incremental build times while building large binaries with debug information. * ThinLTO, http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2016_06_thinlto-2Dscalable-2Dand-2Dincremental-2Dlto.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=rW9yHyu5DPla9M38HolcW_w_Md8TLqe53BTWIClBxO4&e=> was primarily invented to make LLVM’s full LTO scalable and keep the memory and time overheads low. ThinLTO has enabled much broader adoption of whole program optimization, by making it non-monolithic. * For Chromium builds, https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha<https://urldefense.proofpoint.com/v2/url?u=https-3A__chromium-2Dreview.googlesource.com_c_chromium_src_-2B_695714_3_build_toolcha&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=8EBzmSqxfeVJXXXFKkx4Mzkf5d6cucxPc9pXkF36v_o&e=> in/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO. It was 26GB with Full LTO before that and individual processes will run of out of memory beyond that. * Here, https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib<https://urldefense.proofpoint.com/v2/url?u=https-3A__gotocon.com_dl_goto-2Dchicago-2D2016_slides_AysyluGreenberg-5FBuildingADistrib&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=lH-bp7s0QTtCyJkcSTOL8B_wOsRw-SLGrFsbZLmjaxQ&e=> utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale is shown where 5 million binary and test builds are performed every day on several thousands of machines, each with a limitation of 12G of memory per process and 15 minute time-out on tests. Memory overheads of 35G (clang) are well above these thresholds. We have developed Propeller like ThinLTO that can be used to obtain similar performance gains like BOLT in such environments. Thanks Sri On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code? Or, could the propellor-optimized compile step instead split each function into only 2 pieces -- one containing an "optimally-ordered" set of hot blocks from the function, and another containing the cold blocks? The linker would have less flexibility in placement, but maybe it doesn't actually need that flexibility? Apologies if this is obvious for those who actually know what they're talking about here. :) It is a fair question. We believe the flexibility to do fine grained layout in whole program context is important. PostLinkOptimization is aimed at getting as much performance improvement as possible (usually applied on top of ThinLTO+PGO), so the framework is designed to enable it. In particular, it allows the linker to stitch hot bb traces from different functions to be stitched together. It also allows hot trace duplication across procedure boundaries (kind of interprocedural tailDup). Besides, code alignment decisions to minimize branch mispredictions may require global context (e.g, too conflicting branches residing in two different functions). Other micro-arch specific optimizations to improve processor front-end throughput may also require global context. It is conceivable to have an option to control the level of granularity at the possible cost of performance. thanks, David On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at fb.com<mailto:rafaelauler at fb.com>> wrote: You’re correct, except that, in Propeller, CFI duplication happens for every basic block as it operates with the conservative assumption that a block can be put anywhere by the linker. That’s a significant bloat that is not cleaned up later. So, during link time, if N blocks from the same function are contiguous in the final layout, as it should happen most of the time for any sane BB order, we would have several FDEs for a region that only needs one. The bloat goes to the final binary (a lot more FDEs, specifically, one FDE per basic block). BOLT will only split a function in two parts, and only if it has profile. Most of the time, a function is not split. It also has an option not to split at all. For internally reordered basic blocks of a given function, it has CFI deduplication logic (it will interpret and build the CFI states for each block and rewrite the CFIs in a way that uses the minimum number of instructions to encode the states for each block). From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org>> on behalf of James Y Knight via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Reply-To: James Y Knight <jyknight at google.com<mailto:jyknight at google.com>> Date: Wednesday, October 2, 2019 at 1:59 PM To: Maksim Panchenko <maks at fb.com<mailto:maks at fb.com>> Cc: "llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>" <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations I'm a bit confused by this subthread -- doesn't BOLT have the exact same CFI bloat issue? From my cursory reading of the propellor doc, the CFI duplication is _necessary_ to represent discontiguous functions, not anything particular to the way Propellor happens to generate those discontiguous functions. And emitting discontiguous functions is a fundamental goal of this, right? On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Thanks for clarifying. This means once you move to the next basic block (or any other basic block in the function) you have to execute an entirely new set of CFI instructions except for the common CIE part. While indeed this is not as bad, on average, the overall active memory footprint will increase. Creating one FDE per basic block means that .eh_frame_hdr, an allocatable section, will be bloated too. This will increase the FDE lookup time. I don’t see .eh_frame_hdr being mentioned in the proposal. Maksim On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at google.com<mailto:kpszeniczny at google.com>> wrote: On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: *Pessimization/overhead for stack unwinding used by system-wide profilers and for exception handling* Larger CFI programs put an extra burden on unwinding at runtime as more CFI (and thus native) instructions have to be executed. This will cause more overhead for any profiler that records stack traces, and, as you correctly note in the proposal, for any program that heavily uses exceptions. The number of CFI instructions that have to be executed when unwinding any given stack stays the same. The CFI instructions for a function have to be duplicated in every basic block section, but when performing unwinding only one such a set is executed -- the copy for the current basic block. However, this copy contains precisely the same CFI instructions as the ones that would have to be executed if there were no basic block sections. -- Krzysztof Pszeniczny _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=> _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191016/64fdafc0/attachment.html>
Sriraman Tallam via llvm-dev
2019-Oct-17  21:58 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Hello Maksim, On Wed, Oct 16, 2019 at 3:52 PM Maksim Panchenko <maks at fb.com> wrote:> Hi Sri, > > > > I want to clarify one thing before sending a detailed reply: did you > evaluate > > BOLT on Clang built with basic block sections? >In the makefile you reference,> > there are two versions: a “vanilla” and a default built with function > sections. > > High overheads you see with BOLT on Clang do not match our experience. >Thanks for pointing that out in the Makefile. We double-checked and noticed a bug in our Makefile. For clang, we noticed that we are BOLTING with basic block symbols which seems to affect the memory consumption of BOLT. So, we have re-measured with recent bolt and for *full transparency* we have uploaded the binaries, BOLT's yaml files and perf.data files and the commands so that you can independently verify our results and check the binaries. We have gzipped all the files to keep it under 2G limit for git lfs, everything is here : https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment We have run our experiments on a 192G machine with Intel 18 core. We built llvm-bolt with most recent sources and is *pristine* with none of our patches and uploading the binary we used here, https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt That's a very recent BOLT binary, git hash: 988a7e8819b882fd14e18d149f8d3f702b134680 The https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/{v1,v2} contains two sets of binaries. The first binary is pristine recent clang-10 built from 2 weeks ago with additionally only -Wl,-q. v2 is another clang binary also only additionally built with -q. There are no labels or sections or any other Propeller flags used to build these clang binaries. Here is the command we are using to optimize with BOLT, all the commands have been checked in too. You should be able to run llvm-bolt now on these binaries as all the files are provided. We have also provided the raw perf data files in case you want to independently convert. $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1 --update-debug-sections For version 2, this is the number: Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40 Maximum resident set size (kbytes): 18742688 That is 125 seconds and ~18G of RAM. For version 1, this hangs and we stopped it after several minutes and the maximum RSS size crossing 50G. This is most likely just a bug and you should be able to reproduce this. The binary seems to be running and printing update messages. We also measured without update-debug-sections too with the command : $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1 For version1 : Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74 Maximum resident set size (kbytes): 14824444 93 seconds and ~14G of RAM version 2 : Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90 Maximum resident set size (kbytes): 14511912 similar 91 secs and ~14G Now, coming back to the bug in the Makefile, we originally reported ~35G. That is *wrong* since the clang binary used to measure bolt overheads was built with basic block labels. Our *sincere apologies* for this, this showed BOLT as consuming more memory than is actual for clang. We double-checked BOLT numbers with the internal benchmark search2 for sanity and that is built *without any labels* and only with "-Wl,-q". We are checking the other large internal benchmarks too. We cannot disclose internal benchmarks. So, we will get more large open-source benchmarks like Chrome or gcc built with bolt and share the binaries and results so you can independently verify. Thanks Sri> > > Thanks, > > Maksim > > > > On 10/14/19, 11:44 AM, "llvm-dev on behalf of Sriraman Tallam via > llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of > llvm-dev at lists.llvm.org> wrote: > > > > Hello, > > > > I wanted to consolidate all the discussions and our final thoughts on the > concerns raised. I have attached a document consolidating it. > > > > BOLT’s performance gains inspired this work and we believe BOLT > > is a great piece of engineering. However, there are build environments > where > scalability is critical and memory limits per process are tight : > > * Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily > invented to achieve scalability and better incremental build times while > building large binaries with debug information. > > * ThinLTO, > http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html > <https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.llvm.org_2016_06_thinlto-2Dscalable-2Dand-2Dincremental-2Dlto.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=rW9yHyu5DPla9M38HolcW_w_Md8TLqe53BTWIClBxO4&e=> > was > primarily invented to make LLVM’s full LTO scalable and keep the memory > and > time overheads low. ThinLTO has enabled much broader adoption of whole > program optimization, by making it non-monolithic. > > * For Chromium builds, > > https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha > <https://urldefense.proofpoint.com/v2/url?u=https-3A__chromium-2Dreview.googlesource.com_c_chromium_src_-2B_695714_3_build_toolcha&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=8EBzmSqxfeVJXXXFKkx4Mzkf5d6cucxPc9pXkF36v_o&e=> > in/concurrent_links.gni, the linker process memory is set to 10GB with > ThinLTO. > It was 26GB with Full LTO before that and individual processes will run of > out > of memory beyond that. > > * Here, > > https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib > <https://urldefense.proofpoint.com/v2/url?u=https-3A__gotocon.com_dl_goto-2Dchicago-2D2016_slides_AysyluGreenberg-5FBuildingADistrib&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=lH-bp7s0QTtCyJkcSTOL8B_wOsRw-SLGrFsbZLmjaxQ&e=> > utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google > scale > is shown where 5 million binary and test builds are performed every day on > several thousands of machines, each with a limitation of 12G of memory > per > process and 15 minute time-out on tests. Memory overheads of 35G (clang) > are > well above these thresholds. > > We have developed Propeller like ThinLTO that can be used to obtain > similar > performance gains like BOLT in such environments. > > > > Thanks > > Sri > > > > > > On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > > > On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Is there large value from deferring the block ordering to link time? That > is, does the block layout algorithm need to consider global layout issues > when deciding which blocks to put together and which to relegate to the > far-away part of the code? > > > > Or, could the propellor-optimized compile step instead split each function > into only 2 pieces -- one containing an "optimally-ordered" set of hot > blocks from the function, and another containing the cold blocks? The > linker would have less flexibility in placement, but maybe it doesn't > actually need that flexibility? > > > > Apologies if this is obvious for those who actually know what they're > talking about here. :) > > > > It is a fair question. > > > > We believe the flexibility to do fine grained layout in whole program > context is important. PostLinkOptimization is aimed at getting as much > performance improvement as possible (usually applied on top of > ThinLTO+PGO), so the framework is designed to enable it. > > > > In particular, it allows the linker to stitch hot bb traces from different > functions to be stitched together. It also allows hot trace duplication > across procedure boundaries (kind of interprocedural tailDup). Besides, > code alignment decisions to minimize branch mispredictions may require > global context (e.g, too conflicting branches residing in two different > functions). Other micro-arch specific optimizations to improve processor > front-end throughput may also require global context. > > > > It is conceivable to have an option to control the level of granularity at > the possible cost of performance. > > > > thanks, > > > > David > > > > > > > > On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at fb.com> wrote: > > You’re correct, except that, in Propeller, CFI duplication happens for > every basic block as it operates with the conservative assumption that a > block can be put anywhere by the linker. That’s a significant bloat that is > not cleaned up later. So, during link time, if N blocks from the same > function are contiguous in the final layout, as it should happen most of > the time for any sane BB order, we would have several FDEs for a region > that only needs one. The bloat goes to the final binary (a lot more FDEs, > specifically, one FDE per basic block). > > BOLT will only split a function in two parts, and only if it has profile. > Most of the time, a function is not split. It also has an option not to > split at all. For internally reordered basic blocks of a given function, it > has CFI deduplication logic (it will interpret and build the CFI states for > each block and rewrite the CFIs in a way that uses the minimum number of > instructions to encode the states for each block). > > > > *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of James Y > Knight via llvm-dev <llvm-dev at lists.llvm.org> > *Reply-To: *James Y Knight <jyknight at google.com> > *Date: *Wednesday, October 2, 2019 at 1:59 PM > *To: *Maksim Panchenko <maks at fb.com> > *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org> > *Subject: *Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link > Optimizations > > > > I'm a bit confused by this subthread -- doesn't BOLT have the exact same > CFI bloat issue? From my cursory reading of the propellor doc, the CFI > duplication is _necessary_ to represent discontiguous functions, not > anything particular to the way Propellor happens to generate those > discontiguous functions. > > > > And emitting discontiguous functions is a fundamental goal of this, right? > > > > On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Thanks for clarifying. This means once you move to the next basic block > (or any other basic > > block in the function) you have to execute an entirely new set of CFI > instructions > > except for the common CIE part. While indeed this is not as bad, on > average, the overall > > active memory footprint will increase. > > > > Creating one FDE per basic block means that .eh_frame_hdr, an allocatable > section, > > will be bloated too. This will increase the FDE lookup time. I don’t see > .eh_frame_hdr > > being mentioned in the proposal. > > > > Maksim > > > > On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at google.com> > wrote: > > > > > > > > On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > *Pessimization/overhead for stack unwinding used by system-wide profilers > and > for exception handling* > > Larger CFI programs put an extra burden on unwinding at runtime as more CFI > (and thus native) instructions have to be executed. This will cause more > overhead for any profiler that records stack traces, and, as you correctly > note > in the proposal, for any program that heavily uses exceptions. > > > > The number of CFI instructions that have to be executed when unwinding any > given stack stays the same. The CFI instructions for a function have to be > duplicated in every basic block section, but when performing unwinding only > one such a set is executed -- the copy for the current basic block. > However, this copy contains precisely the same CFI instructions as the ones > that would have to be executed if there were no basic block sections. > > > > -- > > Krzysztof Pszeniczny > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=BOTyGbKXpK1kdAvdQF0QoVsl4A5BCIQJMEEXJRVW6To&s=nI5BAwXY8FSk9XDrnUJfuretr1UXOXiiqKNvmEDjDXo&e=> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191017/704cfea7/attachment-0001.html>
Seemingly Similar Threads
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations