Sriraman Tallam via llvm-dev
2019-Sep-24 23:51 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Greetings, We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer framework, on large google benchmarks and noticed that it improves key performance metrics of these benchmarks by 2% to 6%, which is pretty impressive as this is over and above a baseline binaryalready heavily optimized with ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is also profile guided and does very aggressive performance optimizations, there is more room for performance improvements due to profile approximations while applying the transformations. BOLT uses exact profiles from the final binary and is able to fill the gaps left by ThinLTO + PGO. The performance improvements due to BOLT come from basic block layout, function reordering and function splitting. While BOLT does an excellent job of squeezing extra performance from highly optimized binaries with optimizations such as code layout, it has these major issues: * It does not take advantage of distributed build systems. * It has scalability issues and to rewrite a binary with a ~300M text segment size: * Memory foot-print is 70G. * It takes more than 10 minutes to rewrite the binary. Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes and rewrites the final binary in one process. This limits the scalability of BOLT and the memory and time overhead shoots up quickly for large binaries. Inspired by the performance gains and to address the scalability issue of BOLT, we went about designing a scalable infrastructure that can perform BOLT-like post-link optimizations. In this RFC, we discuss our system, “Propeller”, which can perform profile guided link time binary optimizations in a scalable way and is friendly to distributed build systems. Our system leverages the existing capabilities of the compiler tool-chain and is not a stand alone tool. Like BOLT, our system boosts the performance of optimized binaries via link-time optimizations using accurate profiles of the binary. We discuss the Propeller system and show how to do the whole program basic block layout using Propeller. Propeller does whole program basic block layout at link time via basic block sections. We have added support for having each basic block in its own section which allows the linker to do arbitrary reorderings of basic blocks to achieve any desired fine-grain code layout which includes block layout, function splitting and function reordering. Our experiments on large real-world applications and SPEC with code layout show that Propeller can optimize as effectively as BOLT, with just 20% of its memory footprint and time overhead. An LLVM branch with propeller patches is available in the git repository here: https://github.com/google/llvm-propeller/ We will upload individual patches of the various elements for review. We have attached a google doc describing the Propeller system with Experiments in detail, https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf Quick Start - How to optimize further with Propeller? # git clone and build repo $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git $ mkdir $BUILD_DIR && cd $BUILD_DIR $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ $LLVM_DIR/llvm-propeller/llvm $ ninja -j25 $ export PATH=$BUILD_DIR/bin:$PATH # Let’s Propeller-optimize the following program: # Step 1: Build the peak optimized binary with an additional flag. $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld # Step 2: Profile the binary, only one side of the branch is executed. $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& /dev/null # Step 3: Convert the profiles using the tool provided $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ --binary=./a.out.labels --profile=perf.data --out=perf.propeller # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed. $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld In Step 4, the optimized bit code can be used if it is saved in Step1 as Propeller is active only during compile backend and link. The optimized binary has a different layout of the basic blocks in main to keep the executed blocks together and split the cold blocks. Some of the key points: + Added support for basic block sections, similar to function sections, where each basic block can reside in its own section. + Jump instructions need 32-bit relocations and subsequent linker relaxations after basic block layout. We would like to add a new relocation type for jump instructions to make it easier for relaxations and guarantee correctness. + Added support in the linker to read profiles (PMU LBR) and discover optimal basic block layout using the Ex-TSP algorithm described here: https://arxiv.org/abs/1809.04676 + Added support in the linker to re-order basic block sections in any specified sequence (can be done with symbol ordering file). This requires linker relaxations to delete and shrink branches across basic blocks. + Compared our system to BOLT and have shown that our system can produce similar performance improvements as BOLT but with much less memory and time overheads. Our experiments are on very large warehouse-scale benchmarks and SPEC 2017. + Listed why this cannot be done as part of PGO itself. Post Link optimizations are able to transform the binary using accurate profiles and PGO passes suffer from profile imprecision. + Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks via basic block sections. + Discussed CFI handling and is sub-optimal and bloats object file sizes and binary sizes much more than DebugInfo due to lack of support for discontiguous address ranges. We have talked about this and would like to make a case to support discontiguous ranges with CFI which will make basic block sections much more cheaper. Detailed RFC document here : https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf Please let us know what you think, Thanks Sri on behalf of the team.
Mehdi AMINI via llvm-dev
2019-Sep-25 08:36 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Hi Sriraman, This is an impressive piece of work! The results look really good, and the document you provided is very thorough. Looking forward to the patches :) Best, -- Mehdi On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Greetings, > > We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer > framework, on large google benchmarks and noticed that it improves key > performance metrics of these benchmarks by 2% to 6%, which is pretty > impressive > as this is over and above a baseline binaryalready heavily optimized with > ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance > of > binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is > also > profile guided and does very aggressive performance optimizations, there is > more room for performance improvements due to profile approximations while > applying the transformations. BOLT uses exact profiles from the final > binary > and is able to fill the gaps left by ThinLTO + PGO. The performance > improvements due to BOLT come from basic block layout, function reordering > and > function splitting. > > While BOLT does an excellent job of squeezing extra performance from highly > optimized binaries with optimizations such as code layout, it has these > major > issues: > > * It does not take advantage of distributed build systems. > * It has scalability issues and to rewrite a binary with a ~300M text > segment > size: > * Memory foot-print is 70G. > * It takes more than 10 minutes to rewrite the binary. > > Similar to Full LTO, BOLT’s design is monolithic as it disassembles the > original binary, optimizes and rewrites the final binary in one process. > This > limits the scalability of BOLT and the memory and time overhead shoots up > quickly for large binaries. > > Inspired by the performance gains and to address the scalability issue of > BOLT, > we went about designing a scalable infrastructure that can perform > BOLT-like > post-link optimizations. In this RFC, we discuss our system, “Propeller”, > which can perform profile guided link time binary optimizations in a > scalable > way and is friendly to distributed build systems. Our system leverages the > existing capabilities of the compiler tool-chain and is not a stand alone > tool. > Like BOLT, our system boosts the performance of optimized binaries via > link-time optimizations using accurate profiles of the binary. We discuss > the > Propeller system and show how to do the whole program basic block layout > using > Propeller. > > Propeller does whole program basic block layout at link time via basic > block > sections. We have added support for having each basic block in its own > section > which allows the linker to do arbitrary reorderings of basic blocks to > achieve > any desired fine-grain code layout which includes block layout, function > splitting and function reordering. Our experiments on large real-world > applications and SPEC with code layout show that Propeller can optimize as > effectively as BOLT, with just 20% of its memory footprint and time > overhead. > > An LLVM branch with propeller patches is available in the git repository > here: > https://github.com/google/llvm-propeller/ We will upload individual > patches of > the various elements for review. We have attached a google doc describing > the > Propeller system with Experiments in detail, > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf > > > Quick Start - How to optimize further with Propeller? > > # git clone and build repo > > $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git > > $ mkdir $BUILD_DIR && cd $BUILD_DIR > > $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ > $LLVM_DIR/llvm-propeller/llvm > > $ ninja -j25 > > $ export PATH=$BUILD_DIR/bin:$PATH > > > # Let’s Propeller-optimize the following program: > > > # Step 1: Build the peak optimized binary with an additional flag. > > $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels > -fuse-ld=lld > > # Step 2: Profile the binary, only one side of the branch is executed. > $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& > /dev/null > > > # Step 3: Convert the profiles using the tool provided > $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ > --binary=./a.out.labels --profile=perf.data --out=perf.propeller > > > # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag > changed. > $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller > -fuse-ld=lld > > In Step 4, the optimized bit code can be used if it is saved in Step1 as > Propeller is active only during compile backend and link. The optimized > binary > has a different layout of the basic blocks in main to keep the executed > blocks > together and split the cold blocks. > > Some of the key points: > > + Added support for basic block sections, similar to function sections, > where > each basic block can reside in its own section. > > + Jump instructions need 32-bit relocations and subsequent linker > relaxations > after basic block layout. We would like to add a new relocation type for > jump > instructions to make it easier for relaxations and guarantee correctness. > > + Added support in the linker to read profiles (PMU LBR) and discover > optimal > basic block layout using the Ex-TSP algorithm described here: > https://arxiv.org/abs/1809.04676 > > + Added support in the linker to re-order basic block sections in any > specified sequence (can be done with symbol ordering file). This requires > linker relaxations to delete and shrink branches across basic blocks. > > + Compared our system to BOLT and have shown that our system can produce > similar performance improvements as BOLT but with much less memory and time > overheads. Our experiments are on very large warehouse-scale benchmarks > and > SPEC 2017. > > + Listed why this cannot be done as part of PGO itself. Post Link > optimizations are able to transform the binary using accurate profiles and > PGO > passes suffer from profile imprecision. > > + Updated DebugInfo and CFI to account for arbitrary ordering of basic > blocks > via basic block sections. > > + Discussed CFI handling and is sub-optimal and bloats object file sizes > and > binary sizes much more than DebugInfo due to lack of support for > discontiguous > address ranges. We have talked about this and would like to make a case to > support discontiguous ranges with CFI which will make basic block sections > much > more cheaper. > > Detailed RFC document here : > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf > > Please let us know what you think, > Thanks > Sri on behalf of the team. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190925/48752c98/attachment.html>
Michael Spencer via llvm-dev
2019-Sep-25 22:44 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Greetings, > > We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer > framework, on large google benchmarks and noticed that it improves key > performance metrics of these benchmarks by 2% to 6%, which is pretty > impressive > as this is over and above a baseline binaryalready heavily optimized with > ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance > of > binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is > also > profile guided and does very aggressive performance optimizations, there is > more room for performance improvements due to profile approximations while > applying the transformations. BOLT uses exact profiles from the final > binary > and is able to fill the gaps left by ThinLTO + PGO. The performance > improvements due to BOLT come from basic block layout, function reordering > and > function splitting. > > While BOLT does an excellent job of squeezing extra performance from highly > optimized binaries with optimizations such as code layout, it has these > major > issues: > > * It does not take advantage of distributed build systems. > * It has scalability issues and to rewrite a binary with a ~300M text > segment > size: > * Memory foot-print is 70G. > * It takes more than 10 minutes to rewrite the binary. > > Similar to Full LTO, BOLT’s design is monolithic as it disassembles the > original binary, optimizes and rewrites the final binary in one process. > This > limits the scalability of BOLT and the memory and time overhead shoots up > quickly for large binaries. > > Inspired by the performance gains and to address the scalability issue of > BOLT, > we went about designing a scalable infrastructure that can perform > BOLT-like > post-link optimizations. In this RFC, we discuss our system, “Propeller”, > which can perform profile guided link time binary optimizations in a > scalable > way and is friendly to distributed build systems. Our system leverages the > existing capabilities of the compiler tool-chain and is not a stand alone > tool. > Like BOLT, our system boosts the performance of optimized binaries via > link-time optimizations using accurate profiles of the binary. We discuss > the > Propeller system and show how to do the whole program basic block layout > using > Propeller. > > Propeller does whole program basic block layout at link time via basic > block > sections. We have added support for having each basic block in its own > section > which allows the linker to do arbitrary reorderings of basic blocks to > achieve > any desired fine-grain code layout which includes block layout, function > splitting and function reordering. Our experiments on large real-world > applications and SPEC with code layout show that Propeller can optimize as > effectively as BOLT, with just 20% of its memory footprint and time > overhead. > > An LLVM branch with propeller patches is available in the git repository > here: > https://github.com/google/llvm-propeller/ We will upload individual > patches of > the various elements for review. We have attached a google doc describing > the > Propeller system with Experiments in detail, > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf > > > Quick Start - How to optimize further with Propeller? > > # git clone and build repo > > $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git > > $ mkdir $BUILD_DIR && cd $BUILD_DIR > > $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ > $LLVM_DIR/llvm-propeller/llvm > > $ ninja -j25 > > $ export PATH=$BUILD_DIR/bin:$PATH > > > # Let’s Propeller-optimize the following program: > > > # Step 1: Build the peak optimized binary with an additional flag. > > $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels > -fuse-ld=lld > > # Step 2: Profile the binary, only one side of the branch is executed. > $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& > /dev/null > > > # Step 3: Convert the profiles using the tool provided > $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ > --binary=./a.out.labels --profile=perf.data --out=perf.propeller > > > # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag > changed. > $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller > -fuse-ld=lld > > In Step 4, the optimized bit code can be used if it is saved in Step1 as > Propeller is active only during compile backend and link. The optimized > binary > has a different layout of the basic blocks in main to keep the executed > blocks > together and split the cold blocks. > > Some of the key points: > > + Added support for basic block sections, similar to function sections, > where > each basic block can reside in its own section. > > + Jump instructions need 32-bit relocations and subsequent linker > relaxations > after basic block layout. We would like to add a new relocation type for > jump > instructions to make it easier for relaxations and guarantee correctness. > > + Added support in the linker to read profiles (PMU LBR) and discover > optimal > basic block layout using the Ex-TSP algorithm described here: > https://arxiv.org/abs/1809.04676 > > + Added support in the linker to re-order basic block sections in any > specified sequence (can be done with symbol ordering file). This requires > linker relaxations to delete and shrink branches across basic blocks. > > + Compared our system to BOLT and have shown that our system can produce > similar performance improvements as BOLT but with much less memory and time > overheads. Our experiments are on very large warehouse-scale benchmarks > and > SPEC 2017. > > + Listed why this cannot be done as part of PGO itself. Post Link > optimizations are able to transform the binary using accurate profiles and > PGO > passes suffer from profile imprecision. > > + Updated DebugInfo and CFI to account for arbitrary ordering of basic > blocks > via basic block sections. > > + Discussed CFI handling and is sub-optimal and bloats object file sizes > and > binary sizes much more than DebugInfo due to lack of support for > discontiguous > address ranges. We have talked about this and would like to make a case to > support discontiguous ranges with CFI which will make basic block sections > much > more cheaper. > > Detailed RFC document here : > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf > > Please let us know what you think, > Thanks > Sri on behalf of the team. >The function ordering parts of this look like they are doing the same thing as lld's existing function reordering code, just with a different profile source and different ordering algorithm. Have you compared the performance of just the function reordering bits to what we already have ( https://github.com/google/llvm-propeller/blob/plo-dev/lld/ELF/CallGraphSort.cpp )? I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one. - Michael Spencer -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190925/b8c8fad6/attachment-0001.html>
Eli Friedman via llvm-dev
2019-Sep-26 00:02 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
My biggest question about this architecture is about when propeller runs basic block reordering within a function. It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step. I'm not sure that's the right approach, long-term. Splitting every basic block into its own section introduces overhead, like you described. And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches. And the reordering itself involves a bunch of x86 and ELF-specific code. I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer. MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions. And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards. This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows. There are some minor downsides to this approach I can think of. You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description? It might take more CPU time to re-do LTO code generation, but that's probably not a big deal for applications where the user is willing to collect two different kinds PGO of profiles for a program. I understand you can't use a regular PGO profile for this, but it should be possible to pass in a Propeller-PGO profile separately to the compiler, and apply that to the MIR as a late pass in the compiler, separate from any regular PGO profile. There are probably multiple ways you could do this; you could use basic block symbols, like your current implementation. -Eli -----Original Message----- From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sriraman Tallam via llvm-dev Sent: Tuesday, September 24, 2019 4:51 PM To: llvm-dev at lists.llvm.org Subject: [EXT] [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations Greetings, We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer framework, on large google benchmarks and noticed that it improves key performance metrics of these benchmarks by 2% to 6%, which is pretty impressive as this is over and above a baseline binaryalready heavily optimized with ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is also profile guided and does very aggressive performance optimizations, there is more room for performance improvements due to profile approximations while applying the transformations. BOLT uses exact profiles from the final binary and is able to fill the gaps left by ThinLTO + PGO. The performance improvements due to BOLT come from basic block layout, function reordering and function splitting. While BOLT does an excellent job of squeezing extra performance from highly optimized binaries with optimizations such as code layout, it has these major issues: * It does not take advantage of distributed build systems. * It has scalability issues and to rewrite a binary with a ~300M text segment size: * Memory foot-print is 70G. * It takes more than 10 minutes to rewrite the binary. Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes and rewrites the final binary in one process. This limits the scalability of BOLT and the memory and time overhead shoots up quickly for large binaries. Inspired by the performance gains and to address the scalability issue of BOLT, we went about designing a scalable infrastructure that can perform BOLT-like post-link optimizations. In this RFC, we discuss our system, “Propeller”, which can perform profile guided link time binary optimizations in a scalable way and is friendly to distributed build systems. Our system leverages the existing capabilities of the compiler tool-chain and is not a stand alone tool. Like BOLT, our system boosts the performance of optimized binaries via link-time optimizations using accurate profiles of the binary. We discuss the Propeller system and show how to do the whole program basic block layout using Propeller. Propeller does whole program basic block layout at link time via basic block sections. We have added support for having each basic block in its own section which allows the linker to do arbitrary reorderings of basic blocks to achieve any desired fine-grain code layout which includes block layout, function splitting and function reordering. Our experiments on large real-world applications and SPEC with code layout show that Propeller can optimize as effectively as BOLT, with just 20% of its memory footprint and time overhead. An LLVM branch with propeller patches is available in the git repository here: https://github.com/google/llvm-propeller/ We will upload individual patches of the various elements for review. We have attached a google doc describing the Propeller system with Experiments in detail, https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf Quick Start - How to optimize further with Propeller? # git clone and build repo $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git $ mkdir $BUILD_DIR && cd $BUILD_DIR $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ $LLVM_DIR/llvm-propeller/llvm $ ninja -j25 $ export PATH=$BUILD_DIR/bin:$PATH # Let’s Propeller-optimize the following program: # Step 1: Build the peak optimized binary with an additional flag. $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld # Step 2: Profile the binary, only one side of the branch is executed. $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& /dev/null # Step 3: Convert the profiles using the tool provided $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ --binary=./a.out.labels --profile=perf.data --out=perf.propeller # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed. $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld In Step 4, the optimized bit code can be used if it is saved in Step1 as Propeller is active only during compile backend and link. The optimized binary has a different layout of the basic blocks in main to keep the executed blocks together and split the cold blocks. Some of the key points: + Added support for basic block sections, similar to function sections, where each basic block can reside in its own section. + Jump instructions need 32-bit relocations and subsequent linker relaxations after basic block layout. We would like to add a new relocation type for jump instructions to make it easier for relaxations and guarantee correctness. + Added support in the linker to read profiles (PMU LBR) and discover optimal basic block layout using the Ex-TSP algorithm described here: https://arxiv.org/abs/1809.04676 + Added support in the linker to re-order basic block sections in any specified sequence (can be done with symbol ordering file). This requires linker relaxations to delete and shrink branches across basic blocks. + Compared our system to BOLT and have shown that our system can produce similar performance improvements as BOLT but with much less memory and time overheads. Our experiments are on very large warehouse-scale benchmarks and SPEC 2017. + Listed why this cannot be done as part of PGO itself. Post Link optimizations are able to transform the binary using accurate profiles and PGO passes suffer from profile imprecision. + Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks via basic block sections. + Discussed CFI handling and is sub-optimal and bloats object file sizes and binary sizes much more than DebugInfo due to lack of support for discontiguous address ranges. We have talked about this and would like to make a case to support discontiguous ranges with CFI which will make basic block sections much more cheaper. Detailed RFC document here : https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf Please let us know what you think, Thanks Sri on behalf of the team. _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Sriraman Tallam via llvm-dev
2019-Sep-26 00:18 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman <efriedma at quicinc.com> wrote:> > My biggest question about this architecture is about when propeller runs basic block reordering within a function. It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step. I'm not sure that's the right approach, long-term. > > Splitting every basic block into its own section introduces overhead, like you described. And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches. And the reordering itself involves a bunch of x86 and ELF-specific code. > > I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer. MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions. And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards. This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows. > > There are some minor downsides to this approach I can think of. You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?I have mentioned this in the RFC under alternate design in the Experments. We did consider this approach. This is our first attempt at doing this end to end. We plan to do much more. We are looking at much more aggressive inter-procedural code layout optimizations based on path profiles and machine learning, which requires having support for basic block sections. Being able to do inter-procedural block layout is already showing more performance improvements over what we have right now. We don't think basic block sections is expensive if done on demand and there are opportunities for performance exploration with this support. It might take more CPU time to re-do LTO code generation, but that's probably not a big deal for applications where the user is willing to collect two different kinds PGO of profiles for a program.> > I understand you can't use a regular PGO profile for this, but it should be possible to pass in a Propeller-PGO profile separately to the compiler, and apply that to the MIR as a late pass in the compiler, separate from any regular PGO profile. There are probably multiple ways you could do this; you could use basic block symbols, like your current implementation. > > -Eli > > -----Original Message----- > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sriraman Tallam via llvm-dev > Sent: Tuesday, September 24, 2019 4:51 PM > To: llvm-dev at lists.llvm.org > Subject: [EXT] [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations > > Greetings, > > We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer > framework, on large google benchmarks and noticed that it improves key > performance metrics of these benchmarks by 2% to 6%, which is pretty impressive > as this is over and above a baseline binaryalready heavily optimized with > ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of > binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is also > profile guided and does very aggressive performance optimizations, there is > more room for performance improvements due to profile approximations while > applying the transformations. BOLT uses exact profiles from the final binary > and is able to fill the gaps left by ThinLTO + PGO. The performance > improvements due to BOLT come from basic block layout, function reordering and > function splitting. > > While BOLT does an excellent job of squeezing extra performance from highly > optimized binaries with optimizations such as code layout, it has these major > issues: > > * It does not take advantage of distributed build systems. > * It has scalability issues and to rewrite a binary with a ~300M text segment > size: > * Memory foot-print is 70G. > * It takes more than 10 minutes to rewrite the binary. > > Similar to Full LTO, BOLT’s design is monolithic as it disassembles the > original binary, optimizes and rewrites the final binary in one process. This > limits the scalability of BOLT and the memory and time overhead shoots up > quickly for large binaries. > > Inspired by the performance gains and to address the scalability issue of BOLT, > we went about designing a scalable infrastructure that can perform BOLT-like > post-link optimizations. In this RFC, we discuss our system, “Propeller”, > which can perform profile guided link time binary optimizations in a scalable > way and is friendly to distributed build systems. Our system leverages the > existing capabilities of the compiler tool-chain and is not a stand alone tool. > Like BOLT, our system boosts the performance of optimized binaries via > link-time optimizations using accurate profiles of the binary. We discuss the > Propeller system and show how to do the whole program basic block layout using > Propeller. > > Propeller does whole program basic block layout at link time via basic block > sections. We have added support for having each basic block in its own section > which allows the linker to do arbitrary reorderings of basic blocks to achieve > any desired fine-grain code layout which includes block layout, function > splitting and function reordering. Our experiments on large real-world > applications and SPEC with code layout show that Propeller can optimize as > effectively as BOLT, with just 20% of its memory footprint and time overhead. > > An LLVM branch with propeller patches is available in the git repository here: > https://github.com/google/llvm-propeller/ We will upload individual patches of > the various elements for review. We have attached a google doc describing the > Propeller system with Experiments in detail, > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf > > > Quick Start - How to optimize further with Propeller? > > # git clone and build repo > > $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git > > $ mkdir $BUILD_DIR && cd $BUILD_DIR > > $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ > $LLVM_DIR/llvm-propeller/llvm > > $ ninja -j25 > > $ export PATH=$BUILD_DIR/bin:$PATH > > > # Let’s Propeller-optimize the following program: > > > # Step 1: Build the peak optimized binary with an additional flag. > > $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld > > # Step 2: Profile the binary, only one side of the branch is executed. > $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& /dev/null > > > # Step 3: Convert the profiles using the tool provided > $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ > --binary=./a.out.labels --profile=perf.data --out=perf.propeller > > > # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed. > $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld > > In Step 4, the optimized bit code can be used if it is saved in Step1 as > Propeller is active only during compile backend and link. The optimized binary > has a different layout of the basic blocks in main to keep the executed blocks > together and split the cold blocks. > > Some of the key points: > > + Added support for basic block sections, similar to function sections, where > each basic block can reside in its own section. > > + Jump instructions need 32-bit relocations and subsequent linker relaxations > after basic block layout. We would like to add a new relocation type for jump > instructions to make it easier for relaxations and guarantee correctness. > > + Added support in the linker to read profiles (PMU LBR) and discover optimal > basic block layout using the Ex-TSP algorithm described here: > https://arxiv.org/abs/1809.04676 > > + Added support in the linker to re-order basic block sections in any > specified sequence (can be done with symbol ordering file). This requires > linker relaxations to delete and shrink branches across basic blocks. > > + Compared our system to BOLT and have shown that our system can produce > similar performance improvements as BOLT but with much less memory and time > overheads. Our experiments are on very large warehouse-scale benchmarks and > SPEC 2017. > > + Listed why this cannot be done as part of PGO itself. Post Link > optimizations are able to transform the binary using accurate profiles and PGO > passes suffer from profile imprecision. > > + Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks > via basic block sections. > > + Discussed CFI handling and is sub-optimal and bloats object file sizes and > binary sizes much more than DebugInfo due to lack of support for discontiguous > address ranges. We have talked about this and would like to make a case to > support discontiguous ranges with CFI which will make basic block sections much > more cheaper. > > Detailed RFC document here : > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf > > Please let us know what you think, > Thanks > Sri on behalf of the team. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Xinliang David Li via llvm-dev
2019-Sep-26 00:58 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev < llvm-dev at lists.llvm.org> wrote:> My biggest question about this architecture is about when propeller runs > basic block reordering within a function. It seems like a lot of the > complexity comes from using the proposed -fbasicblock-sections to generated > mangled ELF, and then re-parsing the mangled ELF as a separate step. I'm > not sure that's the right approach, long-term. > > Splitting every basic block into its own section introduces overhead, like > you described. And it's likely more complex on non-x86 targets, which have > a greater variety of conditional branches. And the reordering itself > involves a bunch of x86 and ELF-specific code. > > I'd like to suggest an alternative: instead of perform basic block > reordering and function splitting by manipulating the ELF files, you could > perform reordering and splitting as part of link-time code generation, as > an MIR pass that runs just before the assembly printer. MIR is almost > exactly the form you want for this sort of manipulation: it has basic > blocks which correspond closely to the final binary, and a high-level > representation of branch instructions.This was considered for Propeller. This is currently being explored in a similar way as an alternative of CSFDO which uses PMU samples.> And it's before the DWARF/CFI emission, so you don't need to worry about > fixing them afterwards. This should take less code overall, and much less > target-specific code. And infrastructure for function splitting would be > useful for non-Propeller workflows. > > There are some minor downsides to this approach I can think of. You lose > a little flexibility, in that you can't mix blocks from different functions > together, but you aren't doing that anyway, from your description?One of the main design objectives of Propeller is to have the capability to do interprocedural code transformations (reordering, cloning, dedupping etc), so this won't be a minor downside. Function/block alignment (for branch misprediction reduction etc) will also need to be done as a global optimization in the future. David> It might take more CPU time to re-do LTO code generation, but that's > probably not a big deal for applications where the user is willing to > collect two different kinds PGO of profiles for a program. > > I understand you can't use a regular PGO profile for this, but it should > be possible to pass in a Propeller-PGO profile separately to the compiler, > and apply that to the MIR as a late pass in the compiler, separate from any > regular PGO profile. There are probably multiple ways you could do this; > you could use basic block symbols, like your current implementation. > > -Eli > > -----Original Message----- > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sriraman > Tallam via llvm-dev > Sent: Tuesday, September 24, 2019 4:51 PM > To: llvm-dev at lists.llvm.org > Subject: [EXT] [llvm-dev] [RFC] Propeller: A frame work for Post Link > Optimizations > > Greetings, > > We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer > framework, on large google benchmarks and noticed that it improves key > performance metrics of these benchmarks by 2% to 6%, which is pretty > impressive > as this is over and above a baseline binaryalready heavily optimized with > ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance > of > binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is > also > profile guided and does very aggressive performance optimizations, there is > more room for performance improvements due to profile approximations while > applying the transformations. BOLT uses exact profiles from the final > binary > and is able to fill the gaps left by ThinLTO + PGO. The performance > improvements due to BOLT come from basic block layout, function reordering > and > function splitting. > > While BOLT does an excellent job of squeezing extra performance from highly > optimized binaries with optimizations such as code layout, it has these > major > issues: > > * It does not take advantage of distributed build systems. > * It has scalability issues and to rewrite a binary with a ~300M text > segment > size: > * Memory foot-print is 70G. > * It takes more than 10 minutes to rewrite the binary. > > Similar to Full LTO, BOLT’s design is monolithic as it disassembles the > original binary, optimizes and rewrites the final binary in one process. > This > limits the scalability of BOLT and the memory and time overhead shoots up > quickly for large binaries. > > Inspired by the performance gains and to address the scalability issue of > BOLT, > we went about designing a scalable infrastructure that can perform > BOLT-like > post-link optimizations. In this RFC, we discuss our system, “Propeller”, > which can perform profile guided link time binary optimizations in a > scalable > way and is friendly to distributed build systems. Our system leverages the > existing capabilities of the compiler tool-chain and is not a stand alone > tool. > Like BOLT, our system boosts the performance of optimized binaries via > link-time optimizations using accurate profiles of the binary. We discuss > the > Propeller system and show how to do the whole program basic block layout > using > Propeller. > > Propeller does whole program basic block layout at link time via basic > block > sections. We have added support for having each basic block in its own > section > which allows the linker to do arbitrary reorderings of basic blocks to > achieve > any desired fine-grain code layout which includes block layout, function > splitting and function reordering. Our experiments on large real-world > applications and SPEC with code layout show that Propeller can optimize as > effectively as BOLT, with just 20% of its memory footprint and time > overhead. > > An LLVM branch with propeller patches is available in the git repository > here: > https://github.com/google/llvm-propeller/ We will upload individual > patches of > the various elements for review. We have attached a google doc describing > the > Propeller system with Experiments in detail, > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf > > > Quick Start - How to optimize further with Propeller? > > # git clone and build repo > > $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git > > $ mkdir $BUILD_DIR && cd $BUILD_DIR > > $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ > $LLVM_DIR/llvm-propeller/llvm > > $ ninja -j25 > > $ export PATH=$BUILD_DIR/bin:$PATH > > > # Let’s Propeller-optimize the following program: > > > # Step 1: Build the peak optimized binary with an additional flag. > > $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels > -fuse-ld=lld > > # Step 2: Profile the binary, only one side of the branch is executed. > $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& > /dev/null > > > # Step 3: Convert the profiles using the tool provided > $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ > --binary=./a.out.labels --profile=perf.data --out=perf.propeller > > > # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag > changed. > $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller > -fuse-ld=lld > > In Step 4, the optimized bit code can be used if it is saved in Step1 as > Propeller is active only during compile backend and link. The optimized > binary > has a different layout of the basic blocks in main to keep the executed > blocks > together and split the cold blocks. > > Some of the key points: > > + Added support for basic block sections, similar to function sections, > where > each basic block can reside in its own section. > > + Jump instructions need 32-bit relocations and subsequent linker > relaxations > after basic block layout. We would like to add a new relocation type for > jump > instructions to make it easier for relaxations and guarantee correctness. > > + Added support in the linker to read profiles (PMU LBR) and discover > optimal > basic block layout using the Ex-TSP algorithm described here: > https://arxiv.org/abs/1809.04676 > > + Added support in the linker to re-order basic block sections in any > specified sequence (can be done with symbol ordering file). This requires > linker relaxations to delete and shrink branches across basic blocks. > > + Compared our system to BOLT and have shown that our system can produce > similar performance improvements as BOLT but with much less memory and time > overheads. Our experiments are on very large warehouse-scale benchmarks > and > SPEC 2017. > > + Listed why this cannot be done as part of PGO itself. Post Link > optimizations are able to transform the binary using accurate profiles and > PGO > passes suffer from profile imprecision. > > + Updated DebugInfo and CFI to account for arbitrary ordering of basic > blocks > via basic block sections. > > + Discussed CFI handling and is sub-optimal and bloats object file sizes > and > binary sizes much more than DebugInfo due to lack of support for > discontiguous > address ranges. We have talked about this and would like to make a case to > support discontiguous ranges with CFI which will make basic block sections > much > more cheaper. > > Detailed RFC document here : > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf > > Please let us know what you think, > Thanks > Sri on behalf of the team. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190925/b1661614/attachment.html>
Sriraman Tallam via llvm-dev
2019-Sep-26 01:13 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
On Wed, Sep 25, 2019 at 3:44 PM Michael Spencer <bigcheesegs at gmail.com> wrote:> > On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Greetings, >> >> We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer >> framework, on large google benchmarks and noticed that it improves key >> performance metrics of these benchmarks by 2% to 6%, which is pretty impressive >> as this is over and above a baseline binaryalready heavily optimized with >> ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of >> binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is also >> profile guided and does very aggressive performance optimizations, there is >> more room for performance improvements due to profile approximations while >> applying the transformations. BOLT uses exact profiles from the final binary >> and is able to fill the gaps left by ThinLTO + PGO. The performance >> improvements due to BOLT come from basic block layout, function reordering and >> function splitting. >> >> While BOLT does an excellent job of squeezing extra performance from highly >> optimized binaries with optimizations such as code layout, it has these major >> issues: >> >> * It does not take advantage of distributed build systems. >> * It has scalability issues and to rewrite a binary with a ~300M text segment >> size: >> * Memory foot-print is 70G. >> * It takes more than 10 minutes to rewrite the binary. >> >> Similar to Full LTO, BOLT’s design is monolithic as it disassembles the >> original binary, optimizes and rewrites the final binary in one process. This >> limits the scalability of BOLT and the memory and time overhead shoots up >> quickly for large binaries. >> >> Inspired by the performance gains and to address the scalability issue of BOLT, >> we went about designing a scalable infrastructure that can perform BOLT-like >> post-link optimizations. In this RFC, we discuss our system, “Propeller”, >> which can perform profile guided link time binary optimizations in a scalable >> way and is friendly to distributed build systems. Our system leverages the >> existing capabilities of the compiler tool-chain and is not a stand alone tool. >> Like BOLT, our system boosts the performance of optimized binaries via >> link-time optimizations using accurate profiles of the binary. We discuss the >> Propeller system and show how to do the whole program basic block layout using >> Propeller. >> >> Propeller does whole program basic block layout at link time via basic block >> sections. We have added support for having each basic block in its own section >> which allows the linker to do arbitrary reorderings of basic blocks to achieve >> any desired fine-grain code layout which includes block layout, function >> splitting and function reordering. Our experiments on large real-world >> applications and SPEC with code layout show that Propeller can optimize as >> effectively as BOLT, with just 20% of its memory footprint and time overhead. >> >> An LLVM branch with propeller patches is available in the git repository here: >> https://github.com/google/llvm-propeller/ We will upload individual patches of >> the various elements for review. We have attached a google doc describing the >> Propeller system with Experiments in detail, >> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf >> >> >> Quick Start - How to optimize further with Propeller? >> >> # git clone and build repo >> >> $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git >> >> $ mkdir $BUILD_DIR && cd $BUILD_DIR >> >> $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ >> $LLVM_DIR/llvm-propeller/llvm >> >> $ ninja -j25 >> >> $ export PATH=$BUILD_DIR/bin:$PATH >> >> >> # Let’s Propeller-optimize the following program: >> >> >> # Step 1: Build the peak optimized binary with an additional flag. >> >> $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld >> >> # Step 2: Profile the binary, only one side of the branch is executed. >> $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& /dev/null >> >> >> # Step 3: Convert the profiles using the tool provided >> $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ >> --binary=./a.out.labels --profile=perf.data --out=perf.propeller >> >> >> # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed. >> $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld >> >> In Step 4, the optimized bit code can be used if it is saved in Step1 as >> Propeller is active only during compile backend and link. The optimized binary >> has a different layout of the basic blocks in main to keep the executed blocks >> together and split the cold blocks. >> >> Some of the key points: >> >> + Added support for basic block sections, similar to function sections, where >> each basic block can reside in its own section. >> >> + Jump instructions need 32-bit relocations and subsequent linker relaxations >> after basic block layout. We would like to add a new relocation type for jump >> instructions to make it easier for relaxations and guarantee correctness. >> >> + Added support in the linker to read profiles (PMU LBR) and discover optimal >> basic block layout using the Ex-TSP algorithm described here: >> https://arxiv.org/abs/1809.04676 >> >> + Added support in the linker to re-order basic block sections in any >> specified sequence (can be done with symbol ordering file). This requires >> linker relaxations to delete and shrink branches across basic blocks. >> >> + Compared our system to BOLT and have shown that our system can produce >> similar performance improvements as BOLT but with much less memory and time >> overheads. Our experiments are on very large warehouse-scale benchmarks and >> SPEC 2017. >> >> + Listed why this cannot be done as part of PGO itself. Post Link >> optimizations are able to transform the binary using accurate profiles and PGO >> passes suffer from profile imprecision. >> >> + Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks >> via basic block sections. >> >> + Discussed CFI handling and is sub-optimal and bloats object file sizes and >> binary sizes much more than DebugInfo due to lack of support for discontiguous >> address ranges. We have talked about this and would like to make a case to >> support discontiguous ranges with CFI which will make basic block sections much >> more cheaper. >> >> Detailed RFC document here : >> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf >> >> Please let us know what you think, >> Thanks >> Sri on behalf of the team. > > > The function ordering parts of this look like they are doing the same thing as lld's existing function reordering code, just with a different profile source and different ordering algorithm. Have you compared the performance of just the function reordering bits to what we already have (https://github.com/google/llvm-propeller/blob/plo-dev/lld/ELF/CallGraphSort.cpp)? > > I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one.Long term, we are looking at inter-procedural basic block layout, basic block cloning etc. and this encapsulates block layout, function splitting and function reordering. We will look at how we could integrate this with existing callgraph layout algorithm.> > - Michael Spencer
Sriraman Tallam via llvm-dev
2019-Sep-26 19:03 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
On Wed, Sep 25, 2019 at 3:44 PM Michael Spencer <bigcheesegs at gmail.com> wrote:> > On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Greetings, >> >> We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer >> framework, on large google benchmarks and noticed that it improves key >> performance metrics of these benchmarks by 2% to 6%, which is pretty impressive >> as this is over and above a baseline binaryalready heavily optimized with >> ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of >> binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is also >> profile guided and does very aggressive performance optimizations, there is >> more room for performance improvements due to profile approximations while >> applying the transformations. BOLT uses exact profiles from the final binary >> and is able to fill the gaps left by ThinLTO + PGO. The performance >> improvements due to BOLT come from basic block layout, function reordering and >> function splitting. >> >> While BOLT does an excellent job of squeezing extra performance from highly >> optimized binaries with optimizations such as code layout, it has these major >> issues: >> >> * It does not take advantage of distributed build systems. >> * It has scalability issues and to rewrite a binary with a ~300M text segment >> size: >> * Memory foot-print is 70G. >> * It takes more than 10 minutes to rewrite the binary. >> >> Similar to Full LTO, BOLT’s design is monolithic as it disassembles the >> original binary, optimizes and rewrites the final binary in one process. This >> limits the scalability of BOLT and the memory and time overhead shoots up >> quickly for large binaries. >> >> Inspired by the performance gains and to address the scalability issue of BOLT, >> we went about designing a scalable infrastructure that can perform BOLT-like >> post-link optimizations. In this RFC, we discuss our system, “Propeller”, >> which can perform profile guided link time binary optimizations in a scalable >> way and is friendly to distributed build systems. Our system leverages the >> existing capabilities of the compiler tool-chain and is not a stand alone tool. >> Like BOLT, our system boosts the performance of optimized binaries via >> link-time optimizations using accurate profiles of the binary. We discuss the >> Propeller system and show how to do the whole program basic block layout using >> Propeller. >> >> Propeller does whole program basic block layout at link time via basic block >> sections. We have added support for having each basic block in its own section >> which allows the linker to do arbitrary reorderings of basic blocks to achieve >> any desired fine-grain code layout which includes block layout, function >> splitting and function reordering. Our experiments on large real-world >> applications and SPEC with code layout show that Propeller can optimize as >> effectively as BOLT, with just 20% of its memory footprint and time overhead. >> >> An LLVM branch with propeller patches is available in the git repository here: >> https://github.com/google/llvm-propeller/ We will upload individual patches of >> the various elements for review. We have attached a google doc describing the >> Propeller system with Experiments in detail, >> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf >> >> >> Quick Start - How to optimize further with Propeller? >> >> # git clone and build repo >> >> $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git >> >> $ mkdir $BUILD_DIR && cd $BUILD_DIR >> >> $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ >> $LLVM_DIR/llvm-propeller/llvm >> >> $ ninja -j25 >> >> $ export PATH=$BUILD_DIR/bin:$PATH >> >> >> # Let’s Propeller-optimize the following program: >> >> >> # Step 1: Build the peak optimized binary with an additional flag. >> >> $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld >> >> # Step 2: Profile the binary, only one side of the branch is executed. >> $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& /dev/null >> >> >> # Step 3: Convert the profiles using the tool provided >> $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ >> --binary=./a.out.labels --profile=perf.data --out=perf.propeller >> >> >> # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed. >> $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld >> >> In Step 4, the optimized bit code can be used if it is saved in Step1 as >> Propeller is active only during compile backend and link. The optimized binary >> has a different layout of the basic blocks in main to keep the executed blocks >> together and split the cold blocks. >> >> Some of the key points: >> >> + Added support for basic block sections, similar to function sections, where >> each basic block can reside in its own section. >> >> + Jump instructions need 32-bit relocations and subsequent linker relaxations >> after basic block layout. We would like to add a new relocation type for jump >> instructions to make it easier for relaxations and guarantee correctness. >> >> + Added support in the linker to read profiles (PMU LBR) and discover optimal >> basic block layout using the Ex-TSP algorithm described here: >> https://arxiv.org/abs/1809.04676 >> >> + Added support in the linker to re-order basic block sections in any >> specified sequence (can be done with symbol ordering file). This requires >> linker relaxations to delete and shrink branches across basic blocks. >> >> + Compared our system to BOLT and have shown that our system can produce >> similar performance improvements as BOLT but with much less memory and time >> overheads. Our experiments are on very large warehouse-scale benchmarks and >> SPEC 2017. >> >> + Listed why this cannot be done as part of PGO itself. Post Link >> optimizations are able to transform the binary using accurate profiles and PGO >> passes suffer from profile imprecision. >> >> + Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks >> via basic block sections. >> >> + Discussed CFI handling and is sub-optimal and bloats object file sizes and >> binary sizes much more than DebugInfo due to lack of support for discontiguous >> address ranges. We have talked about this and would like to make a case to >> support discontiguous ranges with CFI which will make basic block sections much >> more cheaper. >> >> Detailed RFC document here : >> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf >> >> Please let us know what you think, >> Thanks >> Sri on behalf of the team. > > > The function ordering parts of this look like they are doing the same thing as lld's existing function reordering code, just with a different profile source and different ordering algorithm. Have you compared the performance of just the function reordering bits to what we already have (https://github.com/google/llvm-propeller/blob/plo-dev/lld/ELF/CallGraphSort.cpp)?We looked at the code and it looks like the function reordering parts are using the hfsort algorithm, both CallgraphSort and Propeller. So, we can merge these. When callgraphsort option is passed, Propeller will invoke the hfsort code to do function layout. We prefer doing this as a follow-up patch if that is alright as this requires more plumbing.> > I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one. > > - Michael Spencer
Maksim Panchenko via llvm-dev
2019-Oct-02 18:40 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
Hi Sri and the team, Thank you for sharing your proposal. It helps bring awareness to the importance of context-sensitive and link-time optimizations in the compiler toolchain for boosting the performance of large applications. The proposal also shows that different approaches are possible to achieve a similar goal. Putting basic blocks into separate sections is an ambitious approach with expected challenges. For the first time, the real overheads of this approach were measured at scale, and the results look quite staggering. I can imagine how it might work for Google where the overhead of the 2-stage re-compilation could be masked out by a distributed build system, and dealing with C++ exceptions is not an issue due to software development practices. At the same time, it should be clear that for users without access to a distributed build system, the actual build-time overhead will far exceed that of BOLT. Considering the proposal in its current form, I'm not convinced it's the best approach even for Google in the long run. Since the proposal references BOLT in multiple places, and since I’m directly involved with BOLT and hence potentially biased, I decided to break my feedback into two parts, with the first part being unrelated to BOLT and hopefully being as objective as possible. Here’s a quick summary of my concerns, followed by more detailed feedback. PERFORMANCE OF TARGET APPLICATIONS * Poor optimization of C++ code bases with exceptions * Pessimization/overhead for stack unwinding used by system-wide profilers and for exception handling * Increased memory footprint at application runtime * No clear plan for adding optimizations in the future without the need for more RFCs * Effectiveness of the relaxation pass * Inconsistencies in performance data * Debugging overhead BUILD SYSTEM * Requirement for two different builds and hidden overhead of Propeller * Step 1 overheads * Enormous increase in object file size * Lack of scalability for actual code reordering during the optimization phase PROFILING * Propeller-optimized binary for continuous deployment PERFORMANCE OF TARGET APPLICATIONS *Poor optimization for C++ code bases with exceptions* From what I understand, your list of large real-world applications (you exclude SPEC from the list, which I totally agree with) does not include a single one that uses C++ exceptions. This is based on the assumption that exceptions are not used at Google, and Clang is compiled without the exceptions support by default. I consider this is a major omission from the evaluation and a drawback of the proposal. A lot of exception-handling code is generated by compiler “behind the scenes” and is invisible to the user, but is exposed to Propeller. Even if such code is never executed, i.e. exceptions are not thrown, it is important to be able to reorder the code in the function including these blocks. The RFC says: “basic blocks that are involved in exception handling, that is, they either throw or catch an exception, are grouped together and placed in a single section. Unique sections are not created for such basic blocks.” Further down the paragraph, you say: “benchmarks which spend a significant amount of time handling exceptions might not get the optimization benefits of Propeller.“ The proposal, intentionally or not, makes it look like applications that are compiled with exceptions support and that only use them rarely, i.e. for error-handling, are not affected. Based on the limitations you describe above, I believe the code reordering and thus optimization benefits will be limited for such applications. If you cannot find any large real-world benchmark that uses C++ exceptions, at the very minimum, I would like to see Clang itself being compiled with exceptions support and optimized by Propeller. Granted, it will not exercise any exception-handling code, but at least the results will give an idea of how well Propeller handles optimizing code with the mechanism enabled. *Pessimization/overhead for stack unwinding used by system-wide profilers and for exception handling* Larger CFI programs put an extra burden on unwinding at runtime as more CFI (and thus native) instructions have to be executed. This will cause more overhead for any profiler that records stack traces, and, as you correctly note in the proposal, for any program that heavily uses exceptions. *Increased application memory footprint* The increase of .eh_frame section is up to 17x, which is quite significant. Since it’s part of the runtime image of the application, it is important to know how much larger the application memory footprint gets. I.e., in addition to the “Total” size of the binary in the comparison table, I would like to see “Total Allocatable” size data. *No clear plan for adding optimizations in the future without the need for additional RFCs* Some optimizations other than basic block reordering are best executed at link time. One example would be an optimization for macro-op fusion on x86-64 CPUs. We’ve seen a real-world application, bzip2, regressed by 5% in the case of “unlucky” code placement. The optimization requires an analysis of instructions (in this case, instruction pairs) and modification of the instruction stream such as insertion of nops. For the optimization to be performed in the compiler, it would require functions to be aligned at 64 bytes which is generally not practical, hence an opportunity for link-time/binary optimization. What is the plan for Propeller to handle the above and similar cases? *Effectiveness of the relaxation pass* When you describe the relaxation pass, you do not mention if you run it till it converges, or you run it conservatively leaving larger than necessary versions of jump instructions. *Inconsistencies in performance data* Performance data in bar charts indicate that Propeller is not achieving all possible improvements for Clang, the only open-source benchmark in the evaluation where code-layout optimizations matter. Cycles, I-Cache, and branches-not-taken indicate that BOLT is doing better. At the same time, the summary table shows 9% improvement for all. Which data set is correct? *Debugging overhead* As Propeller has to support debug info generation on a per-basic-block basis, it increases DWARF sections corresponding to address ranges, location lists, and line number info. How much larger are these sections with Propeller flags? We are currently pushing the limits of 32-bit DWARF, and I wonder if you encounter any related issues considering the Propeller bloat? With larger debug info, the overhead flows to debuggers and tools using debug information for annotation purposes, e.g. profilers reporting line numbers corresponding to binary addresses. Depending on their usage of the debug info, these tools will likely use more memory and spend more time parsing DWARF. Have you considered/measured this overhead? By the way, I didn’t see DWARF location lists specifically mentioned in the proposal. Making sure you use extra entries for those too. BUILD SYSTEM *Requirement for two different builds and hidden overhead of Propeller* While discussing Memory Overhead and Runtime Overhead in the RFC, you chose to include the overhead incurred only during link time. However, compared to a regular highly-optimized AutoFDO+(Thin)LTO build, you require a second full build of the application with (Thin)LTO. Even with the caching of IR, there’s an extra cost of generating code not reflected in your evaluation. With tight integration with the build system, it is possible to hide these costs partially. In either case, since “all the benchmarks were built for the X86_64 platform on a 18 core Intel machine with 192 G of RAM“, you should be able to measure extra time for compilation spent on this machine. *Step 1 overheads* Furthermore, wouldn’t the LLVM compiler incur extra overhead when compiling with Propeller flags because of larger CFIs, debug info, etc.? I think it would be fair to measure and report the resulting compile-time and memory overhead for both builds (steps 1 and 4) versus regular build. Additionally, what are runtime and memory overheads for linking in Step 1? Do they correspond to the “All” column in the Experiments section? If you chose to compare just the link-time overhead versus the baseline, it should come from the combination of link times of steps 1 and 4 since you have to link twice for Propeller. *Enormous increase in object file size* Even with the optimized on-demand approach, the size of object files almost doubles. Since in a distributed build system object files are cached and shared, I’m not sure it is fair to describe this approach as being “friendly to distributed build systems” since it increases the existing load. What is the breakdown of the increase in object file sizes? As you mention in the proposal, DWARF natively supports address ranges. However, specifying a separate range for every basic block will introduce an overhead. Is this increase included in the total overhead for the “Object File Sizes Bloat”? *Lack of scalability for actual code reordering during the optimization phase* When you talk about Propeller being a distributed and scalable system, you mention that it has “a distributed part and a whole-program part”. Does this mean that the part where you need to recompile the application is distributed (if you are using a distributed build system), but the actual link and the code reordering optimization are happening serially? PROFILING Real-world systems built to benefit from profile feedback optimizations, such as AutoFDO, are designed to benefit from a feedback collected on a previous revision of the highly optimized binary running in prod. Does Propeller support profiling binaries optimized by Propeller with the purpose of future optimizations? BOLT Now to the BOLT part. As you know, BOLT was originally designed to handle applications built using arbitrary toolchain without any modification to the toolchain itself or the build process (other than “--emit-relocs” linker flag for maximum gains). Integration with any particular toolchain, such as LLVM+lld, obviously brings immediate advantages to the design and thus, at least in theory, comparing BOLT to such a system would not be apples-to-apples. However, I think BOLT stands fairly competitive even in its current form. With a few enhancements and a little help from the linker, issues raised in the Propeller proposal could be addressed without the need for the radical section-per-basic-block approach. Recently (July 2019), we have parallelized many passes in BOLT and improved its runtime. If you run experiments using an old version, you should try a more recent one. For example, optimizing Clang 7.0, with “.text” of ~50MB, takes less than one minute. Hardly a use case requiring any further optimization and justifying a need for a distributed system. One of your main arguments is that Propeller has 20% memory footprint of that of BOLT. This is achieved by limiting optimization to profiled functions, i.e. typically less than 10% of all functions. BOLT’s main memory overhead comes from the intermediate representation of all functions in the binary in the form of MCPlus. If we decide to keep IR only for functions with profile information, i.e. less than 10% of all functions, there’s fundamentally no reason why BOLT’s memory usage wouldn’t get decimated and brought on-par with Propeller’s link time, perhaps even less. We never found that to be a restriction in our environment and haven’t heard it been an issue other than from you. The following statement about BOLT in your proposal does not sound right: “with BOLT, even small source changes invalidate the profile information, an error if the binary’s build id does not match the profile [sic]. Hence, the binary corresponding to the source change must be re-built and re-profiled.“ Quite the opposite, BOLT is built to tolerate binary differences, including those resulting from LTO, and even supports profiling previously BOLTed binaries. The only place we check for the build ID is during the profile conversion phase corresponding to Step 3 in your proposal. This is done to avoid any possible user error, as the collection (Step 2) and conversion often happen on different machines. It might be a good idea to implement this check in Propeller too. Comparison on “Search1” includes data gathered with huge pages enabled for Propeller and disabled for BOLT. For a direct comparison, you can include a line with and without huge pages. If you cannot enable huge pages for BOLT, it is fair to exclude the result. I should also note that BOLT performs a set of optimizations beyond the code layout that can benefit from context-sensitive profile information, such as indirect call promotion, inlining, shrink-wrapping etc. For the full set of optimizations implemented in BOLT, please check our CGO’19 paper (https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/). BOLT’s additional benefits include a support for code instrumentation and code protection/hardening. E.g. spectre/meltdown mitigation including that for assembly-written code. Obviously, this is not the right place for an alternative proposal, but I strongly believe that integrating BOLT with a linker, such as lld, would address the majority of the issues and provide a seamless solution to the users of the LLVM toolchain without the need for radical changes to the emitted code and LLVM codegen. If the scalability is indeed a concern, which so far is not what we’ve seen, then we might consider changing LLVM itself to emit more data or add integration to a distributed build system. But again, IMHO this would be a solution to a nonproblem. On 9/24/19, 4:52 PM, "Sriraman Tallam" <tmsriram at google.com> wrote: Greetings, We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer framework, on large google benchmarks and noticed that it improves key performance metrics of these benchmarks by 2% to 6%, which is pretty impressive as this is over and above a baseline binaryalready heavily optimized with ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is also profile guided and does very aggressive performance optimizations, there is more room for performance improvements due to profile approximations while applying the transformations. BOLT uses exact profiles from the final binary and is able to fill the gaps left by ThinLTO + PGO. The performance improvements due to BOLT come from basic block layout, function reordering and function splitting. While BOLT does an excellent job of squeezing extra performance from highly optimized binaries with optimizations such as code layout, it has these major issues: * It does not take advantage of distributed build systems. * It has scalability issues and to rewrite a binary with a ~300M text segment size: * Memory foot-print is 70G. * It takes more than 10 minutes to rewrite the binary. Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes and rewrites the final binary in one process. This limits the scalability of BOLT and the memory and time overhead shoots up quickly for large binaries. Inspired by the performance gains and to address the scalability issue of BOLT, we went about designing a scalable infrastructure that can perform BOLT-like post-link optimizations. In this RFC, we discuss our system, “Propeller”, which can perform profile guided link time binary optimizations in a scalable way and is friendly to distributed build systems. Our system leverages the existing capabilities of the compiler tool-chain and is not a stand alone tool. Like BOLT, our system boosts the performance of optimized binaries via link-time optimizations using accurate profiles of the binary. We discuss the Propeller system and show how to do the whole program basic block layout using Propeller. Propeller does whole program basic block layout at link time via basic block sections. We have added support for having each basic block in its own section which allows the linker to do arbitrary reorderings of basic blocks to achieve any desired fine-grain code layout which includes block layout, function splitting and function reordering. Our experiments on large real-world applications and SPEC with code layout show that Propeller can optimize as effectively as BOLT, with just 20% of its memory footprint and time overhead. An LLVM branch with propeller patches is available in the git repository here: https://github.com/google/llvm-propeller/ We will upload individual patches of the various elements for review. We have attached a google doc describing the Propeller system with Experiments in detail, https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf Quick Start - How to optimize further with Propeller? # git clone and build repo $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git $ mkdir $BUILD_DIR && cd $BUILD_DIR $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \ $LLVM_DIR/llvm-propeller/llvm $ ninja -j25 $ export PATH=$BUILD_DIR/bin:$PATH # Let’s Propeller-optimize the following program: # Step 1: Build the peak optimized binary with an additional flag. $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld # Step 2: Profile the binary, only one side of the branch is executed. $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& /dev/null # Step 3: Convert the profiles using the tool provided $ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \ --binary=./a.out.labels --profile=perf.data --out=perf.propeller # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed. $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld In Step 4, the optimized bit code can be used if it is saved in Step1 as Propeller is active only during compile backend and link. The optimized binary has a different layout of the basic blocks in main to keep the executed blocks together and split the cold blocks. Some of the key points: + Added support for basic block sections, similar to function sections, where each basic block can reside in its own section. + Jump instructions need 32-bit relocations and subsequent linker relaxations after basic block layout. We would like to add a new relocation type for jump instructions to make it easier for relaxations and guarantee correctness. + Added support in the linker to read profiles (PMU LBR) and discover optimal basic block layout using the Ex-TSP algorithm described here: https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1809.04676&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=_GpIFlYp6xXw55TW32G_Z4B6vhpncSis7Vdf3sutEf8&s=2jV6DGC8za8IdgV-438b6TzUOsRienUX5hcyLbsHYag&e= + Added support in the linker to re-order basic block sections in any specified sequence (can be done with symbol ordering file). This requires linker relaxations to delete and shrink branches across basic blocks. + Compared our system to BOLT and have shown that our system can produce similar performance improvements as BOLT but with much less memory and time overheads. Our experiments are on very large warehouse-scale benchmarks and SPEC 2017. + Listed why this cannot be done as part of PGO itself. Post Link optimizations are able to transform the binary using accurate profiles and PGO passes suffer from profile imprecision. + Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks via basic block sections. + Discussed CFI handling and is sub-optimal and bloats object file sizes and binary sizes much more than DebugInfo due to lack of support for discontiguous address ranges. We have talked about this and would like to make a case to support discontiguous ranges with CFI which will make basic block sections much more cheaper. Detailed RFC document here : https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf Please let us know what you think, Thanks Sri on behalf of the team.
Krzysztof Pszeniczny via llvm-dev
2019-Oct-02 19:19 UTC
[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev < llvm-dev at lists.llvm.org> wrote:> *Pessimization/overhead for stack unwinding used by system-wide profilers > and > for exception handling* > > Larger CFI programs put an extra burden on unwinding at runtime as more CFI > (and thus native) instructions have to be executed. This will cause more > overhead for any profiler that records stack traces, and, as you correctly > note > in the proposal, for any program that heavily uses exceptions. >The number of CFI instructions that have to be executed when unwinding any given stack stays the same. The CFI instructions for a function have to be duplicated in every basic block section, but when performing unwinding only one such a set is executed -- the copy for the current basic block. However, this copy contains precisely the same CFI instructions as the ones that would have to be executed if there were no basic block sections. -- Krzysztof Pszeniczny -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191002/af57c080/attachment.html>
Reasonably Related Threads
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations
- [RFC] Propeller: A frame work for Post Link Optimizations
- A Propeller link (similar to a Thin Link as used by ThinLTO)?
- [RFC] Propeller: A frame work for Post Link Optimizations