Maksim Panchenko via llvm-dev
2020-Oct-20 18:39 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
Hi All, Please find below an RFC for adding a binary optimization framework to LLVM. Thanks for the feedback, Maksim & BOLT Team [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization 1. Background and Motivation BOLT is a static post-link binary optimizer successfully used inside and outside of Facebook for code optimizations that complement traditional compiler whole-program and link-time optimizations [1]. Last year Google reported that BOLT accelerates their key workloads by 2% to 6% [2]. Additionally, BOLT is used in academia as a framework for program instrumentation, transformation, and binary analysis [3]. We believe that including BOLT in the LLVM project will benefit the community in several ways [4]. First, BOLT is useful as a binary optimizer. It has been used at Facebook to accelerate the top services, and we would love to see more people benefit from the performance boost that BOLT brings. We would also love to see our partners adopt BOLT's new use-cases, such as optimizing mobile and embedded applications. Beyond the binary optimizer, BOLT is an impressive infrastructure that enables research in the following areas: * Advanced disassembly * Low-level program instrumentation * Static analysis 2. Overview While BOLT uses several LLVM libraries [5], it is currently a separate project based on an out-of-tree version of LLVM [6]. Most of the code lives under separate tools/llvm-bolt directory, and required changes to LLVM are included in the supplied patch [7]. The patch mainly consists of backported fixes and minor extensions of the existing interfaces to update debug info, frame information, and ORC. BOLT has no external build dependencies outside of LLVM. For profile collection, we recommend using sampling with a Linux perf tool [8]. LBR (last branch records) feature [9] is recommended as it improves profile quality dramatically. BOLT can be supplied perf.data profile directly, but in general, we advise converting the profile first using the supplied perf2bolt utility. In case the sampling profiling is unavailable, e.g., while running under a hypervisor locally or in the cloud, BOLT provides a builtin instrumentation. BOLT supports x86-64 ELF as the primary platform. We have also implemented the initial support for Aarch64, and the support for MachO is in the works. 3. USE CASES 3.1 Link-time and binary transformations and optimizations Static profile-driven post-link optimization of an application was our primary goal when creating BOLT. The convenience and expandability that the model offers perhaps could only be exceeded by a dynamic binary optimization approach. E.g., to optimize a binary using a perf-generated profile, the user has to execute a single command: $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> No recompilation of a.out is needed (*), but we ask to link with "--emit-relocs" for maximum performance gains. However, the latter is not a strict requirement, and the command works even on stripped binaries. We have worked on reducing BOLT processing time and memory consumption. Overall, BOLT efficiency has been improved by over 3X during that process. It takes less than a minute to optimize HHVM [10] production binary with over 100 MB of code and less than 3 minutes to optimize another multi-gigabyte production binary with 500 MB of code. BOLT memory consumption is under 5 GB for HHVM and under 13 GB for the large binary (**). Fast turn-around times for optimizing an application with BOLT without the need for source code allow us to design a system that automatically manages binary profiling and optimization. This is one of the most exciting directions in application optimization we are currently pursuing. We thought it would be interesting to perform a fresh comparison of BOLT overhead to Propeller [11]. Sadly, Propeller relies on a custom version of an external create-llvm-prof tool that we could not build in our setup, and using a GitHub-hosted binary version of that tool in the virtual environment produced a profile that caused a performance regression of the optimized application. Using "-fbasic-block-sections=all" Propeller option without a profile resulted in fast linking times but also caused a performance regression. Although code reordering is the primary optimization in BOLT and the source of most performance gains, it is not the only optimization in BOLT. The full list of optimizations includes late inlining, indirect call promotion, frame optimizations, and others. The convenience of adding new optimizations in whole-program post-link mode is one of the main advantages that the BOLT framework offers, be it for research or practical purposes. Additionally, BOLT can add new code to a linked ELF binary. We have recently used that capability to allocate frequently-executed code on huge pages automatically. Even legacy applications can now use 2MB pages for code on x86-64 Linux systems to reduce the number of iTLB misses. BOLT's ability to process and optimize binary code without source code, such as legacy/distributed binaries, or code from third-party and assembly-written code, provides another advantage over a pure compiler-based approach. This advantage could or could not be important for optimizations depending on the user scenario. However, the visibility into the code mentioned above could be critical for other code re-writing cases such as vulnerability mitigations, instrumentation, and static analysis. *) Support for input with split functions is in the works. Before it is completed, we ask not to use "-freorder-blocks-and-partition" compiler option during the build. A similar or better result will be achieved by running the BOLT function splitting pass. **) while running BOLT with "-reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=1 -split-eh" optimization options. DWARF update takes extra time and memory. 3.2 Advanced Disassembly Internally, BOLT identifies code in the binary, breaks it into functions, disassembles, and uses static analysis to build a control flow graph. This functionality could complement a traditional disassembler, as it adds the ability to display possible targets for indirect jumps/calls, providing a better understanding of the control flow in a function. Additionally, for performance analysis, the disassembly is annotated with execution counts, including those for indirect branch targets within a function (jump tables) and across functions (indirect and virtual function calls). E.g.: <Basic Block> .LFT35985 Exec Count : 42 Predecessors: .Ltmp935657 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx 00003c92: movslq (%rcx,%rax,4), %rax 00003c96: addq %rcx, %rax 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 Successors: .Ltmp935702 (count: 0, mispreds: 0), .Ltmp935705 (count: 41, mispreds: 26), .Ltmp935703 (count: 0, mispreds: 0), .Ltmp935704 (count: 1, mispreds: 0) .... <BasicBlock>.LFT43 (9 instructions, align : 1) Exec Count : 8 Predecessors: .LBB01191 00000028: movq %rdx, %rbx 0000002b: leaq 0x20(%rsp), %rdi 00000030: movl $0x2, %edx 00000035: movq %rbx, %rsi 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : { f1: 3 (3 mispreds) }, { f2: 1 (1 mispreds) }, { f3: 4 (4 mispreds) } 0000003a: movdqu 0x10(%rbx), %xmm0 0000003f: movdqu %xmm0, 0x30(%rsp) 00000045: movq %xmm0, %rcx 0000004a: jmp .Ltmp26968 Successors: .Ltmp26968 (count: 8, mispreds: 0) With LLVM integration, the advanced disassembler can be added as a new standalone tool or as an extra feature to existing tools such as llvm-objdump. 3.3 Low-Level Program Instrumentation Tools, such as memory sanitizers, rely on compiler-generated memory instrumentation to detect application errors at runtime. Suppose part of the application is written in assembly or comes from a library with no sources. In that case, such code could become a source of false positives and false negatives depending on the memory access types missed by the instrumentation. Since BOLT can instrument arbitrary code, independent of the source, it can complement compiler-based instrumentation and fill in the missing parts leading to a higher quality signal from the tool. 3.4 Static Analysis BOLT offers an intuitive API to perform compiler-level analyses directly on machine code. Because BOLT reconstructs the control-flow graph of the program, it allows pass writers to extract arbitrary information beyond the scope of a single basic block with data-flow analyses. As an example, we can perform shrink wrapping by checking how stack-accessing instructions are using the frame in a given function and validating opportunities to move memory accesses in hot basic blocks to colder areas. While this framework provides the optimization writer with greater reach over previously opaque external third-party binary code linked in the binary, perhaps the greatest value of this is in this code being visible at all. With static analysis, users can write passes to draw conclusions on safety concerns as well, such as checking if system or library calls are being used in a suspicious way without the need to execute the binary. 4. Plans BOLT is a mature project that has been used in production for years, but we continue to develop BOLT and invest in new use-cases and optimizations. Below is a shortlist of areas that are currently under development: 1. Automatic profile collection and optimization 2. MachO support 3. LLD integration 4. Optimizing Linux kernel -- BOLT Team References [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: a practical binary optimizer for data centers and beyond. In "Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization" (CGO 2019). IEEE Press, 2–14. https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization" (CGO 2020). Association for Computing Machinery, New York, NY, USA, 94–106. DOI:https://doi.org/10.1145/3368826.3377914 [4] https://github.com/facebookincubator/BOLT/issues/46 [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 European LLVM Developers' Meeting. https://llvm.org/devmtg/2016-03/#presentation7 [6] https://github.com/facebookincubator/BOLT [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch [8] perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page. [9] Runtime Performance Optimization Blueprint: Intel® Architecture Optimizations with LBR. https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html<https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html> [10] The HipHop Virtual Machine. https://hhvm.com/ [11] Propeller RFC. https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf. Updated performance results: https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201020/8f5736b0/attachment.html>
raghesh via llvm-dev
2020-Oct-21 06:22 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
Hi All, I believe BOLT will be a very useful addition to the llvm tool set. We have already started using it and are thinking about adding more binary level analyses and optimizations. Most importantly, the control flow graph and the disassembled code attached with debugging info turned out to be quite handy. Additionally, the associated tool llvm-boltdiff (to diff performance numbers of two different binaries) also was very useful. Regards, Raghesh ------------------------------ Raghesh Aloor AMD India Pvt. Ltd. Bengaluru. ------------------------------ On Wed, Oct 21, 2020 at 12:09 AM Maksim Panchenko via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi All, > > > > Please find below an RFC for adding a binary optimization framework to > LLVM. > > > > Thanks for the feedback, > > Maksim & BOLT Team > > > [RFC] BOLT: A Framework for Binary Analysis, Transformation, and > Optimization > > > 1. Background and Motivation > > > > BOLT is a static post-link binary optimizer successfully used inside and > outside of Facebook for code optimizations that complement traditional > compiler whole-program and link-time optimizations [1]. Last year Google > reported that BOLT accelerates their key workloads by 2% to 6% [2]. > Additionally, BOLT is used in academia as a framework for program > instrumentation, transformation, and binary analysis [3]. > > > > We believe that including BOLT in the LLVM project will benefit the > community in several ways [4]. First, BOLT is useful as a binary optimizer. > It has been used at Facebook to accelerate the top services, and we would > love to see more people benefit from the performance boost that BOLT > brings. We would also love to see our partners adopt BOLT's new use-cases, > such as optimizing mobile and embedded applications. Beyond the binary > optimizer, BOLT is an impressive infrastructure that enables research in > the following areas: > > - Advanced disassembly > - Low-level program instrumentation > - Static analysis > > > 2. Overview > > > > While BOLT uses several LLVM libraries [5], it is currently a separate > project based on an out-of-tree version of LLVM [6]. Most of the code lives > under separate tools/llvm-bolt directory, and required changes to LLVM are > included in the supplied patch [7]. The patch mainly consists of backported > fixes and minor extensions of the existing interfaces to update debug info, > frame information, and ORC. > > > > BOLT has no external build dependencies outside of LLVM. For profile > collection, we recommend using sampling with a Linux perf tool [8]. LBR > (last branch records) feature [9] is recommended as it improves profile > quality dramatically. BOLT can be supplied perf.data profile directly, but > in general, we advise converting the profile first using the supplied > perf2bolt utility. In case the sampling profiling is unavailable, e.g., > while running under a hypervisor locally or in the cloud, BOLT provides a > builtin instrumentation. > > > > BOLT supports x86-64 ELF as the primary platform. We have also implemented > the initial support for Aarch64, and the support for MachO is in the works. > > > 3. USE CASES > > > 3.1 Link-time and binary transformations and optimizations > > > > Static profile-driven post-link optimization of an application was our > primary goal when creating BOLT. The convenience and expandability that the > model offers perhaps could only be exceeded by a dynamic binary > optimization approach. E.g., to optimize a binary using a perf-generated > profile, the user has to execute a single command: > > > > $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> > > > > No recompilation of a.out is needed (*), but we ask to link with > "--emit-relocs" for maximum performance gains. However, the latter is not a > strict requirement, and the command works even on stripped binaries. > > > > We have worked on reducing BOLT processing time and memory consumption. > Overall, BOLT efficiency has been improved by over 3X during that process. > It takes less than a minute to optimize HHVM [10] production binary with > over 100 MB of code and less than 3 minutes to optimize another > multi-gigabyte production binary with 500 MB of code. BOLT memory > consumption is under 5 GB for HHVM and under 13 GB for the large binary > (**). > > > > Fast turn-around times for optimizing an application with BOLT without the > need for source code allow us to design a system that automatically manages > binary profiling and optimization. This is one of the most exciting > directions in application optimization we are currently pursuing. > > > > We thought it would be interesting to perform a fresh comparison of BOLT > overhead to Propeller [11]. Sadly, Propeller relies on a custom version of > an external create-llvm-prof tool that we could not build in our setup, and > using a GitHub-hosted binary version of that tool in the virtual > environment produced a profile that caused a performance regression of the > optimized application. Using "-fbasic-block-sections=all" Propeller option > without a profile resulted in fast linking times but also caused a > performance regression. > > > > Although code reordering is the primary optimization in BOLT and the > source of most performance gains, it is not the only optimization in BOLT. > The full list of optimizations includes late inlining, indirect call > promotion, frame optimizations, and others. The convenience of adding new > optimizations in whole-program post-link mode is one of the main advantages > that the BOLT framework offers, be it for research or practical purposes. > > > > Additionally, BOLT can add new code to a linked ELF binary. We have > recently used that capability to allocate frequently-executed code on huge > pages automatically. Even legacy applications can now use 2MB pages for > code on x86-64 Linux systems to reduce the number of iTLB misses. > > > > BOLT's ability to process and optimize binary code without source code, > such as legacy/distributed binaries, or code from third-party and > assembly-written code, provides another advantage over a pure > compiler-based approach. This advantage could or could not be important for > optimizations depending on the user scenario. However, the visibility into > the code mentioned above could be critical for other code re-writing cases > such as vulnerability mitigations, instrumentation, and static analysis. > > > > *) Support for input with split functions is in the works. Before it is > completed, we ask not to use "-freorder-blocks-and-partition" compiler > option during the build. A similar or better result will be achieved by > running the BOLT function splitting pass. > > > > **) while running BOLT with "-reorder-blocks=cache+ > -reorder-functions=hfsort -split-functions=1 -split-eh" optimization > options. DWARF update takes extra time and memory. > > > 3.2 Advanced Disassembly > > > > Internally, BOLT identifies code in the binary, breaks it into functions, > disassembles, and uses static analysis to build a control flow graph. This > functionality could complement a traditional disassembler, as it adds the > ability to display possible targets for indirect jumps/calls, providing a > better understanding of the control flow in a function. > > > > Additionally, for performance analysis, the disassembly is annotated with > execution counts, including those for indirect branch targets within a > function (jump tables) and across functions (indirect and virtual function > calls). E.g.: > > > > <Basic Block> .LFT35985 > > Exec Count : 42 > > Predecessors: .Ltmp935657 > > 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx > > 00003c92: movslq (%rcx,%rax,4), %rax > > 00003c96: addq %rcx, %rax > > 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 > > Successors: .Ltmp935702 (count: 0, mispreds: 0), > > .Ltmp935705 (count: 41, mispreds: 26), > > .Ltmp935703 (count: 0, mispreds: 0), > > .Ltmp935704 (count: 1, mispreds: 0) > > .... > > <BasicBlock>.LFT43 (9 instructions, align : 1) > > Exec Count : 8 > > Predecessors: .LBB01191 > > 00000028: movq %rdx, %rbx > > 0000002b: leaq 0x20(%rsp), %rdi > > 00000030: movl $0x2, %edx > > 00000035: movq %rbx, %rsi > > 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : > > { f1: 3 (3 mispreds) }, > > { f2: 1 (1 mispreds) }, > > { f3: 4 (4 mispreds) } > > 0000003a: movdqu 0x10(%rbx), %xmm0 > > 0000003f: movdqu %xmm0, 0x30(%rsp) > > 00000045: movq %xmm0, %rcx > > 0000004a: jmp .Ltmp26968 > > Successors: .Ltmp26968 (count: 8, mispreds: 0) > > > > With LLVM integration, the advanced disassembler can be added as a new > standalone tool or as an extra feature to existing tools such as > *llvm-objdump*. > > > 3.3 Low-Level Program Instrumentation > > > > Tools, such as memory sanitizers, rely on compiler-generated memory > instrumentation to detect application errors at runtime. Suppose part of > the application is written in assembly or comes from a library with no > sources. In that case, such code could become a source of false positives > and false negatives depending on the memory access types missed by the > instrumentation. Since BOLT can instrument arbitrary code, independent of > the source, it can complement compiler-based instrumentation and fill in > the missing parts leading to a higher quality signal from the tool. > > > 3.4 Static Analysis > > > > BOLT offers an intuitive API to perform compiler-level analyses directly > on machine code. Because BOLT reconstructs the control-flow graph of the > program, it allows pass writers to extract arbitrary information beyond the > scope of a single basic block with data-flow analyses. As an example, we > can perform shrink wrapping by checking how stack-accessing instructions > are using the frame in a given function and validating opportunities to > move memory accesses in hot basic blocks to colder areas. > > > > While this framework provides the optimization writer with greater reach > over previously opaque external third-party binary code linked in the > binary, perhaps the greatest value of this is in this code being visible at > all. With static analysis, users can write passes to draw conclusions on > safety concerns as well, such as checking if system or library calls are > being used in a suspicious way without the need to execute the binary. > > > 4. Plans > > > > BOLT is a mature project that has been used in production for years, but > we continue to develop BOLT and invest in new use-cases and optimizations. > Below is a shortlist of areas that are currently under development: > > 1. Automatic profile collection and optimization > 2. MachO support > 3. LLD integration > 4. Optimizing Linux kernel > > > -- > > BOLT Team > > > References > > [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. > BOLT: a practical binary optimizer for data centers and beyond. In > "Proceedings of the 2019 IEEE/ACM International Symposium on Code > Generation and Optimization" (CGO 2019). IEEE Press, 2–14. > https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ > > [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html > > [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout > optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium > on Code Generation and Optimization" (CGO 2020). Association for Computing > Machinery, New York, NY, USA, 94–106. DOI: > https://doi.org/10.1145/3368826.3377914 > > [4] https://github.com/facebookincubator/BOLT/issues/46 > > [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 > European LLVM Developers' Meeting. > https://llvm.org/devmtg/2016-03/#presentation7 > > [6] https://github.com/facebookincubator/BOLT > > [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch > > [8] perf: Linux profiling with performance counters. > https://perf.wiki.kernel.org/index.php/Main_Page. > > [9] Runtime Performance Optimization Blueprint: Intel® Architecture > Optimizations with LBR. > https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html > > [10] The HipHop Virtual Machine. https://hhvm.com/ > > [11] Propeller RFC. > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf. > Updated performance results: > https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201021/689afc5e/attachment.html>
Nick Desaulniers via llvm-dev
2020-Oct-23 18:53 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
On Tue, Oct 20, 2020 at 11:39 AM Maksim Panchenko via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Hi All, > > > > Please find below an RFC for adding a binary optimization framework to LLVM. > > > > Thanks for the feedback, > > Maksim & BOLT Team > > > > [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization > > > > 1. Background and Motivation > > > > BOLT is a static post-link binary optimizer successfully used inside and outside of Facebook for code optimizations that complement traditional compiler whole-program and link-time optimizations [1]. Last year Google reported that BOLT accelerates their key workloads by 2% to 6% [2]. Additionally, BOLT is used in academia as a framework for program instrumentation, transformation, and binary analysis [3]. > > > > We believe that including BOLT in the LLVM project will benefit the community in several ways [4]. First, BOLT is useful as a binary optimizer. It has been used at Facebook to accelerate the top services, and we would love to see more people benefit from the performance boost that BOLT brings. We would also love to see our partners adopt BOLT's new use-cases, such as optimizing mobile and embedded applications. Beyond the binary optimizer, BOLT is an impressive infrastructure that enables research in the following areas: > > Advanced disassembly > Low-level program instrumentation > Static analysis > > > > 2. Overview > > > > While BOLT uses several LLVM libraries [5], it is currently a separate project based on an out-of-tree version of LLVM [6]. Most of the code lives under separate tools/llvm-bolt directory, and required changes to LLVM are included in the supplied patch [7]. The patch mainly consists of backported fixes and minor extensions of the existing interfaces to update debug info, frame information, and ORC. > > > > BOLT has no external build dependencies outside of LLVM. For profile collection, we recommend using sampling with a Linux perf tool [8]. LBR (last branch records) feature [9] is recommended as it improves profile quality dramatically. BOLT can be supplied perf.data profile directly, but in general, we advise converting the profile first using the supplied perf2bolt utility. In case the sampling profiling is unavailable, e.g., while running under a hypervisor locally or in the cloud, BOLT provides a builtin instrumentation. > > > > BOLT supports x86-64 ELF as the primary platform. We have also implemented the initial support for Aarch64, and the support for MachO is in the works. > > > > 3. USE CASES > > > > 3.1 Link-time and binary transformations and optimizations > > > > Static profile-driven post-link optimization of an application was our primary goal when creating BOLT. The convenience and expandability that the model offers perhaps could only be exceeded by a dynamic binary optimization approach. E.g., to optimize a binary using a perf-generated profile, the user has to execute a single command: > > > > $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> > > > > No recompilation of a.out is needed (*), but we ask to link with "--emit-relocs" for maximum performance gains. However, the latter is not a strict requirement, and the command works even on stripped binaries. > > > > We have worked on reducing BOLT processing time and memory consumption. Overall, BOLT efficiency has been improved by over 3X during that process. It takes less than a minute to optimize HHVM [10] production binary with over 100 MB of code and less than 3 minutes to optimize another multi-gigabyte production binary with 500 MB of code. BOLT memory consumption is under 5 GB for HHVM and under 13 GB for the large binary (**). > > > > Fast turn-around times for optimizing an application with BOLT without the need for source code allow us to design a system that automatically manages binary profiling and optimization. This is one of the most exciting directions in application optimization we are currently pursuing. > > > > We thought it would be interesting to perform a fresh comparison of BOLT overhead to Propeller [11]. Sadly, Propeller relies on a custom version of an external create-llvm-prof tool that we could not build in our setup, and using a GitHub-hosted binary version of that tool in the virtual environment produced a profile that caused a performance regression of the optimized application. Using "-fbasic-block-sections=all" Propeller option without a profile resulted in fast linking times but also caused a performance regression. > > > > Although code reordering is the primary optimization in BOLT and the source of most performance gains, it is not the only optimization in BOLT. The full list of optimizations includes late inlining, indirect call promotion, frame optimizations, and others. The convenience of adding new optimizations in whole-program post-link mode is one of the main advantages that the BOLT framework offers, be it for research or practical purposes. > > > > Additionally, BOLT can add new code to a linked ELF binary. We have recently used that capability to allocate frequently-executed code on huge pages automatically. Even legacy applications can now use 2MB pages for code on x86-64 Linux systems to reduce the number of iTLB misses. > > > > BOLT's ability to process and optimize binary code without source code, such as legacy/distributed binaries, or code from third-party and assembly-written code, provides another advantage over a pure compiler-based approach. This advantage could or could not be important for optimizations depending on the user scenario. However, the visibility into the code mentioned above could be critical for other code re-writing cases such as vulnerability mitigations, instrumentation, and static analysis. > > > > *) Support for input with split functions is in the works. Before it is completed, we ask not to use "-freorder-blocks-and-partition" compiler option during the build. A similar or better result will be achieved by running the BOLT function splitting pass. > > > > **) while running BOLT with "-reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=1 -split-eh" optimization options. DWARF update takes extra time and memory. > > > > 3.2 Advanced Disassembly > > > > Internally, BOLT identifies code in the binary, breaks it into functions, disassembles, and uses static analysis to build a control flow graph. This functionality could complement a traditional disassembler, as it adds the ability to display possible targets for indirect jumps/calls, providing a better understanding of the control flow in a function. > > > > Additionally, for performance analysis, the disassembly is annotated with execution counts, including those for indirect branch targets within a function (jump tables) and across functions (indirect and virtual function calls). E.g.: > > > > <Basic Block> .LFT35985 > > Exec Count : 42 > > Predecessors: .Ltmp935657 > > 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx > > 00003c92: movslq (%rcx,%rax,4), %rax > > 00003c96: addq %rcx, %rax > > 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 > > Successors: .Ltmp935702 (count: 0, mispreds: 0), > > .Ltmp935705 (count: 41, mispreds: 26), > > .Ltmp935703 (count: 0, mispreds: 0), > > .Ltmp935704 (count: 1, mispreds: 0) > > .... > > <BasicBlock>.LFT43 (9 instructions, align : 1) > > Exec Count : 8 > > Predecessors: .LBB01191 > > 00000028: movq %rdx, %rbx > > 0000002b: leaq 0x20(%rsp), %rdi > > 00000030: movl $0x2, %edx > > 00000035: movq %rbx, %rsi > > 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : > > { f1: 3 (3 mispreds) }, > > { f2: 1 (1 mispreds) }, > > { f3: 4 (4 mispreds) } > > 0000003a: movdqu 0x10(%rbx), %xmm0 > > 0000003f: movdqu %xmm0, 0x30(%rsp) > > 00000045: movq %xmm0, %rcx > > 0000004a: jmp .Ltmp26968 > > Successors: .Ltmp26968 (count: 8, mispreds: 0) > > > > With LLVM integration, the advanced disassembler can be added as a new standalone tool or as an extra feature to existing tools such as llvm-objdump. > > > > 3.3 Low-Level Program Instrumentation > > > > Tools, such as memory sanitizers, rely on compiler-generated memory instrumentation to detect application errors at runtime. Suppose part of the application is written in assembly or comes from a library with no sources. In that case, such code could become a source of false positives and false negatives depending on the memory access types missed by the instrumentation. Since BOLT can instrument arbitrary code, independent of the source, it can complement compiler-based instrumentation and fill in the missing parts leading to a higher quality signal from the tool. > > > > 3.4 Static Analysis > > > > BOLT offers an intuitive API to perform compiler-level analyses directly on machine code. Because BOLT reconstructs the control-flow graph of the program, it allows pass writers to extract arbitrary information beyond the scope of a single basic block with data-flow analyses. As an example, we can perform shrink wrapping by checking how stack-accessing instructions are using the frame in a given function and validating opportunities to move memory accesses in hot basic blocks to colder areas. > > > > While this framework provides the optimization writer with greater reach over previously opaque external third-party binary code linked in the binary, perhaps the greatest value of this is in this code being visible at all. With static analysis, users can write passes to draw conclusions on safety concerns as well, such as checking if system or library calls are being used in a suspicious way without the need to execute the binary. > > > > 4. Plans > > > > BOLT is a mature project that has been used in production for years, but we continue to develop BOLT and invest in new use-cases and optimizations. Below is a shortlist of areas that are currently under development: > > Automatic profile collection and optimization > MachO support > LLD integration > Optimizing Linux kernelHi Maksim, you already know I'm a big fan of BOLT. Specifically for the area under development of optimizing the Linux kernel, I think BOLT could be a huge competitive advantage for LLVM. Any way we could discuss/collaborate more on that? I look forward to seeing patches upstreamed.> > > > -- > > BOLT Team > > > > References > > [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: a practical binary optimizer for data centers and beyond. In "Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization" (CGO 2019). IEEE Press, 2–14. https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ > > [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html > > [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization" (CGO 2020). Association for Computing Machinery, New York, NY, USA, 94–106. DOI:https://doi.org/10.1145/3368826.3377914 > > [4] https://github.com/facebookincubator/BOLT/issues/46 > > [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 European LLVM Developers' Meeting. https://llvm.org/devmtg/2016-03/#presentation7 > > [6] https://github.com/facebookincubator/BOLT > > [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch > > [8] perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page. > > [9] Runtime Performance Optimization Blueprint: Intel® Architecture Optimizations with LBR. https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html > > [10] The HipHop Virtual Machine. https://hhvm.com/ > > [11] Propeller RFC. https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf. Updated performance results: https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html > > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Thanks, ~Nick Desaulniers
Tobias Hieta via llvm-dev
2020-Oct-24 14:59 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
Hello, Thanks for your work on bolt. We tried to use it to speedup clang itself following the instructions in your github repo. The main problem we ran into with pref not being able to run on our AMD workstations nor in our VMs in the cloud. I know it's not exactly bolt related but are there other ways to generate this profile? Ideally we would collect data while building our app as part of our toolchain builder in the CI (as we do with PGO) - but I was not able to get this working with the perf tool. Thanks for your help. Tobias On Tue, Oct 20, 2020, 20:39 Maksim Panchenko via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi All, > > > > Please find below an RFC for adding a binary optimization framework to > LLVM. > > > > Thanks for the feedback, > > Maksim & BOLT Team > > > [RFC] BOLT: A Framework for Binary Analysis, Transformation, and > Optimization > > > 1. Background and Motivation > > > > BOLT is a static post-link binary optimizer successfully used inside and > outside of Facebook for code optimizations that complement traditional > compiler whole-program and link-time optimizations [1]. Last year Google > reported that BOLT accelerates their key workloads by 2% to 6% [2]. > Additionally, BOLT is used in academia as a framework for program > instrumentation, transformation, and binary analysis [3]. > > > > We believe that including BOLT in the LLVM project will benefit the > community in several ways [4]. First, BOLT is useful as a binary optimizer. > It has been used at Facebook to accelerate the top services, and we would > love to see more people benefit from the performance boost that BOLT > brings. We would also love to see our partners adopt BOLT's new use-cases, > such as optimizing mobile and embedded applications. Beyond the binary > optimizer, BOLT is an impressive infrastructure that enables research in > the following areas: > > - Advanced disassembly > - Low-level program instrumentation > - Static analysis > > > 2. Overview > > > > While BOLT uses several LLVM libraries [5], it is currently a separate > project based on an out-of-tree version of LLVM [6]. Most of the code lives > under separate tools/llvm-bolt directory, and required changes to LLVM are > included in the supplied patch [7]. The patch mainly consists of backported > fixes and minor extensions of the existing interfaces to update debug info, > frame information, and ORC. > > > > BOLT has no external build dependencies outside of LLVM. For profile > collection, we recommend using sampling with a Linux perf tool [8]. LBR > (last branch records) feature [9] is recommended as it improves profile > quality dramatically. BOLT can be supplied perf.data profile directly, but > in general, we advise converting the profile first using the supplied > perf2bolt utility. In case the sampling profiling is unavailable, e.g., > while running under a hypervisor locally or in the cloud, BOLT provides a > builtin instrumentation. > > > > BOLT supports x86-64 ELF as the primary platform. We have also implemented > the initial support for Aarch64, and the support for MachO is in the works. > > > 3. USE CASES > > > 3.1 Link-time and binary transformations and optimizations > > > > Static profile-driven post-link optimization of an application was our > primary goal when creating BOLT. The convenience and expandability that the > model offers perhaps could only be exceeded by a dynamic binary > optimization approach. E.g., to optimize a binary using a perf-generated > profile, the user has to execute a single command: > > > > $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> > > > > No recompilation of a.out is needed (*), but we ask to link with > "--emit-relocs" for maximum performance gains. However, the latter is not a > strict requirement, and the command works even on stripped binaries. > > > > We have worked on reducing BOLT processing time and memory consumption. > Overall, BOLT efficiency has been improved by over 3X during that process. > It takes less than a minute to optimize HHVM [10] production binary with > over 100 MB of code and less than 3 minutes to optimize another > multi-gigabyte production binary with 500 MB of code. BOLT memory > consumption is under 5 GB for HHVM and under 13 GB for the large binary > (**). > > > > Fast turn-around times for optimizing an application with BOLT without the > need for source code allow us to design a system that automatically manages > binary profiling and optimization. This is one of the most exciting > directions in application optimization we are currently pursuing. > > > > We thought it would be interesting to perform a fresh comparison of BOLT > overhead to Propeller [11]. Sadly, Propeller relies on a custom version of > an external create-llvm-prof tool that we could not build in our setup, and > using a GitHub-hosted binary version of that tool in the virtual > environment produced a profile that caused a performance regression of the > optimized application. Using "-fbasic-block-sections=all" Propeller option > without a profile resulted in fast linking times but also caused a > performance regression. > > > > Although code reordering is the primary optimization in BOLT and the > source of most performance gains, it is not the only optimization in BOLT. > The full list of optimizations includes late inlining, indirect call > promotion, frame optimizations, and others. The convenience of adding new > optimizations in whole-program post-link mode is one of the main advantages > that the BOLT framework offers, be it for research or practical purposes. > > > > Additionally, BOLT can add new code to a linked ELF binary. We have > recently used that capability to allocate frequently-executed code on huge > pages automatically. Even legacy applications can now use 2MB pages for > code on x86-64 Linux systems to reduce the number of iTLB misses. > > > > BOLT's ability to process and optimize binary code without source code, > such as legacy/distributed binaries, or code from third-party and > assembly-written code, provides another advantage over a pure > compiler-based approach. This advantage could or could not be important for > optimizations depending on the user scenario. However, the visibility into > the code mentioned above could be critical for other code re-writing cases > such as vulnerability mitigations, instrumentation, and static analysis. > > > > *) Support for input with split functions is in the works. Before it is > completed, we ask not to use "-freorder-blocks-and-partition" compiler > option during the build. A similar or better result will be achieved by > running the BOLT function splitting pass. > > > > **) while running BOLT with "-reorder-blocks=cache+ > -reorder-functions=hfsort -split-functions=1 -split-eh" optimization > options. DWARF update takes extra time and memory. > > > 3.2 Advanced Disassembly > > > > Internally, BOLT identifies code in the binary, breaks it into functions, > disassembles, and uses static analysis to build a control flow graph. This > functionality could complement a traditional disassembler, as it adds the > ability to display possible targets for indirect jumps/calls, providing a > better understanding of the control flow in a function. > > > > Additionally, for performance analysis, the disassembly is annotated with > execution counts, including those for indirect branch targets within a > function (jump tables) and across functions (indirect and virtual function > calls). E.g.: > > > > <Basic Block> .LFT35985 > > Exec Count : 42 > > Predecessors: .Ltmp935657 > > 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx > > 00003c92: movslq (%rcx,%rax,4), %rax > > 00003c96: addq %rcx, %rax > > 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 > > Successors: .Ltmp935702 (count: 0, mispreds: 0), > > .Ltmp935705 (count: 41, mispreds: 26), > > .Ltmp935703 (count: 0, mispreds: 0), > > .Ltmp935704 (count: 1, mispreds: 0) > > .... > > <BasicBlock>.LFT43 (9 instructions, align : 1) > > Exec Count : 8 > > Predecessors: .LBB01191 > > 00000028: movq %rdx, %rbx > > 0000002b: leaq 0x20(%rsp), %rdi > > 00000030: movl $0x2, %edx > > 00000035: movq %rbx, %rsi > > 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : > > { f1: 3 (3 mispreds) }, > > { f2: 1 (1 mispreds) }, > > { f3: 4 (4 mispreds) } > > 0000003a: movdqu 0x10(%rbx), %xmm0 > > 0000003f: movdqu %xmm0, 0x30(%rsp) > > 00000045: movq %xmm0, %rcx > > 0000004a: jmp .Ltmp26968 > > Successors: .Ltmp26968 (count: 8, mispreds: 0) > > > > With LLVM integration, the advanced disassembler can be added as a new > standalone tool or as an extra feature to existing tools such as > *llvm-objdump*. > > > 3.3 Low-Level Program Instrumentation > > > > Tools, such as memory sanitizers, rely on compiler-generated memory > instrumentation to detect application errors at runtime. Suppose part of > the application is written in assembly or comes from a library with no > sources. In that case, such code could become a source of false positives > and false negatives depending on the memory access types missed by the > instrumentation. Since BOLT can instrument arbitrary code, independent of > the source, it can complement compiler-based instrumentation and fill in > the missing parts leading to a higher quality signal from the tool. > > > 3.4 Static Analysis > > > > BOLT offers an intuitive API to perform compiler-level analyses directly > on machine code. Because BOLT reconstructs the control-flow graph of the > program, it allows pass writers to extract arbitrary information beyond the > scope of a single basic block with data-flow analyses. As an example, we > can perform shrink wrapping by checking how stack-accessing instructions > are using the frame in a given function and validating opportunities to > move memory accesses in hot basic blocks to colder areas. > > > > While this framework provides the optimization writer with greater reach > over previously opaque external third-party binary code linked in the > binary, perhaps the greatest value of this is in this code being visible at > all. With static analysis, users can write passes to draw conclusions on > safety concerns as well, such as checking if system or library calls are > being used in a suspicious way without the need to execute the binary. > > > 4. Plans > > > > BOLT is a mature project that has been used in production for years, but > we continue to develop BOLT and invest in new use-cases and optimizations. > Below is a shortlist of areas that are currently under development: > > 1. Automatic profile collection and optimization > 2. MachO support > 3. LLD integration > 4. Optimizing Linux kernel > > > -- > > BOLT Team > > > References > > [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. > BOLT: a practical binary optimizer for data centers and beyond. In > "Proceedings of the 2019 IEEE/ACM International Symposium on Code > Generation and Optimization" (CGO 2019). IEEE Press, 2–14. > https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ > > [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html > > [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout > optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium > on Code Generation and Optimization" (CGO 2020). Association for Computing > Machinery, New York, NY, USA, 94–106. DOI: > https://doi.org/10.1145/3368826.3377914 > > [4] https://github.com/facebookincubator/BOLT/issues/46 > > [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 > European LLVM Developers' Meeting. > https://llvm.org/devmtg/2016-03/#presentation7 > > [6] https://github.com/facebookincubator/BOLT > > [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch > > [8] perf: Linux profiling with performance counters. > https://perf.wiki.kernel.org/index.php/Main_Page. > > [9] Runtime Performance Optimization Blueprint: Intel® Architecture > Optimizations with LBR. > https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html > > [10] The HipHop Virtual Machine. https://hhvm.com/ > > [11] Propeller RFC. > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf. > Updated performance results: > https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201024/1a664862/attachment.html>
Maksim Panchenko via llvm-dev
2020-Oct-25 00:12 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
Hi Nick, Re-writing the kernel binary proved to be quite a challenge considering the number of techniques used for code patching/modification at startup and runtime. We think we identified most of the special ELF sections involved and started prototyping the support. It's going to be a fun journey. I'd be happy to discuss the collaboration. Maksim On 10/23/20, 11:53 AM, "Nick Desaulniers" <ndesaulniers at google.com> wrote: On Tue, Oct 20, 2020 at 11:39 AM Maksim Panchenko via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hi All, > > > > Please find below an RFC for adding a binary optimization framework to LLVM. > > > > Thanks for the feedback, > > Maksim & BOLT Team > > > > [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization > > > > 1. Background and Motivation > > > > BOLT is a static post-link binary optimizer successfully used inside and outside of Facebook for code optimizations that complement traditional compiler whole-program and link-time optimizations [1]. Last year Google reported that BOLT accelerates their key workloads by 2% to 6% [2]. Additionally, BOLT is used in academia as a framework for program instrumentation, transformation, and binary analysis [3]. > > > > We believe that including BOLT in the LLVM project will benefit the community in several ways [4]. First, BOLT is useful as a binary optimizer. It has been used at Facebook to accelerate the top services, and we would love to see more people benefit from the performance boost that BOLT brings. We would also love to see our partners adopt BOLT's new use-cases, such as optimizing mobile and embedded applications. Beyond the binary optimizer, BOLT is an impressive infrastructure that enables research in the following areas: > > Advanced disassembly > Low-level program instrumentation > Static analysis > > > > 2. Overview > > > > While BOLT uses several LLVM libraries [5], it is currently a separate project based on an out-of-tree version of LLVM [6]. Most of the code lives under separate tools/llvm-bolt directory, and required changes to LLVM are included in the supplied patch [7]. The patch mainly consists of backported fixes and minor extensions of the existing interfaces to update debug info, frame information, and ORC. > > > > BOLT has no external build dependencies outside of LLVM. For profile collection, we recommend using sampling with a Linux perf tool [8]. LBR (last branch records) feature [9] is recommended as it improves profile quality dramatically. BOLT can be supplied perf.data profile directly, but in general, we advise converting the profile first using the supplied perf2bolt utility. In case the sampling profiling is unavailable, e.g., while running under a hypervisor locally or in the cloud, BOLT provides a builtin instrumentation. > > > > BOLT supports x86-64 ELF as the primary platform. We have also implemented the initial support for Aarch64, and the support for MachO is in the works. > > > > 3. USE CASES > > > > 3.1 Link-time and binary transformations and optimizations > > > > Static profile-driven post-link optimization of an application was our primary goal when creating BOLT. The convenience and expandability that the model offers perhaps could only be exceeded by a dynamic binary optimization approach. E.g., to optimize a binary using a perf-generated profile, the user has to execute a single command: > > > > $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> > > > > No recompilation of a.out is needed (*), but we ask to link with "--emit-relocs" for maximum performance gains. However, the latter is not a strict requirement, and the command works even on stripped binaries. > > > > We have worked on reducing BOLT processing time and memory consumption. Overall, BOLT efficiency has been improved by over 3X during that process. It takes less than a minute to optimize HHVM [10] production binary with over 100 MB of code and less than 3 minutes to optimize another multi-gigabyte production binary with 500 MB of code. BOLT memory consumption is under 5 GB for HHVM and under 13 GB for the large binary (**). > > > > Fast turn-around times for optimizing an application with BOLT without the need for source code allow us to design a system that automatically manages binary profiling and optimization. This is one of the most exciting directions in application optimization we are currently pursuing. > > > > We thought it would be interesting to perform a fresh comparison of BOLT overhead to Propeller [11]. Sadly, Propeller relies on a custom version of an external create-llvm-prof tool that we could not build in our setup, and using a GitHub-hosted binary version of that tool in the virtual environment produced a profile that caused a performance regression of the optimized application. Using "-fbasic-block-sections=all" Propeller option without a profile resulted in fast linking times but also caused a performance regression. > > > > Although code reordering is the primary optimization in BOLT and the source of most performance gains, it is not the only optimization in BOLT. The full list of optimizations includes late inlining, indirect call promotion, frame optimizations, and others. The convenience of adding new optimizations in whole-program post-link mode is one of the main advantages that the BOLT framework offers, be it for research or practical purposes. > > > > Additionally, BOLT can add new code to a linked ELF binary. We have recently used that capability to allocate frequently-executed code on huge pages automatically. Even legacy applications can now use 2MB pages for code on x86-64 Linux systems to reduce the number of iTLB misses. > > > > BOLT's ability to process and optimize binary code without source code, such as legacy/distributed binaries, or code from third-party and assembly-written code, provides another advantage over a pure compiler-based approach. This advantage could or could not be important for optimizations depending on the user scenario. However, the visibility into the code mentioned above could be critical for other code re-writing cases such as vulnerability mitigations, instrumentation, and static analysis. > > > > *) Support for input with split functions is in the works. Before it is completed, we ask not to use "-freorder-blocks-and-partition" compiler option during the build. A similar or better result will be achieved by running the BOLT function splitting pass. > > > > **) while running BOLT with "-reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=1 -split-eh" optimization options. DWARF update takes extra time and memory. > > > > 3.2 Advanced Disassembly > > > > Internally, BOLT identifies code in the binary, breaks it into functions, disassembles, and uses static analysis to build a control flow graph. This functionality could complement a traditional disassembler, as it adds the ability to display possible targets for indirect jumps/calls, providing a better understanding of the control flow in a function. > > > > Additionally, for performance analysis, the disassembly is annotated with execution counts, including those for indirect branch targets within a function (jump tables) and across functions (indirect and virtual function calls). E.g.: > > > > <Basic Block> .LFT35985 > > Exec Count : 42 > > Predecessors: .Ltmp935657 > > 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx > > 00003c92: movslq (%rcx,%rax,4), %rax > > 00003c96: addq %rcx, %rax > > 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 > > Successors: .Ltmp935702 (count: 0, mispreds: 0), > > .Ltmp935705 (count: 41, mispreds: 26), > > .Ltmp935703 (count: 0, mispreds: 0), > > .Ltmp935704 (count: 1, mispreds: 0) > > .... > > <BasicBlock>.LFT43 (9 instructions, align : 1) > > Exec Count : 8 > > Predecessors: .LBB01191 > > 00000028: movq %rdx, %rbx > > 0000002b: leaq 0x20(%rsp), %rdi > > 00000030: movl $0x2, %edx > > 00000035: movq %rbx, %rsi > > 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : > > { f1: 3 (3 mispreds) }, > > { f2: 1 (1 mispreds) }, > > { f3: 4 (4 mispreds) } > > 0000003a: movdqu 0x10(%rbx), %xmm0 > > 0000003f: movdqu %xmm0, 0x30(%rsp) > > 00000045: movq %xmm0, %rcx > > 0000004a: jmp .Ltmp26968 > > Successors: .Ltmp26968 (count: 8, mispreds: 0) > > > > With LLVM integration, the advanced disassembler can be added as a new standalone tool or as an extra feature to existing tools such as llvm-objdump. > > > > 3.3 Low-Level Program Instrumentation > > > > Tools, such as memory sanitizers, rely on compiler-generated memory instrumentation to detect application errors at runtime. Suppose part of the application is written in assembly or comes from a library with no sources. In that case, such code could become a source of false positives and false negatives depending on the memory access types missed by the instrumentation. Since BOLT can instrument arbitrary code, independent of the source, it can complement compiler-based instrumentation and fill in the missing parts leading to a higher quality signal from the tool. > > > > 3.4 Static Analysis > > > > BOLT offers an intuitive API to perform compiler-level analyses directly on machine code. Because BOLT reconstructs the control-flow graph of the program, it allows pass writers to extract arbitrary information beyond the scope of a single basic block with data-flow analyses. As an example, we can perform shrink wrapping by checking how stack-accessing instructions are using the frame in a given function and validating opportunities to move memory accesses in hot basic blocks to colder areas. > > > > While this framework provides the optimization writer with greater reach over previously opaque external third-party binary code linked in the binary, perhaps the greatest value of this is in this code being visible at all. With static analysis, users can write passes to draw conclusions on safety concerns as well, such as checking if system or library calls are being used in a suspicious way without the need to execute the binary. > > > > 4. Plans > > > > BOLT is a mature project that has been used in production for years, but we continue to develop BOLT and invest in new use-cases and optimizations. Below is a shortlist of areas that are currently under development: > > Automatic profile collection and optimization > MachO support > LLD integration > Optimizing Linux kernel Hi Maksim, you already know I'm a big fan of BOLT. Specifically for the area under development of optimizing the Linux kernel, I think BOLT could be a huge competitive advantage for LLVM. Any way we could discuss/collaborate more on that? I look forward to seeing patches upstreamed. > > > > -- > > BOLT Team > > > > References > > [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: a practical binary optimizer for data centers and beyond. In "Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization" (CGO 2019). IEEE Press, 2–14. https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ > > [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html > > [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization" (CGO 2020). Association for Computing Machinery, New York, NY, USA, 94–106. DOI:https://doi.org/10.1145/3368826.3377914 > > [4] https://github.com/facebookincubator/BOLT/issues/46 > > [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 European LLVM Developers' Meeting. https://llvm.org/devmtg/2016-03/#presentation7 > > [6] https://github.com/facebookincubator/BOLT > > [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch > > [8] perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page. > > [9] Runtime Performance Optimization Blueprint: Intel® Architecture Optimizations with LBR. https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html > > [10] The HipHop Virtual Machine. https://hhvm.com/ > > [11] Propeller RFC. https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf. Updated performance results: https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html > > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -- Thanks, ~Nick Desaulniers
Maksim Panchenko via llvm-dev
2020-Oct-25 00:21 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
Hi Tobias, If you have access to simple counters, such as CPU cycles, you can use those instead of LBRs. We have instructions on our GitHub page on how to use this mode. If you cannot run perf at all, we have implemented low-level instrumentation. It is not yet at production-quality level, but you can give it a try and report any issues on GitHub: $ llvm-bolt a.out -instrument -instrumentation-file=instr.data -o a.out.bolt Maksim From: Tobias Hieta <tobias at plexapp.com> Date: Saturday, October 24, 2020 at 7:59 AM To: Maksim Panchenko <maks at fb.com> Cc: llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization Hello, Thanks for your work on bolt. We tried to use it to speedup clang itself following the instructions in your github repo. The main problem we ran into with pref not being able to run on our AMD workstations nor in our VMs in the cloud. I know it's not exactly bolt related but are there other ways to generate this profile? Ideally we would collect data while building our app as part of our toolchain builder in the CI (as we do with PGO) - but I was not able to get this working with the perf tool. Thanks for your help. Tobias On Tue, Oct 20, 2020, 20:39 Maksim Panchenko via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Hi All, Please find below an RFC for adding a binary optimization framework to LLVM. Thanks for the feedback, Maksim & BOLT Team [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization 1. Background and Motivation BOLT is a static post-link binary optimizer successfully used inside and outside of Facebook for code optimizations that complement traditional compiler whole-program and link-time optimizations [1]. Last year Google reported that BOLT accelerates their key workloads by 2% to 6% [2]. Additionally, BOLT is used in academia as a framework for program instrumentation, transformation, and binary analysis [3]. We believe that including BOLT in the LLVM project will benefit the community in several ways [4]. First, BOLT is useful as a binary optimizer. It has been used at Facebook to accelerate the top services, and we would love to see more people benefit from the performance boost that BOLT brings. We would also love to see our partners adopt BOLT's new use-cases, such as optimizing mobile and embedded applications. Beyond the binary optimizer, BOLT is an impressive infrastructure that enables research in the following areas: * Advanced disassembly * Low-level program instrumentation * Static analysis 2. Overview While BOLT uses several LLVM libraries [5], it is currently a separate project based on an out-of-tree version of LLVM [6]. Most of the code lives under separate tools/llvm-bolt directory, and required changes to LLVM are included in the supplied patch [7]. The patch mainly consists of backported fixes and minor extensions of the existing interfaces to update debug info, frame information, and ORC. BOLT has no external build dependencies outside of LLVM. For profile collection, we recommend using sampling with a Linux perf tool [8]. LBR (last branch records) feature [9] is recommended as it improves profile quality dramatically. BOLT can be supplied perf.data profile directly, but in general, we advise converting the profile first using the supplied perf2bolt utility. In case the sampling profiling is unavailable, e.g., while running under a hypervisor locally or in the cloud, BOLT provides a builtin instrumentation. BOLT supports x86-64 ELF as the primary platform. We have also implemented the initial support for Aarch64, and the support for MachO is in the works. 3. USE CASES 3.1 Link-time and binary transformations and optimizations Static profile-driven post-link optimization of an application was our primary goal when creating BOLT. The convenience and expandability that the model offers perhaps could only be exceeded by a dynamic binary optimization approach. E.g., to optimize a binary using a perf-generated profile, the user has to execute a single command: $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> No recompilation of a.out is needed (*), but we ask to link with "--emit-relocs" for maximum performance gains. However, the latter is not a strict requirement, and the command works even on stripped binaries. We have worked on reducing BOLT processing time and memory consumption. Overall, BOLT efficiency has been improved by over 3X during that process. It takes less than a minute to optimize HHVM [10] production binary with over 100 MB of code and less than 3 minutes to optimize another multi-gigabyte production binary with 500 MB of code. BOLT memory consumption is under 5 GB for HHVM and under 13 GB for the large binary (**). Fast turn-around times for optimizing an application with BOLT without the need for source code allow us to design a system that automatically manages binary profiling and optimization. This is one of the most exciting directions in application optimization we are currently pursuing. We thought it would be interesting to perform a fresh comparison of BOLT overhead to Propeller [11]. Sadly, Propeller relies on a custom version of an external create-llvm-prof tool that we could not build in our setup, and using a GitHub-hosted binary version of that tool in the virtual environment produced a profile that caused a performance regression of the optimized application. Using "-fbasic-block-sections=all" Propeller option without a profile resulted in fast linking times but also caused a performance regression. Although code reordering is the primary optimization in BOLT and the source of most performance gains, it is not the only optimization in BOLT. The full list of optimizations includes late inlining, indirect call promotion, frame optimizations, and others. The convenience of adding new optimizations in whole-program post-link mode is one of the main advantages that the BOLT framework offers, be it for research or practical purposes. Additionally, BOLT can add new code to a linked ELF binary. We have recently used that capability to allocate frequently-executed code on huge pages automatically. Even legacy applications can now use 2MB pages for code on x86-64 Linux systems to reduce the number of iTLB misses. BOLT's ability to process and optimize binary code without source code, such as legacy/distributed binaries, or code from third-party and assembly-written code, provides another advantage over a pure compiler-based approach. This advantage could or could not be important for optimizations depending on the user scenario. However, the visibility into the code mentioned above could be critical for other code re-writing cases such as vulnerability mitigations, instrumentation, and static analysis. *) Support for input with split functions is in the works. Before it is completed, we ask not to use "-freorder-blocks-and-partition" compiler option during the build. A similar or better result will be achieved by running the BOLT function splitting pass. **) while running BOLT with "-reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=1 -split-eh" optimization options. DWARF update takes extra time and memory. 3.2 Advanced Disassembly Internally, BOLT identifies code in the binary, breaks it into functions, disassembles, and uses static analysis to build a control flow graph. This functionality could complement a traditional disassembler, as it adds the ability to display possible targets for indirect jumps/calls, providing a better understanding of the control flow in a function. Additionally, for performance analysis, the disassembly is annotated with execution counts, including those for indirect branch targets within a function (jump tables) and across functions (indirect and virtual function calls). E.g.: <Basic Block> .LFT35985 Exec Count : 42 Predecessors: .Ltmp935657 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx 00003c92: movslq (%rcx,%rax,4), %rax 00003c96: addq %rcx, %rax 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 Successors: .Ltmp935702 (count: 0, mispreds: 0), .Ltmp935705 (count: 41, mispreds: 26), .Ltmp935703 (count: 0, mispreds: 0), .Ltmp935704 (count: 1, mispreds: 0) .... <BasicBlock>.LFT43 (9 instructions, align : 1) Exec Count : 8 Predecessors: .LBB01191 00000028: movq %rdx, %rbx 0000002b: leaq 0x20(%rsp), %rdi 00000030: movl $0x2, %edx 00000035: movq %rbx, %rsi 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : { f1: 3 (3 mispreds) }, { f2: 1 (1 mispreds) }, { f3: 4 (4 mispreds) } 0000003a: movdqu 0x10(%rbx), %xmm0 0000003f: movdqu %xmm0, 0x30(%rsp) 00000045: movq %xmm0, %rcx 0000004a: jmp .Ltmp26968 Successors: .Ltmp26968 (count: 8, mispreds: 0) With LLVM integration, the advanced disassembler can be added as a new standalone tool or as an extra feature to existing tools such as llvm-objdump. 3.3 Low-Level Program Instrumentation Tools, such as memory sanitizers, rely on compiler-generated memory instrumentation to detect application errors at runtime. Suppose part of the application is written in assembly or comes from a library with no sources. In that case, such code could become a source of false positives and false negatives depending on the memory access types missed by the instrumentation. Since BOLT can instrument arbitrary code, independent of the source, it can complement compiler-based instrumentation and fill in the missing parts leading to a higher quality signal from the tool. 3.4 Static Analysis BOLT offers an intuitive API to perform compiler-level analyses directly on machine code. Because BOLT reconstructs the control-flow graph of the program, it allows pass writers to extract arbitrary information beyond the scope of a single basic block with data-flow analyses. As an example, we can perform shrink wrapping by checking how stack-accessing instructions are using the frame in a given function and validating opportunities to move memory accesses in hot basic blocks to colder areas. While this framework provides the optimization writer with greater reach over previously opaque external third-party binary code linked in the binary, perhaps the greatest value of this is in this code being visible at all. With static analysis, users can write passes to draw conclusions on safety concerns as well, such as checking if system or library calls are being used in a suspicious way without the need to execute the binary. 4. Plans BOLT is a mature project that has been used in production for years, but we continue to develop BOLT and invest in new use-cases and optimizations. Below is a shortlist of areas that are currently under development: 1. Automatic profile collection and optimization 2. MachO support 3. LLD integration 4. Optimizing Linux kernel -- BOLT Team References [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: a practical binary optimizer for data centers and beyond. In "Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization" (CGO 2019). IEEE Press, 2–14. https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html<https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html> [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization" (CGO 2020). Association for Computing Machinery, New York, NY, USA, 94–106. DOI:https://doi.org/10.1145/3368826.3377914<https://doi.org/10.1145/3368826.3377914> [4] https://github.com/facebookincubator/BOLT/issues/46 [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 European LLVM Developers' Meeting. https://llvm.org/devmtg/2016-03/#presentation7<https://llvm.org/devmtg/2016-03/#presentation7> [6] https://github.com/facebookincubator/BOLT [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch [8] perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page. [9] Runtime Performance Optimization Blueprint: Intel® Architecture Optimizations with LBR. https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html<https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html> [10] The HipHop Virtual Machine. https://hhvm.com/ [11] Propeller RFC. https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf. Updated performance results: https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html<https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html> _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201025/2a794be8/attachment-0001.html>
Sriraman Tallam via llvm-dev
2020-Oct-29 20:23 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
Hello BOLT team, tl;dr BOLT has shown that it can further optimize already heavily optimized binaries and we strongly welcome the efforts towards its inclusion into LLVM. Thank you for writing a detailed proposal for upstreaming BOLT to LLVM. We, at Google, have used BOLT to optimize our internal benchmarks and we found that BOLT can further improve heavily optimized binaries by 2% to 6%. This result inspired us to look closely at post-link optimizations and led to the development of Propeller. Regarding the upstreaming of BOLT, a suggestion we have is to split up BOLT into two pieces, the disassembler and the optimizer. The disassembler would live as a separate tool or a library which would disassemble the binary and lift it to a suitable IR (MCInst or MachineIR or other), and either dump into a file or put it in a memory buffer. Dumping to a file needs serialization support but efforts to serialize MachineIR are ongoing. The BOLT optimizer can then pick up the MCInst/MachineIR and run the post-link optimizations converting it to optimized IR and then linking the final binary. The advantage of this approach is that you do not need to rely on disassembly if MCInst IR from the original optimized build can be cached in a file, or the compiler backend could be re-run to regenerate the MCInst IR in a memory buffer, and you can directly kick off from the BOLT optimizer. We also wanted to clarify the questions about Propeller: 1. create_llvm_prof could indeed use some love. We are considering bringing it closer to llvm; that aside, we initially thought of leveraging it for Propeller's needs because it seemed the closest (it handles profile creation for AFDO). With the introduction of llvm-profgen https://reviews.llvm.org/D89707 , that is probably a more appropriate tool, and we're looking at integrating our work there instead. 2. We will shortly post detailed instructions on how to optimize the clang benchmark and mysql with Propeller which will clarify some of the questions you have raised here regarding performance regressions. We look forward to the patches and are happy to help with looking at how BOLT and Propeller could share common functionality within LLVM. Best, The Propeller Team. On Tue, Oct 20, 2020 at 11:39 AM Maksim Panchenko via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi All, > > > > Please find below an RFC for adding a binary optimization framework to > LLVM. > > > > Thanks for the feedback, > > Maksim & BOLT Team > > > [RFC] BOLT: A Framework for Binary Analysis, Transformation, and > Optimization > > > 1. Background and Motivation > > > > BOLT is a static post-link binary optimizer successfully used inside and > outside of Facebook for code optimizations that complement traditional > compiler whole-program and link-time optimizations [1]. Last year Google > reported that BOLT accelerates their key workloads by 2% to 6% [2]. > Additionally, BOLT is used in academia as a framework for program > instrumentation, transformation, and binary analysis [3]. > > > > We believe that including BOLT in the LLVM project will benefit the > community in several ways [4]. First, BOLT is useful as a binary optimizer. > It has been used at Facebook to accelerate the top services, and we would > love to see more people benefit from the performance boost that BOLT > brings. We would also love to see our partners adopt BOLT's new use-cases, > such as optimizing mobile and embedded applications. Beyond the binary > optimizer, BOLT is an impressive infrastructure that enables research in > the following areas: > > - Advanced disassembly > - Low-level program instrumentation > - Static analysis > > > 2. Overview > > > > While BOLT uses several LLVM libraries [5], it is currently a separate > project based on an out-of-tree version of LLVM [6]. Most of the code lives > under separate tools/llvm-bolt directory, and required changes to LLVM are > included in the supplied patch [7]. The patch mainly consists of backported > fixes and minor extensions of the existing interfaces to update debug info, > frame information, and ORC. > > > > BOLT has no external build dependencies outside of LLVM. For profile > collection, we recommend using sampling with a Linux perf tool [8]. LBR > (last branch records) feature [9] is recommended as it improves profile > quality dramatically. BOLT can be supplied perf.data profile directly, but > in general, we advise converting the profile first using the supplied > perf2bolt utility. In case the sampling profiling is unavailable, e.g., > while running under a hypervisor locally or in the cloud, BOLT provides a > builtin instrumentation. > > > > BOLT supports x86-64 ELF as the primary platform. We have also implemented > the initial support for Aarch64, and the support for MachO is in the works. > > > 3. USE CASES > > > 3.1 Link-time and binary transformations and optimizations > > > > Static profile-driven post-link optimization of an application was our > primary goal when creating BOLT. The convenience and expandability that the > model offers perhaps could only be exceeded by a dynamic binary > optimization approach. E.g., to optimize a binary using a perf-generated > profile, the user has to execute a single command: > > > > $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> > > > > No recompilation of a.out is needed (*), but we ask to link with > "--emit-relocs" for maximum performance gains. However, the latter is not a > strict requirement, and the command works even on stripped binaries. > > > > We have worked on reducing BOLT processing time and memory consumption. > Overall, BOLT efficiency has been improved by over 3X during that process. > It takes less than a minute to optimize HHVM [10] production binary with > over 100 MB of code and less than 3 minutes to optimize another > multi-gigabyte production binary with 500 MB of code. BOLT memory > consumption is under 5 GB for HHVM and under 13 GB for the large binary > (**). > > > > Fast turn-around times for optimizing an application with BOLT without the > need for source code allow us to design a system that automatically manages > binary profiling and optimization. This is one of the most exciting > directions in application optimization we are currently pursuing. > > > > We thought it would be interesting to perform a fresh comparison of BOLT > overhead to Propeller [11]. Sadly, Propeller relies on a custom version of > an external create-llvm-prof tool that we could not build in our setup, and > using a GitHub-hosted binary version of that tool in the virtual > environment produced a profile that caused a performance regression of the > optimized application. Using "-fbasic-block-sections=all" Propeller option > without a profile resulted in fast linking times but also caused a > performance regression. > > > > Although code reordering is the primary optimization in BOLT and the > source of most performance gains, it is not the only optimization in BOLT. > The full list of optimizations includes late inlining, indirect call > promotion, frame optimizations, and others. The convenience of adding new > optimizations in whole-program post-link mode is one of the main advantages > that the BOLT framework offers, be it for research or practical purposes. > > > > Additionally, BOLT can add new code to a linked ELF binary. We have > recently used that capability to allocate frequently-executed code on huge > pages automatically. Even legacy applications can now use 2MB pages for > code on x86-64 Linux systems to reduce the number of iTLB misses. > > > > BOLT's ability to process and optimize binary code without source code, > such as legacy/distributed binaries, or code from third-party and > assembly-written code, provides another advantage over a pure > compiler-based approach. This advantage could or could not be important for > optimizations depending on the user scenario. However, the visibility into > the code mentioned above could be critical for other code re-writing cases > such as vulnerability mitigations, instrumentation, and static analysis. > > > > *) Support for input with split functions is in the works. Before it is > completed, we ask not to use "-freorder-blocks-and-partition" compiler > option during the build. A similar or better result will be achieved by > running the BOLT function splitting pass. > > > > **) while running BOLT with "-reorder-blocks=cache+ > -reorder-functions=hfsort -split-functions=1 -split-eh" optimization > options. DWARF update takes extra time and memory. > > > 3.2 Advanced Disassembly > > > > Internally, BOLT identifies code in the binary, breaks it into functions, > disassembles, and uses static analysis to build a control flow graph. This > functionality could complement a traditional disassembler, as it adds the > ability to display possible targets for indirect jumps/calls, providing a > better understanding of the control flow in a function. > > > > Additionally, for performance analysis, the disassembly is annotated with > execution counts, including those for indirect branch targets within a > function (jump tables) and across functions (indirect and virtual function > calls). E.g.: > > > > <Basic Block> .LFT35985 > > Exec Count : 42 > > Predecessors: .Ltmp935657 > > 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx > > 00003c92: movslq (%rcx,%rax,4), %rax > > 00003c96: addq %rcx, %rax > > 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 > > Successors: .Ltmp935702 (count: 0, mispreds: 0), > > .Ltmp935705 (count: 41, mispreds: 26), > > .Ltmp935703 (count: 0, mispreds: 0), > > .Ltmp935704 (count: 1, mispreds: 0) > > .... > > <BasicBlock>.LFT43 (9 instructions, align : 1) > > Exec Count : 8 > > Predecessors: .LBB01191 > > 00000028: movq %rdx, %rbx > > 0000002b: leaq 0x20(%rsp), %rdi > > 00000030: movl $0x2, %edx > > 00000035: movq %rbx, %rsi > > 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : > > { f1: 3 (3 mispreds) }, > > { f2: 1 (1 mispreds) }, > > { f3: 4 (4 mispreds) } > > 0000003a: movdqu 0x10(%rbx), %xmm0 > > 0000003f: movdqu %xmm0, 0x30(%rsp) > > 00000045: movq %xmm0, %rcx > > 0000004a: jmp .Ltmp26968 > > Successors: .Ltmp26968 (count: 8, mispreds: 0) > > > > With LLVM integration, the advanced disassembler can be added as a new > standalone tool or as an extra feature to existing tools such as > *llvm-objdump*. > > > 3.3 Low-Level Program Instrumentation > > > > Tools, such as memory sanitizers, rely on compiler-generated memory > instrumentation to detect application errors at runtime. Suppose part of > the application is written in assembly or comes from a library with no > sources. In that case, such code could become a source of false positives > and false negatives depending on the memory access types missed by the > instrumentation. Since BOLT can instrument arbitrary code, independent of > the source, it can complement compiler-based instrumentation and fill in > the missing parts leading to a higher quality signal from the tool. > > > 3.4 Static Analysis > > > > BOLT offers an intuitive API to perform compiler-level analyses directly > on machine code. Because BOLT reconstructs the control-flow graph of the > program, it allows pass writers to extract arbitrary information beyond the > scope of a single basic block with data-flow analyses. As an example, we > can perform shrink wrapping by checking how stack-accessing instructions > are using the frame in a given function and validating opportunities to > move memory accesses in hot basic blocks to colder areas. > > > > While this framework provides the optimization writer with greater reach > over previously opaque external third-party binary code linked in the > binary, perhaps the greatest value of this is in this code being visible at > all. With static analysis, users can write passes to draw conclusions on > safety concerns as well, such as checking if system or library calls are > being used in a suspicious way without the need to execute the binary. > > > 4. Plans > > > > BOLT is a mature project that has been used in production for years, but > we continue to develop BOLT and invest in new use-cases and optimizations. > Below is a shortlist of areas that are currently under development: > > 1. Automatic profile collection and optimization > 2. MachO support > 3. LLD integration > 4. Optimizing Linux kernel > > > -- > > BOLT Team > > > References > > [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. > BOLT: a practical binary optimizer for data centers and beyond. In > "Proceedings of the 2019 IEEE/ACM International Symposium on Code > Generation and Optimization" (CGO 2019). IEEE Press, 2–14. > https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ > > [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html > > [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout > optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium > on Code Generation and Optimization" (CGO 2020). Association for Computing > Machinery, New York, NY, USA, 94–106. DOI: > https://doi.org/10.1145/3368826.3377914 > > [4] https://github.com/facebookincubator/BOLT/issues/46 > > [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 > European LLVM Developers' Meeting. > https://llvm.org/devmtg/2016-03/#presentation7 > > [6] https://github.com/facebookincubator/BOLT > > [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch > > [8] perf: Linux profiling with performance counters. > https://perf.wiki.kernel.org/index.php/Main_Page. > > [9] Runtime Performance Optimization Blueprint: Intel® Architecture > Optimizations with LBR. > https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html > > [10] The HipHop Virtual Machine. https://hhvm.com/ > > [11] Propeller RFC. > https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf. > Updated performance results: > https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201029/5811061f/attachment.html>
Chris Lattner via llvm-dev
2020-Oct-31 19:25 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
<just catching up on email> Maksim, it seems like there is strong consensus to move all or parts of BOLT upstream. How would you like to stage this in, and what do you think of Sriraman’s proposal below? -Chris> On Oct 29, 2020, at 1:23 PM, Sriraman Tallam via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hello BOLT team, > > tl;dr BOLT has shown that it can further optimize already heavily optimized binaries and we strongly welcome the efforts towards its inclusion into LLVM. > > Thank you for writing a detailed proposal for upstreaming BOLT to LLVM. We, at Google, have used BOLT to optimize our internal benchmarks and we found that BOLT can further improve heavily optimized binaries by 2% to 6%. This result inspired us to look closely at post-link optimizations and led to the development of Propeller. > > Regarding the upstreaming of BOLT, a suggestion we have is to split up BOLT into two pieces, the disassembler and the optimizer. The disassembler would live as a separate tool or a library which would disassemble the binary and lift it to a suitable IR (MCInst or MachineIR or other), and either dump into a file or put it in a memory buffer. Dumping to a file needs serialization support but efforts to serialize MachineIR are ongoing. The BOLT optimizer can then pick up the MCInst/MachineIR and run the post-link optimizations converting it to optimized IR and then linking the final binary. The advantage of this approach is that you do not need to rely on disassembly if MCInst IR from the original optimized build can be cached in a file, or the compiler backend could be re-run to regenerate the MCInst IR in a memory buffer, and you can directly kick off from the BOLT optimizer. > > We also wanted to clarify the questions about Propeller: > > create_llvm_prof could indeed use some love. We are considering bringing it closer to llvm; that aside, we initially thought of leveraging it for Propeller's needs because it seemed the closest (it handles profile creation for AFDO). With the introduction of llvm-profgen https://reviews.llvm.org/D89707 <https://reviews.llvm.org/D89707> , that is probably a more appropriate tool, and we're looking at integrating our work there instead. > We will shortly post detailed instructions on how to optimize the clang benchmark and mysql with Propeller which will clarify some of the questions you have raised here regarding performance regressions. > > We look forward to the patches and are happy to help with looking at how BOLT and Propeller could share common functionality within LLVM. > > Best, > The Propeller Team. > > > On Tue, Oct 20, 2020 at 11:39 AM Maksim Panchenko via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > Hi All, > > Please find below an RFC for adding a binary optimization framework to LLVM. > > Thanks for the feedback, > Maksim & BOLT Team > > [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization > > 1. Background and Motivation > > BOLT is a static post-link binary optimizer successfully used inside and outside of Facebook for code optimizations that complement traditional compiler whole-program and link-time optimizations [1]. Last year Google reported that BOLT accelerates their key workloads by 2% to 6% [2]. Additionally, BOLT is used in academia as a framework for program instrumentation, transformation, and binary analysis [3]. > > We believe that including BOLT in the LLVM project will benefit the community in several ways [4]. First, BOLT is useful as a binary optimizer. It has been used at Facebook to accelerate the top services, and we would love to see more people benefit from the performance boost that BOLT brings. We would also love to see our partners adopt BOLT's new use-cases, such as optimizing mobile and embedded applications. Beyond the binary optimizer, BOLT is an impressive infrastructure that enables research in the following areas: > Advanced disassembly > Low-level program instrumentation > Static analysis > > 2. Overview > > While BOLT uses several LLVM libraries [5], it is currently a separate project based on an out-of-tree version of LLVM [6]. Most of the code lives under separate tools/llvm-bolt directory, and required changes to LLVM are included in the supplied patch [7]. The patch mainly consists of backported fixes and minor extensions of the existing interfaces to update debug info, frame information, and ORC. > > BOLT has no external build dependencies outside of LLVM. For profile collection, we recommend using sampling with a Linux perf tool [8]. LBR (last branch records) feature [9] is recommended as it improves profile quality dramatically. BOLT can be supplied perf.data profile directly, but in general, we advise converting the profile first using the supplied perf2bolt utility. In case the sampling profiling is unavailable, e.g., while running under a hypervisor locally or in the cloud, BOLT provides a builtin instrumentation. > > BOLT supports x86-64 ELF as the primary platform. We have also implemented the initial support for Aarch64, and the support for MachO is in the works. > > 3. USE CASES > > 3.1 Link-time and binary transformations and optimizations > > Static profile-driven post-link optimization of an application was our primary goal when creating BOLT. The convenience and expandability that the model offers perhaps could only be exceeded by a dynamic binary optimization approach. E.g., to optimize a binary using a perf-generated profile, the user has to execute a single command: > > $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> > > No recompilation of a.out is needed (*), but we ask to link with "--emit-relocs" for maximum performance gains. However, the latter is not a strict requirement, and the command works even on stripped binaries. > > We have worked on reducing BOLT processing time and memory consumption. Overall, BOLT efficiency has been improved by over 3X during that process. It takes less than a minute to optimize HHVM [10] production binary with over 100 MB of code and less than 3 minutes to optimize another multi-gigabyte production binary with 500 MB of code. BOLT memory consumption is under 5 GB for HHVM and under 13 GB for the large binary (**). > > Fast turn-around times for optimizing an application with BOLT without the need for source code allow us to design a system that automatically manages binary profiling and optimization. This is one of the most exciting directions in application optimization we are currently pursuing. > > We thought it would be interesting to perform a fresh comparison of BOLT overhead to Propeller [11]. Sadly, Propeller relies on a custom version of an external create-llvm-prof tool that we could not build in our setup, and using a GitHub-hosted binary version of that tool in the virtual environment produced a profile that caused a performance regression of the optimized application. Using "-fbasic-block-sections=all" Propeller option without a profile resulted in fast linking times but also caused a performance regression. > > Although code reordering is the primary optimization in BOLT and the source of most performance gains, it is not the only optimization in BOLT. The full list of optimizations includes late inlining, indirect call promotion, frame optimizations, and others. The convenience of adding new optimizations in whole-program post-link mode is one of the main advantages that the BOLT framework offers, be it for research or practical purposes. > > Additionally, BOLT can add new code to a linked ELF binary. We have recently used that capability to allocate frequently-executed code on huge pages automatically. Even legacy applications can now use 2MB pages for code on x86-64 Linux systems to reduce the number of iTLB misses. > > BOLT's ability to process and optimize binary code without source code, such as legacy/distributed binaries, or code from third-party and assembly-written code, provides another advantage over a pure compiler-based approach. This advantage could or could not be important for optimizations depending on the user scenario. However, the visibility into the code mentioned above could be critical for other code re-writing cases such as vulnerability mitigations, instrumentation, and static analysis. > > *) Support for input with split functions is in the works. Before it is completed, we ask not to use "-freorder-blocks-and-partition" compiler option during the build. A similar or better result will be achieved by running the BOLT function splitting pass. > > **) while running BOLT with "-reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=1 -split-eh" optimization options. DWARF update takes extra time and memory. > > 3.2 Advanced Disassembly > > Internally, BOLT identifies code in the binary, breaks it into functions, disassembles, and uses static analysis to build a control flow graph. This functionality could complement a traditional disassembler, as it adds the ability to display possible targets for indirect jumps/calls, providing a better understanding of the control flow in a function. > > Additionally, for performance analysis, the disassembly is annotated with execution counts, including those for indirect branch targets within a function (jump tables) and across functions (indirect and virtual function calls). E.g.: > > <Basic Block> .LFT35985 > Exec Count : 42 > Predecessors: .Ltmp935657 > 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx > 00003c92: movslq (%rcx,%rax,4), %rax > 00003c96: addq %rcx, %rax > 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 > Successors: .Ltmp935702 (count: 0, mispreds: 0), > .Ltmp935705 (count: 41, mispreds: 26), > .Ltmp935703 (count: 0, mispreds: 0), > .Ltmp935704 (count: 1, mispreds: 0) > .... > <BasicBlock>.LFT43 (9 instructions, align : 1) > Exec Count : 8 > Predecessors: .LBB01191 > 00000028: movq %rdx, %rbx > 0000002b: leaq 0x20(%rsp), %rdi > 00000030: movl $0x2, %edx > 00000035: movq %rbx, %rsi > 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : > { f1: 3 (3 mispreds) }, > { f2: 1 (1 mispreds) }, > { f3: 4 (4 mispreds) } > 0000003a: movdqu 0x10(%rbx), %xmm0 > 0000003f: movdqu %xmm0, 0x30(%rsp) > 00000045: movq %xmm0, %rcx > 0000004a: jmp .Ltmp26968 > Successors: .Ltmp26968 (count: 8, mispreds: 0) > > With LLVM integration, the advanced disassembler can be added as a new standalone tool or as an extra feature to existing tools such as llvm-objdump. > > 3.3 Low-Level Program Instrumentation > > Tools, such as memory sanitizers, rely on compiler-generated memory instrumentation to detect application errors at runtime. Suppose part of the application is written in assembly or comes from a library with no sources. In that case, such code could become a source of false positives and false negatives depending on the memory access types missed by the instrumentation. Since BOLT can instrument arbitrary code, independent of the source, it can complement compiler-based instrumentation and fill in the missing parts leading to a higher quality signal from the tool. > > 3.4 Static Analysis > > BOLT offers an intuitive API to perform compiler-level analyses directly on machine code. Because BOLT reconstructs the control-flow graph of the program, it allows pass writers to extract arbitrary information beyond the scope of a single basic block with data-flow analyses. As an example, we can perform shrink wrapping by checking how stack-accessing instructions are using the frame in a given function and validating opportunities to move memory accesses in hot basic blocks to colder areas. > > While this framework provides the optimization writer with greater reach over previously opaque external third-party binary code linked in the binary, perhaps the greatest value of this is in this code being visible at all. With static analysis, users can write passes to draw conclusions on safety concerns as well, such as checking if system or library calls are being used in a suspicious way without the need to execute the binary. > > 4. Plans > > BOLT is a mature project that has been used in production for years, but we continue to develop BOLT and invest in new use-cases and optimizations. Below is a shortlist of areas that are currently under development: > Automatic profile collection and optimization > MachO support > LLD integration > Optimizing Linux kernel > > -- > BOLT Team > > References > [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: a practical binary optimizer for data centers and beyond. In "Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization" (CGO 2019). IEEE Press, 2–14. https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/> > [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html <https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html> > [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization" (CGO 2020). Association for Computing Machinery, New York, NY, USA, 94–106. DOI:https://doi.org/10.1145/3368826.3377914 <https://doi.org/10.1145/3368826.3377914> > [4] https://github.com/facebookincubator/BOLT/issues/46 <https://github.com/facebookincubator/BOLT/issues/46> > [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 European LLVM Developers' Meeting. https://llvm.org/devmtg/2016-03/#presentation7 <https://llvm.org/devmtg/2016-03/#presentation7> > [6] https://github.com/facebookincubator/BOLT <https://github.com/facebookincubator/BOLT> > [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch <https://github.com/facebookincubator/BOLT/blob/master/llvm.patch> > [8] perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page <https://perf.wiki.kernel.org/index.php/Main_Page>. > [9] Runtime Performance Optimization Blueprint: Intel® Architecture Optimizations with LBR. https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html <https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html> > [10] The HipHop Virtual Machine. https://hhvm.com/ <https://hhvm.com/> > [11] Propeller RFC. https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf <https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf>. Updated performance results: https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html <https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html> > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201031/11e09af0/attachment.html>
Danila Kutenin via llvm-dev
2020-Nov-01 00:52 UTC
[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization
Hi everyone, Sorry for being a bit late with a response I just wanted to point out that at Yandex we used BOLT for the main search engine, it helped gain us 3-4% of additional performance. My colleagues will be very happy to see this technology merged into the LLVM toolchain. Danila> > On Tue, Oct 20, 2020 at 11:39 AM Maksim Panchenko via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi All, >> >> >> >> Please find below an RFC for adding a binary optimization framework to >> LLVM. >> >> >> >> Thanks for the feedback, >> >> Maksim & BOLT Team >> >> >> [RFC] BOLT: A Framework for Binary Analysis, Transformation, and >> Optimization >> >> >> 1. Background and Motivation >> >> >> >> BOLT is a static post-link binary optimizer successfully used inside and >> outside of Facebook for code optimizations that complement traditional >> compiler whole-program and link-time optimizations [1]. Last year Google >> reported that BOLT accelerates their key workloads by 2% to 6% [2]. >> Additionally, BOLT is used in academia as a framework for program >> instrumentation, transformation, and binary analysis [3]. >> >> >> >> We believe that including BOLT in the LLVM project will benefit the >> community in several ways [4]. First, BOLT is useful as a binary optimizer. >> It has been used at Facebook to accelerate the top services, and we would >> love to see more people benefit from the performance boost that BOLT >> brings. We would also love to see our partners adopt BOLT's new use-cases, >> such as optimizing mobile and embedded applications. Beyond the binary >> optimizer, BOLT is an impressive infrastructure that enables research in >> the following areas: >> >> - Advanced disassembly >> - Low-level program instrumentation >> - Static analysis >> >> >> 2. Overview >> >> >> >> While BOLT uses several LLVM libraries [5], it is currently a separate >> project based on an out-of-tree version of LLVM [6]. Most of the code lives >> under separate tools/llvm-bolt directory, and required changes to LLVM are >> included in the supplied patch [7]. The patch mainly consists of backported >> fixes and minor extensions of the existing interfaces to update debug info, >> frame information, and ORC. >> >> >> >> BOLT has no external build dependencies outside of LLVM. For profile >> collection, we recommend using sampling with a Linux perf tool [8]. LBR >> (last branch records) feature [9] is recommended as it improves profile >> quality dramatically. BOLT can be supplied perf.data profile directly, but >> in general, we advise converting the profile first using the supplied >> perf2bolt utility. In case the sampling profiling is unavailable, e.g., >> while running under a hypervisor locally or in the cloud, BOLT provides a >> builtin instrumentation. >> >> >> >> BOLT supports x86-64 ELF as the primary platform. We have also >> implemented the initial support for Aarch64, and the support for MachO is >> in the works. >> >> >> 3. USE CASES >> >> >> 3.1 Link-time and binary transformations and optimizations >> >> >> >> Static profile-driven post-link optimization of an application was our >> primary goal when creating BOLT. The convenience and expandability that the >> model offers perhaps could only be exceeded by a dynamic binary >> optimization approach. E.g., to optimize a binary using a perf-generated >> profile, the user has to execute a single command: >> >> >> >> $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts> >> >> >> >> No recompilation of a.out is needed (*), but we ask to link with >> "--emit-relocs" for maximum performance gains. However, the latter is not a >> strict requirement, and the command works even on stripped binaries. >> >> >> >> We have worked on reducing BOLT processing time and memory consumption. >> Overall, BOLT efficiency has been improved by over 3X during that process. >> It takes less than a minute to optimize HHVM [10] production binary with >> over 100 MB of code and less than 3 minutes to optimize another >> multi-gigabyte production binary with 500 MB of code. BOLT memory >> consumption is under 5 GB for HHVM and under 13 GB for the large binary >> (**). >> >> >> >> Fast turn-around times for optimizing an application with BOLT without >> the need for source code allow us to design a system that automatically >> manages binary profiling and optimization. This is one of the most exciting >> directions in application optimization we are currently pursuing. >> >> >> >> We thought it would be interesting to perform a fresh comparison of BOLT >> overhead to Propeller [11]. Sadly, Propeller relies on a custom version of >> an external create-llvm-prof tool that we could not build in our setup, and >> using a GitHub-hosted binary version of that tool in the virtual >> environment produced a profile that caused a performance regression of the >> optimized application. Using "-fbasic-block-sections=all" Propeller option >> without a profile resulted in fast linking times but also caused a >> performance regression. >> >> >> >> Although code reordering is the primary optimization in BOLT and the >> source of most performance gains, it is not the only optimization in BOLT. >> The full list of optimizations includes late inlining, indirect call >> promotion, frame optimizations, and others. The convenience of adding new >> optimizations in whole-program post-link mode is one of the main advantages >> that the BOLT framework offers, be it for research or practical purposes. >> >> >> >> Additionally, BOLT can add new code to a linked ELF binary. We have >> recently used that capability to allocate frequently-executed code on huge >> pages automatically. Even legacy applications can now use 2MB pages for >> code on x86-64 Linux systems to reduce the number of iTLB misses. >> >> >> >> BOLT's ability to process and optimize binary code without source code, >> such as legacy/distributed binaries, or code from third-party and >> assembly-written code, provides another advantage over a pure >> compiler-based approach. This advantage could or could not be important for >> optimizations depending on the user scenario. However, the visibility into >> the code mentioned above could be critical for other code re-writing cases >> such as vulnerability mitigations, instrumentation, and static analysis. >> >> >> >> *) Support for input with split functions is in the works. Before it is >> completed, we ask not to use "-freorder-blocks-and-partition" compiler >> option during the build. A similar or better result will be achieved by >> running the BOLT function splitting pass. >> >> >> >> **) while running BOLT with "-reorder-blocks=cache+ >> -reorder-functions=hfsort -split-functions=1 -split-eh" optimization >> options. DWARF update takes extra time and memory. >> >> >> 3.2 Advanced Disassembly >> >> >> >> Internally, BOLT identifies code in the binary, breaks it into functions, >> disassembles, and uses static analysis to build a control flow graph. This >> functionality could complement a traditional disassembler, as it adds the >> ability to display possible targets for indirect jumps/calls, providing a >> better understanding of the control flow in a function. >> >> >> >> Additionally, for performance analysis, the disassembly is annotated with >> execution counts, including those for indirect branch targets within a >> function (jump tables) and across functions (indirect and virtual function >> calls). E.g.: >> >> >> >> <Basic Block> .LFT35985 >> >> Exec Count : 42 >> >> Predecessors: .Ltmp935657 >> >> 00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx >> >> 00003c92: movslq (%rcx,%rax,4), %rax >> >> 00003c96: addq %rcx, %rax >> >> 00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94 >> >> Successors: .Ltmp935702 (count: 0, mispreds: 0), >> >> .Ltmp935705 (count: 41, mispreds: 26), >> >> .Ltmp935703 (count: 0, mispreds: 0), >> >> .Ltmp935704 (count: 1, mispreds: 0) >> >> .... >> >> <BasicBlock>.LFT43 (9 instructions, align : 1) >> >> Exec Count : 8 >> >> Predecessors: .LBB01191 >> >> 00000028: movq %rdx, %rbx >> >> 0000002b: leaq 0x20(%rsp), %rdi >> >> 00000030: movl $0x2, %edx >> >> 00000035: movq %rbx, %rsi >> >> 00000038: callq *%rax # CallProfile: 8 (8 mispreds) : >> >> { f1: 3 (3 mispreds) }, >> >> { f2: 1 (1 mispreds) }, >> >> { f3: 4 (4 mispreds) } >> >> 0000003a: movdqu 0x10(%rbx), %xmm0 >> >> 0000003f: movdqu %xmm0, 0x30(%rsp) >> >> 00000045: movq %xmm0, %rcx >> >> 0000004a: jmp .Ltmp26968 >> >> Successors: .Ltmp26968 (count: 8, mispreds: 0) >> >> >> >> With LLVM integration, the advanced disassembler can be added as a new >> standalone tool or as an extra feature to existing tools such as >> *llvm-objdump*. >> >> >> 3.3 Low-Level Program Instrumentation >> >> >> >> Tools, such as memory sanitizers, rely on compiler-generated memory >> instrumentation to detect application errors at runtime. Suppose part of >> the application is written in assembly or comes from a library with no >> sources. In that case, such code could become a source of false positives >> and false negatives depending on the memory access types missed by the >> instrumentation. Since BOLT can instrument arbitrary code, independent of >> the source, it can complement compiler-based instrumentation and fill in >> the missing parts leading to a higher quality signal from the tool. >> >> >> 3.4 Static Analysis >> >> >> >> BOLT offers an intuitive API to perform compiler-level analyses directly >> on machine code. Because BOLT reconstructs the control-flow graph of the >> program, it allows pass writers to extract arbitrary information beyond the >> scope of a single basic block with data-flow analyses. As an example, we >> can perform shrink wrapping by checking how stack-accessing instructions >> are using the frame in a given function and validating opportunities to >> move memory accesses in hot basic blocks to colder areas. >> >> >> >> While this framework provides the optimization writer with greater reach >> over previously opaque external third-party binary code linked in the >> binary, perhaps the greatest value of this is in this code being visible at >> all. With static analysis, users can write passes to draw conclusions on >> safety concerns as well, such as checking if system or library calls are >> being used in a suspicious way without the need to execute the binary. >> >> >> 4. Plans >> >> >> >> BOLT is a mature project that has been used in production for years, but >> we continue to develop BOLT and invest in new use-cases and optimizations. >> Below is a shortlist of areas that are currently under development: >> >> 1. Automatic profile collection and optimization >> 2. MachO support >> 3. LLD integration >> 4. Optimizing Linux kernel >> >> >> -- >> >> BOLT Team >> >> >> References >> >> [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. >> 2019. BOLT: a practical binary optimizer for data centers and beyond. In >> "Proceedings of the 2019 IEEE/ACM International Symposium on Code >> Generation and Optimization" (CGO 2019). IEEE Press, 2–14. >> https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ >> >> [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html >> >> [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout >> optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium >> on Code Generation and Optimization" (CGO 2020). Association for Computing >> Machinery, New York, NY, USA, 94–106. DOI: >> https://doi.org/10.1145/3368826.3377914 >> >> [4] https://github.com/facebookincubator/BOLT/issues/46 >> >> [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016 >> European LLVM Developers' Meeting. >> https://llvm.org/devmtg/2016-03/#presentation7 >> >> [6] https://github.com/facebookincubator/BOLT >> >> [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch >> >> [8] perf: Linux profiling with performance counters. >> https://perf.wiki.kernel.org/index.php/Main_Page. >> >> [9] Runtime Performance Optimization Blueprint: Intel® Architecture >> Optimizations with LBR. >> https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html >> >> [10] The HipHop Virtual Machine. https://hhvm.com/ >> >> [11] Propeller RFC. >> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf. >> Updated performance results: >> https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201101/aebbfaa6/attachment.html>