Sean Silva via llvm-dev
2015-Aug-08 05:56 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
Accidentally sent to uiuc server. On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> wrote:> Can you compare your results with another approach: simply do not > instrument the top 1% hottest functions (by function entry count)? If this > simple approach provides most of the benefits (my measurements on one > codebase I tested show that it would eliminate over 97% of total function > counts), we may be able to use a simpler approach. > > The biggest thing I notice about this proposal is that although the focus > of the proposal is on reducing profiling overhead, it requires using a > different pass pipeline during *both* the "instr-generate" compilation and > the "instr-use" compilation. Therefore it is more than simply a reduction > in profiling overhead -- it can also change performance even ignoring > adding the profiling information in the IR. I think it would be interesting > to test the modified pass pipeline even in the absence of profile > information to better understand its effects in isolation (for example, > maybe it would be good to add these passes during -O3 regardless of whether > we are doing PGO). > > -- Sean Silva > > On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote: > >> Instrumentation based Profile Guided Optimization (PGO) is a compiler >> technique that leverages important program runtime information, such as >> precise edge counts and frequent value information, to make frequently >> executed code run faster. It's proven to be one of the most effective ways >> to improve program performance. >> >> An important design point of PGO is to decide where to place the >> instrumentation. In current LLVM PGO, the instrumentation is done in the >> Clang Front-End (will be referred as FE based instrumentation). Doing this >> early in the front-end gives better coverage information as the there are >> more precise line number information. The resulted profile also has >> relatively high tolerance to compiler changes. All the compiler change >> after the instrumentation point will not lead to mismatched IR that >> invalidates the profile. >> >> On the other hand, doing this too early gives the compiler fewer >> opportunities to optimize the code before instrumentation. It has >> significant impact on instrumentation runtime performance. In addition, it >> tends to produce larger binary and profile data file. Our internal C++ >> benchmarks shows that FE based instrumentation degrades the performance >> (compared to non-instrumented version) by 58.3%, and in some extreme cases, >> the application speed/throughput decreases by 95% (21x runtime slowdown). >> >> Running the instrumented binary too slow is not desirable in PGO for the >> following reasons: >> * This slows down already lengthy build time. In the worst case, the >> instrumented binary is so slow that it fails to run a representative >> workload, because slow execution can leads to more time-outs in many server >> programs. Other typical issues include: text size too big, failure to link >> instrumented binaries, memory usage exceeding system limits. >> * Slow runtime affects the program behavior. Real applications >> sometimes monitor the program runtime and have different execution path >> when the program run too slow. This would defeat the underlying assumption >> of PGO and make it less effective. >> >> This work proposes an option to turn on middle-end based instrumentation >> (new) that aims to speed up instrumentation runtime. The new >> instrumentation is referred to as ME based instrumentation in this >> document. Our experimental results show that ME instrumentation can speed >> up the instrumentation by 80% on average for typical C++ programs. Here are >> the two main design objectives: >> * Co-existing with FE Instrumenter: We do not propose to replace the >> FE based instrumentation because FE based instrumentation has its >> advantages and applications. User can choose which phase to do >> instrumentation via command line options. >> * PGO Runtime Support Sharing: The ME instrumenter will completely >> re-use the existing PGO’s runtime support. >> >> 1. FE Based Instrumentation Runtime Overhead Analysis >> >> Instrumented binaries are expected to run slower due to instrumentation >> code inserted. With FE based instrumentation, the overhead is especially >> high and runtime slowdown can be unacceptable in many cases. Further >> analysis shows that there are 3 important factors contributing to FE >> instrumentation slowdown : >> * [Main] Redundant counter updates of inlined functions. C++ programs >> can introduce large abstraction penalties by using lots of small inline >> functions (assignment operators, getters, setters, ctors/dtors etc). >> Overhead of instrumenting those small functions can be very large, making >> training runs too slow and in some cases to usable; >> * Non-optimal placement of the count updates; >> * A third factor is related value profiling (to be turned on in the >> future). Small and hot callee functions taking function pointer (callbacks) >> can incur overhead due to indirect call target profiling. >> >> >> 1.1 Redundant Counter Update >> >> If checking the assembly of the instrumented binary generated by current >> LLVM implementation, we can find many sequence of consecutive 'incq' >> instructions that updating difference counters in the same basic block. As >> an example that extracted from real binary: >> ... >> incq 0xa91d80(%rip) # 14df4b8 >> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8> >> incq 0xa79011(%rip) # 14c6750 >> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm> >> incq 0xa79442(%rip) # 14c6b88 >> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv> >> incq 0x9c288b(%rip) # 140ffd8 >> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl> >> ... >> >> From profile use point of view, many of these counter update are >> redundant. Considering the following example: >> void bar(){ >> sum++; >> } >> void foo() { >> bar(); >> } >> >> FE based instrumentation needs to insert counter update for the only BB >> of the bar(). >> bar: # @bar >> # BB#0: # %entry >> incq .L__llvm_profile_counters_bar(%rip) >> incl sum(%rip) >> retq >> >> It also need to insert the update the BB in function foo(). After >> inlining bar to foo(), the code is: >> foo: # @foo >> # BB#0: # %entry >> incq .L__llvm_profile_counters_foo(%rip) >> incq .L__llvm_profile_counters_bar(%rip) >> incl sum(%rip) >> retq >> >> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is >> redundant -- the counter won't help downstream optimizations. On the other >> hand, if bar() is a large function and may not be suitable to be inlined >> for all callsites, this counter updated is necessary in order to produce >> more accurate profile data for the out-of-line instance of the callee. >> >> If foo() is a hot function, the overhead of updating two counters can be >> significant. This is especially bad for C++ program where there are tons of >> small inline functions. >> >> There is another missing opportunity in FE based instrumentation. The >> small functions’ control flow can usually be simplified when they are >> inlined into caller contexts. Once the control flow is simplified, many >> counter updates can therefore be eliminated. This is only possible for a >> middle end based late instrumenter. Defining a custom clean-up pass to >> remove redundant counter update is unrealistic and cannot be done in a sane >> way without destroying the profile integrity of neither the out-of-line nor >> inline instances of the callee. >> >> A much simpler and cleaner solution is to do a pre-inline pass to inline >> all the trivial inlines before instrumentation. In addition to removing >> the unnecessary count updates for the inline instances, another advantage >> of pre-inline is to provide context sensitive profile for these small >> inlined functions. This context senstive profile can further improve the >> PGO based optimizations. Here is a contrived example: >> void bar (int n) { >> if (n&1) >> do_sth1(); >> else >> do_sth2(); >> } >> >> void caller() { >> int s = 1; >> for (; s<100; s+=2) >> bar(s); >> >> for (s = 102; s< 200; s+=2) >> bar(s); >> } >> >> The direction of the branch inside bar will be totally opposite between >> two different callsites in ‘caller’. Without pre-inlining, the branch >> probability will be 50-50 which will be useless for later optimizations. >> With pre-inlining, the profile will have the perfect branch count for each >> callsite. The positive performance impact of context sensitive profiling >> due to pre-inlining has been observed in real world large C++ programs. >> Supporting context sensitive profiling is another way to solve this, but it >> will introduce large additional runtime/memory overhead. >> >> >> 1.2 Non-optimal placement of count update >> >> Another much smaller showdown factor is the placement of the counter >> updates. Current front-end based instrumentation applies the >> instrumentation to each front-end lexical construct. It also minimizes the >> number of static instrumentations. Note that it always instruments the >> entry count of the CFG. This may result in higher dynamic instruction >> counts. For example, >> BB0 >> | 100 >> BB1 >> 90 / \ 10 >> BB2 BB3 >> 90 \ / 10 >> BB4 >> Like the the above example, FE based instrumentation will always insert >> count update in BB0. The dynamic instrumentation count will be either 110 >> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better >> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic >> instrumentation count is 100. >> >> Our experimental shows that the optimal placement based on edge hotness >> can improve instrumented code performance by about 10%. While it’s hard to >> find the optimal placement of count update, compiler heuristics can be >> used the get the better placement. These heuristics can be based on static >> profile prediction or user annotations (like __buildin_expect) to estimate >> the relative edge hotness and put instrumentations on the less hot edges. >> The initial late instrumentation has not fully implemented this placement >> strategy yet. With that implemented, we expect even better results than >> what is reported here. For real world programs, another major source of the >> slowdown is the data racing and false sharing of the counter update for >> highly threaded programs. Pre-inlining can alleviate this problem as the >> counters in the inline instances are not longer shared. But the complete >> solution to the data racing issue is orthogonal to the problem we try to >> solve here. >> >> >> 2. High Level Design >> >> We propose to perform a pre-profile inline pass before the PGO >> instrumentation pass. Since the instrumentation pass is after inine, it has >> to be done in the middle-end. >> >> (1) The pre-inline pass >> We will invoke a pre-inline pass before the instrumentation. When PGO is >> on, the inlining will be split into two passes: >> * A pre-inline pass that is scheduled before the profile >> instrumentation/annotation >> * A post-inline pass which is the regular inline pass after >> instrumentation/annotation >> By design, all beneficial callsites without requiring profile data should >> be inlined in the pre-inline pass. It includes all callsites that will >> shrink code size after inlining. All the remaining callsites will be left >> to the regular inline pass when profile data is available. >> >> After pre-inline, a CFG based profile instrumentation/annotation will be >> done. A minimum weight spanning tree (MST) in CFG is first computed, then >> only the edges not in the MST will be instrumented. The counter update >> instructions are placed in the basic blocks. >> >> (2) Minimum Spanning Tree (MST) based instrumentation >> A native way of instrumentation is to insert a count update for every >> edge in CFG which will result in too many redundant updates that makes the >> runtime very slow. Knuth [1] proposed a minimum spanning tree based method: >> given a CFG, first compute a spanning tree. All edges that not in the MST >> will be instrumented. In the profile use compilation, the counters are >> populated (from the leaf of the spanning tree) to all the edges. Knuth >> proved this method inserts the minimum number of instrumentations. MST >> based method only guarantees the number static instrumentation are >> minimized, not the dynamic instance of instrumentation. To reduce the >> number of dynamic instrumentation, edges of potentially high counts will be >> put into MST first so that they will have less chance to be instrumented. >> >> >> 3. Experimental Results >> >> 3.1 Measurement of the efficiency of instrumentation >> Other than the runtime of the instrumented binaries, a more direct >> measurement of the instrumentation overhead is the the sum of the raw >> profile count values. Note that regardless what kind of instrumentations >> are used, the raw profile count should be able to reconstruct all the edge >> count values for the whole program. All raw profile value are obtained via >> incrementing the counter variable value by one. The sum of the raw profile >> count value is roughly the dynamic instruction count of the instrumented >> code. The lower of the value, the more efficient of the instrumentation. >> >> >> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K >> eon >> The performance speedup is computed by (FE_instrumentation_runtime / >> ME_instrumentation_runtime - 1) >> >> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from >> SPEC2000. For C programs, except for one outlier 456.hmmer, there are small >> ups and downs across different programs. Late instrumentation improves >> hmmer a lot, but it is probably due to unrelated loop optimizations (90% of >> runtime spent in one loop nest). >> >> For C++ programs, the performance impact of late instrumentation is very >> large, which is as expected. The following table shows the result. For >> some C++ program, the time speedup is huge. For example, in 483.xalancbmk, >> late instrumentation speeds up performance by ~60%. Among all the SPEC C++ >> programs, only 444.namd is an outlier -- it uses a lot of macros and is a >> very C like program. >> >> Program Speedup >> 471.omnetpp 16.03% >> 473.astar 5.00% >> 483.xalancbmk 58.57% >> 444.namd -0.90% >> 447.dealII 60.47% >> 450.soplex 8.20% >> 453.povray 11.34% >> 252.eon 35.33% >> ------------------------- >> Geomean 21.01% >> >> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs >> We also collect some statistic of the profiles generated by FE based >> instrumentation and late instrumentation, namely, the following information: >> 1. the number of functions that being instrumented, >> 2. the result profile file size, >> 3. the sum of raw count values that was mentioned earlier -- we used >> it to measure the efficiency of the instrumentation. >> Next table shows the ratios of the each metrics by late instrumentation >> for the C++ programs, with FE based instrumentation as the base : column >> (1) shows the ratios of instrumented functions; column (2) shows the ratios >> of the profile file size; column (3) shows the ratios of the sum of raw >> count values. >> >> (1) (2) (3) >> 471.omnetpp 85.36% 110.26% 46.52% >> 473.astar 64.86% 72.72% 63.13% >> 483.xalancbmk 51.83% 56.11% 35.77% >> 444.namd 75.36% 82.82% 85.77% >> 447.dealII 43.42% 46.46% 26.75% >> 450.soplex 71.80% 87.54% 51.19% >> 453.povray 78.68% 83.57% 64.37% >> 252.eon 72.06% 91.22% 30.02% >> ---------------------------------------- >> Geomean 66.50% 76.36% 47.01% >> >> >> For FE based instrumentation, profile count variables generated for the >> dead functions will not be removed (like __llvm_prf_names, __llvm_prf_data, >> and __llvm_prf_cnts) from the data/text segment, nor in the result >> profile. There is a recent patch that removes these unused data for COMDAT >> functions, but that patch won’t touch regular functions. This is the main >> reason for the larger number of instrumented functions and larger profile >> file size for the FE based instrumentation. The reduction of the sum of raw >> count values is mainly due to the elimination of redundant profile updates >> enabled by the pre-inlining. >> >> For C programs, we observe similar improvement in the profile size >> (geomean ratio of 73.75%) and smaller improvement in the number of >> instrumented functions (geomean ratio of 87.49%) and the sum of raw count >> values (geomean of 82.76%). >> >> >> 3.4 LLVM instrumentations runtime performance for Google internal C/C++ >> benchmarks >> >> We also use Google internal benchmarks (mostly typical C++ applications) >> to measure the relative performance b/w FE based instrumentation and late >> instrumentation. The following table shows the speedup of late >> instrumentation vs FE based instrumentation. Note that C++benchmark01 is a >> very large multi-threaded C++ program. Late instrumentation sees 4x >> speedup. Larger than 3x speedups are also seen in many other programs. >> >> C++_bencharmk01 416.98% >> C++_bencharmk02 6.29% >> C++_bencharmk03 22.39% >> C++_bencharmk04 28.05% >> C++_bencharmk05 2.00% >> C++_bencharmk06 675.89% >> C++_bencharmk07 359.19% >> C++_bencharmk08 395.03% >> C_bencharmk09 15.11% >> C_bencharmk10 5.47% >> C++_bencharmk11 5.73% >> C++_bencharmk12 2.31% >> C++_bencharmk13 87.73% >> C++_bencharmk14 7.22% >> C_bencharmk15 -0.51% >> C++_bencharmk16 59.15% >> C++_bencharmk17 358.82% >> C++_bencharmk18 861.36% >> C++_bencharmk19 29.62% >> C++_bencharmk20 11.82% >> C_bencharmk21 0.53% >> C++_bencharmk22 43.10% >> --------------------------- >> Geomean 83.03% >> >> >> 3.5 Statistics of LLVM profiles for Google internal benchmarks >> >> The following shows the profile statics using Google internal benchmarks. >> (1) (2) (3) >> C++_bencharmk01 36.84% 40.29% 2.32% >> C++_bencharmk02 39.20% 40.49% 42.39% >> C++_bencharmk03 39.37% 40.65% 23.24% >> C++_bencharmk04 39.13% 40.68% 17.70% >> C++_bencharmk05 36.58% 38.27% 51.08% >> C++_bencharmk06 29.50% 27.87% 2.87% >> C++_bencharmk07 29.50% 27.87% 1.73% >> C++_bencharmk08 29.50% 27.87% 4.17% >> C_bencharmk09 53.95% 68.00% 11.18% >> C_bencharmk10 53.95% 68.00% 31.74% >> C++_bencharmk11 36.40% 37.07% 46.12% >> C++_bencharmk12 38.44% 41.90% 73.59% >> C++_bencharmk13 39.28% 42.72% 29.56% >> C++_bencharmk14 38.59% 42.20% 13.42% >> C_bencharmk15 57.45% 48.50% 66.99% >> C++_bencharmk16 36.86% 42.18% 16.53% >> C++_bencharmk17 37.82% 39.77% 13.68% >> C++_bencharmk18 37.82% 39.77% 7.96% >> C++_bencharmk19 37.52% 40.46% 1.85% >> C++_bencharmk20 32.37% 30.44% 19.69% >> C_bencharmk21 37.63% 40.42% 88.81% >> C++_bencharmk22 36.28% 36.92% 21.62% >> -------------------------------------------------- >> Geomean 38.22% 39.96% 15.58% >> >> >> 4. Implementation Details: >> >> We need to add new option(s) for the alternative PGO instrumentation pass >> in the middle end. It can in one of the following forms: >> >> (1) Complete new options that are on par with current PGO options: >> -fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation, >> and -fprofile-late-instr-use[=<profile_file>]? for PGO USE. >> (2) Or, late instrumentation can be turned on with an additional >> option -fprofile-instr-late with current PGO options. I. e. >> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO >> instrumentation, and -fprofile-instr-late >> -fprofile-instr-use[=<profile_file>]? for PGO use. >> (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO >> instrumentation. Adding a magic tag in profile so that FE based profile and >> late instrumented profile can be automatically detected by profile loader >> In PGO use compilation. This requires a slight profile format change. >> >> In our prototype implementation, two new passes are added in the >> beginning of PassManagerBuilder::populateModulePassManager(), namely >> PreProfileInlinerPass and PGOLateInstrumentationPass. >> >> >> 4.1 Pre-inline pass: >> >> It is controlled by back-end option "-preinline" and >> "-disable-preinline". If the user specifies any llvm option of >> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will be >> automatically inserted in the driver.. To disable the pre-inliner when late >> instrumentation is enabled, use option "-mllvm -disable-preinline". >> >> For now, only minimum tuning is done for the pre-inliner, which simply >> adjusts the inline threshold: If -Oz is specified, the threshold is set to >> 25. Otherwise, it is 75. >> >> The following clean up passes are added to PassManager, right after the >> PreProfileInline pass: >> createEarlyCSEPass() >> createJumpThreadingPass() >> createCorrelatedValuePropagationPass() >> createCFGSimplificationPass() >> createInstructionCombiningPass() >> createGVNPass(DisableGVNLoadPRE) >> createPeepholePASS() >> Some of them might not be necessary. >> >> 4.2 Late Instrumentation Pass: >> The late instrumentation is right after the pre-inline pass and it's >> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" and >> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will >> provide the profile name. >> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass() >> is also added to PassManager to lower the instrumentation intrinsics >> >> PGOLateInstrumeatnion is a module pass that applies the instrumentation >> to each function by class PGOLateInstrumentationFunc. For each function, >> perform the following steps: >> 1. First collect all the CFG edges. Assign an estimated weight to each >> edge. Critical edges and back-edges are assigned to high value of weights. >> One fake node and a few fake edges (from the fake node to the entry node, >> and from all the exit nodes to the fake node) are also added to the >> worklist. >> 2. Construct the MST. The edges with the higher weight will be put to >> MST first, unless it forms a cycle. >> 3. Traverse the CFG and compute the CFG hash using CRC32 of the index >> of each BB. >> The above three steps are the same for profile-generate and profile-use >> compilation. >> >> In the next step, for profile-generation compilation, all the edges that >> not in the MST are instrumented. If this is a critical edge, split the edge >> first. The actual instrumentation is to generate >> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic >> will be lowed by pass createInstrProfilingPass(). >> >> In the next step, for profile-generation compilation, all the edges that >> not in the MST are instrumented. If this is a critical edge, split the edge >> first. The actual instrumentation is to generate >> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic >> will be lowed by pass createInstrProfilingPass(). >> >> For -fprofile-use compilation, first read in the counters and the CFG >> hash from the profile file. If the CFG hash matches, populate the counters >> to all the edges in reverse topological order of the MST. Once having all >> the edge counts, set the branch weights metadata for the IR having multiple >> branches. Also apply the cold/hot function attributes based on function >> level counts. >> >> >> 4.3 Profile Format: >> >> The late instrumentation profile is mostly the same as the one from >> front-end instrument-ion. The difference is >> * Function checksums are different. >> * Function entry counts are no longer available. >> For llvm-profdata utility, options -lateinstr needs to be used to >> differentiate FE based and late instrumentation profiles, unless a magic >> tag is added to the profile. >> >> >> 5. References: >> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of points >> for program frequency counts. BIT Numerical Mathematics 1973, Volume 13, >> Issue 3, pp 313-322 >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150807/0104ecd7/attachment-0001.html>
Xinliang David Li via llvm-dev
2015-Aug-08 13:31 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <chisophugis at gmail.com> wrote:> Accidentally sent to uiuc server. > > > On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> wrote: >> >> Can you compare your results with another approach: simply do not >> instrument the top 1% hottest functions (by function entry count)? If this >> simple approach provides most of the benefits (my measurements on one >> codebase I tested show that it would eliminate over 97% of total function >> counts), we may be able to use a simpler approach.AFor large C++ programs, theFor static compiler, this is not possible. It also seem to defeat the purpose of PGO -- hottest functions are those which need profile guidance the most.>> >> The biggest thing I notice about this proposal is that although the focus >> of the proposal is on reducing profiling overhead, it requires using a >> different pass pipeline during *both* the "instr-generate" compilation and >> the "instr-use" compilation. Therefore it is more than simply a reduction in >> profiling overhead -- it can also change performance even ignoring adding >> the profiling information in the IR. I think it would be interesting to test >> the modified pass pipeline even in the absence of profile information to >> better understand its effects in isolation (for example, maybe it would be >> good to add these passes during -O3 regardless of whether we are doing PGO). >>The pipeline change is very PGO specific. I expect it has very little impact on regular compilations: 1) LLVM's bottom up inliner is already iterative. 2) The performance impact (on instrumented build) can be 4x large -- which is unlikely for any nonPGO pipeline change. LLVM already supports running SCC passes iteratively, so experiment like this will be easy to do -- the data can be collected. thanks, David>> -- Sean Silva >> >> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote: >>> >>> Instrumentation based Profile Guided Optimization (PGO) is a compiler >>> technique that leverages important program runtime information, such as >>> precise edge counts and frequent value information, to make frequently >>> executed code run faster. It's proven to be one of the most effective ways >>> to improve program performance. >>> >>> An important design point of PGO is to decide where to place the >>> instrumentation. In current LLVM PGO, the instrumentation is done in the >>> Clang Front-End (will be referred as FE based instrumentation). Doing this >>> early in the front-end gives better coverage information as the there are >>> more precise line number information. The resulted profile also has >>> relatively high tolerance to compiler changes. All the compiler change after >>> the instrumentation point will not lead to mismatched IR that invalidates >>> the profile. >>> >>> On the other hand, doing this too early gives the compiler fewer >>> opportunities to optimize the code before instrumentation. It has >>> significant impact on instrumentation runtime performance. In addition, it >>> tends to produce larger binary and profile data file. Our internal C++ >>> benchmarks shows that FE based instrumentation degrades the performance >>> (compared to non-instrumented version) by 58.3%, and in some extreme cases, >>> the application speed/throughput decreases by 95% (21x runtime slowdown). >>> >>> Running the instrumented binary too slow is not desirable in PGO for the >>> following reasons: >>> * This slows down already lengthy build time. In the worst case, the >>> instrumented binary is so slow that it fails to run a representative >>> workload, because slow execution can leads to more time-outs in many server >>> programs. Other typical issues include: text size too big, failure to link >>> instrumented binaries, memory usage exceeding system limits. >>> * Slow runtime affects the program behavior. Real applications >>> sometimes monitor the program runtime and have different execution path when >>> the program run too slow. This would defeat the underlying assumption of PGO >>> and make it less effective. >>> >>> This work proposes an option to turn on middle-end based instrumentation >>> (new) that aims to speed up instrumentation runtime. The new instrumentation >>> is referred to as ME based instrumentation in this document. Our >>> experimental results show that ME instrumentation can speed up the >>> instrumentation by 80% on average for typical C++ programs. Here are the two >>> main design objectives: >>> * Co-existing with FE Instrumenter: We do not propose to replace the >>> FE based instrumentation because FE based instrumentation has its advantages >>> and applications. User can choose which phase to do instrumentation via >>> command line options. >>> * PGO Runtime Support Sharing: The ME instrumenter will completely >>> re-use the existing PGO’s runtime support. >>> >>> 1. FE Based Instrumentation Runtime Overhead Analysis >>> >>> Instrumented binaries are expected to run slower due to instrumentation >>> code inserted. With FE based instrumentation, the overhead is especially >>> high and runtime slowdown can be unacceptable in many cases. Further >>> analysis shows that there are 3 important factors contributing to FE >>> instrumentation slowdown : >>> * [Main] Redundant counter updates of inlined functions. C++ programs >>> can introduce large abstraction penalties by using lots of small inline >>> functions (assignment operators, getters, setters, ctors/dtors etc). >>> Overhead of instrumenting those small functions can be very large, making >>> training runs too slow and in some cases to usable; >>> * Non-optimal placement of the count updates; >>> * A third factor is related value profiling (to be turned on in the >>> future). Small and hot callee functions taking function pointer (callbacks) >>> can incur overhead due to indirect call target profiling. >>> >>> >>> 1.1 Redundant Counter Update >>> >>> If checking the assembly of the instrumented binary generated by current >>> LLVM implementation, we can find many sequence of consecutive 'incq' >>> instructions that updating difference counters in the same basic block. As >>> an example that extracted from real binary: >>> ... >>> incq 0xa91d80(%rip) # 14df4b8 >>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8> >>> incq 0xa79011(%rip) # 14c6750 >>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm> >>> incq 0xa79442(%rip) # 14c6b88 >>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv> >>> incq 0x9c288b(%rip) # 140ffd8 >>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl> >>> ... >>> >>> From profile use point of view, many of these counter update are >>> redundant. Considering the following example: >>> void bar(){ >>> sum++; >>> } >>> void foo() { >>> bar(); >>> } >>> >>> FE based instrumentation needs to insert counter update for the only BB >>> of the bar(). >>> bar: # @bar >>> # BB#0: # %entry >>> incq .L__llvm_profile_counters_bar(%rip) >>> incl sum(%rip) >>> retq >>> >>> It also need to insert the update the BB in function foo(). After >>> inlining bar to foo(), the code is: >>> foo: # @foo >>> # BB#0: # %entry >>> incq .L__llvm_profile_counters_foo(%rip) >>> incq .L__llvm_profile_counters_bar(%rip) >>> incl sum(%rip) >>> retq >>> >>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is >>> redundant -- the counter won't help downstream optimizations. On the other >>> hand, if bar() is a large function and may not be suitable to be inlined for >>> all callsites, this counter updated is necessary in order to produce more >>> accurate profile data for the out-of-line instance of the callee. >>> >>> If foo() is a hot function, the overhead of updating two counters can be >>> significant. This is especially bad for C++ program where there are tons of >>> small inline functions. >>> >>> There is another missing opportunity in FE based instrumentation. The >>> small functions’ control flow can usually be simplified when they are >>> inlined into caller contexts. Once the control flow is simplified, many >>> counter updates can therefore be eliminated. This is only possible for a >>> middle end based late instrumenter. Defining a custom clean-up pass to >>> remove redundant counter update is unrealistic and cannot be done in a sane >>> way without destroying the profile integrity of neither the out-of-line nor >>> inline instances of the callee. >>> >>> A much simpler and cleaner solution is to do a pre-inline pass to inline >>> all the trivial inlines before instrumentation. In addition to removing the >>> unnecessary count updates for the inline instances, another advantage of >>> pre-inline is to provide context sensitive profile for these small inlined >>> functions. This context senstive profile can further improve the PGO based >>> optimizations. Here is a contrived example: >>> void bar (int n) { >>> if (n&1) >>> do_sth1(); >>> else >>> do_sth2(); >>> } >>> >>> void caller() { >>> int s = 1; >>> for (; s<100; s+=2) >>> bar(s); >>> >>> for (s = 102; s< 200; s+=2) >>> bar(s); >>> } >>> >>> The direction of the branch inside bar will be totally opposite between >>> two different callsites in ‘caller’. Without pre-inlining, the branch >>> probability will be 50-50 which will be useless for later optimizations. >>> With pre-inlining, the profile will have the perfect branch count for each >>> callsite. The positive performance impact of context sensitive profiling due >>> to pre-inlining has been observed in real world large C++ programs. >>> Supporting context sensitive profiling is another way to solve this, but it >>> will introduce large additional runtime/memory overhead. >>> >>> >>> 1.2 Non-optimal placement of count update >>> >>> Another much smaller showdown factor is the placement of the counter >>> updates. Current front-end based instrumentation applies the instrumentation >>> to each front-end lexical construct. It also minimizes the number of static >>> instrumentations. Note that it always instruments the entry count of the >>> CFG. This may result in higher dynamic instruction counts. For example, >>> BB0 >>> | 100 >>> BB1 >>> 90 / \ 10 >>> BB2 BB3 >>> 90 \ / 10 >>> BB4 >>> Like the the above example, FE based instrumentation will always insert >>> count update in BB0. The dynamic instrumentation count will be either 110 >>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better >>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic >>> instrumentation count is 100. >>> >>> Our experimental shows that the optimal placement based on edge hotness >>> can improve instrumented code performance by about 10%. While it’s hard to >>> find the optimal placement of count update, compiler heuristics can be used >>> the get the better placement. These heuristics can be based on static >>> profile prediction or user annotations (like __buildin_expect) to estimate >>> the relative edge hotness and put instrumentations on the less hot edges. >>> The initial late instrumentation has not fully implemented this placement >>> strategy yet. With that implemented, we expect even better results than >>> what is reported here. For real world programs, another major source of the >>> slowdown is the data racing and false sharing of the counter update for >>> highly threaded programs. Pre-inlining can alleviate this problem as the >>> counters in the inline instances are not longer shared. But the complete >>> solution to the data racing issue is orthogonal to the problem we try to >>> solve here. >>> >>> >>> 2. High Level Design >>> >>> We propose to perform a pre-profile inline pass before the PGO >>> instrumentation pass. Since the instrumentation pass is after inine, it has >>> to be done in the middle-end. >>> >>> (1) The pre-inline pass >>> We will invoke a pre-inline pass before the instrumentation. When PGO is >>> on, the inlining will be split into two passes: >>> * A pre-inline pass that is scheduled before the profile >>> instrumentation/annotation >>> * A post-inline pass which is the regular inline pass after >>> instrumentation/annotation >>> By design, all beneficial callsites without requiring profile data should >>> be inlined in the pre-inline pass. It includes all callsites that will >>> shrink code size after inlining. All the remaining callsites will be left to >>> the regular inline pass when profile data is available. >>> >>> After pre-inline, a CFG based profile instrumentation/annotation will be >>> done. A minimum weight spanning tree (MST) in CFG is first computed, then >>> only the edges not in the MST will be instrumented. The counter update >>> instructions are placed in the basic blocks. >>> >>> (2) Minimum Spanning Tree (MST) based instrumentation >>> A native way of instrumentation is to insert a count update for every >>> edge in CFG which will result in too many redundant updates that makes the >>> runtime very slow. Knuth [1] proposed a minimum spanning tree based method: >>> given a CFG, first compute a spanning tree. All edges that not in the MST >>> will be instrumented. In the profile use compilation, the counters are >>> populated (from the leaf of the spanning tree) to all the edges. Knuth >>> proved this method inserts the minimum number of instrumentations. MST based >>> method only guarantees the number static instrumentation are minimized, not >>> the dynamic instance of instrumentation. To reduce the number of dynamic >>> instrumentation, edges of potentially high counts will be put into MST first >>> so that they will have less chance to be instrumented. >>> >>> >>> 3. Experimental Results >>> >>> 3.1 Measurement of the efficiency of instrumentation >>> Other than the runtime of the instrumented binaries, a more direct >>> measurement of the instrumentation overhead is the the sum of the raw >>> profile count values. Note that regardless what kind of instrumentations are >>> used, the raw profile count should be able to reconstruct all the edge count >>> values for the whole program. All raw profile value are obtained via >>> incrementing the counter variable value by one. The sum of the raw profile >>> count value is roughly the dynamic instruction count of the instrumented >>> code. The lower of the value, the more efficient of the instrumentation. >>> >>> >>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K >>> eon >>> The performance speedup is computed by (FE_instrumentation_runtime / >>> ME_instrumentation_runtime - 1) >>> >>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from >>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are small >>> ups and downs across different programs. Late instrumentation improves hmmer >>> a lot, but it is probably due to unrelated loop optimizations (90% of >>> runtime spent in one loop nest). >>> >>> For C++ programs, the performance impact of late instrumentation is very >>> large, which is as expected. The following table shows the result. For some >>> C++ program, the time speedup is huge. For example, in 483.xalancbmk, late >>> instrumentation speeds up performance by ~60%. Among all the SPEC C++ >>> programs, only 444.namd is an outlier -- it uses a lot of macros and is a >>> very C like program. >>> >>> Program Speedup >>> 471.omnetpp 16.03% >>> 473.astar 5.00% >>> 483.xalancbmk 58.57% >>> 444.namd -0.90% >>> 447.dealII 60.47% >>> 450.soplex 8.20% >>> 453.povray 11.34% >>> 252.eon 35.33% >>> ------------------------- >>> Geomean 21.01% >>> >>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs >>> We also collect some statistic of the profiles generated by FE based >>> instrumentation and late instrumentation, namely, the following information: >>> 1. the number of functions that being instrumented, >>> 2. the result profile file size, >>> 3. the sum of raw count values that was mentioned earlier -- we used >>> it to measure the efficiency of the instrumentation. >>> Next table shows the ratios of the each metrics by late instrumentation >>> for the C++ programs, with FE based instrumentation as the base : column (1) >>> shows the ratios of instrumented functions; column (2) shows the ratios of >>> the profile file size; column (3) shows the ratios of the sum of raw count >>> values. >>> >>> (1) (2) (3) >>> 471.omnetpp 85.36% 110.26% 46.52% >>> 473.astar 64.86% 72.72% 63.13% >>> 483.xalancbmk 51.83% 56.11% 35.77% >>> 444.namd 75.36% 82.82% 85.77% >>> 447.dealII 43.42% 46.46% 26.75% >>> 450.soplex 71.80% 87.54% 51.19% >>> 453.povray 78.68% 83.57% 64.37% >>> 252.eon 72.06% 91.22% 30.02% >>> ---------------------------------------- >>> Geomean 66.50% 76.36% 47.01% >>> >>> >>> For FE based instrumentation, profile count variables generated for the >>> dead functions will not be removed (like __llvm_prf_names, __llvm_prf_data, >>> and __llvm_prf_cnts) from the data/text segment, nor in the result profile. >>> There is a recent patch that removes these unused data for COMDAT functions, >>> but that patch won’t touch regular functions. This is the main reason for >>> the larger number of instrumented functions and larger profile file size for >>> the FE based instrumentation. The reduction of the sum of raw count values >>> is mainly due to the elimination of redundant profile updates enabled by the >>> pre-inlining. >>> >>> For C programs, we observe similar improvement in the profile size >>> (geomean ratio of 73.75%) and smaller improvement in the number of >>> instrumented functions (geomean ratio of 87.49%) and the sum of raw count >>> values (geomean of 82.76%). >>> >>> >>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++ >>> benchmarks >>> >>> We also use Google internal benchmarks (mostly typical C++ applications) >>> to measure the relative performance b/w FE based instrumentation and late >>> instrumentation. The following table shows the speedup of late >>> instrumentation vs FE based instrumentation. Note that C++benchmark01 is a >>> very large multi-threaded C++ program. Late instrumentation sees 4x speedup. >>> Larger than 3x speedups are also seen in many other programs. >>> >>> C++_bencharmk01 416.98% >>> C++_bencharmk02 6.29% >>> C++_bencharmk03 22.39% >>> C++_bencharmk04 28.05% >>> C++_bencharmk05 2.00% >>> C++_bencharmk06 675.89% >>> C++_bencharmk07 359.19% >>> C++_bencharmk08 395.03% >>> C_bencharmk09 15.11% >>> C_bencharmk10 5.47% >>> C++_bencharmk11 5.73% >>> C++_bencharmk12 2.31% >>> C++_bencharmk13 87.73% >>> C++_bencharmk14 7.22% >>> C_bencharmk15 -0.51% >>> C++_bencharmk16 59.15% >>> C++_bencharmk17 358.82% >>> C++_bencharmk18 861.36% >>> C++_bencharmk19 29.62% >>> C++_bencharmk20 11.82% >>> C_bencharmk21 0.53% >>> C++_bencharmk22 43.10% >>> --------------------------- >>> Geomean 83.03% >>> >>> >>> 3.5 Statistics of LLVM profiles for Google internal benchmarks >>> >>> The following shows the profile statics using Google internal benchmarks. >>> (1) (2) (3) >>> C++_bencharmk01 36.84% 40.29% 2.32% >>> C++_bencharmk02 39.20% 40.49% 42.39% >>> C++_bencharmk03 39.37% 40.65% 23.24% >>> C++_bencharmk04 39.13% 40.68% 17.70% >>> C++_bencharmk05 36.58% 38.27% 51.08% >>> C++_bencharmk06 29.50% 27.87% 2.87% >>> C++_bencharmk07 29.50% 27.87% 1.73% >>> C++_bencharmk08 29.50% 27.87% 4.17% >>> C_bencharmk09 53.95% 68.00% 11.18% >>> C_bencharmk10 53.95% 68.00% 31.74% >>> C++_bencharmk11 36.40% 37.07% 46.12% >>> C++_bencharmk12 38.44% 41.90% 73.59% >>> C++_bencharmk13 39.28% 42.72% 29.56% >>> C++_bencharmk14 38.59% 42.20% 13.42% >>> C_bencharmk15 57.45% 48.50% 66.99% >>> C++_bencharmk16 36.86% 42.18% 16.53% >>> C++_bencharmk17 37.82% 39.77% 13.68% >>> C++_bencharmk18 37.82% 39.77% 7.96% >>> C++_bencharmk19 37.52% 40.46% 1.85% >>> C++_bencharmk20 32.37% 30.44% 19.69% >>> C_bencharmk21 37.63% 40.42% 88.81% >>> C++_bencharmk22 36.28% 36.92% 21.62% >>> -------------------------------------------------- >>> Geomean 38.22% 39.96% 15.58% >>> >>> >>> 4. Implementation Details: >>> >>> We need to add new option(s) for the alternative PGO instrumentation pass >>> in the middle end. It can in one of the following forms: >>> >>> (1) Complete new options that are on par with current PGO options: >>> -fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation, and >>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE. >>> (2) Or, late instrumentation can be turned on with an additional >>> option -fprofile-instr-late with current PGO options. I. e. >>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO >>> instrumentation, and -fprofile-instr-late >>> -fprofile-instr-use[=<profile_file>]? for PGO use. >>> (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO >>> instrumentation. Adding a magic tag in profile so that FE based profile and >>> late instrumented profile can be automatically detected by profile loader In >>> PGO use compilation. This requires a slight profile format change. >>> >>> In our prototype implementation, two new passes are added in the >>> beginning of PassManagerBuilder::populateModulePassManager(), namely >>> PreProfileInlinerPass and PGOLateInstrumentationPass. >>> >>> >>> 4.1 Pre-inline pass: >>> >>> It is controlled by back-end option "-preinline" and >>> "-disable-preinline". If the user specifies any llvm option of >>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will be >>> automatically inserted in the driver.. To disable the pre-inliner when late >>> instrumentation is enabled, use option "-mllvm -disable-preinline". >>> >>> For now, only minimum tuning is done for the pre-inliner, which simply >>> adjusts the inline threshold: If -Oz is specified, the threshold is set to >>> 25. Otherwise, it is 75. >>> >>> The following clean up passes are added to PassManager, right after the >>> PreProfileInline pass: >>> createEarlyCSEPass() >>> createJumpThreadingPass() >>> createCorrelatedValuePropagationPass() >>> createCFGSimplificationPass() >>> createInstructionCombiningPass() >>> createGVNPass(DisableGVNLoadPRE) >>> createPeepholePASS() >>> Some of them might not be necessary. >>> >>> 4.2 Late Instrumentation Pass: >>> The late instrumentation is right after the pre-inline pass and it's >>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" and >>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will >>> provide the profile name. >>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass() >>> is also added to PassManager to lower the instrumentation intrinsics >>> >>> PGOLateInstrumeatnion is a module pass that applies the instrumentation >>> to each function by class PGOLateInstrumentationFunc. For each function, >>> perform the following steps: >>> 1. First collect all the CFG edges. Assign an estimated weight to each >>> edge. Critical edges and back-edges are assigned to high value of weights. >>> One fake node and a few fake edges (from the fake node to the entry node, >>> and from all the exit nodes to the fake node) are also added to the >>> worklist. >>> 2. Construct the MST. The edges with the higher weight will be put to >>> MST first, unless it forms a cycle. >>> 3. Traverse the CFG and compute the CFG hash using CRC32 of the index >>> of each BB. >>> The above three steps are the same for profile-generate and profile-use >>> compilation. >>> >>> In the next step, for profile-generation compilation, all the edges that >>> not in the MST are instrumented. If this is a critical edge, split the edge >>> first. The actual instrumentation is to generate >>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic will >>> be lowed by pass createInstrProfilingPass(). >>> >>> In the next step, for profile-generation compilation, all the edges that >>> not in the MST are instrumented. If this is a critical edge, split the edge >>> first. The actual instrumentation is to generate >>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic will >>> be lowed by pass createInstrProfilingPass(). >>> >>> For -fprofile-use compilation, first read in the counters and the CFG >>> hash from the profile file. If the CFG hash matches, populate the counters >>> to all the edges in reverse topological order of the MST. Once having all >>> the edge counts, set the branch weights metadata for the IR having multiple >>> branches. Also apply the cold/hot function attributes based on function >>> level counts. >>> >>> >>> 4.3 Profile Format: >>> >>> The late instrumentation profile is mostly the same as the one from >>> front-end instrument-ion. The difference is >>> * Function checksums are different. >>> * Function entry counts are no longer available. >>> For llvm-profdata utility, options -lateinstr needs to be used to >>> differentiate FE based and late instrumentation profiles, unless a magic tag >>> is added to the profile. >>> >>> >>> 5. References: >>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of points >>> for program frequency counts. BIT Numerical Mathematics 1973, Volume 13, >>> Issue 3, pp 313-322 >>> >> >
Sean Silva via llvm-dev
2015-Aug-10 01:23 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
On Sat, Aug 8, 2015 at 6:31 AM, Xinliang David Li <davidxl at google.com> wrote:> On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <chisophugis at gmail.com> wrote: > > Accidentally sent to uiuc server. > > > > > > On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> > wrote: > >> > >> Can you compare your results with another approach: simply do not > >> instrument the top 1% hottest functions (by function entry count)? If > this > >> simple approach provides most of the benefits (my measurements on one > >> codebase I tested show that it would eliminate over 97% of total > function > >> counts), we may be able to use a simpler approach.AFor large C++ > programs, the > > For static compiler, this is not possible. It also seem to defeat the > purpose of PGO -- hottest functions are those which need profile > guidance the most. >In the program I looked at the top 1% were just trivial getters and constructors or similar. We should already be "getting these right". Stuff like: class Foo { .... Foo(int bar) : m_bar(bar) {} int getBar() { return m_bar; } ... }; Are the results different for your codebases? Have you tried something like simply not instrumenting the hottest 1% or 0.5% of functions? (maybe restrict the instrumentation skipping to functions of just a single BB with less than, say, 10 instructions). Rong's approach is quite sophisticated; I'm just interested on getting a sanity check against a "naive" approach to see how much the sophisticated approach is buying us.> > >> > >> The biggest thing I notice about this proposal is that although the > focus > >> of the proposal is on reducing profiling overhead, it requires using a > >> different pass pipeline during *both* the "instr-generate" compilation > and > >> the "instr-use" compilation. Therefore it is more than simply a > reduction in > >> profiling overhead -- it can also change performance even ignoring > adding > >> the profiling information in the IR. I think it would be interesting to > test > >> the modified pass pipeline even in the absence of profile information to > >> better understand its effects in isolation (for example, maybe it would > be > >> good to add these passes during -O3 regardless of whether we are doing > PGO). > >> > > The pipeline change is very PGO specific. I expect it has very little > impact on regular compilations: > 1) LLVM's bottom up inliner is already iterative. > 2) The performance impact (on instrumented build) can be 4x large -- > which is unlikely for any nonPGO pipeline change. >With respect to adding extra passes, I'm actually more concerned about the non-instrumented build, for which Rong did not show any data. For example, will users find their program is X% faster than no PGO with ("ME") PGO, but really (X/2)% of that is due simply to the extra passes, and not any profile guidance? We should then prefer to use the extra passes during regular -O3 builds. Conversely, if they find that their program is X% faster with ("ME") PGO, but the extra passes are making the program (X/2)% slower, then the users could be getting (3X/2)% faster instead. I am only concerned about having two variables change simultaneously; I think that instrumenting after some amount of cleanup has been done makes a lot of sense. Could Rong's proposal be made to work within the existing pipeline, but doing the instrumentation after a subset of the existing pass pipeline has been run? -- Sean Silva> > LLVM already supports running SCC passes iteratively, so experiment > like this will be easy to do -- the data can be collected. > > thanks, > > David > > >> -- Sean Silva > >> > >> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote: > >>> > >>> Instrumentation based Profile Guided Optimization (PGO) is a compiler > >>> technique that leverages important program runtime information, such as > >>> precise edge counts and frequent value information, to make frequently > >>> executed code run faster. It's proven to be one of the most effective > ways > >>> to improve program performance. > >>> > >>> An important design point of PGO is to decide where to place the > >>> instrumentation. In current LLVM PGO, the instrumentation is done in > the > >>> Clang Front-End (will be referred as FE based instrumentation). Doing > this > >>> early in the front-end gives better coverage information as the there > are > >>> more precise line number information. The resulted profile also has > >>> relatively high tolerance to compiler changes. All the compiler change > after > >>> the instrumentation point will not lead to mismatched IR that > invalidates > >>> the profile. > >>> > >>> On the other hand, doing this too early gives the compiler fewer > >>> opportunities to optimize the code before instrumentation. It has > >>> significant impact on instrumentation runtime performance. In > addition, it > >>> tends to produce larger binary and profile data file. Our internal > C++ > >>> benchmarks shows that FE based instrumentation degrades the performance > >>> (compared to non-instrumented version) by 58.3%, and in some extreme > cases, > >>> the application speed/throughput decreases by 95% (21x runtime > slowdown). > >>> > >>> Running the instrumented binary too slow is not desirable in PGO for > the > >>> following reasons: > >>> * This slows down already lengthy build time. In the worst case, the > >>> instrumented binary is so slow that it fails to run a representative > >>> workload, because slow execution can leads to more time-outs in many > server > >>> programs. Other typical issues include: text size too big, failure to > link > >>> instrumented binaries, memory usage exceeding system limits. > >>> * Slow runtime affects the program behavior. Real applications > >>> sometimes monitor the program runtime and have different execution > path when > >>> the program run too slow. This would defeat the underlying assumption > of PGO > >>> and make it less effective. > >>> > >>> This work proposes an option to turn on middle-end based > instrumentation > >>> (new) that aims to speed up instrumentation runtime. The new > instrumentation > >>> is referred to as ME based instrumentation in this document. Our > >>> experimental results show that ME instrumentation can speed up the > >>> instrumentation by 80% on average for typical C++ programs. Here are > the two > >>> main design objectives: > >>> * Co-existing with FE Instrumenter: We do not propose to replace the > >>> FE based instrumentation because FE based instrumentation has its > advantages > >>> and applications. User can choose which phase to do instrumentation via > >>> command line options. > >>> * PGO Runtime Support Sharing: The ME instrumenter will completely > >>> re-use the existing PGO’s runtime support. > >>> > >>> 1. FE Based Instrumentation Runtime Overhead Analysis > >>> > >>> Instrumented binaries are expected to run slower due to instrumentation > >>> code inserted. With FE based instrumentation, the overhead is > especially > >>> high and runtime slowdown can be unacceptable in many cases. Further > >>> analysis shows that there are 3 important factors contributing to FE > >>> instrumentation slowdown : > >>> * [Main] Redundant counter updates of inlined functions. C++ > programs > >>> can introduce large abstraction penalties by using lots of small inline > >>> functions (assignment operators, getters, setters, ctors/dtors etc). > >>> Overhead of instrumenting those small functions can be very large, > making > >>> training runs too slow and in some cases to usable; > >>> * Non-optimal placement of the count updates; > >>> * A third factor is related value profiling (to be turned on in the > >>> future). Small and hot callee functions taking function pointer > (callbacks) > >>> can incur overhead due to indirect call target profiling. > >>> > >>> > >>> 1.1 Redundant Counter Update > >>> > >>> If checking the assembly of the instrumented binary generated by > current > >>> LLVM implementation, we can find many sequence of consecutive 'incq' > >>> instructions that updating difference counters in the same basic > block. As > >>> an example that extracted from real binary: > >>> ... > >>> incq 0xa91d80(%rip) # 14df4b8 > >>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8> > >>> incq 0xa79011(%rip) # 14c6750 > >>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm> > >>> incq 0xa79442(%rip) # 14c6b88 > >>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv> > >>> incq 0x9c288b(%rip) # 140ffd8 > >>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl> > >>> ... > >>> > >>> From profile use point of view, many of these counter update are > >>> redundant. Considering the following example: > >>> void bar(){ > >>> sum++; > >>> } > >>> void foo() { > >>> bar(); > >>> } > >>> > >>> FE based instrumentation needs to insert counter update for the only BB > >>> of the bar(). > >>> bar: # @bar > >>> # BB#0: # %entry > >>> incq .L__llvm_profile_counters_bar(%rip) > >>> incl sum(%rip) > >>> retq > >>> > >>> It also need to insert the update the BB in function foo(). After > >>> inlining bar to foo(), the code is: > >>> foo: # @foo > >>> # BB#0: # %entry > >>> incq .L__llvm_profile_counters_foo(%rip) > >>> incq .L__llvm_profile_counters_bar(%rip) > >>> incl sum(%rip) > >>> retq > >>> > >>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) > is > >>> redundant -- the counter won't help downstream optimizations. On the > other > >>> hand, if bar() is a large function and may not be suitable to be > inlined for > >>> all callsites, this counter updated is necessary in order to produce > more > >>> accurate profile data for the out-of-line instance of the callee. > >>> > >>> If foo() is a hot function, the overhead of updating two counters can > be > >>> significant. This is especially bad for C++ program where there are > tons of > >>> small inline functions. > >>> > >>> There is another missing opportunity in FE based instrumentation. The > >>> small functions’ control flow can usually be simplified when they are > >>> inlined into caller contexts. Once the control flow is simplified, many > >>> counter updates can therefore be eliminated. This is only possible for > a > >>> middle end based late instrumenter. Defining a custom clean-up pass to > >>> remove redundant counter update is unrealistic and cannot be done in a > sane > >>> way without destroying the profile integrity of neither the > out-of-line nor > >>> inline instances of the callee. > >>> > >>> A much simpler and cleaner solution is to do a pre-inline pass to > inline > >>> all the trivial inlines before instrumentation. In addition to > removing the > >>> unnecessary count updates for the inline instances, another advantage > of > >>> pre-inline is to provide context sensitive profile for these small > inlined > >>> functions. This context senstive profile can further improve the PGO > based > >>> optimizations. Here is a contrived example: > >>> void bar (int n) { > >>> if (n&1) > >>> do_sth1(); > >>> else > >>> do_sth2(); > >>> } > >>> > >>> void caller() { > >>> int s = 1; > >>> for (; s<100; s+=2) > >>> bar(s); > >>> > >>> for (s = 102; s< 200; s+=2) > >>> bar(s); > >>> } > >>> > >>> The direction of the branch inside bar will be totally opposite between > >>> two different callsites in ‘caller’. Without pre-inlining, the branch > >>> probability will be 50-50 which will be useless for later > optimizations. > >>> With pre-inlining, the profile will have the perfect branch count for > each > >>> callsite. The positive performance impact of context sensitive > profiling due > >>> to pre-inlining has been observed in real world large C++ programs. > >>> Supporting context sensitive profiling is another way to solve this, > but it > >>> will introduce large additional runtime/memory overhead. > >>> > >>> > >>> 1.2 Non-optimal placement of count update > >>> > >>> Another much smaller showdown factor is the placement of the counter > >>> updates. Current front-end based instrumentation applies the > instrumentation > >>> to each front-end lexical construct. It also minimizes the number of > static > >>> instrumentations. Note that it always instruments the entry count of > the > >>> CFG. This may result in higher dynamic instruction counts. For example, > >>> BB0 > >>> | 100 > >>> BB1 > >>> 90 / \ 10 > >>> BB2 BB3 > >>> 90 \ / 10 > >>> BB4 > >>> Like the the above example, FE based instrumentation will always insert > >>> count update in BB0. The dynamic instrumentation count will be either > 110 > >>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A > better > >>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the > dynamic > >>> instrumentation count is 100. > >>> > >>> Our experimental shows that the optimal placement based on edge hotness > >>> can improve instrumented code performance by about 10%. While it’s > hard to > >>> find the optimal placement of count update, compiler heuristics can > be used > >>> the get the better placement. These heuristics can be based on static > >>> profile prediction or user annotations (like __buildin_expect) to > estimate > >>> the relative edge hotness and put instrumentations on the less hot > edges. > >>> The initial late instrumentation has not fully implemented this > placement > >>> strategy yet. With that implemented, we expect even better results > than > >>> what is reported here. For real world programs, another major source > of the > >>> slowdown is the data racing and false sharing of the counter update for > >>> highly threaded programs. Pre-inlining can alleviate this problem as > the > >>> counters in the inline instances are not longer shared. But the > complete > >>> solution to the data racing issue is orthogonal to the problem we try > to > >>> solve here. > >>> > >>> > >>> 2. High Level Design > >>> > >>> We propose to perform a pre-profile inline pass before the PGO > >>> instrumentation pass. Since the instrumentation pass is after inine, > it has > >>> to be done in the middle-end. > >>> > >>> (1) The pre-inline pass > >>> We will invoke a pre-inline pass before the instrumentation. When PGO > is > >>> on, the inlining will be split into two passes: > >>> * A pre-inline pass that is scheduled before the profile > >>> instrumentation/annotation > >>> * A post-inline pass which is the regular inline pass after > >>> instrumentation/annotation > >>> By design, all beneficial callsites without requiring profile data > should > >>> be inlined in the pre-inline pass. It includes all callsites that will > >>> shrink code size after inlining. All the remaining callsites will be > left to > >>> the regular inline pass when profile data is available. > >>> > >>> After pre-inline, a CFG based profile instrumentation/annotation will > be > >>> done. A minimum weight spanning tree (MST) in CFG is first computed, > then > >>> only the edges not in the MST will be instrumented. The counter update > >>> instructions are placed in the basic blocks. > >>> > >>> (2) Minimum Spanning Tree (MST) based instrumentation > >>> A native way of instrumentation is to insert a count update for every > >>> edge in CFG which will result in too many redundant updates that > makes the > >>> runtime very slow. Knuth [1] proposed a minimum spanning tree based > method: > >>> given a CFG, first compute a spanning tree. All edges that not in the > MST > >>> will be instrumented. In the profile use compilation, the counters are > >>> populated (from the leaf of the spanning tree) to all the edges. Knuth > >>> proved this method inserts the minimum number of instrumentations. MST > based > >>> method only guarantees the number static instrumentation are > minimized, not > >>> the dynamic instance of instrumentation. To reduce the number of > dynamic > >>> instrumentation, edges of potentially high counts will be put into MST > first > >>> so that they will have less chance to be instrumented. > >>> > >>> > >>> 3. Experimental Results > >>> > >>> 3.1 Measurement of the efficiency of instrumentation > >>> Other than the runtime of the instrumented binaries, a more direct > >>> measurement of the instrumentation overhead is the the sum of the raw > >>> profile count values. Note that regardless what kind of > instrumentations are > >>> used, the raw profile count should be able to reconstruct all the edge > count > >>> values for the whole program. All raw profile value are obtained via > >>> incrementing the counter variable value by one. The sum of the raw > profile > >>> count value is roughly the dynamic instruction count of the > instrumented > >>> code. The lower of the value, the more efficient of the > instrumentation. > >>> > >>> > >>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and > SPEC2K > >>> eon > >>> The performance speedup is computed by (FE_instrumentation_runtime / > >>> ME_instrumentation_runtime - 1) > >>> > >>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon > from > >>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are > small > >>> ups and downs across different programs. Late instrumentation improves > hmmer > >>> a lot, but it is probably due to unrelated loop optimizations (90% of > >>> runtime spent in one loop nest). > >>> > >>> For C++ programs, the performance impact of late instrumentation is > very > >>> large, which is as expected. The following table shows the result. > For some > >>> C++ program, the time speedup is huge. For example, in 483.xalancbmk, > late > >>> instrumentation speeds up performance by ~60%. Among all the SPEC C++ > >>> programs, only 444.namd is an outlier -- it uses a lot of macros and > is a > >>> very C like program. > >>> > >>> Program Speedup > >>> 471.omnetpp 16.03% > >>> 473.astar 5.00% > >>> 483.xalancbmk 58.57% > >>> 444.namd -0.90% > >>> 447.dealII 60.47% > >>> 450.soplex 8.20% > >>> 453.povray 11.34% > >>> 252.eon 35.33% > >>> ------------------------- > >>> Geomean 21.01% > >>> > >>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs > >>> We also collect some statistic of the profiles generated by FE based > >>> instrumentation and late instrumentation, namely, the following > information: > >>> 1. the number of functions that being instrumented, > >>> 2. the result profile file size, > >>> 3. the sum of raw count values that was mentioned earlier -- we used > >>> it to measure the efficiency of the instrumentation. > >>> Next table shows the ratios of the each metrics by late instrumentation > >>> for the C++ programs, with FE based instrumentation as the base : > column (1) > >>> shows the ratios of instrumented functions; column (2) shows the > ratios of > >>> the profile file size; column (3) shows the ratios of the sum of raw > count > >>> values. > >>> > >>> (1) (2) (3) > >>> 471.omnetpp 85.36% 110.26% 46.52% > >>> 473.astar 64.86% 72.72% 63.13% > >>> 483.xalancbmk 51.83% 56.11% 35.77% > >>> 444.namd 75.36% 82.82% 85.77% > >>> 447.dealII 43.42% 46.46% 26.75% > >>> 450.soplex 71.80% 87.54% 51.19% > >>> 453.povray 78.68% 83.57% 64.37% > >>> 252.eon 72.06% 91.22% 30.02% > >>> ---------------------------------------- > >>> Geomean 66.50% 76.36% 47.01% > >>> > >>> > >>> For FE based instrumentation, profile count variables generated for the > >>> dead functions will not be removed (like __llvm_prf_names, > __llvm_prf_data, > >>> and __llvm_prf_cnts) from the data/text segment, nor in the result > profile. > >>> There is a recent patch that removes these unused data for COMDAT > functions, > >>> but that patch won’t touch regular functions. This is the main reason > for > >>> the larger number of instrumented functions and larger profile file > size for > >>> the FE based instrumentation. The reduction of the sum of raw count > values > >>> is mainly due to the elimination of redundant profile updates enabled > by the > >>> pre-inlining. > >>> > >>> For C programs, we observe similar improvement in the profile size > >>> (geomean ratio of 73.75%) and smaller improvement in the number of > >>> instrumented functions (geomean ratio of 87.49%) and the sum of raw > count > >>> values (geomean of 82.76%). > >>> > >>> > >>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++ > >>> benchmarks > >>> > >>> We also use Google internal benchmarks (mostly typical C++ > applications) > >>> to measure the relative performance b/w FE based instrumentation and > late > >>> instrumentation. The following table shows the speedup of late > >>> instrumentation vs FE based instrumentation. Note that C++benchmark01 > is a > >>> very large multi-threaded C++ program. Late instrumentation sees 4x > speedup. > >>> Larger than 3x speedups are also seen in many other programs. > >>> > >>> C++_bencharmk01 416.98% > >>> C++_bencharmk02 6.29% > >>> C++_bencharmk03 22.39% > >>> C++_bencharmk04 28.05% > >>> C++_bencharmk05 2.00% > >>> C++_bencharmk06 675.89% > >>> C++_bencharmk07 359.19% > >>> C++_bencharmk08 395.03% > >>> C_bencharmk09 15.11% > >>> C_bencharmk10 5.47% > >>> C++_bencharmk11 5.73% > >>> C++_bencharmk12 2.31% > >>> C++_bencharmk13 87.73% > >>> C++_bencharmk14 7.22% > >>> C_bencharmk15 -0.51% > >>> C++_bencharmk16 59.15% > >>> C++_bencharmk17 358.82% > >>> C++_bencharmk18 861.36% > >>> C++_bencharmk19 29.62% > >>> C++_bencharmk20 11.82% > >>> C_bencharmk21 0.53% > >>> C++_bencharmk22 43.10% > >>> --------------------------- > >>> Geomean 83.03% > >>> > >>> > >>> 3.5 Statistics of LLVM profiles for Google internal benchmarks > >>> > >>> The following shows the profile statics using Google internal > benchmarks. > >>> (1) (2) (3) > >>> C++_bencharmk01 36.84% 40.29% 2.32% > >>> C++_bencharmk02 39.20% 40.49% 42.39% > >>> C++_bencharmk03 39.37% 40.65% 23.24% > >>> C++_bencharmk04 39.13% 40.68% 17.70% > >>> C++_bencharmk05 36.58% 38.27% 51.08% > >>> C++_bencharmk06 29.50% 27.87% 2.87% > >>> C++_bencharmk07 29.50% 27.87% 1.73% > >>> C++_bencharmk08 29.50% 27.87% 4.17% > >>> C_bencharmk09 53.95% 68.00% 11.18% > >>> C_bencharmk10 53.95% 68.00% 31.74% > >>> C++_bencharmk11 36.40% 37.07% 46.12% > >>> C++_bencharmk12 38.44% 41.90% 73.59% > >>> C++_bencharmk13 39.28% 42.72% 29.56% > >>> C++_bencharmk14 38.59% 42.20% 13.42% > >>> C_bencharmk15 57.45% 48.50% 66.99% > >>> C++_bencharmk16 36.86% 42.18% 16.53% > >>> C++_bencharmk17 37.82% 39.77% 13.68% > >>> C++_bencharmk18 37.82% 39.77% 7.96% > >>> C++_bencharmk19 37.52% 40.46% 1.85% > >>> C++_bencharmk20 32.37% 30.44% 19.69% > >>> C_bencharmk21 37.63% 40.42% 88.81% > >>> C++_bencharmk22 36.28% 36.92% 21.62% > >>> -------------------------------------------------- > >>> Geomean 38.22% 39.96% 15.58% > >>> > >>> > >>> 4. Implementation Details: > >>> > >>> We need to add new option(s) for the alternative PGO instrumentation > pass > >>> in the middle end. It can in one of the following forms: > >>> > >>> (1) Complete new options that are on par with current PGO options: > >>> -fprofile-late-instr-generate[=<profile_file>]? for PGO > Instrumentation, and > >>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE. > >>> (2) Or, late instrumentation can be turned on with an additional > >>> option -fprofile-instr-late with current PGO options. I. e. > >>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO > >>> instrumentation, and -fprofile-instr-late > >>> -fprofile-instr-use[=<profile_file>]? for PGO use. > >>> (3) Alternatively to (2), only keep -fprofile-instr-late option in > PGO > >>> instrumentation. Adding a magic tag in profile so that FE based > profile and > >>> late instrumented profile can be automatically detected by profile > loader In > >>> PGO use compilation. This requires a slight profile format change. > >>> > >>> In our prototype implementation, two new passes are added in the > >>> beginning of PassManagerBuilder::populateModulePassManager(), namely > >>> PreProfileInlinerPass and PGOLateInstrumentationPass. > >>> > >>> > >>> 4.1 Pre-inline pass: > >>> > >>> It is controlled by back-end option "-preinline" and > >>> "-disable-preinline". If the user specifies any llvm option of > >>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will > be > >>> automatically inserted in the driver.. To disable the pre-inliner when > late > >>> instrumentation is enabled, use option "-mllvm -disable-preinline". > >>> > >>> For now, only minimum tuning is done for the pre-inliner, which simply > >>> adjusts the inline threshold: If -Oz is specified, the threshold is > set to > >>> 25. Otherwise, it is 75. > >>> > >>> The following clean up passes are added to PassManager, right after the > >>> PreProfileInline pass: > >>> createEarlyCSEPass() > >>> createJumpThreadingPass() > >>> createCorrelatedValuePropagationPass() > >>> createCFGSimplificationPass() > >>> createInstructionCombiningPass() > >>> createGVNPass(DisableGVNLoadPRE) > >>> createPeepholePASS() > >>> Some of them might not be necessary. > >>> > >>> 4.2 Late Instrumentation Pass: > >>> The late instrumentation is right after the pre-inline pass and it's > >>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" > and > >>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver > will > >>> provide the profile name. > >>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass() > >>> is also added to PassManager to lower the instrumentation intrinsics > >>> > >>> PGOLateInstrumeatnion is a module pass that applies the instrumentation > >>> to each function by class PGOLateInstrumentationFunc. For each > function, > >>> perform the following steps: > >>> 1. First collect all the CFG edges. Assign an estimated weight to > each > >>> edge. Critical edges and back-edges are assigned to high value of > weights. > >>> One fake node and a few fake edges (from the fake node to the entry > node, > >>> and from all the exit nodes to the fake node) are also added to the > >>> worklist. > >>> 2. Construct the MST. The edges with the higher weight will be put > to > >>> MST first, unless it forms a cycle. > >>> 3. Traverse the CFG and compute the CFG hash using CRC32 of the > index > >>> of each BB. > >>> The above three steps are the same for profile-generate and profile-use > >>> compilation. > >>> > >>> In the next step, for profile-generation compilation, all the edges > that > >>> not in the MST are instrumented. If this is a critical edge, split the > edge > >>> first. The actual instrumentation is to generate > >>> Intrinsic::instrprof_increment() in the instrumented BB. This > intrinsic will > >>> be lowed by pass createInstrProfilingPass(). > >>> > >>> In the next step, for profile-generation compilation, all the edges > that > >>> not in the MST are instrumented. If this is a critical edge, split the > edge > >>> first. The actual instrumentation is to generate > >>> Intrinsic::instrprof_increment() in the instrumented BB. This > intrinsic will > >>> be lowed by pass createInstrProfilingPass(). > >>> > >>> For -fprofile-use compilation, first read in the counters and the CFG > >>> hash from the profile file. If the CFG hash matches, populate the > counters > >>> to all the edges in reverse topological order of the MST. Once having > all > >>> the edge counts, set the branch weights metadata for the IR having > multiple > >>> branches. Also apply the cold/hot function attributes based on function > >>> level counts. > >>> > >>> > >>> 4.3 Profile Format: > >>> > >>> The late instrumentation profile is mostly the same as the one from > >>> front-end instrument-ion. The difference is > >>> * Function checksums are different. > >>> * Function entry counts are no longer available. > >>> For llvm-profdata utility, options -lateinstr needs to be used to > >>> differentiate FE based and late instrumentation profiles, unless a > magic tag > >>> is added to the profile. > >>> > >>> > >>> 5. References: > >>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of > points > >>> for program frequency counts. BIT Numerical Mathematics 1973, Volume > 13, > >>> Issue 3, pp 313-322 > >>> > >> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150809/cc486e38/attachment.html>