Sean Silva via llvm-dev
2015-Aug-10 01:23 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
On Sat, Aug 8, 2015 at 6:31 AM, Xinliang David Li <davidxl at google.com> wrote:> On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <chisophugis at gmail.com> wrote: > > Accidentally sent to uiuc server. > > > > > > On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> > wrote: > >> > >> Can you compare your results with another approach: simply do not > >> instrument the top 1% hottest functions (by function entry count)? If > this > >> simple approach provides most of the benefits (my measurements on one > >> codebase I tested show that it would eliminate over 97% of total > function > >> counts), we may be able to use a simpler approach.AFor large C++ > programs, the > > For static compiler, this is not possible. It also seem to defeat the > purpose of PGO -- hottest functions are those which need profile > guidance the most. >In the program I looked at the top 1% were just trivial getters and constructors or similar. We should already be "getting these right". Stuff like: class Foo { .... Foo(int bar) : m_bar(bar) {} int getBar() { return m_bar; } ... }; Are the results different for your codebases? Have you tried something like simply not instrumenting the hottest 1% or 0.5% of functions? (maybe restrict the instrumentation skipping to functions of just a single BB with less than, say, 10 instructions). Rong's approach is quite sophisticated; I'm just interested on getting a sanity check against a "naive" approach to see how much the sophisticated approach is buying us.> > >> > >> The biggest thing I notice about this proposal is that although the > focus > >> of the proposal is on reducing profiling overhead, it requires using a > >> different pass pipeline during *both* the "instr-generate" compilation > and > >> the "instr-use" compilation. Therefore it is more than simply a > reduction in > >> profiling overhead -- it can also change performance even ignoring > adding > >> the profiling information in the IR. I think it would be interesting to > test > >> the modified pass pipeline even in the absence of profile information to > >> better understand its effects in isolation (for example, maybe it would > be > >> good to add these passes during -O3 regardless of whether we are doing > PGO). > >> > > The pipeline change is very PGO specific. I expect it has very little > impact on regular compilations: > 1) LLVM's bottom up inliner is already iterative. > 2) The performance impact (on instrumented build) can be 4x large -- > which is unlikely for any nonPGO pipeline change. >With respect to adding extra passes, I'm actually more concerned about the non-instrumented build, for which Rong did not show any data. For example, will users find their program is X% faster than no PGO with ("ME") PGO, but really (X/2)% of that is due simply to the extra passes, and not any profile guidance? We should then prefer to use the extra passes during regular -O3 builds. Conversely, if they find that their program is X% faster with ("ME") PGO, but the extra passes are making the program (X/2)% slower, then the users could be getting (3X/2)% faster instead. I am only concerned about having two variables change simultaneously; I think that instrumenting after some amount of cleanup has been done makes a lot of sense. Could Rong's proposal be made to work within the existing pipeline, but doing the instrumentation after a subset of the existing pass pipeline has been run? -- Sean Silva> > LLVM already supports running SCC passes iteratively, so experiment > like this will be easy to do -- the data can be collected. > > thanks, > > David > > >> -- Sean Silva > >> > >> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote: > >>> > >>> Instrumentation based Profile Guided Optimization (PGO) is a compiler > >>> technique that leverages important program runtime information, such as > >>> precise edge counts and frequent value information, to make frequently > >>> executed code run faster. It's proven to be one of the most effective > ways > >>> to improve program performance. > >>> > >>> An important design point of PGO is to decide where to place the > >>> instrumentation. In current LLVM PGO, the instrumentation is done in > the > >>> Clang Front-End (will be referred as FE based instrumentation). Doing > this > >>> early in the front-end gives better coverage information as the there > are > >>> more precise line number information. The resulted profile also has > >>> relatively high tolerance to compiler changes. All the compiler change > after > >>> the instrumentation point will not lead to mismatched IR that > invalidates > >>> the profile. > >>> > >>> On the other hand, doing this too early gives the compiler fewer > >>> opportunities to optimize the code before instrumentation. It has > >>> significant impact on instrumentation runtime performance. In > addition, it > >>> tends to produce larger binary and profile data file. Our internal > C++ > >>> benchmarks shows that FE based instrumentation degrades the performance > >>> (compared to non-instrumented version) by 58.3%, and in some extreme > cases, > >>> the application speed/throughput decreases by 95% (21x runtime > slowdown). > >>> > >>> Running the instrumented binary too slow is not desirable in PGO for > the > >>> following reasons: > >>> * This slows down already lengthy build time. In the worst case, the > >>> instrumented binary is so slow that it fails to run a representative > >>> workload, because slow execution can leads to more time-outs in many > server > >>> programs. Other typical issues include: text size too big, failure to > link > >>> instrumented binaries, memory usage exceeding system limits. > >>> * Slow runtime affects the program behavior. Real applications > >>> sometimes monitor the program runtime and have different execution > path when > >>> the program run too slow. This would defeat the underlying assumption > of PGO > >>> and make it less effective. > >>> > >>> This work proposes an option to turn on middle-end based > instrumentation > >>> (new) that aims to speed up instrumentation runtime. The new > instrumentation > >>> is referred to as ME based instrumentation in this document. Our > >>> experimental results show that ME instrumentation can speed up the > >>> instrumentation by 80% on average for typical C++ programs. Here are > the two > >>> main design objectives: > >>> * Co-existing with FE Instrumenter: We do not propose to replace the > >>> FE based instrumentation because FE based instrumentation has its > advantages > >>> and applications. User can choose which phase to do instrumentation via > >>> command line options. > >>> * PGO Runtime Support Sharing: The ME instrumenter will completely > >>> re-use the existing PGO’s runtime support. > >>> > >>> 1. FE Based Instrumentation Runtime Overhead Analysis > >>> > >>> Instrumented binaries are expected to run slower due to instrumentation > >>> code inserted. With FE based instrumentation, the overhead is > especially > >>> high and runtime slowdown can be unacceptable in many cases. Further > >>> analysis shows that there are 3 important factors contributing to FE > >>> instrumentation slowdown : > >>> * [Main] Redundant counter updates of inlined functions. C++ > programs > >>> can introduce large abstraction penalties by using lots of small inline > >>> functions (assignment operators, getters, setters, ctors/dtors etc). > >>> Overhead of instrumenting those small functions can be very large, > making > >>> training runs too slow and in some cases to usable; > >>> * Non-optimal placement of the count updates; > >>> * A third factor is related value profiling (to be turned on in the > >>> future). Small and hot callee functions taking function pointer > (callbacks) > >>> can incur overhead due to indirect call target profiling. > >>> > >>> > >>> 1.1 Redundant Counter Update > >>> > >>> If checking the assembly of the instrumented binary generated by > current > >>> LLVM implementation, we can find many sequence of consecutive 'incq' > >>> instructions that updating difference counters in the same basic > block. As > >>> an example that extracted from real binary: > >>> ... > >>> incq 0xa91d80(%rip) # 14df4b8 > >>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8> > >>> incq 0xa79011(%rip) # 14c6750 > >>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm> > >>> incq 0xa79442(%rip) # 14c6b88 > >>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv> > >>> incq 0x9c288b(%rip) # 140ffd8 > >>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl> > >>> ... > >>> > >>> From profile use point of view, many of these counter update are > >>> redundant. Considering the following example: > >>> void bar(){ > >>> sum++; > >>> } > >>> void foo() { > >>> bar(); > >>> } > >>> > >>> FE based instrumentation needs to insert counter update for the only BB > >>> of the bar(). > >>> bar: # @bar > >>> # BB#0: # %entry > >>> incq .L__llvm_profile_counters_bar(%rip) > >>> incl sum(%rip) > >>> retq > >>> > >>> It also need to insert the update the BB in function foo(). After > >>> inlining bar to foo(), the code is: > >>> foo: # @foo > >>> # BB#0: # %entry > >>> incq .L__llvm_profile_counters_foo(%rip) > >>> incq .L__llvm_profile_counters_bar(%rip) > >>> incl sum(%rip) > >>> retq > >>> > >>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) > is > >>> redundant -- the counter won't help downstream optimizations. On the > other > >>> hand, if bar() is a large function and may not be suitable to be > inlined for > >>> all callsites, this counter updated is necessary in order to produce > more > >>> accurate profile data for the out-of-line instance of the callee. > >>> > >>> If foo() is a hot function, the overhead of updating two counters can > be > >>> significant. This is especially bad for C++ program where there are > tons of > >>> small inline functions. > >>> > >>> There is another missing opportunity in FE based instrumentation. The > >>> small functions’ control flow can usually be simplified when they are > >>> inlined into caller contexts. Once the control flow is simplified, many > >>> counter updates can therefore be eliminated. This is only possible for > a > >>> middle end based late instrumenter. Defining a custom clean-up pass to > >>> remove redundant counter update is unrealistic and cannot be done in a > sane > >>> way without destroying the profile integrity of neither the > out-of-line nor > >>> inline instances of the callee. > >>> > >>> A much simpler and cleaner solution is to do a pre-inline pass to > inline > >>> all the trivial inlines before instrumentation. In addition to > removing the > >>> unnecessary count updates for the inline instances, another advantage > of > >>> pre-inline is to provide context sensitive profile for these small > inlined > >>> functions. This context senstive profile can further improve the PGO > based > >>> optimizations. Here is a contrived example: > >>> void bar (int n) { > >>> if (n&1) > >>> do_sth1(); > >>> else > >>> do_sth2(); > >>> } > >>> > >>> void caller() { > >>> int s = 1; > >>> for (; s<100; s+=2) > >>> bar(s); > >>> > >>> for (s = 102; s< 200; s+=2) > >>> bar(s); > >>> } > >>> > >>> The direction of the branch inside bar will be totally opposite between > >>> two different callsites in ‘caller’. Without pre-inlining, the branch > >>> probability will be 50-50 which will be useless for later > optimizations. > >>> With pre-inlining, the profile will have the perfect branch count for > each > >>> callsite. The positive performance impact of context sensitive > profiling due > >>> to pre-inlining has been observed in real world large C++ programs. > >>> Supporting context sensitive profiling is another way to solve this, > but it > >>> will introduce large additional runtime/memory overhead. > >>> > >>> > >>> 1.2 Non-optimal placement of count update > >>> > >>> Another much smaller showdown factor is the placement of the counter > >>> updates. Current front-end based instrumentation applies the > instrumentation > >>> to each front-end lexical construct. It also minimizes the number of > static > >>> instrumentations. Note that it always instruments the entry count of > the > >>> CFG. This may result in higher dynamic instruction counts. For example, > >>> BB0 > >>> | 100 > >>> BB1 > >>> 90 / \ 10 > >>> BB2 BB3 > >>> 90 \ / 10 > >>> BB4 > >>> Like the the above example, FE based instrumentation will always insert > >>> count update in BB0. The dynamic instrumentation count will be either > 110 > >>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A > better > >>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the > dynamic > >>> instrumentation count is 100. > >>> > >>> Our experimental shows that the optimal placement based on edge hotness > >>> can improve instrumented code performance by about 10%. While it’s > hard to > >>> find the optimal placement of count update, compiler heuristics can > be used > >>> the get the better placement. These heuristics can be based on static > >>> profile prediction or user annotations (like __buildin_expect) to > estimate > >>> the relative edge hotness and put instrumentations on the less hot > edges. > >>> The initial late instrumentation has not fully implemented this > placement > >>> strategy yet. With that implemented, we expect even better results > than > >>> what is reported here. For real world programs, another major source > of the > >>> slowdown is the data racing and false sharing of the counter update for > >>> highly threaded programs. Pre-inlining can alleviate this problem as > the > >>> counters in the inline instances are not longer shared. But the > complete > >>> solution to the data racing issue is orthogonal to the problem we try > to > >>> solve here. > >>> > >>> > >>> 2. High Level Design > >>> > >>> We propose to perform a pre-profile inline pass before the PGO > >>> instrumentation pass. Since the instrumentation pass is after inine, > it has > >>> to be done in the middle-end. > >>> > >>> (1) The pre-inline pass > >>> We will invoke a pre-inline pass before the instrumentation. When PGO > is > >>> on, the inlining will be split into two passes: > >>> * A pre-inline pass that is scheduled before the profile > >>> instrumentation/annotation > >>> * A post-inline pass which is the regular inline pass after > >>> instrumentation/annotation > >>> By design, all beneficial callsites without requiring profile data > should > >>> be inlined in the pre-inline pass. It includes all callsites that will > >>> shrink code size after inlining. All the remaining callsites will be > left to > >>> the regular inline pass when profile data is available. > >>> > >>> After pre-inline, a CFG based profile instrumentation/annotation will > be > >>> done. A minimum weight spanning tree (MST) in CFG is first computed, > then > >>> only the edges not in the MST will be instrumented. The counter update > >>> instructions are placed in the basic blocks. > >>> > >>> (2) Minimum Spanning Tree (MST) based instrumentation > >>> A native way of instrumentation is to insert a count update for every > >>> edge in CFG which will result in too many redundant updates that > makes the > >>> runtime very slow. Knuth [1] proposed a minimum spanning tree based > method: > >>> given a CFG, first compute a spanning tree. All edges that not in the > MST > >>> will be instrumented. In the profile use compilation, the counters are > >>> populated (from the leaf of the spanning tree) to all the edges. Knuth > >>> proved this method inserts the minimum number of instrumentations. MST > based > >>> method only guarantees the number static instrumentation are > minimized, not > >>> the dynamic instance of instrumentation. To reduce the number of > dynamic > >>> instrumentation, edges of potentially high counts will be put into MST > first > >>> so that they will have less chance to be instrumented. > >>> > >>> > >>> 3. Experimental Results > >>> > >>> 3.1 Measurement of the efficiency of instrumentation > >>> Other than the runtime of the instrumented binaries, a more direct > >>> measurement of the instrumentation overhead is the the sum of the raw > >>> profile count values. Note that regardless what kind of > instrumentations are > >>> used, the raw profile count should be able to reconstruct all the edge > count > >>> values for the whole program. All raw profile value are obtained via > >>> incrementing the counter variable value by one. The sum of the raw > profile > >>> count value is roughly the dynamic instruction count of the > instrumented > >>> code. The lower of the value, the more efficient of the > instrumentation. > >>> > >>> > >>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and > SPEC2K > >>> eon > >>> The performance speedup is computed by (FE_instrumentation_runtime / > >>> ME_instrumentation_runtime - 1) > >>> > >>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon > from > >>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are > small > >>> ups and downs across different programs. Late instrumentation improves > hmmer > >>> a lot, but it is probably due to unrelated loop optimizations (90% of > >>> runtime spent in one loop nest). > >>> > >>> For C++ programs, the performance impact of late instrumentation is > very > >>> large, which is as expected. The following table shows the result. > For some > >>> C++ program, the time speedup is huge. For example, in 483.xalancbmk, > late > >>> instrumentation speeds up performance by ~60%. Among all the SPEC C++ > >>> programs, only 444.namd is an outlier -- it uses a lot of macros and > is a > >>> very C like program. > >>> > >>> Program Speedup > >>> 471.omnetpp 16.03% > >>> 473.astar 5.00% > >>> 483.xalancbmk 58.57% > >>> 444.namd -0.90% > >>> 447.dealII 60.47% > >>> 450.soplex 8.20% > >>> 453.povray 11.34% > >>> 252.eon 35.33% > >>> ------------------------- > >>> Geomean 21.01% > >>> > >>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs > >>> We also collect some statistic of the profiles generated by FE based > >>> instrumentation and late instrumentation, namely, the following > information: > >>> 1. the number of functions that being instrumented, > >>> 2. the result profile file size, > >>> 3. the sum of raw count values that was mentioned earlier -- we used > >>> it to measure the efficiency of the instrumentation. > >>> Next table shows the ratios of the each metrics by late instrumentation > >>> for the C++ programs, with FE based instrumentation as the base : > column (1) > >>> shows the ratios of instrumented functions; column (2) shows the > ratios of > >>> the profile file size; column (3) shows the ratios of the sum of raw > count > >>> values. > >>> > >>> (1) (2) (3) > >>> 471.omnetpp 85.36% 110.26% 46.52% > >>> 473.astar 64.86% 72.72% 63.13% > >>> 483.xalancbmk 51.83% 56.11% 35.77% > >>> 444.namd 75.36% 82.82% 85.77% > >>> 447.dealII 43.42% 46.46% 26.75% > >>> 450.soplex 71.80% 87.54% 51.19% > >>> 453.povray 78.68% 83.57% 64.37% > >>> 252.eon 72.06% 91.22% 30.02% > >>> ---------------------------------------- > >>> Geomean 66.50% 76.36% 47.01% > >>> > >>> > >>> For FE based instrumentation, profile count variables generated for the > >>> dead functions will not be removed (like __llvm_prf_names, > __llvm_prf_data, > >>> and __llvm_prf_cnts) from the data/text segment, nor in the result > profile. > >>> There is a recent patch that removes these unused data for COMDAT > functions, > >>> but that patch won’t touch regular functions. This is the main reason > for > >>> the larger number of instrumented functions and larger profile file > size for > >>> the FE based instrumentation. The reduction of the sum of raw count > values > >>> is mainly due to the elimination of redundant profile updates enabled > by the > >>> pre-inlining. > >>> > >>> For C programs, we observe similar improvement in the profile size > >>> (geomean ratio of 73.75%) and smaller improvement in the number of > >>> instrumented functions (geomean ratio of 87.49%) and the sum of raw > count > >>> values (geomean of 82.76%). > >>> > >>> > >>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++ > >>> benchmarks > >>> > >>> We also use Google internal benchmarks (mostly typical C++ > applications) > >>> to measure the relative performance b/w FE based instrumentation and > late > >>> instrumentation. The following table shows the speedup of late > >>> instrumentation vs FE based instrumentation. Note that C++benchmark01 > is a > >>> very large multi-threaded C++ program. Late instrumentation sees 4x > speedup. > >>> Larger than 3x speedups are also seen in many other programs. > >>> > >>> C++_bencharmk01 416.98% > >>> C++_bencharmk02 6.29% > >>> C++_bencharmk03 22.39% > >>> C++_bencharmk04 28.05% > >>> C++_bencharmk05 2.00% > >>> C++_bencharmk06 675.89% > >>> C++_bencharmk07 359.19% > >>> C++_bencharmk08 395.03% > >>> C_bencharmk09 15.11% > >>> C_bencharmk10 5.47% > >>> C++_bencharmk11 5.73% > >>> C++_bencharmk12 2.31% > >>> C++_bencharmk13 87.73% > >>> C++_bencharmk14 7.22% > >>> C_bencharmk15 -0.51% > >>> C++_bencharmk16 59.15% > >>> C++_bencharmk17 358.82% > >>> C++_bencharmk18 861.36% > >>> C++_bencharmk19 29.62% > >>> C++_bencharmk20 11.82% > >>> C_bencharmk21 0.53% > >>> C++_bencharmk22 43.10% > >>> --------------------------- > >>> Geomean 83.03% > >>> > >>> > >>> 3.5 Statistics of LLVM profiles for Google internal benchmarks > >>> > >>> The following shows the profile statics using Google internal > benchmarks. > >>> (1) (2) (3) > >>> C++_bencharmk01 36.84% 40.29% 2.32% > >>> C++_bencharmk02 39.20% 40.49% 42.39% > >>> C++_bencharmk03 39.37% 40.65% 23.24% > >>> C++_bencharmk04 39.13% 40.68% 17.70% > >>> C++_bencharmk05 36.58% 38.27% 51.08% > >>> C++_bencharmk06 29.50% 27.87% 2.87% > >>> C++_bencharmk07 29.50% 27.87% 1.73% > >>> C++_bencharmk08 29.50% 27.87% 4.17% > >>> C_bencharmk09 53.95% 68.00% 11.18% > >>> C_bencharmk10 53.95% 68.00% 31.74% > >>> C++_bencharmk11 36.40% 37.07% 46.12% > >>> C++_bencharmk12 38.44% 41.90% 73.59% > >>> C++_bencharmk13 39.28% 42.72% 29.56% > >>> C++_bencharmk14 38.59% 42.20% 13.42% > >>> C_bencharmk15 57.45% 48.50% 66.99% > >>> C++_bencharmk16 36.86% 42.18% 16.53% > >>> C++_bencharmk17 37.82% 39.77% 13.68% > >>> C++_bencharmk18 37.82% 39.77% 7.96% > >>> C++_bencharmk19 37.52% 40.46% 1.85% > >>> C++_bencharmk20 32.37% 30.44% 19.69% > >>> C_bencharmk21 37.63% 40.42% 88.81% > >>> C++_bencharmk22 36.28% 36.92% 21.62% > >>> -------------------------------------------------- > >>> Geomean 38.22% 39.96% 15.58% > >>> > >>> > >>> 4. Implementation Details: > >>> > >>> We need to add new option(s) for the alternative PGO instrumentation > pass > >>> in the middle end. It can in one of the following forms: > >>> > >>> (1) Complete new options that are on par with current PGO options: > >>> -fprofile-late-instr-generate[=<profile_file>]? for PGO > Instrumentation, and > >>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE. > >>> (2) Or, late instrumentation can be turned on with an additional > >>> option -fprofile-instr-late with current PGO options. I. e. > >>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO > >>> instrumentation, and -fprofile-instr-late > >>> -fprofile-instr-use[=<profile_file>]? for PGO use. > >>> (3) Alternatively to (2), only keep -fprofile-instr-late option in > PGO > >>> instrumentation. Adding a magic tag in profile so that FE based > profile and > >>> late instrumented profile can be automatically detected by profile > loader In > >>> PGO use compilation. This requires a slight profile format change. > >>> > >>> In our prototype implementation, two new passes are added in the > >>> beginning of PassManagerBuilder::populateModulePassManager(), namely > >>> PreProfileInlinerPass and PGOLateInstrumentationPass. > >>> > >>> > >>> 4.1 Pre-inline pass: > >>> > >>> It is controlled by back-end option "-preinline" and > >>> "-disable-preinline". If the user specifies any llvm option of > >>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will > be > >>> automatically inserted in the driver.. To disable the pre-inliner when > late > >>> instrumentation is enabled, use option "-mllvm -disable-preinline". > >>> > >>> For now, only minimum tuning is done for the pre-inliner, which simply > >>> adjusts the inline threshold: If -Oz is specified, the threshold is > set to > >>> 25. Otherwise, it is 75. > >>> > >>> The following clean up passes are added to PassManager, right after the > >>> PreProfileInline pass: > >>> createEarlyCSEPass() > >>> createJumpThreadingPass() > >>> createCorrelatedValuePropagationPass() > >>> createCFGSimplificationPass() > >>> createInstructionCombiningPass() > >>> createGVNPass(DisableGVNLoadPRE) > >>> createPeepholePASS() > >>> Some of them might not be necessary. > >>> > >>> 4.2 Late Instrumentation Pass: > >>> The late instrumentation is right after the pre-inline pass and it's > >>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" > and > >>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver > will > >>> provide the profile name. > >>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass() > >>> is also added to PassManager to lower the instrumentation intrinsics > >>> > >>> PGOLateInstrumeatnion is a module pass that applies the instrumentation > >>> to each function by class PGOLateInstrumentationFunc. For each > function, > >>> perform the following steps: > >>> 1. First collect all the CFG edges. Assign an estimated weight to > each > >>> edge. Critical edges and back-edges are assigned to high value of > weights. > >>> One fake node and a few fake edges (from the fake node to the entry > node, > >>> and from all the exit nodes to the fake node) are also added to the > >>> worklist. > >>> 2. Construct the MST. The edges with the higher weight will be put > to > >>> MST first, unless it forms a cycle. > >>> 3. Traverse the CFG and compute the CFG hash using CRC32 of the > index > >>> of each BB. > >>> The above three steps are the same for profile-generate and profile-use > >>> compilation. > >>> > >>> In the next step, for profile-generation compilation, all the edges > that > >>> not in the MST are instrumented. If this is a critical edge, split the > edge > >>> first. The actual instrumentation is to generate > >>> Intrinsic::instrprof_increment() in the instrumented BB. This > intrinsic will > >>> be lowed by pass createInstrProfilingPass(). > >>> > >>> In the next step, for profile-generation compilation, all the edges > that > >>> not in the MST are instrumented. If this is a critical edge, split the > edge > >>> first. The actual instrumentation is to generate > >>> Intrinsic::instrprof_increment() in the instrumented BB. This > intrinsic will > >>> be lowed by pass createInstrProfilingPass(). > >>> > >>> For -fprofile-use compilation, first read in the counters and the CFG > >>> hash from the profile file. If the CFG hash matches, populate the > counters > >>> to all the edges in reverse topological order of the MST. Once having > all > >>> the edge counts, set the branch weights metadata for the IR having > multiple > >>> branches. Also apply the cold/hot function attributes based on function > >>> level counts. > >>> > >>> > >>> 4.3 Profile Format: > >>> > >>> The late instrumentation profile is mostly the same as the one from > >>> front-end instrument-ion. The difference is > >>> * Function checksums are different. > >>> * Function entry counts are no longer available. > >>> For llvm-profdata utility, options -lateinstr needs to be used to > >>> differentiate FE based and late instrumentation profiles, unless a > magic tag > >>> is added to the profile. > >>> > >>> > >>> 5. References: > >>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of > points > >>> for program frequency counts. BIT Numerical Mathematics 1973, Volume > 13, > >>> Issue 3, pp 313-322 > >>> > >> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150809/cc486e38/attachment.html>
Rong Xu via llvm-dev
2015-Aug-10 22:34 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
On Sun, Aug 9, 2015 at 6:23 PM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Sat, Aug 8, 2015 at 6:31 AM, Xinliang David Li <davidxl at google.com> > wrote: > >> On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >> > Accidentally sent to uiuc server. >> > >> > >> > On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >> >> >> >> Can you compare your results with another approach: simply do not >> >> instrument the top 1% hottest functions (by function entry count)? If >> this >> >> simple approach provides most of the benefits (my measurements on one >> >> codebase I tested show that it would eliminate over 97% of total >> function >> >> counts), we may be able to use a simpler approach.AFor large C++ >> programs, the >> >> For static compiler, this is not possible. It also seem to defeat the >> purpose of PGO -- hottest functions are those which need profile >> guidance the most. >> > > In the program I looked at the top 1% were just trivial getters and > constructors or similar. We should already be "getting these right". > > Stuff like: > > class Foo { > .... > Foo(int bar) : m_bar(bar) {} > int getBar() { return m_bar; } > ... > }; > > Are the results different for your codebases? Have you tried something > like simply not instrumenting the hottest 1% or 0.5% of functions? (maybe > restrict the instrumentation skipping to functions of just a single BB with > less than, say, 10 instructions). > > Rong's approach is quite sophisticated; I'm just interested on getting a > sanity check against a "naive" approach to see how much the sophisticated > approach is buying us. >I've seen plenty of this kind of code patterns in the benchmarks. I did not measure if they are the top 1% of hottest functions, but I'm not surprised to see so. Your suggesting of not instrumenting this small single BB function is interesting. I'm pretty sure it will cut a lot of the overhead. But will not destroy the integrity of the profile, (for example, it will harder to be used in the coverage test). I'm curious why you think my approach is sophisticated. It's pretty simple in high level: do a preinline pass to inline all the trivial inlines and then do the instrumentation. This way the instrumentation binary is faster and result profile is smaller, and complete.> > >> >> >> >> >> The biggest thing I notice about this proposal is that although the >> focus >> >> of the proposal is on reducing profiling overhead, it requires using a >> >> different pass pipeline during *both* the "instr-generate" compilation >> and >> >> the "instr-use" compilation. Therefore it is more than simply a >> reduction in >> >> profiling overhead -- it can also change performance even ignoring >> adding >> >> the profiling information in the IR. I think it would be interesting >> to test >> >> the modified pass pipeline even in the absence of profile information >> to >> >> better understand its effects in isolation (for example, maybe it >> would be >> >> good to add these passes during -O3 regardless of whether we are doing >> PGO). >> >> >> >> The pipeline change is very PGO specific. I expect it has very little >> impact on regular compilations: >> 1) LLVM's bottom up inliner is already iterative. >> 2) The performance impact (on instrumented build) can be 4x large -- >> which is unlikely for any nonPGO pipeline change. >> > > With respect to adding extra passes, I'm actually more concerned about the > non-instrumented build, for which Rong did not show any data. For example, > will users find their program is X% faster than no PGO with ("ME") PGO, but > really (X/2)% of that is due simply to the extra passes, and not any > profile guidance? We should then prefer to use the extra passes during > regular -O3 builds. Conversely, if they find that their program is X% > faster with ("ME") PGO, but the extra passes are making the program (X/2)% > slower, then the users could be getting (3X/2)% faster instead. I am only > concerned about having two variables change simultaneously; I think that > instrumenting after some amount of cleanup has been done makes a lot of > sense. > >I agree with David that LLVM's bottom up, and iterative inliner should make the pre-inline pass have little performance impact in regular build, at least in theory. The performance swing caused by the pre-inline is likely due to inliner implementation/heuristic issues, and should be handled in the inliner. Here is the performance data to enable pre-inliner for regular compilation (i.e. O2). SPEC2006 and SPEC2000 C++ programs (train input) program O2_+_preinline_time / O2_time - 1 471.omnetpp 0.40% 473.astar 11.62% 483.xalancbmk -3.58% 444.namd 0.00% 447.dealII 1.02% 450.soplex 0.83% 453.povray -1.81% 252.eon 0.33% Note this uses the time. 473.astar runs 11.62% longer with preinline pass. For Google Internal benchmarks, we use the performance number O2_+_pre_inline speedup vs O2 C++benchmark01 -6.88% C++benchmark02 +4.42% C++benchmark03 +0.74% C++benchmark04 +0.56% C++benchmark05 -1.02% C++benchmark06 -14.98% C++benchmark07 +4.21% C++benchmark08 -1.00% C_benchmark09 +0.92% C_benchmark10 -0.43% C++benchmark11 -12.23% C++benchmark12 +0.65% C++benchmark13 -0.96% C++benchmark14 -12.89% C_benchmark15 -0.12% C++benchmark16 -0.07% C++benchmark17 -0.18% C++benchmark18 +0.24% C++benchmark19 -1.86% C++benchmark20 +3.63% C_benchmark21 +0.40% C++benchmark22 -0.08% -------------------------------- geometric mean -1.82% For many benchmarks, we get negative speedup (ie.e slowdown).> Could Rong's proposal be made to work within the existing pipeline, but > doing the instrumentation after a subset of the existing pass pipeline has > been run? > > -- Sean Silva > > >> >> LLVM already supports running SCC passes iteratively, so experiment >> like this will be easy to do -- the data can be collected. >> >> thanks, >> >> David >> >> >> -- Sean Silva >> >> >> >> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote: >> >>> >> >>> Instrumentation based Profile Guided Optimization (PGO) is a compiler >> >>> technique that leverages important program runtime information, such >> as >> >>> precise edge counts and frequent value information, to make frequently >> >>> executed code run faster. It's proven to be one of the most effective >> ways >> >>> to improve program performance. >> >>> >> >>> An important design point of PGO is to decide where to place the >> >>> instrumentation. In current LLVM PGO, the instrumentation is done in >> the >> >>> Clang Front-End (will be referred as FE based instrumentation). Doing >> this >> >>> early in the front-end gives better coverage information as the there >> are >> >>> more precise line number information. The resulted profile also has >> >>> relatively high tolerance to compiler changes. All the compiler >> change after >> >>> the instrumentation point will not lead to mismatched IR that >> invalidates >> >>> the profile. >> >>> >> >>> On the other hand, doing this too early gives the compiler fewer >> >>> opportunities to optimize the code before instrumentation. It has >> >>> significant impact on instrumentation runtime performance. In >> addition, it >> >>> tends to produce larger binary and profile data file. Our internal >> C++ >> >>> benchmarks shows that FE based instrumentation degrades the >> performance >> >>> (compared to non-instrumented version) by 58.3%, and in some extreme >> cases, >> >>> the application speed/throughput decreases by 95% (21x runtime >> slowdown). >> >>> >> >>> Running the instrumented binary too slow is not desirable in PGO for >> the >> >>> following reasons: >> >>> * This slows down already lengthy build time. In the worst case, >> the >> >>> instrumented binary is so slow that it fails to run a representative >> >>> workload, because slow execution can leads to more time-outs in many >> server >> >>> programs. Other typical issues include: text size too big, failure to >> link >> >>> instrumented binaries, memory usage exceeding system limits. >> >>> * Slow runtime affects the program behavior. Real applications >> >>> sometimes monitor the program runtime and have different execution >> path when >> >>> the program run too slow. This would defeat the underlying assumption >> of PGO >> >>> and make it less effective. >> >>> >> >>> This work proposes an option to turn on middle-end based >> instrumentation >> >>> (new) that aims to speed up instrumentation runtime. The new >> instrumentation >> >>> is referred to as ME based instrumentation in this document. Our >> >>> experimental results show that ME instrumentation can speed up the >> >>> instrumentation by 80% on average for typical C++ programs. Here are >> the two >> >>> main design objectives: >> >>> * Co-existing with FE Instrumenter: We do not propose to replace >> the >> >>> FE based instrumentation because FE based instrumentation has its >> advantages >> >>> and applications. User can choose which phase to do instrumentation >> via >> >>> command line options. >> >>> * PGO Runtime Support Sharing: The ME instrumenter will completely >> >>> re-use the existing PGO’s runtime support. >> >>> >> >>> 1. FE Based Instrumentation Runtime Overhead Analysis >> >>> >> >>> Instrumented binaries are expected to run slower due to >> instrumentation >> >>> code inserted. With FE based instrumentation, the overhead is >> especially >> >>> high and runtime slowdown can be unacceptable in many cases. Further >> >>> analysis shows that there are 3 important factors contributing to FE >> >>> instrumentation slowdown : >> >>> * [Main] Redundant counter updates of inlined functions. C++ >> programs >> >>> can introduce large abstraction penalties by using lots of small >> inline >> >>> functions (assignment operators, getters, setters, ctors/dtors etc). >> >>> Overhead of instrumenting those small functions can be very large, >> making >> >>> training runs too slow and in some cases to usable; >> >>> * Non-optimal placement of the count updates; >> >>> * A third factor is related value profiling (to be turned on in the >> >>> future). Small and hot callee functions taking function pointer >> (callbacks) >> >>> can incur overhead due to indirect call target profiling. >> >>> >> >>> >> >>> 1.1 Redundant Counter Update >> >>> >> >>> If checking the assembly of the instrumented binary generated by >> current >> >>> LLVM implementation, we can find many sequence of consecutive 'incq' >> >>> instructions that updating difference counters in the same basic >> block. As >> >>> an example that extracted from real binary: >> >>> ... >> >>> incq 0xa91d80(%rip) # 14df4b8 >> >>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8> >> >>> incq 0xa79011(%rip) # 14c6750 >> >>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm> >> >>> incq 0xa79442(%rip) # 14c6b88 >> >>> >> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv> >> >>> incq 0x9c288b(%rip) # 140ffd8 >> >>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl> >> >>> ... >> >>> >> >>> From profile use point of view, many of these counter update are >> >>> redundant. Considering the following example: >> >>> void bar(){ >> >>> sum++; >> >>> } >> >>> void foo() { >> >>> bar(); >> >>> } >> >>> >> >>> FE based instrumentation needs to insert counter update for the only >> BB >> >>> of the bar(). >> >>> bar: # @bar >> >>> # BB#0: # %entry >> >>> incq .L__llvm_profile_counters_bar(%rip) >> >>> incl sum(%rip) >> >>> retq >> >>> >> >>> It also need to insert the update the BB in function foo(). After >> >>> inlining bar to foo(), the code is: >> >>> foo: # @foo >> >>> # BB#0: # %entry >> >>> incq .L__llvm_profile_counters_foo(%rip) >> >>> incq .L__llvm_profile_counters_bar(%rip) >> >>> incl sum(%rip) >> >>> retq >> >>> >> >>> If bar() should be always inlined, >> .L__llvm_profile_counters_bar(%rip) is >> >>> redundant -- the counter won't help downstream optimizations. On the >> other >> >>> hand, if bar() is a large function and may not be suitable to be >> inlined for >> >>> all callsites, this counter updated is necessary in order to produce >> more >> >>> accurate profile data for the out-of-line instance of the callee. >> >>> >> >>> If foo() is a hot function, the overhead of updating two counters can >> be >> >>> significant. This is especially bad for C++ program where there are >> tons of >> >>> small inline functions. >> >>> >> >>> There is another missing opportunity in FE based instrumentation. The >> >>> small functions’ control flow can usually be simplified when they are >> >>> inlined into caller contexts. Once the control flow is simplified, >> many >> >>> counter updates can therefore be eliminated. This is only possible >> for a >> >>> middle end based late instrumenter. Defining a custom clean-up pass to >> >>> remove redundant counter update is unrealistic and cannot be done in >> a sane >> >>> way without destroying the profile integrity of neither the >> out-of-line nor >> >>> inline instances of the callee. >> >>> >> >>> A much simpler and cleaner solution is to do a pre-inline pass to >> inline >> >>> all the trivial inlines before instrumentation. In addition to >> removing the >> >>> unnecessary count updates for the inline instances, another >> advantage of >> >>> pre-inline is to provide context sensitive profile for these small >> inlined >> >>> functions. This context senstive profile can further improve the PGO >> based >> >>> optimizations. Here is a contrived example: >> >>> void bar (int n) { >> >>> if (n&1) >> >>> do_sth1(); >> >>> else >> >>> do_sth2(); >> >>> } >> >>> >> >>> void caller() { >> >>> int s = 1; >> >>> for (; s<100; s+=2) >> >>> bar(s); >> >>> >> >>> for (s = 102; s< 200; s+=2) >> >>> bar(s); >> >>> } >> >>> >> >>> The direction of the branch inside bar will be totally opposite >> between >> >>> two different callsites in ‘caller’. Without pre-inlining, the branch >> >>> probability will be 50-50 which will be useless for later >> optimizations. >> >>> With pre-inlining, the profile will have the perfect branch count for >> each >> >>> callsite. The positive performance impact of context sensitive >> profiling due >> >>> to pre-inlining has been observed in real world large C++ programs. >> >>> Supporting context sensitive profiling is another way to solve this, >> but it >> >>> will introduce large additional runtime/memory overhead. >> >>> >> >>> >> >>> 1.2 Non-optimal placement of count update >> >>> >> >>> Another much smaller showdown factor is the placement of the counter >> >>> updates. Current front-end based instrumentation applies the >> instrumentation >> >>> to each front-end lexical construct. It also minimizes the number of >> static >> >>> instrumentations. Note that it always instruments the entry count of >> the >> >>> CFG. This may result in higher dynamic instruction counts. For >> example, >> >>> BB0 >> >>> | 100 >> >>> BB1 >> >>> 90 / \ 10 >> >>> BB2 BB3 >> >>> 90 \ / 10 >> >>> BB4 >> >>> Like the the above example, FE based instrumentation will always >> insert >> >>> count update in BB0. The dynamic instrumentation count will be >> either 110 >> >>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A >> better >> >>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the >> dynamic >> >>> instrumentation count is 100. >> >>> >> >>> Our experimental shows that the optimal placement based on edge >> hotness >> >>> can improve instrumented code performance by about 10%. While it’s >> hard to >> >>> find the optimal placement of count update, compiler heuristics can >> be used >> >>> the get the better placement. These heuristics can be based on static >> >>> profile prediction or user annotations (like __buildin_expect) to >> estimate >> >>> the relative edge hotness and put instrumentations on the less hot >> edges. >> >>> The initial late instrumentation has not fully implemented this >> placement >> >>> strategy yet. With that implemented, we expect even better results >> than >> >>> what is reported here. For real world programs, another major source >> of the >> >>> slowdown is the data racing and false sharing of the counter update >> for >> >>> highly threaded programs. Pre-inlining can alleviate this problem as >> the >> >>> counters in the inline instances are not longer shared. But the >> complete >> >>> solution to the data racing issue is orthogonal to the problem we try >> to >> >>> solve here. >> >>> >> >>> >> >>> 2. High Level Design >> >>> >> >>> We propose to perform a pre-profile inline pass before the PGO >> >>> instrumentation pass. Since the instrumentation pass is after inine, >> it has >> >>> to be done in the middle-end. >> >>> >> >>> (1) The pre-inline pass >> >>> We will invoke a pre-inline pass before the instrumentation. When PGO >> is >> >>> on, the inlining will be split into two passes: >> >>> * A pre-inline pass that is scheduled before the profile >> >>> instrumentation/annotation >> >>> * A post-inline pass which is the regular inline pass after >> >>> instrumentation/annotation >> >>> By design, all beneficial callsites without requiring profile data >> should >> >>> be inlined in the pre-inline pass. It includes all callsites that will >> >>> shrink code size after inlining. All the remaining callsites will be >> left to >> >>> the regular inline pass when profile data is available. >> >>> >> >>> After pre-inline, a CFG based profile instrumentation/annotation will >> be >> >>> done. A minimum weight spanning tree (MST) in CFG is first computed, >> then >> >>> only the edges not in the MST will be instrumented. The counter update >> >>> instructions are placed in the basic blocks. >> >>> >> >>> (2) Minimum Spanning Tree (MST) based instrumentation >> >>> A native way of instrumentation is to insert a count update for every >> >>> edge in CFG which will result in too many redundant updates that >> makes the >> >>> runtime very slow. Knuth [1] proposed a minimum spanning tree based >> method: >> >>> given a CFG, first compute a spanning tree. All edges that not in the >> MST >> >>> will be instrumented. In the profile use compilation, the counters are >> >>> populated (from the leaf of the spanning tree) to all the edges. Knuth >> >>> proved this method inserts the minimum number of instrumentations. >> MST based >> >>> method only guarantees the number static instrumentation are >> minimized, not >> >>> the dynamic instance of instrumentation. To reduce the number of >> dynamic >> >>> instrumentation, edges of potentially high counts will be put into >> MST first >> >>> so that they will have less chance to be instrumented. >> >>> >> >>> >> >>> 3. Experimental Results >> >>> >> >>> 3.1 Measurement of the efficiency of instrumentation >> >>> Other than the runtime of the instrumented binaries, a more direct >> >>> measurement of the instrumentation overhead is the the sum of the raw >> >>> profile count values. Note that regardless what kind of >> instrumentations are >> >>> used, the raw profile count should be able to reconstruct all the >> edge count >> >>> values for the whole program. All raw profile value are obtained via >> >>> incrementing the counter variable value by one. The sum of the raw >> profile >> >>> count value is roughly the dynamic instruction count of the >> instrumented >> >>> code. The lower of the value, the more efficient of the >> instrumentation. >> >>> >> >>> >> >>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and >> SPEC2K >> >>> eon >> >>> The performance speedup is computed by (FE_instrumentation_runtime / >> >>> ME_instrumentation_runtime - 1) >> >>> >> >>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon >> from >> >>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are >> small >> >>> ups and downs across different programs. Late instrumentation >> improves hmmer >> >>> a lot, but it is probably due to unrelated loop optimizations (90% of >> >>> runtime spent in one loop nest). >> >>> >> >>> For C++ programs, the performance impact of late instrumentation is >> very >> >>> large, which is as expected. The following table shows the result. >> For some >> >>> C++ program, the time speedup is huge. For example, in >> 483.xalancbmk, late >> >>> instrumentation speeds up performance by ~60%. Among all the SPEC C++ >> >>> programs, only 444.namd is an outlier -- it uses a lot of macros and >> is a >> >>> very C like program. >> >>> >> >>> Program Speedup >> >>> 471.omnetpp 16.03% >> >>> 473.astar 5.00% >> >>> 483.xalancbmk 58.57% >> >>> 444.namd -0.90% >> >>> 447.dealII 60.47% >> >>> 450.soplex 8.20% >> >>> 453.povray 11.34% >> >>> 252.eon 35.33% >> >>> ------------------------- >> >>> Geomean 21.01% >> >>> >> >>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs >> >>> We also collect some statistic of the profiles generated by FE based >> >>> instrumentation and late instrumentation, namely, the following >> information: >> >>> 1. the number of functions that being instrumented, >> >>> 2. the result profile file size, >> >>> 3. the sum of raw count values that was mentioned earlier -- we >> used >> >>> it to measure the efficiency of the instrumentation. >> >>> Next table shows the ratios of the each metrics by late >> instrumentation >> >>> for the C++ programs, with FE based instrumentation as the base : >> column (1) >> >>> shows the ratios of instrumented functions; column (2) shows the >> ratios of >> >>> the profile file size; column (3) shows the ratios of the sum of raw >> count >> >>> values. >> >>> >> >>> (1) (2) (3) >> >>> 471.omnetpp 85.36% 110.26% 46.52% >> >>> 473.astar 64.86% 72.72% 63.13% >> >>> 483.xalancbmk 51.83% 56.11% 35.77% >> >>> 444.namd 75.36% 82.82% 85.77% >> >>> 447.dealII 43.42% 46.46% 26.75% >> >>> 450.soplex 71.80% 87.54% 51.19% >> >>> 453.povray 78.68% 83.57% 64.37% >> >>> 252.eon 72.06% 91.22% 30.02% >> >>> ---------------------------------------- >> >>> Geomean 66.50% 76.36% 47.01% >> >>> >> >>> >> >>> For FE based instrumentation, profile count variables generated for >> the >> >>> dead functions will not be removed (like __llvm_prf_names, >> __llvm_prf_data, >> >>> and __llvm_prf_cnts) from the data/text segment, nor in the result >> profile. >> >>> There is a recent patch that removes these unused data for COMDAT >> functions, >> >>> but that patch won’t touch regular functions. This is the main reason >> for >> >>> the larger number of instrumented functions and larger profile file >> size for >> >>> the FE based instrumentation. The reduction of the sum of raw count >> values >> >>> is mainly due to the elimination of redundant profile updates enabled >> by the >> >>> pre-inlining. >> >>> >> >>> For C programs, we observe similar improvement in the profile size >> >>> (geomean ratio of 73.75%) and smaller improvement in the number of >> >>> instrumented functions (geomean ratio of 87.49%) and the sum of raw >> count >> >>> values (geomean of 82.76%). >> >>> >> >>> >> >>> 3.4 LLVM instrumentations runtime performance for Google internal >> C/C++ >> >>> benchmarks >> >>> >> >>> We also use Google internal benchmarks (mostly typical C++ >> applications) >> >>> to measure the relative performance b/w FE based instrumentation and >> late >> >>> instrumentation. The following table shows the speedup of late >> >>> instrumentation vs FE based instrumentation. Note that C++benchmark01 >> is a >> >>> very large multi-threaded C++ program. Late instrumentation sees 4x >> speedup. >> >>> Larger than 3x speedups are also seen in many other programs. >> >>> >> >>> C++_bencharmk01 416.98% >> >>> C++_bencharmk02 6.29% >> >>> C++_bencharmk03 22.39% >> >>> C++_bencharmk04 28.05% >> >>> C++_bencharmk05 2.00% >> >>> C++_bencharmk06 675.89% >> >>> C++_bencharmk07 359.19% >> >>> C++_bencharmk08 395.03% >> >>> C_bencharmk09 15.11% >> >>> C_bencharmk10 5.47% >> >>> C++_bencharmk11 5.73% >> >>> C++_bencharmk12 2.31% >> >>> C++_bencharmk13 87.73% >> >>> C++_bencharmk14 7.22% >> >>> C_bencharmk15 -0.51% >> >>> C++_bencharmk16 59.15% >> >>> C++_bencharmk17 358.82% >> >>> C++_bencharmk18 861.36% >> >>> C++_bencharmk19 29.62% >> >>> C++_bencharmk20 11.82% >> >>> C_bencharmk21 0.53% >> >>> C++_bencharmk22 43.10% >> >>> --------------------------- >> >>> Geomean 83.03% >> >>> >> >>> >> >>> 3.5 Statistics of LLVM profiles for Google internal benchmarks >> >>> >> >>> The following shows the profile statics using Google internal >> benchmarks. >> >>> (1) (2) (3) >> >>> C++_bencharmk01 36.84% 40.29% 2.32% >> >>> C++_bencharmk02 39.20% 40.49% 42.39% >> >>> C++_bencharmk03 39.37% 40.65% 23.24% >> >>> C++_bencharmk04 39.13% 40.68% 17.70% >> >>> C++_bencharmk05 36.58% 38.27% 51.08% >> >>> C++_bencharmk06 29.50% 27.87% 2.87% >> >>> C++_bencharmk07 29.50% 27.87% 1.73% >> >>> C++_bencharmk08 29.50% 27.87% 4.17% >> >>> C_bencharmk09 53.95% 68.00% 11.18% >> >>> C_bencharmk10 53.95% 68.00% 31.74% >> >>> C++_bencharmk11 36.40% 37.07% 46.12% >> >>> C++_bencharmk12 38.44% 41.90% 73.59% >> >>> C++_bencharmk13 39.28% 42.72% 29.56% >> >>> C++_bencharmk14 38.59% 42.20% 13.42% >> >>> C_bencharmk15 57.45% 48.50% 66.99% >> >>> C++_bencharmk16 36.86% 42.18% 16.53% >> >>> C++_bencharmk17 37.82% 39.77% 13.68% >> >>> C++_bencharmk18 37.82% 39.77% 7.96% >> >>> C++_bencharmk19 37.52% 40.46% 1.85% >> >>> C++_bencharmk20 32.37% 30.44% 19.69% >> >>> C_bencharmk21 37.63% 40.42% 88.81% >> >>> C++_bencharmk22 36.28% 36.92% 21.62% >> >>> -------------------------------------------------- >> >>> Geomean 38.22% 39.96% 15.58% >> >>> >> >>> >> >>> 4. Implementation Details: >> >>> >> >>> We need to add new option(s) for the alternative PGO instrumentation >> pass >> >>> in the middle end. It can in one of the following forms: >> >>> >> >>> (1) Complete new options that are on par with current PGO options: >> >>> -fprofile-late-instr-generate[=<profile_file>]? for PGO >> Instrumentation, and >> >>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE. >> >>> (2) Or, late instrumentation can be turned on with an additional >> >>> option -fprofile-instr-late with current PGO options. I. e. >> >>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for >> PGO >> >>> instrumentation, and -fprofile-instr-late >> >>> -fprofile-instr-use[=<profile_file>]? for PGO use. >> >>> (3) Alternatively to (2), only keep -fprofile-instr-late option in >> PGO >> >>> instrumentation. Adding a magic tag in profile so that FE based >> profile and >> >>> late instrumented profile can be automatically detected by profile >> loader In >> >>> PGO use compilation. This requires a slight profile format change. >> >>> >> >>> In our prototype implementation, two new passes are added in the >> >>> beginning of PassManagerBuilder::populateModulePassManager(), namely >> >>> PreProfileInlinerPass and PGOLateInstrumentationPass. >> >>> >> >>> >> >>> 4.1 Pre-inline pass: >> >>> >> >>> It is controlled by back-end option "-preinline" and >> >>> "-disable-preinline". If the user specifies any llvm option of >> >>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will >> be >> >>> automatically inserted in the driver.. To disable the pre-inliner >> when late >> >>> instrumentation is enabled, use option "-mllvm -disable-preinline". >> >>> >> >>> For now, only minimum tuning is done for the pre-inliner, which simply >> >>> adjusts the inline threshold: If -Oz is specified, the threshold is >> set to >> >>> 25. Otherwise, it is 75. >> >>> >> >>> The following clean up passes are added to PassManager, right after >> the >> >>> PreProfileInline pass: >> >>> createEarlyCSEPass() >> >>> createJumpThreadingPass() >> >>> createCorrelatedValuePropagationPass() >> >>> createCFGSimplificationPass() >> >>> createInstructionCombiningPass() >> >>> createGVNPass(DisableGVNLoadPRE) >> >>> createPeepholePASS() >> >>> Some of them might not be necessary. >> >>> >> >>> 4.2 Late Instrumentation Pass: >> >>> The late instrumentation is right after the pre-inline pass and it's >> >>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" >> and >> >>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver >> will >> >>> provide the profile name. >> >>> For "-pgo-late-instr-gen", a pass that calls >> createInstrProfilingPass() >> >>> is also added to PassManager to lower the instrumentation intrinsics >> >>> >> >>> PGOLateInstrumeatnion is a module pass that applies the >> instrumentation >> >>> to each function by class PGOLateInstrumentationFunc. For each >> function, >> >>> perform the following steps: >> >>> 1. First collect all the CFG edges. Assign an estimated weight to >> each >> >>> edge. Critical edges and back-edges are assigned to high value of >> weights. >> >>> One fake node and a few fake edges (from the fake node to the entry >> node, >> >>> and from all the exit nodes to the fake node) are also added to the >> >>> worklist. >> >>> 2. Construct the MST. The edges with the higher weight will be put >> to >> >>> MST first, unless it forms a cycle. >> >>> 3. Traverse the CFG and compute the CFG hash using CRC32 of the >> index >> >>> of each BB. >> >>> The above three steps are the same for profile-generate and >> profile-use >> >>> compilation. >> >>> >> >>> In the next step, for profile-generation compilation, all the edges >> that >> >>> not in the MST are instrumented. If this is a critical edge, split >> the edge >> >>> first. The actual instrumentation is to generate >> >>> Intrinsic::instrprof_increment() in the instrumented BB. This >> intrinsic will >> >>> be lowed by pass createInstrProfilingPass(). >> >>> >> >>> In the next step, for profile-generation compilation, all the edges >> that >> >>> not in the MST are instrumented. If this is a critical edge, split >> the edge >> >>> first. The actual instrumentation is to generate >> >>> Intrinsic::instrprof_increment() in the instrumented BB. This >> intrinsic will >> >>> be lowed by pass createInstrProfilingPass(). >> >>> >> >>> For -fprofile-use compilation, first read in the counters and the CFG >> >>> hash from the profile file. If the CFG hash matches, populate the >> counters >> >>> to all the edges in reverse topological order of the MST. Once having >> all >> >>> the edge counts, set the branch weights metadata for the IR having >> multiple >> >>> branches. Also apply the cold/hot function attributes based on >> function >> >>> level counts. >> >>> >> >>> >> >>> 4.3 Profile Format: >> >>> >> >>> The late instrumentation profile is mostly the same as the one from >> >>> front-end instrument-ion. The difference is >> >>> * Function checksums are different. >> >>> * Function entry counts are no longer available. >> >>> For llvm-profdata utility, options -lateinstr needs to be used to >> >>> differentiate FE based and late instrumentation profiles, unless a >> magic tag >> >>> is added to the profile. >> >>> >> >>> >> >>> 5. References: >> >>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of >> points >> >>> for program frequency counts. BIT Numerical Mathematics 1973, Volume >> 13, >> >>> Issue 3, pp 313-322 >> >>> >> >> >> > >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150810/ed543618/attachment-0001.html>
Sean Silva via llvm-dev
2015-Aug-11 01:13 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
On Mon, Aug 10, 2015 at 3:34 PM, Rong Xu <xur at google.com> wrote:> > > On Sun, Aug 9, 2015 at 6:23 PM, Sean Silva <chisophugis at gmail.com> wrote: > >> >> >> On Sat, Aug 8, 2015 at 6:31 AM, Xinliang David Li <davidxl at google.com> >> wrote: >> >>> On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <chisophugis at gmail.com> >>> wrote: >>> > Accidentally sent to uiuc server. >>> > >>> > >>> > On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> >>> wrote: >>> >> >>> >> Can you compare your results with another approach: simply do not >>> >> instrument the top 1% hottest functions (by function entry count)? If >>> this >>> >> simple approach provides most of the benefits (my measurements on one >>> >> codebase I tested show that it would eliminate over 97% of total >>> function >>> >> counts), we may be able to use a simpler approach.AFor large C++ >>> programs, the >>> >>> For static compiler, this is not possible. It also seem to defeat the >>> purpose of PGO -- hottest functions are those which need profile >>> guidance the most. >>> >> >> In the program I looked at the top 1% were just trivial getters and >> constructors or similar. We should already be "getting these right". >> >> Stuff like: >> >> class Foo { >> .... >> Foo(int bar) : m_bar(bar) {} >> int getBar() { return m_bar; } >> ... >> }; >> >> Are the results different for your codebases? Have you tried something >> like simply not instrumenting the hottest 1% or 0.5% of functions? (maybe >> restrict the instrumentation skipping to functions of just a single BB with >> less than, say, 10 instructions). >> >> Rong's approach is quite sophisticated; I'm just interested on getting a >> sanity check against a "naive" approach to see how much the sophisticated >> approach is buying us. >> > > I've seen plenty of this kind of code patterns in the benchmarks. I did > not measure if they are the top 1% of hottest functions, but I'm not > surprised to see so. > Your suggesting of not instrumenting this small single BB function is > interesting. I'm pretty sure it will cut a lot of the overhead. > But will not destroy the integrity of the profile, (for example, it will > harder to be used in the coverage test). > > I'm curious why you think my approach is sophisticated. >It's a relative term. Compared to a naive approach like simply not instrumenting the hottest 1% of function if they are single-BB, it is "sophisticated" (both in terms of lines of code to implement and in the amount of thought that needs to be put into understanding it). My interest is putting your instrumented-binary speedups in perspective. Currently, you are comparing your approach with the existing approach which clearly leaves a lot of low-hanging fruit -- it is not surprising that your approach can do better. To put in perspective the improvements of your approach, I think it would be useful to compare it to a "naive" approach for minimizing the instrumentation overhead. There are two outcomes which I think would be interesting and make me reconsider my current (mostly positive) perspective on your approach: - You find that your approach is not a noticeable improvement over the "naive" approach. - You find that your approach is worse than the "naive" approach in a non-trivial way.> It's pretty simple in high level: do a preinline pass to inline all the > trivial inlines and then do the instrumentation. This way the > instrumentation binary is faster and result profile is smaller, and > complete. > > >> >> >>> >>> >> >>> >> The biggest thing I notice about this proposal is that although the >>> focus >>> >> of the proposal is on reducing profiling overhead, it requires using a >>> >> different pass pipeline during *both* the "instr-generate" >>> compilation and >>> >> the "instr-use" compilation. Therefore it is more than simply a >>> reduction in >>> >> profiling overhead -- it can also change performance even ignoring >>> adding >>> >> the profiling information in the IR. I think it would be interesting >>> to test >>> >> the modified pass pipeline even in the absence of profile information >>> to >>> >> better understand its effects in isolation (for example, maybe it >>> would be >>> >> good to add these passes during -O3 regardless of whether we are >>> doing PGO). >>> >> >>> >>> The pipeline change is very PGO specific. I expect it has very little >>> impact on regular compilations: >>> 1) LLVM's bottom up inliner is already iterative. >>> 2) The performance impact (on instrumented build) can be 4x large -- >>> which is unlikely for any nonPGO pipeline change. >>> >> >> With respect to adding extra passes, I'm actually more concerned about >> the non-instrumented build, for which Rong did not show any data. For >> example, will users find their program is X% faster than no PGO with ("ME") >> PGO, but really (X/2)% of that is due simply to the extra passes, and not >> any profile guidance? We should then prefer to use the extra passes during >> regular -O3 builds. Conversely, if they find that their program is X% >> faster with ("ME") PGO, but the extra passes are making the program (X/2)% >> slower, then the users could be getting (3X/2)% faster instead. I am only >> concerned about having two variables change simultaneously; I think that >> instrumenting after some amount of cleanup has been done makes a lot of >> sense. >> >> > I agree with David that LLVM's bottom up, and iterative inliner should > make the pre-inline pass have little performance impact in regular build, > at least in theory. > The performance swing caused by the pre-inline is likely due to inliner > implementation/heuristic issues, and should be handled in the inliner. > > Here is the performance data to enable pre-inliner for regular compilation > (i.e. O2). > > SPEC2006 and SPEC2000 C++ programs (train input) > > program O2_+_preinline_time / O2_time - 1 > 471.omnetpp 0.40% > 473.astar 11.62% > 483.xalancbmk -3.58% > 444.namd 0.00% > 447.dealII 1.02% > 450.soplex 0.83% > 453.povray -1.81% > 252.eon 0.33% > > Note this uses the time. 473.astar runs 11.62% longer with preinline pass. > > For Google Internal benchmarks, we use the performance number > O2_+_pre_inline speedup vs O2 > C++benchmark01 -6.88% > C++benchmark02 +4.42% > C++benchmark03 +0.74% > C++benchmark04 +0.56% > C++benchmark05 -1.02% > C++benchmark06 -14.98% > C++benchmark07 +4.21% > C++benchmark08 -1.00% > C_benchmark09 +0.92% > C_benchmark10 -0.43% > C++benchmark11 -12.23% > C++benchmark12 +0.65% > C++benchmark13 -0.96% > C++benchmark14 -12.89% > C_benchmark15 -0.12% > C++benchmark16 -0.07% > C++benchmark17 -0.18% > C++benchmark18 +0.24% > C++benchmark19 -1.86% > C++benchmark20 +3.63% > C_benchmark21 +0.40% > C++benchmark22 -0.08% > -------------------------------- > geometric mean -1.82% > > For many benchmarks, we get negative speedup (ie.e slowdown). >These swings seem quite large in some cases, contrary to what was theoretically expected. Do you know which part of the inliner is causing the problem? Some of my users would be happy with these speedups even without PGO; at the same time for some of my users this could negate much of the advantage of using PGO. Is there a way for your approach to simply do the instrumentation after a subset of the existing pass pipeline has been run? Also, in your OP you discussed that the advantages of the context sensitive profile enabled by your approach. Do you have performance numbers to quantify how much this improves things? -- Sean Silva> > > >> Could Rong's proposal be made to work within the existing pipeline, but >> doing the instrumentation after a subset of the existing pass pipeline has >> been run? >> >> -- Sean Silva >> >> >>> >>> LLVM already supports running SCC passes iteratively, so experiment >>> like this will be easy to do -- the data can be collected. >>> >>> thanks, >>> >>> David >>> >>> >> -- Sean Silva >>> >> >>> >> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote: >>> >>> >>> >>> Instrumentation based Profile Guided Optimization (PGO) is a compiler >>> >>> technique that leverages important program runtime information, such >>> as >>> >>> precise edge counts and frequent value information, to make >>> frequently >>> >>> executed code run faster. It's proven to be one of the most >>> effective ways >>> >>> to improve program performance. >>> >>> >>> >>> An important design point of PGO is to decide where to place the >>> >>> instrumentation. In current LLVM PGO, the instrumentation is done in >>> the >>> >>> Clang Front-End (will be referred as FE based instrumentation). >>> Doing this >>> >>> early in the front-end gives better coverage information as the >>> there are >>> >>> more precise line number information. The resulted profile also has >>> >>> relatively high tolerance to compiler changes. All the compiler >>> change after >>> >>> the instrumentation point will not lead to mismatched IR that >>> invalidates >>> >>> the profile. >>> >>> >>> >>> On the other hand, doing this too early gives the compiler fewer >>> >>> opportunities to optimize the code before instrumentation. It has >>> >>> significant impact on instrumentation runtime performance. In >>> addition, it >>> >>> tends to produce larger binary and profile data file. Our internal >>> C++ >>> >>> benchmarks shows that FE based instrumentation degrades the >>> performance >>> >>> (compared to non-instrumented version) by 58.3%, and in some extreme >>> cases, >>> >>> the application speed/throughput decreases by 95% (21x runtime >>> slowdown). >>> >>> >>> >>> Running the instrumented binary too slow is not desirable in PGO for >>> the >>> >>> following reasons: >>> >>> * This slows down already lengthy build time. In the worst case, >>> the >>> >>> instrumented binary is so slow that it fails to run a representative >>> >>> workload, because slow execution can leads to more time-outs in many >>> server >>> >>> programs. Other typical issues include: text size too big, failure >>> to link >>> >>> instrumented binaries, memory usage exceeding system limits. >>> >>> * Slow runtime affects the program behavior. Real applications >>> >>> sometimes monitor the program runtime and have different execution >>> path when >>> >>> the program run too slow. This would defeat the underlying >>> assumption of PGO >>> >>> and make it less effective. >>> >>> >>> >>> This work proposes an option to turn on middle-end based >>> instrumentation >>> >>> (new) that aims to speed up instrumentation runtime. The new >>> instrumentation >>> >>> is referred to as ME based instrumentation in this document. Our >>> >>> experimental results show that ME instrumentation can speed up the >>> >>> instrumentation by 80% on average for typical C++ programs. Here are >>> the two >>> >>> main design objectives: >>> >>> * Co-existing with FE Instrumenter: We do not propose to replace >>> the >>> >>> FE based instrumentation because FE based instrumentation has its >>> advantages >>> >>> and applications. User can choose which phase to do instrumentation >>> via >>> >>> command line options. >>> >>> * PGO Runtime Support Sharing: The ME instrumenter will completely >>> >>> re-use the existing PGO’s runtime support. >>> >>> >>> >>> 1. FE Based Instrumentation Runtime Overhead Analysis >>> >>> >>> >>> Instrumented binaries are expected to run slower due to >>> instrumentation >>> >>> code inserted. With FE based instrumentation, the overhead is >>> especially >>> >>> high and runtime slowdown can be unacceptable in many cases. Further >>> >>> analysis shows that there are 3 important factors contributing to >>> FE >>> >>> instrumentation slowdown : >>> >>> * [Main] Redundant counter updates of inlined functions. C++ >>> programs >>> >>> can introduce large abstraction penalties by using lots of small >>> inline >>> >>> functions (assignment operators, getters, setters, ctors/dtors etc). >>> >>> Overhead of instrumenting those small functions can be very large, >>> making >>> >>> training runs too slow and in some cases to usable; >>> >>> * Non-optimal placement of the count updates; >>> >>> * A third factor is related value profiling (to be turned on in >>> the >>> >>> future). Small and hot callee functions taking function pointer >>> (callbacks) >>> >>> can incur overhead due to indirect call target profiling. >>> >>> >>> >>> >>> >>> 1.1 Redundant Counter Update >>> >>> >>> >>> If checking the assembly of the instrumented binary generated by >>> current >>> >>> LLVM implementation, we can find many sequence of consecutive 'incq' >>> >>> instructions that updating difference counters in the same basic >>> block. As >>> >>> an example that extracted from real binary: >>> >>> ... >>> >>> incq 0xa91d80(%rip) # 14df4b8 >>> >>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8> >>> >>> incq 0xa79011(%rip) # 14c6750 >>> >>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm> >>> >>> incq 0xa79442(%rip) # 14c6b88 >>> >>> >>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv> >>> >>> incq 0x9c288b(%rip) # 140ffd8 >>> >>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl> >>> >>> ... >>> >>> >>> >>> From profile use point of view, many of these counter update are >>> >>> redundant. Considering the following example: >>> >>> void bar(){ >>> >>> sum++; >>> >>> } >>> >>> void foo() { >>> >>> bar(); >>> >>> } >>> >>> >>> >>> FE based instrumentation needs to insert counter update for the only >>> BB >>> >>> of the bar(). >>> >>> bar: # @bar >>> >>> # BB#0: # %entry >>> >>> incq .L__llvm_profile_counters_bar(%rip) >>> >>> incl sum(%rip) >>> >>> retq >>> >>> >>> >>> It also need to insert the update the BB in function foo(). After >>> >>> inlining bar to foo(), the code is: >>> >>> foo: # @foo >>> >>> # BB#0: # %entry >>> >>> incq .L__llvm_profile_counters_foo(%rip) >>> >>> incq .L__llvm_profile_counters_bar(%rip) >>> >>> incl sum(%rip) >>> >>> retq >>> >>> >>> >>> If bar() should be always inlined, >>> .L__llvm_profile_counters_bar(%rip) is >>> >>> redundant -- the counter won't help downstream optimizations. On the >>> other >>> >>> hand, if bar() is a large function and may not be suitable to be >>> inlined for >>> >>> all callsites, this counter updated is necessary in order to produce >>> more >>> >>> accurate profile data for the out-of-line instance of the callee. >>> >>> >>> >>> If foo() is a hot function, the overhead of updating two counters >>> can be >>> >>> significant. This is especially bad for C++ program where there are >>> tons of >>> >>> small inline functions. >>> >>> >>> >>> There is another missing opportunity in FE based instrumentation. The >>> >>> small functions’ control flow can usually be simplified when they are >>> >>> inlined into caller contexts. Once the control flow is simplified, >>> many >>> >>> counter updates can therefore be eliminated. This is only possible >>> for a >>> >>> middle end based late instrumenter. Defining a custom clean-up pass >>> to >>> >>> remove redundant counter update is unrealistic and cannot be done in >>> a sane >>> >>> way without destroying the profile integrity of neither the >>> out-of-line nor >>> >>> inline instances of the callee. >>> >>> >>> >>> A much simpler and cleaner solution is to do a pre-inline pass to >>> inline >>> >>> all the trivial inlines before instrumentation. In addition to >>> removing the >>> >>> unnecessary count updates for the inline instances, another >>> advantage of >>> >>> pre-inline is to provide context sensitive profile for these small >>> inlined >>> >>> functions. This context senstive profile can further improve the PGO >>> based >>> >>> optimizations. Here is a contrived example: >>> >>> void bar (int n) { >>> >>> if (n&1) >>> >>> do_sth1(); >>> >>> else >>> >>> do_sth2(); >>> >>> } >>> >>> >>> >>> void caller() { >>> >>> int s = 1; >>> >>> for (; s<100; s+=2) >>> >>> bar(s); >>> >>> >>> >>> for (s = 102; s< 200; s+=2) >>> >>> bar(s); >>> >>> } >>> >>> >>> >>> The direction of the branch inside bar will be totally opposite >>> between >>> >>> two different callsites in ‘caller’. Without pre-inlining, the branch >>> >>> probability will be 50-50 which will be useless for later >>> optimizations. >>> >>> With pre-inlining, the profile will have the perfect branch count >>> for each >>> >>> callsite. The positive performance impact of context sensitive >>> profiling due >>> >>> to pre-inlining has been observed in real world large C++ programs. >>> >>> Supporting context sensitive profiling is another way to solve this, >>> but it >>> >>> will introduce large additional runtime/memory overhead. >>> >>> >>> >>> >>> >>> 1.2 Non-optimal placement of count update >>> >>> >>> >>> Another much smaller showdown factor is the placement of the counter >>> >>> updates. Current front-end based instrumentation applies the >>> instrumentation >>> >>> to each front-end lexical construct. It also minimizes the number of >>> static >>> >>> instrumentations. Note that it always instruments the entry count of >>> the >>> >>> CFG. This may result in higher dynamic instruction counts. For >>> example, >>> >>> BB0 >>> >>> | 100 >>> >>> BB1 >>> >>> 90 / \ 10 >>> >>> BB2 BB3 >>> >>> 90 \ / 10 >>> >>> BB4 >>> >>> Like the the above example, FE based instrumentation will always >>> insert >>> >>> count update in BB0. The dynamic instrumentation count will be >>> either 110 >>> >>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A >>> better >>> >>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the >>> dynamic >>> >>> instrumentation count is 100. >>> >>> >>> >>> Our experimental shows that the optimal placement based on edge >>> hotness >>> >>> can improve instrumented code performance by about 10%. While it’s >>> hard to >>> >>> find the optimal placement of count update, compiler heuristics can >>> be used >>> >>> the get the better placement. These heuristics can be based on >>> static >>> >>> profile prediction or user annotations (like __buildin_expect) to >>> estimate >>> >>> the relative edge hotness and put instrumentations on the less hot >>> edges. >>> >>> The initial late instrumentation has not fully implemented this >>> placement >>> >>> strategy yet. With that implemented, we expect even better results >>> than >>> >>> what is reported here. For real world programs, another major source >>> of the >>> >>> slowdown is the data racing and false sharing of the counter update >>> for >>> >>> highly threaded programs. Pre-inlining can alleviate this problem as >>> the >>> >>> counters in the inline instances are not longer shared. But the >>> complete >>> >>> solution to the data racing issue is orthogonal to the problem we >>> try to >>> >>> solve here. >>> >>> >>> >>> >>> >>> 2. High Level Design >>> >>> >>> >>> We propose to perform a pre-profile inline pass before the PGO >>> >>> instrumentation pass. Since the instrumentation pass is after inine, >>> it has >>> >>> to be done in the middle-end. >>> >>> >>> >>> (1) The pre-inline pass >>> >>> We will invoke a pre-inline pass before the instrumentation. When >>> PGO is >>> >>> on, the inlining will be split into two passes: >>> >>> * A pre-inline pass that is scheduled before the profile >>> >>> instrumentation/annotation >>> >>> * A post-inline pass which is the regular inline pass after >>> >>> instrumentation/annotation >>> >>> By design, all beneficial callsites without requiring profile data >>> should >>> >>> be inlined in the pre-inline pass. It includes all callsites that >>> will >>> >>> shrink code size after inlining. All the remaining callsites will be >>> left to >>> >>> the regular inline pass when profile data is available. >>> >>> >>> >>> After pre-inline, a CFG based profile instrumentation/annotation >>> will be >>> >>> done. A minimum weight spanning tree (MST) in CFG is first computed, >>> then >>> >>> only the edges not in the MST will be instrumented. The counter >>> update >>> >>> instructions are placed in the basic blocks. >>> >>> >>> >>> (2) Minimum Spanning Tree (MST) based instrumentation >>> >>> A native way of instrumentation is to insert a count update for every >>> >>> edge in CFG which will result in too many redundant updates that >>> makes the >>> >>> runtime very slow. Knuth [1] proposed a minimum spanning tree based >>> method: >>> >>> given a CFG, first compute a spanning tree. All edges that not in >>> the MST >>> >>> will be instrumented. In the profile use compilation, the counters >>> are >>> >>> populated (from the leaf of the spanning tree) to all the edges. >>> Knuth >>> >>> proved this method inserts the minimum number of instrumentations. >>> MST based >>> >>> method only guarantees the number static instrumentation are >>> minimized, not >>> >>> the dynamic instance of instrumentation. To reduce the number of >>> dynamic >>> >>> instrumentation, edges of potentially high counts will be put into >>> MST first >>> >>> so that they will have less chance to be instrumented. >>> >>> >>> >>> >>> >>> 3. Experimental Results >>> >>> >>> >>> 3.1 Measurement of the efficiency of instrumentation >>> >>> Other than the runtime of the instrumented binaries, a more direct >>> >>> measurement of the instrumentation overhead is the the sum of the raw >>> >>> profile count values. Note that regardless what kind of >>> instrumentations are >>> >>> used, the raw profile count should be able to reconstruct all the >>> edge count >>> >>> values for the whole program. All raw profile value are obtained via >>> >>> incrementing the counter variable value by one. The sum of the raw >>> profile >>> >>> count value is roughly the dynamic instruction count of the >>> instrumented >>> >>> code. The lower of the value, the more efficient of the >>> instrumentation. >>> >>> >>> >>> >>> >>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and >>> SPEC2K >>> >>> eon >>> >>> The performance speedup is computed by (FE_instrumentation_runtime / >>> >>> ME_instrumentation_runtime - 1) >>> >>> >>> >>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon >>> from >>> >>> SPEC2000. For C programs, except for one outlier 456.hmmer, there >>> are small >>> >>> ups and downs across different programs. Late instrumentation >>> improves hmmer >>> >>> a lot, but it is probably due to unrelated loop optimizations (90% of >>> >>> runtime spent in one loop nest). >>> >>> >>> >>> For C++ programs, the performance impact of late instrumentation is >>> very >>> >>> large, which is as expected. The following table shows the result. >>> For some >>> >>> C++ program, the time speedup is huge. For example, in >>> 483.xalancbmk, late >>> >>> instrumentation speeds up performance by ~60%. Among all the SPEC >>> C++ >>> >>> programs, only 444.namd is an outlier -- it uses a lot of macros and >>> is a >>> >>> very C like program. >>> >>> >>> >>> Program Speedup >>> >>> 471.omnetpp 16.03% >>> >>> 473.astar 5.00% >>> >>> 483.xalancbmk 58.57% >>> >>> 444.namd -0.90% >>> >>> 447.dealII 60.47% >>> >>> 450.soplex 8.20% >>> >>> 453.povray 11.34% >>> >>> 252.eon 35.33% >>> >>> ------------------------- >>> >>> Geomean 21.01% >>> >>> >>> >>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs >>> >>> We also collect some statistic of the profiles generated by FE based >>> >>> instrumentation and late instrumentation, namely, the following >>> information: >>> >>> 1. the number of functions that being instrumented, >>> >>> 2. the result profile file size, >>> >>> 3. the sum of raw count values that was mentioned earlier -- we >>> used >>> >>> it to measure the efficiency of the instrumentation. >>> >>> Next table shows the ratios of the each metrics by late >>> instrumentation >>> >>> for the C++ programs, with FE based instrumentation as the base : >>> column (1) >>> >>> shows the ratios of instrumented functions; column (2) shows the >>> ratios of >>> >>> the profile file size; column (3) shows the ratios of the sum of raw >>> count >>> >>> values. >>> >>> >>> >>> (1) (2) (3) >>> >>> 471.omnetpp 85.36% 110.26% 46.52% >>> >>> 473.astar 64.86% 72.72% 63.13% >>> >>> 483.xalancbmk 51.83% 56.11% 35.77% >>> >>> 444.namd 75.36% 82.82% 85.77% >>> >>> 447.dealII 43.42% 46.46% 26.75% >>> >>> 450.soplex 71.80% 87.54% 51.19% >>> >>> 453.povray 78.68% 83.57% 64.37% >>> >>> 252.eon 72.06% 91.22% 30.02% >>> >>> ---------------------------------------- >>> >>> Geomean 66.50% 76.36% 47.01% >>> >>> >>> >>> >>> >>> For FE based instrumentation, profile count variables generated for >>> the >>> >>> dead functions will not be removed (like __llvm_prf_names, >>> __llvm_prf_data, >>> >>> and __llvm_prf_cnts) from the data/text segment, nor in the result >>> profile. >>> >>> There is a recent patch that removes these unused data for COMDAT >>> functions, >>> >>> but that patch won’t touch regular functions. This is the main >>> reason for >>> >>> the larger number of instrumented functions and larger profile file >>> size for >>> >>> the FE based instrumentation. The reduction of the sum of raw count >>> values >>> >>> is mainly due to the elimination of redundant profile updates >>> enabled by the >>> >>> pre-inlining. >>> >>> >>> >>> For C programs, we observe similar improvement in the profile size >>> >>> (geomean ratio of 73.75%) and smaller improvement in the number of >>> >>> instrumented functions (geomean ratio of 87.49%) and the sum of raw >>> count >>> >>> values (geomean of 82.76%). >>> >>> >>> >>> >>> >>> 3.4 LLVM instrumentations runtime performance for Google internal >>> C/C++ >>> >>> benchmarks >>> >>> >>> >>> We also use Google internal benchmarks (mostly typical C++ >>> applications) >>> >>> to measure the relative performance b/w FE based instrumentation and >>> late >>> >>> instrumentation. The following table shows the speedup of late >>> >>> instrumentation vs FE based instrumentation. Note that >>> C++benchmark01 is a >>> >>> very large multi-threaded C++ program. Late instrumentation sees 4x >>> speedup. >>> >>> Larger than 3x speedups are also seen in many other programs. >>> >>> >>> >>> C++_bencharmk01 416.98% >>> >>> C++_bencharmk02 6.29% >>> >>> C++_bencharmk03 22.39% >>> >>> C++_bencharmk04 28.05% >>> >>> C++_bencharmk05 2.00% >>> >>> C++_bencharmk06 675.89% >>> >>> C++_bencharmk07 359.19% >>> >>> C++_bencharmk08 395.03% >>> >>> C_bencharmk09 15.11% >>> >>> C_bencharmk10 5.47% >>> >>> C++_bencharmk11 5.73% >>> >>> C++_bencharmk12 2.31% >>> >>> C++_bencharmk13 87.73% >>> >>> C++_bencharmk14 7.22% >>> >>> C_bencharmk15 -0.51% >>> >>> C++_bencharmk16 59.15% >>> >>> C++_bencharmk17 358.82% >>> >>> C++_bencharmk18 861.36% >>> >>> C++_bencharmk19 29.62% >>> >>> C++_bencharmk20 11.82% >>> >>> C_bencharmk21 0.53% >>> >>> C++_bencharmk22 43.10% >>> >>> --------------------------- >>> >>> Geomean 83.03% >>> >>> >>> >>> >>> >>> 3.5 Statistics of LLVM profiles for Google internal benchmarks >>> >>> >>> >>> The following shows the profile statics using Google internal >>> benchmarks. >>> >>> (1) (2) (3) >>> >>> C++_bencharmk01 36.84% 40.29% 2.32% >>> >>> C++_bencharmk02 39.20% 40.49% 42.39% >>> >>> C++_bencharmk03 39.37% 40.65% 23.24% >>> >>> C++_bencharmk04 39.13% 40.68% 17.70% >>> >>> C++_bencharmk05 36.58% 38.27% 51.08% >>> >>> C++_bencharmk06 29.50% 27.87% 2.87% >>> >>> C++_bencharmk07 29.50% 27.87% 1.73% >>> >>> C++_bencharmk08 29.50% 27.87% 4.17% >>> >>> C_bencharmk09 53.95% 68.00% 11.18% >>> >>> C_bencharmk10 53.95% 68.00% 31.74% >>> >>> C++_bencharmk11 36.40% 37.07% 46.12% >>> >>> C++_bencharmk12 38.44% 41.90% 73.59% >>> >>> C++_bencharmk13 39.28% 42.72% 29.56% >>> >>> C++_bencharmk14 38.59% 42.20% 13.42% >>> >>> C_bencharmk15 57.45% 48.50% 66.99% >>> >>> C++_bencharmk16 36.86% 42.18% 16.53% >>> >>> C++_bencharmk17 37.82% 39.77% 13.68% >>> >>> C++_bencharmk18 37.82% 39.77% 7.96% >>> >>> C++_bencharmk19 37.52% 40.46% 1.85% >>> >>> C++_bencharmk20 32.37% 30.44% 19.69% >>> >>> C_bencharmk21 37.63% 40.42% 88.81% >>> >>> C++_bencharmk22 36.28% 36.92% 21.62% >>> >>> -------------------------------------------------- >>> >>> Geomean 38.22% 39.96% 15.58% >>> >>> >>> >>> >>> >>> 4. Implementation Details: >>> >>> >>> >>> We need to add new option(s) for the alternative PGO instrumentation >>> pass >>> >>> in the middle end. It can in one of the following forms: >>> >>> >>> >>> (1) Complete new options that are on par with current PGO options: >>> >>> -fprofile-late-instr-generate[=<profile_file>]? for PGO >>> Instrumentation, and >>> >>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE. >>> >>> (2) Or, late instrumentation can be turned on with an additional >>> >>> option -fprofile-instr-late with current PGO options. I. e. >>> >>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for >>> PGO >>> >>> instrumentation, and -fprofile-instr-late >>> >>> -fprofile-instr-use[=<profile_file>]? for PGO use. >>> >>> (3) Alternatively to (2), only keep -fprofile-instr-late option >>> in PGO >>> >>> instrumentation. Adding a magic tag in profile so that FE based >>> profile and >>> >>> late instrumented profile can be automatically detected by profile >>> loader In >>> >>> PGO use compilation. This requires a slight profile format change. >>> >>> >>> >>> In our prototype implementation, two new passes are added in the >>> >>> beginning of PassManagerBuilder::populateModulePassManager(), namely >>> >>> PreProfileInlinerPass and PGOLateInstrumentationPass. >>> >>> >>> >>> >>> >>> 4.1 Pre-inline pass: >>> >>> >>> >>> It is controlled by back-end option "-preinline" and >>> >>> "-disable-preinline". If the user specifies any llvm option of >>> >>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" >>> will be >>> >>> automatically inserted in the driver.. To disable the pre-inliner >>> when late >>> >>> instrumentation is enabled, use option "-mllvm -disable-preinline". >>> >>> >>> >>> For now, only minimum tuning is done for the pre-inliner, which >>> simply >>> >>> adjusts the inline threshold: If -Oz is specified, the threshold is >>> set to >>> >>> 25. Otherwise, it is 75. >>> >>> >>> >>> The following clean up passes are added to PassManager, right after >>> the >>> >>> PreProfileInline pass: >>> >>> createEarlyCSEPass() >>> >>> createJumpThreadingPass() >>> >>> createCorrelatedValuePropagationPass() >>> >>> createCFGSimplificationPass() >>> >>> createInstructionCombiningPass() >>> >>> createGVNPass(DisableGVNLoadPRE) >>> >>> createPeepholePASS() >>> >>> Some of them might not be necessary. >>> >>> >>> >>> 4.2 Late Instrumentation Pass: >>> >>> The late instrumentation is right after the pre-inline pass and it's >>> >>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" >>> and >>> >>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver >>> will >>> >>> provide the profile name. >>> >>> For "-pgo-late-instr-gen", a pass that calls >>> createInstrProfilingPass() >>> >>> is also added to PassManager to lower the instrumentation intrinsics >>> >>> >>> >>> PGOLateInstrumeatnion is a module pass that applies the >>> instrumentation >>> >>> to each function by class PGOLateInstrumentationFunc. For each >>> function, >>> >>> perform the following steps: >>> >>> 1. First collect all the CFG edges. Assign an estimated weight to >>> each >>> >>> edge. Critical edges and back-edges are assigned to high value of >>> weights. >>> >>> One fake node and a few fake edges (from the fake node to the entry >>> node, >>> >>> and from all the exit nodes to the fake node) are also added to the >>> >>> worklist. >>> >>> 2. Construct the MST. The edges with the higher weight will be >>> put to >>> >>> MST first, unless it forms a cycle. >>> >>> 3. Traverse the CFG and compute the CFG hash using CRC32 of the >>> index >>> >>> of each BB. >>> >>> The above three steps are the same for profile-generate and >>> profile-use >>> >>> compilation. >>> >>> >>> >>> In the next step, for profile-generation compilation, all the edges >>> that >>> >>> not in the MST are instrumented. If this is a critical edge, split >>> the edge >>> >>> first. The actual instrumentation is to generate >>> >>> Intrinsic::instrprof_increment() in the instrumented BB. This >>> intrinsic will >>> >>> be lowed by pass createInstrProfilingPass(). >>> >>> >>> >>> In the next step, for profile-generation compilation, all the edges >>> that >>> >>> not in the MST are instrumented. If this is a critical edge, split >>> the edge >>> >>> first. The actual instrumentation is to generate >>> >>> Intrinsic::instrprof_increment() in the instrumented BB. This >>> intrinsic will >>> >>> be lowed by pass createInstrProfilingPass(). >>> >>> >>> >>> For -fprofile-use compilation, first read in the counters and the CFG >>> >>> hash from the profile file. If the CFG hash matches, populate the >>> counters >>> >>> to all the edges in reverse topological order of the MST. Once >>> having all >>> >>> the edge counts, set the branch weights metadata for the IR having >>> multiple >>> >>> branches. Also apply the cold/hot function attributes based on >>> function >>> >>> level counts. >>> >>> >>> >>> >>> >>> 4.3 Profile Format: >>> >>> >>> >>> The late instrumentation profile is mostly the same as the one from >>> >>> front-end instrument-ion. The difference is >>> >>> * Function checksums are different. >>> >>> * Function entry counts are no longer available. >>> >>> For llvm-profdata utility, options -lateinstr needs to be used to >>> >>> differentiate FE based and late instrumentation profiles, unless a >>> magic tag >>> >>> is added to the profile. >>> >>> >>> >>> >>> >>> 5. References: >>> >>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of >>> points >>> >>> for program frequency counts. BIT Numerical Mathematics 1973, Volume >>> 13, >>> >>> Issue 3, pp 313-322 >>> >>> >>> >> >>> > >>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150810/e9a02699/attachment-0001.html>
Xinliang David Li via llvm-dev
2015-Aug-11 19:30 UTC
[llvm-dev] RFC: PGO Late instrumentation for LLVM
>> >> The pipeline change is very PGO specific. I expect it has very little >> impact on regular compilations: >> 1) LLVM's bottom up inliner is already iterative. >> 2) The performance impact (on instrumented build) can be 4x large -- >> which is unlikely for any nonPGO pipeline change. > > > With respect to adding extra passes, I'm actually more concerned about the > non-instrumented build, for which Rong did not show any data. For example, > will users find their program is X% faster than no PGO with ("ME") PGO, but > really (X/2)% of that is due simply to the extra passes, and not any profile > guidance? We should then prefer to use the extra passes during regular -O3 > builds.My reply about actually predicts that the pipeline changes is unlikely to have large positive performance impact for regular non-PGO build. It would be useful to see that effect (extra passes) -- but instead of using Rong's untuned new passes, it can be done by forcing at least 2 iterations for CallGraphSCCPass (currently iterate >1 only when devirtualization happens).> Conversely, if they find that their program is X% faster with ("ME") > PGO, but the extra passes are making the program (X/2)% slower, then the > users could be getting (3X/2)% faster instead.Pre-cleanup is required for ME-PGO -- but the simple math you mentioned does not apply. In your case, the (X/2)% slowdown is required to enable the X% improvement over the original baseline (without extra passes) -- otherwise only 0% can be materialized (analogy : retraction and punching).>I am only concerned about > having two variables change simultaneously; I think that instrumenting after > some amount of cleanup has been done makes a lot of sense. > > Could Rong's proposal be made to work within the existing pipeline, but > doing the instrumentation after a subset of the existing pass pipeline has > been run?Conceptually it is doable using iterative pass manager (which the support already exists). With PGO, the inliner needs to be greatly tuned down to enable only small function inlining though. David> > -- Sean Silva > >> >> >> LLVM already supports running SCC passes iteratively, so experiment >> like this will be easy to do -- the data can be collected. >> >> thanks, >> >> David >> >> >> -- Sean Silva >> >> >> >> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote: >> >>> >> >>> Instrumentation based Profile Guided Optimization (PGO) is a compiler >> >>> technique that leverages important program runtime information, such >> >>> as >> >>> precise edge counts and frequent value information, to make frequently >> >>> executed code run faster. It's proven to be one of the most effective >> >>> ways >> >>> to improve program performance. >> >>> >> >>> An important design point of PGO is to decide where to place the >> >>> instrumentation. In current LLVM PGO, the instrumentation is done in >> >>> the >> >>> Clang Front-End (will be referred as FE based instrumentation). Doing >> >>> this >> >>> early in the front-end gives better coverage information as the there >> >>> are >> >>> more precise line number information. The resulted profile also has >> >>> relatively high tolerance to compiler changes. All the compiler change >> >>> after >> >>> the instrumentation point will not lead to mismatched IR that >> >>> invalidates >> >>> the profile. >> >>> >> >>> On the other hand, doing this too early gives the compiler fewer >> >>> opportunities to optimize the code before instrumentation. It has >> >>> significant impact on instrumentation runtime performance. In >> >>> addition, it >> >>> tends to produce larger binary and profile data file. Our internal >> >>> C++ >> >>> benchmarks shows that FE based instrumentation degrades the >> >>> performance >> >>> (compared to non-instrumented version) by 58.3%, and in some extreme >> >>> cases, >> >>> the application speed/throughput decreases by 95% (21x runtime >> >>> slowdown). >> >>> >> >>> Running the instrumented binary too slow is not desirable in PGO for >> >>> the >> >>> following reasons: >> >>> * This slows down already lengthy build time. In the worst case, >> >>> the >> >>> instrumented binary is so slow that it fails to run a representative >> >>> workload, because slow execution can leads to more time-outs in many >> >>> server >> >>> programs. Other typical issues include: text size too big, failure to >> >>> link >> >>> instrumented binaries, memory usage exceeding system limits. >> >>> * Slow runtime affects the program behavior. Real applications >> >>> sometimes monitor the program runtime and have different execution >> >>> path when >> >>> the program run too slow. This would defeat the underlying assumption >> >>> of PGO >> >>> and make it less effective. >> >>> >> >>> This work proposes an option to turn on middle-end based >> >>> instrumentation >> >>> (new) that aims to speed up instrumentation runtime. The new >> >>> instrumentation >> >>> is referred to as ME based instrumentation in this document. Our >> >>> experimental results show that ME instrumentation can speed up the >> >>> instrumentation by 80% on average for typical C++ programs. Here are >> >>> the two >> >>> main design objectives: >> >>> * Co-existing with FE Instrumenter: We do not propose to replace >> >>> the >> >>> FE based instrumentation because FE based instrumentation has its >> >>> advantages >> >>> and applications. User can choose which phase to do instrumentation >> >>> via >> >>> command line options. >> >>> * PGO Runtime Support Sharing: The ME instrumenter will completely >> >>> re-use the existing PGO’s runtime support. >> >>> >> >>> 1. FE Based Instrumentation Runtime Overhead Analysis >> >>> >> >>> Instrumented binaries are expected to run slower due to >> >>> instrumentation >> >>> code inserted. With FE based instrumentation, the overhead is >> >>> especially >> >>> high and runtime slowdown can be unacceptable in many cases. Further >> >>> analysis shows that there are 3 important factors contributing to FE >> >>> instrumentation slowdown : >> >>> * [Main] Redundant counter updates of inlined functions. C++ >> >>> programs >> >>> can introduce large abstraction penalties by using lots of small >> >>> inline >> >>> functions (assignment operators, getters, setters, ctors/dtors etc). >> >>> Overhead of instrumenting those small functions can be very large, >> >>> making >> >>> training runs too slow and in some cases to usable; >> >>> * Non-optimal placement of the count updates; >> >>> * A third factor is related value profiling (to be turned on in the >> >>> future). Small and hot callee functions taking function pointer >> >>> (callbacks) >> >>> can incur overhead due to indirect call target profiling. >> >>> >> >>> >> >>> 1.1 Redundant Counter Update >> >>> >> >>> If checking the assembly of the instrumented binary generated by >> >>> current >> >>> LLVM implementation, we can find many sequence of consecutive 'incq' >> >>> instructions that updating difference counters in the same basic >> >>> block. As >> >>> an example that extracted from real binary: >> >>> ... >> >>> incq 0xa91d80(%rip) # 14df4b8 >> >>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8> >> >>> incq 0xa79011(%rip) # 14c6750 >> >>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm> >> >>> incq 0xa79442(%rip) # 14c6b88 >> >>> >> >>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv> >> >>> incq 0x9c288b(%rip) # 140ffd8 >> >>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl> >> >>> ... >> >>> >> >>> From profile use point of view, many of these counter update are >> >>> redundant. Considering the following example: >> >>> void bar(){ >> >>> sum++; >> >>> } >> >>> void foo() { >> >>> bar(); >> >>> } >> >>> >> >>> FE based instrumentation needs to insert counter update for the only >> >>> BB >> >>> of the bar(). >> >>> bar: # @bar >> >>> # BB#0: # %entry >> >>> incq .L__llvm_profile_counters_bar(%rip) >> >>> incl sum(%rip) >> >>> retq >> >>> >> >>> It also need to insert the update the BB in function foo(). After >> >>> inlining bar to foo(), the code is: >> >>> foo: # @foo >> >>> # BB#0: # %entry >> >>> incq .L__llvm_profile_counters_foo(%rip) >> >>> incq .L__llvm_profile_counters_bar(%rip) >> >>> incl sum(%rip) >> >>> retq >> >>> >> >>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) >> >>> is >> >>> redundant -- the counter won't help downstream optimizations. On the >> >>> other >> >>> hand, if bar() is a large function and may not be suitable to be >> >>> inlined for >> >>> all callsites, this counter updated is necessary in order to produce >> >>> more >> >>> accurate profile data for the out-of-line instance of the callee. >> >>> >> >>> If foo() is a hot function, the overhead of updating two counters can >> >>> be >> >>> significant. This is especially bad for C++ program where there are >> >>> tons of >> >>> small inline functions. >> >>> >> >>> There is another missing opportunity in FE based instrumentation. The >> >>> small functions’ control flow can usually be simplified when they are >> >>> inlined into caller contexts. Once the control flow is simplified, >> >>> many >> >>> counter updates can therefore be eliminated. This is only possible for >> >>> a >> >>> middle end based late instrumenter. Defining a custom clean-up pass to >> >>> remove redundant counter update is unrealistic and cannot be done in a >> >>> sane >> >>> way without destroying the profile integrity of neither the >> >>> out-of-line nor >> >>> inline instances of the callee. >> >>> >> >>> A much simpler and cleaner solution is to do a pre-inline pass to >> >>> inline >> >>> all the trivial inlines before instrumentation. In addition to >> >>> removing the >> >>> unnecessary count updates for the inline instances, another advantage >> >>> of >> >>> pre-inline is to provide context sensitive profile for these small >> >>> inlined >> >>> functions. This context senstive profile can further improve the PGO >> >>> based >> >>> optimizations. Here is a contrived example: >> >>> void bar (int n) { >> >>> if (n&1) >> >>> do_sth1(); >> >>> else >> >>> do_sth2(); >> >>> } >> >>> >> >>> void caller() { >> >>> int s = 1; >> >>> for (; s<100; s+=2) >> >>> bar(s); >> >>> >> >>> for (s = 102; s< 200; s+=2) >> >>> bar(s); >> >>> } >> >>> >> >>> The direction of the branch inside bar will be totally opposite >> >>> between >> >>> two different callsites in ‘caller’. Without pre-inlining, the branch >> >>> probability will be 50-50 which will be useless for later >> >>> optimizations. >> >>> With pre-inlining, the profile will have the perfect branch count for >> >>> each >> >>> callsite. The positive performance impact of context sensitive >> >>> profiling due >> >>> to pre-inlining has been observed in real world large C++ programs. >> >>> Supporting context sensitive profiling is another way to solve this, >> >>> but it >> >>> will introduce large additional runtime/memory overhead. >> >>> >> >>> >> >>> 1.2 Non-optimal placement of count update >> >>> >> >>> Another much smaller showdown factor is the placement of the counter >> >>> updates. Current front-end based instrumentation applies the >> >>> instrumentation >> >>> to each front-end lexical construct. It also minimizes the number of >> >>> static >> >>> instrumentations. Note that it always instruments the entry count of >> >>> the >> >>> CFG. This may result in higher dynamic instruction counts. For >> >>> example, >> >>> BB0 >> >>> | 100 >> >>> BB1 >> >>> 90 / \ 10 >> >>> BB2 BB3 >> >>> 90 \ / 10 >> >>> BB4 >> >>> Like the the above example, FE based instrumentation will always >> >>> insert >> >>> count update in BB0. The dynamic instrumentation count will be either >> >>> 110 >> >>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A >> >>> better >> >>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the >> >>> dynamic >> >>> instrumentation count is 100. >> >>> >> >>> Our experimental shows that the optimal placement based on edge >> >>> hotness >> >>> can improve instrumented code performance by about 10%. While it’s >> >>> hard to >> >>> find the optimal placement of count update, compiler heuristics can >> >>> be used >> >>> the get the better placement. These heuristics can be based on static >> >>> profile prediction or user annotations (like __buildin_expect) to >> >>> estimate >> >>> the relative edge hotness and put instrumentations on the less hot >> >>> edges. >> >>> The initial late instrumentation has not fully implemented this >> >>> placement >> >>> strategy yet. With that implemented, we expect even better results >> >>> than >> >>> what is reported here. For real world programs, another major source >> >>> of the >> >>> slowdown is the data racing and false sharing of the counter update >> >>> for >> >>> highly threaded programs. Pre-inlining can alleviate this problem as >> >>> the >> >>> counters in the inline instances are not longer shared. But the >> >>> complete >> >>> solution to the data racing issue is orthogonal to the problem we try >> >>> to >> >>> solve here. >> >>> >> >>> >> >>> 2. High Level Design >> >>> >> >>> We propose to perform a pre-profile inline pass before the PGO >> >>> instrumentation pass. Since the instrumentation pass is after inine, >> >>> it has >> >>> to be done in the middle-end. >> >>> >> >>> (1) The pre-inline pass >> >>> We will invoke a pre-inline pass before the instrumentation. When PGO >> >>> is >> >>> on, the inlining will be split into two passes: >> >>> * A pre-inline pass that is scheduled before the profile >> >>> instrumentation/annotation >> >>> * A post-inline pass which is the regular inline pass after >> >>> instrumentation/annotation >> >>> By design, all beneficial callsites without requiring profile data >> >>> should >> >>> be inlined in the pre-inline pass. It includes all callsites that will >> >>> shrink code size after inlining. All the remaining callsites will be >> >>> left to >> >>> the regular inline pass when profile data is available. >> >>> >> >>> After pre-inline, a CFG based profile instrumentation/annotation will >> >>> be >> >>> done. A minimum weight spanning tree (MST) in CFG is first computed, >> >>> then >> >>> only the edges not in the MST will be instrumented. The counter update >> >>> instructions are placed in the basic blocks. >> >>> >> >>> (2) Minimum Spanning Tree (MST) based instrumentation >> >>> A native way of instrumentation is to insert a count update for every >> >>> edge in CFG which will result in too many redundant updates that >> >>> makes the >> >>> runtime very slow. Knuth [1] proposed a minimum spanning tree based >> >>> method: >> >>> given a CFG, first compute a spanning tree. All edges that not in the >> >>> MST >> >>> will be instrumented. In the profile use compilation, the counters are >> >>> populated (from the leaf of the spanning tree) to all the edges. Knuth >> >>> proved this method inserts the minimum number of instrumentations. MST >> >>> based >> >>> method only guarantees the number static instrumentation are >> >>> minimized, not >> >>> the dynamic instance of instrumentation. To reduce the number of >> >>> dynamic >> >>> instrumentation, edges of potentially high counts will be put into MST >> >>> first >> >>> so that they will have less chance to be instrumented. >> >>> >> >>> >> >>> 3. Experimental Results >> >>> >> >>> 3.1 Measurement of the efficiency of instrumentation >> >>> Other than the runtime of the instrumented binaries, a more direct >> >>> measurement of the instrumentation overhead is the the sum of the raw >> >>> profile count values. Note that regardless what kind of >> >>> instrumentations are >> >>> used, the raw profile count should be able to reconstruct all the edge >> >>> count >> >>> values for the whole program. All raw profile value are obtained via >> >>> incrementing the counter variable value by one. The sum of the raw >> >>> profile >> >>> count value is roughly the dynamic instruction count of the >> >>> instrumented >> >>> code. The lower of the value, the more efficient of the >> >>> instrumentation. >> >>> >> >>> >> >>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and >> >>> SPEC2K >> >>> eon >> >>> The performance speedup is computed by (FE_instrumentation_runtime / >> >>> ME_instrumentation_runtime - 1) >> >>> >> >>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon >> >>> from >> >>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are >> >>> small >> >>> ups and downs across different programs. Late instrumentation improves >> >>> hmmer >> >>> a lot, but it is probably due to unrelated loop optimizations (90% of >> >>> runtime spent in one loop nest). >> >>> >> >>> For C++ programs, the performance impact of late instrumentation is >> >>> very >> >>> large, which is as expected. The following table shows the result. >> >>> For some >> >>> C++ program, the time speedup is huge. For example, in 483.xalancbmk, >> >>> late >> >>> instrumentation speeds up performance by ~60%. Among all the SPEC C++ >> >>> programs, only 444.namd is an outlier -- it uses a lot of macros and >> >>> is a >> >>> very C like program. >> >>> >> >>> Program Speedup >> >>> 471.omnetpp 16.03% >> >>> 473.astar 5.00% >> >>> 483.xalancbmk 58.57% >> >>> 444.namd -0.90% >> >>> 447.dealII 60.47% >> >>> 450.soplex 8.20% >> >>> 453.povray 11.34% >> >>> 252.eon 35.33% >> >>> ------------------------- >> >>> Geomean 21.01% >> >>> >> >>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs >> >>> We also collect some statistic of the profiles generated by FE based >> >>> instrumentation and late instrumentation, namely, the following >> >>> information: >> >>> 1. the number of functions that being instrumented, >> >>> 2. the result profile file size, >> >>> 3. the sum of raw count values that was mentioned earlier -- we >> >>> used >> >>> it to measure the efficiency of the instrumentation. >> >>> Next table shows the ratios of the each metrics by late >> >>> instrumentation >> >>> for the C++ programs, with FE based instrumentation as the base : >> >>> column (1) >> >>> shows the ratios of instrumented functions; column (2) shows the >> >>> ratios of >> >>> the profile file size; column (3) shows the ratios of the sum of raw >> >>> count >> >>> values. >> >>> >> >>> (1) (2) (3) >> >>> 471.omnetpp 85.36% 110.26% 46.52% >> >>> 473.astar 64.86% 72.72% 63.13% >> >>> 483.xalancbmk 51.83% 56.11% 35.77% >> >>> 444.namd 75.36% 82.82% 85.77% >> >>> 447.dealII 43.42% 46.46% 26.75% >> >>> 450.soplex 71.80% 87.54% 51.19% >> >>> 453.povray 78.68% 83.57% 64.37% >> >>> 252.eon 72.06% 91.22% 30.02% >> >>> ---------------------------------------- >> >>> Geomean 66.50% 76.36% 47.01% >> >>> >> >>> >> >>> For FE based instrumentation, profile count variables generated for >> >>> the >> >>> dead functions will not be removed (like __llvm_prf_names, >> >>> __llvm_prf_data, >> >>> and __llvm_prf_cnts) from the data/text segment, nor in the result >> >>> profile. >> >>> There is a recent patch that removes these unused data for COMDAT >> >>> functions, >> >>> but that patch won’t touch regular functions. This is the main reason >> >>> for >> >>> the larger number of instrumented functions and larger profile file >> >>> size for >> >>> the FE based instrumentation. The reduction of the sum of raw count >> >>> values >> >>> is mainly due to the elimination of redundant profile updates enabled >> >>> by the >> >>> pre-inlining. >> >>> >> >>> For C programs, we observe similar improvement in the profile size >> >>> (geomean ratio of 73.75%) and smaller improvement in the number of >> >>> instrumented functions (geomean ratio of 87.49%) and the sum of raw >> >>> count >> >>> values (geomean of 82.76%). >> >>> >> >>> >> >>> 3.4 LLVM instrumentations runtime performance for Google internal >> >>> C/C++ >> >>> benchmarks >> >>> >> >>> We also use Google internal benchmarks (mostly typical C++ >> >>> applications) >> >>> to measure the relative performance b/w FE based instrumentation and >> >>> late >> >>> instrumentation. The following table shows the speedup of late >> >>> instrumentation vs FE based instrumentation. Note that C++benchmark01 >> >>> is a >> >>> very large multi-threaded C++ program. Late instrumentation sees 4x >> >>> speedup. >> >>> Larger than 3x speedups are also seen in many other programs. >> >>> >> >>> C++_bencharmk01 416.98% >> >>> C++_bencharmk02 6.29% >> >>> C++_bencharmk03 22.39% >> >>> C++_bencharmk04 28.05% >> >>> C++_bencharmk05 2.00% >> >>> C++_bencharmk06 675.89% >> >>> C++_bencharmk07 359.19% >> >>> C++_bencharmk08 395.03% >> >>> C_bencharmk09 15.11% >> >>> C_bencharmk10 5.47% >> >>> C++_bencharmk11 5.73% >> >>> C++_bencharmk12 2.31% >> >>> C++_bencharmk13 87.73% >> >>> C++_bencharmk14 7.22% >> >>> C_bencharmk15 -0.51% >> >>> C++_bencharmk16 59.15% >> >>> C++_bencharmk17 358.82% >> >>> C++_bencharmk18 861.36% >> >>> C++_bencharmk19 29.62% >> >>> C++_bencharmk20 11.82% >> >>> C_bencharmk21 0.53% >> >>> C++_bencharmk22 43.10% >> >>> --------------------------- >> >>> Geomean 83.03% >> >>> >> >>> >> >>> 3.5 Statistics of LLVM profiles for Google internal benchmarks >> >>> >> >>> The following shows the profile statics using Google internal >> >>> benchmarks. >> >>> (1) (2) (3) >> >>> C++_bencharmk01 36.84% 40.29% 2.32% >> >>> C++_bencharmk02 39.20% 40.49% 42.39% >> >>> C++_bencharmk03 39.37% 40.65% 23.24% >> >>> C++_bencharmk04 39.13% 40.68% 17.70% >> >>> C++_bencharmk05 36.58% 38.27% 51.08% >> >>> C++_bencharmk06 29.50% 27.87% 2.87% >> >>> C++_bencharmk07 29.50% 27.87% 1.73% >> >>> C++_bencharmk08 29.50% 27.87% 4.17% >> >>> C_bencharmk09 53.95% 68.00% 11.18% >> >>> C_bencharmk10 53.95% 68.00% 31.74% >> >>> C++_bencharmk11 36.40% 37.07% 46.12% >> >>> C++_bencharmk12 38.44% 41.90% 73.59% >> >>> C++_bencharmk13 39.28% 42.72% 29.56% >> >>> C++_bencharmk14 38.59% 42.20% 13.42% >> >>> C_bencharmk15 57.45% 48.50% 66.99% >> >>> C++_bencharmk16 36.86% 42.18% 16.53% >> >>> C++_bencharmk17 37.82% 39.77% 13.68% >> >>> C++_bencharmk18 37.82% 39.77% 7.96% >> >>> C++_bencharmk19 37.52% 40.46% 1.85% >> >>> C++_bencharmk20 32.37% 30.44% 19.69% >> >>> C_bencharmk21 37.63% 40.42% 88.81% >> >>> C++_bencharmk22 36.28% 36.92% 21.62% >> >>> -------------------------------------------------- >> >>> Geomean 38.22% 39.96% 15.58% >> >>> >> >>> >> >>> 4. Implementation Details: >> >>> >> >>> We need to add new option(s) for the alternative PGO instrumentation >> >>> pass >> >>> in the middle end. It can in one of the following forms: >> >>> >> >>> (1) Complete new options that are on par with current PGO options: >> >>> -fprofile-late-instr-generate[=<profile_file>]? for PGO >> >>> Instrumentation, and >> >>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE. >> >>> (2) Or, late instrumentation can be turned on with an additional >> >>> option -fprofile-instr-late with current PGO options. I. e. >> >>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for >> >>> PGO >> >>> instrumentation, and -fprofile-instr-late >> >>> -fprofile-instr-use[=<profile_file>]? for PGO use. >> >>> (3) Alternatively to (2), only keep -fprofile-instr-late option in >> >>> PGO >> >>> instrumentation. Adding a magic tag in profile so that FE based >> >>> profile and >> >>> late instrumented profile can be automatically detected by profile >> >>> loader In >> >>> PGO use compilation. This requires a slight profile format change. >> >>> >> >>> In our prototype implementation, two new passes are added in the >> >>> beginning of PassManagerBuilder::populateModulePassManager(), namely >> >>> PreProfileInlinerPass and PGOLateInstrumentationPass. >> >>> >> >>> >> >>> 4.1 Pre-inline pass: >> >>> >> >>> It is controlled by back-end option "-preinline" and >> >>> "-disable-preinline". If the user specifies any llvm option of >> >>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will >> >>> be >> >>> automatically inserted in the driver.. To disable the pre-inliner when >> >>> late >> >>> instrumentation is enabled, use option "-mllvm -disable-preinline". >> >>> >> >>> For now, only minimum tuning is done for the pre-inliner, which simply >> >>> adjusts the inline threshold: If -Oz is specified, the threshold is >> >>> set to >> >>> 25. Otherwise, it is 75. >> >>> >> >>> The following clean up passes are added to PassManager, right after >> >>> the >> >>> PreProfileInline pass: >> >>> createEarlyCSEPass() >> >>> createJumpThreadingPass() >> >>> createCorrelatedValuePropagationPass() >> >>> createCFGSimplificationPass() >> >>> createInstructionCombiningPass() >> >>> createGVNPass(DisableGVNLoadPRE) >> >>> createPeepholePASS() >> >>> Some of them might not be necessary. >> >>> >> >>> 4.2 Late Instrumentation Pass: >> >>> The late instrumentation is right after the pre-inline pass and it's >> >>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" >> >>> and >> >>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver >> >>> will >> >>> provide the profile name. >> >>> For "-pgo-late-instr-gen", a pass that calls >> >>> createInstrProfilingPass() >> >>> is also added to PassManager to lower the instrumentation intrinsics >> >>> >> >>> PGOLateInstrumeatnion is a module pass that applies the >> >>> instrumentation >> >>> to each function by class PGOLateInstrumentationFunc. For each >> >>> function, >> >>> perform the following steps: >> >>> 1. First collect all the CFG edges. Assign an estimated weight to >> >>> each >> >>> edge. Critical edges and back-edges are assigned to high value of >> >>> weights. >> >>> One fake node and a few fake edges (from the fake node to the entry >> >>> node, >> >>> and from all the exit nodes to the fake node) are also added to the >> >>> worklist. >> >>> 2. Construct the MST. The edges with the higher weight will be put >> >>> to >> >>> MST first, unless it forms a cycle. >> >>> 3. Traverse the CFG and compute the CFG hash using CRC32 of the >> >>> index >> >>> of each BB. >> >>> The above three steps are the same for profile-generate and >> >>> profile-use >> >>> compilation. >> >>> >> >>> In the next step, for profile-generation compilation, all the edges >> >>> that >> >>> not in the MST are instrumented. If this is a critical edge, split the >> >>> edge >> >>> first. The actual instrumentation is to generate >> >>> Intrinsic::instrprof_increment() in the instrumented BB. This >> >>> intrinsic will >> >>> be lowed by pass createInstrProfilingPass(). >> >>> >> >>> In the next step, for profile-generation compilation, all the edges >> >>> that >> >>> not in the MST are instrumented. If this is a critical edge, split the >> >>> edge >> >>> first. The actual instrumentation is to generate >> >>> Intrinsic::instrprof_increment() in the instrumented BB. This >> >>> intrinsic will >> >>> be lowed by pass createInstrProfilingPass(). >> >>> >> >>> For -fprofile-use compilation, first read in the counters and the CFG >> >>> hash from the profile file. If the CFG hash matches, populate the >> >>> counters >> >>> to all the edges in reverse topological order of the MST. Once having >> >>> all >> >>> the edge counts, set the branch weights metadata for the IR having >> >>> multiple >> >>> branches. Also apply the cold/hot function attributes based on >> >>> function >> >>> level counts. >> >>> >> >>> >> >>> 4.3 Profile Format: >> >>> >> >>> The late instrumentation profile is mostly the same as the one from >> >>> front-end instrument-ion. The difference is >> >>> * Function checksums are different. >> >>> * Function entry counts are no longer available. >> >>> For llvm-profdata utility, options -lateinstr needs to be used to >> >>> differentiate FE based and late instrumentation profiles, unless a >> >>> magic tag >> >>> is added to the profile. >> >>> >> >>> >> >>> 5. References: >> >>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of >> >>> points >> >>> for program frequency counts. BIT Numerical Mathematics 1973, Volume >> >>> 13, >> >>> Issue 3, pp 313-322 >> >>> >> >> >> > > >