Hi, we've implemented an indirect call promotion llvm pass. The design notes including examples are shown below. This pass complements the indirect call profile infrastructure http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html Your feedback and comments will be highly appreciated. Thanks, Ivan ============================================================================RFC: Indirect Call Promotion LLVM Pass Betul Buyukkurt and Ivan Baev 1. Introduction Indirect call promotion (ICP) replaces an indirect call instruction to a set of target addresses with a sequence of tests guarding direct calls to selected targets, plus a fall through branch containing the original indirect call. The ICP optimization is found to be the second most profitable (after inlining) profile-based optimization in a recent study [2]. We've implemented an ICP LLVM pass that iterates over all indirect call sites in the module and selectively (under heuristics) performs the promotion. Here is one example of the transformation. --------------------ico.ll-------------------------------------------------- define void @foo(i32 %a) { entry: %a1 = add i32 %a, 1 ret void } define void @bar(i32 %a) { entry: %a2 = add i32 %a, 2 ret void } define void @main(void (i32)* %fun) { entry: call void %fun(i32 10), !prof !1 ret void } !1 = !{!"indirect_call_targets", i64 6000, !"foo", i64 5000, !"bar", i64 100} ----------------------------------------------------------------------------> opt -ic-opt ico.ll -o ico-post.bc--------------------ico-post.ll---------------------------------------------define void @foo(i32 %a) { entry: %a1 = add i32 %a, 1 ret void } define void @bar(i32 %a) { entry: %a2 = add i32 %a, 2 ret void } define void @main(void (i32)* %fun) { entry: %0 = bitcast void (i32)* %fun to i8* %1 = bitcast void (i32)* @foo to i8* %2 = icmp eq i8* %0, %1 br i1 %2, label %if.true, label %if.false, !prof !0 if.merge: ; preds = %if.false, %if.true ret void if.true: ; preds = %entry call void @foo(i32 10) #0 br label %if.merge if.false: ; preds = %entry call void %fun(i32 10), !prof !1 br label %if.merge } attributes #0 = { inlinehint } !0 = !{!"branch_weights", i32 5000, i32 1000} !1 = !{!"indirect_call_targets", i64 1000, !"bar", i64 100} ---------------------------------------------------------------------------- The ICP pass handles indirect call and indirect invoke LLVM IR instructions. It depends on the availability of indirect call metadata provided by the indirect call profile infrastructure briefly described at http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html Here is the new indirect call (IC) metadata type used in the example above: !1 = !{!"indirect_call_targets", i64 6000, !"foo", i64 5000, !"bar", i64 100} The 6000 represents the number of times the indirect call was executed during the profiling runs, of which function foo was the receiver 5000 times, and function bar was the receiver 100 times. The input for ICP pass: LLVM IR including IC metadata, (future) function entry count metadata The output: Modified LLVM IR with modified indirect call sites. Selected indirect call targets are promoted and inline hint attributes are added subject to heuristics. 2. Aigner & Hölzle-based heuristics ------------------------------------ We've implemented two heuristics from this paper [1]. a. Hot call site heuristic: only consider for promotion a call site which contribution (profile count) is more than 0.1% of the total count of all indirect calls executed during the profile runs. b. Hot target heuristic: promote the most frequent target at a call site if it gets at least 40% of all receivers at the call site. Only the hottest target from a call site is possibly promoted, similarly to the approach taken in the paper. In addition to Aigner & Hölzle-based heuristics, we add an inline hint to the promoted direct call/invoke instruction if it is the single receiver at the call site according to the profile data or the number of times the target is executed at the call site is at least 4% of the total count of all indirect calls. Once the function entry profile counts become available we will use them to tune the above inline-related heuristic. 3. Handling virtual function calls ----------------------------------- Consider the following C++ example involving virtual functions and calls. ----------------------------------------------------------------------------class Operation { public: virtual int test_add(int a, int b)=0; virtual int test_sub(int a, int b)=0; }; class A : public Operation { public: int test_add(int a, int b) { return a + b; } int test_sub(int a, int b) { return a - b; } }; A myA; int __attribute__ ((noinline)) testmain(int (A::*fptr)(int, int)) { return (myA.*fptr)(1, 2); } ---------------------------------------------------------------------------- The debugging output of the ICP pass for this example is as follows: ----------------------------------------------------------------------------**** INDIRECT CALL OPTIMIZATION **** IC target hotness threshold = 40 Total IC execution count = 1.000000e+00 Attempting IC_opt on: %call = call i32 %5(%class.A* %this.adjusted, i32 1, i32 2), !prof !8 with: !{!"indirect_call_targets", i64 1, !"_ZN1A8test_subEii", i64 1} in function: _Z8testmainM1AFiiiE CS hotness% = 1.000000e+02 Target hotness% = 100 == Basic Block Before =memptr.end: ; preds %memptr.nonvirtual, %memptr.virtual %5 = phi i32 (%class.A*, i32, i32)* [ %memptr.virtualfn, %memptr.virtual ], [ %memptr.nonvirtualfn, %memptr.nonvirtual ] %call = call i32 %5(%class.A* %this.adjusted, i32 1, i32 2), !prof !8 ret i32 %call ... !8 = !{!"indirect_call_targets", i64 1, !"_ZN1A8test_subEii", i64 1} == Basic Blocks After == // code after the ICP pass memptr.end: ; preds %memptr.nonvirtual, %memptr.virtual %5 = phi i32 (%class.A*, i32, i32)* [ %memptr.virtualfn, %memptr.virtual ], [ %memptr.nonvirtualfn, %memptr.nonvirtual ] %6 = bitcast i32 (%class.A*, i32, i32)* %5 to i8* %7 = bitcast i32 (%class.A*, i32, i32)* @_ZN1A8test_subEii to i8* %8 icmp eq i8* %6, %7 br i1 %8, label %if.true, label %if.false, !prof !8 if.true: ; preds = %memptr.end %10 = call i32 @_ZN1A8test_subEii(%class.A* %this.adjusted, i32 1, i32 2) #3 br label %if.merge if.false: ; preds = %memptr.end %call = call i32 %5(%class.A* %this.adjusted, i32 1, i32 2), !prof !9 br label %if.merge if.merge: ; preds = %if.false, %if.true %9 = phi i32 [ %10, %if.true ], [ %call, %if.false ] ret i32 %9 ... attributes #3 = { inlinehint } !8 = !{!"branch_weights", i32 1, i32 0} !9 = !{!"indirect_call_targets", i64 0} ---------------------------------------------------------------------------- At LLVM IR level the ICP pass sees virtual function calls as normal indirect calls, and proceeds as in the first example. Currently ICP is oblivious to vtables and virtial function support. On a more complicated example found in eon benchmark, one future enhancement is to identify that an indirect call is virtual and change the comparison (shown at a higher IR level) from if (ptr->foo == A::foo) to if (ptr->_vptr == A::_vtable) This will sink one load from the original block into the less frequently executed if.false block. This opportunity was found by Balaram Makam. 4. New enhancement patch ------------------------- Currently our implementation has the following shortcomings: a. Our heuristics do not depend on the global information on function counts. It could be that none of the indirect call sites are contributing highly to the overall calls. Because our current implementation is deciding what to inline based on the indirect call site sum only, it could be inlining functions that are in essence cold when all functions in the source base are considered. This situation will be improved when the function entry profile counts become available in llvm IR. b. Current implementation only transforms the first hot target, the rest of the targets are never considered even if they are relatively hot. We are evaluating a new solution which depends on the presence/availability of functions counts in clang. We form a sorted multiset of all functions counts. A given indirect target is considered for inlining if the targets count at the call site falls within one of the ranges that form the top 0-10%, 10-20% or 20-30% of the sorted multiset. Weve added checks which become stricter as the target count falls farther away from the top most called 10%, 20% or 30% of all functions respectively. Targets that are classified as making calls to one of the top most called 30% of the functions receive inline hints. Inline hints are communicated from clang down to LLVM in metadata. Then, on the LLVM side the transformation pass uses the metadata field for the hint to add an inline hint at the transformed call site. ------------------------- [1] G. Aigner and U. Hölzle. Eliminating virtual function calls in C++ programs. ECOOP, 1996. [2] X. Li, R. Ashok, R. Hundt. Lightweight Feedback-Directed Cross-Module Optimization. CGO, 2010.
On Fri, Apr 17, 2015 at 2:13 PM, Ivan Baev <ibaev at codeaurora.org> wrote:> Hi, we've implemented an indirect call promotion llvm pass. The design > notes including examples are shown below. This pass complements the > indirect call profile infrastructure > http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.htmlThere are issues with the profiling infrastructure proposal which will addressed separately. This part looks sane in general to me. See my rely inline.> We've implemented two heuristics from this paper [1]. > > a. Hot call site heuristic: only consider for promotion a call site which > contribution (profile count) is more than 0.1% of the total count of all > indirect calls executed during the profile runs. > > b. Hot target heuristic: promote the most frequent target at a call site > if it gets at least 40% of all receivers at the call site.Is the heuristics a || b, or a && b ?> > Only the hottest target from a call site is possibly promoted, similarly > to the approach taken in the paper. > > In addition to Aigner & Hölzle-based heuristics, we add an inline hint to > the promoted direct call/invoke instruction if it is the single receiver > at the call site according to the profile data or the number of times the > target is executed at the call site is at least 4% of the total count of > all indirect calls. Once the function entry profile counts become > available we will use them to tune the above inline-related heuristic. > >I don't think indirectly promoted callsites should be treated any differently from original direct callsites -- after promotion, the direct callsites have call count info and the same inline heuristics should apply.> > if (ptr->foo == A::foo) > to > if (ptr->_vptr == A::_vtable)You can do that if you know class A is final. In general, you need type or vtable profiling to get it.> > This will sink one load from the original block into the less frequently > executed if.false block. This opportunity was found by Balaram Makam. > > > 4. New enhancement patch > ------------------------- > Currently our implementation has the following shortcomings: > a. Our heuristics do not depend on the global information on function > counts. It could be that none of the indirect call sites are contributing > highly to the overall calls. Because our current implementation is > deciding what to inline based on the indirect call site sum only, it could > be inlining functions that are in essence cold when all functions in the > source base are considered. This situation will be improved when the > function entry profile counts become available in llvm IR.Our plan is to add program level summary data for PGO. Any global decisions need to made based on that because only relative global hotness matters.> > b. Current implementation only transforms the first hot target, the rest > of the targets are never considered even if they are relatively hot.This is probably a good thing. Going beyond 2 can have negative effect.> > We are evaluating a new solution which depends on the > presence/availability of functions counts in clang. We form a sorted > multiset of all functions counts. A given indirect target is considered > for inlining if the target’s count at the call site falls within one of > the ranges that form the top 0-10%, 10-20% or 20-30% of the sorted > multiset. We’ve added checks which become stricter as the target count > falls farther away from the top most called 10%, 20% or 30% of all > functions respectively. > > Targets that are classified as making calls to one of the top most called > 30% of the functions receive inline hints. Inline hints are communicated > from clang down to LLVM in metadata. Then, on the LLVM side the > transformation pass uses the metadata field for the hint to add an inline > hint at the transformed call site.Again, there is no need to invent indirect call (promoted) specific inline heuristics. thanks, David> > ------------------------- > [1] G. Aigner and U. Hölzle. Eliminating virtual function calls in C++ > programs. ECOOP, 1996. > [2] X. Li, R. Ashok, R. Hundt. Lightweight Feedback-Directed Cross-Module > Optimization. CGO, 2010. > > > > > > > > > > > >
> On Fri, Apr 17, 2015 at 2:13 PM, Ivan Baev <ibaev at codeaurora.org> wrote: >> Hi, we've implemented an indirect call promotion llvm pass. The design >> notes including examples are shown below. This pass complements the >> indirect call profile infrastructure >> http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html > > There are issues with the profiling infrastructure proposal which will > addressed separately.-- Please send these to Betul and me.> > This part looks sane in general to me. See my rely inline. > > >> We've implemented two heuristics from this paper [1]. >> >> a. Hot call site heuristic: only consider for promotion a call site >> which >> contribution (profile count) is more than 0.1% of the total count of all >> indirect calls executed during the profile runs. >> >> b. Hot target heuristic: promote the most frequent target at a call site >> if it gets at least 40% of all receivers at the call site. > > Is the heuristics a || b, or a && b ?-- a && b> >> >> Only the hottest target from a call site is possibly promoted, similarly >> to the approach taken in the paper. >> >> In addition to Aigner & Hölzle-based heuristics, we add an inline hint >> to >> the promoted direct call/invoke instruction if it is the single receiver >> at the call site according to the profile data or the number of times >> the >> target is executed at the call site is at least 4% of the total count of >> all indirect calls. Once the function entry profile counts become >> available we will use them to tune the above inline-related heuristic. >> >> > > I don't think indirectly promoted callsites should be treated any > differently from original direct callsites -- after promotion, the > direct callsites have call count info and the same inline heuristics > should apply.--- Agree in general, we should live this decision to inliner, especially when it becomes a profile-based inliner. At ICP pass we currenty have the profile counts for indirect call sites and their receivers (targets) and it is tempting to pass some of this information to the inliner.>> >> if (ptr->foo == A::foo) >> to >> if (ptr->_vptr == A::_vtable) > > You can do that if you know class A is final. In general, you need > type or vtable profiling to get it.-- It is a future enhancement. Could you please provide some more details, in particular is it valid for C++ programs?> >> >> This will sink one load from the original block into the less frequently >> executed if.false block. This opportunity was found by Balaram Makam. >> >> >> 4. New enhancement patch >> ------------------------- >> Currently our implementation has the following shortcomings: >> a. Our heuristics do not depend on the global information on function >> counts. It could be that none of the indirect call sites are >> contributing >> highly to the overall calls. Because our current implementation is >> deciding what to inline based on the indirect call site sum only, it >> could >> be inlining functions that are in essence cold when all functions in the >> source base are considered. This situation will be improved when the >> function entry profile counts become available in llvm IR. > > Our plan is to add program level summary data for PGO. Any global > decisions need to made based on that because only relative global > hotness matters. > >> >> b. Current implementation only transforms the first hot target, the rest >> of the targets are never considered even if they are relatively hot. > > This is probably a good thing. Going beyond 2 can have negative effect. >-- With 2 we're getting incremental improvements, and we plan to further tune it. Thanks for the feedback, David. Ivan>> >> We are evaluating a new solution which depends on the >> presence/availability of functions counts in clang. We form a sorted >> multiset of all functions counts. A given indirect target is considered >> for inlining if the targetâs count at the call site falls within one >> of >> the ranges that form the top 0-10%, 10-20% or 20-30% of the sorted >> multiset. Weâve added checks which become stricter as the target >> count >> falls farther away from the top most called 10%, 20% or 30% of all >> functions respectively. >> >> Targets that are classified as making calls to one of the top most >> called >> 30% of the functions receive inline hints. Inline hints are >> communicated >> from clang down to LLVM in metadata. Then, on the LLVM side the >> transformation pass uses the metadata field for the hint to add an >> inline >> hint at the transformed call site. > > Again, there is no need to invent indirect call (promoted) specific > inline heuristics. > > > thanks, > > David > > >> >> ------------------------- >> [1] G. Aigner and U. Hölzle. Eliminating virtual function calls in C++ >> programs. ECOOP, 1996. >> [2] X. Li, R. Ashok, R. Hundt. Lightweight Feedback-Directed >> Cross-Module >> Optimization. CGO, 2010. >>
betulb at codeaurora.org
2015-Apr-17 23:26 UTC
[LLVMdev] RFC: Indirect Call Promotion LLVM Pass
> On Fri, Apr 17, 2015 at 2:13 PM, Ivan Baev <ibaev at codeaurora.org> wrote: >> Hi, we've implemented an indirect call promotion llvm pass. The design >> notes including examples are shown below. This pass complements the >> indirect call profile infrastructure >> http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-April/084271.html > > There are issues with the profiling infrastructure proposal which will > addressed separately. > > This part looks sane in general to me. See my rely inline. > > >> We've implemented two heuristics from this paper [1]. >> >> a. Hot call site heuristic: only consider for promotion a call site >> which >> contribution (profile count) is more than 0.1% of the total count of all >> indirect calls executed during the profile runs. >> >> b. Hot target heuristic: promote the most frequent target at a call site >> if it gets at least 40% of all receivers at the call site. > > Is the heuristics a || b, or a && b ? > >> >> Only the hottest target from a call site is possibly promoted, similarly >> to the approach taken in the paper. >> >> In addition to Aigner & Hölzle-based heuristics, we add an inline hint >> to >> the promoted direct call/invoke instruction if it is the single receiver >> at the call site according to the profile data or the number of times >> the >> target is executed at the call site is at least 4% of the total count of >> all indirect calls. Once the function entry profile counts become >> available we will use them to tune the above inline-related heuristic. >> >> > > I don't think indirectly promoted callsites should be treated any > differently from original direct callsites -- after promotion, the > direct callsites have call count info and the same inline heuristics > should apply. > > >> >> if (ptr->foo == A::foo) >> to >> if (ptr->_vptr == A::_vtable) > > You can do that if you know class A is final. In general, you need > type or vtable profiling to get it. > >> >> This will sink one load from the original block into the less frequently >> executed if.false block. This opportunity was found by Balaram Makam. >> >> >> 4. New enhancement patch >> ------------------------- >> Currently our implementation has the following shortcomings: >> a. Our heuristics do not depend on the global information on function >> counts. It could be that none of the indirect call sites are >> contributing >> highly to the overall calls. Because our current implementation is >> deciding what to inline based on the indirect call site sum only, it >> could >> be inlining functions that are in essence cold when all functions in the >> source base are considered. This situation will be improved when the >> function entry profile counts become available in llvm IR. > > Our plan is to add program level summary data for PGO. Any global > decisions need to made based on that because only relative global > hotness matters. > >> >> b. Current implementation only transforms the first hot target, the rest >> of the targets are never considered even if they are relatively hot. > > This is probably a good thing. Going beyond 2 can have negative effect. > >> >> We are evaluating a new solution which depends on the >> presence/availability of functions counts in clang. We form a sorted >> multiset of all functions counts. A given indirect target is considered >> for inlining if the targetâs count at the call site falls within one >> of >> the ranges that form the top 0-10%, 10-20% or 20-30% of the sorted >> multiset. Weâve added checks which become stricter as the target >> count >> falls farther away from the top most called 10%, 20% or 30% of all >> functions respectively. >> >> Targets that are classified as making calls to one of the top most >> called >> 30% of the functions receive inline hints. Inline hints are >> communicated >> from clang down to LLVM in metadata. Then, on the LLVM side the >> transformation pass uses the metadata field for the hint to add an >> inline >> hint at the transformed call site. > > Again, there is no need to invent indirect call (promoted) specific > inline heuristics.These heuristics are there to decide which indirect call targets to promote at an indirect call site. Not necessarily an additional inline heuristic for the promoted targets. However, we do add an inline hint currently at the transformed call sites to help trigger the inlines at the new call site.> > thanks, > > David > > >> >> ------------------------- >> [1] G. Aigner and U. Hölzle. Eliminating virtual function calls in C++ >> programs. ECOOP, 1996. >> [2] X. Li, R. Ashok, R. Hundt. Lightweight Feedback-Directed >> Cross-Module >> Optimization. CGO, 2010. >> >> >> >> >> >> >> >> >> >> >> >> >