Peter Collingbourne via llvm-dev
2016-Jan-28 05:32 UTC
[llvm-dev] Proposal: virtual constant propagation
Hi Dibyendu, (re-sending to correct list) I have not, and I could try to measure that once I have a full implementation up and running, but I should mention that Chromium's workloads are very different from the SPEC CPU2006 workloads -- for example, we measured a much higher overhead for CFI on these Chromium benchmarks than anything in SPEC. Peter On Thu, Jan 28, 2016 at 04:58:08AM +0000, Das, Dibyendu wrote:> Peter- > > Do you know the impact on the C++ benchmarks in SPEC CPU2006 ? > > -Thx > Dibyendu > > -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Peter Collingbourne via llvm-dev > Sent: Thursday, January 28, 2016 9:27 AM > To: llvm-dev at lists.llvm.org > Cc: dannyb at google.com > Subject: [llvm-dev] Proposal: virtual constant propagation > > Hi all, > > I'd like to make the following proposal to implement an optimization called virtual constant propagation. > > ==Introduction=> After enabling control flow integrity protection in Chromium, we have observed an unacceptable performance regression in certain critical layout microbenchmarks. Profiling [0] revealed that the cause of the regression was a large number of virtual calls, each requiring the overhead of a control flow integrity check. > > We took a closer look at the hottest virtual call sites, and found that they were mostly type checking functions of the following form: > > struct Base { > enum Type { kBase, kDerived }; > virtual bool isDerived() const { return false; } // or: > virtual bool isOfType(Type t) const { return t == kBase; } }; struct Derived : Base { > virtual bool isDerived() const { return true; } > virtual bool isOfType(Type t) const { return t == kDerived; } }; > > bool f(Base *b) { > return b->isDerived(); > } > bool g(Base *b) { > return b->isOfType(Base::kDerived); > } > > We can make the observation that if the type hierarchy is known to be closed and the definition of each virtual function and each virtual call is visible to the compiler, calls to either of these functions can be implemented more efficiently by loading directly from the vtable (we refer to this as virtual constant propagation): > > struct Base_VTable { > bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1; }; Base_VTable base_vtable{false, true, false}; Derived_VTable derived_vtable{true, false, true}; bool f(Base *b) { > return b->vtable->isDerived; > } > bool g(Base *b) { > return b->vtable->isOfTypeDerived; > } > > Another advantage of implementing the calls this way is that because there is no indirect call taking place, no control flow integrity check is necessary. > > I implemented a prototype of this idea that specifically targeted the hottest virtual function in one of Chromium’s layout benchmarks, and observed a performance improvement of median 1.62%, min -4.72%, max 204.52% for the layout benchmark suite. I’d like to emphasise that these numbers are from a regular LTO build without CFI, and that we expect to see better performance from a general implementation that targets all virtual functions. > > ==User interface=> To instruct the compiler to assume that all type hierarchies are closed, a user can pass the -fwhole-program-vtables flag. The -fwhole-program-vtables flag requires the -flto flag to also be specified. > > Of course, there may be some type hierarchies that are not entirely closed, but the underlying assumption is that most hierarchies will not be. To support open hierarchies the user can also specify the path to a blacklist that lists the names of the open class hierarchies using a -fwhole-program-vtables-blacklist> flag. By default, the compiler will assume that hierarchies in the std namespace are open. > > ==High level design=> Devirtualization will take place at LTO time. Candidates for virtual constant propagation will be found by searching for call sites with all-constant arguments to virtual functions which in all derived classes are readnone, which means that it is a pure function of its arguments. For each (virtual function, arguments) candidate pair, the compiler will compute a offset in bits (positive or negative) from the class’s vtable address point in which to store the function result, extend the size of any vtable objects to cover any such offsets, evaluate the function to compute a result and store it at the correct offsets. To avoid breaking ABI, the vtable layout will not be changed in any way. This includes pointers to optimized functions in the vtable, and (for Itanium ABI) the distance between address points within a single vtable symbol. > > For example, the vtable for Base and Derived will look like this: > > struct Base_VTable { > bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1; > size_t padding : 61; > size_t offset_to_top; > size_t rtti; > bool (*isDerived)(const Base *this); > bool (*isOfType)(const Base *this, Type t); }; Base_VTable base_vtable{ > false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType }; Base_VTable derived_vtable{ > True, false, true, 0, 0, 0, &Derived::isDerived, &Derived::IsOfType }; > > The symbol _ZTV4Base, representing the start of Base’s vtable set, will be defined as base_vtable+8; similarly for _ZTV7Derived and derived_vtable+8. > > To avoid bloating virtual tables by too much in degenerate cases, there will be a limit for each virtual function on the number of bits that the virtual function may contribute to the size of the vtable, which we may fix at (say) > 2 machine words. > > To give an idea of how the generated code would look, here is an asm snippet from the hottest virtual call from Chromium’s layout benchmarks before the transformation: > > 255c9ae: 48 8b 03 mov (%rbx),%rax > 255c9b1: be 30 00 00 00 mov $0x30,%esi > 255c9b6: 48 89 df mov %rbx,%rdi > 255c9b9: ff 90 e0 02 00 00 callq *0x2e0(%rax) > 255c9bf: 84 c0 test %al,%al > 255c9c1: 74 04 je 255c9c7 > 255c9c3: 31 c0 xor %eax,%eax > 255c9c5: 5b pop %rbx > 255c9c6: c3 retq > > And after: > > 255a8ee: 48 8b 03 mov (%rbx),%rax > 255a8f1: f6 40 e6 01 testb $0x1,-0x1a(%rax) > 255a8f5: 74 04 je 255a8fb > 255a8f7: 31 c0 xor %eax,%eax > 255a8f9: 5b pop %rbx > 255a8fa: c3 retq > > ==LLVM IR-level design=> Given a class name, how can we determine the closed set of derived classes and possible function pointers for a call site? As it turns out the IR already has a way of expressing this information: bitsets [1], which are already being used to implement CFI. We can encode the information for devirtualization by combining the @llvm.bitset.test and @llvm.assume intrinsics. > > %vtable = load i8**, i8*** %obj > %p = call i1 @llvm.bitset.test(%vtable, “Base”) call void @llvm.assume(i1 %p) ; %vtable is assumed to point to > ; _ZTV4Base+16 or _ZTV7Derived+16. > %fptrptr = getelementptr i8* %vtable, 1 > %fptr = load i8*, i8** %fptrptr ; %fptr must point to Base::isOfType > ; or Derived::isOfType. > %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32) %result = call i1 %fptr_casted(i8*** %obj, i32 0) > > This gives the optimizer all the information it needs to implement the > optimization: the addresses of the virtual tables, the addresses of the function pointers (within the virtual table initializers) and the constant arguments. Note that the compiler does not need to do any of the work described in the above link to lay out globals or create bitsets if all calls to llvm.bitset.test are passed to llvm.assume. > > If virtual constant propagation succeeds, the transformed IR will look like this: > > %vtable = load i8*, i8** %obj > %vtable_bit_addr = getelementptr i8* %vtable, i64 -17 %vtable_bit = load i8 %vtable_bit_addr %vtable_bit_and = and i8 %vtable_bit, 1 %result = icmp ne %vtable_bit_and, 0 > > The pass that applies this transformation will be placed early in the LTO pipeline, before most of the regular optimization passes. The pass could later be extended to do general devirtualization based on the bitset information: > > - The pass can implement single-deriver (i.e. only one derived class with a > non-pure virtual function definition) and single-implementation > (i.e. multiple derived classes all sharing a virtual function definition > from a base class) devirtualization by checking that each of the possible > values loaded from the vtables in the bitset are either identical or > __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine > disregarding them), and propagating the identical address into the > function callee. > - The pass could also potentially implement speculative devirtualization > (i.e. testing the vtable entry and guarding direct calls on it) by > generating a comparison against the known address point. The pass will > know exactly how many possible classes there will be for each call site, > so we could have logic to speculatively devirtualize if that number is > sufficiently small. > > The design when CFI is enabled is slightly different, as we need to give the compiler the ability to eliminate the unnecessary llvm.bitset.test call. Recall that the IR looks like this: > > %vtable = load i8**, i8*** %obj > %p = call i1 @llvm.bitset.test(%vtable, “Base”) br i1 %p, label %call, label %abort > > call: > %fptr = load i8*, i8** %vtable > %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32) call i1 %fptr_casted(i8*** %obj, i32 0) > > abort: > call void @llvm.trap() > > We cannot simply have our optimization pass transform this IR to eliminate the llvm.bitset.test call, as the transformation would not be semantics-preserving (we already have a CFI setting that emits diagnostics in the abort branch, which would break under such a transformation). Instead, we can introduce a new intrinsic, @llvm.bitset.check, to be used only in production mode (i.e. with diagnostics disabled): > > {i8*, i1} @llvm.bitset.check(i8* %ptr, metadata %name) > > If %ptr is in the bitset identified by %name, the function returns {%ptr, true}. But if %ptr is not in %name, the behaviour is undefined, modulo that if the second element of the result is non-zero, the program may load and call a function pointer from an address derived from the first element of the result without causing an indirect function call to any function other than one potentially loaded from one of the constant globals of which %name is a member (i.e. one of the valid vtables for %name). A CFI-protected virtual call would then look like this (eliding bitcasts for brevity): > > %vtable = load i8**, i8*** %obj > %pair = call {i8*, i1} @llvm.bitset.check(i8* %vtable, “Base”) %checked_vtable = extractelement %pair, 0 %p = extractelement %pair, 1 br i1 %p, label %call, label %abort > > call: > %fptr = load i8*, i8** %checked_vtable > %result = call i1 %fptr(i8*** %obj, i32 0) > > abort: > call void @llvm.trap() > > Applying virtual constant propagation to @llvm.bitset.check would be very similar to applying it to @llvm.bitset.test, but if the constant propagation succeeds, the function will return true as the second element: > > %vtable = load i8**, i8*** %obj > br i1 true, label %call, label %abort > > call: > %vtable_bit_addr = getelementptr i8* %vtable, i64 -17 %vtable_bit = load i8 %vtable_bit_addr %vtable_bit_and = and i8 %vtable_bit, 1 %result = icmp ne %vtable_bit_and, 0 > > abort: > call void @llvm.trap() > > Now the existing SimplifyCFG pass can easily eliminate the unnecessary branching. > > In order to allow for regular constant folding through the llvm.bitset.check intrinsic, the code in lib/Analysis/ConstantFolding.cpp would be taught to look through the intrinsic’s argument. > > -- > Peter > > [0] https://code.google.com/p/chromium/issues/detail?id=580389 > [1] http://llvm.org/docs/BitSets.html > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Peter
Das, Dibyendu via llvm-dev
2016-Jan-28 06:36 UTC
[llvm-dev] Proposal: virtual constant propagation
Thx Peter. -----Original Message----- From: Peter Collingbourne [mailto:peter at pcc.me.uk] Sent: Thursday, January 28, 2016 11:02 AM To: Das, Dibyendu <Dibyendu.Das at amd.com> Cc: llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] Proposal: virtual constant propagation Hi Dibyendu, (re-sending to correct list) I have not, and I could try to measure that once I have a full implementation up and running, but I should mention that Chromium's workloads are very different from the SPEC CPU2006 workloads -- for example, we measured a much higher overhead for CFI on these Chromium benchmarks than anything in SPEC. Peter On Thu, Jan 28, 2016 at 04:58:08AM +0000, Das, Dibyendu wrote:> Peter- > > Do you know the impact on the C++ benchmarks in SPEC CPU2006 ? > > -Thx > Dibyendu > > -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > Peter Collingbourne via llvm-dev > Sent: Thursday, January 28, 2016 9:27 AM > To: llvm-dev at lists.llvm.org > Cc: dannyb at google.com > Subject: [llvm-dev] Proposal: virtual constant propagation > > Hi all, > > I'd like to make the following proposal to implement an optimization called virtual constant propagation. > > ==Introduction=> After enabling control flow integrity protection in Chromium, we have observed an unacceptable performance regression in certain critical layout microbenchmarks. Profiling [0] revealed that the cause of the regression was a large number of virtual calls, each requiring the overhead of a control flow integrity check. > > We took a closer look at the hottest virtual call sites, and found that they were mostly type checking functions of the following form: > > struct Base { > enum Type { kBase, kDerived }; > virtual bool isDerived() const { return false; } // or: > virtual bool isOfType(Type t) const { return t == kBase; } }; struct Derived : Base { > virtual bool isDerived() const { return true; } > virtual bool isOfType(Type t) const { return t == kDerived; } }; > > bool f(Base *b) { > return b->isDerived(); > } > bool g(Base *b) { > return b->isOfType(Base::kDerived); > } > > We can make the observation that if the type hierarchy is known to be closed and the definition of each virtual function and each virtual call is visible to the compiler, calls to either of these functions can be implemented more efficiently by loading directly from the vtable (we refer to this as virtual constant propagation): > > struct Base_VTable { > bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1; }; Base_VTable base_vtable{false, true, false}; Derived_VTable derived_vtable{true, false, true}; bool f(Base *b) { > return b->vtable->isDerived; > } > bool g(Base *b) { > return b->vtable->isOfTypeDerived; > } > > Another advantage of implementing the calls this way is that because there is no indirect call taking place, no control flow integrity check is necessary. > > I implemented a prototype of this idea that specifically targeted the hottest virtual function in one of Chromium’s layout benchmarks, and observed a performance improvement of median 1.62%, min -4.72%, max 204.52% for the layout benchmark suite. I’d like to emphasise that these numbers are from a regular LTO build without CFI, and that we expect to see better performance from a general implementation that targets all virtual functions. > > ==User interface=> To instruct the compiler to assume that all type hierarchies are closed, a user can pass the -fwhole-program-vtables flag. The -fwhole-program-vtables flag requires the -flto flag to also be specified. > > Of course, there may be some type hierarchies that are not entirely > closed, but the underlying assumption is that most hierarchies will > not be. To support open hierarchies the user can also specify the path > to a blacklist that lists the names of the open class hierarchies > using a -fwhole-program-vtables-blacklist> flag. By default, the compiler will assume that hierarchies in the std namespace are open. > > ==High level design=> Devirtualization will take place at LTO time. Candidates for virtual constant propagation will be found by searching for call sites with all-constant arguments to virtual functions which in all derived classes are readnone, which means that it is a pure function of its arguments. For each (virtual function, arguments) candidate pair, the compiler will compute a offset in bits (positive or negative) from the class’s vtable address point in which to store the function result, extend the size of any vtable objects to cover any such offsets, evaluate the function to compute a result and store it at the correct offsets. To avoid breaking ABI, the vtable layout will not be changed in any way. This includes pointers to optimized functions in the vtable, and (for Itanium ABI) the distance between address points within a single vtable symbol. > > For example, the vtable for Base and Derived will look like this: > > struct Base_VTable { > bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1; > size_t padding : 61; > size_t offset_to_top; > size_t rtti; > bool (*isDerived)(const Base *this); > bool (*isOfType)(const Base *this, Type t); }; Base_VTable base_vtable{ > false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType }; Base_VTable derived_vtable{ > True, false, true, 0, 0, 0, &Derived::isDerived, &Derived::IsOfType > }; > > The symbol _ZTV4Base, representing the start of Base’s vtable set, will be defined as base_vtable+8; similarly for _ZTV7Derived and derived_vtable+8. > > To avoid bloating virtual tables by too much in degenerate cases, > there will be a limit for each virtual function on the number of bits > that the virtual function may contribute to the size of the vtable, > which we may fix at (say) > 2 machine words. > > To give an idea of how the generated code would look, here is an asm snippet from the hottest virtual call from Chromium’s layout benchmarks before the transformation: > > 255c9ae: 48 8b 03 mov (%rbx),%rax > 255c9b1: be 30 00 00 00 mov $0x30,%esi > 255c9b6: 48 89 df mov %rbx,%rdi > 255c9b9: ff 90 e0 02 00 00 callq *0x2e0(%rax) > 255c9bf: 84 c0 test %al,%al > 255c9c1: 74 04 je 255c9c7 > 255c9c3: 31 c0 xor %eax,%eax > 255c9c5: 5b pop %rbx > 255c9c6: c3 retq > > And after: > > 255a8ee: 48 8b 03 mov (%rbx),%rax > 255a8f1: f6 40 e6 01 testb $0x1,-0x1a(%rax) > 255a8f5: 74 04 je 255a8fb > 255a8f7: 31 c0 xor %eax,%eax > 255a8f9: 5b pop %rbx > 255a8fa: c3 retq > > ==LLVM IR-level design=> Given a class name, how can we determine the closed set of derived classes and possible function pointers for a call site? As it turns out the IR already has a way of expressing this information: bitsets [1], which are already being used to implement CFI. We can encode the information for devirtualization by combining the @llvm.bitset.test and @llvm.assume intrinsics. > > %vtable = load i8**, i8*** %obj > %p = call i1 @llvm.bitset.test(%vtable, “Base”) call void @llvm.assume(i1 %p) ; %vtable is assumed to point to > ; _ZTV4Base+16 or _ZTV7Derived+16. > %fptrptr = getelementptr i8* %vtable, 1 %fptr = load i8*, i8** > %fptrptr ; %fptr must point to Base::isOfType > ; or Derived::isOfType. > %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32) %result = call i1 > %fptr_casted(i8*** %obj, i32 0) > > This gives the optimizer all the information it needs to implement the > optimization: the addresses of the virtual tables, the addresses of the function pointers (within the virtual table initializers) and the constant arguments. Note that the compiler does not need to do any of the work described in the above link to lay out globals or create bitsets if all calls to llvm.bitset.test are passed to llvm.assume. > > If virtual constant propagation succeeds, the transformed IR will look like this: > > %vtable = load i8*, i8** %obj > %vtable_bit_addr = getelementptr i8* %vtable, i64 -17 %vtable_bit = > load i8 %vtable_bit_addr %vtable_bit_and = and i8 %vtable_bit, 1 > %result = icmp ne %vtable_bit_and, 0 > > The pass that applies this transformation will be placed early in the LTO pipeline, before most of the regular optimization passes. The pass could later be extended to do general devirtualization based on the bitset information: > > - The pass can implement single-deriver (i.e. only one derived class with a > non-pure virtual function definition) and single-implementation > (i.e. multiple derived classes all sharing a virtual function definition > from a base class) devirtualization by checking that each of the possible > values loaded from the vtables in the bitset are either identical or > __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine > disregarding them), and propagating the identical address into the > function callee. > - The pass could also potentially implement speculative devirtualization > (i.e. testing the vtable entry and guarding direct calls on it) by > generating a comparison against the known address point. The pass will > know exactly how many possible classes there will be for each call site, > so we could have logic to speculatively devirtualize if that number is > sufficiently small. > > The design when CFI is enabled is slightly different, as we need to give the compiler the ability to eliminate the unnecessary llvm.bitset.test call. Recall that the IR looks like this: > > %vtable = load i8**, i8*** %obj > %p = call i1 @llvm.bitset.test(%vtable, “Base”) br i1 %p, label %call, > label %abort > > call: > %fptr = load i8*, i8** %vtable > %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32) call i1 > %fptr_casted(i8*** %obj, i32 0) > > abort: > call void @llvm.trap() > > We cannot simply have our optimization pass transform this IR to eliminate the llvm.bitset.test call, as the transformation would not be semantics-preserving (we already have a CFI setting that emits diagnostics in the abort branch, which would break under such a transformation). Instead, we can introduce a new intrinsic, @llvm.bitset.check, to be used only in production mode (i.e. with diagnostics disabled): > > {i8*, i1} @llvm.bitset.check(i8* %ptr, metadata %name) > > If %ptr is in the bitset identified by %name, the function returns {%ptr, true}. But if %ptr is not in %name, the behaviour is undefined, modulo that if the second element of the result is non-zero, the program may load and call a function pointer from an address derived from the first element of the result without causing an indirect function call to any function other than one potentially loaded from one of the constant globals of which %name is a member (i.e. one of the valid vtables for %name). A CFI-protected virtual call would then look like this (eliding bitcasts for brevity): > > %vtable = load i8**, i8*** %obj > %pair = call {i8*, i1} @llvm.bitset.check(i8* %vtable, “Base”) > %checked_vtable = extractelement %pair, 0 %p = extractelement %pair, 1 > br i1 %p, label %call, label %abort > > call: > %fptr = load i8*, i8** %checked_vtable %result = call i1 %fptr(i8*** > %obj, i32 0) > > abort: > call void @llvm.trap() > > Applying virtual constant propagation to @llvm.bitset.check would be very similar to applying it to @llvm.bitset.test, but if the constant propagation succeeds, the function will return true as the second element: > > %vtable = load i8**, i8*** %obj > br i1 true, label %call, label %abort > > call: > %vtable_bit_addr = getelementptr i8* %vtable, i64 -17 %vtable_bit = > load i8 %vtable_bit_addr %vtable_bit_and = and i8 %vtable_bit, 1 > %result = icmp ne %vtable_bit_and, 0 > > abort: > call void @llvm.trap() > > Now the existing SimplifyCFG pass can easily eliminate the unnecessary branching. > > In order to allow for regular constant folding through the llvm.bitset.check intrinsic, the code in lib/Analysis/ConstantFolding.cpp would be taught to look through the intrinsic’s argument. > > -- > Peter > > [0] https://code.google.com/p/chromium/issues/detail?id=580389 > [1] http://llvm.org/docs/BitSets.html > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Peter