thr3ads.net - llvm dev - [llvm-dev] Proposal: virtual constant propagation [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Peter Collingbourne via llvm-dev

2016-Jan-28 03:57 UTC

[llvm-dev] Proposal: virtual constant propagation

Hi all,

I'd like to make the following proposal to implement an optimization called
virtual constant propagation.

==Introduction=After enabling control flow integrity protection in Chromium, we
have
observed an unacceptable performance regression in certain critical layout
microbenchmarks. Profiling [0] revealed that the cause of the regression was
a large number of virtual calls, each requiring the overhead of a control
flow integrity check.

We took a closer look at the hottest virtual call sites, and found that they
were mostly type checking functions of the following form:

struct Base {
  enum Type { kBase, kDerived };
  virtual bool isDerived() const { return false; } // or:
  virtual bool isOfType(Type t) const { return t == kBase; }
};
struct Derived : Base {
  virtual bool isDerived() const { return true; }
  virtual bool isOfType(Type t) const { return t == kDerived; }
};

bool f(Base *b) {
  return b->isDerived();
}
bool g(Base *b) {
  return b->isOfType(Base::kDerived);
}

We can make the observation that if the type hierarchy is known to be closed
and the definition of each virtual function and each virtual call is visible
to the compiler, calls to either of these functions can be implemented more
efficiently by loading directly from the vtable (we refer to this as virtual
constant propagation):

struct Base_VTable {
  bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
};
Base_VTable base_vtable{false, true, false};
Derived_VTable derived_vtable{true, false, true};
bool f(Base *b) {
  return b->vtable->isDerived;
}
bool g(Base *b) {
  return b->vtable->isOfTypeDerived;
}

Another advantage of implementing the calls this way is that because there
is no indirect call taking place, no control flow integrity check is necessary.

I implemented a prototype of this idea that specifically targeted the hottest
virtual function in one of Chromium’s layout benchmarks, and observed a
performance improvement of median 1.62%, min -4.72%, max 204.52% for the
layout benchmark suite. I’d like to emphasise that these numbers are from
a regular LTO build without CFI, and that we expect to see better performance
from a general implementation that targets all virtual functions.

==User interface=To instruct the compiler to assume that all type hierarchies
are closed, a
user can pass the -fwhole-program-vtables flag. The -fwhole-program-vtables
flag requires the -flto flag to also be specified.

Of course, there may be some type hierarchies that are not entirely closed, but
the underlying assumption is that most hierarchies will not be. To support open
hierarchies the user can also specify the path to a blacklist that lists the
names of the open class hierarchies using a
-fwhole-program-vtables-blacklistflag. By default, the compiler will assume that
hierarchies in the std
namespace are open.

==High level design=Devirtualization will take place at LTO time. Candidates for
virtual constant
propagation will be found by searching for call sites with all-constant
arguments to virtual functions which in all derived classes are readnone,
which means that it is a pure function of its arguments. For each (virtual
function, arguments) candidate pair, the compiler will compute a offset in
bits (positive or negative) from the class’s vtable address point in which
to store the function result, extend the size of any vtable objects to cover
any such offsets, evaluate the function to compute a result and store it
at the correct offsets. To avoid breaking ABI, the vtable layout will not
be changed in any way. This includes pointers to optimized functions in the
vtable, and (for Itanium ABI) the distance between address points within a
single vtable symbol.

For example, the vtable for Base and Derived will look like this:

struct Base_VTable {
  bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
  size_t padding : 61;
  size_t offset_to_top;
  size_t rtti;
  bool (*isDerived)(const Base *this);
  bool (*isOfType)(const Base *this, Type t);
};
Base_VTable base_vtable{
  false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType
};
Base_VTable derived_vtable{
  True, false, true, 0, 0, 0, &Derived::isDerived, &Derived::IsOfType
};

The symbol _ZTV4Base, representing the start of Base’s vtable set, will
be defined as base_vtable+8; similarly for _ZTV7Derived and derived_vtable+8.

To avoid bloating virtual tables by too much in degenerate cases, there will
be a limit for each virtual function on the number of bits that the virtual
function may contribute to the size of the vtable, which we may fix at (say)
2 machine words.

To give an idea of how the generated code would look, here is an asm snippet
from the hottest virtual call from Chromium’s layout benchmarks before
the transformation:

 255c9ae:       48 8b 03                mov    (%rbx),%rax
 255c9b1:       be 30 00 00 00          mov    $0x30,%esi
 255c9b6:       48 89 df                mov    %rbx,%rdi
 255c9b9:       ff 90 e0 02 00 00       callq  *0x2e0(%rax)
 255c9bf:       84 c0                   test   %al,%al
 255c9c1:       74 04                   je     255c9c7
 255c9c3:       31 c0                   xor    %eax,%eax
 255c9c5:       5b                      pop    %rbx
 255c9c6:       c3                      retq   

And after:

 255a8ee:       48 8b 03                mov    (%rbx),%rax
 255a8f1:       f6 40 e6 01             testb  $0x1,-0x1a(%rax)
 255a8f5:       74 04                   je     255a8fb
 255a8f7:       31 c0                   xor    %eax,%eax
 255a8f9:       5b                      pop    %rbx
 255a8fa:       c3                      retq   

==LLVM IR-level design=Given a class name, how can we determine the closed set
of derived classes and
possible function pointers for a call site? As it turns out the IR already
has a way of expressing this information: bitsets [1], which are already being
used to implement CFI. We can encode the information for devirtualization
by combining the @llvm.bitset.test and @llvm.assume intrinsics.

%vtable = load i8**, i8*** %obj
%p = call i1 @llvm.bitset.test(%vtable, “Base”)
call void @llvm.assume(i1 %p) ; %vtable is assumed to point to
                              ; _ZTV4Base+16 or _ZTV7Derived+16.
%fptrptr = getelementptr i8* %vtable, 1
%fptr = load i8*, i8** %fptrptr ; %fptr must point to Base::isOfType
                                ; or Derived::isOfType.
%fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
%result = call i1 %fptr_casted(i8*** %obj, i32 0)

This gives the optimizer all the information it needs to implement the
optimization: the addresses of the virtual tables, the addresses of the
function pointers (within the virtual table initializers) and the constant
arguments. Note that the compiler does not need to do any of the work
described in the above link to lay out globals or create bitsets if all
calls to llvm.bitset.test are passed to llvm.assume.

If virtual constant propagation succeeds, the transformed IR will look
like this:

%vtable = load i8*, i8** %obj
%vtable_bit_addr = getelementptr i8* %vtable, i64 -17
%vtable_bit = load i8 %vtable_bit_addr
%vtable_bit_and = and i8 %vtable_bit, 1
%result = icmp ne %vtable_bit_and, 0

The pass that applies this transformation will be placed early in the LTO
pipeline, before most of the regular optimization passes. The pass could later
be extended to do general devirtualization based on the bitset information:

 - The pass can implement single-deriver (i.e. only one derived class with a
   non-pure virtual function definition) and single-implementation
   (i.e. multiple derived classes all sharing a virtual function definition
   from a base class) devirtualization by checking that each of the possible
   values loaded from the vtables in the bitset are either identical or
   __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine
   disregarding them), and propagating the identical address into the
   function callee.
 - The pass could also potentially implement speculative devirtualization
   (i.e. testing the vtable entry and guarding direct calls on it) by
   generating a comparison against the known address point. The pass will
   know exactly how many possible classes there will be for each call site,
   so we could have logic to speculatively devirtualize if that number is
   sufficiently small.

The design when CFI is enabled is slightly different, as we need to give
the compiler the ability to eliminate the unnecessary llvm.bitset.test
call. Recall that the IR looks like this:

%vtable = load i8**, i8*** %obj
%p = call i1 @llvm.bitset.test(%vtable, “Base”)
br i1 %p, label %call, label %abort

call:
%fptr = load i8*, i8** %vtable
%fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
call i1 %fptr_casted(i8*** %obj, i32 0)

abort:
call void @llvm.trap()

We cannot simply have our optimization pass transform this IR to eliminate the
llvm.bitset.test call, as the transformation would not be semantics-preserving
(we already have a CFI setting that emits diagnostics in the abort branch,
which would break under such a transformation). Instead, we can introduce
a new intrinsic, @llvm.bitset.check, to be used only in production mode
(i.e. with diagnostics disabled):

{i8*, i1} @llvm.bitset.check(i8* %ptr, metadata %name)

If %ptr is in the bitset identified by %name, the function returns {%ptr,
true}. But if %ptr is not in %name, the behaviour is undefined, modulo that
if the second element of the result is non-zero, the program may load and
call a function pointer from an address derived from the first element of the
result without causing an indirect function call to any function other than
one potentially loaded from one of the constant globals of which %name is
a member (i.e. one of the valid vtables for %name). A CFI-protected virtual
call would then look like this (eliding bitcasts for brevity):

%vtable = load i8**, i8*** %obj
%pair = call {i8*, i1} @llvm.bitset.check(i8* %vtable, “Base”)
%checked_vtable = extractelement %pair, 0
%p = extractelement %pair, 1
br i1 %p, label %call, label %abort

call:
%fptr = load i8*, i8** %checked_vtable
%result = call i1 %fptr(i8*** %obj, i32 0)

abort:
call void @llvm.trap()

Applying virtual constant propagation to @llvm.bitset.check would be very
similar to applying it to @llvm.bitset.test, but if the constant propagation
succeeds, the function will return true as the second element:

%vtable = load i8**, i8*** %obj
br i1 true, label %call, label %abort

call:
%vtable_bit_addr = getelementptr i8* %vtable, i64 -17
%vtable_bit = load i8 %vtable_bit_addr
%vtable_bit_and = and i8 %vtable_bit, 1
%result = icmp ne %vtable_bit_and, 0

abort:
call void @llvm.trap()

Now the existing SimplifyCFG pass can easily eliminate the unnecessary
branching.

In order to allow for regular constant folding through the llvm.bitset.check
intrinsic, the code in lib/Analysis/ConstantFolding.cpp would be taught to
look through the intrinsic’s argument.

-- 
Peter

[0] https://code.google.com/p/chromium/issues/detail?id=580389
[1] http://llvm.org/docs/BitSets.html

Mehdi Amini via llvm-dev

2016-Jan-28 06:29 UTC

head link

[llvm-dev] Proposal: virtual constant propagation

Hi Peter,

Pete (Cooper, CC'ed) had a similar idea a few months ago about
devirtualization and LTO: we can know the set of possible overload based on the
knowledge of a closed type hierarchy. He worked on a prototype, and I continued
to evolve it. I planned to submit something on the mailing list but couldn’t fit
it in my planning before April :)
Our plan was about removing virtual call in general, or maybe in some cases
turning the indirect call into a switch table.

In both use case (your constant propagation and devirtualization), I think the
interesting part is “how to figure the set of possible callees for a call site”
and how we encode this in the IR.

I haven’t worked on this for a few weeks, but I’ll try to give a rough
description of where I stopped on my prototype:
I was storing the full inheritance tree in metadata. There was one metadata per
entry of the VTable, and any method that override one of the base class will
have a metadata attached pointing to the slot in the base class.
The front-end is modified to emit any load of a VTable through a new intrinsic
llvm.read.vtable that takes a pointer to the VTable, the index of the slot in
the VTable, and a pointer to the metadata that describe the slot for the current
type. Using the metadata you access directly to any override and construct
naturally and very easily the set of possible overloads. The pass that transform
the IR is quite easy to implement since you don’t need to walk the IR but access
directly all the uses of the llvm.read.vtable intrinsic and have all the needed
information in the operands.
I haven’t give much thought about your representation yet to know how it
compares (and I’m not very familiar with how CFI works), but I’m interested in
your feedback!

I was limited during my experiment by the lack of black/white list to define
what hierarchy is/isn’t closed, I’m glad if such flag/control would be added!

Best,

— 
Mehdi


> On Jan 27, 2016, at 7:57 PM, Peter Collingbourne via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> 
> Hi all,
> 
> I'd like to make the following proposal to implement an optimization
called
> virtual constant propagation.
> 
> ==Introduction=> After enabling control flow integrity protection in
Chromium, we have
> observed an unacceptable performance regression in certain critical layout
> microbenchmarks. Profiling [0] revealed that the cause of the regression
was
> a large number of virtual calls, each requiring the overhead of a control
> flow integrity check.
> 
> We took a closer look at the hottest virtual call sites, and found that
they
> were mostly type checking functions of the following form:
> 
> struct Base {
>  enum Type { kBase, kDerived };
>  virtual bool isDerived() const { return false; } // or:
>  virtual bool isOfType(Type t) const { return t == kBase; }
> };
> struct Derived : Base {
>  virtual bool isDerived() const { return true; }
>  virtual bool isOfType(Type t) const { return t == kDerived; }
> };
> 
> bool f(Base *b) {
>  return b->isDerived();
> }
> bool g(Base *b) {
>  return b->isOfType(Base::kDerived);
> }
> 
> We can make the observation that if the type hierarchy is known to be
closed
> and the definition of each virtual function and each virtual call is
visible
> to the compiler, calls to either of these functions can be implemented more
> efficiently by loading directly from the vtable (we refer to this as
virtual
> constant propagation):
> 
> struct Base_VTable {
>  bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
> };
> Base_VTable base_vtable{false, true, false};
> Derived_VTable derived_vtable{true, false, true};
> bool f(Base *b) {
>  return b->vtable->isDerived;
> }
> bool g(Base *b) {
>  return b->vtable->isOfTypeDerived;
> }
> 
> Another advantage of implementing the calls this way is that because there
> is no indirect call taking place, no control flow integrity check is
necessary.
> 
> I implemented a prototype of this idea that specifically targeted the
hottest
> virtual function in one of Chromium’s layout benchmarks, and observed a
> performance improvement of median 1.62%, min -4.72%, max 204.52% for the
> layout benchmark suite. I’d like to emphasise that these numbers are from
> a regular LTO build without CFI, and that we expect to see better
performance
> from a general implementation that targets all virtual functions.
> 
> ==User interface=> To instruct the compiler to assume that all type
hierarchies are closed, a
> user can pass the -fwhole-program-vtables flag. The -fwhole-program-vtables
> flag requires the -flto flag to also be specified.
> 
> Of course, there may be some type hierarchies that are not entirely closed,
but
> the underlying assumption is that most hierarchies will not be. To support
open
> hierarchies the user can also specify the path to a blacklist that lists
the
> names of the open class hierarchies using a
-fwhole-program-vtables-blacklist> flag. By default, the compiler will assume
that hierarchies in the std
> namespace are open.
> 
> ==High level design=> Devirtualization will take place at LTO time.
Candidates for virtual constant
> propagation will be found by searching for call sites with all-constant
> arguments to virtual functions which in all derived classes are readnone,
> which means that it is a pure function of its arguments. For each (virtual
> function, arguments) candidate pair, the compiler will compute a offset in
> bits (positive or negative) from the class’s vtable address point in which
> to store the function result, extend the size of any vtable objects to
cover
> any such offsets, evaluate the function to compute a result and store it
> at the correct offsets. To avoid breaking ABI, the vtable layout will not
> be changed in any way. This includes pointers to optimized functions in the
> vtable, and (for Itanium ABI) the distance between address points within a
> single vtable symbol.
> 
> For example, the vtable for Base and Derived will look like this:
> 
> struct Base_VTable {
>  bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
>  size_t padding : 61;
>  size_t offset_to_top;
>  size_t rtti;
>  bool (*isDerived)(const Base *this);
>  bool (*isOfType)(const Base *this, Type t);
> };
> Base_VTable base_vtable{
>  false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType
> };
> Base_VTable derived_vtable{
>  True, false, true, 0, 0, 0, &Derived::isDerived,
&Derived::IsOfType
> };
> 
> The symbol _ZTV4Base, representing the start of Base’s vtable set, will
> be defined as base_vtable+8; similarly for _ZTV7Derived and
derived_vtable+8.
> 
> To avoid bloating virtual tables by too much in degenerate cases, there
will
> be a limit for each virtual function on the number of bits that the virtual
> function may contribute to the size of the vtable, which we may fix at
(say)
> 2 machine words.
> 
> To give an idea of how the generated code would look, here is an asm
snippet
> from the hottest virtual call from Chromium’s layout benchmarks before
> the transformation:
> 
> 255c9ae:       48 8b 03                mov    (%rbx),%rax
> 255c9b1:       be 30 00 00 00          mov    $0x30,%esi
> 255c9b6:       48 89 df                mov    %rbx,%rdi
> 255c9b9:       ff 90 e0 02 00 00       callq  *0x2e0(%rax)
> 255c9bf:       84 c0                   test   %al,%al
> 255c9c1:       74 04                   je     255c9c7
> 255c9c3:       31 c0                   xor    %eax,%eax
> 255c9c5:       5b                      pop    %rbx
> 255c9c6:       c3                      retq   
> 
> And after:
> 
> 255a8ee:       48 8b 03                mov    (%rbx),%rax
> 255a8f1:       f6 40 e6 01             testb  $0x1,-0x1a(%rax)
> 255a8f5:       74 04                   je     255a8fb
> 255a8f7:       31 c0                   xor    %eax,%eax
> 255a8f9:       5b                      pop    %rbx
> 255a8fa:       c3                      retq   
> 
> ==LLVM IR-level design=> Given a class name, how can we determine the
closed set of derived classes and
> possible function pointers for a call site? As it turns out the IR already
> has a way of expressing this information: bitsets [1], which are already
being
> used to implement CFI. We can encode the information for devirtualization
> by combining the @llvm.bitset.test and @llvm.assume intrinsics.
> 
> %vtable = load i8**, i8*** %obj
> %p = call i1 @llvm.bitset.test(%vtable, “Base”)
> call void @llvm.assume(i1 %p) ; %vtable is assumed to point to
>                              ; _ZTV4Base+16 or _ZTV7Derived+16.
> %fptrptr = getelementptr i8* %vtable, 1
> %fptr = load i8*, i8** %fptrptr ; %fptr must point to Base::isOfType
>                                ; or Derived::isOfType.
> %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
> %result = call i1 %fptr_casted(i8*** %obj, i32 0)
> 
> This gives the optimizer all the information it needs to implement the
> optimization: the addresses of the virtual tables, the addresses of the
> function pointers (within the virtual table initializers) and the constant
> arguments. Note that the compiler does not need to do any of the work
> described in the above link to lay out globals or create bitsets if all
> calls to llvm.bitset.test are passed to llvm.assume.
> 
> If virtual constant propagation succeeds, the transformed IR will look
> like this:
> 
> %vtable = load i8*, i8** %obj
> %vtable_bit_addr = getelementptr i8* %vtable, i64 -17
> %vtable_bit = load i8 %vtable_bit_addr
> %vtable_bit_and = and i8 %vtable_bit, 1
> %result = icmp ne %vtable_bit_and, 0
> 
> The pass that applies this transformation will be placed early in the LTO
> pipeline, before most of the regular optimization passes. The pass could
later
> be extended to do general devirtualization based on the bitset information:
> 
> - The pass can implement single-deriver (i.e. only one derived class with a
>   non-pure virtual function definition) and single-implementation
>   (i.e. multiple derived classes all sharing a virtual function definition
>   from a base class) devirtualization by checking that each of the possible
>   values loaded from the vtables in the bitset are either identical or
>   __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine
>   disregarding them), and propagating the identical address into the
>   function callee.
> - The pass could also potentially implement speculative devirtualization
>   (i.e. testing the vtable entry and guarding direct calls on it) by
>   generating a comparison against the known address point. The pass will
>   know exactly how many possible classes there will be for each call site,
>   so we could have logic to speculatively devirtualize if that number is
>   sufficiently small.
> 
> The design when CFI is enabled is slightly different, as we need to give
> the compiler the ability to eliminate the unnecessary llvm.bitset.test
> call. Recall that the IR looks like this:
> 
> %vtable = load i8**, i8*** %obj
> %p = call i1 @llvm.bitset.test(%vtable, “Base”)
> br i1 %p, label %call, label %abort
> 
> call:
> %fptr = load i8*, i8** %vtable
> %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
> call i1 %fptr_casted(i8*** %obj, i32 0)
> 
> abort:
> call void @llvm.trap()
> 
> We cannot simply have our optimization pass transform this IR to eliminate
the
> llvm.bitset.test call, as the transformation would not be
semantics-preserving
> (we already have a CFI setting that emits diagnostics in the abort branch,
> which would break under such a transformation). Instead, we can introduce
> a new intrinsic, @llvm.bitset.check, to be used only in production mode
> (i.e. with diagnostics disabled):
> 
> {i8*, i1} @llvm.bitset.check(i8* %ptr, metadata %name)
> 
> If %ptr is in the bitset identified by %name, the function returns {%ptr,
> true}. But if %ptr is not in %name, the behaviour is undefined, modulo that
> if the second element of the result is non-zero, the program may load and
> call a function pointer from an address derived from the first element of
the
> result without causing an indirect function call to any function other than
> one potentially loaded from one of the constant globals of which %name is
> a member (i.e. one of the valid vtables for %name). A CFI-protected virtual
> call would then look like this (eliding bitcasts for brevity):
> 
> %vtable = load i8**, i8*** %obj
> %pair = call {i8*, i1} @llvm.bitset.check(i8* %vtable, “Base”)
> %checked_vtable = extractelement %pair, 0
> %p = extractelement %pair, 1
> br i1 %p, label %call, label %abort
> 
> call:
> %fptr = load i8*, i8** %checked_vtable
> %result = call i1 %fptr(i8*** %obj, i32 0)
> 
> abort:
> call void @llvm.trap()
> 
> Applying virtual constant propagation to @llvm.bitset.check would be very
> similar to applying it to @llvm.bitset.test, but if the constant
propagation
> succeeds, the function will return true as the second element:
> 
> %vtable = load i8**, i8*** %obj
> br i1 true, label %call, label %abort
> 
> call:
> %vtable_bit_addr = getelementptr i8* %vtable, i64 -17
> %vtable_bit = load i8 %vtable_bit_addr
> %vtable_bit_and = and i8 %vtable_bit, 1
> %result = icmp ne %vtable_bit_and, 0
> 
> abort:
> call void @llvm.trap()
> 
> Now the existing SimplifyCFG pass can easily eliminate the unnecessary
branching.
> 
> In order to allow for regular constant folding through the
llvm.bitset.check
> intrinsic, the code in lib/Analysis/ConstantFolding.cpp would be taught to
> look through the intrinsic’s argument.
> 
> -- 
> Peter
> 
> [0] https://code.google.com/p/chromium/issues/detail?id=580389
> [1] http://llvm.org/docs/BitSets.html
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Hans Wennborg via llvm-dev

2016-Jan-28 16:47 UTC

head link

[llvm-dev] Proposal: virtual constant propagation

On Wed, Jan 27, 2016 at 7:57 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:> The pass that applies this transformation will be placed early in the LTO
> pipeline, before most of the regular optimization passes. The pass could
later
> be extended to do general devirtualization based on the bitset information:
>
>  - The pass can implement single-deriver (i.e. only one derived class with
a
>    non-pure virtual function definition) and single-implementation
>    (i.e. multiple derived classes all sharing a virtual function definition
>    from a base class) devirtualization by checking that each of the
possible
>    values loaded from the vtables in the bitset are either identical or
>    __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine
>    disregarding them), and propagating the identical address into the
>    function callee.
In addition to the constant value propagation, which looks fantastic,
I'm also very interested in this aspect.

For example, Blink/WebKit public interface is full of classes like
[0], which have a single implementation (sometimes there's also a mock
implementation used in tests). In many classes (like [1]), the methods
are not pure-virtual but have dummy implementations, but we should be
able to figure out that the base class constructor is never called
except when constructing the single implementation class.

Most of this interface might not be very hot, but optimizing the
virtual barrier away seems looks like a juicy opportunity for binary
size and inlining (and start-up time if we can drop the vtables and
void the dynamic relocations).

Cheers,
Hans

[0].
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/public/platform/WebScrollbar.h
[1].
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/public/platform/WebScrollbarBehavior.h

Mehdi Amini via llvm-dev

2016-Jan-28 17:15 UTC

head link

[llvm-dev] Proposal: virtual constant propagation

Hi,

I just thought about another use case: VTable compression.
If you know that an entry in the Vtable is never used, just remove it!
I’d hope we could even eliminate some unused virtual functions from the final
binary.

— 
Mehdi

> On Jan 27, 2016, at 10:29 PM, Mehdi Amini via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi Peter,
> 
> Pete (Cooper, CC'ed) had a similar idea a few months ago about
devirtualization and LTO: we can know the set of possible overload based on the
knowledge of a closed type hierarchy. He worked on a prototype, and I continued
to evolve it. I planned to submit something on the mailing list but couldn’t fit
it in my planning before April :)
> Our plan was about removing virtual call in general, or maybe in some cases
turning the indirect call into a switch table.
> 
> In both use case (your constant propagation and devirtualization), I think
the interesting part is “how to figure the set of possible callees for a call
site” and how we encode this in the IR.
> 
> I haven’t worked on this for a few weeks, but I’ll try to give a rough
description of where I stopped on my prototype:
> I was storing the full inheritance tree in metadata. There was one metadata
per entry of the VTable, and any method that override one of the base class will
have a metadata attached pointing to the slot in the base class.
> The front-end is modified to emit any load of a VTable through a new
intrinsic llvm.read.vtable that takes a pointer to the VTable, the index of the
slot in the VTable, and a pointer to the metadata that describe the slot for the
current type. Using the metadata you access directly to any override and
construct naturally and very easily the set of possible overloads. The pass that
transform the IR is quite easy to implement since you don’t need to walk the IR
but access directly all the uses of the llvm.read.vtable intrinsic and have all
the needed information in the operands.
> I haven’t give much thought about your representation yet to know how it
compares (and I’m not very familiar with how CFI works), but I’m interested in
your feedback!
> 
> I was limited during my experiment by the lack of black/white list to
define what hierarchy is/isn’t closed, I’m glad if such flag/control would be
added!
> 
> Best,
> 
> — 
> Mehdi
> 
> 
> 
>> On Jan 27, 2016, at 7:57 PM, Peter Collingbourne via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>> 
>> Hi all,
>> 
>> I'd like to make the following proposal to implement an
optimization called
>> virtual constant propagation.
>> 
>> ==Introduction=>> After enabling control flow integrity
protection in Chromium, we have
>> observed an unacceptable performance regression in certain critical
layout
>> microbenchmarks. Profiling [0] revealed that the cause of the
regression was
>> a large number of virtual calls, each requiring the overhead of a
control
>> flow integrity check.
>> 
>> We took a closer look at the hottest virtual call sites, and found that
they
>> were mostly type checking functions of the following form:
>> 
>> struct Base {
>> enum Type { kBase, kDerived };
>> virtual bool isDerived() const { return false; } // or:
>> virtual bool isOfType(Type t) const { return t == kBase; }
>> };
>> struct Derived : Base {
>> virtual bool isDerived() const { return true; }
>> virtual bool isOfType(Type t) const { return t == kDerived; }
>> };
>> 
>> bool f(Base *b) {
>> return b->isDerived();
>> }
>> bool g(Base *b) {
>> return b->isOfType(Base::kDerived);
>> }
>> 
>> We can make the observation that if the type hierarchy is known to be
closed
>> and the definition of each virtual function and each virtual call is
visible
>> to the compiler, calls to either of these functions can be implemented
more
>> efficiently by loading directly from the vtable (we refer to this as
virtual
>> constant propagation):
>> 
>> struct Base_VTable {
>> bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
>> };
>> Base_VTable base_vtable{false, true, false};
>> Derived_VTable derived_vtable{true, false, true};
>> bool f(Base *b) {
>> return b->vtable->isDerived;
>> }
>> bool g(Base *b) {
>> return b->vtable->isOfTypeDerived;
>> }
>> 
>> Another advantage of implementing the calls this way is that because
there
>> is no indirect call taking place, no control flow integrity check is
necessary.
>> 
>> I implemented a prototype of this idea that specifically targeted the
hottest
>> virtual function in one of Chromium’s layout benchmarks, and observed a
>> performance improvement of median 1.62%, min -4.72%, max 204.52% for
the
>> layout benchmark suite. I’d like to emphasise that these numbers are
from
>> a regular LTO build without CFI, and that we expect to see better
performance
>> from a general implementation that targets all virtual functions.
>> 
>> ==User interface=>> To instruct the compiler to assume that all
type hierarchies are closed, a
>> user can pass the -fwhole-program-vtables flag. The
-fwhole-program-vtables
>> flag requires the -flto flag to also be specified.
>> 
>> Of course, there may be some type hierarchies that are not entirely
closed, but
>> the underlying assumption is that most hierarchies will not be. To
support open
>> hierarchies the user can also specify the path to a blacklist that
lists the
>> names of the open class hierarchies using a
-fwhole-program-vtables-blacklist>> flag. By default, the compiler will
assume that hierarchies in the std
>> namespace are open.
>> 
>> ==High level design=>> Devirtualization will take place at LTO
time. Candidates for virtual constant
>> propagation will be found by searching for call sites with all-constant
>> arguments to virtual functions which in all derived classes are
readnone,
>> which means that it is a pure function of its arguments. For each
(virtual
>> function, arguments) candidate pair, the compiler will compute a offset
in
>> bits (positive or negative) from the class’s vtable address point in
which
>> to store the function result, extend the size of any vtable objects to
cover
>> any such offsets, evaluate the function to compute a result and store
it
>> at the correct offsets. To avoid breaking ABI, the vtable layout will
not
>> be changed in any way. This includes pointers to optimized functions in
the
>> vtable, and (for Itanium ABI) the distance between address points
within a
>> single vtable symbol.
>> 
>> For example, the vtable for Base and Derived will look like this:
>> 
>> struct Base_VTable {
>> bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
>> size_t padding : 61;
>> size_t offset_to_top;
>> size_t rtti;
>> bool (*isDerived)(const Base *this);
>> bool (*isOfType)(const Base *this, Type t);
>> };
>> Base_VTable base_vtable{
>> false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType
>> };
>> Base_VTable derived_vtable{
>> True, false, true, 0, 0, 0, &Derived::isDerived,
&Derived::IsOfType
>> };
>> 
>> The symbol _ZTV4Base, representing the start of Base’s vtable set, will
>> be defined as base_vtable+8; similarly for _ZTV7Derived and
derived_vtable+8.
>> 
>> To avoid bloating virtual tables by too much in degenerate cases, there
will
>> be a limit for each virtual function on the number of bits that the
virtual
>> function may contribute to the size of the vtable, which we may fix at
(say)
>> 2 machine words.
>> 
>> To give an idea of how the generated code would look, here is an asm
snippet
>> from the hottest virtual call from Chromium’s layout benchmarks before
>> the transformation:
>> 
>> 255c9ae:       48 8b 03                mov    (%rbx),%rax
>> 255c9b1:       be 30 00 00 00          mov    $0x30,%esi
>> 255c9b6:       48 89 df                mov    %rbx,%rdi
>> 255c9b9:       ff 90 e0 02 00 00       callq  *0x2e0(%rax)
>> 255c9bf:       84 c0                   test   %al,%al
>> 255c9c1:       74 04                   je     255c9c7
>> 255c9c3:       31 c0                   xor    %eax,%eax
>> 255c9c5:       5b                      pop    %rbx
>> 255c9c6:       c3                      retq   
>> 
>> And after:
>> 
>> 255a8ee:       48 8b 03                mov    (%rbx),%rax
>> 255a8f1:       f6 40 e6 01             testb  $0x1,-0x1a(%rax)
>> 255a8f5:       74 04                   je     255a8fb
>> 255a8f7:       31 c0                   xor    %eax,%eax
>> 255a8f9:       5b                      pop    %rbx
>> 255a8fa:       c3                      retq   
>> 
>> ==LLVM IR-level design=>> Given a class name, how can we
determine the closed set of derived classes and
>> possible function pointers for a call site? As it turns out the IR
already
>> has a way of expressing this information: bitsets [1], which are
already being
>> used to implement CFI. We can encode the information for
devirtualization
>> by combining the @llvm.bitset.test and @llvm.assume intrinsics.
>> 
>> %vtable = load i8**, i8*** %obj
>> %p = call i1 @llvm.bitset.test(%vtable, “Base”)
>> call void @llvm.assume(i1 %p) ; %vtable is assumed to point to
>>                             ; _ZTV4Base+16 or _ZTV7Derived+16.
>> %fptrptr = getelementptr i8* %vtable, 1
>> %fptr = load i8*, i8** %fptrptr ; %fptr must point to Base::isOfType
>>                               ; or Derived::isOfType.
>> %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
>> %result = call i1 %fptr_casted(i8*** %obj, i32 0)
>> 
>> This gives the optimizer all the information it needs to implement the
>> optimization: the addresses of the virtual tables, the addresses of the
>> function pointers (within the virtual table initializers) and the
constant
>> arguments. Note that the compiler does not need to do any of the work
>> described in the above link to lay out globals or create bitsets if all
>> calls to llvm.bitset.test are passed to llvm.assume.
>> 
>> If virtual constant propagation succeeds, the transformed IR will look
>> like this:
>> 
>> %vtable = load i8*, i8** %obj
>> %vtable_bit_addr = getelementptr i8* %vtable, i64 -17
>> %vtable_bit = load i8 %vtable_bit_addr
>> %vtable_bit_and = and i8 %vtable_bit, 1
>> %result = icmp ne %vtable_bit_and, 0
>> 
>> The pass that applies this transformation will be placed early in the
LTO
>> pipeline, before most of the regular optimization passes. The pass
could later
>> be extended to do general devirtualization based on the bitset
information:
>> 
>> - The pass can implement single-deriver (i.e. only one derived class
with a
>>  non-pure virtual function definition) and single-implementation
>>  (i.e. multiple derived classes all sharing a virtual function
definition
>>  from a base class) devirtualization by checking that each of the
possible
>>  values loaded from the vtables in the bitset are either identical or
>>  __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine
>>  disregarding them), and propagating the identical address into the
>>  function callee.
>> - The pass could also potentially implement speculative
devirtualization
>>  (i.e. testing the vtable entry and guarding direct calls on it) by
>>  generating a comparison against the known address point. The pass will
>>  know exactly how many possible classes there will be for each call
site,
>>  so we could have logic to speculatively devirtualize if that number is
>>  sufficiently small.
>> 
>> The design when CFI is enabled is slightly different, as we need to
give
>> the compiler the ability to eliminate the unnecessary llvm.bitset.test
>> call. Recall that the IR looks like this:
>> 
>> %vtable = load i8**, i8*** %obj
>> %p = call i1 @llvm.bitset.test(%vtable, “Base”)
>> br i1 %p, label %call, label %abort
>> 
>> call:
>> %fptr = load i8*, i8** %vtable
>> %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
>> call i1 %fptr_casted(i8*** %obj, i32 0)
>> 
>> abort:
>> call void @llvm.trap()
>> 
>> We cannot simply have our optimization pass transform this IR to
eliminate the
>> llvm.bitset.test call, as the transformation would not be
semantics-preserving
>> (we already have a CFI setting that emits diagnostics in the abort
branch,
>> which would break under such a transformation). Instead, we can
introduce
>> a new intrinsic, @llvm.bitset.check, to be used only in production mode
>> (i.e. with diagnostics disabled):
>> 
>> {i8*, i1} @llvm.bitset.check(i8* %ptr, metadata %name)
>> 
>> If %ptr is in the bitset identified by %name, the function returns
{%ptr,
>> true}. But if %ptr is not in %name, the behaviour is undefined, modulo
that
>> if the second element of the result is non-zero, the program may load
and
>> call a function pointer from an address derived from the first element
of the
>> result without causing an indirect function call to any function other
than
>> one potentially loaded from one of the constant globals of which %name
is
>> a member (i.e. one of the valid vtables for %name). A CFI-protected
virtual
>> call would then look like this (eliding bitcasts for brevity):
>> 
>> %vtable = load i8**, i8*** %obj
>> %pair = call {i8*, i1} @llvm.bitset.check(i8* %vtable, “Base”)
>> %checked_vtable = extractelement %pair, 0
>> %p = extractelement %pair, 1
>> br i1 %p, label %call, label %abort
>> 
>> call:
>> %fptr = load i8*, i8** %checked_vtable
>> %result = call i1 %fptr(i8*** %obj, i32 0)
>> 
>> abort:
>> call void @llvm.trap()
>> 
>> Applying virtual constant propagation to @llvm.bitset.check would be
very
>> similar to applying it to @llvm.bitset.test, but if the constant
propagation
>> succeeds, the function will return true as the second element:
>> 
>> %vtable = load i8**, i8*** %obj
>> br i1 true, label %call, label %abort
>> 
>> call:
>> %vtable_bit_addr = getelementptr i8* %vtable, i64 -17
>> %vtable_bit = load i8 %vtable_bit_addr
>> %vtable_bit_and = and i8 %vtable_bit, 1
>> %result = icmp ne %vtable_bit_and, 0
>> 
>> abort:
>> call void @llvm.trap()
>> 
>> Now the existing SimplifyCFG pass can easily eliminate the unnecessary
branching.
>> 
>> In order to allow for regular constant folding through the
llvm.bitset.check
>> intrinsic, the code in lib/Analysis/ConstantFolding.cpp would be taught
to
>> look through the intrinsic’s argument.
>> 
>> -- 
>> Peter
>> 
>> [0] https://code.google.com/p/chromium/issues/detail?id=580389
>> [1] http://llvm.org/docs/BitSets.html
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Xinliang David Li via llvm-dev

2016-Jan-28 17:48 UTC

head link

[llvm-dev] Proposal: virtual constant propagation

Peter,

Interesting idea. One small extension that can be done is to allow the
optimization to work on a value range. For instance, if the argument type
is has know limited range (enum for instance), you can still optimize it
when the argument is not a constant via proper bit mapping/shifting. In
your example, the following is also devirtualizable.

bool g(Base *b, Type t) {
  return b->isOfType(t);
}

David

On Wed, Jan 27, 2016 at 7:57 PM, Peter Collingbourne via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi all,
>
> I'd like to make the following proposal to implement an optimization
called
> virtual constant propagation.
>
> ==Introduction=> After enabling control flow integrity protection in
Chromium, we have
> observed an unacceptable performance regression in certain critical layout
> microbenchmarks. Profiling [0] revealed that the cause of the regression
> was
> a large number of virtual calls, each requiring the overhead of a control
> flow integrity check.
>
> We took a closer look at the hottest virtual call sites, and found that
> they
> were mostly type checking functions of the following form:
>
> struct Base {
>   enum Type { kBase, kDerived };
>   virtual bool isDerived() const { return false; } // or:
>   virtual bool isOfType(Type t) const { return t == kBase; }
> };
> struct Derived : Base {
>   virtual bool isDerived() const { return true; }
>   virtual bool isOfType(Type t) const { return t == kDerived; }
> };
>
> bool f(Base *b) {
>   return b->isDerived();
> }
> bool g(Base *b) {
>   return b->isOfType(Base::kDerived);
> }
>
> We can make the observation that if the type hierarchy is known to be
> closed
> and the definition of each virtual function and each virtual call is
> visible
> to the compiler, calls to either of these functions can be implemented more
> efficiently by loading directly from the vtable (we refer to this as
> virtual
> constant propagation):
>
> struct Base_VTable {
>   bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
> };
> Base_VTable base_vtable{false, true, false};
> Derived_VTable derived_vtable{true, false, true};
> bool f(Base *b) {
>   return b->vtable->isDerived;
> }
> bool g(Base *b) {
>   return b->vtable->isOfTypeDerived;
> }
>
> Another advantage of implementing the calls this way is that because there
> is no indirect call taking place, no control flow integrity check is
> necessary.
>
> I implemented a prototype of this idea that specifically targeted the
> hottest
> virtual function in one of Chromium’s layout benchmarks, and observed a
> performance improvement of median 1.62%, min -4.72%, max 204.52% for the
> layout benchmark suite. I’d like to emphasise that these numbers are from
> a regular LTO build without CFI, and that we expect to see better
> performance
> from a general implementation that targets all virtual functions.
>
> ==User interface=> To instruct the compiler to assume that all type
hierarchies are closed, a
> user can pass the -fwhole-program-vtables flag. The -fwhole-program-vtables
> flag requires the -flto flag to also be specified.
>
> Of course, there may be some type hierarchies that are not entirely
> closed, but
> the underlying assumption is that most hierarchies will not be. To support
> open
> hierarchies the user can also specify the path to a blacklist that lists
> the
> names of the open class hierarchies using a
> -fwhole-program-vtables-blacklist> flag. By default, the compiler will
assume that hierarchies in the std
> namespace are open.
>
> ==High level design=> Devirtualization will take place at LTO time.
Candidates for virtual
> constant
> propagation will be found by searching for call sites with all-constant
> arguments to virtual functions which in all derived classes are readnone,
> which means that it is a pure function of its arguments. For each (virtual
> function, arguments) candidate pair, the compiler will compute a offset in
> bits (positive or negative) from the class’s vtable address point in which
> to store the function result, extend the size of any vtable objects to
> cover
> any such offsets, evaluate the function to compute a result and store it
> at the correct offsets. To avoid breaking ABI, the vtable layout will not
> be changed in any way. This includes pointers to optimized functions in the
> vtable, and (for Itanium ABI) the distance between address points within a
> single vtable symbol.
>
> For example, the vtable for Base and Derived will look like this:
>
> struct Base_VTable {
>   bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
>   size_t padding : 61;
>   size_t offset_to_top;
>   size_t rtti;
>   bool (*isDerived)(const Base *this);
>   bool (*isOfType)(const Base *this, Type t);
> };
> Base_VTable base_vtable{
>   false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType
> };
> Base_VTable derived_vtable{
>   True, false, true, 0, 0, 0, &Derived::isDerived,
&Derived::IsOfType
> };
>
> The symbol _ZTV4Base, representing the start of Base’s vtable set, will
> be defined as base_vtable+8; similarly for _ZTV7Derived and
> derived_vtable+8.
>
> To avoid bloating virtual tables by too much in degenerate cases, there
> will
> be a limit for each virtual function on the number of bits that the virtual
> function may contribute to the size of the vtable, which we may fix at
> (say)
> 2 machine words.
>
> To give an idea of how the generated code would look, here is an asm
> snippet
> from the hottest virtual call from Chromium’s layout benchmarks before
> the transformation:
>
>  255c9ae:       48 8b 03                mov    (%rbx),%rax
>  255c9b1:       be 30 00 00 00          mov    $0x30,%esi
>  255c9b6:       48 89 df                mov    %rbx,%rdi
>  255c9b9:       ff 90 e0 02 00 00       callq  *0x2e0(%rax)
>  255c9bf:       84 c0                   test   %al,%al
>  255c9c1:       74 04                   je     255c9c7
>  255c9c3:       31 c0                   xor    %eax,%eax
>  255c9c5:       5b                      pop    %rbx
>  255c9c6:       c3                      retq
>
> And after:
>
>  255a8ee:       48 8b 03                mov    (%rbx),%rax
>  255a8f1:       f6 40 e6 01             testb  $0x1,-0x1a(%rax)
>  255a8f5:       74 04                   je     255a8fb
>  255a8f7:       31 c0                   xor    %eax,%eax
>  255a8f9:       5b                      pop    %rbx
>  255a8fa:       c3                      retq
>
> ==LLVM IR-level design=> Given a class name, how can we determine the
closed set of derived classes
> and
> possible function pointers for a call site? As it turns out the IR already
> has a way of expressing this information: bitsets [1], which are already
> being
> used to implement CFI. We can encode the information for devirtualization
> by combining the @llvm.bitset.test and @llvm.assume intrinsics.
>
> %vtable = load i8**, i8*** %obj
> %p = call i1 @llvm.bitset.test(%vtable, “Base”)
> call void @llvm.assume(i1 %p) ; %vtable is assumed to point to
>                               ; _ZTV4Base+16 or _ZTV7Derived+16.
> %fptrptr = getelementptr i8* %vtable, 1
> %fptr = load i8*, i8** %fptrptr ; %fptr must point to Base::isOfType
>                                 ; or Derived::isOfType.
> %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
> %result = call i1 %fptr_casted(i8*** %obj, i32 0)
>
> This gives the optimizer all the information it needs to implement the
> optimization: the addresses of the virtual tables, the addresses of the
> function pointers (within the virtual table initializers) and the constant
> arguments. Note that the compiler does not need to do any of the work
> described in the above link to lay out globals or create bitsets if all
> calls to llvm.bitset.test are passed to llvm.assume.
>
> If virtual constant propagation succeeds, the transformed IR will look
> like this:
>
> %vtable = load i8*, i8** %obj
> %vtable_bit_addr = getelementptr i8* %vtable, i64 -17
> %vtable_bit = load i8 %vtable_bit_addr
> %vtable_bit_and = and i8 %vtable_bit, 1
> %result = icmp ne %vtable_bit_and, 0
>
> The pass that applies this transformation will be placed early in the LTO
> pipeline, before most of the regular optimization passes. The pass could
> later
> be extended to do general devirtualization based on the bitset information:
>
>  - The pass can implement single-deriver (i.e. only one derived class with
> a
>    non-pure virtual function definition) and single-implementation
>    (i.e. multiple derived classes all sharing a virtual function definition
>    from a base class) devirtualization by checking that each of the
> possible
>    values loaded from the vtables in the bitset are either identical or
>    __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine
>    disregarding them), and propagating the identical address into the
>    function callee.
>  - The pass could also potentially implement speculative devirtualization
>    (i.e. testing the vtable entry and guarding direct calls on it) by
>    generating a comparison against the known address point. The pass will
>    know exactly how many possible classes there will be for each call site,
>    so we could have logic to speculatively devirtualize if that number is
>    sufficiently small.
>
> The design when CFI is enabled is slightly different, as we need to give
> the compiler the ability to eliminate the unnecessary llvm.bitset.test
> call. Recall that the IR looks like this:
>
> %vtable = load i8**, i8*** %obj
> %p = call i1 @llvm.bitset.test(%vtable, “Base”)
> br i1 %p, label %call, label %abort
>
> call:
> %fptr = load i8*, i8** %vtable
> %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
> call i1 %fptr_casted(i8*** %obj, i32 0)
>
> abort:
> call void @llvm.trap()
>
> We cannot simply have our optimization pass transform this IR to eliminate
> the
> llvm.bitset.test call, as the transformation would not be
> semantics-preserving
> (we already have a CFI setting that emits diagnostics in the abort branch,
> which would break under such a transformation). Instead, we can introduce
> a new intrinsic, @llvm.bitset.check, to be used only in production mode
> (i.e. with diagnostics disabled):
>
> {i8*, i1} @llvm.bitset.check(i8* %ptr, metadata %name)
>
> If %ptr is in the bitset identified by %name, the function returns {%ptr,
> true}. But if %ptr is not in %name, the behaviour is undefined, modulo that
> if the second element of the result is non-zero, the program may load and
> call a function pointer from an address derived from the first element of
> the
> result without causing an indirect function call to any function other than
> one potentially loaded from one of the constant globals of which %name is
> a member (i.e. one of the valid vtables for %name). A CFI-protected virtual
> call would then look like this (eliding bitcasts for brevity):
>
> %vtable = load i8**, i8*** %obj
> %pair = call {i8*, i1} @llvm.bitset.check(i8* %vtable, “Base”)
> %checked_vtable = extractelement %pair, 0
> %p = extractelement %pair, 1
> br i1 %p, label %call, label %abort
>
> call:
> %fptr = load i8*, i8** %checked_vtable
> %result = call i1 %fptr(i8*** %obj, i32 0)
>
> abort:
> call void @llvm.trap()
>
> Applying virtual constant propagation to @llvm.bitset.check would be very
> similar to applying it to @llvm.bitset.test, but if the constant
> propagation
> succeeds, the function will return true as the second element:
>
> %vtable = load i8**, i8*** %obj
> br i1 true, label %call, label %abort
>
> call:
> %vtable_bit_addr = getelementptr i8* %vtable, i64 -17
> %vtable_bit = load i8 %vtable_bit_addr
> %vtable_bit_and = and i8 %vtable_bit, 1
> %result = icmp ne %vtable_bit_and, 0
>
> abort:
> call void @llvm.trap()
>
> Now the existing SimplifyCFG pass can easily eliminate the unnecessary
> branching.
>
> In order to allow for regular constant folding through the
> llvm.bitset.check
> intrinsic, the code in lib/Analysis/ConstantFolding.cpp would be taught to
> look through the intrinsic’s argument.
>
> --
> Peter
>
> [0] https://code.google.com/p/chromium/issues/detail?id=580389
> [1] http://llvm.org/docs/BitSets.html
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160128/52bbf1b9/attachment.html>

Pete Cooper via llvm-dev

2016-Jan-28 18:03 UTC

head link

[llvm-dev] Proposal: virtual constant propagation

Hi Hans

The prototype Mehdi and I worked on (which I hope we can contribute on its own,
or as patches to Peter’s implementation), is able to handle your use case.

I’d be interested if Peter’s is able to handle this too, but i confess I don’t
yet know how his code is implemented.

Our code uses metadata to encode what is the most derived class you could be at
a point in time.  Paired with complete class hierarchy in metadata, this allows
us to work out which classes we could possibly be at a given call site.

For example, and I hope this is similar to what you are describing:

class Base {
  virtual void foo();
}

class A : public Base {
  void foo override();
}

class B : public A {
  // no override of foo()
}

A *p;
...
p->foo();

The call to foo here is known to be an ‘A’ or any class derived from it.  As we
have the full hierarchy, we know it can only be [A, B].

Now we can look at the vtable entries on A and B, and see that both contain the
same &foo in the slot we are reading from, and so devirtualize as only a
single foo could possibly be called from here (otherwise UB).

We are also able to look at the users of the vtable globals themselves and see
which ones are ever used.  If one isn’t used, then no-one can be one of those
classes, and we can remove it from the hierarchy.  This could reduce the set of
possible classes (or possible called functions at a given site) to 1, and we can
devirtualize.  That sounds like it might be the case for the WebScrolBar you
mention where the only possible implementation in the hierarchy is a single
derived class.

Peter, i’d be interested in discussing this further, but i don’t want to digress
the conversation any further from what you are trying to discuss here.  FWIW, I
really like the proposal you have here and think we should get it in tree.

Cheers,
Pete> On Jan 28, 2016, at 8:47 AM, Hans Wennborg via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On Wed, Jan 27, 2016 at 7:57 PM, Peter Collingbourne <peter at
pcc.me.uk> wrote:
>> The pass that applies this transformation will be placed early in the
LTO
>> pipeline, before most of the regular optimization passes. The pass
could later
>> be extended to do general devirtualization based on the bitset
information:
>> 
>> - The pass can implement single-deriver (i.e. only one derived class
with a
>>   non-pure virtual function definition) and single-implementation
>>   (i.e. multiple derived classes all sharing a virtual function
definition
>>   from a base class) devirtualization by checking that each of the
possible
>>   values loaded from the vtables in the bitset are either identical or
>>   __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine
>>   disregarding them), and propagating the identical address into the
>>   function callee.
> 
> In addition to the constant value propagation, which looks fantastic,
> I'm also very interested in this aspect.
> 
> For example, Blink/WebKit public interface is full of classes like
> [0], which have a single implementation (sometimes there's also a mock
> implementation used in tests). In many classes (like [1]), the methods
> are not pure-virtual but have dummy implementations, but we should be
> able to figure out that the base class constructor is never called
> except when constructing the single implementation class.
> 
> Most of this interface might not be very hot, but optimizing the
> virtual barrier away seems looks like a juicy opportunity for binary
> size and inlining (and start-up time if we can drop the vtables and
> void the dynamic relocations).
> 
> Cheers,
> Hans
> 
> [0].
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/public/platform/WebScrollbar.h
> [1].
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/public/platform/WebScrollbarBehavior.h
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160128/fa41c9fa/attachment.html>

Pete Cooper via llvm-dev

2016-Jan-28 18:11 UTC

head link

[llvm-dev] Proposal: virtual constant propagation

Hi Peter
> On Jan 27, 2016, at 7:57 PM, Peter Collingbourne via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> 
> For example, the vtable for Base and Derived will look like this:
> 
> struct Base_VTable {
>  bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
>  size_t padding : 61;
>  size_t offset_to_top;
>  size_t rtti;
>  bool (*isDerived)(const Base *this);
>  bool (*isOfType)(const Base *this, Type t);
> };
> Base_VTable base_vtable{
>  false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType
> };
> Base_VTable derived_vtable{
>  True, false, true, 0, 0, 0, &Derived::isDerived,
&Derived::IsOfType
> };
Given the above vtable, you still need to carry around the new booleans as well
as the old pointers.  Have you considered cases where the set of possible
candidates is small and comparing the vtable pointer itself is sufficient?

For example,

class Base { virtual bool foo() = 0; }
class A : public Base { virtual bool foo() { return false; } }
class B : public Base { virtual bool foo() { return true; } }
class C : public Base { virtual bool foo() { return true; } }

Base *p;
…
p->foo();

Here, p->foo() is the same as ‘p->vtable != vtable_A’.  This would reduce
both the number of elements you’d need to add to your vtable, as well as
reducing the code size as now you only need to load the value of the vtable and
not dereference it to get the boolean from inside.

Thanks,
Pete
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160128/ff674b3b/attachment-0001.html>

Peter Collingbourne via llvm-dev

2016-Jan-28 18:25 UTC

head link

[llvm-dev] Proposal: virtual constant propagation

On Thu, Jan 28, 2016 at 10:11:41AM -0800, Pete Cooper
wrote:> Hi Peter
> 
> > On Jan 27, 2016, at 7:57 PM, Peter Collingbourne via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> > 
> > For example, the vtable for Base and Derived will look like this:
> > 
> > struct Base_VTable {
> >  bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
> >  size_t padding : 61;
> >  size_t offset_to_top;
> >  size_t rtti;
> >  bool (*isDerived)(const Base *this);
> >  bool (*isOfType)(const Base *this, Type t);
> > };
> > Base_VTable base_vtable{
> >  false, true, false, 0, 0, 0, &Base::isDerived,
&Base::IsOfType
> > };
> > Base_VTable derived_vtable{
> >  True, false, true, 0, 0, 0, &Derived::isDerived,
&Derived::IsOfType
> > };
> 
> Given the above vtable, you still need to carry around the new booleans as
well as the old pointers.  Have you considered cases where the set of possible
candidates is small and comparing the vtable pointer itself is sufficient?
> 
> For example,
> 
> class Base { virtual bool foo() = 0; }
> class A : public Base { virtual bool foo() { return false; } }
> class B : public Base { virtual bool foo() { return true; } }
> class C : public Base { virtual bool foo() { return true; } }
> 
> Base *p;
> …
> p->foo();
> 
> Here, p->foo() is the same as ‘p->vtable != vtable_A’.  This would
reduce both the number of elements you’d need to add to your vtable, as well as
reducing the code size as now you only need to load the value of the vtable and
not dereference it to get the boolean from inside.
Yes, this is one possible optimization I have considered.

Thanks,
-- 
Peter

Peter Collingbourne via llvm-dev

2016-Jan-28 18:41 UTC

head link

[llvm-dev] Proposal: virtual constant propagation

On Thu, Jan 28, 2016 at 09:48:05AM -0800, Xinliang David Li
wrote:> Peter,
> 
> Interesting idea. One small extension that can be done is to allow the
> optimization to work on a value range. For instance, if the argument type
> is has know limited range (enum for instance), you can still optimize it
> when the argument is not a constant via proper bit mapping/shifting. In
> your example, the following is also devirtualizable.
> 
> bool g(Base *b, Type t) {
>   return b->isOfType(t);
> }
Yes, but we'd need to be careful about this, as the C++ standard rules for
enum ranges are rather complicated. See:

http://stackoverflow.com/questions/18195312/what-happens-if-you-static-cast-invalid-value-to-enum-class

Thanks,
-- 
Peter

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Jan 2016 - Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

[llvm-dev] Proposal: virtual constant propagation

Seemingly Similar Threads