Peter Collingbourne via llvm-dev
2016-Mar-16 23:23 UTC
[llvm-dev] RFC: A new ABI for virtual calls, and a change to the virtual call representation in the IR
On Fri, Mar 4, 2016 at 2:48 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:> On Mon, Feb 29, 2016 at 1:53 PM, <> wrote: >> >> @A_vtable = {i8*, i8*, i32, i32} {0, @A::rtti, @A::f - (@A_vtable + 16), >> @A::g - (@A_vtable + 16)} >> > > There's a subtlety about this aspect of the ABI that I should call > attention to. The virtual function references can only be resolved directly > by the static linker if they are defined in the same executable/DSO as the > virtual table. I expect this to be the overwhelmingly common case, as > classes are normally wholly defined within a single executable or DSO, so > our implementation should be optimized around that case. > > If we expected cross-DSO references to be relatively common, we could make > vtable entries be relative to GOT entries, but that would introduce an > additional level of indirection and additional relocations, probably > costing us more in binary size and memory bandwidth than the current ABI. > > However, it is technically possible to split the implementation of a > class's virtual functions between DSOs, and there are more practical cases > where we might expect to see cross-DSO references: > > - one DSO could derive from a class defined in another DSO, and only > override some of its virtual functions > - the vtable could contain a reference to __cxa_pure_virtual which would > be defined by the standard library > > We can handle these cases by having the vtable refer to a PLT entry for > each function that is not defined within the module. This can be done by > using a specific type of relative relocation that refers directly to the > symbol if defined within the current module, or to a PLT entry if not. This > is the same type of relocation that is needed to implement relative > branches on x86, so I'd expect it to be generally available on that > architecture (ELF has R_{386,X86_64}_PLT32, Mach-O has X86_64_RELOC_BRANCH, > COFF has IMAGE_REL_{AMD64,I386}_REL32, which may resolve to a thunk [1], > which is essentially the same thing as a PLT entry). It is also present on > ARM (R_ARM_PREL31, which was apparently added to support unwind tables). > > We still need some way to create PLT relocations in the vtable's > initializer without breaking the semantics of a load from the vtable. > Rafael and I discussed this and we believe that if the target function is > unnamed_addr, this indicates that the function's address isn't observable > (this is true for virtual functions, as it isn't possible to take their > address), and so it could be substituted with the address of a PLT entry. >I've discovered a problem with this idea. Since we are using 32-bit displacements, the offset from the vtable to the function must fit within 32 bits. This is assumed to be true in the medium code model, so long as the displacement points to a real function address or a PLT entry. However, if we combine a vtable load at a virtual call site, the code will evaluate the function address to the actual address of the function via the GOT, and that could push the displacement outside of the 32-bit boundary and cause an error in the evaluation of the function address. To solve this problem, I reckon that the @llvm.vtable.load.relative intrinsic I mentioned earlier will be required for correctness, and we would have to lower it very late, e.g. in the pre-backend passes. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160316/361e78ca/attachment.html>
Peter Collingbourne via llvm-dev
2016-Mar-17 01:34 UTC
[llvm-dev] RFC: A new ABI for virtual calls, and a change to the virtual call representation in the IR
On Wed, Mar 16, 2016 at 4:23 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:> On Fri, Mar 4, 2016 at 2:48 PM, Peter Collingbourne <peter at pcc.me.uk> > wrote: > >> On Mon, Feb 29, 2016 at 1:53 PM, <> wrote: >>> >>> @A_vtable = {i8*, i8*, i32, i32} {0, @A::rtti, @A::f - (@A_vtable + 16), >>> @A::g - (@A_vtable + 16)} >>> >> >> There's a subtlety about this aspect of the ABI that I should call >> attention to. The virtual function references can only be resolved directly >> by the static linker if they are defined in the same executable/DSO as the >> virtual table. I expect this to be the overwhelmingly common case, as >> classes are normally wholly defined within a single executable or DSO, so >> our implementation should be optimized around that case. >> >> If we expected cross-DSO references to be relatively common, we could >> make vtable entries be relative to GOT entries, but that would introduce an >> additional level of indirection and additional relocations, probably >> costing us more in binary size and memory bandwidth than the current ABI. >> >> However, it is technically possible to split the implementation of a >> class's virtual functions between DSOs, and there are more practical cases >> where we might expect to see cross-DSO references: >> >> - one DSO could derive from a class defined in another DSO, and only >> override some of its virtual functions >> - the vtable could contain a reference to __cxa_pure_virtual which would >> be defined by the standard library >> >> We can handle these cases by having the vtable refer to a PLT entry for >> each function that is not defined within the module. This can be done by >> using a specific type of relative relocation that refers directly to the >> symbol if defined within the current module, or to a PLT entry if not. This >> is the same type of relocation that is needed to implement relative >> branches on x86, so I'd expect it to be generally available on that >> architecture (ELF has R_{386,X86_64}_PLT32, Mach-O has X86_64_RELOC_BRANCH, >> COFF has IMAGE_REL_{AMD64,I386}_REL32, which may resolve to a thunk [1], >> which is essentially the same thing as a PLT entry). It is also present on >> ARM (R_ARM_PREL31, which was apparently added to support unwind tables). >> >> We still need some way to create PLT relocations in the vtable's >> initializer without breaking the semantics of a load from the vtable. >> Rafael and I discussed this and we believe that if the target function is >> unnamed_addr, this indicates that the function's address isn't observable >> (this is true for virtual functions, as it isn't possible to take their >> address), and so it could be substituted with the address of a PLT entry. >> > > I've discovered a problem with this idea. Since we are using 32-bit > displacements, the offset from the vtable to the function must fit within > 32 bits. This is assumed to be true in the medium code model, so long as > the displacement points to a real function address or a PLT entry. However, > if we combine a vtable load at a virtual call site, the code will evaluate > the function address to the actual address of the function via the GOT, and > that could push the displacement outside of the 32-bit boundary and cause > an error in the evaluation of the function address. > > To solve this problem, I reckon that the @llvm.vtable.load.relative > intrinsic I mentioned earlier will be required for correctness, and we > would have to lower it very late, e.g. in the pre-backend passes. >Just to go into a little more detail so that everyone's on the same page. The combining I'm talking about is the trivial devirtualization we've always done in instcombine where a component of a vtable's initializer is folded into a virtual call. Here's an example from Chromium: call void bitcast (i8* getelementptr (i8, i8* bitcast (i32* getelementptr inbounds ({ i8*, i8*, i32, i32, i32, i32, i32, i32 }, { i8*, i8*, i32, i32, i32, i32, i32, i32 }* @_ZTVN4base12_GLOBAL__N_119CallDoStuffOnThreadE, i64 0, i32 2) to i8*), i64 sext (i32 trunc (i64 sub (i64 ptrtoint (void (%"class.base::SimpleThread"*)* @_ZN4base12SimpleThread5StartEv to i64), i64 ptrtoint (i32* getelementptr inbounds ({ i8*, i8*, i32, i32, i32, i32, i32, i32 }, { i8*, i8*, i32, i32, i32, i32, i32, i32 }* @_ZTVN4base12_GLOBAL__N_119CallDoStuffOnThreadE, i64 0, i32 2) to i64)) to i32) to i64)) to void (%"class.base::SimpleThread"*)*)(%"class.base::SimpleThread"* %3) which can be simplified/renamed to: call void (VT + sext(trunc(f - VT)))() The problem is that f is evaluated using its canonical addresses via the GOT, rather than using f's PLT entry in the module VT was defined in (which is what would normally happen when actually loading from the vtable), and if f and VT were defined in different modules, that calculation may overflow. One possible solution would be to introduce a new constexpr kind for PLT references, and use that for references to virtual functions in vtables. But there's still at least a couple of problems I can think of. 1) Suppose that a module M1 contains strong definitions for f and VT, and a translation unit in a module M2 has enough information to produce an available_externally definition of VT. A virtual call in M2 would evaluate (VT-in-M1 + sext(trunc(f at plt-in-M2 - VT-in-M1))), which could overflow. Rafael suggested that we could use non-plt references in available_externally definitions to solve this problem. 2) Suppose that, in a separate scenario, VT has a linkonce_odr definition in modules M1 and M2, f has a strong definition in M1 and the dynamic loader picks the definition of VT in M2. A virtual call in M1 would evaluate (VT-in-M2 + sext(trunc(f-in-M1 - VT-in-M2))), which, again, could overflow. I can't see a good solution to this problem. We could most likely consider other constexpr extensions to cope with 2, but this all gets rather complicated and hard to reason about. We're also solving a problem which doesn't need to exist; we know exactly which function we want, it's right there in the expression we're evaluating. Which is why I think a better solution would be to use the intrinsic I mentioned to load from relative vtables, and teach instcombine to recognize this intrinsic and the specific form of the vtables so that we'd continue to be able to do trivial devirtualization. We'll need the intrinsic later to do whole-program devirtualization on relative vtables anyway, so I think now's a good time to add it. Thanks, -- Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160316/c2b3a1f3/attachment.html>
John McCall via llvm-dev
2016-Mar-17 18:25 UTC
[llvm-dev] RFC: A new ABI for virtual calls, and a change to the virtual call representation in the IR
> On Mar 16, 2016, at 6:34 PM, Peter Collingbourne <peter at pcc.me.uk> wrote: > On Wed, Mar 16, 2016 at 4:23 PM, Peter Collingbourne <peter at pcc.me.uk <mailto:peter at pcc.me.uk>> wrote: > On Fri, Mar 4, 2016 at 2:48 PM, Peter Collingbourne <peter at pcc.me.uk <mailto:peter at pcc.me.uk>> wrote: > On Mon, Feb 29, 2016 at 1:53 PM, < <>> wrote: > @A_vtable = {i8*, i8*, i32, i32} {0, @A::rtti, @A::f - (@A_vtable + 16), @A::g - (@A_vtable + 16)} > > There's a subtlety about this aspect of the ABI that I should call attention to. The virtual function references can only be resolved directly by the static linker if they are defined in the same executable/DSO as the virtual table. I expect this to be the overwhelmingly common case, as classes are normally wholly defined within a single executable or DSO, so our implementation should be optimized around that case. > > If we expected cross-DSO references to be relatively common, we could make vtable entries be relative to GOT entries, but that would introduce an additional level of indirection and additional relocations, probably costing us more in binary size and memory bandwidth than the current ABI. > > However, it is technically possible to split the implementation of a class's virtual functions between DSOs, and there are more practical cases where we might expect to see cross-DSO references: > > - one DSO could derive from a class defined in another DSO, and only override some of its virtual functions > - the vtable could contain a reference to __cxa_pure_virtual which would be defined by the standard library > > We can handle these cases by having the vtable refer to a PLT entry for each function that is not defined within the module. This can be done by using a specific type of relative relocation that refers directly to the symbol if defined within the current module, or to a PLT entry if not. This is the same type of relocation that is needed to implement relative branches on x86, so I'd expect it to be generally available on that architecture (ELF has R_{386,X86_64}_PLT32, Mach-O has X86_64_RELOC_BRANCH, COFF has IMAGE_REL_{AMD64,I386}_REL32, which may resolve to a thunk [1], which is essentially the same thing as a PLT entry). It is also present on ARM (R_ARM_PREL31, which was apparently added to support unwind tables). > > We still need some way to create PLT relocations in the vtable's initializer without breaking the semantics of a load from the vtable. Rafael and I discussed this and we believe that if the target function is unnamed_addr, this indicates that the function's address isn't observable (this is true for virtual functions, as it isn't possible to take their address), and so it could be substituted with the address of a PLT entry. > > I've discovered a problem with this idea. Since we are using 32-bit displacements, the offset from the vtable to the function must fit within 32 bits. This is assumed to be true in the medium code model, so long as the displacement points to a real function address or a PLT entry. However, if we combine a vtable load at a virtual call site, the code will evaluate the function address to the actual address of the function via the GOT, and that could push the displacement outside of the 32-bit boundary and cause an error in the evaluation of the function address. > > To solve this problem, I reckon that the @llvm.vtable.load.relative intrinsic I mentioned earlier will be required for correctness, and we would have to lower it very late, e.g. in the pre-backend passes. > > Just to go into a little more detail so that everyone's on the same page. The combining I'm talking about is the trivial devirtualization we've always done in instcombine where a component of a vtable's initializer is folded into a virtual call. Here's an example from Chromium: > > call void bitcast (i8* getelementptr (i8, i8* bitcast (i32* getelementptr inbounds ({ i8*, i8*, i32, i32, i32, i32, i32, i32 }, { i8*, i8*, i32, i32, i32, i32, i32, i32 }* @_ZTVN4base12_GLOBAL__N_119CallDoStuffOnThreadE, i64 0, i32 2) to i8*), i64 sext (i32 trunc (i64 sub (i64 ptrtoint (void (%"class.base::SimpleThread"*)* @_ZN4base12SimpleThread5StartEv to i64), i64 ptrtoint (i32* getelementptr inbounds ({ i8*, i8*, i32, i32, i32, i32, i32, i32 }, { i8*, i8*, i32, i32, i32, i32, i32, i32 }* @_ZTVN4base12_GLOBAL__N_119CallDoStuffOnThreadE, i64 0, i32 2) to i64)) to i32) to i64)) to void (%"class.base::SimpleThread"*)*)(%"class.base::SimpleThread"* %3) > > which can be simplified/renamed to: > > call void (VT + sext(trunc(f - VT)))() > > The problem is that f is evaluated using its canonical addresses via the GOT, rather than using f's PLT entry in the module VT was defined in (which is what would normally happen when actually loading from the vtable), and if f and VT were defined in different modules, that calculation may overflow. > > One possible solution would be to introduce a new constexpr kind for PLT references, and use that for references to virtual functions in vtables. But there's still at least a couple of problems I can think of. > > 1) Suppose that a module M1 contains strong definitions for f and VT, and a translation unit in a module M2 has enough information to produce an available_externally definition of VT. A virtual call in M2 would evaluate (VT-in-M1 + sext(trunc(f at plt-in-M2 - VT-in-M1))), which could overflow. Rafael suggested that we could use non-plt references in available_externally definitions to solve this problem. > 2) Suppose that, in a separate scenario, VT has a linkonce_odr definition in modules M1 and M2, f has a strong definition in M1 and the dynamic loader picks the definition of VT in M2. A virtual call in M1 would evaluate (VT-in-M2 + sext(trunc(f-in-M1 - VT-in-M2))), which, again, could overflow. I can't see a good solution to this problem. > > We could most likely consider other constexpr extensions to cope with 2, but this all gets rather complicated and hard to reason about. We're also solving a problem which doesn't need to exist; we know exactly which function we want, it's right there in the expression we're evaluating. Which is why I think a better solution would be to use the intrinsic I mentioned to load from relative vtables, and teach instcombine to recognize this intrinsic and the specific form of the vtables so that we'd continue to be able to do trivial devirtualization. We'll need the intrinsic later to do whole-program devirtualization on relative vtables anyway, so I think now's a good time to add it.I agree with your reasoning here, although I will repeat that I think an address-of-PLT ConstantExpr would be useful in other situations. For example, we could potentially use it in the ordinary v-table ABI to get a cheaper relocation to a function defined outside of the module. There are also interesting places we would use this in Swift, which does not make unnecessarily strong guarantees about function-pointer equality. I do not think this intrinsic is specific to v-tables, though, and could reasonably just be named “llvm.load.relative”. You can still reasonably tie devirtualization metadata to that, since I assume the “devirtualization" is not actually specific to calls — it’s really about folding loads based on global information that there’s (e.g.) only one possible value in a set. John. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160317/b37048cc/attachment.html>