Peter Collingbourne via llvm-dev
2016-Oct-26 00:27 UTC
[llvm-dev] RFC: a more detailed design for ThinLTO + vcall CFI
Hi all, As promised, here is a brain dump on how I see CFI for vcalls working under ThinLTO. Most of this has been prototyped, so the design does appear to be sound. For context on how CFI currently works under regular LTO, please read: http://llvm.org/docs/TypeMetadata.html http://clang.llvm.org/docs/ControlFlowIntegrityDesign.html http://clang.llvm.org/docs/LTOVisibility.html ==== Summary extensions === The combined summary index would be extended to include a mapping from type identifiers to "resolutions". The resolution would control what type of code we generate to create a CFI check for that type identifier. Here are the resolutions that we would support: Inline32, Inline64, SingleBit: these would cause us to generate code as described in "Short Inline Bit Vectors" in the design document: http://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#short-inline-bit-vectors AllOnes: this would cause us to generate code as described in "Eliminating Bit Vector Checks for All-Ones Bit Vectors" in the design document: http://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#eliminating-bit-vector-checks-for-all-ones-bit-vectors Unsat: no vtable is a member of that type identifier, so we can simply replace type checks for that type identifier with "false" ByteArray: we emit the general form of the type check, similar to the one shown at the end of http://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-virtual-calls just before "Optimizations" Armed with that information, we have a general idea of what the code to implement the type check would look like. In fact, given one of these values, the code will be identical to any other check that uses that resolution, modulo the constant values embedded in the code. To expand on what I mean by "constant values", let's look at a typical CFI check in the ByteArray case. Consider this module (based on test/Transforms/LowerTypeTests/simple.ll): @a = constant i32 1, !type !0, !type !2 @b = constant [63 x i32] zeroinitializer, !type !0, !type !1 @c = constant i32 3, !type !1, !type !2 @d = constant [2 x i32] [i32 4, i32 5], !type !3 !0 = !{i32 0, !"typeid1"} !3 = !{i32 4, !"typeid1"} !1 = !{i32 0, !"typeid2"} !2 = !{i32 0, !"typeid3"} define i1 @baz(i32* %p) { %x = call i1 @llvm.type.test(i8* %pi8, metadata !"typeid3") ret i1 %x } Here is the IR after the LowerTypeTests pass has run. I've marked the places where we use a constant value: define i1 @baz(i32* %p) { %pi8 = bitcast i32* %p to i8* %1 = ptrtoint i8* %pi8 to i32 %2 = sub i32 %1, ptrtoint ({ i32, [0 x i8], [63 x i32], [4 x i8], i32, [0 x i8], [2 x i32] }* @0 to i32) %3 = lshr i32 %2, 2 ; CONSTANT: rotate count %4 = shl i32 %2, 30 ; CONSTANT: 32-rotate count %5 = or i32 %3, %4 %6 = icmp ult i32 %5, 66 ; CONSTANT: size of byte array br i1 %6, label %7, label %12 ; <label>:7: ; preds = %0 %8 = getelementptr i8, i8* @bits_use, i32 %5 %9 = load i8, i8* %8 %10 = and i8 %9, 2 ; CONSTANT: bit mask %11 = icmp ne i8 %10, 0 br label %12 ; <label>:12: ; preds = %0, %7 %13 = phi i1 [ false, %0 ], [ %11, %7 ] ret i1 %13 } Here is what the asm for the above looks like: baz: leaq .L__unnamed_1(%rip), %rax subl %eax, %edi roll $30, %edi ; CONSTANT: 32-rotate count cmpl $65, %edi ; CONSTANT: size of byte array ja .LBB2_1 movslq %edi, %rax leaq .Lbits_use(%rip), %rcx movb (%rax,%rcx), %al andb $2, %al ; CONSTANT: bit mask shrb %al retq .LBB2_1: xorl %eax, %eax retq A naive summary encoding would map a type identifier to a tuple of (resolution, rotate count, size of byte array, bit mask), and pull the latter three out of the summary as constants. However, the disadvantage of hard coding the constants in the IR like this is that any change to one of the constants will invalidate any cache entry that depends on a constant value. For example, if I add a class to a hierarchy, that would most likely increase the size of the byte array, which may invalidate every cache entry containing a check for a class in that hierarchy. To avoid this, we only include resolutions in the summary and obtain the constants using absolute symbol references. Here is what the asm code may look like: baz: leaq __typeid_typeid3_global_addr(%rip), %rax subl %eax, %edi rorl $__typeid_typeid3_rotate_count, %edi cmpl $__typeid_typeid3_size, %edi ja .LBB2_1 movslq %edi, %rax leaq __typeid_typeid3_byte_array(%rip), %rcx movb (%rax,%rcx), %al andb $__typeid_typeid3_bitmask, %al shrb %al retq .LBB2_1: xorl %eax, %eax retq The appropriate representation for this at the IR level is the subject of the RFC entitled 'Absolute or "fixed address" symbols as immediate operands'. We can imagine, though, that it could look something like this: @__typeid_typeid3_rotate_count = external globalconst i8 @__typeid_typeid3_size = external globalconst i32 @__typeid_typeid3_bit_mask = external globalconst i8 @__typeid_typeid3_byte_array = external global i8 @__typeid_typeid3_global_addr = external global i8 define i1 @baz(i32* %p) { %pi8 = bitcast i32* %p to i8* %1 = ptrtoint i8* %pi8 to i32 %2 = sub i32 %1, ptrtoint (i8* @__typeid_typeid3_global_addr to i32) %3 = lshr i32 %2, i8 @__typeid_typeid3_rotate_count %4 = shl i32 %2, sub i8 (i8 32, i8 @__typeid_typeid3_rotate_count) %5 = or i32 %3, %4 %6 = icmp ult i32 %5, i32 @__typeid_typeid3_size br i1 %6, label %7, label %12 ; <label>:7: ; preds = %0 %8 = getelementptr i8, i8* @__typeid_typeid3_byte_array, i32 %5 %9 = load i8, i8* %8 %10 = and i8 %9, @__typeid_typeid3_bitmask %11 = icmp ne i8 %10, 0 br label %12 ; <label>:12: ; preds = %0, %7 %13 = phi i1 [ false, %0 ], [ %11, %7 ] ret i1 %13 } ==== Getting resolutions into the summary === Now that we've looked at how code is generated for the individual modules, let's look at how the resolutions are computed and put into the summary. The first step happens when we compile the translation unit with clang. I propose to change the bitcode format for any module that defines virtual tables with hidden LTO visibility. The bitcode would contain two modules: one to be compiled with regular LTO and the other to be compiled with ThinLTO. The regular LTO module would contain only the following: - the definitions of those virtual tables that have hidden LTO visibility - the !type metadata for those virtual tables - a list of type tests required by the corresponding ThinLTO module, in a GlobalMDNode named "llvm.export.type.tests" The ThinLTO module would contain the rest of the original module. For example, here is what the regular LTO module for our example above would look like: @a = constant i32 1, !type !0, !type !2 @b = constant [63 x i32] zeroinitializer, !type !0, !type !1 @c = constant i32 3, !type !1, !type !2 @d = constant [2 x i32] [i32 4, i32 5], !type !3 !0 = !{i32 0, !"typeid1"} !3 = !{i32 4, !"typeid1"} !1 = !{i32 0, !"typeid2"} !2 = !{i32 0, !"typeid3"} !8 = !{!"typeid3"} !llvm.export.type.tests = !{!8} and here is the ThinLTO module: define i1 @baz(i32* %p) { %x = call i1 @llvm.type.test(i8* %pi8, metadata !"typeid3") ret i1 %x } The regular LTO modules are merged and compiled in the usual way, so we would end up merging the type definitions together with the list of required data. When the LowerTypeTests pass is given a module with the "llvm.export.type.tests" global MDNode, it "exports" each of the type identifiers mentioned in the MDNode by storing a resolution in the combined summary, and by creating definitions of each of the symbols that shall be required to satisfy the link time dependencies in the individual ThinLTO modules. Here is an example of what the combined module would look like: @0 = [...] ; combined global for a, b, c and d @bits = [...] ; combined global byte array @__typeid_typeid3_global_addr = hidden alias i8, bitcast ({...}* @0 to i8*) @__typeid_typeid3_rotate_count = hidden globalconst i8 2 @__typeid_typeid3_size = hidden globalconst i32 65 @__typeid_typeid3_byte_array = hidden alias i8, i8* @bits @__typeid_typeid3_bit_mask = hidden globalconst i8 2 The ThinLTO backend processes would run after regular LTO, so they would have access to the resolutions in the summary. If the module summary is present, LowerTypeTests will "import" the appropriate type identifier by generating code containing external references to these symbols, as described previously. Thanks, -- -- Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161025/8404fc51/attachment.html>
Teresa Johnson via llvm-dev
2016-Oct-28 14:55 UTC
[llvm-dev] RFC: a more detailed design for ThinLTO + vcall CFI
Hi Peter, Thanks for sending this and sorry for the slow response. Some questions below. Teresa On Tue, Oct 25, 2016 at 5:27 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:> Hi all, > > As promised, here is a brain dump on how I see CFI for vcalls working > under ThinLTO. Most of this has been prototyped, so the design does appear > to be sound. For context on how CFI currently works under regular LTO, > please read: > > http://llvm.org/docs/TypeMetadata.html > http://clang.llvm.org/docs/ControlFlowIntegrityDesign.html > http://clang.llvm.org/docs/LTOVisibility.html > > ==== Summary extensions ===> > The combined summary index would be extended to include a mapping from > type identifiers to "resolutions". The resolution would control what type > of code we generate to create a CFI check for that type identifier. Here > are the resolutions that we would support: > > Inline32, Inline64, SingleBit: these would cause us to generate code as > described in "Short Inline Bit Vectors" in the design document: > http://clang.llvm.org/docs/ControlFlowIntegrityDesign. > html#short-inline-bit-vectors > AllOnes: this would cause us to generate code as described in "Eliminating > Bit Vector Checks for All-Ones Bit Vectors" in the design document: > http://clang.llvm.org/docs/ControlFlowIntegrityDesign. > html#eliminating-bit-vector-checks-for-all-ones-bit-vectors > Unsat: no vtable is a member of that type identifier, so we can simply > replace type checks for that type identifier with "false" > ByteArray: we emit the general form of the type check, similar to the one > shown at the end of http://clang.llvm.org/docs/ControlFlowIntegrityDesign. > html#forward-edge-cfi-for-virtual-calls just before "Optimizations" > > Armed with that information, we have a general idea of what the code to > implement the type check would look like. In fact, given one of these > values, the code will be identical to any other check that uses that > resolution, modulo the constant values embedded in the code. > > To expand on what I mean by "constant values", let's look at a typical CFI > check in the ByteArray case. Consider this module (based on test/Transforms/ > LowerTypeTests/simple.ll): > > @a = constant i32 1, !type !0, !type !2 > @b = constant [63 x i32] zeroinitializer, !type !0, !type !1 > @c = constant i32 3, !type !1, !type !2 > @d = constant [2 x i32] [i32 4, i32 5], !type !3 > > !0 = !{i32 0, !"typeid1"} > !3 = !{i32 4, !"typeid1"} > !1 = !{i32 0, !"typeid2"} > !2 = !{i32 0, !"typeid3"} > > define i1 @baz(i32* %p) { > %x = call i1 @llvm.type.test(i8* %pi8, metadata !"typeid3") > ret i1 %x > } > > Here is the IR after the LowerTypeTests pass has run. I've marked the > places where we use a constant value: > > define i1 @baz(i32* %p) { > %pi8 = bitcast i32* %p to i8* > %1 = ptrtoint i8* %pi8 to i32 > %2 = sub i32 %1, ptrtoint ({ i32, [0 x i8], [63 x i32], [4 x i8], i32, > [0 x i8], [2 x i32] }* @0 to i32) > %3 = lshr i32 %2, 2 ; CONSTANT: rotate count > %4 = shl i32 %2, 30 ; CONSTANT: 32-rotate count > %5 = or i32 %3, %4 > %6 = icmp ult i32 %5, 66 ; CONSTANT: size of byte array > br i1 %6, label %7, label %12 > > ; <label>:7: ; preds = %0 > %8 = getelementptr i8, i8* @bits_use, i32 %5 > %9 = load i8, i8* %8 > %10 = and i8 %9, 2 ; CONSTANT: bit mask > %11 = icmp ne i8 %10, 0 > br label %12 > > ; <label>:12: ; preds = %0, %7 > %13 = phi i1 [ false, %0 ], [ %11, %7 ] > ret i1 %13 > } > > Here is what the asm for the above looks like: > > baz: > leaq .L__unnamed_1(%rip), %rax > subl %eax, %edi > roll $30, %edi ; CONSTANT: 32-rotate count > cmpl $65, %edi ; CONSTANT: size of byte array > ja .LBB2_1 > > movslq %edi, %rax > leaq .Lbits_use(%rip), %rcx > movb (%rax,%rcx), %al > andb $2, %al ; CONSTANT: bit mask > shrb %al > retq > .LBB2_1: > xorl %eax, %eax > retq > > A naive summary encoding would map a type identifier to a tuple of > (resolution, rotate count, size of byte array, bit mask), and pull the > latter three out of the summary as constants. However, the disadvantage of > hard coding the constants in the IR like this is that any change to one of > the constants will invalidate any cache entry that depends on a constant > value. For example, if I add a class to a hierarchy, that would most likely > increase the size of the byte array, which may invalidate every cache entry > containing a check for a class in that hierarchy. To avoid this, we only > include resolutions in the summary and obtain the constants using absolute > symbol references. Here is what the asm code may look like: > > baz: > leaq __typeid_typeid3_global_addr(%rip), %rax > subl %eax, %edi > rorl $__typeid_typeid3_rotate_count, %edi > cmpl $__typeid_typeid3_size, %edi > ja .LBB2_1 > > movslq %edi, %rax > leaq __typeid_typeid3_byte_array(%rip), %rcx > movb (%rax,%rcx), %al > andb $__typeid_typeid3_bitmask, %al > shrb %al > retq > .LBB2_1: > xorl %eax, %eax > retq > > The appropriate representation for this at the IR level is the subject of > the RFC entitled 'Absolute or "fixed address" symbols as immediate > operands'. We can imagine, though, that it could look something like this: > > @__typeid_typeid3_rotate_count = external globalconst i8 > @__typeid_typeid3_size = external globalconst i32 > @__typeid_typeid3_bit_mask = external globalconst i8 > @__typeid_typeid3_byte_array = external global i8 > @__typeid_typeid3_global_addr = external global i8 >Naive question: These will be defined in the object file containing the definition of the key method and therefore the vtable definition, right? So only that object would need to be recompiled on a change like the one you mentioned earlier when talking about caching (adding a class to a hierarchy), right?> > define i1 @baz(i32* %p) { > %pi8 = bitcast i32* %p to i8* > %1 = ptrtoint i8* %pi8 to i32 > %2 = sub i32 %1, ptrtoint (i8* @__typeid_typeid3_global_addr to i32) > %3 = lshr i32 %2, i8 @__typeid_typeid3_rotate_count > %4 = shl i32 %2, sub i8 (i8 32, i8 @__typeid_typeid3_rotate_count) > %5 = or i32 %3, %4 > %6 = icmp ult i32 %5, i32 @__typeid_typeid3_size > br i1 %6, label %7, label %12 > > ; <label>:7: ; preds = %0 > %8 = getelementptr i8, i8* @__typeid_typeid3_byte_array, i32 %5 > %9 = load i8, i8* %8 > %10 = and i8 %9, @__typeid_typeid3_bitmask > %11 = icmp ne i8 %10, 0 > br label %12 > > ; <label>:12: ; preds = %0, %7 > %13 = phi i1 [ false, %0 ], [ %11, %7 ] > ret i1 %13 > } > > ==== Getting resolutions into the summary ===> > Now that we've looked at how code is generated for the individual modules, > let's look at how the resolutions are computed and put into the summary. > > The first step happens when we compile the translation unit with clang. I > propose to change the bitcode format for any module that defines virtual > tables with hidden LTO visibility. The bitcode would contain two modules: > one to be compiled with regular LTO and the other to be compiled with > ThinLTO. The regular LTO module would contain only the following: > > - the definitions of those virtual tables that have hidden LTO > visibility > >Another naive question: What happens in the non-hidden case and why is it different here?> > - the !type metadata for those virtual tables > - a list of type tests required by the corresponding ThinLTO module, > in a GlobalMDNode named "llvm.export.type.tests" > > The ThinLTO module would contain the rest of the original module. >The two modules would presumably be distinguished because the regular LTO module would not have a summary block, because the summary for it would be generated on the fly during LTO processing of the merged regular LTO modules, right?> > For example, here is what the regular LTO module for our example above > would look like: > > @a = constant i32 1, !type !0, !type !2 > @b = constant [63 x i32] zeroinitializer, !type !0, !type !1 > @c = constant i32 3, !type !1, !type !2 > @d = constant [2 x i32] [i32 4, i32 5], !type !3 > > !0 = !{i32 0, !"typeid1"} > !3 = !{i32 4, !"typeid1"} > !1 = !{i32 0, !"typeid2"} > !2 = !{i32 0, !"typeid3"} > > !8 = !{!"typeid3"} > !llvm.export.type.tests = !{!8} > > and here is the ThinLTO module: > > define i1 @baz(i32* %p) { > %x = call i1 @llvm.type.test(i8* %pi8, metadata !"typeid3") > ret i1 %x > } > > The regular LTO modules are merged and compiled in the usual way, so we > would end up merging the type definitions together with the list of > required data. > > When the LowerTypeTests pass is given a module with the > "llvm.export.type.tests" global MDNode, it "exports" each of the type > identifiers mentioned in the MDNode by storing a resolution in the combined > summary, and by creating >Assuming I understand this correctly, that means that there would be some kind of type table section in the combined summary index bitcode file, that would look something like the following for the above case: <TYPEID_SUMMARY_BLOCK ...> <TYPE abbrevid=... resolutionid, "typeid3"> </TYPEID_SUMMARY_BLOCK> and included in the on-disk summary as a StringMap, right?> definitions of each of the symbols that shall be required to satisfy the > link time dependencies in the individual ThinLTO modules. Here is an > example of what the combined module would look like: > > @0 = [...] ; combined global for a, b, c and d > @bits = [...] ; combined global byte array > > @__typeid_typeid3_global_addr = hidden alias i8, bitcast ({...}* @0 to i8*) > @__typeid_typeid3_rotate_count = hidden globalconst i8 2 > @__typeid_typeid3_size = hidden globalconst i32 65 > @__typeid_typeid3_byte_array = hidden alias i8, i8* @bits > @__typeid_typeid3_bit_mask = hidden globalconst i8 2 > > The ThinLTO backend processes would run after regular LTO, so they would > have access to the resolutions in the summary. If the module summary is > present, LowerTypeTests will "import" the appropriate type identifier by > generating code containing external references to these symbols, as > described previously. >So in the above case where the summary says "typeid3" -> ByteArray, ThinLTO would on the ThinLTO module would see the @llvm.type.test of typeid3 in the IR, consult the type resolution table in the combined summary, and know to generate decls for those @__typeid_typeid3_* variables and the associated code, right? One thing that needs some thought is for the distributed case, where the thin link serializes out individual index files for each ThinLTO module. Presumably that step would first create the merged regular LTO module and emit a regular object file for it during the thin link - we'll just need to anticipate that under the appropriate options. But the bigger issue that needs thought now is how to know which subset of the type resolution summary table to emit into each individual index file. That needs to be known when processing the ThinLTO modules through the WriteIndexesThinBackend - won't the ThinLTO summaries need to include info about which typeids are referenced from that module? Otherwise you'd have to conservatively emit the entire type resolution summary table in each individual index file - doable but presumably a fair amount of overhead. Thanks, Teresa> Thanks, > -- > -- > Peter >-- Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161028/f24688a9/attachment.html>
Peter Collingbourne via llvm-dev
2016-Oct-28 23:34 UTC
[llvm-dev] RFC: a more detailed design for ThinLTO + vcall CFI
On Fri, Oct 28, 2016 at 7:55 AM, Teresa Johnson <tejohnson at google.com> wrote:> Hi Peter, > > Thanks for sending this and sorry for the slow response. Some questions > below. > > Teresa > > On Tue, Oct 25, 2016 at 5:27 PM, Peter Collingbourne <peter at pcc.me.uk> > wrote: > >> Hi all, >> >> As promised, here is a brain dump on how I see CFI for vcalls working >> under ThinLTO. Most of this has been prototyped, so the design does appear >> to be sound. For context on how CFI currently works under regular LTO, >> please read: >> >> http://llvm.org/docs/TypeMetadata.html >> http://clang.llvm.org/docs/ControlFlowIntegrityDesign.html >> http://clang.llvm.org/docs/LTOVisibility.html >> >> ==== Summary extensions ===>> >> The combined summary index would be extended to include a mapping from >> type identifiers to "resolutions". The resolution would control what type >> of code we generate to create a CFI check for that type identifier. Here >> are the resolutions that we would support: >> >> Inline32, Inline64, SingleBit: these would cause us to generate code as >> described in "Short Inline Bit Vectors" in the design document: >> http://clang.llvm.org/docs/ControlFlowIntegrityDes >> ign.html#short-inline-bit-vectors >> AllOnes: this would cause us to generate code as described >> in "Eliminating Bit Vector Checks for All-Ones Bit Vectors" in the design >> document: http://clang.llvm.org/docs/ControlFlowIntegrityDes >> ign.html#eliminating-bit-vector-checks-for-all-ones-bit-vectors >> Unsat: no vtable is a member of that type identifier, so we can simply >> replace type checks for that type identifier with "false" >> ByteArray: we emit the general form of the type check, similar to the one >> shown at the end of http://clang.llvm.org/docs/ >> ControlFlowIntegrityDesign.html#forward-edge-cfi-for-virtual-calls just >> before "Optimizations" >> >> Armed with that information, we have a general idea of what the code to >> implement the type check would look like. In fact, given one of these >> values, the code will be identical to any other check that uses that >> resolution, modulo the constant values embedded in the code. >> >> To expand on what I mean by "constant values", let's look at a typical >> CFI check in the ByteArray case. Consider this module (based on >> test/Transforms/LowerTypeTests/simple.ll): >> >> @a = constant i32 1, !type !0, !type !2 >> @b = constant [63 x i32] zeroinitializer, !type !0, !type !1 >> @c = constant i32 3, !type !1, !type !2 >> @d = constant [2 x i32] [i32 4, i32 5], !type !3 >> >> !0 = !{i32 0, !"typeid1"} >> !3 = !{i32 4, !"typeid1"} >> !1 = !{i32 0, !"typeid2"} >> !2 = !{i32 0, !"typeid3"} >> >> define i1 @baz(i32* %p) { >> %x = call i1 @llvm.type.test(i8* %pi8, metadata !"typeid3") >> ret i1 %x >> } >> >> Here is the IR after the LowerTypeTests pass has run. I've marked the >> places where we use a constant value: >> >> define i1 @baz(i32* %p) { >> %pi8 = bitcast i32* %p to i8* >> %1 = ptrtoint i8* %pi8 to i32 >> %2 = sub i32 %1, ptrtoint ({ i32, [0 x i8], [63 x i32], [4 x i8], i32, >> [0 x i8], [2 x i32] }* @0 to i32) >> %3 = lshr i32 %2, 2 ; CONSTANT: rotate count >> %4 = shl i32 %2, 30 ; CONSTANT: 32-rotate count >> %5 = or i32 %3, %4 >> %6 = icmp ult i32 %5, 66 ; CONSTANT: size of byte array >> br i1 %6, label %7, label %12 >> >> ; <label>:7: ; preds = %0 >> %8 = getelementptr i8, i8* @bits_use, i32 %5 >> %9 = load i8, i8* %8 >> %10 = and i8 %9, 2 ; CONSTANT: bit mask >> %11 = icmp ne i8 %10, 0 >> br label %12 >> >> ; <label>:12: ; preds = %0, %7 >> %13 = phi i1 [ false, %0 ], [ %11, %7 ] >> ret i1 %13 >> } >> >> Here is what the asm for the above looks like: >> >> baz: >> leaq .L__unnamed_1(%rip), %rax >> subl %eax, %edi >> roll $30, %edi ; CONSTANT: 32-rotate count >> cmpl $65, %edi ; CONSTANT: size of byte array >> ja .LBB2_1 >> >> movslq %edi, %rax >> leaq .Lbits_use(%rip), %rcx >> movb (%rax,%rcx), %al >> andb $2, %al ; CONSTANT: bit mask >> shrb %al >> retq >> .LBB2_1: >> xorl %eax, %eax >> retq >> >> A naive summary encoding would map a type identifier to a tuple of >> (resolution, rotate count, size of byte array, bit mask), and pull the >> latter three out of the summary as constants. However, the disadvantage of >> hard coding the constants in the IR like this is that any change to one of >> the constants will invalidate any cache entry that depends on a constant >> value. For example, if I add a class to a hierarchy, that would most likely >> increase the size of the byte array, which may invalidate every cache entry >> containing a check for a class in that hierarchy. To avoid this, we only >> include resolutions in the summary and obtain the constants using absolute >> symbol references. Here is what the asm code may look like: >> >> baz: >> leaq __typeid_typeid3_global_addr(%rip), %rax >> subl %eax, %edi >> rorl $__typeid_typeid3_rotate_count, %edi >> cmpl $__typeid_typeid3_size, %edi >> ja .LBB2_1 >> >> movslq %edi, %rax >> leaq __typeid_typeid3_byte_array(%rip), %rcx >> movb (%rax,%rcx), %al >> andb $__typeid_typeid3_bitmask, %al >> shrb %al >> retq >> .LBB2_1: >> xorl %eax, %eax >> retq >> >> The appropriate representation for this at the IR level is the subject of >> the RFC entitled 'Absolute or "fixed address" symbols as immediate >> operands'. We can imagine, though, that it could look something like this: >> >> @__typeid_typeid3_rotate_count = external globalconst i8 >> @__typeid_typeid3_size = external globalconst i32 >> @__typeid_typeid3_bit_mask = external globalconst i8 >> @__typeid_typeid3_byte_array = external global i8 >> @__typeid_typeid3_global_addr = external global i8 >> > > Naive question: These will be defined in the object file containing the > definition of the key method and therefore the vtable definition, right? So > only that object would need to be recompiled on a change like the one you > mentioned earlier when talking about caching (adding a class to a > hierarchy), right? >These will all be defined in the object file produced during regular LTO. In principle we could support caching that object, but I was thinking more about the objects produced by the thin backends.> >> >> define i1 @baz(i32* %p) { >> %pi8 = bitcast i32* %p to i8* >> %1 = ptrtoint i8* %pi8 to i32 >> %2 = sub i32 %1, ptrtoint (i8* @__typeid_typeid3_global_addr to i32) >> %3 = lshr i32 %2, i8 @__typeid_typeid3_rotate_count >> %4 = shl i32 %2, sub i8 (i8 32, i8 @__typeid_typeid3_rotate_count) >> %5 = or i32 %3, %4 >> %6 = icmp ult i32 %5, i32 @__typeid_typeid3_size >> br i1 %6, label %7, label %12 >> >> ; <label>:7: ; preds = %0 >> %8 = getelementptr i8, i8* @__typeid_typeid3_byte_array, i32 %5 >> %9 = load i8, i8* %8 >> %10 = and i8 %9, @__typeid_typeid3_bitmask >> %11 = icmp ne i8 %10, 0 >> br label %12 >> >> ; <label>:12: ; preds = %0, %7 >> %13 = phi i1 [ false, %0 ], [ %11, %7 ] >> ret i1 %13 >> } >> >> ==== Getting resolutions into the summary ===>> >> Now that we've looked at how code is generated for the individual >> modules, let's look at how the resolutions are computed and put into the >> summary. >> >> The first step happens when we compile the translation unit with clang. I >> propose to change the bitcode format for any module that defines virtual >> tables with hidden LTO visibility. The bitcode would contain two modules: >> one to be compiled with regular LTO and the other to be compiled with >> ThinLTO. The regular LTO module would contain only the following: >> >> - the definitions of those virtual tables that have hidden LTO >> visibility >> >> > Another naive question: What happens in the non-hidden case and why is it > different here? >We do not normally enable CFI checks for virtual calls to non-hidden classes, so there's nothing to do about that case.>> - the !type metadata for those virtual tables >> - a list of type tests required by the corresponding ThinLTO module, >> in a GlobalMDNode named "llvm.export.type.tests" >> >> The ThinLTO module would contain the rest of the original module. >> > > The two modules would presumably be distinguished because the regular LTO > module would not have a summary block, because the summary for it would be > generated on the fly during LTO processing of the merged regular LTO > modules, right? >Right.> >> For example, here is what the regular LTO module for our example above >> would look like: >> >> @a = constant i32 1, !type !0, !type !2 >> @b = constant [63 x i32] zeroinitializer, !type !0, !type !1 >> @c = constant i32 3, !type !1, !type !2 >> @d = constant [2 x i32] [i32 4, i32 5], !type !3 >> >> !0 = !{i32 0, !"typeid1"} >> !3 = !{i32 4, !"typeid1"} >> !1 = !{i32 0, !"typeid2"} >> !2 = !{i32 0, !"typeid3"} >> >> !8 = !{!"typeid3"} >> !llvm.export.type.tests = !{!8} >> >> and here is the ThinLTO module: >> >> define i1 @baz(i32* %p) { >> %x = call i1 @llvm.type.test(i8* %pi8, metadata !"typeid3") >> ret i1 %x >> } >> >> The regular LTO modules are merged and compiled in the usual way, so we >> would end up merging the type definitions together with the list of >> required data. >> >> When the LowerTypeTests pass is given a module with the >> "llvm.export.type.tests" global MDNode, it "exports" each of the type >> identifiers mentioned in the MDNode by storing a resolution in the combined >> summary, and by creating >> > > Assuming I understand this correctly, that means that there would be some > kind of type table section in the combined summary index bitcode file, that > would look something like the following for the above case: > > <TYPEID_SUMMARY_BLOCK ...> > <TYPE abbrevid=... resolutionid, "typeid3"> > </TYPEID_SUMMARY_BLOCK> > > and included in the on-disk summary as a StringMap, right? >Right.> > >> definitions of each of the symbols that shall be required to satisfy the >> link time dependencies in the individual ThinLTO modules. Here is an >> example of what the combined module would look like: >> >> @0 = [...] ; combined global for a, b, c and d >> @bits = [...] ; combined global byte array >> >> @__typeid_typeid3_global_addr = hidden alias i8, bitcast ({...}* @0 to >> i8*) >> @__typeid_typeid3_rotate_count = hidden globalconst i8 2 >> @__typeid_typeid3_size = hidden globalconst i32 65 >> @__typeid_typeid3_byte_array = hidden alias i8, i8* @bits >> @__typeid_typeid3_bit_mask = hidden globalconst i8 2 >> >> The ThinLTO backend processes would run after regular LTO, so they would >> have access to the resolutions in the summary. If the module summary is >> present, LowerTypeTests will "import" the appropriate type identifier by >> generating code containing external references to these symbols, as >> described previously. >> > > So in the above case where the summary says "typeid3" -> ByteArray, > ThinLTO would on the ThinLTO module would see the @llvm.type.test of > typeid3 in the IR, consult the type resolution table in the combined > summary, and know to generate decls for those @__typeid_typeid3_* variables > and the associated code, right? >Right.> One thing that needs some thought is for the distributed case, where the > thin link serializes out individual index files for each ThinLTO module. > Presumably that step would first create the merged regular LTO module and > emit a regular object file for it during the thin link - we'll just need to > anticipate that under the appropriate options. But the bigger issue that > needs thought now is how to know which subset of the type resolution > summary table to emit into each individual index file. That needs to be > known when processing the ThinLTO modules through the > WriteIndexesThinBackend - won't the ThinLTO summaries need to include info > about which typeids are referenced from that module? Otherwise you'd have > to conservatively emit the entire type resolution summary table in each > individual index file - doable but presumably a fair amount of overhead. >Yes, that's a good point. Note that this is exactly the list of type identifiers that is emitted into "llvm.export.type.tests" in the regular part of each module. If we store this list in the individual module summaries instead, we can use it to produce both "llvm.export.type.tests" for the combined regular LTO module and the type resolution summaries in the individual index files. Thanks, -- -- Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161028/9f6f211c/attachment.html>