Nick Lewycky
2013-Jul-12 00:45 UTC
[LLVMdev] design for an accurate ODR-checker with clang
Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it. For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units. Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class. The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods. Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number. Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless. In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables. In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic. Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library. Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130711/cde5bf59/attachment.html>
John McCall
2013-Jul-12 01:02 UTC
[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <nlewycky at google.com> wrote:> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it. > > For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units. > > Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class. > > The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods. > > Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number. > > Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless. > > In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables. > > In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic. > > Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string. This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it. Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit. It would be informed by actual symbol visibility, of course. You should expect that there may be multiple hashing schemes (or versions thereof) in play and therefore build a simple prefixing scheme on your ODR symbols. John.
Nick Lewycky
2013-Jul-12 01:13 UTC
[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang
On 11 July 2013 18:02, John McCall <rjmccall at apple.com> wrote:> On Jul 11, 2013, at 5:45 PM, Nick Lewycky <nlewycky at google.com> wrote: > > Hi! A few of us over at Google think a nice feature in clang would be > ODR violation checking, and we thought for a while about how to do this and > wrote it down, but we aren't actively working on it at the moment nor plan > to in the near future. I'm posting this to share our design and hopefully > save anyone else the design work if they're interested in it. > > > > For some background, C++'s ODR rule roughly means that two definitions > of the same symbol must come from "the same tokens with the same > interpretation". Given the same token stream, the interpretation can be > different due to different name lookup results, or different types through > typedefs or using declarations, or due to a different point of > instantiation in two translation units. > > > > Unlike existing approaches (the ODR checker in the gold linker for > example), clang lets us do this with no false positives and very few false > negatives. The basis of the idea is that we produce a hash of all the > ODR-relevant pieces, and to try to pick the largest possible granularity. > By granularity I mean that we would hash the entire definition of a class > including all methods defined lexically inline and emit a single value for > that class. > > > > The first step is to build a new visitor over the clang AST that > calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler > doesn’t work here because it includes pointers addresses which will be > different across different translation units.) Hash the outermost > declaration with external-linkage. For example, given a class with a method > defined inline, we start the visitor at the class, not at the method. The > entirety of the class must be ODR-equivalent across two translation units, > including any inline methods. > > > > Although the standard mentions that the tokens must be the same, we do > not actually include the tokens in the hash. The structure of the AST > includes everything about the code which is semantically relevant. Any > false positives that would be fixed by hashing the tokens either do not > impact the behaviour of the program or could be fixed by hashing more of > the AST. References to globals should be hashed by name, but references to > locals should be hashed by an ordinal number. > > > > Instantiated templates are also visited by the hashing visitor. If we > did not, we would have false negatives where the code is not conforming due > to different points of instantiation in two translation units. We can skip > uninstantiated templates since they don’t affect the behaviour of the > program, and we need to visit the instantiations regardless. > > > > In LLVM IR, create a new named metadata node !llvm.odr_checking which > contains a list of <mangled name, hash value> pairs. The names do not > necessarily correspond to symbols, for instance, a class will have a hash > value but does not have a corresponding symbol. For ease of implementation, > names should be mangled per the C++ Itanium ABI (demanglable with c++filt > -t). Merging modules that contain these will need to do ODR checking as > part of that link, and the resulting module will have the union of these > tables. > > > > In the .o file, emit a sorted table of <mangled name, hash value> in a > non-loadable section intended to be read by the linker. All entries in the > table must be checked if any symbol from this .o file is involved in the > link (note that there is no mapping from symbol to odr table name). If two > .o files contain different hash values for the same name, we have detected > an ODR violation and issue a diagnostic. > > > > Finally, teach the loader (RuntimeDyld) to do verification and catch ODR > violations when dlopen'ing a shared library. > > This is the right basic design, but I'm curious why you're suggesting that > the payload should just be a hash instead of an arbitrary string.What are you suggesting goes into this string?> This isn't going to be performant enough to do unconditionally at every > load no matter how much you shrink it. >Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls. I feel maybe I'm not understanding what you meant here? Also, you should have something analogous to symbol visibility as a way to> tell the static linker that something only needs to be ODR-checked within a > linkage unit. It would be informed by actual symbol visibility, of course. >Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right? You should expect that there may be multiple hashing schemes (or versions> thereof) in play and therefore build a simple prefixing scheme on your ODR > symbols.We could put the choice of hashing algorithm in the name of the llvm named metadata node, and in the name of the section in the .o files. Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130711/35c1d1fd/attachment.html>
Reasonably Related Threads
- [LLVMdev] design for an accurate ODR-checker with clang
- [LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang
- [LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang
- [LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang
- [LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang