thr3ads.net - llvm dev - [LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Richard Smith

2013-Jul-15 22:20 UTC

[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

On Mon, Jul 15, 2013 at 3:12 PM, John McCall <rjmccall at apple.com>
wrote:
> On Jul 11, 2013, at 6:13 PM, Nick Lewycky <nlewycky at google.com>
wrote:
>
> On 11 July 2013 18:02, John McCall <rjmccall at apple.com> wrote:
>
>> On Jul 11, 2013, at 5:45 PM, Nick Lewycky <nlewycky at
google.com> wrote:
>> > Hi! A few of us over at Google think a nice feature in clang would
be
>> ODR violation checking, and we thought for a while about how to do this
and
>> wrote it down, but we aren't actively working on it at the moment
nor plan
>> to in the near future. I'm posting this to share our design and
hopefully
>> save anyone else the design work if they're interested in it.
>> >
>> > For some background, C++'s ODR rule roughly means that two
definitions
>> of the same symbol must come from "the same tokens with the same
>> interpretation". Given the same token stream, the interpretation
can be
>> different due to different name lookup results, or different types
through
>> typedefs or using declarations, or due to a different point of
>> instantiation in two translation units.
>> >
>> > Unlike existing approaches (the ODR checker in the gold linker for
>> example), clang lets us do this with no false positives and very few
false
>> negatives. The basis of the idea is that we produce a hash of all the
>> ODR-relevant pieces, and to try to pick the largest possible
granularity.
>> By granularity I mean that we would hash the entire definition of a
class
>> including all methods defined lexically inline and emit a single value
for
>> that class.
>> >
>> > The first step is to build a new visitor over the clang AST that
>> calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler
>> doesn’t work here because it includes pointers addresses which will be
>> different across different translation units.) Hash the outermost
>> declaration with external-linkage. For example, given a class with a
method
>> defined inline, we start the visitor at the class, not at the method.
The
>> entirety of the class must be ODR-equivalent across two translation
units,
>> including any inline methods.
>> >
>> > Although the standard mentions that the tokens must be the same,
we do
>> not actually include the tokens in the hash. The structure of the AST
>> includes everything about the code which is semantically relevant. Any
>> false positives that would be fixed by hashing the tokens either do not
>> impact the behaviour of the program or could be fixed by hashing more
of
>> the AST. References to globals should be hashed by name, but references
to
>> locals should be hashed by an ordinal number.
>> >
>> > Instantiated templates are also visited by the hashing visitor. If
we
>> did not, we would have false negatives where the code is not conforming
due
>> to different points of instantiation in two translation units. We can
skip
>> uninstantiated templates since they don’t affect the behaviour of the
>> program, and we need to visit the instantiations regardless.
>> >
>> > In LLVM IR, create a new named metadata node !llvm.odr_checking
which
>> contains a list of <mangled name, hash value> pairs. The names do
not
>> necessarily correspond to symbols, for instance, a class will have a
hash
>> value but does not have a corresponding symbol. For ease of
implementation,
>> names should be mangled per the C++ Itanium ABI (demanglable with
c++filt
>> -t). Merging modules that contain these will need to do ODR checking as
>> part of that link, and the resulting module will have the union of
these
>> tables.
>> >
>> > In the .o file, emit a sorted table of <mangled name, hash
value> in a
>> non-loadable section intended to be read by the linker. All entries in
the
>> table must be checked if any symbol from this .o file is involved in
the
>> link (note that there is no mapping from symbol to odr table name). If
two
>> .o files contain different hash values for the same name, we have
detected
>> an ODR violation and issue a diagnostic.
>> >
>> > Finally, teach the loader (RuntimeDyld) to do verification and
catch
>> ODR violations when dlopen'ing a shared library.
>>
>> This is the right basic design, but I'm curious why you're
suggesting
>> that the payload should just be a hash instead of an arbitrary string.
>
>
> What are you suggesting goes into this string?
>
>
> The same sorts of things that you were planning on hashing, but maybe not
> hashed.  It's up to you; having a full string would let you actually
show a
> useful error message, but it definitely inflates binary sizes.  If you
> really think you can make this performant enough to do on every load, I can
> see how the latter would be important.
>
> This isn't going to be performant enough to do unconditionally at every
>> load no matter how much you shrink it.
>>
>
> Every load of a shared object? That's not a fast operation even without
> odr checking, but the idea is to keep the total number of entries in the
> odr table small. It's less than the number of symbols, closer to the
number
> of top-level decls.
>
>
> Your ABI dependencies are every declaration *that you ever rely on*.
>  You've got to figure that that's going to be very large.  For a
library of
> any significance, I'd be expecting this check to touch about half a
> megabyte of data, even with a 32-bit hash and some sort of clever prefixing
> scheme on the symbols.  That's a pretty major regression in library
loading.
>
> Also, you should have something analogous to symbol visibility as a way to
>> tell the static linker that something only needs to be ODR-checked
within a
>> linkage unit.  It would be informed by actual symbol visibility, of
course.
>>
>
> Great point, and that needs to flow into the .o files as well. If a class
> has one visibility and its method has another, we want to skip the method
> when hashing the class, and need to emit an additional entry for the method
> alone? Is that right?
>
>
> Class hashes should probably only include virtual methods anyway, but yes,
> I think this is a good starting point.
>
> What do you want in the hash for a function anyway?  Almost everything is
> already captured by (1) the separate hashes for the nominal types mentioned
> and (2) the symbol mangling.  You're pretty much only missing the
return
> type.  Oh, I guess you need the body's dependencies for inline
functions.
>
We want to enforce the C++ ODR as much as is reasonably possible, so we
want to include the body for both classes and functions. That is, we
explicitly want to check for cases where two functions or classes happen to
have the same declaration but different definitions.

(Perhaps this also clarifies why we want a hash: an unhashed string would
contain as much entropy as the entirety of the source code...)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/1414d7c0/attachment.html>

John McCall

2013-Jul-15 22:42 UTC

head link

[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

On Jul 15, 2013, at 3:20 PM, Richard Smith <richard at metafoo.co.uk>
wrote:> On Mon, Jul 15, 2013 at 3:12 PM, John McCall <rjmccall at apple.com>
wrote:
> On Jul 11, 2013, at 6:13 PM, Nick Lewycky <nlewycky at google.com>
wrote:
>> On 11 July 2013 18:02, John McCall <rjmccall at apple.com> wrote:
>> On Jul 11, 2013, at 5:45 PM, Nick Lewycky <nlewycky at
google.com> wrote:
>> > Hi! A few of us over at Google think a nice feature in clang would
be ODR violation checking, and we thought for a while about how to do this and
wrote it down, but we aren't actively working on it at the moment nor plan
to in the near future. I'm posting this to share our design and hopefully
save anyone else the design work if they're interested in it.
>> >
>> > For some background, C++'s ODR rule roughly means that two
definitions of the same symbol must come from "the same tokens with the
same interpretation". Given the same token stream, the interpretation can
be different due to different name lookup results, or different types through
typedefs or using declarations, or due to a different point of instantiation in
two translation units.
>> >
>> > Unlike existing approaches (the ODR checker in the gold linker for
example), clang lets us do this with no false positives and very few false
negatives. The basis of the idea is that we produce a hash of all the
ODR-relevant pieces, and to try to pick the largest possible granularity. By
granularity I mean that we would hash the entire definition of a class including
all methods defined lexically inline and emit a single value for that class.
>> >
>> > The first step is to build a new visitor over the clang AST that
calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t
work here because it includes pointers addresses which will be different across
different translation units.) Hash the outermost declaration with
external-linkage. For example, given a class with a method defined inline, we
start the visitor at the class, not at the method. The entirety of the class
must be ODR-equivalent across two translation units, including any inline
methods.
>> >
>> > Although the standard mentions that the tokens must be the same,
we do not actually include the tokens in the hash. The structure of the AST
includes everything about the code which is semantically relevant. Any false
positives that would be fixed by hashing the tokens either do not impact the
behaviour of the program or could be fixed by hashing more of the AST.
References to globals should be hashed by name, but references to locals should
be hashed by an ordinal number.
>> >
>> > Instantiated templates are also visited by the hashing visitor. If
we did not, we would have false negatives where the code is not conforming due
to different points of instantiation in two translation units. We can skip
uninstantiated templates since they don’t affect the behaviour of the program,
and we need to visit the instantiations regardless.
>> >
>> > In LLVM IR, create a new named metadata node !llvm.odr_checking
which contains a list of <mangled name, hash value> pairs. The names do
not necessarily correspond to symbols, for instance, a class will have a hash
value but does not have a corresponding symbol. For ease of implementation,
names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t).
Merging modules that contain these will need to do ODR checking as part of that
link, and the resulting module will have the union of these tables.
>> >
>> > In the .o file, emit a sorted table of <mangled name, hash
value> in a non-loadable section intended to be read by the linker. All
entries in the table must be checked if any symbol from this .o file is involved
in the link (note that there is no mapping from symbol to odr table name). If
two .o files contain different hash values for the same name, we have detected
an ODR violation and issue a diagnostic.
>> >
>> > Finally, teach the loader (RuntimeDyld) to do verification and
catch ODR violations when dlopen'ing a shared library.
>> 
>> This is the right basic design, but I'm curious why you're
suggesting that the payload should just be a hash instead of an arbitrary
string.
>> 
>> What are you suggesting goes into this string?
> 
> The same sorts of things that you were planning on hashing, but maybe not
hashed.  It's up to you; having a full string would let you actually show a
useful error message, but it definitely inflates binary sizes.  If you really
think you can make this performant enough to do on every load, I can see how the
latter would be important.
> 
>> This isn't going to be performant enough to do unconditionally at
every load no matter how much you shrink it.
>> 
>> Every load of a shared object? That's not a fast operation even
without odr checking, but the idea is to keep the total number of entries in the
odr table small. It's less than the number of symbols, closer to the number
of top-level decls.
> 
> Your ABI dependencies are every declaration *that you ever rely on*. 
You've got to figure that that's going to be very large.  For a library
of any significance, I'd be expecting this check to touch about half a
megabyte of data, even with a 32-bit hash and some sort of clever prefixing
scheme on the symbols.  That's a pretty major regression in library loading.
> 
>> Also, you should have something analogous to symbol visibility as a way
to tell the static linker that something only needs to be ODR-checked within a
linkage unit.  It would be informed by actual symbol visibility, of course.
>> 
>> Great point, and that needs to flow into the .o files as well. If a
class has one visibility and its method has another, we want to skip the method
when hashing the class, and need to emit an additional entry for the method
alone? Is that right?
> 
> Class hashes should probably only include virtual methods anyway, but yes,
I think this is a good starting point.
> 
> What do you want in the hash for a function anyway?  Almost everything is
already captured by (1) the separate hashes for the nominal types mentioned and
(2) the symbol mangling.  You're pretty much only missing the return type. 
Oh, I guess you need the body's dependencies for inline functions.
> 
> We want to enforce the C++ ODR as much as is reasonably possible, so we
want to include the body for both classes and functions. That is, we explicitly
want to check for cases where two functions or classes happen to have the same
declaration but different definitions.
Mmm.  So you want to warn the user that two libraries using different assertion
settings both use the standard library?

I think warning about actual differences in code, as opposed to differences in
type/vtable layout, is going to be pretty fraught with uninteresting positives,
but if you want to chase that rabbit, it's your time spent.

Anyway, you only need to hash in function bodies for inline functions unless
this is also an ELF abuse dectector.  (*Whether* a function is inline seems like
a legitimate thing to hash for the function signature.)

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/63a47380/attachment.html>

JF Bastien

2013-Jul-15 23:26 UTC

head link

[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

> Mmm.  So you want to warn the user that two libraries using different
> assertion settings both use the standard library?
>
> I think warning about actual differences in code, as opposed to differences
> in type/vtable layout, is going to be pretty fraught with uninteresting
> positives, but if you want to chase that rabbit, it's your time spent.
It's probably desirable to choose to detect ODR violations on classes
or functions independently, although detecting calling convention
differences in functions (without looking at the rest of the code)
could also be useful. I've debugged enough debug+release mixing issues
shared-libraries to feel that pain.

Richard Smith

2013-Jul-15 23:47 UTC

head link

[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

On Mon, Jul 15, 2013 at 3:42 PM, John McCall <rjmccall at apple.com>
wrote:
> On Jul 15, 2013, at 3:20 PM, Richard Smith <richard at metafoo.co.uk>
wrote:
>
> On Mon, Jul 15, 2013 at 3:12 PM, John McCall <rjmccall at apple.com>
wrote:
>
>> On Jul 11, 2013, at 6:13 PM, Nick Lewycky <nlewycky at
google.com> wrote:
>>
>> On 11 July 2013 18:02, John McCall <rjmccall at apple.com> wrote:
>>
>>> On Jul 11, 2013, at 5:45 PM, Nick Lewycky <nlewycky at
google.com> wrote:
>>> > Hi! A few of us over at Google think a nice feature in clang
would be
>>> ODR violation checking, and we thought for a while about how to do
this and
>>> wrote it down, but we aren't actively working on it at the
moment nor plan
>>> to in the near future. I'm posting this to share our design and
hopefully
>>> save anyone else the design work if they're interested in it.
>>> >
>>> > For some background, C++'s ODR rule roughly means that two
definitions
>>> of the same symbol must come from "the same tokens with the
same
>>> interpretation". Given the same token stream, the
interpretation can be
>>> different due to different name lookup results, or different types
through
>>> typedefs or using declarations, or due to a different point of
>>> instantiation in two translation units.
>>> >
>>> > Unlike existing approaches (the ODR checker in the gold linker
for
>>> example), clang lets us do this with no false positives and very
few false
>>> negatives. The basis of the idea is that we produce a hash of all
the
>>> ODR-relevant pieces, and to try to pick the largest possible
granularity.
>>> By granularity I mean that we would hash the entire definition of a
class
>>> including all methods defined lexically inline and emit a single
value for
>>> that class.
>>> >
>>> > The first step is to build a new visitor over the clang AST
that
>>> calculates a hash of the ODR-relevant pieces of the code.
(StmtProfiler
>>> doesn’t work here because it includes pointers addresses which will
be
>>> different across different translation units.) Hash the outermost
>>> declaration with external-linkage. For example, given a class with
a method
>>> defined inline, we start the visitor at the class, not at the
method. The
>>> entirety of the class must be ODR-equivalent across two translation
units,
>>> including any inline methods.
>>> >
>>> > Although the standard mentions that the tokens must be the
same, we do
>>> not actually include the tokens in the hash. The structure of the
AST
>>> includes everything about the code which is semantically relevant.
Any
>>> false positives that would be fixed by hashing the tokens either do
not
>>> impact the behaviour of the program or could be fixed by hashing
more of
>>> the AST. References to globals should be hashed by name, but
references to
>>> locals should be hashed by an ordinal number.
>>> >
>>> > Instantiated templates are also visited by the hashing
visitor. If we
>>> did not, we would have false negatives where the code is not
conforming due
>>> to different points of instantiation in two translation units. We
can skip
>>> uninstantiated templates since they don’t affect the behaviour of
the
>>> program, and we need to visit the instantiations regardless.
>>> >
>>> > In LLVM IR, create a new named metadata node
!llvm.odr_checking which
>>> contains a list of <mangled name, hash value> pairs. The
names do not
>>> necessarily correspond to symbols, for instance, a class will have
a hash
>>> value but does not have a corresponding symbol. For ease of
implementation,
>>> names should be mangled per the C++ Itanium ABI (demanglable with
c++filt
>>> -t). Merging modules that contain these will need to do ODR
checking as
>>> part of that link, and the resulting module will have the union of
these
>>> tables.
>>> >
>>> > In the .o file, emit a sorted table of <mangled name, hash
value> in a
>>> non-loadable section intended to be read by the linker. All entries
in the
>>> table must be checked if any symbol from this .o file is involved
in the
>>> link (note that there is no mapping from symbol to odr table name).
If two
>>> .o files contain different hash values for the same name, we have
detected
>>> an ODR violation and issue a diagnostic.
>>> >
>>> > Finally, teach the loader (RuntimeDyld) to do verification and
catch
>>> ODR violations when dlopen'ing a shared library.
>>>
>>> This is the right basic design, but I'm curious why you're
suggesting
>>> that the payload should just be a hash instead of an arbitrary
string.
>>
>>
>> What are you suggesting goes into this string?
>>
>>
>> The same sorts of things that you were planning on hashing, but maybe
not
>> hashed.  It's up to you; having a full string would let you
actually show a
>> useful error message, but it definitely inflates binary sizes.  If you
>> really think you can make this performant enough to do on every load, I
can
>> see how the latter would be important.
>>
>> This isn't going to be performant enough to do unconditionally at
every
>>> load no matter how much you shrink it.
>>>
>>
>> Every load of a shared object? That's not a fast operation even
without
>> odr checking, but the idea is to keep the total number of entries in
the
>> odr table small. It's less than the number of symbols, closer to
the number
>> of top-level decls.
>>
>>
>> Your ABI dependencies are every declaration *that you ever rely on*.
>>  You've got to figure that that's going to be very large.  For
a library of
>> any significance, I'd be expecting this check to touch about half a
>> megabyte of data, even with a 32-bit hash and some sort of clever
prefixing
>> scheme on the symbols.  That's a pretty major regression in library
loading.
>>
>> Also, you should have something analogous to symbol visibility as a way
>>> to tell the static linker that something only needs to be
ODR-checked
>>> within a linkage unit.  It would be informed by actual symbol
visibility,
>>> of course.
>>>
>>
>> Great point, and that needs to flow into the .o files as well. If a
class
>> has one visibility and its method has another, we want to skip the
method
>> when hashing the class, and need to emit an additional entry for the
method
>> alone? Is that right?
>>
>>
>> Class hashes should probably only include virtual methods anyway, but
>> yes, I think this is a good starting point.
>>
>> What do you want in the hash for a function anyway?  Almost everything
is
>> already captured by (1) the separate hashes for the nominal types
mentioned
>> and (2) the symbol mangling.  You're pretty much only missing the
return
>> type.  Oh, I guess you need the body's dependencies for inline
functions.
>>
>
> We want to enforce the C++ ODR as much as is reasonably possible, so we
> want to include the body for both classes and functions. That is, we
> explicitly want to check for cases where two functions or classes happen to
> have the same declaration but different definitions.
>
>
> Mmm.  So you want to warn the user that two libraries using different
> assertion settings both use the standard library?
>
libstdc++ does not use assert. IIRC, nor does libc++ unless you use its "no
exceptions" mode.

> I think warning about actual differences in code, as opposed to
> differences in type/vtable layout, is going to be pretty fraught with
> uninteresting positives, but if you want to chase that rabbit, it's
your
> time spent.
>
For code that already passes gold's --detect-odr-violations, the extra
testing for definitions of inline functions would effectively be checking
that we don't have two functions that are defined from the same token
sequence but are interpreted in different ways, so the uninteresting
positive rate should be rather low (or if not, then we've learned something
important...). For non-inline functions and classes, the checking would be
more novel, so the uninteresting positive rate is hard to be sure about.

> Anyway, you only need to hash in function bodies for inline functions
> unless this is also an ELF abuse dectector.  (*Whether* a function is
> inline seems like a legitimate thing to hash for the function signature.)
>
Giving different definitions (for either functions or classes) in different
source files is one of the things we'd like to catch (although there are
probably more direct ways to do so than a full ODR checker, such as maybe
-Wmissing-prototypes).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/3c02a774/attachment.html>

John Bytheway

2013-Jul-16 02:50 UTC

head link

[LLVMdev] design for an accurate ODR-checker with clang

On 2013-07-15 18:20, Richard Smith wrote:> On Mon, Jul 15, 2013 at 3:12 PM, John McCall
> <rjmccall at apple.com
> <mailto:rjmccall at apple.com>> wrote:
> 
>     On Jul 11, 2013, at 6:13 PM, Nick Lewycky
>     <nlewycky at google.com
>     <mailto:nlewycky at google.com>> wrote:
>>         This is the right basic design, but I'm curious why
you're
>>         suggesting that the payload should just be a hash instead of
>>         an arbitrary string.
>>
>>
>>     What are you suggesting goes into this string?
> 
>     The same sorts of things that you were planning on hashing, but
>     maybe not hashed.  It's up to you; having a full string would let
>     you actually show a useful error message, but it definitely inflates
>     binary sizes.  If you really think you can make this performant
>     enough to do on every load, I can see how the latter would be
important.
>
> (Perhaps this also clarifies why we want a hash: an unhashed string
> would contain as much entropy as the entirety of the source code...)
Maybe you can't afford to store the unhashed data for everything, but
what about an option to store it for particular function(s)/class(es).
That way, once an ODR violation has been detected through the hash
infrastructure the compilation/linking can be repeated with more data
stored, and yield a decent error message about what exactly changed
between the two definitions.

John Bytheway

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Jul 2013 - [LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

[LLVMdev] [cfe-dev] design for an accurate ODR-checker with clang

[LLVMdev] design for an accurate ODR-checker with clang

Apparently Analagous Threads