thr3ads.net - llvm dev - [LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code? [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2012-Jun-20 05:59 UTC

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

On Tue, Jun 19, 2012 at 10:46 PM, Kostya Serebryany <kcc at google.com>
wrote:
>
>
> On Wed, Jun 20, 2012 at 9:39 AM, Chandler Carruth <chandlerc at
google.com>wrote:
>
>> On Tue, Jun 19, 2012 at 9:07 PM, Kostya Serebryany <kcc at
google.com>wrote:
>>
>>> +dvyukov
>>>
>>> On Wed, Jun 20, 2012 at 7:12 AM, Chandler Carruth <chandlerc at
google.com>wrote:
>>>
>>>> Hello folks (and sorry if I've forgotten to CC anyone with
particular
>>>> interest to this discussion...):
>>>>
>>>> I've been thinking a lot about how best to build advanced
runtime
>>>> libraries like ASan, and scale them up. Note that this does
*not* try to
>>>> address any licensing issues. For now, I'll consider those
orthogonal /
>>>> solvable w/o technical contortions. =]
>>>>
>>>> My primary motivation: we really, *really* need runtime
libraries to be
>>>> able to use common, shared libraries.
>>>>
>>>
>>> I am not sure you understand the problem as we do.
>>>
>>> In short, asan/tsan/msan/etc can not use any function which is also
>>> called from the instrumented binary.
>>>
>>
>> Well, I can't be sure, but this description certainly agrees with
my
>> understanding -- you need *every* part of the runtime to be completely
>> separate from *every* part of the instrumented binary. I'm with you
there.
>>
>> In particular, I think the current strategy for libc & system calls
makes
>> perfect sense, and I'm not trying to suggest changing it.
>>
>> I think the most similar situation is is this one:
>>
>> In the previous version of ThreadSanitizer we used a private copy of
>>> STLport in a separate namespace and a custom libc (small subset).
>>>
>>
>> My proposal is very similar except without the need to modify the C++
>> standard library in use. Instead, I'm suggesting post-processing
the
>> library to ensure that the standard C++ library code in the runtime is
kept
>> complete distinct from that in the instrumented binary -- everything
would
>> in fact be *mangled* differently.
>>
>> The goal would be to avoid the maintenance overhead of a custom C++
>> standard library, and instead use a normal one. My understanding is
that
>> both GCC's libstdc++ and LLVM's libc++ are significantly higher
quality
>> than STLport, and if we're doing static linking, the code bloat
should be
>> greatly reduced. We could reduce it still further by doing LTO of the
>> runtime library, which should be very straight forward given the rest
of my
>> proposal.
>>
>> It would still require a very small subset of libc, likely not much
more
>> than you already have.
>>
>>  This worked, but had problems too (Dmitry was very angry at STLport
for
>>> code bloat, stack size increase and some direct libc calls).
>>>
>>
>> I would be interested to know if the above addresses most of the
problems
>> or not.
>>
>>
>>>  Until recently this was not causing too much pain in asan/tsan,
but
>>> our attempts to use the LLVM DWARF readers made it worse.
>>> When tsan finds a race, we need to symbolize it online to be able
to
>>> match against a suppression and decide whether we want to emit the
warning.
>>> Today we do it in a separate addr2line process (ugly and slow).
>>> But if we start calling the LLVM dwarf reader we end up with all
>>> possible dependency problems (Dmitry and Alexey will know the exact
ones)
>>> because the LLVM code calls to malloc, memcpy, etc.
>>>
>>> Frankly, I don't have any solution other than to change the
code such
>>> that it does not call libc/libc++.
>>> Some of that may be solved by a private copy of STLport + a bit of
>>> custom libc (but see above about STLport)
>>>
>>
>> I think my proposal is essentially in between these two:
>>
>> - Avoid the need for a low quality STL by using a normal C++ standard
>> library implementation, and avoid maintenance burden by doing a
link-time
>> mangling of the symbols.
>>
>
> re-linking might be too platform specific.
> How about compiling the library into LLVM bitcode and adding
> namespaces/prefixes to that bitcode?
>
Re-linking is a bit platform specific...

It would definitely work on ELF platforms, and likely on Darwin, but
Windows is tricky.

On windows we would at least need a custom tool, but such a tool would be
quite easy to write I suspect. We could even use the very LLVM libraries in
question to write it! ;] Amusingly, I think with the LLVM libraries we
could very easily write a custom tool just to mangle the symbol names in a
collection of object files very easily and have it work on *most* platforms!

Still, the bitcode idea is interesting. Doing this entirely in bitcode has
some advantages as these types of runtimes are among the best uses for
things like LTO: they're small, performance sensitive, can enumerate the
entry points easily, and are likely to have a particular need for dead code
elimination.

One nice thing is that I suspect we could do any of these three options,
and get equivalent output for them. It may not matter what strategy is used
long term, we can use the easiest to implement short term.

>
> --kcc
>
>
>
>> - Provide the minimal custom libc, and do the same to it
>> - Link the LLVM libraries against these, and munge their symbols as
well
>> - LTO the whole thing if needed to get the code bloat down
>>
>> I think this is actually easier than changing the LLVM libraries to not
>> use the C++ standard libraries. I also think it is easier than
>> re-implementing the LLVM libraries in question. But that doesn't
mean I
>> think it is easy. ;] I think it is quite hard, but it is the best
solution
>> I can come up with.
>>
>>
>>>
>>> --kcc
>>>
>>>
>>>
>>>> This starts with libraries such as the C++ standard library --
a
>>>> runtime shouldn't need to re-implement std::vector. It
includes other
>>>> primitive libraries that have had significant effort put into
them in LLVM
>>>> such as the ADT and Support libraries. But, IMO, it has even
more
>>>> importance as we start looking at libraries such as ELF
readers, DWARF
>>>> readers, symbolizers, etc. This code should shared, and shared
easily, with
>>>> other LLVM projects.
>>>>
>>>> However, clearly the runtime must at some point be linked
against a
>>>> program, and indeed programs which may be using *the same set
of
>>>> libraries*. It is crucially important that the runtime uses a
separate
>>>> implementation of the libraries from the ones used by the
program itself:
>>>> we will often compile the program's libraries with
instrumentation and
>>>> other features which we explicitly wish to avoid in the
runtime. Even
>>>> simple name clashes can cause problems, leading to the current
practice of
>>>> putting all of these runtime libraries into a
'__sanitizer' or other
>>>> specially spelled namespace.
>>>>
>>>> A final unusual requirement is that at least *some* of the code
for the
>>>> runtime libraries must be statically linked to have reasonable
efficiency.
>>>> We also have several use cases where it would be very
convenient to link
>>>> *all* of the runtime statically, so I prefer a solution that
preserves this
>>>> option.
>>>>
>>>> So how can we effectively share code? Here is my proposal, and
a few
>>>> alternate strategies.
>>>>
>>>> I suggest that we build the runtime library as-if it were not a
runtime
>>>> library at all, and just a normal library. No strange
namespaces, no
>>>> restrictions on what other libraries it uses with one
exception: they must
>>>> *all* be statically linkable. We build this as a normal archive
library,
>>>> nothing special. One nice property is that testing the runtime
library
>>>> becomes the same as testing any other library.
>>>>
>>>> Then, we have a special build step to produce a final archive
which is
>>>> actually *used* as the runtime library. This step works not
dissimilarly to
>>>> the step to link an executable: we build the list of archive
libraries
>>>> depended on, but instead of linking an executable, we run a
linker script
>>>> over them. This script will re-link each '.o' file from
the transitive
>>>> closure of archives, prepending a '__asan__' (or other
runtime library
>>>> prefix) onto each symbol; effectively mangling each symbol. All
of these
>>>> processed '.o' files would go into a single, final
archive that would be
>>>> the installed runtime library. The only functions not processed
in this
>>>> manner are a white list of "exported" functions from
the runtime (C-library
>>>> routines provided by the runtime, and runtime entry points,
et.).
>>>>
>>>> The result should be a runtime library that is essentially
hermetic,
>>>> and should have no clashes with binaries it links against. It
would be free
>>>> to use standard libraries, LLVM libraries, whatever it needs.
That said,
>>>> there are some clear disadvantages:
>>>> - Bizarre name mangling, especially for C++
>>>> - Potentially incompatible with C++ EH, libunwind, or other
tools (I
>>>> just don't know, haven't done enough research here)
>>>> - Requires "relinking" the final runtime
>>>> - Definitely implementable on Linux & ELF-based BSDs, I
*think* do-able
>>>> on Darwin, but I have no idea about Windows.
>>>> - Other downsides? I'm probably missing some big problems
here... ;]
>>>>
>>>> However, if we can make this (possibly with
tweaks/modifications) work,
>>>> I think the upside is quite large -- the runtime library stops
having to be
>>>> written in such a strange special sub-set of the language, etc.
>>>>
>>>>
>>>> Note that this proposal is orthogonal to the issue of
minimizing the
>>>> binary size and cost of the runtime library -- that is clearly
still an
>>>> important concern, but that can be addressed both with or
without using
>>>> other libraries. LLVM has lots of libraries specifically
engineered to be
>>>> lightweight in situations like this.
>>>>
>>>>
>>>> Other alternatives that have been discussed:
>>>>
>>>> - Require isolating all shared code into a shared library (.so)
than is
>>>> loaded as-needed. This helps some, but it doesn't seem to
fully solve the
>>>> issues (where does the shared code go? the .so? What happens
when it is
>>>> loaded into a program that already has copies of the same code?
What
>>>> happens when one is instrumented and the other isn't). It
also requires us
>>>> to ship the '.so' with the binary to get full
functionality, something that
>>>> would be at least somewhat undesirable. It also requires the
runtime
>>>> library developers to carefully partition the code into that
which can go
>>>> in the .a and that which can go in the .so.
>>>>
>>>> - The current strategy of re-implementing everything needed
from
>>>> (essentially) the ground up inside the runtime library. I think
that this
>>>> has serious long-term maintenance problems.... but who knows,
maybe?
>>>>
>>>> - Other ideas?
>>>>
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120619/92df9353/attachment.html>

Chandler Carruth

2012-Jun-20 06:05 UTC

head link

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

On Tue, Jun 19, 2012 at 10:59 PM, Chandler Carruth <chandlerc at
google.com>wrote:
> On Tue, Jun 19, 2012 at 10:46 PM, Kostya Serebryany <kcc at
google.com>wrote:
>
>>
>>
>> On Wed, Jun 20, 2012 at 9:39 AM, Chandler Carruth <chandlerc at
google.com>wrote:
>>
>>> On Tue, Jun 19, 2012 at 9:07 PM, Kostya Serebryany <kcc at
google.com>wrote:
>>>
>>>> +dvyukov
>>>>
>>>> On Wed, Jun 20, 2012 at 7:12 AM, Chandler Carruth <chandlerc
at google.com
>>>> > wrote:
>>>>
>>>>> Hello folks (and sorry if I've forgotten to CC anyone
with particular
>>>>> interest to this discussion...):
>>>>>
>>>>> I've been thinking a lot about how best to build
advanced runtime
>>>>> libraries like ASan, and scale them up. Note that this does
*not* try to
>>>>> address any licensing issues. For now, I'll consider
those orthogonal /
>>>>> solvable w/o technical contortions. =]
>>>>>
>>>>> My primary motivation: we really, *really* need runtime
libraries to
>>>>> be able to use common, shared libraries.
>>>>>
>>>>
>>>> I am not sure you understand the problem as we do.
>>>>
>>>> In short, asan/tsan/msan/etc can not use any function which is
also
>>>> called from the instrumented binary.
>>>>
>>>
>>> Well, I can't be sure, but this description certainly agrees
with my
>>> understanding -- you need *every* part of the runtime to be
completely
>>> separate from *every* part of the instrumented binary. I'm with
you there.
>>>
>>> In particular, I think the current strategy for libc & system
calls
>>> makes perfect sense, and I'm not trying to suggest changing it.
>>>
>>> I think the most similar situation is is this one:
>>>
>>> In the previous version of ThreadSanitizer we used a private copy
of
>>>> STLport in a separate namespace and a custom libc (small
subset).
>>>>
>>>
>>> My proposal is very similar except without the need to modify the
C++
>>> standard library in use. Instead, I'm suggesting
post-processing the
>>> library to ensure that the standard C++ library code in the runtime
is kept
>>> complete distinct from that in the instrumented binary --
everything would
>>> in fact be *mangled* differently.
>>>
>>> The goal would be to avoid the maintenance overhead of a custom C++
>>> standard library, and instead use a normal one. My understanding is
that
>>> both GCC's libstdc++ and LLVM's libc++ are significantly
higher quality
>>> than STLport, and if we're doing static linking, the code bloat
should be
>>> greatly reduced. We could reduce it still further by doing LTO of
the
>>> runtime library, which should be very straight forward given the
rest of my
>>> proposal.
>>>
>>> It would still require a very small subset of libc, likely not much
more
>>> than you already have.
>>>
>>>  This worked, but had problems too (Dmitry was very angry at
STLport for
>>>> code bloat, stack size increase and some direct libc calls).
>>>>
>>>
>>> I would be interested to know if the above addresses most of the
>>> problems or not.
>>>
>>>
>>>>  Until recently this was not causing too much pain in
asan/tsan, but
>>>> our attempts to use the LLVM DWARF readers made it worse.
>>>> When tsan finds a race, we need to symbolize it online to be
able to
>>>> match against a suppression and decide whether we want to emit
the warning.
>>>> Today we do it in a separate addr2line process (ugly and slow).
>>>> But if we start calling the LLVM dwarf reader we end up with
all
>>>> possible dependency problems (Dmitry and Alexey will know the
exact ones)
>>>> because the LLVM code calls to malloc, memcpy, etc.
>>>>
>>>> Frankly, I don't have any solution other than to change the
code such
>>>> that it does not call libc/libc++.
>>>> Some of that may be solved by a private copy of STLport + a bit
of
>>>> custom libc (but see above about STLport)
>>>>
>>>
>>> I think my proposal is essentially in between these two:
>>>
>>> - Avoid the need for a low quality STL by using a normal C++
standard
>>> library implementation, and avoid maintenance burden by doing a
link-time
>>> mangling of the symbols.
>>>
>>
>> re-linking might be too platform specific.
>> How about compiling the library into LLVM bitcode and adding
>> namespaces/prefixes to that bitcode?
>>
>
> Re-linking is a bit platform specific...
>
> It would definitely work on ELF platforms, and likely on Darwin, but
> Windows is tricky.
>
> On windows we would at least need a custom tool, but such a tool would be
> quite easy to write I suspect. We could even use the very LLVM libraries in
> question to write it! ;] Amusingly, I think with the LLVM libraries we
> could very easily write a custom tool just to mangle the symbol names in a
> collection of object files very easily and have it work on *most*
platforms!
>
> Still, the bitcode idea is interesting. Doing this entirely in bitcode has
> some advantages as these types of runtimes are among the best uses for
> things like LTO: they're small, performance sensitive, can enumerate
the
> entry points easily, and are likely to have a particular need for dead code
> elimination.
>
One reason to want to have some support for doing this w/o bitcode: we may
not have the bitcode. Specifically, the goal would be to use the
"normal"
C++ standard library, provided it is available to link statically
(libstdc++ and libc++ certainly are, I don't know about MSVC). That would
be much easier if we can actually use the existing archive file, and just
"fix" the .o files inside it.

It seems likely to be the equivalent of an 'ld -r' run with a linker
script
to munge the symbol names, or potentially a custom tool written with the
LLVM object file libraries.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120619/a3dc3065/attachment.html>

Dmitry Vyukov

2012-Jun-21 07:21 UTC

head link

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

Hi,

Yes, stlport was a pain to deploy and maintain + it calls normal operator
new/delete (there is no way to put them into a separate namespace).

Note that in some codebases we build asan/tsan runtimes from source. How
the build process will look with that object file mangling? How easy it is
to integrate it into a custom build process?

Soon I will start integrating tsan into Go language. For the Go language we
need very simple object files. No global ctors, no thread-local storage, no
weak symbols and other trickery. Basically what a portable C compiler could
have produced.


On Wed, Jun 20, 2012 at 10:05 AM, Chandler Carruth <chandlerc at
google.com>wrote:
> Hello folks (and sorry if I've forgotten to CC anyone with particular
>>>>>> interest to this discussion...):
>>>>>>
>>>>>> I've been thinking a lot about how best to build
advanced runtime
>>>>>> libraries like ASan, and scale them up. Note that this
does *not* try to
>>>>>> address any licensing issues. For now, I'll
consider those orthogonal /
>>>>>> solvable w/o technical contortions. =]
>>>>>>
>>>>>> My primary motivation: we really, *really* need runtime
libraries to
>>>>>> be able to use common, shared libraries.
>>>>>>
>>>>>
>>>>> I am not sure you understand the problem as we do.
>>>>>
>>>>> In short, asan/tsan/msan/etc can not use any function which
is also
>>>>> called from the instrumented binary.
>>>>>
>>>>
>>>> Well, I can't be sure, but this description certainly
agrees with my
>>>> understanding -- you need *every* part of the runtime to be
completely
>>>> separate from *every* part of the instrumented binary. I'm
with you there.
>>>>
>>>> In particular, I think the current strategy for libc &
system calls
>>>> makes perfect sense, and I'm not trying to suggest changing
it.
>>>>
>>>> I think the most similar situation is is this one:
>>>>
>>>> In the previous version of ThreadSanitizer we used a private
copy of
>>>>> STLport in a separate namespace and a custom libc (small
subset).
>>>>>
>>>>
>>>> My proposal is very similar except without the need to modify
the C++
>>>> standard library in use. Instead, I'm suggesting
post-processing the
>>>> library to ensure that the standard C++ library code in the
runtime is kept
>>>> complete distinct from that in the instrumented binary --
everything would
>>>> in fact be *mangled* differently.
>>>>
>>>> The goal would be to avoid the maintenance overhead of a custom
C++
>>>> standard library, and instead use a normal one. My
understanding is that
>>>> both GCC's libstdc++ and LLVM's libc++ are
significantly higher quality
>>>> than STLport, and if we're doing static linking, the code
bloat should be
>>>> greatly reduced. We could reduce it still further by doing LTO
of the
>>>> runtime library, which should be very straight forward given
the rest of my
>>>> proposal.
>>>>
>>>> It would still require a very small subset of libc, likely not
much
>>>> more than you already have.
>>>>
>>>>  This worked, but had problems too (Dmitry was very angry at
STLport
>>>>> for code bloat, stack size increase and some direct libc
calls).
>>>>>
>>>>
>>>> I would be interested to know if the above addresses most of
the
>>>> problems or not.
>>>>
>>>>
>>>>>  Until recently this was not causing too much pain in
asan/tsan, but
>>>>> our attempts to use the LLVM DWARF readers made it worse.
>>>>> When tsan finds a race, we need to symbolize it online to
be able to
>>>>> match against a suppression and decide whether we want to
emit the warning.
>>>>> Today we do it in a separate addr2line process (ugly and
slow).
>>>>> But if we start calling the LLVM dwarf reader we end up
with all
>>>>> possible dependency problems (Dmitry and Alexey will know
the exact ones)
>>>>> because the LLVM code calls to malloc, memcpy, etc.
>>>>>
>>>>> Frankly, I don't have any solution other than to change
the code such
>>>>> that it does not call libc/libc++.
>>>>> Some of that may be solved by a private copy of STLport + a
bit of
>>>>> custom libc (but see above about STLport)
>>>>>
>>>>
>>>> I think my proposal is essentially in between these two:
>>>>
>>>> - Avoid the need for a low quality STL by using a normal C++
standard
>>>> library implementation, and avoid maintenance burden by doing a
link-time
>>>> mangling of the symbols.
>>>>
>>>
>>> re-linking might be too platform specific.
>>> How about compiling the library into LLVM bitcode and adding
>>> namespaces/prefixes to that bitcode?
>>>
>>
>> Re-linking is a bit platform specific...
>>
>> It would definitely work on ELF platforms, and likely on Darwin, but
>> Windows is tricky.
>>
>> On windows we would at least need a custom tool, but such a tool would
be
>> quite easy to write I suspect. We could even use the very LLVM
libraries in
>> question to write it! ;] Amusingly, I think with the LLVM libraries we
>> could very easily write a custom tool just to mangle the symbol names
in a
>> collection of object files very easily and have it work on *most*
platforms!
>>
>> Still, the bitcode idea is interesting. Doing this entirely in bitcode
>> has some advantages as these types of runtimes are among the best uses
for
>> things like LTO: they're small, performance sensitive, can
enumerate the
>> entry points easily, and are likely to have a particular need for dead
code
>> elimination.
>>
>
> One reason to want to have some support for doing this w/o bitcode: we may
> not have the bitcode. Specifically, the goal would be to use the
"normal"
> C++ standard library, provided it is available to link statically
> (libstdc++ and libc++ certainly are, I don't know about MSVC). That
would
> be much easier if we can actually use the existing archive file, and just
> "fix" the .o files inside it.
>
> It seems likely to be the equivalent of an 'ld -r' run with a
linker
> script to munge the symbol names, or potentially a custom tool written with
> the LLVM object file libraries.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120621/40442578/attachment.html>

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Jun 2012 - [LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

Possibly Parallel Threads