thr3ads.net - llvm dev - [LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code? [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2012-Jun-20 03:12 UTC

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

Hello folks (and sorry if I've forgotten to CC anyone with particular
interest to this discussion...):

I've been thinking a lot about how best to build advanced runtime libraries
like ASan, and scale them up. Note that this does *not* try to address any
licensing issues. For now, I'll consider those orthogonal / solvable w/o
technical contortions. =]

My primary motivation: we really, *really* need runtime libraries to be
able to use common, shared libraries.

This starts with libraries such as the C++ standard library -- a runtime
shouldn't need to re-implement std::vector. It includes other primitive
libraries that have had significant effort put into them in LLVM such as
the ADT and Support libraries. But, IMO, it has even more importance as we
start looking at libraries such as ELF readers, DWARF readers, symbolizers,
etc. This code should shared, and shared easily, with other LLVM projects.

However, clearly the runtime must at some point be linked against a
program, and indeed programs which may be using *the same set of
libraries*. It is crucially important that the runtime uses a separate
implementation of the libraries from the ones used by the program itself:
we will often compile the program's libraries with instrumentation and
other features which we explicitly wish to avoid in the runtime. Even
simple name clashes can cause problems, leading to the current practice of
putting all of these runtime libraries into a '__sanitizer' or other
specially spelled namespace.

A final unusual requirement is that at least *some* of the code for the
runtime libraries must be statically linked to have reasonable efficiency.
We also have several use cases where it would be very convenient to link
*all* of the runtime statically, so I prefer a solution that preserves this
option.

So how can we effectively share code? Here is my proposal, and a few
alternate strategies.

I suggest that we build the runtime library as-if it were not a runtime
library at all, and just a normal library. No strange namespaces, no
restrictions on what other libraries it uses with one exception: they must
*all* be statically linkable. We build this as a normal archive library,
nothing special. One nice property is that testing the runtime library
becomes the same as testing any other library.

Then, we have a special build step to produce a final archive which is
actually *used* as the runtime library. This step works not dissimilarly to
the step to link an executable: we build the list of archive libraries
depended on, but instead of linking an executable, we run a linker script
over them. This script will re-link each '.o' file from the transitive
closure of archives, prepending a '__asan__' (or other runtime library
prefix) onto each symbol; effectively mangling each symbol. All of these
processed '.o' files would go into a single, final archive that would be
the installed runtime library. The only functions not processed in this
manner are a white list of "exported" functions from the runtime
(C-library
routines provided by the runtime, and runtime entry points, et.).

The result should be a runtime library that is essentially hermetic, and
should have no clashes with binaries it links against. It would be free to
use standard libraries, LLVM libraries, whatever it needs. That said, there
are some clear disadvantages:
- Bizarre name mangling, especially for C++
- Potentially incompatible with C++ EH, libunwind, or other tools (I just
don't know, haven't done enough research here)
- Requires "relinking" the final runtime
- Definitely implementable on Linux & ELF-based BSDs, I *think* do-able on
Darwin, but I have no idea about Windows.
- Other downsides? I'm probably missing some big problems here... ;]

However, if we can make this (possibly with tweaks/modifications) work, I
think the upside is quite large -- the runtime library stops having to be
written in such a strange special sub-set of the language, etc.


Note that this proposal is orthogonal to the issue of minimizing the binary
size and cost of the runtime library -- that is clearly still an important
concern, but that can be addressed both with or without using other
libraries. LLVM has lots of libraries specifically engineered to be
lightweight in situations like this.


Other alternatives that have been discussed:

- Require isolating all shared code into a shared library (.so) than is
loaded as-needed. This helps some, but it doesn't seem to fully solve the
issues (where does the shared code go? the .so? What happens when it is
loaded into a program that already has copies of the same code? What
happens when one is instrumented and the other isn't). It also requires us
to ship the '.so' with the binary to get full functionality, something
that
would be at least somewhat undesirable. It also requires the runtime
library developers to carefully partition the code into that which can go
in the .a and that which can go in the .so.

- The current strategy of re-implementing everything needed from
(essentially) the ground up inside the runtime library. I think that this
has serious long-term maintenance problems.... but who knows, maybe?

- Other ideas?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120619/3d23298b/attachment.html>

Kostya Serebryany

2012-Jun-20 04:07 UTC

head link

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

+dvyukov

On Wed, Jun 20, 2012 at 7:12 AM, Chandler Carruth <chandlerc at
google.com>wrote:
> Hello folks (and sorry if I've forgotten to CC anyone with particular
> interest to this discussion...):
>
> I've been thinking a lot about how best to build advanced runtime
> libraries like ASan, and scale them up. Note that this does *not* try to
> address any licensing issues. For now, I'll consider those orthogonal /
> solvable w/o technical contortions. =]
>
> My primary motivation: we really, *really* need runtime libraries to be
> able to use common, shared libraries.
>
I am not sure you understand the problem as we do.

In short, asan/tsan/msan/etc can not use any function which is also called
from the instrumented binary.
E.g. it can not use malloc() for internal allocations because malloc is
intercepted/replaced. We use raw mmap.
It can not use functions like strlen, memset, etc, because those functions
generate memory access events. We use our own implementations or sometimes
steal them from libc using dlsym.
Ideally, asan/etc should not even use libc functions like read() -- on
linux we currently use raw system call for some of those.

In Valgrind, they struggled with the same problem and made 2 or 3 attempts
to reuse the system libc.
Every time it ended with a maintenance nightmare; so currently valgrind has
its own private subset of libc.
In PIN, they have a private copy of system libc/libstdc++ and, afaict, it
is constantly causing pain for the maintainers.
In the previous version of ThreadSanitizer we used a private copy of
STLport in a separate namespace and a custom libc (small subset). This
worked, but had problems too (Dmitry was very angry at STLport for code
bloat, stack size increase and some direct libc calls).

Until recently this was not causing too much pain in asan/tsan, but our
attempts to use the LLVM DWARF readers made it worse.
When tsan finds a race, we need to symbolize it online to be able to match
against a suppression and decide whether we want to emit the warning. Today
we do it in a separate addr2line process (ugly and slow).
But if we start calling the LLVM dwarf reader we end up with all possible
dependency problems (Dmitry and Alexey will know the exact ones) because
the LLVM code calls to malloc, memcpy, etc.

Frankly, I don't have any solution other than to change the code such that
it does not call libc/libc++.
Some of that may be solved by a private copy of STLport + a bit of custom
libc (but see above about STLport)

--kcc


> This starts with libraries such as the C++ standard library -- a runtime
> shouldn't need to re-implement std::vector. It includes other primitive
> libraries that have had significant effort put into them in LLVM such as
> the ADT and Support libraries. But, IMO, it has even more importance as we
> start looking at libraries such as ELF readers, DWARF readers, symbolizers,
> etc. This code should shared, and shared easily, with other LLVM projects.
>
> However, clearly the runtime must at some point be linked against a
> program, and indeed programs which may be using *the same set of
> libraries*. It is crucially important that the runtime uses a separate
> implementation of the libraries from the ones used by the program itself:
> we will often compile the program's libraries with instrumentation and
> other features which we explicitly wish to avoid in the runtime. Even
> simple name clashes can cause problems, leading to the current practice of
> putting all of these runtime libraries into a '__sanitizer' or
other
> specially spelled namespace.
>
> A final unusual requirement is that at least *some* of the code for the
> runtime libraries must be statically linked to have reasonable efficiency.
> We also have several use cases where it would be very convenient to link
> *all* of the runtime statically, so I prefer a solution that preserves this
> option.
>
> So how can we effectively share code? Here is my proposal, and a few
> alternate strategies.
>
> I suggest that we build the runtime library as-if it were not a runtime
> library at all, and just a normal library. No strange namespaces, no
> restrictions on what other libraries it uses with one exception: they must
> *all* be statically linkable. We build this as a normal archive library,
> nothing special. One nice property is that testing the runtime library
> becomes the same as testing any other library.
>
> Then, we have a special build step to produce a final archive which is
> actually *used* as the runtime library. This step works not dissimilarly to
> the step to link an executable: we build the list of archive libraries
> depended on, but instead of linking an executable, we run a linker script
> over them. This script will re-link each '.o' file from the
transitive
> closure of archives, prepending a '__asan__' (or other runtime
library
> prefix) onto each symbol; effectively mangling each symbol. All of these
> processed '.o' files would go into a single, final archive that
would be
> the installed runtime library. The only functions not processed in this
> manner are a white list of "exported" functions from the runtime
(C-library
> routines provided by the runtime, and runtime entry points, et.).
>
> The result should be a runtime library that is essentially hermetic, and
> should have no clashes with binaries it links against. It would be free to
> use standard libraries, LLVM libraries, whatever it needs. That said, there
> are some clear disadvantages:
> - Bizarre name mangling, especially for C++
> - Potentially incompatible with C++ EH, libunwind, or other tools (I just
> don't know, haven't done enough research here)
> - Requires "relinking" the final runtime
> - Definitely implementable on Linux & ELF-based BSDs, I *think* do-able
on
> Darwin, but I have no idea about Windows.
> - Other downsides? I'm probably missing some big problems here... ;]
>
> However, if we can make this (possibly with tweaks/modifications) work, I
> think the upside is quite large -- the runtime library stops having to be
> written in such a strange special sub-set of the language, etc.
>
>
> Note that this proposal is orthogonal to the issue of minimizing the
> binary size and cost of the runtime library -- that is clearly still an
> important concern, but that can be addressed both with or without using
> other libraries. LLVM has lots of libraries specifically engineered to be
> lightweight in situations like this.
>
>
> Other alternatives that have been discussed:
>
> - Require isolating all shared code into a shared library (.so) than is
> loaded as-needed. This helps some, but it doesn't seem to fully solve
the
> issues (where does the shared code go? the .so? What happens when it is
> loaded into a program that already has copies of the same code? What
> happens when one is instrumented and the other isn't). It also requires
us
> to ship the '.so' with the binary to get full functionality,
something that
> would be at least somewhat undesirable. It also requires the runtime
> library developers to carefully partition the code into that which can go
> in the .a and that which can go in the .so.
>
> - The current strategy of re-implementing everything needed from
> (essentially) the ground up inside the runtime library. I think that this
> has serious long-term maintenance problems.... but who knows, maybe?
>
> - Other ideas?
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120620/2b1c9287/attachment.html>

Chandler Carruth

2012-Jun-20 05:39 UTC

head link

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

On Tue, Jun 19, 2012 at 9:07 PM, Kostya Serebryany <kcc at google.com>
wrote:
> +dvyukov
>
> On Wed, Jun 20, 2012 at 7:12 AM, Chandler Carruth <chandlerc at
google.com>wrote:
>
>> Hello folks (and sorry if I've forgotten to CC anyone with
particular
>> interest to this discussion...):
>>
>> I've been thinking a lot about how best to build advanced runtime
>> libraries like ASan, and scale them up. Note that this does *not* try
to
>> address any licensing issues. For now, I'll consider those
orthogonal /
>> solvable w/o technical contortions. =]
>>
>> My primary motivation: we really, *really* need runtime libraries to be
>> able to use common, shared libraries.
>>
>
> I am not sure you understand the problem as we do.
>
> In short, asan/tsan/msan/etc can not use any function which is also called
> from the instrumented binary.
>
Well, I can't be sure, but this description certainly agrees with my
understanding -- you need *every* part of the runtime to be completely
separate from *every* part of the instrumented binary. I'm with you there.

In particular, I think the current strategy for libc & system calls makes
perfect sense, and I'm not trying to suggest changing it.

I think the most similar situation is is this one:

In the previous version of ThreadSanitizer we used a private copy
of> STLport in a separate namespace and a custom libc (small subset).
>
My proposal is very similar except without the need to modify the C++
standard library in use. Instead, I'm suggesting post-processing the
library to ensure that the standard C++ library code in the runtime is kept
complete distinct from that in the instrumented binary -- everything would
in fact be *mangled* differently.

The goal would be to avoid the maintenance overhead of a custom C++
standard library, and instead use a normal one. My understanding is that
both GCC's libstdc++ and LLVM's libc++ are significantly higher quality
than STLport, and if we're doing static linking, the code bloat should be
greatly reduced. We could reduce it still further by doing LTO of the
runtime library, which should be very straight forward given the rest of my
proposal.

It would still require a very small subset of libc, likely not much more
than you already have.

This worked, but had problems too (Dmitry was very angry at STLport
for> code bloat, stack size increase and some direct libc calls).
>
I would be interested to know if the above addresses most of the problems
or not.

> Until recently this was not causing too much pain in asan/tsan, but our
> attempts to use the LLVM DWARF readers made it worse.
> When tsan finds a race, we need to symbolize it online to be able to match
> against a suppression and decide whether we want to emit the warning. Today
> we do it in a separate addr2line process (ugly and slow).
> But if we start calling the LLVM dwarf reader we end up with all possible
> dependency problems (Dmitry and Alexey will know the exact ones) because
> the LLVM code calls to malloc, memcpy, etc.
>
> Frankly, I don't have any solution other than to change the code such
that
> it does not call libc/libc++.
> Some of that may be solved by a private copy of STLport + a bit of custom
> libc (but see above about STLport)
>
I think my proposal is essentially in between these two:

- Avoid the need for a low quality STL by using a normal C++ standard
library implementation, and avoid maintenance burden by doing a link-time
mangling of the symbols.
- Provide the minimal custom libc, and do the same to it
- Link the LLVM libraries against these, and munge their symbols as well
- LTO the whole thing if needed to get the code bloat down

I think this is actually easier than changing the LLVM libraries to not use
the C++ standard libraries. I also think it is easier than re-implementing
the LLVM libraries in question. But that doesn't mean I think it is easy.
;] I think it is quite hard, but it is the best solution I can come up with.

>
> --kcc
>
>
>
>> This starts with libraries such as the C++ standard library -- a
runtime
>> shouldn't need to re-implement std::vector. It includes other
primitive
>> libraries that have had significant effort put into them in LLVM such
as
>> the ADT and Support libraries. But, IMO, it has even more importance as
we
>> start looking at libraries such as ELF readers, DWARF readers,
symbolizers,
>> etc. This code should shared, and shared easily, with other LLVM
projects.
>>
>> However, clearly the runtime must at some point be linked against a
>> program, and indeed programs which may be using *the same set of
>> libraries*. It is crucially important that the runtime uses a separate
>> implementation of the libraries from the ones used by the program
itself:
>> we will often compile the program's libraries with instrumentation
and
>> other features which we explicitly wish to avoid in the runtime. Even
>> simple name clashes can cause problems, leading to the current practice
of
>> putting all of these runtime libraries into a '__sanitizer' or
other
>> specially spelled namespace.
>>
>> A final unusual requirement is that at least *some* of the code for the
>> runtime libraries must be statically linked to have reasonable
efficiency.
>> We also have several use cases where it would be very convenient to
link
>> *all* of the runtime statically, so I prefer a solution that preserves
this
>> option.
>>
>> So how can we effectively share code? Here is my proposal, and a few
>> alternate strategies.
>>
>> I suggest that we build the runtime library as-if it were not a runtime
>> library at all, and just a normal library. No strange namespaces, no
>> restrictions on what other libraries it uses with one exception: they
must
>> *all* be statically linkable. We build this as a normal archive
library,
>> nothing special. One nice property is that testing the runtime library
>> becomes the same as testing any other library.
>>
>> Then, we have a special build step to produce a final archive which is
>> actually *used* as the runtime library. This step works not
dissimilarly to
>> the step to link an executable: we build the list of archive libraries
>> depended on, but instead of linking an executable, we run a linker
script
>> over them. This script will re-link each '.o' file from the
transitive
>> closure of archives, prepending a '__asan__' (or other runtime
library
>> prefix) onto each symbol; effectively mangling each symbol. All of
these
>> processed '.o' files would go into a single, final archive that
would be
>> the installed runtime library. The only functions not processed in this
>> manner are a white list of "exported" functions from the
runtime (C-library
>> routines provided by the runtime, and runtime entry points, et.).
>>
>> The result should be a runtime library that is essentially hermetic,
and
>> should have no clashes with binaries it links against. It would be free
to
>> use standard libraries, LLVM libraries, whatever it needs. That said,
there
>> are some clear disadvantages:
>> - Bizarre name mangling, especially for C++
>> - Potentially incompatible with C++ EH, libunwind, or other tools (I
just
>> don't know, haven't done enough research here)
>> - Requires "relinking" the final runtime
>> - Definitely implementable on Linux & ELF-based BSDs, I *think*
do-able
>> on Darwin, but I have no idea about Windows.
>> - Other downsides? I'm probably missing some big problems here...
;]
>>
>> However, if we can make this (possibly with tweaks/modifications) work,
I
>> think the upside is quite large -- the runtime library stops having to
be
>> written in such a strange special sub-set of the language, etc.
>>
>>
>> Note that this proposal is orthogonal to the issue of minimizing the
>> binary size and cost of the runtime library -- that is clearly still an
>> important concern, but that can be addressed both with or without using
>> other libraries. LLVM has lots of libraries specifically engineered to
be
>> lightweight in situations like this.
>>
>>
>> Other alternatives that have been discussed:
>>
>> - Require isolating all shared code into a shared library (.so) than is
>> loaded as-needed. This helps some, but it doesn't seem to fully
solve the
>> issues (where does the shared code go? the .so? What happens when it is
>> loaded into a program that already has copies of the same code? What
>> happens when one is instrumented and the other isn't). It also
requires us
>> to ship the '.so' with the binary to get full functionality,
something that
>> would be at least somewhat undesirable. It also requires the runtime
>> library developers to carefully partition the code into that which can
go
>> in the .a and that which can go in the .so.
>>
>> - The current strategy of re-implementing everything needed from
>> (essentially) the ground up inside the runtime library. I think that
this
>> has serious long-term maintenance problems.... but who knows, maybe?
>>
>> - Other ideas?
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120619/ce28d045/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Jun 2012 - [LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

Possibly Parallel Threads