thr3ads.net - llvm dev - [llvm-dev] RFC: XRay -- A Function Call Tracing System [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Dean Michael Berris via llvm-dev

2016-Apr-29 03:58 UTC

[llvm-dev] RFC: XRay -- A Function Call Tracing System

TL;DR: At Google we use a call tracing system called XRay which inserts
patchable instrumentation points into function entries and exits. If the
community is interested, we'd like to contribute this system to the LLVM
project. Many more details are contained in the whitepaper at:
https://storage.googleapis.com/xray-downloads/whitepaper/XRayAFunctionCallTracingSystem.pdf

Who's "we": dberris at google.com, echristo at google.com

Overview

At Google we use XRay to build binaries that have "nop sleds" at
function
call entry and exits that we can patch at runtime to insert instrumentation
calls. We use this to generate sufficiently accurate function call timings
to help track down the source of production service latency. Alternate uses
include getting function call statistics (histogram of function timings),
and generating function call traces (sample of call stacks) to analyse call
graphs (for example, analysing calls to malloc()/free()).

We have also developed analysis tools for this data, and are continuing to
build on top of them with the intent of making them available outside of
Google in the long term.

The key features of this system are:

   -

   Enable/disable the instrumentation at runtime.
   -

   Control which functions get the instrumentation points at compile time.
   We do this through some configurable heuristics (i.e. only functions with
   >N instructions or that have loop constructs) and whitelists/blacklists
   (i.e. always instrument functions X, Y, Z and never instrument functions A,
   B, C).
   -

   Source-level annotations (__attribute__(...), [[...]]) to specify
   whether or not a function should be instrumentable at a fine-grained level.


Currently this works on x86-64, but we are going to make this work on a few
other platforms as well.

Proposal

The approach involves the following:


   1.

   *Clang+LLVM:* Add support for a C++ function attribute (
   `[[clang::function-instrument=xray-{always,never}]]`) to indicate whether a
   function should either always or never be instrumentable. The intent is to
   propagate this attribute as text attributes on LLVM functions until we can
   consolidate the function instrumentation implementations to use this
   attribute too.
   2.

   *Clang:* Add flags (`-fxray-instrument=`) to enable and control this
   behaviour. Further configuration flags will enable users to specify
   functions to always/never instrument, and add thresholds for some
   heuristics.
   3.

   *LLVM:* Add new LLVM pseudo instructions for indicating where the
   instrumentation sleds should be placed, and what kind of sleds these are.
   For entry sleds, we can use the `PATCHABLE_OP` instruction, and a new
   instruction for patching exits (`PATCHABLE_RET`?) that get handled by the
   assembly printer per platform. Currently only implemented for x86_64.
   4.

   *LLVM:* Use a separate section to map the instrumentation sled
   locations. This will work even in non-debug builds and the sections will be
   merged by the linker.
   5.

   *compiler-rt:* Implement a few functions in compiler-rt under either
   profiling or under a new library (we'd like to call it "xray")
that look
   like the following:
   `__xray_patch(...)` <- takes a function pointer, installs some code at
   runtime in the instrumentation points; the provided function will be
   user-provided.
   `__xray_cleanup()` <- removes the instrumentation code and turns them
   back into the "nop sleds".
   `__xray_dump_table(...)` <- provides a means of inspecting the mapping
   between instrumentation points and function names, useful for
   post-processing.


As a proof of concept usage of this, we're contributing a tool to do
function call accounting for a binary. It will use the XRay compiler-rt
library to install function call logging to an in-memory circular buffer,
and dump out the log to a file at program completion (or triggered with a
signal handler) in a simple trace format. There’s also a command-line tool
for interpreting this data to generate useful statistics.

We're committed to bringing more of the runtime library into LLVM with
consultation and collaboration with the larger community. This includes
things like:


   -

   Support for adding context-specific annotations (e.g. for RPCs, memory
   allocation/deallocation tracing, etc.)
   -

   Accurate function call logging and heuristics around this to handle
   programs that do a lot of work.
   -

   A data analysis framework for dealing with a trace format.


As an initial step, the high level idea is to build the infrastructure
(steps 1-5) into LLVM first to allow us and the community at large to build
upon this (bring this to other platforms, make it more efficient, add
features that matter when debugging performance issues with C++ binaries,
etc.). The next step is potentially to create a project in LLVM dedicated
to the tooling around XRay -- this will contain a library for function call
logging/tracing, documentation, and libraries for reading/manipulating the
trace format that XRay can generate.

Key questions:


   -

   Is this something the LLVM community is interested in having?
   -

   Do you have feedback on the proposed approach?
   -

   Do you see any large complications/blockers to getting this supported as
   part of LLVM?


Alternatives/Related Art

We also looked at other tools that do things similar to what's being
proposed here, and we'd like to say a few words about the differences:

Pintools: This works at too low a level (binaries, machine code) which
usually doesn't provide as much information or control to the programmer
doing the debugging. There are also significant overheads involved in
making this work in a production environment.

Intel PT: Although very powerful, the level of detail it provides is very
different (and is actually complementary) to doing strictly function call
tracing. It also requires hardware and kernel support. XRay is purely in
userland.

dtrace: It requires kernel support to get going, as opposed to a purely
userland solution. Although what's being proposed is purely a subset of the
functionality available to dtrace, we find that the idea itself (function
call tracing) has merit and would like to cite the existence of dtrace as
"proof of validity" to the overall approach. :)

Valgrind: There are similarities for what Valgrind does with XRay without
needing the instrumentation points to be inserted. Dynamic code rewriting
is certainly one possible approach to solving this problem. When we measure
the costs of using Valgrind compared to XRay we've found that the targeted
nature of XRay's instrumentation (and logging) tended to perform better and
more tolerable in production.

DynamoRIO: DR does much of what Valgrind does in a very flexible and
configurable manner. We think XRay can be implemented as a purely
runtime-dynamic rewriting approach but the costs of maintaining alternate
versions of the code in memory and the effects on indirect function calls
makes this less efficient than the combination of static instrumentation
points and targeted runtime patching.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160429/97a035b9/attachment.html>

Sanjoy Das via llvm-dev

2016-Apr-29 05:37 UTC

head link

[llvm-dev] RFC: XRay -- A Function Call Tracing System

Hi Dean, Eric,

This is great!  I'll wait for others from the community to chime in,
but I'm happy to see this land upstream.

 > details are contained in the whitepaper at:
 >
https://storage.googleapis.com/xray-downloads/whitepaper/XRayAFunctionCallTracingSystem.pdf

In the code that you patch in, have you considered excising the "r10d
= <function id>" bit and just having a 5 byte call to
__xray_FunctionEntryStub?  If "<function id>" is a function of
the PC
being instrumented then you should be able to reconstruct it by
peeking into the return PC on the stack.  This means your return
sequence will have to call @__xray_FunctionExitStub (instead of
branching), so that the return PC on the stack is correct.

Using a 5 byte call will let you keep one 5 byte nop instead of the 10/11
byte nop sequence you have today (will really only help for the prologue).

 > *LLVM:* Add new LLVM pseudo instructions for indicating where the
 > instrumentation sleds should be placed, and what kind of sleds these
 > are. For entry sleds, we can use the `PATCHABLE_OP` instruction, and a
 > new instruction for patching exits (`PATCHABLE_RET`?) that get handled
 > by the assembly printer per platform. Currently only implemented for
 > x86_64.

Nitpicky comment -- we can discuss this in more detail during code
review:

I'd prefer to add more functionality to `PATCHABLE_OP` instead of
introducing yet one more pseudo instruction.  `PATCHABLE_OP` can
already wrap an arbitrary machine instruction, teaching it a few
different flavors of nop-insertion should not be a big deal.

-- Sanjoy

Dean Michael Berris via llvm-dev

2016-Apr-29 06:46 UTC

head link

[llvm-dev] RFC: XRay -- A Function Call Tracing System

On Fri, Apr 29, 2016 at 3:37 PM Sanjoy Das <sanjoy at
playingwithpointers.com>
wrote:
> Hi Dean, Eric,
>
> This is great!  I'll wait for others from the community to chime in,
> but I'm happy to see this land upstream.
>
>Thanks Sanjoy!

>  > details are contained in the whitepaper at:
>  >
>
https://storage.googleapis.com/xray-downloads/whitepaper/XRayAFunctionCallTracingSystem.pdf
>
> In the code that you patch in, have you considered excising the "r10d
> = <function id>" bit and just having a 5 byte call to
> __xray_FunctionEntryStub?  If "<function id>" is a function
of the PC
> being instrumented then you should be able to reconstruct it by
> peeking into the return PC on the stack.  This means your return
> sequence will have to call @__xray_FunctionExitStub (instead of
> branching), so that the return PC on the stack is correct.
>
>This was a tradeoff between code-size and runtime cost. Every time we need
to find the function id given the return PC on the stack, we'd have to look
it up from a table mapping PC->function id. We've made the trade-off such
that we pay some cost in code size but save on the costs at runtime.

This might turn out to be something that could be tuned depending on the
situation, but the current implementation favours lower overheads at
runtime when tracing more than trying to save on binary size. I'm open to
probably having an alternative implementation of this later that optimises
for code-size so that's definitely worth exploring.

> Using a 5 byte call will let you keep one 5 byte nop instead of the 10/11
> byte nop sequence you have today (will really only help for the prologue).
>
>  > *LLVM:* Add new LLVM pseudo instructions for indicating where the
>  > instrumentation sleds should be placed, and what kind of sleds these
>  > are. For entry sleds, we can use the `PATCHABLE_OP` instruction, and
a
>  > new instruction for patching exits (`PATCHABLE_RET`?) that get
handled
>  > by the assembly printer per platform. Currently only implemented for
>  > x86_64.
>
> Nitpicky comment -- we can discuss this in more detail during code
> review:
>
> I'd prefer to add more functionality to `PATCHABLE_OP` instead of
> introducing yet one more pseudo instruction.  `PATCHABLE_OP` can
> already wrap an arbitrary machine instruction, teaching it a few
> different flavors of nop-insertion should not be a big deal.
>
>Indeed -- although I'm more worried about the stack lowering code getting
affected by the fact that `PATCHABLE_OP` is not a return instruction nor is
it a terminator. Whether it's possible to make `PATCHABLE_OP` count as both
a normal instruction and a terminator/return at the same time is something
I haven't considered.

Note that I tried doing this with a single pseudo instruction, and in my
prototype implementation it got really messy (mostly because stack
adjustments in prologue emission is predicated on finding a return
instruction to trigger the logic). Maybe we just need to change that part
to handle `PATCHABLE_OP` differently to work appropriately, but definitely
something to discuss in code review. :D

Cheers
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160429/56f38eaf/attachment.html>

llvm dev - Apr 2016 - RFC: XRay -- A Function Call Tracing System

[llvm-dev] RFC: XRay -- A Function Call Tracing System

[llvm-dev] RFC: XRay -- A Function Call Tracing System

[llvm-dev] RFC: XRay -- A Function Call Tracing System