thr3ads.net - llvm dev - [llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Dean Michael Berris via llvm-dev

2016-Jul-04 05:50 UTC

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

Hi llvm-dev (cc google-xray),

As a follow-up to the first XRay RFC [0] introducing the technology, I've
been able to recently implement a functional prototype of the major parts
of the XRay functionality [1]. This RFC is limited to exploring potential
alternatives to the current LLVM-side changes, with the interest of getting
clear guidance for landing the changes first in LLVM.

Background / Current Implementation
============================
XRay relies on statically inserted instrumentation points (implemented as
nop-sleds) and a dynamic enable/disable mechanism implemented in a runtime
library. As of this writing the implementation of the XRay prototype
involves adding two pseudo-instructions (PATCHABLE_FUNCTION_ENTER,
PATCHABLE_RET) that serve as placeholders for where the nop-sleds are to be
emitted when lowering. PATCHABLE_FUNCTION_ENTER is an instruction that
takes no operands and serves as a pure placeholder. PATCHABLE_RET
effectively behaves as a return instruction (isReturn = true) and wraps
whatever the return instruction is, along with all operands -- this is used
to replace the return instructions, and when lowered will unpack into the
appropriate nop-sled-padded return sequence. We rely on a
MachineFunctionPass (XRayInstrumentation) to observe IR functions with
xray-specific attributes (function-instrument=xray-{always,never}, or
xray-instruction-threshold=N), that then insert the pseudo-instructions to
the machine instructions that get lowered appropriately. While lowering, we
keep track of the instrumentation points marked by the lowered
pseudo-instructions and generate a per-function COMDAT/ELF Group section,
merged into a special section (xray_instr_map). We only currently implement
the lowering for x86_64 ELF.

All these changes are implemented in http://reviews.llvm.org/D19904.

Challenges
========
This implementation approach poses two major challenges just on the LLVM
(core) side of the implementation:

1) The pseudo-instructions need to be handled especially for each platform
on which XRay would be ported. At this time we're exploring  implementing
(and accepting help from the community to complete) PPC and ARM support,
spelling the nop sleds differently for those architectures. Since the
prototype only supports ELF sections, we're thinking about a portable/clean
way of finding/coalescing the instrumentation point locations. We have some
choices made in the current implementation that we're unclear about whether
it will work or transfer cleanly to other architectures or formats/OSes
(MachO, COFF, a.out (?)).

2) We are only currently instrumenting "normal" function entry and
exits.
We have a 1:1 correspondence between the type of instrumentation point and
the pseudo-instructions. This means, when we start implementing various
exit points (exception throwing, catch returns, tail calls, sibling calls)
we need to implement new pseudo-instructions and port to all other
platforms where XRay will be ported. The proliferation of
pseudo-instructions seems hardly desirable, and maybe a better approach
would scale better.

Alternatives
========
We've looked at the following alternatives, and we're looking to the
community for feedback on both the current implementation and these
alternatives.

LLVM Functions
----------------------

Instead of using pseudo-instructions, use intrinsic functions [2] that are
part of the IR. These could be emitted at a higher level by front-ends
(like Clang) and are threaded through the various IR transformations
through the various optimisations. There's some pros and cons to this
approach, and we're attempting to list down some that I know about:

Pro:
+ We can encode variance in the sleds as function arguments (scales better
to more kinds of instrumentation points we can insert).
+ The IR has the functions in-line, instead of being magically inserted
when lowering (could be a better aid for debugging/understanding/reasoning).
+ In case the platform doesn't yet support XRay instrumentation, we can
trivially remove the function calls when lowering.

Cons:
- We're unsure whether we can still enforce the layout of emitted code,
especially in the special case of the return sleds. Since the return sleds
(in x86_64) are spelled as `ret; <10-byte nops>`, there may be some
acrobatics needed lower and legalize this lowering potentially inferior to
the pseudo-instructions approach.

More Magic
----------------

Instead of using pseudo-instructions, we rely solely on the
presence/absence of attributes then special-case the start-of-function
(prologue), end of function (epilogue), and return instruction lowering for
platforms where XRay would be supported. This entails adding special-case
function calls in strategic places in the compiler, the logic all being
embedded in the LLVM code base (in lib/CodeGen, lib/Target, etc.). There's
some pros and cons to the this approach:

Pro:
+ All XRay logic can be hidden in an interface purely in LLVM code, no need
for exposing logic in IR nor in MC.
+ Sidesteps all issues with lowering instructions in platforms, inserting
the correct instrumentation points on a platform-by-platform basis.
+ Allows for iterating the implementation purely in LLVM code, testing
logic in isolation, incremental changes to internals.

Cons:
- This involves much more work touching more places where instrumentation
points might be inserted. An initial attempt involves teaching the various
stack adjustment routines, prologue/epilogue emission, return instruction
lowering, the legalizer, and late-stage optimisations how to handle
XRay-specific instrumentation.

Open Questions
============
There are some other open questions to the community at large:

* Looking at the current implementation, are there major objections to
committing to the current implementation, iterating with the knowledge that
this can evolve more later as we learn more about implementing XRay (and
other instrumentation routines) in LLVM?

* Are there other risks we haven't considered yet for having something like
XRay embedded as a supported instrumentation mechanism in LLVM?

* Given the current implementation in http://reviews.llvm.org/D19904, do
you have suggestions on how to partition it to smaller changes that could
be reviewed/landed easier than a singular patch?

Roadmap for Context
================
Note that this RFC focuses only on the LLVM-side changes. To put this in
context, the order of changes we're looking to land comes in the following
order:

- LLVM Changes (subject of this RFC)
- Changes in compiler-rt (the runtime implementing dynamic patching and
in-memory logging)
- Changes in Clang to support emitting XRay-instrumented C/C++ (and maybe
Obj-C) binaries
- Tools for analysing XRay traces generated by XRay-instrumented binaries

I have some changes under works to get the in-memory logging implementation
(a naive implementation) and a simple function call accounting tool working
on top of the existing public patches. Hopefully as soon as we get clear
guidance on the subject of this RFC, more of the implementation described
in the white paper [2] in terms of the logging heuristics and runtime
enabling/disabling can proceed in earnest.

--- End of RFC ---

References:

[0] Original XRay RFC:
http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html

[1] There are three patches that implement the prototype XRay
implementation, updated to track trunk of LLVM, Clang, and compiler-rt:

http://reviews.llvm.org/D19904 (llvm)
http://reviews.llvm.org/D20352 (clang)
http://reviews.llvm.org/D21612 (compiler-rt)

[2] XRay: A Function Call Tracing System:
http://research.google.com/pubs/pub45287.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160704/05c7027e/attachment.html>

Hayden Livingston via llvm-dev

2016-Jul-04 07:39 UTC

head link

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

I have a few meta questions here.

Why should LLVM (and from the patch it seems Clang) favor one
instrumentation system -- in this case the XRay instrumentation system
vs. many others that may be possible to add to upstream?

It seems GCC has -finstrument-functions that call into cyg_....
functions. Poor naming choice, but I suppose one thing would be to use
those names. Or better yet, provide a way in commandline to say what
functions are for entry, and what are for exit.

How is this different from hot patching that exists in Windows? I
suppose this feature makes it more accessible?

I hope we can change the name of this thing if it were to be added to
something generic that doesn't tie us to the runtime libraries needed
for XRay specifically.


On Sun, Jul 3, 2016 at 10:50 PM, Dean Michael Berris via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Hi llvm-dev (cc google-xray),
>
> As a follow-up to the first XRay RFC [0] introducing the technology,
I've
> been able to recently implement a functional prototype of the major parts
of
> the XRay functionality [1]. This RFC is limited to exploring potential
> alternatives to the current LLVM-side changes, with the interest of getting
> clear guidance for landing the changes first in LLVM.
>
> Background / Current Implementation
> ============================>
> XRay relies on statically inserted instrumentation points (implemented as
> nop-sleds) and a dynamic enable/disable mechanism implemented in a runtime
> library. As of this writing the implementation of the XRay prototype
> involves adding two pseudo-instructions (PATCHABLE_FUNCTION_ENTER,
> PATCHABLE_RET) that serve as placeholders for where the nop-sleds are to be
> emitted when lowering. PATCHABLE_FUNCTION_ENTER is an instruction that
takes
> no operands and serves as a pure placeholder. PATCHABLE_RET effectively
> behaves as a return instruction (isReturn = true) and wraps whatever the
> return instruction is, along with all operands -- this is used to replace
> the return instructions, and when lowered will unpack into the appropriate
> nop-sled-padded return sequence. We rely on a MachineFunctionPass
> (XRayInstrumentation) to observe IR functions with xray-specific attributes
> (function-instrument=xray-{always,never}, or xray-instruction-threshold=N),
> that then insert the pseudo-instructions to the machine instructions that
> get lowered appropriately. While lowering, we keep track of the
> instrumentation points marked by the lowered pseudo-instructions and
> generate a per-function COMDAT/ELF Group section, merged into a special
> section (xray_instr_map). We only currently implement the lowering for
> x86_64 ELF.
>
> All these changes are implemented in http://reviews.llvm.org/D19904.
>
> Challenges
> ========>
> This implementation approach poses two major challenges just on the LLVM
> (core) side of the implementation:
>
> 1) The pseudo-instructions need to be handled especially for each platform
> on which XRay would be ported. At this time we're exploring 
implementing
> (and accepting help from the community to complete) PPC and ARM support,
> spelling the nop sleds differently for those architectures. Since the
> prototype only supports ELF sections, we're thinking about a
portable/clean
> way of finding/coalescing the instrumentation point locations. We have some
> choices made in the current implementation that we're unclear about
whether
> it will work or transfer cleanly to other architectures or formats/OSes
> (MachO, COFF, a.out (?)).
>
> 2) We are only currently instrumenting "normal" function entry
and exits. We
> have a 1:1 correspondence between the type of instrumentation point and the
> pseudo-instructions. This means, when we start implementing various exit
> points (exception throwing, catch returns, tail calls, sibling calls) we
> need to implement new pseudo-instructions and port to all other platforms
> where XRay will be ported. The proliferation of pseudo-instructions seems
> hardly desirable, and maybe a better approach would scale better.
>
> Alternatives
> ========>
> We've looked at the following alternatives, and we're looking to
the
> community for feedback on both the current implementation and these
> alternatives.
>
> LLVM Functions
> ----------------------
>
> Instead of using pseudo-instructions, use intrinsic functions [2] that are
> part of the IR. These could be emitted at a higher level by front-ends
(like
> Clang) and are threaded through the various IR transformations through the
> various optimisations. There's some pros and cons to this approach, and
> we're attempting to list down some that I know about:
>
> Pro:
> + We can encode variance in the sleds as function arguments (scales better
> to more kinds of instrumentation points we can insert).
> + The IR has the functions in-line, instead of being magically inserted
when
> lowering (could be a better aid for debugging/understanding/reasoning).
> + In case the platform doesn't yet support XRay instrumentation, we can
> trivially remove the function calls when lowering.
>
> Cons:
> - We're unsure whether we can still enforce the layout of emitted code,
> especially in the special case of the return sleds. Since the return sleds
> (in x86_64) are spelled as `ret; <10-byte nops>`, there may be some
> acrobatics needed lower and legalize this lowering potentially inferior to
> the pseudo-instructions approach.
>
> More Magic
> ----------------
>
> Instead of using pseudo-instructions, we rely solely on the
presence/absence
> of attributes then special-case the start-of-function (prologue), end of
> function (epilogue), and return instruction lowering for platforms where
> XRay would be supported. This entails adding special-case function calls in
> strategic places in the compiler, the logic all being embedded in the LLVM
> code base (in lib/CodeGen, lib/Target, etc.). There's some pros and
cons to
> the this approach:
>
> Pro:
> + All XRay logic can be hidden in an interface purely in LLVM code, no need
> for exposing logic in IR nor in MC.
> + Sidesteps all issues with lowering instructions in platforms, inserting
> the correct instrumentation points on a platform-by-platform basis.
> + Allows for iterating the implementation purely in LLVM code, testing
logic
> in isolation, incremental changes to internals.
>
> Cons:
> - This involves much more work touching more places where instrumentation
> points might be inserted. An initial attempt involves teaching the various
> stack adjustment routines, prologue/epilogue emission, return instruction
> lowering, the legalizer, and late-stage optimisations how to handle
> XRay-specific instrumentation.
>
> Open Questions
> ============>
> There are some other open questions to the community at large:
>
> * Looking at the current implementation, are there major objections to
> committing to the current implementation, iterating with the knowledge that
> this can evolve more later as we learn more about implementing XRay (and
> other instrumentation routines) in LLVM?
>
> * Are there other risks we haven't considered yet for having something
like
> XRay embedded as a supported instrumentation mechanism in LLVM?
>
> * Given the current implementation in http://reviews.llvm.org/D19904, do
you
> have suggestions on how to partition it to smaller changes that could be
> reviewed/landed easier than a singular patch?
>
> Roadmap for Context
> ================>
> Note that this RFC focuses only on the LLVM-side changes. To put this in
> context, the order of changes we're looking to land comes in the
following
> order:
>
> - LLVM Changes (subject of this RFC)
> - Changes in compiler-rt (the runtime implementing dynamic patching and
> in-memory logging)
> - Changes in Clang to support emitting XRay-instrumented C/C++ (and maybe
> Obj-C) binaries
> - Tools for analysing XRay traces generated by XRay-instrumented binaries
>
> I have some changes under works to get the in-memory logging implementation
> (a naive implementation) and a simple function call accounting tool working
> on top of the existing public patches. Hopefully as soon as we get clear
> guidance on the subject of this RFC, more of the implementation described
in
> the white paper [2] in terms of the logging heuristics and runtime
> enabling/disabling can proceed in earnest.
>
> --- End of RFC ---
>
> References:
>
> [0] Original XRay RFC:
> http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html
>
> [1] There are three patches that implement the prototype XRay
> implementation, updated to track trunk of LLVM, Clang, and compiler-rt:
>
> http://reviews.llvm.org/D19904 (llvm)
> http://reviews.llvm.org/D20352 (clang)
> http://reviews.llvm.org/D21612 (compiler-rt)
>
> [2] XRay: A Function Call Tracing System:
> http://research.google.com/pubs/pub45287.html
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Dean Michael Berris via llvm-dev

2016-Jul-04 08:10 UTC

head link

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

Thanks for the questions Hayden, please see in-line below some responses.

On Mon, Jul 4, 2016 at 5:39 PM Hayden Livingston <halivingston at
gmail.com>
wrote:
> I have a few meta questions here.
>
> Why should LLVM (and from the patch it seems Clang) favor one
> instrumentation system -- in this case the XRay instrumentation system
> vs. many others that may be possible to add to upstream?
>
>I don't think there's any intent to exclude any existing or alternative
instrumentation systems from LLVM. At least from our proposal, we're making
sure we're playing well with any existing current implementations already
in LLVM/Clang and others that might come along.

> It seems GCC has -finstrument-functions that call into cyg_....
> functions. Poor naming choice, but I suppose one thing would be to use
> those names. Or better yet, provide a way in commandline to say what
> functions are for entry, and what are for exit.
>
>I thought Clang already supported this as an option?

> How is this different from hot patching that exists in Windows? I
> suppose this feature makes it more accessible?
>
>The differences are multi-fold. Some of them that I can list down are:

- XRay aims to not change the functionality of the application/function
being instrumented. The sole goal of the XRay instrumentation points are to
allow for dynamic enabling/disabling of the instrumentation, and only using
the instrumentation points that have been inserted by the compiler. With
hot-patching in Windows, as far as I can tell the intent is to update the
implementation of a function at runtime completely not just for
instrumentation. You can say that XRay may be implemented in a similar
manner by re-writing the function being instrumented at runtime and
hot-patching the original function implementation, but we've chosen not to
do that for efficiency reasons (trade-off between cost of instrumentation
when "off" and when "on").

- XRay has a very specific goal, which is to generate function call traces
for performance debugging. Other instrumentation systems will have
different goals, and the hot-patching mechanism is just one of those
techniques useful for achieving the various goals. We certainly can allow
other uses for XRay (i.e. in the prototype implementation, we have hooks to
allow changing what function is called when an instrumentation point is
encountered at runtime) but the immediate goal is for generating traces
that can be analysed offline.

> I hope we can change the name of this thing if it were to be added to
> something generic that doesn't tie us to the runtime libraries needed
> for XRay specifically.
>
I agree we should be able to share common infrastructure in LLVM for adding
instrumentation points (there's an interesting RFC recently for CSI) and
I'm all for making it easier to implement things like XRay through the
common infrastructure. There's certainly been talk about consolidating the
different options for adding instrumentation into a coherent set of flags
in Clang, but I haven't quite seen talk about common instrumentation
infrastructure support in LLVM. My hope is, if this is something the
community will find useful, that we can gain consensus or at least share a
clear direction. I'd be happy to do the work if that means we can get XRay
functionality supported as one of the many possible implementations in LLVM.

I'm happy to have a conversation about being able to make alternative
instrumentation systems easier to implement with the work to support XRay
in LLVM, if that makes it at least clear that XRay isn't being proposed as
the "one true way" for instrumenting Clang/LLVM-built binaries.
I'm even
willing to try and iterate on the interfaces and/or implementations in LLVM
to make XRay-like things be built on top of LLVM.

As for naming, I think being able to specify from a command-line (of Clang,
or some other llvm-* tools) the string 'xray' makes it easier to search
for, document, and "teach". The vision is, if it's possible, to
have many
of these instrumentation implementations live under a single flag like
'-finstrument='. For now though any talk of that might be premature if
we're only going to have '-finstrument=profile' and
'-finstrument=xray'.

Does that make sense?

Cheers
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160704/8488544c/attachment.html>

David Chisnall via llvm-dev

2016-Jul-04 08:27 UTC

head link

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

On 4 Jul 2016, at 06:50, Dean Michael Berris via llvm-dev <llvm-dev at
lists.llvm.org> wrote:> 
> We've looked at the following alternatives, and we're looking to
the community for feedback on both the current implementation and these
alternatives.
I don’t think that I’ve yet seen an explanation of why you need the NOPs. 
DTrace stopped using them a long time ago, for two reasons:

1) The increased code size caused a noticeable increase in i-cache misses, even
when instrumentation was not actively being used.  This caused a noticeable
probe effect (macroscopic observable performance artefacts even when no probes
are active) and caused a lot of push-back in adoption.

2) On all of the architectures where we support DTrace (currently, I believe,
x86, x86-64, AArch32, AArch64, MIPS64, and RISC-V) it’s possible to do the same
thing by moving one of the instructions in the function prolog into the
generated trampoline for the instrumentation.

I could understand wanting something more like patchpoints if you want to be
able to instrument in the middle of a function (along the lines of TESLA or
CSI), but if you’re just tracing function entry and exit then it doesn’t seem
like the best solution.

David

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3719 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160704/36eaa8b9/attachment.bin>

Dean Michael Berris via llvm-dev

2016-Jul-04 08:51 UTC

head link

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

On Mon, Jul 4, 2016 at 6:27 PM David Chisnall <david.chisnall at
cl.cam.ac.uk>
wrote:
> On 4 Jul 2016, at 06:50, Dean Michael Berris via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > We've looked at the following alternatives, and we're looking
to the
> community for feedback on both the current implementation and these
> alternatives.
>
> I don’t think that I’ve yet seen an explanation of why you need the NOPs.
> DTrace stopped using them a long time ago, for two reasons:
>
> 1) The increased code size caused a noticeable increase in i-cache misses,
> even when instrumentation was not actively being used.  This caused a
> noticeable probe effect (macroscopic observable performance artefacts even
> when no probes are active) and caused a lot of push-back in adoption.
>
> 2) On all of the architectures where we support DTrace (currently, I
> believe, x86, x86-64, AArch32, AArch64, MIPS64, and RISC-V) it’s possible
> to do the same thing by moving one of the instructions in the function
> prolog into the generated trampoline for the instrumentation.
>
> I could understand wanting something more like patchpoints if you want to
> be able to instrument in the middle of a function (along the lines of TESLA
> or CSI), but if you’re just tracing function entry and exit then it doesn’t
> seem like the best solution.
>
>Thanks for the questions David -- the short version of the answer is that
DTrace (last I checked) requires some help from the Kernel, while XRay is
self-contained in the application.

All of your points above are valid, and DTrace is a really powerful tool
for debugging a lot of performance issues. XRay has a few things that
differentiate it from systems like DTrace though:

1) Because we insert the instrumentation sleds in specific functions that
fit a certain criteria (i.e. more selectively) instead of instrumenting
every function, we pay the cost of the instrumentation being off only on
functions that are instrumented. The combination of the changes in the
front-end to support attributes/annotations in the code to
force-instrument/-inhibit instrumentation gives control to the application
developer, allows us to limit the cost along a spectrum -- full coverage
costs more, selective coverage can be tuned, and explicit annotations
provide precise control of the instrumentation.

2) The cost of the instrumentation at run-time is O(100) cycles for the
"null-logging" case (mov + trampoline jump, atomic load and check if
not
zero). All the cost of instrumentation is within the process' address space
(in-memory log) when on -- no additional overheads external to the
application.

3) The runtime implementation for logging described in the white paper
allows us to balance the coverage (number of instrumentation events we get)
with overheads (the amount of resources used in the logging
implementation). Because we log only very specific things (function id, tsc
deltas in most cases, type of event) and have heuristics to condense the
information we keep (i.e. if entry-exit pairs are under epsilon, we can
omit the entry entirely), we don't need to be quite as complete when
logging and instead move a lot of the logic in reconstruction/analysis of
the generated traces.

There are certainly other approaches to doing selective instrumentation,
and then externally signalling/trapping (with environment support) when
probing. XRay moves this needle towards having the instrumentation and
collection and even signalling into the application. This makes sense if
you're deploying the application on a system that doesn't have DTrace
and
still be able to isolate the costs of instrumentation just to the
application.

I'll admit that I'll need to read a lot more about how DTrace manages to
keep the costs of probes low enough that it could be turned on dynamically
without stopping the process, and without having to intercept more events
than actually necessary (i.e. only on certain functions, and only when it's
on) to be able to provide a more complete answer.

Does this help?

Cheers
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160704/4ce618b7/attachment.html>

llvm dev - Jul 2016 - [XRay] RFC: LLVM-side Changes for nop-sleds

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds

[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds