Dean Michael Berris via llvm-dev
2016-Jul-04 05:50 UTC
[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds
Hi llvm-dev (cc google-xray), As a follow-up to the first XRay RFC [0] introducing the technology, I've been able to recently implement a functional prototype of the major parts of the XRay functionality [1]. This RFC is limited to exploring potential alternatives to the current LLVM-side changes, with the interest of getting clear guidance for landing the changes first in LLVM. Background / Current Implementation ============================ XRay relies on statically inserted instrumentation points (implemented as nop-sleds) and a dynamic enable/disable mechanism implemented in a runtime library. As of this writing the implementation of the XRay prototype involves adding two pseudo-instructions (PATCHABLE_FUNCTION_ENTER, PATCHABLE_RET) that serve as placeholders for where the nop-sleds are to be emitted when lowering. PATCHABLE_FUNCTION_ENTER is an instruction that takes no operands and serves as a pure placeholder. PATCHABLE_RET effectively behaves as a return instruction (isReturn = true) and wraps whatever the return instruction is, along with all operands -- this is used to replace the return instructions, and when lowered will unpack into the appropriate nop-sled-padded return sequence. We rely on a MachineFunctionPass (XRayInstrumentation) to observe IR functions with xray-specific attributes (function-instrument=xray-{always,never}, or xray-instruction-threshold=N), that then insert the pseudo-instructions to the machine instructions that get lowered appropriately. While lowering, we keep track of the instrumentation points marked by the lowered pseudo-instructions and generate a per-function COMDAT/ELF Group section, merged into a special section (xray_instr_map). We only currently implement the lowering for x86_64 ELF. All these changes are implemented in http://reviews.llvm.org/D19904. Challenges ======== This implementation approach poses two major challenges just on the LLVM (core) side of the implementation: 1) The pseudo-instructions need to be handled especially for each platform on which XRay would be ported. At this time we're exploring implementing (and accepting help from the community to complete) PPC and ARM support, spelling the nop sleds differently for those architectures. Since the prototype only supports ELF sections, we're thinking about a portable/clean way of finding/coalescing the instrumentation point locations. We have some choices made in the current implementation that we're unclear about whether it will work or transfer cleanly to other architectures or formats/OSes (MachO, COFF, a.out (?)). 2) We are only currently instrumenting "normal" function entry and exits. We have a 1:1 correspondence between the type of instrumentation point and the pseudo-instructions. This means, when we start implementing various exit points (exception throwing, catch returns, tail calls, sibling calls) we need to implement new pseudo-instructions and port to all other platforms where XRay will be ported. The proliferation of pseudo-instructions seems hardly desirable, and maybe a better approach would scale better. Alternatives ======== We've looked at the following alternatives, and we're looking to the community for feedback on both the current implementation and these alternatives. LLVM Functions ---------------------- Instead of using pseudo-instructions, use intrinsic functions [2] that are part of the IR. These could be emitted at a higher level by front-ends (like Clang) and are threaded through the various IR transformations through the various optimisations. There's some pros and cons to this approach, and we're attempting to list down some that I know about: Pro: + We can encode variance in the sleds as function arguments (scales better to more kinds of instrumentation points we can insert). + The IR has the functions in-line, instead of being magically inserted when lowering (could be a better aid for debugging/understanding/reasoning). + In case the platform doesn't yet support XRay instrumentation, we can trivially remove the function calls when lowering. Cons: - We're unsure whether we can still enforce the layout of emitted code, especially in the special case of the return sleds. Since the return sleds (in x86_64) are spelled as `ret; <10-byte nops>`, there may be some acrobatics needed lower and legalize this lowering potentially inferior to the pseudo-instructions approach. More Magic ---------------- Instead of using pseudo-instructions, we rely solely on the presence/absence of attributes then special-case the start-of-function (prologue), end of function (epilogue), and return instruction lowering for platforms where XRay would be supported. This entails adding special-case function calls in strategic places in the compiler, the logic all being embedded in the LLVM code base (in lib/CodeGen, lib/Target, etc.). There's some pros and cons to the this approach: Pro: + All XRay logic can be hidden in an interface purely in LLVM code, no need for exposing logic in IR nor in MC. + Sidesteps all issues with lowering instructions in platforms, inserting the correct instrumentation points on a platform-by-platform basis. + Allows for iterating the implementation purely in LLVM code, testing logic in isolation, incremental changes to internals. Cons: - This involves much more work touching more places where instrumentation points might be inserted. An initial attempt involves teaching the various stack adjustment routines, prologue/epilogue emission, return instruction lowering, the legalizer, and late-stage optimisations how to handle XRay-specific instrumentation. Open Questions ============ There are some other open questions to the community at large: * Looking at the current implementation, are there major objections to committing to the current implementation, iterating with the knowledge that this can evolve more later as we learn more about implementing XRay (and other instrumentation routines) in LLVM? * Are there other risks we haven't considered yet for having something like XRay embedded as a supported instrumentation mechanism in LLVM? * Given the current implementation in http://reviews.llvm.org/D19904, do you have suggestions on how to partition it to smaller changes that could be reviewed/landed easier than a singular patch? Roadmap for Context ================ Note that this RFC focuses only on the LLVM-side changes. To put this in context, the order of changes we're looking to land comes in the following order: - LLVM Changes (subject of this RFC) - Changes in compiler-rt (the runtime implementing dynamic patching and in-memory logging) - Changes in Clang to support emitting XRay-instrumented C/C++ (and maybe Obj-C) binaries - Tools for analysing XRay traces generated by XRay-instrumented binaries I have some changes under works to get the in-memory logging implementation (a naive implementation) and a simple function call accounting tool working on top of the existing public patches. Hopefully as soon as we get clear guidance on the subject of this RFC, more of the implementation described in the white paper [2] in terms of the logging heuristics and runtime enabling/disabling can proceed in earnest. --- End of RFC --- References: [0] Original XRay RFC: http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html [1] There are three patches that implement the prototype XRay implementation, updated to track trunk of LLVM, Clang, and compiler-rt: http://reviews.llvm.org/D19904 (llvm) http://reviews.llvm.org/D20352 (clang) http://reviews.llvm.org/D21612 (compiler-rt) [2] XRay: A Function Call Tracing System: http://research.google.com/pubs/pub45287.html -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160704/05c7027e/attachment.html>
Hayden Livingston via llvm-dev
2016-Jul-04 07:39 UTC
[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds
I have a few meta questions here. Why should LLVM (and from the patch it seems Clang) favor one instrumentation system -- in this case the XRay instrumentation system vs. many others that may be possible to add to upstream? It seems GCC has -finstrument-functions that call into cyg_.... functions. Poor naming choice, but I suppose one thing would be to use those names. Or better yet, provide a way in commandline to say what functions are for entry, and what are for exit. How is this different from hot patching that exists in Windows? I suppose this feature makes it more accessible? I hope we can change the name of this thing if it were to be added to something generic that doesn't tie us to the runtime libraries needed for XRay specifically. On Sun, Jul 3, 2016 at 10:50 PM, Dean Michael Berris via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi llvm-dev (cc google-xray), > > As a follow-up to the first XRay RFC [0] introducing the technology, I've > been able to recently implement a functional prototype of the major parts of > the XRay functionality [1]. This RFC is limited to exploring potential > alternatives to the current LLVM-side changes, with the interest of getting > clear guidance for landing the changes first in LLVM. > > Background / Current Implementation > ============================> > XRay relies on statically inserted instrumentation points (implemented as > nop-sleds) and a dynamic enable/disable mechanism implemented in a runtime > library. As of this writing the implementation of the XRay prototype > involves adding two pseudo-instructions (PATCHABLE_FUNCTION_ENTER, > PATCHABLE_RET) that serve as placeholders for where the nop-sleds are to be > emitted when lowering. PATCHABLE_FUNCTION_ENTER is an instruction that takes > no operands and serves as a pure placeholder. PATCHABLE_RET effectively > behaves as a return instruction (isReturn = true) and wraps whatever the > return instruction is, along with all operands -- this is used to replace > the return instructions, and when lowered will unpack into the appropriate > nop-sled-padded return sequence. We rely on a MachineFunctionPass > (XRayInstrumentation) to observe IR functions with xray-specific attributes > (function-instrument=xray-{always,never}, or xray-instruction-threshold=N), > that then insert the pseudo-instructions to the machine instructions that > get lowered appropriately. While lowering, we keep track of the > instrumentation points marked by the lowered pseudo-instructions and > generate a per-function COMDAT/ELF Group section, merged into a special > section (xray_instr_map). We only currently implement the lowering for > x86_64 ELF. > > All these changes are implemented in http://reviews.llvm.org/D19904. > > Challenges > ========> > This implementation approach poses two major challenges just on the LLVM > (core) side of the implementation: > > 1) The pseudo-instructions need to be handled especially for each platform > on which XRay would be ported. At this time we're exploring implementing > (and accepting help from the community to complete) PPC and ARM support, > spelling the nop sleds differently for those architectures. Since the > prototype only supports ELF sections, we're thinking about a portable/clean > way of finding/coalescing the instrumentation point locations. We have some > choices made in the current implementation that we're unclear about whether > it will work or transfer cleanly to other architectures or formats/OSes > (MachO, COFF, a.out (?)). > > 2) We are only currently instrumenting "normal" function entry and exits. We > have a 1:1 correspondence between the type of instrumentation point and the > pseudo-instructions. This means, when we start implementing various exit > points (exception throwing, catch returns, tail calls, sibling calls) we > need to implement new pseudo-instructions and port to all other platforms > where XRay will be ported. The proliferation of pseudo-instructions seems > hardly desirable, and maybe a better approach would scale better. > > Alternatives > ========> > We've looked at the following alternatives, and we're looking to the > community for feedback on both the current implementation and these > alternatives. > > LLVM Functions > ---------------------- > > Instead of using pseudo-instructions, use intrinsic functions [2] that are > part of the IR. These could be emitted at a higher level by front-ends (like > Clang) and are threaded through the various IR transformations through the > various optimisations. There's some pros and cons to this approach, and > we're attempting to list down some that I know about: > > Pro: > + We can encode variance in the sleds as function arguments (scales better > to more kinds of instrumentation points we can insert). > + The IR has the functions in-line, instead of being magically inserted when > lowering (could be a better aid for debugging/understanding/reasoning). > + In case the platform doesn't yet support XRay instrumentation, we can > trivially remove the function calls when lowering. > > Cons: > - We're unsure whether we can still enforce the layout of emitted code, > especially in the special case of the return sleds. Since the return sleds > (in x86_64) are spelled as `ret; <10-byte nops>`, there may be some > acrobatics needed lower and legalize this lowering potentially inferior to > the pseudo-instructions approach. > > More Magic > ---------------- > > Instead of using pseudo-instructions, we rely solely on the presence/absence > of attributes then special-case the start-of-function (prologue), end of > function (epilogue), and return instruction lowering for platforms where > XRay would be supported. This entails adding special-case function calls in > strategic places in the compiler, the logic all being embedded in the LLVM > code base (in lib/CodeGen, lib/Target, etc.). There's some pros and cons to > the this approach: > > Pro: > + All XRay logic can be hidden in an interface purely in LLVM code, no need > for exposing logic in IR nor in MC. > + Sidesteps all issues with lowering instructions in platforms, inserting > the correct instrumentation points on a platform-by-platform basis. > + Allows for iterating the implementation purely in LLVM code, testing logic > in isolation, incremental changes to internals. > > Cons: > - This involves much more work touching more places where instrumentation > points might be inserted. An initial attempt involves teaching the various > stack adjustment routines, prologue/epilogue emission, return instruction > lowering, the legalizer, and late-stage optimisations how to handle > XRay-specific instrumentation. > > Open Questions > ============> > There are some other open questions to the community at large: > > * Looking at the current implementation, are there major objections to > committing to the current implementation, iterating with the knowledge that > this can evolve more later as we learn more about implementing XRay (and > other instrumentation routines) in LLVM? > > * Are there other risks we haven't considered yet for having something like > XRay embedded as a supported instrumentation mechanism in LLVM? > > * Given the current implementation in http://reviews.llvm.org/D19904, do you > have suggestions on how to partition it to smaller changes that could be > reviewed/landed easier than a singular patch? > > Roadmap for Context > ================> > Note that this RFC focuses only on the LLVM-side changes. To put this in > context, the order of changes we're looking to land comes in the following > order: > > - LLVM Changes (subject of this RFC) > - Changes in compiler-rt (the runtime implementing dynamic patching and > in-memory logging) > - Changes in Clang to support emitting XRay-instrumented C/C++ (and maybe > Obj-C) binaries > - Tools for analysing XRay traces generated by XRay-instrumented binaries > > I have some changes under works to get the in-memory logging implementation > (a naive implementation) and a simple function call accounting tool working > on top of the existing public patches. Hopefully as soon as we get clear > guidance on the subject of this RFC, more of the implementation described in > the white paper [2] in terms of the logging heuristics and runtime > enabling/disabling can proceed in earnest. > > --- End of RFC --- > > References: > > [0] Original XRay RFC: > http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html > > [1] There are three patches that implement the prototype XRay > implementation, updated to track trunk of LLVM, Clang, and compiler-rt: > > http://reviews.llvm.org/D19904 (llvm) > http://reviews.llvm.org/D20352 (clang) > http://reviews.llvm.org/D21612 (compiler-rt) > > [2] XRay: A Function Call Tracing System: > http://research.google.com/pubs/pub45287.html > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
Dean Michael Berris via llvm-dev
2016-Jul-04 08:10 UTC
[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds
Thanks for the questions Hayden, please see in-line below some responses. On Mon, Jul 4, 2016 at 5:39 PM Hayden Livingston <halivingston at gmail.com> wrote:> I have a few meta questions here. > > Why should LLVM (and from the patch it seems Clang) favor one > instrumentation system -- in this case the XRay instrumentation system > vs. many others that may be possible to add to upstream? > >I don't think there's any intent to exclude any existing or alternative instrumentation systems from LLVM. At least from our proposal, we're making sure we're playing well with any existing current implementations already in LLVM/Clang and others that might come along.> It seems GCC has -finstrument-functions that call into cyg_.... > functions. Poor naming choice, but I suppose one thing would be to use > those names. Or better yet, provide a way in commandline to say what > functions are for entry, and what are for exit. > >I thought Clang already supported this as an option?> How is this different from hot patching that exists in Windows? I > suppose this feature makes it more accessible? > >The differences are multi-fold. Some of them that I can list down are: - XRay aims to not change the functionality of the application/function being instrumented. The sole goal of the XRay instrumentation points are to allow for dynamic enabling/disabling of the instrumentation, and only using the instrumentation points that have been inserted by the compiler. With hot-patching in Windows, as far as I can tell the intent is to update the implementation of a function at runtime completely not just for instrumentation. You can say that XRay may be implemented in a similar manner by re-writing the function being instrumented at runtime and hot-patching the original function implementation, but we've chosen not to do that for efficiency reasons (trade-off between cost of instrumentation when "off" and when "on"). - XRay has a very specific goal, which is to generate function call traces for performance debugging. Other instrumentation systems will have different goals, and the hot-patching mechanism is just one of those techniques useful for achieving the various goals. We certainly can allow other uses for XRay (i.e. in the prototype implementation, we have hooks to allow changing what function is called when an instrumentation point is encountered at runtime) but the immediate goal is for generating traces that can be analysed offline.> I hope we can change the name of this thing if it were to be added to > something generic that doesn't tie us to the runtime libraries needed > for XRay specifically. >I agree we should be able to share common infrastructure in LLVM for adding instrumentation points (there's an interesting RFC recently for CSI) and I'm all for making it easier to implement things like XRay through the common infrastructure. There's certainly been talk about consolidating the different options for adding instrumentation into a coherent set of flags in Clang, but I haven't quite seen talk about common instrumentation infrastructure support in LLVM. My hope is, if this is something the community will find useful, that we can gain consensus or at least share a clear direction. I'd be happy to do the work if that means we can get XRay functionality supported as one of the many possible implementations in LLVM. I'm happy to have a conversation about being able to make alternative instrumentation systems easier to implement with the work to support XRay in LLVM, if that makes it at least clear that XRay isn't being proposed as the "one true way" for instrumenting Clang/LLVM-built binaries. I'm even willing to try and iterate on the interfaces and/or implementations in LLVM to make XRay-like things be built on top of LLVM. As for naming, I think being able to specify from a command-line (of Clang, or some other llvm-* tools) the string 'xray' makes it easier to search for, document, and "teach". The vision is, if it's possible, to have many of these instrumentation implementations live under a single flag like '-finstrument='. For now though any talk of that might be premature if we're only going to have '-finstrument=profile' and '-finstrument=xray'. Does that make sense? Cheers -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160704/8488544c/attachment.html>
David Chisnall via llvm-dev
2016-Jul-04 08:27 UTC
[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds
On 4 Jul 2016, at 06:50, Dean Michael Berris via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > We've looked at the following alternatives, and we're looking to the community for feedback on both the current implementation and these alternatives.I don’t think that I’ve yet seen an explanation of why you need the NOPs. DTrace stopped using them a long time ago, for two reasons: 1) The increased code size caused a noticeable increase in i-cache misses, even when instrumentation was not actively being used. This caused a noticeable probe effect (macroscopic observable performance artefacts even when no probes are active) and caused a lot of push-back in adoption. 2) On all of the architectures where we support DTrace (currently, I believe, x86, x86-64, AArch32, AArch64, MIPS64, and RISC-V) it’s possible to do the same thing by moving one of the instructions in the function prolog into the generated trampoline for the instrumentation. I could understand wanting something more like patchpoints if you want to be able to instrument in the middle of a function (along the lines of TESLA or CSI), but if you’re just tracing function entry and exit then it doesn’t seem like the best solution. David -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3719 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160704/36eaa8b9/attachment.bin>
Dean Michael Berris via llvm-dev
2016-Jul-04 08:51 UTC
[llvm-dev] [XRay] RFC: LLVM-side Changes for nop-sleds
On Mon, Jul 4, 2016 at 6:27 PM David Chisnall <david.chisnall at cl.cam.ac.uk> wrote:> On 4 Jul 2016, at 06:50, Dean Michael Berris via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > We've looked at the following alternatives, and we're looking to the > community for feedback on both the current implementation and these > alternatives. > > I don’t think that I’ve yet seen an explanation of why you need the NOPs. > DTrace stopped using them a long time ago, for two reasons: > > 1) The increased code size caused a noticeable increase in i-cache misses, > even when instrumentation was not actively being used. This caused a > noticeable probe effect (macroscopic observable performance artefacts even > when no probes are active) and caused a lot of push-back in adoption. > > 2) On all of the architectures where we support DTrace (currently, I > believe, x86, x86-64, AArch32, AArch64, MIPS64, and RISC-V) it’s possible > to do the same thing by moving one of the instructions in the function > prolog into the generated trampoline for the instrumentation. > > I could understand wanting something more like patchpoints if you want to > be able to instrument in the middle of a function (along the lines of TESLA > or CSI), but if you’re just tracing function entry and exit then it doesn’t > seem like the best solution. > >Thanks for the questions David -- the short version of the answer is that DTrace (last I checked) requires some help from the Kernel, while XRay is self-contained in the application. All of your points above are valid, and DTrace is a really powerful tool for debugging a lot of performance issues. XRay has a few things that differentiate it from systems like DTrace though: 1) Because we insert the instrumentation sleds in specific functions that fit a certain criteria (i.e. more selectively) instead of instrumenting every function, we pay the cost of the instrumentation being off only on functions that are instrumented. The combination of the changes in the front-end to support attributes/annotations in the code to force-instrument/-inhibit instrumentation gives control to the application developer, allows us to limit the cost along a spectrum -- full coverage costs more, selective coverage can be tuned, and explicit annotations provide precise control of the instrumentation. 2) The cost of the instrumentation at run-time is O(100) cycles for the "null-logging" case (mov + trampoline jump, atomic load and check if not zero). All the cost of instrumentation is within the process' address space (in-memory log) when on -- no additional overheads external to the application. 3) The runtime implementation for logging described in the white paper allows us to balance the coverage (number of instrumentation events we get) with overheads (the amount of resources used in the logging implementation). Because we log only very specific things (function id, tsc deltas in most cases, type of event) and have heuristics to condense the information we keep (i.e. if entry-exit pairs are under epsilon, we can omit the entry entirely), we don't need to be quite as complete when logging and instead move a lot of the logic in reconstruction/analysis of the generated traces. There are certainly other approaches to doing selective instrumentation, and then externally signalling/trapping (with environment support) when probing. XRay moves this needle towards having the instrumentation and collection and even signalling into the application. This makes sense if you're deploying the application on a system that doesn't have DTrace and still be able to isolate the costs of instrumentation just to the application. I'll admit that I'll need to read a lot more about how DTrace manages to keep the costs of probes low enough that it could be turned on dynamically without stopping the process, and without having to intercept more events than actually necessary (i.e. only on certain functions, and only when it's on) to be able to provide a more complete answer. Does this help? Cheers -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160704/4ce618b7/attachment.html>