Serge Rogatch via llvm-dev
2016-Jul-28 23:14 UTC
[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.
Hello, Can I ask you why you chose to patch both function entrances and exits, rather than just patching the entrances and (in the patches) pushing on the stack the address of __xray_FunctionExit , so that the user function returns normally (with RETQ or POP RIP or whatever else instruction) rather than jumping into __xray_FunctionExit? By patching just the function entrances, you avoid duplication of the function ID (which is currently taking space in the entrance and every exit) and duplication of the rest of the exit patch for every of the potentially many function exits. This approach also avoids reporting exits for functions, for which entrances have not been reported because the functions were already running at the time patching happened. This approach should also be faster because smaller code better fits in CPU cache, and patching itself should run faster (because there is less code to modify). Or does this approach have some issues e.g. with exceptions, longjmp, debugger, etc.? Below is an example patch code for ARM (sorry, no resource to translate to x86 myself). The compile-time stub ("sled") would contain a jump as the first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction takes exactly 4 bytes, if not in Thumb etc. mode). ; Look at the disassembly to verify that the sled is inserted before the ; instrumented function pushes caller's registers to the stack ; (otherwise r4 may not get preserved) PUSH {r4, lr} ADR lr, #16 ; relative offset of after_entrance_traced ; r4 must be preserved by the instrumented function, so that ; __xray_FunctionExit gets function ID in r4 too LDR r4, [pc, #0] ; offset of function ID stored by the patching mechanism ; call __xray_FunctionEntry (returning to after_entrance_traced) LDR pc, [pc, #0] ; use the address stored by the patching mechanism .word <32-bit function ID> .word <32-bit address of __xray_FunctionEntry> .word <32-bit address of __xray_FunctionExit> after_entrance_traced: ; Make the instrumented function think that it must return to __xray_FunctionExit LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit ; __xray_FunctionExit must "POP {r4, lr}" and in the end "BX lr" ; the body of the instrumented function follows ; Before patching (i.e. in sleds) the first instruction is a jump over the ; whole stub to the first instruction in the body of the function. So lr ; register stays original, thus no call to __xray_FunctionExit occurs at the ; the exit of the function, even if it is being patched concurrently. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160729/a6f48c57/attachment.html>
Dean Michael Berris via llvm-dev
2016-Jul-29 07:43 UTC
[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.
> On 29 Jul 2016, at 09:14, Serge Rogatch via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hello, > > Can I ask you why you chose to patch both function entrances and exits, rather than just patching the entrances and (in the patches) pushing on the stack the address of __xray_FunctionExit , so that the user function returns normally (with RETQ or POP RIP or whatever else instruction) rather than jumping into __xray_FunctionExit? > > By patching just the function entrances, you avoid duplication of the function ID (which is currently taking space in the entrance and every exit) and duplication of the rest of the exit patch for every of the potentially many function exits. > > This approach also avoids reporting exits for functions, for which entrances have not been reported because the functions were already running at the time patching happened. > > This approach should also be faster because smaller code better fits in CPU cache, and patching itself should run faster (because there is less code to modify). > > Or does this approach have some issues e.g. with exceptions, longjmp, debugger, etc.? >The only issues I can think of are those of potentially interfering with and invalidating the stack pointer at runtime. Because the patching and determination of what the function id's are happen at runtime and not statically, we can only provide the space for the function id. In x86_64 this works out to only be just a few bytes. We also make sure XRay works even if frame pointers are omitted. Another issue is that of tail call and sibling call optimisations. Because exiting these functions actually turn out to be jumps, we cannot be sure that the jumped-to function will clean up the stack appropriately. As far as avoiding writing exit records without entry records, we deal with those externally (during analysis of the trace). It's important to know that when instrumentation is turned on (i.e. the log handler is not nullptr) that there was a function already running and that it exited at a given point in time. Especially when unwinding a deep function call stack, we can keep track of this as it's important information for analysis. Consider the following case: A() -> B() -> C() -> D() -> E() When instrumentation is enabled after E() has started, we can see records of the following kind: [timestamp, cpu] Exit E() [timestamp, cpu] Exit D() [timestamp, cpu] Exit B() [timestamp, cpu] Exit A() Note that the difference between "Exit E()" and "Exit D()" may not be 0 -- and that there may have very well been work happening between the exit of E() and exit of D(), and similarly up the stack. Does this make sense?> Below is an example patch code for ARM (sorry, no resource to translate to x86 myself). The compile-time stub ("sled") would contain a jump as the first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction takes exactly 4 bytes, if not in Thumb etc. mode). > > ; Look at the disassembly to verify that the sled is inserted before the > ; instrumented function pushes caller's registers to the stack > ; (otherwise r4 may not get preserved) > PUSH {r4, lr} > ADR lr, #16 ; relative offset of after_entrance_traced > ; r4 must be preserved by the instrumented function, so that > ; __xray_FunctionExit gets function ID in r4 too > LDR r4, [pc, #0] ; offset of function ID stored by the patching mechanism > ; call __xray_FunctionEntry (returning to after_entrance_traced) > LDR pc, [pc, #0] ; use the address stored by the patching mechanism > .word <32-bit function ID> > .word <32-bit address of __xray_FunctionEntry> > .word <32-bit address of __xray_FunctionExit> > after_entrance_traced: > ; Make the instrumented function think that it must return to __xray_FunctionExit > LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit > ; __xray_FunctionExit must "POP {r4, lr}" and in the end "BX lr" > ; the body of the instrumented function follows > > ; Before patching (i.e. in sleds) the first instruction is a jump over the > ; whole stub to the first instruction in the body of the function. So lr > ; register stays original, thus no call to __xray_FunctionExit occurs at the > ; the exit of the function, even if it is being patched concurrently.Cool, thanks -- we have an interim logging implementation for x86 which does the naïve logging to memory then flushes to disk regularly (I suspect you've already seen https://reviews.llvm.org/D21982). In that patch we have the very early beginnings of a test suite, so I think if you'd like to contribute the ARM implementation, that we can review that patch and land it to allow you to add tests and make sure that this also works on ARM. I have zero experience with actually doing anything with ARM assembly and I'd appreciate all the help I can get to make XRay work on ARM too. Cheers!
Serge Rogatch via llvm-dev
2016-Jul-29 13:58 UTC
[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.
On 29 July 2016 at 10:43, Dean Michael Berris <dean.berris at gmail.com> wrote:> > > On 29 Jul 2016, at 09:14, Serge Rogatch via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > Hello, > > > > Can I ask you why you chose to patch both function entrances and exits, > rather than just patching the entrances and (in the patches) pushing on the > stack the address of __xray_FunctionExit , so that the user function > returns normally (with RETQ or POP RIP or whatever else instruction) rather > than jumping into __xray_FunctionExit? > > > > By patching just the function entrances, you avoid duplication of the > function ID (which is currently taking space in the entrance and every > exit) and duplication of the rest of the exit patch for every of the > potentially many function exits. > > > > This approach also avoids reporting exits for functions, for which > entrances have not been reported because the functions were already running > at the time patching happened. > > > > This approach should also be faster because smaller code better fits in > CPU cache, and patching itself should run faster (because there is less > code to modify). > > > > Or does this approach have some issues e.g. with exceptions, longjmp, > debugger, etc.? > > > > The only issues I can think of are those of potentially interfering with > and invalidating the stack pointer at runtime. Because the patching and > determination of what the function id's are happen at runtime and not > statically, we can only provide the space for the function id. In x86_64 > this works out to only be just a few bytes. We also make sure XRay works > even if frame pointers are omitted. > > Another issue is that of tail call and sibling call optimisations. Because > exiting these functions actually turn out to be jumps, we cannot be sure > that the jumped-to function will clean up the stack appropriately. > > As far as avoiding writing exit records without entry records, we deal > with those externally (during analysis of the trace). It's important to > know that when instrumentation is turned on (i.e. the log handler is not > nullptr) that there was a function already running and that it exited at a > given point in time. Especially when unwinding a deep function call stack, > we can keep track of this as it's important information for analysis. > > Consider the following case: > > A() -> B() -> C() -> D() -> E() > > When instrumentation is enabled after E() has started, we can see records > of the following kind: > > [timestamp, cpu] Exit E() > [timestamp, cpu] Exit D() > [timestamp, cpu] Exit B() > [timestamp, cpu] Exit A() > > Note that the difference between "Exit E()" and "Exit D()" may not be 0 -- > and that there may have very well been work happening between the exit of > E() and exit of D(), and similarly up the stack. > > Does this make sense? >Yes, this makes sense, thanks for the analysis. I'm going to investigate later how to keep the stack consistent for unwinding (so to support C++ exceptions), e.g. by pretending that the __xray_FunctionExit call is the destructor of the first object (local variable) on the stack.> > > Below is an example patch code for ARM (sorry, no resource to translate > to x86 myself). The compile-time stub ("sled") would contain a jump as the > first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction > takes exactly 4 bytes, if not in Thumb etc. mode). > > > > ; Look at the disassembly to verify that the sled is inserted before the > > ; instrumented function pushes caller's registers to the stack > > ; (otherwise r4 may not get preserved) > > PUSH {r4, lr} > > ADR lr, #16 ; relative offset of after_entrance_traced > > ; r4 must be preserved by the instrumented function, so that > > ; __xray_FunctionExit gets function ID in r4 too > > LDR r4, [pc, #0] ; offset of function ID stored by the patching mechanism > > ; call __xray_FunctionEntry (returning to after_entrance_traced) > > LDR pc, [pc, #0] ; use the address stored by the patching mechanism > > .word <32-bit function ID> > > .word <32-bit address of __xray_FunctionEntry> > > .word <32-bit address of __xray_FunctionExit> > > after_entrance_traced: > > ; Make the instrumented function think that it must return to > __xray_FunctionExit > > LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit > > ; __xray_FunctionExit must "POP {r4, lr}" and in the end "BX lr" > > ; the body of the instrumented function follows > > > > ; Before patching (i.e. in sleds) the first instruction is a jump over > the > > ; whole stub to the first instruction in the body of the function. So > lr > > ; register stays original, thus no call to __xray_FunctionExit occurs > at the > > ; the exit of the function, even if it is being patched concurrently. > > Cool, thanks -- we have an interim logging implementation for x86 which > does the naïve logging to memory then flushes to disk regularly (I suspect > you've already seen https://reviews.llvm.org/D21982).No, I wasn't aware of that patch, thanks for pointing out!> In that patch we have the very early beginnings of a test suite, so I > think if you'd like to contribute the ARM implementation, that we can > review that patch and land it to allow you to add tests and make sure that > this also works on ARM. > > I have zero experience with actually doing anything with ARM assembly and > I'd appreciate all the help I can get to make XRay work on ARM too. >Yes, I am trying to port XRay on LLVM to ARM, but I'm just starting with LLVM.> > Cheers!-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160729/8a562219/attachment.html>
Tim Northover via llvm-dev
2016-Jul-29 18:00 UTC
[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.
On 28 July 2016 at 16:14, Serge Rogatch via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Can I ask you why you chose to patch both function entrances and exits, > rather than just patching the entrances and (in the patches) pushing on the > stack the address of __xray_FunctionExit , so that the user function returns > normally (with RETQ or POP RIP or whatever else instruction) rather than > jumping into __xray_FunctionExit?> This approach should also be faster because smaller code better fits in CPU > cache, and patching itself should run faster (because there is less code to > modify).It may well be slower. Larger CPUs tend to track the call stack in hardware and returning to an address pushed manually is an inevitable branch mispredict in those cases. Cheers. Tim.
Serge Rogatch via llvm-dev
2016-Jul-29 19:07 UTC
[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.
Thanks for pointing this out, Tim. Then maybe this approach is not the best choice for x86, though ideally measuring is needed, it is just that on ARM the current x86 approach is not applicable because ARM doesn't have a single return instruction (such as RETQ on x86_64), furthermore, the return instructions on ARM can be conditional. I have another question: what happens if the instrumented function (or its callees) throws an exception and doesn't catch? I understood that currently XRay will not report an exit from this function in such case because the function doesn't return with RETQ, but rather the stack unwinder jumps through functions calling the destructors of local variable objects. If so, why not to instrument the functions by placing a tracing object as the first local variable, with its constructor calling __xray_FunctionEntry and destructor calling __xray_FunctionExit ? Perhaps this approach requires changes in the front-end (C++ compiler, before emitting IR). Cheers. On 29 July 2016 at 21:00, Tim Northover <t.p.northover at gmail.com> wrote:> On 28 July 2016 at 16:14, Serge Rogatch via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > Can I ask you why you chose to patch both function entrances and exits, > > rather than just patching the entrances and (in the patches) pushing on > the > > stack the address of __xray_FunctionExit , so that the user function > returns > > normally (with RETQ or POP RIP or whatever else instruction) rather than > > jumping into __xray_FunctionExit? > > > This approach should also be faster because smaller code better fits in > CPU > > cache, and patching itself should run faster (because there is less code > to > > modify). > > It may well be slower. Larger CPUs tend to track the call stack in > hardware and returning to an address pushed manually is an inevitable > branch mispredict in those cases. > > Cheers. > > Tim. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160729/ff713ce1/attachment.html>
Reasonably Related Threads
- XRay: Demo on x86_64/Linux almost done; some questions.
- XRay: Demo on x86_64/Linux almost done; some questions.
- XRay: Demo on x86_64/Linux almost done; some questions.
- XRay: Demo on x86_64/Linux almost done; some questions.
- XRay: Demo on x86_64/Linux almost done; some questions.