thr3ads.net - llvm dev - [llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions. [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Serge Rogatch via llvm-dev

2016-Jul-28 23:14 UTC

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

Hello,

Can I ask you why you chose to patch both function entrances and exits,
rather than just patching the entrances and (in the patches) pushing on the
stack the address of __xray_FunctionExit , so that the user function
returns normally (with RETQ or POP RIP or whatever else instruction) rather
than jumping into __xray_FunctionExit?

By patching just the function entrances, you avoid duplication of the
function ID (which is currently taking space in the entrance and every
exit) and duplication of the rest of the exit patch for every of the
potentially many function exits.

This approach also avoids reporting exits for functions, for which
entrances have not been reported because the functions were already running
at the time patching happened.

This approach should also be faster because smaller code better fits in CPU
cache, and patching itself should run faster (because there is less code to
modify).

Or does this approach have some issues e.g. with exceptions, longjmp,
debugger, etc.?

Below is an example patch code for ARM (sorry, no resource to translate to
x86 myself). The compile-time stub ("sled") would contain a jump as
the
first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction
takes exactly 4 bytes, if not in Thumb etc. mode).

; Look at the disassembly to verify that the sled is inserted before the
;   instrumented function pushes caller's registers to the stack
;   (otherwise r4 may not get preserved)
PUSH {r4, lr}
ADR lr, #16 ; relative offset of after_entrance_traced
; r4 must be preserved by the instrumented function, so that
;   __xray_FunctionExit gets function ID in r4 too
LDR r4, [pc, #0] ; offset of function ID stored by the patching mechanism
; call __xray_FunctionEntry (returning to after_entrance_traced)
LDR pc, [pc, #0] ; use the address stored by the patching mechanism
.word <32-bit function ID>
.word <32-bit address of __xray_FunctionEntry>
.word <32-bit address of __xray_FunctionExit>
after_entrance_traced:
; Make the instrumented function think that it must return to
__xray_FunctionExit
LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit
; __xray_FunctionExit must "POP {r4, lr}" and in the end "BX
lr"
; the body of the instrumented function follows

; Before patching (i.e. in sleds) the first instruction is a jump over the
;   whole stub to the first instruction in the body of the function. So lr
;   register stays original, thus no call to __xray_FunctionExit occurs at
the
;   the exit of the function, even if it is being patched concurrently.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160729/a6f48c57/attachment.html>

Dean Michael Berris via llvm-dev

2016-Jul-29 07:43 UTC

head link

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

> On 29 Jul 2016, at 09:14, Serge Rogatch via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hello,
> 
> Can I ask you why you chose to patch both function entrances and exits,
rather than just patching the entrances and (in the patches) pushing on the
stack the address of __xray_FunctionExit , so that the user function returns
normally (with RETQ or POP RIP or whatever else instruction) rather than jumping
into __xray_FunctionExit?
> 
> By patching just the function entrances, you avoid duplication of the
function ID (which is currently taking space in the entrance and every exit) and
duplication of the rest of the exit patch for every of the potentially many
function exits.
> 
> This approach also avoids reporting exits for functions, for which
entrances have not been reported because the functions were already running at
the time patching happened.
> 
> This approach should also be faster because smaller code better fits in CPU
cache, and patching itself should run faster (because there is less code to
modify).
> 
> Or does this approach have some issues e.g. with exceptions, longjmp,
debugger, etc.?
> 
The only issues I can think of are those of potentially interfering with and
invalidating the stack pointer at runtime. Because the patching and
determination of what the function id's are happen at runtime and not
statically, we can only provide the space for the function id. In x86_64 this
works out to only be just a few bytes. We also make sure XRay works even if
frame pointers are omitted.

Another issue is that of tail call and sibling call optimisations. Because
exiting these functions actually turn out to be jumps, we cannot be sure that
the jumped-to function will clean up the stack appropriately.

As far as avoiding writing exit records without entry records, we deal with
those externally (during analysis of the trace). It's important to know that
when instrumentation is turned on (i.e. the log handler is not nullptr) that
there was a function already running and that it exited at a given point in
time. Especially when unwinding a deep function call stack, we can keep track of
this as it's important information for analysis.

Consider the following case:

A() -> B() -> C() -> D() -> E()

When instrumentation is enabled after E() has started, we can see records of the
following kind:

[timestamp, cpu] Exit E()
[timestamp, cpu] Exit D()
[timestamp, cpu] Exit B()
[timestamp, cpu] Exit A()

Note that the difference between "Exit E()" and "Exit D()"
may not be 0 -- and that there may have very well been work happening between
the exit of E() and exit of D(), and similarly up the stack.

Does this make sense?
> Below is an example patch code for ARM (sorry, no resource to translate to
x86 myself). The compile-time stub ("sled") would contain a jump as
the first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction
takes exactly 4 bytes, if not in Thumb etc. mode).
> 
> ; Look at the disassembly to verify that the sled is inserted before the
> ;   instrumented function pushes caller's registers to the stack
> ;   (otherwise r4 may not get preserved)
> PUSH {r4, lr}
> ADR lr, #16 ; relative offset of after_entrance_traced
> ; r4 must be preserved by the instrumented function, so that
> ;   __xray_FunctionExit gets function ID in r4 too
> LDR r4, [pc, #0] ; offset of function ID stored by the patching mechanism
> ; call __xray_FunctionEntry (returning to after_entrance_traced)
> LDR pc, [pc, #0] ; use the address stored by the patching mechanism
> .word <32-bit function ID>
> .word <32-bit address of __xray_FunctionEntry>
> .word <32-bit address of __xray_FunctionExit>
> after_entrance_traced:
> ; Make the instrumented function think that it must return to
__xray_FunctionExit
> LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit
> ; __xray_FunctionExit must "POP {r4, lr}" and in the end "BX
lr"
> ; the body of the instrumented function follows
> 
> ; Before patching (i.e. in sleds) the first instruction is a jump over the
> ;   whole stub to the first instruction in the body of the function. So lr
> ;   register stays original, thus no call to __xray_FunctionExit occurs at
the
> ;   the exit of the function, even if it is being patched concurrently.
Cool, thanks -- we have an interim logging implementation for x86 which does the
naïve logging to memory then flushes to disk regularly (I suspect you've
already seen https://reviews.llvm.org/D21982). In that patch we have the very
early beginnings of a test suite, so I think if you'd like to contribute the
ARM implementation, that we can review that patch and land it to allow you to
add tests and make sure that this also works on ARM.

I have zero experience with actually doing anything with ARM assembly and
I'd appreciate all the help I can get to make XRay work on ARM too.

Cheers!

Serge Rogatch via llvm-dev

2016-Jul-29 13:58 UTC

head link

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

On 29 July 2016 at 10:43, Dean Michael Berris <dean.berris at gmail.com>
wrote:
>
> > On 29 Jul 2016, at 09:14, Serge Rogatch via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > Hello,
> >
> > Can I ask you why you chose to patch both function entrances and
exits,
> rather than just patching the entrances and (in the patches) pushing on the
> stack the address of __xray_FunctionExit , so that the user function
> returns normally (with RETQ or POP RIP or whatever else instruction) rather
> than jumping into __xray_FunctionExit?
> >
> > By patching just the function entrances, you avoid duplication of the
> function ID (which is currently taking space in the entrance and every
> exit) and duplication of the rest of the exit patch for every of the
> potentially many function exits.
> >
> > This approach also avoids reporting exits for functions, for which
> entrances have not been reported because the functions were already running
> at the time patching happened.
> >
> > This approach should also be faster because smaller code better fits
in
> CPU cache, and patching itself should run faster (because there is less
> code to modify).
> >
> > Or does this approach have some issues e.g. with exceptions, longjmp,
> debugger, etc.?
> >
>
> The only issues I can think of are those of potentially interfering with
> and invalidating the stack pointer at runtime. Because the patching and
> determination of what the function id's are happen at runtime and not
> statically, we can only provide the space for the function id. In x86_64
> this works out to only be just a few bytes. We also make sure XRay works
> even if frame pointers are omitted.
>
> Another issue is that of tail call and sibling call optimisations. Because
> exiting these functions actually turn out to be jumps, we cannot be sure
> that the jumped-to function will clean up the stack appropriately.
>
> As far as avoiding writing exit records without entry records, we deal
> with those externally (during analysis of the trace). It's important to
> know that when instrumentation is turned on (i.e. the log handler is not
> nullptr) that there was a function already running and that it exited at a
> given point in time. Especially when unwinding a deep function call stack,
> we can keep track of this as it's important information for analysis.
>
> Consider the following case:
>
> A() -> B() -> C() -> D() -> E()
>
> When instrumentation is enabled after E() has started, we can see records
> of the following kind:
>
> [timestamp, cpu] Exit E()
> [timestamp, cpu] Exit D()
> [timestamp, cpu] Exit B()
> [timestamp, cpu] Exit A()
>
> Note that the difference between "Exit E()" and "Exit
D()" may not be 0 --
> and that there may have very well been work happening between the exit of
> E() and exit of D(), and similarly up the stack.
>
> Does this make sense?
>Yes, this makes sense, thanks for the analysis. I'm going to investigate
later how to keep the stack consistent for unwinding (so to support C++
exceptions), e.g. by pretending that the __xray_FunctionExit call is the
destructor of the first object (local variable) on the stack.
>
> > Below is an example patch code for ARM (sorry, no resource to
translate
> to x86 myself). The compile-time stub ("sled") would contain a
jump as the
> first instruction, skipping 28 next bytes of NOOPs (on ARM each instruction
> takes exactly 4 bytes, if not in Thumb etc. mode).
> >
> > ; Look at the disassembly to verify that the sled is inserted before
the
> > ;   instrumented function pushes caller's registers to the stack
> > ;   (otherwise r4 may not get preserved)
> > PUSH {r4, lr}
> > ADR lr, #16 ; relative offset of after_entrance_traced
> > ; r4 must be preserved by the instrumented function, so that
> > ;   __xray_FunctionExit gets function ID in r4 too
> > LDR r4, [pc, #0] ; offset of function ID stored by the patching
mechanism
> > ; call __xray_FunctionEntry (returning to after_entrance_traced)
> > LDR pc, [pc, #0] ; use the address stored by the patching mechanism
> > .word <32-bit function ID>
> > .word <32-bit address of __xray_FunctionEntry>
> > .word <32-bit address of __xray_FunctionExit>
> > after_entrance_traced:
> > ; Make the instrumented function think that it must return to
> __xray_FunctionExit
> > LDR lr, [pc, #-12] ; offset of address of __xray_FunctionExit
> > ; __xray_FunctionExit must "POP {r4, lr}" and in the end
"BX lr"
> > ; the body of the instrumented function follows
> >
> > ; Before patching (i.e. in sleds) the first instruction is a jump over
> the
> > ;   whole stub to the first instruction in the body of the function.
So
> lr
> > ;   register stays original, thus no call to __xray_FunctionExit
occurs
> at the
> > ;   the exit of the function, even if it is being patched
concurrently.
>
> Cool, thanks -- we have an interim logging implementation for x86 which
> does the naïve logging to memory then flushes to disk regularly (I suspect
> you've already seen https://reviews.llvm.org/D21982).
No, I wasn't aware of that patch, thanks for pointing out!
> In that patch we have the very early beginnings of a test suite, so I
> think if you'd like to contribute the ARM implementation, that we can
> review that patch and land it to allow you to add tests and make sure that
> this also works on ARM.
>
> I have zero experience with actually doing anything with ARM assembly and
> I'd appreciate all the help I can get to make XRay work on ARM too.
>Yes, I am trying to port XRay on LLVM to ARM, but I'm just starting with
LLVM.
>
> Cheers!-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160729/8a562219/attachment.html>

Tim Northover via llvm-dev

2016-Jul-29 18:00 UTC

head link

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

On 28 July 2016 at 16:14, Serge Rogatch via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Can I ask you why you chose to patch both function entrances and exits,
> rather than just patching the entrances and (in the patches) pushing on the
> stack the address of __xray_FunctionExit , so that the user function
returns
> normally (with RETQ or POP RIP or whatever else instruction) rather than
> jumping into __xray_FunctionExit?
> This approach should also be faster because smaller code better fits in CPU
> cache, and patching itself should run faster (because there is less code to
> modify).
It may well be slower. Larger CPUs tend to track the call stack in
hardware and returning to an address pushed manually is an inevitable
branch mispredict in those cases.

Cheers.

Tim.

Serge Rogatch via llvm-dev

2016-Jul-29 19:07 UTC

head link

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

Thanks for pointing this out, Tim. Then maybe this approach is not the best
choice for x86, though ideally measuring is needed, it is just that on ARM
the current x86 approach is not applicable because ARM doesn't have a
single return instruction (such as RETQ on x86_64), furthermore, the return
instructions on ARM can be conditional.

I have another question: what happens if the instrumented function (or its
callees) throws an exception and doesn't catch? I understood that currently
XRay will not report an exit from this function in such case because the
function doesn't return with RETQ, but rather the stack unwinder jumps
through functions calling the destructors of local variable objects.

If so, why not to instrument the functions by placing a tracing object as
the first local variable, with its constructor calling __xray_FunctionEntry
and destructor calling __xray_FunctionExit ? Perhaps this approach requires
changes in the front-end (C++ compiler, before emitting IR).

Cheers.

On 29 July 2016 at 21:00, Tim Northover <t.p.northover at gmail.com>
wrote:
> On 28 July 2016 at 16:14, Serge Rogatch via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > Can I ask you why you chose to patch both function entrances and
exits,
> > rather than just patching the entrances and (in the patches) pushing
on
> the
> > stack the address of __xray_FunctionExit , so that the user function
> returns
> > normally (with RETQ or POP RIP or whatever else instruction) rather
than
> > jumping into __xray_FunctionExit?
>
> > This approach should also be faster because smaller code better fits
in
> CPU
> > cache, and patching itself should run faster (because there is less
code
> to
> > modify).
>
> It may well be slower. Larger CPUs tend to track the call stack in
> hardware and returning to an address pushed manually is an inevitable
> branch mispredict in those cases.
>
> Cheers.
>
> Tim.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160729/ff713ce1/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Jul 2016 - XRay: Demo on x86_64/Linux almost done; some questions.

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

[llvm-dev] XRay: Demo on x86_64/Linux almost done; some questions.

Apparently Analagous Threads