thr3ads.net - llvm dev - [llvm-dev] [XRay] Build instrumented Clang, some analysis results [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Dean Michael Berris via llvm-dev

2016-Jul-20 10:26 UTC

[llvm-dev] [XRay] Build instrumented Clang, some analysis results

> On 20 Jul 2016, at 20:02, C Bergström <cbergstrom at pathscale.com>
wrote:
> 
> Some general questions about X-Ray
> -------------
> Is there a plan to make a separate mailing list or project around
> this? Do you have a list of planned features?
Interesting question -- so far we haven't decided yet whether XRay will live
as another project. I'm certainly open to this possibility. No concrete
plans yet. It's an open question in the original RFC too
(http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html).

There's a white paper that details what we plan to implement out in the open
(http://research.google.com/pubs/pub45287.html). We're still working our way
to getting to a full version as described in that white paper (basically blocked
by my lack of familiarity with the LLVM codebase, and other n00b-y things :D).

There's not a concrete list of features, and we're certainly open to
contributions from the community to add features that make sense. :)
> 
> Graphics tools for analysis? AMD open sourced their CodeAnalyst - What
> about some integration with that?
> 
Thanks for the pointer! Yes, I'd love to have more integration with existing
visualisation tools that read a particular well-documented format. Others have
mentioned Jumpshot which might be a little dated, but still something that some
people use for similar things.
> Linux + Perf support (planned/exists)?
> 
There are no plans to support the perf counter-lookups yet. Although I certainly
think that's a nice source of data to be logging XRay-style.

FWIW, the API for XRay allows us to decouple the things being logged at function
entry/exit. Getting performance counters at those points is a nice idea, it
should be doable.
> How much is this tied to something specific about Linux or it could be
> easily ported to another platform?
Currently, the only Linux-specific part I can remember is getting the cpu
frequency (looking at sysfs files). That can be implemented on a
platform-agnostic (or at least pluggable and portable) manner.

There are x86'isms and I'm working on understanding how to do this in
Aarch64 or ARM.
> 
> What's the benefit of this vs a stable and production ready tool like
Dtrace?
> 
I think I've pointed out the differences in a separate mail (some mail
filters may have squashed that response, so apologies if that was missed):
http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short
version is:

- Dtrace requires kernel-side support.
- XRay is completely in-process and controllable by the process through an API
(not sure if dtrace is the same).
- XRay is selective and configurable by the application developer.
- XRay's cost is borne by the application only, and does not require
stopping the application.
> How much overhead do you typically measure?
> 
We've seen in the "null logging case" something around O(100)
cycles in X86 for the trampoline side of XRay. Of course richer logging requires
more cycles, and is completely implementation-dependent. The current one under
development only writes fixed-sized records, uses __rdtscp(), does aligned
writes only, and flushes when the buffer is full. The buffer is 32k per thread.

I haven't formally done benchmarks on the current implementation yet, but
I'd be happy to do that soon.
> If you're injection calls before/after every function - does it end up
> blocking optimizations? Without looking at the implementation, if
> you're injecting the calls late enough in the compilation process it
> won't be a "problem", but if it's too early - you're
going to end up
> blocking a lot of optimizations and interfering with things a lot..
> 
It's currently implemented as a MachineFunctionPass, and as far as I can
tell is already late enough in the process that we are not interfering with
optimisations.

Cheers

C Bergström via llvm-dev

2016-Jul-20 10:58 UTC

head link

[llvm-dev] [XRay] Build instrumented Clang, some analysis results

On Wed, Jul 20, 2016 at 6:26 PM, Dean Michael Berris
<dean.berris at gmail.com> wrote:>
>> On 20 Jul 2016, at 20:02, C Bergström <cbergstrom at
pathscale.com> wrote:
>>
>> Some general questions about X-Ray
>> -------------
>> Is there a plan to make a separate mailing list or project around
>> this? Do you have a list of planned features?
>
> Interesting question -- so far we haven't decided yet whether XRay will
live as another project. I'm certainly open to this possibility. No concrete
plans yet. It's an open question in the original RFC too
(http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html).
>
> There's a white paper that details what we plan to implement out in the
open (http://research.google.com/pubs/pub45287.html). We're still working
our way to getting to a full version as described in that white paper (basically
blocked by my lack of familiarity with the LLVM codebase, and other n00b-y
things :D).
>
> There's not a concrete list of features, and we're certainly open
to contributions from the community to add features that make sense. :)
>
>>
>> Graphics tools for analysis? AMD open sourced their CodeAnalyst - What
>> about some integration with that?
>>
>
> Thanks for the pointer! Yes, I'd love to have more integration with
existing visualisation tools that read a particular well-documented format.
Others have mentioned Jumpshot which might be a little dated, but still
something that some people use for similar things.
>
>> Linux + Perf support (planned/exists)?
>>
>
> There are no plans to support the perf counter-lookups yet. Although I
certainly think that's a nice source of data to be logging XRay-style.
>
> FWIW, the API for XRay allows us to decouple the things being logged at
function entry/exit. Getting performance counters at those points is a nice
idea, it should be doable.
>
>> How much is this tied to something specific about Linux or it could be
>> easily ported to another platform?
>
> Currently, the only Linux-specific part I can remember is getting the cpu
frequency (looking at sysfs files). That can be implemented on a
platform-agnostic (or at least pluggable and portable) manner.
>
> There are x86'isms and I'm working on understanding how to do this
in Aarch64 or ARM.
Ack - actually x86 probably makes some of this a lot easier. I'm
recently (frequently) annoyed (as hell) with how AArch64 isn't
exposing a bunch of basic things that I *want* (demand!) to know.

For example:
clock frequency on AArch64 + Linux == forget it. I had to use a
benchmark in order to basically brute force calculate some processors.
(They don't hard code it in /proc/cpuinfo or sys as you'd want) and
I'm really uncertain about what happens if there's stepping involved
(current AArch64 processors that I'm aware of don't have this feature
though) /* Maybe Google can help kick the linux devs into accepting
these patches */

For FBSD and iOS - I don't know how/if they expose this information..
(Is FBSD ported to AArch64 yet.. ?)
>
>>
>> What's the benefit of this vs a stable and production ready tool
like Dtrace?
>>
>
> I think I've pointed out the differences in a separate mail (some mail
filters may have squashed that response, so apologies if that was missed):
http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short
version is:
>
> - Dtrace requires kernel-side support.
> - XRay is completely in-process and controllable by the process through an
API (not sure if dtrace is the same).
> - XRay is selective and configurable by the application developer.
> - XRay's cost is borne by the application only, and does not require
stopping the application.
Just as you're instrumenting around functions - DTrace can similarly
inject "probes" (basically the same thing) - The other more common way
for DTrace to be used is for the application to not be changed and
it's just profiled. (Ok you must leave SP otherwise it won't work.. so
for the purist I guess you're relying on applications not to be
/fully/ optimized.. I forget if DWARF or CFW is required, but I don't
think so )

Dtrace also doesn't require stopping the application fwiw and you can
control probably a lot more of what's probed/instrumented. (There's a
full scripting langauge in order to control what you instrument
actually)

I'm not trying to take away from X-Ray, I think profiling is extremely
important, but I'm just wondering how much (if any) evaluation of
existing solutions was done. Maybe the DTrace licensing, CTF dep or
linux support was a dealbreaker, I just hate to see NIH when there's
good tools available that cover a significant amount of the needs or
more.

At the end of the day it's probably quite complementary, like all the
work you do for X-Ray, someone could likely leverage to automatically
inject DTrace probes and get a lot of the same stuff.

In your response you mentioned "O(100) cycles" - is that 100
instructions of "skid" between point of measurement? (Seems really
high for instrumenting, but maybe I'm mistaken..)

Lastly and again just side comments - in terms of data formats - JSON
has pretty good support, streams and compresses nicely, high
performance parses exist with liberal licensing as well as I think
there's a binary version of it. This could be handy *if* your app is
on Node B and you'd like the logs to be sent to Node A.

Anywho - cool work..

Dean Michael Berris via llvm-dev

2016-Jul-20 11:39 UTC

head link

[llvm-dev] [XRay] Build instrumented Clang, some analysis results

> On 20 Jul 2016, at 20:58, C Bergström <cbergstrom at pathscale.com>
wrote:
> 
> On Wed, Jul 20, 2016 at 6:26 PM, Dean Michael Berris
> <dean.berris at gmail.com> wrote:
>> 
>>> On 20 Jul 2016, at 20:02, C Bergström <cbergstrom at
pathscale.com> wrote:
>>> 
>>> How much is this tied to something specific about Linux or it could
be
>>> easily ported to another platform?
>> 
>> Currently, the only Linux-specific part I can remember is getting the
cpu frequency (looking at sysfs files). That can be implemented on a
platform-agnostic (or at least pluggable and portable) manner.
>> 
>> There are x86'isms and I'm working on understanding how to do
this in Aarch64 or ARM.
> 
> Ack - actually x86 probably makes some of this a lot easier. I'm
> recently (frequently) annoyed (as hell) with how AArch64 isn't
> exposing a bunch of basic things that I *want* (demand!) to know.
> 
> For example:
> clock frequency on AArch64 + Linux == forget it. I had to use a
> benchmark in order to basically brute force calculate some processors.
> (They don't hard code it in /proc/cpuinfo or sys as you'd want) and
> I'm really uncertain about what happens if there's stepping
involved
> (current AArch64 processors that I'm aware of don't have this
feature
> though) /* Maybe Google can help kick the linux devs into accepting
> these patches */
> 
I actually haven't gotten that far yet, to be honest -- I was still just
trying to learn how to do the runtime patching and the instrumentation sleds
faster. :)

But it is good to know what other kinds of things I might run into when we cross
that bridge. :D
> For FBSD and iOS - I don't know how/if they expose this information..
> (Is FBSD ported to AArch64 yet.. ?)
I have no idea about FreeBSD. :/
> 
>> 
>>> 
>>> What's the benefit of this vs a stable and production ready
tool like Dtrace?
>>> 
>> 
>> I think I've pointed out the differences in a separate mail (some
mail filters may have squashed that response, so apologies if that was missed):
http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short
version is:
>> 
>> - Dtrace requires kernel-side support.
>> - XRay is completely in-process and controllable by the process through
an API (not sure if dtrace is the same).
>> - XRay is selective and configurable by the application developer.
>> - XRay's cost is borne by the application only, and does not
require stopping the application.
> 
> Just as you're instrumenting around functions - DTrace can similarly
> inject "probes" (basically the same thing) - The other more
common way
> for DTrace to be used is for the application to not be changed and
> it's just profiled. (Ok you must leave SP otherwise it won't work..
so
> for the purist I guess you're relying on applications not to be
> /fully/ optimized.. I forget if DWARF or CFW is required, but I don't
> think so )
> 
XRay doesn't rely on DWARF, and has a separate section for the
instrumentation maps. That section can also be removed from the final binary and
loaded externally (that feature isn't implemented yet, but I'm working
on making that happen). XRay also works even if the frame pointer is omitted
which is a nice property. :)
> Dtrace also doesn't require stopping the application fwiw and you can
> control probably a lot more of what's probed/instrumented. (There's
a
> full scripting langauge in order to control what you instrument
> actually)
> 
I'm aware that Dtrace can do what XRay does and more. I'm not so sure
about the technical details of some of how it does its thing -- for example,
XRay isn't sampling anything and instead is made to be logging stuff for
offline reconstruction/analysis.
> I'm not trying to take away from X-Ray, I think profiling is extremely
> important, but I'm just wondering how much (if any) evaluation of
> existing solutions was done.
That's fair -- XRay was developed at Google a long time ago, when Dtrace
wasn't available. Our internal implementation has a lot of...
internal'isms which integrates well into our... internal stuff. :)

The landscape has changed though considerably since XRay was developed and when
we decided to open-source an implementation of it. For example, clang wasn't
even on the radar when some of the work on XRay started happening. Certainly
there's lots of ways of doing this now, but the target at least for XRay is
so that we can:

- Widen the set of platforms where we can use it. Linux+x86 is the "feature
parity" point for us at least. And we're certainly interested in a lot
more platforms now.

- Have better hooks into how much more efficiently we can make it. LLVM IR and
the optimisation pass and analysis infrastructure gives us much more leeway into
being smarter about certain instrumentation decisions.

- Make it more useful than just what our use-case has been. For example, we do
performance analysis on long-running servers and want to be able to do
instrumentation for only a certain period of time (not during the lifetime of
the application). The logging implementation we have internally (that we're
bringing out in the open) has a lot of cleverness to get the tradeoff between
cost and coverage "just right" for our use-cases. There are other
cases where this makes sense too and we recognise that being able to get a full
execution trace (not sampled traces, not sampled profiles) for easier
performance debugging of things like compilers, command-line tools, and other
classes of applications does make sense too.

> Maybe the DTrace licensing, CTF dep or
> linux support was a dealbreaker, I just hate to see NIH when there's
> good tools available that cover a significant amount of the needs or
> more.
> 
There are a couple of other things -- like the cost of when instrumentation is
on. Dtrace currently requires a mode switch when the probe is encountered which
is a non-trivial cost for the kinds of applications we've been debugging.
Certainly that and the potential of affecting other applications/systems while
Dtrace is enabled (and frankly the kinds of things you *can* do with it) becomes
very hard to use at least for some of our systems where XRay has been used in
terms of debugging.

/me is also not a fan of NIH. :)
> At the end of the day it's probably quite complementary, like all the
> work you do for X-Ray, someone could likely leverage to automatically
> inject DTrace probes and get a lot of the same stuff.
Agreed. Also consider the non-Linux systems, and those that have stricter
requirements on resource consumption, etc. :)
> 
> In your response you mentioned "O(100) cycles" - is that 100
> instructions of "skid" between point of measurement? (Seems
really
> high for instrumenting, but maybe I'm mistaken..)
> 
That's CPU cycles, and mostly the following:

For entry points:
- calling into a trampoline (relative jump)
- saving register states
- checking a global value if it's null (the logging intercept function
pointer)
- loading register states
- returning

For exit points:
- jumping into a trampoline (relative jump)
- saving a couple of registers
- checking a global value if it's null (the logging intercept function
pointer)
- loading register states
- returning

Now the logging intercept function should be tuned to do as little as possible.
The implementation in the patches I mentioned to compiler-rt uses thread_local
buffers and attempts to do "as little as possible" to write out the
fixed sized log entries. Basically the cost of getting TSC and some stores.
> Lastly and again just side comments - in terms of data formats - JSON
> has pretty good support, streams and compresses nicely, high
> performance parses exist with liberal licensing as well as I think
> there's a binary version of it. This could be handy *if* your app is
> on Node B and you'd like the logs to be sent to Node A.
> 
Yeah, JSON is definitely one potential format. I've been meaning to use the
Chrome profile viewer too. The problem has been the amount of data we're
talking about here -- with 32-byte fixed-sized records currently per entry/exit,
we're already at 606MB for a fully instrumented clang compiling a simple
hello-world program (compressed is 81MB). I haven't tried writing this out
in JSON, but I suspect that's multiples of the fixed-sized records. :D

We can probably optimise that further with the stack de-duping support in the
JSON format, but that's still a lot of segments/events, even if it could be
converted to JSON. :)
> Anywho - cool work..
Thanks! :)

llvm dev - Jul 2016 - [XRay] Build instrumented Clang, some analysis results

[llvm-dev] [XRay] Build instrumented Clang, some analysis results

[llvm-dev] [XRay] Build instrumented Clang, some analysis results

[llvm-dev] [XRay] Build instrumented Clang, some analysis results