Dean Michael Berris via llvm-dev
2016-Jul-20 10:26 UTC
[llvm-dev] [XRay] Build instrumented Clang, some analysis results
> On 20 Jul 2016, at 20:02, C Bergström <cbergstrom at pathscale.com> wrote: > > Some general questions about X-Ray > ------------- > Is there a plan to make a separate mailing list or project around > this? Do you have a list of planned features?Interesting question -- so far we haven't decided yet whether XRay will live as another project. I'm certainly open to this possibility. No concrete plans yet. It's an open question in the original RFC too (http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html). There's a white paper that details what we plan to implement out in the open (http://research.google.com/pubs/pub45287.html). We're still working our way to getting to a full version as described in that white paper (basically blocked by my lack of familiarity with the LLVM codebase, and other n00b-y things :D). There's not a concrete list of features, and we're certainly open to contributions from the community to add features that make sense. :)> > Graphics tools for analysis? AMD open sourced their CodeAnalyst - What > about some integration with that? >Thanks for the pointer! Yes, I'd love to have more integration with existing visualisation tools that read a particular well-documented format. Others have mentioned Jumpshot which might be a little dated, but still something that some people use for similar things.> Linux + Perf support (planned/exists)? >There are no plans to support the perf counter-lookups yet. Although I certainly think that's a nice source of data to be logging XRay-style. FWIW, the API for XRay allows us to decouple the things being logged at function entry/exit. Getting performance counters at those points is a nice idea, it should be doable.> How much is this tied to something specific about Linux or it could be > easily ported to another platform?Currently, the only Linux-specific part I can remember is getting the cpu frequency (looking at sysfs files). That can be implemented on a platform-agnostic (or at least pluggable and portable) manner. There are x86'isms and I'm working on understanding how to do this in Aarch64 or ARM.> > What's the benefit of this vs a stable and production ready tool like Dtrace? >I think I've pointed out the differences in a separate mail (some mail filters may have squashed that response, so apologies if that was missed): http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short version is: - Dtrace requires kernel-side support. - XRay is completely in-process and controllable by the process through an API (not sure if dtrace is the same). - XRay is selective and configurable by the application developer. - XRay's cost is borne by the application only, and does not require stopping the application.> How much overhead do you typically measure? >We've seen in the "null logging case" something around O(100) cycles in X86 for the trampoline side of XRay. Of course richer logging requires more cycles, and is completely implementation-dependent. The current one under development only writes fixed-sized records, uses __rdtscp(), does aligned writes only, and flushes when the buffer is full. The buffer is 32k per thread. I haven't formally done benchmarks on the current implementation yet, but I'd be happy to do that soon.> If you're injection calls before/after every function - does it end up > blocking optimizations? Without looking at the implementation, if > you're injecting the calls late enough in the compilation process it > won't be a "problem", but if it's too early - you're going to end up > blocking a lot of optimizations and interfering with things a lot.. >It's currently implemented as a MachineFunctionPass, and as far as I can tell is already late enough in the process that we are not interfering with optimisations. Cheers
C Bergström via llvm-dev
2016-Jul-20 10:58 UTC
[llvm-dev] [XRay] Build instrumented Clang, some analysis results
On Wed, Jul 20, 2016 at 6:26 PM, Dean Michael Berris <dean.berris at gmail.com> wrote:> >> On 20 Jul 2016, at 20:02, C Bergström <cbergstrom at pathscale.com> wrote: >> >> Some general questions about X-Ray >> ------------- >> Is there a plan to make a separate mailing list or project around >> this? Do you have a list of planned features? > > Interesting question -- so far we haven't decided yet whether XRay will live as another project. I'm certainly open to this possibility. No concrete plans yet. It's an open question in the original RFC too (http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html). > > There's a white paper that details what we plan to implement out in the open (http://research.google.com/pubs/pub45287.html). We're still working our way to getting to a full version as described in that white paper (basically blocked by my lack of familiarity with the LLVM codebase, and other n00b-y things :D). > > There's not a concrete list of features, and we're certainly open to contributions from the community to add features that make sense. :) > >> >> Graphics tools for analysis? AMD open sourced their CodeAnalyst - What >> about some integration with that? >> > > Thanks for the pointer! Yes, I'd love to have more integration with existing visualisation tools that read a particular well-documented format. Others have mentioned Jumpshot which might be a little dated, but still something that some people use for similar things. > >> Linux + Perf support (planned/exists)? >> > > There are no plans to support the perf counter-lookups yet. Although I certainly think that's a nice source of data to be logging XRay-style. > > FWIW, the API for XRay allows us to decouple the things being logged at function entry/exit. Getting performance counters at those points is a nice idea, it should be doable. > >> How much is this tied to something specific about Linux or it could be >> easily ported to another platform? > > Currently, the only Linux-specific part I can remember is getting the cpu frequency (looking at sysfs files). That can be implemented on a platform-agnostic (or at least pluggable and portable) manner. > > There are x86'isms and I'm working on understanding how to do this in Aarch64 or ARM.Ack - actually x86 probably makes some of this a lot easier. I'm recently (frequently) annoyed (as hell) with how AArch64 isn't exposing a bunch of basic things that I *want* (demand!) to know. For example: clock frequency on AArch64 + Linux == forget it. I had to use a benchmark in order to basically brute force calculate some processors. (They don't hard code it in /proc/cpuinfo or sys as you'd want) and I'm really uncertain about what happens if there's stepping involved (current AArch64 processors that I'm aware of don't have this feature though) /* Maybe Google can help kick the linux devs into accepting these patches */ For FBSD and iOS - I don't know how/if they expose this information.. (Is FBSD ported to AArch64 yet.. ?)> >> >> What's the benefit of this vs a stable and production ready tool like Dtrace? >> > > I think I've pointed out the differences in a separate mail (some mail filters may have squashed that response, so apologies if that was missed): http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short version is: > > - Dtrace requires kernel-side support. > - XRay is completely in-process and controllable by the process through an API (not sure if dtrace is the same). > - XRay is selective and configurable by the application developer. > - XRay's cost is borne by the application only, and does not require stopping the application.Just as you're instrumenting around functions - DTrace can similarly inject "probes" (basically the same thing) - The other more common way for DTrace to be used is for the application to not be changed and it's just profiled. (Ok you must leave SP otherwise it won't work.. so for the purist I guess you're relying on applications not to be /fully/ optimized.. I forget if DWARF or CFW is required, but I don't think so ) Dtrace also doesn't require stopping the application fwiw and you can control probably a lot more of what's probed/instrumented. (There's a full scripting langauge in order to control what you instrument actually) I'm not trying to take away from X-Ray, I think profiling is extremely important, but I'm just wondering how much (if any) evaluation of existing solutions was done. Maybe the DTrace licensing, CTF dep or linux support was a dealbreaker, I just hate to see NIH when there's good tools available that cover a significant amount of the needs or more. At the end of the day it's probably quite complementary, like all the work you do for X-Ray, someone could likely leverage to automatically inject DTrace probes and get a lot of the same stuff. In your response you mentioned "O(100) cycles" - is that 100 instructions of "skid" between point of measurement? (Seems really high for instrumenting, but maybe I'm mistaken..) Lastly and again just side comments - in terms of data formats - JSON has pretty good support, streams and compresses nicely, high performance parses exist with liberal licensing as well as I think there's a binary version of it. This could be handy *if* your app is on Node B and you'd like the logs to be sent to Node A. Anywho - cool work..
Dean Michael Berris via llvm-dev
2016-Jul-20 11:39 UTC
[llvm-dev] [XRay] Build instrumented Clang, some analysis results
> On 20 Jul 2016, at 20:58, C Bergström <cbergstrom at pathscale.com> wrote: > > On Wed, Jul 20, 2016 at 6:26 PM, Dean Michael Berris > <dean.berris at gmail.com> wrote: >> >>> On 20 Jul 2016, at 20:02, C Bergström <cbergstrom at pathscale.com> wrote: >>> >>> How much is this tied to something specific about Linux or it could be >>> easily ported to another platform? >> >> Currently, the only Linux-specific part I can remember is getting the cpu frequency (looking at sysfs files). That can be implemented on a platform-agnostic (or at least pluggable and portable) manner. >> >> There are x86'isms and I'm working on understanding how to do this in Aarch64 or ARM. > > Ack - actually x86 probably makes some of this a lot easier. I'm > recently (frequently) annoyed (as hell) with how AArch64 isn't > exposing a bunch of basic things that I *want* (demand!) to know. > > For example: > clock frequency on AArch64 + Linux == forget it. I had to use a > benchmark in order to basically brute force calculate some processors. > (They don't hard code it in /proc/cpuinfo or sys as you'd want) and > I'm really uncertain about what happens if there's stepping involved > (current AArch64 processors that I'm aware of don't have this feature > though) /* Maybe Google can help kick the linux devs into accepting > these patches */ >I actually haven't gotten that far yet, to be honest -- I was still just trying to learn how to do the runtime patching and the instrumentation sleds faster. :) But it is good to know what other kinds of things I might run into when we cross that bridge. :D> For FBSD and iOS - I don't know how/if they expose this information.. > (Is FBSD ported to AArch64 yet.. ?)I have no idea about FreeBSD. :/> >> >>> >>> What's the benefit of this vs a stable and production ready tool like Dtrace? >>> >> >> I think I've pointed out the differences in a separate mail (some mail filters may have squashed that response, so apologies if that was missed): http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short version is: >> >> - Dtrace requires kernel-side support. >> - XRay is completely in-process and controllable by the process through an API (not sure if dtrace is the same). >> - XRay is selective and configurable by the application developer. >> - XRay's cost is borne by the application only, and does not require stopping the application. > > Just as you're instrumenting around functions - DTrace can similarly > inject "probes" (basically the same thing) - The other more common way > for DTrace to be used is for the application to not be changed and > it's just profiled. (Ok you must leave SP otherwise it won't work.. so > for the purist I guess you're relying on applications not to be > /fully/ optimized.. I forget if DWARF or CFW is required, but I don't > think so ) >XRay doesn't rely on DWARF, and has a separate section for the instrumentation maps. That section can also be removed from the final binary and loaded externally (that feature isn't implemented yet, but I'm working on making that happen). XRay also works even if the frame pointer is omitted which is a nice property. :)> Dtrace also doesn't require stopping the application fwiw and you can > control probably a lot more of what's probed/instrumented. (There's a > full scripting langauge in order to control what you instrument > actually) >I'm aware that Dtrace can do what XRay does and more. I'm not so sure about the technical details of some of how it does its thing -- for example, XRay isn't sampling anything and instead is made to be logging stuff for offline reconstruction/analysis.> I'm not trying to take away from X-Ray, I think profiling is extremely > important, but I'm just wondering how much (if any) evaluation of > existing solutions was done.That's fair -- XRay was developed at Google a long time ago, when Dtrace wasn't available. Our internal implementation has a lot of... internal'isms which integrates well into our... internal stuff. :) The landscape has changed though considerably since XRay was developed and when we decided to open-source an implementation of it. For example, clang wasn't even on the radar when some of the work on XRay started happening. Certainly there's lots of ways of doing this now, but the target at least for XRay is so that we can: - Widen the set of platforms where we can use it. Linux+x86 is the "feature parity" point for us at least. And we're certainly interested in a lot more platforms now. - Have better hooks into how much more efficiently we can make it. LLVM IR and the optimisation pass and analysis infrastructure gives us much more leeway into being smarter about certain instrumentation decisions. - Make it more useful than just what our use-case has been. For example, we do performance analysis on long-running servers and want to be able to do instrumentation for only a certain period of time (not during the lifetime of the application). The logging implementation we have internally (that we're bringing out in the open) has a lot of cleverness to get the tradeoff between cost and coverage "just right" for our use-cases. There are other cases where this makes sense too and we recognise that being able to get a full execution trace (not sampled traces, not sampled profiles) for easier performance debugging of things like compilers, command-line tools, and other classes of applications does make sense too.> Maybe the DTrace licensing, CTF dep or > linux support was a dealbreaker, I just hate to see NIH when there's > good tools available that cover a significant amount of the needs or > more. >There are a couple of other things -- like the cost of when instrumentation is on. Dtrace currently requires a mode switch when the probe is encountered which is a non-trivial cost for the kinds of applications we've been debugging. Certainly that and the potential of affecting other applications/systems while Dtrace is enabled (and frankly the kinds of things you *can* do with it) becomes very hard to use at least for some of our systems where XRay has been used in terms of debugging. /me is also not a fan of NIH. :)> At the end of the day it's probably quite complementary, like all the > work you do for X-Ray, someone could likely leverage to automatically > inject DTrace probes and get a lot of the same stuff.Agreed. Also consider the non-Linux systems, and those that have stricter requirements on resource consumption, etc. :)> > In your response you mentioned "O(100) cycles" - is that 100 > instructions of "skid" between point of measurement? (Seems really > high for instrumenting, but maybe I'm mistaken..) >That's CPU cycles, and mostly the following: For entry points: - calling into a trampoline (relative jump) - saving register states - checking a global value if it's null (the logging intercept function pointer) - loading register states - returning For exit points: - jumping into a trampoline (relative jump) - saving a couple of registers - checking a global value if it's null (the logging intercept function pointer) - loading register states - returning Now the logging intercept function should be tuned to do as little as possible. The implementation in the patches I mentioned to compiler-rt uses thread_local buffers and attempts to do "as little as possible" to write out the fixed sized log entries. Basically the cost of getting TSC and some stores.> Lastly and again just side comments - in terms of data formats - JSON > has pretty good support, streams and compresses nicely, high > performance parses exist with liberal licensing as well as I think > there's a binary version of it. This could be handy *if* your app is > on Node B and you'd like the logs to be sent to Node A. >Yeah, JSON is definitely one potential format. I've been meaning to use the Chrome profile viewer too. The problem has been the amount of data we're talking about here -- with 32-byte fixed-sized records currently per entry/exit, we're already at 606MB for a fully instrumented clang compiling a simple hello-world program (compressed is 81MB). I haven't tried writing this out in JSON, but I suspect that's multiples of the fixed-sized records. :D We can probably optimise that further with the stack de-duping support in the JSON format, but that's still a lot of segments/events, even if it could be converted to JSON. :)> Anywho - cool work..Thanks! :)