Dean Michael Berris via llvm-dev
2016-Nov-30 05:08 UTC
[llvm-dev] RFC: XRay in the LLVM Library
Hi llvm-dev, Recently, we've committed the beginnings of the llvm-xray [0] tool which allows for conveniently working with both XRay-instrumented libraries as well as XRay trace/log files. In the course of the review for the conversion tool [1] which turns a binary/raw XRay log file into YAML for human consumption purposes, a question arose as to how we intend to allow users to develop tools that deal with XRay traces (and the instrumentation maps in binaries). As a bit of background, I've been working on the "flight data recorder" mode [2] for the XRay runtime library -- this mode lets the XRay instrumented binary to continuously write trace entries into an in-memory log, which is kept as a circular buffer of buffers [3]. FDR mode writes more concise records and has a different log format than the current "naive" logging implementation in compiler-rt (which continuously writes to disk as soon as thread-local buffers are full). # Problem Statement XRay has two key pieces of information that need to be encoded in a consistent manner: the instrumentation map embedded in binaries and the xray log files. However, we run into some issues when we change the encoding of this information over time either adding or removing information. This situation is very similar to how LLVM handles backwards compatibility with the bitcode format / versioning. The problem we have is how to ensure that as we make changes to the data being output by the runtime library, that the tools handling this data are able to read them. A lot of factors play into this, which may be solved in many different ways (but is not the crux of this RFC): - The split between the LLVM "core" library/tools and compiler-rt. This means we implement the writer in compiler-rt but implement the tools reading the traces in LLVM. We also have to coordinate any changes in LLVM for encoding new information in to the instrumentation map so that compiler-rt can take advantage of this new information. - The potential for allowing user-defined additional information embedded in the XRay traces. We have ongoing projects that will add things like argument logging, and custom data logging, which will add information to the log without necessarily changing the "format" of the data. # Potential Resolutions Given the state we're at in XRay's development, we're looking at a few ways of going about the backwards/forwards compatibility of the instrumentation map and the xray log files, and the tools that will be written to read/manipulate them. We're seeking feedback on the following options and alternatives we may not have considered. ## Option A: Expose a Library that supports all known formats. We can move out some currently tool-specific code for `llvm-xray extract` [0] that is able to ingest a binary with XRay instrumentation as something in (strawman proposal) lib/XRay (i.e. include/llvm/XRay/..., and implementation in lib/XRay/...), so that the tools become a thin wrapper around the functionality in this library. We can apply this to the `llvm-xray convert` core logic as well, to allow for loading all known/supported formats for the log file. This option gives us the capability to provide a set of canonical implementations that can handle a set of file formats. This might introduce some complexity in parsing lots of known/supported formats (like YAML, compiler-emitted instrumentation maps for x86_64/arm7/aarch64/<insert platforms where XRay is yet to be ported>) in a library that not all tool writers actually need. This option follows closely what the LLVM project does with backwards compatibility for parsing LLVM IR, applied to XRay instrumentation maps and traces. ## Option B: Expose a library that only supports one canonical format. We can keep tool-specific code alongside the tools, but define one canonical format for the instrumentation map and traces -- as a specification document and a library implementation. This canonical format could be what we already have today which will make the log reading and instrumentation map handling library simple, and evolves only in case we extend/change the canonical format. This means in the case for FDR mode traces, we'll have the conversion tool know about the FDR mode trace format/encoding and have a transformation from that to the canonical format. This means that the transformation logic will be localised to the conversion tool, while any other tool that builds upon and uses the reader library will not need to change. This also provides options for users defining their own log formats using the XRay library interfaces to install their own handlers to implement the transformations from their format to the XRay-canonical format in the tool without being tied to maintaining the released library version. The evolution of the canonical format can happen more slowly and more conservatively than when new implementations of the XRay runtime is made available through compiler-rt. # Open Questions Some burning questions we'd like to get some thoughts on: - Is there a preference between the two options provided above? - Any other alternatives we should consider? - Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you? Thanks in advance! [0] - `llvm-xray extract` defined in https://reviews.llvm.org/D21987 [1] - `llvm-xray convert` being reviewed in https://reviews.llvm.org/D24376 [2] - FDR mode ongoing implementation (work in progress) at https://reviews.llvm.org/D27038 [3] - Buffer Queue implementation (work in progress) at https://reviews.llvm.org/D26232 -- Dean
On 30 November 2016 at 05:08, Dean Michael Berris via llvm-dev <llvm-dev at lists.llvm.org> wrote:> - Is there a preference between the two options provided above? > - Any other alternatives we should consider? > - Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you?Hi Dean, I haven't followed the XRay project that closely, but I have been around file formats being formed and either of your two approaches (which are pretty standard) will fail in different ways. But that's ok, because the "fixes" work, they're just not great. If you take the LLVM IR, there were lots of changes, but we always aimed to have one canonical representation. Not just at the syntax of each instruction/construct, but how to represent complex behaviour in the same series of instructions, so that all back-ends can identify and work with it. Of course, the second (semantic) level is less stringent than the first (syntactical), but we try to make it as strict as possible. This hasn't come for free. The two main costs were destructive semantics, for example when we lower C++ classes into arrays and change all the access to jumbled reads and writes because IR readers don't need to understand the ABI of all targets, and backwards incompatibility, for example when we completely changed how exception handling is lowered (from special basic blocks to special constructs as heads/tails of common basic blocks). That price was cheaper than the alternative, but it's still not free. Another approach I followed was SwissProt [1], a manually curated machine readable text file with protein information for cross referencing. Cutting short to the chase, they introduced "line types" with strict formatting for the most common information, and one line type called "comment" where free text was allowed, for additional information. With time, adding a new line type became impossible, so all new fields ended up being added in the comment lines, with a pseudo-strict formatting, which was (probably still is) a nightmare for parsers and humans alike. Between the two, the LLVM IR policy for changes is orders of magnitude better. I suggest you follow that. I also suggest you don't keep multiple canonical representations, and create tools to convert from any other to the canonical format. Finally, I'd separate the design in two phases: 1. Experimental, where the canonical form changes constantly in light of new input and there are no backwards/forwards compatibility guarantees at all. This is where all of you get creative and try to sort out the problems in the best way possible. 2. Stable, when most of the problems were solved, and you now document a final stable version of the representation. Every new input will have to be represented as a combination of existing ones, so make them generic enough. In need of real change, make sure you have a process that identifies versions and compatibility (for example, having a version tag on every dump), and letting the canonical tool know all of the issues. This last point is important if you want to continue reading old files that don't have the compatibility issue, warn when they do but it's irrelevant, or error when they do and it'll produce garbage. You can also write more efficient converting tools.>From what I understood of this XRay, you could in theory keep the datafor years in a tape somewhere in the attic, and want to read it later to compare to a current run, so being compatible is important, but having a canonical form that can be converted to and from other forms is more important, or the comparison tools will get really messy really quickly. Hope that helps, cheers, --renato [1] http://web.expasy.org/docs/swiss-prot_guideline.html
Hi Dean, I haven't looked very closely at XRay so far, but I'm wondering if making CTF (common trace format, e.g. see http://diamon.org/ctf/) the default format for XRay traces would be useful? It seems it'd be nice to be able to reuse some of the tools that already exist for CTF, such as a graphical viewer (http://tracecompass.org/) or a converter library (http://man7.org/linux/man-pages/man1/babeltrace.1.html). LTTng already uses this format and linux perf can create traces in CTF format too. Probably it would be useful for at least some to be able to combine traces from XRay with traces from LTTng or linux perf? Maybe the current version of CTF may not have all the features that you need, but the next version of CTF (CTF 2) seems to be at least addressing some of the concerns you touch on below: http://diamon.org/ctf/files/CTF2-PROP-1.0.html#design-goals. Any thoughts on whether CTF could be a good choice as the format to store XRay logs in? Thanks, Kristof On 30 Nov 2016, at 06:08, Dean Michael Berris via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Hi llvm-dev, Recently, we've committed the beginnings of the llvm-xray [0] tool which allows for conveniently working with both XRay-instrumented libraries as well as XRay trace/log files. In the course of the review for the conversion tool [1] which turns a binary/raw XRay log file into YAML for human consumption purposes, a question arose as to how we intend to allow users to develop tools that deal with XRay traces (and the instrumentation maps in binaries). As a bit of background, I've been working on the "flight data recorder" mode [2] for the XRay runtime library -- this mode lets the XRay instrumented binary to continuously write trace entries into an in-memory log, which is kept as a circular buffer of buffers [3]. FDR mode writes more concise records and has a different log format than the current "naive" logging implementation in compiler-rt (which continuously writes to disk as soon as thread-local buffers are full). # Problem Statement XRay has two key pieces of information that need to be encoded in a consistent manner: the instrumentation map embedded in binaries and the xray log files. However, we run into some issues when we change the encoding of this information over time either adding or removing information. This situation is very similar to how LLVM handles backwards compatibility with the bitcode format / versioning. The problem we have is how to ensure that as we make changes to the data being output by the runtime library, that the tools handling this data are able to read them. A lot of factors play into this, which may be solved in many different ways (but is not the crux of this RFC): - The split between the LLVM "core" library/tools and compiler-rt. This means we implement the writer in compiler-rt but implement the tools reading the traces in LLVM. We also have to coordinate any changes in LLVM for encoding new information in to the instrumentation map so that compiler-rt can take advantage of this new information. - The potential for allowing user-defined additional information embedded in the XRay traces. We have ongoing projects that will add things like argument logging, and custom data logging, which will add information to the log without necessarily changing the "format" of the data. # Potential Resolutions Given the state we're at in XRay's development, we're looking at a few ways of going about the backwards/forwards compatibility of the instrumentation map and the xray log files, and the tools that will be written to read/manipulate them. We're seeking feedback on the following options and alternatives we may not have considered. ## Option A: Expose a Library that supports all known formats. We can move out some currently tool-specific code for `llvm-xray extract` [0] that is able to ingest a binary with XRay instrumentation as something in (strawman proposal) lib/XRay (i.e. include/llvm/XRay/..., and implementation in lib/XRay/...), so that the tools become a thin wrapper around the functionality in this library. We can apply this to the `llvm-xray convert` core logic as well, to allow for loading all known/supported formats for the log file. This option gives us the capability to provide a set of canonical implementations that can handle a set of file formats. This might introduce some complexity in parsing lots of known/supported formats (like YAML, compiler-emitted instrumentation maps for x86_64/arm7/aarch64/<insert platforms where XRay is yet to be ported>) in a library that not all tool writers actually need. This option follows closely what the LLVM project does with backwards compatibility for parsing LLVM IR, applied to XRay instrumentation maps and traces. ## Option B: Expose a library that only supports one canonical format. We can keep tool-specific code alongside the tools, but define one canonical format for the instrumentation map and traces -- as a specification document and a library implementation. This canonical format could be what we already have today which will make the log reading and instrumentation map handling library simple, and evolves only in case we extend/change the canonical format. This means in the case for FDR mode traces, we'll have the conversion tool know about the FDR mode trace format/encoding and have a transformation from that to the canonical format. This means that the transformation logic will be localised to the conversion tool, while any other tool that builds upon and uses the reader library will not need to change. This also provides options for users defining their own log formats using the XRay library interfaces to install their own handlers to implement the transformations from their format to the XRay-canonical format in the tool without being tied to maintaining the released library version. The evolution of the canonical format can happen more slowly and more conservatively than when new implementations of the XRay runtime is made available through compiler-rt. # Open Questions Some burning questions we'd like to get some thoughts on: - Is there a preference between the two options provided above? - Any other alternatives we should consider? - Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you? Thanks in advance! [0] - `llvm-xray extract` defined in https://reviews.llvm.org/D21987 [1] - `llvm-xray convert` being reviewed in https://reviews.llvm.org/D24376 [2] - FDR mode ongoing implementation (work in progress) at https://reviews.llvm.org/D27038 [3] - Buffer Queue implementation (work in progress) at https://reviews.llvm.org/D26232 -- Dean _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/9f65d6a2/attachment.html>
Dean Michael Berris via llvm-dev
2016-Dec-01 00:17 UTC
[llvm-dev] RFC: XRay in the LLVM Library
> On 30 Nov. 2016, at 22:26, Renato Golin <renato.golin at linaro.org> wrote: > > On 30 November 2016 at 05:08, Dean Michael Berris via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> - Is there a preference between the two options provided above? >> - Any other alternatives we should consider? >> - Which parts of which options do you prefer, and is there a synthesis of either of those options that appeals to you? > > Hi Dean, > > I haven't followed the XRay project that closely, but I have been > around file formats being formed and either of your two approaches > (which are pretty standard) will fail in different ways. But that's > ok, because the "fixes" work, they're just not great. > > If you take the LLVM IR, there were lots of changes, but we always > aimed to have one canonical representation. Not just at the syntax of > each instruction/construct, but how to represent complex behaviour in > the same series of instructions, so that all back-ends can identify > and work with it. Of course, the second (semantic) level is less > stringent than the first (syntactical), but we try to make it as > strict as possible. > > This hasn't come for free. The two main costs were destructive > semantics, for example when we lower C++ classes into arrays and > change all the access to jumbled reads and writes because IR readers > don't need to understand the ABI of all targets, and backwards > incompatibility, for example when we completely changed how exception > handling is lowered (from special basic blocks to special constructs > as heads/tails of common basic blocks). That price was cheaper than > the alternative, but it's still not free. > > Another approach I followed was SwissProt [1], a manually curated > machine readable text file with protein information for cross > referencing. Cutting short to the chase, they introduced "line types" > with strict formatting for the most common information, and one line > type called "comment" where free text was allowed, for additional > information. With time, adding a new line type became impossible, so > all new fields ended up being added in the comment lines, with a > pseudo-strict formatting, which was (probably still is) a nightmare > for parsers and humans alike. > > Between the two, the LLVM IR policy for changes is orders of magnitude > better. I suggest you follow that. > > I also suggest you don't keep multiple canonical representations, and > create tools to convert from any other to the canonical format.Thanks Renato! Just so I understand this one sentence (to disambiguate), you meant: 1) Don't have multiple canonical forms, just have one. 2) Create tools that will convert to/from that one canonical format. I think this follows closely the Option B mental model that I had, with the only difference being the canonical reader is a library made part of LLVM "when it's ready", as you suggest later. Would that be accurate?> > Finally, I'd separate the design in two phases: > > 1. Experimental, where the canonical form changes constantly in light > of new input and there are no backwards/forwards compatibility > guarantees at all. This is where all of you get creative and try to > sort out the problems in the best way possible. > 2. Stable, when most of the problems were solved, and you now document > a final stable version of the representation. Every new input will > have to be represented as a combination of existing ones, so make them > generic enough. In need of real change, make sure you have a process > that identifies versions and compatibility (for example, having a > version tag on every dump), and letting the canonical tool know all of > the issues. > > This last point is important if you want to continue reading old files > that don't have the compatibility issue, warn when they do but it's > irrelevant, or error when they do and it'll produce garbage. You can > also write more efficient converting tools. >I like this suggestion -- thanks! So in essence we can treat the current implementation as experimental, and make that abundantly clear in any point release where XRay functionality will be included. Is there a clear place where this ought to be documented clearly (aside from the documentation at http://llvm.org/docs/XRay.html)? XRay trace file headers already contain a version identifier, intended to precisely identify how a reader would interpret the data in there.> From what I understood of this XRay, you could in theory keep the data > for years in a tape somewhere in the attic, and want to read it later > to compare to a current run, so being compatible is important, but > having a canonical form that can be converted to and from other forms > is more important, or the comparison tools will get really messy > really quickly. >Yep, this is definitely one of the goals which is why we're being very careful about what we write down in the traces, optimising for efficient writing and smaller traces at the cost of potential complexity in the analysis tooling.> Hope that helps,Definitely does, thanks again! Cheers -- Dean
Dean Michael Berris via llvm-dev
2016-Dec-01 00:32 UTC
[llvm-dev] RFC: XRay in the LLVM Library
On 1 Dec. 2016, at 00:26, Kristof Beyls <Kristof.Beyls at arm.com> wrote:> > Hi Dean, > > I haven't looked very closely at XRay so far, but I'm wondering if making CTF (common trace format, e.g. see http://diamon.org/ctf/) the default format for XRay traces would be useful?Nice! Thanks for mentioning this, I've not had a look at this before. There's a couple issues I can think of, off the top of my head as to why using that as the default format for XRay may be slightly problematic. More on this below.> It seems it'd be nice to be able to reuse some of the tools that already exist for CTF, such as a graphical viewer (http://tracecompass.org/) or a converter library (http://man7.org/linux/man-pages/man1/babeltrace.1.html). > LTTng already uses this format and linux perf can create traces in CTF format too. Probably it would be useful for at least some to be able to combine traces from XRay with traces from LTTng or linux perf? >This sounds like a great idea! I'm working on a conversion tool that aims to target multiple output formats. It's being developed at https://reviews.llvm.org/D24376 where the intent is to start with something simple, but then grow support for multiple other formats. CTF sounds like a perfectly reasonable target format. Writing CTF though might be slightly problematic for XRay only for the potential complexity that will bring into the runtime library. While conceptually the formats are very similar (XRay uses binary logging, and efficient in-memory structures to save on both space and time required to write them down) we'd like the XRay library to make some more optimisations and evolve into a certain direction without being tied down to one particular format. I'll need to think about this a little more, but I definitely think converting from whatever XRay format we come up with to CTF sounds like a great feature to the conversion tool.> Maybe the current version of CTF may not have all the features that you need, but the next version of CTF (CTF 2) seems to be at least addressing some of the concerns you touch on below: http://diamon.org/ctf/files/CTF2-PROP-1.0.html#design-goals. > > Any thoughts on whether CTF could be a good choice as the format to store XRay logs in? >I may need to think about it more, but I don't see a bad reason for being able to convert XRay traces to CTF. :) Cheers -- Dean
On Wed, Nov 30, 2016 at 3:26 AM Renato Golin <renato.golin at linaro.org> wrote:> On 30 November 2016 at 05:08, Dean Michael Berris via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > - Is there a preference between the two options provided above? > > - Any other alternatives we should consider? > > - Which parts of which options do you prefer, and is there a synthesis > of either of those options that appeals to you? > > Hi Dean, > > I haven't followed the XRay project that closely, but I have been > around file formats being formed and either of your two approaches > (which are pretty standard) will fail in different ways. But that's > ok, because the "fixes" work, they're just not great. > > If you take the LLVM IR, there were lots of changes, but we always > aimed to have one canonical representation. Not just at the syntax of > each instruction/construct, but how to represent complex behaviour in > the same series of instructions, so that all back-ends can identify > and work with it. Of course, the second (semantic) level is less > stringent than the first (syntactical), but we try to make it as > strict as possible. > > This hasn't come for free. The two main costs were destructive > semantics, for example when we lower C++ classes into arrays and > change all the access to jumbled reads and writes because IR readers > don't need to understand the ABI of all targets, and backwards > incompatibility, for example when we completely changed how exception > handling is lowered (from special basic blocks to special constructs > as heads/tails of common basic blocks). That price was cheaper than > the alternative, but it's still not free. > > Another approach I followed was SwissProt [1], a manually curated > machine readable text file with protein information for cross > referencing. Cutting short to the chase, they introduced "line types" > with strict formatting for the most common information, and one line > type called "comment" where free text was allowed, for additional > information. With time, adding a new line type became impossible, so > all new fields ended up being added in the comment lines, with a > pseudo-strict formatting, which was (probably still is) a nightmare > for parsers and humans alike. > > Between the two, the LLVM IR policy for changes is orders of magnitude > better. I suggest you follow that. > > I also suggest you don't keep multiple canonical representations, and > create tools to convert from any other to the canonical format. > > Finally, I'd separate the design in two phases: > > 1. Experimental, where the canonical form changes constantly in light > of new input and there are no backwards/forwards compatibility > guarantees at all. This is where all of you get creative and try to > sort out the problems in the best way possible. > 2. Stable, when most of the problems were solved, and you now document > a final stable version of the representation. Every new input will > have to be represented as a combination of existing ones, so make them > generic enough. In need of real change, make sure you have a process > that identifies versions and compatibility (for example, having a > version tag on every dump), and letting the canonical tool know all of > the issues. > > This last point is important if you want to continue reading old files > that don't have the compatibility issue, warn when they do but it's > irrelevant, or error when they do and it'll produce garbage. You can > also write more efficient converting tools. > > From what I understood of this XRay, you could in theory keep the data > for years in a tape somewhere in the attic, and want to read it later > to compare to a current run, so being compatible is important, but > having a canonical form that can be converted to and from other forms > is more important, or the comparison tools will get really messy > really quickly. >Not sure I quite follow here - perhaps some misunderstanding. My mental model here is that the formats are semantically equivalent - with a common in-memory representation (like LLVM IR APIs). It doesn't/shouldn't complicate a comparison tool to support both LLVM IR and bitcode input (or some other hypothetical formats that are semantically equivalent that we could integrate into a common reading API). At least that's my mental model. Is there something different here? What I'm picturing is that we need an API for reading all these formats and either we use that API only in the conversion tool - and users then have to run the conversion tool before running the tool they want. Or we sink that API into a common place, and all tools use that API to load inputs - making the user experience simpler (they don't have to run an extra conversion step/tool) but it doesn't seem like it should make the development experience more complicated/messy/difficult. - Dave> > Hope that helps, > > cheers, > --renato > > > [1] http://web.expasy.org/docs/swiss-prot_guideline.html >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161201/c3a06b9b/attachment.html>