thr3ads.net - llvm dev - [llvm-dev] RFC: XRay in the LLVM Library [Dec 2016]

If this information is useful, please help other people find it:
Share via:

David Blaikie via llvm-dev

2016-Dec-01 22:06 UTC

[llvm-dev] RFC: XRay in the LLVM Library

On Wed, Nov 30, 2016 at 3:26 AM Renato Golin <renato.golin at linaro.org>
wrote:
> On 30 November 2016 at 05:08, Dean Michael Berris via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > - Is there a preference between the two options provided above?
> > - Any other alternatives we should consider?
> > - Which parts of which options do you prefer, and is there a synthesis
> of either of those options that appeals to you?
>
> Hi Dean,
>
> I haven't followed the XRay project that closely, but I have been
> around file formats being formed and either of your two approaches
> (which are pretty standard) will fail in different ways. But that's
> ok, because the "fixes" work, they're just not great.
>
> If you take the LLVM IR, there were lots of changes, but we always
> aimed to have one canonical representation. Not just at the syntax of
> each instruction/construct, but how to represent complex behaviour in
> the same series of instructions, so that all back-ends can identify
> and work with it. Of course, the second (semantic) level is less
> stringent than the first (syntactical), but we try to make it as
> strict as possible.
>
> This hasn't come for free. The two main costs were destructive
> semantics, for example when we lower C++ classes into arrays and
> change all the access to jumbled reads and writes because IR readers
> don't need to understand the ABI of all targets, and backwards
> incompatibility, for example when we completely changed how exception
> handling is lowered (from special basic blocks to special constructs
> as heads/tails of common basic blocks). That price was cheaper than
> the alternative, but it's still not free.
>
> Another approach I followed was SwissProt [1], a manually curated
> machine readable text file with protein information for cross
> referencing. Cutting short to the chase, they introduced "line
types"
> with strict formatting for the most common information, and one line
> type called "comment" where free text was allowed, for additional
> information. With time, adding a new line type became impossible, so
> all new fields ended up being added in the comment lines, with a
> pseudo-strict formatting, which was (probably still is) a nightmare
> for parsers and humans alike.
>
> Between the two, the LLVM IR policy for changes is orders of magnitude
> better. I suggest you follow that.
>
> I also suggest you don't keep multiple canonical representations, and
> create tools to convert from any other to the canonical format.
>
> Finally, I'd separate the design in two phases:
>
> 1. Experimental, where the canonical form changes constantly in light
> of new input and there are no backwards/forwards compatibility
> guarantees at all. This is where all of you get creative and try to
> sort out the problems in the best way possible.
> 2. Stable, when most of the problems were solved, and you now document
> a final stable version of the representation. Every new input will
> have to be represented as a combination of existing ones, so make them
> generic enough. In need of real change, make sure you have a process
> that identifies versions and compatibility (for example, having a
> version tag on every dump), and letting the canonical tool know all of
> the issues.
>
> This last point is important if you want to continue reading old files
> that don't have the compatibility issue, warn when they do but it's
> irrelevant, or error when they do and it'll produce garbage. You can
> also write more efficient converting tools.
>
> From what I understood of this XRay, you could in theory keep the data
> for years in a tape somewhere in the attic, and want to read it later
> to compare to a current run, so being compatible is important, but
> having a canonical form that can be converted to and from other forms
> is more important, or the comparison tools will get really messy
> really quickly.
>
Not sure I quite follow here - perhaps some misunderstanding.

My mental model here is that the formats are semantically equivalent - with
a common in-memory representation (like LLVM IR APIs). It
doesn't/shouldn't
complicate a comparison tool to support both LLVM IR and bitcode input (or
some other hypothetical formats that are semantically equivalent that we
could integrate into a common reading API). At least that's my mental model.

Is there something different here?

What I'm picturing is that we need an API for reading all these formats and
either we use that API only in the conversion tool - and users then have to
run the conversion tool before running the tool they want. Or we sink that
API into a common place, and all tools use that API to load inputs - making
the user experience simpler (they don't have to run an extra conversion
step/tool) but it doesn't seem like it should make the development
experience more complicated/messy/difficult.

- Dave

>
> Hope that helps,
>
> cheers,
> --renato
>
>
> [1] http://web.expasy.org/docs/swiss-prot_guideline.html
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161201/c3a06b9b/attachment.html>

Dean Michael Berris via llvm-dev

2016-Dec-02 07:15 UTC

head link

[llvm-dev] RFC: XRay in the LLVM Library

> On 2 Dec. 2016, at 09:06, David Blaikie <dblaikie at gmail.com>
wrote:
> 
> 
> Not sure I quite follow here - perhaps some misunderstanding.
> 
> My mental model here is that the formats are semantically equivalent - with
a common in-memory representation (like LLVM IR APIs). It
doesn't/shouldn't complicate a comparison tool to support both LLVM IR
and bitcode input (or some other hypothetical formats that are semantically
equivalent that we could integrate into a common reading API). At least
that's my mental model.
> 
I think you mean 'conversion' instead of 'comparison', but
having said that we cannot assume that semantic equivalence implies
"cheapness". At least in FDR mode, the way the data is laid out in the
file will be one fixed-sized chunk per thread's log. These chunks may be
interleaved with each other, forming something like the following:

[ File Header ] [ <thread 1 buffer>, <thread 2 buffer>, <thread 1
buffer>, ... ]

While this can be converted to the current "naive" format:

[ File Header ] [ <record>, <record>, <record>, ... ]

N.B. Where <record> is a self-contained tuple of (tsc, cpu id, thread id,
record type, function id, padding).

The process of doing so will be very expensive -- i.e. we'll have to
denormalise the records per thread-buffer, expand out the TSCs, potentially load
the whole FDR trace in memory, have multiple passes, etc. While we can certainly
make that part be implemented as a library so that we can "support"
this alternate format/representation, I'm not sure we want users using the
library to pay for this cost in terms of storage and processing time if all they
really want is to deal with an XRay trace.

The proposal is to keep the complexity involved with converting the FDR log
format into the naive log format (both are binary, either one can have YAML
analogues) in the conversion tool but only really support one canonical format
(the naive one which could either be in YAML or binary) for the library that
deals with this format.
> Is there something different here?
> 
> What I'm picturing is that we need an API for reading all these formats
and either we use that API only in the conversion tool - and users then have to
run the conversion tool before running the tool they want. Or we sink that API
into a common place, and all tools use that API to load inputs - making the user
experience simpler (they don't have to run an extra conversion step/tool)
but it doesn't seem like it should make the development experience more
complicated/messy/difficult.
> 
I think having the complexity of conversion be localised in the tool may be
better, than consolidating that API into something that others might be able to
use outside of the tools. For instance, if we're talking about converting
XRay traces to other supported formats (like CTF, the Chrome Trace Viewer
supported format, or <insert something else>) then I suspect we want to
keep that in the conversion tool's implementation rather than making those
routines part of the distributed XRay library. Or if a user wanted to be able to
read XRay traces in their application, they should just be able to support the
canonical format, and the conversion happen externally to keep the costs low.

The trade-off I'm thinking of is in the support burden not only in the
development of the tools, but also of the exposed library that defines what the
supported formats of the XRay trace files look like. I suspect that iterating on
a tool and gaining support for multiple formats there, and keeping a log reading
library simple as a released library in LLVM strikes that balance of not needing
to support too many formats in the API/Library, while being able to support many
formats in the conversion tool.

I'd think of the analogy here as the conversion tool being clang that
supports more than one programming language source code as input, but using a
canonical LLVM IR representation (in-memory, or written out). While LLVM can
handle backwards compatibility of the LLVM IR, it doesn't have to worry
about clang supporting a new programming language.

Does that make sense?

-- Dean

Renato Golin via llvm-dev

2016-Dec-02 09:59 UTC

head link

[llvm-dev] RFC: XRay in the LLVM Library

On 2 December 2016 at 07:15, Dean Michael Berris <dean.berris at
gmail.com> wrote:> I'd think of the analogy here as the conversion tool being clang that
supports more than one programming language source code as input, but using a
canonical LLVM IR representation (in-memory, or written out). While LLVM can
handle backwards compatibility of the LLVM IR, it doesn't have to worry
about clang supporting a new programming language.
That's how I understood.

Multiple languages, with potentially different semantics (like C and
Fortran), not different representations of the same semantics (like
textual and binary IR).

While it's possible to convert both C and Fortran to IR, that's a
complicated design cost that we take because we have to. Where we can,
we decided not to pay that cost, ie. having a single IR semantic
model, and guaranteeing that its multiple representations are
consistent and unique.

The translation of multiple formats (languages) should really be
separated in different front-ends to the underlying model engine,
which should only deal with a single, broad and well defined
representation.

cheers,
--renato

Eric Christopher via llvm-dev

2017-Jan-04 01:13 UTC

head link

[llvm-dev] RFC: XRay in the LLVM Library

Sorry for coming into this thread late.

I can see a few uses for different formats, but I'm not quite convinced on
the usefulness of a universal exchange library. That said, if Dean really
wants to implement a way of converting between all of these things I'm not
going to stop him. I'd probably suggest just dumping some formats and using
some sort of human readable format for input as a way of testing, but
that's just me.

-eric

On Thu, Dec 1, 2016 at 2:06 PM David Blaikie via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Wed, Nov 30, 2016 at 3:26 AM Renato Golin <renato.golin at
linaro.org>
> wrote:
>
> On 30 November 2016 at 05:08, Dean Michael Berris via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > - Is there a preference between the two options provided above?
> > - Any other alternatives we should consider?
> > - Which parts of which options do you prefer, and is there a synthesis
> of either of those options that appeals to you?
>
> Hi Dean,
>
> I haven't followed the XRay project that closely, but I have been
> around file formats being formed and either of your two approaches
> (which are pretty standard) will fail in different ways. But that's
> ok, because the "fixes" work, they're just not great.
>
> If you take the LLVM IR, there were lots of changes, but we always
> aimed to have one canonical representation. Not just at the syntax of
> each instruction/construct, but how to represent complex behaviour in
> the same series of instructions, so that all back-ends can identify
> and work with it. Of course, the second (semantic) level is less
> stringent than the first (syntactical), but we try to make it as
> strict as possible.
>
> This hasn't come for free. The two main costs were destructive
> semantics, for example when we lower C++ classes into arrays and
> change all the access to jumbled reads and writes because IR readers
> don't need to understand the ABI of all targets, and backwards
> incompatibility, for example when we completely changed how exception
> handling is lowered (from special basic blocks to special constructs
> as heads/tails of common basic blocks). That price was cheaper than
> the alternative, but it's still not free.
>
> Another approach I followed was SwissProt [1], a manually curated
> machine readable text file with protein information for cross
> referencing. Cutting short to the chase, they introduced "line
types"
> with strict formatting for the most common information, and one line
> type called "comment" where free text was allowed, for additional
> information. With time, adding a new line type became impossible, so
> all new fields ended up being added in the comment lines, with a
> pseudo-strict formatting, which was (probably still is) a nightmare
> for parsers and humans alike.
>
> Between the two, the LLVM IR policy for changes is orders of magnitude
> better. I suggest you follow that.
>
> I also suggest you don't keep multiple canonical representations, and
> create tools to convert from any other to the canonical format.
>
> Finally, I'd separate the design in two phases:
>
> 1. Experimental, where the canonical form changes constantly in light
> of new input and there are no backwards/forwards compatibility
> guarantees at all. This is where all of you get creative and try to
> sort out the problems in the best way possible.
> 2. Stable, when most of the problems were solved, and you now document
> a final stable version of the representation. Every new input will
> have to be represented as a combination of existing ones, so make them
> generic enough. In need of real change, make sure you have a process
> that identifies versions and compatibility (for example, having a
> version tag on every dump), and letting the canonical tool know all of
> the issues.
>
> This last point is important if you want to continue reading old files
> that don't have the compatibility issue, warn when they do but it's
> irrelevant, or error when they do and it'll produce garbage. You can
> also write more efficient converting tools.
>
> From what I understood of this XRay, you could in theory keep the data
> for years in a tape somewhere in the attic, and want to read it later
> to compare to a current run, so being compatible is important, but
> having a canonical form that can be converted to and from other forms
> is more important, or the comparison tools will get really messy
> really quickly.
>
>
> Not sure I quite follow here - perhaps some misunderstanding.
>
> My mental model here is that the formats are semantically equivalent -
> with a common in-memory representation (like LLVM IR APIs). It
> doesn't/shouldn't complicate a comparison tool to support both LLVM
IR and
> bitcode input (or some other hypothetical formats that are semantically
> equivalent that we could integrate into a common reading API). At least
> that's my mental model.
>
> Is there something different here?
>
> What I'm picturing is that we need an API for reading all these formats
> and either we use that API only in the conversion tool - and users then
> have to run the conversion tool before running the tool they want. Or we
> sink that API into a common place, and all tools use that API to load
> inputs - making the user experience simpler (they don't have to run an
> extra conversion step/tool) but it doesn't seem like it should make the
> development experience more complicated/messy/difficult.
>
> - Dave
>
>
>
> Hope that helps,
>
> cheers,
> --renato
>
>
> [1] http://web.expasy.org/docs/swiss-prot_guideline.html
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170104/088f4040/attachment.html>

llvm dev - Dec 2016 - RFC: XRay in the LLVM Library

[llvm-dev] RFC: XRay in the LLVM Library

[llvm-dev] RFC: XRay in the LLVM Library

[llvm-dev] RFC: XRay in the LLVM Library

[llvm-dev] RFC: XRay in the LLVM Library