thr3ads.net - llvm dev - [LLVMdev] RFC: Binary format for instrumentation based profiling data [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Justin Bogner

2014-Mar-13 01:09 UTC

[LLVMdev] RFC: Binary format for instrumentation based profiling data

Instrumentation-based profiling data is generated by instrumented
binaries through library functions in compiler-rt, and read by the clang
frontend to feed PGO.

The current data format is a naive textual format.

The files consist of a number of counters for each function in a
program.

Use cases
========
There are a few use cases that drive our requirements:

A. Looking up the counters for a particular function needs to be
  efficient, so as to not slow down compilation too much when using the
  data for PGO.  This is also presumably how this data would be
  accessed if using it for coverage.

B. The file should be relatively quick to generate, and shouldn't
  require a large amount of memory overhead to write out.  Instrumented
  programs will write this file out on termination or on demand.

C. Iterating over all of the data must be reasonably simple, to support
  tools that summarize or merge data files.

D. The format should be relatively compact.  Programs with a large
  number of functions may be instrumented, and we don't want to create
  unnecessarily large files because of it.

E. The format should be portable between systems.  Ie, if Alice
  generates profile data for some program P using Machine A, Bob should
  be able to use that same profile data to instrument a copy of that
  same program on machine B.

Partially updating the file is not a goal.  Instrumented programs write
out a profile for a single run, and multiple runs can be merged by a
separate tool.

Semantics
========
Semantically, we have a collection of functions, each with a number of
counters associated with them.

Functions are represented by strings, determined by the part of the
frontend that both generates and uses this data.  In our case, these are
generally whatever clang thinks of as the function's name, with minor
details added to disambiguate names that aren't necessarily unique.

The counters are a list of unsigned 64 bit values.  The frontend decides
what constructs these belong to, but the first counter is for the
function itself, and always exists.  In order to validate that the
mapping is correct, we need a hash for the frontend to compare against.

Initially, we’re using the number of counters as the hash, but that
won't be sufficient as time goes on.

Format
=====
The file should be a binary format in order to satisfy use case (D).  As
such, we need to choose an endianness in order to satisfy (E).  A fixed
endianness is simpler than writing in native-endian and byte-swapping if
necessary on read, so we arbitrarily use little-endian.

The file consists of three parts:

1. Header
2. Index
3. Counter data

Header
------

The header contains metadata for parsing the file, and summarizing data
that's expensive to recompute.  Specifically:

- A magic number to identify the file format.  The character string
 "LPRF" has a nice ring to it.

- A version number, to accomodate future changes to the file format.

- An offset to the end of the index.  This could alternatively be a size
 of the index, to which the header size would need to be added to get
 the offset, but that's slightly less convenient.

- The maximum value of any function counter.  This is easy to calculate
 when writing the file, but would require reading the entire file to
 calculate later.

Index
-----

An index is necessary in order to get fast look up for particular
functions (use case A), and so that we can avoid paging in the entire
data file when we're only interested in a small subset of the data.  The
index is used as an associative array from the function name to the data
offset where the function's counter data can be found.

The first implementation uses the naive approach of an unordered list
of names and offsets.

 Each element will consist of:

   <string length> <character data> <data offset>

This obviously has an obvious drawback in the "startup" cost of
reading
these into an associative data structure, but it's relatively compact
and is an reasonable baseline to evaluate better solutions.

Future directions may include:


1. An on-disk hash table.  Questions like whether to use separate
   chaining or open addressing would need to be addressed.  This
   solution takes at least as much space as the naive approach, but
   avoids the startup cost.

1a. A format like LLVM's dwarf accelerator tables.  This is optimized
    for a different case than we're interested in (failed lookups rather
    than successful), so it's probably not appropriate.

2. An on-disk trie.  This should save a fair amount of space due to the
   fact that C language names and C++ mangling tend to lead to a large
   number of names with common prefixes. This format also allows
   iterating over the functions in alphabetical order.

Counter data
------------

The counter data is simply an array of unsigned 64 bit values.  Given an
offset found in the index, a sequence follows:

   <function hash> <number of counters> <counters...>

This is all of the data needed for a given function.

A potential future direction would be to represent counters with a
variable length encoding.  Many counters in a program are likely to be 0
or 1 for a particular run.  Using 8B for each is expensive.  However,
there’s a natural tradeoff against use case B, as the size of the
file is harder to predict when writing it out.

Diego Novillo

2014-Mar-13 12:48 UTC

head link

[LLVMdev] RFC: Binary format for instrumentation based profiling data

On Wed, Mar 12, 2014 at 9:09 PM, Justin Bogner <mail at justinbogner.com>
wrote:
> Functions are represented by strings, determined by the part of the
> frontend that both generates and uses this data.  In our case, these are
> generally whatever clang thinks of as the function's name, with minor
> details added to disambiguate names that aren't necessarily unique.
Why not just use mangled names here? No need to add minor details, nor
ad-hoc disambiguation. If you're already using mangled names, then I'm
not sure why we need disambiguating details.
> The counter data is simply an array of unsigned 64 bit values.  Given an
> offset found in the index, a sequence follows:
>
>    <function hash> <number of counters> <counters...>
>
> This is all of the data needed for a given function.
How are counters represented? Are these line numbers together with the
counter? Basic blocks? Edges?

I wonder if it would make sense to use the existing gcov format for
this. OTOH, we could provide a converter in the Profile library.


Thanks. Diego.

Bob Wilson

2014-Mar-13 15:51 UTC

head link

[LLVMdev] RFC: Binary format for instrumentation based profiling data

On Mar 13, 2014, at 5:48 AM, Diego Novillo <dnovillo at google.com> wrote:
> On Wed, Mar 12, 2014 at 9:09 PM, Justin Bogner <mail at
justinbogner.com> wrote:
> 
>> Functions are represented by strings, determined by the part of the
>> frontend that both generates and uses this data.  In our case, these
are
>> generally whatever clang thinks of as the function's name, with
minor
>> details added to disambiguate names that aren't necessarily unique.
> 
> Why not just use mangled names here? No need to add minor details, nor
> ad-hoc disambiguation. If you're already using mangled names, then
I'm
> not sure why we need disambiguating details.
We need to prepend the file name to distinguish functions with local linkage.
Also, Objective-C methods do not get mangled, so we don’t want to say that we
just use the mangled names.
> 
>> The counter data is simply an array of unsigned 64 bit values.  Given
an
>> offset found in the index, a sequence follows:
>> 
>>   <function hash> <number of counters> <counters...>
>> 
>> This is all of the data needed for a given function.
> 
> How are counters represented? Are these line numbers together with the
> counter? Basic blocks? Edges?
There are no line numbers, basic blocks, or edges. It is just a sequence of
counters that the front-end knows how to map to the code (the same as with our
current textual file format).
> 
> I wonder if it would make sense to use the existing gcov format for
> this. OTOH, we could provide a converter in the Profile library.
This is pretty clearly a different format from gcov, and I don’t see how we
could convert between them. But, I agree that it would be nice to collect the
code for handling different kinds of profile formats in one library, even if
those file formats are not interchangeable.

Dmitri Gribenko

2014-Mar-13 19:52 UTC

head link

[LLVMdev] RFC: Binary format for instrumentation based profiling data

On Thu, Mar 13, 2014 at 1:09 AM, Justin Bogner <mail at justinbogner.com>
wrote:> 1. An on-disk hash table.  Questions like whether to use separate
>    chaining or open addressing would need to be addressed.  This
>    solution takes at least as much space as the naive approach, but
>    avoids the startup cost.
Just wanted to point out that Clang already has an on-disk hash table
implementation in:

include/clang/Basic/OnDiskHashTable.h

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at
gmail.com>*/

Chandler Carruth

2014-Mar-17 21:07 UTC

head link

[LLVMdev] RFC: Binary format for instrumentation based profiling data

On Wed, Mar 12, 2014 at 6:09 PM, Justin Bogner <mail at
justinbogner.com>wrote:
> Instrumentation-based profiling data is generated by instrumented
> binaries through library functions in compiler-rt, and read by the clang
> frontend to feed PGO.
>
Note that compiler-rt *cannot* link in LLVM libraries for many reasons:

1) The compiler-rt must be built for the target, the LLVM libraries are
built for the host.
2) The compiler-rt source is under a different license
3) We typically want to avoid the large dependencies of LLVM from runtime
libraries that need to be extremely light weight such as the profile
runtime.

So I think you will at least need to have the header file with basic
structs, and endian-aware writing code duplicated. =/ This may explain some
of the trouble you had with build systems at least.

Use cases> ========>
> There are a few use cases that drive our requirements:
>
> A. Looking up the counters for a particular function needs to be
>   efficient, so as to not slow down compilation too much when using the
>   data for PGO.  This is also presumably how this data would be
>   accessed if using it for coverage.
>
> B. The file should be relatively quick to generate, and shouldn't
>   require a large amount of memory overhead to write out.  Instrumented
>   programs will write this file out on termination or on demand.
>
> C. Iterating over all of the data must be reasonably simple, to support
>   tools that summarize or merge data files.
>
> D. The format should be relatively compact.  Programs with a large
>   number of functions may be instrumented, and we don't want to create
>   unnecessarily large files because of it.
>
> E. The format should be portable between systems.  Ie, if Alice
>   generates profile data for some program P using Machine A, Bob should
>   be able to use that same profile data to instrument a copy of that
>   same program on machine B.
>
> Partially updating the file is not a goal.  Instrumented programs write
> out a profile for a single run, and multiple runs can be merged by a
> separate tool.
>
The other assumption here is that you want the same file format written by
instrumentation and read back by the compiler. While I think that is an
unsurprising goal, I think it creates quite a few limitations that I'd like
to point out. I think it would be worthwhile to consider the alternative of
having the profile library write out data files in a format which is
essentially "always" transformed by a post-processing tool before
being
used during compilation.

Limitations of using the same format in both places:
- High burden on writing the file constrains the format (must be fast, must
not use libraries, etc...)
- Have to write and index even though the writer doesn't really need it.
- Have to have the function name passed through the instrumentation,
potentially duplicating it with debug info.
- Can't use an extensible file format (like bitcode) to insulate readers of
profile data from format changes.

I'm imagining it might be nicer to have something along the lines of the
following counter proposal. Define two formats: the format written by
instrumentation, and the format read by the compiler. Split the use cases
up. Specialize the formats based on the use cases. It does require the user
to post-process the results, but it isn't clear that this is really a
burden. Historically it has been needed to merge gcov profiles from
different TUs, and it is still required to merge them from multiple runs.

I think the results could be superior for both the writer and reader:

Instrumentation written format:
- No index, just header and counters
- (optional) Omit function names, and use PC at a known point of the
function, and rely on debug info to map back to function names.
- Use a structure which can be mmap-ed directly by the instrumentation code
(at least on LE systems) so that "writing the file on close" is just
flushing the memory region to disk
- Explicitly version format, and provide no stability going forward

Profile reading format:
- Use a bitcoded format much like Clang's ASTs do (or some other tagged
format which allows extensions)
- Leverage the existing partial reading which has been heavily optimized
for modules, LLVM IR, etc.
- Use implicit-zero semantics for missing counters within a function where
we have *some* instrumentation results, and remove all zero counters
- Maybe other compression techniques

Thoughts? Specific reasons to avoid this? I'm very much interested in
minimizing the space and runtime overhead of instrumentation, as well as
getting more advanced features in the format read by Clang itself.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140317/d516e1c2/attachment.html>

Justin Bogner

2014-Mar-18 00:22 UTC

head link

[LLVMdev] RFC: Binary format for instrumentation based profiling data

Chandler Carruth <chandlerc at google.com> writes:> The other assumption here is that you want the same file format written by
> instrumentation and read back by the compiler. While I think that is an
> unsurprising goal, I think it creates quite a few limitations that I'd
like to
> point out. I think it would be worthwhile to consider the alternative of
> having the profile library write out data files in a format which is
> essentially "always" transformed by a post-processing tool before
being used
> during compilation.
>
> Limitations of using the same format in both places:
> - High burden on writing the file constrains the format (must be fast, must
>   not use libraries, etc...)
> - Have to write and index even though the writer doesn't really need
it.
> - Have to have the function name passed through the instrumentation,
>   potentially duplicating it with debug info.
> - Can't use an extensible file format (like bitcode) to insulate
readers of
>   profile data from format changes.
>
> I'm imagining it might be nicer to have something along the lines of
the
> following counter proposal. Define two formats: the format written by
> instrumentation, and the format read by the compiler. Split the use cases
up.
> Specialize the formats based on the use cases. It does require the user to
> post-process the results, but it isn't clear that this is really a
burden.
> Historically it has been needed to merge gcov profiles from different TUs,
and
> it is still required to merge them from multiple runs.
This is an interesting idea. The counter data itself without index is
dead simple, so this approach for the instrumentation written format
would certainly be nice for compiler-rt, at the small cost of needing
two readers. We'd also need two writers, but that appears inevitable
since one needs to live in compiler-rt.
> I think the results could be superior for both the writer and reader:
>
> Instrumentation written format:
> - No index, just header and counters
> - (optional) Omit function names, and use PC at a known point of the
function,
>   and rely on debug info to map back to function names.
This depends a bit on whether or not the conversion tool should depend
on the debug info being available. We'd need to weigh the usability cost
against the size benefit.
> - Use a structure which can be mmap-ed directly by the instrumentation code
>   (at least on LE systems) so that "writing the file on close" is
just
>   flushing the memory region to disk
If this is feasible, we could also make the format is host endian and
force the post-processing to byteswap as it reads. This avoids online
work in favour of offline.
> - Explicitly version format, and provide no stability going forward
>
> Profile reading format:
> - Use a bitcoded format much like Clang's ASTs do (or some other tagged
format
>   which allows extensions)
I'm not entirely convinced a bitcoded format is going to gain us much
over a simpler on disk hash table. The variable bit rate integers might
be worthwhile, but will it be efficient to look up the counters for a
particular function name?

That said, the ASTs also make use of the on disk hash that Dmitri
mentioned for various indexes, which is definitely worth looking at.
> - Leverage the existing partial reading which has been heavily optimized
for
>   modules, LLVM IR, etc.
> - Use implicit-zero semantics for missing counters within a function where
we
>   have *some* instrumentation results, and remove all zero counters
> - Maybe other compression techniques
>
> Thoughts? Specific reasons to avoid this? I'm very much interested in
> minimizing the space and runtime overhead of instrumentation, as well as
> getting more advanced features in the format read by Clang itself.

llvm dev - Mar 2014 - [LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data