thr3ads.net - llvm dev - [LLVMdev] RFC: Binary format for instrumentation based profiling data [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2014-Apr-16 07:21 UTC

[LLVMdev] RFC: Binary format for instrumentation based profiling data

On Fri, Mar 28, 2014 at 1:33 PM, Justin Bogner <mail at
justinbogner.com>wrote:
> Chandler Carruth <chandlerc at google.com> writes:
> >     Format 2
> >     --------
> >
> >     This format should be efficient to read and preferably reasonably
> >     compact. We'll convert from format 1 to format 2 using
llvm-profdata,
> >     and clang will use format 2 for PGO.
> >
> >     Since the only particularly important operation in this use case
is
> fast
> >     lookup, I propose using the on disk hash table that's
currently used
> in
> >     clang for AST serialization/PTH/etc with a small amount of
metadata
> in a
> >     header.
> >
> >     The hash table implementation currently lives in
include/clang/Basic
> and
> >     consists of a single header. Moving it to llvm and updating the
> clients
> >     in clang should be easy. I'll send a brief RFC separately to
see if
> >     anyone's opposed to moving it.
> >
> > I can mention this and we can discuss this on the other thread if you
> would
> > rather, but I'm not a huge fan of this code. My vague memory was
that
> this was
> > a quick hack by Doug that he never really expected to live long-term.
>
> It may not be the prettiest piece of code, but given that it's used in
> several places in clang and hasn't needed any significant changes since
> 2010, I'd say it's fairly solid. It also has the very obvious
advantage
> of already existing, which makes it a pretty good candidate for a
> version 1 format, IMHO.
>
So, I've gone and read all of it to try and get a good handle on the
current state rather than dredging up memories from so many years ago. =] I
can speak more confidently now.

The OnDiskHashTable stuff seems perfectly fine for emitting just that - an
on-disk-hash-table. However, it was never designed to support long-lived
file formats in that form. The use in serialized ASTs is specifically not
supporting a long-term file format. The biggest issue there is endianness,
and I see you've already very nicely added good support for that. The only
remaining concern I might have are the 32-bit offset limitations.
While>2gb of counter data may seem unlikely, I don't think it is
inconceivable,and even >4gb of counter data might happen. Using 64-bit offsets seems an
easy fix though. Essentially, I think this part of the code could quickly
and easily be made viable for this purpose, although it would require a bit
more cleanup and documenting the intended stability.

Anyways, the part I was truly concerned about is actually nicely factored
out -- the hashing bit. The AST's hashing is *completely* unsuitable for a
long-term file format, but my assumption is that you'd just use the
existing stable PGO hashing stuff for this table? If so, it should work
fine. If you want to hash other things (function names?), I would just urge
using something like their MD5 or some other fixed, and uncontroversial
algorithm so we don't end up wondering how a bug snuck in there N years
later. So this seems workable too.

Essentially, to answer a later question:

If you're opposed to moving the existing OnDiskHashTable into
Support,> perhaps because you don't think it should proliferate to other uses,
> the obvious alternative is to include a private copy of a stripped down
> version of it for the profile reader and writer to use themselves. I'm
> not sure if this is worth the copy pasted code, but it is an
> option. What do you think?

I think with the cleanups you've started plus a bit more documentation,
this could be a fine candidate for a generic on-disk (or, raw memory
buffer) hash table writer and reader.

OK, on to the general questions rather than ones concerning specific code...

> > I have a general preference for from-disk lookups to use tries (for
> strings,
> > prefix tries) or other fast, sorted lookup structures. They have the
nice
> > property of being inherently stable and unambiguous, and not baking
any
> > hashing algorithm into it.
>
> I would like to experiment with a few trie-based approaches for this as
> we try to optimize the PGO process further (both for space and for
> lookup time). Even so, it's not a sure thing that this will work
better,
>
So the first question is whether it is really worth looking into other
solutions. I have a suspicion that there are better formats for this
because of one key idea: while the important operation is lookup, I don't
think it is truly *random* lookup. In fact, I suspect it will be extremely
structured lookup, with a few hot clusters of data due to similar function
"names" (where the names of things like file-local static functions
get
file name prefixes and such to disambiguate them). So I think that there is
a real locality win possible in this space. Maybe not all the time, and
most of the time it may be below the noise floor, but I think it will still
be there.

> and I don't think it's worth delaying getting something that people
can
> use out the door.
>
If the file format is wide open to change over the coming months, then I'm
totally down with this. Makes perfect sense. However, I get the feeling
that it isn't wide open to change, and we're going to quickly end up
locked
into a particular format here to support the backwards compatibility
concerns. Personally, I'm happy to change the format *very* fluidly, at
least until there is an LLVM release, and potentially even after that, but
it would be good to hear from others that want to consume this data.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140416/4cfa2195/attachment.html>

Bob Wilson

2014-Apr-16 17:48 UTC

head link

[LLVMdev] RFC: Binary format for instrumentation based profiling data

On Apr 16, 2014, at 12:21 AM, Chandler Carruth <chandlerc at google.com>
wrote:
> On Fri, Mar 28, 2014 at 1:33 PM, Justin Bogner <mail at
justinbogner.com> wrote:
> Chandler Carruth <chandlerc at google.com> writes:
> >     Format 2
> >     --------
> >
> >     This format should be efficient to read and preferably reasonably
> >     compact. We'll convert from format 1 to format 2 using
llvm-profdata,
> >     and clang will use format 2 for PGO.
> >
> >     Since the only particularly important operation in this use case
is fast
> >     lookup, I propose using the on disk hash table that's
currently used in
> >     clang for AST serialization/PTH/etc with a small amount of
metadata in a
> >     header.
> >
> >     The hash table implementation currently lives in
include/clang/Basic and
> >     consists of a single header. Moving it to llvm and updating the
clients
> >     in clang should be easy. I'll send a brief RFC separately to
see if
> >     anyone's opposed to moving it.
> >
> > I can mention this and we can discuss this on the other thread if you
would
> > rather, but I'm not a huge fan of this code. My vague memory was
that this was
> > a quick hack by Doug that he never really expected to live long-term.
> 
> It may not be the prettiest piece of code, but given that it's used in
> several places in clang and hasn't needed any significant changes since
> 2010, I'd say it's fairly solid. It also has the very obvious
advantage
> of already existing, which makes it a pretty good candidate for a
> version 1 format, IMHO.
> 
> So, I've gone and read all of it to try and get a good handle on the
current state rather than dredging up memories from so many years ago. =] I can
speak more confidently now.
> 
> The OnDiskHashTable stuff seems perfectly fine for emitting just that - an
on-disk-hash-table. However, it was never designed to support long-lived file
formats in that form. The use in serialized ASTs is specifically not supporting
a long-term file format. The biggest issue there is endianness, and I see
you've already very nicely added good support for that. The only remaining
concern I might have are the 32-bit offset limitations. While >2gb of counter
data may seem unlikely, I don't think it is inconceivable, and even >4gb
of counter data might happen. Using 64-bit offsets seems an easy fix though.
Essentially, I think this part of the code could quickly and easily be made
viable for this purpose, although it would require a bit more cleanup and
documenting the intended stability.
It wouldn’t surprise me at all if we eventually need larger than 32-bit offsets,
but I don’t think that’s a priority right now. We have almost no knowledge of
the typical size of these PGO profiles, and no one is even using this new
feature yet. More importantly, it would be good to share the basic OnDiskHash
table implementation for clang’s serialized ASTs and modules and I’m concerned
that it might be disruptive to change that right now. We’re almost certainly
going to be changing the file format over time and we can coordinate a change to
64-bit offsets later if that turns out to be important.
> 
> Anyways, the part I was truly concerned about is actually nicely factored
out -- the hashing bit. The AST's hashing is *completely* unsuitable for a
long-term file format, but my assumption is that you'd just use the existing
stable PGO hashing stuff for this table? If so, it should work fine. If you want
to hash other things (function names?), I would just urge using something like
their MD5 or some other fixed, and uncontroversial algorithm so we don't end
up wondering how a bug snuck in there N years later. So this seems workable too.
The PGO hashing is completely unrelated. The hash here is just of the function
name (or more specifically, the name that we used in the PGO data, e.g.,
possibly mangled). We’re currently using Bernstein’s hash function, which is so
simple that I’m not worried about bugs, but it isn’t a great hash function. MD5
would make sense to me.
> 
> Essentially, to answer a later question:
> 
> If you're opposed to moving the existing OnDiskHashTable into Support,
> perhaps because you don't think it should proliferate to other uses,
> the obvious alternative is to include a private copy of a stripped down
> version of it for the profile reader and writer to use themselves. I'm
> not sure if this is worth the copy pasted code, but it is an
> option. What do you think?
> 
> I think with the cleanups you've started plus a bit more documentation,
this could be a fine candidate for a generic on-disk (or, raw memory buffer)
hash table writer and reader.
> 
> OK, on to the general questions rather than ones concerning specific
code...
> 
> 
> > I have a general preference for from-disk lookups to use tries (for
strings,
> > prefix tries) or other fast, sorted lookup structures. They have the
nice
> > property of being inherently stable and unambiguous, and not baking
any
> > hashing algorithm into it.
> 
> I would like to experiment with a few trie-based approaches for this as
> we try to optimize the PGO process further (both for space and for
> lookup time). Even so, it's not a sure thing that this will work
better,
> 
> So the first question is whether it is really worth looking into other
solutions. I have a suspicion that there are better formats for this because of
one key idea: while the important operation is lookup, I don't think it is
truly *random* lookup. In fact, I suspect it will be extremely structured
lookup, with a few hot clusters of data due to similar function
"names" (where the names of things like file-local static functions
get file name prefixes and such to disambiguate them). So I think that there is
a real locality win possible in this space. Maybe not all the time, and most of
the time it may be below the noise floor, but I think it will still be there.
>  
> and I don't think it's worth delaying getting something that people
can
> use out the door.
> 
> If the file format is wide open to change over the coming months, then
I'm totally down with this. Makes perfect sense. However, I get the feeling
that it isn't wide open to change, and we're going to quickly end up
locked into a particular format here to support the backwards compatibility
concerns. Personally, I'm happy to change the format *very* fluidly, at
least until there is an LLVM release, and potentially even after that, but it
would be good to hear from others that want to consume this data.
We need to settle of a file format ASAP for our internal work, but from the
perspective of the LLVM community, it makes sense to me that this should remain
wide open to change, at least until it goes out in an open-source LLVM release.

I’d like to proceed in the usual incremental fashion. This change is essentially
just moving the OnDiskHashTable code from clang to llvm, and given that we seem
to have agreement on the general direction, I see no reason why that can’t be
committed now, along with the llvm-profdata changes to use it. It should be
followed up with a change to address Chandler’s concern about the hash function.

Chandler, do you think we need to investigate the 64-bit offset issue now or can
that wait until we have a better idea whether it matters?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140416/dfd4a9d7/attachment.html>

Justin Bogner

2014-Apr-16 19:48 UTC

head link

[LLVMdev] RFC: Binary format for instrumentation based profiling data

Bob Wilson <bob.wilson at apple.com> writes:> On Apr 16, 2014, at 12:21 AM, Chandler Carruth <chandlerc at
google.com> wrote:
>     The OnDiskHashTable stuff seems perfectly fine for emitting just that -
an
>     on-disk-hash-table. However, it was never designed to support
long-lived
>     file formats in that form. The use in serialized ASTs is specifically
not
>     supporting a long-term file format. The biggest issue there is
endianness,
>     and I see you've already very nicely added good support for that.
The only
>     remaining concern I might have are the 32-bit offset limitations. While
>
>     2gb of counter data may seem unlikely, I don't think it is
inconceivable,
>     and even >4gb of counter data might happen. Using 64-bit offsets
seems an
>     easy fix though. Essentially, I think this part of the code could
quickly
>     and easily be made viable for this purpose, although it would require a
>     bit more cleanup and documenting the intended stability.
>
> It wouldn’t surprise me at all if we eventually need larger than 32-bit
> offsets, but I don’t think that’s a priority right now. We have almost no
> knowledge of the typical size of these PGO profiles, and no one is even
using
> this new feature yet. More importantly, it would be good to share the basic
> OnDiskHash table implementation for clang’s serialized ASTs and modules and
> I’m concerned that it might be disruptive to change that right now. We’re
> almost certainly going to be changing the file format over time and we can
> coordinate a change to 64-bit offsets later if that turns out to be
important.
Adding an extra template parameter (offset_type?) to the OnDiskHashTable
classes would be pretty trivial, we can even default it to uint32_t for
the pre-existing clang usages to minimize churn. Given how easy this is
to do, it's probably worth doing now.
>     Anyways, the part I was truly concerned about is actually nicely
factored
>     out -- the hashing bit. The AST's hashing is *completely*
unsuitable for a
>     long-term file format, but my assumption is that you'd just use the
>     existing stable PGO hashing stuff for this table? If so, it should work
>     fine. If you want to hash other things (function names?), I would just
>     urge using something like their MD5 or some other fixed, and
>     uncontroversial algorithm so we don't end up wondering how a bug
snuck in
>     there N years later. So this seems workable too.
>
> The PGO hashing is completely unrelated. The hash here is just of the
function
> name (or more specifically, the name that we used in the PGO data, e.g.,
> possibly mangled). We’re currently using Bernstein’s hash function, which
is
> so simple that I’m not worried about bugs, but it isn’t a great hash
function.
> MD5 would make sense to me.
Yes, these are hashes of the function names. The OnDiskHashTable
currently expects a 32 bit hash, so I think we should add a
hash_value_type template parameter as well, then use 64 bits of an MD5
here. Agreed?

Chandler Carruth <chandlerc at google.com> writes:> Essentially, to answer a later question:
>
>     If you're opposed to moving the existing OnDiskHashTable into
Support,
>     perhaps because you don't think it should proliferate to other
uses,
>     the obvious alternative is to include a private copy of a stripped down
>     version of it for the profile reader and writer to use themselves.
I'm
>     not sure if this is worth the copy pasted code, but it is an
>     option. What do you think?
>
> I think with the cleanups you've started plus a bit more documentation,
this
> could be a fine candidate for a generic on-disk (or, raw memory buffer)
hash
> table writer and reader.
Great. I'll go ahead and move the hash table now, and I'll add some
documentation while I'm at it. I'm going to put it in llvm/Support/,
and keep the OnDiskHashTable.h name.

Then I'll go ahead and write up the patches for choosing the offset and
hash value types, and then move on to using this in InstrProf and
updating the clang clients.
> OK, on to the general questions rather than ones concerning specific
code...
>
>     > I have a general preference for from-disk lookups to use tries
(for
>     strings,
>     > prefix tries) or other fast, sorted lookup structures. They have
the
>     nice
>     > property of being inherently stable and unambiguous, and not
baking any
>     > hashing algorithm into it.
>    
>     I would like to experiment with a few trie-based approaches for this as
>     we try to optimize the PGO process further (both for space and for
>     lookup time). Even so, it's not a sure thing that this will work
better,
>
> So the first question is whether it is really worth looking into other
> solutions. I have a suspicion that there are better formats for this
because
> of one key idea: while the important operation is lookup, I don't think
it is
> truly *random* lookup. In fact, I suspect it will be extremely structured
> lookup, with a few hot clusters of data due to similar function
"names" (where
> the names of things like file-local static functions get file name prefixes
> and such to disambiguate them). So I think that there is a real locality
win
> possible in this space. Maybe not all the time, and most of the time it may
be
> below the noise floor, but I think it will still be there.
Right, data locality and an abundance of shared prefixes make this data
set kind of interesting. I agree it's worth experimenting, but I do
think we need this in as a baseline first.
>     and I don't think it's worth delaying getting something that
people can
>     use out the door.
>
> If the file format is wide open to change over the coming months, then
I'm
> totally down with this. Makes perfect sense. However, I get the feeling
that
> it isn't wide open to change, and we're going to quickly end up
locked into a
> particular format here to support the backwards compatibility concerns.
> Personally, I'm happy to change the format *very* fluidly, at least
until
> there is an LLVM release, and potentially even after that, but it would be
> good to hear from others that want to consume this data.
I'm happy with making changes here too. Once we have an LLVM release
with this in it, then changes will need to come with a story on how
users upgrade their old data to the changed format, but we shouldn't let
that stop us from improving things.

Chandler Carruth

2014-Apr-16 21:01 UTC

head link

[LLVMdev] RFC: Binary format for instrumentation based profiling data

On Wed, Apr 16, 2014 at 10:48 AM, Bob Wilson <bob.wilson at apple.com>
wrote:
> We need to settle of a file format ASAP for our internal work, but from
> the perspective of the LLVM community, it makes sense to me that this
> should remain wide open to change, at least until it goes out in an
> open-source LLVM release.

Ok, I don't really know how to reconcile this.

Is it fine to make breaking, backwards incompatible changes to the file
format in the coming months or not?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140416/099cb0fb/attachment.html>

llvm dev - Apr 2014 - [LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data

[LLVMdev] RFC: Binary format for instrumentation based profiling data