thr3ads.net - llvm dev - [llvm-dev] RFC: A binary serialization format for MemProf [Oct 2021]

If this information is useful, please help other people find it:
Share via:

Snehasish Kumar via llvm-dev

2021-Oct-07 19:06 UTC

[llvm-dev] RFC: A binary serialization format for MemProf

Hi Wenlei,

Thanks for taking a look! Added responses inline.

On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <davidxl at google.com>
wrote:>
> Just a quick note -- IRPGO profile is not deterministic with multi-threaded
programs due to contentions (there is of course atomic update mode, but it can
be slow). Asynchronous dumping is another reason that the profile is not
guaranteed to be repeatable.
>
> David
>
> On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wenlei at fb.com> wrote:
>>
>> Thanks for sharing the progress and details on the binary format.
Overall this looks like a clean design that fits current PGO profile format with
extensions.
>>
>>
>>
>> Some high level comments:
>>
>>
>>
Our focus is to have a single combined IR instrumentation and PGHO
instrumentation phase to keep operational costs low. For CSPGO today,
this would be the second IR instrumentation phase. We also intend to
support a separate PGHO instrumentation phase.>> Does memprof/PGHO work together with today's IRPGO today, i.e. can
we have one instrumented build to collect both PGO and PGHO profile, or we will
need separate PGO instrumentation builds for each, in which case CSPGO + PGHO
would need three iterations of training and build, which would be significant
operational cost..
Yes, the context tracker is quite relevant to the IR matching need.
Teresa will share the detailed design soon and we can evaluate the
benefit of reusing the existing logic for CSSPGO. I think this is
orthogonal to this RFC (serialization format) so we can defer to the
next one for a detailed discussion.>> I think some of the problems memprof faced when dealing with storing
calling context and mapping context to IR is very similar to CSSPGO. I'm
wondering if it makes sense to promote some existing infrastructure to be more
general beyond just serving CSSPGO. One example is the IR mapping you mentioned
(quoted below). In CSSPGO, we have the exact same need, and it's handled by
`SampleContextTracker` which queries a context trie using an
instruction/DILocation.
>>
>>
>>
>>           >  Because the MIB corresponding to the A->B context is
associated with function B in the profile, we do not find it by looking at
function A’s profile when we see function A’s malloc call during matching. To
address this we need to keep a correspondence from debug locations to the
associated profile information.
>>
>>
>>
We intend to retain as much of the calling context information until
the IR matching. This is where we can leverage common solutions. We
would be happy to generalize where appropriate and intend to tackle
this topic in detail in the next RFC.>> The serialization of calling context, pruning of calling context are
also example of shared problems, and we've put in some effort to have
effective solutions (e.g. offline preinliner for most effective pruning, which I
think could be adapted to help keep most important allocation context). Perhaps
some of the frameworks can be merged, so LLVM has general context aware PGO
support that can be leverage by different kinds of PGO (IRPGO, PGHO, CSSPGO). If
you think this is worth pursuing, we’d be happy to help too.
>>
>>
>>
>> More on the details:
>>
>>
>>As David mentioned, keeping the PGHO profile deterministic is a
non-goal since IR PGO profile is non-deterministic.>> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make
memprof profile non-deterministic in the sense that running memprof twice on the
exact program and input would yield bit-wise different memory profile? I think
IR PGO profile is deterministic?
>>
>>
>>We need to use the file path instead of the function to be able to
distinguish COMDAT functions. The line_offset based matching is more
resilient if the entire function is moved, I think it's a good idea
and we can incorporate it into the IR matching phase.>> Why do we use `file:line:discriminator` instead of
`func:line_offset:discriminator `? The later would be more resilient to source
change. If function name string is too long, we could perhaps leverage the MD5
encoding used by sample PGO?
>>
>>
>>While we only intend to support Memprof optimizations for the main
binary, retaining all executable mappings allow future analysis tools
to symbolize shared library code.>> Is the design of mmap section (quoted below) trying to support memprof
for multiple binaries in the same process at the same time, or mainly for
handling multiple non-consecutive executable segments for a single binary?
>>
>>
>>
>>            > The process memory mappings for the executable segment
during profiling are stored in this section. This allows symbolization during
post processing for binaries which are built with position independent code. For
now all read only, executable  mappings are recorded, however in the future,
mappings for heap data can also potentially be stored.
>>
>>Yes, we do intend to support Memprof profile section merging via
`llvm-profdata merge`. The schema overhead per function is low now, so
we opted for function granularity. We can revisit if the overheads are
high or if the IR metadata scheme intends to keep it at module
granularity (in which case we don't need the extra
fidelity).>> Do we need each function record to have its own schema, do we expect
different functions to use different versions/schemas? The is very flexible, but
wondering what’s the use case. If the schema is for compatibility across
versions, perhaps a file level scheme would be enough?
>>
>>
>>
>>             > The InstrProfRecord for each function will hold the
schema and an array of Memprof info blocks, one for each unique allocation
context.
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Wenlei
>>

Wenlei He via llvm-dev

2021-Oct-07 19:59 UTC

head link

[llvm-dev] RFC: A binary serialization format for MemProf

Thanks for the reply and clarification. Having a single combined IR
instrumentation and PGHO instrumentation sounds good.

I’m also wondering if you have any data you could share that tells the overall
benefit of memprof driven optimization since last RFC, perhaps with some early
prototype and on small/synthetic workload? Asking because even though this all
looks promising, from runtime support to binary format, later profile loader and
optimization, there’s non-trivial complexity being added to a few places.

Thanks,
Wenlei

From: Snehasish Kumar <snehasishk at google.com>
Date: Thursday, October 7, 2021 at 12:06 PM
To: Xinliang David Li <davidxl at google.com>
Cc: Wenlei He <wenlei at fb.com>, llvm-dev <llvm-dev at
lists.llvm.org>, Vedant Kumar <vsk at apple.com>, andreybokhanko at
gmail.com <andreybokhanko at gmail.com>, Teresa Johnson <tejohnson at
google.com>, Hongtao Yu <hoy at fb.com>
Subject: Re: RFC: A binary serialization format for MemProf
Hi Wenlei,

Thanks for taking a look! Added responses inline.

On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <davidxl at google.com>
wrote:>
> Just a quick note -- IRPGO profile is not deterministic with multi-threaded
programs due to contentions (there is of course atomic update mode, but it can
be slow). Asynchronous dumping is another reason that the profile is not
guaranteed to be repeatable.
>
> David
>
> On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wenlei at fb.com> wrote:
>>
>> Thanks for sharing the progress and details on the binary format.
Overall this looks like a clean design that fits current PGO profile format with
extensions.
>>
>>
>>
>> Some high level comments:
>>
>>
>>
Our focus is to have a single combined IR instrumentation and PGHO
instrumentation phase to keep operational costs low. For CSPGO today,
this would be the second IR instrumentation phase. We also intend to
support a separate PGHO instrumentation phase.>> Does memprof/PGHO work together with today's IRPGO today, i.e. can
we have one instrumented build to collect both PGO and PGHO profile, or we will
need separate PGO instrumentation builds for each, in which case CSPGO + PGHO
would need three iterations of training and build, which would be significant
operational cost..
Yes, the context tracker is quite relevant to the IR matching need.
Teresa will share the detailed design soon and we can evaluate the
benefit of reusing the existing logic for CSSPGO. I think this is
orthogonal to this RFC (serialization format) so we can defer to the
next one for a detailed discussion.>> I think some of the problems memprof faced when dealing with storing
calling context and mapping context to IR is very similar to CSSPGO. I'm
wondering if it makes sense to promote some existing infrastructure to be more
general beyond just serving CSSPGO. One example is the IR mapping you mentioned
(quoted below). In CSSPGO, we have the exact same need, and it's handled by
`SampleContextTracker` which queries a context trie using an
instruction/DILocation.
>>
>>
>>
>>           >  Because the MIB corresponding to the A->B context is
associated with function B in the profile, we do not find it by looking at
function A’s profile when we see function A’s malloc call during matching. To
address this we need to keep a correspondence from debug locations to the
associated profile information.
>>
>>
>>
We intend to retain as much of the calling context information until
the IR matching. This is where we can leverage common solutions. We
would be happy to generalize where appropriate and intend to tackle
this topic in detail in the next RFC.>> The serialization of calling context, pruning of calling context are
also example of shared problems, and we've put in some effort to have
effective solutions (e.g. offline preinliner for most effective pruning, which I
think could be adapted to help keep most important allocation context). Perhaps
some of the frameworks can be merged, so LLVM has general context aware PGO
support that can be leverage by different kinds of PGO (IRPGO, PGHO, CSSPGO). If
you think this is worth pursuing, we’d be happy to help too.
>>
>>
>>
>> More on the details:
>>
>>
>>As David mentioned, keeping the PGHO profile deterministic is a
non-goal since IR PGO profile is non-deterministic.>> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make
memprof profile non-deterministic in the sense that running memprof twice on the
exact program and input would yield bit-wise different memory profile? I think
IR PGO profile is deterministic?
>>
>>
>>We need to use the file path instead of the function to be able to
distinguish COMDAT functions. The line_offset based matching is more
resilient if the entire function is moved, I think it's a good idea
and we can incorporate it into the IR matching phase.>> Why do we use `file:line:discriminator` instead of
`func:line_offset:discriminator `? The later would be more resilient to source
change. If function name string is too long, we could perhaps leverage the MD5
encoding used by sample PGO?
>>
>>
>>While we only intend to support Memprof optimizations for the main
binary, retaining all executable mappings allow future analysis tools
to symbolize shared library code.>> Is the design of mmap section (quoted below) trying to support memprof
for multiple binaries in the same process at the same time, or mainly for
handling multiple non-consecutive executable segments for a single binary?
>>
>>
>>
>>            > The process memory mappings for the executable segment
during profiling are stored in this section. This allows symbolization during
post processing for binaries which are built with position independent code. For
now all read only, executable  mappings are recorded, however in the future,
mappings for heap data can also potentially be stored.
>>
>>Yes, we do intend to support Memprof profile section merging via
`llvm-profdata merge`. The schema overhead per function is low now, so
we opted for function granularity. We can revisit if the overheads are
high or if the IR metadata scheme intends to keep it at module
granularity (in which case we don't need the extra
fidelity).>> Do we need each function record to have its own schema, do we expect
different functions to use different versions/schemas? The is very flexible, but
wondering what’s the use case. If the schema is for compatibility across
versions, perhaps a file level scheme would be enough?
>>
>>
>>
>>             > The InstrProfRecord for each function will hold the
schema and an array of Memprof info blocks, one for each unique allocation
context.
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Wenlei
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211007/63dc2981/attachment.html>

Teresa Johnson via llvm-dev

2021-Oct-10 15:34 UTC

head link

[llvm-dev] RFC: A binary serialization format for MemProf

On Thu, Oct 7, 2021 at 12:06 PM Snehasish Kumar <snehasishk at google.com>
wrote:
> Hi Wenlei,
>
> Thanks for taking a look! Added responses inline.
>
> On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <davidxl at
google.com>
> wrote:
> >
> > Just a quick note -- IRPGO profile is not deterministic with
> multi-threaded programs due to contentions (there is of course atomic
> update mode, but it can be slow). Asynchronous dumping is another reason
> that the profile is not guaranteed to be repeatable.
> >
> > David
> >
> > On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wenlei at fb.com>
wrote:
> >>
> >> Thanks for sharing the progress and details on the binary format.
> Overall this looks like a clean design that fits current PGO profile format
> with extensions.
> >>
> >>
> >>
> >> Some high level comments:
> >>
> >>
> >>
>
> Our focus is to have a single combined IR instrumentation and PGHO
> instrumentation phase to keep operational costs low. For CSPGO today,
> this would be the second IR instrumentation phase. We also intend to
> support a separate PGHO instrumentation phase.
> >> Does memprof/PGHO work together with today's IRPGO today, i.e.
can we
> have one instrumented build to collect both PGO and PGHO profile, or we
> will need separate PGO instrumentation builds for each, in which case CSPGO
> + PGHO would need three iterations of training and build, which would be
> significant operational cost..
>
> Yes, the context tracker is quite relevant to the IR matching need.
> Teresa will share the detailed design soon and we can evaluate the
> benefit of reusing the existing logic for CSSPGO. I think this is
> orthogonal to this RFC (serialization format) so we can defer to the
> next one for a detailed discussion.
> >> I think some of the problems memprof faced when dealing with
storing
> calling context and mapping context to IR is very similar to CSSPGO.
I'm
> wondering if it makes sense to promote some existing infrastructure to be
> more general beyond just serving CSSPGO. One example is the IR mapping you
> mentioned (quoted below). In CSSPGO, we have the exact same need, and
it's
> handled by `SampleContextTracker` which queries a context trie using an
> instruction/DILocation.
> >>
> >>
> >>
> >>           >  Because the MIB corresponding to the A->B
context is
> associated with function B in the profile, we do not find it by looking at
> function A’s profile when we see function A’s malloc call during matching.
> To address this we need to keep a correspondence from debug locations to
> the associated profile information.
> >>
> >>
> >>
>
> We intend to retain as much of the calling context information until
> the IR matching. This is where we can leverage common solutions. We
> would be happy to generalize where appropriate and intend to tackle
> this topic in detail in the next RFC.
>
In fact, we need to retain the calling context beyond matching, so that we
can perform the context disambiguation transformations that Snehasish
described in an earlier email. The next RFC will focus on the IR metadata
needed to carry the PGHO data as well as the context.
>From reading through the CSSPGO RFC it sounds like the context info isnever annotated onto the IR, but rather just used during the sample PGO
loading/inlining step to help generate more accurate IR prof md counts - is
that correct? In that case perhaps some of the infrastructure can be shared
for performing the matching for already inlined contexts, which I think is
what the ContextTrieNode structures are used for from what I can tell
perusing the code. It is a little unclear to me - how is the profile for a
partially inlined context found in the data structure - i.e. how do you
look up the ContextTrieNode for a given out of line function?

Thanks,
Teresa

>> The serialization of calling context, pruning of calling context are
> also example of shared problems, and we've put in some effort to have
> effective solutions (e.g. offline preinliner for most effective pruning,
> which I think could be adapted to help keep most important allocation
> context). Perhaps some of the frameworks can be merged, so LLVM has general
> context aware PGO support that can be leverage by different kinds of PGO
> (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy
> to help too.
> >>
> >>
> >>
> >> More on the details:
> >>
> >>
> >>
> As David mentioned, keeping the PGHO profile deterministic is a
> non-goal since IR PGO profile is non-deterministic.
> >> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that
make
> memprof profile non-deterministic in the sense that running memprof twice
> on the exact program and input would yield bit-wise different memory
> profile? I think IR PGO profile is deterministic?
> >>
> >>
> >>
> We need to use the file path instead of the function to be able to
> distinguish COMDAT functions. The line_offset based matching is more
> resilient if the entire function is moved, I think it's a good idea
> and we can incorporate it into the IR matching phase.
> >> Why do we use `file:line:discriminator` instead of
> `func:line_offset:discriminator `? The later would be more resilient to
> source change. If function name string is too long, we could perhaps
> leverage the MD5 encoding used by sample PGO?
> >>
> >>
> >>
> While we only intend to support Memprof optimizations for the main
> binary, retaining all executable mappings allow future analysis tools
> to symbolize shared library code.
> >> Is the design of mmap section (quoted below) trying to support
memprof
> for multiple binaries in the same process at the same time, or mainly for
> handling multiple non-consecutive executable segments for a single binary?
> >>
> >>
> >>
> >>            > The process memory mappings for the executable
segment
> during profiling are stored in this section. This allows symbolization
> during post processing for binaries which are built with position
> independent code. For now all read only, executable  mappings are recorded,
> however in the future, mappings for heap data can also potentially be
> stored.
> >>
> >>
> Yes, we do intend to support Memprof profile section merging via
> `llvm-profdata merge`. The schema overhead per function is low now, so
> we opted for function granularity. We can revisit if the overheads are
> high or if the IR metadata scheme intends to keep it at module
> granularity (in which case we don't need the extra fidelity).
> >> Do we need each function record to have its own schema, do we
expect
> different functions to use different versions/schemas? The is very
> flexible, but wondering what’s the use case. If the schema is for
> compatibility across versions, perhaps a file level scheme would be enough?
> >>
> >>
> >>
> >>             > The InstrProfRecord for each function will hold
the
> schema and an array of Memprof info blocks, one for each unique allocation
> context.
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Wenlei
> >>
>

-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211010/c63f7fb7/attachment.html>

llvm dev - Oct 2021 - RFC: A binary serialization format for MemProf

[llvm-dev] RFC: A binary serialization format for MemProf

[llvm-dev] RFC: A binary serialization format for MemProf

[llvm-dev] RFC: A binary serialization format for MemProf