Snehasish Kumar via llvm-dev
2021-Oct-07 19:06 UTC
[llvm-dev] RFC: A binary serialization format for MemProf
Hi Wenlei, Thanks for taking a look! Added responses inline. On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <davidxl at google.com> wrote:> > Just a quick note -- IRPGO profile is not deterministic with multi-threaded programs due to contentions (there is of course atomic update mode, but it can be slow). Asynchronous dumping is another reason that the profile is not guaranteed to be repeatable. > > David > > On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wenlei at fb.com> wrote: >> >> Thanks for sharing the progress and details on the binary format. Overall this looks like a clean design that fits current PGO profile format with extensions. >> >> >> >> Some high level comments: >> >> >>Our focus is to have a single combined IR instrumentation and PGHO instrumentation phase to keep operational costs low. For CSPGO today, this would be the second IR instrumentation phase. We also intend to support a separate PGHO instrumentation phase.>> Does memprof/PGHO work together with today's IRPGO today, i.e. can we have one instrumented build to collect both PGO and PGHO profile, or we will need separate PGO instrumentation builds for each, in which case CSPGO + PGHO would need three iterations of training and build, which would be significant operational cost..Yes, the context tracker is quite relevant to the IR matching need. Teresa will share the detailed design soon and we can evaluate the benefit of reusing the existing logic for CSSPGO. I think this is orthogonal to this RFC (serialization format) so we can defer to the next one for a detailed discussion.>> I think some of the problems memprof faced when dealing with storing calling context and mapping context to IR is very similar to CSSPGO. I'm wondering if it makes sense to promote some existing infrastructure to be more general beyond just serving CSSPGO. One example is the IR mapping you mentioned (quoted below). In CSSPGO, we have the exact same need, and it's handled by `SampleContextTracker` which queries a context trie using an instruction/DILocation. >> >> >> >> > Because the MIB corresponding to the A->B context is associated with function B in the profile, we do not find it by looking at function A’s profile when we see function A’s malloc call during matching. To address this we need to keep a correspondence from debug locations to the associated profile information. >> >> >>We intend to retain as much of the calling context information until the IR matching. This is where we can leverage common solutions. We would be happy to generalize where appropriate and intend to tackle this topic in detail in the next RFC.>> The serialization of calling context, pruning of calling context are also example of shared problems, and we've put in some effort to have effective solutions (e.g. offline preinliner for most effective pruning, which I think could be adapted to help keep most important allocation context). Perhaps some of the frameworks can be merged, so LLVM has general context aware PGO support that can be leverage by different kinds of PGO (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy to help too. >> >> >> >> More on the details: >> >> >>As David mentioned, keeping the PGHO profile deterministic is a non-goal since IR PGO profile is non-deterministic.>> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make memprof profile non-deterministic in the sense that running memprof twice on the exact program and input would yield bit-wise different memory profile? I think IR PGO profile is deterministic? >> >> >>We need to use the file path instead of the function to be able to distinguish COMDAT functions. The line_offset based matching is more resilient if the entire function is moved, I think it's a good idea and we can incorporate it into the IR matching phase.>> Why do we use `file:line:discriminator` instead of `func:line_offset:discriminator `? The later would be more resilient to source change. If function name string is too long, we could perhaps leverage the MD5 encoding used by sample PGO? >> >> >>While we only intend to support Memprof optimizations for the main binary, retaining all executable mappings allow future analysis tools to symbolize shared library code.>> Is the design of mmap section (quoted below) trying to support memprof for multiple binaries in the same process at the same time, or mainly for handling multiple non-consecutive executable segments for a single binary? >> >> >> >> > The process memory mappings for the executable segment during profiling are stored in this section. This allows symbolization during post processing for binaries which are built with position independent code. For now all read only, executable mappings are recorded, however in the future, mappings for heap data can also potentially be stored. >> >>Yes, we do intend to support Memprof profile section merging via `llvm-profdata merge`. The schema overhead per function is low now, so we opted for function granularity. We can revisit if the overheads are high or if the IR metadata scheme intends to keep it at module granularity (in which case we don't need the extra fidelity).>> Do we need each function record to have its own schema, do we expect different functions to use different versions/schemas? The is very flexible, but wondering what’s the use case. If the schema is for compatibility across versions, perhaps a file level scheme would be enough? >> >> >> >> > The InstrProfRecord for each function will hold the schema and an array of Memprof info blocks, one for each unique allocation context. >> >> >> >> >> >> Thanks, >> >> Wenlei >>
Wenlei He via llvm-dev
2021-Oct-07 19:59 UTC
[llvm-dev] RFC: A binary serialization format for MemProf
Thanks for the reply and clarification. Having a single combined IR instrumentation and PGHO instrumentation sounds good. I’m also wondering if you have any data you could share that tells the overall benefit of memprof driven optimization since last RFC, perhaps with some early prototype and on small/synthetic workload? Asking because even though this all looks promising, from runtime support to binary format, later profile loader and optimization, there’s non-trivial complexity being added to a few places. Thanks, Wenlei From: Snehasish Kumar <snehasishk at google.com> Date: Thursday, October 7, 2021 at 12:06 PM To: Xinliang David Li <davidxl at google.com> Cc: Wenlei He <wenlei at fb.com>, llvm-dev <llvm-dev at lists.llvm.org>, Vedant Kumar <vsk at apple.com>, andreybokhanko at gmail.com <andreybokhanko at gmail.com>, Teresa Johnson <tejohnson at google.com>, Hongtao Yu <hoy at fb.com> Subject: Re: RFC: A binary serialization format for MemProf Hi Wenlei, Thanks for taking a look! Added responses inline. On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <davidxl at google.com> wrote:> > Just a quick note -- IRPGO profile is not deterministic with multi-threaded programs due to contentions (there is of course atomic update mode, but it can be slow). Asynchronous dumping is another reason that the profile is not guaranteed to be repeatable. > > David > > On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wenlei at fb.com> wrote: >> >> Thanks for sharing the progress and details on the binary format. Overall this looks like a clean design that fits current PGO profile format with extensions. >> >> >> >> Some high level comments: >> >> >>Our focus is to have a single combined IR instrumentation and PGHO instrumentation phase to keep operational costs low. For CSPGO today, this would be the second IR instrumentation phase. We also intend to support a separate PGHO instrumentation phase.>> Does memprof/PGHO work together with today's IRPGO today, i.e. can we have one instrumented build to collect both PGO and PGHO profile, or we will need separate PGO instrumentation builds for each, in which case CSPGO + PGHO would need three iterations of training and build, which would be significant operational cost..Yes, the context tracker is quite relevant to the IR matching need. Teresa will share the detailed design soon and we can evaluate the benefit of reusing the existing logic for CSSPGO. I think this is orthogonal to this RFC (serialization format) so we can defer to the next one for a detailed discussion.>> I think some of the problems memprof faced when dealing with storing calling context and mapping context to IR is very similar to CSSPGO. I'm wondering if it makes sense to promote some existing infrastructure to be more general beyond just serving CSSPGO. One example is the IR mapping you mentioned (quoted below). In CSSPGO, we have the exact same need, and it's handled by `SampleContextTracker` which queries a context trie using an instruction/DILocation. >> >> >> >> > Because the MIB corresponding to the A->B context is associated with function B in the profile, we do not find it by looking at function A’s profile when we see function A’s malloc call during matching. To address this we need to keep a correspondence from debug locations to the associated profile information. >> >> >>We intend to retain as much of the calling context information until the IR matching. This is where we can leverage common solutions. We would be happy to generalize where appropriate and intend to tackle this topic in detail in the next RFC.>> The serialization of calling context, pruning of calling context are also example of shared problems, and we've put in some effort to have effective solutions (e.g. offline preinliner for most effective pruning, which I think could be adapted to help keep most important allocation context). Perhaps some of the frameworks can be merged, so LLVM has general context aware PGO support that can be leverage by different kinds of PGO (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy to help too. >> >> >> >> More on the details: >> >> >>As David mentioned, keeping the PGHO profile deterministic is a non-goal since IR PGO profile is non-deterministic.>> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make memprof profile non-deterministic in the sense that running memprof twice on the exact program and input would yield bit-wise different memory profile? I think IR PGO profile is deterministic? >> >> >>We need to use the file path instead of the function to be able to distinguish COMDAT functions. The line_offset based matching is more resilient if the entire function is moved, I think it's a good idea and we can incorporate it into the IR matching phase.>> Why do we use `file:line:discriminator` instead of `func:line_offset:discriminator `? The later would be more resilient to source change. If function name string is too long, we could perhaps leverage the MD5 encoding used by sample PGO? >> >> >>While we only intend to support Memprof optimizations for the main binary, retaining all executable mappings allow future analysis tools to symbolize shared library code.>> Is the design of mmap section (quoted below) trying to support memprof for multiple binaries in the same process at the same time, or mainly for handling multiple non-consecutive executable segments for a single binary? >> >> >> >> > The process memory mappings for the executable segment during profiling are stored in this section. This allows symbolization during post processing for binaries which are built with position independent code. For now all read only, executable mappings are recorded, however in the future, mappings for heap data can also potentially be stored. >> >>Yes, we do intend to support Memprof profile section merging via `llvm-profdata merge`. The schema overhead per function is low now, so we opted for function granularity. We can revisit if the overheads are high or if the IR metadata scheme intends to keep it at module granularity (in which case we don't need the extra fidelity).>> Do we need each function record to have its own schema, do we expect different functions to use different versions/schemas? The is very flexible, but wondering what’s the use case. If the schema is for compatibility across versions, perhaps a file level scheme would be enough? >> >> >> >> > The InstrProfRecord for each function will hold the schema and an array of Memprof info blocks, one for each unique allocation context. >> >> >> >> >> >> Thanks, >> >> Wenlei >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211007/63dc2981/attachment.html>
Teresa Johnson via llvm-dev
2021-Oct-10 15:34 UTC
[llvm-dev] RFC: A binary serialization format for MemProf
On Thu, Oct 7, 2021 at 12:06 PM Snehasish Kumar <snehasishk at google.com> wrote:> Hi Wenlei, > > Thanks for taking a look! Added responses inline. > > On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <davidxl at google.com> > wrote: > > > > Just a quick note -- IRPGO profile is not deterministic with > multi-threaded programs due to contentions (there is of course atomic > update mode, but it can be slow). Asynchronous dumping is another reason > that the profile is not guaranteed to be repeatable. > > > > David > > > > On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wenlei at fb.com> wrote: > >> > >> Thanks for sharing the progress and details on the binary format. > Overall this looks like a clean design that fits current PGO profile format > with extensions. > >> > >> > >> > >> Some high level comments: > >> > >> > >> > > Our focus is to have a single combined IR instrumentation and PGHO > instrumentation phase to keep operational costs low. For CSPGO today, > this would be the second IR instrumentation phase. We also intend to > support a separate PGHO instrumentation phase. > >> Does memprof/PGHO work together with today's IRPGO today, i.e. can we > have one instrumented build to collect both PGO and PGHO profile, or we > will need separate PGO instrumentation builds for each, in which case CSPGO > + PGHO would need three iterations of training and build, which would be > significant operational cost.. > > Yes, the context tracker is quite relevant to the IR matching need. > Teresa will share the detailed design soon and we can evaluate the > benefit of reusing the existing logic for CSSPGO. I think this is > orthogonal to this RFC (serialization format) so we can defer to the > next one for a detailed discussion. > >> I think some of the problems memprof faced when dealing with storing > calling context and mapping context to IR is very similar to CSSPGO. I'm > wondering if it makes sense to promote some existing infrastructure to be > more general beyond just serving CSSPGO. One example is the IR mapping you > mentioned (quoted below). In CSSPGO, we have the exact same need, and it's > handled by `SampleContextTracker` which queries a context trie using an > instruction/DILocation. > >> > >> > >> > >> > Because the MIB corresponding to the A->B context is > associated with function B in the profile, we do not find it by looking at > function A’s profile when we see function A’s malloc call during matching. > To address this we need to keep a correspondence from debug locations to > the associated profile information. > >> > >> > >> > > We intend to retain as much of the calling context information until > the IR matching. This is where we can leverage common solutions. We > would be happy to generalize where appropriate and intend to tackle > this topic in detail in the next RFC. >In fact, we need to retain the calling context beyond matching, so that we can perform the context disambiguation transformations that Snehasish described in an earlier email. The next RFC will focus on the IR metadata needed to carry the PGHO data as well as the context.>From reading through the CSSPGO RFC it sounds like the context info isnever annotated onto the IR, but rather just used during the sample PGO loading/inlining step to help generate more accurate IR prof md counts - is that correct? In that case perhaps some of the infrastructure can be shared for performing the matching for already inlined contexts, which I think is what the ContextTrieNode structures are used for from what I can tell perusing the code. It is a little unclear to me - how is the profile for a partially inlined context found in the data structure - i.e. how do you look up the ContextTrieNode for a given out of line function? Thanks, Teresa>> The serialization of calling context, pruning of calling context are > also example of shared problems, and we've put in some effort to have > effective solutions (e.g. offline preinliner for most effective pruning, > which I think could be adapted to help keep most important allocation > context). Perhaps some of the frameworks can be merged, so LLVM has general > context aware PGO support that can be leverage by different kinds of PGO > (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy > to help too. > >> > >> > >> > >> More on the details: > >> > >> > >> > As David mentioned, keeping the PGHO profile deterministic is a > non-goal since IR PGO profile is non-deterministic. > >> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make > memprof profile non-deterministic in the sense that running memprof twice > on the exact program and input would yield bit-wise different memory > profile? I think IR PGO profile is deterministic? > >> > >> > >> > We need to use the file path instead of the function to be able to > distinguish COMDAT functions. The line_offset based matching is more > resilient if the entire function is moved, I think it's a good idea > and we can incorporate it into the IR matching phase. > >> Why do we use `file:line:discriminator` instead of > `func:line_offset:discriminator `? The later would be more resilient to > source change. If function name string is too long, we could perhaps > leverage the MD5 encoding used by sample PGO? > >> > >> > >> > While we only intend to support Memprof optimizations for the main > binary, retaining all executable mappings allow future analysis tools > to symbolize shared library code. > >> Is the design of mmap section (quoted below) trying to support memprof > for multiple binaries in the same process at the same time, or mainly for > handling multiple non-consecutive executable segments for a single binary? > >> > >> > >> > >> > The process memory mappings for the executable segment > during profiling are stored in this section. This allows symbolization > during post processing for binaries which are built with position > independent code. For now all read only, executable mappings are recorded, > however in the future, mappings for heap data can also potentially be > stored. > >> > >> > Yes, we do intend to support Memprof profile section merging via > `llvm-profdata merge`. The schema overhead per function is low now, so > we opted for function granularity. We can revisit if the overheads are > high or if the IR metadata scheme intends to keep it at module > granularity (in which case we don't need the extra fidelity). > >> Do we need each function record to have its own schema, do we expect > different functions to use different versions/schemas? The is very > flexible, but wondering what’s the use case. If the schema is for > compatibility across versions, perhaps a file level scheme would be enough? > >> > >> > >> > >> > The InstrProfRecord for each function will hold the > schema and an array of Memprof info blocks, one for each unique allocation > context. > >> > >> > >> > >> > >> > >> Thanks, > >> > >> Wenlei > >> >-- Teresa Johnson | Software Engineer | tejohnson at google.com | -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211010/c63f7fb7/attachment.html>