thr3ads.net - llvm dev - [LLVMdev] RFC - Improvements to PGO profile support [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li

2015-Feb-25 20:40 UTC

[LLVMdev] RFC - Improvements to PGO profile support

On Wed, Feb 25, 2015 at 10:52 AM, Philip Reames
<listmail at philipreames.com> wrote:> On 02/24/2015 03:31 PM, Diego Novillo wrote:
>
>
> We (Google) have started to look more closely at the profiling
> infrastructure in LLVM. Internally, we have a large dependency on PGO to
get
> peak performance in generated code.
>
> Some of the dependencies we have on profiling are still not present in LLVM
> (e.g., the inliner) but we will still need to incorporate changes to
support
> our work on these optimizations. Some of the changes may be addressed as
> individual bug fixes on the existing profiling infrastructure. Other
changes
> may be better implemented as either new extensions or as replacements of
> existing code.
>
> I think we will try to minimize infrastructure replacement at least in the
> short/medium term. After all, it doesn't make too much sense to replace
> infrastructure that is broken for code that doesn't exist yet.
>
> David Li and I are preparing a document where we describe the major issues
> that we'd like to address. The document is a bit on the lengthy side,
so it
> may be easier to start with an email discussion.
>
> I would personally be interested in seeing a copy of that document, but it
> might be more appropriate for a blog post then a discussion on llvm-dev.  I
> worry that we'd end up with a very unfocused discussion.  It might be
better
> to frame this as your plan of attack and reserve discussion on llvm-dev for
> things that are being proposed semi near term.  Just my 2 cents.
>
> This is a summary of the main changes we are looking at:
>
> Need to faithfully represent the execution count taken from dynamic
> profiles. Currently, MD_prof does not really represent an execution count.
> This makes things like comparing hotness across functions hard or
> impossible. We need a concept of global hotness.
>
> What does MD_prof actually represent when used from Clang?  I know I've
been
> using it for execution counters in my frontend.  Am I approaching that
> wrong?
>
> As a side comment: I'm a bit leery of the notion of a consistent notion
of
> hotness based on counters across functions.  These counters are almost
> always approximate in practice and counting problems run rampant.
Having representative training runs is pre-requisite for using FDO/PGO.
>  I'd
> almost rather see a consistent count inferred from data that's assumed
to be
> questionable than
>make the frontend try to generate consistent profiling
> metadata.
Frontend does not generate profile data -- it is just a messenger that
should pass the data faithfully to the middle end. That messenger
(profile reader) can be in middle end too.
>  I think either approach could be made to work, we just need to
> think about it carefully.
>
> When the CFG or callgraph change, there need to exist an API for
> incrementally updating/scaling counts. For instance, when a function is
> inlined or partially inlined, when the CFG is modified, etc. These counts
> need to be updated incrementally (or perhaps re-computed as a first step
> into that direction).
>
> Agreed.  Do you have a sense how much of an issue this in practice?  I
> haven't see it kick in much, but it's also not something I've
been looking
> for.
>
> The inliner (and other optimizations) needs to use profile information and
> update it accordingly. This is predicated on Chandler's work on the
pass
> manager, of course.
>
> Its worth noting that the inliner work can be done independently of the
pass
> manager work.  We can always explicitly recompute relevant analysis in the
> inliner if needed.  This will cost compile time, so we might need to make
> this an off by default option.  (Maybe -O3 only?)  Being able to work on
the
> inliner independently of the pass management structure is valuable enough
> that we should probably consider doing this.
>
> PGO inlining is an area I'm very interested in.  I'd really
encourage you to
> work incrementally in tree.  I'm likely to start putting non-trivial
amounts
> of time into this topic in the next few weeks.  I just need to clear a few
> things off my plate first.
>
> Other than the inliner, can you list the passes you think are profitable to
> teach about profiling data?  My list so far is: PRE (particularly of
> loads!), the vectorizer (i.e. duplicate work down both a hot and cold path
> when it can be vectorized on the hot path), LoopUnswitch, IRCE, &
LoopUnroll
> (avoiding code size explosion in cold code).  I'm much more interested
in
> sources of improved performance than I am simply code size reduction.
> (Reducing code size can improve performance of course.)
PGO is very effective in code size reduction. In reality, large
percentage of functions are globally cold.

David>
> Need to represent global profile summary data. For example, for global
> hotness determination, it is useful to compute additional global summary
> info, such as a histogram of counts that can be used to determine hotness
> and working set size estimates for a large percentage of the profiled
> execution.
>
> Er, not clear what you're trying to say here?
>
> There are other changes that we will need to incorporate. David, Teresa,
> Chandler, please add anything large that I missed.
>
> My main question at the moment is what would be the best way of addressing
> them. Some seem to require new concepts to be implemented (e.g., execution
> counts). Others could be addressed as simple bugs to be fixed in the
current
> framework.
>
> Would it make sense to present everything in a unified document and discuss
> that? I've got some reservations about that approach because we will
end up
> discussing everything at once and it may not lead to concrete progress.
> Another approach would be to present each issue individually either as
> patches or RFCs or bugs.
>
> See above.
>
>
> I will be taking on the implementation of several of these issues. Some of
> them involve the SamplePGO harness that I added last year. I would also
like
> to know what other bugs or problems people have in mind that I could also
> roll into this work.
>
>
> Thanks. Diego.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>

Philip Reames

2015-Feb-25 22:14 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On 02/25/2015 12:40 PM, Xinliang David Li wrote:> On Wed, Feb 25, 2015 at 10:52 AM, Philip Reames
> <listmail at philipreames.com> wrote:
>> On 02/24/2015 03:31 PM, Diego Novillo wrote:
>>
>> Need to faithfully represent the execution count taken from dynamic
>> profiles. Currently, MD_prof does not really represent an execution
count.
>> This makes things like comparing hotness across functions hard or
>> impossible. We need a concept of global hotness.
>>
>> What does MD_prof actually represent when used from Clang?  I know
I've been
>> using it for execution counters in my frontend.  Am I approaching that
>> wrong?
>>
>> As a side comment: I'm a bit leery of the notion of a consistent
notion of
>> hotness based on counters across functions.  These counters are almost
>> always approximate in practice and counting problems run rampant.
> Having representative training runs is pre-requisite for using FDO/PGO.Representativeness is not the issue I'm raising.  Profiling systems 
(particularly instrumentation based ones) have systemic biases.  Not 
accounting for that can lead to some very odd results.  As an example:
void foo() {
   if (member)
      for(int i = 0; i < 100000; i++)
        if (member2)
           bar();
}

With multiple threads in play, it's entirely possible that the sum of 
the absolute weights on the second branch are lower than the sum of the 
absolute counts on the first branch.  (i.e. due to racy updating)  While 
you can avoid this by using race free updates, I know of very few 
systems that actually do.

If your optimization is radically unstable in such scenarios, that's a 
serious problem.  Pessimization is bad enough (if tolerable), incorrect 
transforms are not.  It's very easy to write a transform that implicitly 
assumes the counts for the first branch must be less than the counts for 
the second.
>
>>   I'd
>> almost rather see a consistent count inferred from data that's
assumed to be
>> questionable than
>> make the frontend try to generate consistent profiling
>> metadata.
> Frontend does not generate profile data -- it is just a messenger that
> should pass the data faithfully to the middle end. That messenger
> (profile reader) can be in middle end too.Er, we may be arguing terminology here.  I was including the profiling 
system as part of the "frontend" - I'm working with a JIT -
whereas
you're assuming a separate collection system.  It doesn't actually 
matter which terms we use.  My point was that assuming clean profiling 
data is just not reasonable in practice.  At minimum, some type of 
normalization step is required.>
>> Other than the inliner, can you list the passes you think are
profitable to
>> teach about profiling data?  My list so far is: PRE (particularly of
>> loads!), the vectorizer (i.e. duplicate work down both a hot and cold
path
>> when it can be vectorized on the hot path), LoopUnswitch, IRCE, &
LoopUnroll
>> (avoiding code size explosion in cold code).  I'm much more
interested in
>> sources of improved performance than I am simply code size reduction.
>> (Reducing code size can improve performance of course.)
> PGO is very effective in code size reduction. In reality, large
> percentage of functions are globally cold.For a traditional C++ application, yes.  For a JIT which is only 
compiling warm code paths in hot methods, not so much.  It's still 
helpful, but the impact is much smaller.

Philip

Xinliang David Li

2015-Feb-25 22:29 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On Wed, Feb 25, 2015 at 2:14 PM, Philip Reames
<listmail at philipreames.com> wrote:>
> On 02/25/2015 12:40 PM, Xinliang David Li wrote:
>>
>> On Wed, Feb 25, 2015 at 10:52 AM, Philip Reames
>> <listmail at philipreames.com> wrote:
>>>
>>> On 02/24/2015 03:31 PM, Diego Novillo wrote:
>>>
>>> Need to faithfully represent the execution count taken from dynamic
>>> profiles. Currently, MD_prof does not really represent an execution
>>> count.
>>> This makes things like comparing hotness across functions hard or
>>> impossible. We need a concept of global hotness.
>>>
>>> What does MD_prof actually represent when used from Clang?  I know
I've
>>> been
>>> using it for execution counters in my frontend.  Am I approaching
that
>>> wrong?
>>>
>>> As a side comment: I'm a bit leery of the notion of a
consistent notion
>>> of
>>> hotness based on counters across functions.  These counters are
almost
>>> always approximate in practice and counting problems run rampant.
>>
>> Having representative training runs is pre-requisite for using FDO/PGO.
>
> Representativeness is not the issue I'm raising.  Profiling systems
> (particularly instrumentation based ones) have systemic biases.  Not
> accounting for that can lead to some very odd results.  As an example:
> void foo() {
>   if (member)
>      for(int i = 0; i < 100000; i++)
>        if (member2)
>           bar();
> }
>
> With multiple threads in play, it's entirely possible that the sum of
the
> absolute weights on the second branch are lower than the sum of the
absolute
> counts on the first branch.  (i.e. due to racy updating)  While you can
> avoid this by using race free updates, I know of very few systems that
> actually do.
Are you speculating or you have data to show it? We have large
programs run with hundreds of threads, race condition only contribute
to very small count variations -- and there are ways to smooth out the
difference.
>
> If your optimization is radically unstable in such scenarios, that's a
> serious problem.  Pessimization is bad enough (if tolerable), incorrect
> transforms are not.
This is never our experience with using PGO  in the past.  We also
have tools to compare profile consistency from one training run to
another.

If you experience such problems in real apps, can you file a bug?
> It's very easy to write a transform that implicitly
> assumes the counts for the first branch must be less than the counts for
the
> second.
Compiler can detect insane profile -- it can either ignore it, correct
it, or uses it with warnings depending on options.
>
>>
>>>   I'd
>>> almost rather see a consistent count inferred from data that's
assumed to
>>> be
>>> questionable than
>>> make the frontend try to generate consistent profiling
>>> metadata.
>>
>> Frontend does not generate profile data -- it is just a messenger that
>> should pass the data faithfully to the middle end. That messenger
>> (profile reader) can be in middle end too.
>
> Er, we may be arguing terminology here.  I was including the profiling
> system as part of the "frontend" - I'm working with a JIT -
whereas you're
> assuming a separate collection system.  It doesn't actually matter
which
> terms we use.  My point was that assuming clean profiling data is just not
> reasonable in practice.  At minimum, some type of normalization step is
> required.
If you are talking about making slightly consistent profile to be flow
consistent, yes there are mechanisms to do that.


David
>>
>>
>>> Other than the inliner, can you list the passes you think are
profitable
>>> to
>>> teach about profiling data?  My list so far is: PRE (particularly
of
>>> loads!), the vectorizer (i.e. duplicate work down both a hot and
cold
>>> path
>>> when it can be vectorized on the hot path), LoopUnswitch, IRCE,
&
>>> LoopUnroll
>>> (avoiding code size explosion in cold code).  I'm much more
interested in
>>> sources of improved performance than I am simply code size
reduction.
>>> (Reducing code size can improve performance of course.)
>>
>> PGO is very effective in code size reduction. In reality, large
>> percentage of functions are globally cold.
>
> For a traditional C++ application, yes.  For a JIT which is only compiling
> warm code paths in hot methods, not so much.  It's still helpful, but
the
> impact is much smaller.
>
> Philip

llvm dev - Feb 2015 - [LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support