thr3ads.net - llvm dev - [LLVMdev] RFC - Improvements to PGO profile support [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Bob Wilson

2015-Mar-05 16:29 UTC

[LLVMdev] RFC - Improvements to PGO profile support

> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com>
wrote:
> 
> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at google.com
<mailto:dnovillo at google.com>> wrote:
> 
> I've created a few bugzilla issues with details of some of the things
I'll be looking into. I'm not yet done wordsmithing the overall design
document. I'll try to finish it by early next week at the latest.
> 
> The document is available at
> 
>
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing 
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
> 
> There are several topics covered. Ideally, I would prefer that we discuss
each topic separately. The main ones I will start working on are the ones
described in the bugzilla links we have in the doc.
> 
> This is just a starting point for us. I am not at all concerned with
implementing exactly what is proposed in the document. In fact, if we can get
the same value using the existing support, all the better.
> 
> OTOH, any other ideas that folks may have that work better than this are
more than welcome. I don't have really strong opinions on the matter. I am
fine with whatever works.
Thanks for the detailed write-up on this. Some of the issues definitely need to
be addressed. I am concerned, though, that some of the ideas may be leading
toward a scenario where we have essentially two completely different ways of
representing profile information in LLVM IR. It is great to have two
complementary approaches to collecting profile data, but two representations in
the IR would not make sense.

The first issue raised is that profile execution counts are not represented in
the IR. This was a very intentional decision. I know it goes against what other
compilers have done in the past. It took me a while to get used to the idea when
Andy first suggested it, so I know it seems awkward at first. The advantage is
that branch probabilities are much easier to keep updated in the face of
compiler transformations, compared to execution counts. We are definitely
missing the per-function execution counts that are needed to be able to compare
relative “hotness” across functions, and I think that would be a good place to
start making improvements. In the long term, we should keep our options open to
making major changes, but before we go there, we should try to make incremental
improvements to fix the existing infrastructure.

Many of the other issues you raise seem like they could also be addressed
without major changes to the existing infrastructure. Let’s try to fix those
first.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150305/8b67ebd8/attachment.html>

Philip Reames

2015-Mar-07 01:49 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On 03/05/2015 08:29 AM, Bob Wilson wrote:>
>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com 
>> <mailto:dnovillo at google.com>> wrote:
>>
>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at
google.com
>> <mailto:dnovillo at google.com>> wrote:
>>
>>     I've created a few bugzilla issues with details of some of the
>>     things I'll be looking into. I'm not yet done wordsmithing
the
>>     overall design document. I'll try to finish it by early next
week
>>     at the latest.
>>
>>
>> The document is available at
>>
>>
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
>>
>> There are several topics covered. Ideally, I would prefer that we 
>> discuss each topic separately. The main ones I will start working on 
>> are the ones described in the bugzilla links we have in the doc.
>>
>> This is just a starting point for us. I am not at all concerned with 
>> implementing exactly what is proposed in the document. In fact, if we 
>> can get the same value using the existing support, all the better.
>>
>> OTOH, any other ideas that folks may have that work better than this 
>> are more than welcome. I don't have really strong opinions on the 
>> matter. I am fine with whatever works.
>
> Thanks for the detailed write-up on this. Some of the issues 
> definitely need to be addressed. I am concerned, though, that some of 
> the ideas may be leading toward a scenario where we have essentially 
> two completely different ways of representing profile information in 
> LLVM IR. It is great to have two complementary approaches to 
> collecting profile data, but two representations in the IR would not 
> make sense.
>
> The first issue raised is that profile execution counts are not 
> represented in the IR. This was a very intentional decision. I know it 
> goes against what other compilers have done in the past. It took me a 
> while to get used to the idea when Andy first suggested it, so I know 
> it seems awkward at first. The advantage is that branch probabilities 
> are much easier to keep updated in the face of compiler 
> transformations, compared to execution counts. We are definitely 
> missing the per-function execution counts that are needed to be able 
> to compare relative “hotness” across functions, and I think that would 
> be a good place to start making improvements. In the long term, we 
> should keep our options open to making major changes, but before we go 
> there, we should try to make incremental improvements to fix the 
> existing infrastructure.
>
> Many of the other issues you raise seem like they could also be 
> addressed without major changes to the existing infrastructure. Let’s 
> try to fix those first.After reading the document, I agree with Bob's perspective here.

I would strongly recommend that you start with the optimizations than 
can be implemented within the current framework.  The current 
infrastructure gives a fairly reasonable idea of relative hotness within 
a function.  There's a lot to be done to exploit that information (even 
in the inliner!) without resorting to cross function analysis.  If, 
after most of those have been implemented, we need more fundamental 
changes we could consider them.  Starting with a fundamental rewrite of 
the profiling system within LLVM seems like a mistake.

At a meta level, as someone who uses LLVM for JITing I would be opposed 
to a system that assumed consistent profiling counts across function 
boundaries and gave up on relative hotness information.  At least if I'm 
understanding your proposal, this would *completely break* a 
multi-tiered JIT.  In practice, you generally stop collecting 
instrumentation profiling once something is compiled at a high enough 
tier.  When compiling it's caller, you'll get very deceptive results if 
you rely on the execution counts to line up across functions.  On the 
other hand, merging two relative hotness profiles by scaling based on 
the hotness of the callsite works out quite well in practice.  You can 
use some information about global hotness to make decisions, but those 
decisions need to be resilient to such systematic under-counting.

Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150306/eb75230f/attachment.html>

Xinliang David Li

2015-Mar-07 03:00 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

Bob, Philip, thanks for the feedback.

Diego is planning to give more detailed reply next Monday. There seem
to be some misunderstanding about the proposals, so I will just give
some highlights here:

1) The proposal is not intending to fundamentally change the current
framework, but to enhanced the framework so that
  a) more profile information is preserved
  b) block/edge count/frequency becomes faster to compute
  b) profile information becomes faster to access and update
(inter-procedurally)

2) Changes to profile APIs and profile client code will be minimized,
except that we will add IPA clients (once Chandler's pass manager
change is ready)

3) The proposed change does *not* give up relative hotness as
mentioned by Philiip. All clients that relies on relative hotness are
not affected -- except that the data is better and more reliable

4) With real profile data available, current infrastructure does *not*
provide reasonable hotness (e.g., you can try comparing the BBs that
execute the same number times, but in loops with different depths in
the same function and see how big the diff is), let alone fast
updating.

I am reasonably confident that the proposal
1) does not affect compilations using static profile (with branch prediction)
2) strictly better for -fprofile-instr-use optimizations.

The area I am not so sure is the JIT, but I am really interested in
knowing the details and propose solutions for you if the current
proposal does not work for you (which I doubt -- because if the
current framework works, the new one should work too :) ).

I am looking forward to more detailed discussions next week! We shall
sit down together and discuss changes, rationale, concerns one by one
-- with concrete examples.

thanks,

David

On Fri, Mar 6, 2015 at 5:49 PM, Philip Reames <listmail at
philipreames.com> wrote:>
> On 03/05/2015 08:29 AM, Bob Wilson wrote:
>
>
> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com>
wrote:
>
> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at
google.com> wrote:
>
>> I've created a few bugzilla issues with details of some of the
things I'll
>> be looking into. I'm not yet done wordsmithing the overall design
document.
>> I'll try to finish it by early next week at the latest.
>
>
> The document is available at
>
>
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>
> There are several topics covered. Ideally, I would prefer that we discuss
> each topic separately. The main ones I will start working on are the ones
> described in the bugzilla links we have in the doc.
>
> This is just a starting point for us. I am not at all concerned with
> implementing exactly what is proposed in the document. In fact, if we can
> get the same value using the existing support, all the better.
>
> OTOH, any other ideas that folks may have that work better than this are
> more than welcome. I don't have really strong opinions on the matter. I
am
> fine with whatever works.
>
>
> Thanks for the detailed write-up on this. Some of the issues definitely
need
> to be addressed. I am concerned, though, that some of the ideas may be
> leading toward a scenario where we have essentially two completely
different
> ways of representing profile information in LLVM IR. It is great to have
two
> complementary approaches to collecting profile data, but two
representations
> in the IR would not make sense.
>
> The first issue raised is that profile execution counts are not represented
> in the IR. This was a very intentional decision. I know it goes against
what
> other compilers have done in the past. It took me a while to get used to
the
> idea when Andy first suggested it, so I know it seems awkward at first. The
> advantage is that branch probabilities are much easier to keep updated in
> the face of compiler transformations, compared to execution counts. We are
> definitely missing the per-function execution counts that are needed to be
> able to compare relative “hotness” across functions, and I think that would
> be a good place to start making improvements. In the long term, we should
> keep our options open to making major changes, but before we go there, we
> should try to make incremental improvements to fix the existing
> infrastructure.
>
> Many of the other issues you raise seem like they could also be addressed
> without major changes to the existing infrastructure. Let’s try to fix
those
> first.
>
> After reading the document, I agree with Bob's perspective here.
>
> I would strongly recommend that you start with the optimizations than can
be
> implemented within the current framework.  The current infrastructure gives
> a fairly reasonable idea of relative hotness within a function. 
There's a
> lot to be done to exploit that information (even in the inliner!) without
> resorting to cross function analysis.  If, after most of those have been
> implemented, we need more fundamental changes we could consider them.
> Starting with a fundamental rewrite of the profiling system within LLVM
> seems like a mistake.
>
> At a meta level, as someone who uses LLVM for JITing I would be opposed to
a
> system that assumed consistent profiling counts across function boundaries
> and gave up on relative hotness information.  At least if I'm
understanding
> your proposal, this would *completely break* a multi-tiered JIT.  In
> practice, you generally stop collecting instrumentation profiling once
> something is compiled at a high enough tier.  When compiling it's
caller,
> you'll get very deceptive results if you rely on the execution counts
to
> line up across functions.  On the other hand, merging two relative hotness
> profiles by scaling based on the hotness of the callsite works out quite
> well in practice.  You can use some information about global hotness to
make
> decisions, but those decisions need to be resilient to such systematic
> under-counting.
>
> Philip

Diego Novillo

2015-Mar-10 17:14 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com>
wrote:
>
> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com>
wrote:
>
> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at
google.com>
> wrote:
>
> I've created a few bugzilla issues with details of some of the things
I'll
>> be looking into. I'm not yet done wordsmithing the overall design
document.
>> I'll try to finish it by early next week at the latest.
>>
>
> The document is available at
>
>
>
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
>
> There are several topics covered. Ideally, I would prefer that we discuss
> each topic separately. The main ones I will start working on are the ones
> described in the bugzilla links we have in the doc.
>
> This is just a starting point for us. I am not at all concerned with
> implementing exactly what is proposed in the document. In fact, if we can
> get the same value using the existing support, all the better.
>
> OTOH, any other ideas that folks may have that work better than this are
> more than welcome. I don't have really strong opinions on the matter. I
am
> fine with whatever works.
>
>
> Thanks for the detailed write-up on this. Some of the issues definitely
> need to be addressed. I am concerned, though, that some of the ideas may be
> leading toward a scenario where we have essentially two completely
> different ways of representing profile information in LLVM IR. It is great
> to have two complementary approaches to collecting profile data, but two
> representations in the IR would not make sense.
>
Yeah, I don't think I'll continue to push for a new MD_count attribute.
If
we were to make MD_prof be a "real" execution count, that would be
enough.
Note that by re-using MD_prof we are not changing its meaning at all. The
execution count is still a weight and the ratio is still branch
probability. All that we are changing are the absolute values of the number
and increasing its data type width to remove the 32bit limitation.


> The first issue raised is that profile execution counts are not
> represented in the IR. This was a very intentional decision. I know it goes
> against what other compilers have done in the past. It took me a while to
> get used to the idea when Andy first suggested it, so I know it seems
> awkward at first. The advantage is that branch probabilities are much
> easier to keep updated in the face of compiler transformations, compared to
> execution counts.
>
Sorry. I don't follow. Updating counts as the CFG is transformed is not
difficult at all. What examples do you have in mind?  The big advantage of
making MD_prof an actual execution count is that it is a meaningful metric
wrt scaling and transformation.

Say, for instance, that we have a branch instruction with two targets with
counts {100, 300} inside a function 'foo' that has entry count 2. The
edge
probability for the first edge (count 100) is 100/(100+300) = 25%.

If we inline foo() inside another function bar() at a callsite with profile
count == 1, the cloned branch instruction gets its counters scaled with the
callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} {50,
150}.  But the branch probability did not change. Currently, we are
cloning the branch without changing the edge weights.

This scaling is not difficult at all and can be incrementally very quickly.
We cannot afford to recompute all frequencies on the fly because it would
be detrimental to compile time. If foo() itself has N callees inlined into
it, each inlined callee needs to trigger a re-computation. When foo() is
inlined into bar(), the frequencies will need to be recomputed for foo()
and all N callees inlined into foo().


> We are definitely missing the per-function execution counts that are
> needed to be able to compare relative “hotness” across functions, and I
> think that would be a good place to start making improvements. In the long
> term, we should keep our options open to making major changes, but before
> we go there, we should try to make incremental improvements to fix the
> existing infrastructure.
>
Right, and that's the core of our proposal. We don't really want to make
major infrastructure changes at this point. The only thing I'd like to
explore is making MD_prof a real count. This will be useful for the inliner
changes and it would also make incremental updates easier, because the
scaling that needs to be done is very straightforward and quick.

Note that this change should not modify the current behaviour we get from
profile analysis. Things that were hot before should continue to be hot now.

> Many of the other issues you raise seem like they could also be addressed
> without major changes to the existing infrastructure. Let’s try to fix
> those first.
>
That's exactly the point of the proposal.  We definitely don't want to
make
major changes to the infrastructure at first. My thinking is to start
working on making MD_prof a real count. One of the things that are
happening is that the combination of real profile data plus the frequency
propagation that we are currently doing is misleading the analysis.

For example (thanks David for the code and data). In the following code:

int g;
__attribute__((noinline)) void bar() {
 g++;
}

extern int printf(const char*, ...);

int main()
{
  int i, j, k;

  g = 0;

  // Loop 1.
  for (i = 0; i < 100; i++)
    for (j = 0; j < 100; j++)
       for (k = 0; k < 100; k++)
           bar();

  printf ("g = %d\n", g);
  g = 0;

  // Loop 2.
  for (i = 0; i < 100; i++)
    for (j = 0; j < 10000; j++)
        bar();

  printf ("g = %d\n", g);
  g = 0;


  // Loop 3.
  for (i = 0; i < 1000000; i++)
    bar();

  printf ("g = %d\n", g);
  g = 0;
}

When compiled with profile instrumentation, frequency propagation is
distorting the real profile because it gives different frequency to the
calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
times, but after frequency propagation, the first call to bar() gets a
weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3. In
reality, every call to bar() should have a weight of 1,000,000.


Thanks.  Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150310/f921154a/attachment.html>

Duncan P. N. Exon Smith

2015-Mar-12 21:42 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

> On 2015-Mar-10, at 10:14, Diego Novillo <dnovillo at google.com>
wrote:
> 
> 
> 
> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com>
wrote:
> 
>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at
google.com> wrote:
>> 
>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at
google.com> wrote:
>> 
>> I've created a few bugzilla issues with details of some of the
things I'll be looking into. I'm not yet done wordsmithing the overall
design document. I'll try to finish it by early next week at the latest.
>> 
>> The document is available at
>> 
>>
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>> 
>> There are several topics covered. Ideally, I would prefer that we
discuss each topic separately. The main ones I will start working on are the
ones described in the bugzilla links we have in the doc.
>> 
>> This is just a starting point for us. I am not at all concerned with
implementing exactly what is proposed in the document. In fact, if we can get
the same value using the existing support, all the better.
>> 
>> OTOH, any other ideas that folks may have that work better than this
are more than welcome. I don't have really strong opinions on the matter. I
am fine with whatever works.
> 
> Thanks for the detailed write-up on this. Some of the issues definitely
need to be addressed. I am concerned, though, that some of the ideas may be
leading toward a scenario where we have essentially two completely different
ways of representing profile information in LLVM IR. It is great to have two
complementary approaches to collecting profile data, but two representations in
the IR would not make sense.
> 
> Yeah, I don't think I'll continue to push for a new MD_count
attribute. If we were to make MD_prof be a "real" execution count,
that would be enough. Note that by re-using MD_prof we are not changing its
meaning at all. The execution count is still a weight and the ratio is still
branch probability. All that we are changing are the absolute values of the
number and increasing its data type width to remove the 32bit limitation.
> 
> 
> 
> The first issue raised is that profile execution counts are not represented
in the IR. This was a very intentional decision. I know it goes against what
other compilers have done in the past. It took me a while to get used to the
idea when Andy first suggested it, so I know it seems awkward at first. The
advantage is that branch probabilities are much easier to keep updated in the
face of compiler transformations, compared to execution counts.
> 
> Sorry. I don't follow. Updating counts as the CFG is transformed is not
difficult at all. What examples do you have in mind?  The big advantage of
making MD_prof an actual execution count is that it is a meaningful metric wrt
scaling and transformation.
> 
> Say, for instance, that we have a branch instruction with two targets with
counts {100, 300} inside a function 'foo' that has entry count 2. The
edge probability for the first edge (count 100) is 100/(100+300) = 25%.
> 
> If we inline foo() inside another function bar() at a callsite with profile
count == 1, the cloned branch instruction gets its counters scaled with the
callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} = {50,
150}.  But the branch probability did not change. Currently, we are cloning the
branch without changing the edge weights.
> 
> This scaling is not difficult at all and can be incrementally very quickly.
We cannot afford to recompute all frequencies on the fly because it would be
detrimental to compile time. If foo() itself has N callees inlined into it, each
inlined callee needs to trigger a re-computation. When foo() is inlined into
bar(), the frequencies will need to be recomputed for foo() and all N callees
inlined into foo().
>  
>  
> We are definitely missing the per-function execution counts that are needed
to be able to compare relative “hotness” across functions, and I think that
would be a good place to start making improvements. In the long term, we should
keep our options open to making major changes, but before we go there, we should
try to make incremental improvements to fix the existing infrastructure.
> 
> Right, and that's the core of our proposal. We don't really want to
make major infrastructure changes at this point. The only thing I'd like to
explore is making MD_prof a real count. This will be useful for the inliner
changes and it would also make incremental updates easier, because the scaling
that needs to be done is very straightforward and quick.
> 
> Note that this change should not modify the current behaviour we get from
profile analysis. Things that were hot before should continue to be hot now.
> 
> 
> Many of the other issues you raise seem like they could also be addressed
without major changes to the existing infrastructure. Let’s try to fix those
first.
> 
> That's exactly the point of the proposal.  We definitely don't want
to make major changes to the infrastructure at first. My thinking is to start
working on making MD_prof a real count. One of the things that are happening is
that the combination of real profile data plus the frequency propagation that we
are currently doing is misleading the analysis.
> 
> For example (thanks David for the code and data). In the following code:
> 
> int g;
> __attribute__((noinline)) void bar() {
>  g++;
> }
> 
> extern int printf(const char*, ...);
> 
> int main()
> {
>   int i, j, k;
> 
>   g = 0;
> 
>   // Loop 1.
>   for (i = 0; i < 100; i++)
>     for (j = 0; j < 100; j++)
>        for (k = 0; k < 100; k++)
>            bar();
> 
>   printf ("g = %d\n", g);
>   g = 0;
> 
>   // Loop 2.
>   for (i = 0; i < 100; i++)
>     for (j = 0; j < 10000; j++)
>         bar();
> 
>   printf ("g = %d\n", g);
>   g = 0;
> 
> 
>   // Loop 3.
>   for (i = 0; i < 1000000; i++)
>     bar();
> 
>   printf ("g = %d\n", g);
>   g = 0;
> }
> 
> When compiled with profile instrumentation, frequency propagation is
distorting the real profile because it gives different frequency to the calls to
bar() in the 3 different loops. All 3 loops execute 1,000,000 times, but after
frequency propagation, the first call to bar() gets a weight of 520,202 in loop
#1, 210,944 in  loop #2 and 4,096 in loop #3. In reality, every call to bar()
should have a weight of 1,000,000.
(Sorry for the delay responding; I've been on holiday.)

There are two things going on here.

Firstly, the loop scales are being capped at 4096.  I propagated this
approximation from the previous version of BFI.  If it's causing a
problem (which it looks like it is), we should drop it and fix what
breaks.  You can play around with this by commenting out the `if`
statement at the end of `computeLoopScale()` in
BlockFrequencyInfoImpl.cpp.

For example, without that logic this testcase gives:

    Printing analysis 'Block Frequency Analysis' for function
'main':
    block-frequency-info: main
     - entry: float = 1.0, int = 8
     - for.cond: float = 51.5, int = 411
     - for.body: float = 50.5, int = 403
     - for.cond1: float = 5051.0, int = 40407
     - for.body3: float = 5000.5, int = 40003
     - for.cond4: float = 505001.0, int = 4040007
     - for.body6: float = 500000.5, int = 4000003
     - for.inc: float = 500000.5, int = 4000003
     - for.end: float = 5000.5, int = 40003
     - for.inc7: float = 5000.5, int = 40003
     - for.end9: float = 50.5, int = 403
     - for.inc10: float = 50.5, int = 403
     - for.end12: float = 1.0, int = 8
     - for.cond13: float = 51.5, int = 411
     - for.body15: float = 50.5, int = 403
     - for.cond16: float = 500051.0, int = 4000407
     - for.body18: float = 500000.5, int = 4000003
     - for.inc19: float = 500000.5, int = 4000003
     - for.end21: float = 50.5, int = 403
     - for.inc22: float = 50.5, int = 403
     - for.end24: float = 1.0, int = 8
     - for.cond26: float = 500001.5, int = 4000011
     - for.body28: float = 500000.5, int = 4000003
     - for.inc29: float = 500000.5, int = 4000003
     - for.end31: float = 1.0, int = 8

(Now we get 500000.5 for all the inner loop bodies.)

Secondly, instrumentation-based profiling intentionally fuzzes the
profile data in the frontend using Laplace's Rule of Succession (look at
`scaleBranchWeight()` in CodeGenPGO.cpp).

For example, "loop 1" (which isn't affected by the 4096 cap)
should give
a loop scale of 500000.5, not 1000000.  (The profile data says
1000000/10000 for the inner loop, 10000/100 for the middle, and 100/1
for the outer loop.  Laplace says that we should fuzz these branch
weights to 1000001/10001, 10001/101, and 101/2, which works out to
1000001/2 == 500000.5 total.)

Philip Reames

2015-Mar-24 16:59 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On 03/10/2015 10:14 AM, Diego Novillo wrote:>
>
> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com 
> <mailto:bob.wilson at apple.com>> wrote:
>
>
>>     On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at
google.com
>>     <mailto:dnovillo at google.com>> wrote:
>>
>>     On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo
>>     <dnovillo at google.com <mailto:dnovillo at
google.com>> wrote:
>>
>>         I've created a few bugzilla issues with details of some of
>>         the things I'll be looking into. I'm not yet done
>>         wordsmithing the overall design document. I'll try to
finish
>>         it by early next week at the latest.
>>
>>
>>     The document is available at
>>
>>    
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>    
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
>>
>>     There are several topics covered. Ideally, I would prefer that we
>>     discuss each topic separately. The main ones I will start working
>>     on are the ones described in the bugzilla links we have in the doc.
>>
>>     This is just a starting point for us. I am not at all concerned
>>     with implementing exactly what is proposed in the document. In
>>     fact, if we can get the same value using the existing support,
>>     all the better.
>>
>>     OTOH, any other ideas that folks may have that work better than
>>     this are more than welcome. I don't have really strong opinions
>>     on the matter. I am fine with whatever works.
>
>     Thanks for the detailed write-up on this. Some of the issues
>     definitely need to be addressed. I am concerned, though, that some
>     of the ideas may be leading toward a scenario where we have
>     essentially two completely different ways of representing profile
>     information in LLVM IR. It is great to have two complementary
>     approaches to collecting profile data, but two representations in
>     the IR would not make sense.
>
>
> Yeah, I don't think I'll continue to push for a new MD_count 
> attribute. If we were to make MD_prof be a "real" execution
count,
> that would be enough. Note that by re-using MD_prof we are not 
> changing its meaning at all. The execution count is still a weight and 
> the ratio is still branch probability. All that we are changing are 
> the absolute values of the number and increasing its data type width 
> to remove the 32bit limitation.Independent of everything else, relaxing the 32 bit restriction is 
clearly a good idea.  This would make a great standalone
patch.>
>
>
>     The first issue raised is that profile execution counts are not
>     represented in the IR. This was a very intentional decision. I
>     know it goes against what other compilers have done in the past.
>     It took me a while to get used to the idea when Andy first
>     suggested it, so I know it seems awkward at first. The advantage
>     is that branch probabilities are much easier to keep updated in
>     the face of compiler transformations, compared to execution counts.
>
>
> Sorry. I don't follow. Updating counts as the CFG is transformed is 
> not difficult at all. What examples do you have in mind?  The big 
> advantage of making MD_prof an actual execution count is that it is a 
> meaningful metric wrt scaling and transformation.
>
> Say, for instance, that we have a branch instruction with two targets 
> with counts {100, 300} inside a function 'foo' that has entry count
2.
> The edge probability for the first edge (count 100) is 100/(100+300) = 
> 25%.
>
> If we inline foo() inside another function bar() at a callsite with 
> profile count == 1, the cloned branch instruction gets its counters 
> scaled with the callsite count. So the new branch has counts {100 * 1 
> / 2, 300 * 1 / 2} = {50, 150}.  But the branch probability did not 
> change. Currently, we are cloning the branch without changing the edge 
> weights.
>
> This scaling is not difficult at all and can be incrementally very 
> quickly. We cannot afford to recompute all frequencies on the fly 
> because it would be detrimental to compile time. If foo() itself has N 
> callees inlined into it, each inlined callee needs to trigger a 
> re-computation. When foo() is inlined into bar(), the frequencies will 
> need to be recomputed for foo() and all N callees inlined into foo().It really sounds like your proposal is to essentially eagerly compute 
scaling rather than lazyily compute it on demand.  Assuming perfect 
implementations for both (with no rounding losses), the results should 
be the same.  Is that a correct restatement?  I'm going to hold off on 
responding to why that's a bad idea until you confirm, because I'm not 
sure I follow what you're trying to say. :)

Also, trusting exact entry counts is going to be somewhat suspect. These 
are *highly* susceptible to racy updates, overflow, etc... Anything 
which puts too much implicit trust in these numbers is going to be 
problematic.>
>
>     We are definitely missing the per-function execution counts that
>     are needed to be able to compare relative “hotness” across
>     functions, and I think that would be a good place to start making
>     improvements. In the long term, we should keep our options open to
>     making major changes, but before we go there, we should try to
>     make incremental improvements to fix the existing infrastructure.
>
>
> Right, and that's the core of our proposal. We don't really want to
> make major infrastructure changes at this point. The only thing I'd 
> like to explore is making MD_prof a real count. This will be useful 
> for the inliner changes and it would also make incremental updates 
> easier, because the scaling that needs to be done is very 
> straightforward and quick.
>
> Note that this change should not modify the current behaviour we get 
> from profile analysis. Things that were hot before should continue to 
> be hot now.I have no objection to adding a mechanism for expressing an entry 
count.  I am still very hesitant about the proposals with regards to 
redefining the current MD_prof.

I'd encourage you to post a patch for the entry count mechanism, but not 
tie its semantics to exact execution count.  (Something like "the value 
provided must correctly describe the relative hotness of this routine 
against others in the program annoatated with the same metadata.  It is 
the relative scaling that is important, not the absolute value.  In 
particular, the value need not be an exact execution
count.")>
>
>     Many of the other issues you raise seem like they could also be
>     addressed without major changes to the existing infrastructure.
>     Let’s try to fix those first.
>
>
> That's exactly the point of the proposal.  We definitely don't want
to
> make major changes to the infrastructure at first. My thinking is to 
> start working on making MD_prof a real count. One of the things that 
> are happening is that the combination of real profile data plus the 
> frequency propagation that we are currently doing is misleading the 
> analysis.I consider this a major change.  You're trying to redefine a major part 
of the current system.

Multiple people have spoken up and objected to this change (as currently 
described).  Please start somewhere else.>
> For example (thanks David for the code and data). In the following code:
>
> int g;
> __attribute__((noinline)) void bar() {
>  g++;
> }
>
> extern int printf(const char*, ...);
>
> int main()
> {
> int i, j, k;
>
>   g = 0;
>
> // Loop 1.
> for (i = 0; i < 100; i++)
> for (j = 0; j < 100; j++)
>    for (k = 0; k < 100; k++)
>        bar();
>
> printf ("g = %d\n", g);
>   g = 0;
>
> // Loop 2.
> for (i = 0; i < 100; i++)
> for (j = 0; j < 10000; j++)
>     bar();
>
> printf ("g = %d\n", g);
>   g = 0;
>
>
> // Loop 3.
> for (i = 0; i < 1000000; i++)
> bar();
>
> printf ("g = %d\n", g);
>   g = 0;
> }
>
> When compiled with profile instrumentation, frequency propagation is 
> distorting the real profile because it gives different frequency to 
> the calls to bar() in the 3 different loops. All 3 loops execute 
> 1,000,000 times, but after frequency propagation, the first call to 
> bar() gets a weight of 520,202 in loop #1, 210,944 in  loop #2 and 
> 4,096 in loop #3. In reality, every call to bar() should have a weight 
> of 1,000,000.Duncan responded to this. My conclusion from his response: this is a 
bug, not a fundamental issue.  Remove the max scaling factor, switch the 
counts to 64 bits and everything should be fine.  If you disagree, let's 
discuss.>
>
> Thanks.  Diego.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150324/2caf9dc3/attachment.html>

Reasonably Related Threads

Search for more reasonably related threads

llvm dev - Mar 2015 - [LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

Reasonably Related Threads