thr3ads.net - llvm dev - [LLVMdev] RFC - Improvements to PGO profile support [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2015-Mar-24 16:59 UTC

[LLVMdev] RFC - Improvements to PGO profile support

On 03/10/2015 10:14 AM, Diego Novillo wrote:>
>
> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com 
> <mailto:bob.wilson at apple.com>> wrote:
>
>
>>     On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at
google.com
>>     <mailto:dnovillo at google.com>> wrote:
>>
>>     On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo
>>     <dnovillo at google.com <mailto:dnovillo at
google.com>> wrote:
>>
>>         I've created a few bugzilla issues with details of some of
>>         the things I'll be looking into. I'm not yet done
>>         wordsmithing the overall design document. I'll try to
finish
>>         it by early next week at the latest.
>>
>>
>>     The document is available at
>>
>>    
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>    
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
>>
>>     There are several topics covered. Ideally, I would prefer that we
>>     discuss each topic separately. The main ones I will start working
>>     on are the ones described in the bugzilla links we have in the doc.
>>
>>     This is just a starting point for us. I am not at all concerned
>>     with implementing exactly what is proposed in the document. In
>>     fact, if we can get the same value using the existing support,
>>     all the better.
>>
>>     OTOH, any other ideas that folks may have that work better than
>>     this are more than welcome. I don't have really strong opinions
>>     on the matter. I am fine with whatever works.
>
>     Thanks for the detailed write-up on this. Some of the issues
>     definitely need to be addressed. I am concerned, though, that some
>     of the ideas may be leading toward a scenario where we have
>     essentially two completely different ways of representing profile
>     information in LLVM IR. It is great to have two complementary
>     approaches to collecting profile data, but two representations in
>     the IR would not make sense.
>
>
> Yeah, I don't think I'll continue to push for a new MD_count 
> attribute. If we were to make MD_prof be a "real" execution
count,
> that would be enough. Note that by re-using MD_prof we are not 
> changing its meaning at all. The execution count is still a weight and 
> the ratio is still branch probability. All that we are changing are 
> the absolute values of the number and increasing its data type width 
> to remove the 32bit limitation.Independent of everything else, relaxing the 32 bit restriction is 
clearly a good idea.  This would make a great standalone
patch.>
>
>
>     The first issue raised is that profile execution counts are not
>     represented in the IR. This was a very intentional decision. I
>     know it goes against what other compilers have done in the past.
>     It took me a while to get used to the idea when Andy first
>     suggested it, so I know it seems awkward at first. The advantage
>     is that branch probabilities are much easier to keep updated in
>     the face of compiler transformations, compared to execution counts.
>
>
> Sorry. I don't follow. Updating counts as the CFG is transformed is 
> not difficult at all. What examples do you have in mind?  The big 
> advantage of making MD_prof an actual execution count is that it is a 
> meaningful metric wrt scaling and transformation.
>
> Say, for instance, that we have a branch instruction with two targets 
> with counts {100, 300} inside a function 'foo' that has entry count
2.
> The edge probability for the first edge (count 100) is 100/(100+300) = 
> 25%.
>
> If we inline foo() inside another function bar() at a callsite with 
> profile count == 1, the cloned branch instruction gets its counters 
> scaled with the callsite count. So the new branch has counts {100 * 1 
> / 2, 300 * 1 / 2} = {50, 150}.  But the branch probability did not 
> change. Currently, we are cloning the branch without changing the edge 
> weights.
>
> This scaling is not difficult at all and can be incrementally very 
> quickly. We cannot afford to recompute all frequencies on the fly 
> because it would be detrimental to compile time. If foo() itself has N 
> callees inlined into it, each inlined callee needs to trigger a 
> re-computation. When foo() is inlined into bar(), the frequencies will 
> need to be recomputed for foo() and all N callees inlined into foo().It really sounds like your proposal is to essentially eagerly compute 
scaling rather than lazyily compute it on demand.  Assuming perfect 
implementations for both (with no rounding losses), the results should 
be the same.  Is that a correct restatement?  I'm going to hold off on 
responding to why that's a bad idea until you confirm, because I'm not 
sure I follow what you're trying to say. :)

Also, trusting exact entry counts is going to be somewhat suspect. These 
are *highly* susceptible to racy updates, overflow, etc... Anything 
which puts too much implicit trust in these numbers is going to be 
problematic.>
>
>     We are definitely missing the per-function execution counts that
>     are needed to be able to compare relative “hotness” across
>     functions, and I think that would be a good place to start making
>     improvements. In the long term, we should keep our options open to
>     making major changes, but before we go there, we should try to
>     make incremental improvements to fix the existing infrastructure.
>
>
> Right, and that's the core of our proposal. We don't really want to
> make major infrastructure changes at this point. The only thing I'd 
> like to explore is making MD_prof a real count. This will be useful 
> for the inliner changes and it would also make incremental updates 
> easier, because the scaling that needs to be done is very 
> straightforward and quick.
>
> Note that this change should not modify the current behaviour we get 
> from profile analysis. Things that were hot before should continue to 
> be hot now.I have no objection to adding a mechanism for expressing an entry 
count.  I am still very hesitant about the proposals with regards to 
redefining the current MD_prof.

I'd encourage you to post a patch for the entry count mechanism, but not 
tie its semantics to exact execution count.  (Something like "the value 
provided must correctly describe the relative hotness of this routine 
against others in the program annoatated with the same metadata.  It is 
the relative scaling that is important, not the absolute value.  In 
particular, the value need not be an exact execution
count.")>
>
>     Many of the other issues you raise seem like they could also be
>     addressed without major changes to the existing infrastructure.
>     Let’s try to fix those first.
>
>
> That's exactly the point of the proposal.  We definitely don't want
to
> make major changes to the infrastructure at first. My thinking is to 
> start working on making MD_prof a real count. One of the things that 
> are happening is that the combination of real profile data plus the 
> frequency propagation that we are currently doing is misleading the 
> analysis.I consider this a major change.  You're trying to redefine a major part 
of the current system.

Multiple people have spoken up and objected to this change (as currently 
described).  Please start somewhere else.>
> For example (thanks David for the code and data). In the following code:
>
> int g;
> __attribute__((noinline)) void bar() {
>  g++;
> }
>
> extern int printf(const char*, ...);
>
> int main()
> {
> int i, j, k;
>
>   g = 0;
>
> // Loop 1.
> for (i = 0; i < 100; i++)
> for (j = 0; j < 100; j++)
>    for (k = 0; k < 100; k++)
>        bar();
>
> printf ("g = %d\n", g);
>   g = 0;
>
> // Loop 2.
> for (i = 0; i < 100; i++)
> for (j = 0; j < 10000; j++)
>     bar();
>
> printf ("g = %d\n", g);
>   g = 0;
>
>
> // Loop 3.
> for (i = 0; i < 1000000; i++)
> bar();
>
> printf ("g = %d\n", g);
>   g = 0;
> }
>
> When compiled with profile instrumentation, frequency propagation is 
> distorting the real profile because it gives different frequency to 
> the calls to bar() in the 3 different loops. All 3 loops execute 
> 1,000,000 times, but after frequency propagation, the first call to 
> bar() gets a weight of 520,202 in loop #1, 210,944 in  loop #2 and 
> 4,096 in loop #3. In reality, every call to bar() should have a weight 
> of 1,000,000.Duncan responded to this. My conclusion from his response: this is a 
bug, not a fundamental issue.  Remove the max scaling factor, switch the 
counts to 64 bits and everything should be fine.  If you disagree, let's 
discuss.>
>
> Thanks.  Diego.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150324/2caf9dc3/attachment.html>

Philip Reames

2015-Mar-24 17:13 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On 03/24/2015 09:59 AM, Philip Reames wrote:> On 03/10/2015 10:14 AM, Diego Novillo wrote:
>>
>>
>> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com
>> <mailto:bob.wilson at apple.com>> wrote:
>>
>>
>>>     On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at
google.com
>>>     <mailto:dnovillo at google.com>> wrote:
>>>
>>>     On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo
>>>     <dnovillo at google.com <mailto:dnovillo at
google.com>> wrote:
>>>
>>>         I've created a few bugzilla issues with details of some
of
>>>         the things I'll be looking into. I'm not yet done
>>>         wordsmithing the overall design document. I'll try to
finish
>>>         it by early next week at the latest.
>>>
>>>
>>>     The document is available at
>>>
>>>    
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>>    
<https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
>>>
>>>     There are several topics covered. Ideally, I would prefer that
>>>     we discuss each topic separately. The main ones I will start
>>>     working on are the ones described in the bugzilla links we have
>>>     in the doc.
>>>
>>>     This is just a starting point for us. I am not at all concerned
>>>     with implementing exactly what is proposed in the document. In
>>>     fact, if we can get the same value using the existing support,
>>>     all the better.
>>>
>>>     OTOH, any other ideas that folks may have that work better than
>>>     this are more than welcome. I don't have really strong
opinions
>>>     on the matter. I am fine with whatever works.
>>
>>     Thanks for the detailed write-up on this. Some of the issues
>>     definitely need to be addressed. I am concerned, though, that
>>     some of the ideas may be leading toward a scenario where we have
>>     essentially two completely different ways of representing profile
>>     information in LLVM IR. It is great to have two complementary
>>     approaches to collecting profile data, but two representations in
>>     the IR would not make sense.
>>
>>
>> Yeah, I don't think I'll continue to push for a new MD_count 
>> attribute. If we were to make MD_prof be a "real" execution
count,
>> that would be enough. Note that by re-using MD_prof we are not 
>> changing its meaning at all. The execution count is still a weight 
>> and the ratio is still branch probability. All that we are changing 
>> are the absolute values of the number and increasing its data type 
>> width to remove the 32bit limitation.
> Independent of everything else, relaxing the 32 bit restriction is 
> clearly a good idea.  This would make a great standalone patch.
>>
>>
>>
>>     The first issue raised is that profile execution counts are not
>>     represented in the IR. This was a very intentional decision. I
>>     know it goes against what other compilers have done in the past.
>>     It took me a while to get used to the idea when Andy first
>>     suggested it, so I know it seems awkward at first. The advantage
>>     is that branch probabilities are much easier to keep updated in
>>     the face of compiler transformations, compared to execution counts.
>>
>>
>> Sorry. I don't follow. Updating counts as the CFG is transformed is
>> not difficult at all. What examples do you have in mind?  The big 
>> advantage of making MD_prof an actual execution count is that it is a 
>> meaningful metric wrt scaling and transformation.
>>
>> Say, for instance, that we have a branch instruction with two targets 
>> with counts {100, 300} inside a function 'foo' that has entry
count
>> 2. The edge probability for the first edge (count 100) is 
>> 100/(100+300) = 25%.
>>
>> If we inline foo() inside another function bar() at a callsite with 
>> profile count == 1, the cloned branch instruction gets its counters 
>> scaled with the callsite count. So the new branch has counts {100 * 1 
>> / 2, 300 * 1 / 2} = {50, 150}.  But the branch probability did not 
>> change. Currently, we are cloning the branch without changing the 
>> edge weights.
>>
>> This scaling is not difficult at all and can be incrementally very 
>> quickly. We cannot afford to recompute all frequencies on the fly 
>> because it would be detrimental to compile time. If foo() itself has 
>> N callees inlined into it, each inlined callee needs to trigger a 
>> re-computation. When foo() is inlined into bar(), the frequencies 
>> will need to be recomputed for foo() and all N callees inlined into 
>> foo().
> It really sounds like your proposal is to essentially eagerly compute 
> scaling rather than lazyily compute it on demand. Assuming perfect 
> implementations for both (with no rounding losses), the results should 
> be the same.  Is that a correct restatement?  I'm going to hold off on 
> responding to why that's a bad idea until you confirm, because I'm
not
> sure I follow what you're trying to say.  :)
>
> Also, trusting exact entry counts is going to be somewhat suspect.  
> These are *highly* susceptible to racy updates, overflow, etc...  
> Anything which puts too much implicit trust in these numbers is going 
> to be problematic.
>>
>>
>>     We are definitely missing the per-function execution counts that
>>     are needed to be able to compare relative “hotness” across
>>     functions, and I think that would be a good place to start making
>>     improvements. In the long term, we should keep our options open
>>     to making major changes, but before we go there, we should try to
>>     make incremental improvements to fix the existing infrastructure.
>>
>>
>> Right, and that's the core of our proposal. We don't really
want to
>> make major infrastructure changes at this point. The only thing I'd
>> like to explore is making MD_prof a real count. This will be useful 
>> for the inliner changes and it would also make incremental updates 
>> easier, because the scaling that needs to be done is very 
>> straightforward and quick.
>>
>> Note that this change should not modify the current behaviour we get 
>> from profile analysis. Things that were hot before should continue to 
>> be hot now.
> I have no objection to adding a mechanism for expressing an entry 
> count.  I am still very hesitant about the proposals with regards to 
> redefining the current MD_prof.
>
> I'd encourage you to post a patch for the entry count mechanism, but 
> not tie its semantics to exact execution count.  (Something like "the 
> value provided must correctly describe the relative hotness of this 
> routine against others in the program annoatated with the same 
> metadata.  It is the relative scaling that is important, not the 
> absolute value.  In particular, the value need not be an exact 
> execution count.")
>>
>>
>>     Many of the other issues you raise seem like they could also be
>>     addressed without major changes to the existing infrastructure.
>>     Let’s try to fix those first.
>>
>>
>> That's exactly the point of the proposal.  We definitely don't
want
>> to make major changes to the infrastructure at first. My thinking is 
>> to start working on making MD_prof a real count. One of the things 
>> that are happening is that the combination of real profile data plus 
>> the frequency propagation that we are currently doing is misleading 
>> the analysis.
> I consider this a major change.  You're trying to redefine a major 
> part of the current system.
>
> Multiple people have spoken up and objected to this change (as 
> currently described).  Please start somewhere else.Sorry, "objected" is too strong a word here.  I should have said
"have
reservations about".

My point here - which I'm not sure I expressed well - is that I'm 
concerned we're going to bog down on a controversial change rather than 
make progress on the parts we agree on.  I'm not saying that we should 
never redefine things in the way your proposing, but I would like to see 
the parts that work under both schemes be addressed first.  We can 
continue this discussion in parallel.>>
>> For example (thanks David for the code and data). In the following
code:
>>
>> int g;
>> __attribute__((noinline)) void bar() {
>>  g++;
>> }
>>
>> extern int printf(const char*, ...);
>>
>> int main()
>> {
>> int i, j, k;
>>
>> g = 0;
>>
>> // Loop 1.
>> for (i = 0; i < 100; i++)
>>   for (j = 0; j < 100; j++)
>>      for (k = 0; k < 100; k++)
>>          bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>>
>> // Loop 2.
>> for (i = 0; i < 100; i++)
>>   for (j = 0; j < 10000; j++)
>>       bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>>
>>
>> // Loop 3.
>> for (i = 0; i < 1000000; i++)
>>   bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>> }
>>
>> When compiled with profile instrumentation, frequency propagation is 
>> distorting the real profile because it gives different frequency to 
>> the calls to bar() in the 3 different loops. All 3 loops execute 
>> 1,000,000 times, but after frequency propagation, the first call to 
>> bar() gets a weight of 520,202 in loop #1, 210,944 in  loop #2 and 
>> 4,096 in loop #3. In reality, every call to bar() should have a 
>> weight of 1,000,000.
> Duncan responded to this.  My conclusion from his response: this is a 
> bug, not a fundamental issue.  Remove the max scaling factor, switch 
> the counts to 64 bits and everything should be fine.  If you disagree, 
> let's discuss.
>>
>>
>> Thanks.  Diego.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu          http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150324/a92786c3/attachment.html>

Diego Novillo

2015-Mar-24 17:17 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On Tue, Mar 24, 2015 at 1:13 PM, Philip Reames
<listmail at philipreames.com> wrote:
> Sorry, "objected" is too strong a word here.  I should have said
"have
> reservations about".
>
> My point here - which I'm not sure I expressed well - is that I'm
concerned
> we're going to bog down on a controversial change rather than make
progress
> on the parts we agree on.  I'm not saying that we should never redefine
> things in the way your proposing, but I would like to see the parts that
> work under both schemes be addressed first.  We can continue this
discussion
> in parallel.
No worries. As I stated upthread, my plan is to address all the issues
bottom up, starting with the more immediately obvious (no changes for
change sake). If we can get the usefulness we are looking for out of
the current harness, then no big changes will be needed.

If we run into some brick wall down the road (say when we're doing the
profile-based inliner changes), then we'll see what other options we
have available.


Diego.

Xinliang David Li

2015-Mar-24 17:27 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

Diego and I have discussed this according to the feedback received. We
have revised plan for this (see Diego's last reply).  Here is a more
detailed re-cap:

1) keep MD_prof definition as it is today; also keep using the
frequency propagation as it is (assuming programs with irreducible
loops are not common and not important. If it turns out to be
otherwise, we will revisit this).
2) fix all problems that lead to wrong 'frequency/count' computed from
the frequency propagation algorithm
   2.1) relax 32bit limit
   2.2) remove 4096 iteration count cap
   2.3) remove the 'laplace rule of succession' which can be very
harmful
   2.4) Fix the bug in  'BranchProbabilityInfo::calcMetadataWeights'
that does capping without scaling
   2.5) Fix a similar bug in
'MachineBranchProbabilityInfo::getSumForBlock('
    .. other bugs found
3) introduce a new meta data to represent function entry global
hotness (the implementation can choose to use entry execution count)
4) implement frequency/profile update interfaces for Inline
transformations -- i.e., allowing importing callee's profile info into
caller with scaling factor computed from callsite count and callee
entry count
5) Implement mechanism to compute, record and use profile program
summary data (This will be discussed separately).


thanks,

David



On Tue, Mar 24, 2015 at 9:59 AM, Philip Reames
<listmail at philipreames.com> wrote:> On 03/10/2015 10:14 AM, Diego Novillo wrote:
>
>
>
> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com>
wrote:
>>
>>
>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at
google.com> wrote:
>>
>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at
google.com>
>> wrote:
>>
>>> I've created a few bugzilla issues with details of some of the
things
>>> I'll be looking into. I'm not yet done wordsmithing the
overall design
>>> document. I'll try to finish it by early next week at the
latest.
>>
>>
>> The document is available at
>>
>>
>>
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>
>> There are several topics covered. Ideally, I would prefer that we
discuss
>> each topic separately. The main ones I will start working on are the
ones
>> described in the bugzilla links we have in the doc.
>>
>> This is just a starting point for us. I am not at all concerned with
>> implementing exactly what is proposed in the document. In fact, if we
can
>> get the same value using the existing support, all the better.
>>
>> OTOH, any other ideas that folks may have that work better than this
are
>> more than welcome. I don't have really strong opinions on the
matter. I am
>> fine with whatever works.
>>
>>
>> Thanks for the detailed write-up on this. Some of the issues definitely
>> need to be addressed. I am concerned, though, that some of the ideas
may be
>> leading toward a scenario where we have essentially two completely
different
>> ways of representing profile information in LLVM IR. It is great to
have two
>> complementary approaches to collecting profile data, but two
representations
>> in the IR would not make sense.
>
>
> Yeah, I don't think I'll continue to push for a new MD_count
attribute. If
> we were to make MD_prof be a "real" execution count, that would
be enough.
> Note that by re-using MD_prof we are not changing its meaning at all. The
> execution count is still a weight and the ratio is still branch
probability.
> All that we are changing are the absolute values of the number and
> increasing its data type width to remove the 32bit limitation.
>
> Independent of everything else, relaxing the 32 bit restriction is clearly
a
> good idea.  This would make a great standalone patch.
>
>
>
>>
>> The first issue raised is that profile execution counts are not
>> represented in the IR. This was a very intentional decision. I know it
goes
>> against what other compilers have done in the past. It took me a while
to
>> get used to the idea when Andy first suggested it, so I know it seems
>> awkward at first. The advantage is that branch probabilities are much
easier
>> to keep updated in the face of compiler transformations, compared to
>> execution counts.
>
>
> Sorry. I don't follow. Updating counts as the CFG is transformed is not
> difficult at all. What examples do you have in mind?  The big advantage of
> making MD_prof an actual execution count is that it is a meaningful metric
> wrt scaling and transformation.
>
> Say, for instance, that we have a branch instruction with two targets with
> counts {100, 300} inside a function 'foo' that has entry count 2.
The edge
> probability for the first edge (count 100) is 100/(100+300) = 25%.
>
> If we inline foo() inside another function bar() at a callsite with profile
> count == 1, the cloned branch instruction gets its counters scaled with the
> callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2}
> {50, 150}.  But the branch probability did not change. Currently, we are
> cloning the branch without changing the edge weights.
>
> This scaling is not difficult at all and can be incrementally very quickly.
> We cannot afford to recompute all frequencies on the fly because it would
be
> detrimental to compile time. If foo() itself has N callees inlined into it,
> each inlined callee needs to trigger a re-computation. When foo() is
inlined
> into bar(), the frequencies will need to be recomputed for foo() and all N
> callees inlined into foo().
>
> It really sounds like your proposal is to essentially eagerly compute
> scaling rather than lazyily compute it on demand.  Assuming perfect
> implementations for both (with no rounding losses), the results should be
> the same.  Is that a correct restatement?  I'm going to hold off on
> responding to why that's a bad idea until you confirm, because I'm
not sure
> I follow what you're trying to say.  :)
>
> Also, trusting exact entry counts is going to be somewhat suspect.  These
> are *highly* susceptible to racy updates, overflow, etc...  Anything which
> puts too much implicit trust in these numbers is going to be problematic.
>
>
>
>>
>> We are definitely missing the per-function execution counts that are
>> needed to be able to compare relative “hotness” across functions, and I
>> think that would be a good place to start making improvements. In the
long
>> term, we should keep our options open to making major changes, but
before we
>> go there, we should try to make incremental improvements to fix the
existing
>> infrastructure.
>
>
> Right, and that's the core of our proposal. We don't really want to
make
> major infrastructure changes at this point. The only thing I'd like to
> explore is making MD_prof a real count. This will be useful for the inliner
> changes and it would also make incremental updates easier, because the
> scaling that needs to be done is very straightforward and quick.
>
> Note that this change should not modify the current behaviour we get from
> profile analysis. Things that were hot before should continue to be hot
now.
>
> I have no objection to adding a mechanism for expressing an entry count.  I
> am still very hesitant about the proposals with regards to redefining the
> current MD_prof.
>
> I'd encourage you to post a patch for the entry count mechanism, but
not tie
> its semantics to exact execution count.  (Something like "the value
provided
> must correctly describe the relative hotness of this routine against others
> in the program annoatated with the same metadata.  It is the relative
> scaling that is important, not the absolute value.  In particular, the
value
> need not be an exact execution count.")
>
>
>>
>> Many of the other issues you raise seem like they could also be
addressed
>> without major changes to the existing infrastructure. Let’s try to fix
those
>> first.
>
>
> That's exactly the point of the proposal.  We definitely don't want
to make
> major changes to the infrastructure at first. My thinking is to start
> working on making MD_prof a real count. One of the things that are
happening
> is that the combination of real profile data plus the frequency propagation
> that we are currently doing is misleading the analysis.
>
> I consider this a major change.  You're trying to redefine a major part
of
> the current system.
>
> Multiple people have spoken up and objected to this change (as currently
> described).  Please start somewhere else.
>
>
> For example (thanks David for the code and data). In the following code:
>
> int g;
> __attribute__((noinline)) void bar() {
>  g++;
> }
>
> extern int printf(const char*, ...);
>
> int main()
> {
>   int i, j, k;
>
>   g = 0;
>
>   // Loop 1.
>   for (i = 0; i < 100; i++)
>     for (j = 0; j < 100; j++)
>        for (k = 0; k < 100; k++)
>            bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
>
>   // Loop 2.
>   for (i = 0; i < 100; i++)
>     for (j = 0; j < 10000; j++)
>         bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
>
>
>   // Loop 3.
>   for (i = 0; i < 1000000; i++)
>     bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
> }
>
> When compiled with profile instrumentation, frequency propagation is
> distorting the real profile because it gives different frequency to the
> calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
> times, but after frequency propagation, the first call to bar() gets a
> weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3. In
> reality, every call to bar() should have a weight of 1,000,000.
>
> Duncan responded to this.  My conclusion from his response: this is a bug,
> not a fundamental issue.  Remove the max scaling factor, switch the counts
> to 64 bits and everything should be fine.  If you disagree, let's
discuss.
>
>
>
> Thanks.  Diego.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>

Xinliang David Li

2015-Mar-24 17:34 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

> I follow what you're trying to say.  :)
>
> Also, trusting exact entry counts is going to be somewhat suspect.  These
> are *highly* susceptible to racy updates, overflow, etc...  Anything which
> puts too much implicit trust in these numbers is going to be problematic.
No, we won't be using profile counts in the absolute sense, a.k.a.  if
(count == 100000).  The counts is used as global hotness relative to
others. It is also used together with program summary data.
>
> I have no objection to adding a mechanism for expressing an entry count.  I
> am still very hesitant about the proposals with regards to redefining the
> current MD_prof.
For now, we won't touch MD_prof's definition.
>
> I'd encourage you to post a patch for the entry count mechanism, but
not tie
> its semantics to exact execution count.  (Something like "the value
provided
> must correctly describe the relative hotness of this routine against others
> in the program annoatated with the same metadata.  It is the relative
> scaling that is important, not the absolute value.  In particular, the
value
> need not be an exact execution count.")
>
Define entry count in a more flexible way is fine -- as long as the
implementation can choose to use 'execution count' :)

thanks,

David
>
>>
>> Many of the other issues you raise seem like they could also be
addressed
>> without major changes to the existing infrastructure. Let’s try to fix
those
>> first.
>
>
> That's exactly the point of the proposal.  We definitely don't want
to make
> major changes to the infrastructure at first. My thinking is to start
> working on making MD_prof a real count. One of the things that are
happening
> is that the combination of real profile data plus the frequency propagation
> that we are currently doing is misleading the analysis.
>
> I consider this a major change.  You're trying to redefine a major part
of
> the current system.
>
> Multiple people have spoken up and objected to this change (as currently
> described).  Please start somewhere else.
>
>
> For example (thanks David for the code and data). In the following code:
>
> int g;
> __attribute__((noinline)) void bar() {
>  g++;
> }
>
> extern int printf(const char*, ...);
>
> int main()
> {
>   int i, j, k;
>
>   g = 0;
>
>   // Loop 1.
>   for (i = 0; i < 100; i++)
>     for (j = 0; j < 100; j++)
>        for (k = 0; k < 100; k++)
>            bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
>
>   // Loop 2.
>   for (i = 0; i < 100; i++)
>     for (j = 0; j < 10000; j++)
>         bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
>
>
>   // Loop 3.
>   for (i = 0; i < 1000000; i++)
>     bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
> }
>
> When compiled with profile instrumentation, frequency propagation is
> distorting the real profile because it gives different frequency to the
> calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
> times, but after frequency propagation, the first call to bar() gets a
> weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3. In
> reality, every call to bar() should have a weight of 1,000,000.
>
> Duncan responded to this.  My conclusion from his response: this is a bug,
> not a fundamental issue.  Remove the max scaling factor, switch the counts
> to 64 bits and everything should be fine.  If you disagree, let's
discuss.
>
>
>
> Thanks.  Diego.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>

Bob Wilson

2015-Mar-24 17:48 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <davidxl at
google.com> wrote:
> 
> Diego and I have discussed this according to the feedback received. We
> have revised plan for this (see Diego's last reply).  Here is a more
> detailed re-cap:
> 
> 1) keep MD_prof definition as it is today; also keep using the
> frequency propagation as it is (assuming programs with irreducible
> loops are not common and not important. If it turns out to be
> otherwise, we will revisit this).
> 2) fix all problems that lead to wrong 'frequency/count' computed
from
> the frequency propagation algorithm
>   2.1) relax 32bit limit
>   2.2) remove 4096 iteration count cap
>   2.3) remove the 'laplace rule of succession' which can be very
harmful
I have not seen any consensus on this. I’m not at all convinced this would be a
good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior
in cases where the counts are very small and has almost no impact when the
counts are large.
>   2.4) Fix the bug in  'BranchProbabilityInfo::calcMetadataWeights'
> that does capping without scaling
>   2.5) Fix a similar bug in
'MachineBranchProbabilityInfo::getSumForBlock('
>    .. other bugs found
> 3) introduce a new meta data to represent function entry global
> hotness (the implementation can choose to use entry execution count)
> 4) implement frequency/profile update interfaces for Inline
> transformations -- i.e., allowing importing callee's profile info into
> caller with scaling factor computed from callsite count and callee
> entry count
> 5) Implement mechanism to compute, record and use profile program
> summary data (This will be discussed separately).
> 
> 
> thanks,
> 
> David
> 
> 
> 
> On Tue, Mar 24, 2015 at 9:59 AM, Philip Reames
> <listmail at philipreames.com> wrote:
>> On 03/10/2015 10:14 AM, Diego Novillo wrote:
>> 
>> 
>> 
>> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at
apple.com> wrote:
>>> 
>>> 
>>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at
google.com> wrote:
>>> 
>>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at
google.com>
>>> wrote:
>>> 
>>>> I've created a few bugzilla issues with details of some of
the things
>>>> I'll be looking into. I'm not yet done wordsmithing the
overall design
>>>> document. I'll try to finish it by early next week at the
latest.
>>> 
>>> 
>>> The document is available at
>>> 
>>> 
>>>
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>> 
>>> There are several topics covered. Ideally, I would prefer that we
discuss
>>> each topic separately. The main ones I will start working on are
the ones
>>> described in the bugzilla links we have in the doc.
>>> 
>>> This is just a starting point for us. I am not at all concerned
with
>>> implementing exactly what is proposed in the document. In fact, if
we can
>>> get the same value using the existing support, all the better.
>>> 
>>> OTOH, any other ideas that folks may have that work better than
this are
>>> more than welcome. I don't have really strong opinions on the
matter. I am
>>> fine with whatever works.
>>> 
>>> 
>>> Thanks for the detailed write-up on this. Some of the issues
definitely
>>> need to be addressed. I am concerned, though, that some of the
ideas may be
>>> leading toward a scenario where we have essentially two completely
different
>>> ways of representing profile information in LLVM IR. It is great to
have two
>>> complementary approaches to collecting profile data, but two
representations
>>> in the IR would not make sense.
>> 
>> 
>> Yeah, I don't think I'll continue to push for a new MD_count
attribute. If
>> we were to make MD_prof be a "real" execution count, that
would be enough.
>> Note that by re-using MD_prof we are not changing its meaning at all.
The
>> execution count is still a weight and the ratio is still branch
probability.
>> All that we are changing are the absolute values of the number and
>> increasing its data type width to remove the 32bit limitation.
>> 
>> Independent of everything else, relaxing the 32 bit restriction is
clearly a
>> good idea.  This would make a great standalone patch.
>> 
>> 
>> 
>>> 
>>> The first issue raised is that profile execution counts are not
>>> represented in the IR. This was a very intentional decision. I know
it goes
>>> against what other compilers have done in the past. It took me a
while to
>>> get used to the idea when Andy first suggested it, so I know it
seems
>>> awkward at first. The advantage is that branch probabilities are
much easier
>>> to keep updated in the face of compiler transformations, compared
to
>>> execution counts.
>> 
>> 
>> Sorry. I don't follow. Updating counts as the CFG is transformed is
not
>> difficult at all. What examples do you have in mind?  The big advantage
of
>> making MD_prof an actual execution count is that it is a meaningful
metric
>> wrt scaling and transformation.
>> 
>> Say, for instance, that we have a branch instruction with two targets
with
>> counts {100, 300} inside a function 'foo' that has entry count
2. The edge
>> probability for the first edge (count 100) is 100/(100+300) = 25%.
>> 
>> If we inline foo() inside another function bar() at a callsite with
profile
>> count == 1, the cloned branch instruction gets its counters scaled with
the
>> callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2}
>> {50, 150}.  But the branch probability did not change. Currently, we
are
>> cloning the branch without changing the edge weights.
>> 
>> This scaling is not difficult at all and can be incrementally very
quickly.
>> We cannot afford to recompute all frequencies on the fly because it
would be
>> detrimental to compile time. If foo() itself has N callees inlined into
it,
>> each inlined callee needs to trigger a re-computation. When foo() is
inlined
>> into bar(), the frequencies will need to be recomputed for foo() and
all N
>> callees inlined into foo().
>> 
>> It really sounds like your proposal is to essentially eagerly compute
>> scaling rather than lazyily compute it on demand.  Assuming perfect
>> implementations for both (with no rounding losses), the results should
be
>> the same.  Is that a correct restatement?  I'm going to hold off on
>> responding to why that's a bad idea until you confirm, because
I'm not sure
>> I follow what you're trying to say.  :)
>> 
>> Also, trusting exact entry counts is going to be somewhat suspect. 
These
>> are *highly* susceptible to racy updates, overflow, etc...  Anything
which
>> puts too much implicit trust in these numbers is going to be
problematic.
>> 
>> 
>> 
>>> 
>>> We are definitely missing the per-function execution counts that
are
>>> needed to be able to compare relative “hotness” across functions,
and I
>>> think that would be a good place to start making improvements. In
the long
>>> term, we should keep our options open to making major changes, but
before we
>>> go there, we should try to make incremental improvements to fix the
existing
>>> infrastructure.
>> 
>> 
>> Right, and that's the core of our proposal. We don't really
want to make
>> major infrastructure changes at this point. The only thing I'd like
to
>> explore is making MD_prof a real count. This will be useful for the
inliner
>> changes and it would also make incremental updates easier, because the
>> scaling that needs to be done is very straightforward and quick.
>> 
>> Note that this change should not modify the current behaviour we get
from
>> profile analysis. Things that were hot before should continue to be hot
now.
>> 
>> I have no objection to adding a mechanism for expressing an entry
count.  I
>> am still very hesitant about the proposals with regards to redefining
the
>> current MD_prof.
>> 
>> I'd encourage you to post a patch for the entry count mechanism,
but not tie
>> its semantics to exact execution count.  (Something like "the
value provided
>> must correctly describe the relative hotness of this routine against
others
>> in the program annoatated with the same metadata.  It is the relative
>> scaling that is important, not the absolute value.  In particular, the
value
>> need not be an exact execution count.")
>> 
>> 
>>> 
>>> Many of the other issues you raise seem like they could also be
addressed
>>> without major changes to the existing infrastructure. Let’s try to
fix those
>>> first.
>> 
>> 
>> That's exactly the point of the proposal.  We definitely don't
want to make
>> major changes to the infrastructure at first. My thinking is to start
>> working on making MD_prof a real count. One of the things that are
happening
>> is that the combination of real profile data plus the frequency
propagation
>> that we are currently doing is misleading the analysis.
>> 
>> I consider this a major change.  You're trying to redefine a major
part of
>> the current system.
>> 
>> Multiple people have spoken up and objected to this change (as
currently
>> described).  Please start somewhere else.
>> 
>> 
>> For example (thanks David for the code and data). In the following
code:
>> 
>> int g;
>> __attribute__((noinline)) void bar() {
>> g++;
>> }
>> 
>> extern int printf(const char*, ...);
>> 
>> int main()
>> {
>>  int i, j, k;
>> 
>>  g = 0;
>> 
>>  // Loop 1.
>>  for (i = 0; i < 100; i++)
>>    for (j = 0; j < 100; j++)
>>       for (k = 0; k < 100; k++)
>>           bar();
>> 
>>  printf ("g = %d\n", g);
>>  g = 0;
>> 
>>  // Loop 2.
>>  for (i = 0; i < 100; i++)
>>    for (j = 0; j < 10000; j++)
>>        bar();
>> 
>>  printf ("g = %d\n", g);
>>  g = 0;
>> 
>> 
>>  // Loop 3.
>>  for (i = 0; i < 1000000; i++)
>>    bar();
>> 
>>  printf ("g = %d\n", g);
>>  g = 0;
>> }
>> 
>> When compiled with profile instrumentation, frequency propagation is
>> distorting the real profile because it gives different frequency to the
>> calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
>> times, but after frequency propagation, the first call to bar() gets a
>> weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3.
In
>> reality, every call to bar() should have a weight of 1,000,000.
>> 
>> Duncan responded to this.  My conclusion from his response: this is a
bug,
>> not a fundamental issue.  Remove the max scaling factor, switch the
counts
>> to 64 bits and everything should be fine.  If you disagree, let's
discuss.
>> 
>> 
>> 
>> Thanks.  Diego.
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>>

Chandler Carruth

2015-Mar-24 18:29 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

Sorry I haven't responded earlier, but one point here still doesn't make
sense to me:

On Tue, Mar 24, 2015 at 10:27 AM, Xinliang David Li <davidxl at
google.com>
wrote:
> Diego and I have discussed this according to the feedback received. We
> have revised plan for this (see Diego's last reply).  Here is a more
> detailed re-cap:
>
> 1) keep MD_prof definition as it is today; also keep using the
> frequency propagation as it is (assuming programs with irreducible
> loops are not common and not important. If it turns out to be
> otherwise, we will revisit this).
> 2) fix all problems that lead to wrong 'frequency/count' computed
from
> the frequency propagation algorithm
>    2.1) relax 32bit limit
>
I still don't understand why this is important or useful.... Maybe I'm
just
missing something.

Given the current meaning of MD_prof, it seems like the result of limiting
this to 32-bits is that the maximum relative ratio of probabilities between
two successors of a basic block with N successors is (2 billion / N):1 --
what is the circumstance that makes this resolution insufficient?

It also doesn't seem *bad* per-se, I just don't see what it improves,
and
it does cost memory...
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150324/cf698d75/attachment.html>

Philip Reames

2015-Mar-24 20:54 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On 03/24/2015 10:34 AM, Xinliang David Li wrote:>> I follow what you're trying to say.  :)
>>
>> Also, trusting exact entry counts is going to be somewhat suspect. 
These
>> are *highly* susceptible to racy updates, overflow, etc...  Anything
which
>> puts too much implicit trust in these numbers is going to be
problematic.
> No, we won't be using profile counts in the absolute sense, a.k.a.  if
> (count == 100000).  The counts is used as global hotness relative to
> others. It is also used together with program summary data.
>
>> I have no objection to adding a mechanism for expressing an entry
count.  I
>> am still very hesitant about the proposals with regards to redefining
the
>> current MD_prof.
> For now, we won't touch MD_prof's definition.
>
>> I'd encourage you to post a patch for the entry count mechanism,
but not tie
>> its semantics to exact execution count.  (Something like "the
value provided
>> must correctly describe the relative hotness of this routine against
others
>> in the program annoatated with the same metadata.  It is the relative
>> scaling that is important, not the absolute value.  In particular, the
value
>> need not be an exact execution count.")
>>
> Define entry count in a more flexible way is fine -- as long as the
> implementation can choose to use 'execution count' :)
Absolutely agreed here.  :)>
> thanks,
>
> David
>
>>> Many of the other issues you raise seem like they could also be
addressed
>>> without major changes to the existing infrastructure. Let’s try to
fix those
>>> first.
>>
>> That's exactly the point of the proposal.  We definitely don't
want to make
>> major changes to the infrastructure at first. My thinking is to start
>> working on making MD_prof a real count. One of the things that are
happening
>> is that the combination of real profile data plus the frequency
propagation
>> that we are currently doing is misleading the analysis.
>>
>> I consider this a major change.  You're trying to redefine a major
part of
>> the current system.
>>
>> Multiple people have spoken up and objected to this change (as
currently
>> described).  Please start somewhere else.
>>
>>
>> For example (thanks David for the code and data). In the following
code:
>>
>> int g;
>> __attribute__((noinline)) void bar() {
>>   g++;
>> }
>>
>> extern int printf(const char*, ...);
>>
>> int main()
>> {
>>    int i, j, k;
>>
>>    g = 0;
>>
>>    // Loop 1.
>>    for (i = 0; i < 100; i++)
>>      for (j = 0; j < 100; j++)
>>         for (k = 0; k < 100; k++)
>>             bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>>
>>    // Loop 2.
>>    for (i = 0; i < 100; i++)
>>      for (j = 0; j < 10000; j++)
>>          bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>>
>>
>>    // Loop 3.
>>    for (i = 0; i < 1000000; i++)
>>      bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>> }
>>
>> When compiled with profile instrumentation, frequency propagation is
>> distorting the real profile because it gives different frequency to the
>> calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
>> times, but after frequency propagation, the first call to bar() gets a
>> weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3.
In
>> reality, every call to bar() should have a weight of 1,000,000.
>>
>> Duncan responded to this.  My conclusion from his response: this is a
bug,
>> not a fundamental issue.  Remove the max scaling factor, switch the
counts
>> to 64 bits and everything should be fine.  If you disagree, let's
discuss.
>>
>>
>>
>> Thanks.  Diego.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>

Philip Reames

2015-Mar-24 20:57 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On 03/24/2015 10:27 AM, Xinliang David Li wrote:> Diego and I have discussed this according to the feedback received. We
> have revised plan for this (see Diego's last reply).  Here is a more
> detailed re-cap:
>
> 1) keep MD_prof definition as it is today; also keep using the
> frequency propagation as it is (assuming programs with irreducible
> loops are not common and not important. If it turns out to be
> otherwise, we will revisit this).
> 2) fix all problems that lead to wrong 'frequency/count' computed
from
> the frequency propagation algorithm
>     2.1) relax 32bit limit
>     2.2) remove 4096 iteration count cap
>     2.3) remove the 'laplace rule of succession' which can be very
harmful
>     2.4) Fix the bug in 
'BranchProbabilityInfo::calcMetadataWeights'
> that does capping without scaling
>     2.5) Fix a similar bug in
'MachineBranchProbabilityInfo::getSumForBlock('
>      .. other bugs found
> 3) introduce a new meta data to represent function entry global
> hotness (the implementation can choose to use entry execution count)
> 4) implement frequency/profile update interfaces for Inline
> transformations -- i.e., allowing importing callee's profile info into
> caller with scaling factor computed from callsite count and callee
> entry count
> 5) Implement mechanism to compute, record and use profile program
> summary data (This will be discussed separately).This sounds like a very workable approach to me.  I think 2.3 needs a 
bit more discussion, but given it doesn't really directly effect me, I'm
probably going to stay out of it for now.

Thanks for working on this!>
>
> thanks,
>
> David
>
>
>
> On Tue, Mar 24, 2015 at 9:59 AM, Philip Reames
> <listmail at philipreames.com> wrote:
>> On 03/10/2015 10:14 AM, Diego Novillo wrote:
>>
>>
>>
>> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at
apple.com> wrote:
>>>
>>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at
google.com> wrote:
>>>
>>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at
google.com>
>>> wrote:
>>>
>>>> I've created a few bugzilla issues with details of some of
the things
>>>> I'll be looking into. I'm not yet done wordsmithing the
overall design
>>>> document. I'll try to finish it by early next week at the
latest.
>>>
>>> The document is available at
>>>
>>>
>>>
https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>>
>>> There are several topics covered. Ideally, I would prefer that we
discuss
>>> each topic separately. The main ones I will start working on are
the ones
>>> described in the bugzilla links we have in the doc.
>>>
>>> This is just a starting point for us. I am not at all concerned
with
>>> implementing exactly what is proposed in the document. In fact, if
we can
>>> get the same value using the existing support, all the better.
>>>
>>> OTOH, any other ideas that folks may have that work better than
this are
>>> more than welcome. I don't have really strong opinions on the
matter. I am
>>> fine with whatever works.
>>>
>>>
>>> Thanks for the detailed write-up on this. Some of the issues
definitely
>>> need to be addressed. I am concerned, though, that some of the
ideas may be
>>> leading toward a scenario where we have essentially two completely
different
>>> ways of representing profile information in LLVM IR. It is great to
have two
>>> complementary approaches to collecting profile data, but two
representations
>>> in the IR would not make sense.
>>
>> Yeah, I don't think I'll continue to push for a new MD_count
attribute. If
>> we were to make MD_prof be a "real" execution count, that
would be enough.
>> Note that by re-using MD_prof we are not changing its meaning at all.
The
>> execution count is still a weight and the ratio is still branch
probability.
>> All that we are changing are the absolute values of the number and
>> increasing its data type width to remove the 32bit limitation.
>>
>> Independent of everything else, relaxing the 32 bit restriction is
clearly a
>> good idea.  This would make a great standalone patch.
>>
>>
>>
>>> The first issue raised is that profile execution counts are not
>>> represented in the IR. This was a very intentional decision. I know
it goes
>>> against what other compilers have done in the past. It took me a
while to
>>> get used to the idea when Andy first suggested it, so I know it
seems
>>> awkward at first. The advantage is that branch probabilities are
much easier
>>> to keep updated in the face of compiler transformations, compared
to
>>> execution counts.
>>
>> Sorry. I don't follow. Updating counts as the CFG is transformed is
not
>> difficult at all. What examples do you have in mind?  The big advantage
of
>> making MD_prof an actual execution count is that it is a meaningful
metric
>> wrt scaling and transformation.
>>
>> Say, for instance, that we have a branch instruction with two targets
with
>> counts {100, 300} inside a function 'foo' that has entry count
2. The edge
>> probability for the first edge (count 100) is 100/(100+300) = 25%.
>>
>> If we inline foo() inside another function bar() at a callsite with
profile
>> count == 1, the cloned branch instruction gets its counters scaled with
the
>> callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2}
>> {50, 150}.  But the branch probability did not change. Currently, we
are
>> cloning the branch without changing the edge weights.
>>
>> This scaling is not difficult at all and can be incrementally very
quickly.
>> We cannot afford to recompute all frequencies on the fly because it
would be
>> detrimental to compile time. If foo() itself has N callees inlined into
it,
>> each inlined callee needs to trigger a re-computation. When foo() is
inlined
>> into bar(), the frequencies will need to be recomputed for foo() and
all N
>> callees inlined into foo().
>>
>> It really sounds like your proposal is to essentially eagerly compute
>> scaling rather than lazyily compute it on demand.  Assuming perfect
>> implementations for both (with no rounding losses), the results should
be
>> the same.  Is that a correct restatement?  I'm going to hold off on
>> responding to why that's a bad idea until you confirm, because
I'm not sure
>> I follow what you're trying to say.  :)
>>
>> Also, trusting exact entry counts is going to be somewhat suspect. 
These
>> are *highly* susceptible to racy updates, overflow, etc...  Anything
which
>> puts too much implicit trust in these numbers is going to be
problematic.
>>
>>
>>
>>> We are definitely missing the per-function execution counts that
are
>>> needed to be able to compare relative “hotness” across functions,
and I
>>> think that would be a good place to start making improvements. In
the long
>>> term, we should keep our options open to making major changes, but
before we
>>> go there, we should try to make incremental improvements to fix the
existing
>>> infrastructure.
>>
>> Right, and that's the core of our proposal. We don't really
want to make
>> major infrastructure changes at this point. The only thing I'd like
to
>> explore is making MD_prof a real count. This will be useful for the
inliner
>> changes and it would also make incremental updates easier, because the
>> scaling that needs to be done is very straightforward and quick.
>>
>> Note that this change should not modify the current behaviour we get
from
>> profile analysis. Things that were hot before should continue to be hot
now.
>>
>> I have no objection to adding a mechanism for expressing an entry
count.  I
>> am still very hesitant about the proposals with regards to redefining
the
>> current MD_prof.
>>
>> I'd encourage you to post a patch for the entry count mechanism,
but not tie
>> its semantics to exact execution count.  (Something like "the
value provided
>> must correctly describe the relative hotness of this routine against
others
>> in the program annoatated with the same metadata.  It is the relative
>> scaling that is important, not the absolute value.  In particular, the
value
>> need not be an exact execution count.")
>>
>>
>>> Many of the other issues you raise seem like they could also be
addressed
>>> without major changes to the existing infrastructure. Let’s try to
fix those
>>> first.
>>
>> That's exactly the point of the proposal.  We definitely don't
want to make
>> major changes to the infrastructure at first. My thinking is to start
>> working on making MD_prof a real count. One of the things that are
happening
>> is that the combination of real profile data plus the frequency
propagation
>> that we are currently doing is misleading the analysis.
>>
>> I consider this a major change.  You're trying to redefine a major
part of
>> the current system.
>>
>> Multiple people have spoken up and objected to this change (as
currently
>> described).  Please start somewhere else.
>>
>>
>> For example (thanks David for the code and data). In the following
code:
>>
>> int g;
>> __attribute__((noinline)) void bar() {
>>   g++;
>> }
>>
>> extern int printf(const char*, ...);
>>
>> int main()
>> {
>>    int i, j, k;
>>
>>    g = 0;
>>
>>    // Loop 1.
>>    for (i = 0; i < 100; i++)
>>      for (j = 0; j < 100; j++)
>>         for (k = 0; k < 100; k++)
>>             bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>>
>>    // Loop 2.
>>    for (i = 0; i < 100; i++)
>>      for (j = 0; j < 10000; j++)
>>          bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>>
>>
>>    // Loop 3.
>>    for (i = 0; i < 1000000; i++)
>>      bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>> }
>>
>> When compiled with profile instrumentation, frequency propagation is
>> distorting the real profile because it gives different frequency to the
>> calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
>> times, but after frequency propagation, the first call to bar() gets a
>> weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3.
In
>> reality, every call to bar() should have a weight of 1,000,000.
>>
>> Duncan responded to this.  My conclusion from his response: this is a
bug,
>> not a fundamental issue.  Remove the max scaling factor, switch the
counts
>> to 64 bits and everything should be fine.  If you disagree, let's
discuss.
>>
>>
>>
>> Thanks.  Diego.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>

llvm dev - Mar 2015 - [LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support