thr3ads.net - llvm dev - [LLVMdev] RFC - Improvements to PGO profile support [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Bob Wilson

2015-Mar-24 21:47 UTC

[LLVMdev] RFC - Improvements to PGO profile support

> On Mar 24, 2015, at 12:08 PM, Xinliang David Li <davidxl at
google.com> wrote:
> 
> On Tue, Mar 24, 2015 at 10:54 AM, Bob Wilson <bob.wilson at
apple.com> wrote:
>> 
>>> On Mar 24, 2015, at 10:53 AM, Diego Novillo <dnovillo at
google.com> wrote:
>>> 
>>> On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.wilson at
apple.com> wrote:
>>>> 
>>>>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <davidxl
at google.com> wrote:
>>>>> 
>>>>> 2.3) remove the 'laplace rule of succession' which
can be very harmful
>>>> 
>>>> I have not seen any consensus on this. I’m not at all convinced
this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids
really bad behavior in cases where the counts are very small and has almost no
impact when the counts are large.
>>> 
>>> Duncan and I chatted briefly about this on IRC. I'm going to
>>> experiment with only applying laplace smoothing if any scaled
weights
>>> are 0. AFAIU, usage of this rule is really just trying to avoid
having
>>> downstream users deal with 0 weights. Duncan, please whack me with
a
>>> clue stick if I totally misrepresented our conclusions.
>> 
>> Zero is the extreme case, but it’s also important for other small
counts. I’d like to see some specific examples of why you think the current
behavior is harmful.
> 
> Using the rule has at least the following bad side effects:
> 1) wrong estimation of average loop trip count -- possibly leading to
> wrong unroll decisions
I remain skeptical. If the loop only ran once in your profile run, how much
confidence do you have in the trip count? The LaPlace rule will tend to make the
compiler more conservative in exactly the cases where there is insufficient
information to be confident about what to do.
> 2) result in bad inlining decisions. For instance:
>    for (...)
>        bar();  // (1)
> 
> where (1) is the only callsite to bar().   Using the rule, BB count
> enclosing the call to bar() can be as low as half of the entry count
> of bar().  Inliner will get confused and think there are more hot
> callsites to 'bar' and  make suboptimal decisions ..
> 
> Also if bar has calls to other functions, those callsites will look
> hotter than the call to 'bar' …
Your own proposal for recording entry counts is to record “relative hotness”,
not absolute profile counts. On the caller’s side, we’ve got a branch weight
influenced by LaPlace’s rule that is then used to compute BlockFrequency and
you’re concerned about a mismatch between that the “relative hotness” recorded
for the callee??
> 
> The attached are two cases as well as the frequency graph computed
> today (with the laplace rule) and the correct frequency expected.
I’d be a lot more interested to see a real-world example.

Xinliang David Li

2015-Mar-24 22:22 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

On Tue, Mar 24, 2015 at 2:47 PM, Bob Wilson <bob.wilson at apple.com>
wrote:>
>> On Mar 24, 2015, at 12:08 PM, Xinliang David Li <davidxl at
google.com> wrote:
>>
>> On Tue, Mar 24, 2015 at 10:54 AM, Bob Wilson <bob.wilson at
apple.com> wrote:
>>>
>>>> On Mar 24, 2015, at 10:53 AM, Diego Novillo <dnovillo at
google.com> wrote:
>>>>
>>>> On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.wilson at
apple.com> wrote:
>>>>>
>>>>>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li
<davidxl at google.com> wrote:
>>>>>>
>>>>>> 2.3) remove the 'laplace rule of succession'
which can be very harmful
>>>>>
>>>>> I have not seen any consensus on this. I’m not at all
convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule
avoids really bad behavior in cases where the counts are very small and has
almost no impact when the counts are large.
>>>>
>>>> Duncan and I chatted briefly about this on IRC. I'm going
to
>>>> experiment with only applying laplace smoothing if any scaled
weights
>>>> are 0. AFAIU, usage of this rule is really just trying to avoid
having
>>>> downstream users deal with 0 weights. Duncan, please whack me
with a
>>>> clue stick if I totally misrepresented our conclusions.
>>>
>>> Zero is the extreme case, but it’s also important for other small
counts. I’d like to see some specific examples of why you think the current
behavior is harmful.
>>
>> Using the rule has at least the following bad side effects:
>> 1) wrong estimation of average loop trip count -- possibly leading to
>> wrong unroll decisions
>
> I remain skeptical. If the loop only ran once in your profile run, how much
confidence do you have in the trip count? The LaPlace rule will tend to make the
compiler more conservative in exactly the cases where there is insufficient
information to be confident about what to do.
Training run is expensive, it is common for PGO users to reduce
workload as much as possible without sacrificing the
representativeness of the workload.

Workload 1:  loop run once, iterating 100 times in total
workload 2:  loop run twice, iterating 200 times in total

Which workload  will user prefer to use?

Without the rule, the two workload at least produces consistent
profile data.  With the Laplace rule, you get 50 in one case, and 66
in the other.

Having some technology to improve confidence of the profile data is
fine, but I don't see
1) how laplace rule is good for it
2) why this can not be done in the consumer side (i.e., faithfully
record the profile data).

>
>> 2) result in bad inlining decisions. For instance:
>>    for (...)
>>        bar();  // (1)
>>
>> where (1) is the only callsite to bar().   Using the rule, BB count
>> enclosing the call to bar() can be as low as half of the entry count
>> of bar().  Inliner will get confused and think there are more hot
>> callsites to 'bar' and  make suboptimal decisions ..
>>
>> Also if bar has calls to other functions, those callsites will look
>> hotter than the call to 'bar' …
>
> Your own proposal for recording entry counts is to record “relative
hotness”, not absolute profile counts.
The proposal is to record 'global hotness' that can used to compare
relative hotness across procedural boundaries (e.g. callsites in
different callers). Profile counts satisfies this condition.
>On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule
that is then used to compute BlockFrequency and you’re concerned about a
mismatch between that the “relative hotness” recorded for the callee??
Basically, say the caller is test()

bar(){
  // ENTRY count =  100 (from profile data)
  // ENTRY freq = 1

  // BB2: Freq(BB2) = 1, count = 100
  foo ();              (2)
}


test() {
   // ENTRY count = 1 (from profile data)
   // Entry Freq = 1
   for (i = 0; i < 100; i++) {
       // BB1: Freq(BB1) = 50 due to Laplace rule
       bar();  // Freq = 50, count = 50    (1)
    }
}

With laplace rule, the block freq computed for bar's enclosing BB will
be wrong -- as a result, the bar's enclosing BB's count will  be wrong
too: 50*1/1 = 50.

The global hotness of call site (1) & (2) should be the same, but
distorted when Laplace rule is used.

Yes, we care about using PGO across routine boundaries for IPO.
>
>>
>> The attached are two cases as well as the frequency graph computed
>> today (with the laplace rule) and the correct frequency expected.
>
> I’d be a lot more interested to see a real-world example.
See my reply above. On the other hand, I'd like to see examples where
LaPlace Rule can actually help improve the profile data quality.

thanks,

David

Bob Wilson

2015-Mar-26 04:32 UTC

head link

[LLVMdev] RFC - Improvements to PGO profile support

> On Mar 24, 2015, at 3:22 PM, Xinliang David Li <davidxl at
google.com> wrote:
> 
> On Tue, Mar 24, 2015 at 2:47 PM, Bob Wilson <bob.wilson at apple.com
<mailto:bob.wilson at apple.com>> wrote:
>> 
>>> On Mar 24, 2015, at 12:08 PM, Xinliang David Li <davidxl at
google.com> wrote:
>>> 
>>> On Tue, Mar 24, 2015 at 10:54 AM, Bob Wilson <bob.wilson at
apple.com> wrote:
>>>> 
>>>>> On Mar 24, 2015, at 10:53 AM, Diego Novillo <dnovillo at
google.com> wrote:
>>>>> 
>>>>> On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.wilson
at apple.com> wrote:
>>>>>> 
>>>>>>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li
<davidxl at google.com> wrote:
>>>>>>> 
>>>>>>> 2.3) remove the 'laplace rule of
succession' which can be very harmful
>>>>>> 
>>>>>> I have not seen any consensus on this. I’m not at all
convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule
avoids really bad behavior in cases where the counts are very small and has
almost no impact when the counts are large.
>>>>> 
>>>>> Duncan and I chatted briefly about this on IRC. I'm
going to
>>>>> experiment with only applying laplace smoothing if any
scaled weights
>>>>> are 0. AFAIU, usage of this rule is really just trying to
avoid having
>>>>> downstream users deal with 0 weights. Duncan, please whack
me with a
>>>>> clue stick if I totally misrepresented our conclusions.
>>>> 
>>>> Zero is the extreme case, but it’s also important for other
small counts. I’d like to see some specific examples of why you think the
current behavior is harmful.
>>> 
>>> Using the rule has at least the following bad side effects:
>>> 1) wrong estimation of average loop trip count -- possibly leading
to
>>> wrong unroll decisions
>> 
>> I remain skeptical. If the loop only ran once in your profile run, how
much confidence do you have in the trip count? The LaPlace rule will tend to
make the compiler more conservative in exactly the cases where there is
insufficient information to be confident about what to do.
> 
> Training run is expensive, it is common for PGO users to reduce
> workload as much as possible without sacrificing the
> representativeness of the workload.
> 
> Workload 1:  loop run once, iterating 100 times in total
> workload 2:  loop run twice, iterating 200 times in total
> 
> Which workload  will user prefer to use?
Which workload is better? I don’t at all trust users to get this right, at least
for real, non-benchmark code.
> 
> Without the rule, the two workload at least produces consistent
> profile data.  With the Laplace rule, you get 50 in one case, and 66
> in the other.
Yes, but you’ve got more information in one case than the other. This is a
feature IMO, not a bug. It’s entirely possible that with workload 2, the loop
may have executed for a drastically different number of iterations. The fact
that it did not, i.e., that it was consistent with workload 1, is more
information that you did not have before. It makes sense for the compiler to be
more aggressive when it has more data.
> 
> Having some technology to improve confidence of the profile data is
> fine, but I don't see
> 1) how laplace rule is good for it
What do you not understand about it? As the counts get larger, LaPlace’s rule
fades into the noise. It only makes a difference for cases where some of the
counts are *very* small, and in those cases, it very simply adjust the weights
to make optimizations less aggressive.
> 2) why this can not be done in the consumer side (i.e., faithfully
> record the profile data).
What does this have to do with how faithfully the profile is recorded? We’ve got
fully accurate data, but if the profiling inputs are too small or not
representative, you may still get poor optimization choices.
> 
> 
>> 
>>> 2) result in bad inlining decisions. For instance:
>>>   for (...)
>>>       bar();  // (1)
>>> 
>>> where (1) is the only callsite to bar().   Using the rule, BB count
>>> enclosing the call to bar() can be as low as half of the entry
count
>>> of bar().  Inliner will get confused and think there are more hot
>>> callsites to 'bar' and  make suboptimal decisions ..
>>> 
>>> Also if bar has calls to other functions, those callsites will look
>>> hotter than the call to 'bar' …
>> 
>> Your own proposal for recording entry counts is to record “relative
hotness”, not absolute profile counts.
> 
> The proposal is to record 'global hotness' that can used to compare
> relative hotness across procedural boundaries (e.g. callsites in
> different callers). Profile counts satisfies this condition.
> 
>> On the caller’s side, we’ve got a branch weight influenced by LaPlace’s
rule that is then used to compute BlockFrequency and you’re concerned about a
mismatch between that the “relative hotness” recorded for the callee??
> 
> Basically, say the caller is test()
> 
> bar(){
>  // ENTRY count =  100 (from profile data)
>  // ENTRY freq = 1
> 
>  // BB2: Freq(BB2) = 1, count = 100
>  foo ();              (2)
> }
> 
> 
> test() {
>   // ENTRY count = 1 (from profile data)
>   // Entry Freq = 1
>   for (i = 0; i < 100; i++) {
>       // BB1: Freq(BB1) = 50 due to Laplace rule
>       bar();  // Freq = 50, count = 50    (1)
>    }
> }
> 
> With laplace rule, the block freq computed for bar's enclosing BB will
> be wrong -- as a result, the bar's enclosing BB's count will  be
wrong
> too: 50*1/1 = 50.
> 
> The global hotness of call site (1) & (2) should be the same, but
> distorted when Laplace rule is used.
> 
> Yes, we care about using PGO across routine boundaries for IPO.
I understand the issue, but my point was that you should simply not do that.
You’re objecting to LaPlace’s rule based on a hypothetical comparison of block
frequencies vs. entry counts. There is nothing in LLVM that does that now. We
don’t even have entry counts.

I don’t see how you can argue that LaPlace’s rule is bad because it could affect
an apples vs. oranges comparison of something that does not even exist yet.
> 
>> 
>>> 
>>> The attached are two cases as well as the frequency graph computed
>>> today (with the laplace rule) and the correct frequency expected.
>> 
>> I’d be a lot more interested to see a real-world example.
> 
> See my reply above. On the other hand, I'd like to see examples where
> LaPlace Rule can actually help improve the profile data quality.
It’s not about improving the results — it’s about preventing clang from being
overly aggressive about optimizing based on limited profile data.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<lists.llvm.org/pipermail/llvm-dev/attachments/20150325/57443f23/attachment.html>

llvm dev - Mar 2015 - [LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support

[LLVMdev] RFC - Improvements to PGO profile support