thr3ads.net - llvm dev - [LLVMdev] Loss of precision with very large branch weights [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Diego Novillo

2015-Apr-24 18:18 UTC

[LLVMdev] Loss of precision with very large branch weights

In PR 22718, we are looking at issues with long running applications
producing non-representative frequencies.  For example, in these two loops:

int g = 0;
__attribute__((noinline)) void bar() {
 g++;
}

extern int printf(const char*, ...);

int main()
{
  int i, j, k;

  for (i = 0; i < 1000000; i++)
    bar();

  printf ("g = %d\n", g);
  g = 0;

  for (i = 0; i < 500000; i++)
    bar();

  printf ("g = %d\n", g);
}

I expect the loop body frequency for the second loop to be about half of
the first one. This holds fine for this test case:

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis 'Block Frequency Analysis' for function 'bar':
block-frequency-info: bar
 - entry: float = 1.0, int = 8

Printing analysis 'Block Frequency Analysis' for function
'main':
block-frequency-info: main
 - entry: float = 1.0, int = 8
 - for.cond: float = 500001.5, int = 4000011
 - for.body: float = 500000.5, int = 4000003
 - for.inc: float = 500000.5, int = 4000003
 - for.end: float = 1.0, int = 8
 - for.cond1: float = 250001.5, int = 2000011
 - for.body3: float = 250000.5, int = 2000003
 - for.inc4: float = 250000.5, int = 2000003
 - for.end6: float = 1.0, int = 8


But if I manually modify the frequencies of both to get close to MAX_INT32,
the ratios between the frequencies do not reflect reality. For example, if
I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647

$ bin/opt -analyze -block-freq -S unbiased-branches.ll
Printing analysis 'Block Frequency Analysis' for function 'bar':
block-frequency-info: bar
 - entry: float = 1.0, int = 8

Printing analysis 'Block Frequency Analysis' for function
'main':
block-frequency-info: main
 - entry: float = 1.0, int = 8
 - for.cond: float = 1073741824.4, int = 8589934595
 - for.body: float = 1073741823.4, int = 8589934587
 - for.inc: float = 1073741823.4, int = 8589934587
 - for.end: float = 1.0, int = 8
 - for.cond1: float = 1073741824.4, int = 8589934595
 - for.body3: float = 1073741823.4, int = 8589934587
 - for.inc4: float = 1073741823.4, int = 8589934587
 - for.end6: float = 1.0, int = 8

Now both loops are considered equally hot.

Duncan, I think that if I were to make branch_weights a 64-bit integer,
this would not be an issue. But I'm not sure if I'm not missing
something
else here.


Thanks.  Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150424/d2dfbaa7/attachment.html>

Xinliang David Li

2015-Apr-24 18:32 UTC

head link

[LLVMdev] Loss of precision with very large branch weights

Since we have decided to capture the global hotness using function
entry count, it is ok to CAP the max weight as of today. I think the
remaining bug in PR22718 is that when capping happens, the scaling is
not done. That needs to be fixed.

The efficiency of the interface is a different issue.

David

On Fri, Apr 24, 2015 at 11:18 AM, Diego Novillo <dnovillo at google.com>
wrote:>
> In PR 22718, we are looking at issues with long running applications
> producing non-representative frequencies.  For example, in these two loops:
>
> int g = 0;
> __attribute__((noinline)) void bar() {
>  g++;
> }
>
> extern int printf(const char*, ...);
>
> int main()
> {
>   int i, j, k;
>
>   for (i = 0; i < 1000000; i++)
>     bar();
>
>   printf ("g = %d\n", g);
>   g = 0;
>
>   for (i = 0; i < 500000; i++)
>     bar();
>
>   printf ("g = %d\n", g);
> }
>
> I expect the loop body frequency for the second loop to be about half of
the
> first one. This holds fine for this test case:
>
> $ bin/opt -analyze -block-freq -S unbiased-branches.ll
> Printing analysis 'Block Frequency Analysis' for function
'bar':
> block-frequency-info: bar
>  - entry: float = 1.0, int = 8
>
> Printing analysis 'Block Frequency Analysis' for function
'main':
> block-frequency-info: main
>  - entry: float = 1.0, int = 8
>  - for.cond: float = 500001.5, int = 4000011
>  - for.body: float = 500000.5, int = 4000003
>  - for.inc: float = 500000.5, int = 4000003
>  - for.end: float = 1.0, int = 8
>  - for.cond1: float = 250001.5, int = 2000011
>  - for.body3: float = 250000.5, int = 2000003
>  - for.inc4: float = 250000.5, int = 2000003
>  - for.end6: float = 1.0, int = 8
>
>
> But if I manually modify the frequencies of both to get close to MAX_INT32,
> the ratios between the frequencies do not reflect reality. For example, if
I
> change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647
>
> $ bin/opt -analyze -block-freq -S unbiased-branches.ll
> Printing analysis 'Block Frequency Analysis' for function
'bar':
> block-frequency-info: bar
>  - entry: float = 1.0, int = 8
>
> Printing analysis 'Block Frequency Analysis' for function
'main':
> block-frequency-info: main
>  - entry: float = 1.0, int = 8
>  - for.cond: float = 1073741824.4, int = 8589934595
>  - for.body: float = 1073741823.4, int = 8589934587
>  - for.inc: float = 1073741823.4, int = 8589934587
>  - for.end: float = 1.0, int = 8
>  - for.cond1: float = 1073741824.4, int = 8589934595
>  - for.body3: float = 1073741823.4, int = 8589934587
>  - for.inc4: float = 1073741823.4, int = 8589934587
>  - for.end6: float = 1.0, int = 8
>
> Now both loops are considered equally hot.
>
> Duncan, I think that if I were to make branch_weights a 64-bit integer,
this
> would not be an issue. But I'm not sure if I'm not missing
something else
> here.
>
>
> Thanks.  Diego.

Diego Novillo

2015-Apr-24 18:36 UTC

head link

[LLVMdev] Loss of precision with very large branch weights

On Fri, Apr 24, 2015 at 2:32 PM, Xinliang David Li <davidxl at google.com>
wrote:
> Since we have decided to capture the global hotness using function
> entry count, it is ok to CAP the max weight as of today. I think the
> remaining bug in PR22718 is that when capping happens, the scaling is
> not done. That needs to be fixed.
>
The problem here is that the two loops show up equally hot. So the function
entry count for bar() won't really help since it seems like both loops call
bar() equally frequently.

> The efficiency of the interface is a different issue.
>
>Yeah. I'm looking at the hotness calculation first.


Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150424/a44ad583/attachment.html>

Duncan P. N. Exon Smith

2015-Apr-24 19:40 UTC

head link

[LLVMdev] Loss of precision with very large branch weights

> On 2015-Apr-24, at 11:18, Diego Novillo <dnovillo at google.com>
wrote:
> 
> 
> In PR 22718, we are looking at issues with long running applications
producing non-representative frequencies.  For example, in these two loops:
> 
> int g = 0;
> __attribute__((noinline)) void bar() {
>  g++;
> }
> 
> extern int printf(const char*, ...);
> 
> int main()
> {
>   int i, j, k;
> 
>   for (i = 0; i < 1000000; i++)
>     bar();
> 
>   printf ("g = %d\n", g);
>   g = 0;
> 
>   for (i = 0; i < 500000; i++)
>     bar();
> 
>   printf ("g = %d\n", g);
> }
> 
> I expect the loop body frequency for the second loop to be about half of
the first one. This holds fine for this test case:
> 
> $ bin/opt -analyze -block-freq -S unbiased-branches.ll 
> Printing analysis 'Block Frequency Analysis' for function
'bar':
> block-frequency-info: bar
>  - entry: float = 1.0, int = 8
> 
> Printing analysis 'Block Frequency Analysis' for function
'main':
> block-frequency-info: main
>  - entry: float = 1.0, int = 8
>  - for.cond: float = 500001.5, int = 4000011
>  - for.body: float = 500000.5, int = 4000003
>  - for.inc: float = 500000.5, int = 4000003
>  - for.end: float = 1.0, int = 8
>  - for.cond1: float = 250001.5, int = 2000011
>  - for.body3: float = 250000.5, int = 2000003
>  - for.inc4: float = 250000.5, int = 2000003
>  - for.end6: float = 1.0, int = 8
> 
> 
> But if I manually modify the frequencies of both to get close to MAX_INT32,
the ratios between the frequencies do not reflect reality. For example, if I
change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647
What does -analyze -branch-prob give?
> $ bin/opt -analyze -block-freq -S unbiased-branches.ll 
> Printing analysis 'Block Frequency Analysis' for function
'bar':
> block-frequency-info: bar
>  - entry: float = 1.0, int = 8
> 
> Printing analysis 'Block Frequency Analysis' for function
'main':
> block-frequency-info: main
>  - entry: float = 1.0, int = 8
>  - for.cond: float = 1073741824.4, int = 8589934595
>  - for.body: float = 1073741823.4, int = 8589934587
>  - for.inc: float = 1073741823.4, int = 8589934587
>  - for.end: float = 1.0, int = 8
>  - for.cond1: float = 1073741824.4, int = 8589934595
>  - for.body3: float = 1073741823.4, int = 8589934587
>  - for.inc4: float = 1073741823.4, int = 8589934587
>  - for.end6: float = 1.0, int = 8
> 
> Now both loops are considered equally hot.
> 
> Duncan, I think that if I were to make branch_weights a 64-bit integer,
this would not be an issue. But I'm not sure if I'm not missing
something else here.
I don't understand where the fidelity is being lost; this seems
strange to me.  4B fits inside UINT32_MAX, so I don't see how we
hit a cap here.

But certainly there *is* a limit.  We can't store a ratio of larger
than ~4B:1.  Is this really a problem?  I understand that you can't
tell which loop is hotter, but why does that matter?  Is there some
optimization that will do the wrong thing?  Which one?

Regarding changing branch weights to 64-bits, I think we should be
sure it's valuable before we change it.  I imagine it's doable, but
it'll complicate logic in a few places that don't currently have to
worry about overflowing `uint64_t`.

Xinliang David Li

2015-Apr-24 19:51 UTC

head link

[LLVMdev] Loss of precision with very large branch weights

On Fri, Apr 24, 2015 at 12:40 PM, Duncan P. N. Exon Smith
<dexonsmith at apple.com> wrote:>
>> On 2015-Apr-24, at 11:18, Diego Novillo <dnovillo at google.com>
wrote:
>>
>>
>> In PR 22718, we are looking at issues with long running applications
producing non-representative frequencies.  For example, in these two loops:
>>
>> int g = 0;
>> __attribute__((noinline)) void bar() {
>>  g++;
>> }
>>
>> extern int printf(const char*, ...);
>>
>> int main()
>> {
>>   int i, j, k;
>>
>>   for (i = 0; i < 1000000; i++)
>>     bar();
>>
>>   printf ("g = %d\n", g);
>>   g = 0;
>>
>>   for (i = 0; i < 500000; i++)
>>     bar();
>>
>>   printf ("g = %d\n", g);
>> }
>>
>> I expect the loop body frequency for the second loop to be about half
of the first one. This holds fine for this test case:
>>
>> $ bin/opt -analyze -block-freq -S unbiased-branches.ll
>> Printing analysis 'Block Frequency Analysis' for function
'bar':
>> block-frequency-info: bar
>>  - entry: float = 1.0, int = 8
>>
>> Printing analysis 'Block Frequency Analysis' for function
'main':
>> block-frequency-info: main
>>  - entry: float = 1.0, int = 8
>>  - for.cond: float = 500001.5, int = 4000011
>>  - for.body: float = 500000.5, int = 4000003
>>  - for.inc: float = 500000.5, int = 4000003
>>  - for.end: float = 1.0, int = 8
>>  - for.cond1: float = 250001.5, int = 2000011
>>  - for.body3: float = 250000.5, int = 2000003
>>  - for.inc4: float = 250000.5, int = 2000003
>>  - for.end6: float = 1.0, int = 8
>>
>>
>> But if I manually modify the frequencies of both to get close to
MAX_INT32, the ratios between the frequencies do not reflect reality. For
example, if I change branch_weights in both loops to be 4,294,967,295 and
2,147,483,647
>
> What does -analyze -branch-prob give?
>
>> $ bin/opt -analyze -block-freq -S unbiased-branches.ll
>> Printing analysis 'Block Frequency Analysis' for function
'bar':
>> block-frequency-info: bar
>>  - entry: float = 1.0, int = 8
>>
>> Printing analysis 'Block Frequency Analysis' for function
'main':
>> block-frequency-info: main
>>  - entry: float = 1.0, int = 8
>>  - for.cond: float = 1073741824.4, int = 8589934595
>>  - for.body: float = 1073741823.4, int = 8589934587
>>  - for.inc: float = 1073741823.4, int = 8589934587
>>  - for.end: float = 1.0, int = 8
>>  - for.cond1: float = 1073741824.4, int = 8589934595
>>  - for.body3: float = 1073741823.4, int = 8589934587
>>  - for.inc4: float = 1073741823.4, int = 8589934587
>>  - for.end6: float = 1.0, int = 8
>>
>> Now both loops are considered equally hot.
>>
>> Duncan, I think that if I were to make branch_weights a 64-bit integer,
this would not be an issue. But I'm not sure if I'm not missing
something else here.
>
> I don't understand where the fidelity is being lost; this seems
> strange to me.  4B fits inside UINT32_MAX, so I don't see how we
> hit a cap here.
The max weight is UINT32_MAX/BB->GetTerminator()->GetNumSuccessors().

>
> But certainly there *is* a limit.  We can't store a ratio of larger
> than ~4B:1.  Is this really a problem?
Right, this is not the issue. The issue is that when one branch's
weight reaches the cap, the other branch's weight is not scaled
accordingly:

 uint32_t WeightLimit = getMaxWeightFor(BB);
Weights.reserve(TI->getNumSuccessors());
  for (unsigned i = 1, e = WeightsNode->getNumOperands(); i != e; ++i) {
    ConstantInt *Weight        
mdconst::dyn_extract<ConstantInt>(WeightsNode->getOperand(i));
    if (!Weight)
      return false;
    Weights.push_back(
      std::max<uint32_t>(1, Weight->getLimitedValue(WeightLimit)));
<---- may hit limit

>I understand that you can't
> tell which loop is hotter, but why does that matter?  Is there some
> optimization that will do the wrong thing?  Which one?
>
Two calls (in the loop) not equally hot may be considered equally hot
-- confusing the cost/benefit analysis. This is just an example.

Our goal is to make sure profile data is faithfully kept. Which pass
or whether it will be used now or in the future is a totally different
matter.

> Regarding changing branch weights to 64-bits, I think we should be
> sure it's valuable before we change it.  I imagine it's doable, but
> it'll complicate logic in a few places that don't currently have to
> worry about overflowing `uint64_t`.
I think we can fix this without using 64bit weight.

David

Diego Novillo

2015-Apr-24 20:06 UTC

head link

[LLVMdev] Loss of precision with very large branch weights

On Fri, Apr 24, 2015 at 3:40 PM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:
>
> What does -analyze -branch-prob give?
>
Similar indication of hotness. Both loops appear equally hot.

$ bin/opt -analyze -branch-prob -S unbiased-branches.ll
Printing analysis 'Branch Probability Analysis' for function
'bar':
---- Branch Probabilities ----
Printing analysis 'Branch Probability Analysis' for function
'main':
---- Branch Probabilities ----
  edge entry -> for.cond probability is 16 / 16 = 100% [HOT edge]
  edge for.cond -> for.body probability is 2147483647 / 2147483649 = 100%
[HOT edge]
  edge for.cond -> for.end probability is 2 / 2147483649 = 9.31323e-08%
  edge for.body -> for.inc probability is 16 / 16 = 100% [HOT edge]
  edge for.inc -> for.cond probability is 124 / 124 = 100% [HOT edge]
  edge for.end -> for.cond1 probability is 16 / 16 = 100% [HOT edge]
  edge for.cond1 -> for.body3 probability is 2147483647 / 2147483649 = 100%
[HOT edge]
  edge for.cond1 -> for.end6 probability is 2 / 2147483649 = 9.31323e-08%
  edge for.body3 -> for.inc4 probability is 16 / 16 = 100% [HOT edge]
  edge for.inc4 -> for.cond1 probability is 124 / 124 = 100% [HOT edge]

> But certainly there *is* a limit.  We can't store a ratio of larger
> than ~4B:1.  Is this really a problem?  I understand that you can't
> tell which loop is hotter, but why does that matter?  Is there some
> optimization that will do the wrong thing?  Which one?
>
It will matter for cost analysis in the inliner, for instance.  I don't
think this matters in the few places that profile is used now.  But long
term, we want to make sure that we have a good measure of relative
temperatures within a CFG and across CFGs.

>
> Regarding changing branch weights to 64-bits, I think we should be
> sure it's valuable before we change it.  I imagine it's doable, but
> it'll complicate logic in a few places that don't currently have to
> worry about overflowing `uint64_t`.

Sure, OK. I'll see about making sure we scale things properly during
frequency computation.


Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150424/438f2149/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Apr 2015 - [LLVMdev] Loss of precision with very large branch weights

[LLVMdev] Loss of precision with very large branch weights

[LLVMdev] Loss of precision with very large branch weights

[LLVMdev] Loss of precision with very large branch weights

[LLVMdev] Loss of precision with very large branch weights

[LLVMdev] Loss of precision with very large branch weights

[LLVMdev] Loss of precision with very large branch weights

Possibly Parallel Threads