Diego Novillo
2015-Apr-24 18:18 UTC
[LLVMdev] Loss of precision with very large branch weights
In PR 22718, we are looking at issues with long running applications producing non-representative frequencies. For example, in these two loops: int g = 0; __attribute__((noinline)) void bar() { g++; } extern int printf(const char*, ...); int main() { int i, j, k; for (i = 0; i < 1000000; i++) bar(); printf ("g = %d\n", g); g = 0; for (i = 0; i < 500000; i++) bar(); printf ("g = %d\n", g); } I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case: $ bin/opt -analyze -block-freq -S unbiased-branches.ll Printing analysis 'Block Frequency Analysis' for function 'bar': block-frequency-info: bar - entry: float = 1.0, int = 8 Printing analysis 'Block Frequency Analysis' for function 'main': block-frequency-info: main - entry: float = 1.0, int = 8 - for.cond: float = 500001.5, int = 4000011 - for.body: float = 500000.5, int = 4000003 - for.inc: float = 500000.5, int = 4000003 - for.end: float = 1.0, int = 8 - for.cond1: float = 250001.5, int = 2000011 - for.body3: float = 250000.5, int = 2000003 - for.inc4: float = 250000.5, int = 2000003 - for.end6: float = 1.0, int = 8 But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647 $ bin/opt -analyze -block-freq -S unbiased-branches.ll Printing analysis 'Block Frequency Analysis' for function 'bar': block-frequency-info: bar - entry: float = 1.0, int = 8 Printing analysis 'Block Frequency Analysis' for function 'main': block-frequency-info: main - entry: float = 1.0, int = 8 - for.cond: float = 1073741824.4, int = 8589934595 - for.body: float = 1073741823.4, int = 8589934587 - for.inc: float = 1073741823.4, int = 8589934587 - for.end: float = 1.0, int = 8 - for.cond1: float = 1073741824.4, int = 8589934595 - for.body3: float = 1073741823.4, int = 8589934587 - for.inc4: float = 1073741823.4, int = 8589934587 - for.end6: float = 1.0, int = 8 Now both loops are considered equally hot. Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I'm not sure if I'm not missing something else here. Thanks. Diego. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150424/d2dfbaa7/attachment.html>
Xinliang David Li
2015-Apr-24 18:32 UTC
[LLVMdev] Loss of precision with very large branch weights
Since we have decided to capture the global hotness using function entry count, it is ok to CAP the max weight as of today. I think the remaining bug in PR22718 is that when capping happens, the scaling is not done. That needs to be fixed. The efficiency of the interface is a different issue. David On Fri, Apr 24, 2015 at 11:18 AM, Diego Novillo <dnovillo at google.com> wrote:> > In PR 22718, we are looking at issues with long running applications > producing non-representative frequencies. For example, in these two loops: > > int g = 0; > __attribute__((noinline)) void bar() { > g++; > } > > extern int printf(const char*, ...); > > int main() > { > int i, j, k; > > for (i = 0; i < 1000000; i++) > bar(); > > printf ("g = %d\n", g); > g = 0; > > for (i = 0; i < 500000; i++) > bar(); > > printf ("g = %d\n", g); > } > > I expect the loop body frequency for the second loop to be about half of the > first one. This holds fine for this test case: > > $ bin/opt -analyze -block-freq -S unbiased-branches.ll > Printing analysis 'Block Frequency Analysis' for function 'bar': > block-frequency-info: bar > - entry: float = 1.0, int = 8 > > Printing analysis 'Block Frequency Analysis' for function 'main': > block-frequency-info: main > - entry: float = 1.0, int = 8 > - for.cond: float = 500001.5, int = 4000011 > - for.body: float = 500000.5, int = 4000003 > - for.inc: float = 500000.5, int = 4000003 > - for.end: float = 1.0, int = 8 > - for.cond1: float = 250001.5, int = 2000011 > - for.body3: float = 250000.5, int = 2000003 > - for.inc4: float = 250000.5, int = 2000003 > - for.end6: float = 1.0, int = 8 > > > But if I manually modify the frequencies of both to get close to MAX_INT32, > the ratios between the frequencies do not reflect reality. For example, if I > change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647 > > $ bin/opt -analyze -block-freq -S unbiased-branches.ll > Printing analysis 'Block Frequency Analysis' for function 'bar': > block-frequency-info: bar > - entry: float = 1.0, int = 8 > > Printing analysis 'Block Frequency Analysis' for function 'main': > block-frequency-info: main > - entry: float = 1.0, int = 8 > - for.cond: float = 1073741824.4, int = 8589934595 > - for.body: float = 1073741823.4, int = 8589934587 > - for.inc: float = 1073741823.4, int = 8589934587 > - for.end: float = 1.0, int = 8 > - for.cond1: float = 1073741824.4, int = 8589934595 > - for.body3: float = 1073741823.4, int = 8589934587 > - for.inc4: float = 1073741823.4, int = 8589934587 > - for.end6: float = 1.0, int = 8 > > Now both loops are considered equally hot. > > Duncan, I think that if I were to make branch_weights a 64-bit integer, this > would not be an issue. But I'm not sure if I'm not missing something else > here. > > > Thanks. Diego.
Diego Novillo
2015-Apr-24 18:36 UTC
[LLVMdev] Loss of precision with very large branch weights
On Fri, Apr 24, 2015 at 2:32 PM, Xinliang David Li <davidxl at google.com> wrote:> Since we have decided to capture the global hotness using function > entry count, it is ok to CAP the max weight as of today. I think the > remaining bug in PR22718 is that when capping happens, the scaling is > not done. That needs to be fixed. >The problem here is that the two loops show up equally hot. So the function entry count for bar() won't really help since it seems like both loops call bar() equally frequently.> The efficiency of the interface is a different issue. > >Yeah. I'm looking at the hotness calculation first. Diego. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150424/a44ad583/attachment.html>
Duncan P. N. Exon Smith
2015-Apr-24 19:40 UTC
[LLVMdev] Loss of precision with very large branch weights
> On 2015-Apr-24, at 11:18, Diego Novillo <dnovillo at google.com> wrote: > > > In PR 22718, we are looking at issues with long running applications producing non-representative frequencies. For example, in these two loops: > > int g = 0; > __attribute__((noinline)) void bar() { > g++; > } > > extern int printf(const char*, ...); > > int main() > { > int i, j, k; > > for (i = 0; i < 1000000; i++) > bar(); > > printf ("g = %d\n", g); > g = 0; > > for (i = 0; i < 500000; i++) > bar(); > > printf ("g = %d\n", g); > } > > I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case: > > $ bin/opt -analyze -block-freq -S unbiased-branches.ll > Printing analysis 'Block Frequency Analysis' for function 'bar': > block-frequency-info: bar > - entry: float = 1.0, int = 8 > > Printing analysis 'Block Frequency Analysis' for function 'main': > block-frequency-info: main > - entry: float = 1.0, int = 8 > - for.cond: float = 500001.5, int = 4000011 > - for.body: float = 500000.5, int = 4000003 > - for.inc: float = 500000.5, int = 4000003 > - for.end: float = 1.0, int = 8 > - for.cond1: float = 250001.5, int = 2000011 > - for.body3: float = 250000.5, int = 2000003 > - for.inc4: float = 250000.5, int = 2000003 > - for.end6: float = 1.0, int = 8 > > > But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647What does -analyze -branch-prob give?> $ bin/opt -analyze -block-freq -S unbiased-branches.ll > Printing analysis 'Block Frequency Analysis' for function 'bar': > block-frequency-info: bar > - entry: float = 1.0, int = 8 > > Printing analysis 'Block Frequency Analysis' for function 'main': > block-frequency-info: main > - entry: float = 1.0, int = 8 > - for.cond: float = 1073741824.4, int = 8589934595 > - for.body: float = 1073741823.4, int = 8589934587 > - for.inc: float = 1073741823.4, int = 8589934587 > - for.end: float = 1.0, int = 8 > - for.cond1: float = 1073741824.4, int = 8589934595 > - for.body3: float = 1073741823.4, int = 8589934587 > - for.inc4: float = 1073741823.4, int = 8589934587 > - for.end6: float = 1.0, int = 8 > > Now both loops are considered equally hot. > > Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I'm not sure if I'm not missing something else here.I don't understand where the fidelity is being lost; this seems strange to me. 4B fits inside UINT32_MAX, so I don't see how we hit a cap here. But certainly there *is* a limit. We can't store a ratio of larger than ~4B:1. Is this really a problem? I understand that you can't tell which loop is hotter, but why does that matter? Is there some optimization that will do the wrong thing? Which one? Regarding changing branch weights to 64-bits, I think we should be sure it's valuable before we change it. I imagine it's doable, but it'll complicate logic in a few places that don't currently have to worry about overflowing `uint64_t`.
Xinliang David Li
2015-Apr-24 19:51 UTC
[LLVMdev] Loss of precision with very large branch weights
On Fri, Apr 24, 2015 at 12:40 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:> >> On 2015-Apr-24, at 11:18, Diego Novillo <dnovillo at google.com> wrote: >> >> >> In PR 22718, we are looking at issues with long running applications producing non-representative frequencies. For example, in these two loops: >> >> int g = 0; >> __attribute__((noinline)) void bar() { >> g++; >> } >> >> extern int printf(const char*, ...); >> >> int main() >> { >> int i, j, k; >> >> for (i = 0; i < 1000000; i++) >> bar(); >> >> printf ("g = %d\n", g); >> g = 0; >> >> for (i = 0; i < 500000; i++) >> bar(); >> >> printf ("g = %d\n", g); >> } >> >> I expect the loop body frequency for the second loop to be about half of the first one. This holds fine for this test case: >> >> $ bin/opt -analyze -block-freq -S unbiased-branches.ll >> Printing analysis 'Block Frequency Analysis' for function 'bar': >> block-frequency-info: bar >> - entry: float = 1.0, int = 8 >> >> Printing analysis 'Block Frequency Analysis' for function 'main': >> block-frequency-info: main >> - entry: float = 1.0, int = 8 >> - for.cond: float = 500001.5, int = 4000011 >> - for.body: float = 500000.5, int = 4000003 >> - for.inc: float = 500000.5, int = 4000003 >> - for.end: float = 1.0, int = 8 >> - for.cond1: float = 250001.5, int = 2000011 >> - for.body3: float = 250000.5, int = 2000003 >> - for.inc4: float = 250000.5, int = 2000003 >> - for.end6: float = 1.0, int = 8 >> >> >> But if I manually modify the frequencies of both to get close to MAX_INT32, the ratios between the frequencies do not reflect reality. For example, if I change branch_weights in both loops to be 4,294,967,295 and 2,147,483,647 > > What does -analyze -branch-prob give? > >> $ bin/opt -analyze -block-freq -S unbiased-branches.ll >> Printing analysis 'Block Frequency Analysis' for function 'bar': >> block-frequency-info: bar >> - entry: float = 1.0, int = 8 >> >> Printing analysis 'Block Frequency Analysis' for function 'main': >> block-frequency-info: main >> - entry: float = 1.0, int = 8 >> - for.cond: float = 1073741824.4, int = 8589934595 >> - for.body: float = 1073741823.4, int = 8589934587 >> - for.inc: float = 1073741823.4, int = 8589934587 >> - for.end: float = 1.0, int = 8 >> - for.cond1: float = 1073741824.4, int = 8589934595 >> - for.body3: float = 1073741823.4, int = 8589934587 >> - for.inc4: float = 1073741823.4, int = 8589934587 >> - for.end6: float = 1.0, int = 8 >> >> Now both loops are considered equally hot. >> >> Duncan, I think that if I were to make branch_weights a 64-bit integer, this would not be an issue. But I'm not sure if I'm not missing something else here. > > I don't understand where the fidelity is being lost; this seems > strange to me. 4B fits inside UINT32_MAX, so I don't see how we > hit a cap here.The max weight is UINT32_MAX/BB->GetTerminator()->GetNumSuccessors().> > But certainly there *is* a limit. We can't store a ratio of larger > than ~4B:1. Is this really a problem?Right, this is not the issue. The issue is that when one branch's weight reaches the cap, the other branch's weight is not scaled accordingly: uint32_t WeightLimit = getMaxWeightFor(BB); Weights.reserve(TI->getNumSuccessors()); for (unsigned i = 1, e = WeightsNode->getNumOperands(); i != e; ++i) { ConstantInt *Weight mdconst::dyn_extract<ConstantInt>(WeightsNode->getOperand(i)); if (!Weight) return false; Weights.push_back( std::max<uint32_t>(1, Weight->getLimitedValue(WeightLimit))); <---- may hit limit>I understand that you can't > tell which loop is hotter, but why does that matter? Is there some > optimization that will do the wrong thing? Which one? >Two calls (in the loop) not equally hot may be considered equally hot -- confusing the cost/benefit analysis. This is just an example. Our goal is to make sure profile data is faithfully kept. Which pass or whether it will be used now or in the future is a totally different matter.> Regarding changing branch weights to 64-bits, I think we should be > sure it's valuable before we change it. I imagine it's doable, but > it'll complicate logic in a few places that don't currently have to > worry about overflowing `uint64_t`.I think we can fix this without using 64bit weight. David
Diego Novillo
2015-Apr-24 20:06 UTC
[LLVMdev] Loss of precision with very large branch weights
On Fri, Apr 24, 2015 at 3:40 PM, Duncan P. N. Exon Smith < dexonsmith at apple.com> wrote:> > What does -analyze -branch-prob give? >Similar indication of hotness. Both loops appear equally hot. $ bin/opt -analyze -branch-prob -S unbiased-branches.ll Printing analysis 'Branch Probability Analysis' for function 'bar': ---- Branch Probabilities ---- Printing analysis 'Branch Probability Analysis' for function 'main': ---- Branch Probabilities ---- edge entry -> for.cond probability is 16 / 16 = 100% [HOT edge] edge for.cond -> for.body probability is 2147483647 / 2147483649 = 100% [HOT edge] edge for.cond -> for.end probability is 2 / 2147483649 = 9.31323e-08% edge for.body -> for.inc probability is 16 / 16 = 100% [HOT edge] edge for.inc -> for.cond probability is 124 / 124 = 100% [HOT edge] edge for.end -> for.cond1 probability is 16 / 16 = 100% [HOT edge] edge for.cond1 -> for.body3 probability is 2147483647 / 2147483649 = 100% [HOT edge] edge for.cond1 -> for.end6 probability is 2 / 2147483649 = 9.31323e-08% edge for.body3 -> for.inc4 probability is 16 / 16 = 100% [HOT edge] edge for.inc4 -> for.cond1 probability is 124 / 124 = 100% [HOT edge]> But certainly there *is* a limit. We can't store a ratio of larger > than ~4B:1. Is this really a problem? I understand that you can't > tell which loop is hotter, but why does that matter? Is there some > optimization that will do the wrong thing? Which one? >It will matter for cost analysis in the inliner, for instance. I don't think this matters in the few places that profile is used now. But long term, we want to make sure that we have a good measure of relative temperatures within a CFG and across CFGs.> > Regarding changing branch weights to 64-bits, I think we should be > sure it's valuable before we change it. I imagine it's doable, but > it'll complicate logic in a few places that don't currently have to > worry about overflowing `uint64_t`.Sure, OK. I'll see about making sure we scale things properly during frequency computation. Diego. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150424/438f2149/attachment.html>