Hi Diego, I work for Qualcomm and we are also interested in improvements to PGO profile support, especially for supporting relative hotness across functions. We are willing to contribute in this area so please keep me in this discussion and work distribution. Some questions and comments below.> Date: Tue, 10 Mar 2015 13:14:22 -0400 > From: Diego Novillo <dnovillo at google.com> > To: Bob Wilson <bob.wilson at apple.com> > Cc: Xinliang David Li <davidxl at google.com>, LLVM Developers Mailing > List <llvmdev at cs.uiuc.edu> > Subject: Re: [LLVMdev] RFC - Improvements to PGO profile support > Message-ID: > <CAD_=9DSzNSqpku0FjpeLyugaxshcxY9fLyyZ5EmD=i5TVDiqRA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com> wrote: > >> >> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com> wrote: >> >> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at google.com> >> wrote: >> >> I've created a few bugzilla issues with details of some of the things >> I'll >>> be looking into. I'm not yet done wordsmithing the overall design >>> document. >>> I'll try to finish it by early next week at the latest. >>> >> >> The document is available at >> >> >> https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing >> <https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing> >> >> There are several topics covered. Ideally, I would prefer that we >> discuss >> each topic separately. The main ones I will start working on are the >> ones >> described in the bugzilla links we have in the doc. >> >> This is just a starting point for us. I am not at all concerned with >> implementing exactly what is proposed in the document. In fact, if we >> can >> get the same value using the existing support, all the better. >> >> OTOH, any other ideas that folks may have that work better than this are >> more than welcome. I don't have really strong opinions on the matter. I >> am >> fine with whatever works. >> >> >> Thanks for the detailed write-up on this. Some of the issues definitely >> need to be addressed. I am concerned, though, that some of the ideas may >> be >> leading toward a scenario where we have essentially two completely >> different ways of representing profile information in LLVM IR. It is >> great >> to have two complementary approaches to collecting profile data, but two >> representations in the IR would not make sense. >> > > Yeah, I don't think I'll continue to push for a new MD_count attribute. If > we were to make MD_prof be a "real" execution count, that would be enough. > Note that by re-using MD_prof we are not changing its meaning at all. The > execution count is still a weight and the ratio is still branch > probability. All that we are changing are the absolute values of the > number > and increasing its data type width to remove the 32bit limitation. >[ivan] What new data type do you suggest? Increasing it to 64 bit for 64-bit architectures would be reasonable, and would reduce the need of scaling counts.> > >> The first issue raised is that profile execution counts are not >> represented in the IR. This was a very intentional decision. I know it >> goes >> against what other compilers have done in the past. It took me a while >> to >> get used to the idea when Andy first suggested it, so I know it seems >> awkward at first. The advantage is that branch probabilities are much >> easier to keep updated in the face of compiler transformations, compared >> to >> execution counts. >> > > Sorry. I don't follow. Updating counts as the CFG is transformed is not > difficult at all. What examples do you have in mind? The big advantage of > making MD_prof an actual execution count is that it is a meaningful metric > wrt scaling and transformation. > > Say, for instance, that we have a branch instruction with two targets with > counts {100, 300} inside a function 'foo' that has entry count 2. The edge > probability for the first edge (count 100) is 100/(100+300) = 25%. > > If we inline foo() inside another function bar() at a callsite with > profile > count == 1, the cloned branch instruction gets its counters scaled with > the > callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} > {50, 150}. But the branch probability did not change. Currently, we are > cloning the branch without changing the edge weights. > > This scaling is not difficult at all and can be incrementally very > quickly. > We cannot afford to recompute all frequencies on the fly because it would > be detrimental to compile time. If foo() itself has N callees inlined into > it, each inlined callee needs to trigger a re-computation. When foo() is > inlined into bar(), the frequencies will need to be recomputed for foo() > and all N callees inlined into foo(). >[ivan] Scaling branch counts when inlining looks good, but the current PGO support doesn't maintain function entry counts at LLVM IR. How do you plan to maintain these at LLVM IR?> >> We are definitely missing the per-function execution counts that are >> needed to be able to compare relative ?hotness? across functions, and I >> think that would be a good place to start making improvements. In the >> long >> term, we should keep our options open to making major changes, but >> before >> we go there, we should try to make incremental improvements to fix the >> existing infrastructure. >> > > Right, and that's the core of our proposal. We don't really want to make > major infrastructure changes at this point. The only thing I'd like to > explore is making MD_prof a real count. This will be useful for the > inliner > changes and it would also make incremental updates easier, because the > scaling that needs to be done is very straightforward and quick. > > Note that this change should not modify the current behaviour we get from > profile analysis. Things that were hot before should continue to be hot > now. > > >> Many of the other issues you raise seem like they could also be >> addressed >> without major changes to the existing infrastructure. Let?s try to fix >> those first. >> > > That's exactly the point of the proposal. We definitely don't want to > make > major changes to the infrastructure at first. My thinking is to start > working on making MD_prof a real count. One of the things that are > happening is that the combination of real profile data plus the frequency > propagation that we are currently doing is misleading the analysis. > > For example (thanks David for the code and data). In the following code: > > int g; > __attribute__((noinline)) void bar() { > g++; > } > > extern int printf(const char*, ...); > > int main() > { > int i, j, k; > > g = 0; > > // Loop 1. > for (i = 0; i < 100; i++) > for (j = 0; j < 100; j++) > for (k = 0; k < 100; k++) > bar(); > > printf ("g = %d\n", g); > g = 0; > > // Loop 2. > for (i = 0; i < 100; i++) > for (j = 0; j < 10000; j++) > bar(); > > printf ("g = %d\n", g); > g = 0; > > > // Loop 3. > for (i = 0; i < 1000000; i++) > bar(); > > printf ("g = %d\n", g); > g = 0; > } > > When compiled with profile instrumentation, frequency propagation is > distorting the real profile because it gives different frequency to the > calls to bar() in the 3 different loops. All 3 loops execute 1,000,000 > times, but after frequency propagation, the first call to bar() gets a > weight of 520,202 in loop #1, 210,944 in loop #2 and 4,096 in loop #3. In > reality, every call to bar() should have a weight of 1,000,000.[ivan] I ran the example with> opt -analyze -block-freq test1.llPrinting analysis 'Block Frequency Analysis' for function 'main': block-frequency-info: main - entry: float = 1.0, int = 8 - for.cond: float = 101.0, int = 807 - for.cond1: float = 10100.0, int = 80799 - for.cond4: float = 1010000.0, int = 8079999 - for.body6: float = 1000000.0, int = 7999999 - for.inc7: float = 10000.0, int = 79999 - for.inc10: float = 100.0, int = 799 - for.end12: float = 1.0, int = 8 - for.cond13: float = 101.0, int = 807 - for.cond16: float = 409600.0, int = 3276799 - for.body18: float = 409559.0441, int = 3276472 - for.inc22: float = 100.0, int = 799 - for.end24: float = 1.0, int = 8 - for.cond26: float = 4096.0, int = 32768 - for.body28: float = 4095.995904, int = 32767 - for.end31: float = 1.0, int = 8 As shown above, the three calls to bar() get float BBFreq of 1000000 (in for.body6), 409559.0441 (in for.body18), and 4095.995904 (in for.body28). When the example is modified for trip counts of 10, 100, and 1000> opt -analyze -block-freq test1.10.llPrinting analysis 'Block Frequency Analysis' for function 'main': block-frequency-info: main - entry: float = 1.0, int = 8 - for.cond: float = 11.0, int = 87 - for.cond1: float = 110.0, int = 879 - for.cond4: float = 1100.0, int = 8799 - for.body6: float = 1000.0, int = 7999 - for.inc7: float = 100.0, int = 799 - for.inc10: float = 10.0, int = 79 - for.end12: float = 1.0, int = 8 - for.cond13: float = 11.0, int = 87 - for.cond16: float = 1010.0, int = 8079 - for.body18: float = 1000.0, int = 7999 - for.inc22: float = 10.0, int = 79 - for.end24: float = 1.0, int = 8 - for.cond26: float = 1001.0, int = 8007 - for.body28: float = 1000.0, int = 7999 - for.end31: float = 1.0, int = 8 the three calls to bar() get the correct count of 1000. We should try to remove the limitation of 4096 in the block frequency propagation algorithm if possible. Ivan
Xinliang David Li
2015-Mar-11 04:44 UTC
[LLVMdev] RFC - Improvements to PGO profile support
> [ivan] What new data type do you suggest? Increasing it to 64 bit for > 64-bit architectures would be reasonable, and would reduce the need of > scaling counts.64 bit int in MD_prof.> >> >> >>> The first issue raised is that profile execution counts are not >>> represented in the IR. This was a very intentional decision. I know it >>> goes >>> against what other compilers have done in the past. It took me a while >>> to >>> get used to the idea when Andy first suggested it, so I know it seems >>> awkward at first. The advantage is that branch probabilities are much >>> easier to keep updated in the face of compiler transformations, compared >>> to >>> execution counts. >>> >> >> Sorry. I don't follow. Updating counts as the CFG is transformed is not >> difficult at all. What examples do you have in mind? The big advantage of >> making MD_prof an actual execution count is that it is a meaningful metric >> wrt scaling and transformation. >> >> Say, for instance, that we have a branch instruction with two targets with >> counts {100, 300} inside a function 'foo' that has entry count 2. The edge >> probability for the first edge (count 100) is 100/(100+300) = 25%. >> >> If we inline foo() inside another function bar() at a callsite with >> profile >> count == 1, the cloned branch instruction gets its counters scaled with >> the >> callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} >> {50, 150}. But the branch probability did not change. Currently, we are >> cloning the branch without changing the edge weights. >> >> This scaling is not difficult at all and can be incrementally very >> quickly. >> We cannot afford to recompute all frequencies on the fly because it would >> be detrimental to compile time. If foo() itself has N callees inlined into >> it, each inlined callee needs to trigger a re-computation. When foo() is >> inlined into bar(), the frequencies will need to be recomputed for foo() >> and all N callees inlined into foo(). >> > > [ivan] Scaling branch counts when inlining looks good, but the current PGO > support doesn't maintain function entry counts at LLVM IR. How do you plan > to maintain these at LLVM IR?MD_prof is the only thing that needs to be maintained at IR level. For in-memory CFG profile representation, function entry count is needed. This means when CFG profile data is incrementally updated during inlining, the MD_profile weight/count for the cloned branches needs to be scaled in the same way.> > As shown above, the three calls to bar() get float BBFreq of 1000000 (in > for.body6), 409559.0441 (in for.body18), and 4095.995904 (in for.body28). > > When the example is modified for trip counts of 10, 100, and 1000 >> opt -analyze -block-freq test1.10.ll > > Printing analysis 'Block Frequency Analysis' for function 'main': > block-frequency-info: main > - entry: float = 1.0, int = 8 > - for.cond: float = 11.0, int = 87 > - for.cond1: float = 110.0, int = 879 > - for.cond4: float = 1100.0, int = 8799 > - for.body6: float = 1000.0, int = 7999 > - for.inc7: float = 100.0, int = 799 > - for.inc10: float = 10.0, int = 79 > - for.end12: float = 1.0, int = 8 > - for.cond13: float = 11.0, int = 87 > - for.cond16: float = 1010.0, int = 8079 > - for.body18: float = 1000.0, int = 7999 > - for.inc22: float = 10.0, int = 79 > - for.end24: float = 1.0, int = 8 > - for.cond26: float = 1001.0, int = 8007 > - for.body28: float = 1000.0, int = 7999 > - for.end31: float = 1.0, int = 8 > > the three calls to bar() get the correct count of 1000. > We should try to remove the limitation of 4096 in the block frequency > propagation algorithm if possible.4096 is needed for static profile estimation -- there is no need to remove the limit when real profile data is available -- the solution is to simply not use the frequency propagation at all. For your example, the values 11, 1010 etc are all bogus. When loop trip count is small, the ratio of loop body count and preheader count (average of trip count) can be very different from the real trip count (see PR22719). In the same bug, there is another example (multi-entry loop) showing the weakness. Storing profile count in MD_prof allows the compiler to skip the freq propagation completely. It also allows faster incremental update during inlining via simple scaling. David> > Ivan > >
On 03/10/15 21:30, Ivan Baev wrote:> > [ivan] What new data type do you suggest? Increasing it to 64 bit for > 64-bit architectures would be reasonable, and would reduce the need of > scaling counts.Yes. That's what we have in mind. MD_prof would eventually become a 64bit integer.>> >> [ivan] Scaling branch counts when inlining looks good, but the >> current PGO support doesn't maintain function entry counts at LLVM >> IR. How do you plan to maintain these at LLVM IR?Well, we'd have to incrementally update the entry counts as inlining happens while scaling the MD_prof counts in the IR. This is a bit far out right now. The inliner is not even capable of reading profiles today. This is the part of the plan that will happen much further down the road.> >> >> opt -analyze -block-freq test1.10.ll > Printing analysis 'Block Frequency Analysis' for function 'main': > block-frequency-info: main > - entry: float = 1.0, int = 8 > - for.cond: float = 11.0, int = 87 > - for.cond1: float = 110.0, int = 879 > - for.cond4: float = 1100.0, int = 8799 > - for.body6: float = 1000.0, int = 7999 > - for.inc7: float = 100.0, int = 799 > - for.inc10: float = 10.0, int = 79 > - for.end12: float = 1.0, int = 8 > - for.cond13: float = 11.0, int = 87 > - for.cond16: float = 1010.0, int = 8079 > - for.body18: float = 1000.0, int = 7999 > - for.inc22: float = 10.0, int = 79 > - for.end24: float = 1.0, int = 8 > - for.cond26: float = 1001.0, int = 8007 > - for.body28: float = 1000.0, int = 7999 > - for.end31: float = 1.0, int = 8 > > the three calls to bar() get the correct count of 1000. > We should try to remove the limitation of 4096 in the block frequency > propagation algorithm if possible.Long term, the presence of real profile data would make the frequency propagator unnecessary. In the shorter term, coming up with some intermediate fix to this is certainly possible. I'd rather continue this discussion in the bug itself (https://llvm.org/bugs/show_bug.cgi?id=22719) Diego.