> On Mar 24, 2015, at 3:22 PM, Xinliang David Li <davidxl at google.com> wrote: > > On Tue, Mar 24, 2015 at 2:47 PM, Bob Wilson <bob.wilson at apple.com <mailto:bob.wilson at apple.com>> wrote: >> >>> On Mar 24, 2015, at 12:08 PM, Xinliang David Li <davidxl at google.com> wrote: >>> >>> On Tue, Mar 24, 2015 at 10:54 AM, Bob Wilson <bob.wilson at apple.com> wrote: >>>> >>>>> On Mar 24, 2015, at 10:53 AM, Diego Novillo <dnovillo at google.com> wrote: >>>>> >>>>> On Tue, Mar 24, 2015 at 1:48 PM, Bob Wilson <bob.wilson at apple.com> wrote: >>>>>> >>>>>>> On Mar 24, 2015, at 10:27 AM, Xinliang David Li <davidxl at google.com> wrote: >>>>>>> >>>>>>> 2.3) remove the 'laplace rule of succession' which can be very harmful >>>>>> >>>>>> I have not seen any consensus on this. I’m not at all convinced this would be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad behavior in cases where the counts are very small and has almost no impact when the counts are large. >>>>> >>>>> Duncan and I chatted briefly about this on IRC. I'm going to >>>>> experiment with only applying laplace smoothing if any scaled weights >>>>> are 0. AFAIU, usage of this rule is really just trying to avoid having >>>>> downstream users deal with 0 weights. Duncan, please whack me with a >>>>> clue stick if I totally misrepresented our conclusions. >>>> >>>> Zero is the extreme case, but it’s also important for other small counts. I’d like to see some specific examples of why you think the current behavior is harmful. >>> >>> Using the rule has at least the following bad side effects: >>> 1) wrong estimation of average loop trip count -- possibly leading to >>> wrong unroll decisions >> >> I remain skeptical. If the loop only ran once in your profile run, how much confidence do you have in the trip count? The LaPlace rule will tend to make the compiler more conservative in exactly the cases where there is insufficient information to be confident about what to do. > > Training run is expensive, it is common for PGO users to reduce > workload as much as possible without sacrificing the > representativeness of the workload. > > Workload 1: loop run once, iterating 100 times in total > workload 2: loop run twice, iterating 200 times in total > > Which workload will user prefer to use?Which workload is better? I don’t at all trust users to get this right, at least for real, non-benchmark code.> > Without the rule, the two workload at least produces consistent > profile data. With the Laplace rule, you get 50 in one case, and 66 > in the other.Yes, but you’ve got more information in one case than the other. This is a feature IMO, not a bug. It’s entirely possible that with workload 2, the loop may have executed for a drastically different number of iterations. The fact that it did not, i.e., that it was consistent with workload 1, is more information that you did not have before. It makes sense for the compiler to be more aggressive when it has more data.> > Having some technology to improve confidence of the profile data is > fine, but I don't see > 1) how laplace rule is good for itWhat do you not understand about it? As the counts get larger, LaPlace’s rule fades into the noise. It only makes a difference for cases where some of the counts are *very* small, and in those cases, it very simply adjust the weights to make optimizations less aggressive.> 2) why this can not be done in the consumer side (i.e., faithfully > record the profile data).What does this have to do with how faithfully the profile is recorded? We’ve got fully accurate data, but if the profiling inputs are too small or not representative, you may still get poor optimization choices.> > >> >>> 2) result in bad inlining decisions. For instance: >>> for (...) >>> bar(); // (1) >>> >>> where (1) is the only callsite to bar(). Using the rule, BB count >>> enclosing the call to bar() can be as low as half of the entry count >>> of bar(). Inliner will get confused and think there are more hot >>> callsites to 'bar' and make suboptimal decisions .. >>> >>> Also if bar has calls to other functions, those callsites will look >>> hotter than the call to 'bar' … >> >> Your own proposal for recording entry counts is to record “relative hotness”, not absolute profile counts. > > The proposal is to record 'global hotness' that can used to compare > relative hotness across procedural boundaries (e.g. callsites in > different callers). Profile counts satisfies this condition. > >> On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule that is then used to compute BlockFrequency and you’re concerned about a mismatch between that the “relative hotness” recorded for the callee?? > > Basically, say the caller is test() > > bar(){ > // ENTRY count = 100 (from profile data) > // ENTRY freq = 1 > > // BB2: Freq(BB2) = 1, count = 100 > foo (); (2) > } > > > test() { > // ENTRY count = 1 (from profile data) > // Entry Freq = 1 > for (i = 0; i < 100; i++) { > // BB1: Freq(BB1) = 50 due to Laplace rule > bar(); // Freq = 50, count = 50 (1) > } > } > > With laplace rule, the block freq computed for bar's enclosing BB will > be wrong -- as a result, the bar's enclosing BB's count will be wrong > too: 50*1/1 = 50. > > The global hotness of call site (1) & (2) should be the same, but > distorted when Laplace rule is used. > > Yes, we care about using PGO across routine boundaries for IPO.I understand the issue, but my point was that you should simply not do that. You’re objecting to LaPlace’s rule based on a hypothetical comparison of block frequencies vs. entry counts. There is nothing in LLVM that does that now. We don’t even have entry counts. I don’t see how you can argue that LaPlace’s rule is bad because it could affect an apples vs. oranges comparison of something that does not even exist yet.> >> >>> >>> The attached are two cases as well as the frequency graph computed >>> today (with the laplace rule) and the correct frequency expected. >> >> I’d be a lot more interested to see a real-world example. > > See my reply above. On the other hand, I'd like to see examples where > LaPlace Rule can actually help improve the profile data quality.It’s not about improving the results — it’s about preventing clang from being overly aggressive about optimizing based on limited profile data. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150325/57443f23/attachment.html>
Xinliang David Li
2015-Mar-26 05:47 UTC
[LLVMdev] RFC - Improvements to PGO profile support
Bob,> Which workload is better? I don’t at all trust users to get this right, at > least for real, non-benchmark code.We do have a lot of users (real world apps) who can get this right -- I am not joking ;)> > > Without the rule, the two workload at least produces consistent > profile data. With the Laplace rule, you get 50 in one case, and 66 > in the other. > > > Yes, but you’ve got more information in one case than the other. This is a > feature IMO, not a bug. It’s entirely possible that with workload 2, the > loop may have executed for a drastically different number of iterations. The > fact that it did not, i.e., that it was consistent with workload 1, is more > information that you did not have before. It makes sense for the compiler to > be more aggressive when it has more data.But the decision by the compiler is arbitrary and not necessarily correct. For instance, the single run used in the training may have actually executed much fewer number of iterations than average. With Laplace rule, the iteration count becomes even smaller. My point is that there is no way for compiler to tell how good the data is nor is the compiler in a good position to make that judgement. By so doing, the users who carefully prune their workload to reduce runtime gets punished for no reason.> > > Having some technology to improve confidence of the profile data is > fine, but I don't see > 1) how laplace rule is good for it >> > What do you not understand about it? As the counts get larger, LaPlace’s > rule fades into the noise. It only makes a difference for cases where some > of the counts are *very* small, and in those cases, it very simply adjust > the weights to make optimizations less aggressive.Strictly speaking, in loop context, it just makes optimizations to assume shorter trip counts.> > 2) why this can not be done in the consumer side (i.e., faithfully > record the profile data). > > > What does this have to do with how faithfully the profile is recorded? We’ve > got fully accurate data, but if the profiling inputs are too small or not > representative, you may still get poor optimization choices.The point is that there is no need to adjust the weights. It is very easy to check the loop header's profile count to determine how much confidence you want to give (and possibly controlled with flag). The control in this way is more fine grained than blindly changing the weight right after reading the profile data.> > > > > 2) result in bad inlining decisions. For instance: > for (...) > bar(); // (1) > > where (1) is the only callsite to bar(). Using the rule, BB count > enclosing the call to bar() can be as low as half of the entry count > of bar(). Inliner will get confused and think there are more hot > callsites to 'bar' and make suboptimal decisions .. > > Also if bar has calls to other functions, those callsites will look > hotter than the call to 'bar' … > > > Your own proposal for recording entry counts is to record “relative > hotness”, not absolute profile counts. > > > The proposal is to record 'global hotness' that can used to compare > relative hotness across procedural boundaries (e.g. callsites in > different callers). Profile counts satisfies this condition. > > On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule > that is then used to compute BlockFrequency and you’re concerned about a > mismatch between that the “relative hotness” recorded for the callee?? > > > Basically, say the caller is test() > > bar(){ > // ENTRY count = 100 (from profile data) > // ENTRY freq = 1 > > // BB2: Freq(BB2) = 1, count = 100 > foo (); (2) > } > > > test() { > // ENTRY count = 1 (from profile data) > // Entry Freq = 1 > for (i = 0; i < 100; i++) { > // BB1: Freq(BB1) = 50 due to Laplace rule > bar(); // Freq = 50, count = 50 (1) > } > } > > With laplace rule, the block freq computed for bar's enclosing BB will > be wrong -- as a result, the bar's enclosing BB's count will be wrong > too: 50*1/1 = 50. > > The global hotness of call site (1) & (2) should be the same, but > distorted when Laplace rule is used. > > Yes, we care about using PGO across routine boundaries for IPO. > > > I understand the issue, but my point was that you should simply not do that. > You’re objecting to LaPlace’s rule based on a hypothetical comparison of > block frequencies vs. entry counts. There is nothing in LLVM that does that > now. We don’t even have entry counts.I am not sure what you mean by 'hypothetical comparison of block frequencies vs entry counts', but it does not seem to be what I mean. What I mean is that 1) We need a metric to represent global hotness. Profile (execution) count fits the bill 2) There are two ways to compute profile count for BBs a) directly compute it from the edge count recorded in profile data (and BB Frequency can be directly scaled from it), but this change requires slightly changing MD_prof's meaning or introducing MD_count to record edge count without capping/scaling. b) Just recording the entry profile count (minimal change), but do not change MD_prof. This approach will reuse block frequency propagation, but the later relies on unaltered branch probability/weight in order to recompute precisely the count (combined with entry count). Since people have concerns on a), we chose b). For b), I merely pointed out in the above example that with Laplace rule, the recomputed profile count at the only callsite of 'Bar' can be greatly different from the recorded entry profile count Bar. Incoming callsite's profile distribution can be good signal for inlining decisions. Such difference will be bad.> > I don’t see how you can argue that LaPlace’s rule is bad because it could > affect an apples vs. oranges comparison of something that does not even > exist yet. >Of course, PGO for IPA support is exactly the missing (and very important) piece we plan to add -- if it already existed, there will be no problems. thanks, David> > > > The attached are two cases as well as the frequency graph computed > today (with the laplace rule) and the correct frequency expected. > > > I’d be a lot more interested to see a real-world example. > > > See my reply above. On the other hand, I'd like to see examples where > LaPlace Rule can actually help improve the profile data quality. > > > It’s not about improving the results — it’s about preventing clang from > being overly aggressive about optimizing based on limited profile data.
Hayden Livingston
2015-May-07 07:55 UTC
[LLVMdev] RFC - Improvements to PGO profile support
Can you tell us if you're continuing to use the same approach as described in one of the LLVM meetings, i.e. instrument at the clang AST level? Also, do you generate GCOV files, some yaml, or is this a separate format? And finally in the meeting you had given how you assign counters to the blocks, an algorithm to minimize the number of insertions. Is that algorithm a well-known one or a custom one? Is that described somewhere? On Wed, Mar 25, 2015 at 10:47 PM, Xinliang David Li <davidxl at google.com> wrote:> Bob, > >> Which workload is better? I don’t at all trust users to get this right, at >> least for real, non-benchmark code. > > We do have a lot of users (real world apps) who can get this right -- > I am not joking ;) > >> >> >> Without the rule, the two workload at least produces consistent >> profile data. With the Laplace rule, you get 50 in one case, and 66 >> in the other. >> >> >> Yes, but you’ve got more information in one case than the other. This is a >> feature IMO, not a bug. It’s entirely possible that with workload 2, the >> loop may have executed for a drastically different number of iterations. The >> fact that it did not, i.e., that it was consistent with workload 1, is more >> information that you did not have before. It makes sense for the compiler to >> be more aggressive when it has more data. > > But the decision by the compiler is arbitrary and not necessarily > correct. For instance, the single run used in the training may have > actually executed much fewer number of iterations than average. With > Laplace rule, the iteration count becomes even smaller. My point is > that there is no way for compiler to tell how good the data is nor is > the compiler in a good position to make that judgement. By so doing, > the users who carefully prune their workload to reduce runtime gets > punished for no reason. > > >> >> >> Having some technology to improve confidence of the profile data is >> fine, but I don't see >> 1) how laplace rule is good for it >>> >> What do you not understand about it? As the counts get larger, LaPlace’s >> rule fades into the noise. It only makes a difference for cases where some >> of the counts are *very* small, and in those cases, it very simply adjust >> the weights to make optimizations less aggressive. > > Strictly speaking, in loop context, it just makes optimizations to > assume shorter trip counts. > >> >> 2) why this can not be done in the consumer side (i.e., faithfully >> record the profile data). >> >> >> What does this have to do with how faithfully the profile is recorded? We’ve >> got fully accurate data, but if the profiling inputs are too small or not >> representative, you may still get poor optimization choices. > > The point is that there is no need to adjust the weights. It is very > easy to check the loop header's profile count to determine how much > confidence you want to give (and possibly controlled with flag). The > control in this way is more fine grained than blindly changing the > weight right after reading the profile data. > >> >> >> >> >> 2) result in bad inlining decisions. For instance: >> for (...) >> bar(); // (1) >> >> where (1) is the only callsite to bar(). Using the rule, BB count >> enclosing the call to bar() can be as low as half of the entry count >> of bar(). Inliner will get confused and think there are more hot >> callsites to 'bar' and make suboptimal decisions .. >> >> Also if bar has calls to other functions, those callsites will look >> hotter than the call to 'bar' … >> >> >> Your own proposal for recording entry counts is to record “relative >> hotness”, not absolute profile counts. >> >> >> The proposal is to record 'global hotness' that can used to compare >> relative hotness across procedural boundaries (e.g. callsites in >> different callers). Profile counts satisfies this condition. >> >> On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule >> that is then used to compute BlockFrequency and you’re concerned about a >> mismatch between that the “relative hotness” recorded for the callee?? >> >> >> Basically, say the caller is test() >> >> bar(){ >> // ENTRY count = 100 (from profile data) >> // ENTRY freq = 1 >> >> // BB2: Freq(BB2) = 1, count = 100 >> foo (); (2) >> } >> >> >> test() { >> // ENTRY count = 1 (from profile data) >> // Entry Freq = 1 >> for (i = 0; i < 100; i++) { >> // BB1: Freq(BB1) = 50 due to Laplace rule >> bar(); // Freq = 50, count = 50 (1) >> } >> } >> >> With laplace rule, the block freq computed for bar's enclosing BB will >> be wrong -- as a result, the bar's enclosing BB's count will be wrong >> too: 50*1/1 = 50. >> >> The global hotness of call site (1) & (2) should be the same, but >> distorted when Laplace rule is used. >> >> Yes, we care about using PGO across routine boundaries for IPO. >> >> >> I understand the issue, but my point was that you should simply not do that. >> You’re objecting to LaPlace’s rule based on a hypothetical comparison of >> block frequencies vs. entry counts. There is nothing in LLVM that does that >> now. We don’t even have entry counts. > > I am not sure what you mean by 'hypothetical comparison of block > frequencies vs entry counts', but it does not seem to be what I mean. > What I mean is that > > 1) We need a metric to represent global hotness. Profile (execution) > count fits the bill > 2) There are two ways to compute profile count for BBs > a) directly compute it from the edge count recorded in profile data > (and BB Frequency can be directly scaled from it), but this change > requires slightly changing MD_prof's meaning or introducing MD_count > to record edge count without capping/scaling. > > b) Just recording the entry profile count (minimal change), but do > not change MD_prof. This approach will reuse block frequency > propagation, but the later relies on unaltered branch > probability/weight in order to recompute precisely the count (combined > with entry count). > > Since people have concerns on a), we chose b). For b), I merely > pointed out in the above example that with Laplace rule, the > recomputed profile count at the only callsite of 'Bar' can be greatly > different from the recorded entry profile count Bar. Incoming > callsite's profile distribution can be good signal for inlining > decisions. Such difference will be bad. > >> >> I don’t see how you can argue that LaPlace’s rule is bad because it could >> affect an apples vs. oranges comparison of something that does not even >> exist yet. >> > > Of course, PGO for IPA support is exactly the missing (and very > important) piece we plan to add -- if it already existed, there will > be no problems. > > thanks, > > David > > >> >> >> >> The attached are two cases as well as the frequency graph computed >> today (with the laplace rule) and the correct frequency expected. >> >> >> I’d be a lot more interested to see a real-world example. >> >> >> See my reply above. On the other hand, I'd like to see examples where >> LaPlace Rule can actually help improve the profile data quality. >> >> >> It’s not about improving the results — it’s about preventing clang from >> being overly aggressive about optimizing based on limited profile data. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Apparently Analagous Threads
- [LLVMdev] RFC - Improvements to PGO profile support
- [LLVMdev] RFC - Improvements to PGO profile support
- [LLVMdev] RFC - Improvements to PGO profile support
- [LLVMdev] RFC - Improvements to PGO profile support
- [LLVMdev] RFC - Improvements to PGO profile support