Hi David, The plan makes sense. A question on:>> 4) implement frequency/profile update interfaces for Inline >> transformations -- i.e., allowing importing callee's profile info into >> caller with scaling factor computed from callsite count and callee >> entry countDo you envision the following implementation for it? Say we've decided to inline callee B at callsite C in caller A. In addition to the current inlining update work, we need to: 1. get block frequency BF of the basic block containing the callsite C by invoking BlockFrequency on function A; 2. scale BF with the function entry count of A, to get scaledBF 3. decrease function entry count for B with scaledBF 4. compute and scale block frequencies of the basic blocks of B inlined into A, or re-invoke BlockFrequency on the modified function A. Thanks, Ivan> > Date: Tue, 24 Mar 2015 13:57:30 -0700 > From: Philip Reames <listmail at philipreames.com> > To: Xinliang David Li <davidxl at google.com> > Cc: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu> > Subject: Re: [LLVMdev] RFC - Improvements to PGO profile support > Message-ID: <5511CFBA.6060308 at philipreames.com> > Content-Type: text/plain; charset="utf-8"; format=flowed > > > On 03/24/2015 10:27 AM, Xinliang David Li wrote: >> Diego and I have discussed this according to the feedback received. We >> have revised plan for this (see Diego's last reply). Here is a more >> detailed re-cap: >> >> 1) keep MD_prof definition as it is today; also keep using the >> frequency propagation as it is (assuming programs with irreducible >> loops are not common and not important. If it turns out to be >> otherwise, we will revisit this). >> 2) fix all problems that lead to wrong 'frequency/count' computed from >> the frequency propagation algorithm >> 2.1) relax 32bit limit >> 2.2) remove 4096 iteration count cap >> 2.3) remove the 'laplace rule of succession' which can be very >> harmful >> 2.4) Fix the bug in 'BranchProbabilityInfo::calcMetadataWeights' >> that does capping without scaling >> 2.5) Fix a similar bug in >> 'MachineBranchProbabilityInfo::getSumForBlock(' >> .. other bugs found >> 3) introduce a new meta data to represent function entry global >> hotness (the implementation can choose to use entry execution count) >> 4) implement frequency/profile update interfaces for Inline >> transformations -- i.e., allowing importing callee's profile info into >> caller with scaling factor computed from callsite count and callee >> entry count >> 5) Implement mechanism to compute, record and use profile program >> summary data (This will be discussed separately). > This sounds like a very workable approach to me. I think 2.3 needs a > bit more discussion, but given it doesn't really directly effect me, I'm > probably going to stay out of it for now. > > Thanks for working on this! >> >> >> thanks, >> >> David
Xinliang David Li
2015-Mar-26 04:45 UTC
[LLVMdev] RFC - Improvements to PGO profile support
On Wed, Mar 25, 2015 at 5:54 PM, Ivan Baev <ibaev at codeaurora.org> wrote:> Hi David, > The plan makes sense. A question on: > >>> 4) implement frequency/profile update interfaces for Inline >>> transformations -- i.e., allowing importing callee's profile info into >>> caller with scaling factor computed from callsite count and callee >>> entry count > > Do you envision the following implementation for it? Say we've decided to > inline callee B at callsite C in caller A. In addition to the current > inlining update work, we need to: > > 1. get block frequency BF of the basic block containing the callsite C by > invoking BlockFrequency on function A; > 2. scale BF with the function entry count of A, to get scaledBF > 3. decrease function entry count for B with scaledBF > 4. compute and scale block frequencies of the basic blocks of B inlined > into A, or re-invoke BlockFrequency on the modified function A. >The update should work for both PGO and non-PGO. Without profile feedback, the callee's BF also needs to be re-scaled and merged with caller's BF (since we can not afford to recompute BF for the caller with every inline). The assumption is that with the new pass manager ready, BF info for caller A and callee B can co-exist. Assuming BB is a basic block in callee B, BB' is the cloned block of BB in the inline instance of B. Freq_A represents block frequency in caller A, and Freq_B represents block frequency in B. Without PGO, the incremental update is: Freq_A(BB') = Freq_B(BB)*Freq_A(C)/Freq_B(Entry_B) With PGO, the Frequency update is the same, but with additional update on the callee's Entry count: Count(Entry_B) = Count(Entry_B) - Count(Callsite_C) = Count(Entry_B) - Count(Entry_A)*Freq_A(Callsite_C)/Freq_A(Entry_A) David> Thanks, > Ivan > >> >> Date: Tue, 24 Mar 2015 13:57:30 -0700 >> From: Philip Reames <listmail at philipreames.com> >> To: Xinliang David Li <davidxl at google.com> >> Cc: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu> >> Subject: Re: [LLVMdev] RFC - Improvements to PGO profile support >> Message-ID: <5511CFBA.6060308 at philipreames.com> >> Content-Type: text/plain; charset="utf-8"; format=flowed >> >> >> On 03/24/2015 10:27 AM, Xinliang David Li wrote: >>> Diego and I have discussed this according to the feedback received. We >>> have revised plan for this (see Diego's last reply). Here is a more >>> detailed re-cap: >>> >>> 1) keep MD_prof definition as it is today; also keep using the >>> frequency propagation as it is (assuming programs with irreducible >>> loops are not common and not important. If it turns out to be >>> otherwise, we will revisit this). >>> 2) fix all problems that lead to wrong 'frequency/count' computed from >>> the frequency propagation algorithm >>> 2.1) relax 32bit limit >>> 2.2) remove 4096 iteration count cap >>> 2.3) remove the 'laplace rule of succession' which can be very >>> harmful >>> 2.4) Fix the bug in 'BranchProbabilityInfo::calcMetadataWeights' >>> that does capping without scaling >>> 2.5) Fix a similar bug in >>> 'MachineBranchProbabilityInfo::getSumForBlock(' >>> .. other bugs found >>> 3) introduce a new meta data to represent function entry global >>> hotness (the implementation can choose to use entry execution count) >>> 4) implement frequency/profile update interfaces for Inline >>> transformations -- i.e., allowing importing callee's profile info into >>> caller with scaling factor computed from callsite count and callee >>> entry count >>> 5) Implement mechanism to compute, record and use profile program >>> summary data (This will be discussed separately). >> This sounds like a very workable approach to me. I think 2.3 needs a >> bit more discussion, but given it doesn't really directly effect me, I'm >> probably going to stay out of it for now. >> >> Thanks for working on this! >>> >>> >>> thanks, >>> >>> David > >