Fedor Sergeev via llvm-dev
2019-Mar-13 21:37 UTC
[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
Overall seems fine to me. On 3/11/19 8:12 PM, Hiroshi Yamauchi wrote:> Here's a revised approach based on the discussion: > > - Cache PSI right after the profile summary in the IR is written in > the pass pipeline. This would avoid the need to insert > RequireAnalysisPass for PSI before each non-module pass that needs it. > PSI can be technically invalidated but unlikely as PSI is immutable. > If it does, we can insert another RequireAnalysisPass.ProfileSummaryInfo::invalidate always return false, so it does not need any extra handling (as soon as it finds its way into ModuleAnalysisManager).> > - If PGO, conditionally request BFI from the passes that need it. For > (pipelined) loop passes, we need to insert a pass that computes BFI > conditionally (if PGO) in front of them and make them preserve BFI > through. This is to avoid pipeline interruptions and potential > invalidation/recomputation of BFI between the loop passes. We detect > PGO based on whether PSI has profile summary info. (For the old pass > manager, implement a similar approach by using LazyBlockFrequencyInfo.)There is already an optional analysis in LoopStandardAnalysisResults - MemorySSA. We can do the same for BFI/BPI. And, yes - preserving those through loop passes is a cornerstone to this approach.> > - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be > able to get to the ModuleAnalysisManager in one step and PSI through it.This is just an optimization of compile-time, saves one indirection through FunctionAnalysisManager. I'm not even sure if it is worth the effort. And definitely not crucial for the overall idea. regards, Fedor.> > > > > On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com > <mailto:fedor.sergeev at azul.com>> wrote: > > > > On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote: >> >> >> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi >> <yamauchi at google.com <mailto:yamauchi at google.com>> wrote: >> >> >> >> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev >> <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>> wrote: >> >> >> >> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote: >>> Here's a sketch of the proposed approach for just one >>> pass(but imagine more) >>> >>> https://reviews.llvm.org/D58845 >>> >>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via >>> llvm-dev <llvm-dev at lists.llvm.org >>> <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev >>> wrote: >>>> Hi all, >>>> >>>> To implement more profile-guided optimizations, >>>> we’d like to use ProfileSummaryInfo (PSI) and >>>> BlockFrequencyInfo (BFI) from more passes of >>>> various types, under the new pass manager. >>>> >>>> The following is what we came up with. Would >>>> appreciate feedback. Thanks. >>>> >>>> Issue >>>> >>>> It’s not obvious (to me) how to best do this, given >>>> that we cannot request an outer-scope analysis >>>> result from an inner-scope pass through analysis >>>> managers [1] and that we might unnecessarily >>>> running some analyses unless we conditionally build >>>> pass pipelines for PGO cases. >>> Indeed, this is an intentional restriction in new >>> pass manager, which is more or less a reflection of >>> a fundamental property of outer-inner IRUnit >>> relationship >>> and transformations/analyses run on those units. The >>> main intent for having those inner IRUnits (e.g. >>> Loops) is to run local transformations and save >>> compile time >>> on being local to a particular small piece of IR. >>> Loop Pass manager allows you to run a whole pipeline >>> of different transformations still locally, >>> amplifying the save. >>> As soon as you run function-level analysis from >>> within the loop pipeline you essentially break this >>> pipelining. >>> Say, as you run your loop transformation it modifies >>> the loop (and the function) and potentially >>> invalidates the analysis, >>> so you have to rerun your analysis again and again. >>> Hence instead of saving on compile time it ends up >>> increasing it. >>> >>> >>> Exactly. >>> >>> >>> I have hit this issue somewhat recently with >>> dependency of loop passes on BranchProbabilityInfo. >>> (some loop passes, like IRCE can use it for >>> profitability analysis). >>> >>> The only solution that appears to be reasonable >>> there is to teach all the loops passes that need to >>> be pipelined >>> to preserve BPI (or any other module/function-level >>> analyses) similar to how they preserve DominatorTree and >>> other "LoopStandard" analyses. >>> >>> >>> Is this implemented - do the loop passes preserve BPI? >> Nope, not implemented right now. >> One of the problems is that even loop canonicalization >> passes run at the start of loop pass manager dont preserve it >> (and at least LoopSimplifyCFG does change control flow). >>> >>> In buildFunctionSimplificationPipeline >>> (where LoopFullUnrollPass is added as in the sketch), >>> LateLoopOptimizationsEPCallbacks >>> and LoopOptimizerEndEPCallbacks seem to allow some >>> arbitrary loop passes to be inserted into the pipelines >>> (via flags)? >>> >>> I wonder how hard it'd be to teach all the relevant loop >>> passes to preserve BFI(or BPI).. >> Well, each time you restructure control flow around the >> loops you will have to update those extra analyses, >> pretty much the same way as DT is being updated through >> DomTreeUpdater. >> The trick is to design a proper update interface (and >> then implement it ;) ). >> And I have not spent enough time on this issue to get a >> good idea of what that interface would be. >> >> >> Hm, sounds non-trivial :) noting BFI depends on BPI. >> >> >> To step back, it looks like: >> >> want to use profiles from more passes -> need to get BFI (from >> loop passes) -> need all the loop passes to preserve BFI. >> >> I wonder if there's no way around this. > Indeed. I believe this is a general consensus here. > > regards, > Fedor. > >> >> >> >> regards, >> Fedor. >> >>> >>>> It seems that for different types of passes to be >>>> able to get PSI and BFI, we’d need to ensure PSI is >>>> cached for a non-module pass, and PSI, BFI and the >>>> ModuleAnalysisManager proxy are cached for a loop >>>> pass in the pass pipelines. This may mean >>>> potentially needing to insert BFI/PSI in front of >>>> many passes [2]. It seems not obvious how to >>>> conditionally insert BFI for PGO pipelines because >>>> there isn’t always a good flag to detect PGO cases >>>> [3] or we tend to build pass pipelines before >>>> examining the code (or without propagating enough >>>> info down) [4]. >>>> >>>> Proposed approach >>>> >>>> - Cache PSI right after the profile summary in the >>>> IR is written in the pass pipeline [5]. This would >>>> avoid the need to insert RequiredAnalysisPass for >>>> PSI before each non-module pass that needs it. PSI >>>> can be technically invalidated but unlikely. If it >>>> does, we insert another RequiredAnalysisPass[6]. >>>> >>>> - Conditionally insert RequireAnalysisPass for BFI, >>>> if PGO, right before each loop pass that needs it. >>>> This doesn't seem avoidable because BFI can be >>>> invalidated whenever the CFG changes. We detect PGO >>>> based on the command line flags and/or whether the >>>> module has the profile summary info (we may need to >>>> pass the module to more functions.) >>>> >>>> - Add a new proxy ModuleAnalysisManagerLoopProxy >>>> for a loop pass to be able to get to the >>>> ModuleAnalysisManager in one step and PSI through it. >>>> >>>> Alternative approaches >>>> >>>> Dropping BFI and use PSI only >>>> We could consider not using BFI and solely relying >>>> on PSI and function-level profiles only (as opposed >>>> to block-level), but profile precision would suffer. >>>> >>>> Computing BFI in-place >>>> We could consider computing BFI “in-place” by >>>> directly running BFI outside of the pass manager >>>> [7]. This would let us avoid using the analysis >>>> manager constraints but it would still involve >>>> running an outer-scope analysis from an inner-scope >>>> pass and potentially cause problems in terms of >>>> pass pipelining and concurrency. Moreover, a >>>> potential downside of running analyses in-place is >>>> that it won’t take advantage of cached analysis >>>> results provided by the pass manager. >>>> >>>> Adding inner-scope versions of PSI and BFI >>>> We could consider adding a function-level and >>>> loop-level PSI and loop-level BFI, which internally >>>> act like their outer-scope versions but provide >>>> inner-scope results only. This way, we could always >>>> call getResult for PSI and BFI. However, this would >>>> still involve running an outer-scope analysis from >>>> an inner-scope pass. >>>> >>>> Caching the FAM and the MAM proxies >>>> We could consider caching the >>>> FunctionalAnalysisManager and the >>>> ModuleAnalysisManager proxies once early on instead >>>> of adding a new proxy. But it seems to not likely >>>> work well because the analysis cache key type >>>> includes the function or the module and some pass >>>> may add a new function for which the proxy wouldn’t >>>> be cached. We’d need to write and insert a pass in >>>> select locations to just fill the cache. Adding the >>>> new proxy would take care of these with a >>>> three-line change. >>>> >>>> Conditional BFI >>>> We could consider adding a conditional BFI analysis >>>> that is a wrapper around BFI and computes BFI only >>>> if profiles are available (either checking the >>>> module has profile summary or depend on the PSI.) >>>> With this, we wouldn’t need to conditionally build >>>> pass pipelines and may work for the new pass >>>> manager. But a similar wouldn’t work for the old >>>> pass manager because we cannot conditionally depend >>>> on an analysis under it. >>> There is LazyBlockFrequencyInfo. >>> Not sure how well it fits this idea. >>> >>> >>> Good point. LazyBlockFrequencyInfo seems usable with the >>> old pass manager (save unnecessary BFI/BPI) and would >>> work for function passes. I think the restriction still >>> applies - a loop pass cannot still request (outer-scope) >>> BFI, lazy or not, new or old (pass manager). Another >>> assumption is that it'd be cheap and safe to >>> unconditionally depend on PSI or check the module's >>> profile summary. >>> >>> >>> regards, >>> Fedor. >>> >>>> >>>> >>>> [1] We cannot call AnalysisManager::getResult for >>>> an outer scope but only getCachedResult. Probably >>>> because of potential pipelining or concurrency issues. >>>> [2] For example, potentially breaking up multiple >>>> pipelined loop passes and insert >>>> RequireAnalysisPass<BlockFrequencyAnalysis> in >>>> front of each of them. >>>> [3] For example, -fprofile-instr-use and >>>> -fprofile-sample-use aren’t present in ThinLTO post >>>> link builds. >>>> [4] For example, we could check whether the module >>>> has the profile summary metadata annotated when >>>> building pass pipelines but we don’t always pass >>>> the module down to the place where we build pass >>>> pipelines. >>>> [5] By inserting >>>> RequireAnalysisPass<ProfileSummaryInfo> after the >>>> PGOInstrumentationUse and the >>>> SampleProfileLoaderPass passes (and around the >>>> PGOIndirectCallPromotion pass for the Thin LTO post >>>> link pipeline.) >>>> [6] For example, the context-sensitive PGO. >>>> [7] Directly calling its constructor along with the >>>> dependent analyses results, eg. the jump threading >>>> pass. >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190314/ce875719/attachment-0001.html>
Hiroshi Yamauchi via llvm-dev
2019-Mar-13 23:04 UTC
[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
On Wed, Mar 13, 2019 at 2:37 PM Fedor Sergeev <fedor.sergeev at azul.com> wrote:> Overall seems fine to me. > > On 3/11/19 8:12 PM, Hiroshi Yamauchi wrote: > > Here's a revised approach based on the discussion: > > - Cache PSI right after the profile summary in the IR is written in the > pass pipeline. This would avoid the need to insert RequireAnalysisPass for > PSI before each non-module pass that needs it. PSI can be technically > invalidated but unlikely as PSI is immutable. If it does, we can insert > another RequireAnalysisPass. > > ProfileSummaryInfo::invalidate always return false, so it does not need > any extra handling > (as soon as it finds its way into ModuleAnalysisManager). >Right, as long as there happens to be a pass that always runs PSI before any pass that expects it to cached, it'd be fine. This is to be extra reassuring by having PSI explicitly run and cached right after a pass that writes the profile summary to the IR so that there's no window where PSI may not be cached.> - If PGO, conditionally request BFI from the passes that need it. For > (pipelined) loop passes, we need to insert a pass that computes BFI > conditionally (if PGO) in front of them and make them preserve BFI through. > This is to avoid pipeline interruptions and potential invalidation/recomputation > of BFI between the loop passes. We detect PGO based on whether PSI has > profile summary info. (For the old pass manager, implement a similar > approach by using LazyBlockFrequencyInfo.) > > There is already an optional analysis in LoopStandardAnalysisResults - > MemorySSA. > We can do the same for BFI/BPI. > And, yes - preserving those through loop passes is a cornerstone to this > approach. > > > - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be > able to get to the ModuleAnalysisManager in one step and PSI through it. > > This is just an optimization of compile-time, saves one indirection > through FunctionAnalysisManager. > I'm not even sure if it is worth the effort. And definitely not crucial > for the overall idea. >This should probably be clarified to something like: - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be able to get to the ModuleAnalysisManager and PSI because it may not always through (const) FunctionAnalysisManager, unless ModuleAnalysisManagerFunctionProxy is already cached. Since FunctionAnalysisManager we can get from LoopAnalysisManager is a const ref, we cannot call getResult on it and always get ModuleAnalysisManager and PSI (see below.) This actually happens in my experiment. SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) { auto &FAM = LAM.getResult<FunctionAnalysisManagerLoopProxy>(L, AR).getManager(); auto *MAMProxy = FAM.getCachedResult<ModuleAnalysisManagerFunctionProxy>( L.getHeader()->getParent()); *// Can be null* If (MAMProxy) { auto &MAM = MAMProxy->getManager(); auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent()); } else { *// Can't get MAM and PSI.* } ... -> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) { auto &MAM = LAM.getResult<ModuleAnalysisManagerLoopProxy>(L, AR).getManager(); *// Not null* auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent()); ... AFAICT, adding ModuleAnalysisManagerLoopProxy seems to be as simple as: /// A proxy from a \c ModuleAnalysisManager to a \c Loop. typedef OuterAnalysisManagerProxy<ModuleAnalysisManager, Loop, LoopStandardAnalysisResults &> ModuleAnalysisManagerLoopProxy;> > regards, > Fedor. > > > > > > On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com> > wrote: > >> >> >> On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote: >> >> >> >> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at google.com> >> wrote: >> >>> >>> >>> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev <fedor.sergeev at azul.com> >>> wrote: >>> >>>> >>>> >>>> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote: >>>> >>>> Here's a sketch of the proposed approach for just one pass (but >>>> imagine more) >>>> >>>> https://reviews.llvm.org/D58845 >>>> >>>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> >>>>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote: >>>>> >>>>> Hi all, >>>>> >>>>> To implement more profile-guided optimizations, we’d like to use >>>>> ProfileSummaryInfo (PSI) and BlockFrequencyInfo (BFI) from more passes of >>>>> various types, under the new pass manager. >>>>> >>>>> The following is what we came up with. Would appreciate feedback. >>>>> Thanks. >>>>> >>>>> Issue >>>>> >>>>> It’s not obvious (to me) how to best do this, given that we cannot >>>>> request an outer-scope analysis result from an inner-scope pass through >>>>> analysis managers [1] and that we might unnecessarily running some analyses >>>>> unless we conditionally build pass pipelines for PGO cases. >>>>> >>>>> Indeed, this is an intentional restriction in new pass manager, which >>>>> is more or less a reflection of a fundamental property of outer-inner >>>>> IRUnit relationship >>>>> and transformations/analyses run on those units. The main intent for >>>>> having those inner IRUnits (e.g. Loops) is to run local transformations and >>>>> save compile time >>>>> on being local to a particular small piece of IR. Loop Pass manager >>>>> allows you to run a whole pipeline of different transformations still >>>>> locally, amplifying the save. >>>>> As soon as you run function-level analysis from within the loop >>>>> pipeline you essentially break this pipelining. >>>>> Say, as you run your loop transformation it modifies the loop (and the >>>>> function) and potentially invalidates the analysis, >>>>> so you have to rerun your analysis again and again. Hence instead of >>>>> saving on compile time it ends up increasing it. >>>>> >>>> >>>> Exactly. >>>> >>>> >>>>> I have hit this issue somewhat recently with dependency of loop passes >>>>> on BranchProbabilityInfo. >>>>> (some loop passes, like IRCE can use it for profitability analysis). >>>>> >>>> The only solution that appears to be reasonable there is to teach all >>>>> the loops passes that need to be pipelined >>>>> to preserve BPI (or any other module/function-level analyses) similar >>>>> to how they preserve DominatorTree and >>>>> other "LoopStandard" analyses. >>>>> >>>> >>>> Is this implemented - do the loop passes preserve BPI? >>>> >>>> Nope, not implemented right now. >>>> One of the problems is that even loop canonicalization passes run at >>>> the start of loop pass manager dont preserve it >>>> (and at least LoopSimplifyCFG does change control flow). >>>> >>>> >>>> In buildFunctionSimplificationPipeline (where LoopFullUnrollPass is >>>> added as in the sketch), LateLoopOptimizationsEPCallbacks >>>> and LoopOptimizerEndEPCallbacks seem to allow some arbitrary loop passes to >>>> be inserted into the pipelines (via flags)? >>>> >>>> I wonder how hard it'd be to teach all the relevant loop passes to >>>> preserve BFI (or BPI).. >>>> >>>> Well, each time you restructure control flow around the loops you will >>>> have to update those extra analyses, >>>> pretty much the same way as DT is being updated through DomTreeUpdater. >>>> The trick is to design a proper update interface (and then implement it >>>> ;) ). >>>> And I have not spent enough time on this issue to get a good idea of >>>> what that interface would be. >>>> >>> >>> Hm, sounds non-trivial :) noting BFI depends on BPI. >>> >> >> To step back, it looks like: >> >> want to use profiles from more passes -> need to get BFI (from loop >> passes) -> need all the loop passes to preserve BFI. >> >> I wonder if there's no way around this. >> >> Indeed. I believe this is a general consensus here. >> >> regards, >> Fedor. >> >> >> >>> >>>> regards, >>>> Fedor. >>>> >>>> >>>>> It seems that for different types of passes to be able to get PSI and >>>>> BFI, we’d need to ensure PSI is cached for a non-module pass, and PSI, BFI >>>>> and the ModuleAnalysisManager proxy are cached for a loop pass in the pass >>>>> pipelines. This may mean potentially needing to insert BFI/PSI in front of >>>>> many passes [2]. It seems not obvious how to conditionally insert BFI for >>>>> PGO pipelines because there isn’t always a good flag to detect PGO cases >>>>> [3] or we tend to build pass pipelines before examining the code (or >>>>> without propagating enough info down) [4]. >>>>> >>>>> Proposed approach >>>>> >>>>> - Cache PSI right after the profile summary in the IR is written in >>>>> the pass pipeline [5]. This would avoid the need to insert >>>>> RequiredAnalysisPass for PSI before each non-module pass that needs it. PSI >>>>> can be technically invalidated but unlikely. If it does, we insert another >>>>> RequiredAnalysisPass [6]. >>>>> >>>>> - Conditionally insert RequireAnalysisPass for BFI, if PGO, right >>>>> before each loop pass that needs it. This doesn't seem avoidable because >>>>> BFI can be invalidated whenever the CFG changes. We detect PGO based on the >>>>> command line flags and/or whether the module has the profile summary >>>>> info (we may need to pass the module to more functions.) >>>>> >>>>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be >>>>> able to get to the ModuleAnalysisManager in one step and PSI through it. >>>>> >>>>> Alternative approaches >>>>> >>>>> Dropping BFI and use PSI only >>>>> We could consider not using BFI and solely relying on PSI and >>>>> function-level profiles only (as opposed to block-level), but profile >>>>> precision would suffer. >>>>> >>>>> Computing BFI in-place >>>>> We could consider computing BFI “in-place” by directly running BFI >>>>> outside of the pass manager [7]. This would let us avoid using the analysis >>>>> manager constraints but it would still involve running an outer-scope >>>>> analysis from an inner-scope pass and potentially cause problems in terms >>>>> of pass pipelining and concurrency. Moreover, a potential downside of >>>>> running analyses in-place is that it won’t take advantage of cached >>>>> analysis results provided by the pass manager. >>>>> >>>>> Adding inner-scope versions of PSI and BFI >>>>> We could consider adding a function-level and loop-level PSI and >>>>> loop-level BFI, which internally act like their outer-scope versions but >>>>> provide inner-scope results only. This way, we could always call getResult >>>>> for PSI and BFI. However, this would still involve running an outer-scope >>>>> analysis from an inner-scope pass. >>>>> >>>>> Caching the FAM and the MAM proxies >>>>> We could consider caching the FunctionalAnalysisManager and the >>>>> ModuleAnalysisManager proxies once early on instead of adding a new proxy. >>>>> But it seems to not likely work well because the analysis cache key type >>>>> includes the function or the module and some pass may add a new function >>>>> for which the proxy wouldn’t be cached. We’d need to write and insert a >>>>> pass in select locations to just fill the cache. Adding the new proxy would >>>>> take care of these with a three-line change. >>>>> >>>>> Conditional BFI >>>>> We could consider adding a conditional BFI analysis that is a wrapper >>>>> around BFI and computes BFI only if profiles are available (either checking >>>>> the module has profile summary or depend on the PSI.) With this, we >>>>> wouldn’t need to conditionally build pass pipelines and may work for the >>>>> new pass manager. But a similar wouldn’t work for the old pass manager >>>>> because we cannot conditionally depend on an analysis under it. >>>>> >>>>> There is LazyBlockFrequencyInfo. >>>>> Not sure how well it fits this idea. >>>>> >>>> >>>> Good point. LazyBlockFrequencyInfo seems usable with the old pass >>>> manager (save unnecessary BFI/BPI) and would work for function passes. I >>>> think the restriction still applies - a loop pass cannot still request >>>> (outer-scope) BFI, lazy or not, new or old (pass manager). Another >>>> assumption is that it'd be cheap and safe to unconditionally depend on >>>> PSI or check the module's profile summary. >>>> >>>> >>>>> regards, >>>>> Fedor. >>>>> >>>>> >>>>> >>>>> [1] We cannot call AnalysisManager::getResult for an outer scope but >>>>> only getCachedResult. Probably because of potential pipelining or >>>>> concurrency issues. >>>>> [2] For example, potentially breaking up multiple pipelined loop >>>>> passes and insert RequireAnalysisPass<BlockFrequencyAnalysis> in front of >>>>> each of them. >>>>> [3] For example, -fprofile-instr-use and -fprofile-sample-use aren’t >>>>> present in ThinLTO post link builds. >>>>> [4] For example, we could check whether the module has the profile >>>>> summary metadata annotated when building pass pipelines but we don’t always >>>>> pass the module down to the place where we build pass pipelines. >>>>> [5] By inserting RequireAnalysisPass<ProfileSummaryInfo> after the >>>>> PGOInstrumentationUse and the SampleProfileLoaderPass passes (and around >>>>> the PGOIndirectCallPromotion pass for the Thin LTO post link pipeline.) >>>>> [6] For example, the context-sensitive PGO. >>>>> [7] Directly calling its constructor along with the dependent analyses >>>>> results, eg. the jump threading pass. >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org >>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>> >>>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190313/d12eb4b2/attachment-0001.html>
Fedor Sergeev via llvm-dev
2019-Mar-13 23:20 UTC
[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
On 3/14/19 2:04 AM, Hiroshi Yamauchi wrote:> > > On Wed, Mar 13, 2019 at 2:37 PM Fedor Sergeev <fedor.sergeev at azul.com > <mailto:fedor.sergeev at azul.com>> wrote: > >> >> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass >> to be able to get to the ModuleAnalysisManager in one step and >> PSI through it. > This is just an optimization of compile-time, saves one > indirection through FunctionAnalysisManager. > I'm not even sure if it is worth the effort. And definitely not > crucial for the overall idea. > > > This should probably be clarified to something like: > > - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be > able to get to the ModuleAnalysisManager and PSI because it may not > always through (const) FunctionAnalysisManager, > unless ModuleAnalysisManagerFunctionProxy is already cached. > > Since FunctionAnalysisManager we can get from LoopAnalysisManager is a > const ref, we cannot call getResult on it and always get > ModuleAnalysisManager and PSI (see below.) This actually happens in my > experiment. > > SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) { > auto &FAM = LAM.getResult<FunctionAnalysisManagerLoopProxy>(L, > AR).getManager(); > auto *MAMProxy = FAM.getCachedResult<ModuleAnalysisManagerFunctionProxy>( > L.getHeader()->getParent()); *// Can be null*Oh... well...> If (MAMProxy) { > auto &MAM = MAMProxy->getManager(); > auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent()); > } else { > *// Can't get MAM and PSI.* > } > ... > > -> > > SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) { > auto &MAM = LAM.getResult<ModuleAnalysisManagerLoopProxy>(L, > AR).getManager(); *// Not null* > auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent()); > ... > > > AFAICT, adding ModuleAnalysisManagerLoopProxy seems to be as simple as: > > /// A proxy from a \c ModuleAnalysisManager to a \c Loop. > typedef OuterAnalysisManagerProxy<ModuleAnalysisManager, Loop, > LoopStandardAnalysisResults &> > ModuleAnalysisManagerLoopProxy;It also needs to be added to PassBuilder::crossRegisterProxies... But yes, that appears to be a required action. regards, Fedor.> > > > regards, > Fedor. >> >> >> >> >> On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev >> <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>> wrote: >> >> >> >> On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote: >>> >>> >>> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi >>> <yamauchi at google.com <mailto:yamauchi at google.com>> wrote: >>> >>> >>> >>> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev >>> <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>> >>> wrote: >>> >>> >>> >>> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote: >>>> Here's a sketch of the proposed approach for just >>>> one pass(but imagine more) >>>> >>>> https://reviews.llvm.org/D58845 >>>> >>>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via >>>> llvm-dev <llvm-dev at lists.llvm.org >>>> <mailto:llvm-dev at lists.llvm.org>> wrote: >>>> >>>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via >>>> llvm-dev wrote: >>>>> Hi all, >>>>> >>>>> To implement more profile-guided >>>>> optimizations, we’d like to use >>>>> ProfileSummaryInfo (PSI) and >>>>> BlockFrequencyInfo (BFI) from more passes of >>>>> various types, under the new pass manager. >>>>> >>>>> The following is what we came up with. Would >>>>> appreciate feedback. Thanks. >>>>> >>>>> Issue >>>>> >>>>> It’s not obvious (to me) how to best do this, >>>>> given that we cannot request an outer-scope >>>>> analysis result from an inner-scope pass >>>>> through analysis managers [1] and that we >>>>> might unnecessarily running some analyses >>>>> unless we conditionally build pass pipelines >>>>> for PGO cases. >>>> Indeed, this is an intentional restriction in >>>> new pass manager, which is more or less a >>>> reflection of a fundamental property of >>>> outer-inner IRUnit relationship >>>> and transformations/analyses run on those >>>> units. The main intent for having those inner >>>> IRUnits (e.g. Loops) is to run local >>>> transformations and save compile time >>>> on being local to a particular small piece of >>>> IR. Loop Pass manager allows you to run a whole >>>> pipeline of different transformations still >>>> locally, amplifying the save. >>>> As soon as you run function-level analysis from >>>> within the loop pipeline you essentially break >>>> this pipelining. >>>> Say, as you run your loop transformation it >>>> modifies the loop (and the function) and >>>> potentially invalidates the analysis, >>>> so you have to rerun your analysis again and >>>> again. Hence instead of saving on compile time >>>> it ends up increasing it. >>>> >>>> >>>> Exactly. >>>> >>>> >>>> I have hit this issue somewhat recently with >>>> dependency of loop passes on BranchProbabilityInfo. >>>> (some loop passes, like IRCE can use it for >>>> profitability analysis). >>>> >>>> The only solution that appears to be reasonable >>>> there is to teach all the loops passes that >>>> need to be pipelined >>>> to preserve BPI (or any other >>>> module/function-level analyses) similar to how >>>> they preserve DominatorTree and >>>> other "LoopStandard" analyses. >>>> >>>> >>>> Is this implemented - do the loop passes preserve BPI? >>> Nope, not implemented right now. >>> One of the problems is that even loop >>> canonicalization passes run at the start of loop >>> pass manager dont preserve it >>> (and at least LoopSimplifyCFG does change control flow). >>>> >>>> In buildFunctionSimplificationPipeline >>>> (where LoopFullUnrollPass is added as in the >>>> sketch), LateLoopOptimizationsEPCallbacks >>>> and LoopOptimizerEndEPCallbacks seem to allow some >>>> arbitrary loop passes to be inserted into the >>>> pipelines (via flags)? >>>> >>>> I wonder how hard it'd be to teach all the relevant >>>> loop passes to preserve BFI(or BPI).. >>> Well, each time you restructure control flow around >>> the loops you will have to update those extra analyses, >>> pretty much the same way as DT is being updated >>> through DomTreeUpdater. >>> The trick is to design a proper update interface >>> (and then implement it ;) ). >>> And I have not spent enough time on this issue to >>> get a good idea of what that interface would be. >>> >>> >>> Hm, sounds non-trivial :) noting BFI depends on BPI. >>> >>> >>> To step back, it looks like: >>> >>> want to use profiles from more passes -> need to get BFI >>> (from loop passes) -> need all the loop passes to preserve BFI. >>> >>> I wonder if there's no way around this. >> Indeed. I believe this is a general consensus here. >> >> regards, >> Fedor. >> >>> >>> >>> >>> regards, >>> Fedor. >>> >>>> >>>>> It seems that for different types of passes to >>>>> be able to get PSI and BFI, we’d need to >>>>> ensure PSI is cached for a non-module pass, >>>>> and PSI, BFI and the ModuleAnalysisManager >>>>> proxy are cached for a loop pass in the pass >>>>> pipelines. This may mean potentially needing >>>>> to insert BFI/PSI in front of many passes [2]. >>>>> It seems not obvious how to conditionally >>>>> insert BFI for PGO pipelines because there >>>>> isn’t always a good flag to detect PGO cases >>>>> [3] or we tend to build pass pipelines before >>>>> examining the code (or without propagating >>>>> enough info down) [4]. >>>>> >>>>> Proposed approach >>>>> >>>>> - Cache PSI right after the profile summary in >>>>> the IR is written in the pass pipeline [5]. >>>>> This would avoid the need to insert >>>>> RequiredAnalysisPass for PSI before each >>>>> non-module pass that needs it. PSI can be >>>>> technically invalidated but unlikely. If it >>>>> does, we insert another RequiredAnalysisPass[6]. >>>>> >>>>> - Conditionally insert RequireAnalysisPass for >>>>> BFI, if PGO, right before each loop pass that >>>>> needs it. This doesn't seem avoidable because >>>>> BFI can be invalidated whenever the CFG >>>>> changes. We detect PGO based on the command >>>>> line flags and/or whether the module has the >>>>> profile summary info (we may need to pass the >>>>> module to more functions.) >>>>> >>>>> - Add a new proxy >>>>> ModuleAnalysisManagerLoopProxy for a loop pass >>>>> to be able to get to the ModuleAnalysisManager >>>>> in one step and PSI through it. >>>>> >>>>> Alternative approaches >>>>> >>>>> Dropping BFI and use PSI only >>>>> We could consider not using BFI and solely >>>>> relying on PSI and function-level profiles >>>>> only (as opposed to block-level), but profile >>>>> precision would suffer. >>>>> >>>>> Computing BFI in-place >>>>> We could consider computing BFI “in-place” by >>>>> directly running BFI outside of the pass >>>>> manager [7]. This would let us avoid using the >>>>> analysis manager constraints but it would >>>>> still involve running an outer-scope analysis >>>>> from an inner-scope pass and potentially cause >>>>> problems in terms of pass pipelining and >>>>> concurrency. Moreover, a potential downside of >>>>> running analyses in-place is that it won’t >>>>> take advantage of cached analysis results >>>>> provided by the pass manager. >>>>> >>>>> Adding inner-scope versions of PSI and BFI >>>>> We could consider adding a function-level and >>>>> loop-level PSI and loop-level BFI, which >>>>> internally act like their outer-scope versions >>>>> but provide inner-scope results only. This >>>>> way, we could always call getResult for PSI >>>>> and BFI. However, this would still involve >>>>> running an outer-scope analysis from an >>>>> inner-scope pass. >>>>> >>>>> Caching the FAM and the MAM proxies >>>>> We could consider caching the >>>>> FunctionalAnalysisManager and the >>>>> ModuleAnalysisManager proxies once early on >>>>> instead of adding a new proxy. But it seems to >>>>> not likely work well because the analysis >>>>> cache key type includes the function or the >>>>> module and some pass may add a new function >>>>> for which the proxy wouldn’t be cached. We’d >>>>> need to write and insert a pass in select >>>>> locations to just fill the cache. Adding the >>>>> new proxy would take care of these with a >>>>> three-line change. >>>>> >>>>> Conditional BFI >>>>> We could consider adding a conditional BFI >>>>> analysis that is a wrapper around BFI and >>>>> computes BFI only if profiles are available >>>>> (either checking the module has profile >>>>> summary or depend on the PSI.) With this, we >>>>> wouldn’t need to conditionally build pass >>>>> pipelines and may work for the new pass >>>>> manager. But a similar wouldn’t work for the >>>>> old pass manager because we cannot >>>>> conditionally depend on an analysis under it. >>>> There is LazyBlockFrequencyInfo. >>>> Not sure how well it fits this idea. >>>> >>>> >>>> Good point. LazyBlockFrequencyInfo seems usable >>>> with the old pass manager (save unnecessary >>>> BFI/BPI) and would work for function passes. I >>>> think the restriction still applies - a loop pass >>>> cannot still request (outer-scope) BFI, lazy or >>>> not, new or old (pass manager). Another assumption >>>> is that it'd be cheap and safe to unconditionally >>>> depend on PSI or check the module's profile summary. >>>> >>>> >>>> regards, >>>> Fedor. >>>> >>>>> >>>>> >>>>> [1] We cannot call AnalysisManager::getResult >>>>> for an outer scope but only getCachedResult. >>>>> Probably because of potential pipelining or >>>>> concurrency issues. >>>>> [2] For example, potentially breaking up >>>>> multiple pipelined loop passes and insert >>>>> RequireAnalysisPass<BlockFrequencyAnalysis> in >>>>> front of each of them. >>>>> [3] For example, -fprofile-instr-use and >>>>> -fprofile-sample-use aren’t present in ThinLTO >>>>> post link builds. >>>>> [4] For example, we could check whether the >>>>> module has the profile summary metadata >>>>> annotated when building pass pipelines but we >>>>> don’t always pass the module down to the place >>>>> where we build pass pipelines. >>>>> [5] By inserting >>>>> RequireAnalysisPass<ProfileSummaryInfo> after >>>>> the PGOInstrumentationUse and the >>>>> SampleProfileLoaderPass passes (and around the >>>>> PGOIndirectCallPromotion pass for the Thin LTO >>>>> post link pipeline.) >>>>> [6] For example, the context-sensitive PGO. >>>>> [7] Directly calling its constructor along >>>>> with the dependent analyses results, eg. the >>>>> jump threading pass. >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> <mailto:llvm-dev at lists.llvm.org> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190314/828db884/attachment-0001.html>