Fedor Sergeev via llvm-dev
2019-Mar-04 22:05 UTC
[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:> > > On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at google.com > <mailto:yamauchi at google.com>> wrote: > > > > On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev > <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>> wrote: > > > > On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote: >> Here's a sketch of the proposed approach for just one >> pass(but imagine more) >> >> https://reviews.llvm.org/D58845 >> >> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote: >>> Hi all, >>> >>> To implement more profile-guided optimizations, we’d >>> like to use ProfileSummaryInfo (PSI) and >>> BlockFrequencyInfo (BFI) from more passes of various >>> types, under the new pass manager. >>> >>> The following is what we came up with. Would appreciate >>> feedback. Thanks. >>> >>> Issue >>> >>> It’s not obvious (to me) how to best do this, given that >>> we cannot request an outer-scope analysis result from an >>> inner-scope pass through analysis managers [1] and that >>> we might unnecessarily running some analyses unless we >>> conditionally build pass pipelines for PGO cases. >> Indeed, this is an intentional restriction in new pass >> manager, which is more or less a reflection of a >> fundamental property of outer-inner IRUnit relationship >> and transformations/analyses run on those units. The main >> intent for having those inner IRUnits (e.g. Loops) is to >> run local transformations and save compile time >> on being local to a particular small piece of IR. Loop >> Pass manager allows you to run a whole pipeline of >> different transformations still locally, amplifying the save. >> As soon as you run function-level analysis from within >> the loop pipeline you essentially break this pipelining. >> Say, as you run your loop transformation it modifies the >> loop (and the function) and potentially invalidates the >> analysis, >> so you have to rerun your analysis again and again. Hence >> instead of saving on compile time it ends up increasing it. >> >> >> Exactly. >> >> >> I have hit this issue somewhat recently with dependency >> of loop passes on BranchProbabilityInfo. >> (some loop passes, like IRCE can use it for profitability >> analysis). >> >> The only solution that appears to be reasonable there is >> to teach all the loops passes that need to be pipelined >> to preserve BPI (or any other module/function-level >> analyses) similar to how they preserve DominatorTree and >> other "LoopStandard" analyses. >> >> >> Is this implemented - do the loop passes preserve BPI? > Nope, not implemented right now. > One of the problems is that even loop canonicalization passes > run at the start of loop pass manager dont preserve it > (and at least LoopSimplifyCFG does change control flow). >> >> In buildFunctionSimplificationPipeline >> (where LoopFullUnrollPass is added as in the sketch), >> LateLoopOptimizationsEPCallbacks >> and LoopOptimizerEndEPCallbacks seem to allow some arbitrary >> loop passes to be inserted into the pipelines (via flags)? >> >> I wonder how hard it'd be to teach all the relevant loop >> passes to preserve BFI(or BPI).. > Well, each time you restructure control flow around the loops > you will have to update those extra analyses, > pretty much the same way as DT is being updated through > DomTreeUpdater. > The trick is to design a proper update interface (and then > implement it ;) ). > And I have not spent enough time on this issue to get a good > idea of what that interface would be. > > > Hm, sounds non-trivial :) noting BFI depends on BPI. > > > To step back, it looks like: > > want to use profiles from more passes -> need to get BFI (from loop > passes) -> need all the loop passes to preserve BFI. > > I wonder if there's no way around this.Indeed. I believe this is a general consensus here. regards, Fedor.> > > > regards, > Fedor. > >> >>> It seems that for different types of passes to be able >>> to get PSI and BFI, we’d need to ensure PSI is cached >>> for a non-module pass, and PSI, BFI and the >>> ModuleAnalysisManager proxy are cached for a loop pass >>> in the pass pipelines. This may mean potentially needing >>> to insert BFI/PSI in front of many passes [2]. It seems >>> not obvious how to conditionally insert BFI for PGO >>> pipelines because there isn’t always a good flag to >>> detect PGO cases [3] or we tend to build pass pipelines >>> before examining the code (or without propagating enough >>> info down) [4]. >>> >>> Proposed approach >>> >>> - Cache PSI right after the profile summary in the IR is >>> written in the pass pipeline [5]. This would avoid the >>> need to insert RequiredAnalysisPass for PSI before each >>> non-module pass that needs it. PSI can be technically >>> invalidated but unlikely. If it does, we insert another >>> RequiredAnalysisPass[6]. >>> >>> - Conditionally insert RequireAnalysisPass for BFI, if >>> PGO, right before each loop pass that needs it. This >>> doesn't seem avoidable because BFI can be invalidated >>> whenever the CFG changes. We detect PGO based on the >>> command line flags and/or whether the module has the >>> profile summary info (we may need to pass the module to >>> more functions.) >>> >>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a >>> loop pass to be able to get to the ModuleAnalysisManager >>> in one step and PSI through it. >>> >>> Alternative approaches >>> >>> Dropping BFI and use PSI only >>> We could consider not using BFI and solely relying on >>> PSI and function-level profiles only (as opposed to >>> block-level), but profile precision would suffer. >>> >>> Computing BFI in-place >>> We could consider computing BFI “in-place” by directly >>> running BFI outside of the pass manager [7]. This would >>> let us avoid using the analysis manager constraints but >>> it would still involve running an outer-scope analysis >>> from an inner-scope pass and potentially cause problems >>> in terms of pass pipelining and concurrency. Moreover, a >>> potential downside of running analyses in-place is that >>> it won’t take advantage of cached analysis results >>> provided by the pass manager. >>> >>> Adding inner-scope versions of PSI and BFI >>> We could consider adding a function-level and loop-level >>> PSI and loop-level BFI, which internally act like their >>> outer-scope versions but provide inner-scope results >>> only. This way, we could always call getResult for PSI >>> and BFI. However, this would still involve running an >>> outer-scope analysis from an inner-scope pass. >>> >>> Caching the FAM and the MAM proxies >>> We could consider caching the FunctionalAnalysisManager >>> and the ModuleAnalysisManager proxies once early on >>> instead of adding a new proxy. But it seems to not >>> likely work well because the analysis cache key type >>> includes the function or the module and some pass may >>> add a new function for which the proxy wouldn’t be >>> cached. We’d need to write and insert a pass in select >>> locations to just fill the cache. Adding the new proxy >>> would take care of these with a three-line change. >>> >>> Conditional BFI >>> We could consider adding a conditional BFI analysis that >>> is a wrapper around BFI and computes BFI only if >>> profiles are available (either checking the module has >>> profile summary or depend on the PSI.) With this, we >>> wouldn’t need to conditionally build pass pipelines and >>> may work for the new pass manager. But a similar >>> wouldn’t work for the old pass manager because we cannot >>> conditionally depend on an analysis under it. >> There is LazyBlockFrequencyInfo. >> Not sure how well it fits this idea. >> >> >> Good point. LazyBlockFrequencyInfo seems usable with the old >> pass manager (save unnecessary BFI/BPI) and would work for >> function passes. I think the restriction still applies - a >> loop pass cannot still request (outer-scope) BFI, lazy or >> not, new or old (pass manager). Another assumption is that >> it'd be cheap and safe to unconditionally depend on PSI or >> check the module's profile summary. >> >> >> regards, >> Fedor. >> >>> >>> >>> [1] We cannot call AnalysisManager::getResult for an >>> outer scope but only getCachedResult. Probably because >>> of potential pipelining or concurrency issues. >>> [2] For example, potentially breaking up multiple >>> pipelined loop passes and insert >>> RequireAnalysisPass<BlockFrequencyAnalysis> in front of >>> each of them. >>> [3] For example, -fprofile-instr-use and >>> -fprofile-sample-use aren’t present in ThinLTO post link >>> builds. >>> [4] For example, we could check whether the module has >>> the profile summary metadata annotated when building >>> pass pipelines but we don’t always pass the module down >>> to the place where we build pass pipelines. >>> [5] By inserting RequireAnalysisPass<ProfileSummaryInfo> >>> after the PGOInstrumentationUse and the >>> SampleProfileLoaderPass passes (and around the >>> PGOIndirectCallPromotion pass for the Thin LTO post link >>> pipeline.) >>> [6] For example, the context-sensitive PGO. >>> [7] Directly calling its constructor along with the >>> dependent analyses results, eg. the jump threading pass. >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190305/054eda98/attachment.html>
Hiroshi Yamauchi via llvm-dev
2019-Mar-11 17:12 UTC
[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
Here's a revised approach based on the discussion: - Cache PSI right after the profile summary in the IR is written in the pass pipeline. This would avoid the need to insert RequireAnalysisPass for PSI before each non-module pass that needs it. PSI can be technically invalidated but unlikely as PSI is immutable. If it does, we can insert another RequireAnalysisPass. - If PGO, conditionally request BFI from the passes that need it. For (pipelined) loop passes, we need to insert a pass that computes BFI conditionally (if PGO) in front of them and make them preserve BFI through. This is to avoid pipeline interruptions and potential invalidation/recomputation of BFI between the loop passes. We detect PGO based on whether PSI has profile summary info. (For the old pass manager, implement a similar approach by using LazyBlockFrequencyInfo.) - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be able to get to the ModuleAnalysisManager in one step and PSI through it. On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com> wrote:> > > On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote: > > > > On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at google.com> > wrote: > >> >> >> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev <fedor.sergeev at azul.com> >> wrote: >> >>> >>> >>> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote: >>> >>> Here's a sketch of the proposed approach for just one pass (but imagine >>> more) >>> >>> https://reviews.llvm.org/D58845 >>> >>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote: >>>> >>>> Hi all, >>>> >>>> To implement more profile-guided optimizations, we’d like to use >>>> ProfileSummaryInfo (PSI) and BlockFrequencyInfo (BFI) from more passes of >>>> various types, under the new pass manager. >>>> >>>> The following is what we came up with. Would appreciate feedback. >>>> Thanks. >>>> >>>> Issue >>>> >>>> It’s not obvious (to me) how to best do this, given that we cannot >>>> request an outer-scope analysis result from an inner-scope pass through >>>> analysis managers [1] and that we might unnecessarily running some analyses >>>> unless we conditionally build pass pipelines for PGO cases. >>>> >>>> Indeed, this is an intentional restriction in new pass manager, which >>>> is more or less a reflection of a fundamental property of outer-inner >>>> IRUnit relationship >>>> and transformations/analyses run on those units. The main intent for >>>> having those inner IRUnits (e.g. Loops) is to run local transformations and >>>> save compile time >>>> on being local to a particular small piece of IR. Loop Pass manager >>>> allows you to run a whole pipeline of different transformations still >>>> locally, amplifying the save. >>>> As soon as you run function-level analysis from within the loop >>>> pipeline you essentially break this pipelining. >>>> Say, as you run your loop transformation it modifies the loop (and the >>>> function) and potentially invalidates the analysis, >>>> so you have to rerun your analysis again and again. Hence instead of >>>> saving on compile time it ends up increasing it. >>>> >>> >>> Exactly. >>> >>> >>>> I have hit this issue somewhat recently with dependency of loop passes >>>> on BranchProbabilityInfo. >>>> (some loop passes, like IRCE can use it for profitability analysis). >>>> >>> The only solution that appears to be reasonable there is to teach all >>>> the loops passes that need to be pipelined >>>> to preserve BPI (or any other module/function-level analyses) similar >>>> to how they preserve DominatorTree and >>>> other "LoopStandard" analyses. >>>> >>> >>> Is this implemented - do the loop passes preserve BPI? >>> >>> Nope, not implemented right now. >>> One of the problems is that even loop canonicalization passes run at the >>> start of loop pass manager dont preserve it >>> (and at least LoopSimplifyCFG does change control flow). >>> >>> >>> In buildFunctionSimplificationPipeline (where LoopFullUnrollPass is >>> added as in the sketch), LateLoopOptimizationsEPCallbacks >>> and LoopOptimizerEndEPCallbacks seem to allow some arbitrary loop passes to >>> be inserted into the pipelines (via flags)? >>> >>> I wonder how hard it'd be to teach all the relevant loop passes to >>> preserve BFI (or BPI).. >>> >>> Well, each time you restructure control flow around the loops you will >>> have to update those extra analyses, >>> pretty much the same way as DT is being updated through DomTreeUpdater. >>> The trick is to design a proper update interface (and then implement it >>> ;) ). >>> And I have not spent enough time on this issue to get a good idea of >>> what that interface would be. >>> >> >> Hm, sounds non-trivial :) noting BFI depends on BPI. >> > > To step back, it looks like: > > want to use profiles from more passes -> need to get BFI (from loop > passes) -> need all the loop passes to preserve BFI. > > I wonder if there's no way around this. > > Indeed. I believe this is a general consensus here. > > regards, > Fedor. > > > >> >>> regards, >>> Fedor. >>> >>> >>>> It seems that for different types of passes to be able to get PSI and >>>> BFI, we’d need to ensure PSI is cached for a non-module pass, and PSI, BFI >>>> and the ModuleAnalysisManager proxy are cached for a loop pass in the pass >>>> pipelines. This may mean potentially needing to insert BFI/PSI in front of >>>> many passes [2]. It seems not obvious how to conditionally insert BFI for >>>> PGO pipelines because there isn’t always a good flag to detect PGO cases >>>> [3] or we tend to build pass pipelines before examining the code (or >>>> without propagating enough info down) [4]. >>>> >>>> Proposed approach >>>> >>>> - Cache PSI right after the profile summary in the IR is written in the >>>> pass pipeline [5]. This would avoid the need to insert RequiredAnalysisPass >>>> for PSI before each non-module pass that needs it. PSI can be technically >>>> invalidated but unlikely. If it does, we insert another RequiredAnalysisPass >>>> [6]. >>>> >>>> - Conditionally insert RequireAnalysisPass for BFI, if PGO, right >>>> before each loop pass that needs it. This doesn't seem avoidable because >>>> BFI can be invalidated whenever the CFG changes. We detect PGO based on the >>>> command line flags and/or whether the module has the profile summary >>>> info (we may need to pass the module to more functions.) >>>> >>>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be >>>> able to get to the ModuleAnalysisManager in one step and PSI through it. >>>> >>>> Alternative approaches >>>> >>>> Dropping BFI and use PSI only >>>> We could consider not using BFI and solely relying on PSI and >>>> function-level profiles only (as opposed to block-level), but profile >>>> precision would suffer. >>>> >>>> Computing BFI in-place >>>> We could consider computing BFI “in-place” by directly running BFI >>>> outside of the pass manager [7]. This would let us avoid using the analysis >>>> manager constraints but it would still involve running an outer-scope >>>> analysis from an inner-scope pass and potentially cause problems in terms >>>> of pass pipelining and concurrency. Moreover, a potential downside of >>>> running analyses in-place is that it won’t take advantage of cached >>>> analysis results provided by the pass manager. >>>> >>>> Adding inner-scope versions of PSI and BFI >>>> We could consider adding a function-level and loop-level PSI and >>>> loop-level BFI, which internally act like their outer-scope versions but >>>> provide inner-scope results only. This way, we could always call getResult >>>> for PSI and BFI. However, this would still involve running an outer-scope >>>> analysis from an inner-scope pass. >>>> >>>> Caching the FAM and the MAM proxies >>>> We could consider caching the FunctionalAnalysisManager and the >>>> ModuleAnalysisManager proxies once early on instead of adding a new proxy. >>>> But it seems to not likely work well because the analysis cache key type >>>> includes the function or the module and some pass may add a new function >>>> for which the proxy wouldn’t be cached. We’d need to write and insert a >>>> pass in select locations to just fill the cache. Adding the new proxy would >>>> take care of these with a three-line change. >>>> >>>> Conditional BFI >>>> We could consider adding a conditional BFI analysis that is a wrapper >>>> around BFI and computes BFI only if profiles are available (either checking >>>> the module has profile summary or depend on the PSI.) With this, we >>>> wouldn’t need to conditionally build pass pipelines and may work for the >>>> new pass manager. But a similar wouldn’t work for the old pass manager >>>> because we cannot conditionally depend on an analysis under it. >>>> >>>> There is LazyBlockFrequencyInfo. >>>> Not sure how well it fits this idea. >>>> >>> >>> Good point. LazyBlockFrequencyInfo seems usable with the old pass >>> manager (save unnecessary BFI/BPI) and would work for function passes. I >>> think the restriction still applies - a loop pass cannot still request >>> (outer-scope) BFI, lazy or not, new or old (pass manager). Another >>> assumption is that it'd be cheap and safe to unconditionally depend on >>> PSI or check the module's profile summary. >>> >>> >>>> regards, >>>> Fedor. >>>> >>>> >>>> >>>> [1] We cannot call AnalysisManager::getResult for an outer scope but >>>> only getCachedResult. Probably because of potential pipelining or >>>> concurrency issues. >>>> [2] For example, potentially breaking up multiple pipelined loop passes >>>> and insert RequireAnalysisPass<BlockFrequencyAnalysis> in front of each of >>>> them. >>>> [3] For example, -fprofile-instr-use and -fprofile-sample-use aren’t >>>> present in ThinLTO post link builds. >>>> [4] For example, we could check whether the module has the profile >>>> summary metadata annotated when building pass pipelines but we don’t always >>>> pass the module down to the place where we build pass pipelines. >>>> [5] By inserting RequireAnalysisPass<ProfileSummaryInfo> after the >>>> PGOInstrumentationUse and the SampleProfileLoaderPass passes (and around >>>> the PGOIndirectCallPromotion pass for the Thin LTO post link pipeline.) >>>> [6] For example, the context-sensitive PGO. >>>> [7] Directly calling its constructor along with the dependent analyses >>>> results, eg. the jump threading pass. >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>> >>> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190311/dcb6b6ff/attachment-0001.html>
Fedor Sergeev via llvm-dev
2019-Mar-13 21:37 UTC
[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
Overall seems fine to me. On 3/11/19 8:12 PM, Hiroshi Yamauchi wrote:> Here's a revised approach based on the discussion: > > - Cache PSI right after the profile summary in the IR is written in > the pass pipeline. This would avoid the need to insert > RequireAnalysisPass for PSI before each non-module pass that needs it. > PSI can be technically invalidated but unlikely as PSI is immutable. > If it does, we can insert another RequireAnalysisPass.ProfileSummaryInfo::invalidate always return false, so it does not need any extra handling (as soon as it finds its way into ModuleAnalysisManager).> > - If PGO, conditionally request BFI from the passes that need it. For > (pipelined) loop passes, we need to insert a pass that computes BFI > conditionally (if PGO) in front of them and make them preserve BFI > through. This is to avoid pipeline interruptions and potential > invalidation/recomputation of BFI between the loop passes. We detect > PGO based on whether PSI has profile summary info. (For the old pass > manager, implement a similar approach by using LazyBlockFrequencyInfo.)There is already an optional analysis in LoopStandardAnalysisResults - MemorySSA. We can do the same for BFI/BPI. And, yes - preserving those through loop passes is a cornerstone to this approach.> > - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be > able to get to the ModuleAnalysisManager in one step and PSI through it.This is just an optimization of compile-time, saves one indirection through FunctionAnalysisManager. I'm not even sure if it is worth the effort. And definitely not crucial for the overall idea. regards, Fedor.> > > > > On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com > <mailto:fedor.sergeev at azul.com>> wrote: > > > > On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote: >> >> >> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi >> <yamauchi at google.com <mailto:yamauchi at google.com>> wrote: >> >> >> >> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev >> <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>> wrote: >> >> >> >> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote: >>> Here's a sketch of the proposed approach for just one >>> pass(but imagine more) >>> >>> https://reviews.llvm.org/D58845 >>> >>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via >>> llvm-dev <llvm-dev at lists.llvm.org >>> <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev >>> wrote: >>>> Hi all, >>>> >>>> To implement more profile-guided optimizations, >>>> we’d like to use ProfileSummaryInfo (PSI) and >>>> BlockFrequencyInfo (BFI) from more passes of >>>> various types, under the new pass manager. >>>> >>>> The following is what we came up with. Would >>>> appreciate feedback. Thanks. >>>> >>>> Issue >>>> >>>> It’s not obvious (to me) how to best do this, given >>>> that we cannot request an outer-scope analysis >>>> result from an inner-scope pass through analysis >>>> managers [1] and that we might unnecessarily >>>> running some analyses unless we conditionally build >>>> pass pipelines for PGO cases. >>> Indeed, this is an intentional restriction in new >>> pass manager, which is more or less a reflection of >>> a fundamental property of outer-inner IRUnit >>> relationship >>> and transformations/analyses run on those units. The >>> main intent for having those inner IRUnits (e.g. >>> Loops) is to run local transformations and save >>> compile time >>> on being local to a particular small piece of IR. >>> Loop Pass manager allows you to run a whole pipeline >>> of different transformations still locally, >>> amplifying the save. >>> As soon as you run function-level analysis from >>> within the loop pipeline you essentially break this >>> pipelining. >>> Say, as you run your loop transformation it modifies >>> the loop (and the function) and potentially >>> invalidates the analysis, >>> so you have to rerun your analysis again and again. >>> Hence instead of saving on compile time it ends up >>> increasing it. >>> >>> >>> Exactly. >>> >>> >>> I have hit this issue somewhat recently with >>> dependency of loop passes on BranchProbabilityInfo. >>> (some loop passes, like IRCE can use it for >>> profitability analysis). >>> >>> The only solution that appears to be reasonable >>> there is to teach all the loops passes that need to >>> be pipelined >>> to preserve BPI (or any other module/function-level >>> analyses) similar to how they preserve DominatorTree and >>> other "LoopStandard" analyses. >>> >>> >>> Is this implemented - do the loop passes preserve BPI? >> Nope, not implemented right now. >> One of the problems is that even loop canonicalization >> passes run at the start of loop pass manager dont preserve it >> (and at least LoopSimplifyCFG does change control flow). >>> >>> In buildFunctionSimplificationPipeline >>> (where LoopFullUnrollPass is added as in the sketch), >>> LateLoopOptimizationsEPCallbacks >>> and LoopOptimizerEndEPCallbacks seem to allow some >>> arbitrary loop passes to be inserted into the pipelines >>> (via flags)? >>> >>> I wonder how hard it'd be to teach all the relevant loop >>> passes to preserve BFI(or BPI).. >> Well, each time you restructure control flow around the >> loops you will have to update those extra analyses, >> pretty much the same way as DT is being updated through >> DomTreeUpdater. >> The trick is to design a proper update interface (and >> then implement it ;) ). >> And I have not spent enough time on this issue to get a >> good idea of what that interface would be. >> >> >> Hm, sounds non-trivial :) noting BFI depends on BPI. >> >> >> To step back, it looks like: >> >> want to use profiles from more passes -> need to get BFI (from >> loop passes) -> need all the loop passes to preserve BFI. >> >> I wonder if there's no way around this. > Indeed. I believe this is a general consensus here. > > regards, > Fedor. > >> >> >> >> regards, >> Fedor. >> >>> >>>> It seems that for different types of passes to be >>>> able to get PSI and BFI, we’d need to ensure PSI is >>>> cached for a non-module pass, and PSI, BFI and the >>>> ModuleAnalysisManager proxy are cached for a loop >>>> pass in the pass pipelines. This may mean >>>> potentially needing to insert BFI/PSI in front of >>>> many passes [2]. It seems not obvious how to >>>> conditionally insert BFI for PGO pipelines because >>>> there isn’t always a good flag to detect PGO cases >>>> [3] or we tend to build pass pipelines before >>>> examining the code (or without propagating enough >>>> info down) [4]. >>>> >>>> Proposed approach >>>> >>>> - Cache PSI right after the profile summary in the >>>> IR is written in the pass pipeline [5]. This would >>>> avoid the need to insert RequiredAnalysisPass for >>>> PSI before each non-module pass that needs it. PSI >>>> can be technically invalidated but unlikely. If it >>>> does, we insert another RequiredAnalysisPass[6]. >>>> >>>> - Conditionally insert RequireAnalysisPass for BFI, >>>> if PGO, right before each loop pass that needs it. >>>> This doesn't seem avoidable because BFI can be >>>> invalidated whenever the CFG changes. We detect PGO >>>> based on the command line flags and/or whether the >>>> module has the profile summary info (we may need to >>>> pass the module to more functions.) >>>> >>>> - Add a new proxy ModuleAnalysisManagerLoopProxy >>>> for a loop pass to be able to get to the >>>> ModuleAnalysisManager in one step and PSI through it. >>>> >>>> Alternative approaches >>>> >>>> Dropping BFI and use PSI only >>>> We could consider not using BFI and solely relying >>>> on PSI and function-level profiles only (as opposed >>>> to block-level), but profile precision would suffer. >>>> >>>> Computing BFI in-place >>>> We could consider computing BFI “in-place” by >>>> directly running BFI outside of the pass manager >>>> [7]. This would let us avoid using the analysis >>>> manager constraints but it would still involve >>>> running an outer-scope analysis from an inner-scope >>>> pass and potentially cause problems in terms of >>>> pass pipelining and concurrency. Moreover, a >>>> potential downside of running analyses in-place is >>>> that it won’t take advantage of cached analysis >>>> results provided by the pass manager. >>>> >>>> Adding inner-scope versions of PSI and BFI >>>> We could consider adding a function-level and >>>> loop-level PSI and loop-level BFI, which internally >>>> act like their outer-scope versions but provide >>>> inner-scope results only. This way, we could always >>>> call getResult for PSI and BFI. However, this would >>>> still involve running an outer-scope analysis from >>>> an inner-scope pass. >>>> >>>> Caching the FAM and the MAM proxies >>>> We could consider caching the >>>> FunctionalAnalysisManager and the >>>> ModuleAnalysisManager proxies once early on instead >>>> of adding a new proxy. But it seems to not likely >>>> work well because the analysis cache key type >>>> includes the function or the module and some pass >>>> may add a new function for which the proxy wouldn’t >>>> be cached. We’d need to write and insert a pass in >>>> select locations to just fill the cache. Adding the >>>> new proxy would take care of these with a >>>> three-line change. >>>> >>>> Conditional BFI >>>> We could consider adding a conditional BFI analysis >>>> that is a wrapper around BFI and computes BFI only >>>> if profiles are available (either checking the >>>> module has profile summary or depend on the PSI.) >>>> With this, we wouldn’t need to conditionally build >>>> pass pipelines and may work for the new pass >>>> manager. But a similar wouldn’t work for the old >>>> pass manager because we cannot conditionally depend >>>> on an analysis under it. >>> There is LazyBlockFrequencyInfo. >>> Not sure how well it fits this idea. >>> >>> >>> Good point. LazyBlockFrequencyInfo seems usable with the >>> old pass manager (save unnecessary BFI/BPI) and would >>> work for function passes. I think the restriction still >>> applies - a loop pass cannot still request (outer-scope) >>> BFI, lazy or not, new or old (pass manager). Another >>> assumption is that it'd be cheap and safe to >>> unconditionally depend on PSI or check the module's >>> profile summary. >>> >>> >>> regards, >>> Fedor. >>> >>>> >>>> >>>> [1] We cannot call AnalysisManager::getResult for >>>> an outer scope but only getCachedResult. Probably >>>> because of potential pipelining or concurrency issues. >>>> [2] For example, potentially breaking up multiple >>>> pipelined loop passes and insert >>>> RequireAnalysisPass<BlockFrequencyAnalysis> in >>>> front of each of them. >>>> [3] For example, -fprofile-instr-use and >>>> -fprofile-sample-use aren’t present in ThinLTO post >>>> link builds. >>>> [4] For example, we could check whether the module >>>> has the profile summary metadata annotated when >>>> building pass pipelines but we don’t always pass >>>> the module down to the place where we build pass >>>> pipelines. >>>> [5] By inserting >>>> RequireAnalysisPass<ProfileSummaryInfo> after the >>>> PGOInstrumentationUse and the >>>> SampleProfileLoaderPass passes (and around the >>>> PGOIndirectCallPromotion pass for the Thin LTO post >>>> link pipeline.) >>>> [6] For example, the context-sensitive PGO. >>>> [7] Directly calling its constructor along with the >>>> dependent analyses results, eg. the jump threading >>>> pass. >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190314/ce875719/attachment-0001.html>