thr3ads.net - llvm dev - [llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Fedor Sergeev via llvm-dev

2019-Mar-04 22:05 UTC

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:>
>
> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at google.com
> <mailto:yamauchi at google.com>> wrote:
>
>
>
>     On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev
>     <fedor.sergeev at azul.com <mailto:fedor.sergeev at
azul.com>> wrote:
>
>
>
>         On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>         Here's a sketch of the proposed approach for just one
>>         pass(but imagine more)
>>
>>         https://reviews.llvm.org/D58845
>>
>>         On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev
>>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>             On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote:
>>>             Hi all,
>>>
>>>             To implement more profile-guided optimizations, we’d
>>>             like to use ProfileSummaryInfo (PSI) and
>>>             BlockFrequencyInfo (BFI) from more passes of various
>>>             types, under the new pass manager.
>>>
>>>             The following is what we came up with. Would appreciate
>>>             feedback. Thanks.
>>>
>>>             Issue
>>>
>>>             It’s not obvious (to me) how to best do this, given
that
>>>             we cannot request an outer-scope analysis result from
an
>>>             inner-scope pass through analysis managers [1] and that
>>>             we might unnecessarily running some analyses unless we
>>>             conditionally build pass pipelines for PGO cases.
>>             Indeed, this is an intentional restriction in new pass
>>             manager, which is more or less a reflection of a
>>             fundamental property of outer-inner IRUnit relationship
>>             and transformations/analyses run on those units. The main
>>             intent for having those inner IRUnits (e.g. Loops) is to
>>             run local transformations and save compile time
>>             on being local to a particular small piece of IR. Loop
>>             Pass manager allows you to run a whole pipeline of
>>             different transformations still locally, amplifying the
save.
>>             As soon as you run function-level analysis from within
>>             the loop pipeline you essentially break this pipelining.
>>             Say, as you run your loop transformation it modifies the
>>             loop (and the function) and potentially invalidates the
>>             analysis,
>>             so you have to rerun your analysis again and again. Hence
>>             instead of saving on compile time it ends up increasing it.
>>
>>
>>         Exactly.
>>
>>
>>             I have hit this issue somewhat recently with dependency
>>             of loop passes on BranchProbabilityInfo.
>>             (some loop passes, like IRCE can use it for profitability
>>             analysis).
>>
>>             The only solution that appears to be reasonable there is
>>             to teach all the loops passes that need to be pipelined
>>             to preserve BPI (or any other module/function-level
>>             analyses) similar to how they preserve DominatorTree and
>>             other "LoopStandard" analyses.
>>
>>
>>         Is this implemented - do the loop passes preserve BPI?
>         Nope, not implemented right now.
>         One of the problems is that even loop canonicalization passes
>         run at the start of loop pass manager dont preserve it
>         (and at least LoopSimplifyCFG does change control flow).
>>
>>         In buildFunctionSimplificationPipeline
>>         (where LoopFullUnrollPass is added as in the sketch),
>>         LateLoopOptimizationsEPCallbacks
>>         and LoopOptimizerEndEPCallbacks seem to allow some arbitrary
>>         loop passes to be inserted into the pipelines (via flags)?
>>
>>         I wonder how hard it'd be to teach all the relevant loop
>>         passes to preserve BFI(or BPI)..
>         Well, each time you restructure control flow around the loops
>         you will have to update those extra analyses,
>         pretty much the same way as DT is being updated through
>         DomTreeUpdater.
>         The trick is to design a proper update interface (and then
>         implement it ;) ).
>         And I have not spent enough time on this issue to get a good
>         idea of what that interface would be.
>
>
>     Hm, sounds non-trivial :) noting BFI depends on BPI.
>
>
> To step back, it looks like:
>
> want to use profiles from more passes -> need to get BFI (from loop 
> passes) -> need all the loop passes to preserve BFI.
>
> I wonder if there's no way around this.Indeed. I believe this is a general consensus here.

regards,
   Fedor.
>
>
>
>         regards,
>           Fedor.
>
>>
>>>             It seems that for different types of passes to be able
>>>             to get PSI and BFI, we’d need to ensure PSI is cached
>>>             for a non-module pass, and PSI, BFI and the
>>>             ModuleAnalysisManager proxy are cached for a loop pass
>>>             in the pass pipelines. This may mean potentially
needing
>>>             to insert BFI/PSI in front of many passes [2]. It seems
>>>             not obvious how to conditionally insert BFI for PGO
>>>             pipelines because there isn’t always a good flag to
>>>             detect PGO cases [3] or we tend to build pass pipelines
>>>             before examining the code (or without propagating
enough
>>>             info down) [4].
>>>
>>>             Proposed approach
>>>
>>>             - Cache PSI right after the profile summary in the IR
is
>>>             written in the pass pipeline [5]. This would avoid the
>>>             need to insert RequiredAnalysisPass for PSI before each
>>>             non-module pass that needs it. PSI can be technically
>>>             invalidated but unlikely. If it does, we insert another
>>>             RequiredAnalysisPass[6].
>>>
>>>             - Conditionally insert RequireAnalysisPass for BFI, if
>>>             PGO, right before each loop pass that needs it. This
>>>             doesn't seem avoidable because BFI can be
invalidated
>>>             whenever the CFG changes. We detect PGO based on the
>>>             command line flags and/or whether the module has the
>>>             profile summary info (we may need to pass the module to
>>>             more functions.)
>>>
>>>             - Add a new proxy ModuleAnalysisManagerLoopProxy for a
>>>             loop pass to be able to get to the
ModuleAnalysisManager
>>>             in one step and PSI through it.
>>>
>>>             Alternative approaches
>>>
>>>             Dropping BFI and use PSI only
>>>             We could consider not using BFI and solely relying on
>>>             PSI and function-level profiles only (as opposed to
>>>             block-level), but profile precision would suffer.
>>>
>>>             Computing BFI in-place
>>>             We could consider computing BFI “in-place” by directly
>>>             running BFI outside of the pass manager [7]. This would
>>>             let us avoid using the analysis manager constraints but
>>>             it would still involve running an outer-scope analysis
>>>             from an inner-scope pass and potentially cause problems
>>>             in terms of pass pipelining and concurrency. Moreover,
a
>>>             potential downside of running analyses in-place is that
>>>             it won’t take advantage of cached analysis results
>>>             provided by the pass manager.
>>>
>>>             Adding inner-scope versions of PSI and BFI
>>>             We could consider adding a function-level and
loop-level
>>>             PSI and loop-level BFI, which internally act like their
>>>             outer-scope versions but provide inner-scope results
>>>             only. This way, we could always call getResult for PSI
>>>             and BFI. However, this would still involve running an
>>>             outer-scope analysis from an inner-scope pass.
>>>
>>>             Caching the FAM and the MAM proxies
>>>             We could consider caching the FunctionalAnalysisManager
>>>             and the ModuleAnalysisManager proxies once early on
>>>             instead of adding a new proxy. But it seems to not
>>>             likely work well because the analysis cache key type
>>>             includes the function or the module and some pass may
>>>             add a new function for which the proxy wouldn’t be
>>>             cached. We’d need to write and insert a pass in select
>>>             locations to just fill the cache. Adding the new proxy
>>>             would take care of these with a three-line change.
>>>
>>>             Conditional BFI
>>>             We could consider adding a conditional BFI analysis
that
>>>             is a wrapper around BFI and computes BFI only if
>>>             profiles are available (either checking the module has
>>>             profile summary or depend on the PSI.) With this, we
>>>             wouldn’t need to conditionally build pass pipelines and
>>>             may work for the new pass manager. But a similar
>>>             wouldn’t work for the old pass manager because we
cannot
>>>             conditionally depend on an analysis under it.
>>             There is LazyBlockFrequencyInfo.
>>             Not sure how well it fits this idea.
>>
>>
>>         Good point. LazyBlockFrequencyInfo seems usable with the old
>>         pass manager (save unnecessary BFI/BPI) and would work for
>>         function passes. I think the restriction still applies - a
>>         loop pass cannot still request (outer-scope) BFI, lazy or
>>         not, new or old (pass manager). Another assumption is that
>>         it'd be cheap and safe to unconditionally depend on PSI or
>>         check the module's profile summary.
>>
>>
>>             regards,
>>               Fedor.
>>
>>>
>>>
>>>             [1] We cannot call AnalysisManager::getResult for an
>>>             outer scope but only getCachedResult. Probably because
>>>             of potential pipelining or concurrency issues.
>>>             [2] For example, potentially breaking up multiple
>>>             pipelined loop passes and insert
>>>             RequireAnalysisPass<BlockFrequencyAnalysis> in
front of
>>>             each of them.
>>>             [3] For example, -fprofile-instr-use and
>>>             -fprofile-sample-use aren’t present in ThinLTO post
link
>>>             builds.
>>>             [4] For example, we could check whether the module has
>>>             the profile summary metadata annotated when building
>>>             pass pipelines but we don’t always pass the module down
>>>             to the place where we build pass pipelines.
>>>             [5] By inserting
RequireAnalysisPass<ProfileSummaryInfo>
>>>             after the PGOInstrumentationUse and the
>>>             SampleProfileLoaderPass passes (and around the
>>>             PGOIndirectCallPromotion pass for the Thin LTO post
link
>>>             pipeline.)
>>>             [6] For example, the context-sensitive PGO.
>>>             [7] Directly calling its constructor along with the
>>>             dependent analyses results, eg. the jump threading
pass.
>>>
>>>             _______________________________________________
>>>             LLVM Developers mailing list
>>>             llvm-dev at lists.llvm.org  <mailto:llvm-dev at
lists.llvm.org>
>>>            
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>             _______________________________________________
>>             LLVM Developers mailing list
>>             llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>             https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190305/054eda98/attachment.html>

Hiroshi Yamauchi via llvm-dev

2019-Mar-11 17:12 UTC

head link

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

Here's a revised approach based on the discussion:

- Cache PSI right after the profile summary in the IR is written in the
pass pipeline. This would avoid the need to insert RequireAnalysisPass for
PSI before each non-module pass that needs it. PSI can be technically
invalidated but unlikely as PSI is immutable. If it does, we can insert
another RequireAnalysisPass.

- If PGO, conditionally request BFI from the passes that need it. For
(pipelined) loop passes, we need to insert a pass that computes BFI
conditionally (if PGO) in front of them and make them preserve BFI through.
This is to avoid pipeline interruptions and potential
invalidation/recomputation
of BFI between the loop passes. We detect PGO based on whether PSI has
profile summary info. (For the old pass manager, implement a similar
approach by using LazyBlockFrequencyInfo.)

- Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be able
to get to the ModuleAnalysisManager in one step and PSI through it.




On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com>
wrote:
>
>
> On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>
>
>
> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at
google.com>
> wrote:
>
>>
>>
>> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev <fedor.sergeev at
azul.com>
>> wrote:
>>
>>>
>>>
>>> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>
>>> Here's a sketch of the proposed approach for just one pass (but
imagine
>>> more)
>>>
>>> https://reviews.llvm.org/D58845
>>>
>>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote:
>>>>
>>>> Hi all,
>>>>
>>>> To implement more profile-guided optimizations, we’d like to
use
>>>> ProfileSummaryInfo (PSI) and BlockFrequencyInfo (BFI) from more
passes of
>>>> various types, under the new pass manager.
>>>>
>>>> The following is what we came up with. Would appreciate
feedback.
>>>> Thanks.
>>>>
>>>> Issue
>>>>
>>>> It’s not obvious (to me) how to best do this, given that we
cannot
>>>> request an outer-scope analysis result from an inner-scope pass
through
>>>> analysis managers [1] and that we might unnecessarily running
some analyses
>>>> unless we conditionally build pass pipelines for PGO cases.
>>>>
>>>> Indeed, this is an intentional restriction in new pass manager,
which
>>>> is more or less a reflection of a fundamental property of
outer-inner
>>>> IRUnit relationship
>>>> and transformations/analyses run on those units. The main
intent for
>>>> having those inner IRUnits (e.g. Loops) is to run local
transformations and
>>>> save compile time
>>>> on being local to a particular small piece of IR. Loop Pass
manager
>>>> allows you to run a whole pipeline of different transformations
still
>>>> locally, amplifying the save.
>>>> As soon as you run function-level analysis from within the loop
>>>> pipeline you essentially break this pipelining.
>>>> Say, as you run your loop transformation it modifies the loop
(and the
>>>> function) and potentially invalidates the analysis,
>>>> so you have to rerun your analysis again and again. Hence
instead of
>>>> saving on compile time it ends up increasing it.
>>>>
>>>
>>> Exactly.
>>>
>>>
>>>> I have hit this issue somewhat recently with dependency of loop
passes
>>>> on BranchProbabilityInfo.
>>>> (some loop passes, like IRCE can use it for profitability
analysis).
>>>>
>>> The only solution that appears to be reasonable there is to teach
all
>>>> the loops passes that need to be pipelined
>>>> to preserve BPI (or any other module/function-level analyses)
similar
>>>> to how they preserve DominatorTree and
>>>> other "LoopStandard" analyses.
>>>>
>>>
>>> Is this implemented - do the loop passes preserve BPI?
>>>
>>> Nope, not implemented right now.
>>> One of the problems is that even loop canonicalization passes run
at the
>>> start of loop pass manager dont preserve it
>>> (and at least LoopSimplifyCFG does change control flow).
>>>
>>>
>>> In buildFunctionSimplificationPipeline (where LoopFullUnrollPass is
>>> added as in the sketch), LateLoopOptimizationsEPCallbacks
>>> and LoopOptimizerEndEPCallbacks seem to allow some arbitrary loop
passes to
>>> be inserted into the pipelines (via flags)?
>>>
>>> I wonder how hard it'd be to teach all the relevant loop passes
to
>>> preserve BFI (or BPI)..
>>>
>>> Well, each time you restructure control flow around the loops you
will
>>> have to update those extra analyses,
>>> pretty much the same way as DT is being updated through
DomTreeUpdater.
>>> The trick is to design a proper update interface (and then
implement it
>>> ;) ).
>>> And I have not spent enough time on this issue to get a good idea
of
>>> what that interface would be.
>>>
>>
>> Hm, sounds non-trivial :) noting BFI depends on BPI.
>>
>
> To step back, it looks like:
>
> want to use profiles from more passes -> need to get BFI (from loop
> passes) -> need all the loop passes to preserve BFI.
>
> I wonder if there's no way around this.
>
> Indeed. I believe this is a general consensus here.
>
> regards,
>   Fedor.
>
>
>
>>
>>> regards,
>>>   Fedor.
>>>
>>>
>>>> It seems that for different types of passes to be able to get
PSI and
>>>> BFI, we’d need to ensure PSI is cached for a non-module pass,
and PSI, BFI
>>>> and the ModuleAnalysisManager proxy are cached for a loop pass
in the pass
>>>> pipelines. This may mean potentially needing to insert BFI/PSI
in front of
>>>> many passes [2]. It seems not obvious how to conditionally
insert BFI for
>>>> PGO pipelines because there isn’t always a good flag to detect
PGO cases
>>>> [3] or we tend to build pass pipelines before examining the
code (or
>>>> without propagating enough info down) [4].
>>>>
>>>> Proposed approach
>>>>
>>>> - Cache PSI right after the profile summary in the IR is
written in the
>>>> pass pipeline [5]. This would avoid the need to insert
RequiredAnalysisPass
>>>> for PSI before each non-module pass that needs it. PSI can be
technically
>>>> invalidated but unlikely. If it does, we insert another
RequiredAnalysisPass
>>>>  [6].
>>>>
>>>> - Conditionally insert RequireAnalysisPass for BFI, if PGO,
right
>>>> before each loop pass that needs it. This doesn't seem
avoidable because
>>>> BFI can be invalidated whenever the CFG changes. We detect PGO
based on the
>>>> command line flags and/or whether the module has the profile
summary
>>>> info (we may need to pass the module to more functions.)
>>>>
>>>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop
pass to be
>>>> able to get to the ModuleAnalysisManager in one step and PSI
through it.
>>>>
>>>> Alternative approaches
>>>>
>>>> Dropping BFI and use PSI only
>>>> We could consider not using BFI and solely relying on PSI and
>>>> function-level profiles only (as opposed to block-level), but
profile
>>>> precision would suffer.
>>>>
>>>> Computing BFI in-place
>>>> We could consider computing BFI “in-place” by directly running
BFI
>>>> outside of the pass manager [7]. This would let us avoid using
the analysis
>>>> manager constraints but it would still involve running an
outer-scope
>>>> analysis from an inner-scope pass and potentially cause
problems in terms
>>>> of pass pipelining and concurrency. Moreover, a potential
downside of
>>>> running analyses in-place is that it won’t take advantage of
cached
>>>> analysis results provided by the pass manager.
>>>>
>>>> Adding inner-scope versions of PSI and BFI
>>>> We could consider adding a function-level and loop-level PSI
and
>>>> loop-level BFI, which internally act like their outer-scope
versions but
>>>> provide inner-scope results only. This way, we could always
call getResult
>>>> for PSI and BFI. However, this would still involve running an
outer-scope
>>>> analysis from an inner-scope pass.
>>>>
>>>> Caching the FAM and the MAM proxies
>>>> We could consider caching the FunctionalAnalysisManager and the
>>>> ModuleAnalysisManager proxies once early on instead of adding a
new proxy.
>>>> But it seems to not likely work well because the analysis cache
key type
>>>> includes the function or the module and some pass may add a new
function
>>>> for which the proxy wouldn’t be cached. We’d need to write and
insert a
>>>> pass in select locations to just fill the cache. Adding the new
proxy would
>>>> take care of these with a three-line change.
>>>>
>>>> Conditional BFI
>>>> We could consider adding a conditional BFI analysis that is a
wrapper
>>>> around BFI and computes BFI only if profiles are available
(either checking
>>>> the module has profile summary or depend on the PSI.) With
this, we
>>>> wouldn’t need to conditionally build pass pipelines and may
work for the
>>>> new pass manager. But a similar wouldn’t work for the old pass
manager
>>>> because we cannot conditionally depend on an analysis under it.
>>>>
>>>> There is LazyBlockFrequencyInfo.
>>>> Not sure how well it fits this idea.
>>>>
>>>
>>> Good point. LazyBlockFrequencyInfo seems usable with the old pass
>>> manager (save unnecessary BFI/BPI) and would work for function
passes. I
>>> think the restriction still applies - a loop pass cannot still
request
>>> (outer-scope) BFI, lazy or not, new or old (pass manager). Another
>>> assumption is that it'd be cheap and safe to unconditionally
depend on
>>> PSI or check the module's profile summary.
>>>
>>>
>>>> regards,
>>>>   Fedor.
>>>>
>>>>
>>>>
>>>> [1] We cannot call AnalysisManager::getResult for an outer
scope but
>>>> only getCachedResult. Probably because of potential pipelining
or
>>>> concurrency issues.
>>>> [2] For example, potentially breaking up multiple pipelined
loop passes
>>>> and insert RequireAnalysisPass<BlockFrequencyAnalysis> in
front of each of
>>>> them.
>>>> [3] For example, -fprofile-instr-use and -fprofile-sample-use
aren’t
>>>> present in ThinLTO post link builds.
>>>> [4] For example, we could check whether the module has the
profile
>>>> summary metadata annotated when building pass pipelines but we
don’t always
>>>> pass the module down to the place where we build pass
pipelines.
>>>> [5] By inserting RequireAnalysisPass<ProfileSummaryInfo>
after the
>>>> PGOInstrumentationUse and the SampleProfileLoaderPass passes
(and around
>>>> the PGOIndirectCallPromotion pass for the Thin LTO post link
pipeline.)
>>>> [6] For example, the context-sensitive PGO.
>>>> [7] Directly calling its constructor along with the dependent
analyses
>>>> results, eg. the jump threading pass.
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190311/dcb6b6ff/attachment-0001.html>

Fedor Sergeev via llvm-dev

2019-Mar-13 21:37 UTC

head link

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

Overall seems fine to me.

On 3/11/19 8:12 PM, Hiroshi Yamauchi wrote:> Here's a revised approach based on the discussion:
>
> - Cache PSI right after the profile summary in the IR is written in 
> the pass pipeline. This would avoid the need to insert 
> RequireAnalysisPass for PSI before each non-module pass that needs it. 
> PSI can be technically invalidated but unlikely as PSI is immutable. 
> If it does, we can insert another RequireAnalysisPass.ProfileSummaryInfo::invalidate always return false, so it does not need 
any extra handling
(as soon as it finds its way into
ModuleAnalysisManager).>
> - If PGO, conditionally request BFI from the passes that need it. For 
> (pipelined) loop passes, we need to insert a pass that computes BFI 
> conditionally (if PGO) in front of them and make them preserve BFI 
> through. This is to avoid pipeline interruptions and potential 
> invalidation/recomputation of BFI between the loop passes. We detect 
> PGO based on whether PSI has profile summary info. (For the old pass 
> manager, implement a similar approach by using LazyBlockFrequencyInfo.)There is already an optional analysis in LoopStandardAnalysisResults - 
MemorySSA.
We can do the same for BFI/BPI.
And, yes - preserving those through loop passes is a cornerstone to this 
approach.>
> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be 
> able to get to the ModuleAnalysisManager in one step and PSI through it.This is just an optimization of compile-time, saves one indirection 
through FunctionAnalysisManager.
I'm not even sure if it is worth the effort. And definitely not crucial 
for the overall idea.

regards,
   Fedor.>
>
>
>
> On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com 
> <mailto:fedor.sergeev at azul.com>> wrote:
>
>
>
>     On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>>
>>
>>     On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi
>>     <yamauchi at google.com <mailto:yamauchi at
google.com>> wrote:
>>
>>
>>
>>         On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev
>>         <fedor.sergeev at azul.com <mailto:fedor.sergeev at
azul.com>> wrote:
>>
>>
>>
>>             On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>             Here's a sketch of the proposed approach for just
one
>>>             pass(but imagine more)
>>>
>>>             https://reviews.llvm.org/D58845
>>>
>>>             On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via
>>>             llvm-dev <llvm-dev at lists.llvm.org
>>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>>                 On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev
>>>                 wrote:
>>>>                 Hi all,
>>>>
>>>>                 To implement more profile-guided optimizations,
>>>>                 we’d like to use ProfileSummaryInfo (PSI) and
>>>>                 BlockFrequencyInfo (BFI) from more passes of
>>>>                 various types, under the new pass manager.
>>>>
>>>>                 The following is what we came up with. Would
>>>>                 appreciate feedback. Thanks.
>>>>
>>>>                 Issue
>>>>
>>>>                 It’s not obvious (to me) how to best do this,
given
>>>>                 that we cannot request an outer-scope analysis
>>>>                 result from an inner-scope pass through
analysis
>>>>                 managers [1] and that we might unnecessarily
>>>>                 running some analyses unless we conditionally
build
>>>>                 pass pipelines for PGO cases.
>>>                 Indeed, this is an intentional restriction in new
>>>                 pass manager, which is more or less a reflection of
>>>                 a fundamental property of outer-inner IRUnit
>>>                 relationship
>>>                 and transformations/analyses run on those units.
The
>>>                 main intent for having those inner IRUnits (e.g.
>>>                 Loops) is to run local transformations and save
>>>                 compile time
>>>                 on being local to a particular small piece of IR.
>>>                 Loop Pass manager allows you to run a whole
pipeline
>>>                 of different transformations still locally,
>>>                 amplifying the save.
>>>                 As soon as you run function-level analysis from
>>>                 within the loop pipeline you essentially break this
>>>                 pipelining.
>>>                 Say, as you run your loop transformation it
modifies
>>>                 the loop (and the function) and potentially
>>>                 invalidates the analysis,
>>>                 so you have to rerun your analysis again and again.
>>>                 Hence instead of saving on compile time it ends up
>>>                 increasing it.
>>>
>>>
>>>             Exactly.
>>>
>>>
>>>                 I have hit this issue somewhat recently with
>>>                 dependency of loop passes on BranchProbabilityInfo.
>>>                 (some loop passes, like IRCE can use it for
>>>                 profitability analysis).
>>>
>>>                 The only solution that appears to be reasonable
>>>                 there is to teach all the loops passes that need to
>>>                 be pipelined
>>>                 to preserve BPI (or any other module/function-level
>>>                 analyses) similar to how they preserve
DominatorTree and
>>>                 other "LoopStandard" analyses.
>>>
>>>
>>>             Is this implemented - do the loop passes preserve BPI?
>>             Nope, not implemented right now.
>>             One of the problems is that even loop canonicalization
>>             passes run at the start of loop pass manager dont preserve
it
>>             (and at least LoopSimplifyCFG does change control flow).
>>>
>>>             In buildFunctionSimplificationPipeline
>>>             (where LoopFullUnrollPass is added as in the sketch),
>>>             LateLoopOptimizationsEPCallbacks
>>>             and LoopOptimizerEndEPCallbacks seem to allow some
>>>             arbitrary loop passes to be inserted into the pipelines
>>>             (via flags)?
>>>
>>>             I wonder how hard it'd be to teach all the relevant
loop
>>>             passes to preserve BFI(or BPI)..
>>             Well, each time you restructure control flow around the
>>             loops you will have to update those extra analyses,
>>             pretty much the same way as DT is being updated through
>>             DomTreeUpdater.
>>             The trick is to design a proper update interface (and
>>             then implement it ;) ).
>>             And I have not spent enough time on this issue to get a
>>             good idea of what that interface would be.
>>
>>
>>         Hm, sounds non-trivial :) noting BFI depends on BPI.
>>
>>
>>     To step back, it looks like:
>>
>>     want to use profiles from more passes -> need to get BFI (from
>>     loop passes) -> need all the loop passes to preserve BFI.
>>
>>     I wonder if there's no way around this.
>     Indeed. I believe this is a general consensus here.
>
>     regards,
>       Fedor.
>
>>
>>
>>
>>             regards,
>>               Fedor.
>>
>>>
>>>>                 It seems that for different types of passes to
be
>>>>                 able to get PSI and BFI, we’d need to ensure
PSI is
>>>>                 cached for a non-module pass, and PSI, BFI and
the
>>>>                 ModuleAnalysisManager proxy are cached for a
loop
>>>>                 pass in the pass pipelines. This may mean
>>>>                 potentially needing to insert BFI/PSI in front
of
>>>>                 many passes [2]. It seems not obvious how to
>>>>                 conditionally insert BFI for PGO pipelines
because
>>>>                 there isn’t always a good flag to detect PGO
cases
>>>>                 [3] or we tend to build pass pipelines before
>>>>                 examining the code (or without propagating
enough
>>>>                 info down) [4].
>>>>
>>>>                 Proposed approach
>>>>
>>>>                 - Cache PSI right after the profile summary in
the
>>>>                 IR is written in the pass pipeline [5]. This
would
>>>>                 avoid the need to insert RequiredAnalysisPass
for
>>>>                 PSI before each non-module pass that needs it.
PSI
>>>>                 can be technically invalidated but unlikely. If
it
>>>>                 does, we insert another
RequiredAnalysisPass[6].
>>>>
>>>>                 - Conditionally insert RequireAnalysisPass for
BFI,
>>>>                 if PGO, right before each loop pass that needs
it.
>>>>                 This doesn't seem avoidable because BFI can
be
>>>>                 invalidated whenever the CFG changes. We detect
PGO
>>>>                 based on the command line flags and/or whether
the
>>>>                 module has the profile summary info (we may
need to
>>>>                 pass the module to more functions.)
>>>>
>>>>                 - Add a new proxy
ModuleAnalysisManagerLoopProxy
>>>>                 for a loop pass to be able to get to the
>>>>                 ModuleAnalysisManager in one step and PSI
through it.
>>>>
>>>>                 Alternative approaches
>>>>
>>>>                 Dropping BFI and use PSI only
>>>>                 We could consider not using BFI and solely
relying
>>>>                 on PSI and function-level profiles only (as
opposed
>>>>                 to block-level), but profile precision would
suffer.
>>>>
>>>>                 Computing BFI in-place
>>>>                 We could consider computing BFI “in-place” by
>>>>                 directly running BFI outside of the pass
manager
>>>>                 [7]. This would let us avoid using the analysis
>>>>                 manager constraints but it would still involve
>>>>                 running an outer-scope analysis from an
inner-scope
>>>>                 pass and potentially cause problems in terms of
>>>>                 pass pipelining and concurrency. Moreover, a
>>>>                 potential downside of running analyses in-place
is
>>>>                 that it won’t take advantage of cached analysis
>>>>                 results provided by the pass manager.
>>>>
>>>>                 Adding inner-scope versions of PSI and BFI
>>>>                 We could consider adding a function-level and
>>>>                 loop-level PSI and loop-level BFI, which
internally
>>>>                 act like their outer-scope versions but provide
>>>>                 inner-scope results only. This way, we could
always
>>>>                 call getResult for PSI and BFI. However, this
would
>>>>                 still involve running an outer-scope analysis
from
>>>>                 an inner-scope pass.
>>>>
>>>>                 Caching the FAM and the MAM proxies
>>>>                 We could consider caching the
>>>>                 FunctionalAnalysisManager and the
>>>>                 ModuleAnalysisManager proxies once early on
instead
>>>>                 of adding a new proxy. But it seems to not
likely
>>>>                 work well because the analysis cache key type
>>>>                 includes the function or the module and some
pass
>>>>                 may add a new function for which the proxy
wouldn’t
>>>>                 be cached. We’d need to write and insert a pass
in
>>>>                 select locations to just fill the cache. Adding
the
>>>>                 new proxy would take care of these with a
>>>>                 three-line change.
>>>>
>>>>                 Conditional BFI
>>>>                 We could consider adding a conditional BFI
analysis
>>>>                 that is a wrapper around BFI and computes BFI
only
>>>>                 if profiles are available (either checking the
>>>>                 module has profile summary or depend on the
PSI.)
>>>>                 With this, we wouldn’t need to conditionally
build
>>>>                 pass pipelines and may work for the new pass
>>>>                 manager. But a similar wouldn’t work for the
old
>>>>                 pass manager because we cannot conditionally
depend
>>>>                 on an analysis under it.
>>>                 There is LazyBlockFrequencyInfo.
>>>                 Not sure how well it fits this idea.
>>>
>>>
>>>             Good point. LazyBlockFrequencyInfo seems usable with
the
>>>             old pass manager (save unnecessary BFI/BPI) and would
>>>             work for function passes. I think the restriction still
>>>             applies - a loop pass cannot still request
(outer-scope)
>>>             BFI, lazy or not, new or old (pass manager). Another
>>>             assumption is that it'd be cheap and safe to
>>>             unconditionally depend on PSI or check the module's
>>>             profile summary.
>>>
>>>
>>>                 regards,
>>>                   Fedor.
>>>
>>>>
>>>>
>>>>                 [1] We cannot call AnalysisManager::getResult
for
>>>>                 an outer scope but only getCachedResult.
Probably
>>>>                 because of potential pipelining or concurrency
issues.
>>>>                 [2] For example, potentially breaking up
multiple
>>>>                 pipelined loop passes and insert
>>>>                
RequireAnalysisPass<BlockFrequencyAnalysis> in
>>>>                 front of each of them.
>>>>                 [3] For example, -fprofile-instr-use and
>>>>                 -fprofile-sample-use aren’t present in ThinLTO
post
>>>>                 link builds.
>>>>                 [4] For example, we could check whether the
module
>>>>                 has the profile summary metadata annotated when
>>>>                 building pass pipelines but we don’t always
pass
>>>>                 the module down to the place where we build
pass
>>>>                 pipelines.
>>>>                 [5] By inserting
>>>>                 RequireAnalysisPass<ProfileSummaryInfo>
after the
>>>>                 PGOInstrumentationUse and the
>>>>                 SampleProfileLoaderPass passes (and around the
>>>>                 PGOIndirectCallPromotion pass for the Thin LTO
post
>>>>                 link pipeline.)
>>>>                 [6] For example, the context-sensitive PGO.
>>>>                 [7] Directly calling its constructor along with
the
>>>>                 dependent analyses results, eg. the jump
threading
>>>>                 pass.
>>>>
>>>>                 _______________________________________________
>>>>                 LLVM Developers mailing list
>>>>                 llvm-dev at lists.llvm.org  <mailto:llvm-dev
at lists.llvm.org>
>>>>                
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>                 _______________________________________________
>>>                 LLVM Developers mailing list
>>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>                
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190314/ce875719/attachment-0001.html>

llvm dev - Mar 2019 - RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager