thr3ads.net - llvm dev - [llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Fedor Sergeev via llvm-dev

2019-Mar-13 21:37 UTC

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

Overall seems fine to me.

On 3/11/19 8:12 PM, Hiroshi Yamauchi wrote:> Here's a revised approach based on the discussion:
>
> - Cache PSI right after the profile summary in the IR is written in 
> the pass pipeline. This would avoid the need to insert 
> RequireAnalysisPass for PSI before each non-module pass that needs it. 
> PSI can be technically invalidated but unlikely as PSI is immutable. 
> If it does, we can insert another RequireAnalysisPass.ProfileSummaryInfo::invalidate always return false, so it does not need 
any extra handling
(as soon as it finds its way into
ModuleAnalysisManager).>
> - If PGO, conditionally request BFI from the passes that need it. For 
> (pipelined) loop passes, we need to insert a pass that computes BFI 
> conditionally (if PGO) in front of them and make them preserve BFI 
> through. This is to avoid pipeline interruptions and potential 
> invalidation/recomputation of BFI between the loop passes. We detect 
> PGO based on whether PSI has profile summary info. (For the old pass 
> manager, implement a similar approach by using LazyBlockFrequencyInfo.)There is already an optional analysis in LoopStandardAnalysisResults - 
MemorySSA.
We can do the same for BFI/BPI.
And, yes - preserving those through loop passes is a cornerstone to this 
approach.>
> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be 
> able to get to the ModuleAnalysisManager in one step and PSI through it.This is just an optimization of compile-time, saves one indirection 
through FunctionAnalysisManager.
I'm not even sure if it is worth the effort. And definitely not crucial 
for the overall idea.

regards,
   Fedor.>
>
>
>
> On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com 
> <mailto:fedor.sergeev at azul.com>> wrote:
>
>
>
>     On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>>
>>
>>     On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi
>>     <yamauchi at google.com <mailto:yamauchi at
google.com>> wrote:
>>
>>
>>
>>         On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev
>>         <fedor.sergeev at azul.com <mailto:fedor.sergeev at
azul.com>> wrote:
>>
>>
>>
>>             On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>             Here's a sketch of the proposed approach for just
one
>>>             pass(but imagine more)
>>>
>>>             https://reviews.llvm.org/D58845
>>>
>>>             On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via
>>>             llvm-dev <llvm-dev at lists.llvm.org
>>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>>                 On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev
>>>                 wrote:
>>>>                 Hi all,
>>>>
>>>>                 To implement more profile-guided optimizations,
>>>>                 we’d like to use ProfileSummaryInfo (PSI) and
>>>>                 BlockFrequencyInfo (BFI) from more passes of
>>>>                 various types, under the new pass manager.
>>>>
>>>>                 The following is what we came up with. Would
>>>>                 appreciate feedback. Thanks.
>>>>
>>>>                 Issue
>>>>
>>>>                 It’s not obvious (to me) how to best do this,
given
>>>>                 that we cannot request an outer-scope analysis
>>>>                 result from an inner-scope pass through
analysis
>>>>                 managers [1] and that we might unnecessarily
>>>>                 running some analyses unless we conditionally
build
>>>>                 pass pipelines for PGO cases.
>>>                 Indeed, this is an intentional restriction in new
>>>                 pass manager, which is more or less a reflection of
>>>                 a fundamental property of outer-inner IRUnit
>>>                 relationship
>>>                 and transformations/analyses run on those units.
The
>>>                 main intent for having those inner IRUnits (e.g.
>>>                 Loops) is to run local transformations and save
>>>                 compile time
>>>                 on being local to a particular small piece of IR.
>>>                 Loop Pass manager allows you to run a whole
pipeline
>>>                 of different transformations still locally,
>>>                 amplifying the save.
>>>                 As soon as you run function-level analysis from
>>>                 within the loop pipeline you essentially break this
>>>                 pipelining.
>>>                 Say, as you run your loop transformation it
modifies
>>>                 the loop (and the function) and potentially
>>>                 invalidates the analysis,
>>>                 so you have to rerun your analysis again and again.
>>>                 Hence instead of saving on compile time it ends up
>>>                 increasing it.
>>>
>>>
>>>             Exactly.
>>>
>>>
>>>                 I have hit this issue somewhat recently with
>>>                 dependency of loop passes on BranchProbabilityInfo.
>>>                 (some loop passes, like IRCE can use it for
>>>                 profitability analysis).
>>>
>>>                 The only solution that appears to be reasonable
>>>                 there is to teach all the loops passes that need to
>>>                 be pipelined
>>>                 to preserve BPI (or any other module/function-level
>>>                 analyses) similar to how they preserve
DominatorTree and
>>>                 other "LoopStandard" analyses.
>>>
>>>
>>>             Is this implemented - do the loop passes preserve BPI?
>>             Nope, not implemented right now.
>>             One of the problems is that even loop canonicalization
>>             passes run at the start of loop pass manager dont preserve
it
>>             (and at least LoopSimplifyCFG does change control flow).
>>>
>>>             In buildFunctionSimplificationPipeline
>>>             (where LoopFullUnrollPass is added as in the sketch),
>>>             LateLoopOptimizationsEPCallbacks
>>>             and LoopOptimizerEndEPCallbacks seem to allow some
>>>             arbitrary loop passes to be inserted into the pipelines
>>>             (via flags)?
>>>
>>>             I wonder how hard it'd be to teach all the relevant
loop
>>>             passes to preserve BFI(or BPI)..
>>             Well, each time you restructure control flow around the
>>             loops you will have to update those extra analyses,
>>             pretty much the same way as DT is being updated through
>>             DomTreeUpdater.
>>             The trick is to design a proper update interface (and
>>             then implement it ;) ).
>>             And I have not spent enough time on this issue to get a
>>             good idea of what that interface would be.
>>
>>
>>         Hm, sounds non-trivial :) noting BFI depends on BPI.
>>
>>
>>     To step back, it looks like:
>>
>>     want to use profiles from more passes -> need to get BFI (from
>>     loop passes) -> need all the loop passes to preserve BFI.
>>
>>     I wonder if there's no way around this.
>     Indeed. I believe this is a general consensus here.
>
>     regards,
>       Fedor.
>
>>
>>
>>
>>             regards,
>>               Fedor.
>>
>>>
>>>>                 It seems that for different types of passes to
be
>>>>                 able to get PSI and BFI, we’d need to ensure
PSI is
>>>>                 cached for a non-module pass, and PSI, BFI and
the
>>>>                 ModuleAnalysisManager proxy are cached for a
loop
>>>>                 pass in the pass pipelines. This may mean
>>>>                 potentially needing to insert BFI/PSI in front
of
>>>>                 many passes [2]. It seems not obvious how to
>>>>                 conditionally insert BFI for PGO pipelines
because
>>>>                 there isn’t always a good flag to detect PGO
cases
>>>>                 [3] or we tend to build pass pipelines before
>>>>                 examining the code (or without propagating
enough
>>>>                 info down) [4].
>>>>
>>>>                 Proposed approach
>>>>
>>>>                 - Cache PSI right after the profile summary in
the
>>>>                 IR is written in the pass pipeline [5]. This
would
>>>>                 avoid the need to insert RequiredAnalysisPass
for
>>>>                 PSI before each non-module pass that needs it.
PSI
>>>>                 can be technically invalidated but unlikely. If
it
>>>>                 does, we insert another
RequiredAnalysisPass[6].
>>>>
>>>>                 - Conditionally insert RequireAnalysisPass for
BFI,
>>>>                 if PGO, right before each loop pass that needs
it.
>>>>                 This doesn't seem avoidable because BFI can
be
>>>>                 invalidated whenever the CFG changes. We detect
PGO
>>>>                 based on the command line flags and/or whether
the
>>>>                 module has the profile summary info (we may
need to
>>>>                 pass the module to more functions.)
>>>>
>>>>                 - Add a new proxy
ModuleAnalysisManagerLoopProxy
>>>>                 for a loop pass to be able to get to the
>>>>                 ModuleAnalysisManager in one step and PSI
through it.
>>>>
>>>>                 Alternative approaches
>>>>
>>>>                 Dropping BFI and use PSI only
>>>>                 We could consider not using BFI and solely
relying
>>>>                 on PSI and function-level profiles only (as
opposed
>>>>                 to block-level), but profile precision would
suffer.
>>>>
>>>>                 Computing BFI in-place
>>>>                 We could consider computing BFI “in-place” by
>>>>                 directly running BFI outside of the pass
manager
>>>>                 [7]. This would let us avoid using the analysis
>>>>                 manager constraints but it would still involve
>>>>                 running an outer-scope analysis from an
inner-scope
>>>>                 pass and potentially cause problems in terms of
>>>>                 pass pipelining and concurrency. Moreover, a
>>>>                 potential downside of running analyses in-place
is
>>>>                 that it won’t take advantage of cached analysis
>>>>                 results provided by the pass manager.
>>>>
>>>>                 Adding inner-scope versions of PSI and BFI
>>>>                 We could consider adding a function-level and
>>>>                 loop-level PSI and loop-level BFI, which
internally
>>>>                 act like their outer-scope versions but provide
>>>>                 inner-scope results only. This way, we could
always
>>>>                 call getResult for PSI and BFI. However, this
would
>>>>                 still involve running an outer-scope analysis
from
>>>>                 an inner-scope pass.
>>>>
>>>>                 Caching the FAM and the MAM proxies
>>>>                 We could consider caching the
>>>>                 FunctionalAnalysisManager and the
>>>>                 ModuleAnalysisManager proxies once early on
instead
>>>>                 of adding a new proxy. But it seems to not
likely
>>>>                 work well because the analysis cache key type
>>>>                 includes the function or the module and some
pass
>>>>                 may add a new function for which the proxy
wouldn’t
>>>>                 be cached. We’d need to write and insert a pass
in
>>>>                 select locations to just fill the cache. Adding
the
>>>>                 new proxy would take care of these with a
>>>>                 three-line change.
>>>>
>>>>                 Conditional BFI
>>>>                 We could consider adding a conditional BFI
analysis
>>>>                 that is a wrapper around BFI and computes BFI
only
>>>>                 if profiles are available (either checking the
>>>>                 module has profile summary or depend on the
PSI.)
>>>>                 With this, we wouldn’t need to conditionally
build
>>>>                 pass pipelines and may work for the new pass
>>>>                 manager. But a similar wouldn’t work for the
old
>>>>                 pass manager because we cannot conditionally
depend
>>>>                 on an analysis under it.
>>>                 There is LazyBlockFrequencyInfo.
>>>                 Not sure how well it fits this idea.
>>>
>>>
>>>             Good point. LazyBlockFrequencyInfo seems usable with
the
>>>             old pass manager (save unnecessary BFI/BPI) and would
>>>             work for function passes. I think the restriction still
>>>             applies - a loop pass cannot still request
(outer-scope)
>>>             BFI, lazy or not, new or old (pass manager). Another
>>>             assumption is that it'd be cheap and safe to
>>>             unconditionally depend on PSI or check the module's
>>>             profile summary.
>>>
>>>
>>>                 regards,
>>>                   Fedor.
>>>
>>>>
>>>>
>>>>                 [1] We cannot call AnalysisManager::getResult
for
>>>>                 an outer scope but only getCachedResult.
Probably
>>>>                 because of potential pipelining or concurrency
issues.
>>>>                 [2] For example, potentially breaking up
multiple
>>>>                 pipelined loop passes and insert
>>>>                
RequireAnalysisPass<BlockFrequencyAnalysis> in
>>>>                 front of each of them.
>>>>                 [3] For example, -fprofile-instr-use and
>>>>                 -fprofile-sample-use aren’t present in ThinLTO
post
>>>>                 link builds.
>>>>                 [4] For example, we could check whether the
module
>>>>                 has the profile summary metadata annotated when
>>>>                 building pass pipelines but we don’t always
pass
>>>>                 the module down to the place where we build
pass
>>>>                 pipelines.
>>>>                 [5] By inserting
>>>>                 RequireAnalysisPass<ProfileSummaryInfo>
after the
>>>>                 PGOInstrumentationUse and the
>>>>                 SampleProfileLoaderPass passes (and around the
>>>>                 PGOIndirectCallPromotion pass for the Thin LTO
post
>>>>                 link pipeline.)
>>>>                 [6] For example, the context-sensitive PGO.
>>>>                 [7] Directly calling its constructor along with
the
>>>>                 dependent analyses results, eg. the jump
threading
>>>>                 pass.
>>>>
>>>>                 _______________________________________________
>>>>                 LLVM Developers mailing list
>>>>                 llvm-dev at lists.llvm.org  <mailto:llvm-dev
at lists.llvm.org>
>>>>                
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>                 _______________________________________________
>>>                 LLVM Developers mailing list
>>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>                
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190314/ce875719/attachment-0001.html>

Hiroshi Yamauchi via llvm-dev

2019-Mar-13 23:04 UTC

head link

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

On Wed, Mar 13, 2019 at 2:37 PM Fedor Sergeev <fedor.sergeev at azul.com>
wrote:
> Overall seems fine to me.
>
> On 3/11/19 8:12 PM, Hiroshi Yamauchi wrote:
>
> Here's a revised approach based on the discussion:
>
> - Cache PSI right after the profile summary in the IR is written in the
> pass pipeline. This would avoid the need to insert RequireAnalysisPass for
> PSI before each non-module pass that needs it. PSI can be technically
> invalidated but unlikely as PSI is immutable. If it does, we can insert
> another RequireAnalysisPass.
>
> ProfileSummaryInfo::invalidate always return false, so it does not need
> any extra handling
> (as soon as it finds its way into ModuleAnalysisManager).
>
Right, as long as there happens to be a pass that always runs PSI before
any pass that expects it to cached, it'd be fine. This is to be extra
reassuring by having PSI explicitly run and cached right after a pass that
writes the profile summary to the IR so that there's no window where PSI
may not be cached.
> - If PGO, conditionally request BFI from the passes that need it. For
> (pipelined) loop passes, we need to insert a pass that computes BFI
> conditionally (if PGO) in front of them and make them preserve BFI through.
> This is to avoid pipeline interruptions and potential
invalidation/recomputation
> of BFI between the loop passes. We detect PGO based on whether PSI has
> profile summary info. (For the old pass manager, implement a similar
> approach by using LazyBlockFrequencyInfo.)
>
> There is already an optional analysis in LoopStandardAnalysisResults -
> MemorySSA.
> We can do the same for BFI/BPI.
> And, yes - preserving those through loop passes is a cornerstone to this
> approach.
>
>
> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be
> able to get to the ModuleAnalysisManager in one step and PSI through it.
>
> This is just an optimization of compile-time, saves one indirection
> through FunctionAnalysisManager.
> I'm not even sure if it is worth the effort. And definitely not crucial
> for the overall idea.
>
This should probably be clarified to something like:

- Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be able
to get to the ModuleAnalysisManager and PSI because it may not always
through (const) FunctionAnalysisManager, unless
ModuleAnalysisManagerFunctionProxy
is already cached.

Since FunctionAnalysisManager we can get from LoopAnalysisManager is a
const ref, we cannot call getResult on it and always get
ModuleAnalysisManager and PSI (see below.) This actually happens in my
experiment.

SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
  auto &FAM = LAM.getResult<FunctionAnalysisManagerLoopProxy>(L,
AR).getManager();
  auto *MAMProxy =
FAM.getCachedResult<ModuleAnalysisManagerFunctionProxy>(
      L.getHeader()->getParent()); *// Can be null*
  If (MAMProxy) {
    auto &MAM = MAMProxy->getManager();
    auto *PSI =
MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
  } else {
    *// Can't get MAM and PSI.*
  }
  ...

->

SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
  auto &MAM = LAM.getResult<ModuleAnalysisManagerLoopProxy>(L,
AR).getManager();  *// Not null*
  auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
  ...


AFAICT, adding ModuleAnalysisManagerLoopProxy seems to be as simple as:

/// A proxy from a \c ModuleAnalysisManager to a \c Loop.
typedef OuterAnalysisManagerProxy<ModuleAnalysisManager, Loop,
                                  LoopStandardAnalysisResults &>
    ModuleAnalysisManagerLoopProxy;



>
> regards,
>   Fedor.
>
>
>
>
>
> On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at
azul.com>
> wrote:
>
>>
>>
>> On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>>
>>
>>
>> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at
google.com>
>> wrote:
>>
>>>
>>>
>>> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev <fedor.sergeev at
azul.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>>
>>>> Here's a sketch of the proposed approach for just one pass
(but
>>>> imagine more)
>>>>
>>>> https://reviews.llvm.org/D58845
>>>>
>>>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> To implement more profile-guided optimizations, we’d like
to use
>>>>> ProfileSummaryInfo (PSI) and BlockFrequencyInfo (BFI) from
more passes of
>>>>> various types, under the new pass manager.
>>>>>
>>>>> The following is what we came up with. Would appreciate
feedback.
>>>>> Thanks.
>>>>>
>>>>> Issue
>>>>>
>>>>> It’s not obvious (to me) how to best do this, given that we
cannot
>>>>> request an outer-scope analysis result from an inner-scope
pass through
>>>>> analysis managers [1] and that we might unnecessarily
running some analyses
>>>>> unless we conditionally build pass pipelines for PGO cases.
>>>>>
>>>>> Indeed, this is an intentional restriction in new pass
manager, which
>>>>> is more or less a reflection of a fundamental property of
outer-inner
>>>>> IRUnit relationship
>>>>> and transformations/analyses run on those units. The main
intent for
>>>>> having those inner IRUnits (e.g. Loops) is to run local
transformations and
>>>>> save compile time
>>>>> on being local to a particular small piece of IR. Loop Pass
manager
>>>>> allows you to run a whole pipeline of different
transformations still
>>>>> locally, amplifying the save.
>>>>> As soon as you run function-level analysis from within the
loop
>>>>> pipeline you essentially break this pipelining.
>>>>> Say, as you run your loop transformation it modifies the
loop (and the
>>>>> function) and potentially invalidates the analysis,
>>>>> so you have to rerun your analysis again and again. Hence
instead of
>>>>> saving on compile time it ends up increasing it.
>>>>>
>>>>
>>>> Exactly.
>>>>
>>>>
>>>>> I have hit this issue somewhat recently with dependency of
loop passes
>>>>> on BranchProbabilityInfo.
>>>>> (some loop passes, like IRCE can use it for profitability
analysis).
>>>>>
>>>> The only solution that appears to be reasonable there is to
teach all
>>>>> the loops passes that need to be pipelined
>>>>> to preserve BPI (or any other module/function-level
analyses) similar
>>>>> to how they preserve DominatorTree and
>>>>> other "LoopStandard" analyses.
>>>>>
>>>>
>>>> Is this implemented - do the loop passes preserve BPI?
>>>>
>>>> Nope, not implemented right now.
>>>> One of the problems is that even loop canonicalization passes
run at
>>>> the start of loop pass manager dont preserve it
>>>> (and at least LoopSimplifyCFG does change control flow).
>>>>
>>>>
>>>> In buildFunctionSimplificationPipeline (where
LoopFullUnrollPass is
>>>> added as in the sketch), LateLoopOptimizationsEPCallbacks
>>>> and LoopOptimizerEndEPCallbacks seem to allow some arbitrary
loop passes to
>>>> be inserted into the pipelines (via flags)?
>>>>
>>>> I wonder how hard it'd be to teach all the relevant loop
passes to
>>>> preserve BFI (or BPI)..
>>>>
>>>> Well, each time you restructure control flow around the loops
you will
>>>> have to update those extra analyses,
>>>> pretty much the same way as DT is being updated through
DomTreeUpdater.
>>>> The trick is to design a proper update interface (and then
implement it
>>>> ;) ).
>>>> And I have not spent enough time on this issue to get a good
idea of
>>>> what that interface would be.
>>>>
>>>
>>> Hm, sounds non-trivial :) noting BFI depends on BPI.
>>>
>>
>> To step back, it looks like:
>>
>> want to use profiles from more passes -> need to get BFI (from loop
>> passes) -> need all the loop passes to preserve BFI.
>>
>> I wonder if there's no way around this.
>>
>> Indeed. I believe this is a general consensus here.
>>
>> regards,
>>   Fedor.
>>
>>
>>
>>>
>>>> regards,
>>>>   Fedor.
>>>>
>>>>
>>>>> It seems that for different types of passes to be able to
get PSI and
>>>>> BFI, we’d need to ensure PSI is cached for a non-module
pass, and PSI, BFI
>>>>> and the ModuleAnalysisManager proxy are cached for a loop
pass in the pass
>>>>> pipelines. This may mean potentially needing to insert
BFI/PSI in front of
>>>>> many passes [2]. It seems not obvious how to conditionally
insert BFI for
>>>>> PGO pipelines because there isn’t always a good flag to
detect PGO cases
>>>>> [3] or we tend to build pass pipelines before examining the
code (or
>>>>> without propagating enough info down) [4].
>>>>>
>>>>> Proposed approach
>>>>>
>>>>> - Cache PSI right after the profile summary in the IR is
written in
>>>>> the pass pipeline [5]. This would avoid the need to insert
>>>>> RequiredAnalysisPass for PSI before each non-module pass
that needs it. PSI
>>>>> can be technically invalidated but unlikely. If it does, we
insert another
>>>>> RequiredAnalysisPass [6].
>>>>>
>>>>> - Conditionally insert RequireAnalysisPass for BFI, if PGO,
right
>>>>> before each loop pass that needs it. This doesn't seem
avoidable because
>>>>> BFI can be invalidated whenever the CFG changes. We detect
PGO based on the
>>>>> command line flags and/or whether the module has the
profile summary
>>>>> info (we may need to pass the module to more functions.)
>>>>>
>>>>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop
pass to be
>>>>> able to get to the ModuleAnalysisManager in one step and
PSI through it.
>>>>>
>>>>> Alternative approaches
>>>>>
>>>>> Dropping BFI and use PSI only
>>>>> We could consider not using BFI and solely relying on PSI
and
>>>>> function-level profiles only (as opposed to block-level),
but profile
>>>>> precision would suffer.
>>>>>
>>>>> Computing BFI in-place
>>>>> We could consider computing BFI “in-place” by directly
running BFI
>>>>> outside of the pass manager [7]. This would let us avoid
using the analysis
>>>>> manager constraints but it would still involve running an
outer-scope
>>>>> analysis from an inner-scope pass and potentially cause
problems in terms
>>>>> of pass pipelining and concurrency. Moreover, a potential
downside of
>>>>> running analyses in-place is that it won’t take advantage
of cached
>>>>> analysis results provided by the pass manager.
>>>>>
>>>>> Adding inner-scope versions of PSI and BFI
>>>>> We could consider adding a function-level and loop-level
PSI and
>>>>> loop-level BFI, which internally act like their outer-scope
versions but
>>>>> provide inner-scope results only. This way, we could always
call getResult
>>>>> for PSI and BFI. However, this would still involve running
an outer-scope
>>>>> analysis from an inner-scope pass.
>>>>>
>>>>> Caching the FAM and the MAM proxies
>>>>> We could consider caching the FunctionalAnalysisManager and
the
>>>>> ModuleAnalysisManager proxies once early on instead of
adding a new proxy.
>>>>> But it seems to not likely work well because the analysis
cache key type
>>>>> includes the function or the module and some pass may add a
new function
>>>>> for which the proxy wouldn’t be cached. We’d need to write
and insert a
>>>>> pass in select locations to just fill the cache. Adding the
new proxy would
>>>>> take care of these with a three-line change.
>>>>>
>>>>> Conditional BFI
>>>>> We could consider adding a conditional BFI analysis that is
a wrapper
>>>>> around BFI and computes BFI only if profiles are available
(either checking
>>>>> the module has profile summary or depend on the PSI.) With
this, we
>>>>> wouldn’t need to conditionally build pass pipelines and may
work for the
>>>>> new pass manager. But a similar wouldn’t work for the old
pass manager
>>>>> because we cannot conditionally depend on an analysis under
it.
>>>>>
>>>>> There is LazyBlockFrequencyInfo.
>>>>> Not sure how well it fits this idea.
>>>>>
>>>>
>>>> Good point. LazyBlockFrequencyInfo seems usable with the old
pass
>>>> manager (save unnecessary BFI/BPI) and would work for function
passes. I
>>>> think the restriction still applies - a loop pass cannot still
request
>>>> (outer-scope) BFI, lazy or not, new or old (pass manager).
Another
>>>> assumption is that it'd be cheap and safe to
unconditionally depend on
>>>> PSI or check the module's profile summary.
>>>>
>>>>
>>>>> regards,
>>>>>   Fedor.
>>>>>
>>>>>
>>>>>
>>>>> [1] We cannot call AnalysisManager::getResult for an outer
scope but
>>>>> only getCachedResult. Probably because of potential
pipelining or
>>>>> concurrency issues.
>>>>> [2] For example, potentially breaking up multiple pipelined
loop
>>>>> passes and insert
RequireAnalysisPass<BlockFrequencyAnalysis> in front of
>>>>> each of them.
>>>>> [3] For example, -fprofile-instr-use and
-fprofile-sample-use aren’t
>>>>> present in ThinLTO post link builds.
>>>>> [4] For example, we could check whether the module has the
profile
>>>>> summary metadata annotated when building pass pipelines but
we don’t always
>>>>> pass the module down to the place where we build pass
pipelines.
>>>>> [5] By inserting
RequireAnalysisPass<ProfileSummaryInfo> after the
>>>>> PGOInstrumentationUse and the SampleProfileLoaderPass
passes (and around
>>>>> the PGOIndirectCallPromotion pass for the Thin LTO post
link pipeline.)
>>>>> [6] For example, the context-sensitive PGO.
>>>>> [7] Directly calling its constructor along with the
dependent analyses
>>>>> results, eg. the jump threading pass.
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>
>>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190313/d12eb4b2/attachment-0001.html>

Fedor Sergeev via llvm-dev

2019-Mar-13 23:20 UTC

head link

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

On 3/14/19 2:04 AM, Hiroshi Yamauchi wrote:>
>
> On Wed, Mar 13, 2019 at 2:37 PM Fedor Sergeev <fedor.sergeev at azul.com
> <mailto:fedor.sergeev at azul.com>> wrote:
>
>>
>>     - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass
>>     to be able to get to the ModuleAnalysisManager in one step and
>>     PSI through it.
>     This is just an optimization of compile-time, saves one
>     indirection through FunctionAnalysisManager.
>     I'm not even sure if it is worth the effort. And definitely not
>     crucial for the overall idea.
>
>
> This should probably be clarified to something like:
>
> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be 
> able to get to the ModuleAnalysisManager and PSI because it may not 
> always through (const) FunctionAnalysisManager, 
> unless ModuleAnalysisManagerFunctionProxy is already cached.
>
> Since FunctionAnalysisManager we can get from LoopAnalysisManager is a 
> const ref, we cannot call getResult on it and always get 
> ModuleAnalysisManager and PSI (see below.) This actually happens in my 
> experiment.
>
> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
> auto &FAM = LAM.getResult<FunctionAnalysisManagerLoopProxy>(L, 
> AR).getManager();
> auto *MAMProxy =
FAM.getCachedResult<ModuleAnalysisManagerFunctionProxy>(
>   L.getHeader()->getParent()); *// Can be null*Oh... well...
> If (MAMProxy) {
> auto &MAM = MAMProxy->getManager();
> auto *PSI =
MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
>   } else {
> *// Can't get MAM and PSI.*
>   }
> ...
>
> ->
>
> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
> auto &MAM = LAM.getResult<ModuleAnalysisManagerLoopProxy>(L, 
> AR).getManager(); *// Not null*
> auto *PSI =
MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
>   ...
>
>
> AFAICT, adding ModuleAnalysisManagerLoopProxy seems to be as simple as:
>
> /// A proxy from a \c ModuleAnalysisManager to a \c Loop.
> typedef OuterAnalysisManagerProxy<ModuleAnalysisManager, Loop,
> LoopStandardAnalysisResults &>
> ModuleAnalysisManagerLoopProxy;It also needs to be added to PassBuilder::crossRegisterProxies...
But yes, that appears to be a required action.

regards,
   Fedor.>
>
>
>     regards,
>       Fedor.
>>
>>
>>
>>
>>     On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev
>>     <fedor.sergeev at azul.com <mailto:fedor.sergeev at
azul.com>> wrote:
>>
>>
>>
>>         On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>>>
>>>
>>>         On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi
>>>         <yamauchi at google.com <mailto:yamauchi at
google.com>> wrote:
>>>
>>>
>>>
>>>             On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev
>>>             <fedor.sergeev at azul.com <mailto:fedor.sergeev
at azul.com>>
>>>             wrote:
>>>
>>>
>>>
>>>                 On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>>                 Here's a sketch of the proposed approach
for just
>>>>                 one pass(but imagine more)
>>>>
>>>>                 https://reviews.llvm.org/D58845
>>>>
>>>>                 On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev
via
>>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>>                 <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>>
>>>>                     On 2/28/19 12:47 AM, Hiroshi Yamauchi via
>>>>                     llvm-dev wrote:
>>>>>                     Hi all,
>>>>>
>>>>>                     To implement more profile-guided
>>>>>                     optimizations, we’d like to use
>>>>>                     ProfileSummaryInfo (PSI) and
>>>>>                     BlockFrequencyInfo (BFI) from more
passes of
>>>>>                     various types, under the new pass
manager.
>>>>>
>>>>>                     The following is what we came up with.
Would
>>>>>                     appreciate feedback. Thanks.
>>>>>
>>>>>                     Issue
>>>>>
>>>>>                     It’s not obvious (to me) how to best do
this,
>>>>>                     given that we cannot request an
outer-scope
>>>>>                     analysis result from an inner-scope
pass
>>>>>                     through analysis managers [1] and that
we
>>>>>                     might unnecessarily running some
analyses
>>>>>                     unless we conditionally build pass
pipelines
>>>>>                     for PGO cases.
>>>>                     Indeed, this is an intentional restriction
in
>>>>                     new pass manager, which is more or less a
>>>>                     reflection of a fundamental property of
>>>>                     outer-inner IRUnit relationship
>>>>                     and transformations/analyses run on those
>>>>                     units. The main intent for having those
inner
>>>>                     IRUnits (e.g. Loops) is to run local
>>>>                     transformations and save compile time
>>>>                     on being local to a particular small piece
of
>>>>                     IR. Loop Pass manager allows you to run a
whole
>>>>                     pipeline of different transformations still
>>>>                     locally, amplifying the save.
>>>>                     As soon as you run function-level analysis
from
>>>>                     within the loop pipeline you essentially
break
>>>>                     this pipelining.
>>>>                     Say, as you run your loop transformation it
>>>>                     modifies the loop (and the function) and
>>>>                     potentially invalidates the analysis,
>>>>                     so you have to rerun your analysis again
and
>>>>                     again. Hence instead of saving on compile
time
>>>>                     it ends up increasing it.
>>>>
>>>>
>>>>                 Exactly.
>>>>
>>>>
>>>>                     I have hit this issue somewhat recently
with
>>>>                     dependency of loop passes on
BranchProbabilityInfo.
>>>>                     (some loop passes, like IRCE can use it for
>>>>                     profitability analysis).
>>>>
>>>>                     The only solution that appears to be
reasonable
>>>>                     there is to teach all the loops passes that
>>>>                     need to be pipelined
>>>>                     to preserve BPI (or any other
>>>>                     module/function-level analyses) similar to
how
>>>>                     they preserve DominatorTree and
>>>>                     other "LoopStandard" analyses.
>>>>
>>>>
>>>>                 Is this implemented - do the loop passes
preserve BPI?
>>>                 Nope, not implemented right now.
>>>                 One of the problems is that even loop
>>>                 canonicalization passes run at the start of loop
>>>                 pass manager dont preserve it
>>>                 (and at least LoopSimplifyCFG does change control
flow).
>>>>
>>>>                 In buildFunctionSimplificationPipeline
>>>>                 (where LoopFullUnrollPass is added as in the
>>>>                 sketch), LateLoopOptimizationsEPCallbacks
>>>>                 and LoopOptimizerEndEPCallbacks seem to allow
some
>>>>                 arbitrary loop passes to be inserted into the
>>>>                 pipelines (via flags)?
>>>>
>>>>                 I wonder how hard it'd be to teach all the
relevant
>>>>                 loop passes to preserve BFI(or BPI)..
>>>                 Well, each time you restructure control flow around
>>>                 the loops you will have to update those extra
analyses,
>>>                 pretty much the same way as DT is being updated
>>>                 through DomTreeUpdater.
>>>                 The trick is to design a proper update interface
>>>                 (and then implement it ;) ).
>>>                 And I have not spent enough time on this issue to
>>>                 get a good idea of what that interface would be.
>>>
>>>
>>>             Hm, sounds non-trivial :) noting BFI depends on BPI.
>>>
>>>
>>>         To step back, it looks like:
>>>
>>>         want to use profiles from more passes -> need to get BFI
>>>         (from loop passes) -> need all the loop passes to
preserve BFI.
>>>
>>>         I wonder if there's no way around this.
>>         Indeed. I believe this is a general consensus here.
>>
>>         regards,
>>           Fedor.
>>
>>>
>>>
>>>
>>>                 regards,
>>>                   Fedor.
>>>
>>>>
>>>>>                     It seems that for different types of
passes to
>>>>>                     be able to get PSI and BFI, we’d need
to
>>>>>                     ensure PSI is cached for a non-module
pass,
>>>>>                     and PSI, BFI and the
ModuleAnalysisManager
>>>>>                     proxy are cached for a loop pass in the
pass
>>>>>                     pipelines. This may mean potentially
needing
>>>>>                     to insert BFI/PSI in front of many
passes [2].
>>>>>                     It seems not obvious how to
conditionally
>>>>>                     insert BFI for PGO pipelines because
there
>>>>>                     isn’t always a good flag to detect PGO
cases
>>>>>                     [3] or we tend to build pass pipelines
before
>>>>>                     examining the code (or without
propagating
>>>>>                     enough info down) [4].
>>>>>
>>>>>                     Proposed approach
>>>>>
>>>>>                     - Cache PSI right after the profile
summary in
>>>>>                     the IR is written in the pass pipeline
[5].
>>>>>                     This would avoid the need to insert
>>>>>                     RequiredAnalysisPass for PSI before
each
>>>>>                     non-module pass that needs it. PSI can
be
>>>>>                     technically invalidated but unlikely.
If it
>>>>>                     does, we insert another
RequiredAnalysisPass[6].
>>>>>
>>>>>                     - Conditionally insert
RequireAnalysisPass for
>>>>>                     BFI, if PGO, right before each loop
pass that
>>>>>                     needs it. This doesn't seem
avoidable because
>>>>>                     BFI can be invalidated whenever the CFG
>>>>>                     changes. We detect PGO based on the
command
>>>>>                     line flags and/or whether the module
has the
>>>>>                     profile summary info (we may need to
pass the
>>>>>                     module to more functions.)
>>>>>
>>>>>                     - Add a new proxy
>>>>>                     ModuleAnalysisManagerLoopProxy for a
loop pass
>>>>>                     to be able to get to the
ModuleAnalysisManager
>>>>>                     in one step and PSI through it.
>>>>>
>>>>>                     Alternative approaches
>>>>>
>>>>>                     Dropping BFI and use PSI only
>>>>>                     We could consider not using BFI and
solely
>>>>>                     relying on PSI and function-level
profiles
>>>>>                     only (as opposed to block-level), but
profile
>>>>>                     precision would suffer.
>>>>>
>>>>>                     Computing BFI in-place
>>>>>                     We could consider computing BFI
“in-place” by
>>>>>                     directly running BFI outside of the
pass
>>>>>                     manager [7]. This would let us avoid
using the
>>>>>                     analysis manager constraints but it
would
>>>>>                     still involve running an outer-scope
analysis
>>>>>                     from an inner-scope pass and
potentially cause
>>>>>                     problems in terms of pass pipelining
and
>>>>>                     concurrency. Moreover, a potential
downside of
>>>>>                     running analyses in-place is that it
won’t
>>>>>                     take advantage of cached analysis
results
>>>>>                     provided by the pass manager.
>>>>>
>>>>>                     Adding inner-scope versions of PSI and
BFI
>>>>>                     We could consider adding a
function-level and
>>>>>                     loop-level PSI and loop-level BFI,
which
>>>>>                     internally act like their outer-scope
versions
>>>>>                     but provide inner-scope results only.
This
>>>>>                     way, we could always call getResult for
PSI
>>>>>                     and BFI. However, this would still
involve
>>>>>                     running an outer-scope analysis from an
>>>>>                     inner-scope pass.
>>>>>
>>>>>                     Caching the FAM and the MAM proxies
>>>>>                     We could consider caching the
>>>>>                     FunctionalAnalysisManager and the
>>>>>                     ModuleAnalysisManager proxies once
early on
>>>>>                     instead of adding a new proxy. But it
seems to
>>>>>                     not likely work well because the
analysis
>>>>>                     cache key type includes the function or
the
>>>>>                     module and some pass may add a new
function
>>>>>                     for which the proxy wouldn’t be cached.
We’d
>>>>>                     need to write and insert a pass in
select
>>>>>                     locations to just fill the cache.
Adding the
>>>>>                     new proxy would take care of these with
a
>>>>>                     three-line change.
>>>>>
>>>>>                     Conditional BFI
>>>>>                     We could consider adding a conditional
BFI
>>>>>                     analysis that is a wrapper around BFI
and
>>>>>                     computes BFI only if profiles are
available
>>>>>                     (either checking the module has profile
>>>>>                     summary or depend on the PSI.) With
this, we
>>>>>                     wouldn’t need to conditionally build
pass
>>>>>                     pipelines and may work for the new pass
>>>>>                     manager. But a similar wouldn’t work
for the
>>>>>                     old pass manager because we cannot
>>>>>                     conditionally depend on an analysis
under it.
>>>>                     There is LazyBlockFrequencyInfo.
>>>>                     Not sure how well it fits this idea.
>>>>
>>>>
>>>>                 Good point. LazyBlockFrequencyInfo seems usable
>>>>                 with the old pass manager (save unnecessary
>>>>                 BFI/BPI) and would work for function passes. I
>>>>                 think the restriction still applies - a loop
pass
>>>>                 cannot still request (outer-scope) BFI, lazy or
>>>>                 not, new or old (pass manager). Another
assumption
>>>>                 is that it'd be cheap and safe to
unconditionally
>>>>                 depend on PSI or check the module's profile
summary.
>>>>
>>>>
>>>>                     regards,
>>>>                       Fedor.
>>>>
>>>>>
>>>>>
>>>>>                     [1] We cannot call
AnalysisManager::getResult
>>>>>                     for an outer scope but only
getCachedResult.
>>>>>                     Probably because of potential
pipelining or
>>>>>                     concurrency issues.
>>>>>                     [2] For example, potentially breaking
up
>>>>>                     multiple pipelined loop passes and
insert
>>>>>                    
RequireAnalysisPass<BlockFrequencyAnalysis> in
>>>>>                     front of each of them.
>>>>>                     [3] For example, -fprofile-instr-use
and
>>>>>                     -fprofile-sample-use aren’t present in
ThinLTO
>>>>>                     post link builds.
>>>>>                     [4] For example, we could check whether
the
>>>>>                     module has the profile summary metadata
>>>>>                     annotated when building pass pipelines
but we
>>>>>                     don’t always pass the module down to
the place
>>>>>                     where we build pass pipelines.
>>>>>                     [5] By inserting
>>>>>                    
RequireAnalysisPass<ProfileSummaryInfo> after
>>>>>                     the PGOInstrumentationUse and the
>>>>>                     SampleProfileLoaderPass passes (and
around the
>>>>>                     PGOIndirectCallPromotion pass for the
Thin LTO
>>>>>                     post link pipeline.)
>>>>>                     [6] For example, the context-sensitive
PGO.
>>>>>                     [7] Directly calling its constructor
along
>>>>>                     with the dependent analyses results,
eg. the
>>>>>                     jump threading pass.
>>>>>
>>>>>                    
_______________________________________________
>>>>>                     LLVM Developers mailing list
>>>>>                     llvm-dev at lists.llvm.org 
<mailto:llvm-dev at lists.llvm.org>
>>>>>                    
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>                    
_______________________________________________
>>>>                     LLVM Developers mailing list
>>>>                     llvm-dev at lists.llvm.org
>>>>                     <mailto:llvm-dev at lists.llvm.org>
>>>>                    
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190314/828db884/attachment-0001.html>

llvm dev - Mar 2019 - RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager