thr3ads.net - llvm dev - [llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Graham Yiu via llvm-dev

2017-Aug-24 19:47 UTC

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hi David,

The only reason I can see to use the 'pattern matching' part as a
fall-back
is in case we cannot inline the (what I'm assuming would be) a much bigger
hot-path-only cloned function for whatever reason.  What I'm assuming here
is that after cold-region outlining, we may still have a large portion of
the original function body to attempt to inline, whereas the pattern
matching method will only contain a few basic blocks, giving a better
chance to inline something.

For your (2) point, I think we'll have to be careful here.  Without a sense
of how 'likely' we're going to inline the new function, we'll
have to make
sure our outlining of cold regions will not degrade the performance of the
function in 99.xx% of the cases, as it's unclear how much performance
we'll
gain from just outlining (without inlining to increase the odds of some
performance gain).  My initial thought was to ditch the new function and
its outlined children if we cannot immediately inline it.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-707/8200/Markham
Email: gyiu at ca.ibm.com



From:	Xinliang David Li <xinliangli at gmail.com>
To:	Graham Yiu <gyiu at ca.ibm.com>
Cc:	llvm-dev <llvm-dev at lists.llvm.org>
Date:	08/24/2017 03:05 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks





On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.ibm.com> wrote:
  Hi David,

  So I've began doing some implementation on the outlining portion of the
  code. Currently, I got the partial inliner to outline cold regions
  (single entry, single exit) of the code, based solely on the existence of
  ProfileSummaryInfo (ie. profiling data). However, I have some concerns on
  how this will co-exist with the existing code that peels early returns.

  The control flow looks something like this:

  // New Code: find cold regions to outline
  if (!computeOutliningInfoForColdRegions()) {
  // If we can't find any cold regions, then fall-back to early return
  peeling
  if (!computeOutliningInfo) {
  return nullptr;
  }
  }
  // Try to outline the identified regions
  // Then try to inline the cloned function

  My concern is during inlining, if we fail to inline the cloned function,
  we give up and discard all cloned and outlined functions. But with these
  two types of outlining we're doing, it's possible to attempt to inline
  the cloned function that has outlined cold regions, and if we cannot do
  so, try to inline a different clone that has peeled early returns (ie.
  the way we have it today). This would require us to clone the original
  function twice and modify one based on cold region outlining and the
  other early return peeling, with the latter being our fall-back option if
  we fail to inline the first clone.

  What are your thoughts?




I expect  computeOutliningInfoForColdRegions can produce a super set of
outlinable regions to the current 'pattern matching' approach. In other
words, most of the cases currently caught by 'computeOutlineInfo' should
be
caught by the new algorithm, so why not ditching the current
'computeOutlningInfo' completely?

My suggestion was to enhance the pass to 1) support outlining multiple
regions; and 2) add a mode to do function outlining only (not the inlining
part).  The second is important can be used before the regular inliner
pass.   With the new pass manager and profile aware inlining, the inliner
won't undo the outline decision, but in meantime becomes more powerful due
to the reduced hot function size.

David


  Graham Yiu
  LLVM Compiler Development
  IBM Toronto Software Lab
  Office: (905) 413-4077 C2-707/8200/Markham
  Email: gyiu at ca.ibm.com

  Inactive hide details for Graham Yiu---08/15/2017 08:04:28 PM---Hey
  David, Yes, we'll need to consider the effect on live rangeGraham
  Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll need to consider the
  effect on live ranges for regions we want to outline. In

  From: Graham Yiu/Toronto/IBM
  To: Xinliang David Li <xinliangli at gmail.com>
  Cc: llvm-dev <llvm-dev at lists.llvm.org>
  Date: 08/15/2017 08:04 PM
  Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
  outlining scheme for cold blocks


  Hey David,

  Yes, we'll need to consider the effect on live ranges for regions we want
  to outline. In my experience, outlining live-exit regions seem to cause
  the most harm as we ruin chances to keep data in registers as you were
  alluding to. It's unclear, however, what the exact effect of outlining
  regions with live-entries would be.

  I'll probably try to avoid regions that are not single entry & single
  exit at least initially, to simplify the transformation and analysis. Are
  multi-exit regions common in your experience?

  And of course, I agree, we should reuse as much of the current partial
  inlining infrastructure as possible. I'll likely run some ideas by you as
  I begin to make changes.

  Cheers,

  Graham Yiu
  LLVM Compiler Development
  IBM Toronto Software Lab
  Office: (905) 413-4077 C2-407/8200/Markham
  Email: gyiu at ca.ibm.com


  Inactive hide details for Xinliang David Li ---08/15/2017 05:36:07
  PM---Hi Graham, Making partial inlining more general is someXinliang
  David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making partial inlining
  more general is something worth doing. Regarding your implementat

  From: Xinliang David Li <xinliangli at gmail.com>
  To: Graham Yiu <gyiu at ca.ibm.com>
  Cc: llvm-dev <llvm-dev at lists.llvm.org>
  Date: 08/15/2017 05:36 PM
  Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
  outlining scheme for cold blocks




  Hi Graham, Making partial inlining more general is something worth
  doing.  Regarding your implementation plan, I have some suggestions here:

  *) Function outlining introduces additional runtime cost: passing of live
  in values, returning of live out values (via memory), glue code in the
  caller to handle regions without a single exit block etc.  The cost
  analysis needs to factor in those carefully
  *) Remove the limitation that there is only *one* outlined routine.
  Instead, the algorithm can compute multiple single-entry/single exit or
  single entry/multiple exit regions (cold ones) in the routine, and
  outline each region into its own function. The benefit include
     1) simplify the design and implementation and most of the existing
  code can be reused;
     2) provide more flexibility to allow most effective outlining;
     3) reduced runtime overhead of making calls to the outline functions.

  thanks,

  David

  On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
  llvm-dev at lists.llvm.org> wrote:
        Hello,

        My team and I are looking to do some enhancements in the partial
        inliner in opt. Would appreciate any feedback that folks might
        have.

        # Partial Inlining in LLVM opt

        ## Summary

        ### Background

        Currently, the partial inliner searches the first few blocks of the
        callee and looks for a branch to the return block (ie. early
        return). If found, it attempts to outline the rest of the slow (or
        heavy) code so the inliner will be able to inline the fast (or
        light) code. If no early returns are found, the partial inliner
        will give up. As far as I can tell, BlockFrequency and
        BranchProbability information is only used when attempting to
        inline the early return code, and not used to determine whether to
        outline the slow code.

        ### Proposed changes

        In addition to looking for early returns, we should utilize profile
        information to outline blocks that are considered cold. If we can
        sufficiently reduce the size of the original function via this type
        of outlining, inlining should be able to inline the rest of the hot
        code.

        ## Details

        With the presence of profile information, we have a view of what
        code is infrequently executed and make better decisions on what to
        outline. Early return blocks that are infrequently executed should
        still be included as candidates for outlining, but will be treated
        just like any other cold blocks. Without profiling information,
        however, we should remain conservative and only partial inline in
        the presence of an early return in the first few blocks of a
        function (ie. peel the early return out of the function).

        To find cold regions to outline, we will traverse the CFG to find
        edges deemed 'cold' and look at the blocks dominated by the
        successor node. If, for some reason, that block has more than one
        predecessor, then we will skip this candidate as there should be a
        node that dominates this successor that has a single entry point.
        The last node in the dominance vector should also have a single
        successor. If it does not, then further investigation of the CFG is
        necessary to see when/how this situation occurs.

        We will need several heuristics to make sure we only outline in
        cases where we are confident it will result in a performance gain.
        Things such as threshold on when a branch is considered cold, the
        minimum number of times the predecessor node has to be executed in
        order for an edge to be considered (confidence factor), and the
        minimum size of the region to be outlined (can use inlining cost
        analysis like we currently do) will require some level of tuning.

        Similar to the current implementation, we will attempt to inline
        the leftover (hot) parts of the code, and if for some reason we
        cannot then we discard the modified function and its outlined code.

        ### Code changes

        The current Partial Inlining code first clones the function of
        interest and looks for a single set of blocks to outline. It then
        creates a function with the set the blocks, and saves the outlined
        function and outline callsite information as part of the function
        cloning container. In order to outline multiple regions of the
        function, we will need to change these containers to keep track of
        a list of regions to outline. We will also need to update the cost
        analysis to take into account multiple outlined functions.

        When a ProfileSummary is available, then we should skip the code
        that looks for early returns and go into new code that looks for
        cold regions to outline. When ProfileSummary is not available, then
        we can fall back to the existing code and look for early returns
        only.

        ### Tuning

        - The outlining heuristics will need to determine if a set of cold
        blocks is large enough to warrant the overhead of a function call.
        We also don't want the inliner to attempt to inline the outlined
        code later.
        - The threshold for determining whether a block is cold will also
        need to be tuned. In the case that profiling information is not
        accurate, we will pay the price of the additional call overhead for
        executing cold code.
        - The confidence factor, which can be viewed as the minimum number
        of times the predecessor has to be executed in order for an edge to
        be considered cold, should also be taken into account to avoid
        outlining code paths we have little information on.

        Graham Yiu
        LLVM Compiler Development
        IBM Toronto Software Lab
        Office: (905) 413-4077 C2-407/8200/Markham
        Email: gyiu at ca.ibm.com

        _______________________________________________
        LLVM Developers mailing list
        llvm-dev at lists.llvm.org
        http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev












-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170824/3dbca29d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170824/3dbca29d/attachment.gif>

Xinliang David Li via llvm-dev

2017-Aug-26 16:52 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

On Thu, Aug 24, 2017 at 12:47 PM, Graham Yiu <gyiu at ca.ibm.com> wrote:
> Hi David,
>
> The only reason I can see to use the 'pattern matching' part as a
> fall-back is in case we cannot inline the (what I'm assuming would be)
a
> much bigger hot-path-only cloned function for whatever reason. What I'm
> assuming here is that after cold-region outlining, we may still have a
> large portion of the original function body to attempt to inline, whereas
> the pattern matching method will only contain a few basic blocks, giving a
> better chance to inline something.
>
With profile data, the overhead of outlining a cold region can be estimated
more accurately. (With the new PM), the threshold of inlining a hot
callsite is also much higher. Without profile, the pattern matching method
won't work too well in general even though it can enable more more inlining
because the call overhead introduced to call the outlined function may
outweigh the benefit of inlining the caller.

What ever region that can be found by the pattern matching method should be
identified by the new method as well. If there are multiple (but mutually
exclusive) candidate regions found, the cost analysis heuristic should pick
the best candidate region for outlining .

>
>
> For your (2) point, I think we'll have to be careful here. Without a
sense
> of how 'likely' we're going to inline the new function,
we'll have to make
> sure our outlining of cold regions will not degrade the performance of the
> function in 99.xx% of the cases, as it's unclear how much performance
we'll
> gain from just outlining (without inlining to increase the odds of some
> performance gain). My initial thought was to ditch the new function and its
> outlined children if we cannot immediately inline it.
>
The outlining only mode is useful to enable more aggressive inlining for
the regular inlining pass. Slightly different heuristics can be applied
here. For instance it can prefer largest candidate region (to maximiize the
chance to inline the caller). The outlined region does not need to be super
cold and leave it to the inliner to do more deeper analysis and decide to
inline it right back in.

David


>
>
> Graham Yiu
> LLVM Compiler Development
> IBM Toronto Software Lab
> Office: (905) 413-4077 C2-707/8200/Markham
> Email: gyiu at ca.ibm.com
>
> [image: Inactive hide details for Xinliang David Li ---08/24/2017 03:05:06
> PM---On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.i]Xinliang
> David Li ---08/24/2017 03:05:06 PM---On Thu, Aug 24, 2017 at 10:40 AM,
> Graham Yiu <gyiu at ca.ibm.com> wrote: > Hi David,
>
> From: Xinliang David Li <xinliangli at gmail.com>
> To: Graham Yiu <gyiu at ca.ibm.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Date: 08/24/2017 03:05 PM
>
> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
> outlining scheme for cold blocks
> ------------------------------
>
>
>
>
>
> On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <*gyiu at ca.ibm.com*
> <gyiu at ca.ibm.com>> wrote:
>
>    Hi David,
>
>    So I've began doing some implementation on the outlining portion of
>    the code. Currently, I got the partial inliner to outline cold regions
>    (single entry, single exit) of the code, based solely on the existence
of
>    ProfileSummaryInfo (ie. profiling data). However, I have some concerns
on
>    how this will co-exist with the existing code that peels early returns.
>
>    The control flow looks something like this:
>
>    // New Code: find cold regions to outline
>    if (!computeOutliningInfoForColdRegions()) {
>    // If we can't find any cold regions, then fall-back to early return
>    peeling
>    if (!computeOutliningInfo) {
>    return nullptr;
>    }
>    }
>    // Try to outline the identified regions
>    // Then try to inline the cloned function
>
>    My concern is during inlining, if we fail to inline the cloned
>    function, we give up and discard all cloned and outlined functions. But
>    with these two types of outlining we're doing, it's possible to
attempt to
>    inline the cloned function that has outlined cold regions, and if we
cannot
>    do so, try to inline a different clone that has peeled early returns
(ie.
>    the way we have it today). This would require us to clone the original
>    function twice and modify one based on cold region outlining and the
other
>    early return peeling, with the latter being our fall-back option if we
fail
>    to inline the first clone.
>
>    What are your thoughts?
>
>
>
> I expect  computeOutliningInfoForColdRegions can produce a super set of
> outlinable regions to the current 'pattern matching' approach. In
other
> words, most of the cases currently caught by 'computeOutlineInfo'
should be
> caught by the new algorithm, so why not ditching the current
> 'computeOutlningInfo' completely?
>
> My suggestion was to enhance the pass to 1) support outlining multiple
> regions; and 2) add a mode to do function outlining only (not the inlining
> part).  The second is important can be used before the regular inliner
> pass.   With the new pass manager and profile aware inlining, the inliner
> won't undo the outline decision, but in meantime becomes more powerful
due
> to the reduced hot function size.
>
> David
>
>
>
>    Graham Yiu
>    LLVM Compiler Development
>    IBM Toronto Software Lab
>    Office: *(905) 413-4077* <(905)%20413-4077> C2-707/8200/Markham
>    Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>
>    [image: Inactive hide details for Graham Yiu---08/15/2017 08:04:28
>    PM---Hey David, Yes, we'll need to consider the effect on live
range]Graham
>    Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll need to
consider the
>    effect on live ranges for regions we want to outline. In
>
>    From: Graham Yiu/Toronto/IBM
>    To: Xinliang David Li <*xinliangli at gmail.com* <xinliangli at
gmail.com>>
>    Cc: llvm-dev <*llvm-dev at lists.llvm.org* <llvm-dev at
lists.llvm.org>>
>    Date: 08/15/2017 08:04 PM
>    Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>    general outlining scheme for cold blocks
>    ------------------------------
>
>
>    Hey David,
>
>    Yes, we'll need to consider the effect on live ranges for regions we
>    want to outline. In my experience, outlining live-exit regions seem to
>    cause the most harm as we ruin chances to keep data in registers as you
>    were alluding to. It's unclear, however, what the exact effect of
outlining
>    regions with live-entries would be.
>
>    I'll probably try to avoid regions that are not single entry &
single
>    exit at least initially, to simplify the transformation and analysis.
Are
>    multi-exit regions common in your experience?
>
>    And of course, I agree, we should reuse as much of the current partial
>    inlining infrastructure as possible. I'll likely run some ideas by
you as I
>    begin to make changes.
>
>    Cheers,
>
>    Graham Yiu
>    LLVM Compiler Development
>    IBM Toronto Software Lab
>    Office: *(905) 413-4077* <(905)%20413-4077> C2-407/8200/Markham
>    Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>
>
>    [image: Inactive hide details for Xinliang David Li ---08/15/2017
>    05:36:07 PM---Hi Graham, Making partial inlining more general is
some]Xinliang
>    David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making partial inlining
>    more general is something worth doing. Regarding your implementat
>
>    From: Xinliang David Li <*xinliangli at gmail.com* <xinliangli at
gmail.com>>
>    To: Graham Yiu <*gyiu at ca.ibm.com* <gyiu at ca.ibm.com>>
>    Cc: llvm-dev <*llvm-dev at lists.llvm.org* <llvm-dev at
lists.llvm.org>>
>    Date: 08/15/2017 05:36 PM
>    Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>    general outlining scheme for cold blocks
>    ------------------------------
>
>
>
>
>    Hi Graham, Making partial inlining more general is something worth
>    doing.  Regarding your implementation plan, I have some suggestions
here:
>
>    *) Function outlining introduces additional runtime cost: passing of
>    live in values, returning of live out values (via memory), glue code in
the
>    caller to handle regions without a single exit block etc.  The cost
>    analysis needs to factor in those carefully
>    *) Remove the limitation that there is only *one* outlined routine.
>    Instead, the algorithm can compute multiple single-entry/single exit or
>    single entry/multiple exit regions (cold ones) in the routine, and
outline
>    each region into its own function. The benefit include
>       1) simplify the design and implementation and most of the existing
>    code can be reused;
>       2) provide more flexibility to allow most effective outlining;
>       3) reduced runtime overhead of making calls to the outline
>    functions.
>
>    thanks,
>
>    David
>
>    On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
>    *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>>
wrote:
>       Hello,
>
>          My team and I are looking to do some enhancements in the partial
>          inliner in opt. Would appreciate any feedback that folks might
have.
>
>          # Partial Inlining in LLVM opt
>
>          ## Summary
>
>          ### Background
>
>          Currently, the partial inliner searches the first few blocks of
>          the callee and looks for a branch to the return block (ie. early
return).
>          If found, it attempts to outline the rest of the slow (or heavy)
code so
>          the inliner will be able to inline the fast (or light) code. If no
early
>          returns are found, the partial inliner will give up. As far as I
can tell,
>          BlockFrequency and BranchProbability information is only used when
>          attempting to inline the early return code, and not used to
determine
>          whether to outline the slow code.
>
>          ### Proposed changes
>
>          In addition to looking for early returns, we should utilize
>          profile information to outline blocks that are considered cold. If
we can
>          sufficiently reduce the size of the original function via this
type of
>          outlining, inlining should be able to inline the rest of the hot
code.
>
>          ## Details
>
>          With the presence of profile information, we have a view of what
>          code is infrequently executed and make better decisions on what to
outline.
>          Early return blocks that are infrequently executed should still be
included
>          as candidates for outlining, but will be treated just like any
other cold
>          blocks. Without profiling information, however, we should remain
>          conservative and only partial inline in the presence of an early
return in
>          the first few blocks of a function (ie. peel the early return out
of the
>          function).
>
>          To find cold regions to outline, we will traverse the CFG to
>          find edges deemed 'cold' and look at the blocks dominated
by the successor
>          node. If, for some reason, that block has more than one
predecessor, then
>          we will skip this candidate as there should be a node that
dominates this
>          successor that has a single entry point. The last node in the
dominance
>          vector should also have a single successor. If it does not, then
further
>          investigation of the CFG is necessary to see when/how this
situation occurs.
>
>          We will need several heuristics to make sure we only outline in
>          cases where we are confident it will result in a performance gain.
Things
>          such as threshold on when a branch is considered cold, the minimum
number
>          of times the predecessor node has to be executed in order for an
edge to be
>          considered (confidence factor), and the minimum size of the region
to be
>          outlined (can use inlining cost analysis like we currently do)
will require
>          some level of tuning.
>
>          Similar to the current implementation, we will attempt to inline
>          the leftover (hot) parts of the code, and if for some reason we
cannot then
>          we discard the modified function and its outlined code.
>
>          ### Code changes
>
>          The current Partial Inlining code first clones the function of
>          interest and looks for a single set of blocks to outline. It then
creates a
>          function with the set the blocks, and saves the outlined function
and
>          outline callsite information as part of the function cloning
container. In
>          order to outline multiple regions of the function, we will need to
change
>          these containers to keep track of a list of regions to outline. We
will
>          also need to update the cost analysis to take into account
multiple
>          outlined functions.
>
>          When a ProfileSummary is available, then we should skip the code
>          that looks for early returns and go into new code that looks for
cold
>          regions to outline. When ProfileSummary is not available, then we
can fall
>          back to the existing code and look for early returns only.
>
>          ### Tuning
>
>          - The outlining heuristics will need to determine if a set of
>          cold blocks is large enough to warrant the overhead of a function
call. We
>          also don't want the inliner to attempt to inline the outlined
code later.
>          - The threshold for determining whether a block is cold will
>          also need to be tuned. In the case that profiling information is
not
>          accurate, we will pay the price of the additional call overhead
for
>          executing cold code.
>          - The confidence factor, which can be viewed as the minimum
>          number of times the predecessor has to be executed in order for an
edge to
>          be considered cold, should also be taken into account to avoid
outlining
>          code paths we have little information on.
>
>          Graham Yiu
>          LLVM Compiler Development
>          IBM Toronto Software Lab
>          Office: *(905) 413-4077* <(905)%20413-4077>
C2-407/8200/Markham
>          Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>
>          _______________________________________________
>          LLVM Developers mailing list
> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>
> *http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev*
>         
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=4ST7e3kMd0GTi3w9ByK5Cw&m=rbfPPnRP9weVvtwCT5LyhMrn3TeP6-HaVUUkv-DHQ5I&s=0NPYoALj0vvVlLnq4AKtctnM_tHFxPY6SsX5mv2LMUE&e=>
>
>
>
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170826/af3d47ce/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170826/af3d47ce/attachment.gif>

keita abdoul-kader via llvm-dev

2017-Aug-29 16:15 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

I second the fact that a way to outline specific function regions
independently of the partial inliner sound very useful. I am not sure
however if we would want a mode within the partialInliner or something
completely independent.

As a general question,   does anybody has a clear idea of what are the
constraints on the region CodeExtractor is currently able to handle ?
Going through the code, it looks like the only requirement is for the
header to dominate all the BB in the region ;

On Sat, Aug 26, 2017 at 9:52 AM, Xinliang David Li via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Thu, Aug 24, 2017 at 12:47 PM, Graham Yiu <gyiu at ca.ibm.com>
wrote:
>
>> Hi David,
>>
>> The only reason I can see to use the 'pattern matching' part as
a
>> fall-back is in case we cannot inline the (what I'm assuming would
be) a
>> much bigger hot-path-only cloned function for whatever reason. What
I'm
>> assuming here is that after cold-region outlining, we may still have a
>> large portion of the original function body to attempt to inline,
whereas
>> the pattern matching method will only contain a few basic blocks,
giving a
>> better chance to inline something.
>>
>
> With profile data, the overhead of outlining a cold region can be
> estimated more accurately. (With the new PM), the threshold of inlining a
> hot callsite is also much higher. Without profile, the pattern matching
> method won't work too well in general even though it can enable more
more
> inlining because the call overhead introduced to call the outlined function
> may outweigh the benefit of inlining the caller.
>
> What ever region that can be found by the pattern matching method should
> be identified by the new method as well. If there are multiple (but
> mutually exclusive) candidate regions found, the cost analysis heuristic
> should pick the best candidate region for outlining .
>
>
>>
>>
>> For your (2) point, I think we'll have to be careful here. Without
a
>> sense of how 'likely' we're going to inline the new
function, we'll have to
>> make sure our outlining of cold regions will not degrade the
performance of
>> the function in 99.xx% of the cases, as it's unclear how much
performance
>> we'll gain from just outlining (without inlining to increase the
odds of
>> some performance gain). My initial thought was to ditch the new
function
>> and its outlined children if we cannot immediately inline it.
>>
>
> The outlining only mode is useful to enable more aggressive inlining for
> the regular inlining pass. Slightly different heuristics can be applied
> here. For instance it can prefer largest candidate region (to maximiize the
> chance to inline the caller). The outlined region does not need to be super
> cold and leave it to the inliner to do more deeper analysis and decide to
> inline it right back in.
>
> David
>
>
>
>>
>>
>> Graham Yiu
>> LLVM Compiler Development
>> IBM Toronto Software Lab
>> Office: (905) 413-4077 C2-707/8200/Markham
>> Email: gyiu at ca.ibm.com
>>
>> [image: Inactive hide details for Xinliang David Li ---08/24/2017
>> 03:05:06 PM---On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at
ca.i]Xinliang
>> David Li ---08/24/2017 03:05:06 PM---On Thu, Aug 24, 2017 at 10:40 AM,
>> Graham Yiu <gyiu at ca.ibm.com> wrote: > Hi David,
>>
>> From: Xinliang David Li <xinliangli at gmail.com>
>> To: Graham Yiu <gyiu at ca.ibm.com>
>> Cc: llvm-dev <llvm-dev at lists.llvm.org>
>> Date: 08/24/2017 03:05 PM
>>
>> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
general
>> outlining scheme for cold blocks
>> ------------------------------
>>
>>
>>
>>
>>
>> On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <*gyiu at ca.ibm.com*
>> <gyiu at ca.ibm.com>> wrote:
>>
>>    Hi David,
>>
>>    So I've began doing some implementation on the outlining portion
of
>>    the code. Currently, I got the partial inliner to outline cold
regions
>>    (single entry, single exit) of the code, based solely on the
existence of
>>    ProfileSummaryInfo (ie. profiling data). However, I have some
concerns on
>>    how this will co-exist with the existing code that peels early
returns.
>>
>>    The control flow looks something like this:
>>
>>    // New Code: find cold regions to outline
>>    if (!computeOutliningInfoForColdRegions()) {
>>    // If we can't find any cold regions, then fall-back to early
return
>>    peeling
>>    if (!computeOutliningInfo) {
>>    return nullptr;
>>    }
>>    }
>>    // Try to outline the identified regions
>>    // Then try to inline the cloned function
>>
>>    My concern is during inlining, if we fail to inline the cloned
>>    function, we give up and discard all cloned and outlined functions.
But
>>    with these two types of outlining we're doing, it's possible
to attempt to
>>    inline the cloned function that has outlined cold regions, and if we
cannot
>>    do so, try to inline a different clone that has peeled early returns
(ie.
>>    the way we have it today). This would require us to clone the
original
>>    function twice and modify one based on cold region outlining and the
other
>>    early return peeling, with the latter being our fall-back option if
we fail
>>    to inline the first clone.
>>
>>    What are your thoughts?
>>
>>
>>
>> I expect  computeOutliningInfoForColdRegions can produce a super set of
>> outlinable regions to the current 'pattern matching' approach.
In other
>> words, most of the cases currently caught by
'computeOutlineInfo' should be
>> caught by the new algorithm, so why not ditching the current
>> 'computeOutlningInfo' completely?
>>
>> My suggestion was to enhance the pass to 1) support outlining multiple
>> regions; and 2) add a mode to do function outlining only (not the
inlining
>> part).  The second is important can be used before the regular inliner
>> pass.   With the new pass manager and profile aware inlining, the
inliner
>> won't undo the outline decision, but in meantime becomes more
powerful due
>> to the reduced hot function size.
>>
>> David
>>
>>
>>
>>    Graham Yiu
>>    LLVM Compiler Development
>>    IBM Toronto Software Lab
>>    Office: *(905) 413-4077* <(905)%20413-4077>
C2-707/8200/Markham
>>    Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>
>>    [image: Inactive hide details for Graham Yiu---08/15/2017 08:04:28
>>    PM---Hey David, Yes, we'll need to consider the effect on live
range]Graham
>>    Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll need to
consider the
>>    effect on live ranges for regions we want to outline. In
>>
>>    From: Graham Yiu/Toronto/IBM
>>    To: Xinliang David Li <*xinliangli at gmail.com* <xinliangli
at gmail.com>>
>>    Cc: llvm-dev <*llvm-dev at lists.llvm.org* <llvm-dev at
lists.llvm.org>>
>>    Date: 08/15/2017 08:04 PM
>>    Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>>    general outlining scheme for cold blocks
>>    ------------------------------
>>
>>
>>    Hey David,
>>
>>    Yes, we'll need to consider the effect on live ranges for
regions we
>>    want to outline. In my experience, outlining live-exit regions seem
to
>>    cause the most harm as we ruin chances to keep data in registers as
you
>>    were alluding to. It's unclear, however, what the exact effect
of outlining
>>    regions with live-entries would be.
>>
>>    I'll probably try to avoid regions that are not single entry
& single
>>    exit at least initially, to simplify the transformation and
analysis. Are
>>    multi-exit regions common in your experience?
>>
>>    And of course, I agree, we should reuse as much of the current
>>    partial inlining infrastructure as possible. I'll likely run
some ideas by
>>    you as I begin to make changes.
>>
>>    Cheers,
>>
>>    Graham Yiu
>>    LLVM Compiler Development
>>    IBM Toronto Software Lab
>>    Office: *(905) 413-4077* <(905)%20413-4077>
C2-407/8200/Markham
>>    Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>
>>
>>    [image: Inactive hide details for Xinliang David Li ---08/15/2017
>>    05:36:07 PM---Hi Graham, Making partial inlining more general is
some]Xinliang
>>    David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making partial
inlining
>>    more general is something worth doing. Regarding your implementat
>>
>>    From: Xinliang David Li <*xinliangli at gmail.com* <xinliangli
at gmail.com>
>>    >
>>    To: Graham Yiu <*gyiu at ca.ibm.com* <gyiu at
ca.ibm.com>>
>>    Cc: llvm-dev <*llvm-dev at lists.llvm.org* <llvm-dev at
lists.llvm.org>>
>>    Date: 08/15/2017 05:36 PM
>>    Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>>    general outlining scheme for cold blocks
>>    ------------------------------
>>
>>
>>
>>
>>    Hi Graham, Making partial inlining more general is something worth
>>    doing.  Regarding your implementation plan, I have some suggestions
here:
>>
>>    *) Function outlining introduces additional runtime cost: passing of
>>    live in values, returning of live out values (via memory), glue code
in the
>>    caller to handle regions without a single exit block etc.  The cost
>>    analysis needs to factor in those carefully
>>    *) Remove the limitation that there is only *one* outlined routine.
>>    Instead, the algorithm can compute multiple single-entry/single exit
or
>>    single entry/multiple exit regions (cold ones) in the routine, and
outline
>>    each region into its own function. The benefit include
>>       1) simplify the design and implementation and most of the
existing
>>    code can be reused;
>>       2) provide more flexibility to allow most effective outlining;
>>       3) reduced runtime overhead of making calls to the outline
>>    functions.
>>
>>    thanks,
>>
>>    David
>>
>>    On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
>>    *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>>
wrote:
>>       Hello,
>>
>>          My team and I are looking to do some enhancements in the
>>          partial inliner in opt. Would appreciate any feedback that
folks might have.
>>
>>          # Partial Inlining in LLVM opt
>>
>>          ## Summary
>>
>>          ### Background
>>
>>          Currently, the partial inliner searches the first few blocks
of
>>          the callee and looks for a branch to the return block (ie.
early return).
>>          If found, it attempts to outline the rest of the slow (or
heavy) code so
>>          the inliner will be able to inline the fast (or light) code.
If no early
>>          returns are found, the partial inliner will give up. As far as
I can tell,
>>          BlockFrequency and BranchProbability information is only used
when
>>          attempting to inline the early return code, and not used to
determine
>>          whether to outline the slow code.
>>
>>          ### Proposed changes
>>
>>          In addition to looking for early returns, we should utilize
>>          profile information to outline blocks that are considered
cold. If we can
>>          sufficiently reduce the size of the original function via this
type of
>>          outlining, inlining should be able to inline the rest of the
hot code.
>>
>>          ## Details
>>
>>          With the presence of profile information, we have a view of
>>          what code is infrequently executed and make better decisions
on what to
>>          outline. Early return blocks that are infrequently executed
should still be
>>          included as candidates for outlining, but will be treated just
like any
>>          other cold blocks. Without profiling information, however, we
should remain
>>          conservative and only partial inline in the presence of an
early return in
>>          the first few blocks of a function (ie. peel the early return
out of the
>>          function).
>>
>>          To find cold regions to outline, we will traverse the CFG to
>>          find edges deemed 'cold' and look at the blocks
dominated by the successor
>>          node. If, for some reason, that block has more than one
predecessor, then
>>          we will skip this candidate as there should be a node that
dominates this
>>          successor that has a single entry point. The last node in the
dominance
>>          vector should also have a single successor. If it does not,
then further
>>          investigation of the CFG is necessary to see when/how this
situation occurs.
>>
>>          We will need several heuristics to make sure we only outline
in
>>          cases where we are confident it will result in a performance
gain. Things
>>          such as threshold on when a branch is considered cold, the
minimum number
>>          of times the predecessor node has to be executed in order for
an edge to be
>>          considered (confidence factor), and the minimum size of the
region to be
>>          outlined (can use inlining cost analysis like we currently do)
will require
>>          some level of tuning.
>>
>>          Similar to the current implementation, we will attempt to
>>          inline the leftover (hot) parts of the code, and if for some
reason we
>>          cannot then we discard the modified function and its outlined
code.
>>
>>          ### Code changes
>>
>>          The current Partial Inlining code first clones the function of
>>          interest and looks for a single set of blocks to outline. It
then creates a
>>          function with the set the blocks, and saves the outlined
function and
>>          outline callsite information as part of the function cloning
container. In
>>          order to outline multiple regions of the function, we will
need to change
>>          these containers to keep track of a list of regions to
outline. We will
>>          also need to update the cost analysis to take into account
multiple
>>          outlined functions.
>>
>>          When a ProfileSummary is available, then we should skip the
>>          code that looks for early returns and go into new code that
looks for cold
>>          regions to outline. When ProfileSummary is not available, then
we can fall
>>          back to the existing code and look for early returns only.
>>
>>          ### Tuning
>>
>>          - The outlining heuristics will need to determine if a set of
>>          cold blocks is large enough to warrant the overhead of a
function call. We
>>          also don't want the inliner to attempt to inline the
outlined code later.
>>          - The threshold for determining whether a block is cold will
>>          also need to be tuned. In the case that profiling information
is not
>>          accurate, we will pay the price of the additional call
overhead for
>>          executing cold code.
>>          - The confidence factor, which can be viewed as the minimum
>>          number of times the predecessor has to be executed in order
for an edge to
>>          be considered cold, should also be taken into account to avoid
outlining
>>          code paths we have little information on.
>>
>>          Graham Yiu
>>          LLVM Compiler Development
>>          IBM Toronto Software Lab
>>          Office: *(905) 413-4077* <(905)%20413-4077>
C2-407/8200/Markham
>>          Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>
>>          _______________________________________________
>>          LLVM Developers mailing list
>> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>
>> *http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev*
>>         
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=4ST7e3kMd0GTi3w9ByK5Cw&m=rbfPPnRP9weVvtwCT5LyhMrn3TeP6-HaVUUkv-DHQ5I&s=0NPYoALj0vvVlLnq4AKtctnM_tHFxPY6SsX5mv2LMUE&e=>
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/8e2c4473/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/8e2c4473/attachment.gif>

Graham Yiu via llvm-dev

2017-Aug-29 17:19 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hi David,

"Without profile, the pattern matching method won't work too well in
general even though it can enable more more inlining because the call
overhead introduced to call the outlined function may outweigh the benefit
of inlining the caller"
   - I'm not sure I understand why or how there'd be any additional call
   overhead if we partially inlined one or more early returns.  The call
   overhead of not inlining the function at all vs. call overhead of the
   outlined function should be similar, right?
   - i.e. if we partially inlined bar into foo, we'd have some early return
   checks and then a function call to bar_outlined, but if we didn't inline
   bar into foo at all, we'd still have the call overhead of calling bar.

"The outlining only mode is useful to enable more aggressive inlining for
the regular inlining pass. Slightly different heuristics can be applied
here. For instance it can prefer largest candidate region (to maximiize the
chance to inline the caller). The outlined region does not need to be super
cold and leave it to the inliner to do more deeper analysis and decide to
inline it right back in."
   - Interesting point, I never thought of that.  The partial inliner would
   then be a 'function splitter' rather than a partial inliner at that
   point.  Maybe worthwhile to create a separate pass for this so we don't
   have the partial inliner trying to do too many things.
   - So I did some digging and it seems like the 'regular' inlining pass
in
   opt is ran before the partial inliner, which makes sense to me since we
   want only the candidates left over from inlining which it could not
   inline (maybe due to code size).  Is there another inlining pass
   downstream that I may have missed?  Or perhaps you're referring to
   inlining with ThinLTO?

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-707/8200/Markham
Email: gyiu at ca.ibm.com



From:	Xinliang David Li <xinliangli at gmail.com>
To:	Graham Yiu <gyiu at ca.ibm.com>
Cc:	llvm-dev <llvm-dev at lists.llvm.org>
Date:	08/26/2017 12:53 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks





On Thu, Aug 24, 2017 at 12:47 PM, Graham Yiu <gyiu at ca.ibm.com> wrote:
  Hi David,

  The only reason I can see to use the 'pattern matching' part as a
  fall-back is in case we cannot inline the (what I'm assuming would be) a
  much bigger hot-path-only cloned function for whatever reason. What I'm
  assuming here is that after cold-region outlining, we may still have a
  large portion of the original function body to attempt to inline, whereas
  the pattern matching method will only contain a few basic blocks, giving
  a better chance to inline something.



With profile data, the overhead of outlining a cold region can be estimated
more accurately. (With the new PM), the threshold of inlining a hot
callsite is also much higher. Without profile, the pattern matching method
won't work too well in general even though it can enable more more inlining
because the call overhead introduced to call the outlined function may
outweigh the benefit of inlining the caller.

What ever region that can be found by the pattern matching method should be
identified by the new method as well. If there are multiple (but mutually
exclusive) candidate regions found, the cost analysis heuristic should pick
the best candidate region for outlining .



  For your (2) point, I think we'll have to be careful here. Without a
  sense of how 'likely' we're going to inline the new function,
we'll have
  to make sure our outlining of cold regions will not degrade the
  performance of the function in 99.xx% of the cases, as it's unclear how
  much performance we'll gain from just outlining (without inlining to
  increase the odds of some performance gain). My initial thought was to
  ditch the new function and its outlined children if we cannot immediately
  inline it.



The outlining only mode is useful to enable more aggressive inlining for
the regular inlining pass. Slightly different heuristics can be applied
here. For instance it can prefer largest candidate region (to maximiize the
chance to inline the caller). The outlined region does not need to be super
cold and leave it to the inliner to do more deeper analysis and decide to
inline it right back in.

David




  Graham Yiu
  LLVM Compiler Development
  IBM Toronto Software Lab
  Office: (905) 413-4077 C2-707/8200/Markham
  Email: gyiu at ca.ibm.com

  Inactive hide details for Xinliang David Li ---08/24/2017 03:05:06
  PM---On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.iXinliang
  David Li ---08/24/2017 03:05:06 PM---On Thu, Aug 24, 2017 at 10:40 AM,
  Graham Yiu <gyiu at ca.ibm.com> wrote: > Hi David,

  From: Xinliang David Li <xinliangli at gmail.com>
  To: Graham Yiu <gyiu at ca.ibm.com>
  Cc: llvm-dev <llvm-dev at lists.llvm.org>
  Date: 08/24/2017 03:05 PM



  Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
  outlining scheme for cold blocks





  On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.ibm.com> wrote:
        Hi David,

        So I've began doing some implementation on the outlining portion of
        the code. Currently, I got the partial inliner to outline cold
        regions (single entry, single exit) of the code, based solely on
        the existence of ProfileSummaryInfo (ie. profiling data). However,
        I have some concerns on how this will co-exist with the existing
        code that peels early returns.

        The control flow looks something like this:

        // New Code: find cold regions to outline
        if (!computeOutliningInfoForColdRegions()) {
        // If we can't find any cold regions, then fall-back to early
        return peeling
        if (!computeOutliningInfo) {
        return nullptr;
        }
        }
        // Try to outline the identified regions
        // Then try to inline the cloned function

        My concern is during inlining, if we fail to inline the cloned
        function, we give up and discard all cloned and outlined functions.
        But with these two types of outlining we're doing, it's possible
to
        attempt to inline the cloned function that has outlined cold
        regions, and if we cannot do so, try to inline a different clone
        that has peeled early returns (ie. the way we have it today). This
        would require us to clone the original function twice and modify
        one based on cold region outlining and the other early return
        peeling, with the latter being our fall-back option if we fail to
        inline the first clone.

        What are your thoughts?


  I expect  computeOutliningInfoForColdRegions can produce a super set of
  outlinable regions to the current 'pattern matching' approach. In
other
  words, most of the cases currently caught by 'computeOutlineInfo'
should
  be caught by the new algorithm, so why not ditching the current
  'computeOutlningInfo' completely?

  My suggestion was to enhance the pass to 1) support outlining multiple
  regions; and 2) add a mode to do function outlining only (not the
  inlining part).  The second is important can be used before the regular
  inliner pass.   With the new pass manager and profile aware inlining, the
  inliner won't undo the outline decision, but in meantime becomes more
  powerful due to the reduced hot function size.

  David


        Graham Yiu
        LLVM Compiler Development
        IBM Toronto Software Lab
        Office: (905) 413-4077 C2-707/8200/Markham
        Email: gyiu at ca.ibm.com

        Inactive hide details for Graham Yiu---08/15/2017 08:04:28 PM---Hey
        David, Yes, we'll need to consider the effect on live rangeGraham
        Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll need to
        consider the effect on live ranges for regions we want to outline.
        In

        From: Graham Yiu/Toronto/IBM
        To: Xinliang David Li <xinliangli at gmail.com>
        Cc: llvm-dev <llvm-dev at lists.llvm.org>
        Date: 08/15/2017 08:04 PM
        Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
        general outlining scheme for cold blocks


        Hey David,

        Yes, we'll need to consider the effect on live ranges for regions
        we want to outline. In my experience, outlining live-exit regions
        seem to cause the most harm as we ruin chances to keep data in
        registers as you were alluding to. It's unclear, however, what the
        exact effect of outlining regions with live-entries would be.

        I'll probably try to avoid regions that are not single entry &
        single exit at least initially, to simplify the transformation and
        analysis. Are multi-exit regions common in your experience?

        And of course, I agree, we should reuse as much of the current
        partial inlining infrastructure as possible. I'll likely run some
        ideas by you as I begin to make changes.

        Cheers,

        Graham Yiu
        LLVM Compiler Development
        IBM Toronto Software Lab
        Office: (905) 413-4077 C2-407/8200/Markham
        Email: gyiu at ca.ibm.com


        Inactive hide details for Xinliang David Li ---08/15/2017 05:36:07
        PM---Hi Graham, Making partial inlining more general is some
        Xinliang David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making
        partial inlining more general is something worth doing. Regarding
        your implementat

        From: Xinliang David Li <xinliangli at gmail.com>
        To: Graham Yiu <gyiu at ca.ibm.com>
        Cc: llvm-dev <llvm-dev at lists.llvm.org>
        Date: 08/15/2017 05:36 PM
        Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
        general outlining scheme for cold blocks




        Hi Graham, Making partial inlining more general is something worth
        doing.  Regarding your implementation plan, I have some suggestions
        here:

        *) Function outlining introduces additional runtime cost: passing
        of live in values, returning of live out values (via memory), glue
        code in the caller to handle regions without a single exit block
        etc.  The cost analysis needs to factor in those carefully
        *) Remove the limitation that there is only *one* outlined routine.
        Instead, the algorithm can compute multiple single-entry/single
        exit or single entry/multiple exit regions (cold ones) in the
        routine, and outline each region into its own function. The benefit
        include
           1) simplify the design and implementation and most of the
        existing code can be reused;
           2) provide more flexibility to allow most effective outlining;
           3) reduced runtime overhead of making calls to the outline
        functions.

        thanks,

        David

        On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
        llvm-dev at lists.llvm.org> wrote:
                    Hello,

                    My team and I are looking to do some enhancements in
                    the partial inliner in opt. Would appreciate any
                    feedback that folks might have.

                    # Partial Inlining in LLVM opt

                    ## Summary

                    ### Background

                    Currently, the partial inliner searches the first few
                    blocks of the callee and looks for a branch to the
                    return block (ie. early return). If found, it attempts
                    to outline the rest of the slow (or heavy) code so the
                    inliner will be able to inline the fast (or light)
                    code. If no early returns are found, the partial
                    inliner will give up. As far as I can tell,
                    BlockFrequency and BranchProbability information is
                    only used when attempting to inline the early return
                    code, and not used to determine whether to outline the
                    slow code.

                    ### Proposed changes

                    In addition to looking for early returns, we should
                    utilize profile information to outline blocks that are
                    considered cold. If we can sufficiently reduce the size
                    of the original function via this type of outlining,
                    inlining should be able to inline the rest of the hot
                    code.

                    ## Details

                    With the presence of profile information, we have a
                    view of what code is infrequently executed and make
                    better decisions on what to outline. Early return
                    blocks that are infrequently executed should still be
                    included as candidates for outlining, but will be
                    treated just like any other cold blocks. Without
                    profiling information, however, we should remain
                    conservative and only partial inline in the presence of
                    an early return in the first few blocks of a function
                    (ie. peel the early return out of the function).

                    To find cold regions to outline, we will traverse the
                    CFG to find edges deemed 'cold' and look at the
blocks
                    dominated by the successor node. If, for some reason,
                    that block has more than one predecessor, then we will
                    skip this candidate as there should be a node that
                    dominates this successor that has a single entry point.
                    The last node in the dominance vector should also have
                    a single successor. If it does not, then further
                    investigation of the CFG is necessary to see when/how
                    this situation occurs.

                    We will need several heuristics to make sure we only
                    outline in cases where we are confident it will result
                    in a performance gain. Things such as threshold on when
                    a branch is considered cold, the minimum number of
                    times the predecessor node has to be executed in order
                    for an edge to be considered (confidence factor), and
                    the minimum size of the region to be outlined (can use
                    inlining cost analysis like we currently do) will
                    require some level of tuning.

                    Similar to the current implementation, we will attempt
                    to inline the leftover (hot) parts of the code, and if
                    for some reason we cannot then we discard the modified
                    function and its outlined code.

                    ### Code changes

                    The current Partial Inlining code first clones the
                    function of interest and looks for a single set of
                    blocks to outline. It then creates a function with the
                    set the blocks, and saves the outlined function and
                    outline callsite information as part of the function
                    cloning container. In order to outline multiple regions
                    of the function, we will need to change these
                    containers to keep track of a list of regions to
                    outline. We will also need to update the cost analysis
                    to take into account multiple outlined functions.

                    When a ProfileSummary is available, then we should skip
                    the code that looks for early returns and go into new
                    code that looks for cold regions to outline. When
                    ProfileSummary is not available, then we can fall back
                    to the existing code and look for early returns only.

                    ### Tuning

                    - The outlining heuristics will need to determine if a
                    set of cold blocks is large enough to warrant the
                    overhead of a function call. We also don't want the
                    inliner to attempt to inline the outlined code later.
                    - The threshold for determining whether a block is cold
                    will also need to be tuned. In the case that profiling
                    information is not accurate, we will pay the price of
                    the additional call overhead for executing cold code.
                    - The confidence factor, which can be viewed as the
                    minimum number of times the predecessor has to be
                    executed in order for an edge to be considered cold,
                    should also be taken into account to avoid outlining
                    code paths we have little information on.

                    Graham Yiu
                    LLVM Compiler Development
                    IBM Toronto Software Lab
                    Office: (905) 413-4077 C2-407/8200/Markham
                    Email: gyiu at ca.ibm.com

                    _______________________________________________
                    LLVM Developers mailing list
                    llvm-dev at lists.llvm.org
                    http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev















-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/4894bf7c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/4894bf7c/attachment-0001.gif>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Aug 2017 - [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Seemingly Similar Threads