thr3ads.net - llvm dev - [llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Graham Yiu via llvm-dev

2017-Aug-15 18:22 UTC

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hello,

 My team and I are looking to do some enhancements in the partial inliner
 in opt.  Would appreciate any feedback that folks might have.

 # Partial Inlining in LLVM opt

 ## Summary

 ### Background

 Currently, the partial inliner searches the first few blocks of the callee
 and looks for a branch to the return block (ie. early return).  If found,
 it attempts to outline the rest of the slow (or heavy) code so the inliner
 will be able to inline the fast (or light) code.  If no early returns are
 found, the partial inliner will give up.  As far as I can tell,
 BlockFrequency and BranchProbability information is only used when
 attempting to inline the early return code, and not used to determine
 whether to outline the slow code.

 ### Proposed changes

 In addition to looking for early returns, we should utilize profile
 information to outline blocks that are considered cold.  If we can
 sufficiently reduce the size of the original function via this type of
 outlining, inlining should be able to inline the rest of the hot code.

 ## Details

 With the presence of profile information, we have a view of what code is
 infrequently executed and make better decisions on what to outline.  Early
 return blocks that are infrequently executed should still be included as
 candidates for outlining, but will be treated just like any other cold
 blocks.  Without profiling information, however, we should remain
 conservative and only partial inline in the presence of an early return in
 the first few blocks of a function (ie. peel the early return out of the
 function).

 To find cold regions to outline, we will traverse the CFG to find edges
 deemed 'cold' and look at the blocks dominated by the successor node. 
If,
 for some reason, that block has more than one predecessor, then we will
 skip this candidate as there should be a node that dominates this
 successor that has a single entry point.  The last node in the dominance
 vector should also have a single successor.  If it does not, then further
 investigation of the CFG is necessary to see when/how this situation
 occurs.

 We will need several heuristics to make sure we only outline in cases
 where we are confident it will result in a performance gain.  Things such
 as threshold on when a branch is considered cold, the minimum number of
 times the predecessor node has to be executed in order for an edge to be
 considered (confidence factor), and the minimum size of the region to be
 outlined (can use inlining cost analysis like we currently do) will
 require some level of tuning.

 Similar to the current implementation, we will attempt to inline the
 leftover (hot) parts of the code, and if for some reason we cannot then we
 discard the modified function and its outlined code.

 ### Code changes

 The current Partial Inlining code first clones the function of interest
 and looks for a single set of blocks to outline.  It then creates a
 function with the set the blocks, and saves the outlined function and
 outline callsite information as part of the function cloning container.
 In order to outline multiple regions of the function, we will need to
 change these containers to keep track of a list of regions to outline.  We
 will also need to update the cost analysis to take into account multiple
 outlined functions.

 When a ProfileSummary is available, then we should skip the code that
 looks for early returns and go into new code that looks for cold regions
 to outline.  When ProfileSummary is not available, then we can fall back
 to the existing code and look for early returns only.

 ### Tuning

 - The outlining heuristics will need to determine if a set of cold blocks
 is large enough to warrant the overhead of a function call.  We also don't
 want the inliner to attempt to inline the outlined code later.
 - The threshold for determining whether a block is cold will also need to
 be tuned.  In the case that profiling information is not accurate, we will
 pay the price of the additional call overhead for executing cold code.
 - The confidence factor, which can be viewed as the minimum number of
 times the predecessor has to be executed in order for an edge to be
 considered cold, should also be taken into account to avoid outlining code
 paths we have little information on.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-407/8200/Markham
Email: gyiu at ca.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/17fc26c9/attachment.html>

Jessica Paquette via llvm-dev

2017-Aug-15 18:45 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hi Graham,

Have you looked at using the existing outliner framework to implement partial
inlining? In the past, people have combined outlining + inlining to get partial
inlining “for free” (see "Function outlining and partial inlining” by Zhao
and Amaral).

There is the MachineOutliner. River Riddle has been working on an IR-level
outliner as well. The MachineOutliner doesn’t currently consider the heat of a
block, but might be worth experimenting with. It might be good to experiment
with River’s work and the existing outlining pass to see if this functionality
can effectively fall out of what we have already.

Here’s the recent IR-level outlining thread for reference wrt River’s work:
http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html
<http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html>

- Jessica
> On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hello,
> 
> My team and I are looking to do some enhancements in the partial inliner in
opt. Would appreciate any feedback that folks might have.
> 
> # Partial Inlining in LLVM opt
> 
> ## Summary
> 
> ### Background
> 
> Currently, the partial inliner searches the first few blocks of the callee
and looks for a branch to the return block (ie. early return). If found, it
attempts to outline the rest of the slow (or heavy) code so the inliner will be
able to inline the fast (or light) code. If no early returns are found, the
partial inliner will give up. As far as I can tell, BlockFrequency and
BranchProbability information is only used when attempting to inline the early
return code, and not used to determine whether to outline the slow code.
> 
> ### Proposed changes
> 
> In addition to looking for early returns, we should utilize profile
information to outline blocks that are considered cold. If we can sufficiently
reduce the size of the original function via this type of outlining, inlining
should be able to inline the rest of the hot code.
> 
> ## Details
> 
> With the presence of profile information, we have a view of what code is
infrequently executed and make better decisions on what to outline. Early return
blocks that are infrequently executed should still be included as candidates for
outlining, but will be treated just like any other cold blocks. Without
profiling information, however, we should remain conservative and only partial
inline in the presence of an early return in the first few blocks of a function
(ie. peel the early return out of the function).
> 
> To find cold regions to outline, we will traverse the CFG to find edges
deemed 'cold' and look at the blocks dominated by the successor node.
If, for some reason, that block has more than one predecessor, then we will skip
this candidate as there should be a node that dominates this successor that has
a single entry point. The last node in the dominance vector should also have a
single successor. If it does not, then further investigation of the CFG is
necessary to see when/how this situation occurs.
> 
> We will need several heuristics to make sure we only outline in cases where
we are confident it will result in a performance gain. Things such as threshold
on when a branch is considered cold, the minimum number of times the predecessor
node has to be executed in order for an edge to be considered (confidence
factor), and the minimum size of the region to be outlined (can use inlining
cost analysis like we currently do) will require some level of tuning.
> 
> Similar to the current implementation, we will attempt to inline the
leftover (hot) parts of the code, and if for some reason we cannot then we
discard the modified function and its outlined code.
> 
> ### Code changes
> 
> The current Partial Inlining code first clones the function of interest and
looks for a single set of blocks to outline. It then creates a function with the
set the blocks, and saves the outlined function and outline callsite information
as part of the function cloning container. In order to outline multiple regions
of the function, we will need to change these containers to keep track of a list
of regions to outline. We will also need to update the cost analysis to take
into account multiple outlined functions.
> 
> When a ProfileSummary is available, then we should skip the code that looks
for early returns and go into new code that looks for cold regions to outline.
When ProfileSummary is not available, then we can fall back to the existing code
and look for early returns only.
> 
> ### Tuning
> 
> - The outlining heuristics will need to determine if a set of cold blocks
is large enough to warrant the overhead of a function call. We also don't
want the inliner to attempt to inline the outlined code later.
> - The threshold for determining whether a block is cold will also need to
be tuned. In the case that profiling information is not accurate, we will pay
the price of the additional call overhead for executing cold code.
> - The confidence factor, which can be viewed as the minimum number of times
the predecessor has to be executed in order for an edge to be considered cold,
should also be taken into account to avoid outlining code paths we have little
information on.
> 
> Graham Yiu
> LLVM Compiler Development
> IBM Toronto Software Lab
> Office: (905) 413-4077 C2-407/8200/Markham
> Email: gyiu at ca.ibm.com
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/3b590d13/attachment.html>

Graham Yiu via llvm-dev

2017-Aug-15 20:14 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hi Jessica,

Thanks for the feedback.

I believe the existing partial inliner pass does use some common utilities
like the code extractor to do outlining.  Is that what you're referring to?
I don't recall seeing any other passes that has outlining other than the
machine outliner, but I may have missed something.

I briefly looked at River's RFC and it seems he's mainly utilizing
outlining as a tool to reduce code size while maintaining code performance.
Our proposal is to improve on the existing outlining scheme of the partial
inliner to give us more opportunities to inline hot code.  We will only do
this type of general outlining with the presence of profiling or user data
(such as builtin_expect).  With that said, I am interested in how River
plans to handle live ranges and heuristics he will use with profiling data
available.

I haven't looked at the machine outliner yet, but my understanding is it
also prioritizes code size reduction over performance gains.  Is my
assumption correct?  Do you think there's some tuning we can do in the
machine outliner to tailor for performance improvements?

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-407/8200/Markham
Email: gyiu at ca.ibm.com



From:	Jessica Paquette <jpaquette at apple.com>
To:	Graham Yiu <gyiu at ca.ibm.com>
Cc:	llvm-dev at lists.llvm.org
Date:	08/15/2017 02:45 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks
Sent by:	jpaquette at apple.com



Hi Graham,

Have you looked at using the existing outliner framework to implement
partial inlining? In the past, people have combined outlining + inlining to
get partial inlining “for free” (see "Function outlining and partial
inlining” by Zhao and Amaral).

There is the MachineOutliner. River Riddle has been working on an IR-level
outliner as well. The MachineOutliner doesn’t currently consider the heat
of a block, but might be worth experimenting with. It might be good to
experiment with River’s work and the existing outlining pass to see if this
functionality can effectively fall out of what we have already.

Here’s the recent IR-level outlining thread for reference wrt River’s work:
http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html

- Jessica

      On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <
      llvm-dev at lists.llvm.org> wrote:



      Hello,

      My team and I are looking to do some enhancements in the partial
      inliner in opt. Would appreciate any feedback that folks might have.

      # Partial Inlining in LLVM opt

      ## Summary

      ### Background

      Currently, the partial inliner searches the first few blocks of the
      callee and looks for a branch to the return block (ie. early return).
      If found, it attempts to outline the rest of the slow (or heavy) code
      so the inliner will be able to inline the fast (or light) code. If no
      early returns are found, the partial inliner will give up. As far as
      I can tell, BlockFrequency and BranchProbability information is only
      used when attempting to inline the early return code, and not used to
      determine whether to outline the slow code.

      ### Proposed changes

      In addition to looking for early returns, we should utilize profile
      information to outline blocks that are considered cold. If we can
      sufficiently reduce the size of the original function via this type
      of outlining, inlining should be able to inline the rest of the hot
      code.

      ## Details

      With the presence of profile information, we have a view of what code
      is infrequently executed and make better decisions on what to
      outline. Early return blocks that are infrequently executed should
      still be included as candidates for outlining, but will be treated
      just like any other cold blocks. Without profiling information,
      however, we should remain conservative and only partial inline in the
      presence of an early return in the first few blocks of a function
      (ie. peel the early return out of the function).

      To find cold regions to outline, we will traverse the CFG to find
      edges deemed 'cold' and look at the blocks dominated by the
successor
      node. If, for some reason, that block has more than one predecessor,
      then we will skip this candidate as there should be a node that
      dominates this successor that has a single entry point. The last node
      in the dominance vector should also have a single successor. If it
      does not, then further investigation of the CFG is necessary to see
      when/how this situation occurs.

      We will need several heuristics to make sure we only outline in cases
      where we are confident it will result in a performance gain. Things
      such as threshold on when a branch is considered cold, the minimum
      number of times the predecessor node has to be executed in order for
      an edge to be considered (confidence factor), and the minimum size of
      the region to be outlined (can use inlining cost analysis like we
      currently do) will require some level of tuning.

      Similar to the current implementation, we will attempt to inline the
      leftover (hot) parts of the code, and if for some reason we cannot
      then we discard the modified function and its outlined code.

      ### Code changes

      The current Partial Inlining code first clones the function of
      interest and looks for a single set of blocks to outline. It then
      creates a function with the set the blocks, and saves the outlined
      function and outline callsite information as part of the function
      cloning container. In order to outline multiple regions of the
      function, we will need to change these containers to keep track of a
      list of regions to outline. We will also need to update the cost
      analysis to take into account multiple outlined functions.

      When a ProfileSummary is available, then we should skip the code that
      looks for early returns and go into new code that looks for cold
      regions to outline. When ProfileSummary is not available, then we can
      fall back to the existing code and look for early returns only.

      ### Tuning

      - The outlining heuristics will need to determine if a set of cold
      blocks is large enough to warrant the overhead of a function call. We
      also don't want the inliner to attempt to inline the outlined code
      later.
      - The threshold for determining whether a block is cold will also
      need to be tuned. In the case that profiling information is not
      accurate, we will pay the price of the additional call overhead for
      executing cold code.
      - The confidence factor, which can be viewed as the minimum number of
      times the predecessor has to be executed in order for an edge to be
      considered cold, should also be taken into account to avoid outlining
      code paths we have little information on.

      Graham Yiu
      LLVM Compiler Development
      IBM Toronto Software Lab
      Office: (905) 413-4077 C2-407/8200/Markham
      Email: gyiu at ca.ibm.com


      _______________________________________________
      LLVM Developers mailing list
      llvm-dev at lists.llvm.org
      http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/93e2a93b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/93e2a93b/attachment.gif>

Xinliang David Li via llvm-dev

2017-Aug-15 21:36 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hi Graham, Making partial inlining more general is something worth doing.
Regarding your implementation plan, I have some suggestions here:

*) Function outlining introduces additional runtime cost: passing of live
in values, returning of live out values (via memory), glue code in the
caller to handle regions without a single exit block etc.  The cost
analysis needs to factor in those carefully
*) Remove the limitation that there is only *one* outlined routine.
Instead, the algorithm can compute multiple single-entry/single exit or
single entry/multiple exit regions (cold ones) in the routine, and outline
each region into its own function. The benefit include
   1) simplify the design and implementation and most of the existing code
can be reused;
   2) provide more flexibility to allow most effective outlining;
   3) reduced runtime overhead of making calls to the outline functions.

thanks,

David

On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hello,
>
> My team and I are looking to do some enhancements in the partial inliner
> in opt. Would appreciate any feedback that folks might have.
>
> # Partial Inlining in LLVM opt
>
> ## Summary
>
> ### Background
>
> Currently, the partial inliner searches the first few blocks of the callee
> and looks for a branch to the return block (ie. early return). If found, it
> attempts to outline the rest of the slow (or heavy) code so the inliner
> will be able to inline the fast (or light) code. If no early returns are
> found, the partial inliner will give up. As far as I can tell,
> BlockFrequency and BranchProbability information is only used when
> attempting to inline the early return code, and not used to determine
> whether to outline the slow code.
>
> ### Proposed changes
>
> In addition to looking for early returns, we should utilize profile
> information to outline blocks that are considered cold. If we can
> sufficiently reduce the size of the original function via this type of
> outlining, inlining should be able to inline the rest of the hot code.
>
> ## Details
>
> With the presence of profile information, we have a view of what code is
> infrequently executed and make better decisions on what to outline. Early
> return blocks that are infrequently executed should still be included as
> candidates for outlining, but will be treated just like any other cold
> blocks. Without profiling information, however, we should remain
> conservative and only partial inline in the presence of an early return in
> the first few blocks of a function (ie. peel the early return out of the
> function).
>
> To find cold regions to outline, we will traverse the CFG to find edges
> deemed 'cold' and look at the blocks dominated by the successor
node. If,
> for some reason, that block has more than one predecessor, then we will
> skip this candidate as there should be a node that dominates this successor
> that has a single entry point. The last node in the dominance vector should
> also have a single successor. If it does not, then further investigation of
> the CFG is necessary to see when/how this situation occurs.
>
> We will need several heuristics to make sure we only outline in cases
> where we are confident it will result in a performance gain. Things such as
> threshold on when a branch is considered cold, the minimum number of times
> the predecessor node has to be executed in order for an edge to be
> considered (confidence factor), and the minimum size of the region to be
> outlined (can use inlining cost analysis like we currently do) will require
> some level of tuning.
>
> Similar to the current implementation, we will attempt to inline the
> leftover (hot) parts of the code, and if for some reason we cannot then we
> discard the modified function and its outlined code.
>
> ### Code changes
>
> The current Partial Inlining code first clones the function of interest
> and looks for a single set of blocks to outline. It then creates a function
> with the set the blocks, and saves the outlined function and outline
> callsite information as part of the function cloning container. In order to
> outline multiple regions of the function, we will need to change these
> containers to keep track of a list of regions to outline. We will also need
> to update the cost analysis to take into account multiple outlined
> functions.
>
> When a ProfileSummary is available, then we should skip the code that
> looks for early returns and go into new code that looks for cold regions to
> outline. When ProfileSummary is not available, then we can fall back to the
> existing code and look for early returns only.
>
> ### Tuning
>
> - The outlining heuristics will need to determine if a set of cold blocks
> is large enough to warrant the overhead of a function call. We also
don't
> want the inliner to attempt to inline the outlined code later.
> - The threshold for determining whether a block is cold will also need to
> be tuned. In the case that profiling information is not accurate, we will
> pay the price of the additional call overhead for executing cold code.
> - The confidence factor, which can be viewed as the minimum number of
> times the predecessor has to be executed in order for an edge to be
> considered cold, should also be taken into account to avoid outlining code
> paths we have little information on.
>
> Graham Yiu
> LLVM Compiler Development
> IBM Toronto Software Lab
> Office: (905) 413-4077 C2-407/8200/Markham
> Email: gyiu at ca.ibm.com
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/2bd300a3/attachment.html>

Graham Yiu via llvm-dev

2017-Aug-16 00:04 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hey David,

Yes, we'll need to consider the effect on live ranges for regions we want
to outline.  In my experience, outlining live-exit regions seem to cause
the most harm as we ruin chances to keep data in registers as you were
alluding to.  It's unclear, however, what the exact effect of outlining
regions with live-entries would be.

I'll probably try to avoid regions that are not single entry & single
exit
at least initially, to simplify the transformation and analysis.  Are
multi-exit regions common in your experience?

And of course, I agree, we should reuse as much of the current partial
inlining infrastructure as possible.  I'll likely run some ideas by you as
I begin to make changes.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-407/8200/Markham
Email: gyiu at ca.ibm.com



From:	Xinliang David Li <xinliangli at gmail.com>
To:	Graham Yiu <gyiu at ca.ibm.com>
Cc:	llvm-dev <llvm-dev at lists.llvm.org>
Date:	08/15/2017 05:36 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks



Hi Graham, Making partial inlining more general is something worth doing.
Regarding your implementation plan, I have some suggestions here:

*) Function outlining introduces additional runtime cost: passing of live
in values, returning of live out values (via memory), glue code in the
caller to handle regions without a single exit block etc.  The cost
analysis needs to factor in those carefully
*) Remove the limitation that there is only *one* outlined routine.
Instead, the algorithm can compute multiple single-entry/single exit or
single entry/multiple exit regions (cold ones) in the routine, and outline
each region into its own function. The benefit include
   1) simplify the design and implementation and most of the existing code
can be reused;
   2) provide more flexibility to allow most effective outlining;
   3) reduced runtime overhead of making calls to the outline functions.

thanks,

David

On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
  Hello,

  My team and I are looking to do some enhancements in the partial inliner
  in opt. Would appreciate any feedback that folks might have.

  # Partial Inlining in LLVM opt

  ## Summary

  ### Background

  Currently, the partial inliner searches the first few blocks of the
  callee and looks for a branch to the return block (ie. early return). If
  found, it attempts to outline the rest of the slow (or heavy) code so the
  inliner will be able to inline the fast (or light) code. If no early
  returns are found, the partial inliner will give up. As far as I can
  tell, BlockFrequency and BranchProbability information is only used when
  attempting to inline the early return code, and not used to determine
  whether to outline the slow code.

  ### Proposed changes

  In addition to looking for early returns, we should utilize profile
  information to outline blocks that are considered cold. If we can
  sufficiently reduce the size of the original function via this type of
  outlining, inlining should be able to inline the rest of the hot code.

  ## Details

  With the presence of profile information, we have a view of what code is
  infrequently executed and make better decisions on what to outline. Early
  return blocks that are infrequently executed should still be included as
  candidates for outlining, but will be treated just like any other cold
  blocks. Without profiling information, however, we should remain
  conservative and only partial inline in the presence of an early return
  in the first few blocks of a function (ie. peel the early return out of
  the function).

  To find cold regions to outline, we will traverse the CFG to find edges
  deemed 'cold' and look at the blocks dominated by the successor node.
If,
  for some reason, that block has more than one predecessor, then we will
  skip this candidate as there should be a node that dominates this
  successor that has a single entry point. The last node in the dominance
  vector should also have a single successor. If it does not, then further
  investigation of the CFG is necessary to see when/how this situation
  occurs.

  We will need several heuristics to make sure we only outline in cases
  where we are confident it will result in a performance gain. Things such
  as threshold on when a branch is considered cold, the minimum number of
  times the predecessor node has to be executed in order for an edge to be
  considered (confidence factor), and the minimum size of the region to be
  outlined (can use inlining cost analysis like we currently do) will
  require some level of tuning.

  Similar to the current implementation, we will attempt to inline the
  leftover (hot) parts of the code, and if for some reason we cannot then
  we discard the modified function and its outlined code.

  ### Code changes

  The current Partial Inlining code first clones the function of interest
  and looks for a single set of blocks to outline. It then creates a
  function with the set the blocks, and saves the outlined function and
  outline callsite information as part of the function cloning container.
  In order to outline multiple regions of the function, we will need to
  change these containers to keep track of a list of regions to outline. We
  will also need to update the cost analysis to take into account multiple
  outlined functions.

  When a ProfileSummary is available, then we should skip the code that
  looks for early returns and go into new code that looks for cold regions
  to outline. When ProfileSummary is not available, then we can fall back
  to the existing code and look for early returns only.

  ### Tuning

  - The outlining heuristics will need to determine if a set of cold blocks
  is large enough to warrant the overhead of a function call. We also don't
  want the inliner to attempt to inline the outlined code later.
  - The threshold for determining whether a block is cold will also need to
  be tuned. In the case that profiling information is not accurate, we will
  pay the price of the additional call overhead for executing cold code.
  - The confidence factor, which can be viewed as the minimum number of
  times the predecessor has to be executed in order for an edge to be
  considered cold, should also be taken into account to avoid outlining
  code paths we have little information on.

  Graham Yiu
  LLVM Compiler Development
  IBM Toronto Software Lab
  Office: (905) 413-4077 C2-407/8200/Markham
  Email: gyiu at ca.ibm.com

  _______________________________________________
  LLVM Developers mailing list
  llvm-dev at lists.llvm.org
  http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/b663c474/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/b663c474/attachment.gif>

Adrian Prantl via llvm-dev

2017-Aug-16 16:34 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

(How) does your proposal affect the debug info for the transformed code?

-- adrian

Graham Yiu via llvm-dev

2017-Aug-16 17:05 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hi Adrian,

That's a good question, I haven't looked into the effect on debug
information, but I imagine it won't be any different from what the partial
inliner does today (David Li might be able to answer that question better
than I).  There's existing utilities that do the code extraction for
outlining, so I'll need to look into that code to see if it also modifies
debug information.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-707/8200/Markham
Email: gyiu at ca.ibm.com



From:	Adrian Prantl <aprantl at apple.com>
To:	Graham Yiu <gyiu at ca.ibm.com>
Cc:	llvm-dev at lists.llvm.org
Date:	08/16/2017 12:34 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks
Sent by:	aprantl at apple.com



(How) does your proposal affect the debug info for the transformed code?

-- adrian



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170816/65b77dc5/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170816/65b77dc5/attachment.gif>

Graham Yiu via llvm-dev

2017-Aug-24 17:40 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hi David,

So I've began doing some implementation on the outlining portion of the
code.  Currently, I got the partial inliner to outline cold regions (single
entry, single exit) of the code, based solely on the existence of
ProfileSummaryInfo (ie. profiling data).  However, I have some concerns on
how this will co-exist with the existing code that peels early returns.

The control flow looks something like this:

// New Code: find cold regions to outline
if (!computeOutliningInfoForColdRegions()) {
  // If we can't find any cold regions, then fall-back to early return
peeling
  if (!computeOutliningInfo) {
     return nullptr;
  }
}
// Try to outline the identified regions
// Then try to inline the cloned function

My concern is during inlining, if we fail to inline the cloned function, we
give up and discard all cloned and outlined functions.  But with these two
types of outlining we're doing, it's possible to attempt to inline the
cloned function that has outlined cold regions, and if we cannot do so, try
to inline a different clone that has peeled early returns (ie. the way we
have it today).  This would require us to clone the original function twice
and modify one based on cold region outlining and the other early return
peeling, with the latter being our fall-back option if we fail to inline
the first clone.

What are your thoughts?

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-707/8200/Markham
Email: gyiu at ca.ibm.com



From:	Graham Yiu/Toronto/IBM
To:	Xinliang David Li <xinliangli at gmail.com>
Cc:	llvm-dev <llvm-dev at lists.llvm.org>
Date:	08/15/2017 08:04 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks


Hey David,

Yes, we'll need to consider the effect on live ranges for regions we want
to outline.  In my experience, outlining live-exit regions seem to cause
the most harm as we ruin chances to keep data in registers as you were
alluding to.  It's unclear, however, what the exact effect of outlining
regions with live-entries would be.

I'll probably try to avoid regions that are not single entry & single
exit
at least initially, to simplify the transformation and analysis.  Are
multi-exit regions common in your experience?

And of course, I agree, we should reuse as much of the current partial
inlining infrastructure as possible.  I'll likely run some ideas by you as
I begin to make changes.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-407/8200/Markham
Email: gyiu at ca.ibm.com




From:	Xinliang David Li <xinliangli at gmail.com>
To:	Graham Yiu <gyiu at ca.ibm.com>
Cc:	llvm-dev <llvm-dev at lists.llvm.org>
Date:	08/15/2017 05:36 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks



Hi Graham, Making partial inlining more general is something worth doing.
Regarding your implementation plan, I have some suggestions here:

*) Function outlining introduces additional runtime cost: passing of live
in values, returning of live out values (via memory), glue code in the
caller to handle regions without a single exit block etc.  The cost
analysis needs to factor in those carefully
*) Remove the limitation that there is only *one* outlined routine.
Instead, the algorithm can compute multiple single-entry/single exit or
single entry/multiple exit regions (cold ones) in the routine, and outline
each region into its own function. The benefit include
   1) simplify the design and implementation and most of the existing code
can be reused;
   2) provide more flexibility to allow most effective outlining;
   3) reduced runtime overhead of making calls to the outline functions.

thanks,

David

On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
  Hello,

  My team and I are looking to do some enhancements in the partial inliner
  in opt. Would appreciate any feedback that folks might have.

  # Partial Inlining in LLVM opt

  ## Summary

  ### Background

  Currently, the partial inliner searches the first few blocks of the
  callee and looks for a branch to the return block (ie. early return). If
  found, it attempts to outline the rest of the slow (or heavy) code so the
  inliner will be able to inline the fast (or light) code. If no early
  returns are found, the partial inliner will give up. As far as I can
  tell, BlockFrequency and BranchProbability information is only used when
  attempting to inline the early return code, and not used to determine
  whether to outline the slow code.

  ### Proposed changes

  In addition to looking for early returns, we should utilize profile
  information to outline blocks that are considered cold. If we can
  sufficiently reduce the size of the original function via this type of
  outlining, inlining should be able to inline the rest of the hot code.

  ## Details

  With the presence of profile information, we have a view of what code is
  infrequently executed and make better decisions on what to outline. Early
  return blocks that are infrequently executed should still be included as
  candidates for outlining, but will be treated just like any other cold
  blocks. Without profiling information, however, we should remain
  conservative and only partial inline in the presence of an early return
  in the first few blocks of a function (ie. peel the early return out of
  the function).

  To find cold regions to outline, we will traverse the CFG to find edges
  deemed 'cold' and look at the blocks dominated by the successor node.
If,
  for some reason, that block has more than one predecessor, then we will
  skip this candidate as there should be a node that dominates this
  successor that has a single entry point. The last node in the dominance
  vector should also have a single successor. If it does not, then further
  investigation of the CFG is necessary to see when/how this situation
  occurs.

  We will need several heuristics to make sure we only outline in cases
  where we are confident it will result in a performance gain. Things such
  as threshold on when a branch is considered cold, the minimum number of
  times the predecessor node has to be executed in order for an edge to be
  considered (confidence factor), and the minimum size of the region to be
  outlined (can use inlining cost analysis like we currently do) will
  require some level of tuning.

  Similar to the current implementation, we will attempt to inline the
  leftover (hot) parts of the code, and if for some reason we cannot then
  we discard the modified function and its outlined code.

  ### Code changes

  The current Partial Inlining code first clones the function of interest
  and looks for a single set of blocks to outline. It then creates a
  function with the set the blocks, and saves the outlined function and
  outline callsite information as part of the function cloning container.
  In order to outline multiple regions of the function, we will need to
  change these containers to keep track of a list of regions to outline. We
  will also need to update the cost analysis to take into account multiple
  outlined functions.

  When a ProfileSummary is available, then we should skip the code that
  looks for early returns and go into new code that looks for cold regions
  to outline. When ProfileSummary is not available, then we can fall back
  to the existing code and look for early returns only.

  ### Tuning

  - The outlining heuristics will need to determine if a set of cold blocks
  is large enough to warrant the overhead of a function call. We also don't
  want the inliner to attempt to inline the outlined code later.
  - The threshold for determining whether a block is cold will also need to
  be tuned. In the case that profiling information is not accurate, we will
  pay the price of the additional call overhead for executing cold code.
  - The confidence factor, which can be viewed as the minimum number of
  times the predecessor has to be executed in order for an edge to be
  considered cold, should also be taken into account to avoid outlining
  code paths we have little information on.

  Graham Yiu
  LLVM Compiler Development
  IBM Toronto Software Lab
  Office: (905) 413-4077 C2-407/8200/Markham
  Email: gyiu at ca.ibm.com

  _______________________________________________
  LLVM Developers mailing list
  llvm-dev at lists.llvm.org
  http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170824/77c78ff3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170824/77c78ff3/attachment.gif>

Xinliang David Li via llvm-dev

2017-Aug-24 19:04 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.ibm.com> wrote:
> Hi David,
>
> So I've began doing some implementation on the outlining portion of the
> code. Currently, I got the partial inliner to outline cold regions (single
> entry, single exit) of the code, based solely on the existence of
> ProfileSummaryInfo (ie. profiling data). However, I have some concerns on
> how this will co-exist with the existing code that peels early returns.
>
> The control flow looks something like this:
>
> // New Code: find cold regions to outline
> if (!computeOutliningInfoForColdRegions()) {
> // If we can't find any cold regions, then fall-back to early return
> peeling
> if (!computeOutliningInfo) {
> return nullptr;
> }
> }
> // Try to outline the identified regions
> // Then try to inline the cloned function
>
> My concern is during inlining, if we fail to inline the cloned function,
> we give up and discard all cloned and outlined functions. But with these
> two types of outlining we're doing, it's possible to attempt to
inline the
> cloned function that has outlined cold regions, and if we cannot do so, try
> to inline a different clone that has peeled early returns (ie. the way we
> have it today). This would require us to clone the original function twice
> and modify one based on cold region outlining and the other early return
> peeling, with the latter being our fall-back option if we fail to inline
> the first clone.
>
> What are your thoughts?
>
>I expect  computeOutliningInfoForColdRegions can produce a super set of
outlinable regions to the current 'pattern matching' approach. In other
words, most of the cases currently caught by 'computeOutlineInfo' should
be
caught by the new algorithm, so why not ditching the current
'computeOutlningInfo' completely?

My suggestion was to enhance the pass to 1) support outlining multiple
regions; and 2) add a mode to do function outlining only (not the inlining
part).  The second is important can be used before the regular inliner
pass.   With the new pass manager and profile aware inlining, the inliner
won't undo the outline decision, but in meantime becomes more powerful due
to the reduced hot function size.

David


> Graham Yiu
> LLVM Compiler Development
> IBM Toronto Software Lab
> Office: (905) 413-4077 C2-707/8200/Markham
> Email: gyiu at ca.ibm.com
>
> [image: Inactive hide details for Graham Yiu---08/15/2017 08:04:28
> PM---Hey David, Yes, we'll need to consider the effect on live
range]Graham
> Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll need to consider
the
> effect on live ranges for regions we want to outline. In
>
> From: Graham Yiu/Toronto/IBM
> To: Xinliang David Li <xinliangli at gmail.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Date: 08/15/2017 08:04 PM
> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
> outlining scheme for cold blocks
> ------------------------------
>
>
> Hey David,
>
> Yes, we'll need to consider the effect on live ranges for regions we
want
> to outline. In my experience, outlining live-exit regions seem to cause the
> most harm as we ruin chances to keep data in registers as you were alluding
> to. It's unclear, however, what the exact effect of outlining regions
with
> live-entries would be.
>
> I'll probably try to avoid regions that are not single entry &
single exit
> at least initially, to simplify the transformation and analysis. Are
> multi-exit regions common in your experience?
>
> And of course, I agree, we should reuse as much of the current partial
> inlining infrastructure as possible. I'll likely run some ideas by you
as I
> begin to make changes.
>
> Cheers,
>
> Graham Yiu
> LLVM Compiler Development
> IBM Toronto Software Lab
> Office: (905) 413-4077 C2-407/8200/Markham
> Email: gyiu at ca.ibm.com
>
>
> [image: Inactive hide details for Xinliang David Li ---08/15/2017 05:36:07
> PM---Hi Graham, Making partial inlining more general is some]Xinliang
> David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making partial inlining
> more general is something worth doing. Regarding your implementat
>
> From: Xinliang David Li <xinliangli at gmail.com>
> To: Graham Yiu <gyiu at ca.ibm.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Date: 08/15/2017 05:36 PM
> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
> outlining scheme for cold blocks
> ------------------------------
>
>
>
> Hi Graham, Making partial inlining more general is something worth doing.
> Regarding your implementation plan, I have some suggestions here:
>
> *) Function outlining introduces additional runtime cost: passing of live
> in values, returning of live out values (via memory), glue code in the
> caller to handle regions without a single exit block etc.  The cost
> analysis needs to factor in those carefully
> *) Remove the limitation that there is only *one* outlined routine.
> Instead, the algorithm can compute multiple single-entry/single exit or
> single entry/multiple exit regions (cold ones) in the routine, and outline
> each region into its own function. The benefit include
>    1) simplify the design and implementation and most of the existing code
> can be reused;
>    2) provide more flexibility to allow most effective outlining;
>    3) reduced runtime overhead of making calls to the outline functions.
>
> thanks,
>
> David
>
> On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>> wrote:
>
>    Hello,
>
>    My team and I are looking to do some enhancements in the partial
>    inliner in opt. Would appreciate any feedback that folks might have.
>
>    # Partial Inlining in LLVM opt
>
>    ## Summary
>
>    ### Background
>
>    Currently, the partial inliner searches the first few blocks of the
>    callee and looks for a branch to the return block (ie. early return). If
>    found, it attempts to outline the rest of the slow (or heavy) code so
the
>    inliner will be able to inline the fast (or light) code. If no early
>    returns are found, the partial inliner will give up. As far as I can
tell,
>    BlockFrequency and BranchProbability information is only used when
>    attempting to inline the early return code, and not used to determine
>    whether to outline the slow code.
>
>    ### Proposed changes
>
>    In addition to looking for early returns, we should utilize profile
>    information to outline blocks that are considered cold. If we can
>    sufficiently reduce the size of the original function via this type of
>    outlining, inlining should be able to inline the rest of the hot code.
>
>    ## Details
>
>    With the presence of profile information, we have a view of what code
>    is infrequently executed and make better decisions on what to outline.
>    Early return blocks that are infrequently executed should still be
included
>    as candidates for outlining, but will be treated just like any other
cold
>    blocks. Without profiling information, however, we should remain
>    conservative and only partial inline in the presence of an early return
in
>    the first few blocks of a function (ie. peel the early return out of the
>    function).
>
>    To find cold regions to outline, we will traverse the CFG to find
>    edges deemed 'cold' and look at the blocks dominated by the
successor node.
>    If, for some reason, that block has more than one predecessor, then we
will
>    skip this candidate as there should be a node that dominates this
successor
>    that has a single entry point. The last node in the dominance vector
should
>    also have a single successor. If it does not, then further investigation
of
>    the CFG is necessary to see when/how this situation occurs.
>
>    We will need several heuristics to make sure we only outline in cases
>    where we are confident it will result in a performance gain. Things such
as
>    threshold on when a branch is considered cold, the minimum number of
times
>    the predecessor node has to be executed in order for an edge to be
>    considered (confidence factor), and the minimum size of the region to be
>    outlined (can use inlining cost analysis like we currently do) will
require
>    some level of tuning.
>
>    Similar to the current implementation, we will attempt to inline the
>    leftover (hot) parts of the code, and if for some reason we cannot then
we
>    discard the modified function and its outlined code.
>
>    ### Code changes
>
>    The current Partial Inlining code first clones the function of
>    interest and looks for a single set of blocks to outline. It then
creates a
>    function with the set the blocks, and saves the outlined function and
>    outline callsite information as part of the function cloning container.
In
>    order to outline multiple regions of the function, we will need to
change
>    these containers to keep track of a list of regions to outline. We will
>    also need to update the cost analysis to take into account multiple
>    outlined functions.
>
>    When a ProfileSummary is available, then we should skip the code that
>    looks for early returns and go into new code that looks for cold regions
to
>    outline. When ProfileSummary is not available, then we can fall back to
the
>    existing code and look for early returns only.
>
>    ### Tuning
>
>    - The outlining heuristics will need to determine if a set of cold
>    blocks is large enough to warrant the overhead of a function call. We
also
>    don't want the inliner to attempt to inline the outlined code later.
>    - The threshold for determining whether a block is cold will also need
>    to be tuned. In the case that profiling information is not accurate, we
>    will pay the price of the additional call overhead for executing cold
code.
>    - The confidence factor, which can be viewed as the minimum number of
>    times the predecessor has to be executed in order for an edge to be
>    considered cold, should also be taken into account to avoid outlining
code
>    paths we have little information on.
>
>    Graham Yiu
>    LLVM Compiler Development
>    IBM Toronto Software Lab
>    Office: *(905) 413-4077* <(905)%20413-4077> C2-407/8200/Markham
>    Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>
>    _______________________________________________
>    LLVM Developers mailing list
> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>
> *http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev*
>    <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170824/dd343d96/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170824/dd343d96/attachment.gif>

llvm dev - Aug 2017 - [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks