thr3ads.net - llvm dev - [llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2017-Aug-15 23:22 UTC

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

On Tue, Aug 15, 2017 at 4:14 PM, River Riddle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hey Graham,
>  I worked on pretty much this exact thing last year.  I did something
> similar to what you described, I traversed the CFG and built potentially
> profitable regions from any given valid start node. At that point there
> were several road blocks that prevented it from really going anywhere( like
> the inliner not preserving pgo data) but those problems should be gone by
> now.
> Some thoughts:
>
> : In terms of cost modeling, you will want to consider the interaction
> with loops:
>  - Whats the impact of outlining from inside of a loop.
>  - What if the hot region does not include the entry/exit, i.e hot loop.
>
> : Something that would also benefit, but would require more tuning, is the
> relationship between size and "hotness". If for example you have
a function
> that looks like this:
>
> foo(bool B) {
> ... // Code that prevents early return.
>
> if(B) // RegionA : Taken 50% of the time.
>   ... Very large amount of code.
>
> else // RegionB : Taken 50% of the time.
>   ... Very small amount of code.
> }
>
> In the above you have two regions that aren't exactly cold. Outlining
> RegionA will make foo profitable to inline, but only if you take into
> account its large size. Using profile data as the sole cutoff will miss
> this opportunity.
>
>
Modelling the cost/benefit for this case can be very tricky though.


>
> : You may also want to look into hot switch statement splitting, though I
> don't know if it should really go into the partial inliner. This is
> splitting a switch statement into multiple switch statements if there are
> only a few cases that are actually hot.
>
> SWITCH1:
> switch(val) {
>   HOT: ...
>   COLD1: ...
>   COLD2: ...
>   ....
>   COLDN: ...
>   default: ...
> }
>
> would get transformed into:
>
>
> SWITCH1:
> switch(val) {
>  HOT: ...
>  default:
>    br SWITCH2:
> }
>
> SWITCH2:
> switch(val) {
>   COLD1: ...
>   COLD2: ...
>   ....
>   COLDN: ...
>   default: ...
> }
>
>
>
This is profile based switch peeling -- which is a separate optimization
from partial inlining.

David




> On Tue, Aug 15, 2017 at 1:14 PM, Graham Yiu via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi Jessica,
>>
>> Thanks for the feedback.
>>
>> I believe the existing partial inliner pass does use some common
>> utilities like the code extractor to do outlining.
>>
> AFAIK, Code extractor is the only IR level utility for outlining, but
it's
> not the best and definitely needs some improvement. It currently returns
> outputs via live out pointer parameters, which means that outlined
> functions can't be tagged with attributes like readonly.
>
> There were also some known problems with extracting in the presence of
> critical edges, but I don't know if thats still the case.
>
>
>> Is that what you're referring to? I don't recall seeing any
other passes
>> that has outlining other than the machine outliner, but I may have
missed
>> something.
>>
>> I briefly looked at River's RFC and it seems he's mainly
utilizing
>> outlining as a tool to reduce code size while maintaining code
performance.
>>
>
> The main point of the outliner I proposed in that RFC is benefiting code
> size. The use of PGO data is to simply avoid detrimenting runtime
> performance.
>
>
>> Our proposal is to improve on the existing outlining scheme of the
>> partial inliner to give us more opportunities to inline hot code. We
will
>> only do this type of general outlining with the presence of profiling
or
>> user data (such as builtin_expect).
>>
> With that said, I am interested in how River plans to handle live ranges
>>
>
>  It's a pretty difficult problem. I have estimation for register usage
> within an outlined region, using something similar to the register cost
> calculation in the loop vectorizer. I don't currently have anything for
the
> actual call to the function itself. I haven't found a nice enough
solution
> that can work for a large number of candidates.
>
> and heuristics he will use with profiling data available.
>>
> As I said above, the use of profile data, currently, is to avoid
> regressing execution time. So the heuristics are basically "Is this
too hot
> to outline from". That outliner also has a completely different
goal/cost
> model, which makes applying it to this problem difficult.
>
> Thanks,
>  River Riddle
>
>
>>
>> I haven't looked at the machine outliner yet, but my understanding
is it
>> also prioritizes code size reduction over performance gains. Is my
>> assumption correct? Do you think there's some tuning we can do in
the
>> machine outliner to tailor for performance improvements?
>>
>> Graham Yiu
>> LLVM Compiler Development
>> IBM Toronto Software Lab
>> Office: (905) 413-4077 C2-407/8200/Markham
>> Email: gyiu at ca.ibm.com
>>
>> [image: Inactive hide details for Jessica Paquette ---08/15/2017
02:45:06
>> PM---Hi Graham, Have you looked at using the existing outline]Jessica
>> Paquette ---08/15/2017 02:45:06 PM---Hi Graham, Have you looked at
using
>> the existing outliner framework to implement partial inlining? I
>>
>> From: Jessica Paquette <jpaquette at apple.com>
>> To: Graham Yiu <gyiu at ca.ibm.com>
>> Cc: llvm-dev at lists.llvm.org
>> Date: 08/15/2017 02:45 PM
>> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
general
>> outlining scheme for cold blocks
>> Sent by: jpaquette at apple.com
>> ------------------------------
>>
>>
>>
>> Hi Graham,
>>
>> Have you looked at using the existing outliner framework to implement
>> partial inlining? In the past, people have combined outlining +
inlining to
>> get partial inlining “for free” (see "Function outlining and
partial
>> inlining” by Zhao and Amaral).
>>
>> There is the MachineOutliner. River Riddle has been working on an
>> IR-level outliner as well. The MachineOutliner doesn’t currently
consider
>> the heat of a block, but might be worth experimenting with. It might be
>> good to experiment with River’s work and the existing outlining pass to
see
>> if this functionality can effectively fall out of what we have already.
>>
>> Here’s the recent IR-level outlining thread for reference wrt River’s
>> work: *http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html*
>> <http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html>
>>
>> - Jessica
>>
>>    On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <
>>       *llvm-dev at lists.llvm.org* <llvm-dev at
lists.llvm.org>> wrote:
>>
>>       Hello,
>>
>>       My team and I are looking to do some enhancements in the partial
>>       inliner in opt. Would appreciate any feedback that folks might
have.
>>
>>       # Partial Inlining in LLVM opt
>>
>>       ## Summary
>>
>>       ### Background
>>
>>       Currently, the partial inliner searches the first few blocks of
>>       the callee and looks for a branch to the return block (ie. early
return).
>>       If found, it attempts to outline the rest of the slow (or heavy)
code so
>>       the inliner will be able to inline the fast (or light) code. If
no early
>>       returns are found, the partial inliner will give up. As far as I
can tell,
>>       BlockFrequency and BranchProbability information is only used
when
>>       attempting to inline the early return code, and not used to
determine
>>       whether to outline the slow code.
>>
>>       ### Proposed changes
>>
>>       In addition to looking for early returns, we should utilize
>>       profile information to outline blocks that are considered cold.
If we can
>>       sufficiently reduce the size of the original function via this
type of
>>       outlining, inlining should be able to inline the rest of the hot
code.
>>
>>       ## Details
>>
>>       With the presence of profile information, we have a view of what
>>       code is infrequently executed and make better decisions on what
to outline.
>>       Early return blocks that are infrequently executed should still
be included
>>       as candidates for outlining, but will be treated just like any
other cold
>>       blocks. Without profiling information, however, we should remain
>>       conservative and only partial inline in the presence of an early
return in
>>       the first few blocks of a function (ie. peel the early return out
of the
>>       function).
>>
>>       To find cold regions to outline, we will traverse the CFG to find
>>       edges deemed 'cold' and look at the blocks dominated by
the successor node.
>>       If, for some reason, that block has more than one predecessor,
then we will
>>       skip this candidate as there should be a node that dominates this
successor
>>       that has a single entry point. The last node in the dominance
vector should
>>       also have a single successor. If it does not, then further
investigation of
>>       the CFG is necessary to see when/how this situation occurs.
>>
>>       We will need several heuristics to make sure we only outline in
>>       cases where we are confident it will result in a performance
gain. Things
>>       such as threshold on when a branch is considered cold, the
minimum number
>>       of times the predecessor node has to be executed in order for an
edge to be
>>       considered (confidence factor), and the minimum size of the
region to be
>>       outlined (can use inlining cost analysis like we currently do)
will require
>>       some level of tuning.
>>
>>       Similar to the current implementation, we will attempt to inline
>>       the leftover (hot) parts of the code, and if for some reason we
cannot then
>>       we discard the modified function and its outlined code.
>>
>>       ### Code changes
>>
>>       The current Partial Inlining code first clones the function of
>>       interest and looks for a single set of blocks to outline. It then
creates a
>>       function with the set the blocks, and saves the outlined function
and
>>       outline callsite information as part of the function cloning
container. In
>>       order to outline multiple regions of the function, we will need
to change
>>       these containers to keep track of a list of regions to outline.
We will
>>       also need to update the cost analysis to take into account
multiple
>>       outlined functions.
>>
>>       When a ProfileSummary is available, then we should skip the code
>>       that looks for early returns and go into new code that looks for
cold
>>       regions to outline. When ProfileSummary is not available, then we
can fall
>>       back to the existing code and look for early returns only.
>>
>>       ### Tuning
>>
>>       - The outlining heuristics will need to determine if a set of
cold
>>       blocks is large enough to warrant the overhead of a function
call. We also
>>       don't want the inliner to attempt to inline the outlined code
later.
>>       - The threshold for determining whether a block is cold will also
>>       need to be tuned. In the case that profiling information is not
accurate,
>>       we will pay the price of the additional call overhead for
executing cold
>>       code.
>>       - The confidence factor, which can be viewed as the minimum
number
>>       of times the predecessor has to be executed in order for an edge
to be
>>       considered cold, should also be taken into account to avoid
outlining code
>>       paths we have little information on.
>>
>>       Graham Yiu
>>       LLVM Compiler Development
>>       IBM Toronto Software Lab
>>       Office: (905) 413-4077 C2-407/8200/Markham
>> *Email: gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>
>>       _______________________________________________
>>       LLVM Developers mailing list
>> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>
>>       http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/9af62420/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/9af62420/attachment-0001.gif>

River Riddle via llvm-dev

2017-Aug-15 23:26 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

On Tue, Aug 15, 2017 at 4:22 PM Xinliang David Li <xinliangli at
gmail.com>
wrote:
> On Tue, Aug 15, 2017 at 4:14 PM, River Riddle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hey Graham,
>>  I worked on pretty much this exact thing last year.  I did something
>> similar to what you described, I traversed the CFG and built
potentially
>> profitable regions from any given valid start node. At that point there
>> were several road blocks that prevented it from really going anywhere(
like
>> the inliner not preserving pgo data) but those problems should be gone
by
>> now.
>> Some thoughts:
>>
>> : In terms of cost modeling, you will want to consider the interaction
>> with loops:
>>  - Whats the impact of outlining from inside of a loop.
>>  - What if the hot region does not include the entry/exit, i.e hot
loop.
>>
>> : Something that would also benefit, but would require more tuning, is
>> the relationship between size and "hotness". If for example
you have a
>> function that looks like this:
>>
>> foo(bool B) {
>> ... // Code that prevents early return.
>>
>> if(B) // RegionA : Taken 50% of the time.
>>   ... Very large amount of code.
>>
>> else // RegionB : Taken 50% of the time.
>>   ... Very small amount of code.
>> }
>>
>> In the above you have two regions that aren't exactly cold.
Outlining
>> RegionA will make foo profitable to inline, but only if you take into
>> account its large size. Using profile data as the sole cutoff will miss
>> this opportunity.
>>
>>
>
> Modelling the cost/benefit for this case can be very tricky though.
>
Most definitely, but it's something to think about when moving forward.

>
>
>>
>> : You may also want to look into hot switch statement splitting, though
I
>> don't know if it should really go into the partial inliner. This is
>> splitting a switch statement into multiple switch statements if there
are
>> only a few cases that are actually hot.
>>
>> SWITCH1:
>> switch(val) {
>>   HOT: ...
>>   COLD1: ...
>>   COLD2: ...
>>   ....
>>   COLDN: ...
>>   default: ...
>> }
>>
>> would get transformed into:
>>
>
>
>>
>> SWITCH1:
>> switch(val) {
>>  HOT: ...
>>  default:
>>    br SWITCH2:
>> }
>>
>> SWITCH2:
>> switch(val) {
>>   COLD1: ...
>>   COLD2: ...
>>   ....
>>   COLDN: ...
>>   default: ...
>> }
>>
>>
>>
>
> This is profile based switch peeling -- which is a separate optimization
> from partial inlining.
>
> David
>
>Ah thanks, I wasn't sure of the name.
River Riddle

>
>
>
>> On Tue, Aug 15, 2017 at 1:14 PM, Graham Yiu via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi Jessica,
>>>
>>> Thanks for the feedback.
>>>
>>> I believe the existing partial inliner pass does use some common
>>> utilities like the code extractor to do outlining.
>>>
>> AFAIK, Code extractor is the only IR level utility for outlining, but
>> it's not the best and definitely needs some improvement. It
currently
>> returns outputs via live out pointer parameters, which means that
outlined
>> functions can't be tagged with attributes like readonly.
>>
>> There were also some known problems with extracting in the presence of
>> critical edges, but I don't know if thats still the case.
>>
>>
>>> Is that what you're referring to? I don't recall seeing any
other passes
>>> that has outlining other than the machine outliner, but I may have
missed
>>> something.
>>>
>>> I briefly looked at River's RFC and it seems he's mainly
utilizing
>>> outlining as a tool to reduce code size while maintaining code
performance.
>>>
>>
>> The main point of the outliner I proposed in that RFC is benefiting
code
>> size. The use of PGO data is to simply avoid detrimenting runtime
>> performance.
>>
>>
>>> Our proposal is to improve on the existing outlining scheme of the
>>> partial inliner to give us more opportunities to inline hot code.
We will
>>> only do this type of general outlining with the presence of
profiling or
>>> user data (such as builtin_expect).
>>>
>> With that said, I am interested in how River plans to handle live
ranges
>>>
>>
>>  It's a pretty difficult problem. I have estimation for register
usage
>> within an outlined region, using something similar to the register cost
>> calculation in the loop vectorizer. I don't currently have anything
for the
>> actual call to the function itself. I haven't found a nice enough
solution
>> that can work for a large number of candidates.
>>
>> and heuristics he will use with profiling data available.
>>>
>> As I said above, the use of profile data, currently, is to avoid
>> regressing execution time. So the heuristics are basically "Is
this too hot
>> to outline from". That outliner also has a completely different
goal/cost
>> model, which makes applying it to this problem difficult.
>>
>> Thanks,
>>  River Riddle
>>
>>
>>>
>>> I haven't looked at the machine outliner yet, but my
understanding is it
>>> also prioritizes code size reduction over performance gains. Is my
>>> assumption correct? Do you think there's some tuning we can do
in the
>>> machine outliner to tailor for performance improvements?
>>>
>>> Graham Yiu
>>> LLVM Compiler Development
>>> IBM Toronto Software Lab
>>> Office: (905) 413-4077 C2-407/8200/Markham
>>> Email: gyiu at ca.ibm.com
>>>
>>> [image: Inactive hide details for Jessica Paquette ---08/15/2017
>>> 02:45:06 PM---Hi Graham, Have you looked at using the existing
outline]Jessica
>>> Paquette ---08/15/2017 02:45:06 PM---Hi Graham, Have you looked at
using
>>> the existing outliner framework to implement partial inlining? I
>>>
>>> From: Jessica Paquette <jpaquette at apple.com>
>>> To: Graham Yiu <gyiu at ca.ibm.com>
>>> Cc: llvm-dev at lists.llvm.org
>>> Date: 08/15/2017 02:45 PM
>>> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>>> general outlining scheme for cold blocks
>>> Sent by: jpaquette at apple.com
>>> ------------------------------
>>>
>>>
>>>
>>> Hi Graham,
>>>
>>> Have you looked at using the existing outliner framework to
implement
>>> partial inlining? In the past, people have combined outlining +
inlining to
>>> get partial inlining “for free” (see "Function outlining and
partial
>>> inlining” by Zhao and Amaral).
>>>
>>> There is the MachineOutliner. River Riddle has been working on an
>>> IR-level outliner as well. The MachineOutliner doesn’t currently
consider
>>> the heat of a block, but might be worth experimenting with. It
might be
>>> good to experiment with River’s work and the existing outlining
pass to see
>>> if this functionality can effectively fall out of what we have
already.
>>>
>>> Here’s the recent IR-level outlining thread for reference wrt
River’s
>>> work:
*http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html*
>>>
<http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html>
>>>
>>> - Jessica
>>>
>>>    On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <
>>>       *llvm-dev at lists.llvm.org* <llvm-dev at
lists.llvm.org>> wrote:
>>>
>>>       Hello,
>>>
>>>       My team and I are looking to do some enhancements in the
partial
>>>       inliner in opt. Would appreciate any feedback that folks
might have.
>>>
>>>       # Partial Inlining in LLVM opt
>>>
>>>       ## Summary
>>>
>>>       ### Background
>>>
>>>       Currently, the partial inliner searches the first few blocks
of
>>>       the callee and looks for a branch to the return block (ie.
early return).
>>>       If found, it attempts to outline the rest of the slow (or
heavy) code so
>>>       the inliner will be able to inline the fast (or light) code.
If no early
>>>       returns are found, the partial inliner will give up. As far
as I can tell,
>>>       BlockFrequency and BranchProbability information is only used
when
>>>       attempting to inline the early return code, and not used to
determine
>>>       whether to outline the slow code.
>>>
>>>       ### Proposed changes
>>>
>>>       In addition to looking for early returns, we should utilize
>>>       profile information to outline blocks that are considered
cold. If we can
>>>       sufficiently reduce the size of the original function via
this type of
>>>       outlining, inlining should be able to inline the rest of the
hot code.
>>>
>>>       ## Details
>>>
>>>       With the presence of profile information, we have a view of
what
>>>       code is infrequently executed and make better decisions on
what to outline.
>>>       Early return blocks that are infrequently executed should
still be included
>>>       as candidates for outlining, but will be treated just like
any other cold
>>>       blocks. Without profiling information, however, we should
remain
>>>       conservative and only partial inline in the presence of an
early return in
>>>       the first few blocks of a function (ie. peel the early return
out of the
>>>       function).
>>>
>>>       To find cold regions to outline, we will traverse the CFG to
find
>>>       edges deemed 'cold' and look at the blocks dominated
by the successor node.
>>>       If, for some reason, that block has more than one
predecessor, then we will
>>>       skip this candidate as there should be a node that dominates
this successor
>>>       that has a single entry point. The last node in the dominance
vector should
>>>       also have a single successor. If it does not, then further
investigation of
>>>       the CFG is necessary to see when/how this situation occurs.
>>>
>>>       We will need several heuristics to make sure we only outline
in
>>>       cases where we are confident it will result in a performance
gain. Things
>>>       such as threshold on when a branch is considered cold, the
minimum number
>>>       of times the predecessor node has to be executed in order for
an edge to be
>>>       considered (confidence factor), and the minimum size of the
region to be
>>>       outlined (can use inlining cost analysis like we currently
do) will require
>>>       some level of tuning.
>>>
>>>       Similar to the current implementation, we will attempt to
inline
>>>       the leftover (hot) parts of the code, and if for some reason
we cannot then
>>>       we discard the modified function and its outlined code.
>>>
>>>       ### Code changes
>>>
>>>       The current Partial Inlining code first clones the function
of
>>>       interest and looks for a single set of blocks to outline. It
then creates a
>>>       function with the set the blocks, and saves the outlined
function and
>>>       outline callsite information as part of the function cloning
container. In
>>>       order to outline multiple regions of the function, we will
need to change
>>>       these containers to keep track of a list of regions to
outline. We will
>>>       also need to update the cost analysis to take into account
multiple
>>>       outlined functions.
>>>
>>>       When a ProfileSummary is available, then we should skip the
code
>>>       that looks for early returns and go into new code that looks
for cold
>>>       regions to outline. When ProfileSummary is not available,
then we can fall
>>>       back to the existing code and look for early returns only.
>>>
>>>       ### Tuning
>>>
>>>       - The outlining heuristics will need to determine if a set of
>>>       cold blocks is large enough to warrant the overhead of a
function call. We
>>>       also don't want the inliner to attempt to inline the
outlined code later.
>>>       - The threshold for determining whether a block is cold will
also
>>>       need to be tuned. In the case that profiling information is
not accurate,
>>>       we will pay the price of the additional call overhead for
executing cold
>>>       code.
>>>       - The confidence factor, which can be viewed as the minimum
>>>       number of times the predecessor has to be executed in order
for an edge to
>>>       be considered cold, should also be taken into account to
avoid outlining
>>>       code paths we have little information on.
>>>
>>>       Graham Yiu
>>>       LLVM Compiler Development
>>>       IBM Toronto Software Lab
>>>       Office: (905) 413-4077 C2-407/8200/Markham
>>> *Email: gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>>
>>>       _______________________________________________
>>>       LLVM Developers mailing list
>>> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>
>>>       http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> --Thank you,
River Riddle
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/3a83b212/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/3a83b212/attachment.gif>

Graham Yiu via llvm-dev

2017-Aug-16 00:21 UTC

head link

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Hey River,

It's good to know that you have been down this path before, hopefully I can
share with you any roadblocks I'll encounter along the way.

Your first example illustrates a very good point, where ideally we should
take into account a coldness-to-size ratio.  However, this type of case
requires a lot of additional analysis to get right, and may not give you
the most 'bang-for-your-buck'.  With that said, I think it's
worthwhile to
build in some flexibility as I begin implementation in case we want to
tackle something like this in the future.

Second example is probably outside of the scope of what I'm trying to do,
at least in the immediate term, but interesting nonetheless.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-407/8200/Markham
Email: gyiu at ca.ibm.com



From:	River Riddle <riddleriver at gmail.com>
To:	Xinliang David Li <xinliangli at gmail.com>
Cc:	Graham Yiu <gyiu at ca.ibm.com>, llvm-dev
            <llvm-dev at lists.llvm.org>
Date:	08/15/2017 07:27 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks




On Tue, Aug 15, 2017 at 4:22 PM Xinliang David Li <xinliangli at
gmail.com>
wrote:
  On Tue, Aug 15, 2017 at 4:14 PM, River Riddle via llvm-dev <
  llvm-dev at lists.llvm.org> wrote:
   Hey Graham,
    I worked on pretty much this exact thing last year.  I did something
   similar to what you described, I traversed the CFG and built potentially
   profitable regions from any given valid start node. At that point there
   were several road blocks that prevented it from really going anywhere
   ( like the inliner not preserving pgo data) but those problems should be
   gone by now.
   Some thoughts:

   : In terms of cost modeling, you will want to consider the interaction
   with loops:
    - Whats the impact of outlining from inside of a loop.
    - What if the hot region does not include the entry/exit, i.e hot loop.

   : Something that would also benefit, but would require more tuning, is
   the relationship between size and "hotness". If for example you
have a
   function that looks like this:

   foo(bool B) {
   ... // Code that prevents early return.

   if(B) // RegionA : Taken 50% of the time.
     ... Very large amount of code.

   else // RegionB : Taken 50% of the time.
     ... Very small amount of code.
   }

   In the above you have two regions that aren't exactly cold. Outlining
   RegionA will make foo profitable to inline, but only if you take into
   account its large size. Using profile data as the sole cutoff will miss
   this opportunity.



  Modelling the cost/benefit for this case can be very tricky though.

Most definitely, but it's something to think about when moving forward.




   : You may also want to look into hot switch statement splitting, though
   I don't know if it should really go into the partial inliner. This is
   splitting a switch statement into multiple switch statements if there
   are only a few cases that are actually hot.

   SWITCH1:
   switch(val) {
     HOT: ...
     COLD1: ...
     COLD2: ...
     ....
     COLDN: ...
     default: ...
   }

   would get transformed into:


   SWITCH1:
   switch(val) {
    HOT: ...
    default:
      br SWITCH2:
   }

   SWITCH2:
   switch(val) {
     COLD1: ...
     COLD2: ...
     ....
     COLDN: ...
     default: ...
   }




  This is profile based switch peeling -- which is a separate optimization
  from partial inlining.

  David


Ah thanks, I wasn't sure of the name.
River Riddle




   On Tue, Aug 15, 2017 at 1:14 PM, Graham Yiu via llvm-dev <
   llvm-dev at lists.llvm.org> wrote:
     Hi Jessica,

     Thanks for the feedback.

     I believe the existing partial inliner pass does use some common
     utilities like the code extractor to do outlining.


   AFAIK, Code extractor is the only IR level utility for outlining, but
   it's not the best and definitely needs some improvement. It currently
   returns outputs via live out pointer parameters, which means that
   outlined functions can't be tagged with attributes like readonly.

   There were also some known problems with extracting in the presence of
   critical edges, but I don't know if thats still the case.

     Is that what you're referring to? I don't recall seeing any other
     passes that has outlining other than the machine outliner, but I may
     have missed something.

     I briefly looked at River's RFC and it seems he's mainly utilizing
     outlining as a tool to reduce code size while maintaining code
     performance.



   The main point of the outliner I proposed in that RFC is benefiting code
   size. The use of PGO data is to simply avoid detrimenting runtime
   performance.

     Our proposal is to improve on the existing outlining scheme of the
     partial inliner to give us more opportunities to inline hot code. We
     will only do this type of general outlining with the presence of
     profiling or user data (such as builtin_expect).
     With that said, I am interested in how River plans to handle live
     ranges



    It's a pretty difficult problem. I have estimation for register usage
   within an outlined region, using something similar to the register cost
   calculation in the loop vectorizer. I don't currently have anything for
   the actual call to the function itself. I haven't found a nice enough
   solution that can work for a large number of candidates.

     and heuristics he will use with profiling data available.


   As I said above, the use of profile data, currently, is to avoid
   regressing execution time. So the heuristics are basically "Is this too
   hot to outline from". That outliner also has a completely different
   goal/cost model, which makes applying it to this problem difficult.

   Thanks,
    River Riddle



     I haven't looked at the machine outliner yet, but my understanding is
     it also prioritizes code size reduction over performance gains. Is my
     assumption correct? Do you think there's some tuning we can do in the
     machine outliner to tailor for performance improvements?

     Graham Yiu
     LLVM Compiler Development
     IBM Toronto Software Lab
     Office: (905) 413-4077 C2-407/8200/Markham
     Email: gyiu at ca.ibm.com

     Inactive hide details for Jessica Paquette ---08/15/2017 02:45:06
     PM---Hi Graham, Have you looked at using the existing outlineJessica
     Paquette ---08/15/2017 02:45:06 PM---Hi Graham, Have you looked at
     using the existing outliner framework to implement partial inlining? I

     From: Jessica Paquette <jpaquette at apple.com>
     To: Graham Yiu <gyiu at ca.ibm.com>
     Cc: llvm-dev at lists.llvm.org
     Date: 08/15/2017 02:45 PM
     Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
     general outlining scheme for cold blocks
     Sent by: jpaquette at apple.com




     Hi Graham,

     Have you looked at using the existing outliner framework to implement
     partial inlining? In the past, people have combined outlining +
     inlining to get partial inlining “for free” (see "Function outlining
     and partial inlining” by Zhao and Amaral).

     There is the MachineOutliner. River Riddle has been working on an
     IR-level outliner as well. The MachineOutliner doesn’t currently
     consider the heat of a block, but might be worth experimenting with.
     It might be good to experiment with River’s work and the existing
     outlining pass to see if this functionality can effectively fall out
     of what we have already.

     Here’s the recent IR-level outlining thread for reference wrt River’s
     work: http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html

     - Jessica
                 On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <
                 llvm-dev at lists.llvm.org> wrote:


                 Hello,

                 My team and I are looking to do some enhancements in the
                 partial inliner in opt. Would appreciate any feedback that
                 folks might have.

                 # Partial Inlining in LLVM opt

                 ## Summary

                 ### Background

                 Currently, the partial inliner searches the first few
                 blocks of the callee and looks for a branch to the return
                 block (ie. early return). If found, it attempts to outline
                 the rest of the slow (or heavy) code so the inliner will
                 be able to inline the fast (or light) code. If no early
                 returns are found, the partial inliner will give up. As
                 far as I can tell, BlockFrequency and BranchProbability
                 information is only used when attempting to inline the
                 early return code, and not used to determine whether to
                 outline the slow code.

                 ### Proposed changes

                 In addition to looking for early returns, we should
                 utilize profile information to outline blocks that are
                 considered cold. If we can sufficiently reduce the size of
                 the original function via this type of outlining, inlining
                 should be able to inline the rest of the hot code.

                 ## Details

                 With the presence of profile information, we have a view
                 of what code is infrequently executed and make better
                 decisions on what to outline. Early return blocks that are
                 infrequently executed should still be included as
                 candidates for outlining, but will be treated just like
                 any other cold blocks. Without profiling information,
                 however, we should remain conservative and only partial
                 inline in the presence of an early return in the first few
                 blocks of a function (ie. peel the early return out of the
                 function).

                 To find cold regions to outline, we will traverse the CFG
                 to find edges deemed 'cold' and look at the blocks
                 dominated by the successor node. If, for some reason, that
                 block has more than one predecessor, then we will skip
                 this candidate as there should be a node that dominates
                 this successor that has a single entry point. The last
                 node in the dominance vector should also have a single
                 successor. If it does not, then further investigation of
                 the CFG is necessary to see when/how this situation
                 occurs.

                 We will need several heuristics to make sure we only
                 outline in cases where we are confident it will result in
                 a performance gain. Things such as threshold on when a
                 branch is considered cold, the minimum number of times the
                 predecessor node has to be executed in order for an edge
                 to be considered (confidence factor), and the minimum size
                 of the region to be outlined (can use inlining cost
                 analysis like we currently do) will require some level of
                 tuning.

                 Similar to the current implementation, we will attempt to
                 inline the leftover (hot) parts of the code, and if for
                 some reason we cannot then we discard the modified
                 function and its outlined code.

                 ### Code changes

                 The current Partial Inlining code first clones the
                 function of interest and looks for a single set of blocks
                 to outline. It then creates a function with the set the
                 blocks, and saves the outlined function and outline
                 callsite information as part of the function cloning
                 container. In order to outline multiple regions of the
                 function, we will need to change these containers to keep
                 track of a list of regions to outline. We will also need
                 to update the cost analysis to take into account multiple
                 outlined functions.

                 When a ProfileSummary is available, then we should skip
                 the code that looks for early returns and go into new code
                 that looks for cold regions to outline. When
                 ProfileSummary is not available, then we can fall back to
                 the existing code and look for early returns only.

                 ### Tuning

                 - The outlining heuristics will need to determine if a set
                 of cold blocks is large enough to warrant the overhead of
                 a function call. We also don't want the inliner to attempt
                 to inline the outlined code later.
                 - The threshold for determining whether a block is cold
                 will also need to be tuned. In the case that profiling
                 information is not accurate, we will pay the price of the
                 additional call overhead for executing cold code.
                 - The confidence factor, which can be viewed as the
                 minimum number of times the predecessor has to be executed
                 in order for an edge to be considered cold, should also be
                 taken into account to avoid outlining code paths we have
                 little information on.

                 Graham Yiu
                 LLVM Compiler Development
                 IBM Toronto Software Lab
                 Office: (905) 413-4077 C2-407/8200/Markham
                 Email: gyiu at ca.ibm.com


                 _______________________________________________
                 LLVM Developers mailing list
                 llvm-dev at lists.llvm.org
                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev






     _______________________________________________
     LLVM Developers mailing list
     llvm-dev at lists.llvm.org
     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



   _______________________________________________
   LLVM Developers mailing list
   llvm-dev at lists.llvm.org
   http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--
Thank you,
River Riddle

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/41523463/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/41523463/attachment-0001.gif>

llvm dev - Aug 2017 - [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks