Graham Yiu via llvm-dev
2017-Aug-15 20:14 UTC
[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
Hi Jessica,
Thanks for the feedback.
I believe the existing partial inliner pass does use some common utilities
like the code extractor to do outlining. Is that what you're referring to?
I don't recall seeing any other passes that has outlining other than the
machine outliner, but I may have missed something.
I briefly looked at River's RFC and it seems he's mainly utilizing
outlining as a tool to reduce code size while maintaining code performance.
Our proposal is to improve on the existing outlining scheme of the partial
inliner to give us more opportunities to inline hot code. We will only do
this type of general outlining with the presence of profiling or user data
(such as builtin_expect). With that said, I am interested in how River
plans to handle live ranges and heuristics he will use with profiling data
available.
I haven't looked at the machine outliner yet, but my understanding is it
also prioritizes code size reduction over performance gains. Is my
assumption correct? Do you think there's some tuning we can do in the
machine outliner to tailor for performance improvements?
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
From: Jessica Paquette <jpaquette at apple.com>
To: Graham Yiu <gyiu at ca.ibm.com>
Cc: llvm-dev at lists.llvm.org
Date: 08/15/2017 02:45 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
outlining scheme for cold blocks
Sent by: jpaquette at apple.com
Hi Graham,
Have you looked at using the existing outliner framework to implement
partial inlining? In the past, people have combined outlining + inlining to
get partial inlining “for free” (see "Function outlining and partial
inlining” by Zhao and Amaral).
There is the MachineOutliner. River Riddle has been working on an IR-level
outliner as well. The MachineOutliner doesn’t currently consider the heat
of a block, but might be worth experimenting with. It might be good to
experiment with River’s work and the existing outlining pass to see if this
functionality can effectively fall out of what we have already.
Here’s the recent IR-level outlining thread for reference wrt River’s work:
http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html
- Jessica
On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
Hello,
My team and I are looking to do some enhancements in the partial
inliner in opt. Would appreciate any feedback that folks might have.
# Partial Inlining in LLVM opt
## Summary
### Background
Currently, the partial inliner searches the first few blocks of the
callee and looks for a branch to the return block (ie. early return).
If found, it attempts to outline the rest of the slow (or heavy) code
so the inliner will be able to inline the fast (or light) code. If no
early returns are found, the partial inliner will give up. As far as
I can tell, BlockFrequency and BranchProbability information is only
used when attempting to inline the early return code, and not used to
determine whether to outline the slow code.
### Proposed changes
In addition to looking for early returns, we should utilize profile
information to outline blocks that are considered cold. If we can
sufficiently reduce the size of the original function via this type
of outlining, inlining should be able to inline the rest of the hot
code.
## Details
With the presence of profile information, we have a view of what code
is infrequently executed and make better decisions on what to
outline. Early return blocks that are infrequently executed should
still be included as candidates for outlining, but will be treated
just like any other cold blocks. Without profiling information,
however, we should remain conservative and only partial inline in the
presence of an early return in the first few blocks of a function
(ie. peel the early return out of the function).
To find cold regions to outline, we will traverse the CFG to find
edges deemed 'cold' and look at the blocks dominated by the
successor
node. If, for some reason, that block has more than one predecessor,
then we will skip this candidate as there should be a node that
dominates this successor that has a single entry point. The last node
in the dominance vector should also have a single successor. If it
does not, then further investigation of the CFG is necessary to see
when/how this situation occurs.
We will need several heuristics to make sure we only outline in cases
where we are confident it will result in a performance gain. Things
such as threshold on when a branch is considered cold, the minimum
number of times the predecessor node has to be executed in order for
an edge to be considered (confidence factor), and the minimum size of
the region to be outlined (can use inlining cost analysis like we
currently do) will require some level of tuning.
Similar to the current implementation, we will attempt to inline the
leftover (hot) parts of the code, and if for some reason we cannot
then we discard the modified function and its outlined code.
### Code changes
The current Partial Inlining code first clones the function of
interest and looks for a single set of blocks to outline. It then
creates a function with the set the blocks, and saves the outlined
function and outline callsite information as part of the function
cloning container. In order to outline multiple regions of the
function, we will need to change these containers to keep track of a
list of regions to outline. We will also need to update the cost
analysis to take into account multiple outlined functions.
When a ProfileSummary is available, then we should skip the code that
looks for early returns and go into new code that looks for cold
regions to outline. When ProfileSummary is not available, then we can
fall back to the existing code and look for early returns only.
### Tuning
- The outlining heuristics will need to determine if a set of cold
blocks is large enough to warrant the overhead of a function call. We
also don't want the inliner to attempt to inline the outlined code
later.
- The threshold for determining whether a block is cold will also
need to be tuned. In the case that profiling information is not
accurate, we will pay the price of the additional call overhead for
executing cold code.
- The confidence factor, which can be viewed as the minimum number of
times the predecessor has to be executed in order for an edge to be
considered cold, should also be taken into account to avoid outlining
code paths we have little information on.
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/93e2a93b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/93e2a93b/attachment.gif>
River Riddle via llvm-dev
2017-Aug-15 23:14 UTC
[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
Hey Graham,
I worked on pretty much this exact thing last year. I did something
similar to what you described, I traversed the CFG and built potentially
profitable regions from any given valid start node. At that point there
were several road blocks that prevented it from really going anywhere( like
the inliner not preserving pgo data) but those problems should be gone by
now.
Some thoughts:
: In terms of cost modeling, you will want to consider the interaction with
loops:
- Whats the impact of outlining from inside of a loop.
- What if the hot region does not include the entry/exit, i.e hot loop.
: Something that would also benefit, but would require more tuning, is the
relationship between size and "hotness". If for example you have a
function
that looks like this:
foo(bool B) {
... // Code that prevents early return.
if(B) // RegionA : Taken 50% of the time.
... Very large amount of code.
else // RegionB : Taken 50% of the time.
... Very small amount of code.
}
In the above you have two regions that aren't exactly cold. Outlining
RegionA will make foo profitable to inline, but only if you take into
account its large size. Using profile data as the sole cutoff will miss
this opportunity.
: You may also want to look into hot switch statement splitting, though I
don't know if it should really go into the partial inliner. This is
splitting a switch statement into multiple switch statements if there are
only a few cases that are actually hot.
SWITCH1:
switch(val) {
HOT: ...
COLD1: ...
COLD2: ...
....
COLDN: ...
default: ...
}
would get transformed into:
SWITCH1:
switch(val) {
HOT: ...
default:
br SWITCH2:
}
SWITCH2:
switch(val) {
COLD1: ...
COLD2: ...
....
COLDN: ...
default: ...
}
On Tue, Aug 15, 2017 at 1:14 PM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi Jessica,
>
> Thanks for the feedback.
>
> I believe the existing partial inliner pass does use some common utilities
> like the code extractor to do outlining.
>
AFAIK, Code extractor is the only IR level utility for outlining, but it's
not the best and definitely needs some improvement. It currently returns
outputs via live out pointer parameters, which means that outlined
functions can't be tagged with attributes like readonly.
There were also some known problems with extracting in the presence of
critical edges, but I don't know if thats still the case.
> Is that what you're referring to? I don't recall seeing any other
passes
> that has outlining other than the machine outliner, but I may have missed
> something.
>
> I briefly looked at River's RFC and it seems he's mainly utilizing
> outlining as a tool to reduce code size while maintaining code performance.
>
The main point of the outliner I proposed in that RFC is benefiting code
size. The use of PGO data is to simply avoid detrimenting runtime
performance.
> Our proposal is to improve on the existing outlining scheme of the partial
> inliner to give us more opportunities to inline hot code. We will only do
> this type of general outlining with the presence of profiling or user data
> (such as builtin_expect).
>
With that said, I am interested in how River plans to handle live
ranges>
It's a pretty difficult problem. I have estimation for register usage
within an outlined region, using something similar to the register cost
calculation in the loop vectorizer. I don't currently have anything for the
actual call to the function itself. I haven't found a nice enough solution
that can work for a large number of candidates.
and heuristics he will use with profiling data
available.>
As I said above, the use of profile data, currently, is to avoid regressing
execution time. So the heuristics are basically "Is this too hot to outline
from". That outliner also has a completely different goal/cost model, which
makes applying it to this problem difficult.
Thanks,
River Riddle
>
> I haven't looked at the machine outliner yet, but my understanding is
it
> also prioritizes code size reduction over performance gains. Is my
> assumption correct? Do you think there's some tuning we can do in the
> machine outliner to tailor for performance improvements?
>
> Graham Yiu
> LLVM Compiler Development
> IBM Toronto Software Lab
> Office: (905) 413-4077 C2-407/8200/Markham
> Email: gyiu at ca.ibm.com
>
> [image: Inactive hide details for Jessica Paquette ---08/15/2017 02:45:06
> PM---Hi Graham, Have you looked at using the existing outline]Jessica
> Paquette ---08/15/2017 02:45:06 PM---Hi Graham, Have you looked at using
> the existing outliner framework to implement partial inlining? I
>
> From: Jessica Paquette <jpaquette at apple.com>
> To: Graham Yiu <gyiu at ca.ibm.com>
> Cc: llvm-dev at lists.llvm.org
> Date: 08/15/2017 02:45 PM
> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
> outlining scheme for cold blocks
> Sent by: jpaquette at apple.com
> ------------------------------
>
>
>
> Hi Graham,
>
> Have you looked at using the existing outliner framework to implement
> partial inlining? In the past, people have combined outlining + inlining to
> get partial inlining “for free” (see "Function outlining and partial
> inlining” by Zhao and Amaral).
>
> There is the MachineOutliner. River Riddle has been working on an IR-level
> outliner as well. The MachineOutliner doesn’t currently consider the heat
> of a block, but might be worth experimenting with. It might be good to
> experiment with River’s work and the existing outlining pass to see if this
> functionality can effectively fall out of what we have already.
>
> Here’s the recent IR-level outlining thread for reference wrt River’s
> work: *http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html*
> <http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html>
>
> - Jessica
>
> On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <
> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>>
wrote:
>
> Hello,
>
> My team and I are looking to do some enhancements in the partial
> inliner in opt. Would appreciate any feedback that folks might have.
>
> # Partial Inlining in LLVM opt
>
> ## Summary
>
> ### Background
>
> Currently, the partial inliner searches the first few blocks of the
> callee and looks for a branch to the return block (ie. early return).
If
> found, it attempts to outline the rest of the slow (or heavy) code so
the
> inliner will be able to inline the fast (or light) code. If no early
> returns are found, the partial inliner will give up. As far as I can
tell,
> BlockFrequency and BranchProbability information is only used when
> attempting to inline the early return code, and not used to determine
> whether to outline the slow code.
>
> ### Proposed changes
>
> In addition to looking for early returns, we should utilize profile
> information to outline blocks that are considered cold. If we can
> sufficiently reduce the size of the original function via this type
of
> outlining, inlining should be able to inline the rest of the hot
code.
>
> ## Details
>
> With the presence of profile information, we have a view of what
> code is infrequently executed and make better decisions on what to
outline.
> Early return blocks that are infrequently executed should still be
included
> as candidates for outlining, but will be treated just like any other
cold
> blocks. Without profiling information, however, we should remain
> conservative and only partial inline in the presence of an early
return in
> the first few blocks of a function (ie. peel the early return out of
the
> function).
>
> To find cold regions to outline, we will traverse the CFG to find
> edges deemed 'cold' and look at the blocks dominated by the
successor node.
> If, for some reason, that block has more than one predecessor, then
we will
> skip this candidate as there should be a node that dominates this
successor
> that has a single entry point. The last node in the dominance vector
should
> also have a single successor. If it does not, then further
investigation of
> the CFG is necessary to see when/how this situation occurs.
>
> We will need several heuristics to make sure we only outline in
> cases where we are confident it will result in a performance gain.
Things
> such as threshold on when a branch is considered cold, the minimum
number
> of times the predecessor node has to be executed in order for an edge
to be
> considered (confidence factor), and the minimum size of the region to
be
> outlined (can use inlining cost analysis like we currently do) will
require
> some level of tuning.
>
> Similar to the current implementation, we will attempt to inline
> the leftover (hot) parts of the code, and if for some reason we
cannot then
> we discard the modified function and its outlined code.
>
> ### Code changes
>
> The current Partial Inlining code first clones the function of
> interest and looks for a single set of blocks to outline. It then
creates a
> function with the set the blocks, and saves the outlined function and
> outline callsite information as part of the function cloning
container. In
> order to outline multiple regions of the function, we will need to
change
> these containers to keep track of a list of regions to outline. We
will
> also need to update the cost analysis to take into account multiple
> outlined functions.
>
> When a ProfileSummary is available, then we should skip the code
> that looks for early returns and go into new code that looks for cold
> regions to outline. When ProfileSummary is not available, then we can
fall
> back to the existing code and look for early returns only.
>
> ### Tuning
>
> - The outlining heuristics will need to determine if a set of cold
> blocks is large enough to warrant the overhead of a function call. We
also
> don't want the inliner to attempt to inline the outlined code
later.
> - The threshold for determining whether a block is cold will also
> need to be tuned. In the case that profiling information is not
accurate,
> we will pay the price of the additional call overhead for executing
cold
> code.
> - The confidence factor, which can be viewed as the minimum number
> of times the predecessor has to be executed in order for an edge to
be
> considered cold, should also be taken into account to avoid outlining
code
> paths we have little information on.
>
> Graham Yiu
> LLVM Compiler Development
> IBM Toronto Software Lab
> Office: (905) 413-4077 C2-407/8200/Markham
> *Email: gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>
> _______________________________________________
> LLVM Developers mailing list
> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/30ee7abe/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/30ee7abe/attachment.gif>
Xinliang David Li via llvm-dev
2017-Aug-15 23:22 UTC
[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
On Tue, Aug 15, 2017 at 4:14 PM, River Riddle via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hey Graham, > I worked on pretty much this exact thing last year. I did something > similar to what you described, I traversed the CFG and built potentially > profitable regions from any given valid start node. At that point there > were several road blocks that prevented it from really going anywhere( like > the inliner not preserving pgo data) but those problems should be gone by > now. > Some thoughts: > > : In terms of cost modeling, you will want to consider the interaction > with loops: > - Whats the impact of outlining from inside of a loop. > - What if the hot region does not include the entry/exit, i.e hot loop. > > : Something that would also benefit, but would require more tuning, is the > relationship between size and "hotness". If for example you have a function > that looks like this: > > foo(bool B) { > ... // Code that prevents early return. > > if(B) // RegionA : Taken 50% of the time. > ... Very large amount of code. > > else // RegionB : Taken 50% of the time. > ... Very small amount of code. > } > > In the above you have two regions that aren't exactly cold. Outlining > RegionA will make foo profitable to inline, but only if you take into > account its large size. Using profile data as the sole cutoff will miss > this opportunity. > >Modelling the cost/benefit for this case can be very tricky though.> > : You may also want to look into hot switch statement splitting, though I > don't know if it should really go into the partial inliner. This is > splitting a switch statement into multiple switch statements if there are > only a few cases that are actually hot. > > SWITCH1: > switch(val) { > HOT: ... > COLD1: ... > COLD2: ... > .... > COLDN: ... > default: ... > } > > would get transformed into: >> > SWITCH1: > switch(val) { > HOT: ... > default: > br SWITCH2: > } > > SWITCH2: > switch(val) { > COLD1: ... > COLD2: ... > .... > COLDN: ... > default: ... > } > > >This is profile based switch peeling -- which is a separate optimization from partial inlining. David> On Tue, Aug 15, 2017 at 1:14 PM, Graham Yiu via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi Jessica, >> >> Thanks for the feedback. >> >> I believe the existing partial inliner pass does use some common >> utilities like the code extractor to do outlining. >> > AFAIK, Code extractor is the only IR level utility for outlining, but it's > not the best and definitely needs some improvement. It currently returns > outputs via live out pointer parameters, which means that outlined > functions can't be tagged with attributes like readonly. > > There were also some known problems with extracting in the presence of > critical edges, but I don't know if thats still the case. > > >> Is that what you're referring to? I don't recall seeing any other passes >> that has outlining other than the machine outliner, but I may have missed >> something. >> >> I briefly looked at River's RFC and it seems he's mainly utilizing >> outlining as a tool to reduce code size while maintaining code performance. >> > > The main point of the outliner I proposed in that RFC is benefiting code > size. The use of PGO data is to simply avoid detrimenting runtime > performance. > > >> Our proposal is to improve on the existing outlining scheme of the >> partial inliner to give us more opportunities to inline hot code. We will >> only do this type of general outlining with the presence of profiling or >> user data (such as builtin_expect). >> > With that said, I am interested in how River plans to handle live ranges >> > > It's a pretty difficult problem. I have estimation for register usage > within an outlined region, using something similar to the register cost > calculation in the loop vectorizer. I don't currently have anything for the > actual call to the function itself. I haven't found a nice enough solution > that can work for a large number of candidates. > > and heuristics he will use with profiling data available. >> > As I said above, the use of profile data, currently, is to avoid > regressing execution time. So the heuristics are basically "Is this too hot > to outline from". That outliner also has a completely different goal/cost > model, which makes applying it to this problem difficult. > > Thanks, > River Riddle > > >> >> I haven't looked at the machine outliner yet, but my understanding is it >> also prioritizes code size reduction over performance gains. Is my >> assumption correct? Do you think there's some tuning we can do in the >> machine outliner to tailor for performance improvements? >> >> Graham Yiu >> LLVM Compiler Development >> IBM Toronto Software Lab >> Office: (905) 413-4077 C2-407/8200/Markham >> Email: gyiu at ca.ibm.com >> >> [image: Inactive hide details for Jessica Paquette ---08/15/2017 02:45:06 >> PM---Hi Graham, Have you looked at using the existing outline]Jessica >> Paquette ---08/15/2017 02:45:06 PM---Hi Graham, Have you looked at using >> the existing outliner framework to implement partial inlining? I >> >> From: Jessica Paquette <jpaquette at apple.com> >> To: Graham Yiu <gyiu at ca.ibm.com> >> Cc: llvm-dev at lists.llvm.org >> Date: 08/15/2017 02:45 PM >> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general >> outlining scheme for cold blocks >> Sent by: jpaquette at apple.com >> ------------------------------ >> >> >> >> Hi Graham, >> >> Have you looked at using the existing outliner framework to implement >> partial inlining? In the past, people have combined outlining + inlining to >> get partial inlining “for free” (see "Function outlining and partial >> inlining” by Zhao and Amaral). >> >> There is the MachineOutliner. River Riddle has been working on an >> IR-level outliner as well. The MachineOutliner doesn’t currently consider >> the heat of a block, but might be worth experimenting with. It might be >> good to experiment with River’s work and the existing outlining pass to see >> if this functionality can effectively fall out of what we have already. >> >> Here’s the recent IR-level outlining thread for reference wrt River’s >> work: *http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html* >> <http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html> >> >> - Jessica >> >> On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev < >> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>> wrote: >> >> Hello, >> >> My team and I are looking to do some enhancements in the partial >> inliner in opt. Would appreciate any feedback that folks might have. >> >> # Partial Inlining in LLVM opt >> >> ## Summary >> >> ### Background >> >> Currently, the partial inliner searches the first few blocks of >> the callee and looks for a branch to the return block (ie. early return). >> If found, it attempts to outline the rest of the slow (or heavy) code so >> the inliner will be able to inline the fast (or light) code. If no early >> returns are found, the partial inliner will give up. As far as I can tell, >> BlockFrequency and BranchProbability information is only used when >> attempting to inline the early return code, and not used to determine >> whether to outline the slow code. >> >> ### Proposed changes >> >> In addition to looking for early returns, we should utilize >> profile information to outline blocks that are considered cold. If we can >> sufficiently reduce the size of the original function via this type of >> outlining, inlining should be able to inline the rest of the hot code. >> >> ## Details >> >> With the presence of profile information, we have a view of what >> code is infrequently executed and make better decisions on what to outline. >> Early return blocks that are infrequently executed should still be included >> as candidates for outlining, but will be treated just like any other cold >> blocks. Without profiling information, however, we should remain >> conservative and only partial inline in the presence of an early return in >> the first few blocks of a function (ie. peel the early return out of the >> function). >> >> To find cold regions to outline, we will traverse the CFG to find >> edges deemed 'cold' and look at the blocks dominated by the successor node. >> If, for some reason, that block has more than one predecessor, then we will >> skip this candidate as there should be a node that dominates this successor >> that has a single entry point. The last node in the dominance vector should >> also have a single successor. If it does not, then further investigation of >> the CFG is necessary to see when/how this situation occurs. >> >> We will need several heuristics to make sure we only outline in >> cases where we are confident it will result in a performance gain. Things >> such as threshold on when a branch is considered cold, the minimum number >> of times the predecessor node has to be executed in order for an edge to be >> considered (confidence factor), and the minimum size of the region to be >> outlined (can use inlining cost analysis like we currently do) will require >> some level of tuning. >> >> Similar to the current implementation, we will attempt to inline >> the leftover (hot) parts of the code, and if for some reason we cannot then >> we discard the modified function and its outlined code. >> >> ### Code changes >> >> The current Partial Inlining code first clones the function of >> interest and looks for a single set of blocks to outline. It then creates a >> function with the set the blocks, and saves the outlined function and >> outline callsite information as part of the function cloning container. In >> order to outline multiple regions of the function, we will need to change >> these containers to keep track of a list of regions to outline. We will >> also need to update the cost analysis to take into account multiple >> outlined functions. >> >> When a ProfileSummary is available, then we should skip the code >> that looks for early returns and go into new code that looks for cold >> regions to outline. When ProfileSummary is not available, then we can fall >> back to the existing code and look for early returns only. >> >> ### Tuning >> >> - The outlining heuristics will need to determine if a set of cold >> blocks is large enough to warrant the overhead of a function call. We also >> don't want the inliner to attempt to inline the outlined code later. >> - The threshold for determining whether a block is cold will also >> need to be tuned. In the case that profiling information is not accurate, >> we will pay the price of the additional call overhead for executing cold >> code. >> - The confidence factor, which can be viewed as the minimum number >> of times the predecessor has to be executed in order for an edge to be >> considered cold, should also be taken into account to avoid outlining code >> paths we have little information on. >> >> Graham Yiu >> LLVM Compiler Development >> IBM Toronto Software Lab >> Office: (905) 413-4077 C2-407/8200/Markham >> *Email: gyiu at ca.ibm.com* <gyiu at ca.ibm.com> >> >> _______________________________________________ >> LLVM Developers mailing list >> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/9af62420/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/9af62420/attachment-0001.gif>
Jessica Paquette via llvm-dev
2017-Aug-16 15:50 UTC
[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
Hi Graham,> I haven't looked at the machine outliner yet, but my understanding is it also prioritizes code size reduction over performance gains. Is my assumption correct? Do you think there's some tuning we can do in the machine outliner to tailor for performance improvements > >Yes, it’s only concerned with code size reduction. It currently doesn’t consider performance gains in the cost model at all. It also doesn’t use any profiling information. (I also don’t know if that information is preserved until the point where the MachineOutliner runs, but if it is, it would be possible to hook it in.) If the profiling information *is* available, it would be possible to ignore instructions in, say, hot blocks during the preprocessing step in the MachineOutliner. Before even considering candidates for outlining, the MachineOutliner walks the module to assign instructions integer values. If it’s something we can’t outline, then we effectively ignore it. The heat of an overall block could be considered during this step as well. I’m not sure if this would be sufficient for your use-case though. - Jessica> On Aug 15, 2017, at 1:14 PM, Graham Yiu <gyiu at ca.ibm.com> wrote: > > Hi Jessica, > > Thanks for the feedback. > > I believe the existing partial inliner pass does use some common utilities like the code extractor to do outlining. Is that what you're referring to? I don't recall seeing any other passes that has outlining other than the machine outliner, but I may have missed something. > > I briefly looked at River's RFC and it seems he's mainly utilizing outlining as a tool to reduce code size while maintaining code performance. Our proposal is to improve on the existing outlining scheme of the partial inliner to give us more opportunities to inline hot code. We will only do this type of general outlining with the presence of profiling or user data (such as builtin_expect). With that said, I am interested in how River plans to handle live ranges and heuristics he will use with profiling data available. > > I haven't looked at the machine outliner yet, but my understanding is it also prioritizes code size reduction over performance gains. Is my assumption correct? Do you think there's some tuning we can do in the machine outliner to tailor for performance improvements? > > Graham Yiu > LLVM Compiler Development > IBM Toronto Software Lab > Office: (905) 413-4077 C2-407/8200/Markham > Email: gyiu at ca.ibm.com > > <graycol.gif>Jessica Paquette ---08/15/2017 02:45:06 PM---Hi Graham, Have you looked at using the existing outliner framework to implement partial inlining? I > > From: Jessica Paquette <jpaquette at apple.com> > To: Graham Yiu <gyiu at ca.ibm.com> > Cc: llvm-dev at lists.llvm.org > Date: 08/15/2017 02:45 PM > Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks > Sent by: jpaquette at apple.com > > > > > Hi Graham, > > Have you looked at using the existing outliner framework to implement partial inlining? In the past, people have combined outlining + inlining to get partial inlining “for free” (see "Function outlining and partial inlining” by Zhao and Amaral). > > There is the MachineOutliner. River Riddle has been working on an IR-level outliner as well. The MachineOutliner doesn’t currently consider the heat of a block, but might be worth experimenting with. It might be good to experiment with River’s work and the existing outlining pass to see if this functionality can effectively fall out of what we have already. > > Here’s the recent IR-level outlining thread for reference wrt River’s work: http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html <http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html> > > - Jessica > On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > Hello, > > My team and I are looking to do some enhancements in the partial inliner in opt. Would appreciate any feedback that folks might have. > > # Partial Inlining in LLVM opt > > ## Summary > > ### Background > > Currently, the partial inliner searches the first few blocks of the callee and looks for a branch to the return block (ie. early return). If found, it attempts to outline the rest of the slow (or heavy) code so the inliner will be able to inline the fast (or light) code. If no early returns are found, the partial inliner will give up. As far as I can tell, BlockFrequency and BranchProbability information is only used when attempting to inline the early return code, and not used to determine whether to outline the slow code. > > ### Proposed changes > > In addition to looking for early returns, we should utilize profile information to outline blocks that are considered cold. If we can sufficiently reduce the size of the original function via this type of outlining, inlining should be able to inline the rest of the hot code. > > ## Details > > With the presence of profile information, we have a view of what code is infrequently executed and make better decisions on what to outline. Early return blocks that are infrequently executed should still be included as candidates for outlining, but will be treated just like any other cold blocks. Without profiling information, however, we should remain conservative and only partial inline in the presence of an early return in the first few blocks of a function (ie. peel the early return out of the function). > > To find cold regions to outline, we will traverse the CFG to find edges deemed 'cold' and look at the blocks dominated by the successor node. If, for some reason, that block has more than one predecessor, then we will skip this candidate as there should be a node that dominates this successor that has a single entry point. The last node in the dominance vector should also have a single successor. If it does not, then further investigation of the CFG is necessary to see when/how this situation occurs. > > We will need several heuristics to make sure we only outline in cases where we are confident it will result in a performance gain. Things such as threshold on when a branch is considered cold, the minimum number of times the predecessor node has to be executed in order for an edge to be considered (confidence factor), and the minimum size of the region to be outlined (can use inlining cost analysis like we currently do) will require some level of tuning. > > Similar to the current implementation, we will attempt to inline the leftover (hot) parts of the code, and if for some reason we cannot then we discard the modified function and its outlined code. > > ### Code changes > > The current Partial Inlining code first clones the function of interest and looks for a single set of blocks to outline. It then creates a function with the set the blocks, and saves the outlined function and outline callsite information as part of the function cloning container. In order to outline multiple regions of the function, we will need to change these containers to keep track of a list of regions to outline. We will also need to update the cost analysis to take into account multiple outlined functions. > > When a ProfileSummary is available, then we should skip the code that looks for early returns and go into new code that looks for cold regions to outline. When ProfileSummary is not available, then we can fall back to the existing code and look for early returns only. > > ### Tuning > > - The outlining heuristics will need to determine if a set of cold blocks is large enough to warrant the overhead of a function call. We also don't want the inliner to attempt to inline the outlined code later. > - The threshold for determining whether a block is cold will also need to be tuned. In the case that profiling information is not accurate, we will pay the price of the additional call overhead for executing cold code. > - The confidence factor, which can be viewed as the minimum number of times the predecessor has to be executed in order for an edge to be considered cold, should also be taken into account to avoid outlining code paths we have little information on. > > Graham Yiu > LLVM Compiler Development > IBM Toronto Software Lab > Office: (905) 413-4077 C2-407/8200/Markham > Email: gyiu at ca.ibm.com <mailto:gyiu at ca.ibm.com> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170816/2f9d8910/attachment-0001.html>
Reasonably Related Threads
- [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
- [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
- [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
- [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
- [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks