thr3ads.net - llvm dev - [LLVMdev] Path forward on profile guided inlining? [Jun 2015]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2015-Jun-17 22:49 UTC

[LLVMdev] Path forward on profile guided inlining?

I would like to start prototyping changes towards profile guided 
inlining.  Before doing so, I wanted to get feedback from the community 
on the appropriate approach.  I'm going to layout a strawman proposal, 
but I'm open to other ideas on how to approach the problem.

Depending on what approach we settle on, I *may* be able to commit 
resources to actually implement this in the near term.  I can't commit 
to too much work here, so if we settle on something which will need a 
lot of enabling infrastructure work first, I'm likely going to need to 
just hack this within my local tree.  I'm hoping to avoid that, but I 
want to be up front about the practicalities here.

Now, on to the strawman proposal...

This proposal is intended to not be dependent on the ongoing pass manger 
changes by Chandler.  It's intended to be compatible with that work, but 
not reliant on it.

Background
One of the problems we have today is that we can't ask for information 
about the frequency of the call site we're examining within the inlining 
iteration loop.  We can get at this information from normal transform 
passes run within the outer inlining loop (CallGraphSCCPass), but the 
inner loop does not have the ability to keep function analysis up to 
date while inlining.

We have an existing piece of metadata on functions which record the 
function entry count.  This is not currently kept up to date in any way; 
it records the entry count of the function w/o.r.t. to inlining.  This 
metadata was only recently added and doesn't effect optimization today.

Objective
I'm looking to be able to implement two key inlining heuristics.

The first is to recognize when we have the last remaining *actual* call 
to a callee.  If we can prove there's a single caller today, we 
essentially just assume it's profitable to inline.  I want to extend 
this logic to the case where we have a single call site which dominates 
the caller profile for a given callee.  We might need to still keep 
around a copy of the callee (external, or rare callers), but the 
remaining callee definition will be a cold function, can be optimized 
for size, and can be placed far from hot code.

Second, I would like to use the frequency of the call site to adjust the 
inline threshold within the inliner.  This basically means that we 
become more willing to inline hot functions (and probably less willing 
to inline cold ones).

At least to start with, both of these optimizations would be off by 
default under a flag.  I figure there's a lot to be discussed here, but 
I'd prefer to defer that a bit.  We need to get the enabling parts in 
place before worrying too much about the heuristics.


The Ideal Answer
If we had the pass manager changes done, we'd be able to just ask BFI to 
compute an estimated execution count for the call site.  Both of the 
inliner heuristics I'm interested in could be implemented using that 
information combined with an up to date estimate of the function entry 
count.

If we had the pass manager changes, the only thing we'd need to do is 
update InlineFunction to subtract the estimated call count from the 
entry count of the remaining callee definition.  This would result in 
the entry count of B reflecting the estimated entry count from the 
remaining callers of B.  (Given the call site count is an estimate, this 
implies that entry counts are also now approximate. Given typical 
profiling mechanisms are lossy, that's not a big deal, but is worth 
noting explicitly.)

(See also my note at the bottom about iteration order.  I consider that 
somewhat out of scope for this proposal, but it effects both the ideal 
and practical sections herein.)


The Practical Answer
Essentially, the remainder of this is an approximation of the pass 
manager based solution which let's us start experimenting with the 
inliner changes while waiting for the pass manager.  At least in theory, 
when we have the pass manager in place, we can simply drop most of this.

The basic idea is to use metadata to effectively cache the estimated 
frequency for a given call site.  We can use BFI to generate/update this 
information once per iteration of the outer inlining loop.  As a result, 
we only need to keep it reasonably up to date within the inner inlining 
loop.  (See note below about alternate approaches.)

The metadata on the call site would look something like:
call void @foo() !prof !"estimated_execution_count" 2000 (where 2000
is
the estimated count)

We'd have a new FunctionPass which removes any existing metadata and 
adds new metadata which basically comes down to a product of the 
functions entry count and the estimated block frequency (provided by 
BFI) of the block containing the call site.  This would run as the last 
FunctionPass within the outer inlining loop.  Assuming that the entry 
counts are kept reasonably up to date, the resulting estimated call site 
counts should be reasonable.

Within the inliner itself, we'd need to update InlineFunction to keep 
the estimated counts in sync:
- When inlining B into A, split the estimated call counts on the calls 
within B between the remaining instance of B and the newly created call 
sites within A.  I plan to split the estimated count in roughly the 
ratio of the entries into B.  As a result, a given call site BC 
(originally from B into C) would be split into two sites AC, and BC with 
estimated counts (AB.count/B.entry_count * OrigBC.count) and 
((1-AB.count/B.entry_count) * OrigBC.count).  This does require updating 
the body of the callee, not just the caller when inlining.

It's important to note that the resulting estimated call counts can be 
just flat out wrong.  As the easiest example, consider a case where B 
called C, but only when a boolean parameter was set to true.  If we 
split the count of BC into AC, BC and then drop the call site AC, we've 
essentially lost information about the frequency of the remaining BC 
w.r.t. any other callers of C.  We could try to adjust for this by only 
updating calls which don't get pruned away by constant propagation 
within InlineFunction, but I'm not sure how worthwhile it is trying to 
be smart here given the information will be roughly restored once we're 
out of the inner loop again.

What we can assume (and thus make inlining decisions on), is that a) a 
given call sites count is a reasonable approximation of it's actual 
execution count and b) that the sum of the call site counts for a given 
callee is less than or equal to the callee's entry count.  What we can't
do is compare two call sites counts for the same callee within two 
different functions and decide with high confidence which is more frequent.


Alternate Approaches:
1) We could just recompute BFI explicitly in the inliner right before 
passing the result to ICA for the purposes of prototyping. If this was 
off by default, this might be a reasonable scheme for investigation.  
This could likely never be enabled for real uses.
2) We could pre-compute BFI once per function within a given SCC and 
then try to keep it up to date during inlining.  If we cached the call 
frequencies for the initial call sites, we could adjust the visit order 
to minimize the number of times we need to recompute a given functions 
block frequencies.  (e.g. we can look at all the original call sites 
within a function before looking at newly inlined ones)

For the record, it's not clear that the bottom up inlining approach 
works anywhere near as well once you starting using profiling 
information about the caller/callee pair.  You would seem to end up with 
cases where inlining BC makes it more likely you'll inline DC, but if 
you visited them in the other order you wouldn't inline DC. Consider the 
case where BC is a hot call site, and DC is the only other caller of C.  
I have no plans to actually tackle this problem.

Xinliang David Li

2015-Jun-18 00:25 UTC

head link

[LLVMdev] Path forward on profile guided inlining?

Phillip, thanks for looking into this.

On Wed, Jun 17, 2015 at 3:49 PM, Philip Reames
<listmail at philipreames.com> wrote:> I would like to start prototyping changes towards profile guided inlining.
> Before doing so, I wanted to get feedback from the community on the
> appropriate approach.  I'm going to layout a strawman proposal, but
I'm open
> to other ideas on how to approach the problem.
>
> Depending on what approach we settle on, I *may* be able to commit
resources
> to actually implement this in the near term.  I can't commit to too
much
> work here, so if we settle on something which will need a lot of enabling
> infrastructure work first, I'm likely going to need to just hack this
within
> my local tree.  I'm hoping to avoid that, but I want to be up front
about
> the practicalities here.
>
> Now, on to the strawman proposal...
>
> This proposal is intended to not be dependent on the ongoing pass manger
> changes by Chandler.  It's intended to be compatible with that work,
but not
> reliant on it.
I assume what you are proposing is only for the short term before Pass
manager work is ready?

Longer term, we do need function analysis results (BB level, Loop
level) to be available during IPA/Inliner. For instance to estimate
icache footprint using weighted size etc.

I have some preliminary data that shows the memory overhead of keeping
those data for all functions (and maintained with incremental update)
is small.
>
> Background
> One of the problems we have today is that we can't ask for information
about
> the frequency of the call site we're examining within the inlining
iteration
> loop.  We can get at this information from normal transform passes run
> within the outer inlining loop (CallGraphSCCPass), but the inner loop does
> not have the ability to keep function analysis up to date while inlining.
>
> We have an existing piece of metadata on functions which record the
function
> entry count.  This is not currently kept up to date in any way; it records
> the entry count of the function w/o.r.t. to inlining.  This metadata was
> only recently added and doesn't effect optimization today.
>
> Objective
> I'm looking to be able to implement two key inlining heuristics.
>
> The first is to recognize when we have the last remaining *actual* call to
a
> callee.  If we can prove there's a single caller today, we essentially
just
> assume it's profitable to inline.  I want to extend this logic to the
case
> where we have a single call site which dominates the caller profile for a
> given callee.  We might need to still keep around a copy of the callee
> (external, or rare callers), but the remaining callee definition will be a
> cold function, can be optimized for size, and can be placed far from hot
> code.
>
> Second, I would like to use the frequency of the call site to adjust the
> inline threshold within the inliner.  This basically means that we become
> more willing to inline hot functions (and probably less willing to inline
> cold ones).
yes. We plan to do that too.
>
> At least to start with, both of these optimizations would be off by default
> under a flag.  I figure there's a lot to be discussed here, but I'd
prefer
> to defer that a bit.  We need to get the enabling parts in place before
> worrying too much about the heuristics.
>
>
> The Ideal Answer
> If we had the pass manager changes done, we'd be able to just ask BFI
to
> compute an estimated execution count for the call site.  Both of the
inliner
> heuristics I'm interested in could be implemented using that
information
> combined with an up to date estimate of the function entry count.
>
> If we had the pass manager changes, the only thing we'd need to do is
update
> InlineFunction to subtract the estimated call count from the entry count of
> the remaining callee definition.  This would result in the entry count of B
> reflecting the estimated entry count from the remaining callers of B.
> (Given the call site count is an estimate, this implies that entry counts
> are also now approximate. Given typical profiling mechanisms are lossy,
> that's not a big deal, but is worth noting explicitly.)
yes. BB count is computed from entry_count and BB freq; call count is
obtained from BB count. Two things need to be updated during inlining

1) entry count of the remaining out of line instance of the callee
2) BB frequency of inline instance of the callee.

see also http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083943.html

>
> (See also my note at the bottom about iteration order.  I consider that
> somewhat out of scope for this proposal, but it effects both the ideal and
> practical sections herein.)
>
>
> The Practical Answer
> Essentially, the remainder of this is an approximation of the pass manager
> based solution which let's us start experimenting with the inliner
changes
> while waiting for the pass manager.  At least in theory, when we have the
> pass manager in place, we can simply drop most of this.
>
> The basic idea is to use metadata to effectively cache the estimated
> frequency for a given call site.  We can use BFI to generate/update this
> information once per iteration of the outer inlining loop.  As a result, we
> only need to keep it reasonably up to date within the inner inlining loop.
> (See note below about alternate approaches.)
>
> The metadata on the call site would look something like:
> call void @foo() !prof !"estimated_execution_count" 2000 (where
2000 is the
> estimated count)
Can you use callgraph to keep this information instead?

>
> We'd have a new FunctionPass which removes any existing metadata and
adds
> new metadata which basically comes down to a product of the functions entry
> count and the estimated block frequency (provided by BFI) of the block
> containing the call site.  This would run as the last FunctionPass within
> the outer inlining loop.  Assuming that the entry counts are kept
reasonably
> up to date, the resulting estimated call site counts should be reasonable.
>
> Within the inliner itself, we'd need to update InlineFunction to keep
the
> estimated counts in sync:
> - When inlining B into A, split the estimated call counts on the calls
> within B between the remaining instance of B and the newly created call
> sites within A.  I plan to split the estimated count in roughly the ratio
of
> the entries into B.  As a result, a given call site BC (originally from B
> into C) would be split into two sites AC, and BC with estimated counts
> (AB.count/B.entry_count * OrigBC.count) and ((1-AB.count/B.entry_count) *
> OrigBC.count).  This does require updating the body of the callee, not just
> the caller when inlining.
This scaling scheme assumes context-insensitivity. See your example
below which it does not apply. However this is no better way.

>
> It's important to note that the resulting estimated call counts can be
just
> flat out wrong.  As the easiest example, consider a case where B called C,
> but only when a boolean parameter was set to true.  If we split the count
of
> BC into AC, BC and then drop the call site AC, we've essentially lost
> information about the frequency of the remaining BC w.r.t. any other
callers
> of C.  We could try to adjust for this by only updating calls which
don't
> get pruned away by constant propagation within InlineFunction, but I'm
not
> sure how worthwhile it is trying to be smart here given the information
will
> be roughly restored once we're out of the inner loop again.
Why is wrong? should it make the result more precise? Basically C is
never called from B if B is called from A in this case. In such as
case,  edge BC's count does not need to be adjusted (though B's entry
count needs to be adjusted).
>
> What we can assume (and thus make inlining decisions on), is that a) a
given
> call sites count is a reasonable approximation of it's actual execution
> count and b) that the sum of the call site counts for a given callee is
less
> than or equal to the callee's entry count.  What we can't do is
compare two
> call sites counts for the same callee within two different functions and
> decide with high confidence which is more frequent.
>
>
> Alternate Approaches:
> 1) We could just recompute BFI explicitly in the inliner right before
> passing the result to ICA for the purposes of prototyping. If this was off
> by default, this might be a reasonable scheme for investigation.  This
could
> likely never be enabled for real uses.
> 2) We could pre-compute BFI once per function within a given SCC and then
> try to keep it up to date during inlining.  If we cached the call
> frequencies for the initial call sites, we could adjust the visit order to
> minimize the number of times we need to recompute a given functions block
> frequencies.  (e.g. we can look at all the original call sites within a
> function before looking at newly inlined ones)
>
> For the record, it's not clear that the bottom up inlining approach
works
> anywhere near as well once you starting using profiling information about
> the caller/callee pair.  You would seem to end up with cases where inlining
> BC makes it more likely you'll inline DC, but if you visited them in
the
> other order you wouldn't inline DC. Consider the case where BC is a hot
call
> site, and DC is the only other caller of C.  I have no plans to actually
> tackle this problem.
Inliner can be taught to delay inlining and use a top down scheme. For
instance if B has lots of incoming calls, but there is only one
dominating call edge from A -- in such as case, it is likely better to
delay inlining decisions of callsites in B that may have large
overhead until  B gets inlined ..

thanks,

David

Philip Reames

2015-Jun-18 01:02 UTC

head link

[LLVMdev] Path forward on profile guided inlining?

On 06/17/2015 05:25 PM, Xinliang David Li wrote:> Phillip, thanks for looking into this.
>
> On Wed, Jun 17, 2015 at 3:49 PM, Philip Reames
> <listmail at philipreames.com> wrote:
>> I would like to start prototyping changes towards profile guided
inlining.
>> Before doing so, I wanted to get feedback from the community on the
>> appropriate approach.  I'm going to layout a strawman proposal, but
I'm open
>> to other ideas on how to approach the problem.
>>
>> Depending on what approach we settle on, I *may* be able to commit
resources
>> to actually implement this in the near term.  I can't commit to too
much
>> work here, so if we settle on something which will need a lot of
enabling
>> infrastructure work first, I'm likely going to need to just hack
this within
>> my local tree.  I'm hoping to avoid that, but I want to be up front
about
>> the practicalities here.
>>
>> Now, on to the strawman proposal...
>>
>> This proposal is intended to not be dependent on the ongoing pass
manger
>> changes by Chandler.  It's intended to be compatible with that
work, but not
>> reliant on it.
> I assume what you are proposing is only for the short term before Pass
> manager work is ready?
>
> Longer term, we do need function analysis results (BB level, Loop
> level) to be available during IPA/Inliner. For instance to estimate
> icache footprint using weighted size etc.I addressed this specifically in my original email.  The use of the 
extra metadata based caching scheme is temporary.  The inlining 
heuristics and updating of entry counts would not be.  They'd just be 
driven by BFI directly.> I have some preliminary data that shows the memory overhead of keeping
> those data for all functions (and maintained with incremental update)
> is small.The memory size part doesn't surprise me for at least some analysis 
passes.  What's the runtime cost of the incremental
update?>
>> Background
>> One of the problems we have today is that we can't ask for
information about
>> the frequency of the call site we're examining within the inlining
iteration
>> loop.  We can get at this information from normal transform passes run
>> within the outer inlining loop (CallGraphSCCPass), but the inner loop
does
>> not have the ability to keep function analysis up to date while
inlining.
>>
>> We have an existing piece of metadata on functions which record the
function
>> entry count.  This is not currently kept up to date in any way; it
records
>> the entry count of the function w/o.r.t. to inlining.  This metadata
was
>> only recently added and doesn't effect optimization today.
>>
>> Objective
>> I'm looking to be able to implement two key inlining heuristics.
>>
>> The first is to recognize when we have the last remaining *actual* call
to a
>> callee.  If we can prove there's a single caller today, we
essentially just
>> assume it's profitable to inline.  I want to extend this logic to
the case
>> where we have a single call site which dominates the caller profile for
a
>> given callee.  We might need to still keep around a copy of the callee
>> (external, or rare callers), but the remaining callee definition will
be a
>> cold function, can be optimized for size, and can be placed far from
hot
>> code.
>>
>> Second, I would like to use the frequency of the call site to adjust
the
>> inline threshold within the inliner.  This basically means that we
become
>> more willing to inline hot functions (and probably less willing to
inline
>> cold ones).
> yes. We plan to do that too.Good, let's collaborate.  It'll get done sooner. 
:)>
>> At least to start with, both of these optimizations would be off by
default
>> under a flag.  I figure there's a lot to be discussed here, but
I'd prefer
>> to defer that a bit.  We need to get the enabling parts in place before
>> worrying too much about the heuristics.
>>
>>
>> The Ideal Answer
>> If we had the pass manager changes done, we'd be able to just ask
BFI to
>> compute an estimated execution count for the call site.  Both of the
inliner
>> heuristics I'm interested in could be implemented using that
information
>> combined with an up to date estimate of the function entry count.
>>
>> If we had the pass manager changes, the only thing we'd need to do
is update
>> InlineFunction to subtract the estimated call count from the entry
count of
>> the remaining callee definition.  This would result in the entry count
of B
>> reflecting the estimated entry count from the remaining callers of B.
>> (Given the call site count is an estimate, this implies that entry
counts
>> are also now approximate. Given typical profiling mechanisms are lossy,
>> that's not a big deal, but is worth noting explicitly.)
> yes. BB count is computed from entry_count and BB freq; call count is
> obtained from BB count. Two things need to be updated during inlining
>
> 1) entry count of the remaining out of line instance of the callee
> 2) BB frequency of inline instance of the callee.
>
> see also http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083943.htmlThis is exactly correct.  I was assuming that the BFI function pass new 
how to update itself (i.e. lazy rebuild), but if you wanted the inliner 
to explicitly keep the analysis up to date, that would work
too.>
>
>> (See also my note at the bottom about iteration order.  I consider that
>> somewhat out of scope for this proposal, but it effects both the ideal
and
>> practical sections herein.)
>>
>>
>> The Practical Answer
>> Essentially, the remainder of this is an approximation of the pass
manager
>> based solution which let's us start experimenting with the inliner
changes
>> while waiting for the pass manager.  At least in theory, when we have
the
>> pass manager in place, we can simply drop most of this.
>>
>> The basic idea is to use metadata to effectively cache the estimated
>> frequency for a given call site.  We can use BFI to generate/update
this
>> information once per iteration of the outer inlining loop.  As a
result, we
>> only need to keep it reasonably up to date within the inner inlining
loop.
>> (See note below about alternate approaches.)
>>
>> The metadata on the call site would look something like:
>> call void @foo() !prof !"estimated_execution_count" 2000
(where 2000 is the
>> estimated count)
> Can you use callgraph to keep this information instead?That would be another option.  Not a bad one actually.  It would mean 
having to explicitly populate the information into the callgraph by 
calling per function analysis passes.  Not sure how easy that is to do.

I'm also leery relying on the call graph since I know that's going to 
radically change with the new pass manager.  I really, really, really do 
not want to be blocked behind Chandlers progress.>
>
>> We'd have a new FunctionPass which removes any existing metadata
and adds
>> new metadata which basically comes down to a product of the functions
entry
>> count and the estimated block frequency (provided by BFI) of the block
>> containing the call site.  This would run as the last FunctionPass
within
>> the outer inlining loop.  Assuming that the entry counts are kept
reasonably
>> up to date, the resulting estimated call site counts should be
reasonable.
>>
>> Within the inliner itself, we'd need to update InlineFunction to
keep the
>> estimated counts in sync:
>> - When inlining B into A, split the estimated call counts on the calls
>> within B between the remaining instance of B and the newly created call
>> sites within A.  I plan to split the estimated count in roughly the
ratio of
>> the entries into B.  As a result, a given call site BC (originally from
B
>> into C) would be split into two sites AC, and BC with estimated counts
>> (AB.count/B.entry_count * OrigBC.count) and ((1-AB.count/B.entry_count)
*
>> OrigBC.count).  This does require updating the body of the callee, not
just
>> the caller when inlining.
> This scaling scheme assumes context-insensitivity. See your example
> below which it does not apply. However this is no better way.I assume you meant "there is no better way".  If so, agreed. Assuming 
that you're not recomputing the call count from another source at
least.>
>
>> It's important to note that the resulting estimated call counts can
be just
>> flat out wrong.  As the easiest example, consider a case where B called
C,
>> but only when a boolean parameter was set to true.  If we split the
count of
>> BC into AC, BC and then drop the call site AC, we've essentially
lost
>> information about the frequency of the remaining BC w.r.t. any other
callers
>> of C.  We could try to adjust for this by only updating calls which
don't
>> get pruned away by constant propagation within InlineFunction, but
I'm not
>> sure how worthwhile it is trying to be smart here given the information
will
>> be roughly restored once we're out of the inner loop again.
> Why is wrong? should it make the result more precise? Basically C is
> never called from B if B is called from A in this case. In such as
> case,  edge BC's count does not need to be adjusted (though B's
entry
> count needs to be adjusted).To clarify: "wrong" was meant to imply inaccurate, and confusing. Not 
"wrong" in the sense of miscompile.

The problem is not the count at call site AC or even BC.  The problem is 
that we can have some other call site XC.

We start with:
A calls B with call site count 200
B calls C with call site count 200 (but never when called from A)
X calls C with call site count 50
A's entry count is 200
B's entry count is 210
C's entry count is 250

After inlining AB, we get:
A calls C with call site count 0 (because we proved away the call site)
B calls C with call site count 10/210 (small fraction of original)
X calls C with call site count 50
A's entry count is 200
B's entry count is 10
C's entry count is 250

The information for BC is wildly inaccurate and we've mistated the ratio 
between call sites BC and BX.  Any decision we make off this data is 
likely to be questionable.  This isn't a correctness issue, but it's 
definitely makes writing optimization heuristics harder.

(Note that the ability to rescale by block frequency within B solves 
this problem quite nicely.  Thus, we want the pass manager changes!)
>
>> What we can assume (and thus make inlining decisions on), is that a) a
given
>> call sites count is a reasonable approximation of it's actual
execution
>> count and b) that the sum of the call site counts for a given callee is
less
>> than or equal to the callee's entry count.  What we can't do is
compare two
>> call sites counts for the same callee within two different functions
and
>> decide with high confidence which is more frequent.
>>
>>
>> Alternate Approaches:
>> 1) We could just recompute BFI explicitly in the inliner right before
>> passing the result to ICA for the purposes of prototyping. If this was
off
>> by default, this might be a reasonable scheme for investigation.  This
could
>> likely never be enabled for real uses.
>> 2) We could pre-compute BFI once per function within a given SCC and
then
>> try to keep it up to date during inlining.  If we cached the call
>> frequencies for the initial call sites, we could adjust the visit order
to
>> minimize the number of times we need to recompute a given functions
block
>> frequencies.  (e.g. we can look at all the original call sites within a
>> function before looking at newly inlined ones)
>>
>> For the record, it's not clear that the bottom up inlining approach
works
>> anywhere near as well once you starting using profiling information
about
>> the caller/callee pair.  You would seem to end up with cases where
inlining
>> BC makes it more likely you'll inline DC, but if you visited them
in the
>> other order you wouldn't inline DC. Consider the case where BC is a
hot call
>> site, and DC is the only other caller of C.  I have no plans to
actually
>> tackle this problem.
> Inliner can be taught to delay inlining and use a top down scheme. For
> instance if B has lots of incoming calls, but there is only one
> dominating call edge from A -- in such as case, it is likely better to
> delay inlining decisions of callsites in B that may have large
> overhead until  B gets inlined ..It can be done; agreed.  Just out of scope for the
moment.>
> thanks,
>
> David

Xinliang David Li

2015-Jul-31 05:35 UTC

head link

[LLVMdev] Path forward on profile guided inlining?

Hi Chandler, I'd like to hear your thought about this interim solution
to make global profile information available to inliner.

thanks,

David


On Wed, Jun 17, 2015 at 3:49 PM, Philip Reames
<listmail at philipreames.com> wrote:> I would like to start prototyping changes towards profile guided inlining.
> Before doing so, I wanted to get feedback from the community on the
> appropriate approach.  I'm going to layout a strawman proposal, but
I'm open
> to other ideas on how to approach the problem.
>
> Depending on what approach we settle on, I *may* be able to commit
resources
> to actually implement this in the near term.  I can't commit to too
much
> work here, so if we settle on something which will need a lot of enabling
> infrastructure work first, I'm likely going to need to just hack this
within
> my local tree.  I'm hoping to avoid that, but I want to be up front
about
> the practicalities here.
>
> Now, on to the strawman proposal...
>
> This proposal is intended to not be dependent on the ongoing pass manger
> changes by Chandler.  It's intended to be compatible with that work,
but not
> reliant on it.
>
> Background
> One of the problems we have today is that we can't ask for information
about
> the frequency of the call site we're examining within the inlining
iteration
> loop.  We can get at this information from normal transform passes run
> within the outer inlining loop (CallGraphSCCPass), but the inner loop does
> not have the ability to keep function analysis up to date while inlining.
>
> We have an existing piece of metadata on functions which record the
function
> entry count.  This is not currently kept up to date in any way; it records
> the entry count of the function w/o.r.t. to inlining.  This metadata was
> only recently added and doesn't effect optimization today.
>
> Objective
> I'm looking to be able to implement two key inlining heuristics.
>
> The first is to recognize when we have the last remaining *actual* call to
a
> callee.  If we can prove there's a single caller today, we essentially
just
> assume it's profitable to inline.  I want to extend this logic to the
case
> where we have a single call site which dominates the caller profile for a
> given callee.  We might need to still keep around a copy of the callee
> (external, or rare callers), but the remaining callee definition will be a
> cold function, can be optimized for size, and can be placed far from hot
> code.
>
> Second, I would like to use the frequency of the call site to adjust the
> inline threshold within the inliner.  This basically means that we become
> more willing to inline hot functions (and probably less willing to inline
> cold ones).
>
> At least to start with, both of these optimizations would be off by default
> under a flag.  I figure there's a lot to be discussed here, but I'd
prefer
> to defer that a bit.  We need to get the enabling parts in place before
> worrying too much about the heuristics.
>
>
> The Ideal Answer
> If we had the pass manager changes done, we'd be able to just ask BFI
to
> compute an estimated execution count for the call site.  Both of the
inliner
> heuristics I'm interested in could be implemented using that
information
> combined with an up to date estimate of the function entry count.
>
> If we had the pass manager changes, the only thing we'd need to do is
update
> InlineFunction to subtract the estimated call count from the entry count of
> the remaining callee definition.  This would result in the entry count of B
> reflecting the estimated entry count from the remaining callers of B.
> (Given the call site count is an estimate, this implies that entry counts
> are also now approximate. Given typical profiling mechanisms are lossy,
> that's not a big deal, but is worth noting explicitly.)
>
> (See also my note at the bottom about iteration order.  I consider that
> somewhat out of scope for this proposal, but it effects both the ideal and
> practical sections herein.)
>
>
> The Practical Answer
> Essentially, the remainder of this is an approximation of the pass manager
> based solution which let's us start experimenting with the inliner
changes
> while waiting for the pass manager.  At least in theory, when we have the
> pass manager in place, we can simply drop most of this.
>
> The basic idea is to use metadata to effectively cache the estimated
> frequency for a given call site.  We can use BFI to generate/update this
> information once per iteration of the outer inlining loop.  As a result, we
> only need to keep it reasonably up to date within the inner inlining loop.
> (See note below about alternate approaches.)
>
> The metadata on the call site would look something like:
> call void @foo() !prof !"estimated_execution_count" 2000 (where
2000 is the
> estimated count)
>
> We'd have a new FunctionPass which removes any existing metadata and
adds
> new metadata which basically comes down to a product of the functions entry
> count and the estimated block frequency (provided by BFI) of the block
> containing the call site.  This would run as the last FunctionPass within
> the outer inlining loop.  Assuming that the entry counts are kept
reasonably
> up to date, the resulting estimated call site counts should be reasonable.
>
> Within the inliner itself, we'd need to update InlineFunction to keep
the
> estimated counts in sync:
> - When inlining B into A, split the estimated call counts on the calls
> within B between the remaining instance of B and the newly created call
> sites within A.  I plan to split the estimated count in roughly the ratio
of
> the entries into B.  As a result, a given call site BC (originally from B
> into C) would be split into two sites AC, and BC with estimated counts
> (AB.count/B.entry_count * OrigBC.count) and ((1-AB.count/B.entry_count) *
> OrigBC.count).  This does require updating the body of the callee, not just
> the caller when inlining.
>
> It's important to note that the resulting estimated call counts can be
just
> flat out wrong.  As the easiest example, consider a case where B called C,
> but only when a boolean parameter was set to true.  If we split the count
of
> BC into AC, BC and then drop the call site AC, we've essentially lost
> information about the frequency of the remaining BC w.r.t. any other
callers
> of C.  We could try to adjust for this by only updating calls which
don't
> get pruned away by constant propagation within InlineFunction, but I'm
not
> sure how worthwhile it is trying to be smart here given the information
will
> be roughly restored once we're out of the inner loop again.
>
> What we can assume (and thus make inlining decisions on), is that a) a
given
> call sites count is a reasonable approximation of it's actual execution
> count and b) that the sum of the call site counts for a given callee is
less
> than or equal to the callee's entry count.  What we can't do is
compare two
> call sites counts for the same callee within two different functions and
> decide with high confidence which is more frequent.
>
>
> Alternate Approaches:
> 1) We could just recompute BFI explicitly in the inliner right before
> passing the result to ICA for the purposes of prototyping. If this was off
> by default, this might be a reasonable scheme for investigation.  This
could
> likely never be enabled for real uses.
> 2) We could pre-compute BFI once per function within a given SCC and then
> try to keep it up to date during inlining.  If we cached the call
> frequencies for the initial call sites, we could adjust the visit order to
> minimize the number of times we need to recompute a given functions block
> frequencies.  (e.g. we can look at all the original call sites within a
> function before looking at newly inlined ones)
>
> For the record, it's not clear that the bottom up inlining approach
works
> anywhere near as well once you starting using profiling information about
> the caller/callee pair.  You would seem to end up with cases where inlining
> BC makes it more likely you'll inline DC, but if you visited them in
the
> other order you wouldn't inline DC. Consider the case where BC is a hot
call
> site, and DC is the only other caller of C.  I have no plans to actually
> tackle this problem.

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Jun 2015 - [LLVMdev] Path forward on profile guided inlining?

[LLVMdev] Path forward on profile guided inlining?

[LLVMdev] Path forward on profile guided inlining?

[LLVMdev] Path forward on profile guided inlining?

[LLVMdev] Path forward on profile guided inlining?

Possibly Parallel Threads