thr3ads.net - llvm dev - [llvm-dev] [LLVMdev] Path forward on profile guided inlining? [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Ivan Baev via llvm-dev

2015-Aug-31 19:25 UTC

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

Hi Philip, David,
Thanks for looking into this. We are also interested in contributing to
this effort. In addition to improving heuristics in the inliner itself,
another motivation is to have correct function entry profile counts after
the inliner so subsequent phases can use them. For example, reduce or
disable optimizations on cold functions for size improvements, or function
placement.

some further comments/questions inline...


On 06/17/2015 05:25 PM, Xinliang David Li wrote:> Phillip, thanks for looking into this.
> On Wed, Jun 17, 2015 at 3:49 PM, Philip Reames
> <list... at philipreames.com> wrote:
>> I would like to start prototyping changes towards profile guided
inlining.>> Before doing so, I wanted to get feedback from the community on theappropriate approach.  I'm going to layout a strawman proposal, but I'm
open to other ideas on how to approach the problem.>> Depending on what approach we settle on, I *may* be able to commit
resources>> to actually implement this in the near term.  I can't commit to too
much>> work here, so if we settle on something which will need a lot of
enabling>> infrastructure work first, I'm likely going to need to just hack
this
within>> my local tree.  I'm hoping to avoid that, but I want to be up front
about the practicalities here.>> Now, on to the strawman proposal...
>> This proposal is intended to not be dependent on the ongoing pass
manager>> changes by Chandler.  It's intended to be compatible with that
work,
but not reliant on it.> I assume what you are proposing is only for the short term before Pass
manager work is ready?> Longer term, we do need function analysis results (BB level, Looplevel) to be available during IPA/Inliner. For instance to estimate 
icache footprint using weighted size etc.

I addressed this specifically in my original email.  The use of the  extra
metadata based caching scheme is temporary.  The inlining
heuristics and updating of entry counts would not be.  They'd just be 
driven by BFI directly.
> I have some preliminary data that shows the memory overhead of keepingthose data for all functions (and maintained with incremental update)  is
small.

The memory size part doesn't surprise me for at least some analysis 
passes.  What's the runtime cost of the incremental update?

Good, let's collaborate.  It'll get done sooner.  :)
>> At least to start with, both of these optimizations would be off by
default>> under a flag.  I figure there's a lot to be discussed here, but
I'd
prefer>> to defer that a bit.  We need to get the enabling parts in place before
worrying too much about the heuristics.>> The Ideal Answer
>> If we had the pass manager changes done, we'd be able to just ask
BFIto compute an estimated execution count for the call site.  Both of the
inliner  heuristics I'm interested in could be implemented using that
information>> combined with an up to date estimate of the function entry count.  Ifwe had the pass manager changes, the only thing we'd need to do is  update
>> InlineFunction to subtract the estimated call count from the entry
count of>> the remaining callee definition.  This would result in the entry count
of B>> reflecting the estimated entry count from the remaining callers of B.(Given the call site count is an estimate, this implies that entry  counts
>> are also now approximate. Given typical profiling mechanisms are lossy,that's not a big deal, but is worth noting
explicitly.)> yes. BB count is computed from entry_count and BB freq; call count isobtained from BB count. Two things need to be updated during inlining  1)
entry count of the remaining out of line instance of the callee  2) BB
frequency of inline instance of the callee.> see alsohttp://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083943.html

This is exactly correct.  I was assuming that the BFI function pass new 
how to update itself (i.e. lazy rebuild), but if you wanted the inliner 
to explicitly keep the analysis up to date, that would work too.


[Ivan] Calculation for recursive functions should be slightly different.

>> (See also my note at the bottom about iteration order.  I consider thatsomewhat out of scope for this proposal, but it effects both the ideal and
>> practical sections herein.)
>> The Practical Answer
>> Essentially, the remainder of this is an approximation of the pass
manager>> based solution which let's us start experimenting with the inliner
changes>> while waiting for the pass manager.  At least in theory, when we havethe pass manager in place, we can simply drop most of
this.>> The basic idea is to use metadata to effectively cache the estimatedfrequency for a given call site.  We can use BFI to generate/update
this>> information once per iteration of the outer inlining loop.  As a
result, we>> only need to keep it reasonably up to date within the inner inlining
loop. (See note below about alternate approaches.)>> The metadata on the call site would look something like:
>> call void @foo() !prof !"estimated_execution_count" 2000
(where 2000 is
the estimated count)> Can you use callgraph to keep this information instead?
That would be another option.  Not a bad one actually.  It would mean 
having to explicitly populate the information into the callgraph by 
calling per function analysis passes.  Not sure how easy that is to do.

I'm also leery relying on the call graph since I know that's going to 
radically change with the new pass manager.  I really, really, really do 
not want to be blocked behind Chandlers progress.


[Ivan] At this point, it seems better to use metadata on the call site
(Philip's proposal).

Soon indirect call sites will carry the following metadata:
  call void %funPtr(i32 10), !prof !1
  !1 = !{!"indirect_call_targets", i64 6000, !"foo", i64
5000, !"bar", i64
100}
details at http://lists.llvm.org/pipermail/llvm-dev/2015-April/084576.html
We should plan to use these for inline analysis and profile count updates.

>> We'd have a new FunctionPass which removes any existing metadata
and
adds>> new metadata which basically comes down to a product of the functions
entry>> count and the estimated block frequency (provided by BFI) of the blockcontaining the call site.  This would run as the last FunctionPass  within
the outer inlining loop.  Assuming that the entry counts are
kept>> reasonably up to date, the resulting estimated call site counts should
be>> reasonable.
>> Within the inliner itself, we'd need to update InlineFunction to
keep
the estimated counts in sync:>> - When inlining B into A, split the estimated call counts on the callswithin B between the remaining instance of B and the newly created call
sites within A.  I plan to split the estimated count in roughly the ratio
of>> the entries into B.  As a result, a given call site BC (originally fromB into C) would be split into two sites AC, and BC with estimated
counts>> (AB.count/B.entry_count * OrigBC.count) and ((1-AB.count/B.entry_count)* OrigBC.count).  This does require updating the body of the callee,
not>> just the caller when inlining.
> This scaling scheme assumes context-insensitivity. See your examplebelow which it does not apply. However this is no better way.

I assume you meant "there is no better way".  If so, agreed. Assuming 
that you're not recomputing the call count from another source at least.


[Ivan] That is the Constant ratio assumption: for any procedure body and
any call site contained in the body, the expected number of executions of
the call site per execution of the body is constant (Scheiffer, An
analysis of inline substitution for a structured programming language,
CACM, 1977).
The results in this paper show that the use of multilevel history
(context-sensitivity) is not warranted, and several compilers use this
constant ratio assumption. For now, it might be reasonable to use it in
LLVM too.

>> It's important to note that the resulting estimated call counts can
be
> just
>> flat out wrong.  As the easiest example, consider a case where B called
C,>> but only when a boolean parameter was set to true.  If we split the
count of>> BC into AC, BC and then drop the call site AC, we've essentially
lostinformation about the frequency of the remaining BC w.r.t. any other
callers>> of C.  We could try to adjust for this by only updating calls which
don't>> get pruned away by constant propagation within InlineFunction, but
I'm
not>> sure how worthwhile it is trying to be smart here given the information
will>> be roughly restored once we're out of the inner loop again.
> Why is wrong? should it make the result more precise? Basically C isnever called from B if B is called from A in this case. In such as  case, 
edge BC's count does not need to be adjusted (though B's entry  count
needs to be adjusted).

To clarify: "wrong" was meant to imply inaccurate, and confusing. Not 
"wrong" in the sense of miscompile.

The problem is not the count at call site AC or even BC.  The problem is 
that we can have some other call site XC.

We start with:
A calls B with call site count 200
B calls C with call site count 200 (but never when called from A)  X calls
C with call site count 50
A's entry count is 200
B's entry count is 210
C's entry count is 250

After inlining AB, we get:
A calls C with call site count 0 (because we proved away the call site)  B
calls C with call site count 10/210 (small fraction of original)  X calls
C with call site count 50
A's entry count is 200
B's entry count is 10
C's entry count is 250

The information for BC is wildly inaccurate and we've mistated the ratio 
between call sites BC and BX.  Any decision we make off this data is 
likely to be questionable.  This isn't a correctness issue, but it's 
definitely makes writing optimization heuristics harder.

(Note that the ability to rescale by block frequency within B solves  this
problem quite nicely.  Thus, we want the pass manager changes!)
>> What we can assume (and thus make inlining decisions on), is that a) a
given>> call sites count is a reasonable approximation of it's actual
executioncount and b) that the sum of the call site counts for a given callee is
less>> than or equal to the callee's entry count.  What we can't do is
compare
two>> call sites counts for the same callee within two different functions
and>> decide with high confidence which is more frequent.
>> Alternate Approaches:
>> 1) We could just recompute BFI explicitly in the inliner right beforepassing the result to ICA for the purposes of prototyping. If this was off
by default, this might be a reasonable scheme for
investigation.>> This could  likely never be enabled for real uses.
>> 2) We could pre-compute BFI once per function within a given SCC and
then>> try to keep it up to date during inlining.  If we cached the callfrequencies for the initial call sites, we could adjust the visit order to
minimize the number of times we need to recompute a given
functions>> block
>> frequencies.  (e.g. we can look at all the original call sites within afunction before looking at newly inlined ones)

[Ivan] How about a pass called immediately before the Inliner, which will
use BFI to compute and add calsite frequencies as metadata to callsite
instructions? During each inline step/transformation, the Inliner will
update the impacted callsite frequencies as well as function entry counts.



_______________________________________________
LLVM Developers mailing list
LLV... at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Xinliang David Li via llvm-dev

2015-Sep-01 19:40 UTC

head link

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

Ivan, thanks for looking into this. See my reply below.
> level) to be available during IPA/Inliner. For instance to estimate
> icache footprint using weighted size etc.
>
> I addressed this specifically in my original email.  The use of the  extra
> metadata based caching scheme is temporary.  The inlining
> heuristics and updating of entry counts would not be.  They'd just be
> driven by BFI directly.
yes -- any profile update patch that is independent of pass manager
change is welcome :)


>>> are also now approximate. Given typical profiling mechanisms are
lossy,
> that's not a big deal, but is worth noting explicitly.)
>> yes. BB count is computed from entry_count and BB freq; call count is
> obtained from BB count. Two things need to be updated during inlining  1)
> entry count of the remaining out of line instance of the callee  2) BB
> frequency of inline instance of the callee.
>> see also
> http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-March/083943.html
>
> This is exactly correct.  I was assuming that the BFI function pass new
> how to update itself (i.e. lazy rebuild), but if you wanted the inliner
> to explicitly keep the analysis up to date, that would work too.
I am not sure what you mean by BFI function pass know how to update
itself. What we want here is that we need  to compute BFI for inline
instances via scaling (of the callee BFI) to avoid expensive
recomputation of BFI in caller after inlining happens.

The good news is that Cong has made changes in BFI/BPI so that they
can be computed on demand and decoupled from pass manager. This
greatly simplifies the profile update work in inliner.  Since the
inliner is bottom up, we can also release BFI info for the callees
once the inliner is done with them (i.e. inlined into callers).
>> Can you use callgraph to keep this information instead?
>
> That would be another option.  Not a bad one actually.  It would mean
> having to explicitly populate the information into the callgraph by
> calling per function analysis passes.  Not sure how easy that is to do.
>
> I'm also leery relying on the call graph since I know that's going
to
> radically change with the new pass manager.  I really, really, really do
> not want to be blocked behind Chandlers progress.
>
>
> [Ivan] At this point, it seems better to use metadata on the call site
> (Philip's proposal).
See above. I think it is possible to do BFI update now without relying
on the new pass manager. I also expect this change to be compatible
with (or easily adaptable to) the new pass manager once it is in
place.

> below which it does not apply. However this is no better way.
>
> I assume you meant "there is no better way".  If so, agreed.
Assuming
> that you're not recomputing the call count from another source at
least.
>
>
> [Ivan] That is the Constant ratio assumption: for any procedure body and
> any call site contained in the body, the expected number of executions of
> the call site per execution of the body is constant (Scheiffer, An
> analysis of inline substitution for a structured programming language,
> CACM, 1977).
> The results in this paper show that the use of multilevel history
> (context-sensitivity) is not warranted, and several compilers use this
> constant ratio assumption. For now, it might be reasonable to use it in
> LLVM too.
You are quoting a pretty old paper. Context sensitivity is pretty
important for modern C++ programs with lots of abstractions. Same
small functions can be used in large number of contexts with totally
different profile: the same loop can have iterations from 1 to
millions; the same indirect callsite can have too many hot targets in
the merged context; the merged branch prob can be close to 50/50
making it useless for block layout; the merged switch profile can be
evenly distributed making it less useful.

The best/efficient way to solve this is to enable pre-inlining (see
Rong's RFC) -- but that is a different topic.  Smarter post-inline
update can also be helpful -- but probably not something we need to
worry about initially.
>
>
> minimize the number of times we need to recompute a given
> functions
>>> block
>>> frequencies.  (e.g. we can look at all the original call sites
within a
> function before looking at newly inlined ones)
>
> [Ivan] How about a pass called immediately before the Inliner, which will
> use BFI to compute and add calsite frequencies as metadata to callsite
> instructions? During each inline step/transformation, the Inliner will
> update the impacted callsite frequencies as well as function entry counts.
>
>
See above, I think it is possible to do this without requiring  using
new meta data.

thanks,

David
>
> _______________________________________________
> LLVM Developers mailing list
> LLV... at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

llvm dev - Aug 2015 - [LLVMdev] Path forward on profile guided inlining?

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?

[llvm-dev] [LLVMdev] Path forward on profile guided inlining?