thr3ads.net - llvm dev - [LLVMdev] LLVM Inliner [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li

2010-Nov-29 07:39 UTC

[LLVMdev] LLVM Inliner

On Sun, Nov 28, 2010 at 2:37 PM, Chris Lattner <clattner at apple.com>
wrote:
> On Nov 23, 2010, at 5:07 PM, Xinliang David Li wrote:
> > Hi, I browsed the LLVM inliner implementation, and it seems there is
room
> for improvement.  (I have not read it too carefully, so correct me if what
I
> observed is wrong).
> >
> > First the good side of the inliner -- the function level summary and
> inline cost estimation is more elaborate and complete than gcc. For
> instance, it considers callsite arguments and the effects of optimization
> enabled by inlining.
>
> Yep, as others pointed out, this is intended to interact closely with the
> per-function optimizations that get mixed in due to the inliner being a
> callgraphscc pass.  This is actually a really important property of the
> inliner.  If you have a function foo that calls a leaf function bar, the
> sequence of optimization is:
>
> 1. Run the inliner on bar (noop, since it has no call sites)
> 2. Run the per-function passes on bar.  This generally shrinks it, and
> prevents "abstraction penalty" from making bar look too big to
inline.
> 3. Run the inliner on foo.  Since foo calls bar, we consider inlining bar
> into foo and do so if profitable.
> 4. Run the per-function passes on foo.  If bar got inlined, this means that
> we're running the per-function passes over the inlined contents of bar
> again.
>
On-the-fly clean up (optimization) while doing bottom up inlining is nice as
you described. Many other compilers chose not to do this way due to
scalability concerns (with IPO) -- this can make the IPO the biggest bottom
neck in terms of compile time (as it is serialized).  Memory many not be a
big issue for LLVM as I can see the good locality in pass manager. (Just
curious, what is biggest application LLVM can build with IPO?)

>
> In a traditional optimizer like GCC's, you end up with problems where
you
> have to set a high inline threshold due to inlining-before-optimizing
> causing "abstraction penalty problems".  An early inliner is a
hack that
> tries to address this.

It is a hack in some sense (but a common practice) -- but enables other
flexibilities.

>  Another problem with this approach from the compile time perspective is
> that you end up repeating work multiple times.  For example, if there is a
> common subexpression in a small function, you end up inlining it into many
> places, then having to eliminate the common subexpression in each copy.
>
Early inlining + scalar opt can do the same, right?

>
> The LLVM inliner avoids these problems, but (as you point out) this really
> does force it to being a bottom-up inliner.  This means that the bottom-up
> inliner needs to make decisions in strange ways in some cases: for example
> if qux called foo, and foo were static, then (when processing foo) we may
> decide not to inline bar into foo because it would be more profitable to
> inline foo into qux.
>
> > Now more to the weakness of the inliner:
> >
> > 1) It is bottom up.  The inlining is not done in the order based on
the
> priority of the callsites.  It may leave important callsites (at top of the
> cg) unlined due to higher cost after inline cost update. It also eliminates
> the possibility of inline specialization. To change this, the inliner pass
> may not use the pass manager infrastructure .  (I noticed a hack in the
> inliner to workaround the problem -- for static functions avoid inlining
its
> callees if it causes it to become too big ..)
>
> This is true, but I don't think it's a really large problem in
practice.
>  We don't have a "global inline threshold limit" (which
I've never
> understood, except as a hack to prevent run-away inlining) so not visiting
> in priority order shouldn't prevent high-priority-but-processed-late
> candidates from being inlined.
>
global threshold can be used to control the unnecessary size growth. In some
cases, the size increase may also cause increase in icache footprint leading
to poor performance. In fact, with IPO/CMO, icache footprint can be modeled
in some way and be used as one kind of global limit.

>
> The only potential issue I'm aware of is if we have A->B->C and
we decide
> to inline C into B when it would be more profitable to inline B into A and
> leave C out of line.  This can be handled with a heuristic like the one
> above.
>
> > 2) There seems to be only one inliner pass.  For calls to small
> functions, it is better to perform early inlining as one of the local (per
> function) optimizations followed by scalar opt clean up. This will sharpen
> the summary information.  (Note the inline summary update does not consider
> the possible cleanup)
>
> Yep.  This is a feature :)
>
> > 3)  recursive inlining is not supported
>
> This is a policy decision.  It's not clear whether it is really a good
> idea, though I have seen some bugzilla or something about it.  I agree that
> it should be revisited.
>
> > 4) function with indirect branch is not inlined. What source construct
> does indirect branch instr correspond to ? variable jump?
>
> See:
> http://blog.llvm.org/2010/01/address-of-label-and-indirect-branches.html
>
> for more details.
>
> > 6) There is one heuristc used in inline-cost computation seems wrong:
> >
> >   // Calls usually take a long time, so they make the inlining gain
> smaller.
> >   InlineCost += CalleeFI->Metrics.NumCalls *
> InlineConstants::CallPenalty;
> >
> > Does it try to block inlining of callees with lots of calls? Note
> inlining such a function only increase static call counts.
>
> I think that this is a heuristic that Jakob came up with, but I think
it's
> a good one, also discussed elsewhere on the thread.
>
> When talking about inlining and tuning thresholds and heuristics, it is a
> good idea to quantify what the expected or possible wins of inlining a
> function are.  Some of the ones I'm aware of:
>
> 1. In some cases, inlining shrinks code.
>
> 2. Inlining a function exposes optimization opportunities on the inlined
> code, because constant propagation and other simplifications can take
place.
>
> 3. Inlining a function exposes optimizations in the caller because
> address-taken values can be promoted to registers.
>
> 4. Inlining a function can improve optimization in a caller because
> interprocedural side-effect analysis isn't needed.  For example,
load/call
> dependence may not be precise.  This is something we should continue to
> improve in the optimizer though.
>
> 5. Inlining code with indirect call sites and switches can improve branch
> prediction if some callers of the function are biased differently than
other
> callers.  This is pretty hard to predict without profile info though.
>
>Besides -- 1) reducing call overhead; 2) scheduling freedom; 3) enabling
optimizations across inline instances of callee(s); 4) sharpening local
analysis (mainly aliasing) results -- such as points to, malloc etc.

It may also lose aliasing assertion (such as restrict aliasing) if not done
properly.

>
> The "punish functions containing lots of calls" is based on the
assumption
> that functions which are mostly calls (again, this decision happens after
> the callee has been inlined and simplified) aren't themselves doing
much
> work.
>
My point is that using static count of callsites as a indicator for this can
be misleading. All the calls may be calls to cold external functions for
instance.

Thanks,

David


>
> -Chris
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20101128/0dff7413/attachment.html>

Rafael Espíndola

2010-Nov-29 12:07 UTC

head link

[LLVMdev] LLVM Inliner

> On-the-fly clean up (optimization) while doing bottom up inlining is nice
as
> you described. Many other compilers chose not to do this way due to
> scalability concerns (with IPO) -- this can make the IPO the biggest bottom
> neck in terms of compile time (as it is serialized).  Memory many not be a
> big issue for LLVM as I can see the good locality in pass manager. (Just
> curious, what is biggest application LLVM can build with IPO?)
I am not sure what is the biggest one, but the one  I tried was clang
itself with LTO:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-October/035584.html

Cheers,
Rafael

Chris Lattner

2010-Nov-29 18:56 UTC

head link

[LLVMdev] LLVM Inliner

On Nov 28, 2010, at 11:39 PM, Xinliang David Li wrote:> 1. Run the inliner on bar (noop, since it has no call sites)
> 2. Run the per-function passes on bar.  This generally shrinks it, and
prevents "abstraction penalty" from making bar look too big to inline.
> 3. Run the inliner on foo.  Since foo calls bar, we consider inlining bar
into foo and do so if profitable.
> 4. Run the per-function passes on foo.  If bar got inlined, this means that
we're running the per-function passes over the inlined contents of bar
again.
> 
> On-the-fly clean up (optimization) while doing bottom up inlining is nice
as you described. Many other compilers chose not to do this way due to
scalability concerns (with IPO) -- this can make the IPO the biggest bottom neck
in terms of compile time (as it is serialized).  Memory many not be a big issue
for LLVM as I can see the good locality in pass manager. (Just curious, what is
biggest application LLVM can build with IPO?)
I don't really know, and I agree with you that LLVM's LTO isn't very
scalable (it currently loads all the IR into memory).  I haven't thought a
lot about this, but I'd tackle that problem in three stages:

1. Our LTO model runs optimizations at both compile and link time, the
compile-time optimizations should work as they do now IMO.  This is
controversial though, because doing so could cause (e.g.) an inlining to happen
"early" that would be seen as a bad idea with full LTO information. 
The advantage of doing compile-time optimizations is that it both shrinks the
IR, and speeds up an incremental rebuild by avoiding having to do simple
optimizations again.

2. At LTO time, the bottom-up processing of the callgraph is still goodness and
presents good locality (unless you have very very large SCC's).  The tweak
that we'd have to implement is lazy deserialization (already implemented)
and reserialization to disk (which is missing).  With this, you get much better
memory footprint than "hold everything in memory at once".

3. To support multiple cores/machines, you break the callgraph SCC DAG into
parallel chunks that can be farmed out.  There is a lot of parallelism in a DAG.

I don't know of anyone planning on working on LTO at the moment though.
> In a traditional optimizer like GCC's, you end up with problems where
you have to set a high inline threshold due to inlining-before-optimizing
causing "abstraction penalty problems".  An early inliner is a hack
that tries to address this.
> 
> It is a hack in some sense (but a common practice) -- but enables other
flexibilities.
The hack I'm referring to is the "raise the inline threshold".  If
the inliner has any language specificity to its inline threshold, I consider it
a hack.  There is no reason the inliner should have to know if it's building
C or C++ code.  It should be guided based on the structure of the code.
>  Another problem with this approach from the compile time perspective is
that you end up repeating work multiple times.  For example, if there is a
common subexpression in a small function, you end up inlining it into many
places, then having to eliminate the common subexpression in each copy.
> 
> Early inlining + scalar opt can do the same, right?
In some cases, but not in general, because you run into phase ordering problems.
> This is true, but I don't think it's a really large problem in
practice.  We don't have a "global inline threshold limit" (which
I've never understood, except as a hack to prevent run-away inlining) so not
visiting in priority order shouldn't prevent
high-priority-but-processed-late candidates from being inlined.
> 
> global threshold can be used to control the unnecessary size growth. In
some cases, the size increase may also cause increase in icache footprint
leading to poor performance. In fact, with IPO/CMO, icache footprint can be
modeled in some way and be used as one kind of global limit.
I understand that, but that implies that you have some model for code locality. 
Setting a global code growth limit is (in my opinion) a hack unless you are
aiming for the whole program to fit in the icache (which I don't think
anyone tries to do :).

With any other limit that is higher than your icache size, you are basically
picking an *arbitrary* limit that is not based on the machine model or the
instruction locality of the program.
> The "punish functions containing lots of calls" is based on the
assumption that functions which are mostly calls (again, this decision happens
after the callee has been inlined and simplified) aren't themselves doing
much work.
> 
> My point is that using static count of callsites as a indicator for this
can be misleading. All the calls may be calls to cold external functions for
instance.
Absolutely true.  It may also be completely wrong for some functions.  It's
a heuristic :)

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20101129/afd85ace/attachment.html>

Xinliang David Li

2010-Nov-30 22:19 UTC

head link

[LLVMdev] LLVM Inliner

On Mon, Nov 29, 2010 at 10:56 AM, Chris Lattner <clattner at apple.com>
wrote:
> On Nov 28, 2010, at 11:39 PM, Xinliang David Li wrote:
>
> 1. Run the inliner on bar (noop, since it has no call sites)
>> 2. Run the per-function passes on bar.  This generally shrinks it, and
>> prevents "abstraction penalty" from making bar look too big
to inline.
>> 3. Run the inliner on foo.  Since foo calls bar, we consider inlining
bar
>> into foo and do so if profitable.
>> 4. Run the per-function passes on foo.  If bar got inlined, this means
>> that we're running the per-function passes over the inlined
contents of bar
>> again.
>>
>
> On-the-fly clean up (optimization) while doing bottom up inlining is nice
> as you described. Many other compilers chose not to do this way due to
> scalability concerns (with IPO) -- this can make the IPO the biggest bottom
> neck in terms of compile time (as it is serialized).  Memory many not be a
> big issue for LLVM as I can see the good locality in pass manager. (Just
> curious, what is biggest application LLVM can build with IPO?)
>
>
> I don't really know, and I agree with you that LLVM's LTO isn't
very
> scalable (it currently loads all the IR into memory).  I haven't
thought a
> lot about this, but I'd tackle that problem in three stages:
>
> 1. Our LTO model runs optimizations at both compile and link time, the
> compile-time optimizations should work as they do now IMO.  This is
> controversial though, because doing so could cause (e.g.) an inlining to
> happen "early" that would be seen as a bad idea with full LTO
information.
>  The advantage of doing compile-time optimizations is that it both shrinks
> the IR, and speeds up an incremental rebuild by avoiding having to do
simple
> optimizations again.
>
> 2. At LTO time, the bottom-up processing of the callgraph is still goodness
> and presents good locality (unless you have very very large SCC's). 
The
> tweak that we'd have to implement is lazy deserialization (already
> implemented) and reserialization to disk (which is missing).  With this,
you
> get much better memory footprint than "hold everything in memory at
once".
>

IR is just one memory consumer.  In LTO, there are also global data
structures : global symtab, global type table, call graph, points-to graph,
mod-ref info etc. In a compiler I worked with before, serialization is done
on points-to graph and mod-ref info after the info is mapped from a global
view to a per TU local view, and IR for each TU is mapped/unmapped on
demand. The type/symbol info per TU is in different segment from the the
code segment.

>
> 3. To support multiple cores/machines, you break the callgraph SCC DAG into
> parallel chunks that can be farmed out.  There is a lot of parallelism in a
> DAG.
>
>Parallelism distributed across machines can be tricky -- involving lots of
overhead such as rpc and data passing.

You may also be surprised with side effects due to the parallelism using
multi-core -- thrashing due to memory contention -- some per function level
pass may use lots of memory for temporary data structure. It won't scale for
the compiler workload -- i.e. get 2x speedup using 8 core.


>
> I understand that, but that implies that you have some model for code
> locality.  Setting a global code growth limit is (in my opinion) a hack
> unless you are aiming for the whole program to fit in the icache (which I
> don't think anyone tries to do :).
>
>
Yes, global growth limit may be good for size control, but is a hack for
control icache footprint. However, as I mentioned, the bottom up inline
scheme make it impossible to use any heuristics involving 'global limit'
which can be more complicated and fancier than the simple growth limit.  For
instance, there is no restriction that only one global limit can be used ---
 the compiler can partition the call graph into multiple locality regions,
and set icache limit for each region. The inlining order can be done on a
region by region basis. For each region, the region limit is applied and the
priority queue must be used.

Thanks,

David



> With any other limit that is higher than your icache size, you are
> basically picking an *arbitrary* limit that is not based on the machine
> model or the instruction locality of the program.
>
> The "punish functions containing lots of calls" is based on the
assumption
>> that functions which are mostly calls (again, this decision happens
after
>> the callee has been inlined and simplified) aren't themselves doing
much
>> work.
>>
>
> My point is that using static count of callsites as a indicator for this
> can be misleading. All the calls may be calls to cold external functions
for
> instance.
>
>
> Absolutely true.  It may also be completely wrong for some functions. 
It's
> a heuristic :)
>
> -Chris
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20101130/c90f134a/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Nov 2010 - [LLVMdev] LLVM Inliner

[LLVMdev] LLVM Inliner

[LLVMdev] LLVM Inliner

[LLVMdev] LLVM Inliner

[LLVMdev] LLVM Inliner

Seemingly Similar Threads