thr3ads.net - llvm dev - [llvm-dev] [RFC] Target-specific parametrization of function inliner [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Xinliang David Li via llvm-dev

2016-Mar-10 17:00 UTC

[llvm-dev] [RFC] Target-specific parametrization of function inliner

IMO, a good inliner with a precise cost/benefit model will eventually need
what Art is proposing here.

Giving the function call overhead as an example. It depends on a couple of
factors: 1) call/return instruction latency; 2) function epilogue/prologue;
3) calling convention (argument parsing, using registers or not, what
register classes etc).  All these factors depend on target information.  If
we want go deeper, we know certain micro architectures uses a stack of
call/return pairs to help branch prediction of ret instructions -- such
stack has a target specific limit which can be triggered when a callsite is
deep in the callchain.   Register file size and register pressure increase
due to inline comes as another example.

Another relevant example is the icache/itlb sizes. To do a more precise
analysis of the cost to 'speed' due to icache/itlb pressure increase
requires target information, profile information as well as some global
analysis. Easwaran has done some research in this area in the past and can
share the analysis design when other things are ready.

>
> Hi Art,
>
> I've long thought that we should have a more principled way of doing
> inline profitability. There is obviously some cost to executing a function
> body, some call site overhead, and some cost reduction associated with any
> post-inlining simplifications. If inlining reduces the overall call site
> cost by more than some factor, say 1% (this should probably depend on the
> optimization level), then we should inline. With profiling information, we
> might even use global speedup instead of local speedup.
>
yes -- with target specific cost information, global speedup analysis can
be more precise :)

>
> Whether we need a target customization of this threshold, or just a way
> for a target to supplement the fine inlining decision, is unclear to me. It
> is also true that a the result of a bunch of locally-optimal decisions
> might be far from the global optimum. Maybe the target has something to say
> about that?
>

The concept of threshold can be a topic of another discussion.  In current
design, I think the threshold should remain target independent.  It is the
cost that is target specific.

thanks,

David


>
> In short, I'm fine with what you're proposing, but to the extent
possible,
> I want the numbers provided by the target to mean something. Replacing a
> global set of somewhat-arbitrary magic numbers, with target-specific sets
> of somewhat-arbitrary magic numbers should be our last choice.
>
> Thanks again,
> Hal
>
>
> >
> > Thanks,
> > --
> >
> >
> > --Artem Belevich
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/587221d1/attachment.html>

Hal Finkel via llvm-dev

2016-Apr-01 19:10 UTC

head link

[llvm-dev] [RFC] Target-specific parametrization of function inliner

----- Original Message -----
> From: "Xinliang David Li" <davidxl at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Artem Belevich" <tra at google.com>,
"llvm-dev"
> <llvm-dev at lists.llvm.org>, "chandlerc" <chandlerc at
gmail.com>,
> "Easwaran Raman" <eraman at google.com>
> Sent: Thursday, March 10, 2016 11:00:30 AM
> Subject: Re: [llvm-dev] [RFC] Target-specific parametrization of
> function inliner
> IMO, a good inliner with a precise cost/benefit model will eventually
> need what Art is proposing here.
> Giving the function call overhead as an example. It depends on a
> couple of factors: 1) call/return instruction latency; 2) function
> epilogue/prologue; 3) calling convention (argument parsing, using
> registers or not, what register classes etc). All these factors
> depend on target information. If we want go deeper, we know certain
> micro architectures uses a stack of call/return pairs to help branch
> prediction of ret instructions -- such stack has a target specific
> limit which can be triggered when a callsite is deep in the
> callchain. Register file size and register pressure increase due to
> inline comes as another example.
> Another relevant example is the icache/itlb sizes. To do a more
> precise analysis of the cost to 'speed' due to icache/itlb pressure
> increase requires target information, profile information as well as
> some global analysis. Easwaran has done some research in this area
> in the past and can share the analysis design when other things are
> ready.I don't know what you mean by "when other things are ready", but
what you say above sounds exactly right. I'm certainly curious what Easwaran
has found.

Generally, there seem to be two categories here: 

1. Locally decidable issues, for which there are (or can be) good static
heuristics (call latencies, costs associated with parameter passing, stack
spilling, etc.)
2. Globally decidable issues, like reducing the number of pages consumed by
temporally-correlated hot code regions - profiling data likely necessary for
good decision-making (although it might be possible to make a reasonable
function-local threshold based on page size without it)

and then there are things like icache/itlb effects due to multiple applications
running simultaneously, for which profiling might help, but are also
policy-level decisions over which users may need more-direct control.
> > Hi Art,
> 
> > I've long thought that we should have a more principled way of
> > doing
> > inline profitability. There is obviously some cost to executing a
> > function body, some call site overhead, and some cost reduction
> > associated with any post-inlining simplifications. If inlining
> > reduces the overall call site cost by more than some factor, say 1%
> > (this should probably depend on the optimization level), then we
> > should inline. With profiling information, we might even use global
> > speedup instead of local speedup.
> 
> yes -- with target specific cost information, global speedup analysis
> can be more precise :)
> > Whether we need a target customization of this threshold, or just a
> > way for a target to supplement the fine inlining decision, is
> > unclear to me. It is also true that a the result of a bunch of
> > locally-optimal decisions might be far from the global optimum.
> > Maybe the target has something to say about that?
> 
> The concept of threshold can be a topic of another discussion. In
> current design, I think the threshold should remain target
> independent. It is the cost that is target specific.That's fine, but the units are important here. Having a target independent
threshold in terms of, roughly, instruction count makes little sense. How
instruction count is correlated with either performance or code size is highly
target specific (although it is certainly closer for code size). That, however,
is, roughly what our TTI.getUserCost gives us. Having target-independent
thresholds like % speedup (e.g. inlining should be done when the speedup is >
some %) or code-size thresholds (e.g. functions spanning more than a 4 KB are
bad) makes sense.

-Hal 
> thanks,
> David
> > In short, I'm fine with what you're proposing, but to the
extent
> > possible, I want the numbers provided by the target to mean
> > something. Replacing a global set of somewhat-arbitrary magic
> > numbers, with target-specific sets of somewhat-arbitrary magic
> > numbers should be our last choice.
> 
> > Thanks again,
> 
> > Hal
> 
> > >
> 
> > > Thanks,
> 
> > > --
> 
> > >
> 
> > >
> 
> > > --Artem Belevich
> 
> > > _______________________________________________
> 
> > > LLVM Developers mailing list
> 
> > > llvm-dev at lists.llvm.org
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> > >
> 
> > --
> 
> > Hal Finkel
> 
> > Assistant Computational Scientist
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 
-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160401/dc4beae0/attachment.html>

Xinliang David Li via llvm-dev

2016-Apr-06 17:42 UTC

head link

[llvm-dev] [RFC] Target-specific parametrization of function inliner

On Fri, Apr 1, 2016 at 12:10 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ------------------------------
>
> *From: *"Xinliang David Li" <davidxl at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"Artem Belevich" <tra at google.com>,
"llvm-dev" <
> llvm-dev at lists.llvm.org>, "chandlerc" <chandlerc at
gmail.com>, "Easwaran
> Raman" <eraman at google.com>
> *Sent: *Thursday, March 10, 2016 11:00:30 AM
> *Subject: *Re: [llvm-dev] [RFC] Target-specific parametrization of
> function inliner
>
> IMO, a good inliner with a precise cost/benefit model will eventually need
> what Art is proposing here.
>
> Giving the function call overhead as an example. It depends on a couple of
> factors: 1) call/return instruction latency; 2) function epilogue/prologue;
> 3) calling convention (argument parsing, using registers or not, what
> register classes etc).  All these factors depend on target information.  If
> we want go deeper, we know certain micro architectures uses a stack of
> call/return pairs to help branch prediction of ret instructions -- such
> stack has a target specific limit which can be triggered when a callsite is
> deep in the callchain.   Register file size and register pressure increase
> due to inline comes as another example.
>
> Another relevant example is the icache/itlb sizes. To do a more precise
> analysis of the cost to 'speed' due to icache/itlb pressure
increase
> requires target information, profile information as well as some global
> analysis. Easwaran has done some research in this area in the past and can
> share the analysis design when other things are ready.
>
>
> I don't know what you mean by "when other things are ready",
but what you
> say above sounds exactly right. I'm certainly curious what Easwaran has
> found.
>

By readiness, I mean the basic infrastructure support (such as making
profile data available for inliner) and related tuning based on simple
heuristics. The former will be revisited very soon.  Those work will be
useful to setup a good baseline before we start engaging in more
sophisticated analysis.

>
> Generally, there seem to be two categories here:
>
>  1. Locally decidable issues, for which there are (or can be) good static
> heuristics (call latencies, costs associated with parameter passing, stack
> spilling, etc.)
>  2. Globally decidable issues, like reducing the number of pages consumed
> by temporally-correlated hot code regions - profiling data likely necessary
> for good decision-making (although it might be possible to make a
> reasonable function-local threshold based on page size without it)
>
> Program level static analysis needs to be combined with profile data to
form independent or nested hot regions (with cache/tlb reuse). For global
inlining decisions,  there won't be a single budget to be used. The
decision will highly depend on the region nest where the callsite sits in.

Another side effect is that we may need more flexible inlining order (based
on net benefit of inlining a callsite) than the current bottom up order
scheme.

> and then there are things like icache/itlb effects due to multiple
> applications running simultaneously, for which profiling might help, but
> are also policy-level decisions over which users may need more-direct
> control.
>

Some thread level profiling may also be useful in guiding the decision --
i.e., use information about the shared instruction footprint across
different threads - e.g. Programs running heterogeneous threads vs programs
with homogeneous worker threads running identical code.


thanks,

David

>
>
>
>
>>
>> Hi Art,
>>
>> I've long thought that we should have a more principled way of
doing
>> inline profitability. There is obviously some cost to executing a
function
>> body, some call site overhead, and some cost reduction associated with
any
>> post-inlining simplifications. If inlining reduces the overall call
site
>> cost by more than some factor, say 1% (this should probably depend on
the
>> optimization level), then we should inline. With profiling information,
we
>> might even use global speedup instead of local speedup.
>>
>
> yes -- with target specific cost information, global speedup analysis can
> be more precise :)
>
>
>>
>> Whether we need a target customization of this threshold, or just a way
>> for a target to supplement the fine inlining decision, is unclear to
me. It
>> is also true that a the result of a bunch of locally-optimal decisions
>> might be far from the global optimum. Maybe the target has something to
say
>> about that?
>>
>
>
> The concept of threshold can be a topic of another discussion.  In current
> design, I think the threshold should remain target independent.  It is the
> cost that is target specific.
>
> That's fine, but the units are important here. Having a target
independent
> threshold in terms of, roughly, instruction count makes little sense. How
> instruction count is correlated with either performance or code size is
> highly target specific (although it is certainly closer for code size).
> That, however, is, roughly what our TTI.getUserCost gives us. Having
> target-independent thresholds like % speedup (e.g. inlining should be done
> when the speedup is > some %) or code-size thresholds (e.g. functions
> spanning more than a 4 KB are bad) makes sense.
>
>  -Hal
>
>
> thanks,
>
> David
>
>
>
>>
>> In short, I'm fine with what you're proposing, but to the
extent
>> possible, I want the numbers provided by the target to mean something.
>> Replacing a global set of somewhat-arbitrary magic numbers, with
>> target-specific sets of somewhat-arbitrary magic numbers should be our
last
>> choice.
>>
>> Thanks again,
>> Hal
>>
>>
>> >
>> > Thanks,
>> > --
>> >
>> >
>> > --Artem Belevich
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160406/aedca336/attachment.html>

llvm dev - Apr 2016 - [RFC] Target-specific parametrization of function inliner

[llvm-dev] [RFC] Target-specific parametrization of function inliner

[llvm-dev] [RFC] Target-specific parametrization of function inliner

[llvm-dev] [RFC] Target-specific parametrization of function inliner