thr3ads.net - llvm dev - [llvm-dev] [RFC] Target-specific parametrization of function inliner [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Mehdi Amini via llvm-dev

2016-Apr-01 19:26 UTC

[llvm-dev] [RFC] Target-specific parametrization of function inliner

> On Mar 10, 2016, at 10:34 AM, Xinliang David Li via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> 
> 
> 
> On Thu, Mar 10, 2016 at 6:49 AM, Chandler Carruth <chandlerc at
google.com <mailto:chandlerc at google.com>> wrote:
> IMO, the appropriate thing for TTI to inform the inliner about is how
costly the actual act of a "call" is likely to be. I would hope that
this would only be used on targets where there is some really dramatic overhead
of actually doing a function call such that the code size cost incurred by
inlining is completely dwarfed by the improvements. GPUs are one of the few
platforms that exhibit this kind of behavior, although I don't think
they're truly unique, just a common example.
> 
> This isn't quite the same thing as the cost of the call instruction,
which has much more to do with the size. Instead, it has to do with the expected
consequences of actually leaving a call edge in the program.
>  
> 
> To me, this pretty accurately reflects the TTI hook we have for customizing
loop unrolling where the cost of having a cyclic CFG is modeled to help indicate
that on some targets (also GPUs) it is worth a very large amount of code size
growth to simplify the control flow in a particular way.
> 
> 
> From 10000 foot, the LLVM inliner implements a size based heuristic :  if
the inline instance's size*/cost after simplification via propagating the
call context (actually the relative size -- the callsite cost is subtracted from
it), is smaller than a threshold (adjusted from a base value), then the callsite
is considered an inline candidate. In most cases, the decision is made locally
due to the bottom-up order (there are tweaks to bypass it).   The size/cost can
be remotely tied and serves a proxy to represent the real runtime cost due to
icache/itlb effect, but it seems the size/threshold scheme is mainly used to
model the runtime speedup vs compile time/binary size tradeoffs.
Other than the call cost itself, I've been surprised that the TTI is not
more involved when it comes to this tradeoff: instructions don't have the
same tradeoff depending on the platform (oh this operation is not legal on this
type and will be expanded in multiple instructions in SDAG, too bad..).

-- 
Mehdi

> 
> Set aside what we need longer term for the inliner, the GPU specific
problems can be addressed by
> 1) if the call overhead is really large, define a target specific
getCallCost and subtract it from the initial Cost when analyzing a callsite
(this will help boost all targets with high call costs)
> 2) if not, but instead GPU users can tolerate large code growth, then it is
better to this by adjusting the threshold -- perhaps have a user level option
-finline-limit=?
> 
> thanks,
> 
> David
> 
> 
> * some target dependent info may be used: TTI.getUserCost
>  
> Does that make sense to you Hal? Based on that, it would really just be a
scaling factor of the inline heuristics. Unsure of how to more scientifically
express this construct.
> 
> -Chandler
> 
> On Thu, Mar 10, 2016 at 3:42 PM Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> ----- Original Message -----
> > From: "Artem Belevich via llvm-dev" <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
> > To: "llvm-dev" <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>>
> > Sent: Tuesday, March 1, 2016 6:31:06 PM
> > Subject: [llvm-dev] [RFC] Target-specific parametrization of function
inliner
> >
> > Hi,
> >
> >
> > I propose to make function inliner parameters adjustable for specific
> > target.
> >
> > Currently function inlining pass appears to be target-agnostic with
> > various constants for calculating call cost hardcoded. While it
> > works reasonably well for general purpose CPUs, some quirkier
> > targets like NVPTX would benefit from target-specific tuning.
> >
> >
> > Currently it appears that there are two things that need to be done:
> >
> >
> > * add Inliner preferences to TargetTransformInfo in a way similar to
> > how we customize loop unrolling. Use it to provide inliner with
> > target-specific thresholds and other parameters.
> > * augment Inliner pass to use existing TargetTransformInfo API to
> > figure out cost of particular call on a given target.
> > TargetTransforInfo already has getCallCost(), though it does not
> > look like anything uses it.
> >
> >
> > Comments? Concerns? Suggestions?
> >
> 
> Hi Art,
> 
> I've long thought that we should have a more principled way of doing
inline profitability. There is obviously some cost to executing a function body,
some call site overhead, and some cost reduction associated with any
post-inlining simplifications. If inlining reduces the overall call site cost by
more than some factor, say 1% (this should probably depend on the optimization
level), then we should inline. With profiling information, we might even use
global speedup instead of local speedup.
> 
> Whether we need a target customization of this threshold, or just a way for
a target to supplement the fine inlining decision, is unclear to me. It is also
true that a the result of a bunch of locally-optimal decisions might be far from
the global optimum. Maybe the target has something to say about that?
> 
> In short, I'm fine with what you're proposing, but to the extent
possible, I want the numbers provided by the target to mean something. Replacing
a global set of somewhat-arbitrary magic numbers, with target-specific sets of
somewhat-arbitrary magic numbers should be our last choice.
> 
> Thanks again,
> Hal
> 
> 
> >
> > Thanks,
> > --
> >
> >
> > --Artem Belevich
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> >
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160401/ffc156a7/attachment.html>

Hal Finkel via llvm-dev

2016-Apr-01 19:35 UTC

head link

[llvm-dev] [RFC] Target-specific parametrization of function inliner

----- Original Message -----
> From: "Mehdi Amini via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: "Xinliang David Li" <davidxl at google.com>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Friday, April 1, 2016 2:26:27 PM
> Subject: Re: [llvm-dev] [RFC] Target-specific parametrization of
> function inliner
> > On Mar 10, 2016, at 10:34 AM, Xinliang David Li via llvm-dev <
> > llvm-dev at lists.llvm.org > wrote:
> 
> > On Thu, Mar 10, 2016 at 6:49 AM, Chandler Carruth <
> > chandlerc at google.com > wrote:
> 
> > > IMO, the appropriate thing for TTI to inform the inliner about is
> > > how
> > > costly the actual act of a "call" is likely to be. I
would hope
> > > that
> > > this would only be used on targets where there is some really
> > > dramatic overhead of actually doing a function call such that the
> > > code size cost incurred by inlining is completely dwarfed by the
> > > improvements. GPUs are one of the few platforms that exhibit this
> > > kind of behavior, although I don't think they're truly
unique,
> > > just
> > > a common example.
> > 
> 
> > > This isn't quite the same thing as the cost of the call
> > > instruction,
> > > which has much more to do with the size. Instead, it has to do
> > > with
> > > the expected consequences of actually leaving a call edge in the
> > > program.
> > 
> 
> > > To me, this pretty accurately reflects the TTI hook we have for
> > > customizing loop unrolling where the cost of having a cyclic CFG
> > > is
> > > modeled to help indicate that on some targets (also GPUs) it is
> > > worth a very large amount of code size growth to simplify the
> > > control flow in a particular way.
> > 
> 
> > From 10000 foot, the LLVM inliner implements a size based heuristic
> > :
> > if the inline instance's size*/cost after simplification via
> > propagating the call context (actually the relative size -- the
> > callsite cost is subtracted from it), is smaller than a threshold
> > (adjusted from a base value), then the callsite is considered an
> > inline candidate. In most cases, the decision is made locally due
> > to
> > the bottom-up order (there are tweaks to bypass it). The size/cost
> > can be remotely tied and serves a proxy to represent the real
> > runtime cost due to icache/itlb effect, but it seems the
> > size/threshold scheme is mainly used to model the runtime speedup
> > vs
> > compile time/binary size tradeoffs.
> 
> Other than the call cost itself, I've been surprised that the TTI is
> not more involved when it comes to this tradeoff: instructions don't
> have the same tradeoff depending on the platform (oh this operation
> is not legal on this type and will be expanded in multiple
> instructions in SDAG, too bad..).I think that doing this was intended, but we've not done it yet (as we did
for the throughput model used for vectorization). I think we should (I also
think we should combine the cost models so that we have a single model that
returns multiple kinds of costs (throughput, size, latency, etc.)).

-Hal 
> --
> Mehdi
> > Set aside what we need longer term for the inliner, the GPU
> > specific
> > problems can be addressed by
> 
> > 1) if the call overhead is really large, define a target specific
> > getCallCost and subtract it from the initial Cost when analyzing a
> > callsite (this will help boost all targets with high call costs)
> 
> > 2) if not, but instead GPU users can tolerate large code growth,
> > then
> > it is better to this by adjusting the threshold -- perhaps have a
> > user level option -finline-limit=?
> 
> > thanks,
> 
> > David
> 
> > * some target dependent info may be used: TTI.getUserCost
> 
> > > Does that make sense to you Hal? Based on that, it would really
> > > just
> > > be a scaling factor of the inline heuristics. Unsure of how to
> > > more
> > > scientifically express this construct.
> > 
> 
> > > -Chandler
> > 
> 
> > > On Thu, Mar 10, 2016 at 3:42 PM Hal Finkel via llvm-dev <
> > > llvm-dev at lists.llvm.org > wrote:
> > 
> 
> > > > ----- Original Message -----
> > > 
> > 
> 
> > > > > From: "Artem Belevich via llvm-dev" <
llvm-dev at lists.llvm.org
> > > > > >
> > > 
> > 
> 
> > > > > To: "llvm-dev" < llvm-dev at
lists.llvm.org >
> > > 
> > 
> 
> > > > > Sent: Tuesday, March 1, 2016 6:31:06 PM
> > > 
> > 
> 
> > > > > Subject: [llvm-dev] [RFC] Target-specific
parametrization of
> > > > > function inliner
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Hi,
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > I propose to make function inliner parameters
adjustable for
> > > > > specific
> > > 
> > 
> 
> > > > > target.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Currently function inlining pass appears to be
> > > > > target-agnostic
> > > > > with
> > > 
> > 
> 
> > > > > various constants for calculating call cost hardcoded.
While
> > > > > it
> > > 
> > 
> 
> > > > > works reasonably well for general purpose CPUs, some
quirkier
> > > 
> > 
> 
> > > > > targets like NVPTX would benefit from target-specific
tuning.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Currently it appears that there are two things that
need to
> > > > > be
> > > > > done:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > * add Inliner preferences to TargetTransformInfo in a
way
> > > > > similar
> > > > > to
> > > 
> > 
> 
> > > > > how we customize loop unrolling. Use it to provide
inliner
> > > > > with
> > > 
> > 
> 
> > > > > target-specific thresholds and other parameters.
> > > 
> > 
> 
> > > > > * augment Inliner pass to use existing
TargetTransformInfo
> > > > > API
> > > > > to
> > > 
> > 
> 
> > > > > figure out cost of particular call on a given target.
> > > 
> > 
> 
> > > > > TargetTransforInfo already has getCallCost(), though it
does
> > > > > not
> > > 
> > 
> 
> > > > > look like anything uses it.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Comments? Concerns? Suggestions?
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > Hi Art,
> > > 
> > 
> 
> > > > I've long thought that we should have a more principled
way of
> > > > doing
> > > > inline profitability. There is obviously some cost to
executing
> > > > a
> > > > function body, some call site overhead, and some cost
reduction
> > > > associated with any post-inlining simplifications. If
inlining
> > > > reduces the overall call site cost by more than some factor,
> > > > say
> > > > 1%
> > > > (this should probably depend on the optimization level),
then
> > > > we
> > > > should inline. With profiling information, we might even use
> > > > global
> > > > speedup instead of local speedup.
> > > 
> > 
> 
> > > > Whether we need a target customization of this threshold, or
> > > > just
> > > > a
> > > > way for a target to supplement the fine inlining decision,
is
> > > > unclear to me. It is also true that a the result of a bunch
of
> > > > locally-optimal decisions might be far from the global
optimum.
> > > > Maybe the target has something to say about that?
> > > 
> > 
> 
> > > > In short, I'm fine with what you're proposing, but
to the
> > > > extent
> > > > possible, I want the numbers provided by the target to mean
> > > > something. Replacing a global set of somewhat-arbitrary
magic
> > > > numbers, with target-specific sets of somewhat-arbitrary
magic
> > > > numbers should be our last choice.
> > > 
> > 
> 
> > > > Thanks again,
> > > 
> > 
> 
> > > > Hal
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Thanks,
> > > 
> > 
> 
> > > > > --
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > --Artem Belevich
> > > 
> > 
> 
> > > > > _______________________________________________
> > > 
> > 
> 
> > > > > LLVM Developers mailing list
> > > 
> > 
> 
> > > > > llvm-dev at lists.llvm.org
> > > 
> > 
> 
> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > --
> > > 
> > 
> 
> > > > Hal Finkel
> > > 
> > 
> 
> > > > Assistant Computational Scientist
> > > 
> > 
> 
> > > > Leadership Computing Facility
> > > 
> > 
> 
> > > > Argonne National Laboratory
> > > 
> > 
> 
> > > > _______________________________________________
> > > 
> > 
> 
> > > > LLVM Developers mailing list
> > > 
> > 
> 
> > > > llvm-dev at lists.llvm.org
> > > 
> > 
> 
> > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > 
> > 
> 
> > _______________________________________________
> 
> > LLVM Developers mailing list
> 
> > llvm-dev at lists.llvm.org
> 
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160401/9f420f66/attachment.html>

Xinliang David Li via llvm-dev

2016-Apr-06 17:46 UTC

head link

[llvm-dev] [RFC] Target-specific parametrization of function inliner

On Fri, Apr 1, 2016 at 12:35 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ------------------------------
>
> *From: *"Mehdi Amini via llvm-dev" <llvm-dev at
lists.llvm.org>
> *To: *"Xinliang David Li" <davidxl at google.com>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>
> *Sent: *Friday, April 1, 2016 2:26:27 PM
> *Subject: *Re: [llvm-dev] [RFC] Target-specific parametrization of
> function inliner
>
>
> On Mar 10, 2016, at 10:34 AM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> On Thu, Mar 10, 2016 at 6:49 AM, Chandler Carruth <chandlerc at
google.com>
> wrote:
>
>> IMO, the appropriate thing for TTI to inform the inliner about is how
>> costly the actual act of a "call" is likely to be. I would
hope that this
>> would only be used on targets where there is some really dramatic
overhead
>> of actually doing a function call such that the code size cost incurred
by
>> inlining is completely dwarfed by the improvements. GPUs are one of the
few
>> platforms that exhibit this kind of behavior, although I don't
think
>> they're truly unique, just a common example.
>>
>> This isn't quite the same thing as the cost of the call
instruction,
>> which has much more to do with the size. Instead, it has to do with the
>> expected consequences of actually leaving a call edge in the program.
>>
>
>
>>
>> To me, this pretty accurately reflects the TTI hook we have for
>> customizing loop unrolling where the cost of having a cyclic CFG is
modeled
>> to help indicate that on some targets (also GPUs) it is worth a very
large
>> amount of code size growth to simplify the control flow in a particular
way.
>>
>>
> From 10000 foot, the LLVM inliner implements a size based heuristic :  if
> the inline instance's size*/cost after simplification via propagating
the
> call context (actually the relative size -- the callsite cost is subtracted
> from it), is smaller than a threshold (adjusted from a base value), then
> the callsite is considered an inline candidate. In most cases, the decision
> is made locally due to the bottom-up order (there are tweaks to bypass it).
>   The size/cost can be remotely tied and serves a proxy to represent the
> real runtime cost due to icache/itlb effect, but it seems the
> size/threshold scheme is mainly used to model the runtime speedup vs
> compile time/binary size tradeoffs.
>
>
> Other than the call cost itself, I've been surprised that the TTI is
not
> more involved when it comes to this tradeoff: instructions don't have
the
> same tradeoff depending on the platform (oh this operation is not legal on
> this type and will be expanded in multiple instructions in SDAG, too
bad..).
>
> I think that doing this was intended, but we've not done it yet (as we
did
> for the throughput model used for vectorization). I think we should (I also
> think we should combine the cost models so that we have a single model that
> returns multiple kinds of costs (throughput, size, latency, etc.)).
>

yes -- the time/speed up estimate should be independent of size increase
estimate.

David

>
>
>  -Hal
>
>
> --
> Mehdi
>
>
>
> Set aside what we need longer term for the inliner, the GPU specific
> problems can be addressed by
> 1) if the call overhead is really large, define a target specific
> getCallCost and subtract it from the initial Cost when analyzing a callsite
> (this will help boost all targets with high call costs)
> 2) if not, but instead GPU users can tolerate large code growth, then it
> is better to this by adjusting the threshold -- perhaps have a user level
> option -finline-limit=?
>
> thanks,
>
> David
>
>
> * some target dependent info may be used: TTI.getUserCost
>
>
>> Does that make sense to you Hal? Based on that, it would really just be
a
>> scaling factor of the inline heuristics. Unsure of how to more
>> scientifically express this construct.
>>
>> -Chandler
>>
>> On Thu, Mar 10, 2016 at 3:42 PM Hal Finkel via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> ------------------------------
>>>
>>> > From: "Artem Belevich via llvm-dev" <llvm-dev at
lists.llvm.org>
>>> > To: "llvm-dev" <llvm-dev at lists.llvm.org>
>>> > Sent: Tuesday, March 1, 2016 6:31:06 PM
>>> > Subject: [llvm-dev] [RFC] Target-specific parametrization of
function
>>> inliner
>>> >
>>> > Hi,
>>> >
>>> >
>>> > I propose to make function inliner parameters adjustable for
specific
>>> > target.
>>> >
>>> > Currently function inlining pass appears to be target-agnostic
with
>>> > various constants for calculating call cost hardcoded. While
it
>>> > works reasonably well for general purpose CPUs, some quirkier
>>> > targets like NVPTX would benefit from target-specific tuning.
>>> >
>>> >
>>> > Currently it appears that there are two things that need to be
done:
>>> >
>>> >
>>> > * add Inliner preferences to TargetTransformInfo in a way
similar to
>>> > how we customize loop unrolling. Use it to provide inliner
with
>>> > target-specific thresholds and other parameters.
>>> > * augment Inliner pass to use existing TargetTransformInfo API
to
>>> > figure out cost of particular call on a given target.
>>> > TargetTransforInfo already has getCallCost(), though it does
not
>>> > look like anything uses it.
>>> >
>>> >
>>> > Comments? Concerns? Suggestions?
>>> >
>>>
>>> Hi Art,
>>>
>>> I've long thought that we should have a more principled way of
doing
>>> inline profitability. There is obviously some cost to executing a
function
>>> body, some call site overhead, and some cost reduction associated
with any
>>> post-inlining simplifications. If inlining reduces the overall call
site
>>> cost by more than some factor, say 1% (this should probably depend
on the
>>> optimization level), then we should inline. With profiling
information, we
>>> might even use global speedup instead of local speedup.
>>>
>>> Whether we need a target customization of this threshold, or just a
way
>>> for a target to supplement the fine inlining decision, is unclear
to me. It
>>> is also true that a the result of a bunch of locally-optimal
decisions
>>> might be far from the global optimum. Maybe the target has
something to say
>>> about that?
>>>
>>> In short, I'm fine with what you're proposing, but to the
extent
>>> possible, I want the numbers provided by the target to mean
something.
>>> Replacing a global set of somewhat-arbitrary magic numbers, with
>>> target-specific sets of somewhat-arbitrary magic numbers should be
our last
>>> choice.
>>>
>>> Thanks again,
>>> Hal
>>>
>>>
>>> >
>>> > Thanks,
>>> > --
>>> >
>>> >
>>> > --Artem Belevich
>>> > _______________________________________________
>>> > LLVM Developers mailing list
>>> > llvm-dev at lists.llvm.org
>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> >
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160406/e6f9dc45/attachment.html>

llvm dev - Apr 2016 - [RFC] Target-specific parametrization of function inliner

[llvm-dev] [RFC] Target-specific parametrization of function inliner

[llvm-dev] [RFC] Target-specific parametrization of function inliner

[llvm-dev] [RFC] Target-specific parametrization of function inliner