thr3ads.net - llvm dev - [LLVMdev] RFC: Callee speedup estimation in inline cost analysis [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Easwaran Raman

2015-Jul-30 21:25 UTC

[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

TLDR - The proposal below is intended to allow inlining of larger callees
when such inlining is expected to reduce the dynamic instructions count.

Proposal
-------------

 LLVM inlines a function if the size growth (in the given context) is less
than a threshold. The threshold is increased based on certain
characteristics of the called function (inline keyword and the fraction of
vector instructions, for example). I propose the use of estimated speedup
(estimated reduction in dynamic instruction count to be precise)  as
another factor that controls threshold. This would allow larger functions
whose inlining potentially reduces execution time to be inlined.

The dynamic instruction count of (an uninlined) function F is
DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))

* The above summation is over all basic blocks in F.
* Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))

This dynamic instruction count measurement doesn't distinguish between a
single-cycle instruction and a long latency instruction. Instead of
using InstructionCount(BB)), we could use Sum_I(Weight(I)) where the
summation is over all instructions I in B and Weight(I) represents the time
it takes to execute instruction I.

The dynamic instruction count of F into a callsite C after inlining
InlinedDI(F, C) can be similary computed taking into account the
instructions that are simplified due to inlining.  The estimated speedup is
Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)

Speedup above a pre-determined threshold implies there is an expected
benefit in inlining the callee F and hence a bonus may be applied to the
associated threshold at that callsite.

Details
----------
This proposal is dependent on the new pass manager that would allow inline
cost analysis to see function level analysis passes. The outlined function
dynamic instruction count can be provided by an analysis pass. This dynamic
instruction count and the block frequency can be either updated
(preferable, imo) or recomputed after each inlining.

I prototyped a version of this (the prototype augments the BasicBlock and
Function classes to store the block frequency and function execution times)
and did some initial evaluation with clang and some internal benchmarks
used at Google. Implementing it as described above resulted in a large size
growth when the parameters are chosen to deliver performance improvement.
Some ways to control size growth include applying the heuristic only for
hot callsites, only for inline functions, and measuring the speedup using
both caller and callee time (instead of just the latter). In any case,
without profile feedback it is very likely that there is going to be a size
increase when this heuristic is triggered on cold modules. With profile
feedback, the absolute count of the callsite can be used to limit this to
only hot callsites.

Alternative approach
--------------------------
An alternative approach is to set the thresholds proportional to the
estimated speedup, instead of just having a low and high thresholds (for
callees whose estimated speedup is below and above the cutoff,
respectively). In my opinion such an approach adds complexity in
performance debugging and tuning. My guess is this is not going to bring
significant additional improvement over a well-tuned simple approach, but
perhaps worth exploring later.

Preliminary Results
---------------------------
With the caveat that the prototype implementation is not well validated and
very little tuning has been done, I have some preliminary numbers. These
were obtained by using a very aggressive 12X bonus for estimated speedup of
15% or more.

* Clang (run with -O2 on a large preprocessed file): 1% performance
improvement and a 4.6% text size increase.
* Google benchmark suite (geomean of ~20 benchmarks): 5.5% performance
improvement and a 7.7% text size increase
* Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size increase
* Chrome: Performance neutral, 3.8% text size increase

The I propose to implement this change guarded by a flag. This flag could
be turned on in O2 with profile guided optimization. If size increase under
-O3 is not a huge concern, this could be turned on in -O3 even without PGO.

Looking forward to your feedback.

Thanks,
Easwaran
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150730/0599eccc/attachment.html>

Xinliang David Li

2015-Jul-31 21:25 UTC

head link

[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

Just nitpicking:
1) DI(F) should include a component that estimate the epi/prologue cost
(frameSetupCost) which InlinedDF does not have
2) The speedup should include callsite cost associated with 'C' (call
instr, argument passing):
   Speedup(F,C) =  (DI(F) + CallCost(C) - InlinedDF(F,C))/DI(F).

Otherwise the proposal looks reasonable to me.

David


On Thu, Jul 30, 2015 at 2:25 PM, Easwaran Raman <eraman at google.com>
wrote:
> TLDR - The proposal below is intended to allow inlining of larger callees
> when such inlining is expected to reduce the dynamic instructions count.
>
> Proposal
> -------------
>
>  LLVM inlines a function if the size growth (in the given context) is less
> than a threshold. The threshold is increased based on certain
> characteristics of the called function (inline keyword and the fraction of
> vector instructions, for example). I propose the use of estimated speedup
> (estimated reduction in dynamic instruction count to be precise)  as
> another factor that controls threshold. This would allow larger functions
> whose inlining potentially reduces execution time to be inlined.
>
> The dynamic instruction count of (an uninlined) function F is
> DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))
>
> * The above summation is over all basic blocks in F.
> * Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))
>
> This dynamic instruction count measurement doesn't distinguish between
a
> single-cycle instruction and a long latency instruction. Instead of
> using InstructionCount(BB)), we could use Sum_I(Weight(I)) where the
> summation is over all instructions I in B and Weight(I) represents the time
> it takes to execute instruction I.
>
> The dynamic instruction count of F into a callsite C after inlining
> InlinedDI(F, C) can be similary computed taking into account the
> instructions that are simplified due to inlining.  The estimated speedup is
> Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)
>
> Speedup above a pre-determined threshold implies there is an expected
> benefit in inlining the callee F and hence a bonus may be applied to the
> associated threshold at that callsite.
>
> Details
> ----------
> This proposal is dependent on the new pass manager that would allow inline
> cost analysis to see function level analysis passes. The outlined function
> dynamic instruction count can be provided by an analysis pass. This dynamic
> instruction count and the block frequency can be either updated
> (preferable, imo) or recomputed after each inlining.
>
> I prototyped a version of this (the prototype augments the BasicBlock and
> Function classes to store the block frequency and function execution times)
> and did some initial evaluation with clang and some internal benchmarks
> used at Google. Implementing it as described above resulted in a large size
> growth when the parameters are chosen to deliver performance improvement.
> Some ways to control size growth include applying the heuristic only for
> hot callsites, only for inline functions, and measuring the speedup using
> both caller and callee time (instead of just the latter). In any case,
> without profile feedback it is very likely that there is going to be a size
> increase when this heuristic is triggered on cold modules. With profile
> feedback, the absolute count of the callsite can be used to limit this to
> only hot callsites.
>
> Alternative approach
> --------------------------
> An alternative approach is to set the thresholds proportional to the
> estimated speedup, instead of just having a low and high thresholds (for
> callees whose estimated speedup is below and above the cutoff,
> respectively). In my opinion such an approach adds complexity in
> performance debugging and tuning. My guess is this is not going to bring
> significant additional improvement over a well-tuned simple approach, but
> perhaps worth exploring later.
>
> Preliminary Results
> ---------------------------
> With the caveat that the prototype implementation is not well validated
> and very little tuning has been done, I have some preliminary numbers.
> These were obtained by using a very aggressive 12X bonus for estimated
> speedup of 15% or more.
>
> * Clang (run with -O2 on a large preprocessed file): 1% performance
> improvement and a 4.6% text size increase.
> * Google benchmark suite (geomean of ~20 benchmarks): 5.5% performance
> improvement and a 7.7% text size increase
> * Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size increase
> * Chrome: Performance neutral, 3.8% text size increase
>
> The I propose to implement this change guarded by a flag. This flag could
> be turned on in O2 with profile guided optimization. If size increase under
> -O3 is not a huge concern, this could be turned on in -O3 even without PGO.
>
> Looking forward to your feedback.
>
> Thanks,
> Easwaran
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150731/014cfa55/attachment.html>

Easwaran Raman

2015-Jul-31 21:36 UTC

head link

[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

On Fri, Jul 31, 2015 at 2:25 PM, Xinliang David Li <xinliangli at
gmail.com>
wrote:
> Just nitpicking:
> 1) DI(F) should include a component that estimate the epi/prologue cost
> (frameSetupCost) which InlinedDF does not have
> The cost of frame setup is accounted for in the existing inline cost
computation. For each argument, it adds a negative cost since the
instructions setting them up will go after inlining. Since I was
piggybacking on this (InlinedDI will be actual dynamic instructions minus
the dynamic instructions due to argument setup), I didn't mention this
explicitly, but yes, I agree that this has to be accounted.

2) The speedup should include callsite cost associated with 'C'
(call> instr, argument passing):
>    Speedup(F,C) =  (DI(F) + CallCost(C) - InlinedDF(F,C))/DI(F).
>
> Agreed.
> Otherwise the proposal looks reasonable to me.
>
> David
>
>
> On Thu, Jul 30, 2015 at 2:25 PM, Easwaran Raman <eraman at
google.com> wrote:
>
>> TLDR - The proposal below is intended to allow inlining of larger
callees
>> when such inlining is expected to reduce the dynamic instructions
count.
>>
>> Proposal
>> -------------
>>
>>  LLVM inlines a function if the size growth (in the given context) is
>> less than a threshold. The threshold is increased based on certain
>> characteristics of the called function (inline keyword and the fraction
of
>> vector instructions, for example). I propose the use of estimated
speedup
>> (estimated reduction in dynamic instruction count to be precise)  as
>> another factor that controls threshold. This would allow larger
functions
>> whose inlining potentially reduces execution time to be inlined.
>>
>> The dynamic instruction count of (an uninlined) function F is
>> DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))
>>
>> * The above summation is over all basic blocks in F.
>> * Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))
>>
>> This dynamic instruction count measurement doesn't distinguish
between a
>> single-cycle instruction and a long latency instruction. Instead of
>> using InstructionCount(BB)), we could use Sum_I(Weight(I)) where the
>> summation is over all instructions I in B and Weight(I) represents the
time
>> it takes to execute instruction I.
>>
>> The dynamic instruction count of F into a callsite C after inlining
>> InlinedDI(F, C) can be similary computed taking into account the
>> instructions that are simplified due to inlining.  The estimated
speedup is
>> Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)
>>
>> Speedup above a pre-determined threshold implies there is an expected
>> benefit in inlining the callee F and hence a bonus may be applied to
the
>> associated threshold at that callsite.
>>
>> Details
>> ----------
>> This proposal is dependent on the new pass manager that would allow
>> inline cost analysis to see function level analysis passes. The
outlined
>> function dynamic instruction count can be provided by an analysis pass.
>> This dynamic instruction count and the block frequency can be either
>> updated (preferable, imo) or recomputed after each inlining.
>>
>> I prototyped a version of this (the prototype augments the BasicBlock
and
>> Function classes to store the block frequency and function execution
times)
>> and did some initial evaluation with clang and some internal benchmarks
>> used at Google. Implementing it as described above resulted in a large
size
>> growth when the parameters are chosen to deliver performance
improvement.
>> Some ways to control size growth include applying the heuristic only
for
>> hot callsites, only for inline functions, and measuring the speedup
using
>> both caller and callee time (instead of just the latter). In any case,
>> without profile feedback it is very likely that there is going to be a
size
>> increase when this heuristic is triggered on cold modules. With profile
>> feedback, the absolute count of the callsite can be used to limit this
to
>> only hot callsites.
>>
>> Alternative approach
>> --------------------------
>> An alternative approach is to set the thresholds proportional to the
>> estimated speedup, instead of just having a low and high thresholds
(for
>> callees whose estimated speedup is below and above the cutoff,
>> respectively). In my opinion such an approach adds complexity in
>> performance debugging and tuning. My guess is this is not going to
bring
>> significant additional improvement over a well-tuned simple approach,
but
>> perhaps worth exploring later.
>>
>> Preliminary Results
>> ---------------------------
>> With the caveat that the prototype implementation is not well validated
>> and very little tuning has been done, I have some preliminary numbers.
>> These were obtained by using a very aggressive 12X bonus for estimated
>> speedup of 15% or more.
>>
>> * Clang (run with -O2 on a large preprocessed file): 1% performance
>> improvement and a 4.6% text size increase.
>> * Google benchmark suite (geomean of ~20 benchmarks): 5.5% performance
>> improvement and a 7.7% text size increase
>> * Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size
increase
>> * Chrome: Performance neutral, 3.8% text size increase
>>
>> The I propose to implement this change guarded by a flag. This flag
could
>> be turned on in O2 with profile guided optimization. If size increase
under
>> -O3 is not a huge concern, this could be turned on in -O3 even without
PGO.
>>
>> Looking forward to your feedback.
>>
>> Thanks,
>> Easwaran
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150731/d92f642c/attachment.html>

Hal Finkel

2015-Aug-05 06:51 UTC

head link

[llvm-dev] [LLVMdev] RFC: Callee speedup estimation in inline cost analysis

----- Original Message -----> From: "Easwaran Raman" <eraman at google.com>
> To: "<llvmdev at cs.uiuc.edu> List" <llvmdev at
cs.uiuc.edu>
> Sent: Thursday, July 30, 2015 4:25:41 PM
> Subject: [LLVMdev] RFC: Callee speedup estimation in inline cost analysis
> 
> 
> 
> TLDR - The proposal below is intended to allow inlining of larger
> callees when such inlining is expected to reduce the dynamic
> instructions count.
> 
> 
> 
> Proposal
> -------------
> 
> 
> LLVM inlines a function if the size growth (in the given context) is
> less than a threshold. The threshold is increased based on certain
> characteristics of the called function (inline keyword and the
> fraction of vector instructions, for example). I propose the use of
> estimated speedup (estimated reduction in dynamic instruction count
> to be precise) as another factor that controls threshold. This would
> allow larger functions whose inlining potentially reduces execution
> time to be inlined.
> 
> 
> The dynamic instruction count of (an uninlined) function F is
> DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))
> 
> 
> * The above summation is over all basic blocks in F.
> * Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))
> 
> 
> 
> This dynamic instruction count measurement doesn't distinguish
> between a single-cycle instruction and a long latency instruction.
> Instead of using InstructionCount(BB)), we could use
> Sum_I(Weight(I)) where the summation is over all instructions I in B
> and Weight(I) represents the time it takes to execute instruction I.
> 
> 
> The dynamic instruction count of F into a callsite C after inlining
> InlinedDI(F, C) can be similary computed taking into account the
> instructions that are simplified due to inlining.
Are you computing this cost in the current fashion, or using some other
mechanism? As I recall, we currently limit the number of instructions visited
here to prevent the overall analysis from being quadratic. This seems somewhat
at odds with a change specifically targeted at large callees.
> The estimated
> speedup is
> 
> Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)
> 
> 
> Speedup above a pre-determined threshold implies there is an expected
> benefit in inlining the callee F and hence a bonus may be applied to
> the associated threshold at that callsite.
> 
> 
> Details
> ----------
> This proposal is dependent on the new pass manager that would allow
> inline cost analysis to see function level analysis passes. The
> outlined function dynamic instruction count can be provided by an
> analysis pass. This dynamic instruction count and the block
> frequency can be either updated (preferable, imo) or recomputed
> after each inlining.
> 
> 
> 
> I prototyped a version of this (the prototype augments the BasicBlock
> and Function classes to store the block frequency and function
> execution times) and did some initial evaluation with clang and some
> internal benchmarks used at Google. Implementing it as described
> above resulted in a large size growth when the parameters are chosen
> to deliver performance improvement. Some ways to control size growth
> include applying the heuristic only for hot callsites, only for
> inline functions, and measuring the speedup using both caller and
> callee time (instead of just the latter). In any case, without
> profile feedback it is very likely that there is going to be a size
> increase when this heuristic is triggered on cold modules. With
> profile feedback, the absolute count of the callsite can be used to
> limit this to only hot callsites.
> 
> 
> 
> Alternative approach
> --------------------------
> An alternative approach is to set the thresholds proportional to the
> estimated speedup, instead of just having a low and high thresholds
> (for callees whose estimated speedup is below and above the cutoff,
> respectively). In my opinion such an approach adds complexity in
> performance debugging and tuning. My guess is this is not going to
> bring significant additional improvement over a well-tuned simple
> approach, but perhaps worth exploring later.
> 
> 
> Preliminary Results
> ---------------------------
> With the caveat that the prototype implementation is not well
> validated and very little tuning has been done, I have some
> preliminary numbers. These were obtained by using a very aggressive
> 12X bonus for estimated speedup of 15% or more.
> 
> 
> * Clang (run with -O2 on a large preprocessed file): 1% performance
> improvement and a 4.6% text size increase.
> * Google benchmark suite (geomean of ~20 benchmarks): 5.5%
> performance improvement and a 7.7% text size increase
> 
> * Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size
> increase
> 
> * Chrome: Performance neutral, 3.8% text size increase
What was the compile-time effect?
> 
> 
> The I propose to implement this change guarded by a flag. This flag
> could be turned on in O2 with profile guided optimization. If size
> increase under -O3 is not a huge concern, this could be turned on in
> -O3 even without PGO.
> 
For my users, such an increase would be generally acceptable, however, others
might have a different opinion. I'd certainly support a -O4 level with
more-aggressive inlining in that case.

 -Hal
> 
> Looking forward to your feedback.
> 
> 
> Thanks,
> Easwaran
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Easwaran Raman

2015-Aug-05 22:08 UTC

head link

[llvm-dev] [LLVMdev] RFC: Callee speedup estimation in inline cost analysis

On Tue, Aug 4, 2015 at 11:51 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "Easwaran Raman" <eraman at google.com>
> > To: "<llvmdev at cs.uiuc.edu> List" <llvmdev at
cs.uiuc.edu>
> > Sent: Thursday, July 30, 2015 4:25:41 PM
> > Subject: [LLVMdev] RFC: Callee speedup estimation in inline cost
analysis
> >
> >
> >
> > TLDR - The proposal below is intended to allow inlining of larger
> > callees when such inlining is expected to reduce the dynamic
> > instructions count.
> >
> >
> >
> > Proposal
> > -------------
> >
> >
> > LLVM inlines a function if the size growth (in the given context) is
> > less than a threshold. The threshold is increased based on certain
> > characteristics of the called function (inline keyword and the
> > fraction of vector instructions, for example). I propose the use of
> > estimated speedup (estimated reduction in dynamic instruction count
> > to be precise) as another factor that controls threshold. This would
> > allow larger functions whose inlining potentially reduces execution
> > time to be inlined.
> >
> >
> > The dynamic instruction count of (an uninlined) function F is
> > DI(F) = Sum_BB(Freq(BB) * InstructionCount(BB))
> >
> >
> > * The above summation is over all basic blocks in F.
> > * Freq(BB) = BlockFrequency(BB)/BlockFrequency(Entry(F))
> >
> >
> >
> > This dynamic instruction count measurement doesn't distinguish
> > between a single-cycle instruction and a long latency instruction.
> > Instead of using InstructionCount(BB)), we could use
> > Sum_I(Weight(I)) where the summation is over all instructions I in B
> > and Weight(I) represents the time it takes to execute instruction I.
> >
> >
> > The dynamic instruction count of F into a callsite C after inlining
> > InlinedDI(F, C) can be similary computed taking into account the
> > instructions that are simplified due to inlining.
>
> Are you computing this cost in the current fashion, or using some other
> mechanism? As I recall, we currently limit the number of instructions
> visited here to prevent the overall analysis from being quadratic. This
> seems somewhat at odds with a change specifically targeted at large
callees.
>
> I'm not sure what limit you're referring to here.  There are early
exitsfrom the analyzeCall method once the maximum possible threshold  (the
threshold if all bonuses are to be applied) is reached.  This change will
increase the maximum possible threshold and hence you end up processing
more. AFAICT, the cost is linear on the number of instructions in the
callee per callsite.
> The estimated
> > speedup is
> >
> > Speedup(F, C) = (DI(F) - InlinedDI(F, C)) / DI(F)
> >
> >
> > Speedup above a pre-determined threshold implies there is an expected
> > benefit in inlining the callee F and hence a bonus may be applied to
> > the associated threshold at that callsite.
> >
> >
> > Details
> > ----------
> > This proposal is dependent on the new pass manager that would allow
> > inline cost analysis to see function level analysis passes. The
> > outlined function dynamic instruction count can be provided by an
> > analysis pass. This dynamic instruction count and the block
> > frequency can be either updated (preferable, imo) or recomputed
> > after each inlining.
> >
> >
> >
> > I prototyped a version of this (the prototype augments the BasicBlock
> > and Function classes to store the block frequency and function
> > execution times) and did some initial evaluation with clang and some
> > internal benchmarks used at Google. Implementing it as described
> > above resulted in a large size growth when the parameters are chosen
> > to deliver performance improvement. Some ways to control size growth
> > include applying the heuristic only for hot callsites, only for
> > inline functions, and measuring the speedup using both caller and
> > callee time (instead of just the latter). In any case, without
> > profile feedback it is very likely that there is going to be a size
> > increase when this heuristic is triggered on cold modules. With
> > profile feedback, the absolute count of the callsite can be used to
> > limit this to only hot callsites.
> >
> >
> >
> > Alternative approach
> > --------------------------
> > An alternative approach is to set the thresholds proportional to the
> > estimated speedup, instead of just having a low and high thresholds
> > (for callees whose estimated speedup is below and above the cutoff,
> > respectively). In my opinion such an approach adds complexity in
> > performance debugging and tuning. My guess is this is not going to
> > bring significant additional improvement over a well-tuned simple
> > approach, but perhaps worth exploring later.
> >
> >
> > Preliminary Results
> > ---------------------------
> > With the caveat that the prototype implementation is not well
> > validated and very little tuning has been done, I have some
> > preliminary numbers. These were obtained by using a very aggressive
> > 12X bonus for estimated speedup of 15% or more.
> >
> >
> > * Clang (run with -O2 on a large preprocessed file): 1% performance
> > improvement and a 4.6% text size increase.
> > * Google benchmark suite (geomean of ~20 benchmarks): 5.5%
> > performance improvement and a 7.7% text size increase
> >
> > * Spec (all C and C++ benchmarks): 0.6% speedup and 2% text size
> > increase
> >
> > * Chrome: Performance neutral, 3.8% text size increase
>
> What was the compile-time effect?
>
> Good question :) I just measured  a build of clang with this change
('timeninja -J1 clang'). This increases *build* time by 13% which is certainly an
issue that needs to be addressed. Assuming time spent on non-compile
actions does not change, the actual compile time increase is likely a bit
more than 13%.

A few points regarding this:

* Independent of this change, we could cache the results of the inline cost
computation. One simple approach is to cache when none of the parameters at
the callsite evaluates to a constant. The effect of such savings is likely
to be more with this change.
* The 12X threshold bonus needs to be tuned. I chose this factor to enable
one beneficial inline in an internal benchmark I used.
* I want to re-emphasize this is a quick prototype and I am sure there are
many inefficiencies in this prototype.
* There may be opportunities to further prune the set of callsites where
this heuristic is applied.
>
> >
> > The I propose to implement this change guarded by a flag. This flag
> > could be turned on in O2 with profile guided optimization. If size
> > increase under -O3 is not a huge concern, this could be turned on in
> > -O3 even without PGO.
> >
>
> For my users, such an increase would be generally acceptable, however,
> others might have a different opinion. I'd certainly support a -O4
level
> with more-aggressive inlining in that case.
>
> I would like to hear from other users of -O3 what is an acceptableperformance/code size tradeoff for them. I measured the effects of -O3 on
our internal benchmark suite. The geomean performance improved (compared to
-O2, both without this proposed change) by 0.6% at a cost of 4.5% size
increase. If this is representative of what other see, I would argue that
it is reasonalbe to enable this proposal at  O3.

- Easwaran

 -Hal>
> >
> > Looking forward to your feedback.
> >
> >
> > Thanks,
> > Easwaran
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150805/c44cf160/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Jul 2015 - [LLVMdev] RFC: Callee speedup estimation in inline cost analysis

[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

[LLVMdev] RFC: Callee speedup estimation in inline cost analysis

[llvm-dev] [LLVMdev] RFC: Callee speedup estimation in inline cost analysis

[llvm-dev] [LLVMdev] RFC: Callee speedup estimation in inline cost analysis

Possibly Parallel Threads