thr3ads.net - llvm dev - [LLVMdev] [Polly] Move Polly's execution later [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Star Tan

2013-Sep-17 02:12 UTC

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Now, we come to more evaluations on
http://188.40.87.11:8000/db_default/v4/nts/recent_activity

I mainly care about the compile-time and execution time impact for the following
cases:
pBasic (run 45):  clang -O3 -load LLVMPolly.so
pNoGenSCEV (run 44): clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly
-polly-optimizer=none -polly-code-generator=none
pNoGenSCEV_nocan (run 47): same option with pNoGenSCEV but replace the
LLVMPolly.so by removing all Polly canonicalization passes
pNoGenSCEV_procomb (run 51): same option with pNoGenSCEV but replace the
LLVMPolly.so by removing only the "InstructionCombining" and
"PromoteMemoryToRegister" canonicalization passes
pOptSCEV (run 48):  clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly
pOptSCEV_nocan (run 50): same  option with pNoOptSCEV but replace the
LLVMPolly.so by removing all Polly canonicalization passes
pOptSCEV_procomb (run 52): same  option with pNoOptSCEV but replace the
LLVMPolly.so by removing only the "InstructionCombining" and
"PromoteMemoryToRegister" canonicalization passes
pollyOpt (run 53): clang -O3 -load LLVMPolly.so -mllvm -polly

Discovery 1: Polly optimization and code generation heavily relies on the
"InstructionCombining" and "PromoteMemoryToRegister"
canonicalization passes.
http://188.40.87.11:8000/db_default/v4/nts/52?compare_to=45&baseline=45
shows the comparison between pOptSCEV_procomb with pBasic. As the results shown,
Polly optimization and code generation lead to very small compile-time overhead
(20% at most) compared with clang, i.e. the top four benmarks are:
SingleSource/UnitTests/SignlessTypes/rem20.37%
SingleSource/Benchmarks/Misc/oourafft11.34%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl10.22%
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset10.21%
It means that most of expensive Polly analysis/optimization/code generation
passes are not enabled without running these two canonicalization passes. Of
course Polly also introduces little performance gains in this case. The top
benchmarks for performance improvements are:
SingleSource/Benchmarks/Shootout/nestedloop -100.00%
SingleSource/Benchmarks/Shootout-C++/nestedloop -100.00%
MultiSource/Benchmarks/Ptrdist/anagram/anagram -14.26%
SingleSource/Benchmarks/Shootout/lists -10.77%
BTW, this bug (llvm.org/bugs/show_bug.cgi?id=17159) shown in general SCEV
optimization does not appear any more.

Discovery 2: Removing polly canonicalization passes significantly reduce
compile-time and may also reduce execution-time.
http://188.40.87.11:8000/db_default/v4/nts/50?compare_to=48&baseline=48 show
the comparison between "full polly canonicalization" and "non
polly canonicalization". Definitely, removing canonicalization passes can
significantly reduce compile-time overhead and my decrease the execution-time
performance since "canonicalization passes" can provide more
opportunities for optimization. However, we find that removing polly
canonicalization passes may also improve the execution-time performance for some
benchmarks as shown in the follows:

Performance Regressions - Execution Time
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt45.89%
SingleSource/Benchmarks/CoyoteBench/huffbench22.24%
SingleSource/Benchmarks/Shootout/fib215.06%
SingleSource/Benchmarks/Stanford/FloatMM13.98%
SingleSource/Benchmarks/Misc-C++/mandel-text13.16%

Performance Improvements - Execution Time
SingleSource/Benchmarks/Polybench/medley/reg_detect/reg_detect-37.50%
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog-27.69%
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt-22.59%
SingleSource/Benchmarks/Misc/himenobmtxpa-21.98%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt-16.44%
 It means Polly's optimization does not always improve the performance. It
may lead the performance regression at the same time. This discovery can be also
found in the comparison between "clang -O3 with Polly" and "clang
-O3 without Polly" on
http://188.40.87.11:8000/db_default/v4/nts/48?compare_to=45&baseline=45.
Many benchmarks have execution time regression. So we need to further refine
Polly's optimization. At least we should avoid the performance regression.

In the next step, I will evaluate those polly canonicalization passes without
-polly-codegen-scev to understand their compile-time and execution-time impact.

Best,
Mingxing

At 2013-09-14 09:51:10,"Star Tan" <tanmx_star at yeah.net>
wrote:

Hello all,

I have evaluated the compile-time and execution-time performance of Polly
canonicalization passes. Details can be referred to
http://188.40.87.11:8000/db_default/v4/nts/recent_activity. There are four runs:
pollyBasic (run 45): clang -O3 -Xclang -load -Xclang LLVMPolly.so
pollyNoGenSCEV (run 44): clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm
-polly -mllvm -polly-codegen-scev
pollyNoGenSCEV_1comb (run 46): same option as pollyNoGenSCEV but remove the
first "InstructionCombining" canonicalization pass when generate
LLVMPolly.so
pollyNoGenSCEV_nocan (run 47):  same option as pollyNoGenSCEV but remove all
canonicalization passes (actually only keep
"createCodePreparationPass")  when generate LLVMPolly.so

Fist. let's see the results of removing the first
"InstructionCombining" pass like this:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
//  PM.add(llvm::createInstructionCombiningPass());  //this is the most
expensive canonicalization pass for flop benchmark
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());
  PM.add(polly::createCodePreparationPass());
}
Results are shown on
http://188.40.87.11:8000/db_default/v4/nts/46?baseline=44&compare_to=44. As
shown in the results, 13 benchmarks have >5% compile-time performance
improvements by simply removing the first
"createInstructionCombiningPass". The top 5 benchmarks are listed as
follows:
SingleSource/Regression/C++/2003-09-29-NonPODsByValue-38.46%
SingleSource/Benchmarks/Misc/flops-19.30%
SingleSource/Benchmarks/Misc/himenobmtxpa-12.94%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes-12.68%
MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000-10.68%
Unfortunately, there are also two serious execution-time performance
regressions:
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding204.19%
SingleSource/Benchmarks/Polyb! ench/linear-algebra/solvers/dynprog/dynprog44.58%
By looking into the simple_types_constant_folding benchmark, I find it is mainly
caused by the unexpected impact of the createPromoteMemoryToRegisterPass().
Removing "createPromoteMemoryToRegisterPass" would eliminate the
execution-time performance regression for simple_types_constant_folding
benchmark. Right now, I have no idea why 
createPromoteMemoryToRegisterPass" would lead to such great execution-time
performance regression.

http://188.40.87.11:8000/db_default/v4/nts/46?baseline=45&compare_to=45
shows the extra compile-time overhead of Polly canonicalization passes without
the first "InstructionCombining" pass. By removing the  first
"InstructionCombining" pass, we see the extra compile-time overhead of
Polly canonicalization is at most 13.5%, which is much smaller than the original
Polly canonicalization overhead (>20%).

Second, let's look into the total impact of those polly canonicalization
passes by removing all optional canonicalization passes as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
//  PM.add(llvm::createPromoteMemoryToRegisterPass());
//  PM.add(llvm::createInstructionCombiningPass());  //this is the most
expensive canonicalization pass for flop benchmark
//  PM.add(llvm::createCFGSimplificationPass());
//  PM.add(llvm::createTailCallEliminationPass());
//  PM.add(llvm::createCFGSimplificationPass());
//  PM.add(llvm::createReassociatePass());
//  PM.add(llvm::createLoopRotatePass());
//  PM.add(llvm::createInstructionCombiningPass());
  PM.add(polly::createCodePreparationPass());
}
As shown on
http://188.40.87.11:8000/db_default/v4/nts/47?baseline=45&compare_to=45, the
extra compile-time overhead is very small (5.04% at most) by removing all
optional Polly canonicalization passes. However, I think it is not practical to
remove all these canonicalizations for the sake of Polly optimization
performance. I would further evaluate Polly's performance (with optimization
and code generation)  in the case all optional canonicalization passes are
removed.

As a simple informal conclusion, I think we should revise Polly's
canonicalization passes. At least we should consider removing the first
"InstructionCombining" pass!

Best,
Star Tan

At 2013-09-13 12:46:33,"Star Tan" <tanmx_star at yeah.net>
wrote:

At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es>
wrote:
>On 09/09/2013 05:18 AM, Star Tan wrote:
>>
>> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at
grosser.es> wrote:
>>
>>> On 09/08/2013 08:03 PM, Star Tan wrote:
>>> Also, I wonder if your runs include the dependence analysis. If
this is
>>> the case, the numbers are very good. Otherwise, 30% overhead seems
still
>>> to be a little bit much.
>> I think no Polly Dependence analysis is involved since our compiling
command is:
>> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm
-polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm
-polly-codegen-scev
>> Fortunately, with the option "-polly-codegen-scev", only
three benchmark shows >20% extra compile-time overhead:
>
>I believe so to, but please verify with -debug-pass=StructureI have verified. It indeed does not involve Polly Dependence analysis.
"Polly Dependence Pass" for flop is still high for some benchmarks as
we have discussed before. >> SingleSource/Benchmarks/Misc/flops	28.57%
>> MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
>> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
>> When I look into the compile-time for the flop benchmark using
"-ftime-report", I find the extra compile-time overhead mainly comes
from the "Combine redundant instructions" pass.
>> the top 5 passes when compiled with Polly canonicalization passes:
>>     ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>>     0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine
redundant instructions
>>     0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86
DAG->DAG Instruction Selection
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy
Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value
Numbering
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create
polyhedral description of Scops
>>
>> But the top 5 passes for clang is:
>>     ---User Time---   --System Time--   --User+System--   ---Wall
Time---  --- Name ---
>>     0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 (
25.2%)  X86 DAG->DAG Instruction Selection
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 ( 
8.4%)  Greedy Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 ( 
6.1%)  Combine redundant instructions
>>     0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 ( 
5.2%)  Global Value Numbering
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 ( 
5.2%)  Combine redundant instructions
>> We can see the "Combine redundant instructions" are invoked
many times and the extra invoke resulted by Polly's canonicalization is the
most significant one. We have found this problem before and I need to look into
the details of canonicalization passes related to "Combine redundant
instructions".
>
>OK.
By investigating the flop benchmark, I find the key is the first
"InstructionCombining" pass in a serial of canonicalization passes
listed as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive
canonicalization pass for flop benchmark
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());
  if (!SCEVCodegen)
    PM.add(polly::createIndVarSimplifyPass());
  PM.add(polly::createCodePreparationPass());
}
If we remove the first "InstructionCombining" pass, then the
compile-time is reduced by more than 10% . The results reported by -ftime-report
become very similar to the case without Polly canonicalization:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---
Name ---
   0.0120 ( 23.1%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0138 ( 21.5%)  X86
DAG->DAG Instruction Selection
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0045 (  7.1%)  Greedy
Register Allocator
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0042 (  6.6%)  Polly
- Create polyhedral description of Scops
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0038 (  5.9%) 
Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  4.5%)  Global
Value Numbering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0027 (  4.2%) 
Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.2%) 
Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.1%) 
Combine redundant instructions
Similar results have been found in the benchmark whetstone.  I will have a full
test using LLVM test-suite tonight to see whether it has similar effectiveness
for other test-suite benchmarks.
@Tobias, do you have any idea about the performance impact and other
consequences that if we remove such a  canonicalization pass. In my option, it
should not be important since we still run the "InstructionCombining"
pass after "createLoopRotatePass" pass and in fact there are many more
runs of "InstructionCombine" pass after this point.
Best,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/95c19526/attachment.html>

Tobias Grosser

2013-Sep-18 05:46 UTC

head link

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

On 09/17/2013 04:12 AM, Star Tan wrote:> Now, we come to more evaluations on
http://188.40.87.11:8000/db_default/v4/nts/recent_activity
Hi Star Tan,

thanks for this very extensive analysis. The results look very 
interesting. As you found out, just removing some canonicalization 
passes will reduce compile time, but this reduction may in large part 
being due to Polly not being able to optimise certain pieces of code.

Instead of removing canonicalization passes, I believe we may want to 
move Polly to a later place in the pass manager. Possibly at the 
beginning of the loop optimizer right before
PM.add(createLoopRotatePass());

We would then only need a very low number of canonicalization passes 
(possibly zero) and instead would put a couple of cleanup passes right
after Polly. What do you think?


Cheers,
Tobias

Star Tan

2013-Sep-18 13:47 UTC

head link

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

At 2013-09-18 13:46:13,"Tobias Grosser" <tobias at grosser.es>
wrote:
>On 09/17/2013 04:12 AM, Star Tan wrote:
>> Now, we come to more evaluations on
http://188.40.87.11:8000/db_default/v4/nts/recent_activity
>
>Hi Star Tan,
>
>thanks for this very extensive analysis. The results look very 
>interesting. As you found out, just removing some canonicalization 
>passes will reduce compile time, but this reduction may in large part 
>being due to Polly not being able to optimise certain pieces of code.
>
>Instead of removing canonicalization passes, I believe we may want to 
>move Polly to a later place in the pass manager. Possibly at the 
>beginning of the loop optimizer right before
>PM.add(createLoopRotatePass());
>
>We would then only need a very low number of canonicalization passes 
>(possibly zero) and instead would put a couple of cleanup passes right
>after Polly. What do you think?
Sure, I agree with you. I did those previous evaluations to see what is the
impact of each polly canonicalization pass. Results show that
"InstructionCombining" and "PromoteMemoryToRegister" passes
are critical to enabling Polly optimization. These passes may be also called by
other LLVM components, so I am trying to find out which later point we can start
Polly to avoid Polly's canonicalization passes by reusing those existing
LLVM passes.
Thanks for your helpful suggestion. I will to look into where we should start
Polly.
Best,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130918/7d568aaa/attachment.html>

Star Tan

2013-Sep-19 14:46 UTC

head link

[LLVMdev] [Polly] Move Polly's execution later

Hi Tobias,

I am trying to move Polly later.

LLVM provides some predefined ExtensionPointTy:
EP_EarlyAsPossible,
EP_ModuleOptimizerEarly,
EP_LoopOptimizerEnd,
EP_ScalarOptimizerLate,
...

Currently Polly uses "EP_EarlyAsPossible" to run as early as possible.
As what you suggested:>Instead of removing canonicalization passes, I believe we may want to
>move Polly to a later place in the pass manager. Possibly at the
>beginning of the loop optimizer right before PM.add(createLoopRotatePass());I want to move it to the point immediate after someone Loop optimization pass,
e.g. MPM.add(createLoopRotatePass()). However no predefined ExtensionPointTy is
available for this purpose. Instead, the "EP_ModuleOptimizerEarly"
would move Polly before all loop optimization passes.

In my option, there are two solutions: one is to use
"EP_ModuleOptimizerEarly" (only modify the
tool/polly/lib/RegisterPasses.cpp) to move Polly before all loop optimization
passes; the other is to add a new ExtensionPointTy, e.g.
"EP_LoopRotateEnd" and move Polly exactly immediate after the
"LoopRotate" pass (need to modify tool/polly/lib/RegisterPasses.cpp,
include/llvm/Transforms/IPO/PassManagerBuilder.h and
lib/Transforms/IPO/PassManagerBuilder.cpp). We can use the second way to
investigate other points to start Polly.

Is my understanding correct? Do you have any further suggestion?

Thanks,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/a3e2ba34/attachment.html>

Sebastian Pop

2013-Sep-26 20:16 UTC

head link

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Tobias Grosser wrote:> On 09/17/2013 04:12 AM, Star Tan wrote:
> >Now, we come to more evaluations on
http://188.40.87.11:8000/db_default/v4/nts/recent_activity
> 
> Hi Star Tan,
> 
> thanks for this very extensive analysis. The results look very
> interesting. As you found out, just removing some canonicalization
> passes will reduce compile time, but this reduction may in large
> part being due to Polly not being able to optimise certain pieces of
> code.
> 
> Instead of removing canonicalization passes, I believe we may want
> to move Polly to a later place in the pass manager. Possibly at the
> beginning of the loop optimizer right before
> PM.add(createLoopRotatePass());
> 
> We would then only need a very low number of canonicalization passes
> (possibly zero) and instead would put a couple of cleanup passes
> right
> after Polly. What do you think?
We experimented with moving Polly down the pass pipeline: when moving Polly past
CSE, PRE and other scalar opts, Polly stops recognizing a number of loops.

Sebastian
-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Sep 2013 - [LLVMdev] [Polly] Move Polly's execution later

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

[LLVMdev] [Polly] Move Polly's execution later

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Maybe Matching Threads