Star Tan
2013-Sep-17 02:12 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
Now, we come to more evaluations on http://188.40.87.11:8000/db_default/v4/nts/recent_activity I mainly care about the compile-time and execution time impact for the following cases: pBasic (run 45): clang -O3 -load LLVMPolly.so pNoGenSCEV (run 44): clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly -polly-optimizer=none -polly-code-generator=none pNoGenSCEV_nocan (run 47): same option with pNoGenSCEV but replace the LLVMPolly.so by removing all Polly canonicalization passes pNoGenSCEV_procomb (run 51): same option with pNoGenSCEV but replace the LLVMPolly.so by removing only the "InstructionCombining" and "PromoteMemoryToRegister" canonicalization passes pOptSCEV (run 48): clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly pOptSCEV_nocan (run 50): same option with pNoOptSCEV but replace the LLVMPolly.so by removing all Polly canonicalization passes pOptSCEV_procomb (run 52): same option with pNoOptSCEV but replace the LLVMPolly.so by removing only the "InstructionCombining" and "PromoteMemoryToRegister" canonicalization passes pollyOpt (run 53): clang -O3 -load LLVMPolly.so -mllvm -polly Discovery 1: Polly optimization and code generation heavily relies on the "InstructionCombining" and "PromoteMemoryToRegister" canonicalization passes. http://188.40.87.11:8000/db_default/v4/nts/52?compare_to=45&baseline=45 shows the comparison between pOptSCEV_procomb with pBasic. As the results shown, Polly optimization and code generation lead to very small compile-time overhead (20% at most) compared with clang, i.e. the top four benmarks are: SingleSource/UnitTests/SignlessTypes/rem20.37% SingleSource/Benchmarks/Misc/oourafft11.34% MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl10.22% MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset10.21% It means that most of expensive Polly analysis/optimization/code generation passes are not enabled without running these two canonicalization passes. Of course Polly also introduces little performance gains in this case. The top benchmarks for performance improvements are: SingleSource/Benchmarks/Shootout/nestedloop -100.00% SingleSource/Benchmarks/Shootout-C++/nestedloop -100.00% MultiSource/Benchmarks/Ptrdist/anagram/anagram -14.26% SingleSource/Benchmarks/Shootout/lists -10.77% BTW, this bug (llvm.org/bugs/show_bug.cgi?id=17159) shown in general SCEV optimization does not appear any more. Discovery 2: Removing polly canonicalization passes significantly reduce compile-time and may also reduce execution-time. http://188.40.87.11:8000/db_default/v4/nts/50?compare_to=48&baseline=48 show the comparison between "full polly canonicalization" and "non polly canonicalization". Definitely, removing canonicalization passes can significantly reduce compile-time overhead and my decrease the execution-time performance since "canonicalization passes" can provide more opportunities for optimization. However, we find that removing polly canonicalization passes may also improve the execution-time performance for some benchmarks as shown in the follows: Performance Regressions - Execution Time MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt45.89% SingleSource/Benchmarks/CoyoteBench/huffbench22.24% SingleSource/Benchmarks/Shootout/fib215.06% SingleSource/Benchmarks/Stanford/FloatMM13.98% SingleSource/Benchmarks/Misc-C++/mandel-text13.16% Performance Improvements - Execution Time SingleSource/Benchmarks/Polybench/medley/reg_detect/reg_detect-37.50% SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog-27.69% MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt-22.59% SingleSource/Benchmarks/Misc/himenobmtxpa-21.98% MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt-16.44% It means Polly's optimization does not always improve the performance. It may lead the performance regression at the same time. This discovery can be also found in the comparison between "clang -O3 with Polly" and "clang -O3 without Polly" on http://188.40.87.11:8000/db_default/v4/nts/48?compare_to=45&baseline=45. Many benchmarks have execution time regression. So we need to further refine Polly's optimization. At least we should avoid the performance regression. In the next step, I will evaluate those polly canonicalization passes without -polly-codegen-scev to understand their compile-time and execution-time impact. Best, Mingxing At 2013-09-14 09:51:10,"Star Tan" <tanmx_star at yeah.net> wrote: Hello all, I have evaluated the compile-time and execution-time performance of Polly canonicalization passes. Details can be referred to http://188.40.87.11:8000/db_default/v4/nts/recent_activity. There are four runs: pollyBasic (run 45): clang -O3 -Xclang -load -Xclang LLVMPolly.so pollyNoGenSCEV (run 44): clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-codegen-scev pollyNoGenSCEV_1comb (run 46): same option as pollyNoGenSCEV but remove the first "InstructionCombining" canonicalization pass when generate LLVMPolly.so pollyNoGenSCEV_nocan (run 47): same option as pollyNoGenSCEV but remove all canonicalization passes (actually only keep "createCodePreparationPass") when generate LLVMPolly.so Fist. let's see the results of removing the first "InstructionCombining" pass like this: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); // PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); PM.add(polly::createCodePreparationPass()); } Results are shown on http://188.40.87.11:8000/db_default/v4/nts/46?baseline=44&compare_to=44. As shown in the results, 13 benchmarks have >5% compile-time performance improvements by simply removing the first "createInstructionCombiningPass". The top 5 benchmarks are listed as follows: SingleSource/Regression/C++/2003-09-29-NonPODsByValue-38.46% SingleSource/Benchmarks/Misc/flops-19.30% SingleSource/Benchmarks/Misc/himenobmtxpa-12.94% MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes-12.68% MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000-10.68% Unfortunately, there are also two serious execution-time performance regressions: SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding204.19% SingleSource/Benchmarks/Polyb! ench/linear-algebra/solvers/dynprog/dynprog44.58% By looking into the simple_types_constant_folding benchmark, I find it is mainly caused by the unexpected impact of the createPromoteMemoryToRegisterPass(). Removing "createPromoteMemoryToRegisterPass" would eliminate the execution-time performance regression for simple_types_constant_folding benchmark. Right now, I have no idea why createPromoteMemoryToRegisterPass" would lead to such great execution-time performance regression. http://188.40.87.11:8000/db_default/v4/nts/46?baseline=45&compare_to=45 shows the extra compile-time overhead of Polly canonicalization passes without the first "InstructionCombining" pass. By removing the first "InstructionCombining" pass, we see the extra compile-time overhead of Polly canonicalization is at most 13.5%, which is much smaller than the original Polly canonicalization overhead (>20%). Second, let's look into the total impact of those polly canonicalization passes by removing all optional canonicalization passes as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { // PM.add(llvm::createPromoteMemoryToRegisterPass()); // PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark // PM.add(llvm::createCFGSimplificationPass()); // PM.add(llvm::createTailCallEliminationPass()); // PM.add(llvm::createCFGSimplificationPass()); // PM.add(llvm::createReassociatePass()); // PM.add(llvm::createLoopRotatePass()); // PM.add(llvm::createInstructionCombiningPass()); PM.add(polly::createCodePreparationPass()); } As shown on http://188.40.87.11:8000/db_default/v4/nts/47?baseline=45&compare_to=45, the extra compile-time overhead is very small (5.04% at most) by removing all optional Polly canonicalization passes. However, I think it is not practical to remove all these canonicalizations for the sake of Polly optimization performance. I would further evaluate Polly's performance (with optimization and code generation) in the case all optional canonicalization passes are removed. As a simple informal conclusion, I think we should revise Polly's canonicalization passes. At least we should consider removing the first "InstructionCombining" pass! Best, Star Tan At 2013-09-13 12:46:33,"Star Tan" <tanmx_star at yeah.net> wrote: At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/09/2013 05:18 AM, Star Tan wrote: >> >> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote: >> >>> On 09/08/2013 08:03 PM, Star Tan wrote: >>> Also, I wonder if your runs include the dependence analysis. If this is >>> the case, the numbers are very good. Otherwise, 30% overhead seems still >>> to be a little bit much. >> I think no Polly Dependence analysis is involved since our compiling command is: >> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev >> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead: > >I believe so to, but please verify with -debug-pass=StructureI have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before.>> SingleSource/Benchmarks/Misc/flops 28.57% >> MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22% >> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05% >> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass. >> the top 5 passes when compiled with Polly canonicalization passes: >> ---User Time--- --User+System-- ---Wall Time--- --- Name --- >> 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions >> 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops >> >> But the top 5 passes for clang is: >> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- >> 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions >> 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions >> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions". > >OK.By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); if (!SCEVCodegen) PM.add(polly::createIndVarSimplifyPass()); PM.add(polly::createCodePreparationPass()); } If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0120 ( 23.1%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0138 ( 21.5%) X86 DAG->DAG Instruction Selection 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0045 ( 7.1%) Greedy Register Allocator 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0042 ( 6.6%) Polly - Create polyhedral description of Scops 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0038 ( 5.9%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 4.5%) Global Value Numbering 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0027 ( 4.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.1%) Combine redundant instructions Similar results have been found in the benchmark whetstone. I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks. @Tobias, do you have any idea about the performance impact and other consequences that if we remove such a canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point. Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/95c19526/attachment.html>
Tobias Grosser
2013-Sep-18 05:46 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
On 09/17/2013 04:12 AM, Star Tan wrote:> Now, we come to more evaluations on http://188.40.87.11:8000/db_default/v4/nts/recent_activityHi Star Tan, thanks for this very extensive analysis. The results look very interesting. As you found out, just removing some canonicalization passes will reduce compile time, but this reduction may in large part being due to Polly not being able to optimise certain pieces of code. Instead of removing canonicalization passes, I believe we may want to move Polly to a later place in the pass manager. Possibly at the beginning of the loop optimizer right before PM.add(createLoopRotatePass()); We would then only need a very low number of canonicalization passes (possibly zero) and instead would put a couple of cleanup passes right after Polly. What do you think? Cheers, Tobias
Star Tan
2013-Sep-18 13:47 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
At 2013-09-18 13:46:13,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/17/2013 04:12 AM, Star Tan wrote: >> Now, we come to more evaluations on http://188.40.87.11:8000/db_default/v4/nts/recent_activity > >Hi Star Tan, > >thanks for this very extensive analysis. The results look very >interesting. As you found out, just removing some canonicalization >passes will reduce compile time, but this reduction may in large part >being due to Polly not being able to optimise certain pieces of code. > >Instead of removing canonicalization passes, I believe we may want to >move Polly to a later place in the pass manager. Possibly at the >beginning of the loop optimizer right before >PM.add(createLoopRotatePass()); > >We would then only need a very low number of canonicalization passes >(possibly zero) and instead would put a couple of cleanup passes right >after Polly. What do you think?Sure, I agree with you. I did those previous evaluations to see what is the impact of each polly canonicalization pass. Results show that "InstructionCombining" and "PromoteMemoryToRegister" passes are critical to enabling Polly optimization. These passes may be also called by other LLVM components, so I am trying to find out which later point we can start Polly to avoid Polly's canonicalization passes by reusing those existing LLVM passes. Thanks for your helpful suggestion. I will to look into where we should start Polly. Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130918/7d568aaa/attachment.html>
Hi Tobias, I am trying to move Polly later. LLVM provides some predefined ExtensionPointTy: EP_EarlyAsPossible, EP_ModuleOptimizerEarly, EP_LoopOptimizerEnd, EP_ScalarOptimizerLate, ... Currently Polly uses "EP_EarlyAsPossible" to run as early as possible. As what you suggested:>Instead of removing canonicalization passes, I believe we may want to >move Polly to a later place in the pass manager. Possibly at the >beginning of the loop optimizer right before PM.add(createLoopRotatePass());I want to move it to the point immediate after someone Loop optimization pass, e.g. MPM.add(createLoopRotatePass()). However no predefined ExtensionPointTy is available for this purpose. Instead, the "EP_ModuleOptimizerEarly" would move Polly before all loop optimization passes. In my option, there are two solutions: one is to use "EP_ModuleOptimizerEarly" (only modify the tool/polly/lib/RegisterPasses.cpp) to move Polly before all loop optimization passes; the other is to add a new ExtensionPointTy, e.g. "EP_LoopRotateEnd" and move Polly exactly immediate after the "LoopRotate" pass (need to modify tool/polly/lib/RegisterPasses.cpp, include/llvm/Transforms/IPO/PassManagerBuilder.h and lib/Transforms/IPO/PassManagerBuilder.cpp). We can use the second way to investigate other points to start Polly. Is my understanding correct? Do you have any further suggestion? Thanks, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/a3e2ba34/attachment.html>
Sebastian Pop
2013-Sep-26 20:16 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
Tobias Grosser wrote:> On 09/17/2013 04:12 AM, Star Tan wrote: > >Now, we come to more evaluations on http://188.40.87.11:8000/db_default/v4/nts/recent_activity > > Hi Star Tan, > > thanks for this very extensive analysis. The results look very > interesting. As you found out, just removing some canonicalization > passes will reduce compile time, but this reduction may in large > part being due to Polly not being able to optimise certain pieces of > code. > > Instead of removing canonicalization passes, I believe we may want > to move Polly to a later place in the pass manager. Possibly at the > beginning of the loop optimizer right before > PM.add(createLoopRotatePass()); > > We would then only need a very low number of canonicalization passes > (possibly zero) and instead would put a couple of cleanup passes > right > after Polly. What do you think?We experimented with moving Polly down the pass pipeline: when moving Polly past CSE, PRE and other scalar opts, Polly stops recognizing a number of loops. Sebastian -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Reasonably Related Threads
- [LLVMdev] [Polly] Move Polly's execution later
- [LLVMdev] [Polly] Move Polly's execution later
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Move Polly's execution later
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization