Star Tan
2013-Sep-13 04:46 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/09/2013 05:18 AM, Star Tan wrote: >> >> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote: >> >>> On 09/08/2013 08:03 PM, Star Tan wrote: >>> Also, I wonder if your runs include the dependence analysis. If this is >>> the case, the numbers are very good. Otherwise, 30% overhead seems still >>> to be a little bit much. >> I think no Polly Dependence analysis is involved since our compiling command is: >> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev >> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead: > >I believe so to, but please verify with -debug-pass=StructureI have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before.>> SingleSource/Benchmarks/Misc/flops 28.57% >> MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22% >> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05% >> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass. >> the top 5 passes when compiled with Polly canonicalization passes: >> ---User Time--- --User+System-- ---Wall Time--- --- Name --- >> 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions >> 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops >> >> But the top 5 passes for clang is: >> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- >> 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions >> 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions >> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions". > >OK.By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); if (!SCEVCodegen) PM.add(polly::createIndVarSimplifyPass()); PM.add(polly::createCodePreparationPass()); } If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0120 ( 23.1%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0138 ( 21.5%) X86 DAG->DAG Instruction Selection 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0045 ( 7.1%) Greedy Register Allocator 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0042 ( 6.6%) Polly - Create polyhedral description of Scops 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0038 ( 5.9%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 4.5%) Global Value Numbering 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0027 ( 4.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.1%) Combine redundant instructions Similar results have been found in the benchmark whetstone. I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks. @Tobias, do you have any idea about the performance impact and other consequences that if we remove such a canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point. Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130913/32a89a62/attachment.html>
Star Tan
2013-Sep-14 01:51 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
Hello all, I have evaluated the compile-time and execution-time performance of Polly canonicalization passes. Details can be referred to http://188.40.87.11:8000/db_default/v4/nts/recent_activity. There are four runs: pollyBasic (run 45): clang -O3 -Xclang -load -Xclang LLVMPolly.so pollyNoGenSCEV (run 44): clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-codegen-scev pollyNoGenSCEV_1comb (run 46): same option as pollyNoGenSCEV but remove the first "InstructionCombining" canonicalization pass when generate LLVMPolly.so pollyNoGenSCEV_nocan (run 47): same option as pollyNoGenSCEV but remove all canonicalization passes (actually only keep "createCodePreparationPass") when generate LLVMPolly.so Fist. let's see the results of removing the first "InstructionCombining" pass like this: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); // PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); PM.add(polly::createCodePreparationPass()); } Results are shown on http://188.40.87.11:8000/db_default/v4/nts/46?baseline=44&compare_to=44. As shown in the results, 13 benchmarks have >5% compile-time performance improvements by simply removing the first "createInstructionCombiningPass". The top 5 benchmarks are listed as follows: SingleSource/Regression/C++/2003-09-29-NonPODsByValue-38.46% SingleSource/Benchmarks/Misc/flops-19.30% SingleSource/Benchmarks/Misc/himenobmtxpa-12.94% MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes-12.68% MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000-10.68% Unfortunately, there are also two serious execution-time performance regressions: SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding204.19% SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog44.58% By looking into the simple_types_constant_folding benchmark, I find it is mainly caused by the unexpected impact of the createPromoteMemoryToRegisterPass(). Removing "createPromoteMemoryToRegisterPass" would eliminate the execution-time performance regression for simple_types_constant_folding benchmark. Right now, I have no idea why createPromoteMemoryToRegisterPass" would lead to such great execution-time performance regression. http://188.40.87.11:8000/db_default/v4/nts/46?baseline=45&compare_to=45 shows the extra compile-time overhead of Polly canonicalization passes without the first "InstructionCombining" pass. By removing the first "InstructionCombining" pass, we see the extra compile-time overhead of Polly canonicalization is at most 13.5%, which is much smaller than the original Polly canonicalization overhead (>20%). Second, let's look into the total impact of those polly canonicalization passes by removing all optional canonicalization passes as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { // PM.add(llvm::createPromoteMemoryToRegisterPass()); // PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark // PM.add(llvm::createCFGSimplificationPass()); // PM.add(llvm::createTailCallEliminationPass()); // PM.add(llvm::createCFGSimplificationPass()); // PM.add(llvm::createReassociatePass()); // PM.add(llvm::createLoopRotatePass()); // PM.add(llvm::createInstructionCombiningPass()); PM.add(polly::createCodePreparationPass()); } As shown on http://188.40.87.11:8000/db_default/v4/nts/47?baseline=45&compare_to=45, the extra compile-time overhead is very small (5.04% at most) by removing all optional Polly canonicalization passes. However, I think it is not practical to remove all these canonicalizations for the sake of Polly optimization performance. I would further evaluate Polly's performance (with optimization and code generation) in the case all optional canonicalization passes are removed. As a simple informal conclusion, I think we should revise Polly's canonicalization passes. At least we should consider removing the first "InstructionCombining" pass! Best, Star Tan At 2013-09-13 12:46:33,"Star Tan" <tanmx_star at yeah.net> wrote: At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/09/2013 05:18 AM, Star Tan wrote: >> >> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote: >> >>> On 09/08/2013 08:03 PM, Star Tan wrote: >>> Also, I wonder if your runs include the dependence analysis. If this is >>> the case, the numbers are very good. Otherwise, 30% overhead seems still >>> to be a little bit much. >> I think no Polly Dependence analysis is involved since our compiling command is: >> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev >> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead: > >I believe so to, but please verify with -debug-pass=StructureI have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before.>> SingleSource/Benchmarks/Misc/flops 28.57% >> MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22% >> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05% >> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass. >> the top 5 passes when compiled with Polly canonicalization passes: >> ---User Time--- --User+System-- ---Wall Time--- --- Name --- >> 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions >> 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops >> >> But the top 5 passes for clang is: >> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- >> 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions >> 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions >> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions". > >OK.By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); if (!SCEVCodegen) PM.add(polly::createIndVarSimplifyPass()); PM.add(polly::createCodePreparationPass()); } If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0120 ( 23.1%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0138 ( 21.5%) X86 DAG->DAG Instruction Selection 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0045 ( 7.1%) Greedy Register Allocator 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0042 ( 6.6%) Polly - Create polyhedral description of Scops 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0038 ( 5.9%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 4.5%) Global Value Numbering 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0027 ( 4.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.1%) Combine redundant instructions Similar results have been found in the benchmark whetstone. I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks. @Tobias, do you have any idea about the performance impact and other consequences that if we remove such a canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point. Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130914/a2bf7a5f/attachment.html>
Star Tan
2013-Sep-17 02:12 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
Now, we come to more evaluations on http://188.40.87.11:8000/db_default/v4/nts/recent_activity I mainly care about the compile-time and execution time impact for the following cases: pBasic (run 45): clang -O3 -load LLVMPolly.so pNoGenSCEV (run 44): clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly -polly-optimizer=none -polly-code-generator=none pNoGenSCEV_nocan (run 47): same option with pNoGenSCEV but replace the LLVMPolly.so by removing all Polly canonicalization passes pNoGenSCEV_procomb (run 51): same option with pNoGenSCEV but replace the LLVMPolly.so by removing only the "InstructionCombining" and "PromoteMemoryToRegister" canonicalization passes pOptSCEV (run 48): clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly pOptSCEV_nocan (run 50): same option with pNoOptSCEV but replace the LLVMPolly.so by removing all Polly canonicalization passes pOptSCEV_procomb (run 52): same option with pNoOptSCEV but replace the LLVMPolly.so by removing only the "InstructionCombining" and "PromoteMemoryToRegister" canonicalization passes pollyOpt (run 53): clang -O3 -load LLVMPolly.so -mllvm -polly Discovery 1: Polly optimization and code generation heavily relies on the "InstructionCombining" and "PromoteMemoryToRegister" canonicalization passes. http://188.40.87.11:8000/db_default/v4/nts/52?compare_to=45&baseline=45 shows the comparison between pOptSCEV_procomb with pBasic. As the results shown, Polly optimization and code generation lead to very small compile-time overhead (20% at most) compared with clang, i.e. the top four benmarks are: SingleSource/UnitTests/SignlessTypes/rem20.37% SingleSource/Benchmarks/Misc/oourafft11.34% MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl10.22% MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset10.21% It means that most of expensive Polly analysis/optimization/code generation passes are not enabled without running these two canonicalization passes. Of course Polly also introduces little performance gains in this case. The top benchmarks for performance improvements are: SingleSource/Benchmarks/Shootout/nestedloop -100.00% SingleSource/Benchmarks/Shootout-C++/nestedloop -100.00% MultiSource/Benchmarks/Ptrdist/anagram/anagram -14.26% SingleSource/Benchmarks/Shootout/lists -10.77% BTW, this bug (llvm.org/bugs/show_bug.cgi?id=17159) shown in general SCEV optimization does not appear any more. Discovery 2: Removing polly canonicalization passes significantly reduce compile-time and may also reduce execution-time. http://188.40.87.11:8000/db_default/v4/nts/50?compare_to=48&baseline=48 show the comparison between "full polly canonicalization" and "non polly canonicalization". Definitely, removing canonicalization passes can significantly reduce compile-time overhead and my decrease the execution-time performance since "canonicalization passes" can provide more opportunities for optimization. However, we find that removing polly canonicalization passes may also improve the execution-time performance for some benchmarks as shown in the follows: Performance Regressions - Execution Time MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt45.89% SingleSource/Benchmarks/CoyoteBench/huffbench22.24% SingleSource/Benchmarks/Shootout/fib215.06% SingleSource/Benchmarks/Stanford/FloatMM13.98% SingleSource/Benchmarks/Misc-C++/mandel-text13.16% Performance Improvements - Execution Time SingleSource/Benchmarks/Polybench/medley/reg_detect/reg_detect-37.50% SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog-27.69% MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt-22.59% SingleSource/Benchmarks/Misc/himenobmtxpa-21.98% MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt-16.44% It means Polly's optimization does not always improve the performance. It may lead the performance regression at the same time. This discovery can be also found in the comparison between "clang -O3 with Polly" and "clang -O3 without Polly" on http://188.40.87.11:8000/db_default/v4/nts/48?compare_to=45&baseline=45. Many benchmarks have execution time regression. So we need to further refine Polly's optimization. At least we should avoid the performance regression. In the next step, I will evaluate those polly canonicalization passes without -polly-codegen-scev to understand their compile-time and execution-time impact. Best, Mingxing At 2013-09-14 09:51:10,"Star Tan" <tanmx_star at yeah.net> wrote: Hello all, I have evaluated the compile-time and execution-time performance of Polly canonicalization passes. Details can be referred to http://188.40.87.11:8000/db_default/v4/nts/recent_activity. There are four runs: pollyBasic (run 45): clang -O3 -Xclang -load -Xclang LLVMPolly.so pollyNoGenSCEV (run 44): clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-codegen-scev pollyNoGenSCEV_1comb (run 46): same option as pollyNoGenSCEV but remove the first "InstructionCombining" canonicalization pass when generate LLVMPolly.so pollyNoGenSCEV_nocan (run 47): same option as pollyNoGenSCEV but remove all canonicalization passes (actually only keep "createCodePreparationPass") when generate LLVMPolly.so Fist. let's see the results of removing the first "InstructionCombining" pass like this: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); // PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); PM.add(polly::createCodePreparationPass()); } Results are shown on http://188.40.87.11:8000/db_default/v4/nts/46?baseline=44&compare_to=44. As shown in the results, 13 benchmarks have >5% compile-time performance improvements by simply removing the first "createInstructionCombiningPass". The top 5 benchmarks are listed as follows: SingleSource/Regression/C++/2003-09-29-NonPODsByValue-38.46% SingleSource/Benchmarks/Misc/flops-19.30% SingleSource/Benchmarks/Misc/himenobmtxpa-12.94% MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes-12.68% MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000-10.68% Unfortunately, there are also two serious execution-time performance regressions: SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding204.19% SingleSource/Benchmarks/Polyb! ench/linear-algebra/solvers/dynprog/dynprog44.58% By looking into the simple_types_constant_folding benchmark, I find it is mainly caused by the unexpected impact of the createPromoteMemoryToRegisterPass(). Removing "createPromoteMemoryToRegisterPass" would eliminate the execution-time performance regression for simple_types_constant_folding benchmark. Right now, I have no idea why createPromoteMemoryToRegisterPass" would lead to such great execution-time performance regression. http://188.40.87.11:8000/db_default/v4/nts/46?baseline=45&compare_to=45 shows the extra compile-time overhead of Polly canonicalization passes without the first "InstructionCombining" pass. By removing the first "InstructionCombining" pass, we see the extra compile-time overhead of Polly canonicalization is at most 13.5%, which is much smaller than the original Polly canonicalization overhead (>20%). Second, let's look into the total impact of those polly canonicalization passes by removing all optional canonicalization passes as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { // PM.add(llvm::createPromoteMemoryToRegisterPass()); // PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark // PM.add(llvm::createCFGSimplificationPass()); // PM.add(llvm::createTailCallEliminationPass()); // PM.add(llvm::createCFGSimplificationPass()); // PM.add(llvm::createReassociatePass()); // PM.add(llvm::createLoopRotatePass()); // PM.add(llvm::createInstructionCombiningPass()); PM.add(polly::createCodePreparationPass()); } As shown on http://188.40.87.11:8000/db_default/v4/nts/47?baseline=45&compare_to=45, the extra compile-time overhead is very small (5.04% at most) by removing all optional Polly canonicalization passes. However, I think it is not practical to remove all these canonicalizations for the sake of Polly optimization performance. I would further evaluate Polly's performance (with optimization and code generation) in the case all optional canonicalization passes are removed. As a simple informal conclusion, I think we should revise Polly's canonicalization passes. At least we should consider removing the first "InstructionCombining" pass! Best, Star Tan At 2013-09-13 12:46:33,"Star Tan" <tanmx_star at yeah.net> wrote: At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/09/2013 05:18 AM, Star Tan wrote: >> >> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote: >> >>> On 09/08/2013 08:03 PM, Star Tan wrote: >>> Also, I wonder if your runs include the dependence analysis. If this is >>> the case, the numbers are very good. Otherwise, 30% overhead seems still >>> to be a little bit much. >> I think no Polly Dependence analysis is involved since our compiling command is: >> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev >> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead: > >I believe so to, but please verify with -debug-pass=StructureI have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before.>> SingleSource/Benchmarks/Misc/flops 28.57% >> MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22% >> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05% >> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass. >> the top 5 passes when compiled with Polly canonicalization passes: >> ---User Time--- --User+System-- ---Wall Time--- --- Name --- >> 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions >> 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops >> >> But the top 5 passes for clang is: >> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- >> 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions >> 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions >> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions". > >OK.By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); if (!SCEVCodegen) PM.add(polly::createIndVarSimplifyPass()); PM.add(polly::createCodePreparationPass()); } If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0120 ( 23.1%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0138 ( 21.5%) X86 DAG->DAG Instruction Selection 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0045 ( 7.1%) Greedy Register Allocator 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0042 ( 6.6%) Polly - Create polyhedral description of Scops 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0038 ( 5.9%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 4.5%) Global Value Numbering 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0027 ( 4.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.1%) Combine redundant instructions Similar results have been found in the benchmark whetstone. I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks. @Tobias, do you have any idea about the performance impact and other consequences that if we remove such a canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point. Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/95c19526/attachment.html>
Reasonably Related Threads
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization