Star Tan
2013-Sep-09 03:18 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/08/2013 08:03 PM, Star Tan wrote: >> Hello all, >> >> >> I have done some basic experiments about Polly canonicalization passes and I found the SCEV canonicalization has significant impact on both compile-time and execution-time performance. > >Interesting. > >> Detailed results for SCEV and default canonicalization can be viewed on: http://188.40.87.11:8000/db_default/v4/nts/32 (or 33, 34) >> *pNoGen with SCEV canonicalization (run 32): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev >> *pNoGen with default canonicalization (run 33): -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none >> *pBasic without any canonicalization (run 34): -O3 -Xclang -load -Xclang LLVMPolly.so >> >> >> Impact of SCEV canonicalization: >> http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34 >> Impact of default canonicalization: >> http://188.40.87.11:8000/db_default/v4/nts/33?compare_to=34&baseline=34 >> Comparison of SCEV canonicalization with default canonicalization: >> http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33 >> >> >> As we expected, both SCEV canonicalization and default canonicalization will slightly increase the compile-time overhead (at most 30% extra compile-time). They also lead to some execution-time regressions and improvements. >> >> >> The only difference between SCEV canonicalization and default canonicalization is the "IndVarSimplify" pass as shown in the code RegisterPasses.cpp:212: >> if (!SCEVCodegen) >> PM.add(polly::createIndVarSimplifyPass()); > >There are actually more differences (see grep -R SCEVCodegen polly/), >but the other differences will mainly be code generation differences.Thanks for your reminder. Since we are currently focusing on canonicalization passes, the other differences for code generation do not matter.>> However, I find it is interesting to look into the comparison between SCEV canonicalization and default canonicalization (http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33): > >Yes, this is definitely a good start. > >> First of all, we can expect SCEV canonicalization has better compile-time performance since it avoids the "IndVarSimplify" pass. Actually, it can gain more than 5% compile-time performance improvement for 32 benchmarks, especially for the following benchmarks: >> MultiSource/Applications/lemon/lemon-11.02% >> SingleSource/Benchmarks/Misc/oourafft-10.53% >> SingleSource/Benchmarks/Linpack/linpack-pc-10.00% >> MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan-8.31% >> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt-8.18% >> >> >> Second, we find that SCEV canonicalization has both regression and improvement of execution performance compared with default canonicalization. Actually, there are many execution-time regressions such as: >> SingleSource/Benchmarks/Shootout/nestedloop+16363.64% >> SingleSource/Benchmarks/Shootout-C++/nestedloop+16200.00% >Those two have a huge impact. Understanding what is going on here would >be nice.Yes, I am investigating these cases.>> I think the execution-time performance regression is mainly because of the unexpected performance improvements from non-SCEV canonicalization as shown int eh following bug: http://llvm.org/bugs/show_bug.cgi?id=17153. I will try to find out why "IndVarSimplify" can produce better code in the next step. If we can eliminate "IndVarSimplify" canonicalization but keep on producing high-quality code, then we can gain better compile-time performance without execution-time performance loss. > >Previous experience has shown that the indvars pass as we run it in >Polly can unpredictably change performance both negatively and >positively. It was disabled as it people did not manage to eliminate all >regressions it introduced, such that the positive performance changes >could not really be valued. > >So regarding performance tuning, I do not think we need to get this >optimal. As soon as -polly-codegen-scev reaches similar performance than >the original approach, we are fine.I see. I agree with you. I think we care more about compile-time performance for Polly's canonicalization passes since no Polly optimization or Polly code generation happens here.>Also, I wonder if your runs include the dependence analysis. If this is >the case, the numbers are very good. Otherwise, 30% overhead seems still >to be a little bit much.I think no Polly Dependence analysis is involved since our compiling command is: clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead: SingleSource/Benchmarks/Misc/flops 28.57% MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22% MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05% When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass. the top 5 passes when compiled with Polly canonicalization passes: ---User Time--- --User+System-- ---Wall Time--- --- Name --- 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops But the top 5 passes for clang is: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions". BTW, I want to point out that although SCEV based Polly canonicalization (with -polly-codegen-scev) runs faster than default canonicalization, it can lead to 5 extra compile errors and 3 extra runtime errors as shown on http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34. I have done some basic analysis for one of the compile error (7zip-benchmark). Results can be viewed on http://llvm.org/bugs/show_bug.cgi?Cid=17159 Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130909/0fdd342c/attachment.html>
Tobias Grosser
2013-Sep-09 05:07 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
On 09/09/2013 05:18 AM, Star Tan wrote:> > At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote: > >> On 09/08/2013 08:03 PM, Star Tan wrote: >> Also, I wonder if your runs include the dependence analysis. If this is >> the case, the numbers are very good. Otherwise, 30% overhead seems still >> to be a little bit much. > I think no Polly Dependence analysis is involved since our compiling command is: > clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev > Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead:I believe so to, but please verify with -debug-pass=Structure> SingleSource/Benchmarks/Misc/flops 28.57% > MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22% > MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05% > When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass. > the top 5 passes when compiled with Polly canonicalization passes: > ---User Time--- --User+System-- ---Wall Time--- --- Name --- > 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions > 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection > 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering > 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops > > But the top 5 passes for clang is: > ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- > 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection > 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator > 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions > 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering > 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions > We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions".OK.> BTW, I want to point out that although SCEV based Polly canonicalization (with -polly-codegen-scev) runs faster than default canonicalization, it can lead to 5 extra compile errors and 3 extra runtime errors as shown on http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34. > I have done some basic analysis for one of the compile error (7zip-benchmark). Results can be viewed on http://llvm.org/bugs/show_bug.cgi?Cid=17159Great. I will help looking into this starting this WE. Cheers, Tobias
Star Tan
2013-Sep-13 04:46 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/09/2013 05:18 AM, Star Tan wrote: >> >> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote: >> >>> On 09/08/2013 08:03 PM, Star Tan wrote: >>> Also, I wonder if your runs include the dependence analysis. If this is >>> the case, the numbers are very good. Otherwise, 30% overhead seems still >>> to be a little bit much. >> I think no Polly Dependence analysis is involved since our compiling command is: >> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev >> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead: > >I believe so to, but please verify with -debug-pass=StructureI have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before.>> SingleSource/Benchmarks/Misc/flops 28.57% >> MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22% >> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05% >> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass. >> the top 5 passes when compiled with Polly canonicalization passes: >> ---User Time--- --User+System-- ---Wall Time--- --- Name --- >> 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions >> 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops >> >> But the top 5 passes for clang is: >> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- >> 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions >> 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions >> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions". > >OK.By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); if (!SCEVCodegen) PM.add(polly::createIndVarSimplifyPass()); PM.add(polly::createCodePreparationPass()); } If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0120 ( 23.1%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0138 ( 21.5%) X86 DAG->DAG Instruction Selection 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0045 ( 7.1%) Greedy Register Allocator 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0042 ( 6.6%) Polly - Create polyhedral description of Scops 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0038 ( 5.9%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 4.5%) Global Value Numbering 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0027 ( 4.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.1%) Combine redundant instructions Similar results have been found in the benchmark whetstone. I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks. @Tobias, do you have any idea about the performance impact and other consequences that if we remove such a canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point. Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130913/32a89a62/attachment.html>
Sebastian Pop
2013-Sep-26 20:05 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
Hi Star Tan, Thanks for the very interesting perf analyses. Star Tan wrote:> We can see the "Combine redundant instructions" are invoked many times and the > extra invoke resulted by Polly's canonicalization is the most significant > one. We have found this problem before and I need to look into the details of > canonicalization passes related to "Combine redundant instructions".It could be that the scev codegen produces the same subexpression again and again due to the fact that we are asking the same question again and again for each array index: basically, in the original code we have a set of array access functions A1(i), A2(i), ..., An(i), that get transformed by polly using a linear transform function t: A1(t(i)), A2(t(i)), ..., An(t(i)), so you see that t(i) appears again and again, and we probably do generate redundantly the same code for it.> BTW, I want to point out that although SCEV based Polly canonicalization (with > -polly-codegen-scev) runs faster than default canonicalization, it can lead to > 5 extra compile errors and 3 extra runtime errorsThat's one of the reasons why we have not turned SCEV codegen on by default yet. I will address all these issues and then we'll flip the default value of the -polly-codegen-scev flag.> I have done some basic analysis for one of the compile error > (7zip-benchmark). Results can be viewed on > http://llvm.org/bugs/show_bug.cgi?Cid=17159Thanks for filling up that bug report: I just assigned it to me. Sebastian -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Star Tan
2013-Sep-27 03:49 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
At 2013-09-27 04:05:07,"Sebastian Pop" <spop at codeaurora.org> wrote:>Hi Star Tan,> >Thanks for the very interesting perf analyses. > >Star Tan wrote: >> We can see the "Combine redundant instructions" are invoked many times and the >> extra invoke resulted by Polly's canonicalization is the most significant >> one. We have found this problem before and I need to look into the details of >> canonicalization passes related to "Combine redundant instructions". > >It could be that the scev codegen produces the same subexpression again and >again due to the fact that we are asking the same question again and again for >each array index: basically, in the original code we have a set of array access >functions A1(i), A2(i), ..., An(i), that get transformed by polly using a linear >transform function t: A1(t(i)), A2(t(i)), ..., An(t(i)), so you see that t(i) >appears again and again, and we probably do generate redundantly the same code >for it. > >> BTW, I want to point out that although SCEV based Polly canonicalization (with >> -polly-codegen-scev) runs faster than default canonicalization, it can lead to >> 5 extra compile errors and 3 extra runtime errors > >That's one of the reasons why we have not turned SCEV codegen on by default yet. >I will address all these issues and then we'll flip the default value of the >-polly-codegen-scev flag.Great! I will try to investigate other errors and put them into LLVM bugzilla or try to fix them. I also look forward to fixing these errors and flipping the default option value as soon as possible.>> I have done some basic analysis for one of the compile error >> (7zip-benchmark). Results can be viewed on >> http://llvm.org/bugs/show_bug.cgi?Cid=17159 > >Thanks for filling up that bug report: I just assigned it to me. > >Sebastian >-- >Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, >hosted by The Linux FoundationThanks, Mingxing -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130927/4bc106bb/attachment.html>
Maybe Matching Threads
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization