Star Tan
2013-Sep-13 04:46 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/09/2013 05:18 AM, Star Tan wrote: >> >> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es> wrote: >> >>> On 09/08/2013 08:03 PM, Star Tan wrote: >>> Also, I wonder if your runs include the dependence analysis. If this is >>> the case, the numbers are very good. Otherwise, 30% overhead seems still >>> to be a little bit much. >> I think no Polly Dependence analysis is involved since our compiling command is: >> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm -polly-code-generator=none -mllvm -polly-codegen-scev >> Fortunately, with the option "-polly-codegen-scev", only three benchmark shows >20% extra compile-time overhead: > >I believe so to, but please verify with -debug-pass=StructureI have verified. It indeed does not involve Polly Dependence analysis. "Polly Dependence Pass" for flop is still high for some benchmarks as we have discussed before.>> SingleSource/Benchmarks/Misc/flops 28.57% >> MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22% >> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05% >> When I look into the compile-time for the flop benchmark using "-ftime-report", I find the extra compile-time overhead mainly comes from the "Combine redundant instructions" pass. >> the top 5 passes when compiled with Polly canonicalization passes: >> ---User Time--- --User+System-- ---Wall Time--- --- Name --- >> 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine redundant instructions >> 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value Numbering >> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create polyhedral description of Scops >> >> But the top 5 passes for clang is: >> ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- >> 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 ( 25.2%) X86 DAG->DAG Instruction Selection >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 ( 8.4%) Greedy Register Allocator >> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 ( 6.1%) Combine redundant instructions >> 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Global Value Numbering >> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 ( 5.2%) Combine redundant instructions >> We can see the "Combine redundant instructions" are invoked many times and the extra invoke resulted by Polly's canonicalization is the most significant one. We have found this problem before and I need to look into the details of canonicalization passes related to "Combine redundant instructions". > >OK.By investigating the flop benchmark, I find the key is the first "InstructionCombining" pass in a serial of canonicalization passes listed as follows: static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) { PM.add(llvm::createPromoteMemoryToRegisterPass()); PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive canonicalization pass for flop benchmark PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); if (!SCEVCodegen) PM.add(polly::createIndVarSimplifyPass()); PM.add(polly::createCodePreparationPass()); } If we remove the first "InstructionCombining" pass, then the compile-time is reduced by more than 10% . The results reported by -ftime-report become very similar to the case without Polly canonicalization: ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 0.0120 ( 23.1%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0138 ( 21.5%) X86 DAG->DAG Instruction Selection 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0045 ( 7.1%) Greedy Register Allocator 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0042 ( 6.6%) Polly - Create polyhedral description of Scops 0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0038 ( 5.9%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 4.5%) Global Value Numbering 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0027 ( 4.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.2%) Combine redundant instructions 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.1%) Combine redundant instructions Similar results have been found in the benchmark whetstone. I will have a full test using LLVM test-suite tonight to see whether it has similar effectiveness for other test-suite benchmarks. @Tobias, do you have any idea about the performance impact and other consequences that if we remove such a canonicalization pass. In my option, it should not be important since we still run the "InstructionCombining" pass after "createLoopRotatePass" pass and in fact there are many more runs of "InstructionCombine" pass after this point. Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130913/32a89a62/attachment.html>
Star Tan
2013-Sep-14 01:51 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
Hello all,
I have evaluated the compile-time and execution-time performance of Polly
canonicalization passes. Details can be referred to
http://188.40.87.11:8000/db_default/v4/nts/recent_activity. There are four runs:
pollyBasic (run 45): clang -O3 -Xclang -load -Xclang LLVMPolly.so
pollyNoGenSCEV (run 44): clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm
-polly -mllvm -polly-codegen-scev
pollyNoGenSCEV_1comb (run 46): same option as pollyNoGenSCEV but remove the
first "InstructionCombining" canonicalization pass when generate
LLVMPolly.so
pollyNoGenSCEV_nocan (run 47): same option as pollyNoGenSCEV but remove all
canonicalization passes (actually only keep
"createCodePreparationPass") when generate LLVMPolly.so
Fist. let's see the results of removing the first
"InstructionCombining" pass like this:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
PM.add(llvm::createPromoteMemoryToRegisterPass());
// PM.add(llvm::createInstructionCombiningPass()); //this is the most
expensive canonicalization pass for flop benchmark
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createTailCallEliminationPass());
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createReassociatePass());
PM.add(llvm::createLoopRotatePass());
PM.add(llvm::createInstructionCombiningPass());
PM.add(polly::createCodePreparationPass());
}
Results are shown on
http://188.40.87.11:8000/db_default/v4/nts/46?baseline=44&compare_to=44. As
shown in the results, 13 benchmarks have >5% compile-time performance
improvements by simply removing the first
"createInstructionCombiningPass". The top 5 benchmarks are listed as
follows:
SingleSource/Regression/C++/2003-09-29-NonPODsByValue-38.46%
SingleSource/Benchmarks/Misc/flops-19.30%
SingleSource/Benchmarks/Misc/himenobmtxpa-12.94%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes-12.68%
MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000-10.68%
Unfortunately, there are also two serious execution-time performance
regressions:
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding204.19%
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog44.58%
By looking into the simple_types_constant_folding benchmark, I find it is mainly
caused by the unexpected impact of the createPromoteMemoryToRegisterPass().
Removing "createPromoteMemoryToRegisterPass" would eliminate the
execution-time performance regression for simple_types_constant_folding
benchmark. Right now, I have no idea why
createPromoteMemoryToRegisterPass" would lead to such great execution-time
performance regression.
http://188.40.87.11:8000/db_default/v4/nts/46?baseline=45&compare_to=45
shows the extra compile-time overhead of Polly canonicalization passes without
the first "InstructionCombining" pass. By removing the first
"InstructionCombining" pass, we see the extra compile-time overhead of
Polly canonicalization is at most 13.5%, which is much smaller than the original
Polly canonicalization overhead (>20%).
Second, let's look into the total impact of those polly canonicalization
passes by removing all optional canonicalization passes as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
// PM.add(llvm::createPromoteMemoryToRegisterPass());
// PM.add(llvm::createInstructionCombiningPass()); //this is the most
expensive canonicalization pass for flop benchmark
// PM.add(llvm::createCFGSimplificationPass());
// PM.add(llvm::createTailCallEliminationPass());
// PM.add(llvm::createCFGSimplificationPass());
// PM.add(llvm::createReassociatePass());
// PM.add(llvm::createLoopRotatePass());
// PM.add(llvm::createInstructionCombiningPass());
PM.add(polly::createCodePreparationPass());
}
As shown on
http://188.40.87.11:8000/db_default/v4/nts/47?baseline=45&compare_to=45, the
extra compile-time overhead is very small (5.04% at most) by removing all
optional Polly canonicalization passes. However, I think it is not practical to
remove all these canonicalizations for the sake of Polly optimization
performance. I would further evaluate Polly's performance (with optimization
and code generation) in the case all optional canonicalization passes are
removed.
As a simple informal conclusion, I think we should revise Polly's
canonicalization passes. At least we should consider removing the first
"InstructionCombining" pass!
Best,
Star Tan
At 2013-09-13 12:46:33,"Star Tan" <tanmx_star at yeah.net>
wrote:
At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es>
wrote:
>On 09/09/2013 05:18 AM, Star Tan wrote:
>>
>> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at
grosser.es> wrote:
>>
>>> On 09/08/2013 08:03 PM, Star Tan wrote:
>>> Also, I wonder if your runs include the dependence analysis. If
this is
>>> the case, the numbers are very good. Otherwise, 30% overhead seems
still
>>> to be a little bit much.
>> I think no Polly Dependence analysis is involved since our compiling
command is:
>> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm
-polly-optimizer=none -mllvm -polly-code-generator=none -mllvm
-polly-codegen-scev
>> Fortunately, with the option "-polly-codegen-scev", only
three benchmark shows >20% extra compile-time overhead:
>
>I believe so to, but please verify with -debug-pass=Structure
I have verified. It indeed does not involve Polly Dependence analysis.
"Polly Dependence Pass" for flop is still high for some benchmarks as
we have discussed before. >> SingleSource/Benchmarks/Misc/flops 28.57%
>> MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22%
>> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05%
>> When I look into the compile-time for the flop benchmark using
"-ftime-report", I find the extra compile-time overhead mainly comes
from the "Combine redundant instructions" pass.
>> the top 5 passes when compiled with Polly canonicalization passes:
>> ---User Time--- --User+System-- ---Wall Time--- --- Name ---
>> 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine
redundant instructions
>> 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86
DAG->DAG Instruction Selection
>> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy
Register Allocator
>> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value
Numbering
>> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create
polyhedral description of Scops
>>
>> But the top 5 passes for clang is:
>> ---User Time--- --System Time-- --User+System-- ---Wall
Time--- --- Name ---
>> 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 (
25.2%) X86 DAG->DAG Instruction Selection
>> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 (
8.4%) Greedy Register Allocator
>> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 (
6.1%) Combine redundant instructions
>> 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 (
5.2%) Global Value Numbering
>> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 (
5.2%) Combine redundant instructions
>> We can see the "Combine redundant instructions" are invoked
many times and the extra invoke resulted by Polly's canonicalization is the
most significant one. We have found this problem before and I need to look into
the details of canonicalization passes related to "Combine redundant
instructions".
>
>OK.
By investigating the flop benchmark, I find the key is the first
"InstructionCombining" pass in a serial of canonicalization passes
listed as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
PM.add(llvm::createPromoteMemoryToRegisterPass());
PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive
canonicalization pass for flop benchmark
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createTailCallEliminationPass());
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createReassociatePass());
PM.add(llvm::createLoopRotatePass());
PM.add(llvm::createInstructionCombiningPass());
if (!SCEVCodegen)
PM.add(polly::createIndVarSimplifyPass());
PM.add(polly::createCodePreparationPass());
}
If we remove the first "InstructionCombining" pass, then the
compile-time is reduced by more than 10% . The results reported by -ftime-report
become very similar to the case without Polly canonicalization:
---User Time--- --System Time-- --User+System-- ---Wall Time--- ---
Name ---
0.0120 ( 23.1%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0138 ( 21.5%) X86
DAG->DAG Instruction Selection
0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0045 ( 7.1%) Greedy
Register Allocator
0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0042 ( 6.6%) Polly
- Create polyhedral description of Scops
0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0038 ( 5.9%)
Combine redundant instructions
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 4.5%) Global
Value Numbering
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0027 ( 4.2%)
Combine redundant instructions
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.2%)
Combine redundant instructions
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.1%)
Combine redundant instructions
Similar results have been found in the benchmark whetstone. I will have a full
test using LLVM test-suite tonight to see whether it has similar effectiveness
for other test-suite benchmarks.
@Tobias, do you have any idea about the performance impact and other
consequences that if we remove such a canonicalization pass. In my option, it
should not be important since we still run the "InstructionCombining"
pass after "createLoopRotatePass" pass and in fact there are many more
runs of "InstructionCombine" pass after this point.
Best,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130914/a2bf7a5f/attachment.html>
Star Tan
2013-Sep-17 02:12 UTC
[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
Now, we come to more evaluations on
http://188.40.87.11:8000/db_default/v4/nts/recent_activity
I mainly care about the compile-time and execution time impact for the following
cases:
pBasic (run 45): clang -O3 -load LLVMPolly.so
pNoGenSCEV (run 44): clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly
-polly-optimizer=none -polly-code-generator=none
pNoGenSCEV_nocan (run 47): same option with pNoGenSCEV but replace the
LLVMPolly.so by removing all Polly canonicalization passes
pNoGenSCEV_procomb (run 51): same option with pNoGenSCEV but replace the
LLVMPolly.so by removing only the "InstructionCombining" and
"PromoteMemoryToRegister" canonicalization passes
pOptSCEV (run 48): clang -O3 -load LLVMPolly.so -polly-codegen-scev -polly
pOptSCEV_nocan (run 50): same option with pNoOptSCEV but replace the
LLVMPolly.so by removing all Polly canonicalization passes
pOptSCEV_procomb (run 52): same option with pNoOptSCEV but replace the
LLVMPolly.so by removing only the "InstructionCombining" and
"PromoteMemoryToRegister" canonicalization passes
pollyOpt (run 53): clang -O3 -load LLVMPolly.so -mllvm -polly
Discovery 1: Polly optimization and code generation heavily relies on the
"InstructionCombining" and "PromoteMemoryToRegister"
canonicalization passes.
http://188.40.87.11:8000/db_default/v4/nts/52?compare_to=45&baseline=45
shows the comparison between pOptSCEV_procomb with pBasic. As the results shown,
Polly optimization and code generation lead to very small compile-time overhead
(20% at most) compared with clang, i.e. the top four benmarks are:
SingleSource/UnitTests/SignlessTypes/rem20.37%
SingleSource/Benchmarks/Misc/oourafft11.34%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl10.22%
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset10.21%
It means that most of expensive Polly analysis/optimization/code generation
passes are not enabled without running these two canonicalization passes. Of
course Polly also introduces little performance gains in this case. The top
benchmarks for performance improvements are:
SingleSource/Benchmarks/Shootout/nestedloop -100.00%
SingleSource/Benchmarks/Shootout-C++/nestedloop -100.00%
MultiSource/Benchmarks/Ptrdist/anagram/anagram -14.26%
SingleSource/Benchmarks/Shootout/lists -10.77%
BTW, this bug (llvm.org/bugs/show_bug.cgi?id=17159) shown in general SCEV
optimization does not appear any more.
Discovery 2: Removing polly canonicalization passes significantly reduce
compile-time and may also reduce execution-time.
http://188.40.87.11:8000/db_default/v4/nts/50?compare_to=48&baseline=48 show
the comparison between "full polly canonicalization" and "non
polly canonicalization". Definitely, removing canonicalization passes can
significantly reduce compile-time overhead and my decrease the execution-time
performance since "canonicalization passes" can provide more
opportunities for optimization. However, we find that removing polly
canonicalization passes may also improve the execution-time performance for some
benchmarks as shown in the follows:
Performance Regressions - Execution Time
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt45.89%
SingleSource/Benchmarks/CoyoteBench/huffbench22.24%
SingleSource/Benchmarks/Shootout/fib215.06%
SingleSource/Benchmarks/Stanford/FloatMM13.98%
SingleSource/Benchmarks/Misc-C++/mandel-text13.16%
Performance Improvements - Execution Time
SingleSource/Benchmarks/Polybench/medley/reg_detect/reg_detect-37.50%
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog-27.69%
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt-22.59%
SingleSource/Benchmarks/Misc/himenobmtxpa-21.98%
MultiSource/Benchmarks/TSVC/GlobalDataFlow-flt/GlobalDataFlow-flt-16.44%
It means Polly's optimization does not always improve the performance. It
may lead the performance regression at the same time. This discovery can be also
found in the comparison between "clang -O3 with Polly" and "clang
-O3 without Polly" on
http://188.40.87.11:8000/db_default/v4/nts/48?compare_to=45&baseline=45.
Many benchmarks have execution time regression. So we need to further refine
Polly's optimization. At least we should avoid the performance regression.
In the next step, I will evaluate those polly canonicalization passes without
-polly-codegen-scev to understand their compile-time and execution-time impact.
Best,
Mingxing
At 2013-09-14 09:51:10,"Star Tan" <tanmx_star at yeah.net>
wrote:
Hello all,
I have evaluated the compile-time and execution-time performance of Polly
canonicalization passes. Details can be referred to
http://188.40.87.11:8000/db_default/v4/nts/recent_activity. There are four runs:
pollyBasic (run 45): clang -O3 -Xclang -load -Xclang LLVMPolly.so
pollyNoGenSCEV (run 44): clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm
-polly -mllvm -polly-codegen-scev
pollyNoGenSCEV_1comb (run 46): same option as pollyNoGenSCEV but remove the
first "InstructionCombining" canonicalization pass when generate
LLVMPolly.so
pollyNoGenSCEV_nocan (run 47): same option as pollyNoGenSCEV but remove all
canonicalization passes (actually only keep
"createCodePreparationPass") when generate LLVMPolly.so
Fist. let's see the results of removing the first
"InstructionCombining" pass like this:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
PM.add(llvm::createPromoteMemoryToRegisterPass());
// PM.add(llvm::createInstructionCombiningPass()); //this is the most
expensive canonicalization pass for flop benchmark
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createTailCallEliminationPass());
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createReassociatePass());
PM.add(llvm::createLoopRotatePass());
PM.add(llvm::createInstructionCombiningPass());
PM.add(polly::createCodePreparationPass());
}
Results are shown on
http://188.40.87.11:8000/db_default/v4/nts/46?baseline=44&compare_to=44. As
shown in the results, 13 benchmarks have >5% compile-time performance
improvements by simply removing the first
"createInstructionCombiningPass". The top 5 benchmarks are listed as
follows:
SingleSource/Regression/C++/2003-09-29-NonPODsByValue-38.46%
SingleSource/Benchmarks/Misc/flops-19.30%
SingleSource/Benchmarks/Misc/himenobmtxpa-12.94%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes-12.68%
MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000-10.68%
Unfortunately, there are also two serious execution-time performance
regressions:
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding204.19%
SingleSource/Benchmarks/Polyb! ench/linear-algebra/solvers/dynprog/dynprog44.58%
By looking into the simple_types_constant_folding benchmark, I find it is mainly
caused by the unexpected impact of the createPromoteMemoryToRegisterPass().
Removing "createPromoteMemoryToRegisterPass" would eliminate the
execution-time performance regression for simple_types_constant_folding
benchmark. Right now, I have no idea why
createPromoteMemoryToRegisterPass" would lead to such great execution-time
performance regression.
http://188.40.87.11:8000/db_default/v4/nts/46?baseline=45&compare_to=45
shows the extra compile-time overhead of Polly canonicalization passes without
the first "InstructionCombining" pass. By removing the first
"InstructionCombining" pass, we see the extra compile-time overhead of
Polly canonicalization is at most 13.5%, which is much smaller than the original
Polly canonicalization overhead (>20%).
Second, let's look into the total impact of those polly canonicalization
passes by removing all optional canonicalization passes as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
// PM.add(llvm::createPromoteMemoryToRegisterPass());
// PM.add(llvm::createInstructionCombiningPass()); //this is the most
expensive canonicalization pass for flop benchmark
// PM.add(llvm::createCFGSimplificationPass());
// PM.add(llvm::createTailCallEliminationPass());
// PM.add(llvm::createCFGSimplificationPass());
// PM.add(llvm::createReassociatePass());
// PM.add(llvm::createLoopRotatePass());
// PM.add(llvm::createInstructionCombiningPass());
PM.add(polly::createCodePreparationPass());
}
As shown on
http://188.40.87.11:8000/db_default/v4/nts/47?baseline=45&compare_to=45, the
extra compile-time overhead is very small (5.04% at most) by removing all
optional Polly canonicalization passes. However, I think it is not practical to
remove all these canonicalizations for the sake of Polly optimization
performance. I would further evaluate Polly's performance (with optimization
and code generation) in the case all optional canonicalization passes are
removed.
As a simple informal conclusion, I think we should revise Polly's
canonicalization passes. At least we should consider removing the first
"InstructionCombining" pass!
Best,
Star Tan
At 2013-09-13 12:46:33,"Star Tan" <tanmx_star at yeah.net>
wrote:
At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es>
wrote:
>On 09/09/2013 05:18 AM, Star Tan wrote:
>>
>> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at
grosser.es> wrote:
>>
>>> On 09/08/2013 08:03 PM, Star Tan wrote:
>>> Also, I wonder if your runs include the dependence analysis. If
this is
>>> the case, the numbers are very good. Otherwise, 30% overhead seems
still
>>> to be a little bit much.
>> I think no Polly Dependence analysis is involved since our compiling
command is:
>> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm
-polly-optimizer=none -mllvm -polly-code-generator=none -mllvm
-polly-codegen-scev
>> Fortunately, with the option "-polly-codegen-scev", only
three benchmark shows >20% extra compile-time overhead:
>
>I believe so to, but please verify with -debug-pass=Structure
I have verified. It indeed does not involve Polly Dependence analysis.
"Polly Dependence Pass" for flop is still high for some benchmarks as
we have discussed before. >> SingleSource/Benchmarks/Misc/flops 28.57%
>> MultiSource/Benchmarks/MiBench/security-sha/security-sha 22.22%
>> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes 21.05%
>> When I look into the compile-time for the flop benchmark using
"-ftime-report", I find the extra compile-time overhead mainly comes
from the "Combine redundant instructions" pass.
>> the top 5 passes when compiled with Polly canonicalization passes:
>> ---User Time--- --User+System-- ---Wall Time--- --- Name ---
>> 0.0160 ( 20.0%) 0.0160 ( 20.0%) 0.0164 ( 20.8%) Combine
redundant instructions
>> 0.0120 ( 15.0%) 0.0120 ( 15.0%) 0.0138 ( 17.5%) X86
DAG->DAG Instruction Selection
>> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0045 ( 5.7%) Greedy
Register Allocator
>> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 3.7%) Global Value
Numbering
>> 0.0040 ( 5.0%) 0.0040 ( 5.0%) 0.0028 ( 3.6%) Polly - Create
polyhedral description of Scops
>>
>> But the top 5 passes for clang is:
>> ---User Time--- --System Time-- --User+System-- ---Wall
Time--- --- Name ---
>> 0.0120 ( 25.0%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0141 (
25.2%) X86 DAG->DAG Instruction Selection
>> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0047 (
8.4%) Greedy Register Allocator
>> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0034 (
6.1%) Combine redundant instructions
>> 0.0000 ( 0.0%) 0.0040 ( 50.0%) 0.0040 ( 7.1%) 0.0029 (
5.2%) Global Value Numbering
>> 0.0040 ( 8.3%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0029 (
5.2%) Combine redundant instructions
>> We can see the "Combine redundant instructions" are invoked
many times and the extra invoke resulted by Polly's canonicalization is the
most significant one. We have found this problem before and I need to look into
the details of canonicalization passes related to "Combine redundant
instructions".
>
>OK.
By investigating the flop benchmark, I find the key is the first
"InstructionCombining" pass in a serial of canonicalization passes
listed as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
PM.add(llvm::createPromoteMemoryToRegisterPass());
PM.add(llvm::createInstructionCombiningPass()); //this is the most expensive
canonicalization pass for flop benchmark
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createTailCallEliminationPass());
PM.add(llvm::createCFGSimplificationPass());
PM.add(llvm::createReassociatePass());
PM.add(llvm::createLoopRotatePass());
PM.add(llvm::createInstructionCombiningPass());
if (!SCEVCodegen)
PM.add(polly::createIndVarSimplifyPass());
PM.add(polly::createCodePreparationPass());
}
If we remove the first "InstructionCombining" pass, then the
compile-time is reduced by more than 10% . The results reported by -ftime-report
become very similar to the case without Polly canonicalization:
---User Time--- --System Time-- --User+System-- ---Wall Time--- ---
Name ---
0.0120 ( 23.1%) 0.0000 ( 0.0%) 0.0120 ( 21.4%) 0.0138 ( 21.5%) X86
DAG->DAG Instruction Selection
0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0045 ( 7.1%) Greedy
Register Allocator
0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0042 ( 6.6%) Polly
- Create polyhedral description of Scops
0.0040 ( 7.7%) 0.0000 ( 0.0%) 0.0040 ( 7.1%) 0.0038 ( 5.9%)
Combine redundant instructions
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0029 ( 4.5%) Global
Value Numbering
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0027 ( 4.2%)
Combine redundant instructions
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.2%)
Combine redundant instructions
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 3.1%)
Combine redundant instructions
Similar results have been found in the benchmark whetstone. I will have a full
test using LLVM test-suite tonight to see whether it has similar effectiveness
for other test-suite benchmarks.
@Tobias, do you have any idea about the performance impact and other
consequences that if we remove such a canonicalization pass. In my option, it
should not be important since we still run the "InstructionCombining"
pass after "createLoopRotatePass" pass and in fact there are many more
runs of "InstructionCombine" pass after this point.
Best,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/95c19526/attachment.html>
Apparently Analagous Threads
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization
- [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization