thr3ads.net - llvm dev - [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Star Tan

2013-Sep-09 03:18 UTC

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at grosser.es>
wrote:
>On 09/08/2013 08:03 PM, Star Tan wrote:
>> Hello all,
>>
>>
>> I have done some basic experiments about Polly canonicalization passes
and I found the SCEV canonicalization has significant impact on both
compile-time and execution-time performance.
>
>Interesting.
>
>> Detailed results for SCEV and default canonicalization can be viewed
on: http://188.40.87.11:8000/db_default/v4/nts/32 (or 33, 34)
>>     *pNoGen with SCEV canonicalization (run 32): -O3 -Xclang -load
-Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm
-polly-code-generator=none -mllvm -polly-codegen-scev
>>     *pNoGen with default canonicalization (run 33): -O3 -Xclang -load
-Xclang LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none -mllvm
-polly-code-generator=none
>>     *pBasic without any canonicalization (run 34): -O3 -Xclang -load
-Xclang LLVMPolly.so
>>
>>
>> Impact of SCEV canonicalization:
>>     
http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34
>> Impact of default canonicalization:
>>     
http://188.40.87.11:8000/db_default/v4/nts/33?compare_to=34&baseline=34
>> Comparison of SCEV canonicalization with default canonicalization:
>>     
http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33
>>
>>
>> As we expected, both SCEV canonicalization and default canonicalization
will slightly increase the compile-time overhead (at most 30% extra
compile-time). They also lead to some execution-time regressions and
improvements.
>>
>>
>> The only difference between SCEV canonicalization and default
canonicalization is the "IndVarSimplify" pass as shown in the code
RegisterPasses.cpp:212:
>>        if (!SCEVCodegen)
>>          PM.add(polly::createIndVarSimplifyPass());
>
>There are actually more differences (see grep -R SCEVCodegen polly/), 
>but the other differences will mainly be code generation differences.Thanks for your reminder. Since we are currently focusing on canonicalization
passes, the other differences for code generation do not matter.
>> However, I find it is interesting to look into the comparison between
SCEV canonicalization and default canonicalization
(http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=33&baseline=33):
>
>Yes, this is definitely a good start.
>
>> First of all, we can expect SCEV canonicalization has better
compile-time performance since it avoids the "IndVarSimplify" pass.
Actually, it can gain more than 5% compile-time performance improvement for 32
benchmarks, especially for the following benchmarks:
>>          MultiSource/Applications/lemon/lemon-11.02%
>>          SingleSource/Benchmarks/Misc/oourafft-10.53%
>>          SingleSource/Benchmarks/Linpack/linpack-pc-10.00%
>>         
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan-8.31%
>>         
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt-8.18%
>>
>>
>> Second, we find that SCEV canonicalization has both regression and
improvement of execution performance compared with default canonicalization.
Actually, there are many execution-time regressions such as:
>>          SingleSource/Benchmarks/Shootout/nestedloop+16363.64%
>>          SingleSource/Benchmarks/Shootout-C++/nestedloop+16200.00%
>Those two have a huge impact. Understanding what is going on here would 
>be nice.
Yes, I am investigating these cases.>> I think the execution-time performance regression is mainly because of
the unexpected performance improvements from non-SCEV canonicalization as shown
int eh following bug: http://llvm.org/bugs/show_bug.cgi?id=17153. I will try to
find out why "IndVarSimplify" can produce better code in the next
step. If we can eliminate "IndVarSimplify" canonicalization but keep
on producing high-quality code, then we can gain better compile-time performance
without execution-time performance loss.
>
>Previous experience has shown that the indvars pass as we run it in 
>Polly can unpredictably change performance both negatively and 
>positively. It was disabled as it people did not manage to eliminate all 
>regressions it introduced, such that the positive performance changes 
>could not really be valued.
>
>So regarding performance tuning, I do not think we need to get this 
>optimal. As soon as -polly-codegen-scev reaches similar performance than
>the original approach, we are fine.I see. I agree with you. I think we care more about compile-time performance for
Polly's canonicalization passes since no Polly optimization or Polly code
generation happens here.
>Also, I wonder if your runs include the dependence analysis. If this is 
>the case, the numbers are very good. Otherwise, 30% overhead seems still 
>to be a little bit much.I think no Polly Dependence analysis is involved since our compiling command is:
clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm
-polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm
-polly-codegen-scev
Fortunately, with the option "-polly-codegen-scev", only three
benchmark shows >20% extra compile-time overhead:
SingleSource/Benchmarks/Misc/flops	28.57%
MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
When I look into the compile-time for the flop benchmark using
"-ftime-report", I find the extra compile-time overhead mainly comes
from the "Combine redundant instructions" pass.
the top 5 passes when compiled with Polly canonicalization passes:
   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
   0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine redundant
instructions
   0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86 DAG->DAG
Instruction Selection
   0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy Register
Allocator
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value Numbering
   0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create
polyhedral description of Scops

But the top 5 passes for clang is:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---
Name ---
   0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 ( 25.2%)  X86
DAG->DAG Instruction Selection
   0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 (  8.4%)  Greedy
Register Allocator
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 (  6.1%) 
Combine redundant instructions
   0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 (  5.2%)  Global
Value Numbering
   0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 (  5.2%) 
Combine redundant instructions
We can see the "Combine redundant instructions" are invoked many times
and the extra invoke resulted by Polly's canonicalization is the most
significant one. We have found this problem before and I need to look into the
details of canonicalization passes related to "Combine redundant
instructions".
BTW, I want to point out that although SCEV based Polly canonicalization (with
-polly-codegen-scev) runs faster than default canonicalization, it can lead to 5
extra compile errors and 3 extra runtime errors as shown on
http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34.
I have done some basic analysis for one of the compile error (7zip-benchmark).
Results can be viewed on http://llvm.org/bugs/show_bug.cgi?Cid=17159
Best,
Star Tan



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130909/0fdd342c/attachment.html>

Tobias Grosser

2013-Sep-09 05:07 UTC

head link

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

On 09/09/2013 05:18 AM, Star Tan wrote:>
> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at
grosser.es> wrote:
>
>> On 09/08/2013 08:03 PM, Star Tan wrote:
>> Also, I wonder if your runs include the dependence analysis. If this is
>> the case, the numbers are very good. Otherwise, 30% overhead seems
still
>> to be a little bit much.
> I think no Polly Dependence analysis is involved since our compiling
command is:
> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm
-polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm
-polly-codegen-scev
> Fortunately, with the option "-polly-codegen-scev", only three
benchmark shows >20% extra compile-time overhead:
I believe so to, but please verify with -debug-pass=Structure
> SingleSource/Benchmarks/Misc/flops	28.57%
> MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
> When I look into the compile-time for the flop benchmark using
"-ftime-report", I find the extra compile-time overhead mainly comes
from the "Combine redundant instructions" pass.
> the top 5 passes when compiled with Polly canonicalization passes:
>     ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>     0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine redundant
instructions
>     0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86 DAG->DAG
Instruction Selection
>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy Register
Allocator
>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value
Numbering
>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create
polyhedral description of Scops
>
> But the top 5 passes for clang is:
>     ---User Time---   --System Time--   --User+System--   ---Wall Time--- 
--- Name ---
>     0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 ( 25.2%) 
X86 DAG->DAG Instruction Selection
>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 (  8.4%) 
Greedy Register Allocator
>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 (  6.1%) 
Combine redundant instructions
>     0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 (  5.2%) 
Global Value Numbering
>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 (  5.2%) 
Combine redundant instructions
> We can see the "Combine redundant instructions" are invoked many
times and the extra invoke resulted by Polly's canonicalization is the most
significant one. We have found this problem before and I need to look into the
details of canonicalization passes related to "Combine redundant
instructions".
OK.
> BTW, I want to point out that although SCEV based Polly canonicalization
(with -polly-codegen-scev) runs faster than default canonicalization, it can
lead to 5 extra compile errors and 3 extra runtime errors as shown on
http://188.40.87.11:8000/db_default/v4/nts/32?compare_to=34&baseline=34.
> I have done some basic analysis for one of the compile error
(7zip-benchmark). Results can be viewed on
http://llvm.org/bugs/show_bug.cgi?Cid=17159
Great. I will help looking into this starting this WE.

Cheers,
Tobias

Star Tan

2013-Sep-13 04:46 UTC

head link

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

At 2013-09-09 13:07:07,"Tobias Grosser" <tobias at grosser.es>
wrote:
>On 09/09/2013 05:18 AM, Star Tan wrote:
>>
>> At 2013-09-09 05:52:35,"Tobias Grosser" <tobias at
grosser.es> wrote:
>>
>>> On 09/08/2013 08:03 PM, Star Tan wrote:
>>> Also, I wonder if your runs include the dependence analysis. If
this is
>>> the case, the numbers are very good. Otherwise, 30% overhead seems
still
>>> to be a little bit much.
>> I think no Polly Dependence analysis is involved since our compiling
command is:
>> clang -O3 -Xclang -load -Xclang LLVMPolly.so -mllvm -polly -mllvm
-polly-optimizer=none -mllvm -polly-code-generator=none  -mllvm
-polly-codegen-scev
>> Fortunately, with the option "-polly-codegen-scev", only
three benchmark shows >20% extra compile-time overhead:
>
>I believe so to, but please verify with -debug-pass=StructureI have verified. It indeed does not involve Polly Dependence analysis.
"Polly Dependence Pass" for flop is still high for some benchmarks as
we have discussed before. >> SingleSource/Benchmarks/Misc/flops	28.57%
>> MultiSource/Benchmarks/MiBench/security-sha/security-sha	22.22%
>> MultiSource/Benchmarks/VersaBench/ecbdes/ecbdes	21.05%
>> When I look into the compile-time for the flop benchmark using
"-ftime-report", I find the extra compile-time overhead mainly comes
from the "Combine redundant instructions" pass.
>> the top 5 passes when compiled with Polly canonicalization passes:
>>     ---User Time---   --User+System--   ---Wall Time---  --- Name ---
>>     0.0160 ( 20.0%)   0.0160 ( 20.0%)   0.0164 ( 20.8%)  Combine
redundant instructions
>>     0.0120 ( 15.0%)   0.0120 ( 15.0%)   0.0138 ( 17.5%)  X86
DAG->DAG Instruction Selection
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0045 (  5.7%)  Greedy
Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  3.7%)  Global Value
Numbering
>>     0.0040 (  5.0%)   0.0040 (  5.0%)   0.0028 (  3.6%)  Polly - Create
polyhedral description of Scops
>>
>> But the top 5 passes for clang is:
>>     ---User Time---   --System Time--   --User+System--   ---Wall
Time---  --- Name ---
>>     0.0120 ( 25.0%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0141 (
25.2%)  X86 DAG->DAG Instruction Selection
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0047 ( 
8.4%)  Greedy Register Allocator
>>     0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0034 ( 
6.1%)  Combine redundant instructions
>>     0.0000 (  0.0%)   0.0040 ( 50.0%)   0.0040 (  7.1%)   0.0029 ( 
5.2%)  Global Value Numbering
>>     0.0040 (  8.3%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0029 ( 
5.2%)  Combine redundant instructions
>> We can see the "Combine redundant instructions" are invoked
many times and the extra invoke resulted by Polly's canonicalization is the
most significant one. We have found this problem before and I need to look into
the details of canonicalization passes related to "Combine redundant
instructions".
>
>OK.
By investigating the flop benchmark, I find the key is the first
"InstructionCombining" pass in a serial of canonicalization passes
listed as follows:
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
  PM.add(llvm::createInstructionCombiningPass());  //this is the most expensive
canonicalization pass for flop benchmark
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());
  if (!SCEVCodegen)
    PM.add(polly::createIndVarSimplifyPass());
  PM.add(polly::createCodePreparationPass());
}
If we remove the first "InstructionCombining" pass, then the
compile-time is reduced by more than 10% . The results reported by -ftime-report
become very similar to the case without Polly canonicalization:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---
Name ---
   0.0120 ( 23.1%)   0.0000 (  0.0%)   0.0120 ( 21.4%)   0.0138 ( 21.5%)  X86
DAG->DAG Instruction Selection
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0045 (  7.1%)  Greedy
Register Allocator
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0042 (  6.6%)  Polly
- Create polyhedral description of Scops
   0.0040 (  7.7%)   0.0000 (  0.0%)   0.0040 (  7.1%)   0.0038 (  5.9%) 
Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0029 (  4.5%)  Global
Value Numbering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0027 (  4.2%) 
Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.2%) 
Combine redundant instructions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0020 (  3.1%) 
Combine redundant instructions
Similar results have been found in the benchmark whetstone.  I will have a full
test using LLVM test-suite tonight to see whether it has similar effectiveness
for other test-suite benchmarks.
@Tobias, do you have any idea about the performance impact and other
consequences that if we remove such a  canonicalization pass. In my option, it
should not be important since we still run the "InstructionCombining"
pass after "createLoopRotatePass" pass and in fact there are many more
runs of "InstructionCombine" pass after this point.
Best,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130913/32a89a62/attachment.html>

Sebastian Pop

2013-Sep-26 20:05 UTC

head link

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Hi Star Tan,

Thanks for the very interesting perf analyses.

Star Tan wrote:> We can see the "Combine redundant instructions" are invoked many
times and the
> extra invoke resulted by Polly's canonicalization is the most
significant
> one. We have found this problem before and I need to look into the details
of
> canonicalization passes related to "Combine redundant
instructions".
It could be that the scev codegen produces the same subexpression again and
again due to the fact that we are asking the same question again and again for
each array index: basically, in the original code we have a set of array access
functions A1(i), A2(i), ..., An(i), that get transformed by polly using a linear
transform function t: A1(t(i)), A2(t(i)), ..., An(t(i)), so you see that t(i)
appears again and again, and we probably do generate redundantly the same code
for it.
> BTW, I want to point out that although SCEV based Polly canonicalization
(with
> -polly-codegen-scev) runs faster than default canonicalization, it can lead
to
> 5 extra compile errors and 3 extra runtime errors
That's one of the reasons why we have not turned SCEV codegen on by default
yet.
I will address all these issues and then we'll flip the default value of the
-polly-codegen-scev flag.
> I have done some basic analysis for one of the compile error
> (7zip-benchmark). Results can be viewed on
> http://llvm.org/bugs/show_bug.cgi?Cid=17159
Thanks for filling up that bug report: I just assigned it to me.

Sebastian
-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Star Tan

2013-Sep-27 03:49 UTC

head link

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

At 2013-09-27 04:05:07,"Sebastian Pop" <spop at
codeaurora.org> wrote:>Hi Star Tan,>
>Thanks for the very interesting perf analyses.
>
>Star Tan wrote:
>> We can see the "Combine redundant instructions" are invoked many times and the
>> extra invoke resulted by Polly's canonicalization is the most significant
>> one. We have found this problem before and I need to look into the details of
>> canonicalization passes related to "Combine redundant instructions".
>
>It could be that the scev codegen produces the same subexpression again and
>again due to the fact that we are asking the same question again and again for
>each array index: basically, in the original code we have a set of array access
>functions A1(i), A2(i), ..., An(i), that get transformed by polly using a linear
>transform function t: A1(t(i)), A2(t(i)), ..., An(t(i)), so you see that t(i)
>appears again and again, and we probably do generate redundantly the same code
>for it.
>
>> BTW, I want to point out that although SCEV based Polly canonicalization (with
>> -polly-codegen-scev) runs faster than default canonicalization, it can lead to
>> 5 extra compile errors and 3 extra runtime errors
>
>That's one of the reasons why we have not turned SCEV codegen on by default yet.
>I will address all these issues and then we'll flip the default value of the
>-polly-codegen-scev flag.
Great! I will try to investigate other errors and put them into LLVM bugzilla or
try to fix them.
I also look forward to fixing these errors and flipping the default option value
as soon as possible.
>> I have done some basic analysis for one of the compile error
>> (7zip-benchmark). Results can be viewed on
>> http://llvm.org/bugs/show_bug.cgi?Cid=17159
>
>Thanks for filling up that bug report: I just assigned it to me.
>
>Sebastian
>-- 
>Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>hosted by The Linux Foundation
Thanks,
Mingxing
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130927/4bc106bb/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Sep 2013 - [LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

[LLVMdev] [Polly] Compile-time and Execution-time analysis for the SCEV canonicalization

Maybe Matching Threads