thr3ads.net - llvm dev - [LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Star Tan

2013-Aug-08 08:29 UTC

[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

Hi all, 


I have summarized the top 10 compiler passes for Polly when compiling LLVM
test-ssuite. Results can be viewed on:
   
https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10


Based on the comparison between "clang -O3" and "polly -O3"
listed on:
    http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14


we can see Polly's compile-time overhead is mainly resulted by some
expensive Polly passes such as PollyDependence, PollyOptimization and
PollyCodegen. Especially, I notice that the PollyDependence can lead to
significant extra compile-time overhead. Its compile-time percentage for some
expensive benchmarks can be summarized as:
    nestedloop:  41.4% (Polly - Calculate dependence)
    salsa20:       98.5% (Polly - Calculate dependence)
    seidel-2d:     72.1% (Polly - Calculate dependence)
    multiplies:     54.3% (Poly -  Calculate dependence)
    Puzzle:         22.8% (Poly -  Calculate dependence)


As a result, it is critical to improve the PollyDependence pass.  I have
previously committed a patch file to split start value from SCEVAddRecExpr
(r187728), which can improve the PollyDependence pass for some benchmarks as
follows:
     http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18
Obviously, it is not enough.


I have investigated  the salsa20 benchmark, in which the PollyDependence pass
consumes 98.5% of total compile-time. The key code of salsa20 benchmark consists
of a serial of array operations:
  #define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
  for (i = 20;i > 0;i -= 2) {
    x[ 4] ^= R(x[ 0]+x[12], 7);  x[ 8] ^= R(x[ 4]+x[ 0], 9);
    x[12] ^= R(x[ 8]+x[ 4],13);  x[ 0] ^= R(x[12]+x[ 8],18);
    ....
    x[14] ^= R(x[13]+x[12],13);  x[15] ^= R(x[14]+x[13],18);
  }
which would be translated into a serial of affine functions:
      ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[12] };
      ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
      MustWriteAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
      ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[0] };
      ...
Consequently, the PollyDependence pass would produce very complex
Read/Write/MayWrite as follows:


      Read: { Stmt_for_body[i0] -> MemRef_in[i0] : i0 >= 0 and i0 <=
15; Stmt_for_body5[i0] -> MemRef_x[15] : (i0 >= 0 and i0 <= 9) or (i0
>= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0
<= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[14] : (i0 >= 0 and i0 <= 9) o!
 r (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and
i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[8] : (i0 >= 0 and i0 <= 9)
or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and
i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[10] : (i0 >= 0 and i0 <= 9)
or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and
i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[0] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9)
or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_fo!
 r_body5[i0] -> MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and
i0
<= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9)
or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }
      Write: ...


As we can see, the reason why PollyDependence leads to expensive compile-time
lies in the very complex Read/Write/MayWrite structure.  In fact, we can run
isl_union_map_coalesce() on the Read/MayWrite/MustWrite before feeding them into
ISL analysis:


diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index
9f918f3..39c3fb6 100644 --- a/lib/Analysis/Dependences.cpp +++
b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void
Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read,
&Write, &MayWrite, &Schedule); + Read =
isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); +
MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read:
" << Read << "\n"


With this patch file, we can reduce the compile-time percentage of
PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage
of PollyDependence for benchmarks, such as "nestedloop", is still very
high.


My plan is to continue investigating why PollyDependence pass leads to such high
compile-time overhead. Could anyone give me some comments or suggestions?


Thanks,
Star Tan

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130808/145c2f20/attachment.html>

Tobias Grosser

2013-Aug-08 14:28 UTC

head link

[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

On 08/08/2013 01:29 AM, Star Tan wrote:> Hi all,
>
>
> I have summarized the top 10 compiler passes for Polly when compiling LLVM
test-ssuite. Results can be viewed on:
>     
https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10
>
>
> Based on the comparison between "clang -O3" and "polly
-O3" listed on:
>     
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14
Please compare against clang -O3 -load LLVMPolly.so, otherwise 
especially the compile time of the small binaries includes the overhead 
of loading the Polly shared object file.
> we can see Polly's compile-time overhead is mainly resulted by some
expensive Polly passes such as PollyDependence, PollyOptimization and
PollyCodegen. Especially, I notice that the PollyDependence can lead to
significant extra compile-time overhead. Its compile-time percentage for some
expensive benchmarks can be summarized as:
>      nestedloop:  41.4% (Polly - Calculate dependence)
>      salsa20:       98.5% (Polly - Calculate dependence)
>      seidel-2d:     72.1% (Polly - Calculate dependence)
>      multiplies:     54.3% (Poly -  Calculate dependence)
>      Puzzle:         22.8% (Poly -  Calculate dependence)
>
>
> As a result, it is critical to improve the PollyDependence pass.  I have
previously committed a patch file to split start value from SCEVAddRecExpr
(r187728), which can improve the PollyDependence pass for some benchmarks as
follows:
>      
http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18
Those are very nice results. Removing 80% of the compile time for the lu 
benchmark is a very nice outcome.
> Obviously, it is not enough.
Sure, there is still plenty left to be optimised. However, when Polly 
detects a code region that it can optimise it seems fair that we spend 
some time in analysing and optimising it. Especially if our 
transformation improves the run-time performance.
> I have investigated  the salsa20 benchmark, in which the PollyDependence
pass consumes 98.5% of total compile-time. The key code of salsa20 benchmark
consists of a serial of array operations:
>    #define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
>    for (i = 20;i > 0;i -= 2) {
>      x[ 4] ^= R(x[ 0]+x[12], 7);  x[ 8] ^= R(x[ 4]+x[ 0], 9);
>      x[12] ^= R(x[ 8]+x[ 4],13);  x[ 0] ^= R(x[12]+x[ 8],18);
>      ....
>      x[14] ^= R(x[13]+x[12],13);  x[15] ^= R(x[14]+x[13],18);
>    }
> which would be translated into a serial of affine functions:
>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[12] };
>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
>        MustWriteAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[0] };
>        ...
> Consequently, the PollyDependence pass would produce very complex
Read/Write/MayWrite as follows:
>
>
>        Read: { Stmt_for_body[i0] -> MemRef_in[i0] : i0 >= 0 and i0
<> 15; Stmt_for_body5[i0] -> MemRef_x[15] : (i0 >= 0 and i0 <= 9)
or (i0
>> = 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and
i0 <= 9) or
> (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
> MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or
(i0 >> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
> MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or
(i0 >> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
> MemRef_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9)
or (i0 >> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
> MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or
(i0 >> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
> MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or
(i0 >> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
> MemRef_x[14] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9)
or (i0 >> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
> MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or
(i0 >> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
> MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9)
or (i0 >> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0
>> = 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[0] : (i0 >= 0
and i0
> <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or
(i0 >= 0 and
> i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
> Stmt_for_body5[i0] -> MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0
>= 0
> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9);
> Stmt_for_body5[i0] -> MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0
>= 0
> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9);
> Stmt_for_body5[i0] -> MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0
>= 0
> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9);
> Stmt_for_body5[i0] -> MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0
>= 0
> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9);
> Stmt_for_body5[i0] -> MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0
>= 0
> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9);
> Stmt_for_body5[i0] -> MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0
>= 0
> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0
>> = 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }
>        Write: ...
>
>
> As we can see, the reason why PollyDependence leads to expensive
compile-time lies in the very complex Read/Write/MayWrite structure.  In fact,
we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before
feeding them into ISL analysis:
Very nice.
> diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp
index 9f918f3..39c3fb6 100644 --- a/lib/Analysis/Dependences.cpp +++
b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void
Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read,
&Write, &MayWrite, &Schedule); + Read =
isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); +
MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read:
" << Read << "\n"
This patch is unreadable in the mail. However, the one you submitted 
looked good and was committed.
> With this patch file, we can reduce the compile-time percentage of
PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage
of PollyDependence for benchmarks, such as "nestedloop", is still very
high.
It would be good to get an up-to-date comparison with the latest patch 
having gone into Polly.

I did not yet look at the nestedloop benchmark, but it sounds basically 
like a benchmark only consisting of loop nests that we can optimise. 
This is definitely interesting to look into. Both in respect of how fast
we can analyse it faster, but also if we are able to improve the 
performance of the generated code. Especially if we improve the 
execution performance some additional compile time is justified.
> My plan is to continue investigating why PollyDependence pass leads to such
high compile-time overhead. Could anyone give me some comments or suggestions?
You already made nice progress. I propose to just continue by measuring 
the remaining overhead, taking one of the top cases and analysing it. We 
then see on a case by case basis what can be done.

Cheers,
Tobias

Star Tan

2013-Aug-09 01:27 UTC

head link

[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

At 2013-08-08 22:28:33,"Tobias Grosser" <tobias at grosser.es>
wrote:>On 08/08/2013 01:29 AM, Star Tan wrote:
>> Hi all,
>>
>>
>> I have summarized the top 10 compiler passes for Polly when compiling
LLVM test-ssuite. Results can be viewed on:
>>    
 https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10
>>
>>
>> Based on the comparison between "clang -O3" and "polly
-O3" listed on:
>>    
 http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14
>
>Please compare against clang -O3 -load LLVMPolly.so, otherwise 
>especially the compile time of the small binaries includes the overhead 
>of loading the Polly shared object file.
In fact, the compile-time overhead of loading Polly shared object file is very
small (usually less than 1%). Their overhead can be viewed on:
    http://188.40.87.11:8000/db_default/v4/nts/15?compare_to=14&baseline=14
>> we can see Polly's compile-time overhead is mainly resulted by some
expensive Polly passes such as PollyDependence, PollyOptimization and
PollyCodegen. Especially, I notice that the PollyDependence can lead to
significant extra compile-time overhead. Its compile-time percentage for some
expensive benchmarks can be summarized as:
>>      nestedloop:  41.4% (Polly - Calculate dependence)
>>      salsa20:       98.5% (Polly - Calculate dependence)
>>      seidel-2d:     72.1% (Polly - Calculate dependence)
>>      multiplies:     54.3% (Poly -  Calculate dependence)
>>      Puzzle:         22.8% (Poly -  Calculate dependence)
>>
>>
>> As a result, it is critical to improve the PollyDependence pass.  I
have previously committed a patch file to split start value from SCEVAddRecExpr
(r187728), which can improve the PollyDependence pass for some benchmarks as
follows:
>>      
http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18
>
>Those are very nice results. Removing 80% of the compile time for the lu 
>benchmark is a very nice outcome.
>
>> Obviously, it is not enough.
>
>Sure, there is still plenty left to be optimised. However, when Polly 
>detects a code region that it can optimise it seems fair that we spend 
>some time in analysing and optimising it. Especially if our 
>transformation improves the run-time performance.
>
>> I have investigated  the salsa20 benchmark, in which the
PollyDependence pass consumes 98.5% of total compile-time. The key code of
salsa20 benchmark consists of a serial of array operations:
>>    #define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
>>    for (i = 20;i > 0;i -= 2) {
>>      x[ 4] ^= R(x[ 0]+x[12], 7);  x[ 8] ^= R(x[ 4]+x[ 0], 9);
>>      x[12] ^= R(x[ 8]+x[ 4],13);  x[ 0] ^= R(x[12]+x[ 8],18);
>>      ....
>>      x[14] ^= R(x[13]+x[12],13);  x[15] ^= R(x[14]+x[13],18);
>>    }
>> which would be translated into a serial of affine functions:
>>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[12] };
>>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
>>        MustWriteAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
>>        ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[0] };
>>        ...
>> Consequently, the PollyDependence pass would produce very complex
Read/Write/MayWrite as follows:
>>
>>
>>        Read: { Stmt_for_body[i0] -> MemRef_in[i0] : i0 >= 0 and
i0 <>> 15; Stmt_for_body5[i0] -> MemRef_x[15] : (i0 >= 0 and i0
<= 9) or (i0
>>> = 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or
>> (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
>> MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0 >>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
>> MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0 >>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
>> MemRef_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0 >>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
>> MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0 >>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
>> MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0 >>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
>> MemRef_x[14] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0 >>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
>> MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0 >>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] ->
>> MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9) or (i0 >>> 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0
>= 0 and i0 <= 9) or (i0
>>> = 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[0] : (i0
>= 0 and i0
>> <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9)
or (i0 >= 0 and
>> i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <=
9);
>> Stmt_for_body5[i0] -> MemRef_x[13] : (i0 >= 0 and i0 <= 9) or
(i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0
<= 9);
>> Stmt_for_body5[i0] -> MemRef_x[1] : (i0 >= 0 and i0 <= 9) or
(i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0
<= 9);
>> Stmt_for_body5[i0] -> MemRef_x[3] : (i0 >= 0 and i0 <= 9) or
(i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0
<= 9);
>> Stmt_for_body5[i0] -> MemRef_x[7] : (i0 >= 0 and i0 <= 9) or
(i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0
<= 9);
>> Stmt_for_body5[i0] -> MemRef_x[12] : (i0 >= 0 and i0 <= 9) or
(i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0
<= 9);
>> Stmt_for_body5[i0] -> MemRef_x[5] : (i0 >= 0 and i0 <= 9) or
(i0 >= 0
>> and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0
<= 9) or (i0
>>> = 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }
>>        Write: ...
>>
>>
>> As we can see, the reason why PollyDependence leads to expensive
compile-time lies in the very complex Read/Write/MayWrite structure.  In fact,
we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before
feeding them into ISL analysis:
>
>Very nice.
>
>> diff --git a/lib/Analysis/Dependences.cpp
b/lib/Analysis/Dependences.cpp index 9f918f3..39c3fb6 100644 ---
a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6
+95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S,
&Read, &Write, &MayWrite, &Schedule); + Read =
isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); +
MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read:
" << Read << "\n"
>
>This patch is unreadable in the mail. However, the one you submitted 
>looked good and was committed.
>
>> With this patch file, we can reduce the compile-time percentage of
PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage
of PollyDependence for benchmarks, such as "nestedloop", is still very
high.
>
>It would be good to get an up-to-date comparison with the latest patch 
>having gone into Polly.
Yes, you can view the comparison on:
    http://188.40.87.11:8000/db_default/v4/nts/26?compare_to=25&baseline=25

Results show that this patch file is very effective for several benchmarks, such
as salsa20 (reduced by 97.72%), Obsequi (54.62%), seidel-2d (48.64%),
telecomm-gsm (33.71%).
>I did not yet look at the nestedloop benchmark, but it sounds basically 
>like a benchmark only consisting of loop nests that we can optimise. 
>This is definitely interesting to look into. Both in respect of how fast
>we can analyse it faster, but also if we are able to improve the 
>performance of the generated code. Especially if we improve the 
>execution performance some additional compile time is justified.
Yes. nestedloop.c is a very simple benchmark that contains a single nested loop
as follows:
    int n = ((argc == 2) ? atoi(argv[1]) : LENGTH);
    int a, b, c, d, e, f, x=0;
    for (a=0; a<n; a++)
        for (b=0; b<n; b++)
            for (c=0; c<n; c++)
                for (d=0; d<n; d++)
                    for (e=0; e<n; e++)
                        for (f=0; f<n; f++)
                            x++;
Polly would significantly increases the compile-time from 0.0320s to 2.3320
(70x), but it also reduces the execution time from 0.048s to 0.004s (12x). Maybe
it is worth, but I think that would be eif we can reduce the compile-time
without hurting the execution performance.
> My plan is to continue investigating why PollyDependence pass leads to such
high compile-time overhead. Could anyone give me some comments or suggestions?
>
>You already made nice progress. I propose to just continue by measuring 
>the remaining overhead, taking one of the top cases and analysing it. We 
>then see on a case by case basis what can be done.
>
>Cheers,
>Tobias
>
Thanks for your suggestion.

Cheers,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130809/1960abe9/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Aug 2013 - [LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

Possibly Parallel Threads