thr3ads.net - llvm dev - [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Star Tan

2013-Aug-17 07:08 UTC

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es>
wrote:>>
>> Yes, I have changed the original code to the form you suggested:
>>    for (i
>>        for (j
>>            ...
>>                 x=1
>
>Sorry, I meant
>                  x[0] +>

It is interesting that Polly would run much faster if we change the
"x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.
First, if the nested loop only contains a statement "X[0]+=0", like
this:
// SingleAcc.c
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
  int n = ((argc == 2) ? atoi(argv[1]) : 46);
  int a, b, c, d, e, f, x=0;
  int X[10];
  for (a=0; a<n; a++)
    for (b=0; b<n; b++)
      for (c=0; c<n; c++)
        for (d=0; d<n; d++)
          for (e=0; e<n; e++)
            for (f=0; f<n; f++)
              X[0]+=1;  // could be X[0]+=1, X[0]=1 or x=1
  printf("%d\n", X[0]); // could be X[0] or x
  return 0;
}
Then the scops results would be:


$ clang -O0 -emit-llvm -S SingleAcc.c -o SingleAcc.ll
# Run everything up to the offending pass
$ polly-opt SingleAcc.ll -targetlibinfo -no-aa ... -polly-indvars -o
SingleAcc.preopt.ll -S
# Print the scop info:
$ polly-opt -basicaa -polly-prepare -polly-scops SingleAcc.preopt.ll -analyze 
(*Note: you can download SingleAcc.preopt.ll on
http://llvm.org/bugs/attachment.cgi?id=11059)


Printing analysis 'Polly - Create polyhedral description of Scops' for
region: 'for.cond2.preheader => for.end32' in function
'main':
    Context:
    [cond.reload] -> {  : cond.reload >= -2147483648 and cond.reload <=
2147483647 }
    p0: %cond.reload
    Statements {
    Stmt_for_body16
            Domain :                [cond.reload] -> { Stmt_for_body16[i0,
i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0
and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and
i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and i4 <= -1 +
cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1
};
            Scattering :                [cond.reload] -> {
Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> scattering[0, i0, 0, i1, 0, i2, 0,
i3, 0, i4, 0, i5, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5]
-> MemRef_X[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5]
-> MemRef_X[0] };
    }


Similarly, the scops result for "X[0]=0" would be:


$ polly-opt -basicaa -polly-prepare -polly-scops SingleAssign.preopt.ll -analyze
(*Note: you can download SingleAssign.preopt.ll on
http://llvm.org/bugs/attachment.cgi?id=11060)


Printing analysis 'Polly - Create polyhedral description of Scops' for
region: 'for.cond2.preheader => for.end32' in function
'main':
    Context:
    [cond.reload] -> {  : cond.reload >= -2147483648 and cond.reload <=
2147483647 }
    p0: %cond.reload
    Statements {
    Stmt_for_body16
            Domain :                [cond.reload] -> { Stmt_for_body16[i0,
i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0
and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and
i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and i4 <= -1 +
cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1
};
            Scattering :                [cond.reload] -> {
Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> scattering[0, i0, 0, i1, 0, i2, 0,
i3, 0, i4, 0, i5, 0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5]
-> MemRef_X[0] };
    }


Unfortunately, the scops for "x=1" is much more complicated as
follows:


$ polly-opt -basicaa -polly-prepare -polly-scops nestedloop.preopt.ll -analyze 
(*Note: you can download nestedloop.preopt.ll on
http://llvm.org/bugs/attachment.cgi?id=11045)


Printing analysis 'Polly - Create polyhedral description of Scops' for
region: 'for.cond2.preheader => for.end31' in function
'main':
    Context:
    [cond.reload] -> {  : cond.reload >= -2147483648 and cond.reload <=
2147483647 }
    p0: %cond.reload
    Statements {
    Stmt_for_cond2_preheader
            Domain :                [cond.reload] -> {
Stmt_for_cond2_preheader[i0] : i0 >= 0 and i0 <= -1 + cond.reload };
            Scattering :                [cond.reload] -> {
Stmt_for_cond2_preheader[i0] -> scattering[0, i0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_cond2_preheader[i0] ->
MemRef_x_018_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond2_preheader[i0] ->
MemRef_x_018_reload_s2a[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond2_preheader[i0] ->
MemRef_x_1_lcssa_reg2mem[0] };
    Stmt_for_cond5_preheader_lr_ph
            Domain :                [cond.reload] -> {
Stmt_for_cond5_preheader_lr_ph[i0] : i0 >= 0 and i0 <= -1 + cond.reload
and cond.reload >= 1 };
            Scattering :                [cond.reload] -> {
Stmt_for_cond5_preheader_lr_ph[i0] -> scattering[0, i0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] ->
MemRef_x_018_reload_s2a[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] ->
MemRef_x_114_reg2mem[0] };
    Stmt_for_cond5_preheader
            Domain :                [cond.reload] -> {
Stmt_for_cond5_preheader[i0, i1] : i0 >= 0 and i0 <= -1 + cond.reload and
i1 >= 0 and i1 <= -1 + cond.reload and cond.reload >= 1 };
            Scattering :                [cond.reload] -> {
Stmt_for_cond5_preheader[i0, i1] -> scattering[0, i0, 2, i1, 0, 0, 0, 0, 0,
0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] ->
MemRef_x_114_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] ->
MemRef_x_114_reload_s2a[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] ->
MemRef_x_2_lcssa_reg2mem[0] };
        [...]
    Stmt_for_inc29
            Domain :                [cond.reload] -> { Stmt_for_inc29[i0] :
i0 >= 0 and i0 <= -1 + cond.reload };
            Scattering :                [cond.reload] -> { Stmt_for_inc29[i0]
-> scattering[0, i0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_inc29[i0] ->
MemRef_x_1_lcssa_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_inc29[i0] ->
MemRef_x_1_lcssa_lcssa_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_inc29[i0] ->
MemRef_x_018_reg2mem[0] };
    Stmt_for_cond_for_end31_crit_edge
            Domain :                [cond.reload] -> {
Stmt_for_cond_for_end31_crit_edge[] };
            Scattering :                [cond.reload] -> {
Stmt_for_cond_for_end31_crit_edge[] -> scattering[1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0] };
            ReadAccess := 
                [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] ->
MemRef_x_1_lcssa_lcssa_reg2mem[0] };
            MustWriteAccess := 
                [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] ->
MemRef_x_0_lcssa_reg2mem[0] };
    }


The compile time of polly-dependence analysis for the three cases would be:
loop with "X[0]+=1":0.130s
loop with "X[0] =1": 0.053s
loop with "x=1":1.037s

>> There is no data dependence at all. However,  it seems that
-polly-prepare still introduces a lot of allocs and store instructions.
>>
>>> Meaning one alloc instruction and a single load and store in the
>>> innermost loop, without any non-induction variable PHI nodes.
>>>
>>> When implementing this in C, LLVM may introduce again PHI nodes. So
it
>>> may possibly be necessary to write this directly in LLVM-IR.
>
>This is the important one. I wonder what the performance of the 
>dependence analysis is for increasing loop depth, if we have only a 
>single statement, that just reads and writes from the very same memory 
>location, but that does not have all the mess from the 
>non-induction-variable PHI nodes.

If we change the "x=1" to "X[0]+=1", then there would be
only one "store" instruction and one "load" instructions in
the loop no matter how much the depth of the loop nest. In such a case, the
compile time would increase exponentially as the depth of the loop nest
increases:


nest-2: 0.063s
nest-4: 0.110s
nest-6: 0.221s
nest-8: 0.465s
nest-10: 0.893s
nest-12: 1.710s


Cheers,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130817/6307b94f/attachment.html>

Tobias Grosser

2013-Aug-17 15:22 UTC

head link

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

On 08/17/2013 12:08 AM, Star Tan wrote:> At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at
grosser.es> wrote:
>>>
>>> Yes, I have changed the original code to the form you suggested:
>>>     for (i
>>>         for (j
>>>             ...
>>>                  x=1
>>
>> Sorry, I meant
>>                   x[0] +>>
>
>
> It is interesting that Polly would run much faster if we change the
"x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.
Indeed, this is interesting.
> First, if the nested loop only contains a statement "X[0]+=0",
like this:[...]
> The compile time of polly-dependence analysis for the three cases would be:
> loop with "X[0]+=1":0.130s
> loop with "X[0] =1": 0.053s
> loop with "x=1":1.037s
OK. So it seems in case reductions are converted to PHI nodes, we see a 
lot slower compiles. This is possibly due to the increased number of 
statements as well as the increased number of data-locations modelled.

I see two directions we can approach this problem with:

1) Optimize isl to work faster

We can see if there are some optimization opportunities in isl to make 
the code faster

2) Generate a less complicated SCoP

If possible, detect that the SCoP can be modelled with less statements 
and less memory locations and do so.

I did not look into either of the two, so those are basically open 
questions (that may not be easy to solve). If you are interested, you
may want to take a look.
>>> There is no data dependence at all. However,  it seems that
-polly-prepare still introduces a lot of allocs and store instructions.
>>>
>>>> Meaning one alloc instruction and a single load and store in
the
>>>> innermost loop, without any non-induction variable PHI nodes.
>>>>
>>>> When implementing this in C, LLVM may introduce again PHI
nodes. So it
>>>> may possibly be necessary to write this directly in LLVM-IR.
>>
>> This is the important one. I wonder what the performance of the
>> dependence analysis is for increasing loop depth, if we have only a
>> single statement, that just reads and writes from the very same memory
>> location, but that does not have all the mess from the
>> non-induction-variable PHI nodes.
>
>
> If we change the "x=1" to "X[0]+=1", then there would
be only one "store" instruction and one "load" instructions
in the loop no matter how much the depth of the loop nest. In such a case, the
compile time would increase exponentially as the depth of the loop nest
increases:
>
>
> nest-2: 0.063s
> nest-4: 0.110s
> nest-6: 0.221s
> nest-8: 0.465s
> nest-10: 0.893s
> nest-12: 1.710s
OK, so we still have exponentially growth with the loop-depth, but in 
general we seem to be a lot faster. It is not surprising that the 
complexity grows exponentially with the loop depth. However, seeing that 
even for a 12 level depth loop we only spend two seconds, seems to be OK.

Cheers,
Tobias

Star Tan

2013-Aug-19 16:43 UTC

head link

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

At 2013-08-17 23:22:32,"Tobias Grosser" <tobias at grosser.es>
wrote:>On 08/17/2013 12:08 AM, Star Tan wrote:
>> At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at
grosser.es> wrote:
>>>>
>>>> Yes, I have changed the original code to the form you
suggested:
>>>>     for (i
>>>>         for (j
>>>>             ...
>>>>                  x=1
>>>
>>> Sorry, I meant
>>>                   x[0] +>>>
>>
>>
>> It is interesting that Polly would run much faster if we change the
"x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.
>
>Indeed, this is interesting.
Experiments show that this is because their dependence results are different.

1) When the loop statement is a array operation such as "X[0]+=0",
then clang would generate a store instruction and a load instruction within the
same basic blocks like this:
    for (i=0;i<n;i++)
		for (j=0;j<n;j++)
			for (k=0;k<n;k++){
				load %a, memory(A,0)
				add  %a, 1
				store %a, memory(A,0)
			}

Such data dependence can be easily analysed by polly-dependence pass.

2) When the loop statement is a scalar operation such as "x+=1", then
clang would generate a serial of scalar operations and phi nodes like this:

	x0=0
	for (i=0; i<n; i++) {
		x1 = phi(x0, x6)
		for (j=0; j<n; j++) {
			x2 = phi(x1, x5)
			for (k=0;k<n;k++){
				x3=phi(x2, x4)
				x4=x3+1
			}
			x5 = phi(x2, x4)
		}
		x6 = phi(x1, x5)
	}
	x7 = phi(x0, x6)

As we can see, the data dependences	are becoming very complex as the depth of
loop nest increases, which thus slow the polly-dependence analysis.
Unfortunately, I have not find a way to simplify such data dependences. Do you
have any idea?

>> First, if the nested loop only contains a statement
"X[0]+=0", like this:
>[...]
>
>> The compile time of polly-dependence analysis for the three cases would
be:
>> loop with "X[0]+=1":0.130s
>> loop with "X[0] =1": 0.053s
>> loop with "x=1":1.037s
>
>OK. So it seems in case reductions are converted to PHI nodes, we see a 
>lot slower compiles. This is possibly due to the increased number of 
>statements as well as the increased number of data-locations modelled.
>
>I see two directions we can approach this problem with:
>
>1) Optimize isl to work faster
>
>We can see if there are some optimization opportunities in isl to make 
>the code faster
As shown in the above example, the key problem is the complex data dependence of
the application, so I do not think it is the time to turn to ISL now.
>
>2) Generate a less complicated SCoP
>
>If possible, detect that the SCoP can be modelled with less statements 
>and less memory locations and do so.
Do you mean we should detect smaller SCoP or represent the same SCoP with less
statements and less memory locations?
For the sake of execution performance, we should detect large SCoP to enable
more Polly optimizations.
>I did not look into either of the two, so those are basically open 
>questions (that may not be easy to solve). If you are interested, you
>may want to take a look.
Sure, I would like to investigate if they are indeed useful to reduce the
compile-time of Polly.

Thanks for your suggestions!
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130820/c8887abf/attachment.html>

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Aug 2013 - [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Maybe Matching Threads