Star Tan
2013-Aug-17 07:08 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote:>> >> Yes, I have changed the original code to the form you suggested: >> for (i >> for (j >> ... >> x=1 > >Sorry, I meant > x[0] +>It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops. First, if the nested loop only contains a statement "X[0]+=0", like this: // SingleAcc.c #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { int n = ((argc == 2) ? atoi(argv[1]) : 46); int a, b, c, d, e, f, x=0; int X[10]; for (a=0; a<n; a++) for (b=0; b<n; b++) for (c=0; c<n; c++) for (d=0; d<n; d++) for (e=0; e<n; e++) for (f=0; f<n; f++) X[0]+=1; // could be X[0]+=1, X[0]=1 or x=1 printf("%d\n", X[0]); // could be X[0] or x return 0; } Then the scops results would be: $ clang -O0 -emit-llvm -S SingleAcc.c -o SingleAcc.ll # Run everything up to the offending pass $ polly-opt SingleAcc.ll -targetlibinfo -no-aa ... -polly-indvars -o SingleAcc.preopt.ll -S # Print the scop info: $ polly-opt -basicaa -polly-prepare -polly-scops SingleAcc.preopt.ll -analyze (*Note: you can download SingleAcc.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11059) Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end32' in function 'main': Context: [cond.reload] -> { : cond.reload >= -2147483648 and cond.reload <= 2147483647 } p0: %cond.reload Statements { Stmt_for_body16 Domain : [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and i4 <= -1 + cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1 }; Scattering : [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> scattering[0, i0, 0, i1, 0, i2, 0, i3, 0, i4, 0, i5, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] }; } Similarly, the scops result for "X[0]=0" would be: $ polly-opt -basicaa -polly-prepare -polly-scops SingleAssign.preopt.ll -analyze (*Note: you can download SingleAssign.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11060) Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end32' in function 'main': Context: [cond.reload] -> { : cond.reload >= -2147483648 and cond.reload <= 2147483647 } p0: %cond.reload Statements { Stmt_for_body16 Domain : [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and i4 <= -1 + cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1 }; Scattering : [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> scattering[0, i0, 0, i1, 0, i2, 0, i3, 0, i4, 0, i5, 0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] }; } Unfortunately, the scops for "x=1" is much more complicated as follows: $ polly-opt -basicaa -polly-prepare -polly-scops nestedloop.preopt.ll -analyze (*Note: you can download nestedloop.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11045) Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end31' in function 'main': Context: [cond.reload] -> { : cond.reload >= -2147483648 and cond.reload <= 2147483647 } p0: %cond.reload Statements { Stmt_for_cond2_preheader Domain : [cond.reload] -> { Stmt_for_cond2_preheader[i0] : i0 >= 0 and i0 <= -1 + cond.reload }; Scattering : [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> scattering[0, i0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_018_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_018_reload_s2a[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_1_lcssa_reg2mem[0] }; Stmt_for_cond5_preheader_lr_ph Domain : [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] : i0 >= 0 and i0 <= -1 + cond.reload and cond.reload >= 1 }; Scattering : [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> scattering[0, i0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> MemRef_x_018_reload_s2a[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> MemRef_x_114_reg2mem[0] }; Stmt_for_cond5_preheader Domain : [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and cond.reload >= 1 }; Scattering : [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> scattering[0, i0, 2, i1, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_114_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_114_reload_s2a[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_2_lcssa_reg2mem[0] }; [...] Stmt_for_inc29 Domain : [cond.reload] -> { Stmt_for_inc29[i0] : i0 >= 0 and i0 <= -1 + cond.reload }; Scattering : [cond.reload] -> { Stmt_for_inc29[i0] -> scattering[0, i0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_1_lcssa_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_1_lcssa_lcssa_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_018_reg2mem[0] }; Stmt_for_cond_for_end31_crit_edge Domain : [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] }; Scattering : [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> scattering[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> MemRef_x_1_lcssa_lcssa_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> MemRef_x_0_lcssa_reg2mem[0] }; } The compile time of polly-dependence analysis for the three cases would be: loop with "X[0]+=1":0.130s loop with "X[0] =1": 0.053s loop with "x=1":1.037s>> There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs and store instructions. >> >>> Meaning one alloc instruction and a single load and store in the >>> innermost loop, without any non-induction variable PHI nodes. >>> >>> When implementing this in C, LLVM may introduce again PHI nodes. So it >>> may possibly be necessary to write this directly in LLVM-IR. > >This is the important one. I wonder what the performance of the >dependence analysis is for increasing loop depth, if we have only a >single statement, that just reads and writes from the very same memory >location, but that does not have all the mess from the >non-induction-variable PHI nodes.If we change the "x=1" to "X[0]+=1", then there would be only one "store" instruction and one "load" instructions in the loop no matter how much the depth of the loop nest. In such a case, the compile time would increase exponentially as the depth of the loop nest increases: nest-2: 0.063s nest-4: 0.110s nest-6: 0.221s nest-8: 0.465s nest-10: 0.893s nest-12: 1.710s Cheers, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130817/6307b94f/attachment.html>
Tobias Grosser
2013-Aug-17 15:22 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
On 08/17/2013 12:08 AM, Star Tan wrote:> At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote: >>> >>> Yes, I have changed the original code to the form you suggested: >>> for (i >>> for (j >>> ... >>> x=1 >> >> Sorry, I meant >> x[0] +>> > > > It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.Indeed, this is interesting.> First, if the nested loop only contains a statement "X[0]+=0", like this:[...]> The compile time of polly-dependence analysis for the three cases would be: > loop with "X[0]+=1":0.130s > loop with "X[0] =1": 0.053s > loop with "x=1":1.037sOK. So it seems in case reductions are converted to PHI nodes, we see a lot slower compiles. This is possibly due to the increased number of statements as well as the increased number of data-locations modelled. I see two directions we can approach this problem with: 1) Optimize isl to work faster We can see if there are some optimization opportunities in isl to make the code faster 2) Generate a less complicated SCoP If possible, detect that the SCoP can be modelled with less statements and less memory locations and do so. I did not look into either of the two, so those are basically open questions (that may not be easy to solve). If you are interested, you may want to take a look.>>> There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs and store instructions. >>> >>>> Meaning one alloc instruction and a single load and store in the >>>> innermost loop, without any non-induction variable PHI nodes. >>>> >>>> When implementing this in C, LLVM may introduce again PHI nodes. So it >>>> may possibly be necessary to write this directly in LLVM-IR. >> >> This is the important one. I wonder what the performance of the >> dependence analysis is for increasing loop depth, if we have only a >> single statement, that just reads and writes from the very same memory >> location, but that does not have all the mess from the >> non-induction-variable PHI nodes. > > > If we change the "x=1" to "X[0]+=1", then there would be only one "store" instruction and one "load" instructions in the loop no matter how much the depth of the loop nest. In such a case, the compile time would increase exponentially as the depth of the loop nest increases: > > > nest-2: 0.063s > nest-4: 0.110s > nest-6: 0.221s > nest-8: 0.465s > nest-10: 0.893s > nest-12: 1.710sOK, so we still have exponentially growth with the loop-depth, but in general we seem to be a lot faster. It is not surprising that the complexity grows exponentially with the loop depth. However, seeing that even for a 12 level depth loop we only spend two seconds, seems to be OK. Cheers, Tobias
Star Tan
2013-Aug-19 16:43 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
At 2013-08-17 23:22:32,"Tobias Grosser" <tobias at grosser.es> wrote:>On 08/17/2013 12:08 AM, Star Tan wrote: >> At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote: >>>> >>>> Yes, I have changed the original code to the form you suggested: >>>> for (i >>>> for (j >>>> ... >>>> x=1 >>> >>> Sorry, I meant >>> x[0] +>>> >> >> >> It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops. > >Indeed, this is interesting.Experiments show that this is because their dependence results are different. 1) When the loop statement is a array operation such as "X[0]+=0", then clang would generate a store instruction and a load instruction within the same basic blocks like this: for (i=0;i<n;i++) for (j=0;j<n;j++) for (k=0;k<n;k++){ load %a, memory(A,0) add %a, 1 store %a, memory(A,0) } Such data dependence can be easily analysed by polly-dependence pass. 2) When the loop statement is a scalar operation such as "x+=1", then clang would generate a serial of scalar operations and phi nodes like this: x0=0 for (i=0; i<n; i++) { x1 = phi(x0, x6) for (j=0; j<n; j++) { x2 = phi(x1, x5) for (k=0;k<n;k++){ x3=phi(x2, x4) x4=x3+1 } x5 = phi(x2, x4) } x6 = phi(x1, x5) } x7 = phi(x0, x6) As we can see, the data dependences are becoming very complex as the depth of loop nest increases, which thus slow the polly-dependence analysis. Unfortunately, I have not find a way to simplify such data dependences. Do you have any idea?>> First, if the nested loop only contains a statement "X[0]+=0", like this: >[...] > >> The compile time of polly-dependence analysis for the three cases would be: >> loop with "X[0]+=1":0.130s >> loop with "X[0] =1": 0.053s >> loop with "x=1":1.037s > >OK. So it seems in case reductions are converted to PHI nodes, we see a >lot slower compiles. This is possibly due to the increased number of >statements as well as the increased number of data-locations modelled. > >I see two directions we can approach this problem with: > >1) Optimize isl to work faster > >We can see if there are some optimization opportunities in isl to make >the code fasterAs shown in the above example, the key problem is the complex data dependence of the application, so I do not think it is the time to turn to ISL now.> >2) Generate a less complicated SCoP > >If possible, detect that the SCoP can be modelled with less statements >and less memory locations and do so.Do you mean we should detect smaller SCoP or represent the same SCoP with less statements and less memory locations? For the sake of execution performance, we should detect large SCoP to enable more Polly optimizations.>I did not look into either of the two, so those are basically open >questions (that may not be easy to solve). If you are interested, you >may want to take a look.Sure, I would like to investigate if they are indeed useful to reduce the compile-time of Polly. Thanks for your suggestions! Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130820/c8887abf/attachment.html>
Possibly Parallel Threads
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops