Star Tan
2013-Aug-16 09:42 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
At 2013-08-16 12:44:02,"Tobias Grosser" <tobias at grosser.es> wrote:>Hi, > >I tried to reproduce your findings, but could not do so.Sorry, I did not put all code in my previous email because the code seems a little too long and complicated. You can refer to the detailed C code and LLVM IR code on http://llvm.org/bugs/show_bug.cgi?id=16843 There are four attachments for our C code and LLVM IR code: nestedloop.c (http://llvm.org/bugs/attachment.cgi?id=11043): the simplified C code. nestedloop.ll (http://llvm.org/bugs/attachment.cgi?id=11044): the basic LLVM IR code. nestedloop.preopt.ll (http://llvm.org/bugs/attachment.cgi?id=11045): the preprocessed LLVM IR code. nestedloop.prepare.ll (http://llvm.org/bugs/attachment.cgi?id=11046): the LLVM IR code after running the polly-prepare pass.>> I have investigated the 6X extra compile-time overhead when Polly compiles the simple nestedloop benchmark in LLVM-testsuite. (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). Preliminary results show that such compile-time overhead is resulted by the complicated polly-dependence analysis. However, the key seems to be the polly-prepare pass, which introduces a large number of store instructions, and thus significantly complicating the polly-dependence pass. >> >> Let me show the results with a tiny code example: >> >> int main(int argc, char *argv[]) { >> int n = ((argc == 2) ? atoi(argv[1]) : 46); >> int a, b, x=0; >> for (a=0; a<n; a++) >> for (b=0; b<n; b++) >> x++; >> printf("%d\n", x); >> return(0); >> } > >This one misses some includes. Also, it is more convenient if you attach >the actual C file you include.Yes, it should includes two header file <stdio.h> and <stdlib.h>. You can refer to the source code on http://llvm.org/bugs/attachment.cgi?id=11043 Compared with the original nestedloop benchmark, I changed the accumulation operation to a single assignment as follows: for (a=0; a<n; a++) for (b=0; b<n; b++) for (c=0; c<n; c++) for (d=0; d<n; d++) for (e=0; e<n; e++) for (f=0; f<n; f++) x=1; //original is x++;>This test case misses the target data. If you would attach the original >file, it would be easier to reproduce.I see, you can refer to different LLVM IR code on http://llvm.org/bugs/show_bug.cgi?id=16843.>> Such code is very simple and there is no memory instruction at all, so the polly-dependence pass runs very fast on this code. Unfortunately, when we run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare pass would introduce a large number of store instructions like this: > >Which passes and commands have you actually run? The command "opt -load >LLVMPolly.so" does not run the -polly-prepare pass.Oops, I actually meant "opt -load LLVMPolly.so -polly -O3" here.>Sorry for being picky here. Unfortunately, there are too many ways you >could have run such passes, that I am afraid I would not guess the right >one. And if we are running different experiments, it is difficult to >discuss them.Yes, you are right. I should explain the detailed experimental steps.>The way I propose to run the passes is as follows: > [...] >$ polly-opt -basicaa -polly-prepare -polly-scops out.preopt.ll -analyzeScop info varies in the two cases (with/without -polly-scops). No valid scops are detected if we run polly without -polly-prepare: $ polly-opt -basicaa -polly-scops out.preopt.ll -analyze However, a very large region "for.cond2.preheader => for.end31" is detected as valid region if we run Polly with -polly-prepare: $ polly-opt -basicaa -polly-prepare -polly-scops nestedloop.preopt.ll -analyze Note that the region "for.cond2.preheader => for.end31" is reported as invalid as the exit BasicBlock has PHI nodes, if we run Polly without -polly-prepare.># Time with and without preoptimization >~$ time polly-opt -basicaa -polly-prepare -polly-dependences >out.preopt.ll -disable-output >[..] > >Within -O3 the overhead is still far away from the 60 slowdown we have >seen on the other test case. A 2x slowdown for a kernel that we fully >analyse and code generate is not optimal, but probably not an issue. The >question is if the increasing depth of the loop nest can yield this 60x >slowdown.This is because the depth of the loop nest is only two. If we increase the depth of the loop nest, then the compile-time would significantly increase. I have posted results for the original code which contains 6 nested loops on http://llvm.org/bugs/show_bug.cgi?id=16843#c5 In fact, the compile time is exponentially growing as the depth of loop nest increases: 2-nested: ~2X 4-nested: ~10X 6-nested: ~50X 8-nested: ~200X 10-nested: ~700X 12-nested: ~2000X>> These store instructions significantly complicate the "polly-dependence" pass, and thus leading to high compile-time overhead. > >What do you mean by "complicated the 'polly-dependence' pass"? Without >-polly-prepare the scop that is detected is a lot smaller. Hence, the >two scops are not really comparable.Yes, no valid scop is detected at all without -polly-prepare.>The point of the -polly-prepare pass is to transform some scalar >dependences into memory dependences. Introducing explicit load/store >instructions simplifies the later passes as scalar dependences do not >be handled specially. However, introducing store instructions does (or >should) not make the actual dependence analysis problem more >complicated. If Polly would be extended to detect SCoPs with scalar >dependences those scalar dependences should be the very same once that >are now created for the memory dependences introduced by the >-polly-prepare pass.At first, I thought -polly-dependence mainly do dependence analysis for memory instructions, so I dump all intermediate LLVM IR code for each opt passes using "-print-after-all". Then I find all store instructions are introduced by the -polly-prepare pass. That is why I asked about the -polly-prepare pass. Now I see, -polly-prepare just transforms some scalar dependences into memory dependences and it should not introduce new dependences. So, -polly-prepare may be not the key problem ~>> I have noticed that such memory instructions are finally simplified to scalar operations by the SROA pass, so one possible way to reduce such compile-time overhead is to move the SROA pass ahead of polly-dependence analysis. >> >> Can anyone give me some hints that why the polly-prepare pass introduces such memory instructions? Is it possible to move the SROA pass ahead of polly-dependence analysis? > >Moving the SROA pass is not the solution. It would just prevent Polly >from detecting some SCoPs.I see. We must find other ways.>I think there are two angles we can have a look at: > >1) Does -polly-prepare introduce unneeded dependences? > >This means, could we model the scalar dependences more efficiently than >how they are modeled by the array accesses introduced by -polly-prepare.This is really necessary. In fact, there should be no dependence in our code example as shown in our previous example. There is only a single and constant assignment "a=1" in the nested loop. Unfortunately, Polly still reports very complex data dependences.>2) What triggers the 60x slowdown in dependence analysis > >Is this due to the increasing loop nest depth? Or the increasing number >of allocations introduced by -polly-prepare? >I have found that the compile time is exponentially growing as the depth of loop nest increases. However, increasing the loop nest depth would also increases the number of allocations and store instructions.>An interesting experiment would be to see if the dependence analysis >runs faster on a loop that uses a single element array to implement the >reduction: > >for (i > for (j > ... > X[0] = ...Yes, I have changed the original code to the form you suggested: for (i for (j ... x=1 There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs and store instructions.>Meaning one alloc instruction and a single load and store in the >innermost loop, without any non-induction variable PHI nodes. > >When implementing this in C, LLVM may introduce again PHI nodes. So it >may possibly be necessary to write this directly in LLVM-IR. > >This became a long mail. Hope it gives you some ideas how to proceed. >Thank you so much, Tobias! I sincerely appreciate your helpful suggestions! Cheers, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130816/9c27a192/attachment.html>
Tobias Grosser
2013-Aug-16 14:32 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
On 08/16/2013 02:42 AM, Star Tan wrote:> At 2013-08-16 12:44:02,"Tobias Grosser" <tobias at grosser.es> wrote: >> Hi, >> >> I tried to reproduce your findings, but could not do so. > > > Sorry, I did not put all code in my previous email because the code seems a little too long and complicated. > You can refer to the detailed C code and LLVM IR code on http://llvm.org/bugs/show_bug.cgi?id=16843 > There are four attachments for our C code and LLVM IR code: > > > nestedloop.c (http://llvm.org/bugs/attachment.cgi?id=11043): the simplified C code. > nestedloop.ll (http://llvm.org/bugs/attachment.cgi?id=11044): the basic LLVM IR code. > nestedloop.preopt.ll (http://llvm.org/bugs/attachment.cgi?id=11045): the preprocessed LLVM IR code. > nestedloop.prepare.ll (http://llvm.org/bugs/attachment.cgi?id=11046): the LLVM IR code after running the polly-prepare pass.Thanks, this works for me.>> # Time with and without preoptimization >> ~$ time polly-opt -basicaa -polly-prepare -polly-dependences >> out.preopt.ll -disable-output >> [..] >> >> Within -O3 the overhead is still far away from the 60 slowdown we have >> seen on the other test case. A 2x slowdown for a kernel that we fully >> analyse and code generate is not optimal, but probably not an issue. The >> question is if the increasing depth of the loop nest can yield this 60x >> slowdown. > > > This is because the depth of the loop nest is only two. If we increase the depth of the loop nest, then the compile-time would significantly increase. I have posted results for the original code which contains 6 nested loops on http://llvm.org/bugs/show_bug.cgi?id=16843#c5 > > > In fact, the compile time is exponentially growing as the depth of loop nest increases: > 2-nested: ~2X > 4-nested: ~10X > 6-nested: ~50X > 8-nested: ~200X > 10-nested: ~700X > 12-nested: ~2000XPerfect. This is what I was expecting. Let's see what we can do about it.>> I think there are two angles we can have a look at: >> >> 1) Does -polly-prepare introduce unneeded dependences? >> >> This means, could we model the scalar dependences more efficiently than >> how they are modeled by the array accesses introduced by -polly-prepare. > > > This is really necessary. In fact, there should be no dependence in our code example as shown in our previous example. There is only a single and constant assignment "a=1" in the nested loop. Unfortunately, Polly still reports very complex data dependences.Sure, LLVM may even be able to replace this entire loop by a closed form expression. However, it is also a good test case to challenge Polly. Let's see what we can do.>> 2) What triggers the 60x slowdown in dependence analysis >> >> Is this due to the increasing loop nest depth? Or the increasing number >> of allocations introduced by -polly-prepare? >> > I have found that the compile time is exponentially growing as the depth of loop nest increases. However, increasing the loop nest depth would also increases the number of allocations and store instructions. > > >> An interesting experiment would be to see if the dependence analysis >> runs faster on a loop that uses a single element array to implement the >> reduction: >> >> for (i >> for (j >> ... >> X[0] = ... > > > Yes, I have changed the original code to the form you suggested: > for (i > for (j > ... > x=1Sorry, I meant x[0] +> There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs and store instructions. > > >> Meaning one alloc instruction and a single load and store in the >> innermost loop, without any non-induction variable PHI nodes. >> >> When implementing this in C, LLVM may introduce again PHI nodes. So it >> may possibly be necessary to write this directly in LLVM-IR.This is the important one. I wonder what the performance of the dependence analysis is for increasing loop depth, if we have only a single statement, that just reads and writes from the very same memory location, but that does not have all the mess from the non-induction-variable PHI nodes. Tobi
Star Tan
2013-Aug-17 07:08 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote:>> >> Yes, I have changed the original code to the form you suggested: >> for (i >> for (j >> ... >> x=1 > >Sorry, I meant > x[0] +>It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops. First, if the nested loop only contains a statement "X[0]+=0", like this: // SingleAcc.c #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { int n = ((argc == 2) ? atoi(argv[1]) : 46); int a, b, c, d, e, f, x=0; int X[10]; for (a=0; a<n; a++) for (b=0; b<n; b++) for (c=0; c<n; c++) for (d=0; d<n; d++) for (e=0; e<n; e++) for (f=0; f<n; f++) X[0]+=1; // could be X[0]+=1, X[0]=1 or x=1 printf("%d\n", X[0]); // could be X[0] or x return 0; } Then the scops results would be: $ clang -O0 -emit-llvm -S SingleAcc.c -o SingleAcc.ll # Run everything up to the offending pass $ polly-opt SingleAcc.ll -targetlibinfo -no-aa ... -polly-indvars -o SingleAcc.preopt.ll -S # Print the scop info: $ polly-opt -basicaa -polly-prepare -polly-scops SingleAcc.preopt.ll -analyze (*Note: you can download SingleAcc.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11059) Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end32' in function 'main': Context: [cond.reload] -> { : cond.reload >= -2147483648 and cond.reload <= 2147483647 } p0: %cond.reload Statements { Stmt_for_body16 Domain : [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and i4 <= -1 + cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1 }; Scattering : [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> scattering[0, i0, 0, i1, 0, i2, 0, i3, 0, i4, 0, i5, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] }; } Similarly, the scops result for "X[0]=0" would be: $ polly-opt -basicaa -polly-prepare -polly-scops SingleAssign.preopt.ll -analyze (*Note: you can download SingleAssign.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11060) Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end32' in function 'main': Context: [cond.reload] -> { : cond.reload >= -2147483648 and cond.reload <= 2147483647 } p0: %cond.reload Statements { Stmt_for_body16 Domain : [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and i2 >= 0 and i2 <= -1 + cond.reload and i3 >= 0 and i3 <= -1 + cond.reload and i4 >= 0 and i4 <= -1 + cond.reload and i5 >= 0 and i5 <= -1 + cond.reload and cond.reload >= 1 }; Scattering : [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> scattering[0, i0, 0, i1, 0, i2, 0, i3, 0, i4, 0, i5, 0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_body16[i0, i1, i2, i3, i4, i5] -> MemRef_X[0] }; } Unfortunately, the scops for "x=1" is much more complicated as follows: $ polly-opt -basicaa -polly-prepare -polly-scops nestedloop.preopt.ll -analyze (*Note: you can download nestedloop.preopt.ll on http://llvm.org/bugs/attachment.cgi?id=11045) Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond2.preheader => for.end31' in function 'main': Context: [cond.reload] -> { : cond.reload >= -2147483648 and cond.reload <= 2147483647 } p0: %cond.reload Statements { Stmt_for_cond2_preheader Domain : [cond.reload] -> { Stmt_for_cond2_preheader[i0] : i0 >= 0 and i0 <= -1 + cond.reload }; Scattering : [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> scattering[0, i0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_018_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_018_reload_s2a[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond2_preheader[i0] -> MemRef_x_1_lcssa_reg2mem[0] }; Stmt_for_cond5_preheader_lr_ph Domain : [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] : i0 >= 0 and i0 <= -1 + cond.reload and cond.reload >= 1 }; Scattering : [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> scattering[0, i0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> MemRef_x_018_reload_s2a[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond5_preheader_lr_ph[i0] -> MemRef_x_114_reg2mem[0] }; Stmt_for_cond5_preheader Domain : [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] : i0 >= 0 and i0 <= -1 + cond.reload and i1 >= 0 and i1 <= -1 + cond.reload and cond.reload >= 1 }; Scattering : [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> scattering[0, i0, 2, i1, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_114_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_114_reload_s2a[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond5_preheader[i0, i1] -> MemRef_x_2_lcssa_reg2mem[0] }; [...] Stmt_for_inc29 Domain : [cond.reload] -> { Stmt_for_inc29[i0] : i0 >= 0 and i0 <= -1 + cond.reload }; Scattering : [cond.reload] -> { Stmt_for_inc29[i0] -> scattering[0, i0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_1_lcssa_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_1_lcssa_lcssa_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_inc29[i0] -> MemRef_x_018_reg2mem[0] }; Stmt_for_cond_for_end31_crit_edge Domain : [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] }; Scattering : [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> scattering[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }; ReadAccess := [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> MemRef_x_1_lcssa_lcssa_reg2mem[0] }; MustWriteAccess := [cond.reload] -> { Stmt_for_cond_for_end31_crit_edge[] -> MemRef_x_0_lcssa_reg2mem[0] }; } The compile time of polly-dependence analysis for the three cases would be: loop with "X[0]+=1":0.130s loop with "X[0] =1": 0.053s loop with "x=1":1.037s>> There is no data dependence at all. However, it seems that -polly-prepare still introduces a lot of allocs and store instructions. >> >>> Meaning one alloc instruction and a single load and store in the >>> innermost loop, without any non-induction variable PHI nodes. >>> >>> When implementing this in C, LLVM may introduce again PHI nodes. So it >>> may possibly be necessary to write this directly in LLVM-IR. > >This is the important one. I wonder what the performance of the >dependence analysis is for increasing loop depth, if we have only a >single statement, that just reads and writes from the very same memory >location, but that does not have all the mess from the >non-induction-variable PHI nodes.If we change the "x=1" to "X[0]+=1", then there would be only one "store" instruction and one "load" instructions in the loop no matter how much the depth of the loop nest. In such a case, the compile time would increase exponentially as the depth of the loop nest increases: nest-2: 0.063s nest-4: 0.110s nest-6: 0.221s nest-8: 0.465s nest-10: 0.893s nest-12: 1.710s Cheers, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130817/6307b94f/attachment.html>
Apparently Analagous Threads
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of the expensive compile-time overhead of Polly Dependence pass
- [LLVMdev] [Polly] Analysis of the expensive compile-time overhead of Polly Dependence pass
- [LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence