Star Tan
2013-Aug-16 00:28 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
Hi Sebpop, Thanks for your explanation. I noticed that Polly would finally run the SROA pass to transform these load/store instructions into scalar operations. Is it possible to run such a pass before polly-dependence analysis? Star Tan At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop at gmail.com> wrote:>Codeprepare and independent blocks are introducing these loads and stores. >These are prepasses that polly runs prior to building the dependence graph >to transform scalar dependences into data dependences. >Ether was working on eliminating the rewrite of scalar dependences. > >On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star at yeah.net> wrote: >> Hi all, >> >> I have investigated the 6X extra compile-time overhead when Polly compiles >> the simple nestedloop benchmark in LLVM-testsuite. >> (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). >> Preliminary results show that such compile-time overhead is resulted by the >> complicated polly-dependence analysis. However, the key seems to be the >> polly-prepare pass, which introduces a large number of store instructions, >> and thus significantly complicating the polly-dependence pass. >> >> Let me show the results with a tiny code example: >> >> int main(int argc, char *argv[]) { >> int n = ((argc == 2) ? atoi(argv[1]) : 46); >> int a, b, x=0; >> for (a=0; a<n; a++) >> for (b=0; b<n; b++) >> x++; >> printf("%d\n", x); >> return(0); >> } >> >> The basic LLVM IR code (produced by "clang -O1") is shown as: >> >> @.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1 >> ; Function Attrs: nounwind uwtable >> define i32 @main(i32 %argc, i8** nocapture readonly %argv) { >> entry: >> %cmp = icmp eq i32 %argc, 2 >> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph >> cond.end: >> %arrayidx = getelementptr inbounds i8** %argv, i64 1 >> %0 = load i8** %arrayidx, align 8 >> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, >> ...)*)(i8* %0) #3 >> %cmp117 = icmp sgt i32 %call, 0 >> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8 >> for.cond2.preheader.lr.ph: >> %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ] >> %cmp314 = icmp sgt i32 %cond22, 0 >> br label %for.cond2.preheader >> for.cond2.preheader: >> %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa, >> %for.inc6 ] >> %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ] >> br i1 %cmp314, label %for.body4, label %for.inc6 >> for.body4: >> %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ] >> %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ] >> %inc = add nsw i32 %x.116, 1 >> %inc5 = add nsw i32 %b.015, 1 >> %cmp3 = icmp slt i32 %inc5, %cond22 >> br i1 %cmp3, label %for.body4, label %for.inc6 >> for.inc6: >> %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4 >> ] >> %inc7 = add nsw i32 %a.018, 1 >> %cmp1 = icmp slt i32 %inc7, %cond22 >> br i1 %cmp1, label %for.cond2.preheader, label %for.end8 >> for.end8: >> %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ] >> %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 >> x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3 >> ret i32 0 >> } >> declare i32 @atoi(...) >> declare i32 @printf(i8* nocapture readonly, ...) >> >> Such code is very simple and there is no memory instruction at all, so the >> polly-dependence pass runs very fast on this code. Unfortunately, when we >> run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare >> pass would introduce a large number of store instructions like this: >> >> define i32 @main(i32 %argc, i8** nocapture readonly %argv) { >> entry: >> %cond22.reg2mem = alloca i32 >> %x.019.reg2mem = alloca i32 >> %x.1.lcssa.reg2mem = alloca i32 >> %x.1.lcssa.lcssa.reg2mem = alloca i32 >> %x.0.lcssa.reg2mem = alloca i32 >> br label %entry.split >> entry.split: >> %cmp = icmp eq i32 %argc, 2 >> store i32 46, i32* %cond22.reg2mem >> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph >> cond.end: >> %arrayidx = getelementptr inbounds i8** %argv, i64 1 >> %0 = load i8** %arrayidx, align 8 >> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, >> ...)*)(i8* %0) >> %cmp117 = icmp sgt i32 %call, 0 >> store i32 0, i32* %x.0.lcssa.reg2mem >> store i32 %call, i32* %cond22.reg2mem >> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8 >> ... >> >> These store instructions significantly complicate the "polly-dependence" >> pass, and thus leading to high compile-time overhead. >> >> I have noticed that such memory instructions are finally simplified to >> scalar operations by the SROA pass, so one possible way to reduce such >> compile-time overhead is to move the SROA pass ahead of polly-dependence >> analysis. >> >> Can anyone give me some hints that why the polly-prepare pass introduces >> such memory instructions? Is it possible to move the SROA pass ahead of >> polly-dependence analysis? >> >> Thanks, >> Star Tan >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Polly Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to polly-dev+unsubscribe at googlegroups.com. >> For more options, visit https://groups.google.com/groups/opt_out.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130816/29a129e4/attachment.html>
Sebastian Pop
2013-Aug-16 01:38 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
I do not think that running SROA before polly is a good idea: it would defeat the purpose of the code preparation passes that polly intentionally schedules for the data dependence analysis. If you remove the data references before polly runs, you would miss them in the dependence graph: that could lead to incorrect transforms. On Thu, Aug 15, 2013 at 7:28 PM, Star Tan <tanmx_star at yeah.net> wrote:> Hi Sebpop, > > Thanks for your explanation. > > I noticed that Polly would finally run the SROA pass to transform these > load/store instructions into scalar operations. Is it possible to run such > a pass before polly-dependence analysis? > > Star Tan > > At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop at gmail.com> wrote: >>Codeprepare and independent blocks are introducing these loads and stores. >>These are prepasses that polly runs prior to building the dependence graph >>to transform scalar dependences into data dependences. >>Ether was working on eliminating the rewrite of scalar dependences. >> >>On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star at yeah.net> wrote: >>> Hi all, >>> >>> I have investigated the 6X extra compile-time overhead when Polly >>> compiles >>> the simple nestedloop benchmark in LLVM-testsuite. >>> >>> (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). >>> Preliminary results show that such compile-time overhead is resulted by >>> the >>> complicated polly-dependence analysis. However, the key seems to be the >>> polly-prepare pass, which introduces a large number of store >>> instructions, >>> and thus significantly complicating the polly-dependence pass. >>> >>> Let me show the results with a tiny code example: >>> >>> int main(int argc, char *argv[]) { >>> int n = ((argc == 2) ? atoi(argv[1]) : 46); >>> int a, b, x=0; >>> for (a=0; a<n; a++) >>> for (b=0; b<n; b++) >>> x++; >>> printf("%d\n", x); >>> return(0); >>> } >>> >>> The basic LLVM IR code (produced by "clang -O1") is shown as: >>> >>> @.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1 >>> ; Function Attrs: nounwind uwtable >>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) { >>> entry: >>> %cmp = icmp eq i32 %argc, 2 >>> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph >>> cond.end: >>> %arrayidx = getelementptr inbounds i8** %argv, i64 1 >>> %0 = load i8** %arrayidx, align 8 >>> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 >>> (i8*, >>> ...)*)(i8* %0) #3 >>> %cmp117 = icmp sgt i32 %call, 0 >>> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8 >>> for.cond2.preheader.lr.ph: >>> %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ] >>> %cmp314 = icmp sgt i32 %cond22, 0 >>> br label %for.cond2.preheader >>> for.cond2.preheader: >>> %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa, >>> %for.inc6 ] >>> %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 >>> ] >>> br i1 %cmp314, label %for.body4, label %for.inc6 >>> for.body4: >>> %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ] >>> %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ] >>> %inc = add nsw i32 %x.116, 1 >>> %inc5 = add nsw i32 %b.015, 1 >>> %cmp3 = icmp slt i32 %inc5, %cond22 >>> br i1 %cmp3, label %for.body4, label %for.inc6 >>> for.inc6: >>> %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, >>> %for.body4 >>> ] >>> %inc7 = add nsw i32 %a.018, 1 >>> %cmp1 = icmp slt i32 %inc7, %cond22 >>> br i1 %cmp1, label %for.cond2.preheader, label %for.end8 >>> for.end8: >>> %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ] >>> %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds >>> ([4 >>> x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3 >>> ret i32 0 >>> } >>> declare i32 @atoi(...) >>> declare i32 @printf(i8* nocapture readonly, ...) >>> >>> Such code is very simple and there is no memory instruction at all, so >>> the >>> polly-dependence pass runs very fast on this code. Unfortunately, when we >>> run "opt -load LLVMPolly.so" for this basic LLVM IR code, the >>> polly-prepare >>> pass would introduce a large number of store instructions like this: >>> >>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) { >>> entry: >>> %cond22.reg2mem = alloca i32 >>> %x.019.reg2mem = alloca i32 >>> %x.1.lcssa.reg2mem = alloca i32 >>> %x.1.lcssa.lcssa.reg2mem = alloca i32 >>> %x.0.lcssa.reg2mem = alloca i32 >>> br label %entry.split >>> entry.split: >>> %cmp = icmp eq i32 %argc, 2 >>> store i32 46, i32* %cond22.reg2mem >>> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph >>> cond.end: >>> %arrayidx = getelementptr inbounds i8** %argv, i64 1 >>> %0 = load i8** %arrayidx, align 8 >>> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 >>> (i8*, >>> ...)*)(i8* %0) >>> %cmp117 = icmp sgt i32 %call, 0 >>> store i32 0, i32* %x.0.lcssa.reg2mem >>> store i32 %call, i32* %cond22.reg2mem >>> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8 >>> ... >>> >>> These store instructions significantly complicate the "polly-dependence" >>> pass, and thus leading to high compile-time overhead. >>> >>> I have noticed that such memory instructions are finally simplified to >>> scalar operations by the SROA pass, so one possible way to reduce such >>> compile-time overhead is to move the SROA pass ahead of polly-dependence >>> analysis. >>> >>> Can anyone give me some hints that why the polly-prepare pass introduces >>> such memory instructions? Is it possible to move the SROA pass ahead of >>> polly-dependence analysis? >>> >>> Thanks, >>> Star Tan >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Polly Development" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to polly-dev+unsubscribe at googlegroups.com. >>> For more options, visit https://groups.google.com/groups/opt_out.
Star Tan
2013-Aug-16 01:48 UTC
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
I see. Thank you. Do you have any ideas to reduce such expensive dependence analysis caused by those extra load/store instructions? Any suggestion would be appreciated. Thanks~ At 2013-08-16 09:38:51,"Sebastian Pop" <sebpop at gmail.com> wrote:>I do not think that running SROA before polly is a good idea: >it would defeat the purpose of the code preparation passes that >polly intentionally schedules for the data dependence analysis. >If you remove the data references before polly runs, you would >miss them in the dependence graph: that could lead to incorrect >transforms. > > >On Thu, Aug 15, 2013 at 7:28 PM, Star Tan <tanmx_star at yeah.net> wrote: >> Hi Sebpop, >> >> Thanks for your explanation. >> >> I noticed that Polly would finally run the SROA pass to transform these >> load/store instructions into scalar operations. Is it possible to run such >> a pass before polly-dependence analysis? >> >> Star Tan >> >> At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop at gmail.com> wrote: >>>Codeprepare and independent blocks are introducing these loads and stores. >>>These are prepasses that polly runs prior to building the dependence graph >>>to transform scalar dependences into data dependences. >>>Ether was working on eliminating the rewrite of scalar dependences. >>> >>>On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star at yeah.net> wrote: >>>> Hi all, >>>> >>>> I have investigated the 6X extra compile-time overhead when Polly >>>> compiles >>>> the simple nestedloop benchmark in LLVM-testsuite. >>>> >>>> (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). >>>> Preliminary results show that such compile-time overhead is resulted by >>>> the >>>> complicated polly-dependence analysis. However, the key seems to be the >>>> polly-prepare pass, which introduces a large number of store >>>> instructions, >>>> and thus significantly complicating the polly-dependence pass. >>>> >>>> Let me show the results with a tiny code example: >>>> >>>> int main(int argc, char *argv[]) { >>>> int n = ((argc == 2) ? atoi(argv[1]) : 46); >>>> int a, b, x=0; >>>> for (a=0; a<n; a++) >>>> for (b=0; b<n; b++) >>>> x++; >>>> printf("%d\n", x); >>>> return(0); >>>> } >>>> >>>> The basic LLVM IR code (produced by "clang -O1") is shown as: >>>> >>>> @.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1 >>>> ; Function Attrs: nounwind uwtable >>>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) { >>>> entry: >>>> %cmp = icmp eq i32 %argc, 2 >>>> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph >>>> cond.end: >>>> %arrayidx = getelementptr inbounds i8** %argv, i64 1 >>>> %0 = load i8** %arrayidx, align 8 >>>> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 >>>> (i8*, >>>> ...)*)(i8* %0) #3 >>>> %cmp117 = icmp sgt i32 %call, 0 >>>> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8 >>>> for.cond2.preheader.lr.ph: >>>> %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ] >>>> %cmp314 = icmp sgt i32 %cond22, 0 >>>> br label %for.cond2.preheader >>>> for.cond2.preheader: >>>> %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa, >>>> %for.inc6 ] >>>> %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 >>>> ] >>>> br i1 %cmp314, label %for.body4, label %for.inc6 >>>> for.body4: >>>> %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ] >>>> %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ] >>>> %inc = add nsw i32 %x.116, 1 >>>> %inc5 = add nsw i32 %b.015, 1 >>>> %cmp3 = icmp slt i32 %inc5, %cond22 >>>> br i1 %cmp3, label %for.body4, label %for.inc6 >>>> for.inc6: >>>> %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, >>>> %for.body4 >>>> ] >>>> %inc7 = add nsw i32 %a.018, 1 >>>> %cmp1 = icmp slt i32 %inc7, %cond22 >>>> br i1 %cmp1, label %for.cond2.preheader, label %for.end8 >>>> for.end8: >>>> %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ] >>>> %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds >>>> ([4 >>>> x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3 >>>> ret i32 0 >>>> } >>>> declare i32 @atoi(...) >>>> declare i32 @printf(i8* nocapture readonly, ...) >>>> >>>> Such code is very simple and there is no memory instruction at all, so >>>> the >>>> polly-dependence pass runs very fast on this code. Unfortunately, when we >>>> run "opt -load LLVMPolly.so" for this basic LLVM IR code, the >>>> polly-prepare >>>> pass would introduce a large number of store instructions like this: >>>> >>>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) { >>>> entry: >>>> %cond22.reg2mem = alloca i32 >>>> %x.019.reg2mem = alloca i32 >>>> %x.1.lcssa.reg2mem = alloca i32 >>>> %x.1.lcssa.lcssa.reg2mem = alloca i32 >>>> %x.0.lcssa.reg2mem = alloca i32 >>>> br label %entry.split >>>> entry.split: >>>> %cmp = icmp eq i32 %argc, 2 >>>> store i32 46, i32* %cond22.reg2mem >>>> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph >>>> cond.end: >>>> %arrayidx = getelementptr inbounds i8** %argv, i64 1 >>>> %0 = load i8** %arrayidx, align 8 >>>> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 >>>> (i8*, >>>> ...)*)(i8* %0) >>>> %cmp117 = icmp sgt i32 %call, 0 >>>> store i32 0, i32* %x.0.lcssa.reg2mem >>>> store i32 %call, i32* %cond22.reg2mem >>>> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8 >>>> ... >>>> >>>> These store instructions significantly complicate the "polly-dependence" >>>> pass, and thus leading to high compile-time overhead. >>>> >>>> I have noticed that such memory instructions are finally simplified to >>>> scalar operations by the SROA pass, so one possible way to reduce such >>>> compile-time overhead is to move the SROA pass ahead of polly-dependence >>>> analysis. >>>> >>>> Can anyone give me some hints that why the polly-prepare pass introduces >>>> such memory instructions? Is it possible to move the SROA pass ahead of >>>> polly-dependence analysis? >>>> >>>> Thanks, >>>> Star Tan >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "Polly Development" group. >>>> To unsubscribe from this group and stop receiving emails from it, send an >>>> email to polly-dev+unsubscribe at googlegroups.com. >>>> For more options, visit https://groups.google.com/groups/opt_out.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130816/43c17919/attachment.html>
Apparently Analagous Threads
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
- [LLVMdev] loop vectorizer issue