Hal Finkel
2015-Jul-16 06:46 UTC
[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable
----- Original Message -----> From: "Chandler Carruth" <chandlerc at google.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "Hyojin Sung" <hsung at us.ibm.com>, llvmdev at cs.uiuc.edu > Sent: Thursday, July 16, 2015 1:06:03 AM > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops > with a volatile iteration variable> On Wed, Jul 15, 2015 at 6:36 PM Hal Finkel < hfinkel at anl.gov > wrote:> > > From: "Chandler Carruth" < chandlerc at google.com > > > > > > > To: "Hyojin Sung" < hsung at us.ibm.com >, llvmdev at cs.uiuc.edu > > > > > > Sent: Wednesday, July 15, 2015 7:34:54 PM > > > > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for > > > loops > > > with a volatile iteration variable > > > > > > On Wed, Jul 15, 2015 at 12:55 PM Hyojin Sung < hsung at us.ibm.com > > > > wrote: > > >> > > > Hi all, > > > > > >> > > > I would like to propose an improvement of the “almost dead” > > > > block > > > > elimination in Transforms/Local.cpp so that it will preserve > > > > the > > > > canonical loop form for loops with a volatile iteration > > > > variable. > > > > > >> > > > *** Problem statement > > > > > > > > > > Nested loops in LCALS Subset B ( > > > > https://codesign.llnl.gov/LCALS.php > > > > ) are not vectorized with LLVM -O3 because the LLVM loop > > > > vectorizer > > > > fails the test whether the loop latch and exiting block of a > > > > loop > > > > is > > > > the same. The loops are vectorizable, and get vectorized with > > > > LLVM > > > > -O2 > > > > > > > > > I would be interested to know why -O2 succeeds here. > > >> > > > and also with other commercial compilers (icc, xlc). > > > > > >> > > > *** Details > > > > > > > > > > These loops ended up with different loop latch and exiting > > > > block > > > > after a series of optimizations including loop unswitching, > > > > jump > > > > threading, simplify-the-CFG, and loop simplify. The fundamental > > > > problem here is that the above optimizations cannot recognize a > > > > loop > > > > with a volatile iteration variable and do not preserve its > > > > canonical > > > > loop structure. > > > > > > > > > Ok, meta-level question first: > > >> > > Why do we care about performance of loops with a volatile > > > iteration > > > variable? > > > > > I don't think we do, however, I think that misses the point. In > > this > > case, the volatile iteration variable is just an easy way to expose > > this problem that we have with nested loop canonicalization and the > > vectorizer. To be specific: >> > This we vectorizer just fine: >> > void foo1(float * restrict x, float * restrict y, float * restrict > > z) > > { > > > for (int i = 0; i < 1000; ++i) { > > > for (int j = 0; j < 1600; ++j) { > > > x[j] = y[j] + z[j]; > > > } > > > } > > > } >> > And, indeed, this we don't (the only change is adding volatile on > > i): >> > void foo2(float * restrict x, float * restrict y, float * restrict > > z) > > { > > > for (volatile int i = 0; i < 1000; ++i) { > > > for (int j = 0; j < 1600; ++j) { > > > x[j] = y[j] + z[j]; > > > } > > > } > > > } >> > However, this we don't either, and that's a big problem: >> > int done(float *x, float *y, float *z); > > > void foo3(float * restrict x, float * restrict y, float * restrict > > z) > > { > > > while (!done(x, y, z)) { > > > for (int j = 0; j < 1600; ++j) { > > > x[j] = y[j] + z[j]; > > > } > > > } > > > } >> > And the underlying reason is the same. The IR at the point in time > > when the loop vectorizer runs looks like this: >> > define void @foo3(float* noalias %x, float* noalias %y, float* > > noalias %z) #0 { > > > entry: > > > %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) #1 > > > %lnot.15 = icmp eq i32 %call.14, 0 > > > br i1 %lnot.15, label %for.body.preheader, label %while.end >> > for.body.preheader: ; preds = %entry > > > br label %for.body >> > while.cond.loopexit: ; preds = %for.body > > > %call = tail call i32 @done(float* %x, float* %y, float* %z) #1 > > > %lnot = icmp eq i32 %call, 0 > > > br i1 %lnot, label %for.body.backedge, label %while.end.loopexit >> > for.body: ; preds = %for.body.backedge, %for.body.preheader > > > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ % indvars.iv.be > > , > > %for.body.backedge ] > > > ... > > > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 > > > %exitcond = icmp eq i64 %indvars.iv.next, 1600 > > > br i1 %exitcond, label %while.cond.loopexit, label > > %for.body.backedge >> > for.body.backedge: ; preds = %for.body, %while.cond.loopexit > > > % indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > > %while.cond.loopexit ] > > > br label %for.body >> > while.end.loopexit: ; preds = %while.cond.loopexit > > > br label %while.end >> > while.end: ; preds = %while.end.loopexit, %entry > > > ret void > > > } >> > and we can currently only vectorize loops where the loop latch is > > also the loop's exiting block. In this case, as in the case with > > the > > volatile variable, vectorization is blocked by this constraint > > (here > > the backedge is from the terminator of %for.body.backedge, but the > > loop exiting block is %for.body). The check in the vectorizer is > > explicit: >> > // We only handle bottom-tested loops, i.e. loop in which the > > condition is > > > // checked at the end of each iteration. With that we can assume > > that > > all > > > // instructions in the loop are executed the same number of times. > > > if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) { > > > ... >> Thanks for the detailed explanation. This makes much more sense why > we need to handle it. I think its much better to look at nested > loops of this form than anything to do with volatile -- the latter > is too prone to other random optimizations turning off.> Regarding this problem, it would be interesting to know based on this > explanation what the desired fix would be. I see at least these > options:> 1) Canonicalize loops harder to make them look the way the vectorizer > wants. If this can be done without causing significant problems, it > seems likely the best approach.I agree. In this case, we could certainly fold the trivial %for.body.backedge block into %for.body, meaning transforming this: for.body.backedge: ; preds = %while.cond.loopexit, %for.body %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %while.cond.loopexit ] br label %for.body for.body: ; preds = %for.body.backedge, %for.body.preheader %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be, %for.body.backedge ] ... %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond = icmp eq i64 %indvars.iv.next, 1600 br i1 %exitcond, label %while.cond.loopexit, label %for.body.backedge into this: for.body: ; preds = %for.body.backedge, %for.body.preheader %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %while.cond.loopexit ] %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be, %for.body ] ... %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond = icmp eq i64 %indvars.iv.next, 1600 br i1 %exitcond, label %while.cond.loopexit, label %for.body and this seems pretty trivial when %for.body.backedge is completely empty (as in this case), but if it had non-PHI instructions in it on which the existing PHIs in %for.body depended, then maybe this becomes less trivial?> 2) Teach the vectorizer to vectorize without this constraint by > instead establishing the actual invariant it cares about.It really cares that there's no code that comes in between the latch and the exit, because such code is not really part of the loop (it only runs once), or code in between the exit and the latch (because such code runs in one fewer iterations than the code before the exit). At least nothing with side effects I presume. -Hal> Maybe there is another strategy?> > -Hal >> > > That seems both counter-intuitive and unlikely to be a useful > > > goal. > > > We simply don't optimize volatile operations well in *any* part > > > of > > > the optimizer, and I'm not sure why we need to start trying to > > > fix > > > that. This seems like an irreparably broken benchmark, but > > > perhaps > > > there is a motivation I don't yet see. > > >> > > Assuming that sufficient motivation arises to try to fix this, > > > see > > > my > > > comments below: > > >> > > > (1) Loop unswitching generates several empty placeholder BBs > > > > only > > > > with PHI nodes after separating out a shorter path with no > > > > inner > > > > loop execution from a standard path. > > > > > >> > > > (2) Jump threading and simplify-the-CFG passes independently > > > > calls > > > > TryToSimplifyUnconditionalBranchFromEmptyBlock() in > > > > Transforms/Utils/Local.cpp to get rid of almost empty BBs. > > > > > >> > > > (3) TryToSimplifyUnconditionalBranchFromEmtpyBlock() eliminates > > > > the > > > > placeholder BBs after loop unswitching and merges them into > > > > subsequent blocks including the header of the inner loop. > > > > Before > > > > eliminating the blocks, the function checks if the block is a > > > > loop > > > > header by looking at its PHI nodes so that it can be saved, but > > > > the > > > > test fails with the loops with a volatile iteration variable. > > > > > > > > > Why does this fail for a volatile iteration variable but not for > > > a > > > non-volatile one? I think understanding that will be key to > > > understanding how it should be fixed. > > >> > > _______________________________________________ > > > > > > LLVM Developers mailing list > > > > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >> > -- >> > Hal Finkel > > > Assistant Computational Scientist > > > Leadership Computing Facility > > > Argonne National Laboratory >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150716/edb8e701/attachment.html>
Hal Finkel
2015-Jul-16 06:58 UTC
[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable
----- Original Message -----> From: "Hal Finkel" <hfinkel at anl.gov> > To: "Chandler Carruth" <chandlerc at google.com> > Cc: llvmdev at cs.uiuc.edu > Sent: Thursday, July 16, 2015 1:46:42 AM > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops > with a volatile iteration variable> ----- Original Message -----> > From: "Chandler Carruth" <chandlerc at google.com> > > > To: "Hal Finkel" <hfinkel at anl.gov> > > > Cc: "Hyojin Sung" <hsung at us.ibm.com>, llvmdev at cs.uiuc.edu > > > Sent: Thursday, July 16, 2015 1:06:03 AM > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops > > with a volatile iteration variable >> > On Wed, Jul 15, 2015 at 6:36 PM Hal Finkel < hfinkel at anl.gov > > > wrote: >> > > > From: "Chandler Carruth" < chandlerc at google.com > > > > > > > > > > > To: "Hyojin Sung" < hsung at us.ibm.com >, llvmdev at cs.uiuc.edu > > > > > > > > > > Sent: Wednesday, July 15, 2015 7:34:54 PM > > > > > > > > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for > > > > loops > > > > with a volatile iteration variable > > > > > > > > > > On Wed, Jul 15, 2015 at 12:55 PM Hyojin Sung < hsung at us.ibm.com > > > > > > > > > wrote: > > > > > >> > > > > Hi all, > > > > > > > > > >> > > > > I would like to propose an improvement of the “almost dead” > > > > > block > > > > > elimination in Transforms/Local.cpp so that it will preserve > > > > > the > > > > > canonical loop form for loops with a volatile iteration > > > > > variable. > > > > > > > > > >> > > > > *** Problem statement > > > > > > > > > > > > > > > Nested loops in LCALS Subset B ( > > > > > https://codesign.llnl.gov/LCALS.php > > > > > ) are not vectorized with LLVM -O3 because the LLVM loop > > > > > vectorizer > > > > > fails the test whether the loop latch and exiting block of a > > > > > loop > > > > > is > > > > > the same. The loops are vectorizable, and get vectorized with > > > > > LLVM > > > > > -O2 > > > > > > > > > > > > > > I would be interested to know why -O2 succeeds here. > > > > > >> > > > > and also with other commercial compilers (icc, xlc). > > > > > > > > > >> > > > > *** Details > > > > > > > > > > > > > > > These loops ended up with different loop latch and exiting > > > > > block > > > > > after a series of optimizations including loop unswitching, > > > > > jump > > > > > threading, simplify-the-CFG, and loop simplify. The > > > > > fundamental > > > > > problem here is that the above optimizations cannot recognize > > > > > a > > > > > loop > > > > > with a volatile iteration variable and do not preserve its > > > > > canonical > > > > > loop structure. > > > > > > > > > > > > > > Ok, meta-level question first: > > > > > >> > > > Why do we care about performance of loops with a volatile > > > > iteration > > > > variable? > > > > > > > > > I don't think we do, however, I think that misses the point. In > > > this > > > case, the volatile iteration variable is just an easy way to > > > expose > > > this problem that we have with nested loop canonicalization and > > > the > > > vectorizer. To be specific: > > >> > > This we vectorizer just fine: > > >> > > void foo1(float * restrict x, float * restrict y, float * > > > restrict > > > z) > > > { > > > > > > for (int i = 0; i < 1000; ++i) { > > > > > > for (int j = 0; j < 1600; ++j) { > > > > > > x[j] = y[j] + z[j]; > > > > > > } > > > > > > } > > > > > > } > > >> > > And, indeed, this we don't (the only change is adding volatile on > > > i): > > >> > > void foo2(float * restrict x, float * restrict y, float * > > > restrict > > > z) > > > { > > > > > > for (volatile int i = 0; i < 1000; ++i) { > > > > > > for (int j = 0; j < 1600; ++j) { > > > > > > x[j] = y[j] + z[j]; > > > > > > } > > > > > > } > > > > > > } > > >> > > However, this we don't either, and that's a big problem: > > >> > > int done(float *x, float *y, float *z); > > > > > > void foo3(float * restrict x, float * restrict y, float * > > > restrict > > > z) > > > { > > > > > > while (!done(x, y, z)) { > > > > > > for (int j = 0; j < 1600; ++j) { > > > > > > x[j] = y[j] + z[j]; > > > > > > } > > > > > > } > > > > > > } > > >> > > And the underlying reason is the same. The IR at the point in > > > time > > > when the loop vectorizer runs looks like this: > > >> > > define void @foo3(float* noalias %x, float* noalias %y, float* > > > noalias %z) #0 { > > > > > > entry: > > > > > > %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) > > > #1 > > > > > > %lnot.15 = icmp eq i32 %call.14, 0 > > > > > > br i1 %lnot.15, label %for.body.preheader, label %while.end > > >> > > for.body.preheader: ; preds = %entry > > > > > > br label %for.body > > >> > > while.cond.loopexit: ; preds = %for.body > > > > > > %call = tail call i32 @done(float* %x, float* %y, float* %z) #1 > > > > > > %lnot = icmp eq i32 %call, 0 > > > > > > br i1 %lnot, label %for.body.backedge, label %while.end.loopexit > > >> > > for.body: ; preds = %for.body.backedge, %for.body.preheader > > > > > > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ % > > > indvars.iv.be > > > , > > > %for.body.backedge ] > > > > > > ... > > > > > > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 > > > > > > %exitcond = icmp eq i64 %indvars.iv.next, 1600 > > > > > > br i1 %exitcond, label %while.cond.loopexit, label > > > %for.body.backedge > > >> > > for.body.backedge: ; preds = %for.body, %while.cond.loopexit > > > > > > % indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > > > %while.cond.loopexit ] > > > > > > br label %for.body > > >> > > while.end.loopexit: ; preds = %while.cond.loopexit > > > > > > br label %while.end > > >> > > while.end: ; preds = %while.end.loopexit, %entry > > > > > > ret void > > > > > > } > > >> > > and we can currently only vectorize loops where the loop latch is > > > also the loop's exiting block. In this case, as in the case with > > > the > > > volatile variable, vectorization is blocked by this constraint > > > (here > > > the backedge is from the terminator of %for.body.backedge, but > > > the > > > loop exiting block is %for.body). The check in the vectorizer is > > > explicit: > > >> > > // We only handle bottom-tested loops, i.e. loop in which the > > > condition is > > > > > > // checked at the end of each iteration. With that we can assume > > > that > > > all > > > > > > // instructions in the loop are executed the same number of > > > times. > > > > > > if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) { > > > > > > ... > > >> > Thanks for the detailed explanation. This makes much more sense why > > we need to handle it. I think its much better to look at nested > > loops of this form than anything to do with volatile -- the latter > > is too prone to other random optimizations turning off. >> > Regarding this problem, it would be interesting to know based on > > this > > explanation what the desired fix would be. I see at least these > > options: >> > 1) Canonicalize loops harder to make them look the way the > > vectorizer > > wants. If this can be done without causing significant problems, it > > seems likely the best approach. > > I agree. In this case, we could certainly fold the trivial > %for.body.backedge block into %for.body, meaning transforming this:> for.body.backedge: ; preds = %while.cond.loopexit, %for.body > %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > %while.cond.loopexit ] > br label %for.body> for.body: ; preds = %for.body.backedge, %for.body.preheader > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be, > %for.body.backedge ] > ... > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 > %exitcond = icmp eq i64 %indvars.iv.next, 1600 > br i1 %exitcond, label %while.cond.loopexit, label %for.body.backedge> into this:> for.body: ; preds = %for.body.backedge, %for.body.preheader > %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > %while.cond.loopexit ] > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be, > %for.body ] > ... > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 > %exitcond = icmp eq i64 %indvars.iv.next, 1600 > br i1 %exitcond, label %while.cond.loopexit, label %for.body> and this seems pretty trivial when %for.body.backedge is completely > empty (as in this case), but if it had non-PHI instructions in it on > which the existing PHIs in %for.body depended, then maybe this > becomes less trivial?Although, based on what I said below, the case with instructions there we can't currently vectorize anyway for more-fundamental reasons. -Hal> > 2) Teach the vectorizer to vectorize without this constraint by > > instead establishing the actual invariant it cares about. > > It really cares that there's no code that comes in between the latch > and the exit, because such code is not really part of the loop (it > only runs once), or code in between the exit and the latch (because > such code runs in one fewer iterations than the code before the > exit). At least nothing with side effects I presume.> -Hal> > Maybe there is another strategy? >> > > -Hal > > >> > > > That seems both counter-intuitive and unlikely to be a useful > > > > goal. > > > > We simply don't optimize volatile operations well in *any* part > > > > of > > > > the optimizer, and I'm not sure why we need to start trying to > > > > fix > > > > that. This seems like an irreparably broken benchmark, but > > > > perhaps > > > > there is a motivation I don't yet see. > > > > > >> > > > Assuming that sufficient motivation arises to try to fix this, > > > > see > > > > my > > > > comments below: > > > > > >> > > > > (1) Loop unswitching generates several empty placeholder BBs > > > > > only > > > > > with PHI nodes after separating out a shorter path with no > > > > > inner > > > > > loop execution from a standard path. > > > > > > > > > >> > > > > (2) Jump threading and simplify-the-CFG passes independently > > > > > calls > > > > > TryToSimplifyUnconditionalBranchFromEmptyBlock() in > > > > > Transforms/Utils/Local.cpp to get rid of almost empty BBs. > > > > > > > > > >> > > > > (3) TryToSimplifyUnconditionalBranchFromEmtpyBlock() > > > > > eliminates > > > > > the > > > > > placeholder BBs after loop unswitching and merges them into > > > > > subsequent blocks including the header of the inner loop. > > > > > Before > > > > > eliminating the blocks, the function checks if the block is a > > > > > loop > > > > > header by looking at its PHI nodes so that it can be saved, > > > > > but > > > > > the > > > > > test fails with the loops with a volatile iteration variable. > > > > > > > > > > > > > > Why does this fail for a volatile iteration variable but not > > > > for > > > > a > > > > non-volatile one? I think understanding that will be key to > > > > understanding how it should be fixed. > > > > > >> > > > _______________________________________________ > > > > > > > > > > LLVM Developers mailing list > > > > > > > > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > > > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > > >> > > -- > > >> > > Hal Finkel > > > > > > Assistant Computational Scientist > > > > > > Leadership Computing Facility > > > > > > Argonne National Laboratory > > >> --> Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory> _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150716/ffe1982b/attachment.html>
Hal Finkel
2015-Jul-16 07:11 UTC
[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable
----- Original Message -----> From: "Hal Finkel" <hfinkel at anl.gov> > To: "Chandler Carruth" <chandlerc at google.com> > Cc: llvmdev at cs.uiuc.edu > Sent: Thursday, July 16, 2015 1:58:02 AM > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops > with a volatile iteration variable> ----- Original Message -----> > From: "Hal Finkel" <hfinkel at anl.gov> > > > To: "Chandler Carruth" <chandlerc at google.com> > > > Cc: llvmdev at cs.uiuc.edu > > > Sent: Thursday, July 16, 2015 1:46:42 AM > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops > > with a volatile iteration variable >> > ----- Original Message ----- >> > > From: "Chandler Carruth" <chandlerc at google.com> > > > > > > To: "Hal Finkel" <hfinkel at anl.gov> > > > > > > Cc: "Hyojin Sung" <hsung at us.ibm.com>, llvmdev at cs.uiuc.edu > > > > > > Sent: Thursday, July 16, 2015 1:06:03 AM > > > > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for > > > loops > > > with a volatile iteration variable > > >> > > On Wed, Jul 15, 2015 at 6:36 PM Hal Finkel < hfinkel at anl.gov > > > > wrote: > > >> > > > > From: "Chandler Carruth" < chandlerc at google.com > > > > > > > > > > > > > > > > To: "Hyojin Sung" < hsung at us.ibm.com >, llvmdev at cs.uiuc.edu > > > > > > > > > > > > > > > Sent: Wednesday, July 15, 2015 7:34:54 PM > > > > > > > > > > > > > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for > > > > > loops > > > > > with a volatile iteration variable > > > > > > > > > > > > > > > On Wed, Jul 15, 2015 at 12:55 PM Hyojin Sung < > > > > > hsung at us.ibm.com > > > > > > > > > > > wrote: > > > > > > > > > >> > > > > > Hi all, > > > > > > > > > > > > > > >> > > > > > I would like to propose an improvement of the “almost dead” > > > > > > block > > > > > > elimination in Transforms/Local.cpp so that it will > > > > > > preserve > > > > > > the > > > > > > canonical loop form for loops with a volatile iteration > > > > > > variable. > > > > > > > > > > > > > > >> > > > > > *** Problem statement > > > > > > > > > > > > > > > > > > > > > Nested loops in LCALS Subset B ( > > > > > > https://codesign.llnl.gov/LCALS.php > > > > > > ) are not vectorized with LLVM -O3 because the LLVM loop > > > > > > vectorizer > > > > > > fails the test whether the loop latch and exiting block of > > > > > > a > > > > > > loop > > > > > > is > > > > > > the same. The loops are vectorizable, and get vectorized > > > > > > with > > > > > > LLVM > > > > > > -O2 > > > > > > > > > > > > > > > > > > > > I would be interested to know why -O2 succeeds here. > > > > > > > > > >> > > > > > and also with other commercial compilers (icc, xlc). > > > > > > > > > > > > > > >> > > > > > *** Details > > > > > > > > > > > > > > > > > > > > > These loops ended up with different loop latch and exiting > > > > > > block > > > > > > after a series of optimizations including loop unswitching, > > > > > > jump > > > > > > threading, simplify-the-CFG, and loop simplify. The > > > > > > fundamental > > > > > > problem here is that the above optimizations cannot > > > > > > recognize > > > > > > a > > > > > > loop > > > > > > with a volatile iteration variable and do not preserve its > > > > > > canonical > > > > > > loop structure. > > > > > > > > > > > > > > > > > > > > Ok, meta-level question first: > > > > > > > > > >> > > > > Why do we care about performance of loops with a volatile > > > > > iteration > > > > > variable? > > > > > > > > > > > > > > I don't think we do, however, I think that misses the point. In > > > > this > > > > case, the volatile iteration variable is just an easy way to > > > > expose > > > > this problem that we have with nested loop canonicalization and > > > > the > > > > vectorizer. To be specific: > > > > > >> > > > This we vectorizer just fine: > > > > > >> > > > void foo1(float * restrict x, float * restrict y, float * > > > > restrict > > > > z) > > > > { > > > > > > > > > > for (int i = 0; i < 1000; ++i) { > > > > > > > > > > for (int j = 0; j < 1600; ++j) { > > > > > > > > > > x[j] = y[j] + z[j]; > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > } > > > > > >> > > > And, indeed, this we don't (the only change is adding volatile > > > > on > > > > i): > > > > > >> > > > void foo2(float * restrict x, float * restrict y, float * > > > > restrict > > > > z) > > > > { > > > > > > > > > > for (volatile int i = 0; i < 1000; ++i) { > > > > > > > > > > for (int j = 0; j < 1600; ++j) { > > > > > > > > > > x[j] = y[j] + z[j]; > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > } > > > > > >> > > > However, this we don't either, and that's a big problem: > > > > > >> > > > int done(float *x, float *y, float *z); > > > > > > > > > > void foo3(float * restrict x, float * restrict y, float * > > > > restrict > > > > z) > > > > { > > > > > > > > > > while (!done(x, y, z)) { > > > > > > > > > > for (int j = 0; j < 1600; ++j) { > > > > > > > > > > x[j] = y[j] + z[j]; > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > } > > > > > >> > > > And the underlying reason is the same. The IR at the point in > > > > time > > > > when the loop vectorizer runs looks like this: > > > > > >> > > > define void @foo3(float* noalias %x, float* noalias %y, float* > > > > noalias %z) #0 { > > > > > > > > > > entry: > > > > > > > > > > %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) > > > > #1 > > > > > > > > > > %lnot.15 = icmp eq i32 %call.14, 0 > > > > > > > > > > br i1 %lnot.15, label %for.body.preheader, label %while.end > > > > > >> > > > for.body.preheader: ; preds = %entry > > > > > > > > > > br label %for.body > > > > > >> > > > while.cond.loopexit: ; preds = %for.body > > > > > > > > > > %call = tail call i32 @done(float* %x, float* %y, float* %z) #1 > > > > > > > > > > %lnot = icmp eq i32 %call, 0 > > > > > > > > > > br i1 %lnot, label %for.body.backedge, label > > > > %while.end.loopexit > > > > > >> > > > for.body: ; preds = %for.body.backedge, %for.body.preheader > > > > > > > > > > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ % > > > > indvars.iv.be > > > > , > > > > %for.body.backedge ] > > > > > > > > > > ... > > > > > > > > > > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 > > > > > > > > > > %exitcond = icmp eq i64 %indvars.iv.next, 1600 > > > > > > > > > > br i1 %exitcond, label %while.cond.loopexit, label > > > > %for.body.backedge > > > > > >> > > > for.body.backedge: ; preds = %for.body, %while.cond.loopexit > > > > > > > > > > % indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > > > > %while.cond.loopexit ] > > > > > > > > > > br label %for.body > > > > > >> > > > while.end.loopexit: ; preds = %while.cond.loopexit > > > > > > > > > > br label %while.end > > > > > >> > > > while.end: ; preds = %while.end.loopexit, %entry > > > > > > > > > > ret void > > > > > > > > > > } > > > > > >> > > > and we can currently only vectorize loops where the loop latch > > > > is > > > > also the loop's exiting block. In this case, as in the case > > > > with > > > > the > > > > volatile variable, vectorization is blocked by this constraint > > > > (here > > > > the backedge is from the terminator of %for.body.backedge, but > > > > the > > > > loop exiting block is %for.body). The check in the vectorizer > > > > is > > > > explicit: > > > > > >> > > > // We only handle bottom-tested loops, i.e. loop in which the > > > > condition is > > > > > > > > > > // checked at the end of each iteration. With that we can > > > > assume > > > > that > > > > all > > > > > > > > > > // instructions in the loop are executed the same number of > > > > times. > > > > > > > > > > if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) { > > > > > > > > > > ... > > > > > >> > > Thanks for the detailed explanation. This makes much more sense > > > why > > > we need to handle it. I think its much better to look at nested > > > loops of this form than anything to do with volatile -- the > > > latter > > > is too prone to other random optimizations turning off. > > >> > > Regarding this problem, it would be interesting to know based on > > > this > > > explanation what the desired fix would be. I see at least these > > > options: > > >> > > 1) Canonicalize loops harder to make them look the way the > > > vectorizer > > > wants. If this can be done without causing significant problems, > > > it > > > seems likely the best approach. > > > > > I agree. In this case, we could certainly fold the trivial > > %for.body.backedge block into %for.body, meaning transforming this: >> > for.body.backedge: ; preds = %while.cond.loopexit, %for.body > > > %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > > %while.cond.loopexit ] > > > br label %for.body >> > for.body: ; preds = %for.body.backedge, %for.body.preheader > > > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be, > > %for.body.backedge ] > > > ... > > > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 > > > %exitcond = icmp eq i64 %indvars.iv.next, 1600 > > > br i1 %exitcond, label %while.cond.loopexit, label > > %for.body.backedge >> > into this: >> > for.body: ; preds = %for.body.backedge, %for.body.preheader > > > %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > > %while.cond.loopexit ] > > > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be, > > %for.body ] > > > ... > > > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 > > > %exitcond = icmp eq i64 %indvars.iv.next, 1600 > > > br i1 %exitcond, label %while.cond.loopexit, label %for.body >> > and this seems pretty trivial when %for.body.backedge is completely > > empty (as in this case), but if it had non-PHI instructions in it > > on > > which the existing PHIs in %for.body depended, then maybe this > > becomes less trivial? >> Although, based on what I said below, the case with instructions > there we can't currently vectorize anyway for more-fundamental > reasons.> -HalAlso worth pointing out that SimplifyCFG does this exact transformation. The vectorizer never sees it, however, because LoopSimplify prefers this form with the separate backedge block that the vectorizer can't handle. -Hal> > > 2) Teach the vectorizer to vectorize without this constraint by > > > instead establishing the actual invariant it cares about. > > > > > It really cares that there's no code that comes in between the > > latch > > and the exit, because such code is not really part of the loop (it > > only runs once), or code in between the exit and the latch (because > > such code runs in one fewer iterations than the code before the > > exit). At least nothing with side effects I presume. >> > -Hal >> > > Maybe there is another strategy? > > >> > > > -Hal > > > > > >> > > > > That seems both counter-intuitive and unlikely to be a useful > > > > > goal. > > > > > We simply don't optimize volatile operations well in *any* > > > > > part > > > > > of > > > > > the optimizer, and I'm not sure why we need to start trying > > > > > to > > > > > fix > > > > > that. This seems like an irreparably broken benchmark, but > > > > > perhaps > > > > > there is a motivation I don't yet see. > > > > > > > > > >> > > > > Assuming that sufficient motivation arises to try to fix > > > > > this, > > > > > see > > > > > my > > > > > comments below: > > > > > > > > > >> > > > > > (1) Loop unswitching generates several empty placeholder > > > > > > BBs > > > > > > only > > > > > > with PHI nodes after separating out a shorter path with no > > > > > > inner > > > > > > loop execution from a standard path. > > > > > > > > > > > > > > >> > > > > > (2) Jump threading and simplify-the-CFG passes > > > > > > independently > > > > > > calls > > > > > > TryToSimplifyUnconditionalBranchFromEmptyBlock() in > > > > > > Transforms/Utils/Local.cpp to get rid of almost empty BBs. > > > > > > > > > > > > > > >> > > > > > (3) TryToSimplifyUnconditionalBranchFromEmtpyBlock() > > > > > > eliminates > > > > > > the > > > > > > placeholder BBs after loop unswitching and merges them into > > > > > > subsequent blocks including the header of the inner loop. > > > > > > Before > > > > > > eliminating the blocks, the function checks if the block is > > > > > > a > > > > > > loop > > > > > > header by looking at its PHI nodes so that it can be saved, > > > > > > but > > > > > > the > > > > > > test fails with the loops with a volatile iteration > > > > > > variable. > > > > > > > > > > > > > > > > > > > > Why does this fail for a volatile iteration variable but not > > > > > for > > > > > a > > > > > non-volatile one? I think understanding that will be key to > > > > > understanding how it should be fixed. > > > > > > > > > >> > > > > _______________________________________________ > > > > > > > > > > > > > > > LLVM Developers mailing list > > > > > > > > > > > > > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > > > > > > > > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > > > > > > >> > > > -- > > > > > >> > > > Hal Finkel > > > > > > > > > > Assistant Computational Scientist > > > > > > > > > > Leadership Computing Facility > > > > > > > > > > Argonne National Laboratory > > > > > >> > -- >> > Hal Finkel > > > Assistant Computational Scientist > > > Leadership Computing Facility > > > Argonne National Laboratory >> > _______________________________________________ > > > LLVM Developers mailing list > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> --> Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory> _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150716/49cc3584/attachment.html>