thr3ads.net - llvm dev - [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2015-Jul-16 07:11 UTC

[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Chandler Carruth" <chandlerc at google.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Thursday, July 16, 2015 1:58:02 AM
> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
> with a volatile iteration variable
> ----- Original Message -----
> > From: "Hal Finkel" <hfinkel at anl.gov>
> 
> > To: "Chandler Carruth" <chandlerc at google.com>
> 
> > Cc: llvmdev at cs.uiuc.edu
> 
> > Sent: Thursday, July 16, 2015 1:46:42 AM
> 
> > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
> > with a volatile iteration variable
> 
> > ----- Original Message -----
> 
> > > From: "Chandler Carruth" <chandlerc at
google.com>
> > 
> 
> > > To: "Hal Finkel" <hfinkel at anl.gov>
> > 
> 
> > > Cc: "Hyojin Sung" <hsung at us.ibm.com>, llvmdev
at cs.uiuc.edu
> > 
> 
> > > Sent: Thursday, July 16, 2015 1:06:03 AM
> > 
> 
> > > Subject: Re: [LLVMdev] Improving loop vectorizer support for
> > > loops
> > > with a volatile iteration variable
> > 
> 
> > > On Wed, Jul 15, 2015 at 6:36 PM Hal Finkel < hfinkel at
anl.gov >
> > > wrote:
> > 
> 
> > > > > From: "Chandler Carruth" < chandlerc at
google.com >
> > > > 
> > > 
> > 
> 
> > > > > To: "Hyojin Sung" < hsung at us.ibm.com
>, llvmdev at cs.uiuc.edu
> > > > 
> > > 
> > 
> 
> > > > > Sent: Wednesday, July 15, 2015 7:34:54 PM
> > > > 
> > > 
> > 
> 
> > > > > Subject: Re: [LLVMdev] Improving loop vectorizer
support for
> > > > > loops
> > > > > with a volatile iteration variable
> > > > 
> > > 
> > 
> 
> > > > > On Wed, Jul 15, 2015 at 12:55 PM Hyojin Sung <
> > > > > hsung at us.ibm.com
> > > > > >
> > > > > wrote:
> > > > 
> > > 
> > 
> 
> > > > > > Hi all,
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > I would like to propose an improvement of the
“almost dead”
> > > > > > block
> > > > > > elimination in Transforms/Local.cpp so that it
will
> > > > > > preserve
> > > > > > the
> > > > > > canonical loop form for loops with a volatile
iteration
> > > > > > variable.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > *** Problem statement
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Nested loops in LCALS Subset B (
> > > > > > https://codesign.llnl.gov/LCALS.php
> > > > > > ) are not vectorized with LLVM -O3 because the
LLVM loop
> > > > > > vectorizer
> > > > > > fails the test whether the loop latch and exiting
block of
> > > > > > a
> > > > > > loop
> > > > > > is
> > > > > > the same. The loops are vectorizable, and get
vectorized
> > > > > > with
> > > > > > LLVM
> > > > > > -O2
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > I would be interested to know why -O2 succeeds here.
> > > > 
> > > 
> > 
> 
> > > > > > and also with other commercial compilers (icc,
xlc).
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > *** Details
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > These loops ended up with different loop latch and
exiting
> > > > > > block
> > > > > > after a series of optimizations including loop
unswitching,
> > > > > > jump
> > > > > > threading, simplify-the-CFG, and loop simplify.
The
> > > > > > fundamental
> > > > > > problem here is that the above optimizations
cannot
> > > > > > recognize
> > > > > > a
> > > > > > loop
> > > > > > with a volatile iteration variable and do not
preserve its
> > > > > > canonical
> > > > > > loop structure.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > Ok, meta-level question first:
> > > > 
> > > 
> > 
> 
> > > > > Why do we care about performance of loops with a
volatile
> > > > > iteration
> > > > > variable?
> > > > 
> > > 
> > 
> 
> > > > I don't think we do, however, I think that misses the
point. In
> > > > this
> > > > case, the volatile iteration variable is just an easy way to
> > > > expose
> > > > this problem that we have with nested loop canonicalization
and
> > > > the
> > > > vectorizer. To be specific:
> > > 
> > 
> 
> > > > This we vectorizer just fine:
> > > 
> > 
> 
> > > > void foo1(float * restrict x, float * restrict y, float *
> > > > restrict
> > > > z)
> > > > {
> > > 
> > 
> 
> > > > for (int i = 0; i < 1000; ++i) {
> > > 
> > 
> 
> > > > for (int j = 0; j < 1600; ++j) {
> > > 
> > 
> 
> > > > x[j] = y[j] + z[j];
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > And, indeed, this we don't (the only change is adding
volatile
> > > > on
> > > > i):
> > > 
> > 
> 
> > > > void foo2(float * restrict x, float * restrict y, float *
> > > > restrict
> > > > z)
> > > > {
> > > 
> > 
> 
> > > > for (volatile int i = 0; i < 1000; ++i) {
> > > 
> > 
> 
> > > > for (int j = 0; j < 1600; ++j) {
> > > 
> > 
> 
> > > > x[j] = y[j] + z[j];
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > However, this we don't either, and that's a big
problem:
> > > 
> > 
> 
> > > > int done(float *x, float *y, float *z);
> > > 
> > 
> 
> > > > void foo3(float * restrict x, float * restrict y, float *
> > > > restrict
> > > > z)
> > > > {
> > > 
> > 
> 
> > > > while (!done(x, y, z)) {
> > > 
> > 
> 
> > > > for (int j = 0; j < 1600; ++j) {
> > > 
> > 
> 
> > > > x[j] = y[j] + z[j];
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > And the underlying reason is the same. The IR at the point
in
> > > > time
> > > > when the loop vectorizer runs looks like this:
> > > 
> > 
> 
> > > > define void @foo3(float* noalias %x, float* noalias %y,
float*
> > > > noalias %z) #0 {
> > > 
> > 
> 
> > > > entry:
> > > 
> > 
> 
> > > > %call.14 = tail call i32 @done(float* %x, float* %y, float*
%z)
> > > > #1
> > > 
> > 
> 
> > > > %lnot.15 = icmp eq i32 %call.14, 0
> > > 
> > 
> 
> > > > br i1 %lnot.15, label %for.body.preheader, label %while.end
> > > 
> > 
> 
> > > > for.body.preheader: ; preds = %entry
> > > 
> > 
> 
> > > > br label %for.body
> > > 
> > 
> 
> > > > while.cond.loopexit: ; preds = %for.body
> > > 
> > 
> 
> > > > %call = tail call i32 @done(float* %x, float* %y, float* %z)
#1
> > > 
> > 
> 
> > > > %lnot = icmp eq i32 %call, 0
> > > 
> > 
> 
> > > > br i1 %lnot, label %for.body.backedge, label
> > > > %while.end.loopexit
> > > 
> > 
> 
> > > > for.body: ; preds = %for.body.backedge, %for.body.preheader
> > > 
> > 
> 
> > > > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %
> > > > indvars.iv.be
> > > > ,
> > > > %for.body.backedge ]
> > > 
> > 
> 
> > > > ...
> > > 
> > 
> 
> > > > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
> > > 
> > 
> 
> > > > %exitcond = icmp eq i64 %indvars.iv.next, 1600
> > > 
> > 
> 
> > > > br i1 %exitcond, label %while.cond.loopexit, label
> > > > %for.body.backedge
> > > 
> > 
> 
> > > > for.body.backedge: ; preds = %for.body, %while.cond.loopexit
> > > 
> > 
> 
> > > > % indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [
0,
> > > > %while.cond.loopexit ]
> > > 
> > 
> 
> > > > br label %for.body
> > > 
> > 
> 
> > > > while.end.loopexit: ; preds = %while.cond.loopexit
> > > 
> > 
> 
> > > > br label %while.end
> > > 
> > 
> 
> > > > while.end: ; preds = %while.end.loopexit, %entry
> > > 
> > 
> 
> > > > ret void
> > > 
> > 
> 
> > > > }
> > > 
> > 
> 
> > > > and we can currently only vectorize loops where the loop
latch
> > > > is
> > > > also the loop's exiting block. In this case, as in the
case
> > > > with
> > > > the
> > > > volatile variable, vectorization is blocked by this
constraint
> > > > (here
> > > > the backedge is from the terminator of %for.body.backedge,
but
> > > > the
> > > > loop exiting block is %for.body). The check in the
vectorizer
> > > > is
> > > > explicit:
> > > 
> > 
> 
> > > > // We only handle bottom-tested loops, i.e. loop in which
the
> > > > condition is
> > > 
> > 
> 
> > > > // checked at the end of each iteration. With that we can
> > > > assume
> > > > that
> > > > all
> > > 
> > 
> 
> > > > // instructions in the loop are executed the same number of
> > > > times.
> > > 
> > 
> 
> > > > if (TheLoop->getExitingBlock() !=
TheLoop->getLoopLatch()) {
> > > 
> > 
> 
> > > > ...
> > > 
> > 
> 
> > > Thanks for the detailed explanation. This makes much more sense
> > > why
> > > we need to handle it. I think its much better to look at nested
> > > loops of this form than anything to do with volatile -- the
> > > latter
> > > is too prone to other random optimizations turning off.
> > 
> 
> > > Regarding this problem, it would be interesting to know based on
> > > this
> > > explanation what the desired fix would be. I see at least these
> > > options:
> > 
> 
> > > 1) Canonicalize loops harder to make them look the way the
> > > vectorizer
> > > wants. If this can be done without causing significant problems,
> > > it
> > > seems likely the best approach.
> > 
> 
> > I agree. In this case, we could certainly fold the trivial
> > %for.body.backedge block into %for.body, meaning transforming this:
> 
> > for.body.backedge: ; preds = %while.cond.loopexit, %for.body
> 
> > %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
> > %while.cond.loopexit ]
> 
> > br label %for.body
> 
> > for.body: ; preds = %for.body.backedge, %for.body.preheader
> 
> > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
> > %for.body.backedge ]
> 
> > ...
> 
> > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
> 
> > %exitcond = icmp eq i64 %indvars.iv.next, 1600
> 
> > br i1 %exitcond, label %while.cond.loopexit, label
> > %for.body.backedge
> 
> > into this:
> 
> > for.body: ; preds = %for.body.backedge, %for.body.preheader
> 
> > %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
> > %while.cond.loopexit ]
> 
> > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
> > %for.body ]
> 
> > ...
> 
> > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
> 
> > %exitcond = icmp eq i64 %indvars.iv.next, 1600
> 
> > br i1 %exitcond, label %while.cond.loopexit, label %for.body
> 
> > and this seems pretty trivial when %for.body.backedge is completely
> > empty (as in this case), but if it had non-PHI instructions in it
> > on
> > which the existing PHIs in %for.body depended, then maybe this
> > becomes less trivial?
> 
> Although, based on what I said below, the case with instructions
> there we can't currently vectorize anyway for more-fundamental
> reasons.
> -Hal
Also worth pointing out that SimplifyCFG does this exact transformation. The
vectorizer never sees it, however, because LoopSimplify prefers this form with
the separate backedge block that the vectorizer can't handle.

-Hal 
> > > 2) Teach the vectorizer to vectorize without this constraint by
> > > instead establishing the actual invariant it cares about.
> > 
> 
> > It really cares that there's no code that comes in between the
> > latch
> > and the exit, because such code is not really part of the loop (it
> > only runs once), or code in between the exit and the latch (because
> > such code runs in one fewer iterations than the code before the
> > exit). At least nothing with side effects I presume.
> 
> > -Hal
> 
> > > Maybe there is another strategy?
> > 
> 
> > > > -Hal
> > > 
> > 
> 
> > > > > That seems both counter-intuitive and unlikely to be a
useful
> > > > > goal.
> > > > > We simply don't optimize volatile operations well
in *any*
> > > > > part
> > > > > of
> > > > > the optimizer, and I'm not sure why we need to
start trying
> > > > > to
> > > > > fix
> > > > > that. This seems like an irreparably broken benchmark,
but
> > > > > perhaps
> > > > > there is a motivation I don't yet see.
> > > > 
> > > 
> > 
> 
> > > > > Assuming that sufficient motivation arises to try to
fix
> > > > > this,
> > > > > see
> > > > > my
> > > > > comments below:
> > > > 
> > > 
> > 
> 
> > > > > > (1) Loop unswitching generates several empty
placeholder
> > > > > > BBs
> > > > > > only
> > > > > > with PHI nodes after separating out a shorter path
with no
> > > > > > inner
> > > > > > loop execution from a standard path.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > (2) Jump threading and simplify-the-CFG passes
> > > > > > independently
> > > > > > calls
> > > > > > TryToSimplifyUnconditionalBranchFromEmptyBlock()
in
> > > > > > Transforms/Utils/Local.cpp to get rid of almost
empty BBs.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > (3)
TryToSimplifyUnconditionalBranchFromEmtpyBlock()
> > > > > > eliminates
> > > > > > the
> > > > > > placeholder BBs after loop unswitching and merges
them into
> > > > > > subsequent blocks including the header of the
inner loop.
> > > > > > Before
> > > > > > eliminating the blocks, the function checks if the
block is
> > > > > > a
> > > > > > loop
> > > > > > header by looking at its PHI nodes so that it can
be saved,
> > > > > > but
> > > > > > the
> > > > > > test fails with the loops with a volatile
iteration
> > > > > > variable.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > Why does this fail for a volatile iteration variable
but not
> > > > > for
> > > > > a
> > > > > non-volatile one? I think understanding that will be
key to
> > > > > understanding how it should be fixed.
> > > > 
> > > 
> > 
> 
> > > > > _______________________________________________
> > > > 
> > > 
> > 
> 
> > > > > LLVM Developers mailing list
> > > > 
> > > 
> > 
> 
> > > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > > 
> > > 
> > 
> 
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > > 
> > > 
> > 
> 
> > > > --
> > > 
> > 
> 
> > > > Hal Finkel
> > > 
> > 
> 
> > > > Assistant Computational Scientist
> > > 
> > 
> 
> > > > Leadership Computing Facility
> > > 
> > 
> 
> > > > Argonne National Laboratory
> > > 
> > 
> 
> > --
> 
> > Hal Finkel
> 
> > Assistant Computational Scientist
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 
> > _______________________________________________
> 
> > LLVM Developers mailing list
> 
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> 
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150716/49cc3584/attachment.html>

Hyojin Sung

2015-Aug-03 21:33 UTC

head link

[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

Hi,

I discussed the issue offline with Hal, and would like to clarify what is
exactly going on, what are trade-offs for different solutions, and ask for
more feedback on my proposed solution (http://reviews.llvm.org/D11728). I
will use the example from Hal's post:

void foo2(float * restrict x, float * restrict y, float * restrict z) {
  for (volatile int i = 0; i < 1000; ++i) {
    for (int j = 0; j < 1600; ++j) {
      x[j] = y[j] + z[j];
    }
  }
}

IR after the first loop simplify: A preheader is created.

; Function Attrs: nounwind
define void @foo2(float* noalias nocapture %x, float* noalias nocapture
readonly %y, float* noalias nocapture readonly %z) #0 {
entry:
  %i = alloca i32, align 4
  tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata !11,
metadata !25), !dbg !26
  tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata !12,
metadata !25), !dbg !27
  tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata !13,
metadata !25), !dbg !28
  %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
  call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
  tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14,
metadata !25), !dbg !29
  store volatile i32 0, i32* %i, align 4, !dbg !29
  br label %for.cond, !dbg !30

for.cond:                                         ; preds %for.cond.cleanup.3,
%entry
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !31
  %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !34
  br i1 %cmp, label %for.cond.1.preheader, label
%for.cond.cleanup, !dbg !35

for.cond.1.preheader:                             ; preds = %for.cond
  br label %for.cond.1, !dbg !36

for.cond.cleanup:                                 ; preds = %for.cond
  call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
  ret void, !dbg !38

for.cond.1:                                       ; preds %for.cond.1.preheader,
%for.body.4
  %j.0 = phi i32 [ %inc, %for.body.4 ], [ 0, %for.cond.1.preheader ]
  %cmp2 = icmp slt i32 %j.0, 1600, !dbg !36
  br i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg !39

for.cond.cleanup.3:                               ; preds = %for.cond.1
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !40
  %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !40
  tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14,
metadata !25), !dbg !29
  store volatile i32 %inc10, i32* %i, align 4, !dbg !40
  br label %for.cond, !dbg !41

for.body.4:                                       ; preds = %for.cond.1
  %idxprom = sext i32 %j.0 to i64, !dbg !42
  %arrayidx = getelementptr inbounds float, float* %y, i64
%idxprom, !dbg !42
  %0 = load float, float* %arrayidx, align 4, !dbg !42, !tbaa !44
  %arrayidx6 = getelementptr inbounds float, float* %z, i64
%idxprom, !dbg !48
  %1 = load float, float* %arrayidx6, align 4, !dbg !48, !tbaa !44
  %add = fadd float %0, %1, !dbg !49
  %arrayidx8 = getelementptr inbounds float, float* %x, i64
%idxprom, !dbg !50
  store float %add, float* %arrayidx8, align 4, !dbg !51, !tbaa !44
  %inc = add nsw i32 %j.0, 1, !dbg !52
  tail call void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18,
metadata !25), !dbg !53
  br label %for.cond.1, !dbg !54
}

IR after loop rotation: After loop rotation, a rotated preheader
(for.cond.1.preheader.lr.ph) is created. A test for (i < 1000) is added at
the end of "entry" block. If true, the control jumps unconditionally
to
"for.body.4" through "for.cond.1.preheader.lr.ph" and
"for.cond.1.preheader". You can see that these two blocks
("for.cond.1.preheader.lr.ph" and "for.cond.1.preheader")
are practically
empty, and they will get eliminated later by Jump Threading and/or
Simplify-the-CFG. *IF* the outer loop has a non-volatile induction
variable, the loop will not be rotated in the first place as
"for.cond.1.preheader" has a PHI node for "i", and these
blocks will not be
eliminated.

; Function Attrs: nounwind
define void @foo2(float* noalias nocapture %x, float* noalias nocapture
readonly %y, float* noalias nocapture readonly %z) #0 {
entry:
  %i = alloca i32, align 4
  tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata !11,
metadata !25), !dbg !26
  tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata !12,
metadata !25), !dbg !27
  tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata !13,
metadata !25), !dbg !28
  %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
  call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
  tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14,
metadata !25), !dbg !29
  store volatile i32 0, i32* %i, align 4, !dbg !29
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0..21 = load volatile i32, i32* %i, align 4, !dbg !30
  %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg !33
  br i1 %cmp.22, label %for.cond.1.preheader.lr.ph, label
%for.cond.cleanup, !dbg !34

for.cond.1.preheader.lr.ph:                       ; preds = %entry
  br label %for.cond.1.preheader, !dbg !34

for.cond.1.preheader:                             ; preds
%for.cond.1.preheader.lr.ph, %for.cond.cleanup.3
  br label %for.body.4, !dbg !35

for.cond.for.cond.cleanup_crit_edge:              ; preds %for.cond.cleanup.3
  br label %for.cond.cleanup, !dbg !34

for.cond.cleanup:                                 ; preds
%for.cond.for.cond.cleanup_crit_edge, %entry
  call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
  ret void, !dbg !36

for.cond.cleanup.3:                               ; preds = %for.body.4
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !37
  %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !37
  tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14,
metadata !25), !dbg !29
  store volatile i32 %inc10, i32* %i, align 4, !dbg !37
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
  %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
  br i1 %cmp, label %for.cond.1.preheader, label
%for.cond.for.cond.cleanup_crit_edge, !dbg !34

for.body.4:                                       ; preds %for.cond.1.preheader,
%for.body.4
  %j.020 = phi i32 [ 0, %for.cond.1.preheader ], [ %inc, %for.body.4 ]
  %idxprom = sext i32 %j.020 to i64, !dbg !38
  %arrayidx = getelementptr inbounds float, float* %y, i64
%idxprom, !dbg !38
  %0 = load float, float* %arrayidx, align 4, !dbg !38, !tbaa !41
  %arrayidx6 = getelementptr inbounds float, float* %z, i64
%idxprom, !dbg !45
  %1 = load float, float* %arrayidx6, align 4, !dbg !45, !tbaa !41
  %add = fadd float %0, %1, !dbg !46
  %arrayidx8 = getelementptr inbounds float, float* %x, i64
%idxprom, !dbg !47
  store float %add, float* %arrayidx8, align 4, !dbg !48, !tbaa !41
  %inc = add nsw i32 %j.020, 1, !dbg !49
  tail call void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18,
metadata !25), !dbg !50
  %cmp2 = icmp slt i32 %inc, 1600, !dbg !51
  br i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg !35


After Jump Threading: "for.cond.1.preheader.lr.ph" and
"for.cond.1.preheader" are merged into "for.body.4" by
TryToSimplifyUnconditionalBranchFromEmptyBlock() in
Transforms/Utils/Local.cpp. Now "for.body.4" has three incoming edges
(two
backedges).

; Function Attrs: nounwind
define void @foo2(float* noalias nocapture %x, float* noalias nocapture
readonly %y, float* noalias nocapture readonly %z) #0 {
entry:
  %i = alloca i32, align 4
  tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata !11,
metadata !25), !dbg !26
  tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata !12,
metadata !25), !dbg !27
  tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata !13,
metadata !25), !dbg !28
  %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
  call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
  tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14,
metadata !25), !dbg !29
  store volatile i32 0, i32* %i, align 4, !dbg !29
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0..21 = load volatile i32, i32* %i, align 4, !dbg !30
  %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg !33
  br i1 %cmp.22, label %for.body.4, label %for.cond.cleanup, !dbg !34

for.cond.cleanup:                                 ; preds %for.cond.cleanup.3,
%entry
  call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
  ret void, !dbg !35

for.cond.cleanup.3:                               ; preds = %for.body.4
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !36
  %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !36
  tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14,
metadata !25), !dbg !29
  store volatile i32 %inc10, i32* %i, align 4, !dbg !36
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
  %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
  br i1 %cmp, label %for.body.4, label %for.cond.cleanup, !dbg !34

for.body.4:                                       ; preds %for.cond.cleanup.3,
%entry, %for.body.4
  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body.4 ], [ 0, %entry ],
[ 0, %for.cond.cleanup.3 ]
  %arrayidx = getelementptr inbounds float, float* %y, i64
%indvars.iv, !dbg !37
  %0 = load float, float* %arrayidx, align 4, !dbg !37, !tbaa !40
  %arrayidx6 = getelementptr inbounds float, float* %z, i64
%indvars.iv, !dbg !44
  %1 = load float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40
  %add = fadd float %0, %1, !dbg !45
  %arrayidx8 = getelementptr inbounds float, float* %x, i64
%indvars.iv, !dbg !46
  store float %add, float* %arrayidx8, align 4, !dbg !47, !tbaa !40
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !48
  %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg !48
  br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4, !dbg !48


After another loop simplify: Loop simplify tries to separate out nested
loops but fails to do so with this loop since it does not has a PHI node
for the outer loop variable. Instead, it creates a backedge block.

for.cond.cleanup.3:                               ; preds = %for.body.4
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !39
  %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !39
  tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14,
metadata !25), !dbg !29
  store volatile i32 %inc10, i32* %i, align 4, !dbg !39
  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
  %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
  %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
  br i1 %cmp, label %for.body.4.backedge, label
%for.cond.cleanup.loopexit, !dbg !34

for.body.4:                                       ; preds %for.body.4.backedge,
%for.body.4.preheader
  %indvars.iv = phi i64 [ 0, %for.body.4.preheader ], [ %indvars.iv.be,
%for.body.4.backedge ]
  %arrayidx = getelementptr inbounds float, float* %y, i64
%indvars.iv, !dbg !35
  %0 = load float, float* %arrayidx, align 4, !dbg !35, !tbaa !40
  %arrayidx6 = getelementptr inbounds float, float* %z, i64
%indvars.iv, !dbg !44
  %1 = load float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40
  %add = fadd float %0, %1, !dbg !45
  %arrayidx8 = getelementptr inbounds float, float* %x, i64
%indvars.iv, !dbg !46
  store float %add, float* %arrayidx8, align 4, !dbg !47, !tbaa !40
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !48
  %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg !48
  br i1 %exitcond, label %for.cond.cleanup.3, label
%for.body.4.backedge, !dbg !48

for.body.4.backedge:                              ; preds = %for.body.4,
%for.cond.cleanup.3
  %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body.4 ], [ 0,
%for.cond.cleanup.3 ]
  br label %for.body.4x


LLVM  loop vectorizer rejects to vectorize any loop for which a loop latch
(for.body.4.backedge) is different from a loop exiting block
(for.cond.cleanup.3). The loop vectorizer can assume that all instructions
in the loop are executed the same number of times with the test.

I believe a fix is in order in one way or another because the example is
simple and common enough and vectorized by other compilers. We may approach
it by either (1) preventing loops from being collapsed in the first place
or (2) teaching loop vectorizer to handle collapsed loops. For (2), we may
need to allow loop vectorizer to forego the assumption and handle the loop
as it is. The assumption seems fundamental to many of the vectorization
algorithms, so it will require extensive updates or may end up with
reverting the loop back to a properly nested form. The downside of (1) is
that it may slow down common optimization passes that are repeatedly
executed before vectorization.

My patch (http://reviews.llvm.org/D11728) is a prototype fix for (1) that
modifies Jump Threading and Simplify-the-CFG to not eliminate an empty loop
header BB even when the loop does not have a PHI node for its induction
variable. The details can be found at http://reviews.llvm.org/D11728. I
would welcome and appreciate any comments or feedback.

Best,
Hyojin




From:	Hal Finkel <hfinkel at anl.gov>
To:	Chandler Carruth <chandlerc at google.com>
Cc:	llvmdev at cs.uiuc.edu
Date:	07/16/2015 03:19 AM
Subject:	Re: [LLVMdev] Improving loop vectorizer support for loops with
            a volatile iteration variable
Sent by:	llvmdev-bounces at cs.uiuc.edu






 From: "Hal Finkel" <hfinkel at anl.gov>
 To: "Chandler Carruth" <chandlerc at google.com>
 Cc: llvmdev at cs.uiuc.edu
 Sent: Thursday, July 16, 2015 1:58:02 AM
 Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a
 volatile iteration variable




 From: "Hal Finkel" <hfinkel at anl.gov>
 To: "Chandler Carruth" <chandlerc at google.com>
 Cc: llvmdev at cs.uiuc.edu
 Sent: Thursday, July 16, 2015 1:46:42 AM
 Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a
 volatile iteration variable



  From: "Chandler Carruth" <chandlerc at google.com>
  To: "Hal Finkel" <hfinkel at anl.gov>
  Cc: "Hyojin Sung" <hsung at us.ibm.com>, llvmdev at
cs.uiuc.edu
  Sent: Thursday, July 16, 2015 1:06:03 AM
  Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a
  volatile iteration variable

  On Wed, Jul 15, 2015 at 6:36 PM Hal Finkel <hfinkel at anl.gov> wrote:


    From: "Chandler Carruth" <chandlerc at google.com>
    To: "Hyojin Sung" <hsung at us.ibm.com>, llvmdev at
cs.uiuc.edu
    Sent: Wednesday, July 15, 2015 7:34:54 PM
    Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with
    a volatile iteration variable


    On Wed, Jul 15, 2015 at 12:55 PM Hyojin Sung <hsung at us.ibm.com>
wrote:
      Hi all,

      I would like to propose an improvement of the “almost dead” block
      elimination in Transforms/Local.cpp so that it will preserve the
      canonical loop form for loops with a volatile iteration variable.

      *** Problem statement
      Nested loops in LCALS Subset B (https://codesign.llnl.gov/LCALS.php)
      are not vectorized with LLVM -O3 because the LLVM loop vectorizer
      fails the test whether the loop latch and exiting block of a loop is
      the same. The loops are vectorizable, and get vectorized with LLVM
      -O2


    I would be interested to know why -O2 succeeds here.

      and also with other commercial compilers (icc, xlc).

      *** Details
      These loops ended up with different loop latch and exiting block
      after a series of optimizations including loop unswitching, jump
      threading, simplify-the-CFG, and loop simplify. The fundamental
      problem here is that the above optimizations cannot recognize a loop
      with a volatile iteration variable and do not preserve its canonical
      loop structure.


    Ok, meta-level question first:

    Why do we care about performance of loops with a volatile iteration
    variable?
   I don't think we do, however, I think that misses the point. In this
   case, the volatile iteration variable is just an easy way to expose this
   problem that we have with nested loop canonicalization and the
   vectorizer. To be specific:

   This we vectorizer just fine:

   void foo1(float * restrict x, float * restrict y, float * restrict z) {
     for (int i = 0; i < 1000; ++i) {
       for (int j = 0; j < 1600; ++j) {
         x[j] = y[j] + z[j];
       }
     }
   }

   And, indeed, this we don't (the only change is adding volatile on i):

   void foo2(float * restrict x, float * restrict y, float * restrict z) {
     for (volatile int i = 0; i < 1000; ++i) {
       for (int j = 0; j < 1600; ++j) {
         x[j] = y[j] + z[j];
       }
     }
   }

   However, this we don't either, and that's a big problem:

   int done(float *x, float *y, float *z);
   void foo3(float * restrict x, float * restrict y, float * restrict z) {
     while (!done(x, y, z)) {
       for (int j = 0; j < 1600; ++j) {
         x[j] = y[j] + z[j];
       }
     }
   }

   And the underlying reason is the same. The IR at the point in time when
   the loop vectorizer runs looks like this:

   define void @foo3(float* noalias %x, float* noalias %y, float* noalias
   %z) #0 {
   entry:
     %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) #1
     %lnot.15 = icmp eq i32 %call.14, 0
     br i1 %lnot.15, label %for.body.preheader, label %while.end

   for.body.preheader:                               ; preds = %entry
     br label %for.body

   while.cond.loopexit:                              ; preds = %for.body
     %call = tail call i32 @done(float* %x, float* %y, float* %z) #1
     %lnot = icmp eq i32 %call, 0
     br i1 %lnot, label %for.body.backedge, label %while.end.loopexit

   for.body:                                         ; preds   
%for.body.backedge, %for.body.preheader
     %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
   %for.body.backedge ]
     ...
     %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
     %exitcond = icmp eq i64 %indvars.iv.next, 1600
     br i1 %exitcond, label %while.cond.loopexit, label %for.body.backedge

   for.body.backedge:                                ; preds = %for.body,
   %while.cond.loopexit
     %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
   %while.cond.loopexit ]
     br label %for.body

   while.end.loopexit:                               ; preds   
%while.cond.loopexit
     br label %while.end

   while.end:                                        ; preds   
%while.end.loopexit, %entry
     ret void
   }

   and we can currently only vectorize loops where the loop latch is also
   the loop's exiting block. In this case, as in the case with the volatile
   variable, vectorization is blocked by this constraint (here the backedge
   is from the terminator of %for.body.backedge, but the loop exiting block
   is %for.body). The check in the vectorizer is explicit:

     // We only handle bottom-tested loops, i.e. loop in which the
   condition is
     // checked at the end of each iteration. With that we can assume that
   all
     // instructions in the loop are executed the same number of times.
     if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
       ...

  Thanks for the detailed explanation. This makes much more sense why we
  need to handle it. I think its much better to look at nested loops of
  this form than anything to do with volatile -- the latter is too prone to
  other random optimizations turning off.

  Regarding this problem, it would be interesting to know based on this
  explanation what the desired fix would be. I see at least these options:

  1) Canonicalize loops harder to make them look the way the vectorizer
  wants. If this can be done without causing significant problems, it seems
  likely the best approach.
 I agree. In this case, we could certainly fold the trivial
 %for.body.backedge block into %for.body, meaning transforming this:

 for.body.backedge:                                ; preds 
%while.cond.loopexit, %for.body
   %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
 %while.cond.loopexit ]
   br label %for.body

 for.body:                                         ; preds  %for.body.backedge,
%for.body.preheader
   %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
 %for.body.backedge ]
   ...
   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
   %exitcond = icmp eq i64 %indvars.iv.next, 1600
   br i1 %exitcond, label %while.cond.loopexit, label %for.body.backedge

 into this:

 for.body:                                         ; preds  %for.body.backedge,
%for.body.preheader
   %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
 %while.cond.loopexit ]
   %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
 %for.body ]
   ...
   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
   %exitcond = icmp eq i64 %indvars.iv.next, 1600
   br i1 %exitcond, label %while.cond.loopexit, label %for.body

 and this seems pretty trivial when %for.body.backedge is completely empty
 (as in this case), but if it had non-PHI instructions in it on which the
 existing PHIs in %for.body depended, then maybe this becomes less trivial?
 Although, based on what I said below, the case with instructions there we
 can't currently vectorize anyway for more-fundamental reasons.

  -Hal

Also worth pointing out that SimplifyCFG does this exact transformation.
The vectorizer never sees it, however, because LoopSimplify prefers this
form with the separate backedge block that the vectorizer can't handle.

 -Hal



  2) Teach the vectorizer to vectorize without this constraint by instead
  establishing the actual invariant it cares about.
 It really cares that there's no code that comes in between the latch and
 the exit, because such code is not really part of the loop (it only runs
 once), or code in between the exit and the latch (because such code runs
 in one fewer iterations than the code before the exit). At least nothing
 with side effects I presume.

  -Hal

  Maybe there is another strategy?



    -Hal
    That seems both counter-intuitive and unlikely to be a useful goal. We
    simply don't optimize volatile operations well in *any* part of the
    optimizer, and I'm not sure why we need to start trying to fix that.
    This seems like an irreparably broken benchmark, but perhaps there is a
    motivation I don't yet see.


    Assuming that sufficient motivation arises to try to fix this, see my
    comments below:



      (1) Loop unswitching generates several empty placeholder BBs only
      with PHI nodes after separating out a shorter path with no inner loop
      execution from a standard path.

      (2) Jump threading and simplify-the-CFG passes independently calls
      TryToSimplifyUnconditionalBranchFromEmptyBlock() in
      Transforms/Utils/Local.cpp to get rid of almost empty BBs.

      (3) TryToSimplifyUnconditionalBranchFromEmtpyBlock() eliminates the
      placeholder BBs after loop unswitching and merges them into
      subsequent blocks including the header of the inner loop. Before
      eliminating the blocks, the function checks if the block is a loop
      header by looking at its PHI nodes so that it can be saved, but the
      test fails with the loops with a volatile iteration variable.


    Why does this fail for a volatile iteration variable but not for a
    non-volatile one? I think understanding that will be key to
    understanding how it should be fixed.

    _______________________________________________
    LLVM Developers mailing list
    LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
    http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



   --
   Hal Finkel
   Assistant Computational Scientist
   Leadership Computing Facility
   Argonne National Laboratory



 --
 Hal Finkel
 Assistant Computational Scientist
 Leadership Computing Facility
 Argonne National Laboratory

 _______________________________________________
 LLVM Developers mailing list
 LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
 http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



 --
 Hal Finkel
 Assistant Computational Scientist
 Leadership Computing Facility
 Argonne National Laboratory

 _______________________________________________
 LLVM Developers mailing list
 LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
 http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150803/b069d698/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150803/b069d698/attachment.gif>

Gerolf Hoflehner

2015-Aug-03 22:14 UTC

head link

[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

I see a more fundamental problem and perhaps this example can serve as a
stepping stone towards a solution.

There is a desired property: In this case, loop is vectorizable.
There are N compiler transformations. These transformations must either
establish the property or keep it invariant,  but never destroy it.

If there is agreement to this then the first step is to have the analysis of ‘is
vectorizable’ runnable after every transformation and report violations in
detail, likely under a flag. Then comparing a set of loops not vectorizable with
clang but with other compilers (gcc, clang, …) should shape precise ideas on
normal forms for the vectorizer and general improvements to transformations to
the keep/establish the invariant/desired property.

I’m worried that the one example at time approach over time confuscates matters
within transformations resulting in less maintainable code.

Gerolf


> On Aug 3, 2015, at 2:33 PM, Hyojin Sung <hsung at us.ibm.com> wrote:
> 
> Hi,
> 
> I discussed the issue offline with Hal, and would like to clarify what is
exactly going on, what are trade-offs for different solutions, and ask for more
feedback on my proposed solution (http://reviews.llvm.org/D11728
<https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11728&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=N2de3NabFW69oQ6in3ydIBGGRddU65HAGvjpTRC7uvA&s=q8HlQbw3XXnWU7Og4HMcdBjGw-Y2XDB7h9xSiuRtZFs&e=>).
I will use the example from Hal's post:
> 
> void foo2(float * restrict x, float * restrict y, float * restrict z) {
>   for (volatile int i = 0; i < 1000; ++i) {
>     for (int j = 0; j < 1600; ++j) {
>       x[j] = y[j] + z[j];
>     }
>   }
> }
> 
> IR after the first loop simplify: A preheader is created. 
> 
> ; Function Attrs: nounwind
> define void @foo2(float* noalias nocapture %x, float* noalias nocapture
readonly %y, float* noalias nocapture readonly %z) #0 {
> entry:
>   %i = alloca i32, align 4
>   tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata !11,
metadata !25), !dbg !26
>   tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata !12,
metadata !25), !dbg !27
>   tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata !13,
metadata !25), !dbg !28
>   %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
>   call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
>   tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14,
metadata !25), !dbg !29
>   store volatile i32 0, i32* %i, align 4, !dbg !29
>   br label %for.cond, !dbg !30
> 
> for.cond:                                         ; preds =
%for.cond.cleanup.3, %entry
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !31
>   %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !34
>   br i1 %cmp, label %for.cond.1.preheader, label %for.cond.cleanup, !dbg
!35
> 
> for.cond.1.preheader:                             ; preds = %for.cond
>   br label %for.cond.1, !dbg !36
> 
> for.cond.cleanup:                                 ; preds = %for.cond
>   call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
>   ret void, !dbg !38
> 
> for.cond.1:                                       ; preds =
%for.cond.1.preheader, %for.body.4
>   %j.0 = phi i32 [ %inc, %for.body.4 ], [ 0, %for.cond.1.preheader ]
>   %cmp2 = icmp slt i32 %j.0, 1600, !dbg !36
>   br i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg !39
> 
> for.cond.cleanup.3:                               ; preds = %for.cond.1
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !40
>   %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !40
>   tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14,
metadata !25), !dbg !29
>   store volatile i32 %inc10, i32* %i, align 4, !dbg !40
>   br label %for.cond, !dbg !41
> 
> for.body.4:                                       ; preds = %for.cond.1
>   %idxprom = sext i32 %j.0 to i64, !dbg !42
>   %arrayidx = getelementptr inbounds float, float* %y, i64 %idxprom, !dbg
!42
>   %0 = load float, float* %arrayidx, align 4, !dbg !42, !tbaa !44
>   %arrayidx6 = getelementptr inbounds float, float* %z, i64 %idxprom, !dbg
!48
>   %1 = load float, float* %arrayidx6, align 4, !dbg !48, !tbaa !44
>   %add = fadd float %0, %1, !dbg !49
>   %arrayidx8 = getelementptr inbounds float, float* %x, i64 %idxprom, !dbg
!50
>   store float %add, float* %arrayidx8, align 4, !dbg !51, !tbaa !44
>   %inc = add nsw i32 %j.0, 1, !dbg !52
>   tail call void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18,
metadata !25), !dbg !53
>   br label %for.cond.1, !dbg !54
> }
> 
> IR after loop rotation: After loop rotation, a rotated preheader
(for.cond.1.preheader.lr.ph) is created. A test for (i < 1000) is added at
the end of "entry" block. If true, the control jumps unconditionally
to "for.body.4" through "for.cond.1.preheader.lr.ph" and
"for.cond.1.preheader". You can see that these two blocks
("for.cond.1.preheader.lr.ph" and "for.cond.1.preheader")
are practically empty, and they will get eliminated later by Jump Threading
and/or Simplify-the-CFG. *IF* the outer loop has a non-volatile induction
variable, the loop will not be rotated in the first place as
"for.cond.1.preheader" has a PHI node for "i", and these
blocks will not be eliminated.
> 
> ; Function Attrs: nounwind
> define void @foo2(float* noalias nocapture %x, float* noalias nocapture
readonly %y, float* noalias nocapture readonly %z) #0 {
> entry:
>   %i = alloca i32, align 4
>   tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata !11,
metadata !25), !dbg !26
>   tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata !12,
metadata !25), !dbg !27
>   tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata !13,
metadata !25), !dbg !28
>   %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
>   call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
>   tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14,
metadata !25), !dbg !29
>   store volatile i32 0, i32* %i, align 4, !dbg !29
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0..21 = load volatile i32, i32* %i, align 4, !dbg !30
>   %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg !33
>   br i1 %cmp.22, label %for.cond.1.preheader.lr.ph, label
%for.cond.cleanup, !dbg !34
> 
> for.cond.1.preheader.lr.ph:                       ; preds = %entry
>   br label %for.cond.1.preheader, !dbg !34
> 
> for.cond.1.preheader:                             ; preds =
%for.cond.1.preheader.lr.ph, %for.cond.cleanup.3
>   br label %for.body.4, !dbg !35
> 
> for.cond.for.cond.cleanup_crit_edge:              ; preds =
%for.cond.cleanup.3
>   br label %for.cond.cleanup, !dbg !34
> 
> for.cond.cleanup:                                 ; preds =
%for.cond.for.cond.cleanup_crit_edge, %entry
>   call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
>   ret void, !dbg !36
> 
> for.cond.cleanup.3:                               ; preds = %for.body.4
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !37
>   %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !37
>   tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14,
metadata !25), !dbg !29
>   store volatile i32 %inc10, i32* %i, align 4, !dbg !37
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
>   %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
>   br i1 %cmp, label %for.cond.1.preheader, label
%for.cond.for.cond.cleanup_crit_edge, !dbg !34
> 
> for.body.4:                                       ; preds =
%for.cond.1.preheader, %for.body.4
>   %j.020 = phi i32 [ 0, %for.cond.1.preheader ], [ %inc, %for.body.4 ]
>   %idxprom = sext i32 %j.020 to i64, !dbg !38
>   %arrayidx = getelementptr inbounds float, float* %y, i64 %idxprom, !dbg
!38
>   %0 = load float, float* %arrayidx, align 4, !dbg !38, !tbaa !41
>   %arrayidx6 = getelementptr inbounds float, float* %z, i64 %idxprom, !dbg
!45
>   %1 = load float, float* %arrayidx6, align 4, !dbg !45, !tbaa !41
>   %add = fadd float %0, %1, !dbg !46
>   %arrayidx8 = getelementptr inbounds float, float* %x, i64 %idxprom, !dbg
!47
>   store float %add, float* %arrayidx8, align 4, !dbg !48, !tbaa !41
>   %inc = add nsw i32 %j.020, 1, !dbg !49
>   tail call void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18,
metadata !25), !dbg !50
>   %cmp2 = icmp slt i32 %inc, 1600, !dbg !51
>   br i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg !35
> 
> 
> After Jump Threading: "for.cond.1.preheader.lr.ph" and
"for.cond.1.preheader" are merged into "for.body.4" by
TryToSimplifyUnconditionalBranchFromEmptyBlock() in Transforms/Utils/Local.cpp.
Now "for.body.4" has three incoming edges (two backedges).
> 
> ; Function Attrs: nounwind
> define void @foo2(float* noalias nocapture %x, float* noalias nocapture
readonly %y, float* noalias nocapture readonly %z) #0 {
> entry:
>   %i = alloca i32, align 4
>   tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata !11,
metadata !25), !dbg !26
>   tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata !12,
metadata !25), !dbg !27
>   tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata !13,
metadata !25), !dbg !28
>   %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
>   call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
>   tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14,
metadata !25), !dbg !29
>   store volatile i32 0, i32* %i, align 4, !dbg !29
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0..21 = load volatile i32, i32* %i, align 4, !dbg !30
>   %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg !33
>   br i1 %cmp.22, label %for.body.4, label %for.cond.cleanup, !dbg !34
> 
> for.cond.cleanup:                                 ; preds =
%for.cond.cleanup.3, %entry
>   call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
>   ret void, !dbg !35
> 
> for.cond.cleanup.3:                               ; preds = %for.body.4
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !36
>   %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !36
>   tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14,
metadata !25), !dbg !29
>   store volatile i32 %inc10, i32* %i, align 4, !dbg !36
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
>   %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
>   br i1 %cmp, label %for.body.4, label %for.cond.cleanup, !dbg !34
> 
> for.body.4:                                       ; preds =
%for.cond.cleanup.3, %entry, %for.body.4
>   %indvars.iv = phi i64 [ %indvars.iv.next, %for.body.4 ], [ 0, %entry ], [
0, %for.cond.cleanup.3 ]
>   %arrayidx = getelementptr inbounds float, float* %y, i64 %indvars.iv,
!dbg !37
>   %0 = load float, float* %arrayidx, align 4, !dbg !37, !tbaa !40
>   %arrayidx6 = getelementptr inbounds float, float* %z, i64 %indvars.iv,
!dbg !44
>   %1 = load float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40
>   %add = fadd float %0, %1, !dbg !45
>   %arrayidx8 = getelementptr inbounds float, float* %x, i64 %indvars.iv,
!dbg !46
>   store float %add, float* %arrayidx8, align 4, !dbg !47, !tbaa !40
>   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !48
>   %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg !48
>   br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4, !dbg !48
> 
> 
> After another loop simplify: Loop simplify tries to separate out nested
loops but fails to do so with this loop since it does not has a PHI node for the
outer loop variable. Instead, it creates a backedge block.
> 
> for.cond.cleanup.3:                               ; preds = %for.body.4
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !39
>   %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !39
>   tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14,
metadata !25), !dbg !29
>   store volatile i32 %inc10, i32* %i, align 4, !dbg !39
>   tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14,
metadata !25), !dbg !29
>   %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
>   %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
>   br i1 %cmp, label %for.body.4.backedge, label %for.cond.cleanup.loopexit,
!dbg !34
> 
> for.body.4:                                       ; preds =
%for.body.4.backedge, %for.body.4.preheader
>   %indvars.iv = phi i64 [ 0, %for.body.4.preheader ], [ %indvars.iv.be,
%for.body.4.backedge ]
>   %arrayidx = getelementptr inbounds float, float* %y, i64 %indvars.iv,
!dbg !35
>   %0 = load float, float* %arrayidx, align 4, !dbg !35, !tbaa !40
>   %arrayidx6 = getelementptr inbounds float, float* %z, i64 %indvars.iv,
!dbg !44
>   %1 = load float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40
>   %add = fadd float %0, %1, !dbg !45
>   %arrayidx8 = getelementptr inbounds float, float* %x, i64 %indvars.iv,
!dbg !46
>   store float %add, float* %arrayidx8, align 4, !dbg !47, !tbaa !40
>   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !48
>   %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg !48
>   br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4.backedge,
!dbg !48
> 
> for.body.4.backedge:                              ; preds = %for.body.4,
%for.cond.cleanup.3
>   %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body.4 ], [ 0,
%for.cond.cleanup.3 ]
>   br label %for.body.4x
> 
> 
> LLVM  loop vectorizer rejects to vectorize any loop for which a loop latch
(for.body.4.backedge) is different from a loop exiting block
(for.cond.cleanup.3). The loop vectorizer can assume that all instructions in
the loop are executed the same number of times with the test.
> 
> I believe a fix is in order in one way or another because the example is
simple and common enough and vectorized by other compilers. We may approach it
by either (1) preventing loops from being collapsed in the first place or (2)
teaching loop vectorizer to handle collapsed loops. For (2), we may need to
allow loop vectorizer to forego the assumption and handle the loop as it is. The
assumption seems fundamental to many of the vectorization algorithms, so it will
require extensive updates or may end up with reverting the loop back to a
properly nested form. The downside of (1) is that it may slow down common
optimization passes that are repeatedly executed before vectorization.
> 
> My patch (http://reviews.llvm.org/D11728
<https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11728&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=N2de3NabFW69oQ6in3ydIBGGRddU65HAGvjpTRC7uvA&s=q8HlQbw3XXnWU7Og4HMcdBjGw-Y2XDB7h9xSiuRtZFs&e=>)
is a prototype fix for (1) that modifies Jump Threading and Simplify-the-CFG to
not eliminate an empty loop header BB even when the loop does not have a PHI
node for its induction variable. The details can be found at
http://reviews.llvm.org/D11728
<https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11728&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=N2de3NabFW69oQ6in3ydIBGGRddU65HAGvjpTRC7uvA&s=q8HlQbw3XXnWU7Og4HMcdBjGw-Y2XDB7h9xSiuRtZFs&e=>.
I would welcome and appreciate any comments or feedback.
> 
> Best,
> Hyojin
> 
> 
> <graycol.gif>Hal Finkel ---07/16/2015 03:19:24 AM-------- Original
Message ----- > From: "Hal Finkel" <hfinkel at anl.gov>
> 
> From:	Hal Finkel <hfinkel at anl.gov>
> To:	Chandler Carruth <chandlerc at google.com>
> Cc:	llvmdev at cs.uiuc.edu
> Date:	07/16/2015 03:19 AM
> Subject:	Re: [LLVMdev] Improving loop vectorizer support for loops with a
volatile iteration variable
> Sent by:	llvmdev-bounces at cs.uiuc.edu
> 
> 
> 
> 
> 
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Chandler Carruth" <chandlerc at google.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Thursday, July 16, 2015 1:58:02 AM
> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a
volatile iteration variable
> 
> 
> 
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Chandler Carruth" <chandlerc at google.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Thursday, July 16, 2015 1:46:42 AM
> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a
volatile iteration variable
> 
> 
> From: "Chandler Carruth" <chandlerc at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Hyojin Sung" <hsung at us.ibm.com>, llvmdev at
cs.uiuc.edu
> Sent: Thursday, July 16, 2015 1:06:03 AM
> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a
volatile iteration variable
> 
> On Wed, Jul 15, 2015 at 6:36 PM Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
> 
> From: "Chandler Carruth" <chandlerc at google.com
<mailto:chandlerc at google.com>>
> To: "Hyojin Sung" <hsung at us.ibm.com <mailto:hsung at
us.ibm.com>>, llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>
> Sent: Wednesday, July 15, 2015 7:34:54 PM
> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a
volatile iteration variable
> 
> 
> On Wed, Jul 15, 2015 at 12:55 PM Hyojin Sung <hsung at us.ibm.com
<mailto:hsung at us.ibm.com>> wrote:
> Hi all,
> 
> I would like to propose an improvement of the “almost dead” block
elimination in Transforms/Local.cpp so that it will preserve the canonical loop
form for loops with a volatile iteration variable.
> 
> *** Problem statement
> Nested loops in LCALS Subset B (https://codesign.llnl.gov/LCALS.php
<https://urldefense.proofpoint.com/v2/url?u=https-3A__codesign.llnl.gov_LCALS.php&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=aWKfvN4c8lvUSvVn8J0Z2ajTctlBJf0198Au28epBr0&s=4d9dt5ODcDWHHatSrwu5ZYT9ebgVzNEtpOlIR87izCM&e=>)
are not vectorized with LLVM -O3 because the LLVM loop vectorizer fails the test
whether the loop latch and exiting block of a loop is the same. The loops are
vectorizable, and get vectorized with LLVM -O2
> 
> I would be interested to know why -O2 succeeds here.
>  
> and also with other commercial compilers (icc, xlc). 
> 
> *** Details
> These loops ended up with different loop latch and exiting block after a
series of optimizations including loop unswitching, jump threading,
simplify-the-CFG, and loop simplify. The fundamental problem here is that the
above optimizations cannot recognize a loop with a volatile iteration variable
and do not preserve its canonical loop structure.
> 
> Ok, meta-level question first:
> 
> Why do we care about performance of loops with a volatile iteration
variable?
> I don't think we do, however, I think that misses the point. In this
case, the volatile iteration variable is just an easy way to expose this problem
that we have with nested loop canonicalization and the vectorizer. To be
specific:
> 
> This we vectorizer just fine:
> 
> void foo1(float * restrict x, float * restrict y, float * restrict z) {
>  for (int i = 0; i < 1000; ++i) {
>    for (int j = 0; j < 1600; ++j) {
>      x[j] = y[j] + z[j];
>    }
>  }
> }
> 
> And, indeed, this we don't (the only change is adding volatile on i):
> 
> void foo2(float * restrict x, float * restrict y, float * restrict z) {
>  for (volatile int i = 0; i < 1000; ++i) {
>    for (int j = 0; j < 1600; ++j) {
>      x[j] = y[j] + z[j];
>    }
>  }
> }
> 
> However, this we don't either, and that's a big problem:
> 
> int done(float *x, float *y, float *z);
> void foo3(float * restrict x, float * restrict y, float * restrict z) {
>  while (!done(x, y, z)) {
>    for (int j = 0; j < 1600; ++j) {
>      x[j] = y[j] + z[j];
>    }
>  }
> }
> 
> And the underlying reason is the same. The IR at the point in time when the
loop vectorizer runs looks like this:
> 
> define void @foo3(float* noalias %x, float* noalias %y, float* noalias %z)
#0 {
> entry:
>  %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) #1
>  %lnot.15 = icmp eq i32 %call.14, 0
>  br i1 %lnot.15, label %for.body.preheader, label %while.end
> 
> for.body.preheader:                               ; preds = %entry
>  br label %for.body
> 
> while.cond.loopexit:                              ; preds = %for.body
>  %call = tail call i32 @done(float* %x, float* %y, float* %z) #1
>  %lnot = icmp eq i32 %call, 0
>  br i1 %lnot, label %for.body.backedge, label %while.end.loopexit
> 
> for.body:                                         ; preds =
%for.body.backedge, %for.body.preheader
>  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be
<https://urldefense.proofpoint.com/v2/url?u=http-3A__indvars.iv.be&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=OE36IJuCyZctA3s37X7WGup5iAkOM11AM2CnhP5TBUk&s=JDHk2pO2X8ONZI6T-EtaJWjtJLzbBdKTvULQ6Te-xGI&e=>,
%for.body.backedge ]
>  ...
>  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>  %exitcond = icmp eq i64 %indvars.iv.next, 1600
>  br i1 %exitcond, label %while.cond.loopexit, label %for.body.backedge
> 
> for.body.backedge:                                ; preds = %for.body,
%while.cond.loopexit
>  %indvars.iv.be
<https://urldefense.proofpoint.com/v2/url?u=http-3A__indvars.iv.be&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=OE36IJuCyZctA3s37X7WGup5iAkOM11AM2CnhP5TBUk&s=JDHk2pO2X8ONZI6T-EtaJWjtJLzbBdKTvULQ6Te-xGI&e=>
= phi i64 [ %indvars.iv.next, %for.body ], [ 0, %while.cond.loopexit ]
>  br label %for.body
> 
> while.end.loopexit:                               ; preds =
%while.cond.loopexit
>  br label %while.end
> 
> while.end:                                        ; preds =
%while.end.loopexit, %entry
>  ret void
> }
> 
> and we can currently only vectorize loops where the loop latch is also the
loop's exiting block. In this case, as in the case with the volatile
variable, vectorization is blocked by this constraint (here the backedge is from
the terminator of %for.body.backedge, but the loop exiting block is %for.body).
The check in the vectorizer is explicit:
> 
>  // We only handle bottom-tested loops, i.e. loop in which the condition is
>  // checked at the end of each iteration. With that we can assume that all
>  // instructions in the loop are executed the same number of times.
>  if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
>    ...
> 
> Thanks for the detailed explanation. This makes much more sense why we need
to handle it. I think its much better to look at nested loops of this form than
anything to do with volatile -- the latter is too prone to other random
optimizations turning off.
> 
> Regarding this problem, it would be interesting to know based on this
explanation what the desired fix would be. I see at least these options:
> 
> 1) Canonicalize loops harder to make them look the way the vectorizer
wants. If this can be done without causing significant problems, it seems likely
the best approach.
> I agree. In this case, we could certainly fold the trivial
%for.body.backedge block into %for.body, meaning transforming this:
> 
> for.body.backedge:                                ; preds =
%while.cond.loopexit, %for.body
>  %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
%while.cond.loopexit ]
>  br label %for.body
> 
> for.body:                                         ; preds =
%for.body.backedge, %for.body.preheader
>  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
%for.body.backedge ]
>  ...
>  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>  %exitcond = icmp eq i64 %indvars.iv.next, 1600
>  br i1 %exitcond, label %while.cond.loopexit, label %for.body.backedge
> 
> into this:
> 
> for.body:                                         ; preds =
%for.body.backedge, %for.body.preheader
>  %indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
%while.cond.loopexit ]
>  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be,
%for.body ]
>  ...
>  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>  %exitcond = icmp eq i64 %indvars.iv.next, 1600
>  br i1 %exitcond, label %while.cond.loopexit, label %for.body
> 
> and this seems pretty trivial when %for.body.backedge is completely empty
(as in this case), but if it had non-PHI instructions in it on which the
existing PHIs in %for.body depended, then maybe this becomes less trivial?
> Although, based on what I said below, the case with instructions there we
can't currently vectorize anyway for more-fundamental reasons.
> 
> -Hal
> 
> Also worth pointing out that SimplifyCFG does this exact transformation.
The vectorizer never sees it, however, because LoopSimplify prefers this form
with the separate backedge block that the vectorizer can't handle.
> 
> -Hal
> 
> 2) Teach the vectorizer to vectorize without this constraint by instead
establishing the actual invariant it cares about.
> It really cares that there's no code that comes in between the latch
and the exit, because such code is not really part of the loop (it only runs
once), or code in between the exit and the latch (because such code runs in one
fewer iterations than the code before the exit). At least nothing with side
effects I presume.
> 
> -Hal
> 
> Maybe there is another strategy? 
> 
> 
>  
>  -Hal
> That seems both counter-intuitive and unlikely to be a useful goal. We
simply don't optimize volatile operations well in *any* part of the
optimizer, and I'm not sure why we need to start trying to fix that. This
seems like an irreparably broken benchmark, but perhaps there is a motivation I
don't yet see.
> 
> 
> Assuming that sufficient motivation arises to try to fix this, see my
comments below:
>  
> 
> 
> (1) Loop unswitching generates several empty placeholder BBs only with PHI
nodes after separating out a shorter path with no inner loop execution from a
standard path.
> 
> (2) Jump threading and simplify-the-CFG passes independently calls
TryToSimplifyUnconditionalBranchFromEmptyBlock() in Transforms/Utils/Local.cpp
to get rid of almost empty BBs.
> 
> (3) TryToSimplifyUnconditionalBranchFromEmtpyBlock() eliminates the
placeholder BBs after loop unswitching and merges them into subsequent blocks
including the header of the inner loop. Before eliminating the blocks, the
function checks if the block is a loop header by looking at its PHI nodes so
that it can be saved, but the test fails with the loops with a volatile
iteration variable.
> 
> Why does this fail for a volatile iteration variable but not for a
non-volatile one? I think understanding that will be key to understanding how it
should be fixed.
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
<http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
<http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
<http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150803/21f76965/attachment.html>

Gerolf Hoflehner via llvm-dev

2015-Aug-06 04:11 UTC

head link

[llvm-dev] [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

> On Aug 5, 2015, at 9:08 PM, Gerolf Hoflehner <ghoflehner at
apple.com> wrote:
> 
> Ok, assuming a verifier/checker was in place the information could look
like:
> 
> ...
> LV: Not vectorizing: Cannot prove legality.
> LV: Legal to vectorize. — after loop rotation
> LV: Legal to vectorize.
> ...
> LV: Legal to vectorize.
> LV: Not vectorizing: Cannot prove legality. — after Jump Threading (later
also — after CFG Simplify)
> 
> Note that in your example the second round of loop rotation fails making a
loop vectorizable (from the comment that is its purpose, though!). It might be
more appropriate to generalize that approach to any transformation that makes a
loop vectorizable (not just rotation) rather than tweaking jump threading +
simplification.
> 
> I’m convinced there are more cases like yours and the following framework
would give a systematic way to handle them:
> 
> a) a checker for is_vectorizable (from refactoring the legal class in the
loop vectorizer - but properly, not hacked like my prototype …)
> b) invoke checker optionally. The checker can be used for testing, analysis
and - independently - as a service for loop transformations (eg. if the loop is
vectorizable already, perhaps a transformation is not needed or vice versa). I’m
not sure about the architecture yet. Perhaps the pass manager could run a
checker automatically after every pass when an option is set. Adding extra calls
after every pass is too clumsy.
> c) fix violators if appropriate and/or add transformations that morph a
loop into a vectorizable form when possible (especially when it has been in
vectorizable shape at some point).
> 
> 
> -Gerolf
> 
> 
>> On Aug 4, 2015, at 8:04 PM, Gerolf Hoflehner <ghoflehner at
apple.com <mailto:ghoflehner at apple.com>> wrote:
>> 
>> The approach would be to run the legal check before + after
transformations on demand. For loops that “should” be vectorized it would give
the means to isolate the transformation that introduces a barrier. I’ll see that
can get a dirty prototype to illustrate this on your test case.  I could give a
warning/remark when the code was (loop-) vectorizable at some point but not
anymore when the vectorizer actually kicks in and speed-up diagnosis when
comparing good/bad cases (like in your instance w/ and w/o the volatile
modifier).
>> 
>> Gerolf
>> 
>>> On Aug 4, 2015, at 8:40 AM, Hyojin Sung <hsung at us.ibm.com
<mailto:hsung at us.ibm.com>> wrote:
>>> 
>>> Hi Gerolf,
>>> 
>>> Thanks for the comment. Yes, I agree that compiler transformation
as a whole should not render vectorizable loops unvectorizable. However, some
passes may temporarily make a loop unvectorizable (e.g., by aggressive GVN or
DCE/DBE) then later passes may revert the effect. Trying to keep the invariants
after each pass may prevent useful optimizations. Regarding your concern about
the example-based approach, LoopVectorizationLegality class gives a good idea
about what will be such invariants and they are mostly related to check for
canonical loop structures. I think that there are only a handful of
transformation passes that may disrupt loop structures mainly through
aggressively eliminating BB's. Focusing on such transformations may be
sufficient to prevent prior transformations from destroying vectorizable loops.
I'd be glad to hear more of your thoughts.
>>> 
>>> Best,
>>> Hyojin
>>> 
>>> 
>>> <graycol.gif>Gerolf Hoflehner ---08/03/2015 06:14:22 PM---I
see a more fundamental problem and perhaps this example can serve as a stepping
stone towards a so
>>> 
>>> From:	Gerolf Hoflehner <ghoflehner at apple.com
<mailto:ghoflehner at apple.com>>
>>> To:	Hyojin Sung/Watson/IBM at IBMUS
>>> Cc:	Hal Finkel <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>>, llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces at
cs.uiuc.edu>, llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>
>>> Date:	08/03/2015 06:14 PM
>>> Subject:	Re: [LLVMdev] Improving loop vectorizer support for loops
with a volatile iteration variable
>>> 
>>> 
>>> 
>>> I see a more fundamental problem and perhaps this example can serve
as a stepping stone towards a solution.
>>> 
>>> There is a desired property: In this case, loop is vectorizable.
>>> There are N compiler transformations. These transformations must
either establish the property or keep it invariant,  but never destroy it.
>>> 
>>> If there is agreement to this then the first step is to have the
analysis of ‘is vectorizable’ runnable after every transformation and report
violations in detail, likely under a flag. Then comparing a set of loops not
vectorizable with clang but with other compilers (gcc, clang, …) should shape
precise ideas on normal forms for the vectorizer and general improvements to
transformations to the keep/establish the invariant/desired property.
>>> 
>>> I’m worried that the one example at time approach over time
confuscates matters within transformations resulting in less maintainable code.
>>> 
>>> Gerolf
>>> 
>>> 
>>> On Aug 3, 2015, at 2:33 PM, Hyojin Sung <hsung at us.ibm.com
<mailto:hsung at us.ibm.com>> wrote:
>>> Hi,
>>> 
>>> I discussed the issue offline with Hal, and would like to clarify
what is exactly going on, what are trade-offs for different solutions, and ask
for more feedback on my proposed solution (http://reviews.llvm.org/D11728
<https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11728&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=N2de3NabFW69oQ6in3ydIBGGRddU65HAGvjpTRC7uvA&s=q8HlQbw3XXnWU7Og4HMcdBjGw-Y2XDB7h9xSiuRtZFs&e=>).
I will use the example from Hal's post:
>>> 
>>> void foo2(float * restrict x, float * restrict y, float * restrict
z) {
>>>  for (volatile int i = 0; i < 1000; ++i) {
>>>    for (int j = 0; j < 1600; ++j) {
>>>      x[j] = y[j] + z[j];
>>>    }
>>>  }
>>> }
>>> 
>>> IR after the first loop simplify: A preheader is created. 
>>> 
>>> ; Function Attrs: nounwind
>>> define void @foo2(float* noalias nocapture %x, float* noalias
nocapture readonly %y, float* noalias nocapture readonly %z) #0 {
>>> entry:
>>>  %i = alloca i32, align 4
>>>  tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata
!11, metadata !25), !dbg !26
>>>  tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata
!12, metadata !25), !dbg !27
>>>  tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata
!13, metadata !25), !dbg !28
>>>  %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
>>>  call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
>>>  tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  store volatile i32 0, i32* %i, align 4, !dbg !29
>>>  br label %for.cond, !dbg !30
>>> 
>>> for.cond:                                         ; preds =
%for.cond.cleanup.3, %entry
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !31
>>>  %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !34
>>>  br i1 %cmp, label %for.cond.1.preheader, label %for.cond.cleanup,
!dbg !35
>>> 
>>> for.cond.1.preheader:                             ; preds =
%for.cond
>>>  br label %for.cond.1, !dbg !36
>>> 
>>> for.cond.cleanup:                                 ; preds =
%for.cond
>>>  call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
>>>  ret void, !dbg !38
>>> 
>>> for.cond.1:                                       ; preds =
%for.cond.1.preheader, %for.body.4
>>>  %j.0 = phi i32 [ %inc, %for.body.4 ], [ 0, %for.cond.1.preheader ]
>>>  %cmp2 = icmp slt i32 %j.0, 1600, !dbg !36
>>>  br i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg
!39
>>> 
>>> for.cond.cleanup.3:                               ; preds =
%for.cond.1
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !40
>>>  %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !40
>>>  tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0,
metadata !14, metadata !25), !dbg !29
>>>  store volatile i32 %inc10, i32* %i, align 4, !dbg !40
>>>  br label %for.cond, !dbg !41
>>> 
>>> for.body.4:                                       ; preds =
%for.cond.1
>>>  %idxprom = sext i32 %j.0 to i64, !dbg !42
>>>  %arrayidx = getelementptr inbounds float, float* %y, i64 %idxprom,
!dbg !42
>>>  %0 = load float, float* %arrayidx, align 4, !dbg !42, !tbaa !44
>>>  %arrayidx6 = getelementptr inbounds float, float* %z, i64
%idxprom, !dbg !48
>>>  %1 = load float, float* %arrayidx6, align 4, !dbg !48, !tbaa !44
>>>  %add = fadd float %0, %1, !dbg !49
>>>  %arrayidx8 = getelementptr inbounds float, float* %x, i64
%idxprom, !dbg !50
>>>  store float %add, float* %arrayidx8, align 4, !dbg !51, !tbaa !44
>>>  %inc = add nsw i32 %j.0, 1, !dbg !52
>>>  tail call void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata
!18, metadata !25), !dbg !53
>>>  br label %for.cond.1, !dbg !54
>>> }
>>> 
>>> IR after loop rotation: After loop rotation, a rotated preheader
(for.cond.1.preheader.lr.ph) is created. A test for (i < 1000) is added at
the end of "entry" block. If true, the control jumps unconditionally
to "for.body.4" through "for.cond.1.preheader.lr.ph" and
"for.cond.1.preheader". You can see that these two blocks
("for.cond.1.preheader.lr.ph" and "for.cond.1.preheader")
are practically empty, and they will get eliminated later by Jump Threading
and/or Simplify-the-CFG. *IF* the outer loop has a non-volatile induction
variable, the loop will not be rotated in the first place as
"for.cond.1.preheader" has a PHI node for "i", and these
blocks will not be eliminated.
>>> 
>>> ; Function Attrs: nounwind
>>> define void @foo2(float* noalias nocapture %x, float* noalias
nocapture readonly %y, float* noalias nocapture readonly %z) #0 {
>>> entry:
>>>  %i = alloca i32, align 4
>>>  tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata
!11, metadata !25), !dbg !26
>>>  tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata
!12, metadata !25), !dbg !27
>>>  tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata
!13, metadata !25), !dbg !28
>>>  %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
>>>  call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
>>>  tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  store volatile i32 0, i32* %i, align 4, !dbg !29
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0..21 = load volatile i32, i32* %i, align 4, !dbg !30
>>>  %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg !33
>>>  br i1 %cmp.22, label %for.cond.1.preheader.lr.ph, label
%for.cond.cleanup, !dbg !34
>>> 
>>> for.cond.1.preheader.lr.ph:                       ; preds = %entry
>>>  br label %for.cond.1.preheader, !dbg !34
>>> 
>>> for.cond.1.preheader:                             ; preds =
%for.cond.1.preheader.lr.ph, %for.cond.cleanup.3
>>>  br label %for.body.4, !dbg !35
>>> 
>>> for.cond.for.cond.cleanup_crit_edge:              ; preds =
%for.cond.cleanup.3
>>>  br label %for.cond.cleanup, !dbg !34
>>> 
>>> for.cond.cleanup:                                 ; preds =
%for.cond.for.cond.cleanup_crit_edge, %entry
>>>  call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
>>>  ret void, !dbg !36
>>> 
>>> for.cond.cleanup.3:                               ; preds =
%for.body.4
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !37
>>>  %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !37
>>>  tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0,
metadata !14, metadata !25), !dbg !29
>>>  store volatile i32 %inc10, i32* %i, align 4, !dbg !37
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
>>>  %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
>>>  br i1 %cmp, label %for.cond.1.preheader, label
%for.cond.for.cond.cleanup_crit_edge, !dbg !34
>>> 
>>> for.body.4:                                       ; preds =
%for.cond.1.preheader, %for.body.4
>>>  %j.020 = phi i32 [ 0, %for.cond.1.preheader ], [ %inc, %for.body.4
]
>>>  %idxprom = sext i32 %j.020 to i64, !dbg !38
>>>  %arrayidx = getelementptr inbounds float, float* %y, i64 %idxprom,
!dbg !38
>>>  %0 = load float, float* %arrayidx, align 4, !dbg !38, !tbaa !41
>>>  %arrayidx6 = getelementptr inbounds float, float* %z, i64
%idxprom, !dbg !45
>>>  %1 = load float, float* %arrayidx6, align 4, !dbg !45, !tbaa !41
>>>  %add = fadd float %0, %1, !dbg !46
>>>  %arrayidx8 = getelementptr inbounds float, float* %x, i64
%idxprom, !dbg !47
>>>  store float %add, float* %arrayidx8, align 4, !dbg !48, !tbaa !41
>>>  %inc = add nsw i32 %j.020, 1, !dbg !49
>>>  tail call void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata
!18, metadata !25), !dbg !50
>>>  %cmp2 = icmp slt i32 %inc, 1600, !dbg !51
>>>  br i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg
!35
>>> 
>>> 
>>> After Jump Threading: "for.cond.1.preheader.lr.ph" and
"for.cond.1.preheader" are merged into "for.body.4" by
TryToSimplifyUnconditionalBranchFromEmptyBlock() in Transforms/Utils/Local.cpp.
Now "for.body.4" has three incoming edges (two backedges).
>>> 
>>> ; Function Attrs: nounwind
>>> define void @foo2(float* noalias nocapture %x, float* noalias
nocapture readonly %y, float* noalias nocapture readonly %z) #0 {
>>> entry:
>>>  %i = alloca i32, align 4
>>>  tail call void @llvm.dbg.value(metadata float* %x, i64 0, metadata
!11, metadata !25), !dbg !26
>>>  tail call void @llvm.dbg.value(metadata float* %y, i64 0, metadata
!12, metadata !25), !dbg !27
>>>  tail call void @llvm.dbg.value(metadata float* %z, i64 0, metadata
!13, metadata !25), !dbg !28
>>>  %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
>>>  call void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast)
>>>  tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  store volatile i32 0, i32* %i, align 4, !dbg !29
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0..21 = load volatile i32, i32* %i, align 4, !dbg !30
>>>  %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg !33
>>>  br i1 %cmp.22, label %for.body.4, label %for.cond.cleanup, !dbg
!34
>>> 
>>> for.cond.cleanup:                                 ; preds =
%for.cond.cleanup.3, %entry
>>>  call void @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast)
>>>  ret void, !dbg !35
>>> 
>>> for.cond.cleanup.3:                               ; preds =
%for.body.4
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !36
>>>  %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !36
>>>  tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0,
metadata !14, metadata !25), !dbg !29
>>>  store volatile i32 %inc10, i32* %i, align 4, !dbg !36
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
>>>  %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
>>>  br i1 %cmp, label %for.body.4, label %for.cond.cleanup, !dbg !34
>>> 
>>> for.body.4:                                       ; preds =
%for.cond.cleanup.3, %entry, %for.body.4
>>>  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body.4 ], [ 0,
%entry ], [ 0, %for.cond.cleanup.3 ]
>>>  %arrayidx = getelementptr inbounds float, float* %y, i64
%indvars.iv, !dbg !37
>>>  %0 = load float, float* %arrayidx, align 4, !dbg !37, !tbaa !40
>>>  %arrayidx6 = getelementptr inbounds float, float* %z, i64
%indvars.iv, !dbg !44
>>>  %1 = load float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40
>>>  %add = fadd float %0, %1, !dbg !45
>>>  %arrayidx8 = getelementptr inbounds float, float* %x, i64
%indvars.iv, !dbg !46
>>>  store float %add, float* %arrayidx8, align 4, !dbg !47, !tbaa !40
>>>  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !48
>>>  %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg !48
>>>  br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4,
!dbg !48
>>> 
>>> 
>>> After another loop simplify: Loop simplify tries to separate out
nested loops but fails to do so with this loop since it does not has a PHI node
for the outer loop variable. Instead, it creates a backedge block.
>>> 
>>> for.cond.cleanup.3:                               ; preds =
%for.body.4
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0.17 = load volatile i32, i32* %i, align 4, !dbg !39
>>>  %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !39
>>>  tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0,
metadata !14, metadata !25), !dbg !29
>>>  store volatile i32 %inc10, i32* %i, align 4, !dbg !39
>>>  tail call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata
!14, metadata !25), !dbg !29
>>>  %i.0.i.0. = load volatile i32, i32* %i, align 4, !dbg !30
>>>  %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
>>>  br i1 %cmp, label %for.body.4.backedge, label
%for.cond.cleanup.loopexit, !dbg !34
>>> 
>>> for.body.4:                                       ; preds =
%for.body.4.backedge, %for.body.4.preheader
>>>  %indvars.iv = phi i64 [ 0, %for.body.4.preheader ], [
%indvars.iv.be <http://indvars.iv.be/>, %for.body.4.backedge ]
>>>  %arrayidx = getelementptr inbounds float, float* %y, i64
%indvars.iv, !dbg !35
>>>  %0 = load float, float* %arrayidx, align 4, !dbg !35, !tbaa !40
>>>  %arrayidx6 = getelementptr inbounds float, float* %z, i64
%indvars.iv, !dbg !44
>>>  %1 = load float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40
>>>  %add = fadd float %0, %1, !dbg !45
>>>  %arrayidx8 = getelementptr inbounds float, float* %x, i64
%indvars.iv, !dbg !46
>>>  store float %add, float* %arrayidx8, align 4, !dbg !47, !tbaa !40
>>>  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !48
>>>  %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg !48
>>>  br i1 %exitcond, label %for.cond.cleanup.3, label
%for.body.4.backedge, !dbg !48
>>> 
>>> for.body.4.backedge:                              ; preds =
%for.body.4, %for.cond.cleanup.3
>>>  %indvars.iv.be <http://indvars.iv.be/> = phi i64 [
%indvars.iv.next, %for.body.4 ], [ 0, %for.cond.cleanup.3 ]
>>>  br label %for.body.4x
>>> 
>>> 
>>> LLVM  loop vectorizer rejects to vectorize any loop for which a
loop latch (for.body.4.backedge) is different from a loop exiting block
(for.cond.cleanup.3). The loop vectorizer can assume that all instructions in
the loop are executed the same number of times with the test.
>>> 
>>> I believe a fix is in order in one way or another because the
example is simple and common enough and vectorized by other compilers. We may
approach it by either (1) preventing loops from being collapsed in the first
place or (2) teaching loop vectorizer to handle collapsed loops. For (2), we may
need to allow loop vectorizer to forego the assumption and handle the loop as it
is. The assumption seems fundamental to many of the vectorization algorithms, so
it will require extensive updates or may end up with reverting the loop back to
a properly nested form. The downside of (1) is that it may slow down common
optimization passes that are repeatedly executed before vectorization.
>>> 
>>> My patch (http://reviews.llvm.org/D11728
<https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11728&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=N2de3NabFW69oQ6in3ydIBGGRddU65HAGvjpTRC7uvA&s=q8HlQbw3XXnWU7Og4HMcdBjGw-Y2XDB7h9xSiuRtZFs&e=>)
is a prototype fix for (1) that modifies Jump Threading and Simplify-the-CFG to
not eliminate an empty loop header BB even when the loop does not have a PHI
node for its induction variable. The details can be found at
http://reviews.llvm.org/D11728
<https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11728&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=N2de3NabFW69oQ6in3ydIBGGRddU65HAGvjpTRC7uvA&s=q8HlQbw3XXnWU7Og4HMcdBjGw-Y2XDB7h9xSiuRtZFs&e=>.
I would welcome and appreciate any comments or feedback.
>>> 
>>> Best,
>>> Hyojin
>>> 
>>> 
>>> <graycol.gif>Hal Finkel ---07/16/2015 03:19:24 AM--------
Original Message ----- > From: "Hal Finkel" <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>>
>>> 
>>> From: Hal Finkel <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>>
>>> To: Chandler Carruth <chandlerc at google.com
<mailto:chandlerc at google.com>>
>>> Cc: llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>
>>> Date: 07/16/2015 03:19 AM
>>> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
with a volatile iteration variable
>>> Sent by: llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces
at cs.uiuc.edu>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> From: "Hal Finkel" <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>>
>>> To: "Chandler Carruth" <chandlerc at google.com
<mailto:chandlerc at google.com>>
>>> Cc: llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>
>>> Sent: Thursday, July 16, 2015 1:58:02 AM
>>> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
with a volatile iteration variable
>>> 
>>> 
>>> 
>>> From: "Hal Finkel" <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>>
>>> To: "Chandler Carruth" <chandlerc at google.com
<mailto:chandlerc at google.com>>
>>> Cc: llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>
>>> Sent: Thursday, July 16, 2015 1:46:42 AM
>>> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
with a volatile iteration variable
>>> 
>>> 
>>> From: "Chandler Carruth" <chandlerc at google.com
<mailto:chandlerc at google.com>>
>>> To: "Hal Finkel" <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>>
>>> Cc: "Hyojin Sung" <hsung at us.ibm.com
<mailto:hsung at us.ibm.com>>, llvmdev at cs.uiuc.edu
<mailto:llvmdev at cs.uiuc.edu>
>>> Sent: Thursday, July 16, 2015 1:06:03 AM
>>> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
with a volatile iteration variable
>>> 
>>> On Wed, Jul 15, 2015 at 6:36 PM Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
>>> 
>>> From: "Chandler Carruth" <chandlerc at google.com
<mailto:chandlerc at google.com>>
>>> To: "Hyojin Sung" <hsung at us.ibm.com
<mailto:hsung at us.ibm.com>>, llvmdev at cs.uiuc.edu
<mailto:llvmdev at cs.uiuc.edu>
>>> Sent: Wednesday, July 15, 2015 7:34:54 PM
>>> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops
with a volatile iteration variable
>>> 
>>> 
>>> On Wed, Jul 15, 2015 at 12:55 PM Hyojin Sung <hsung at
us.ibm.com <mailto:hsung at us.ibm.com>> wrote:
>>> Hi all,
>>> 
>>> I would like to propose an improvement of the “almost dead” block
elimination in Transforms/Local.cpp so that it will preserve the canonical loop
form for loops with a volatile iteration variable.
>>> 
>>> *** Problem statement
>>> Nested loops in LCALS Subset B (https://codesign.llnl.gov/LCALS.php
<https://urldefense.proofpoint.com/v2/url?u=https-3A__codesign.llnl.gov_LCALS.php&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=aWKfvN4c8lvUSvVn8J0Z2ajTctlBJf0198Au28epBr0&s=4d9dt5ODcDWHHatSrwu5ZYT9ebgVzNEtpOlIR87izCM&e=>)
are not vectorized with LLVM -O3 because the LLVM loop vectorizer fails the test
whether the loop latch and exiting block of a loop is the same. The loops are
vectorizable, and get vectorized with LLVM -O2
>>> 
>>> I would be interested to know why -O2 succeeds here.
>>>  
>>> and also with other commercial compilers (icc, xlc). 
>>> 
>>> *** Details
>>> These loops ended up with different loop latch and exiting block
after a series of optimizations including loop unswitching, jump threading,
simplify-the-CFG, and loop simplify. The fundamental problem here is that the
above optimizations cannot recognize a loop with a volatile iteration variable
and do not preserve its canonical loop structure.
>>> 
>>> Ok, meta-level question first:
>>> 
>>> Why do we care about performance of loops with a volatile iteration
variable?
>>> I don't think we do, however, I think that misses the point. In
this case, the volatile iteration variable is just an easy way to expose this
problem that we have with nested loop canonicalization and the vectorizer. To be
specific:
>>> 
>>> This we vectorizer just fine:
>>> 
>>> void foo1(float * restrict x, float * restrict y, float * restrict
z) {
>>> for (int i = 0; i < 1000; ++i) {
>>>   for (int j = 0; j < 1600; ++j) {
>>>     x[j] = y[j] + z[j];
>>>   }
>>> }
>>> }
>>> 
>>> And, indeed, this we don't (the only change is adding volatile
on i):
>>> 
>>> void foo2(float * restrict x, float * restrict y, float * restrict
z) {
>>> for (volatile int i = 0; i < 1000; ++i) {
>>>   for (int j = 0; j < 1600; ++j) {
>>>     x[j] = y[j] + z[j];
>>>   }
>>> }
>>> }
>>> 
>>> However, this we don't either, and that's a big problem:
>>> 
>>> int done(float *x, float *y, float *z);
>>> void foo3(float * restrict x, float * restrict y, float * restrict
z) {
>>> while (!done(x, y, z)) {
>>>   for (int j = 0; j < 1600; ++j) {
>>>     x[j] = y[j] + z[j];
>>>   }
>>> }
>>> }
>>> 
>>> And the underlying reason is the same. The IR at the point in time
when the loop vectorizer runs looks like this:
>>> 
>>> define void @foo3(float* noalias %x, float* noalias %y, float*
noalias %z) #0 {
>>> entry:
>>> %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) #1
>>> %lnot.15 = icmp eq i32 %call.14, 0
>>> br i1 %lnot.15, label %for.body.preheader, label %while.end
>>> 
>>> for.body.preheader:                               ; preds = %entry
>>> br label %for.body
>>> 
>>> while.cond.loopexit:                              ; preds =
%for.body
>>> %call = tail call i32 @done(float* %x, float* %y, float* %z) #1
>>> %lnot = icmp eq i32 %call, 0
>>> br i1 %lnot, label %for.body.backedge, label %while.end.loopexit
>>> 
>>> for.body:                                         ; preds =
%for.body.backedge, %for.body.preheader
>>> %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be
<https://urldefense.proofpoint.com/v2/url?u=http-3A__indvars.iv.be&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=OE36IJuCyZctA3s37X7WGup5iAkOM11AM2CnhP5TBUk&s=JDHk2pO2X8ONZI6T-EtaJWjtJLzbBdKTvULQ6Te-xGI&e=>,
%for.body.backedge ]
>>> ...
>>> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>> %exitcond = icmp eq i64 %indvars.iv.next, 1600
>>> br i1 %exitcond, label %while.cond.loopexit, label
%for.body.backedge
>>> 
>>> for.body.backedge:                                ; preds =
%for.body, %while.cond.loopexit
>>> %indvars.iv.be
<https://urldefense.proofpoint.com/v2/url?u=http-3A__indvars.iv.be&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=OE36IJuCyZctA3s37X7WGup5iAkOM11AM2CnhP5TBUk&s=JDHk2pO2X8ONZI6T-EtaJWjtJLzbBdKTvULQ6Te-xGI&e=>
= phi i64 [ %indvars.iv.next, %for.body ], [ 0, %while.cond.loopexit ]
>>> br label %for.body
>>> 
>>> while.end.loopexit:                               ; preds =
%while.cond.loopexit
>>> br label %while.end
>>> 
>>> while.end:                                        ; preds =
%while.end.loopexit, %entry
>>> ret void
>>> }
>>> 
>>> and we can currently only vectorize loops where the loop latch is
also the loop's exiting block. In this case, as in the case with the
volatile variable, vectorization is blocked by this constraint (here the
backedge is from the terminator of %for.body.backedge, but the loop exiting
block is %for.body). The check in the vectorizer is explicit:
>>> 
>>> // We only handle bottom-tested loops, i.e. loop in which the
condition is
>>> // checked at the end of each iteration. With that we can assume
that all
>>> // instructions in the loop are executed the same number of times.
>>> if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
>>>   ...
>>> 
>>> Thanks for the detailed explanation. This makes much more sense why
we need to handle it. I think its much better to look at nested loops of this
form than anything to do with volatile -- the latter is too prone to other
random optimizations turning off.
>>> 
>>> Regarding this problem, it would be interesting to know based on
this explanation what the desired fix would be. I see at least these options:
>>> 
>>> 1) Canonicalize loops harder to make them look the way the
vectorizer wants. If this can be done without causing significant problems, it
seems likely the best approach.
>>> I agree. In this case, we could certainly fold the trivial
%for.body.backedge block into %for.body, meaning transforming this:
>>> 
>>> for.body.backedge:                                ; preds =
%while.cond.loopexit, %for.body
>>> %indvars.iv.be <http://indvars.iv.be/> = phi i64 [
%indvars.iv.next, %for.body ], [ 0, %while.cond.loopexit ]
>>> br label %for.body
>>> 
>>> for.body:                                         ; preds =
%for.body.backedge, %for.body.preheader
>>> %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be
<http://indvars.iv.be/>, %for.body.backedge ]
>>> ...
>>> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>> %exitcond = icmp eq i64 %indvars.iv.next, 1600
>>> br i1 %exitcond, label %while.cond.loopexit, label
%for.body.backedge
>>> 
>>> into this:
>>> 
>>> for.body:                                         ; preds =
%for.body.backedge, %for.body.preheader
>>> %indvars.iv.be <http://indvars.iv.be/> = phi i64 [
%indvars.iv.next, %for.body ], [ 0, %while.cond.loopexit ]
>>> %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.be
<http://indvars.iv.be/>, %for.body ]
>>> ...
>>> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>> %exitcond = icmp eq i64 %indvars.iv.next, 1600
>>> br i1 %exitcond, label %while.cond.loopexit, label %for.body
>>> 
>>> and this seems pretty trivial when %for.body.backedge is completely
empty (as in this case), but if it had non-PHI instructions in it on which the
existing PHIs in %for.body depended, then maybe this becomes less trivial?
>>> Although, based on what I said below, the case with instructions
there we can't currently vectorize anyway for more-fundamental reasons.
>>> 
>>> -Hal
>>> 
>>> Also worth pointing out that SimplifyCFG does this exact
transformation. The vectorizer never sees it, however, because LoopSimplify
prefers this form with the separate backedge block that the vectorizer can't
handle.
>>> 
>>> -Hal
>>> 
>>> 2) Teach the vectorizer to vectorize without this constraint by
instead establishing the actual invariant it cares about.
>>> It really cares that there's no code that comes in between the
latch and the exit, because such code is not really part of the loop (it only
runs once), or code in between the exit and the latch (because such code runs in
one fewer iterations than the code before the exit). At least nothing with side
effects I presume.
>>> 
>>> -Hal 
>>> 
>>> Maybe there is another strategy? 
>>> 
>>> 
>>>  
>>>  -Hal 
>>> That seems both counter-intuitive and unlikely to be a useful goal.
We simply don't optimize volatile operations well in *any* part of the
optimizer, and I'm not sure why we need to start trying to fix that. This
seems like an irreparably broken benchmark, but perhaps there is a motivation I
don't yet see.
>>> 
>>> 
>>> Assuming that sufficient motivation arises to try to fix this, see
my comments below:
>>>  
>>> 
>>> 
>>> (1) Loop unswitching generates several empty placeholder BBs only
with PHI nodes after separating out a shorter path with no inner loop execution
from a standard path.
>>> 
>>> (2) Jump threading and simplify-the-CFG passes independently calls
TryToSimplifyUnconditionalBranchFromEmptyBlock() in Transforms/Utils/Local.cpp
to get rid of almost empty BBs.
>>> 
>>> (3) TryToSimplifyUnconditionalBranchFromEmtpyBlock() eliminates the
placeholder BBs after loop unswitching and merges them into subsequent blocks
including the header of the inner loop. Before eliminating the blocks, the
function checks if the block is a loop header by looking at its PHI nodes so
that it can be saved, but the test fails with the loops with a volatile
iteration variable.
>>> 
>>> Why does this fail for a volatile iteration variable but not for a
non-volatile one? I think understanding that will be key to understanding how it
should be fixed.
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>       
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>> 
>>> 
>>> -- 
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> 
>>> 
>>> 
>>> -- 
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>       
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>> 
>>> 
>>> 
>>> -- 
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>       
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>> 
>>> 
>>> -- 
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National
Laboratory_______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>       
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>       
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>> 
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150805/7de12531/attachment-0001.html>

Hyojin Sung via llvm-dev

2015-Aug-06 18:34 UTC

head link

[llvm-dev] [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

Inlined.



From:	Gerolf Hoflehner <ghoflehner at apple.com>
To:	Hyojin Sung/Watson/IBM at IBMUS
Cc:	Hal Finkel <hfinkel at anl.gov>, llvm-dev at lists.llvm.org
Date:	08/06/2015 12:11 AM
Subject:	Re: [LLVMdev] Improving loop vectorizer support for loops with
            a volatile iteration variable




      On Aug 5, 2015, at 9:08 PM, Gerolf Hoflehner <ghoflehner at
apple.com>
      wrote:

      Ok, assuming a verifier/checker was in place the information could
      look like:

      ...
      LV: Not vectorizing: Cannot prove legality.
      LV: Legal to vectorize. — after loop rotation
      LV: Legal to vectorize.
      ...
      LV: Legal to vectorize.
      LV: Not vectorizing: Cannot prove legality. — after Jump Threading
      (later also — after CFG Simplify)

      Note that in your example the second round of loop rotation fails
      making a loop vectorizable (from the comment that is its purpose,
      though!). It might be more appropriate to generalize that approach to
      any transformation that makes a loop vectorizable (not just rotation)
      rather than tweaking jump threading + simplification.

      To clarify, the loop is vectorizable before and after loop rotation.
      At least for the example, loop rotation does not affect the
      vectorizability of a loop. The LLVM loop vectorizer and vectorization
      legality checker rely on loops in a canonical form. So if there's no
      interfering passes, a check should be able to tell if a loop is
      vectorizable after loop simplify.


      I’m convinced there are more cases like yours and the following
      framework would give a systematic way to handle them:

      a) a checker for is_vectorizable (from refactoring the legal class in
      the loop vectorizer - but properly, not hacked like my prototype …)
      b) invoke checker optionally. The checker can be used for testing,
      analysis and - independently - as a service for loop transformations
      (eg. if the loop is vectorizable already, perhaps a transformation is
      not needed or vice versa). I’m not sure about the architecture yet.
      Perhaps the pass manager could run a checker automatically after
      every pass when an option is set. Adding extra calls after every pass
      is too clumsy.

      c) fix violators if appropriate and/or add transformations that morph
      a loop into a vectorizable form when possible (especially when it has
      been in vectorizable shape at some point).

      We can design a more general form of checker/verifier that is
      optionally executed after LLVM passes can help with identifying
      interfering transformations. We can use the checker only for
      debugging (reporting which transformation is interfering with others)
      or for dynamically triggering or reverting transformations to
      maintain desired properties like vectorizability as you said. If the
      checker dynamically affects actual pass executions, then I think it
      can require extensive changes to the incremental way LLVM applies
      transformations.


      Regards,
      Hyojin



      -Gerolf


            On Aug 4, 2015, at 8:04 PM, Gerolf Hoflehner <
            ghoflehner at apple.com> wrote:

            The approach would be to run the legal check before + after
            transformations on demand. For loops that “should” be
            vectorized it would give the means to isolate the
            transformation that introduces a barrier. I’ll see that can get
            a dirty prototype to illustrate this on your test case.  I
            could give a warning/remark when the code was (loop-)
            vectorizable at some point but not anymore when the vectorizer
            actually kicks in and speed-up diagnosis when comparing
            good/bad cases (like in your instance w/ and w/o the volatile
            modifier).

            Gerolf

                  On Aug 4, 2015, at 8:40 AM, Hyojin Sung <hsung at
us.ibm.com
                  > wrote:



                  Hi Gerolf,

                  Thanks for the comment. Yes, I agree that compiler
                  transformation as a whole should not render vectorizable
                  loops unvectorizable. However, some passes may
                  temporarily make a loop unvectorizable (e.g., by
                  aggressive GVN or DCE/DBE) then later passes may revert
                  the effect. Trying to keep the invariants after each pass
                  may prevent useful optimizations. Regarding your concern
                  about the example-based approach,
                  LoopVectorizationLegality class gives a good idea about
                  what will be such invariants and they are mostly related
                  to check for canonical loop structures. I think that
                  there are only a handful of transformation passes that
                  may disrupt loop structures mainly through aggressively
                  eliminating BB's. Focusing on such transformations may be
                  sufficient to prevent prior transformations from
                  destroying vectorizable loops. I'd be glad to hear more
                  of your thoughts.

                  Best,
                  Hyojin


                  <graycol.gif>Gerolf Hoflehner ---08/03/2015 06:14:22
                  PM---I see a more fundamental problem and perhaps this
                  example can serve as a stepping stone towards a so

                  From: Gerolf Hoflehner <ghoflehner at apple.com>
                  To: Hyojin Sung/Watson/IBM at IBMUS
                  Cc: Hal Finkel <hfinkel at anl.gov>,
                  llvmdev-bounces at cs.uiuc.edu, llvmdev at cs.uiuc.edu
                  Date: 08/03/2015 06:14 PM
                  Subject: Re: [LLVMdev] Improving loop vectorizer support
                  for loops with a volatile iteration variable





                  I see a more fundamental problem and perhaps this example
                  can serve as a stepping stone towards a solution.

                  There is a desired property: In this case, loop is
                  vectorizable.
                  There are N compiler transformations. These
                  transformations must either establish the property or
                  keep it invariant,  but never destroy it.

                  If there is agreement to this then the first step is to
                  have the analysis of ‘is vectorizable’ runnable after
                  every transformation and report violations in detail,
                  likely under a flag. Then comparing a set of loops not
                  vectorizable with clang but with other compilers (gcc,
                  clang, …) should shape precise ideas on normal forms for
                  the vectorizer and general improvements to
                  transformations to the keep/establish the
                  invariant/desired property.

                  I’m worried that the one example at time approach over
                  time confuscates matters within transformations resulting
                  in less maintainable code.

                  Gerolf


                        On Aug 3, 2015, at 2:33 PM, Hyojin Sung <
                        hsung at us.ibm.com> wrote:


                        Hi,

                        I discussed the issue offline with Hal, and would
                        like to clarify what is exactly going on, what are
                        trade-offs for different solutions, and ask for
                        more feedback on my proposed solution (
                        http://reviews.llvm.org/D11728). I will use the
                        example from Hal's post:

                        void foo2(float * restrict x, float * restrict y,
                        float * restrict z) {
                         for (volatile int i = 0; i < 1000; ++i) {
                           for (int j = 0; j < 1600; ++j) {
                             x[j] = y[j] + z[j];
                           }
                         }
                        }

                        IR after the first loop simplify: A preheader is
                        created.

                        ; Function Attrs: nounwind
                        define void @foo2(float* noalias nocapture %x,
                        float* noalias nocapture readonly %y, float*
                        noalias nocapture readonly %z) #0 {
                        entry:
                         %i = alloca i32, align 4
                         tail call void @llvm.dbg.value(metadata float* %x,
                        i64 0, metadata !11, metadata !25), !dbg !26
                         tail call void @llvm.dbg.value(metadata float* %y,
                        i64 0, metadata !12, metadata !25), !dbg !27
                         tail call void @llvm.dbg.value(metadata float* %z,
                        i64 0, metadata !13, metadata !25), !dbg !28
                         %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
                         call void @llvm.lifetime.start(i64 4, i8*
                        %i.0.i.0..sroa_cast)
                         tail call void @llvm.dbg.value(metadata i32 0, i64
                        0, metadata !14, metadata !25), !dbg !29
                         store volatile i32 0, i32* %i, align 4, !dbg !29
                         br label %for.cond, !dbg !30

                        for.cond:                                         ;
                        preds = %for.cond.cleanup.3, %entry
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0. = load volatile i32, i32* %i, align
                        4, !dbg !31
                         %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !34
                         br i1 %cmp, label %for.cond.1.preheader, label
                        %for.cond.cleanup, !dbg !35

                        for.cond.1.preheader:                             ;
                        preds = %for.cond
                         br label %for.cond.1, !dbg !36

                        for.cond.cleanup:                                 ;
                        preds = %for.cond
                         call void @llvm.lifetime.end(i64 4, i8*
                        %i.0.i.0..sroa_cast)
                         ret void, !dbg !38

                        for.cond.1:                                       ;
                        preds = %for.cond.1.preheader, %for.body.4
                         %j.0 = phi i32 [ %inc, %for.body.4 ], [ 0,
                        %for.cond.1.preheader ]
                         %cmp2 = icmp slt i32 %j.0, 1600, !dbg !36
                         br i1 %cmp2, label %for.body.4, label
                        %for.cond.cleanup.3, !dbg !39

                        for.cond.cleanup.3:                               ;
                        preds = %for.cond.1
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0.17 = load volatile i32, i32* %i, align
                        4, !dbg !40
                         %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !40
                         tail call void @llvm.dbg.value(metadata i32
                        %inc10, i64 0, metadata !14,
                        metadata !25), !dbg !29
                         store volatile i32 %inc10, i32* %i, align
                        4, !dbg !40
                         br label %for.cond, !dbg !41

                        for.body.4:                                       ;
                        preds = %for.cond.1
                         %idxprom = sext i32 %j.0 to i64, !dbg !42
                         %arrayidx = getelementptr inbounds float, float*
                        %y, i64 %idxprom, !dbg !42
                         %0 = load float, float* %arrayidx, align
                        4, !dbg !42, !tbaa !44
                         %arrayidx6 = getelementptr inbounds float, float*
                        %z, i64 %idxprom, !dbg !48
                         %1 = load float, float* %arrayidx6, align
                        4, !dbg !48, !tbaa !44
                         %add = fadd float %0, %1, !dbg !49
                         %arrayidx8 = getelementptr inbounds float, float*
                        %x, i64 %idxprom, !dbg !50
                         store float %add, float* %arrayidx8, align
                        4, !dbg !51, !tbaa !44
                         %inc = add nsw i32 %j.0, 1, !dbg !52
                         tail call void @llvm.dbg.value(metadata i32 %inc,
                        i64 0, metadata !18, metadata !25), !dbg !53
                         br label %for.cond.1, !dbg !54
                        }

                        IR after loop rotation: After loop rotation, a
                        rotated preheader (for.cond.1.preheader.lr.ph) is
                        created. A test for (i < 1000) is added at the end
                        of "entry" block. If true, the control jumps
                        unconditionally to "for.body.4" through
                        "for.cond.1.preheader.lr.ph" and
                        "for.cond.1.preheader". You can see that these
two
                        blocks ("for.cond.1.preheader.lr.ph" and
                        "for.cond.1.preheader") are practically empty,
and
                        they will get eliminated later by Jump Threading
                        and/or Simplify-the-CFG. *IF* the outer loop has a
                        non-volatile induction variable, the loop will not
                        be rotated in the first place as
                        "for.cond.1.preheader" has a PHI node for
"i", and
                        these blocks will not be eliminated.

                        ; Function Attrs: nounwind
                        define void @foo2(float* noalias nocapture %x,
                        float* noalias nocapture readonly %y, float*
                        noalias nocapture readonly %z) #0 {
                        entry:
                         %i = alloca i32, align 4
                         tail call void @llvm.dbg.value(metadata float* %x,
                        i64 0, metadata !11, metadata !25), !dbg !26
                         tail call void @llvm.dbg.value(metadata float* %y,
                        i64 0, metadata !12, metadata !25), !dbg !27
                         tail call void @llvm.dbg.value(metadata float* %z,
                        i64 0, metadata !13, metadata !25), !dbg !28
                         %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
                         call void @llvm.lifetime.start(i64 4, i8*
                        %i.0.i.0..sroa_cast)
                         tail call void @llvm.dbg.value(metadata i32 0, i64
                        0, metadata !14, metadata !25), !dbg !29
                         store volatile i32 0, i32* %i, align 4, !dbg !29
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0..21 = load volatile i32, i32* %i, align
                        4, !dbg !30
                         %cmp.22 = icmp slt i32 %i.0.i.0..21,
                        1000, !dbg !33
                         br i1 %cmp.22, label %for.cond.1.preheader.lr.ph,
                        label %for.cond.cleanup, !dbg !34

                        for.cond.1.preheader.lr.ph:                       ;
                        preds = %entry
                         br label %for.cond.1.preheader, !dbg !34

                        for.cond.1.preheader:                             ;
                        preds = %for.cond.1.preheader.lr.ph,
                        %for.cond.cleanup.3
                         br label %for.body.4, !dbg !35

                        for.cond.for.cond.cleanup_crit_edge:              ;
                        preds = %for.cond.cleanup.3
                         br label %for.cond.cleanup, !dbg !34

                        for.cond.cleanup:                                 ;
                        preds = %for.cond.for.cond.cleanup_crit_edge,
                        %entry
                         call void @llvm.lifetime.end(i64 4, i8*
                        %i.0.i.0..sroa_cast)
                         ret void, !dbg !36

                        for.cond.cleanup.3:                               ;
                        preds = %for.body.4
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0.17 = load volatile i32, i32* %i, align
                        4, !dbg !37
                         %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !37
                         tail call void @llvm.dbg.value(metadata i32
                        %inc10, i64 0, metadata !14,
                        metadata !25), !dbg !29
                         store volatile i32 %inc10, i32* %i, align
                        4, !dbg !37
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0. = load volatile i32, i32* %i, align
                        4, !dbg !30
                         %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
                         br i1 %cmp, label %for.cond.1.preheader, label
                        %for.cond.for.cond.cleanup_crit_edge, !dbg !34

                        for.body.4:                                       ;
                        preds = %for.cond.1.preheader, %for.body.4
                         %j.020 = phi i32 [ 0, %for.cond.1.preheader ],
                        [ %inc, %for.body.4 ]
                         %idxprom = sext i32 %j.020 to i64, !dbg !38
                         %arrayidx = getelementptr inbounds float, float*
                        %y, i64 %idxprom, !dbg !38
                         %0 = load float, float* %arrayidx, align
                        4, !dbg !38, !tbaa !41
                         %arrayidx6 = getelementptr inbounds float, float*
                        %z, i64 %idxprom, !dbg !45
                         %1 = load float, float* %arrayidx6, align
                        4, !dbg !45, !tbaa !41
                         %add = fadd float %0, %1, !dbg !46
                         %arrayidx8 = getelementptr inbounds float, float*
                        %x, i64 %idxprom, !dbg !47
                         store float %add, float* %arrayidx8, align
                        4, !dbg !48, !tbaa !41
                         %inc = add nsw i32 %j.020, 1, !dbg !49
                         tail call void @llvm.dbg.value(metadata i32 %inc,
                        i64 0, metadata !18, metadata !25), !dbg !50
                         %cmp2 = icmp slt i32 %inc, 1600, !dbg !51
                         br i1 %cmp2, label %for.body.4, label
                        %for.cond.cleanup.3, !dbg !35


                        After Jump Threading:
"for.cond.1.preheader.lr.ph"
                        and "for.cond.1.preheader" are merged into
                        "for.body.4" by
                        TryToSimplifyUnconditionalBranchFromEmptyBlock() in
                        Transforms/Utils/Local.cpp. Now "for.body.4"
has
                        three incoming edges (two backedges).

                        ; Function Attrs: nounwind
                        define void @foo2(float* noalias nocapture %x,
                        float* noalias nocapture readonly %y, float*
                        noalias nocapture readonly %z) #0 {
                        entry:
                         %i = alloca i32, align 4
                         tail call void @llvm.dbg.value(metadata float* %x,
                        i64 0, metadata !11, metadata !25), !dbg !26
                         tail call void @llvm.dbg.value(metadata float* %y,
                        i64 0, metadata !12, metadata !25), !dbg !27
                         tail call void @llvm.dbg.value(metadata float* %z,
                        i64 0, metadata !13, metadata !25), !dbg !28
                         %i.0.i.0..sroa_cast = bitcast i32* %i to i8*
                         call void @llvm.lifetime.start(i64 4, i8*
                        %i.0.i.0..sroa_cast)
                         tail call void @llvm.dbg.value(metadata i32 0, i64
                        0, metadata !14, metadata !25), !dbg !29
                         store volatile i32 0, i32* %i, align 4, !dbg !29
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0..21 = load volatile i32, i32* %i, align
                        4, !dbg !30
                         %cmp.22 = icmp slt i32 %i.0.i.0..21,
                        1000, !dbg !33
                         br i1 %cmp.22, label %for.body.4, label
                        %for.cond.cleanup, !dbg !34

                        for.cond.cleanup:                                 ;
                        preds = %for.cond.cleanup.3, %entry
                         call void @llvm.lifetime.end(i64 4, i8*
                        %i.0.i.0..sroa_cast)
                         ret void, !dbg !35

                        for.cond.cleanup.3:                               ;
                        preds = %for.body.4
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0.17 = load volatile i32, i32* %i, align
                        4, !dbg !36
                         %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !36
                         tail call void @llvm.dbg.value(metadata i32
                        %inc10, i64 0, metadata !14,
                        metadata !25), !dbg !29
                         store volatile i32 %inc10, i32* %i, align
                        4, !dbg !36
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0. = load volatile i32, i32* %i, align
                        4, !dbg !30
                         %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
                         br i1 %cmp, label %for.body.4, label
                        %for.cond.cleanup, !dbg !34

                        for.body.4:                                       ;
                        preds = %for.cond.cleanup.3, %entry, %for.body.4
                         %indvars.iv = phi i64 [ %indvars.iv.next,
                        %for.body.4 ], [ 0, %entry ], [ 0,
                        %for.cond.cleanup.3 ]
                         %arrayidx = getelementptr inbounds float, float*
                        %y, i64 %indvars.iv, !dbg !37
                         %0 = load float, float* %arrayidx, align
                        4, !dbg !37, !tbaa !40
                         %arrayidx6 = getelementptr inbounds float, float*
                        %z, i64 %indvars.iv, !dbg !44
                         %1 = load float, float* %arrayidx6, align
                        4, !dbg !44, !tbaa !40
                         %add = fadd float %0, %1, !dbg !45
                         %arrayidx8 = getelementptr inbounds float, float*
                        %x, i64 %indvars.iv, !dbg !46
                         store float %add, float* %arrayidx8, align
                        4, !dbg !47, !tbaa !40
                         %indvars.iv.next = add nuw nsw i64 %indvars.iv,
                        1, !dbg !48
                         %exitcond = icmp eq i64 %indvars.iv.next,
                        1600, !dbg !48
                         br i1 %exitcond, label %for.cond.cleanup.3, label
                        %for.body.4, !dbg !48


                        After another loop simplify: Loop simplify tries to
                        separate out nested loops but fails to do so with
                        this loop since it does not has a PHI node for the
                        outer loop variable. Instead, it creates a backedge
                        block.

                        for.cond.cleanup.3:                               ;
                        preds = %for.body.4
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0.17 = load volatile i32, i32* %i, align
                        4, !dbg !39
                         %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !39
                         tail call void @llvm.dbg.value(metadata i32
                        %inc10, i64 0, metadata !14,
                        metadata !25), !dbg !29
                         store volatile i32 %inc10, i32* %i, align
                        4, !dbg !39
                         tail call void @llvm.dbg.value(metadata i32* %i,
                        i64 0, metadata !14, metadata !25), !dbg !29
                         %i.0.i.0. = load volatile i32, i32* %i, align
                        4, !dbg !30
                         %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg !33
                         br i1 %cmp, label %for.body.4.backedge, label
                        %for.cond.cleanup.loopexit, !dbg !34

                        for.body.4:                                       ;
                        preds = %for.body.4.backedge, %for.body.4.preheader
                         %indvars.iv = phi i64 [ 0,
                        %for.body.4.preheader ], [ %indvars.iv.be,
                        %for.body.4.backedge ]
                         %arrayidx = getelementptr inbounds float, float*
                        %y, i64 %indvars.iv, !dbg !35
                         %0 = load float, float* %arrayidx, align
                        4, !dbg !35, !tbaa !40
                         %arrayidx6 = getelementptr inbounds float, float*
                        %z, i64 %indvars.iv, !dbg !44
                         %1 = load float, float* %arrayidx6, align
                        4, !dbg !44, !tbaa !40
                         %add = fadd float %0, %1, !dbg !45
                         %arrayidx8 = getelementptr inbounds float, float*
                        %x, i64 %indvars.iv, !dbg !46
                         store float %add, float* %arrayidx8, align
                        4, !dbg !47, !tbaa !40
                         %indvars.iv.next = add nuw nsw i64 %indvars.iv,
                        1, !dbg !48
                         %exitcond = icmp eq i64 %indvars.iv.next,
                        1600, !dbg !48
                         br i1 %exitcond, label %for.cond.cleanup.3, label
                        %for.body.4.backedge, !dbg !48

                        for.body.4.backedge:                              ;
                        preds = %for.body.4, %for.cond.cleanup.3
                         %indvars.iv.be = phi i64 [ %indvars.iv.next,
                        %for.body.4 ], [ 0, %for.cond.cleanup.3 ]
                         br label %for.body.4x


                        LLVM  loop vectorizer rejects to vectorize any loop
                        for which a loop latch (for.body.4.backedge) is
                        different from a loop exiting block
                        (for.cond.cleanup.3). The loop vectorizer can
                        assume that all instructions in the loop are
                        executed the same number of times with the test.

                        I believe a fix is in order in one way or another
                        because the example is simple and common enough and
                        vectorized by other compilers. We may approach it
                        by either (1) preventing loops from being collapsed
                        in the first place or (2) teaching loop vectorizer
                        to handle collapsed loops. For (2), we may need to
                        allow loop vectorizer to forego the assumption and
                        handle the loop as it is. The assumption seems
                        fundamental to many of the vectorization
                        algorithms, so it will require extensive updates or
                        may end up with reverting the loop back to a
                        properly nested form. The downside of (1) is that
                        it may slow down common optimization passes that
                        are repeatedly executed before vectorization.

                        My patch (http://reviews.llvm.org/D11728) is a
                        prototype fix for (1) that modifies Jump Threading
                        and Simplify-the-CFG to not eliminate an empty loop
                        header BB even when the loop does not have a PHI
                        node for its induction variable. The details can be
                        found at http://reviews.llvm.org/D11728. I would
                        welcome and appreciate any comments or feedback.

                        Best,
                        Hyojin


                        <graycol.gif>Hal Finkel ---07/16/2015 03:19:24
                        AM-------- Original Message ----- > From: "Hal
                        Finkel" <hfinkel at anl.gov>

                        From: Hal Finkel <hfinkel at anl.gov>
                        To: Chandler Carruth <chandlerc at google.com>
                        Cc: llvmdev at cs.uiuc.edu
                        Date: 07/16/2015 03:19 AM
                        Subject: Re: [LLVMdev] Improving loop vectorizer
                        support for loops with a volatile iteration
                        variable
                        Sent by: llvmdev-bounces at cs.uiuc.edu









                              From: "Hal Finkel" <hfinkel at
anl.gov>
                              To: "Chandler Carruth" <chandlerc at
google.com>
                              Cc: llvmdev at cs.uiuc.edu
                              Sent: Thursday, July 16, 2015 1:58:02 AM
                              Subject: Re: [LLVMdev] Improving loop
                              vectorizer support for loops with a volatile
                              iteration variable




                                    From: "Hal Finkel" <hfinkel at
anl.gov>
                                    To: "Chandler Carruth" <
                                    chandlerc at google.com>
                                    Cc: llvmdev at cs.uiuc.edu
                                    Sent: Thursday, July 16, 2015 1:46:42
                                    AM
                                    Subject: Re: [LLVMdev] Improving loop
                                    vectorizer support for loops with a
                                    volatile iteration variable



                                          From: "Chandler Carruth"
<
                                          chandlerc at google.com>
                                          To: "Hal Finkel" <hfinkel
at anl.gov
                                          >
                                          Cc: "Hyojin Sung" <
                                          hsung at us.ibm.com>,
                                          llvmdev at cs.uiuc.edu
                                          Sent: Thursday, July 16, 2015
                                          1:06:03 AM
                                          Subject: Re: [LLVMdev] Improving
                                          loop vectorizer support for loops
                                          with a volatile iteration
                                          variable

                                          On Wed, Jul 15, 2015 at 6:36 PM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150806/87b3e92d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150806/87b3e92d/attachment-0001.gif>

llvm dev - Aug 2015 - [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

[LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

[llvm-dev] [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable

[llvm-dev] [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable