Hal Finkel via llvm-dev
2015-Aug-13 18:00 UTC
[llvm-dev] [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable
Hi Gerolf, I think we have several (perhaps separable) issues here: 1. Do we have a canonical form for loops, preserved through the optimizer, that allows naturally-constructed loop nests to remain separable? 2. Do we forbid non-lowering transformations that turn vectorizable loops into non-vectorizable loops? 3. How do we detect cases where transformations cause a negative answer to either of the above? Now (2) is a subset of (1), but (1) applies as a general structural property to a much larger class of transformations. Hyojin has found, and proposed a fix for, two important instances of (1) by manual investigation, and I think we should move forward with reviewing those (the actual Phabricator differential did not cc llvm-commits, etc. and I've commented there asking her to open a new one). Regarding a more-automatic system for finding these kinds of problems, I'm certainly in favor. We could certainly run the vectorizer's legality check, add metadata to loops found to be vectorizable (or use some other state-keeping mechanism), and then check again later that those loops are still considered legally vectorizable (we could also check that we never move from a profitable to unprofitable state). I suspect this would be much too expensive to run aggressively by default, but we could certainly have a mode for us to use that did this. Part of the challenge, however, is that loops are often taken in and our of 'LoopSimplify' form, and we normally modify the IR in LoopSimplify before the vectorizer's analysis is runs (and, similarly before other passes that operate on loops). As a result, we (might) need to be judicious regarding where to do these checks (maybe we should do them only after LoopSimplify runs?). Regardless, I think that we should work on documenting what our canonical loop form is, and specifically include restrictions necessary to preserve (1). For (2), this would be harder (what the vectorizer will or will not vectorize will likely change much more over time), but detecting problems here is certainly also very important. Thanks again, Hal ----- Original Message -----> From: "Gerolf Hoflehner" <ghoflehner at apple.com> > To: "Hyojin Sung" <hsung at us.ibm.com> > Cc: "Hal Finkel" <hfinkel at anl.gov>, llvmdev at cs.uiuc.edu, llvmdev-bounces at cs.uiuc.edu > Sent: Wednesday, August 5, 2015 11:08:11 PM > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable > > Ok, assuming a verifier/checker was in place the information could > look like: > > > > ... > LV: Not vectorizing: Cannot prove legality. > LV: Legal to vectorize. — after loop rotation > LV: Legal to vectorize. > ... > LV: Legal to vectorize. > LV: Not vectorizing: Cannot prove legality. — after Jump Threading > (later also — after CFG Simplify) > > > Note that in your example the second round of loop rotation fails > making a loop vectorizable (from the comment that is its purpose, > though!). It might be more appropriate to generalize that approach > to any transformation that makes a loop vectorizable (not just > rotation) rather than tweaking jump threading + simplification. > > > I’m convinced there are more cases like yours and the following > framework would give a systematic way to handle them: > > > a) a checker for is_vectorizable (from refactoring the legal class in > the loop vectorizer - but properly, not hacked like my prototype …) > b) invoke checker optionally. The checker can be used for testing, > analysis and - independently - as a service for loop transformations > (eg. if the loop is vectorizable already, perhaps a transformation > is not needed or vice versa). I’m not sure about the architecture > yet. Perhaps the pass manager could run a checker automatically > after every pass when an option is set. Adding extra calls after > every pass is too clumsy. > c) fix violators if appropriate and/or add transformations that morph > a loop into a vectorizable form when possible (especially when it > has been in vectorizable shape at some point). > > > > > -Gerolf > > > > > > > > On Aug 4, 2015, at 8:04 PM, Gerolf Hoflehner < ghoflehner at apple.com > > wrote: > > > The approach would be to run the legal check before + after > transformations on demand. For loops that “should” be vectorized it > would give the means to isolate the transformation that introduces a > barrier. I’ll see that can get a dirty prototype to illustrate this > on your test case. I could give a warning/remark when the code was > (loop-) vectorizable at some point but not anymore when the > vectorizer actually kicks in and speed-up diagnosis when comparing > good/bad cases (like in your instance w/ and w/o the volatile > modifier). > > > Gerolf > > > > > > On Aug 4, 2015, at 8:40 AM, Hyojin Sung < hsung at us.ibm.com > wrote: > > > > > Hi Gerolf, > > Thanks for the comment. Yes, I agree that compiler transformation as > a whole should not render vectorizable loops unvectorizable. > However, some passes may temporarily make a loop unvectorizable > (e.g., by aggressive GVN or DCE/DBE) then later passes may revert > the effect. Trying to keep the invariants after each pass may > prevent useful optimizations. Regarding your concern about the > example-based approach, LoopVectorizationLegality class gives a good > idea about what will be such invariants and they are mostly related > to check for canonical loop structures. I think that there are only > a handful of transformation passes that may disrupt loop structures > mainly through aggressively eliminating BB's. Focusing on such > transformations may be sufficient to prevent prior transformations > from destroying vectorizable loops. I'd be glad to hear more of your > thoughts. > > Best, > Hyojin > > > <graycol.gif> Gerolf Hoflehner ---08/03/2015 06:14:22 PM---I see a > more fundamental problem and perhaps this example can serve as a > stepping stone towards a so > > From: Gerolf Hoflehner < ghoflehner at apple.com > > To: Hyojin Sung/Watson/IBM at IBMUS > Cc: Hal Finkel < hfinkel at anl.gov >, llvmdev-bounces at cs.uiuc.edu , > llvmdev at cs.uiuc.edu > Date: 08/03/2015 06:14 PM > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops > with a volatile iteration variable > > > > > I see a more fundamental problem and perhaps this example can serve > as a stepping stone towards a solution. > > There is a desired property: In this case, loop is vectorizable. > There are N compiler transformations. These transformations must > either establish the property or keep it invariant, but never > destroy it. > > If there is agreement to this then the first step is to have the > analysis of ‘is vectorizable’ runnable after every transformation > and report violations in detail, likely under a flag. Then comparing > a set of loops not vectorizable with clang but with other compilers > (gcc, clang, …) should shape precise ideas on normal forms for the > vectorizer and general improvements to transformations to the > keep/establish the invariant/desired property. > > I’m worried that the one example at time approach over time > confuscates matters within transformations resulting in less > maintainable code. > > Gerolf > > > > > On Aug 3, 2015, at 2:33 PM, Hyojin Sung < hsung at us.ibm.com > wrote: > > Hi, I discussed the issue offline with Hal, and would like to clarify > what is exactly going on, what are trade-offs for different > solutions, and ask for more feedback on my proposed solution ( > http://reviews.llvm.org/D11728 ). I will use the example from Hal's > post: void foo2(float * restrict x, float * restrict y, float * > restrict z) { for (volatile int i = 0; i < 1000; ++i) { for (int j > 0; j < 1600; ++j) { x[j] = y[j] + z[j]; } } } IR after the first > loop simplify: A preheader is created. ; Function Attrs: nounwind > define void @foo2(float* noalias nocapture %x, float* noalias > nocapture readonly %y, float* noalias nocapture readonly %z) #0 { > entry: %i = alloca i32, align 4 tail call void > @llvm.dbg.value(metadata float* %x, i64 0, metadata !11, metadata > !25), !dbg !26 tail call void @llvm.dbg.value(metadata float* %y, > i64 0, metadata !12, metadata !25), !dbg !27 tail call void > @llvm.dbg.value(metadata float* %z, i64 0, metadata !13, metadata > !25), !dbg !28 %i.0.i.0..sroa_cast = bitcast i32* %i to i8* call > void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast) tail call > void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14, metadata > !25), !dbg !29 store volatile i32 0, i32* %i, align 4, !dbg !29 br > label %for.cond, !dbg !30 for.cond: ; preds = %for.cond.cleanup.3, > %entry tail call void @llvm.dbg.value(metadata i32* %i, i64 0, > metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile i32, > i32* %i, align 4, !dbg !31 %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg > !34 br i1 %cmp, label %for.cond.1.preheader, label > %for.cond.cleanup, !dbg !35 for.cond.1.preheader: ; preds > %for.cond br label %for.cond.1, !dbg !36 for.cond.cleanup: ; preds > %for.cond call void @llvm.lifetime.end(i64 4, i8* > %i.0.i.0..sroa_cast) ret void, !dbg !38 for.cond.1: ; preds > %for.cond.1.preheader, %for.body.4 %j.0 = phi i32 [ %inc, > %for.body.4 ], [ 0, %for.cond.1.preheader ] %cmp2 = icmp slt i32 > %j.0, 1600, !dbg !36 br i1 %cmp2, label %for.body.4, label > %for.cond.cleanup.3, !dbg !39 for.cond.cleanup.3: ; preds > %for.cond.1 tail call void @llvm.dbg.value(metadata i32* %i, i64 0, > metadata !14, metadata !25), !dbg !29 %i.0.i.0.17 = load volatile > i32, i32* %i, align 4, !dbg !40 %inc10 = add nsw i32 %i.0.i.0.17, 1, > !dbg !40 tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, > metadata !14, metadata !25), !dbg !29 store volatile i32 %inc10, > i32* %i, align 4, !dbg !40 br label %for.cond, !dbg !41 for.body.4: > ; preds = %for.cond.1 %idxprom = sext i32 %j.0 to i64, !dbg !42 > %arrayidx = getelementptr inbounds float, float* %y, i64 %idxprom, > !dbg !42 %0 = load float, float* %arrayidx, align 4, !dbg !42, !tbaa > !44 %arrayidx6 = getelementptr inbounds float, float* %z, i64 > %idxprom, !dbg !48 %1 = load float, float* %arrayidx6, align 4, !dbg > !48, !tbaa !44 %add = fadd float %0, %1, !dbg !49 %arrayidx8 > getelementptr inbounds float, float* %x, i64 %idxprom, !dbg !50 > store float %add, float* %arrayidx8, align 4, !dbg !51, !tbaa !44 > %inc = add nsw i32 %j.0, 1, !dbg !52 tail call void > @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18, metadata > !25), !dbg !53 br label %for.cond.1, !dbg !54 } IR after loop > rotation: After loop rotation, a rotated preheader > (for.cond.1.preheader.lr.ph) is created. A test for (i < 1000) is > added at the end of "entry" block. If true, the control jumps > unconditionally to "for.body.4" through "for.cond.1.preheader.lr.ph" > and "for.cond.1.preheader". You can see that these two blocks > ("for.cond.1.preheader.lr.ph" and "for.cond.1.preheader") are > practically empty, and they will get eliminated later by Jump > Threading and/or Simplify-the-CFG. *IF* the outer loop has a > non-volatile induction variable, the loop will not be rotated in the > first place as "for.cond.1.preheader" has a PHI node for "i", and > these blocks will not be eliminated. ; Function Attrs: nounwind > define void @foo2(float* noalias nocapture %x, float* noalias > nocapture readonly %y, float* noalias nocapture readonly %z) #0 { > entry: %i = alloca i32, align 4 tail call void > @llvm.dbg.value(metadata float* %x, i64 0, metadata !11, metadata > !25), !dbg !26 tail call void @llvm.dbg.value(metadata float* %y, > i64 0, metadata !12, metadata !25), !dbg !27 tail call void > @llvm.dbg.value(metadata float* %z, i64 0, metadata !13, metadata > !25), !dbg !28 %i.0.i.0..sroa_cast = bitcast i32* %i to i8* call > void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast) tail call > void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14, metadata > !25), !dbg !29 store volatile i32 0, i32* %i, align 4, !dbg !29 tail > call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, > metadata !25), !dbg !29 %i.0.i.0..21 = load volatile i32, i32* %i, > align 4, !dbg !30 %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg > !33 br i1 %cmp.22, label %for.cond.1.preheader.lr.ph, label > %for.cond.cleanup, !dbg !34 for.cond.1.preheader.lr.ph: ; preds > %entry br label %for.cond.1.preheader, !dbg !34 > for.cond.1.preheader: ; preds = %for.cond.1.preheader.lr.ph, > %for.cond.cleanup.3 br label %for.body.4, !dbg !35 > for.cond.for.cond.cleanup_crit_edge: ; preds = %for.cond.cleanup.3 > br label %for.cond.cleanup, !dbg !34 for.cond.cleanup: ; preds > %for.cond.for.cond.cleanup_crit_edge, %entry call void > @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast) ret void, !dbg > !36 for.cond.cleanup.3: ; preds = %for.body.4 tail call void > @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, metadata > !25), !dbg !29 %i.0.i.0.17 = load volatile i32, i32* %i, align 4, > !dbg !37 %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !37 tail call > void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14, > metadata !25), !dbg !29 store volatile i32 %inc10, i32* %i, align 4, > !dbg !37 tail call void @llvm.dbg.value(metadata i32* %i, i64 0, > metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile i32, > i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg > !33 br i1 %cmp, label %for.cond.1.preheader, label > %for.cond.for.cond.cleanup_crit_edge, !dbg !34 for.body.4: ; preds > %for.cond.1.preheader, %for.body.4 %j.020 = phi i32 [ 0, > %for.cond.1.preheader ], [ %inc, %for.body.4 ] %idxprom = sext i32 > %j.020 to i64, !dbg !38 %arrayidx = getelementptr inbounds float, > float* %y, i64 %idxprom, !dbg !38 %0 = load float, float* %arrayidx, > align 4, !dbg !38, !tbaa !41 %arrayidx6 = getelementptr inbounds > float, float* %z, i64 %idxprom, !dbg !45 %1 = load float, float* > %arrayidx6, align 4, !dbg !45, !tbaa !41 %add = fadd float %0, %1, > !dbg !46 %arrayidx8 = getelementptr inbounds float, float* %x, i64 > %idxprom, !dbg !47 store float %add, float* %arrayidx8, align 4, > !dbg !48, !tbaa !41 %inc = add nsw i32 %j.020, 1, !dbg !49 tail call > void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18, > metadata !25), !dbg !50 %cmp2 = icmp slt i32 %inc, 1600, !dbg !51 br > i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg !35 > After Jump Threading: "for.cond.1.preheader.lr.ph" and > "for.cond.1.preheader" are merged into "for.body.4" by > TryToSimplifyUnconditionalBranchFromEmptyBlock() in > Transforms/Utils/Local.cpp. Now "for.body.4" has three incoming > edges (two backedges). ; Function Attrs: nounwind define void > @foo2(float* noalias nocapture %x, float* noalias nocapture readonly > %y, float* noalias nocapture readonly %z) #0 { entry: %i = alloca > i32, align 4 tail call void @llvm.dbg.value(metadata float* %x, i64 > 0, metadata !11, metadata !25), !dbg !26 tail call void > @llvm.dbg.value(metadata float* %y, i64 0, metadata !12, metadata > !25), !dbg !27 tail call void @llvm.dbg.value(metadata float* %z, > i64 0, metadata !13, metadata !25), !dbg !28 %i.0.i.0..sroa_cast > bitcast i32* %i to i8* call void @llvm.lifetime.start(i64 4, i8* > %i.0.i.0..sroa_cast) tail call void @llvm.dbg.value(metadata i32 0, > i64 0, metadata !14, metadata !25), !dbg !29 store volatile i32 0, > i32* %i, align 4, !dbg !29 tail call void @llvm.dbg.value(metadata > i32* %i, i64 0, metadata !14, metadata !25), !dbg !29 %i.0.i.0..21 > load volatile i32, i32* %i, align 4, !dbg !30 %cmp.22 = icmp slt i32 > %i.0.i.0..21, 1000, !dbg !33 br i1 %cmp.22, label %for.body.4, label > %for.cond.cleanup, !dbg !34 for.cond.cleanup: ; preds > %for.cond.cleanup.3, %entry call void @llvm.lifetime.end(i64 4, i8* > %i.0.i.0..sroa_cast) ret void, !dbg !35 for.cond.cleanup.3: ; preds > = %for.body.4 tail call void @llvm.dbg.value(metadata i32* %i, i64 > 0, metadata !14, metadata !25), !dbg !29 %i.0.i.0.17 = load volatile > i32, i32* %i, align 4, !dbg !36 %inc10 = add nsw i32 %i.0.i.0.17, 1, > !dbg !36 tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, > metadata !14, metadata !25), !dbg !29 store volatile i32 %inc10, > i32* %i, align 4, !dbg !36 tail call void @llvm.dbg.value(metadata > i32* %i, i64 0, metadata !14, metadata !25), !dbg !29 %i.0.i.0. > load volatile i32, i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 > %i.0.i.0., 1000, !dbg !33 br i1 %cmp, label %for.body.4, label > %for.cond.cleanup, !dbg !34 for.body.4: ; preds > %for.cond.cleanup.3, %entry, %for.body.4 %indvars.iv = phi i64 [ > %indvars.iv.next, %for.body.4 ], [ 0, %entry ], [ 0, > %for.cond.cleanup.3 ] %arrayidx = getelementptr inbounds float, > float* %y, i64 %indvars.iv, !dbg !37 %0 = load float, float* > %arrayidx, align 4, !dbg !37, !tbaa !40 %arrayidx6 = getelementptr > inbounds float, float* %z, i64 %indvars.iv, !dbg !44 %1 = load > float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40 %add = fadd > float %0, %1, !dbg !45 %arrayidx8 = getelementptr inbounds float, > float* %x, i64 %indvars.iv, !dbg !46 store float %add, float* > %arrayidx8, align 4, !dbg !47, !tbaa !40 %indvars.iv.next = add nuw > nsw i64 %indvars.iv, 1, !dbg !48 %exitcond = icmp eq i64 > %indvars.iv.next, 1600, !dbg !48 br i1 %exitcond, label > %for.cond.cleanup.3, label %for.body.4, !dbg !48 After another loop > simplify: Loop simplify tries to separate out nested loops but fails > to do so with this loop since it does not has a PHI node for the > outer loop variable. Instead, it creates a backedge block. > for.cond.cleanup.3: ; preds = %for.body.4 tail call void > @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, metadata > !25), !dbg !29 %i.0.i.0.17 = load volatile i32, i32* %i, align 4, > !dbg !39 %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !39 tail call > void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14, > metadata !25), !dbg !29 store volatile i32 %inc10, i32* %i, align 4, > !dbg !39 tail call void @llvm.dbg.value(metadata i32* %i, i64 0, > metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile i32, > i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg > !33 br i1 %cmp, label %for.body.4.backedge, label > %for.cond.cleanup.loopexit, !dbg !34 for.body.4: ; preds > %for.body.4.backedge, %for.body.4.preheader %indvars.iv = phi i64 [ > 0, %for.body.4.preheader ], [ % indvars.iv.be , %for.body.4.backedge > ] %arrayidx = getelementptr inbounds float, float* %y, i64 > %indvars.iv, !dbg !35 %0 = load float, float* %arrayidx, align 4, > !dbg !35, !tbaa !40 %arrayidx6 = getelementptr inbounds float, > float* %z, i64 %indvars.iv, !dbg !44 %1 = load float, float* > %arrayidx6, align 4, !dbg !44, !tbaa !40 %add = fadd float %0, %1, > !dbg !45 %arrayidx8 = getelementptr inbounds float, float* %x, i64 > %indvars.iv, !dbg !46 store float %add, float* %arrayidx8, align 4, > !dbg !47, !tbaa !40 %indvars.iv.next = add nuw nsw i64 %indvars.iv, > 1, !dbg !48 %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg !48 > br i1 %exitcond, label %for.cond.cleanup.3, label > %for.body.4.backedge, !dbg !48 for.body.4.backedge: ; preds > %for.body.4, %for.cond.cleanup.3 % indvars.iv.be = phi i64 [ > %indvars.iv.next, %for.body.4 ], [ 0, %for.cond.cleanup.3 ] br label > %for.body.4x LLVM loop vectorizer rejects to vectorize any loop for > which a loop latch (for.body.4.backedge) is different from a loop > exiting block (for.cond.cleanup.3). The loop vectorizer can assume > that all instructions in the loop are executed the same number of > times with the test. I believe a fix is in order in one way or > another because the example is simple and common enough and > vectorized by other compilers. We may approach it by either (1) > preventing loops from being collapsed in the first place or (2) > teaching loop vectorizer to handle collapsed loops. For (2), we may > need to allow loop vectorizer to forego the assumption and handle > the loop as it is. The assumption seems fundamental to many of the > vectorization algorithms, so it will require extensive updates or > may end up with reverting the loop back to a properly nested form. > The downside of (1) is that it may slow down common optimization > passes that are repeatedly executed before vectorization. My patch ( > http://reviews.llvm.org/D11728 ) is a prototype fix for (1) that > modifies Jump Threading and Simplify-the-CFG to not eliminate an > empty loop header BB even when the loop does not have a PHI node for > its induction variable. The details can be found at > http://reviews.llvm.org/D11728 . I would welcome and appreciate any > comments or feedback. Best, Hyojin <graycol.gif> Hal Finkel > ---07/16/2015 03:19:24 AM-------- Original Message ----- > From: > "Hal Finkel" < hfinkel at anl.gov > From: Hal Finkel < hfinkel at anl.gov > > To: Chandler Carruth < chandlerc at google.com > Cc: > llvmdev at cs.uiuc.edu Date: 07/16/2015 03:19 AM Subject: Re: [LLVMdev] > Improving loop vectorizer support for loops with a volatile > iteration variable Sent by: llvmdev-bounces at cs.uiuc.edu > ----- Original Message ----- > > > From: "Hal Finkel" < hfinkel at anl.gov > To: "Chandler Carruth" < > chandlerc at google.com > Cc: llvmdev at cs.uiuc.edu Sent: Thursday, July > 16, 2015 1:58:02 AM Subject: Re: [LLVMdev] Improving loop vectorizer > support for loops with a volatile iteration variable > > From: "Hal Finkel" < hfinkel at anl.gov > To: "Chandler Carruth" < > chandlerc at google.com > Cc: llvmdev at cs.uiuc.edu Sent: Thursday, July > 16, 2015 1:46:42 AM Subject: Re: [LLVMdev] Improving loop vectorizer > support for loops with a volatile iteration variable > > From: "Chandler Carruth" < chandlerc at google.com > To: "Hal Finkel" < > hfinkel at anl.gov > Cc: "Hyojin Sung" < hsung at us.ibm.com >, > llvmdev at cs.uiuc.edu Sent: Thursday, July 16, 2015 1:06:03 AM > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops > with a volatile iteration variable On Wed, Jul 15, 2015 at 6:36 PM > Hal Finkel < hfinkel at anl.gov > wrote: > > > > From: "Chandler Carruth" < chandlerc at google.com > To: "Hyojin Sung" < > hsung at us.ibm.com >, llvmdev at cs.uiuc.edu Sent: Wednesday, July 15, > 2015 7:34:54 PM Subject: Re: [LLVMdev] Improving loop vectorizer > support for loops with a volatile iteration variable On Wed, Jul 15, > 2015 at 12:55 PM Hyojin Sung < hsung at us.ibm.com > wrote: > > Hi all, I would like to propose an improvement of the “almost dead” > block elimination in Transforms/Local.cpp so that it will preserve > the canonical loop form for loops with a volatile iteration > variable. *** Problem statement Nested loops in LCALS Subset B ( > https://codesign.llnl.gov/LCALS.php ) are not vectorized with LLVM > -O3 because the LLVM loop vectorizer fails the test whether the loop > latch and exiting block of a loop is the same. The loops are > vectorizable, and get vectorized with LLVM -O2 I would be interested > to know why -O2 succeeds here. > > and also with other commercial compilers (icc, xlc). *** Details > These loops ended up with different loop latch and exiting block > after a series of optimizations including loop unswitching, jump > threading, simplify-the-CFG, and loop simplify. The fundamental > problem here is that the above optimizations cannot recognize a loop > with a volatile iteration variable and do not preserve its canonical > loop structure. Ok, meta-level question first: Why do we care about > performance of loops with a volatile iteration variable? I don't > think we do, however, I think that misses the point. In this case, > the volatile iteration variable is just an easy way to expose this > problem that we have with nested loop canonicalization and the > vectorizer. To be specific: This we vectorizer just fine: void > foo1(float * restrict x, float * restrict y, float * restrict z) { > for (int i = 0; i < 1000; ++i) { for (int j = 0; j < 1600; ++j) { > x[j] = y[j] + z[j]; } } } And, indeed, this we don't (the only > change is adding volatile on i): void foo2(float * restrict x, float > * restrict y, float * restrict z) { for (volatile int i = 0; i < > 1000; ++i) { for (int j = 0; j < 1600; ++j) { x[j] = y[j] + z[j]; } > } } However, this we don't either, and that's a big problem: int > done(float *x, float *y, float *z); void foo3(float * restrict x, > float * restrict y, float * restrict z) { while (!done(x, y, z)) { > for (int j = 0; j < 1600; ++j) { x[j] = y[j] + z[j]; } } } And the > underlying reason is the same. The IR at the point in time when the > loop vectorizer runs looks like this: define void @foo3(float* > noalias %x, float* noalias %y, float* noalias %z) #0 { entry: > %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) #1 > %lnot.15 = icmp eq i32 %call.14, 0 br i1 %lnot.15, label > %for.body.preheader, label %while.end for.body.preheader: ; preds > %entry br label %for.body while.cond.loopexit: ; preds = %for.body > %call = tail call i32 @done(float* %x, float* %y, float* %z) #1 > %lnot = icmp eq i32 %call, 0 br i1 %lnot, label %for.body.backedge, > label %while.end.loopexit for.body: ; preds = %for.body.backedge, > %for.body.preheader %indvars.iv = phi i64 [ 0, %for.body.preheader > ], [ % indvars.iv.be , %for.body.backedge ] ... %indvars.iv.next > add nuw nsw i64 %indvars.iv, 1 %exitcond = icmp eq i64 > %indvars.iv.next, 1600 br i1 %exitcond, label %while.cond.loopexit, > label %for.body.backedge for.body.backedge: ; preds = %for.body, > %while.cond.loopexit % indvars.iv.be = phi i64 [ %indvars.iv.next, > %for.body ], [ 0, %while.cond.loopexit ] br label %for.body > while.end.loopexit: ; preds = %while.cond.loopexit br label > %while.end while.end: ; preds = %while.end.loopexit, %entry ret void > } and we can currently only vectorize loops where the loop latch is > also the loop's exiting block. In this case, as in the case with the > volatile variable, vectorization is blocked by this constraint (here > the backedge is from the terminator of %for.body.backedge, but the > loop exiting block is %for.body). The check in the vectorizer is > explicit: // We only handle bottom-tested loops, i.e. loop in which > the condition is // checked at the end of each iteration. With that > we can assume that all // instructions in the loop are executed the > same number of times. if (TheLoop->getExitingBlock() !> TheLoop->getLoopLatch()) { ... Thanks for the detailed explanation. > This makes much more sense why we need to handle it. I think its > much better to look at nested loops of this form than anything to do > with volatile -- the latter is too prone to other random > optimizations turning off. Regarding this problem, it would be > interesting to know based on this explanation what the desired fix > would be. I see at least these options: 1) Canonicalize loops harder > to make them look the way the vectorizer wants. If this can be done > without causing significant problems, it seems likely the best > approach. I agree. In this case, we could certainly fold the trivial > %for.body.backedge block into %for.body, meaning transforming this: > for.body.backedge: ; preds = %while.cond.loopexit, %for.body % > indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > %while.cond.loopexit ] br label %for.body for.body: ; preds > %for.body.backedge, %for.body.preheader %indvars.iv = phi i64 [ 0, > %for.body.preheader ], [ % indvars.iv.be , %for.body.backedge ] ... > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond = icmp > eq i64 %indvars.iv.next, 1600 br i1 %exitcond, label > %while.cond.loopexit, label %for.body.backedge into this: for.body: > ; preds = %for.body.backedge, %for.body.preheader % indvars.iv.be > phi i64 [ %indvars.iv.next, %for.body ], [ 0, %while.cond.loopexit ] > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ % indvars.iv.be > , %for.body ] ... %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 > %exitcond = icmp eq i64 %indvars.iv.next, 1600 br i1 %exitcond, > label %while.cond.loopexit, label %for.body and this seems pretty > trivial when %for.body.backedge is completely empty (as in this > case), but if it had non-PHI instructions in it on which the > existing PHIs in %for.body depended, then maybe this becomes less > trivial? Although, based on what I said below, the case with > instructions there we can't currently vectorize anyway for > more-fundamental reasons. -Hal Also worth pointing out that > SimplifyCFG does this exact transformation. The vectorizer never > sees it, however, because LoopSimplify prefers this form with the > separate backedge block that the vectorizer can't handle. -Hal > > 2) Teach the vectorizer to vectorize without this constraint by > instead establishing the actual invariant it cares about. It really > cares that there's no code that comes in between the latch and the > exit, because such code is not really part of the loop (it only runs > once), or code in between the exit and the latch (because such code > runs in one fewer iterations than the code before the exit). At > least nothing with side effects I presume. -Hal > > Maybe there is another strategy? > > -Hal > > That seems both counter-intuitive and unlikely to be a useful goal. > We simply don't optimize volatile operations well in *any* part of > the optimizer, and I'm not sure why we need to start trying to fix > that. This seems like an irreparably broken benchmark, but perhaps > there is a motivation I don't yet see. Assuming that sufficient > motivation arises to try to fix this, see my comments below: > > (1) Loop unswitching generates several empty placeholder BBs only > with PHI nodes after separating out a shorter path with no inner > loop execution from a standard path. (2) Jump threading and > simplify-the-CFG passes independently calls > TryToSimplifyUnconditionalBranchFromEmptyBlock() in > Transforms/Utils/Local.cpp to get rid of almost empty BBs. (3) > TryToSimplifyUnconditionalBranchFromEmtpyBlock() eliminates the > placeholder BBs after loop unswitching and merges them into > subsequent blocks including the header of the inner loop. Before > eliminating the blocks, the function checks if the block is a loop > header by looking at its PHI nodes so that it can be saved, but the > test fails with the loops with a volatile iteration variable. Why > does this fail for a volatile iteration variable but not for a > non-volatile one? I think understanding that will be key to > understanding how it should be fixed. > _______________________________________________ LLVM Developers > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -- Hal Finkel > Assistant Computational Scientist Leadership Computing Facility > Argonne National Laboratory -- Hal Finkel Assistant Computational > Scientist Leadership Computing Facility Argonne National Laboratory > _______________________________________________ LLVM Developers > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > -- Hal Finkel Assistant Computational Scientist Leadership Computing > Facility Argonne National Laboratory > _______________________________________________ LLVM Developers > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -- Hal Finkel > Assistant Computational Scientist Leadership Computing Facility > Argonne National Laboratory > _______________________________________________ LLVM Developers > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > _______________________________________________ LLVM Developers > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Gerolf Hoflehner via llvm-dev
2015-Aug-14 00:50 UTC
[llvm-dev] [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable
Hi Hal, Please see below. I like your break down.> On Aug 13, 2015, at 11:00 AM, Hal Finkel <hfinkel at anl.gov> wrote: > > Hi Gerolf, > > I think we have several (perhaps separable) issues here: > > 1. Do we have a canonical form for loops, preserved through the optimizer, that allows naturally-constructed loop nests to remain separable?Are you going to take a stab at the documentation like you mention below?> 2. Do we forbid non-lowering transformations that turn vectorizable loops into non-vectorizable loops?This is the long-term architecture question which I believe requires more data and investigation. My preference for now is to be flexible. Then it is OK to allow vec -> non-vec trafos, but it should be done consciously (e.g. based on a cost model or for some other well understood reason). Therefore the first step in answering that question is providing the means to discover and reason about such transformations. The current solution is that the first instance of loop rotation attempts to create a vectorizable loop. Then some transformations may break that property. So another instance of rotation is invoked before the vectorizer. But in some cases like Hyojin’s that still fails. So it is a brittle design and the story behind it looks like this (and I’m speculating w/o any history digging, so all of this might be totally off): a) initial design - transform a loop into vectorizable form at some point and keep it that way (first call to loop rotation). b) surprise 1 - some transformations don’t play along. Probably the initial answers were something like let’s patch them ( so loops stay vectorizable) c) surprise 2 - over time the patches were getting ugly. So let’s patch everything with a new pass (second call to loop rotation) and “be done with it." We are now at d) surprise 3 - after all patches the design space has still not been covered. Perhaps worse, how much of it has been covered? Possible alternatives: Why not have a general pass before the vectorizer to generate a vectorizable loop? Or have a helper class for it in the vectorizer (check legality - if not legal transform - if successful vectorize otherwise give up)?> > 3. How do we detect cases where transformations cause a negative answer to either of the above? > > Now (2) is a subset of (1), but (1) applies as a general structural property to a much larger class of transformations. Hyojin has found, and proposed a fix for, two important instances of (1) by manual investigation, and I think we should move forward with reviewing those (the actual Phabricator differential did not cc llvm-commits, etc. and I've commented there asking her to open a new one).I guess I can see good reasons for signing up on your proposal. Similar patches have been accepted in the past, so why come out of the woods and pick on this one? My concern is maintainability. When there are implicit dependences between two passes/phases in the compiler they will break. In my view we are at d) above and that would favor a design overhaul. Why are these loops important? Catching important loops for a small maintenance overhead (which hopefully is temporary) would be a reasonable trade-off.> > Regarding a more-automatic system for finding these kinds of problems, I'm certainly in favor. We could certainly run the vectorizer's legality check, add metadata to loops found to be vectorizable (or use some other state-keeping mechanism), and then check again later that those loops are still considered legally vectorizable (we could also check that we never move from a profitable to unprofitable state). I suspect this would be much too expensive to run aggressively by default, but we could certainly have a mode for us to use that did this.Agree that it would be too heavy for default mode. I’ll see that I can clean up my patch so then we can find the best way to make use of it.> > Part of the challenge, however, is that loops are often taken in and our of 'LoopSimplify' form, and we normally modify the IR in LoopSimplify before the vectorizer's analysis is runs (and, similarly before other passes that operate on loops). As a result, we (might) need to be judicious regarding where to do these checks (maybe we should do them only after LoopSimplify runs?).Sure.> > Regardless, I think that we should work on documenting what our canonical loop form is, and specifically include restrictions necessary to preserve (1). For (2), this would be harder (what the vectorizer will or will not vectorize will likely change much more over time), but detecting problems here is certainly also very important.Ok, please see 1. above.> > Thanks again, > Hal > > ----- Original Message ----- >> From: "Gerolf Hoflehner" <ghoflehner at apple.com <mailto:ghoflehner at apple.com>> >> To: "Hyojin Sung" <hsung at us.ibm.com <mailto:hsung at us.ibm.com>> >> Cc: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>, llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>, llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces at cs.uiuc.edu> >> Sent: Wednesday, August 5, 2015 11:08:11 PM >> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable >> >> Ok, assuming a verifier/checker was in place the information could >> look like: >> >> >> >> ... >> LV: Not vectorizing: Cannot prove legality. >> LV: Legal to vectorize. — after loop rotation >> LV: Legal to vectorize. >> ... >> LV: Legal to vectorize. >> LV: Not vectorizing: Cannot prove legality. — after Jump Threading >> (later also — after CFG Simplify) >> >> >> Note that in your example the second round of loop rotation fails >> making a loop vectorizable (from the comment that is its purpose, >> though!). It might be more appropriate to generalize that approach >> to any transformation that makes a loop vectorizable (not just >> rotation) rather than tweaking jump threading + simplification. >> >> >> I’m convinced there are more cases like yours and the following >> framework would give a systematic way to handle them: >> >> >> a) a checker for is_vectorizable (from refactoring the legal class in >> the loop vectorizer - but properly, not hacked like my prototype …) >> b) invoke checker optionally. The checker can be used for testing, >> analysis and - independently - as a service for loop transformations >> (eg. if the loop is vectorizable already, perhaps a transformation >> is not needed or vice versa). I’m not sure about the architecture >> yet. Perhaps the pass manager could run a checker automatically >> after every pass when an option is set. Adding extra calls after >> every pass is too clumsy. >> c) fix violators if appropriate and/or add transformations that morph >> a loop into a vectorizable form when possible (especially when it >> has been in vectorizable shape at some point). >> >> >> >> >> -Gerolf >> >> >> >> >> >> >> >> On Aug 4, 2015, at 8:04 PM, Gerolf Hoflehner < ghoflehner at apple.com > >> wrote: >> >> >> The approach would be to run the legal check before + after >> transformations on demand. For loops that “should” be vectorized it >> would give the means to isolate the transformation that introduces a >> barrier. I’ll see that can get a dirty prototype to illustrate this >> on your test case. I could give a warning/remark when the code was >> (loop-) vectorizable at some point but not anymore when the >> vectorizer actually kicks in and speed-up diagnosis when comparing >> good/bad cases (like in your instance w/ and w/o the volatile >> modifier). >> >> >> Gerolf >> >> >> >> >> >> On Aug 4, 2015, at 8:40 AM, Hyojin Sung < hsung at us.ibm.com > wrote: >> >> >> >> >> Hi Gerolf, >> >> Thanks for the comment. Yes, I agree that compiler transformation as >> a whole should not render vectorizable loops unvectorizable. >> However, some passes may temporarily make a loop unvectorizable >> (e.g., by aggressive GVN or DCE/DBE) then later passes may revert >> the effect. Trying to keep the invariants after each pass may >> prevent useful optimizations. Regarding your concern about the >> example-based approach, LoopVectorizationLegality class gives a good >> idea about what will be such invariants and they are mostly related >> to check for canonical loop structures. I think that there are only >> a handful of transformation passes that may disrupt loop structures >> mainly through aggressively eliminating BB's. Focusing on such >> transformations may be sufficient to prevent prior transformations >> from destroying vectorizable loops. I'd be glad to hear more of your >> thoughts. >> >> Best, >> Hyojin >> >> >> <graycol.gif> Gerolf Hoflehner ---08/03/2015 06:14:22 PM---I see a >> more fundamental problem and perhaps this example can serve as a >> stepping stone towards a so >> >> From: Gerolf Hoflehner < ghoflehner at apple.com > >> To: Hyojin Sung/Watson/IBM at IBMUS >> Cc: Hal Finkel < hfinkel at anl.gov >, llvmdev-bounces at cs.uiuc.edu , >> llvmdev at cs.uiuc.edu >> Date: 08/03/2015 06:14 PM >> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops >> with a volatile iteration variable >> >> >> >> >> I see a more fundamental problem and perhaps this example can serve >> as a stepping stone towards a solution. >> >> There is a desired property: In this case, loop is vectorizable. >> There are N compiler transformations. These transformations must >> either establish the property or keep it invariant, but never >> destroy it. >> >> If there is agreement to this then the first step is to have the >> analysis of ‘is vectorizable’ runnable after every transformation >> and report violations in detail, likely under a flag. Then comparing >> a set of loops not vectorizable with clang but with other compilers >> (gcc, clang, …) should shape precise ideas on normal forms for the >> vectorizer and general improvements to transformations to the >> keep/establish the invariant/desired property. >> >> I’m worried that the one example at time approach over time >> confuscates matters within transformations resulting in less >> maintainable code. >> >> Gerolf >> >> >> >> >> On Aug 3, 2015, at 2:33 PM, Hyojin Sung < hsung at us.ibm.com > wrote: >> >> Hi, I discussed the issue offline with Hal, and would like to clarify >> what is exactly going on, what are trade-offs for different >> solutions, and ask for more feedback on my proposed solution ( >> http://reviews.llvm.org/D11728 ). I will use the example from Hal's >> post: void foo2(float * restrict x, float * restrict y, float * >> restrict z) { for (volatile int i = 0; i < 1000; ++i) { for (int j >> 0; j < 1600; ++j) { x[j] = y[j] + z[j]; } } } IR after the first >> loop simplify: A preheader is created. ; Function Attrs: nounwind >> define void @foo2(float* noalias nocapture %x, float* noalias >> nocapture readonly %y, float* noalias nocapture readonly %z) #0 { >> entry: %i = alloca i32, align 4 tail call void >> @llvm.dbg.value(metadata float* %x, i64 0, metadata !11, metadata >> !25), !dbg !26 tail call void @llvm.dbg.value(metadata float* %y, >> i64 0, metadata !12, metadata !25), !dbg !27 tail call void >> @llvm.dbg.value(metadata float* %z, i64 0, metadata !13, metadata >> !25), !dbg !28 %i.0.i.0..sroa_cast = bitcast i32* %i to i8* call >> void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast) tail call >> void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14, metadata >> !25), !dbg !29 store volatile i32 0, i32* %i, align 4, !dbg !29 br >> label %for.cond, !dbg !30 for.cond: ; preds = %for.cond.cleanup.3, >> %entry tail call void @llvm.dbg.value(metadata i32* %i, i64 0, >> metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile i32, >> i32* %i, align 4, !dbg !31 %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg >> !34 br i1 %cmp, label %for.cond.1.preheader, label >> %for.cond.cleanup, !dbg !35 for.cond.1.preheader: ; preds >> %for.cond br label %for.cond.1, !dbg !36 for.cond.cleanup: ; preds >> %for.cond call void @llvm.lifetime.end(i64 4, i8* >> %i.0.i.0..sroa_cast) ret void, !dbg !38 for.cond.1: ; preds >> %for.cond.1.preheader, %for.body.4 %j.0 = phi i32 [ %inc, >> %for.body.4 ], [ 0, %for.cond.1.preheader ] %cmp2 = icmp slt i32 >> %j.0, 1600, !dbg !36 br i1 %cmp2, label %for.body.4, label >> %for.cond.cleanup.3, !dbg !39 for.cond.cleanup.3: ; preds >> %for.cond.1 tail call void @llvm.dbg.value(metadata i32* %i, i64 0, >> metadata !14, metadata !25), !dbg !29 %i.0.i.0.17 = load volatile >> i32, i32* %i, align 4, !dbg !40 %inc10 = add nsw i32 %i.0.i.0.17, 1, >> !dbg !40 tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, >> metadata !14, metadata !25), !dbg !29 store volatile i32 %inc10, >> i32* %i, align 4, !dbg !40 br label %for.cond, !dbg !41 for.body.4: >> ; preds = %for.cond.1 %idxprom = sext i32 %j.0 to i64, !dbg !42 >> %arrayidx = getelementptr inbounds float, float* %y, i64 %idxprom, >> !dbg !42 %0 = load float, float* %arrayidx, align 4, !dbg !42, !tbaa >> !44 %arrayidx6 = getelementptr inbounds float, float* %z, i64 >> %idxprom, !dbg !48 %1 = load float, float* %arrayidx6, align 4, !dbg >> !48, !tbaa !44 %add = fadd float %0, %1, !dbg !49 %arrayidx8 >> getelementptr inbounds float, float* %x, i64 %idxprom, !dbg !50 >> store float %add, float* %arrayidx8, align 4, !dbg !51, !tbaa !44 >> %inc = add nsw i32 %j.0, 1, !dbg !52 tail call void >> @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18, metadata >> !25), !dbg !53 br label %for.cond.1, !dbg !54 } IR after loop >> rotation: After loop rotation, a rotated preheader >> (for.cond.1.preheader.lr.ph) is created. A test for (i < 1000) is >> added at the end of "entry" block. If true, the control jumps >> unconditionally to "for.body.4" through "for.cond.1.preheader.lr.ph" >> and "for.cond.1.preheader". You can see that these two blocks >> ("for.cond.1.preheader.lr.ph" and "for.cond.1.preheader") are >> practically empty, and they will get eliminated later by Jump >> Threading and/or Simplify-the-CFG. *IF* the outer loop has a >> non-volatile induction variable, the loop will not be rotated in the >> first place as "for.cond.1.preheader" has a PHI node for "i", and >> these blocks will not be eliminated. ; Function Attrs: nounwind >> define void @foo2(float* noalias nocapture %x, float* noalias >> nocapture readonly %y, float* noalias nocapture readonly %z) #0 { >> entry: %i = alloca i32, align 4 tail call void >> @llvm.dbg.value(metadata float* %x, i64 0, metadata !11, metadata >> !25), !dbg !26 tail call void @llvm.dbg.value(metadata float* %y, >> i64 0, metadata !12, metadata !25), !dbg !27 tail call void >> @llvm.dbg.value(metadata float* %z, i64 0, metadata !13, metadata >> !25), !dbg !28 %i.0.i.0..sroa_cast = bitcast i32* %i to i8* call >> void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast) tail call >> void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14, metadata >> !25), !dbg !29 store volatile i32 0, i32* %i, align 4, !dbg !29 tail >> call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, >> metadata !25), !dbg !29 %i.0.i.0..21 = load volatile i32, i32* %i, >> align 4, !dbg !30 %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg >> !33 br i1 %cmp.22, label %for.cond.1.preheader.lr.ph, label >> %for.cond.cleanup, !dbg !34 for.cond.1.preheader.lr.ph: ; preds >> %entry br label %for.cond.1.preheader, !dbg !34 >> for.cond.1.preheader: ; preds = %for.cond.1.preheader.lr.ph, >> %for.cond.cleanup.3 br label %for.body.4, !dbg !35 >> for.cond.for.cond.cleanup_crit_edge: ; preds = %for.cond.cleanup.3 >> br label %for.cond.cleanup, !dbg !34 for.cond.cleanup: ; preds >> %for.cond.for.cond.cleanup_crit_edge, %entry call void >> @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast) ret void, !dbg >> !36 for.cond.cleanup.3: ; preds = %for.body.4 tail call void >> @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, metadata >> !25), !dbg !29 %i.0.i.0.17 = load volatile i32, i32* %i, align 4, >> !dbg !37 %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !37 tail call >> void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14, >> metadata !25), !dbg !29 store volatile i32 %inc10, i32* %i, align 4, >> !dbg !37 tail call void @llvm.dbg.value(metadata i32* %i, i64 0, >> metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile i32, >> i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg >> !33 br i1 %cmp, label %for.cond.1.preheader, label >> %for.cond.for.cond.cleanup_crit_edge, !dbg !34 for.body.4: ; preds >> %for.cond.1.preheader, %for.body.4 %j.020 = phi i32 [ 0, >> %for.cond.1.preheader ], [ %inc, %for.body.4 ] %idxprom = sext i32 >> %j.020 to i64, !dbg !38 %arrayidx = getelementptr inbounds float, >> float* %y, i64 %idxprom, !dbg !38 %0 = load float, float* %arrayidx, >> align 4, !dbg !38, !tbaa !41 %arrayidx6 = getelementptr inbounds >> float, float* %z, i64 %idxprom, !dbg !45 %1 = load float, float* >> %arrayidx6, align 4, !dbg !45, !tbaa !41 %add = fadd float %0, %1, >> !dbg !46 %arrayidx8 = getelementptr inbounds float, float* %x, i64 >> %idxprom, !dbg !47 store float %add, float* %arrayidx8, align 4, >> !dbg !48, !tbaa !41 %inc = add nsw i32 %j.020, 1, !dbg !49 tail call >> void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18, >> metadata !25), !dbg !50 %cmp2 = icmp slt i32 %inc, 1600, !dbg !51 br >> i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg !35 >> After Jump Threading: "for.cond.1.preheader.lr.ph" and >> "for.cond.1.preheader" are merged into "for.body.4" by >> TryToSimplifyUnconditionalBranchFromEmptyBlock() in >> Transforms/Utils/Local.cpp. Now "for.body.4" has three incoming >> edges (two backedges). ; Function Attrs: nounwind define void >> @foo2(float* noalias nocapture %x, float* noalias nocapture readonly >> %y, float* noalias nocapture readonly %z) #0 { entry: %i = alloca >> i32, align 4 tail call void @llvm.dbg.value(metadata float* %x, i64 >> 0, metadata !11, metadata !25), !dbg !26 tail call void >> @llvm.dbg.value(metadata float* %y, i64 0, metadata !12, metadata >> !25), !dbg !27 tail call void @llvm.dbg.value(metadata float* %z, >> i64 0, metadata !13, metadata !25), !dbg !28 %i.0.i.0..sroa_cast >> bitcast i32* %i to i8* call void @llvm.lifetime.start(i64 4, i8* >> %i.0.i.0..sroa_cast) tail call void @llvm.dbg.value(metadata i32 0, >> i64 0, metadata !14, metadata !25), !dbg !29 store volatile i32 0, >> i32* %i, align 4, !dbg !29 tail call void @llvm.dbg.value(metadata >> i32* %i, i64 0, metadata !14, metadata !25), !dbg !29 %i.0.i.0..21 >> load volatile i32, i32* %i, align 4, !dbg !30 %cmp.22 = icmp slt i32 >> %i.0.i.0..21, 1000, !dbg !33 br i1 %cmp.22, label %for.body.4, label >> %for.cond.cleanup, !dbg !34 for.cond.cleanup: ; preds >> %for.cond.cleanup.3, %entry call void @llvm.lifetime.end(i64 4, i8* >> %i.0.i.0..sroa_cast) ret void, !dbg !35 for.cond.cleanup.3: ; preds >> = %for.body.4 tail call void @llvm.dbg.value(metadata i32* %i, i64 >> 0, metadata !14, metadata !25), !dbg !29 %i.0.i.0.17 = load volatile >> i32, i32* %i, align 4, !dbg !36 %inc10 = add nsw i32 %i.0.i.0.17, 1, >> !dbg !36 tail call void @llvm.dbg.value(metadata i32 %inc10, i64 0, >> metadata !14, metadata !25), !dbg !29 store volatile i32 %inc10, >> i32* %i, align 4, !dbg !36 tail call void @llvm.dbg.value(metadata >> i32* %i, i64 0, metadata !14, metadata !25), !dbg !29 %i.0.i.0. >> load volatile i32, i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 >> %i.0.i.0., 1000, !dbg !33 br i1 %cmp, label %for.body.4, label >> %for.cond.cleanup, !dbg !34 for.body.4: ; preds >> %for.cond.cleanup.3, %entry, %for.body.4 %indvars.iv = phi i64 [ >> %indvars.iv.next, %for.body.4 ], [ 0, %entry ], [ 0, >> %for.cond.cleanup.3 ] %arrayidx = getelementptr inbounds float, >> float* %y, i64 %indvars.iv, !dbg !37 %0 = load float, float* >> %arrayidx, align 4, !dbg !37, !tbaa !40 %arrayidx6 = getelementptr >> inbounds float, float* %z, i64 %indvars.iv, !dbg !44 %1 = load >> float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40 %add = fadd >> float %0, %1, !dbg !45 %arrayidx8 = getelementptr inbounds float, >> float* %x, i64 %indvars.iv, !dbg !46 store float %add, float* >> %arrayidx8, align 4, !dbg !47, !tbaa !40 %indvars.iv.next = add nuw >> nsw i64 %indvars.iv, 1, !dbg !48 %exitcond = icmp eq i64 >> %indvars.iv.next, 1600, !dbg !48 br i1 %exitcond, label >> %for.cond.cleanup.3, label %for.body.4, !dbg !48 After another loop >> simplify: Loop simplify tries to separate out nested loops but fails >> to do so with this loop since it does not has a PHI node for the >> outer loop variable. Instead, it creates a backedge block. >> for.cond.cleanup.3: ; preds = %for.body.4 tail call void >> @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, metadata >> !25), !dbg !29 %i.0.i.0.17 = load volatile i32, i32* %i, align 4, >> !dbg !39 %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !39 tail call >> void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14, >> metadata !25), !dbg !29 store volatile i32 %inc10, i32* %i, align 4, >> !dbg !39 tail call void @llvm.dbg.value(metadata i32* %i, i64 0, >> metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile i32, >> i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 %i.0.i.0., 1000, !dbg >> !33 br i1 %cmp, label %for.body.4.backedge, label >> %for.cond.cleanup.loopexit, !dbg !34 for.body.4: ; preds >> %for.body.4.backedge, %for.body.4.preheader %indvars.iv = phi i64 [ >> 0, %for.body.4.preheader ], [ % indvars.iv.be , %for.body.4.backedge >> ] %arrayidx = getelementptr inbounds float, float* %y, i64 >> %indvars.iv, !dbg !35 %0 = load float, float* %arrayidx, align 4, >> !dbg !35, !tbaa !40 %arrayidx6 = getelementptr inbounds float, >> float* %z, i64 %indvars.iv, !dbg !44 %1 = load float, float* >> %arrayidx6, align 4, !dbg !44, !tbaa !40 %add = fadd float %0, %1, >> !dbg !45 %arrayidx8 = getelementptr inbounds float, float* %x, i64 >> %indvars.iv, !dbg !46 store float %add, float* %arrayidx8, align 4, >> !dbg !47, !tbaa !40 %indvars.iv.next = add nuw nsw i64 %indvars.iv, >> 1, !dbg !48 %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg !48 >> br i1 %exitcond, label %for.cond.cleanup.3, label >> %for.body.4.backedge, !dbg !48 for.body.4.backedge: ; preds >> %for.body.4, %for.cond.cleanup.3 % indvars.iv.be = phi i64 [ >> %indvars.iv.next, %for.body.4 ], [ 0, %for.cond.cleanup.3 ] br label >> %for.body.4x LLVM loop vectorizer rejects to vectorize any loop for >> which a loop latch (for.body.4.backedge) is different from a loop >> exiting block (for.cond.cleanup.3). The loop vectorizer can assume >> that all instructions in the loop are executed the same number of >> times with the test. I believe a fix is in order in one way or >> another because the example is simple and common enough and >> vectorized by other compilers. We may approach it by either (1) >> preventing loops from being collapsed in the first place or (2) >> teaching loop vectorizer to handle collapsed loops. For (2), we may >> need to allow loop vectorizer to forego the assumption and handle >> the loop as it is. The assumption seems fundamental to many of the >> vectorization algorithms, so it will require extensive updates or >> may end up with reverting the loop back to a properly nested form. >> The downside of (1) is that it may slow down common optimization >> passes that are repeatedly executed before vectorization. My patch ( >> http://reviews.llvm.org/D11728 ) is a prototype fix for (1) that >> modifies Jump Threading and Simplify-the-CFG to not eliminate an >> empty loop header BB even when the loop does not have a PHI node for >> its induction variable. The details can be found at >> http://reviews.llvm.org/D11728 . I would welcome and appreciate any >> comments or feedback. Best, Hyojin <graycol.gif> Hal Finkel >> ---07/16/2015 03:19:24 AM-------- Original Message ----- > From: >> "Hal Finkel" < hfinkel at anl.gov > From: Hal Finkel < hfinkel at anl.gov >>> To: Chandler Carruth < chandlerc at google.com > Cc: >> llvmdev at cs.uiuc.edu Date: 07/16/2015 03:19 AM Subject: Re: [LLVMdev] >> Improving loop vectorizer support for loops with a volatile >> iteration variable Sent by: llvmdev-bounces at cs.uiuc.edu >> ----- Original Message ----- >> >> >> From: "Hal Finkel" < hfinkel at anl.gov > To: "Chandler Carruth" < >> chandlerc at google.com > Cc: llvmdev at cs.uiuc.edu Sent: Thursday, July >> 16, 2015 1:58:02 AM Subject: Re: [LLVMdev] Improving loop vectorizer >> support for loops with a volatile iteration variable >> >> From: "Hal Finkel" < hfinkel at anl.gov > To: "Chandler Carruth" < >> chandlerc at google.com > Cc: llvmdev at cs.uiuc.edu Sent: Thursday, July >> 16, 2015 1:46:42 AM Subject: Re: [LLVMdev] Improving loop vectorizer >> support for loops with a volatile iteration variable >> >> From: "Chandler Carruth" < chandlerc at google.com > To: "Hal Finkel" < >> hfinkel at anl.gov > Cc: "Hyojin Sung" < hsung at us.ibm.com >, >> llvmdev at cs.uiuc.edu Sent: Thursday, July 16, 2015 1:06:03 AM >> Subject: Re: [LLVMdev] Improving loop vectorizer support for loops >> with a volatile iteration variable On Wed, Jul 15, 2015 at 6:36 PM >> Hal Finkel < hfinkel at anl.gov > wrote: >> >> >> >> From: "Chandler Carruth" < chandlerc at google.com > To: "Hyojin Sung" < >> hsung at us.ibm.com >, llvmdev at cs.uiuc.edu Sent: Wednesday, July 15, >> 2015 7:34:54 PM Subject: Re: [LLVMdev] Improving loop vectorizer >> support for loops with a volatile iteration variable On Wed, Jul 15, >> 2015 at 12:55 PM Hyojin Sung < hsung at us.ibm.com > wrote: >> >> Hi all, I would like to propose an improvement of the “almost dead” >> block elimination in Transforms/Local.cpp so that it will preserve >> the canonical loop form for loops with a volatile iteration >> variable. *** Problem statement Nested loops in LCALS Subset B ( >> https://codesign.llnl.gov/LCALS.php ) are not vectorized with LLVM >> -O3 because the LLVM loop vectorizer fails the test whether the loop >> latch and exiting block of a loop is the same. The loops are >> vectorizable, and get vectorized with LLVM -O2 I would be interested >> to know why -O2 succeeds here. >> >> and also with other commercial compilers (icc, xlc). *** Details >> These loops ended up with different loop latch and exiting block >> after a series of optimizations including loop unswitching, jump >> threading, simplify-the-CFG, and loop simplify. The fundamental >> problem here is that the above optimizations cannot recognize a loop >> with a volatile iteration variable and do not preserve its canonical >> loop structure. Ok, meta-level question first: Why do we care about >> performance of loops with a volatile iteration variable? I don't >> think we do, however, I think that misses the point. In this case, >> the volatile iteration variable is just an easy way to expose this >> problem that we have with nested loop canonicalization and the >> vectorizer. To be specific: This we vectorizer just fine: void >> foo1(float * restrict x, float * restrict y, float * restrict z) { >> for (int i = 0; i < 1000; ++i) { for (int j = 0; j < 1600; ++j) { >> x[j] = y[j] + z[j]; } } } And, indeed, this we don't (the only >> change is adding volatile on i): void foo2(float * restrict x, float >> * restrict y, float * restrict z) { for (volatile int i = 0; i < >> 1000; ++i) { for (int j = 0; j < 1600; ++j) { x[j] = y[j] + z[j]; } >> } } However, this we don't either, and that's a big problem: int >> done(float *x, float *y, float *z); void foo3(float * restrict x, >> float * restrict y, float * restrict z) { while (!done(x, y, z)) { >> for (int j = 0; j < 1600; ++j) { x[j] = y[j] + z[j]; } } } And the >> underlying reason is the same. The IR at the point in time when the >> loop vectorizer runs looks like this: define void @foo3(float* >> noalias %x, float* noalias %y, float* noalias %z) #0 { entry: >> %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) #1 >> %lnot.15 = icmp eq i32 %call.14, 0 br i1 %lnot.15, label >> %for.body.preheader, label %while.end for.body.preheader: ; preds >> %entry br label %for.body while.cond.loopexit: ; preds = %for.body >> %call = tail call i32 @done(float* %x, float* %y, float* %z) #1 >> %lnot = icmp eq i32 %call, 0 br i1 %lnot, label %for.body.backedge, >> label %while.end.loopexit for.body: ; preds = %for.body.backedge, >> %for.body.preheader %indvars.iv = phi i64 [ 0, %for.body.preheader >> ], [ % indvars.iv.be , %for.body.backedge ] ... %indvars.iv.next >> add nuw nsw i64 %indvars.iv, 1 %exitcond = icmp eq i64 >> %indvars.iv.next, 1600 br i1 %exitcond, label %while.cond.loopexit, >> label %for.body.backedge for.body.backedge: ; preds = %for.body, >> %while.cond.loopexit % indvars.iv.be = phi i64 [ %indvars.iv.next, >> %for.body ], [ 0, %while.cond.loopexit ] br label %for.body >> while.end.loopexit: ; preds = %while.cond.loopexit br label >> %while.end while.end: ; preds = %while.end.loopexit, %entry ret void >> } and we can currently only vectorize loops where the loop latch is >> also the loop's exiting block. In this case, as in the case with the >> volatile variable, vectorization is blocked by this constraint (here >> the backedge is from the terminator of %for.body.backedge, but the >> loop exiting block is %for.body). The check in the vectorizer is >> explicit: // We only handle bottom-tested loops, i.e. loop in which >> the condition is // checked at the end of each iteration. With that >> we can assume that all // instructions in the loop are executed the >> same number of times. if (TheLoop->getExitingBlock() !>> TheLoop->getLoopLatch()) { ... Thanks for the detailed explanation. >> This makes much more sense why we need to handle it. I think its >> much better to look at nested loops of this form than anything to do >> with volatile -- the latter is too prone to other random >> optimizations turning off. Regarding this problem, it would be >> interesting to know based on this explanation what the desired fix >> would be. I see at least these options: 1) Canonicalize loops harder >> to make them look the way the vectorizer wants. If this can be done >> without causing significant problems, it seems likely the best >> approach. I agree. In this case, we could certainly fold the trivial >> %for.body.backedge block into %for.body, meaning transforming this: >> for.body.backedge: ; preds = %while.cond.loopexit, %for.body % >> indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, >> %while.cond.loopexit ] br label %for.body for.body: ; preds >> %for.body.backedge, %for.body.preheader %indvars.iv = phi i64 [ 0, >> %for.body.preheader ], [ % indvars.iv.be , %for.body.backedge ] ... >> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond = icmp >> eq i64 %indvars.iv.next, 1600 br i1 %exitcond, label >> %while.cond.loopexit, label %for.body.backedge into this: for.body: >> ; preds = %for.body.backedge, %for.body.preheader % indvars.iv.be >> phi i64 [ %indvars.iv.next, %for.body ], [ 0, %while.cond.loopexit ] >> %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ % indvars.iv.be >> , %for.body ] ... %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 >> %exitcond = icmp eq i64 %indvars.iv.next, 1600 br i1 %exitcond, >> label %while.cond.loopexit, label %for.body and this seems pretty >> trivial when %for.body.backedge is completely empty (as in this >> case), but if it had non-PHI instructions in it on which the >> existing PHIs in %for.body depended, then maybe this becomes less >> trivial? Although, based on what I said below, the case with >> instructions there we can't currently vectorize anyway for >> more-fundamental reasons. -Hal Also worth pointing out that >> SimplifyCFG does this exact transformation. The vectorizer never >> sees it, however, because LoopSimplify prefers this form with the >> separate backedge block that the vectorizer can't handle. -Hal >> >> 2) Teach the vectorizer to vectorize without this constraint by >> instead establishing the actual invariant it cares about. It really >> cares that there's no code that comes in between the latch and the >> exit, because such code is not really part of the loop (it only runs >> once), or code in between the exit and the latch (because such code >> runs in one fewer iterations than the code before the exit). At >> least nothing with side effects I presume. -Hal >> >> Maybe there is another strategy? >> >> -Hal >> >> That seems both counter-intuitive and unlikely to be a useful goal. >> We simply don't optimize volatile operations well in *any* part of >> the optimizer, and I'm not sure why we need to start trying to fix >> that. This seems like an irreparably broken benchmark, but perhaps >> there is a motivation I don't yet see. Assuming that sufficient >> motivation arises to try to fix this, see my comments below: >> >> (1) Loop unswitching generates several empty placeholder BBs only >> with PHI nodes after separating out a shorter path with no inner >> loop execution from a standard path. (2) Jump threading and >> simplify-the-CFG passes independently calls >> TryToSimplifyUnconditionalBranchFromEmptyBlock() in >> Transforms/Utils/Local.cpp to get rid of almost empty BBs. (3) >> TryToSimplifyUnconditionalBranchFromEmtpyBlock() eliminates the >> placeholder BBs after loop unswitching and merges them into >> subsequent blocks including the header of the inner loop. Before >> eliminating the blocks, the function checks if the block is a loop >> header by looking at its PHI nodes so that it can be saved, but the >> test fails with the loops with a volatile iteration variable. Why >> does this fail for a volatile iteration variable but not for a >> non-volatile one? I think understanding that will be key to >> understanding how it should be fixed. >> _______________________________________________ LLVM Developers >> mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -- Hal Finkel >> Assistant Computational Scientist Leadership Computing Facility >> Argonne National Laboratory -- Hal Finkel Assistant Computational >> Scientist Leadership Computing Facility Argonne National Laboratory >> _______________________________________________ LLVM Developers >> mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> -- Hal Finkel Assistant Computational Scientist Leadership Computing >> Facility Argonne National Laboratory >> _______________________________________________ LLVM Developers >> mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -- Hal Finkel >> Assistant Computational Scientist Leadership Computing Facility >> Argonne National Laboratory >> _______________________________________________ LLVM Developers >> mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> _______________________________________________ LLVM Developers >> mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> >> > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150813/45b01976/attachment.html>
Hal Finkel via llvm-dev
2015-Aug-19 03:52 UTC
[llvm-dev] [LLVMdev] Improving loop vectorizer support for loops with a volatile iteration variable
----- Original Message -----> From: "Gerolf Hoflehner" <ghoflehner at apple.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "Hyojin Sung" <hsung at us.ibm.com>, "llvm-dev" > <llvm-dev at lists.llvm.org> > Sent: Thursday, August 13, 2015 7:50:27 PM > Subject: Re: [LLVMdev] Improving loop vectorizer support for loops > with a volatile iteration variable> Hi Hal,> Please see below. I like your break down.> > On Aug 13, 2015, at 11:00 AM, Hal Finkel < hfinkel at anl.gov > wrote: >> > Hi Gerolf, >> > I think we have several (perhaps separable) issues here: >> > 1. Do we have a canonical form for loops, preserved through the > > optimizer, that allows naturally-constructed loop nests to remain > > separable? >> Are you going to take a stab at the documentation like you mention > below?I'm really not the right person to do this; but I'm happy to help. Chandler also told me that he would help. The basic structure is described in the comments at the top of LoopSimplify.cpp, and there are some additional comments regarding nested loops interspersed with the code. Then there is LCSSA (which is described briefly in LCSSA.cpp). Starting from those pieces likely makes sense.> > 2. Do we forbid non-lowering transformations that turn vectorizable > > loops into non-vectorizable loops? >> This is the long-term architecture question which I believe requires > more data and investigation. My preference for now is to be > flexible. Then it is OK to allow vec -> non-vec trafos, but it > should be done consciously (e.g. based on a cost model or for some > other well understood reason). Therefore the first step in answering > that question is providing the means to discover and reason about > such transformations.> The current solution is that the first instance of loop rotation > attempts to create a vectorizable loop. Then some transformations > may break that property. So another instance of rotation is invoked > before the vectorizer. But in some cases like Hyojin’s that still > fails. So it is a brittle design and the story behind it looks like > this (and I’m speculating w/o any history digging, so all of this > might be totally off): > a) initial design - transform a loop into vectorizable form at some > point and keep it that way (first call to loop rotation). > b) surprise 1 - some transformations don’t play along. Probably the > initial answers were something like let’s patch them ( so loops stay > vectorizable) > c) surprise 2 - over time the patches were getting ugly. So let’s > patch everything with a new pass (second call to loop rotation) and > “be done with it." > We are now at d) surprise 3 - after all patches the design space has > still not been covered. Perhaps worse, how much of it has been > covered?> Possible alternatives: Why not have a general pass before the > vectorizer to generate a vectorizable loop? Or have a helper class > for it in the vectorizer (check legality - if not legal transform - > if successful vectorize otherwise give up)?We already have LoopSimplify that runs prior to the vectorizer, so enhancing that certainly seems like one option. As I recall, Hyojin investigated that as one possible option.> > 3. How do we detect cases where transformations cause a negative > > answer to either of the above? >> > Now (2) is a subset of (1), but (1) applies as a general structural > > property to a much larger class of transformations. Hyojin has > > found, and proposed a fix for, two important instances of (1) by > > manual investigation, and I think we should move forward with > > reviewing those (the actual Phabricator differential did not cc > > llvm-commits, etc. and I've commented there asking her to open a > > new > > one). >> I guess I can see good reasons for signing up on your proposal. > Similar patches have been accepted in the past, so why come out of > the woods and pick on this one? My concern is maintainability. When > there are implicit dependences between two passes/phases in the > compiler they will break. In my view we are at d) above and that > would favor a design overhaul.> Why are these loops important? Catching important loops for a small > maintenance overhead (which hopefully is temporary) would be a > reasonable trade-off.They're important because they show up in real performance-sensitive code. Not the volatile-variable case (I suspect that's relegated to benchmarks), but I've certainly seen loop nests that look something like: while(!done(...)) { for (...) ... [vectorizable] } which is common in iterative solvers, and the vectorizer is crippled, as I discovered, on these loops for the same reason. Regarding maintainability, I think we need to decide on the right future direction, and accept a fix in line with that direction.> > Regarding a more-automatic system for finding these kinds of > > problems, I'm certainly in favor. We could certainly run the > > vectorizer's legality check, add metadata to loops found to be > > vectorizable (or use some other state-keeping mechanism), and then > > check again later that those loops are still considered legally > > vectorizable (we could also check that we never move from a > > profitable to unprofitable state). I suspect this would be much too > > expensive to run aggressively by default, but we could certainly > > have a mode for us to use that did this. >> Agree that it would be too heavy for default mode. I’ll see that I > can clean up my patch so then we can find the best way to make use > of it.Sounds good. Thanks again, Hal> > Part of the challenge, however, is that loops are often taken in > > and > > our of 'LoopSimplify' form, and we normally modify the IR in > > LoopSimplify before the vectorizer's analysis is runs (and, > > similarly before other passes that operate on loops). As a result, > > we (might) need to be judicious regarding where to do these checks > > (maybe we should do them only after LoopSimplify runs?). >> Sure.> > Regardless, I think that we should work on documenting what our > > canonical loop form is, and specifically include restrictions > > necessary to preserve (1). For (2), this would be harder (what the > > vectorizer will or will not vectorize will likely change much more > > over time), but detecting problems here is certainly also very > > important. >> Ok, please see 1. above.> > Thanks again, > > > Hal >> > ----- Original Message ----- >> > > From: "Gerolf Hoflehner" < ghoflehner at apple.com > > > > > > > To: "Hyojin Sung" < hsung at us.ibm.com > > > > > > > Cc: "Hal Finkel" < hfinkel at anl.gov >, llvmdev at cs.uiuc.edu , > > > llvmdev-bounces at cs.uiuc.edu > > > > > > Sent: Wednesday, August 5, 2015 11:08:11 PM > > > > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for > > > loops > > > with a volatile iteration variable > > >> > > Ok, assuming a verifier/checker was in place the information > > > could > > > > > > look like: > > >> > > ... > > > > > > LV: Not vectorizing: Cannot prove legality. > > > > > > LV: Legal to vectorize. — after loop rotation > > > > > > LV: Legal to vectorize. > > > > > > ... > > > > > > LV: Legal to vectorize. > > > > > > LV: Not vectorizing: Cannot prove legality. — after Jump > > > Threading > > > > > > (later also — after CFG Simplify) > > >> > > Note that in your example the second round of loop rotation fails > > > > > > making a loop vectorizable (from the comment that is its purpose, > > > > > > though!). It might be more appropriate to generalize that > > > approach > > > > > > to any transformation that makes a loop vectorizable (not just > > > > > > rotation) rather than tweaking jump threading + simplification. > > >> > > I’m convinced there are more cases like yours and the following > > > > > > framework would give a systematic way to handle them: > > >> > > a) a checker for is_vectorizable (from refactoring the legal > > > class > > > in > > > > > > the loop vectorizer - but properly, not hacked like my prototype > > > …) > > > > > > b) invoke checker optionally. The checker can be used for > > > testing, > > > > > > analysis and - independently - as a service for loop > > > transformations > > > > > > (eg. if the loop is vectorizable already, perhaps a > > > transformation > > > > > > is not needed or vice versa). I’m not sure about the architecture > > > > > > yet. Perhaps the pass manager could run a checker automatically > > > > > > after every pass when an option is set. Adding extra calls after > > > > > > every pass is too clumsy. > > > > > > c) fix violators if appropriate and/or add transformations that > > > morph > > > > > > a loop into a vectorizable form when possible (especially when it > > > > > > has been in vectorizable shape at some point). > > >> > > -Gerolf > > >> > > On Aug 4, 2015, at 8:04 PM, Gerolf Hoflehner < > > > ghoflehner at apple.com > > > > > > > > > > wrote: > > >> > > The approach would be to run the legal check before + after > > > > > > transformations on demand. For loops that “should” be vectorized > > > it > > > > > > would give the means to isolate the transformation that > > > introduces > > > a > > > > > > barrier. I’ll see that can get a dirty prototype to illustrate > > > this > > > > > > on your test case. I could give a warning/remark when the code > > > was > > > > > > (loop-) vectorizable at some point but not anymore when the > > > > > > vectorizer actually kicks in and speed-up diagnosis when > > > comparing > > > > > > good/bad cases (like in your instance w/ and w/o the volatile > > > > > > modifier). > > >> > > Gerolf > > >> > > On Aug 4, 2015, at 8:40 AM, Hyojin Sung < hsung at us.ibm.com > > > > wrote: > > >> > > Hi Gerolf, > > >> > > Thanks for the comment. Yes, I agree that compiler transformation > > > as > > > > > > a whole should not render vectorizable loops unvectorizable. > > > > > > However, some passes may temporarily make a loop unvectorizable > > > > > > (e.g., by aggressive GVN or DCE/DBE) then later passes may revert > > > > > > the effect. Trying to keep the invariants after each pass may > > > > > > prevent useful optimizations. Regarding your concern about the > > > > > > example-based approach, LoopVectorizationLegality class gives a > > > good > > > > > > idea about what will be such invariants and they are mostly > > > related > > > > > > to check for canonical loop structures. I think that there are > > > only > > > > > > a handful of transformation passes that may disrupt loop > > > structures > > > > > > mainly through aggressively eliminating BB's. Focusing on such > > > > > > transformations may be sufficient to prevent prior > > > transformations > > > > > > from destroying vectorizable loops. I'd be glad to hear more of > > > your > > > > > > thoughts. > > >> > > Best, > > > > > > Hyojin > > >> > > <graycol.gif> Gerolf Hoflehner ---08/03/2015 06:14:22 PM---I see > > > a > > > > > > more fundamental problem and perhaps this example can serve as a > > > > > > stepping stone towards a so > > >> > > From: Gerolf Hoflehner < ghoflehner at apple.com > > > > > > > To: Hyojin Sung/Watson/IBM at IBMUS > > > > > > Cc: Hal Finkel < hfinkel at anl.gov >, llvmdev-bounces at cs.uiuc.edu , > > > > > > llvmdev at cs.uiuc.edu > > > > > > Date: 08/03/2015 06:14 PM > > > > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for > > > loops > > > > > > with a volatile iteration variable > > >> > > I see a more fundamental problem and perhaps this example can > > > serve > > > > > > as a stepping stone towards a solution. > > >> > > There is a desired property: In this case, loop is vectorizable. > > > > > > There are N compiler transformations. These transformations must > > > > > > either establish the property or keep it invariant, but never > > > > > > destroy it. > > >> > > If there is agreement to this then the first step is to have the > > > > > > analysis of ‘is vectorizable’ runnable after every transformation > > > > > > and report violations in detail, likely under a flag. Then > > > comparing > > > > > > a set of loops not vectorizable with clang but with other > > > compilers > > > > > > (gcc, clang, …) should shape precise ideas on normal forms for > > > the > > > > > > vectorizer and general improvements to transformations to the > > > > > > keep/establish the invariant/desired property. > > >> > > I’m worried that the one example at time approach over time > > > > > > confuscates matters within transformations resulting in less > > > > > > maintainable code. > > >> > > Gerolf > > >> > > On Aug 3, 2015, at 2:33 PM, Hyojin Sung < hsung at us.ibm.com > > > > wrote: > > >> > > Hi, I discussed the issue offline with Hal, and would like to > > > clarify > > > > > > what is exactly going on, what are trade-offs for different > > > > > > solutions, and ask for more feedback on my proposed solution ( > > > > > > http://reviews.llvm.org/D11728 ). I will use the example from > > > Hal's > > > > > > post: void foo2(float * restrict x, float * restrict y, float * > > > > > > restrict z) { for (volatile int i = 0; i < 1000; ++i) { for (int > > > j > > > > > > > > > 0; j < 1600; ++j) { x[j] = y[j] + z[j]; } } } IR after the first > > > > > > loop simplify: A preheader is created. ; Function Attrs: nounwind > > > > > > define void @foo2(float* noalias nocapture %x, float* noalias > > > > > > nocapture readonly %y, float* noalias nocapture readonly %z) #0 { > > > > > > entry: %i = alloca i32, align 4 tail call void > > > > > > @llvm.dbg.value(metadata float* %x, i64 0, metadata !11, metadata > > > > > > !25), !dbg !26 tail call void @llvm.dbg.value(metadata float* %y, > > > > > > i64 0, metadata !12, metadata !25), !dbg !27 tail call void > > > > > > @llvm.dbg.value(metadata float* %z, i64 0, metadata !13, metadata > > > > > > !25), !dbg !28 %i.0.i.0..sroa_cast = bitcast i32* %i to i8* call > > > > > > void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast) tail > > > call > > > > > > void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14, > > > metadata > > > > > > !25), !dbg !29 store volatile i32 0, i32* %i, align 4, !dbg !29 > > > br > > > > > > label %for.cond, !dbg !30 for.cond: ; preds > > > %for.cond.cleanup.3, > > > > > > %entry tail call void @llvm.dbg.value(metadata i32* %i, i64 0, > > > > > > metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile > > > i32, > > > > > > i32* %i, align 4, !dbg !31 %cmp = icmp slt i32 %i.0.i.0., 1000, > > > !dbg > > > > > > !34 br i1 %cmp, label %for.cond.1.preheader, label > > > > > > %for.cond.cleanup, !dbg !35 for.cond.1.preheader: ; preds > > > > > > %for.cond br label %for.cond.1, !dbg !36 for.cond.cleanup: ; > > > preds > > > > > > > > > %for.cond call void @llvm.lifetime.end(i64 4, i8* > > > > > > %i.0.i.0..sroa_cast) ret void, !dbg !38 for.cond.1: ; preds > > > > > > %for.cond.1.preheader, %for.body.4 %j.0 = phi i32 [ %inc, > > > > > > %for.body.4 ], [ 0, %for.cond.1.preheader ] %cmp2 = icmp slt i32 > > > > > > %j.0, 1600, !dbg !36 br i1 %cmp2, label %for.body.4, label > > > > > > %for.cond.cleanup.3, !dbg !39 for.cond.cleanup.3: ; preds > > > > > > %for.cond.1 tail call void @llvm.dbg.value(metadata i32* %i, i64 > > > 0, > > > > > > metadata !14, metadata !25), !dbg !29 %i.0.i.0.17 = load volatile > > > > > > i32, i32* %i, align 4, !dbg !40 %inc10 = add nsw i32 %i.0.i.0.17, > > > 1, > > > > > > !dbg !40 tail call void @llvm.dbg.value(metadata i32 %inc10, i64 > > > 0, > > > > > > metadata !14, metadata !25), !dbg !29 store volatile i32 %inc10, > > > > > > i32* %i, align 4, !dbg !40 br label %for.cond, !dbg !41 > > > for.body.4: > > > > > > ; preds = %for.cond.1 %idxprom = sext i32 %j.0 to i64, !dbg !42 > > > > > > %arrayidx = getelementptr inbounds float, float* %y, i64 > > > %idxprom, > > > > > > !dbg !42 %0 = load float, float* %arrayidx, align 4, !dbg !42, > > > !tbaa > > > > > > !44 %arrayidx6 = getelementptr inbounds float, float* %z, i64 > > > > > > %idxprom, !dbg !48 %1 = load float, float* %arrayidx6, align 4, > > > !dbg > > > > > > !48, !tbaa !44 %add = fadd float %0, %1, !dbg !49 %arrayidx8 > > > > > > getelementptr inbounds float, float* %x, i64 %idxprom, !dbg !50 > > > > > > store float %add, float* %arrayidx8, align 4, !dbg !51, !tbaa !44 > > > > > > %inc = add nsw i32 %j.0, 1, !dbg !52 tail call void > > > > > > @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18, metadata > > > > > > !25), !dbg !53 br label %for.cond.1, !dbg !54 } IR after loop > > > > > > rotation: After loop rotation, a rotated preheader > > > > > > (for.cond.1.preheader.lr.ph) is created. A test for (i < 1000) is > > > > > > added at the end of "entry" block. If true, the control jumps > > > > > > unconditionally to "for.body.4" through > > > "for.cond.1.preheader.lr.ph" > > > > > > and "for.cond.1.preheader". You can see that these two blocks > > > > > > ("for.cond.1.preheader.lr.ph" and "for.cond.1.preheader") are > > > > > > practically empty, and they will get eliminated later by Jump > > > > > > Threading and/or Simplify-the-CFG. *IF* the outer loop has a > > > > > > non-volatile induction variable, the loop will not be rotated in > > > the > > > > > > first place as "for.cond.1.preheader" has a PHI node for "i", and > > > > > > these blocks will not be eliminated. ; Function Attrs: nounwind > > > > > > define void @foo2(float* noalias nocapture %x, float* noalias > > > > > > nocapture readonly %y, float* noalias nocapture readonly %z) #0 { > > > > > > entry: %i = alloca i32, align 4 tail call void > > > > > > @llvm.dbg.value(metadata float* %x, i64 0, metadata !11, metadata > > > > > > !25), !dbg !26 tail call void @llvm.dbg.value(metadata float* %y, > > > > > > i64 0, metadata !12, metadata !25), !dbg !27 tail call void > > > > > > @llvm.dbg.value(metadata float* %z, i64 0, metadata !13, metadata > > > > > > !25), !dbg !28 %i.0.i.0..sroa_cast = bitcast i32* %i to i8* call > > > > > > void @llvm.lifetime.start(i64 4, i8* %i.0.i.0..sroa_cast) tail > > > call > > > > > > void @llvm.dbg.value(metadata i32 0, i64 0, metadata !14, > > > metadata > > > > > > !25), !dbg !29 store volatile i32 0, i32* %i, align 4, !dbg !29 > > > tail > > > > > > call void @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, > > > > > > metadata !25), !dbg !29 %i.0.i.0..21 = load volatile i32, i32* > > > %i, > > > > > > align 4, !dbg !30 %cmp.22 = icmp slt i32 %i.0.i.0..21, 1000, !dbg > > > > > > !33 br i1 %cmp.22, label %for.cond.1.preheader.lr.ph, label > > > > > > %for.cond.cleanup, !dbg !34 for.cond.1.preheader.lr.ph: ; preds > > > > > > %entry br label %for.cond.1.preheader, !dbg !34 > > > > > > for.cond.1.preheader: ; preds = %for.cond.1.preheader.lr.ph, > > > > > > %for.cond.cleanup.3 br label %for.body.4, !dbg !35 > > > > > > for.cond.for.cond.cleanup_crit_edge: ; preds > > > %for.cond.cleanup.3 > > > > > > br label %for.cond.cleanup, !dbg !34 for.cond.cleanup: ; preds > > > > > > %for.cond.for.cond.cleanup_crit_edge, %entry call void > > > > > > @llvm.lifetime.end(i64 4, i8* %i.0.i.0..sroa_cast) ret void, !dbg > > > > > > !36 for.cond.cleanup.3: ; preds = %for.body.4 tail call void > > > > > > @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, metadata > > > > > > !25), !dbg !29 %i.0.i.0.17 = load volatile i32, i32* %i, align 4, > > > > > > !dbg !37 %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !37 tail call > > > > > > void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14, > > > > > > metadata !25), !dbg !29 store volatile i32 %inc10, i32* %i, align > > > 4, > > > > > > !dbg !37 tail call void @llvm.dbg.value(metadata i32* %i, i64 0, > > > > > > metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile > > > i32, > > > > > > i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 %i.0.i.0., 1000, > > > !dbg > > > > > > !33 br i1 %cmp, label %for.cond.1.preheader, label > > > > > > %for.cond.for.cond.cleanup_crit_edge, !dbg !34 for.body.4: ; > > > preds > > > > > > > > > %for.cond.1.preheader, %for.body.4 %j.020 = phi i32 [ 0, > > > > > > %for.cond.1.preheader ], [ %inc, %for.body.4 ] %idxprom = sext > > > i32 > > > > > > %j.020 to i64, !dbg !38 %arrayidx = getelementptr inbounds float, > > > > > > float* %y, i64 %idxprom, !dbg !38 %0 = load float, float* > > > %arrayidx, > > > > > > align 4, !dbg !38, !tbaa !41 %arrayidx6 = getelementptr inbounds > > > > > > float, float* %z, i64 %idxprom, !dbg !45 %1 = load float, float* > > > > > > %arrayidx6, align 4, !dbg !45, !tbaa !41 %add = fadd float %0, > > > %1, > > > > > > !dbg !46 %arrayidx8 = getelementptr inbounds float, float* %x, > > > i64 > > > > > > %idxprom, !dbg !47 store float %add, float* %arrayidx8, align 4, > > > > > > !dbg !48, !tbaa !41 %inc = add nsw i32 %j.020, 1, !dbg !49 tail > > > call > > > > > > void @llvm.dbg.value(metadata i32 %inc, i64 0, metadata !18, > > > > > > metadata !25), !dbg !50 %cmp2 = icmp slt i32 %inc, 1600, !dbg !51 > > > br > > > > > > i1 %cmp2, label %for.body.4, label %for.cond.cleanup.3, !dbg !35 > > > > > > After Jump Threading: "for.cond.1.preheader.lr.ph" and > > > > > > "for.cond.1.preheader" are merged into "for.body.4" by > > > > > > TryToSimplifyUnconditionalBranchFromEmptyBlock() in > > > > > > Transforms/Utils/Local.cpp. Now "for.body.4" has three incoming > > > > > > edges (two backedges). ; Function Attrs: nounwind define void > > > > > > @foo2(float* noalias nocapture %x, float* noalias nocapture > > > readonly > > > > > > %y, float* noalias nocapture readonly %z) #0 { entry: %i = alloca > > > > > > i32, align 4 tail call void @llvm.dbg.value(metadata float* %x, > > > i64 > > > > > > 0, metadata !11, metadata !25), !dbg !26 tail call void > > > > > > @llvm.dbg.value(metadata float* %y, i64 0, metadata !12, metadata > > > > > > !25), !dbg !27 tail call void @llvm.dbg.value(metadata float* %z, > > > > > > i64 0, metadata !13, metadata !25), !dbg !28 %i.0.i.0..sroa_cast > > > > > > > > > bitcast i32* %i to i8* call void @llvm.lifetime.start(i64 4, i8* > > > > > > %i.0.i.0..sroa_cast) tail call void @llvm.dbg.value(metadata i32 > > > 0, > > > > > > i64 0, metadata !14, metadata !25), !dbg !29 store volatile i32 > > > 0, > > > > > > i32* %i, align 4, !dbg !29 tail call void > > > @llvm.dbg.value(metadata > > > > > > i32* %i, i64 0, metadata !14, metadata !25), !dbg !29 > > > %i.0.i.0..21 > > > > > > > > > load volatile i32, i32* %i, align 4, !dbg !30 %cmp.22 = icmp slt > > > i32 > > > > > > %i.0.i.0..21, 1000, !dbg !33 br i1 %cmp.22, label %for.body.4, > > > label > > > > > > %for.cond.cleanup, !dbg !34 for.cond.cleanup: ; preds > > > > > > %for.cond.cleanup.3, %entry call void @llvm.lifetime.end(i64 4, > > > i8* > > > > > > %i.0.i.0..sroa_cast) ret void, !dbg !35 for.cond.cleanup.3: ; > > > preds > > > > > > = %for.body.4 tail call void @llvm.dbg.value(metadata i32* %i, > > > i64 > > > > > > 0, metadata !14, metadata !25), !dbg !29 %i.0.i.0.17 = load > > > volatile > > > > > > i32, i32* %i, align 4, !dbg !36 %inc10 = add nsw i32 %i.0.i.0.17, > > > 1, > > > > > > !dbg !36 tail call void @llvm.dbg.value(metadata i32 %inc10, i64 > > > 0, > > > > > > metadata !14, metadata !25), !dbg !29 store volatile i32 %inc10, > > > > > > i32* %i, align 4, !dbg !36 tail call void > > > @llvm.dbg.value(metadata > > > > > > i32* %i, i64 0, metadata !14, metadata !25), !dbg !29 %i.0.i.0. > > > > > > load volatile i32, i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 > > > > > > %i.0.i.0., 1000, !dbg !33 br i1 %cmp, label %for.body.4, label > > > > > > %for.cond.cleanup, !dbg !34 for.body.4: ; preds > > > > > > %for.cond.cleanup.3, %entry, %for.body.4 %indvars.iv = phi i64 [ > > > > > > %indvars.iv.next, %for.body.4 ], [ 0, %entry ], [ 0, > > > > > > %for.cond.cleanup.3 ] %arrayidx = getelementptr inbounds float, > > > > > > float* %y, i64 %indvars.iv, !dbg !37 %0 = load float, float* > > > > > > %arrayidx, align 4, !dbg !37, !tbaa !40 %arrayidx6 > > > getelementptr > > > > > > inbounds float, float* %z, i64 %indvars.iv, !dbg !44 %1 = load > > > > > > float, float* %arrayidx6, align 4, !dbg !44, !tbaa !40 %add > > > fadd > > > > > > float %0, %1, !dbg !45 %arrayidx8 = getelementptr inbounds float, > > > > > > float* %x, i64 %indvars.iv, !dbg !46 store float %add, float* > > > > > > %arrayidx8, align 4, !dbg !47, !tbaa !40 %indvars.iv.next = add > > > nuw > > > > > > nsw i64 %indvars.iv, 1, !dbg !48 %exitcond = icmp eq i64 > > > > > > %indvars.iv.next, 1600, !dbg !48 br i1 %exitcond, label > > > > > > %for.cond.cleanup.3, label %for.body.4, !dbg !48 After another > > > loop > > > > > > simplify: Loop simplify tries to separate out nested loops but > > > fails > > > > > > to do so with this loop since it does not has a PHI node for the > > > > > > outer loop variable. Instead, it creates a backedge block. > > > > > > for.cond.cleanup.3: ; preds = %for.body.4 tail call void > > > > > > @llvm.dbg.value(metadata i32* %i, i64 0, metadata !14, metadata > > > > > > !25), !dbg !29 %i.0.i.0.17 = load volatile i32, i32* %i, align 4, > > > > > > !dbg !39 %inc10 = add nsw i32 %i.0.i.0.17, 1, !dbg !39 tail call > > > > > > void @llvm.dbg.value(metadata i32 %inc10, i64 0, metadata !14, > > > > > > metadata !25), !dbg !29 store volatile i32 %inc10, i32* %i, align > > > 4, > > > > > > !dbg !39 tail call void @llvm.dbg.value(metadata i32* %i, i64 0, > > > > > > metadata !14, metadata !25), !dbg !29 %i.0.i.0. = load volatile > > > i32, > > > > > > i32* %i, align 4, !dbg !30 %cmp = icmp slt i32 %i.0.i.0., 1000, > > > !dbg > > > > > > !33 br i1 %cmp, label %for.body.4.backedge, label > > > > > > %for.cond.cleanup.loopexit, !dbg !34 for.body.4: ; preds > > > > > > %for.body.4.backedge, %for.body.4.preheader %indvars.iv = phi i64 > > > [ > > > > > > 0, %for.body.4.preheader ], [ % indvars.iv.be , > > > %for.body.4.backedge > > > > > > ] %arrayidx = getelementptr inbounds float, float* %y, i64 > > > > > > %indvars.iv, !dbg !35 %0 = load float, float* %arrayidx, align 4, > > > > > > !dbg !35, !tbaa !40 %arrayidx6 = getelementptr inbounds float, > > > > > > float* %z, i64 %indvars.iv, !dbg !44 %1 = load float, float* > > > > > > %arrayidx6, align 4, !dbg !44, !tbaa !40 %add = fadd float %0, > > > %1, > > > > > > !dbg !45 %arrayidx8 = getelementptr inbounds float, float* %x, > > > i64 > > > > > > %indvars.iv, !dbg !46 store float %add, float* %arrayidx8, align > > > 4, > > > > > > !dbg !47, !tbaa !40 %indvars.iv.next = add nuw nsw i64 > > > %indvars.iv, > > > > > > 1, !dbg !48 %exitcond = icmp eq i64 %indvars.iv.next, 1600, !dbg > > > !48 > > > > > > br i1 %exitcond, label %for.cond.cleanup.3, label > > > > > > %for.body.4.backedge, !dbg !48 for.body.4.backedge: ; preds > > > > > > %for.body.4, %for.cond.cleanup.3 % indvars.iv.be = phi i64 [ > > > > > > %indvars.iv.next, %for.body.4 ], [ 0, %for.cond.cleanup.3 ] br > > > label > > > > > > %for.body.4x LLVM loop vectorizer rejects to vectorize any loop > > > for > > > > > > which a loop latch (for.body.4.backedge) is different from a loop > > > > > > exiting block (for.cond.cleanup.3). The loop vectorizer can > > > assume > > > > > > that all instructions in the loop are executed the same number of > > > > > > times with the test. I believe a fix is in order in one way or > > > > > > another because the example is simple and common enough and > > > > > > vectorized by other compilers. We may approach it by either (1) > > > > > > preventing loops from being collapsed in the first place or (2) > > > > > > teaching loop vectorizer to handle collapsed loops. For (2), we > > > may > > > > > > need to allow loop vectorizer to forego the assumption and handle > > > > > > the loop as it is. The assumption seems fundamental to many of > > > the > > > > > > vectorization algorithms, so it will require extensive updates or > > > > > > may end up with reverting the loop back to a properly nested > > > form. > > > > > > The downside of (1) is that it may slow down common optimization > > > > > > passes that are repeatedly executed before vectorization. My > > > patch > > > ( > > > > > > http://reviews.llvm.org/D11728 ) is a prototype fix for (1) that > > > > > > modifies Jump Threading and Simplify-the-CFG to not eliminate an > > > > > > empty loop header BB even when the loop does not have a PHI node > > > for > > > > > > its induction variable. The details can be found at > > > > > > http://reviews.llvm.org/D11728 . I would welcome and appreciate > > > any > > > > > > comments or feedback. Best, Hyojin <graycol.gif> Hal Finkel > > > > > > ---07/16/2015 03:19:24 AM-------- Original Message ----- > From: > > > > > > "Hal Finkel" < hfinkel at anl.gov > From: Hal Finkel < > > > hfinkel at anl.gov > > >> > > > To: Chandler Carruth < chandlerc at google.com > Cc: > > > > > >> > > llvmdev at cs.uiuc.edu Date: 07/16/2015 03:19 AM Subject: Re: > > > [LLVMdev] > > > > > > Improving loop vectorizer support for loops with a volatile > > > > > > iteration variable Sent by: llvmdev-bounces at cs.uiuc.edu > > > > > > ----- Original Message ----- > > >> > > From: "Hal Finkel" < hfinkel at anl.gov > To: "Chandler Carruth" < > > > > > > chandlerc at google.com > Cc: llvmdev at cs.uiuc.edu Sent: Thursday, > > > July > > > > > > 16, 2015 1:58:02 AM Subject: Re: [LLVMdev] Improving loop > > > vectorizer > > > > > > support for loops with a volatile iteration variable > > >> > > From: "Hal Finkel" < hfinkel at anl.gov > To: "Chandler Carruth" < > > > > > > chandlerc at google.com > Cc: llvmdev at cs.uiuc.edu Sent: Thursday, > > > July > > > > > > 16, 2015 1:46:42 AM Subject: Re: [LLVMdev] Improving loop > > > vectorizer > > > > > > support for loops with a volatile iteration variable > > >> > > From: "Chandler Carruth" < chandlerc at google.com > To: "Hal > > > Finkel" > > > < > > > > > > hfinkel at anl.gov > Cc: "Hyojin Sung" < hsung at us.ibm.com >, > > > > > > llvmdev at cs.uiuc.edu Sent: Thursday, July 16, 2015 1:06:03 AM > > > > > > Subject: Re: [LLVMdev] Improving loop vectorizer support for > > > loops > > > > > > with a volatile iteration variable On Wed, Jul 15, 2015 at 6:36 > > > PM > > > > > > Hal Finkel < hfinkel at anl.gov > wrote: > > >> > > From: "Chandler Carruth" < chandlerc at google.com > To: "Hyojin > > > Sung" > > > < > > > > > > hsung at us.ibm.com >, llvmdev at cs.uiuc.edu Sent: Wednesday, July 15, > > > > > > 2015 7:34:54 PM Subject: Re: [LLVMdev] Improving loop vectorizer > > > > > > support for loops with a volatile iteration variable On Wed, Jul > > > 15, > > > > > > 2015 at 12:55 PM Hyojin Sung < hsung at us.ibm.com > wrote: > > >> > > Hi all, I would like to propose an improvement of the “almost > > > dead” > > > > > > block elimination in Transforms/Local.cpp so that it will > > > preserve > > > > > > the canonical loop form for loops with a volatile iteration > > > > > > variable. *** Problem statement Nested loops in LCALS Subset B ( > > > > > > https://codesign.llnl.gov/LCALS.php ) are not vectorized with > > > LLVM > > > > > > -O3 because the LLVM loop vectorizer fails the test whether the > > > loop > > > > > > latch and exiting block of a loop is the same. The loops are > > > > > > vectorizable, and get vectorized with LLVM -O2 I would be > > > interested > > > > > > to know why -O2 succeeds here. > > >> > > and also with other commercial compilers (icc, xlc). *** Details > > > > > > These loops ended up with different loop latch and exiting block > > > > > > after a series of optimizations including loop unswitching, jump > > > > > > threading, simplify-the-CFG, and loop simplify. The fundamental > > > > > > problem here is that the above optimizations cannot recognize a > > > loop > > > > > > with a volatile iteration variable and do not preserve its > > > canonical > > > > > > loop structure. Ok, meta-level question first: Why do we care > > > about > > > > > > performance of loops with a volatile iteration variable? I don't > > > > > > think we do, however, I think that misses the point. In this > > > case, > > > > > > the volatile iteration variable is just an easy way to expose > > > this > > > > > > problem that we have with nested loop canonicalization and the > > > > > > vectorizer. To be specific: This we vectorizer just fine: void > > > > > > foo1(float * restrict x, float * restrict y, float * restrict z) > > > { > > > > > > for (int i = 0; i < 1000; ++i) { for (int j = 0; j < 1600; ++j) { > > > > > > x[j] = y[j] + z[j]; } } } And, indeed, this we don't (the only > > > > > > change is adding volatile on i): void foo2(float * restrict x, > > > float > > > > > > * restrict y, float * restrict z) { for (volatile int i = 0; i < > > > > > > 1000; ++i) { for (int j = 0; j < 1600; ++j) { x[j] = y[j] + z[j]; > > > } > > > > > > } } However, this we don't either, and that's a big problem: int > > > > > > done(float *x, float *y, float *z); void foo3(float * restrict x, > > > > > > float * restrict y, float * restrict z) { while (!done(x, y, z)) > > > { > > > > > > for (int j = 0; j < 1600; ++j) { x[j] = y[j] + z[j]; } } } And > > > the > > > > > > underlying reason is the same. The IR at the point in time when > > > the > > > > > > loop vectorizer runs looks like this: define void @foo3(float* > > > > > > noalias %x, float* noalias %y, float* noalias %z) #0 { entry: > > > > > > %call.14 = tail call i32 @done(float* %x, float* %y, float* %z) > > > #1 > > > > > > %lnot.15 = icmp eq i32 %call.14, 0 br i1 %lnot.15, label > > > > > > %for.body.preheader, label %while.end for.body.preheader: ; preds > > > > > > > > > %entry br label %for.body while.cond.loopexit: ; preds > > > %for.body > > > > > > %call = tail call i32 @done(float* %x, float* %y, float* %z) #1 > > > > > > %lnot = icmp eq i32 %call, 0 br i1 %lnot, label > > > %for.body.backedge, > > > > > > label %while.end.loopexit for.body: ; preds = %for.body.backedge, > > > > > > %for.body.preheader %indvars.iv = phi i64 [ 0, > > > %for.body.preheader > > > > > > ], [ % indvars.iv.be , %for.body.backedge ] ... %indvars.iv.next > > > > > > > > > add nuw nsw i64 %indvars.iv, 1 %exitcond = icmp eq i64 > > > > > > %indvars.iv.next, 1600 br i1 %exitcond, label > > > %while.cond.loopexit, > > > > > > label %for.body.backedge for.body.backedge: ; preds = %for.body, > > > > > > %while.cond.loopexit % indvars.iv.be = phi i64 [ > > > %indvars.iv.next, > > > > > > %for.body ], [ 0, %while.cond.loopexit ] br label %for.body > > > > > > while.end.loopexit: ; preds = %while.cond.loopexit br label > > > > > > %while.end while.end: ; preds = %while.end.loopexit, %entry ret > > > void > > > > > > } and we can currently only vectorize loops where the loop latch > > > is > > > > > > also the loop's exiting block. In this case, as in the case with > > > the > > > > > > volatile variable, vectorization is blocked by this constraint > > > (here > > > > > > the backedge is from the terminator of %for.body.backedge, but > > > the > > > > > > loop exiting block is %for.body). The check in the vectorizer is > > > > > > explicit: // We only handle bottom-tested loops, i.e. loop in > > > which > > > > > > the condition is // checked at the end of each iteration. With > > > that > > > > > > we can assume that all // instructions in the loop are executed > > > the > > > > > > same number of times. if (TheLoop->getExitingBlock() !> > > > > > TheLoop->getLoopLatch()) { ... Thanks for the detailed > > > explanation. > > > > > > This makes much more sense why we need to handle it. I think its > > > > > > much better to look at nested loops of this form than anything to > > > do > > > > > > with volatile -- the latter is too prone to other random > > > > > > optimizations turning off. Regarding this problem, it would be > > > > > > interesting to know based on this explanation what the desired > > > fix > > > > > > would be. I see at least these options: 1) Canonicalize loops > > > harder > > > > > > to make them look the way the vectorizer wants. If this can be > > > done > > > > > > without causing significant problems, it seems likely the best > > > > > > approach. I agree. In this case, we could certainly fold the > > > trivial > > > > > > %for.body.backedge block into %for.body, meaning transforming > > > this: > > > > > > for.body.backedge: ; preds = %while.cond.loopexit, %for.body % > > > > > > indvars.iv.be = phi i64 [ %indvars.iv.next, %for.body ], [ 0, > > > > > > %while.cond.loopexit ] br label %for.body for.body: ; preds > > > > > > %for.body.backedge, %for.body.preheader %indvars.iv = phi i64 [ > > > 0, > > > > > > %for.body.preheader ], [ % indvars.iv.be , %for.body.backedge ] > > > ... > > > > > > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond > > > icmp > > > > > > eq i64 %indvars.iv.next, 1600 br i1 %exitcond, label > > > > > > %while.cond.loopexit, label %for.body.backedge into this: > > > for.body: > > > > > > ; preds = %for.body.backedge, %for.body.preheader % indvars.iv.be > > > > > > > > > phi i64 [ %indvars.iv.next, %for.body ], [ 0, > > > %while.cond.loopexit > > > ] > > > > > > %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ % > > > indvars.iv.be > > > > > > , %for.body ] ... %indvars.iv.next = add nuw nsw i64 %indvars.iv, > > > 1 > > > > > > %exitcond = icmp eq i64 %indvars.iv.next, 1600 br i1 %exitcond, > > > > > > label %while.cond.loopexit, label %for.body and this seems pretty > > > > > > trivial when %for.body.backedge is completely empty (as in this > > > > > > case), but if it had non-PHI instructions in it on which the > > > > > > existing PHIs in %for.body depended, then maybe this becomes less > > > > > > trivial? Although, based on what I said below, the case with > > > > > > instructions there we can't currently vectorize anyway for > > > > > > more-fundamental reasons. -Hal Also worth pointing out that > > > > > > SimplifyCFG does this exact transformation. The vectorizer never > > > > > > sees it, however, because LoopSimplify prefers this form with the > > > > > > separate backedge block that the vectorizer can't handle. -Hal > > >> > > 2) Teach the vectorizer to vectorize without this constraint by > > > > > > instead establishing the actual invariant it cares about. It > > > really > > > > > > cares that there's no code that comes in between the latch and > > > the > > > > > > exit, because such code is not really part of the loop (it only > > > runs > > > > > > once), or code in between the exit and the latch (because such > > > code > > > > > > runs in one fewer iterations than the code before the exit). At > > > > > > least nothing with side effects I presume. -Hal > > >> > > Maybe there is another strategy? > > >> > > -Hal > > >> > > That seems both counter-intuitive and unlikely to be a useful > > > goal. > > > > > > We simply don't optimize volatile operations well in *any* part > > > of > > > > > > the optimizer, and I'm not sure why we need to start trying to > > > fix > > > > > > that. This seems like an irreparably broken benchmark, but > > > perhaps > > > > > > there is a motivation I don't yet see. Assuming that sufficient > > > > > > motivation arises to try to fix this, see my comments below: > > >> > > (1) Loop unswitching generates several empty placeholder BBs only > > > > > > with PHI nodes after separating out a shorter path with no inner > > > > > > loop execution from a standard path. (2) Jump threading and > > > > > > simplify-the-CFG passes independently calls > > > > > > TryToSimplifyUnconditionalBranchFromEmptyBlock() in > > > > > > Transforms/Utils/Local.cpp to get rid of almost empty BBs. (3) > > > > > > TryToSimplifyUnconditionalBranchFromEmtpyBlock() eliminates the > > > > > > placeholder BBs after loop unswitching and merges them into > > > > > > subsequent blocks including the header of the inner loop. Before > > > > > > eliminating the blocks, the function checks if the block is a > > > loop > > > > > > header by looking at its PHI nodes so that it can be saved, but > > > the > > > > > > test fails with the loops with a volatile iteration variable. Why > > > > > > does this fail for a volatile iteration variable but not for a > > > > > > non-volatile one? I think understanding that will be key to > > > > > > understanding how it should be fixed. > > > > > > _______________________________________________ LLVM Developers > > > > > > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -- Hal Finkel > > > > > > Assistant Computational Scientist Leadership Computing Facility > > > > > > Argonne National Laboratory -- Hal Finkel Assistant Computational > > > > > > Scientist Leadership Computing Facility Argonne National > > > Laboratory > > > > > > _______________________________________________ LLVM Developers > > > > > > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >> > > -- Hal Finkel Assistant Computational Scientist Leadership > > > Computing > > > > > > Facility Argonne National Laboratory > > > > > > _______________________________________________ LLVM Developers > > > > > > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -- Hal Finkel > > > > > > Assistant Computational Scientist Leadership Computing Facility > > > > > > Argonne National Laboratory > > > > > > _______________________________________________ LLVM Developers > > > > > > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > > > _______________________________________________ LLVM Developers > > > > > > mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >> > -- > > > Hal Finkel > > > Assistant Computational Scientist > > > Leadership Computing Facility > > > Argonne National Laboratory >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150818/4b265521/attachment-0001.html>