Gerolf Hoflehner via llvm-dev
2015-Sep-11 02:51 UTC
[llvm-dev] [RFC] New pass: LoopExitValues
Hi Steve it seems the general consensus is that the patch feels like a work-around for a problem with LSR (and possibly other loop transformations) that introduces redundant instructions. It is probably best to file a bug and a few of your test cases. Thanks Gerolf> On Sep 10, 2015, at 4:37 PM, Steve King via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > On Thu, Sep 10, 2015 at 3:43 PM, Jake VanAdrighem > <jvanadrighem at gmail.com> wrote: >> Which cases does this pass handle which aren't otherwise optimized out by >> passes like GlobalValueNumbering or DeadCodeElimination? >> > > The case uniquely handled here is one in which a computation after a > loop so happens to be redundant with a value left over from loop > execution. At each nesting level of N nested loops, the pass compares > computable SCEVs looking for matches. This sounds specialized to me, > but I'm not knowledgeable enough in LLVM's many passes to know which > one comes closest to this sort of optimization. My colleague that > originally wrote the pass noted that redundancies found by this pass > pop up after loop transforms like LSR. > > Previously, I showed examples of the type of optimization performed by > this pass -- they're bulky so I won't cut and paste here. In > particular, check out the before-and-after IR example James asked for. > > Regards, > -steve > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Hi Steve, I believe that others have been looking recently at a very similar set of problems. For example, Wei Mi posted this patch: http://reviews.llvm.org/D12090 and I wonder if this would also address your use cases. If not, I think we'd like to understand why. Also, if these redundant expressions involve induction variables, then that's something that the IndVarSimplify is already supposed to do, and if it is missing cases, then we should improve that pass (and, thus, folding what you've done into IndVarSimplify might be the right way to go). -Hal ----- Original Message -----> From: "Gerolf Hoflehner via llvm-dev" <llvm-dev at lists.llvm.org> > To: "Steve King" <steve at metrokings.com> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Thursday, September 10, 2015 9:51:21 PM > Subject: Re: [llvm-dev] [RFC] New pass: LoopExitValues > > Hi Steve > > it seems the general consensus is that the patch feels like a > work-around for a problem with LSR (and possibly other loop > transformations) that introduces redundant instructions. It is > probably best to file a bug and a few of your test cases. > > Thanks > Gerolf > > On Sep 10, 2015, at 4:37 PM, Steve King via llvm-dev > > <llvm-dev at lists.llvm.org> wrote: > > > > On Thu, Sep 10, 2015 at 3:43 PM, Jake VanAdrighem > > <jvanadrighem at gmail.com> wrote: > >> Which cases does this pass handle which aren't otherwise optimized > >> out by > >> passes like GlobalValueNumbering or DeadCodeElimination? > >> > > > > The case uniquely handled here is one in which a computation after > > a > > loop so happens to be redundant with a value left over from loop > > execution. At each nesting level of N nested loops, the pass > > compares > > computable SCEVs looking for matches. This sounds specialized to > > me, > > but I'm not knowledgeable enough in LLVM's many passes to know > > which > > one comes closest to this sort of optimization. My colleague that > > originally wrote the pass noted that redundancies found by this > > pass > > pop up after loop transforms like LSR. > > > > Previously, I showed examples of the type of optimization performed > > by > > this pass -- they're bulky so I won't cut and paste here. In > > particular, check out the before-and-after IR example James asked > > for. > > > > Regards, > > -steve > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
On Fri, Sep 11, 2015 at 10:06 AM, Hal Finkel <hfinkel at anl.gov> wrote:> Also, if these redundant expressions involve induction variables, then that's something that the IndVarSimplify is already supposed to do, and if it is missing cases, then we should improve that pass (and, thus, folding what you've done into IndVarSimplify might be the right way to go).Hal, thanks for the info. Turns out IndVarSimplify intentionally creates redundant calculation of loop exit values! // 2. Any use outside of the loop of an expression derived from the indvar // is changed to compute the derived value outside of the loop, eliminating // the dependence on the exit value of the induction variable. If the only // purpose of the loop is to compute the exit value of some derived // expression, this transformation will make the loop dead. I understand the stated purpose for the redundancy, but there is no mention of undoing the damage. The LoopExitValues pass seems to be exactly the cleanup that was missing. Do passes coming after IVS like LSR need the redundant calculations in place to identify dead loops or should IVS have cleaned up after itself immediately? Regards, -steve
On Fri, Sep 11, 2015 at 10:06 AM, Hal Finkel <hfinkel at anl.gov> wrote:> Hi Steve, > > I believe that others have been looking recently at a very similar set of problems. For example, Wei Mi posted this patch: > > http://reviews.llvm.org/D12090 > > and I wonder if this would also address your use cases. If not, I think we'd like to understand why. Also, if these redundant expressions involve induction variables, then that's something that the IndVarSimplify is already supposed to do, and if it is missing cases, then we should improve that pass (and, thus, folding what you've done into IndVarSimplify might be the right way to go). > > -HalI tried the patch on the matrix_mul testcase and saw where the redundency was removed. My patch in http://reviews.llvm.org/D12090 cannot handle the case. I feel like it is a different problem. void matrix_mul(unsigned int Size, unsigned int *Dst, unsigned int *Src, unsigned int Val) { for (int Outer = 0; Outer < Size; ++Outer) for (int Inner = 0; Inner < Size; ++Inner) Dst[Outer * Size + Inner] = Src[Outer * Size + Inner] * Val; } "Outer * Size + Inner" is an induction variable in inner loop. Its SCEV is {{0,+,%Size}<outerloop>,+,1}<innerloop>, so the initial value of the induction variable for inner loop is {0, +, %Size}<outerloop>, .i.e, an "lsr.iv.next = lsr.iv + %Size" will be generated at the end of every iteration of outerloop to prepare the initial value of the induction variable in the next outerloop iteration. However, the iteration number of inner loop happens to be "Size" too, so the exit value of "Outer * Size + Inner" in inner loop happens to be the same with the initial value of "Outer * Size + Inner" in the next iteration of outerloop. So the exit value of the induction variable can be reused and "lsr.iv.next = lsr.iv + %Size" which is used to compute the initial value of "Outer * Size + Inner" can be omitted. If I changed Inner < Size to Inner < Size1, compiler will generate the same code w/wo the patches. I tried gcc and gcc also couldn't remove the redundency too. Looks like the optimization doesn't fit current LSR very well because it needs to consider the relationship between initial and exit value of the induction variable in different iterations of outerloop. This is an interesting work because the code pattern to implement a multi-dimensional array as a single linear array is common. Thanks, Wei.
On Mon, Sep 14, 2015 at 2:10 PM, Wei Mi <wmi at google.com> wrote:> > I tried the patch on the matrix_mul testcase and saw where the > redundency was removed. My patch in http://reviews.llvm.org/D12090 > cannot handle the case. I feel like it is a different problem. > > void matrix_mul(unsigned int Size, unsigned int *Dst, unsigned int > *Src, unsigned int Val) { > for (int Outer = 0; Outer < Size; ++Outer) > for (int Inner = 0; Inner < Size; ++Inner) > Dst[Outer * Size + Inner] = Src[Outer * Size + Inner] * Val; > } > > "Outer * Size + Inner" is an induction variable in inner loop. Its > SCEV is {{0,+,%Size}<outerloop>,+,1}<innerloop>, so the initial value > of the induction variable for inner loop is {0, +, %Size}<outerloop>, > .i.e, an "lsr.iv.next = lsr.iv + %Size" will be generated at the end > of every iteration of outerloop to prepare the initial value of the > induction variable in the next outerloop iteration. > > However, the iteration number of inner loop happens to be "Size" too, > so the exit value of "Outer * Size + Inner" in inner loop happens to > be the same with the initial value of "Outer * Size + Inner" in the > next iteration of outerloop. So the exit value of the induction > variable can be reused and "lsr.iv.next = lsr.iv + %Size" which is > used to compute the initial value of "Outer * Size + Inner" can be > omitted. >Thanks for experimenting with the patch! If I understand correctly, you show that the redundancy is inherent in the loop and not a byproduct of other passes. Following this insight, changing just one line of the IR produces assembly *identical* to the effect of the new LEV pass. Specifically, in the outer loop, the multiply can be replaced with a PHI node that recycles the %add value coming from the inner loop. Should the front end be responsible for this sort of optimization? Original: %mul = mul i32 %Outer.026, %Size Replacing with this line has same effect as LEV pass: %mul = phi i32 [ %add, %for.cond.cleanup.3 ], [ 0, %entry ] The original IR for the matrix_mul.c test case: define void @matrix_mul(i32 %Size, i32* nocapture %Dst, i32* nocapture readonly %Src, i32 %Val) #0 { entry: %cmp.25 = icmp eq i32 %Size, 0 br i1 %cmp.25, label %for.cond.cleanup, label %for.body.4.lr.ph for.body.4.lr.ph: ; preds = %entry, %for.cond.cleanup.3 %Outer.026 = phi i32 [ %inc10, %for.cond.cleanup.3 ], [ 0, %entry ] ;**************** %mul = mul i32 %Outer.026, %Size ;**************** br label %for.body.4 for.cond.cleanup: ; preds %for.cond.cleanup.3, %entry ret void for.cond.cleanup.3: ; preds = %for.body.4 %inc10 = add nuw nsw i32 %Outer.026, 1 %exitcond28 = icmp eq i32 %inc10, %Size br i1 %exitcond28, label %for.cond.cleanup, label %for.body.4.lr.ph for.body.4: ; preds %for.body.4, %for.body.4.lr.ph %Inner.024 = phi i32 [ 0, %for.body.4.lr.ph ], [ %inc, %for.body.4 ] %add = add i32 %Inner.024, %mul %arrayidx = getelementptr inbounds i32, i32* %Src, i32 %add %0 = load i32, i32* %arrayidx, align 4, !tbaa !1 %mul5 = mul i32 %0, %Val %arrayidx8 = getelementptr inbounds i32, i32* %Dst, i32 %add store i32 %mul5, i32* %arrayidx8, align 4, !tbaa !1 %inc = add nuw nsw i32 %Inner.024, 1 %exitcond = icmp eq i32 %inc, %Size br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4 }
Hi Folks, Let's keep this optimization alive. To summarize: several folks voiced general support, but with questions about why existing optimizations do not already catch this case. Deep dive by Wei Mi showed that the optimization is most likely not a clean-up of LSR sloppiness, but something new. Follow-up by myself confirmed that the redundancy eliminated the LoopExitValues pass exists in the original IR from Clang. As a next step, can we come to consensus on where the LoopExitValues optimization belongs? Regards, -steve