The loop vectorizer seems to be not able to vectorize the following code: void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b) { const std::uint64_t inner = 4; for (std::uint64_t i = start ; i < end ; ++i ) { const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4; const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4; c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; c[ ir1 ] = a[ ir1 ] + b[ ir1 ]; } } LV: Found a loop: for.body LV: Found an induction variable. LV: We need to do 0 pointer comparisons. LV: Checking memory dependencies LV: Bad stride - Not an AddRecExpr pointer %arrayidx11 = getelementptr inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw> LV: Bad stride - Not an AddRecExpr pointer %arrayidx15 = getelementptr inbounds float* %c, i64 %add8 SCEV: ((4 * %add8)<nsw> + %c)<nsw> LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 * %add8)<nsw> + %c)<nsw>(Induction step: 0) LV: Distance for store float %add10, float* %arrayidx11, align 4 to store float %add14, float* %arrayidx15, align 4: ((4 * %add8)<nsw> + (-4 * %add2)) Non-consecutive pointer access LV: We don't need a runtime memory check. LV: Can't vectorize due to memory conflicts LV: Not vectorizing. Here the code: entry: %cmp14 = icmp ult i64 %start, %end br i1 %cmp14, label %for.body, label %for.end for.body: ; preds = %entry, %for.body %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ] %div = lshr i64 %i.015, 2 %mul = shl i64 %div, 3 %rem = and i64 %i.015, 3 %add2 = or i64 %mul, %rem %add8 = or i64 %add2, 4 %arrayidx = getelementptr inbounds float* %a, i64 %add2 %0 = load float* %arrayidx, align 4 %arrayidx9 = getelementptr inbounds float* %b, i64 %add2 %1 = load float* %arrayidx9, align 4 %add10 = fadd float %0, %1 %arrayidx11 = getelementptr inbounds float* %c, i64 %add2 store float %add10, float* %arrayidx11, align 4 %arrayidx12 = getelementptr inbounds float* %a, i64 %add8 %2 = load float* %arrayidx12, align 4 %arrayidx13 = getelementptr inbounds float* %b, i64 %add8 %3 = load float* %arrayidx13, align 4 %add14 = fadd float %2, %3 %arrayidx15 = getelementptr inbounds float* %c, i64 %add8 store float %add14, float* %arrayidx15, align 4 %inc = add i64 %i.015, 1 %exitcond = icmp eq i64 %inc, %end br i1 %exitcond, label %for.end, label %for.body for.end: ; preds = %for.body, %entry ret void Why is it saying Bad stride?Are the 'rem' and 'div' instruction asking too much from it? The code should be vectorizable. Here the index access for start=0, end=16: loop count i = 0 index_0 = 0 index_1 = 4 loop count i = 1 index_0 = 1 index_1 = 5 loop count i = 2 index_0 = 2 index_1 = 6 loop count i = 3 index_0 = 3 index_1 = 7 loop count i = 4 index_0 = 8 index_1 = 12 loop count i = 5 index_0 = 9 index_1 = 13 loop count i = 6 index_0 = 10 index_1 = 14 loop count i = 7 index_0 = 11 index_1 = 15 loop count i = 8 index_0 = 16 index_1 = 20 loop count i = 9 index_0 = 17 index_1 = 21 loop count i = 10 index_0 = 18 index_1 = 22 loop count i = 11 index_0 = 19 index_1 = 23 loop count i = 12 index_0 = 24 index_1 = 28 loop count i = 13 index_0 = 25 index_1 = 29 loop count i = 14 index_0 = 26 index_1 = 30 loop count i = 15 index_0 = 27 index_1 = 31 Any ideas? Frank
Hi Frank, The access pattern to arrays a and b is non-linear. Unrolled loops are usually handled by the SLP-vectorizer. Are ir0 and ir1 consecutive for all values for i ? Thanks, Nadav On Oct 30, 2013, at 9:05 AM, Frank Winter <fwinter at jlab.org> wrote:> The loop vectorizer seems to be not able to vectorize the following code: > > void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b) > { > const std::uint64_t inner = 4; > for (std::uint64_t i = start ; i < end ; ++i ) { > const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4; > const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4; > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > c[ ir1 ] = a[ ir1 ] + b[ ir1 ]; > } > } > > LV: Found a loop: for.body > LV: Found an induction variable. > LV: We need to do 0 pointer comparisons. > LV: Checking memory dependencies > LV: Bad stride - Not an AddRecExpr pointer %arrayidx11 = getelementptr inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw> > LV: Bad stride - Not an AddRecExpr pointer %arrayidx15 = getelementptr inbounds float* %c, i64 %add8 SCEV: ((4 * %add8)<nsw> + %c)<nsw> > LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 * %add8)<nsw> + %c)<nsw>(Induction step: 0) > LV: Distance for store float %add10, float* %arrayidx11, align 4 to store float %add14, float* %arrayidx15, align 4: ((4 * %add8)<nsw> + (-4 * %add2)) > Non-consecutive pointer access > LV: We don't need a runtime memory check. > LV: Can't vectorize due to memory conflicts > LV: Not vectorizing. > > Here the code: > > entry: > %cmp14 = icmp ult i64 %start, %end > br i1 %cmp14, label %for.body, label %for.end > > for.body: ; preds = %entry, %for.body > %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ] > %div = lshr i64 %i.015, 2 > %mul = shl i64 %div, 3 > %rem = and i64 %i.015, 3 > %add2 = or i64 %mul, %rem > %add8 = or i64 %add2, 4 > %arrayidx = getelementptr inbounds float* %a, i64 %add2 > %0 = load float* %arrayidx, align 4 > %arrayidx9 = getelementptr inbounds float* %b, i64 %add2 > %1 = load float* %arrayidx9, align 4 > %add10 = fadd float %0, %1 > %arrayidx11 = getelementptr inbounds float* %c, i64 %add2 > store float %add10, float* %arrayidx11, align 4 > %arrayidx12 = getelementptr inbounds float* %a, i64 %add8 > %2 = load float* %arrayidx12, align 4 > %arrayidx13 = getelementptr inbounds float* %b, i64 %add8 > %3 = load float* %arrayidx13, align 4 > %add14 = fadd float %2, %3 > %arrayidx15 = getelementptr inbounds float* %c, i64 %add8 > store float %add14, float* %arrayidx15, align 4 > %inc = add i64 %i.015, 1 > %exitcond = icmp eq i64 %inc, %end > br i1 %exitcond, label %for.end, label %for.body > > for.end: ; preds = %for.body, %entry > ret void > > Why is it saying Bad stride?Are the 'rem' and 'div' instruction asking too much from it? > > The code should be vectorizable. Here the index access for start=0, end=16: > > loop count i = 0 index_0 = 0 index_1 = 4 > loop count i = 1 index_0 = 1 index_1 = 5 > loop count i = 2 index_0 = 2 index_1 = 6 > loop count i = 3 index_0 = 3 index_1 = 7 > loop count i = 4 index_0 = 8 index_1 = 12 > loop count i = 5 index_0 = 9 index_1 = 13 > loop count i = 6 index_0 = 10 index_1 = 14 > loop count i = 7 index_0 = 11 index_1 = 15 > loop count i = 8 index_0 = 16 index_1 = 20 > loop count i = 9 index_0 = 17 index_1 = 21 > loop count i = 10 index_0 = 18 index_1 = 22 > loop count i = 11 index_0 = 19 index_1 = 23 > loop count i = 12 index_0 = 24 index_1 = 28 > loop count i = 13 index_0 = 25 index_1 = 29 > loop count i = 14 index_0 = 26 index_1 = 30 > loop count i = 15 index_0 = 27 index_1 = 31 > > Any ideas? > > Frank > > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On 30 October 2013 09:25, Nadav Rotem <nrotem at apple.com> wrote:> The access pattern to arrays a and b is non-linear. Unrolled loops are > usually handled by the SLP-vectorizer. Are ir0 and ir1 consecutive for all > values for i ? >Based on his list of values, it seems that the induction stride is linear within each block of 4 iterations, but it's not a clear relationship. As you say, it should be possible to spot that once the loop is unrolled, and get the SLP to vectorize if the relationship becomes clear. Maybe I'm wrong, but this looks like a problem of missed opportunities, not technically hard to implement. --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/bd2d9130/attachment.html>