The loop vectorizer seems to be not able to vectorize the following code:
void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
c, float * __restrict__ a, float * __restrict__ b)
{
const std::uint64_t inner = 4;
for (std::uint64_t i = start ; i < end ; ++i ) {
const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
}
}
LV: Found a loop: for.body
LV: Found an induction variable.
LV: We need to do 0 pointer comparisons.
LV: Checking memory dependencies
LV: Bad stride - Not an AddRecExpr pointer %arrayidx11 = getelementptr
inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw>
LV: Bad stride - Not an AddRecExpr pointer %arrayidx15 = getelementptr
inbounds float* %c, i64 %add8 SCEV: ((4 * %add8)<nsw> + %c)<nsw>
LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 *
%add8)<nsw> +
%c)<nsw>(Induction step: 0)
LV: Distance for store float %add10, float* %arrayidx11, align 4 to
store float %add14, float* %arrayidx15, align 4: ((4 * %add8)<nsw> + (-4
* %add2))
Non-consecutive pointer access
LV: We don't need a runtime memory check.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.
Here the code:
entry:
%cmp14 = icmp ult i64 %start, %end
br i1 %cmp14, label %for.body, label %for.end
for.body: ; preds = %entry,
%for.body
%i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
%div = lshr i64 %i.015, 2
%mul = shl i64 %div, 3
%rem = and i64 %i.015, 3
%add2 = or i64 %mul, %rem
%add8 = or i64 %add2, 4
%arrayidx = getelementptr inbounds float* %a, i64 %add2
%0 = load float* %arrayidx, align 4
%arrayidx9 = getelementptr inbounds float* %b, i64 %add2
%1 = load float* %arrayidx9, align 4
%add10 = fadd float %0, %1
%arrayidx11 = getelementptr inbounds float* %c, i64 %add2
store float %add10, float* %arrayidx11, align 4
%arrayidx12 = getelementptr inbounds float* %a, i64 %add8
%2 = load float* %arrayidx12, align 4
%arrayidx13 = getelementptr inbounds float* %b, i64 %add8
%3 = load float* %arrayidx13, align 4
%add14 = fadd float %2, %3
%arrayidx15 = getelementptr inbounds float* %c, i64 %add8
store float %add14, float* %arrayidx15, align 4
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %end
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body,
%entry
ret void
Why is it saying Bad stride?Are the 'rem' and 'div' instruction
asking
too much from it?
The code should be vectorizable. Here the index access for start=0, end=16:
loop count i = 0 index_0 = 0 index_1 = 4
loop count i = 1 index_0 = 1 index_1 = 5
loop count i = 2 index_0 = 2 index_1 = 6
loop count i = 3 index_0 = 3 index_1 = 7
loop count i = 4 index_0 = 8 index_1 = 12
loop count i = 5 index_0 = 9 index_1 = 13
loop count i = 6 index_0 = 10 index_1 = 14
loop count i = 7 index_0 = 11 index_1 = 15
loop count i = 8 index_0 = 16 index_1 = 20
loop count i = 9 index_0 = 17 index_1 = 21
loop count i = 10 index_0 = 18 index_1 = 22
loop count i = 11 index_0 = 19 index_1 = 23
loop count i = 12 index_0 = 24 index_1 = 28
loop count i = 13 index_0 = 25 index_1 = 29
loop count i = 14 index_0 = 26 index_1 = 30
loop count i = 15 index_0 = 27 index_1 = 31
Any ideas?
Frank
Hi Frank, The access pattern to arrays a and b is non-linear. Unrolled loops are usually handled by the SLP-vectorizer. Are ir0 and ir1 consecutive for all values for i ? Thanks, Nadav On Oct 30, 2013, at 9:05 AM, Frank Winter <fwinter at jlab.org> wrote:> The loop vectorizer seems to be not able to vectorize the following code: > > void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b) > { > const std::uint64_t inner = 4; > for (std::uint64_t i = start ; i < end ; ++i ) { > const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4; > const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4; > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > c[ ir1 ] = a[ ir1 ] + b[ ir1 ]; > } > } > > LV: Found a loop: for.body > LV: Found an induction variable. > LV: We need to do 0 pointer comparisons. > LV: Checking memory dependencies > LV: Bad stride - Not an AddRecExpr pointer %arrayidx11 = getelementptr inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw> > LV: Bad stride - Not an AddRecExpr pointer %arrayidx15 = getelementptr inbounds float* %c, i64 %add8 SCEV: ((4 * %add8)<nsw> + %c)<nsw> > LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 * %add8)<nsw> + %c)<nsw>(Induction step: 0) > LV: Distance for store float %add10, float* %arrayidx11, align 4 to store float %add14, float* %arrayidx15, align 4: ((4 * %add8)<nsw> + (-4 * %add2)) > Non-consecutive pointer access > LV: We don't need a runtime memory check. > LV: Can't vectorize due to memory conflicts > LV: Not vectorizing. > > Here the code: > > entry: > %cmp14 = icmp ult i64 %start, %end > br i1 %cmp14, label %for.body, label %for.end > > for.body: ; preds = %entry, %for.body > %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ] > %div = lshr i64 %i.015, 2 > %mul = shl i64 %div, 3 > %rem = and i64 %i.015, 3 > %add2 = or i64 %mul, %rem > %add8 = or i64 %add2, 4 > %arrayidx = getelementptr inbounds float* %a, i64 %add2 > %0 = load float* %arrayidx, align 4 > %arrayidx9 = getelementptr inbounds float* %b, i64 %add2 > %1 = load float* %arrayidx9, align 4 > %add10 = fadd float %0, %1 > %arrayidx11 = getelementptr inbounds float* %c, i64 %add2 > store float %add10, float* %arrayidx11, align 4 > %arrayidx12 = getelementptr inbounds float* %a, i64 %add8 > %2 = load float* %arrayidx12, align 4 > %arrayidx13 = getelementptr inbounds float* %b, i64 %add8 > %3 = load float* %arrayidx13, align 4 > %add14 = fadd float %2, %3 > %arrayidx15 = getelementptr inbounds float* %c, i64 %add8 > store float %add14, float* %arrayidx15, align 4 > %inc = add i64 %i.015, 1 > %exitcond = icmp eq i64 %inc, %end > br i1 %exitcond, label %for.end, label %for.body > > for.end: ; preds = %for.body, %entry > ret void > > Why is it saying Bad stride?Are the 'rem' and 'div' instruction asking too much from it? > > The code should be vectorizable. Here the index access for start=0, end=16: > > loop count i = 0 index_0 = 0 index_1 = 4 > loop count i = 1 index_0 = 1 index_1 = 5 > loop count i = 2 index_0 = 2 index_1 = 6 > loop count i = 3 index_0 = 3 index_1 = 7 > loop count i = 4 index_0 = 8 index_1 = 12 > loop count i = 5 index_0 = 9 index_1 = 13 > loop count i = 6 index_0 = 10 index_1 = 14 > loop count i = 7 index_0 = 11 index_1 = 15 > loop count i = 8 index_0 = 16 index_1 = 20 > loop count i = 9 index_0 = 17 index_1 = 21 > loop count i = 10 index_0 = 18 index_1 = 22 > loop count i = 11 index_0 = 19 index_1 = 23 > loop count i = 12 index_0 = 24 index_1 = 28 > loop count i = 13 index_0 = 25 index_1 = 29 > loop count i = 14 index_0 = 26 index_1 = 30 > loop count i = 15 index_0 = 27 index_1 = 31 > > Any ideas? > > Frank > > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On 30 October 2013 09:25, Nadav Rotem <nrotem at apple.com> wrote:> The access pattern to arrays a and b is non-linear. Unrolled loops are > usually handled by the SLP-vectorizer. Are ir0 and ir1 consecutive for all > values for i ? >Based on his list of values, it seems that the induction stride is linear within each block of 4 iterations, but it's not a clear relationship. As you say, it should be possible to spot that once the loop is unrolled, and get the SLP to vectorize if the relationship becomes clear. Maybe I'm wrong, but this looks like a problem of missed opportunities, not technically hard to implement. --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/bd2d9130/attachment.html>