thr3ads.net - llvm dev - [LLVMdev] loop vectorizer [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Frank Winter

2013-Oct-30 16:05 UTC

[LLVMdev] loop vectorizer

The loop vectorizer seems to be not able to vectorize the following code:

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  
c, float * __restrict__ a, float * __restrict__ b)
{
   const std::uint64_t inner = 4;
   for (std::uint64_t i = start ; i < end ; ++i ) {
     const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
     const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
     c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
     c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
   }
}

LV: Found a loop: for.body
LV: Found an induction variable.
LV: We need to do 0 pointer comparisons.
LV: Checking memory dependencies
LV: Bad stride - Not an AddRecExpr pointer   %arrayidx11 = getelementptr 
inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw>
LV: Bad stride - Not an AddRecExpr pointer   %arrayidx15 = getelementptr 
inbounds float* %c, i64 %add8 SCEV: ((4 * %add8)<nsw> + %c)<nsw>
LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 *
%add8)<nsw> +
%c)<nsw>(Induction step: 0)
LV: Distance for   store float %add10, float* %arrayidx11, align 4 to   
store float %add14, float* %arrayidx15, align 4: ((4 * %add8)<nsw> + (-4 
* %add2))
Non-consecutive pointer access
LV: We don't need a runtime memory check.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.

Here the code:

entry:
   %cmp14 = icmp ult i64 %start, %end
   br i1 %cmp14, label %for.body, label %for.end

for.body:                                         ; preds = %entry, 
%for.body
   %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
   %div = lshr i64 %i.015, 2
   %mul = shl i64 %div, 3
   %rem = and i64 %i.015, 3
   %add2 = or i64 %mul, %rem
   %add8 = or i64 %add2, 4
   %arrayidx = getelementptr inbounds float* %a, i64 %add2
   %0 = load float* %arrayidx, align 4
   %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
   %1 = load float* %arrayidx9, align 4
   %add10 = fadd float %0, %1
   %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
   store float %add10, float* %arrayidx11, align 4
   %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
   %2 = load float* %arrayidx12, align 4
   %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
   %3 = load float* %arrayidx13, align 4
   %add14 = fadd float %2, %3
   %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
   store float %add14, float* %arrayidx15, align 4
   %inc = add i64 %i.015, 1
   %exitcond = icmp eq i64 %inc, %end
   br i1 %exitcond, label %for.end, label %for.body

for.end:                                          ; preds = %for.body, 
%entry
   ret void

Why is it saying Bad stride?Are the 'rem' and 'div' instruction
asking
too much from it?

The code should be vectorizable. Here the index access for start=0, end=16:

loop count i = 0   index_0 = 0   index_1 = 4
loop count i = 1   index_0 = 1   index_1 = 5
loop count i = 2   index_0 = 2   index_1 = 6
loop count i = 3   index_0 = 3   index_1 = 7
loop count i = 4   index_0 = 8   index_1 = 12
loop count i = 5   index_0 = 9   index_1 = 13
loop count i = 6   index_0 = 10   index_1 = 14
loop count i = 7   index_0 = 11   index_1 = 15
loop count i = 8   index_0 = 16   index_1 = 20
loop count i = 9   index_0 = 17   index_1 = 21
loop count i = 10   index_0 = 18   index_1 = 22
loop count i = 11   index_0 = 19   index_1 = 23
loop count i = 12   index_0 = 24   index_1 = 28
loop count i = 13   index_0 = 25   index_1 = 29
loop count i = 14   index_0 = 26   index_1 = 30
loop count i = 15   index_0 = 27   index_1 = 31

Any ideas?

Frank

Nadav Rotem

2013-Oct-30 16:25 UTC

head link

[LLVMdev] loop vectorizer

Hi Frank, 

The access pattern to arrays a and b is non-linear.  Unrolled loops are usually
handled by the SLP-vectorizer.  Are ir0 and ir1 consecutive for all values for i
?
 
Thanks,
Nadav

On Oct 30, 2013, at 9:05 AM, Frank Winter <fwinter at jlab.org> wrote:
> The loop vectorizer seems to be not able to vectorize the following code:
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c,
float * __restrict__ a, float * __restrict__ b)
> {
>  const std::uint64_t inner = 4;
>  for (std::uint64_t i = start ; i < end ; ++i ) {
>    const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
>    const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
>    c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>    c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>  }
> }
> 
> LV: Found a loop: for.body
> LV: Found an induction variable.
> LV: We need to do 0 pointer comparisons.
> LV: Checking memory dependencies
> LV: Bad stride - Not an AddRecExpr pointer   %arrayidx11 = getelementptr
inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw>
> LV: Bad stride - Not an AddRecExpr pointer   %arrayidx15 = getelementptr
inbounds float* %c, i64 %add8 SCEV: ((4 * %add8)<nsw> + %c)<nsw>
> LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 *
%add8)<nsw> + %c)<nsw>(Induction step: 0)
> LV: Distance for   store float %add10, float* %arrayidx11, align 4 to  
store float %add14, float* %arrayidx15, align 4: ((4 * %add8)<nsw> + (-4 *
%add2))
> Non-consecutive pointer access
> LV: We don't need a runtime memory check.
> LV: Can't vectorize due to memory conflicts
> LV: Not vectorizing.
> 
> Here the code:
> 
> entry:
>  %cmp14 = icmp ult i64 %start, %end
>  br i1 %cmp14, label %for.body, label %for.end
> 
> for.body:                                         ; preds = %entry,
%for.body
>  %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>  %div = lshr i64 %i.015, 2
>  %mul = shl i64 %div, 3
>  %rem = and i64 %i.015, 3
>  %add2 = or i64 %mul, %rem
>  %add8 = or i64 %add2, 4
>  %arrayidx = getelementptr inbounds float* %a, i64 %add2
>  %0 = load float* %arrayidx, align 4
>  %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>  %1 = load float* %arrayidx9, align 4
>  %add10 = fadd float %0, %1
>  %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>  store float %add10, float* %arrayidx11, align 4
>  %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>  %2 = load float* %arrayidx12, align 4
>  %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>  %3 = load float* %arrayidx13, align 4
>  %add14 = fadd float %2, %3
>  %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>  store float %add14, float* %arrayidx15, align 4
>  %inc = add i64 %i.015, 1
>  %exitcond = icmp eq i64 %inc, %end
>  br i1 %exitcond, label %for.end, label %for.body
> 
> for.end:                                          ; preds = %for.body,
%entry
>  ret void
> 
> Why is it saying Bad stride?Are the 'rem' and 'div'
instruction asking too much from it?
> 
> The code should be vectorizable. Here the index access for start=0, end=16:
> 
> loop count i = 0   index_0 = 0   index_1 = 4
> loop count i = 1   index_0 = 1   index_1 = 5
> loop count i = 2   index_0 = 2   index_1 = 6
> loop count i = 3   index_0 = 3   index_1 = 7
> loop count i = 4   index_0 = 8   index_1 = 12
> loop count i = 5   index_0 = 9   index_1 = 13
> loop count i = 6   index_0 = 10   index_1 = 14
> loop count i = 7   index_0 = 11   index_1 = 15
> loop count i = 8   index_0 = 16   index_1 = 20
> loop count i = 9   index_0 = 17   index_1 = 21
> loop count i = 10   index_0 = 18   index_1 = 22
> loop count i = 11   index_0 = 19   index_1 = 23
> loop count i = 12   index_0 = 24   index_1 = 28
> loop count i = 13   index_0 = 25   index_1 = 29
> loop count i = 14   index_0 = 26   index_1 = 30
> loop count i = 15   index_0 = 27   index_1 = 31
> 
> Any ideas?
> 
> Frank
> 
> 
> 
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Renato Golin

2013-Oct-30 17:28 UTC

head link

[LLVMdev] loop vectorizer

On 30 October 2013 09:25, Nadav Rotem <nrotem at apple.com> wrote:
> The access pattern to arrays a and b is non-linear.  Unrolled loops are
> usually handled by the SLP-vectorizer.  Are ir0 and ir1 consecutive for all
> values for i ?
>
Based on his list of values, it seems that the induction stride is linear
within each block of 4 iterations, but it's not a clear relationship.

As you say, it should be possible to spot that once the loop is unrolled,
and get the SLP to vectorize if the relationship becomes clear.

Maybe I'm wrong, but this looks like a problem of missed opportunities, not
technically hard to implement.

--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/bd2d9130/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Oct 2013 - [LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

Seemingly Similar Threads