thr3ads.net - llvm dev - [LLVMdev] Why is the loop vectorizer not working on my function? [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Frank Winter

2013-Oct-26 15:03 UTC

[LLVMdev] Why is the loop vectorizer not working on my function?

My function implements a simple loop:

void bar( int start, int end, float* A, float* B, float* C)
{
     for (int i=start; i<end;++i)
        A[i] = B[i] * C[i];
}

This looks pretty much like the standard example. However, I built the 
function
with the IRBuilder, thus not coming from C and clang. Also I changed 
slightly
the function's signature:

define void @bar([8 x i8]* %arg_ptr) {
entrypoint:
   %0 = bitcast [8 x i8]* %arg_ptr to i32*
   %1 = load i32* %0
   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
   %3 = bitcast [8 x i8]* %2 to i32*
   %4 = load i32* %3
   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
   %6 = bitcast [8 x i8]* %5 to float**
   %7 = load float** %6
   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
   %9 = bitcast [8 x i8]* %8 to float**
   %10 = load float** %9
   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
   %12 = bitcast [8 x i8]* %11 to float**
   %13 = load float** %12
   br label %L0

L0:                                               ; preds = %L0, %entrypoint
   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
   %15 = getelementptr float* %10, i32 %14
   %16 = load float* %15
   %17 = getelementptr float* %13, i32 %14
   %18 = load float* %17
   %19 = fmul float %18, %16
   %20 = getelementptr float* %7, i32 %14
   store float %19, float* %20
   %21 = add i32 %14, 1
   %22 = icmp sge i32 %21, %4
   br i1 %22, label %L1, label %L0

L1:                                               ; preds = %L0
   ret void
}


As you can see, I use the phi instruction for the loop index. I notice
that clang prefers stack allocation. So, I am not sure what's the
problem that the loop vectorizer is not working here.
I tried many things, like specifying an architecture with vector
units, enforcing the vector width. No success.

opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll

The only explanation I have is the use of the phi instruction. Is this
preventing to vectorize the loop?

Frank

Arnold

2013-Oct-26 17:03 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

Hi Frank,

Sent from my iPhone
> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org>
wrote:
> 
> My function implements a simple loop:
> 
> void bar( int start, int end, float* A, float* B, float* C)
> {
>    for (int i=start; i<end;++i)
>       A[i] = B[i] * C[i];
> }
> 
> This looks pretty much like the standard example. However, I built the
function
> with the IRBuilder, thus not coming from C and clang. Also I changed
slightly
> the function's signature:
> 
> define void @bar([8 x i8]* %arg_ptr) {
> entrypoint:
>  %0 = bitcast [8 x i8]* %arg_ptr to i32*
>  %1 = load i32* %0
>  %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>  %3 = bitcast [8 x i8]* %2 to i32*
>  %4 = load i32* %3
>  %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>  %6 = bitcast [8 x i8]* %5 to float**
>  %7 = load float** %6
>  %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>  %9 = bitcast [8 x i8]* %8 to float**
>  %10 = load float** %9
>  %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>  %12 = bitcast [8 x i8]* %11 to float**
>  %13 = load float** %12
>  br label %L0
> 
> L0:                                               ; preds = %L0,
%entrypoint
>  %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>  %15 = getelementptr float* %10, i32 %14
>  %16 = load float* %15
>  %17 = getelementptr float* %13, i32 %14
>  %18 = load float* %17
>  %19 = fmul float %18, %16
>  %20 = getelementptr float* %7, i32 %14
>  store float %19, float* %20
>  %21 = add i32 %14, 1Try
%21 = add nsw i32 %14, 1
instead for no-signed wrapping arithmetic.

If that is not working please post the output of opt ...
-debug-only=loop-vectorize ...


>  %22 = icmp sge i32 %21, %4
>  br i1 %22, label %L1, label %L0
> 
> L1:                                               ; preds = %L0
>  ret void
> }
> 
> 
> As you can see, I use the phi instruction for the loop index. I notice
> that clang prefers stack allocation. So, I am not sure what's the
> problem that the loop vectorizer is not working here.
> I tried many things, like specifying an architecture with vector
> units, enforcing the vector width. No success.
> 
> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll
> 
> The only explanation I have is the use of the phi instruction. Is this
> preventing to vectorize the loop?
> 
> Frank
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Frank Winter

2013-Oct-26 18:40 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

Hi Arnold,

adding '-debug-only=loop-vectorize' to the command gives:

LV: Checking a loop in "bar"
LV: Found a loop: L0
LV: Found an induction variable.
LV: Found an unidentified write ptr:   %7 = load float** %6
LV: Found an unidentified read ptr:   %10 = load float** %9
LV: Found an unidentified read ptr:   %13 = load float** %12
LV: We need to do 2 pointer comparisons.
LV: We can't vectorize because we can't find the array bounds.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.

It can't find the loop bounds if we use the overflow version of add. 
That's a good point. I should mark this addition to not overflow.

When using the non-overflow version I get:

LV: Checking a loop in "bar"
LV: Found a loop: L0
LV: Found an induction variable.
LV: Found an unidentified write ptr:   %7 = load float** %6
LV: Found an unidentified read ptr:   %10 = load float** %9
LV: Found an unidentified read ptr:   %13 = load float** %12
LV: Found a runtime check ptr:  %20 = getelementptr float* %7, i32 %14
LV: Found a runtime check ptr:  %15 = getelementptr float* %10, i32 %14
LV: Found a runtime check ptr:  %17 = getelementptr float* %13, i32 %14
LV: We need to do 2 pointer comparisons.
LV: We can perform a memory runtime check if needed.
LV: We need a runtime memory check.
LV: We can vectorize this loop (with a runtime bound check)!
LV: Found trip count: 0
LV: The Widest type: 32 bits.
LV: The Widest register is: 32 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:   %14 = phi 
i32 [ %21, %L0 ], [ %1, %entrypoint ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %15 = 
getelementptr float* %10, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction:   %16 = load 
float* %15
LV: Found an estimated cost of 0 for VF 1 For instruction:   %17 = 
getelementptr float* %13, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction:   %18 = load 
float* %17
LV: Found an estimated cost of 1 for VF 1 For instruction:   %19 = fmul 
float %18, %16
LV: Found an estimated cost of 0 for VF 1 For instruction:   %20 = 
getelementptr float* %7, i32 %14
LV: Found an estimated cost of 1 for VF 1 For instruction:   store float 
%19, float* %20
LV: Found an estimated cost of 1 for VF 1 For instruction:   %21 = add 
nsw i32 %14, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %22 = icmp 
sge i32 %21, %4
LV: Found an estimated cost of 1 for VF 1 For instruction:   br i1 %22, 
label %L1, label %L0
LV: Scalar loop costs: 7.
LV: Selecting VF = : 1.
LV: The target has 8 vector registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 2
LV(REG): At #4 Interval # 3
LV(REG): At #5 Interval # 3
LV(REG): At #6 Interval # 2
LV(REG): At #8 Interval # 1
LV(REG): At #9 Interval # 1
LV(REG): Found max usage: 3
LV(REG): Found invariant usage: 5
LV(REG): LoopSize: 11
LV: Vectorization is possible but not beneficial.
LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
LV: Unroll Factor is 1

It's not beneficial? I didn't expect that. Do you have a descriptive 
explanation why it's not beneficial?

Frank



On 26/10/13 13:03, Arnold wrote:> Hi Frank,
>
> Sent from my iPhone
>
>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org>
wrote:
>>
>> My function implements a simple loop:
>>
>> void bar( int start, int end, float* A, float* B, float* C)
>> {
>>     for (int i=start; i<end;++i)
>>        A[i] = B[i] * C[i];
>> }
>>
>> This looks pretty much like the standard example. However, I built the
function
>> with the IRBuilder, thus not coming from C and clang. Also I changed
slightly
>> the function's signature:
>>
>> define void @bar([8 x i8]* %arg_ptr) {
>> entrypoint:
>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>   %1 = load i32* %0
>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>   %3 = bitcast [8 x i8]* %2 to i32*
>>   %4 = load i32* %3
>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>   %6 = bitcast [8 x i8]* %5 to float**
>>   %7 = load float** %6
>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>   %9 = bitcast [8 x i8]* %8 to float**
>>   %10 = load float** %9
>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>   %12 = bitcast [8 x i8]* %11 to float**
>>   %13 = load float** %12
>>   br label %L0
>>
>> L0:                                               ; preds = %L0,
%entrypoint
>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>   %15 = getelementptr float* %10, i32 %14
>>   %16 = load float* %15
>>   %17 = getelementptr float* %13, i32 %14
>>   %18 = load float* %17
>>   %19 = fmul float %18, %16
>>   %20 = getelementptr float* %7, i32 %14
>>   store float %19, float* %20
>>   %21 = add i32 %14, 1
> Try
> %21 = add nsw i32 %14, 1
> instead for no-signed wrapping arithmetic.
>
> If that is not working please post the output of opt ...
-debug-only=loop-vectorize ...
>
>
>
>>   %22 = icmp sge i32 %21, %4
>>   br i1 %22, label %L1, label %L0
>>
>> L1:                                               ; preds = %L0
>>   ret void
>> }
>>
>>
>> As you can see, I use the phi instruction for the loop index. I notice
>> that clang prefers stack allocation. So, I am not sure what's the
>> problem that the loop vectorizer is not working here.
>> I tried many things, like specifying an architecture with vector
>> units, enforcing the vector width. No success.
>>
>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll
>>
>> The only explanation I have is the use of the phi instruction. Is this
>> preventing to vectorize the loop?
>>
>> Frank
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Oct 2013 - [LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

Reasonably Related Threads