Frank Winter
2013-Oct-26  18:40 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Hi Arnold, adding '-debug-only=loop-vectorize' to the command gives: LV: Checking a loop in "bar" LV: Found a loop: L0 LV: Found an induction variable. LV: Found an unidentified write ptr: %7 = load float** %6 LV: Found an unidentified read ptr: %10 = load float** %9 LV: Found an unidentified read ptr: %13 = load float** %12 LV: We need to do 2 pointer comparisons. LV: We can't vectorize because we can't find the array bounds. LV: Can't vectorize due to memory conflicts LV: Not vectorizing. It can't find the loop bounds if we use the overflow version of add. That's a good point. I should mark this addition to not overflow. When using the non-overflow version I get: LV: Checking a loop in "bar" LV: Found a loop: L0 LV: Found an induction variable. LV: Found an unidentified write ptr: %7 = load float** %6 LV: Found an unidentified read ptr: %10 = load float** %9 LV: Found an unidentified read ptr: %13 = load float** %12 LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32 %14 LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32 %14 LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32 %14 LV: We need to do 2 pointer comparisons. LV: We can perform a memory runtime check if needed. LV: We need a runtime memory check. LV: We can vectorize this loop (with a runtime bound check)! LV: Found trip count: 0 LV: The Widest type: 32 bits. LV: The Widest register is: 32 bits. LV: Found an estimated cost of 0 for VF 1 For instruction: %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] LV: Found an estimated cost of 0 for VF 1 For instruction: %15 = getelementptr float* %10, i32 %14 LV: Found an estimated cost of 1 for VF 1 For instruction: %16 = load float* %15 LV: Found an estimated cost of 0 for VF 1 For instruction: %17 = getelementptr float* %13, i32 %14 LV: Found an estimated cost of 1 for VF 1 For instruction: %18 = load float* %17 LV: Found an estimated cost of 1 for VF 1 For instruction: %19 = fmul float %18, %16 LV: Found an estimated cost of 0 for VF 1 For instruction: %20 = getelementptr float* %7, i32 %14 LV: Found an estimated cost of 1 for VF 1 For instruction: store float %19, float* %20 LV: Found an estimated cost of 1 for VF 1 For instruction: %21 = add nsw i32 %14, 1 LV: Found an estimated cost of 1 for VF 1 For instruction: %22 = icmp sge i32 %21, %4 LV: Found an estimated cost of 1 for VF 1 For instruction: br i1 %22, label %L1, label %L0 LV: Scalar loop costs: 7. LV: Selecting VF = : 1. LV: The target has 8 vector registers LV(REG): Calculating max register usage: LV(REG): At #0 Interval # 0 LV(REG): At #1 Interval # 1 LV(REG): At #2 Interval # 2 LV(REG): At #3 Interval # 2 LV(REG): At #4 Interval # 3 LV(REG): At #5 Interval # 3 LV(REG): At #6 Interval # 2 LV(REG): At #8 Interval # 1 LV(REG): At #9 Interval # 1 LV(REG): Found max usage: 3 LV(REG): Found invariant usage: 5 LV(REG): LoopSize: 11 LV: Vectorization is possible but not beneficial. LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll LV: Unroll Factor is 1 It's not beneficial? I didn't expect that. Do you have a descriptive explanation why it's not beneficial? Frank On 26/10/13 13:03, Arnold wrote:> Hi Frank, > > Sent from my iPhone > >> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> wrote: >> >> My function implements a simple loop: >> >> void bar( int start, int end, float* A, float* B, float* C) >> { >> for (int i=start; i<end;++i) >> A[i] = B[i] * C[i]; >> } >> >> This looks pretty much like the standard example. However, I built the function >> with the IRBuilder, thus not coming from C and clang. Also I changed slightly >> the function's signature: >> >> define void @bar([8 x i8]* %arg_ptr) { >> entrypoint: >> %0 = bitcast [8 x i8]* %arg_ptr to i32* >> %1 = load i32* %0 >> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >> %3 = bitcast [8 x i8]* %2 to i32* >> %4 = load i32* %3 >> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >> %6 = bitcast [8 x i8]* %5 to float** >> %7 = load float** %6 >> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >> %9 = bitcast [8 x i8]* %8 to float** >> %10 = load float** %9 >> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >> %12 = bitcast [8 x i8]* %11 to float** >> %13 = load float** %12 >> br label %L0 >> >> L0: ; preds = %L0, %entrypoint >> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >> %15 = getelementptr float* %10, i32 %14 >> %16 = load float* %15 >> %17 = getelementptr float* %13, i32 %14 >> %18 = load float* %17 >> %19 = fmul float %18, %16 >> %20 = getelementptr float* %7, i32 %14 >> store float %19, float* %20 >> %21 = add i32 %14, 1 > Try > %21 = add nsw i32 %14, 1 > instead for no-signed wrapping arithmetic. > > If that is not working please post the output of opt ... -debug-only=loop-vectorize ... > > > >> %22 = icmp sge i32 %21, %4 >> br i1 %22, label %L1, label %L0 >> >> L1: ; preds = %L0 >> ret void >> } >> >> >> As you can see, I use the phi instruction for the loop index. I notice >> that clang prefers stack allocation. So, I am not sure what's the >> problem that the loop vectorizer is not working here. >> I tried many things, like specifying an architecture with vector >> units, enforcing the vector width. No success. >> >> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll >> >> The only explanation I have is the use of the phi instruction. Is this >> preventing to vectorize the loop? >> >> Frank >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hal Finkel
2013-Oct-26  19:08 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
----- Original Message -----> Hi Arnold, > > adding '-debug-only=loop-vectorize' to the command gives: > > LV: Checking a loop in "bar" > LV: Found a loop: L0 > LV: Found an induction variable. > LV: Found an unidentified write ptr: %7 = load float** %6 > LV: Found an unidentified read ptr: %10 = load float** %9 > LV: Found an unidentified read ptr: %13 = load float** %12 > LV: We need to do 2 pointer comparisons. > LV: We can't vectorize because we can't find the array bounds. > LV: Can't vectorize due to memory conflicts > LV: Not vectorizing. > > It can't find the loop bounds if we use the overflow version of add. > That's a good point. I should mark this addition to not overflow. > > When using the non-overflow version I get: > > LV: Checking a loop in "bar" > LV: Found a loop: L0 > LV: Found an induction variable. > LV: Found an unidentified write ptr: %7 = load float** %6 > LV: Found an unidentified read ptr: %10 = load float** %9 > LV: Found an unidentified read ptr: %13 = load float** %12 > LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32 > %14 > LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32 > %14 > LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32 > %14 > LV: We need to do 2 pointer comparisons. > LV: We can perform a memory runtime check if needed. > LV: We need a runtime memory check. > LV: We can vectorize this loop (with a runtime bound check)! > LV: Found trip count: 0 > LV: The Widest type: 32 bits. > LV: The Widest register is: 32 bits. > LV: Found an estimated cost of 0 for VF 1 For instruction: %14 > phi > i32 [ %21, %L0 ], [ %1, %entrypoint ] > LV: Found an estimated cost of 0 for VF 1 For instruction: %15 > getelementptr float* %10, i32 %14 > LV: Found an estimated cost of 1 for VF 1 For instruction: %16 > load > float* %15 > LV: Found an estimated cost of 0 for VF 1 For instruction: %17 > getelementptr float* %13, i32 %14 > LV: Found an estimated cost of 1 for VF 1 For instruction: %18 > load > float* %17 > LV: Found an estimated cost of 1 for VF 1 For instruction: %19 > fmul > float %18, %16 > LV: Found an estimated cost of 0 for VF 1 For instruction: %20 > getelementptr float* %7, i32 %14 > LV: Found an estimated cost of 1 for VF 1 For instruction: store > float > %19, float* %20 > LV: Found an estimated cost of 1 for VF 1 For instruction: %21 > add > nsw i32 %14, 1 > LV: Found an estimated cost of 1 for VF 1 For instruction: %22 > icmp > sge i32 %21, %4 > LV: Found an estimated cost of 1 for VF 1 For instruction: br i1 > %22, > label %L1, label %L0 > LV: Scalar loop costs: 7. > LV: Selecting VF = : 1. > LV: The target has 8 vector registers > LV(REG): Calculating max register usage: > LV(REG): At #0 Interval # 0 > LV(REG): At #1 Interval # 1 > LV(REG): At #2 Interval # 2 > LV(REG): At #3 Interval # 2 > LV(REG): At #4 Interval # 3 > LV(REG): At #5 Interval # 3 > LV(REG): At #6 Interval # 2 > LV(REG): At #8 Interval # 1 > LV(REG): At #9 Interval # 1 > LV(REG): Found max usage: 3 > LV(REG): Found invariant usage: 5 > LV(REG): LoopSize: 11 > LV: Vectorization is possible but not beneficial. > LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll > LV: Unroll Factor is 1 > > It's not beneficial? I didn't expect that. Do you have a descriptive > explanation why it's not beneficial?It looks like the vectorizer is not picking up a TTI implementation from a target with vector registers (likely, you're just seeing the basic cost model). For what target is this? -Hal> > Frank > > > > On 26/10/13 13:03, Arnold wrote: > > Hi Frank, > > > > Sent from my iPhone > > > >> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> > >> wrote: > >> > >> My function implements a simple loop: > >> > >> void bar( int start, int end, float* A, float* B, float* C) > >> { > >> for (int i=start; i<end;++i) > >> A[i] = B[i] * C[i]; > >> } > >> > >> This looks pretty much like the standard example. However, I built > >> the function > >> with the IRBuilder, thus not coming from C and clang. Also I > >> changed slightly > >> the function's signature: > >> > >> define void @bar([8 x i8]* %arg_ptr) { > >> entrypoint: > >> %0 = bitcast [8 x i8]* %arg_ptr to i32* > >> %1 = load i32* %0 > >> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 > >> %3 = bitcast [8 x i8]* %2 to i32* > >> %4 = load i32* %3 > >> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 > >> %6 = bitcast [8 x i8]* %5 to float** > >> %7 = load float** %6 > >> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 > >> %9 = bitcast [8 x i8]* %8 to float** > >> %10 = load float** %9 > >> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 > >> %12 = bitcast [8 x i8]* %11 to float** > >> %13 = load float** %12 > >> br label %L0 > >> > >> L0: ; preds = %L0, > >> %entrypoint > >> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] > >> %15 = getelementptr float* %10, i32 %14 > >> %16 = load float* %15 > >> %17 = getelementptr float* %13, i32 %14 > >> %18 = load float* %17 > >> %19 = fmul float %18, %16 > >> %20 = getelementptr float* %7, i32 %14 > >> store float %19, float* %20 > >> %21 = add i32 %14, 1 > > Try > > %21 = add nsw i32 %14, 1 > > instead for no-signed wrapping arithmetic. > > > > If that is not working please post the output of opt ... > > -debug-only=loop-vectorize ... > > > > > > > >> %22 = icmp sge i32 %21, %4 > >> br i1 %22, label %L1, label %L0 > >> > >> L1: ; preds = %L0 > >> ret void > >> } > >> > >> > >> As you can see, I use the phi instruction for the loop index. I > >> notice > >> that clang prefers stack allocation. So, I am not sure what's the > >> problem that the loop vectorizer is not working here. > >> I tried many things, like specifying an architecture with vector > >> units, enforcing the vector width. No success. > >> > >> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll > >> > >> The only explanation I have is the use of the phi instruction. Is > >> this > >> preventing to vectorize the loop? > >> > >> Frank > >> > >> > >> _______________________________________________ > >> LLVM Developers mailing list > >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Frank Winter
2013-Oct-26  19:16 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Hi Hal!
I am using the 'x86_64' target. Below the complete module dump and here 
the command line:
opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S test.ll
Frank
; ModuleID = 'test.ll'
target datalayout = 
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-elf"
define void @bar([8 x i8]* %arg_ptr) {
entrypoint:
   %0 = bitcast [8 x i8]* %arg_ptr to i32*
   %1 = load i32* %0
   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
   %3 = bitcast [8 x i8]* %2 to i32*
   %4 = load i32* %3
   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
   %6 = bitcast [8 x i8]* %5 to float**
   %7 = load float** %6
   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
   %9 = bitcast [8 x i8]* %8 to float**
   %10 = load float** %9
   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
   %12 = bitcast [8 x i8]* %11 to float**
   %13 = load float** %12
   br label %L0
L0:                                               ; preds = %L0, %entrypoint
   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
   %15 = getelementptr float* %10, i32 %14
   %16 = load float* %15
   %17 = getelementptr float* %13, i32 %14
   %18 = load float* %17
   %19 = fmul float %18, %16
   %20 = getelementptr float* %7, i32 %14
   store float %19, float* %20
   %21 = add nsw i32 %14, 1
   %22 = icmp sge i32 %21, %4
   br i1 %22, label %L1, label %L0
L1:                                               ; preds = %L0
   ret void
}
On 26/10/13 15:08, Hal Finkel wrote:> ----- Original Message -----
>> Hi Arnold,
>>
>> adding '-debug-only=loop-vectorize' to the command gives:
>>
>> LV: Checking a loop in "bar"
>> LV: Found a loop: L0
>> LV: Found an induction variable.
>> LV: Found an unidentified write ptr:   %7 = load float** %6
>> LV: Found an unidentified read ptr:   %10 = load float** %9
>> LV: Found an unidentified read ptr:   %13 = load float** %12
>> LV: We need to do 2 pointer comparisons.
>> LV: We can't vectorize because we can't find the array bounds.
>> LV: Can't vectorize due to memory conflicts
>> LV: Not vectorizing.
>>
>> It can't find the loop bounds if we use the overflow version of
add.
>> That's a good point. I should mark this addition to not overflow.
>>
>> When using the non-overflow version I get:
>>
>> LV: Checking a loop in "bar"
>> LV: Found a loop: L0
>> LV: Found an induction variable.
>> LV: Found an unidentified write ptr:   %7 = load float** %6
>> LV: Found an unidentified read ptr:   %10 = load float** %9
>> LV: Found an unidentified read ptr:   %13 = load float** %12
>> LV: Found a runtime check ptr:  %20 = getelementptr float* %7, i32
>> %14
>> LV: Found a runtime check ptr:  %15 = getelementptr float* %10, i32
>> %14
>> LV: Found a runtime check ptr:  %17 = getelementptr float* %13, i32
>> %14
>> LV: We need to do 2 pointer comparisons.
>> LV: We can perform a memory runtime check if needed.
>> LV: We need a runtime memory check.
>> LV: We can vectorize this loop (with a runtime bound check)!
>> LV: Found trip count: 0
>> LV: The Widest type: 32 bits.
>> LV: The Widest register is: 32 bits.
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %14
>> phi
>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %15
>> getelementptr float* %10, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %16
>> load
>> float* %15
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %17
>> getelementptr float* %13, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %18
>> load
>> float* %17
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %19
>> fmul
>> float %18, %16
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %20
>> getelementptr float* %7, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   store
>> float
>> %19, float* %20
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %21
>> add
>> nsw i32 %14, 1
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %22
>> icmp
>> sge i32 %21, %4
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   br i1
>> %22,
>> label %L1, label %L0
>> LV: Scalar loop costs: 7.
>> LV: Selecting VF = : 1.
>> LV: The target has 8 vector registers
>> LV(REG): Calculating max register usage:
>> LV(REG): At #0 Interval # 0
>> LV(REG): At #1 Interval # 1
>> LV(REG): At #2 Interval # 2
>> LV(REG): At #3 Interval # 2
>> LV(REG): At #4 Interval # 3
>> LV(REG): At #5 Interval # 3
>> LV(REG): At #6 Interval # 2
>> LV(REG): At #8 Interval # 1
>> LV(REG): At #9 Interval # 1
>> LV(REG): Found max usage: 3
>> LV(REG): Found invariant usage: 5
>> LV(REG): LoopSize: 11
>> LV: Vectorization is possible but not beneficial.
>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>> LV: Unroll Factor is 1
>>
>> It's not beneficial? I didn't expect that. Do you have a
descriptive
>> explanation why it's not beneficial?
> It looks like the vectorizer is not picking up a TTI implementation from a
target with vector registers (likely, you're just seeing the basic cost
model). For what target is this?
>
>   -Hal
>
>> Frank
>>
>>
>>
>> On 26/10/13 13:03, Arnold wrote:
>>> Hi Frank,
>>>
>>> Sent from my iPhone
>>>
>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at
jlab.org>
>>>> wrote:
>>>>
>>>> My function implements a simple loop:
>>>>
>>>> void bar( int start, int end, float* A, float* B, float* C)
>>>> {
>>>>      for (int i=start; i<end;++i)
>>>>         A[i] = B[i] * C[i];
>>>> }
>>>>
>>>> This looks pretty much like the standard example. However, I
built
>>>> the function
>>>> with the IRBuilder, thus not coming from C and clang. Also I
>>>> changed slightly
>>>> the function's signature:
>>>>
>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>> entrypoint:
>>>>    %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>    %1 = load i32* %0
>>>>    %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>    %3 = bitcast [8 x i8]* %2 to i32*
>>>>    %4 = load i32* %3
>>>>    %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>    %6 = bitcast [8 x i8]* %5 to float**
>>>>    %7 = load float** %6
>>>>    %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>    %9 = bitcast [8 x i8]* %8 to float**
>>>>    %10 = load float** %9
>>>>    %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>    %12 = bitcast [8 x i8]* %11 to float**
>>>>    %13 = load float** %12
>>>>    br label %L0
>>>>
>>>> L0:                                               ; preds =
%L0,
>>>> %entrypoint
>>>>    %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>    %15 = getelementptr float* %10, i32 %14
>>>>    %16 = load float* %15
>>>>    %17 = getelementptr float* %13, i32 %14
>>>>    %18 = load float* %17
>>>>    %19 = fmul float %18, %16
>>>>    %20 = getelementptr float* %7, i32 %14
>>>>    store float %19, float* %20
>>>>    %21 = add i32 %14, 1
>>> Try
>>> %21 = add nsw i32 %14, 1
>>> instead for no-signed wrapping arithmetic.
>>>
>>> If that is not working please post the output of opt ...
>>> -debug-only=loop-vectorize ...
>>>
>>>
>>>
>>>>    %22 = icmp sge i32 %21, %4
>>>>    br i1 %22, label %L1, label %L0
>>>>
>>>> L1:                                               ; preds = %L0
>>>>    ret void
>>>> }
>>>>
>>>>
>>>> As you can see, I use the phi instruction for the loop index. I
>>>> notice
>>>> that clang prefers stack allocation. So, I am not sure
what's the
>>>> problem that the loop vectorizer is not working here.
>>>> I tried many things, like specifying an architecture with
vector
>>>> units, enforcing the vector width. No success.
>>>>
>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S
loop.ll
>>>>
>>>> The only explanation I have is the use of the phi instruction.
Is
>>>> this
>>>> preventing to vectorize the loop?
>>>>
>>>> Frank
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
Maybe Matching Threads
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?