Frank Winter
2013-Oct-26 19:16 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Hi Hal!
I am using the 'x86_64' target. Below the complete module dump and here
the command line:
opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S test.ll
Frank
; ModuleID = 'test.ll'
target datalayout =
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-elf"
define void @bar([8 x i8]* %arg_ptr) {
entrypoint:
%0 = bitcast [8 x i8]* %arg_ptr to i32*
%1 = load i32* %0
%2 = getelementptr [8 x i8]* %arg_ptr, i32 1
%3 = bitcast [8 x i8]* %2 to i32*
%4 = load i32* %3
%5 = getelementptr [8 x i8]* %arg_ptr, i32 2
%6 = bitcast [8 x i8]* %5 to float**
%7 = load float** %6
%8 = getelementptr [8 x i8]* %arg_ptr, i32 3
%9 = bitcast [8 x i8]* %8 to float**
%10 = load float** %9
%11 = getelementptr [8 x i8]* %arg_ptr, i32 4
%12 = bitcast [8 x i8]* %11 to float**
%13 = load float** %12
br label %L0
L0: ; preds = %L0, %entrypoint
%14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
%15 = getelementptr float* %10, i32 %14
%16 = load float* %15
%17 = getelementptr float* %13, i32 %14
%18 = load float* %17
%19 = fmul float %18, %16
%20 = getelementptr float* %7, i32 %14
store float %19, float* %20
%21 = add nsw i32 %14, 1
%22 = icmp sge i32 %21, %4
br i1 %22, label %L1, label %L0
L1: ; preds = %L0
ret void
}
On 26/10/13 15:08, Hal Finkel wrote:> ----- Original Message -----
>> Hi Arnold,
>>
>> adding '-debug-only=loop-vectorize' to the command gives:
>>
>> LV: Checking a loop in "bar"
>> LV: Found a loop: L0
>> LV: Found an induction variable.
>> LV: Found an unidentified write ptr: %7 = load float** %6
>> LV: Found an unidentified read ptr: %10 = load float** %9
>> LV: Found an unidentified read ptr: %13 = load float** %12
>> LV: We need to do 2 pointer comparisons.
>> LV: We can't vectorize because we can't find the array bounds.
>> LV: Can't vectorize due to memory conflicts
>> LV: Not vectorizing.
>>
>> It can't find the loop bounds if we use the overflow version of
add.
>> That's a good point. I should mark this addition to not overflow.
>>
>> When using the non-overflow version I get:
>>
>> LV: Checking a loop in "bar"
>> LV: Found a loop: L0
>> LV: Found an induction variable.
>> LV: Found an unidentified write ptr: %7 = load float** %6
>> LV: Found an unidentified read ptr: %10 = load float** %9
>> LV: Found an unidentified read ptr: %13 = load float** %12
>> LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32
>> %14
>> LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32
>> %14
>> LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32
>> %14
>> LV: We need to do 2 pointer comparisons.
>> LV: We can perform a memory runtime check if needed.
>> LV: We need a runtime memory check.
>> LV: We can vectorize this loop (with a runtime bound check)!
>> LV: Found trip count: 0
>> LV: The Widest type: 32 bits.
>> LV: The Widest register is: 32 bits.
>> LV: Found an estimated cost of 0 for VF 1 For instruction: %14
>> phi
>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>> LV: Found an estimated cost of 0 for VF 1 For instruction: %15
>> getelementptr float* %10, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction: %16
>> load
>> float* %15
>> LV: Found an estimated cost of 0 for VF 1 For instruction: %17
>> getelementptr float* %13, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction: %18
>> load
>> float* %17
>> LV: Found an estimated cost of 1 for VF 1 For instruction: %19
>> fmul
>> float %18, %16
>> LV: Found an estimated cost of 0 for VF 1 For instruction: %20
>> getelementptr float* %7, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction: store
>> float
>> %19, float* %20
>> LV: Found an estimated cost of 1 for VF 1 For instruction: %21
>> add
>> nsw i32 %14, 1
>> LV: Found an estimated cost of 1 for VF 1 For instruction: %22
>> icmp
>> sge i32 %21, %4
>> LV: Found an estimated cost of 1 for VF 1 For instruction: br i1
>> %22,
>> label %L1, label %L0
>> LV: Scalar loop costs: 7.
>> LV: Selecting VF = : 1.
>> LV: The target has 8 vector registers
>> LV(REG): Calculating max register usage:
>> LV(REG): At #0 Interval # 0
>> LV(REG): At #1 Interval # 1
>> LV(REG): At #2 Interval # 2
>> LV(REG): At #3 Interval # 2
>> LV(REG): At #4 Interval # 3
>> LV(REG): At #5 Interval # 3
>> LV(REG): At #6 Interval # 2
>> LV(REG): At #8 Interval # 1
>> LV(REG): At #9 Interval # 1
>> LV(REG): Found max usage: 3
>> LV(REG): Found invariant usage: 5
>> LV(REG): LoopSize: 11
>> LV: Vectorization is possible but not beneficial.
>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>> LV: Unroll Factor is 1
>>
>> It's not beneficial? I didn't expect that. Do you have a
descriptive
>> explanation why it's not beneficial?
> It looks like the vectorizer is not picking up a TTI implementation from a
target with vector registers (likely, you're just seeing the basic cost
model). For what target is this?
>
> -Hal
>
>> Frank
>>
>>
>>
>> On 26/10/13 13:03, Arnold wrote:
>>> Hi Frank,
>>>
>>> Sent from my iPhone
>>>
>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at
jlab.org>
>>>> wrote:
>>>>
>>>> My function implements a simple loop:
>>>>
>>>> void bar( int start, int end, float* A, float* B, float* C)
>>>> {
>>>> for (int i=start; i<end;++i)
>>>> A[i] = B[i] * C[i];
>>>> }
>>>>
>>>> This looks pretty much like the standard example. However, I
built
>>>> the function
>>>> with the IRBuilder, thus not coming from C and clang. Also I
>>>> changed slightly
>>>> the function's signature:
>>>>
>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>> entrypoint:
>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>> %1 = load i32* %0
>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>> %3 = bitcast [8 x i8]* %2 to i32*
>>>> %4 = load i32* %3
>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>> %6 = bitcast [8 x i8]* %5 to float**
>>>> %7 = load float** %6
>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>> %9 = bitcast [8 x i8]* %8 to float**
>>>> %10 = load float** %9
>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>> %12 = bitcast [8 x i8]* %11 to float**
>>>> %13 = load float** %12
>>>> br label %L0
>>>>
>>>> L0: ; preds =
%L0,
>>>> %entrypoint
>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>> %15 = getelementptr float* %10, i32 %14
>>>> %16 = load float* %15
>>>> %17 = getelementptr float* %13, i32 %14
>>>> %18 = load float* %17
>>>> %19 = fmul float %18, %16
>>>> %20 = getelementptr float* %7, i32 %14
>>>> store float %19, float* %20
>>>> %21 = add i32 %14, 1
>>> Try
>>> %21 = add nsw i32 %14, 1
>>> instead for no-signed wrapping arithmetic.
>>>
>>> If that is not working please post the output of opt ...
>>> -debug-only=loop-vectorize ...
>>>
>>>
>>>
>>>> %22 = icmp sge i32 %21, %4
>>>> br i1 %22, label %L1, label %L0
>>>>
>>>> L1: ; preds = %L0
>>>> ret void
>>>> }
>>>>
>>>>
>>>> As you can see, I use the phi instruction for the loop index. I
>>>> notice
>>>> that clang prefers stack allocation. So, I am not sure
what's the
>>>> problem that the loop vectorizer is not working here.
>>>> I tried many things, like specifying an architecture with
vector
>>>> units, enforcing the vector width. No success.
>>>>
>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S
loop.ll
>>>>
>>>> The only explanation I have is the use of the phi instruction.
Is
>>>> this
>>>> preventing to vectorize the loop?
>>>>
>>>> Frank
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
Arnold Schwaighofer
2013-Oct-26 19:47 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
>>> LV: The Widest type: 32 bits. >>> LV: The Widest register is: 32 bits.Yep, we don’t pick up the right TTI. Try -march=x86-64 (or leave it out) you already have this info in the triple. Then it should work (does for me with your example below). On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote:> Hi Hal! > > I am using the 'x86_64' target. Below the complete module dump and here the command line: > > opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S test.ll > > Frank > > > ; ModuleID = 'test.ll' > > target datalayout = "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12 > 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64" > > target triple = "x86_64-unknown-linux-elf" > > define void @bar([8 x i8]* %arg_ptr) { > entrypoint: > %0 = bitcast [8 x i8]* %arg_ptr to i32* > %1 = load i32* %0 > %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 > %3 = bitcast [8 x i8]* %2 to i32* > %4 = load i32* %3 > %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 > %6 = bitcast [8 x i8]* %5 to float** > %7 = load float** %6 > %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 > %9 = bitcast [8 x i8]* %8 to float** > %10 = load float** %9 > %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 > %12 = bitcast [8 x i8]* %11 to float** > %13 = load float** %12 > br label %L0 > > L0: ; preds = %L0, %entrypoint > %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] > %15 = getelementptr float* %10, i32 %14 > %16 = load float* %15 > %17 = getelementptr float* %13, i32 %14 > %18 = load float* %17 > %19 = fmul float %18, %16 > %20 = getelementptr float* %7, i32 %14 > store float %19, float* %20 > %21 = add nsw i32 %14, 1 > %22 = icmp sge i32 %21, %4 > br i1 %22, label %L1, label %L0 > > L1: ; preds = %L0 > ret void > } > > > > On 26/10/13 15:08, Hal Finkel wrote: >> ----- Original Message ----- >>> Hi Arnold, >>> >>> adding '-debug-only=loop-vectorize' to the command gives: >>> >>> LV: Checking a loop in "bar" >>> LV: Found a loop: L0 >>> LV: Found an induction variable. >>> LV: Found an unidentified write ptr: %7 = load float** %6 >>> LV: Found an unidentified read ptr: %10 = load float** %9 >>> LV: Found an unidentified read ptr: %13 = load float** %12 >>> LV: We need to do 2 pointer comparisons. >>> LV: We can't vectorize because we can't find the array bounds. >>> LV: Can't vectorize due to memory conflicts >>> LV: Not vectorizing. >>> >>> It can't find the loop bounds if we use the overflow version of add. >>> That's a good point. I should mark this addition to not overflow. >>> >>> When using the non-overflow version I get: >>> >>> LV: Checking a loop in "bar" >>> LV: Found a loop: L0 >>> LV: Found an induction variable. >>> LV: Found an unidentified write ptr: %7 = load float** %6 >>> LV: Found an unidentified read ptr: %10 = load float** %9 >>> LV: Found an unidentified read ptr: %13 = load float** %12 >>> LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32 >>> %14 >>> LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32 >>> %14 >>> LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32 >>> %14 >>> LV: We need to do 2 pointer comparisons. >>> LV: We can perform a memory runtime check if needed. >>> LV: We need a runtime memory check. >>> LV: We can vectorize this loop (with a runtime bound check)! >>> LV: Found trip count: 0 >>> LV: The Widest type: 32 bits. >>> LV: The Widest register is: 32 bits. >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %14 >>> phi >>> i32 [ %21, %L0 ], [ %1, %entrypoint ] >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %15 >>> getelementptr float* %10, i32 %14 >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %16 >>> load >>> float* %15 >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %17 >>> getelementptr float* %13, i32 %14 >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %18 >>> load >>> float* %17 >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %19 >>> fmul >>> float %18, %16 >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %20 >>> getelementptr float* %7, i32 %14 >>> LV: Found an estimated cost of 1 for VF 1 For instruction: store >>> float >>> %19, float* %20 >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %21 >>> add >>> nsw i32 %14, 1 >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %22 >>> icmp >>> sge i32 %21, %4 >>> LV: Found an estimated cost of 1 for VF 1 For instruction: br i1 >>> %22, >>> label %L1, label %L0 >>> LV: Scalar loop costs: 7. >>> LV: Selecting VF = : 1. >>> LV: The target has 8 vector registers >>> LV(REG): Calculating max register usage: >>> LV(REG): At #0 Interval # 0 >>> LV(REG): At #1 Interval # 1 >>> LV(REG): At #2 Interval # 2 >>> LV(REG): At #3 Interval # 2 >>> LV(REG): At #4 Interval # 3 >>> LV(REG): At #5 Interval # 3 >>> LV(REG): At #6 Interval # 2 >>> LV(REG): At #8 Interval # 1 >>> LV(REG): At #9 Interval # 1 >>> LV(REG): Found max usage: 3 >>> LV(REG): Found invariant usage: 5 >>> LV(REG): LoopSize: 11 >>> LV: Vectorization is possible but not beneficial. >>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll >>> LV: Unroll Factor is 1 >>> >>> It's not beneficial? I didn't expect that. Do you have a descriptive >>> explanation why it's not beneficial? >> It looks like the vectorizer is not picking up a TTI implementation from a target with vector registers (likely, you're just seeing the basic cost model). For what target is this? >> >> -Hal >> >>> Frank >>> >>> >>> >>> On 26/10/13 13:03, Arnold wrote: >>>> Hi Frank, >>>> >>>> Sent from my iPhone >>>> >>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> >>>>> wrote: >>>>> >>>>> My function implements a simple loop: >>>>> >>>>> void bar( int start, int end, float* A, float* B, float* C) >>>>> { >>>>> for (int i=start; i<end;++i) >>>>> A[i] = B[i] * C[i]; >>>>> } >>>>> >>>>> This looks pretty much like the standard example. However, I built >>>>> the function >>>>> with the IRBuilder, thus not coming from C and clang. Also I >>>>> changed slightly >>>>> the function's signature: >>>>> >>>>> define void @bar([8 x i8]* %arg_ptr) { >>>>> entrypoint: >>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32* >>>>> %1 = load i32* %0 >>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >>>>> %3 = bitcast [8 x i8]* %2 to i32* >>>>> %4 = load i32* %3 >>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >>>>> %6 = bitcast [8 x i8]* %5 to float** >>>>> %7 = load float** %6 >>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >>>>> %9 = bitcast [8 x i8]* %8 to float** >>>>> %10 = load float** %9 >>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >>>>> %12 = bitcast [8 x i8]* %11 to float** >>>>> %13 = load float** %12 >>>>> br label %L0 >>>>> >>>>> L0: ; preds = %L0, >>>>> %entrypoint >>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>> %15 = getelementptr float* %10, i32 %14 >>>>> %16 = load float* %15 >>>>> %17 = getelementptr float* %13, i32 %14 >>>>> %18 = load float* %17 >>>>> %19 = fmul float %18, %16 >>>>> %20 = getelementptr float* %7, i32 %14 >>>>> store float %19, float* %20 >>>>> %21 = add i32 %14, 1 >>>> Try >>>> %21 = add nsw i32 %14, 1 >>>> instead for no-signed wrapping arithmetic. >>>> >>>> If that is not working please post the output of opt ... >>>> -debug-only=loop-vectorize ... >>>> >>>> >>>> >>>>> %22 = icmp sge i32 %21, %4 >>>>> br i1 %22, label %L1, label %L0 >>>>> >>>>> L1: ; preds = %L0 >>>>> ret void >>>>> } >>>>> >>>>> >>>>> As you can see, I use the phi instruction for the loop index. I >>>>> notice >>>>> that clang prefers stack allocation. So, I am not sure what's the >>>>> problem that the loop vectorizer is not working here. >>>>> I tried many things, like specifying an architecture with vector >>>>> units, enforcing the vector width. No success. >>>>> >>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll >>>>> >>>>> The only explanation I have is the use of the phi instruction. Is >>>>> this >>>>> preventing to vectorize the loop? >>>>> >>>>> Frank >>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>> > >
Hal Finkel
2013-Oct-26 19:54 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
----- Original Message -----> >>> LV: The Widest type: 32 bits. > >>> LV: The Widest register is: 32 bits. > > Yep, we don’t pick up the right TTI. > > Try -march=x86-64 (or leave it out) you already have this info in the > triple. > > Then it should work (does for me with your example below).That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever). -Hal> > > On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote: > > > Hi Hal! > > > > I am using the 'x86_64' target. Below the complete module dump and > > here the command line: > > > > opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S > > test.ll > > > > Frank > > > > > > ; ModuleID = 'test.ll' > > > > target datalayout > > "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12 > > 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64" > > > > target triple = "x86_64-unknown-linux-elf" > > > > define void @bar([8 x i8]* %arg_ptr) { > > entrypoint: > > %0 = bitcast [8 x i8]* %arg_ptr to i32* > > %1 = load i32* %0 > > %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 > > %3 = bitcast [8 x i8]* %2 to i32* > > %4 = load i32* %3 > > %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 > > %6 = bitcast [8 x i8]* %5 to float** > > %7 = load float** %6 > > %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 > > %9 = bitcast [8 x i8]* %8 to float** > > %10 = load float** %9 > > %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 > > %12 = bitcast [8 x i8]* %11 to float** > > %13 = load float** %12 > > br label %L0 > > > > L0: ; preds = %L0, > > %entrypoint > > %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] > > %15 = getelementptr float* %10, i32 %14 > > %16 = load float* %15 > > %17 = getelementptr float* %13, i32 %14 > > %18 = load float* %17 > > %19 = fmul float %18, %16 > > %20 = getelementptr float* %7, i32 %14 > > store float %19, float* %20 > > %21 = add nsw i32 %14, 1 > > %22 = icmp sge i32 %21, %4 > > br i1 %22, label %L1, label %L0 > > > > L1: ; preds = %L0 > > ret void > > } > > > > > > > > On 26/10/13 15:08, Hal Finkel wrote: > >> ----- Original Message ----- > >>> Hi Arnold, > >>> > >>> adding '-debug-only=loop-vectorize' to the command gives: > >>> > >>> LV: Checking a loop in "bar" > >>> LV: Found a loop: L0 > >>> LV: Found an induction variable. > >>> LV: Found an unidentified write ptr: %7 = load float** %6 > >>> LV: Found an unidentified read ptr: %10 = load float** %9 > >>> LV: Found an unidentified read ptr: %13 = load float** %12 > >>> LV: We need to do 2 pointer comparisons. > >>> LV: We can't vectorize because we can't find the array bounds. > >>> LV: Can't vectorize due to memory conflicts > >>> LV: Not vectorizing. > >>> > >>> It can't find the loop bounds if we use the overflow version of > >>> add. > >>> That's a good point. I should mark this addition to not overflow. > >>> > >>> When using the non-overflow version I get: > >>> > >>> LV: Checking a loop in "bar" > >>> LV: Found a loop: L0 > >>> LV: Found an induction variable. > >>> LV: Found an unidentified write ptr: %7 = load float** %6 > >>> LV: Found an unidentified read ptr: %10 = load float** %9 > >>> LV: Found an unidentified read ptr: %13 = load float** %12 > >>> LV: Found a runtime check ptr: %20 = getelementptr float* %7, > >>> i32 > >>> %14 > >>> LV: Found a runtime check ptr: %15 = getelementptr float* %10, > >>> i32 > >>> %14 > >>> LV: Found a runtime check ptr: %17 = getelementptr float* %13, > >>> i32 > >>> %14 > >>> LV: We need to do 2 pointer comparisons. > >>> LV: We can perform a memory runtime check if needed. > >>> LV: We need a runtime memory check. > >>> LV: We can vectorize this loop (with a runtime bound check)! > >>> LV: Found trip count: 0 > >>> LV: The Widest type: 32 bits. > >>> LV: The Widest register is: 32 bits. > >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %14 > >>> > >>> phi > >>> i32 [ %21, %L0 ], [ %1, %entrypoint ] > >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %15 > >>> > >>> getelementptr float* %10, i32 %14 > >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %16 > >>> > >>> load > >>> float* %15 > >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %17 > >>> > >>> getelementptr float* %13, i32 %14 > >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %18 > >>> > >>> load > >>> float* %17 > >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %19 > >>> > >>> fmul > >>> float %18, %16 > >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %20 > >>> > >>> getelementptr float* %7, i32 %14 > >>> LV: Found an estimated cost of 1 for VF 1 For instruction: > >>> store > >>> float > >>> %19, float* %20 > >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %21 > >>> > >>> add > >>> nsw i32 %14, 1 > >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %22 > >>> > >>> icmp > >>> sge i32 %21, %4 > >>> LV: Found an estimated cost of 1 for VF 1 For instruction: br > >>> i1 > >>> %22, > >>> label %L1, label %L0 > >>> LV: Scalar loop costs: 7. > >>> LV: Selecting VF = : 1. > >>> LV: The target has 8 vector registers > >>> LV(REG): Calculating max register usage: > >>> LV(REG): At #0 Interval # 0 > >>> LV(REG): At #1 Interval # 1 > >>> LV(REG): At #2 Interval # 2 > >>> LV(REG): At #3 Interval # 2 > >>> LV(REG): At #4 Interval # 3 > >>> LV(REG): At #5 Interval # 3 > >>> LV(REG): At #6 Interval # 2 > >>> LV(REG): At #8 Interval # 1 > >>> LV(REG): At #9 Interval # 1 > >>> LV(REG): Found max usage: 3 > >>> LV(REG): Found invariant usage: 5 > >>> LV(REG): LoopSize: 11 > >>> LV: Vectorization is possible but not beneficial. > >>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll > >>> LV: Unroll Factor is 1 > >>> > >>> It's not beneficial? I didn't expect that. Do you have a > >>> descriptive > >>> explanation why it's not beneficial? > >> It looks like the vectorizer is not picking up a TTI > >> implementation from a target with vector registers (likely, > >> you're just seeing the basic cost model). For what target is > >> this? > >> > >> -Hal > >> > >>> Frank > >>> > >>> > >>> > >>> On 26/10/13 13:03, Arnold wrote: > >>>> Hi Frank, > >>>> > >>>> Sent from my iPhone > >>>> > >>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> > >>>>> wrote: > >>>>> > >>>>> My function implements a simple loop: > >>>>> > >>>>> void bar( int start, int end, float* A, float* B, float* C) > >>>>> { > >>>>> for (int i=start; i<end;++i) > >>>>> A[i] = B[i] * C[i]; > >>>>> } > >>>>> > >>>>> This looks pretty much like the standard example. However, I > >>>>> built > >>>>> the function > >>>>> with the IRBuilder, thus not coming from C and clang. Also I > >>>>> changed slightly > >>>>> the function's signature: > >>>>> > >>>>> define void @bar([8 x i8]* %arg_ptr) { > >>>>> entrypoint: > >>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32* > >>>>> %1 = load i32* %0 > >>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 > >>>>> %3 = bitcast [8 x i8]* %2 to i32* > >>>>> %4 = load i32* %3 > >>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 > >>>>> %6 = bitcast [8 x i8]* %5 to float** > >>>>> %7 = load float** %6 > >>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 > >>>>> %9 = bitcast [8 x i8]* %8 to float** > >>>>> %10 = load float** %9 > >>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 > >>>>> %12 = bitcast [8 x i8]* %11 to float** > >>>>> %13 = load float** %12 > >>>>> br label %L0 > >>>>> > >>>>> L0: ; preds > >>>>> %L0, > >>>>> %entrypoint > >>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] > >>>>> %15 = getelementptr float* %10, i32 %14 > >>>>> %16 = load float* %15 > >>>>> %17 = getelementptr float* %13, i32 %14 > >>>>> %18 = load float* %17 > >>>>> %19 = fmul float %18, %16 > >>>>> %20 = getelementptr float* %7, i32 %14 > >>>>> store float %19, float* %20 > >>>>> %21 = add i32 %14, 1 > >>>> Try > >>>> %21 = add nsw i32 %14, 1 > >>>> instead for no-signed wrapping arithmetic. > >>>> > >>>> If that is not working please post the output of opt ... > >>>> -debug-only=loop-vectorize ... > >>>> > >>>> > >>>> > >>>>> %22 = icmp sge i32 %21, %4 > >>>>> br i1 %22, label %L1, label %L0 > >>>>> > >>>>> L1: ; preds = %L0 > >>>>> ret void > >>>>> } > >>>>> > >>>>> > >>>>> As you can see, I use the phi instruction for the loop index. I > >>>>> notice > >>>>> that clang prefers stack allocation. So, I am not sure what's > >>>>> the > >>>>> problem that the loop vectorizer is not working here. > >>>>> I tried many things, like specifying an architecture with > >>>>> vector > >>>>> units, enforcing the vector width. No success. > >>>>> > >>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S > >>>>> loop.ll > >>>>> > >>>>> The only explanation I have is the use of the phi instruction. > >>>>> Is > >>>>> this > >>>>> preventing to vectorize the loop? > >>>>> > >>>>> Frank > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> LLVM Developers mailing list > >>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >>> > >>> _______________________________________________ > >>> LLVM Developers mailing list > >>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >>> > > > > > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Apparently Analagous Threads
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?