thr3ads.net - llvm dev - [LLVMdev] Why is the loop vectorizer not working on my function? [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Frank Winter

2013-Oct-26 19:16 UTC

[LLVMdev] Why is the loop vectorizer not working on my function?

Hi Hal!

I am using the 'x86_64' target. Below the complete module dump and here 
the command line:

opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S test.ll

Frank


; ModuleID = 'test.ll'

target datalayout = 
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"

target triple = "x86_64-unknown-linux-elf"

define void @bar([8 x i8]* %arg_ptr) {
entrypoint:
   %0 = bitcast [8 x i8]* %arg_ptr to i32*
   %1 = load i32* %0
   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
   %3 = bitcast [8 x i8]* %2 to i32*
   %4 = load i32* %3
   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
   %6 = bitcast [8 x i8]* %5 to float**
   %7 = load float** %6
   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
   %9 = bitcast [8 x i8]* %8 to float**
   %10 = load float** %9
   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
   %12 = bitcast [8 x i8]* %11 to float**
   %13 = load float** %12
   br label %L0

L0:                                               ; preds = %L0, %entrypoint
   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
   %15 = getelementptr float* %10, i32 %14
   %16 = load float* %15
   %17 = getelementptr float* %13, i32 %14
   %18 = load float* %17
   %19 = fmul float %18, %16
   %20 = getelementptr float* %7, i32 %14
   store float %19, float* %20
   %21 = add nsw i32 %14, 1
   %22 = icmp sge i32 %21, %4
   br i1 %22, label %L1, label %L0

L1:                                               ; preds = %L0
   ret void
}



On 26/10/13 15:08, Hal Finkel wrote:> ----- Original Message -----
>> Hi Arnold,
>>
>> adding '-debug-only=loop-vectorize' to the command gives:
>>
>> LV: Checking a loop in "bar"
>> LV: Found a loop: L0
>> LV: Found an induction variable.
>> LV: Found an unidentified write ptr:   %7 = load float** %6
>> LV: Found an unidentified read ptr:   %10 = load float** %9
>> LV: Found an unidentified read ptr:   %13 = load float** %12
>> LV: We need to do 2 pointer comparisons.
>> LV: We can't vectorize because we can't find the array bounds.
>> LV: Can't vectorize due to memory conflicts
>> LV: Not vectorizing.
>>
>> It can't find the loop bounds if we use the overflow version of
add.
>> That's a good point. I should mark this addition to not overflow.
>>
>> When using the non-overflow version I get:
>>
>> LV: Checking a loop in "bar"
>> LV: Found a loop: L0
>> LV: Found an induction variable.
>> LV: Found an unidentified write ptr:   %7 = load float** %6
>> LV: Found an unidentified read ptr:   %10 = load float** %9
>> LV: Found an unidentified read ptr:   %13 = load float** %12
>> LV: Found a runtime check ptr:  %20 = getelementptr float* %7, i32
>> %14
>> LV: Found a runtime check ptr:  %15 = getelementptr float* %10, i32
>> %14
>> LV: Found a runtime check ptr:  %17 = getelementptr float* %13, i32
>> %14
>> LV: We need to do 2 pointer comparisons.
>> LV: We can perform a memory runtime check if needed.
>> LV: We need a runtime memory check.
>> LV: We can vectorize this loop (with a runtime bound check)!
>> LV: Found trip count: 0
>> LV: The Widest type: 32 bits.
>> LV: The Widest register is: 32 bits.
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %14
>> phi
>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %15
>> getelementptr float* %10, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %16
>> load
>> float* %15
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %17
>> getelementptr float* %13, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %18
>> load
>> float* %17
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %19
>> fmul
>> float %18, %16
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %20
>> getelementptr float* %7, i32 %14
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   store
>> float
>> %19, float* %20
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %21
>> add
>> nsw i32 %14, 1
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %22
>> icmp
>> sge i32 %21, %4
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   br i1
>> %22,
>> label %L1, label %L0
>> LV: Scalar loop costs: 7.
>> LV: Selecting VF = : 1.
>> LV: The target has 8 vector registers
>> LV(REG): Calculating max register usage:
>> LV(REG): At #0 Interval # 0
>> LV(REG): At #1 Interval # 1
>> LV(REG): At #2 Interval # 2
>> LV(REG): At #3 Interval # 2
>> LV(REG): At #4 Interval # 3
>> LV(REG): At #5 Interval # 3
>> LV(REG): At #6 Interval # 2
>> LV(REG): At #8 Interval # 1
>> LV(REG): At #9 Interval # 1
>> LV(REG): Found max usage: 3
>> LV(REG): Found invariant usage: 5
>> LV(REG): LoopSize: 11
>> LV: Vectorization is possible but not beneficial.
>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>> LV: Unroll Factor is 1
>>
>> It's not beneficial? I didn't expect that. Do you have a
descriptive
>> explanation why it's not beneficial?
> It looks like the vectorizer is not picking up a TTI implementation from a
target with vector registers (likely, you're just seeing the basic cost
model). For what target is this?
>
>   -Hal
>
>> Frank
>>
>>
>>
>> On 26/10/13 13:03, Arnold wrote:
>>> Hi Frank,
>>>
>>> Sent from my iPhone
>>>
>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at
jlab.org>
>>>> wrote:
>>>>
>>>> My function implements a simple loop:
>>>>
>>>> void bar( int start, int end, float* A, float* B, float* C)
>>>> {
>>>>      for (int i=start; i<end;++i)
>>>>         A[i] = B[i] * C[i];
>>>> }
>>>>
>>>> This looks pretty much like the standard example. However, I
built
>>>> the function
>>>> with the IRBuilder, thus not coming from C and clang. Also I
>>>> changed slightly
>>>> the function's signature:
>>>>
>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>> entrypoint:
>>>>    %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>    %1 = load i32* %0
>>>>    %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>    %3 = bitcast [8 x i8]* %2 to i32*
>>>>    %4 = load i32* %3
>>>>    %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>    %6 = bitcast [8 x i8]* %5 to float**
>>>>    %7 = load float** %6
>>>>    %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>    %9 = bitcast [8 x i8]* %8 to float**
>>>>    %10 = load float** %9
>>>>    %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>    %12 = bitcast [8 x i8]* %11 to float**
>>>>    %13 = load float** %12
>>>>    br label %L0
>>>>
>>>> L0:                                               ; preds =
%L0,
>>>> %entrypoint
>>>>    %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>    %15 = getelementptr float* %10, i32 %14
>>>>    %16 = load float* %15
>>>>    %17 = getelementptr float* %13, i32 %14
>>>>    %18 = load float* %17
>>>>    %19 = fmul float %18, %16
>>>>    %20 = getelementptr float* %7, i32 %14
>>>>    store float %19, float* %20
>>>>    %21 = add i32 %14, 1
>>> Try
>>> %21 = add nsw i32 %14, 1
>>> instead for no-signed wrapping arithmetic.
>>>
>>> If that is not working please post the output of opt ...
>>> -debug-only=loop-vectorize ...
>>>
>>>
>>>
>>>>    %22 = icmp sge i32 %21, %4
>>>>    br i1 %22, label %L1, label %L0
>>>>
>>>> L1:                                               ; preds = %L0
>>>>    ret void
>>>> }
>>>>
>>>>
>>>> As you can see, I use the phi instruction for the loop index. I
>>>> notice
>>>> that clang prefers stack allocation. So, I am not sure
what's the
>>>> problem that the loop vectorizer is not working here.
>>>> I tried many things, like specifying an architecture with
vector
>>>> units, enforcing the vector width. No success.
>>>>
>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S
loop.ll
>>>>
>>>> The only explanation I have is the use of the phi instruction.
Is
>>>> this
>>>> preventing to vectorize the loop?
>>>>
>>>> Frank
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>

Arnold Schwaighofer

2013-Oct-26 19:47 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

>>> LV: The Widest type: 32 bits.
>>> LV: The Widest register is: 32 bits.
Yep, we don’t pick up the right TTI.

Try -march=x86-64 (or leave it out) you already have this info in the triple.

Then it should work (does for me with your example below).


On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote:
> Hi Hal!
> 
> I am using the 'x86_64' target. Below the complete module dump and
here the command line:
> 
> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S test.ll
> 
> Frank
> 
> 
> ; ModuleID = 'test.ll'
> 
> target datalayout =
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
> 
> target triple = "x86_64-unknown-linux-elf"
> 
> define void @bar([8 x i8]* %arg_ptr) {
> entrypoint:
>  %0 = bitcast [8 x i8]* %arg_ptr to i32*
>  %1 = load i32* %0
>  %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>  %3 = bitcast [8 x i8]* %2 to i32*
>  %4 = load i32* %3
>  %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>  %6 = bitcast [8 x i8]* %5 to float**
>  %7 = load float** %6
>  %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>  %9 = bitcast [8 x i8]* %8 to float**
>  %10 = load float** %9
>  %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>  %12 = bitcast [8 x i8]* %11 to float**
>  %13 = load float** %12
>  br label %L0
> 
> L0:                                               ; preds = %L0,
%entrypoint
>  %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>  %15 = getelementptr float* %10, i32 %14
>  %16 = load float* %15
>  %17 = getelementptr float* %13, i32 %14
>  %18 = load float* %17
>  %19 = fmul float %18, %16
>  %20 = getelementptr float* %7, i32 %14
>  store float %19, float* %20
>  %21 = add nsw i32 %14, 1
>  %22 = icmp sge i32 %21, %4
>  br i1 %22, label %L1, label %L0
> 
> L1:                                               ; preds = %L0
>  ret void
> }
> 
> 
> 
> On 26/10/13 15:08, Hal Finkel wrote:
>> ----- Original Message -----
>>> Hi Arnold,
>>> 
>>> adding '-debug-only=loop-vectorize' to the command gives:
>>> 
>>> LV: Checking a loop in "bar"
>>> LV: Found a loop: L0
>>> LV: Found an induction variable.
>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>> LV: Found an unidentified read ptr:   %13 = load float** %12
>>> LV: We need to do 2 pointer comparisons.
>>> LV: We can't vectorize because we can't find the array
bounds.
>>> LV: Can't vectorize due to memory conflicts
>>> LV: Not vectorizing.
>>> 
>>> It can't find the loop bounds if we use the overflow version of
add.
>>> That's a good point. I should mark this addition to not
overflow.
>>> 
>>> When using the non-overflow version I get:
>>> 
>>> LV: Checking a loop in "bar"
>>> LV: Found a loop: L0
>>> LV: Found an induction variable.
>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>> LV: Found an unidentified read ptr:   %13 = load float** %12
>>> LV: Found a runtime check ptr:  %20 = getelementptr float* %7, i32
>>> %14
>>> LV: Found a runtime check ptr:  %15 = getelementptr float* %10, i32
>>> %14
>>> LV: Found a runtime check ptr:  %17 = getelementptr float* %13, i32
>>> %14
>>> LV: We need to do 2 pointer comparisons.
>>> LV: We can perform a memory runtime check if needed.
>>> LV: We need a runtime memory check.
>>> LV: We can vectorize this loop (with a runtime bound check)!
>>> LV: Found trip count: 0
>>> LV: The Widest type: 32 bits.
>>> LV: The Widest register is: 32 bits.
>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %14
>>> phi
>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %15
>>> getelementptr float* %10, i32 %14
>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %16
>>> load
>>> float* %15
>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %17
>>> getelementptr float* %13, i32 %14
>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %18
>>> load
>>> float* %17
>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %19
>>> fmul
>>> float %18, %16
>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %20
>>> getelementptr float* %7, i32 %14
>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   store
>>> float
>>> %19, float* %20
>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %21
>>> add
>>> nsw i32 %14, 1
>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %22
>>> icmp
>>> sge i32 %21, %4
>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   br i1
>>> %22,
>>> label %L1, label %L0
>>> LV: Scalar loop costs: 7.
>>> LV: Selecting VF = : 1.
>>> LV: The target has 8 vector registers
>>> LV(REG): Calculating max register usage:
>>> LV(REG): At #0 Interval # 0
>>> LV(REG): At #1 Interval # 1
>>> LV(REG): At #2 Interval # 2
>>> LV(REG): At #3 Interval # 2
>>> LV(REG): At #4 Interval # 3
>>> LV(REG): At #5 Interval # 3
>>> LV(REG): At #6 Interval # 2
>>> LV(REG): At #8 Interval # 1
>>> LV(REG): At #9 Interval # 1
>>> LV(REG): Found max usage: 3
>>> LV(REG): Found invariant usage: 5
>>> LV(REG): LoopSize: 11
>>> LV: Vectorization is possible but not beneficial.
>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>>> LV: Unroll Factor is 1
>>> 
>>> It's not beneficial? I didn't expect that. Do you have a
descriptive
>>> explanation why it's not beneficial?
>> It looks like the vectorizer is not picking up a TTI implementation
from a target with vector registers (likely, you're just seeing the basic
cost model). For what target is this?
>> 
>>  -Hal
>> 
>>> Frank
>>> 
>>> 
>>> 
>>> On 26/10/13 13:03, Arnold wrote:
>>>> Hi Frank,
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at
jlab.org>
>>>>> wrote:
>>>>> 
>>>>> My function implements a simple loop:
>>>>> 
>>>>> void bar( int start, int end, float* A, float* B, float* C)
>>>>> {
>>>>>     for (int i=start; i<end;++i)
>>>>>        A[i] = B[i] * C[i];
>>>>> }
>>>>> 
>>>>> This looks pretty much like the standard example. However,
I built
>>>>> the function
>>>>> with the IRBuilder, thus not coming from C and clang. Also
I
>>>>> changed slightly
>>>>> the function's signature:
>>>>> 
>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>> entrypoint:
>>>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>   %1 = load i32* %0
>>>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>>>   %4 = load i32* %3
>>>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>>>   %7 = load float** %6
>>>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>>>   %10 = load float** %9
>>>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>>>   %13 = load float** %12
>>>>>   br label %L0
>>>>> 
>>>>> L0:                                               ; preds =
%L0,
>>>>> %entrypoint
>>>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>   %15 = getelementptr float* %10, i32 %14
>>>>>   %16 = load float* %15
>>>>>   %17 = getelementptr float* %13, i32 %14
>>>>>   %18 = load float* %17
>>>>>   %19 = fmul float %18, %16
>>>>>   %20 = getelementptr float* %7, i32 %14
>>>>>   store float %19, float* %20
>>>>>   %21 = add i32 %14, 1
>>>> Try
>>>> %21 = add nsw i32 %14, 1
>>>> instead for no-signed wrapping arithmetic.
>>>> 
>>>> If that is not working please post the output of opt ...
>>>> -debug-only=loop-vectorize ...
>>>> 
>>>> 
>>>> 
>>>>>   %22 = icmp sge i32 %21, %4
>>>>>   br i1 %22, label %L1, label %L0
>>>>> 
>>>>> L1:                                               ; preds =
%L0
>>>>>   ret void
>>>>> }
>>>>> 
>>>>> 
>>>>> As you can see, I use the phi instruction for the loop
index. I
>>>>> notice
>>>>> that clang prefers stack allocation. So, I am not sure
what's the
>>>>> problem that the loop vectorizer is not working here.
>>>>> I tried many things, like specifying an architecture with
vector
>>>>> units, enforcing the vector width. No success.
>>>>> 
>>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S
loop.ll
>>>>> 
>>>>> The only explanation I have is the use of the phi
instruction. Is
>>>>> this
>>>>> preventing to vectorize the loop?
>>>>> 
>>>>> Frank
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> 
> 
>

Hal Finkel

2013-Oct-26 19:54 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

----- Original Message -----> >>> LV: The Widest type: 32 bits.
> >>> LV: The Widest register is: 32 bits.
> 
> Yep, we don’t pick up the right TTI.
> 
> Try -march=x86-64 (or leave it out) you already have this info in the
> triple.
> 
> Then it should work (does for me with your example below).
That may depend on what CPU is picks by default; Frank, if it does not work for
you, try specifying a target CPU (-mcpu=whatever).

 -Hal
> 
> 
> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org>
wrote:
> 
> > Hi Hal!
> > 
> > I am using the 'x86_64' target. Below the complete module dump
and
> > here the command line:
> > 
> > opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
> > test.ll
> > 
> > Frank
> > 
> > 
> > ; ModuleID = 'test.ll'
> > 
> > target datalayout > >
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
> > 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
> > 
> > target triple = "x86_64-unknown-linux-elf"
> > 
> > define void @bar([8 x i8]* %arg_ptr) {
> > entrypoint:
> >  %0 = bitcast [8 x i8]* %arg_ptr to i32*
> >  %1 = load i32* %0
> >  %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
> >  %3 = bitcast [8 x i8]* %2 to i32*
> >  %4 = load i32* %3
> >  %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
> >  %6 = bitcast [8 x i8]* %5 to float**
> >  %7 = load float** %6
> >  %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
> >  %9 = bitcast [8 x i8]* %8 to float**
> >  %10 = load float** %9
> >  %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
> >  %12 = bitcast [8 x i8]* %11 to float**
> >  %13 = load float** %12
> >  br label %L0
> > 
> > L0:                                               ; preds = %L0,
> > %entrypoint
> >  %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >  %15 = getelementptr float* %10, i32 %14
> >  %16 = load float* %15
> >  %17 = getelementptr float* %13, i32 %14
> >  %18 = load float* %17
> >  %19 = fmul float %18, %16
> >  %20 = getelementptr float* %7, i32 %14
> >  store float %19, float* %20
> >  %21 = add nsw i32 %14, 1
> >  %22 = icmp sge i32 %21, %4
> >  br i1 %22, label %L1, label %L0
> > 
> > L1:                                               ; preds = %L0
> >  ret void
> > }
> > 
> > 
> > 
> > On 26/10/13 15:08, Hal Finkel wrote:
> >> ----- Original Message -----
> >>> Hi Arnold,
> >>> 
> >>> adding '-debug-only=loop-vectorize' to the command
gives:
> >>> 
> >>> LV: Checking a loop in "bar"
> >>> LV: Found a loop: L0
> >>> LV: Found an induction variable.
> >>> LV: Found an unidentified write ptr:   %7 = load float** %6
> >>> LV: Found an unidentified read ptr:   %10 = load float** %9
> >>> LV: Found an unidentified read ptr:   %13 = load float** %12
> >>> LV: We need to do 2 pointer comparisons.
> >>> LV: We can't vectorize because we can't find the array
bounds.
> >>> LV: Can't vectorize due to memory conflicts
> >>> LV: Not vectorizing.
> >>> 
> >>> It can't find the loop bounds if we use the overflow
version of
> >>> add.
> >>> That's a good point. I should mark this addition to not
overflow.
> >>> 
> >>> When using the non-overflow version I get:
> >>> 
> >>> LV: Checking a loop in "bar"
> >>> LV: Found a loop: L0
> >>> LV: Found an induction variable.
> >>> LV: Found an unidentified write ptr:   %7 = load float** %6
> >>> LV: Found an unidentified read ptr:   %10 = load float** %9
> >>> LV: Found an unidentified read ptr:   %13 = load float** %12
> >>> LV: Found a runtime check ptr:  %20 = getelementptr float* %7,
> >>> i32
> >>> %14
> >>> LV: Found a runtime check ptr:  %15 = getelementptr float*
%10,
> >>> i32
> >>> %14
> >>> LV: Found a runtime check ptr:  %17 = getelementptr float*
%13,
> >>> i32
> >>> %14
> >>> LV: We need to do 2 pointer comparisons.
> >>> LV: We can perform a memory runtime check if needed.
> >>> LV: We need a runtime memory check.
> >>> LV: We can vectorize this loop (with a runtime bound check)!
> >>> LV: Found trip count: 0
> >>> LV: The Widest type: 32 bits.
> >>> LV: The Widest register is: 32 bits.
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction:  
%14
> >>> > >>> phi
> >>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction:  
%15
> >>> > >>> getelementptr float* %10, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%16
> >>> > >>> load
> >>> float* %15
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction:  
%17
> >>> > >>> getelementptr float* %13, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%18
> >>> > >>> load
> >>> float* %17
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%19
> >>> > >>> fmul
> >>> float %18, %16
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction:  
%20
> >>> > >>> getelementptr float* %7, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:
> >>>   store
> >>> float
> >>> %19, float* %20
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%21
> >>> > >>> add
> >>> nsw i32 %14, 1
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%22
> >>> > >>> icmp
> >>> sge i32 %21, %4
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
br
> >>> i1
> >>> %22,
> >>> label %L1, label %L0
> >>> LV: Scalar loop costs: 7.
> >>> LV: Selecting VF = : 1.
> >>> LV: The target has 8 vector registers
> >>> LV(REG): Calculating max register usage:
> >>> LV(REG): At #0 Interval # 0
> >>> LV(REG): At #1 Interval # 1
> >>> LV(REG): At #2 Interval # 2
> >>> LV(REG): At #3 Interval # 2
> >>> LV(REG): At #4 Interval # 3
> >>> LV(REG): At #5 Interval # 3
> >>> LV(REG): At #6 Interval # 2
> >>> LV(REG): At #8 Interval # 1
> >>> LV(REG): At #9 Interval # 1
> >>> LV(REG): Found max usage: 3
> >>> LV(REG): Found invariant usage: 5
> >>> LV(REG): LoopSize: 11
> >>> LV: Vectorization is possible but not beneficial.
> >>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
> >>> LV: Unroll Factor is 1
> >>> 
> >>> It's not beneficial? I didn't expect that. Do you have
a
> >>> descriptive
> >>> explanation why it's not beneficial?
> >> It looks like the vectorizer is not picking up a TTI
> >> implementation from a target with vector registers (likely,
> >> you're just seeing the basic cost model). For what target is
> >> this?
> >> 
> >>  -Hal
> >> 
> >>> Frank
> >>> 
> >>> 
> >>> 
> >>> On 26/10/13 13:03, Arnold wrote:
> >>>> Hi Frank,
> >>>> 
> >>>> Sent from my iPhone
> >>>> 
> >>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter
at jlab.org>
> >>>>> wrote:
> >>>>> 
> >>>>> My function implements a simple loop:
> >>>>> 
> >>>>> void bar( int start, int end, float* A, float* B,
float* C)
> >>>>> {
> >>>>>     for (int i=start; i<end;++i)
> >>>>>        A[i] = B[i] * C[i];
> >>>>> }
> >>>>> 
> >>>>> This looks pretty much like the standard example.
However, I
> >>>>> built
> >>>>> the function
> >>>>> with the IRBuilder, thus not coming from C and clang.
Also I
> >>>>> changed slightly
> >>>>> the function's signature:
> >>>>> 
> >>>>> define void @bar([8 x i8]* %arg_ptr) {
> >>>>> entrypoint:
> >>>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
> >>>>>   %1 = load i32* %0
> >>>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
> >>>>>   %3 = bitcast [8 x i8]* %2 to i32*
> >>>>>   %4 = load i32* %3
> >>>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
> >>>>>   %6 = bitcast [8 x i8]* %5 to float**
> >>>>>   %7 = load float** %6
> >>>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
> >>>>>   %9 = bitcast [8 x i8]* %8 to float**
> >>>>>   %10 = load float** %9
> >>>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
> >>>>>   %12 = bitcast [8 x i8]* %11 to float**
> >>>>>   %13 = load float** %12
> >>>>>   br label %L0
> >>>>> 
> >>>>> L0:                                               ;
preds > >>>>> %L0,
> >>>>> %entrypoint
> >>>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >>>>>   %15 = getelementptr float* %10, i32 %14
> >>>>>   %16 = load float* %15
> >>>>>   %17 = getelementptr float* %13, i32 %14
> >>>>>   %18 = load float* %17
> >>>>>   %19 = fmul float %18, %16
> >>>>>   %20 = getelementptr float* %7, i32 %14
> >>>>>   store float %19, float* %20
> >>>>>   %21 = add i32 %14, 1
> >>>> Try
> >>>> %21 = add nsw i32 %14, 1
> >>>> instead for no-signed wrapping arithmetic.
> >>>> 
> >>>> If that is not working please post the output of opt ...
> >>>> -debug-only=loop-vectorize ...
> >>>> 
> >>>> 
> >>>> 
> >>>>>   %22 = icmp sge i32 %21, %4
> >>>>>   br i1 %22, label %L1, label %L0
> >>>>> 
> >>>>> L1:                                               ;
preds = %L0
> >>>>>   ret void
> >>>>> }
> >>>>> 
> >>>>> 
> >>>>> As you can see, I use the phi instruction for the loop
index. I
> >>>>> notice
> >>>>> that clang prefers stack allocation. So, I am not sure
what's
> >>>>> the
> >>>>> problem that the loop vectorizer is not working here.
> >>>>> I tried many things, like specifying an architecture
with
> >>>>> vector
> >>>>> units, enforcing the vector width. No success.
> >>>>> 
> >>>>> opt -march=x64-64 -loop-vectorize
-force-vector-width=8 -S
> >>>>> loop.ll
> >>>>> 
> >>>>> The only explanation I have is the use of the phi
instruction.
> >>>>> Is
> >>>>> this
> >>>>> preventing to vectorize the loop?
> >>>>> 
> >>>>> Frank
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> LLVM Developers mailing list
> >>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>> 
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>> 
> > 
> > 
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Oct 2013 - [LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

Apparently Analagous Threads