thr3ads.net - llvm dev - [LLVMdev] Why is the loop vectorizer not working on my function? [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2013-Oct-26 19:54 UTC

[LLVMdev] Why is the loop vectorizer not working on my function?

----- Original Message -----> >>> LV: The Widest type: 32 bits.
> >>> LV: The Widest register is: 32 bits.
> 
> Yep, we don’t pick up the right TTI.
> 
> Try -march=x86-64 (or leave it out) you already have this info in the
> triple.
> 
> Then it should work (does for me with your example below).
That may depend on what CPU is picks by default; Frank, if it does not work for
you, try specifying a target CPU (-mcpu=whatever).

 -Hal
> 
> 
> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org>
wrote:
> 
> > Hi Hal!
> > 
> > I am using the 'x86_64' target. Below the complete module dump
and
> > here the command line:
> > 
> > opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
> > test.ll
> > 
> > Frank
> > 
> > 
> > ; ModuleID = 'test.ll'
> > 
> > target datalayout > >
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
> > 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
> > 
> > target triple = "x86_64-unknown-linux-elf"
> > 
> > define void @bar([8 x i8]* %arg_ptr) {
> > entrypoint:
> >  %0 = bitcast [8 x i8]* %arg_ptr to i32*
> >  %1 = load i32* %0
> >  %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
> >  %3 = bitcast [8 x i8]* %2 to i32*
> >  %4 = load i32* %3
> >  %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
> >  %6 = bitcast [8 x i8]* %5 to float**
> >  %7 = load float** %6
> >  %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
> >  %9 = bitcast [8 x i8]* %8 to float**
> >  %10 = load float** %9
> >  %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
> >  %12 = bitcast [8 x i8]* %11 to float**
> >  %13 = load float** %12
> >  br label %L0
> > 
> > L0:                                               ; preds = %L0,
> > %entrypoint
> >  %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >  %15 = getelementptr float* %10, i32 %14
> >  %16 = load float* %15
> >  %17 = getelementptr float* %13, i32 %14
> >  %18 = load float* %17
> >  %19 = fmul float %18, %16
> >  %20 = getelementptr float* %7, i32 %14
> >  store float %19, float* %20
> >  %21 = add nsw i32 %14, 1
> >  %22 = icmp sge i32 %21, %4
> >  br i1 %22, label %L1, label %L0
> > 
> > L1:                                               ; preds = %L0
> >  ret void
> > }
> > 
> > 
> > 
> > On 26/10/13 15:08, Hal Finkel wrote:
> >> ----- Original Message -----
> >>> Hi Arnold,
> >>> 
> >>> adding '-debug-only=loop-vectorize' to the command
gives:
> >>> 
> >>> LV: Checking a loop in "bar"
> >>> LV: Found a loop: L0
> >>> LV: Found an induction variable.
> >>> LV: Found an unidentified write ptr:   %7 = load float** %6
> >>> LV: Found an unidentified read ptr:   %10 = load float** %9
> >>> LV: Found an unidentified read ptr:   %13 = load float** %12
> >>> LV: We need to do 2 pointer comparisons.
> >>> LV: We can't vectorize because we can't find the array
bounds.
> >>> LV: Can't vectorize due to memory conflicts
> >>> LV: Not vectorizing.
> >>> 
> >>> It can't find the loop bounds if we use the overflow
version of
> >>> add.
> >>> That's a good point. I should mark this addition to not
overflow.
> >>> 
> >>> When using the non-overflow version I get:
> >>> 
> >>> LV: Checking a loop in "bar"
> >>> LV: Found a loop: L0
> >>> LV: Found an induction variable.
> >>> LV: Found an unidentified write ptr:   %7 = load float** %6
> >>> LV: Found an unidentified read ptr:   %10 = load float** %9
> >>> LV: Found an unidentified read ptr:   %13 = load float** %12
> >>> LV: Found a runtime check ptr:  %20 = getelementptr float* %7,
> >>> i32
> >>> %14
> >>> LV: Found a runtime check ptr:  %15 = getelementptr float*
%10,
> >>> i32
> >>> %14
> >>> LV: Found a runtime check ptr:  %17 = getelementptr float*
%13,
> >>> i32
> >>> %14
> >>> LV: We need to do 2 pointer comparisons.
> >>> LV: We can perform a memory runtime check if needed.
> >>> LV: We need a runtime memory check.
> >>> LV: We can vectorize this loop (with a runtime bound check)!
> >>> LV: Found trip count: 0
> >>> LV: The Widest type: 32 bits.
> >>> LV: The Widest register is: 32 bits.
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction:  
%14
> >>> > >>> phi
> >>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction:  
%15
> >>> > >>> getelementptr float* %10, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%16
> >>> > >>> load
> >>> float* %15
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction:  
%17
> >>> > >>> getelementptr float* %13, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%18
> >>> > >>> load
> >>> float* %17
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%19
> >>> > >>> fmul
> >>> float %18, %16
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction:  
%20
> >>> > >>> getelementptr float* %7, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:
> >>>   store
> >>> float
> >>> %19, float* %20
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%21
> >>> > >>> add
> >>> nsw i32 %14, 1
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
%22
> >>> > >>> icmp
> >>> sge i32 %21, %4
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:  
br
> >>> i1
> >>> %22,
> >>> label %L1, label %L0
> >>> LV: Scalar loop costs: 7.
> >>> LV: Selecting VF = : 1.
> >>> LV: The target has 8 vector registers
> >>> LV(REG): Calculating max register usage:
> >>> LV(REG): At #0 Interval # 0
> >>> LV(REG): At #1 Interval # 1
> >>> LV(REG): At #2 Interval # 2
> >>> LV(REG): At #3 Interval # 2
> >>> LV(REG): At #4 Interval # 3
> >>> LV(REG): At #5 Interval # 3
> >>> LV(REG): At #6 Interval # 2
> >>> LV(REG): At #8 Interval # 1
> >>> LV(REG): At #9 Interval # 1
> >>> LV(REG): Found max usage: 3
> >>> LV(REG): Found invariant usage: 5
> >>> LV(REG): LoopSize: 11
> >>> LV: Vectorization is possible but not beneficial.
> >>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
> >>> LV: Unroll Factor is 1
> >>> 
> >>> It's not beneficial? I didn't expect that. Do you have
a
> >>> descriptive
> >>> explanation why it's not beneficial?
> >> It looks like the vectorizer is not picking up a TTI
> >> implementation from a target with vector registers (likely,
> >> you're just seeing the basic cost model). For what target is
> >> this?
> >> 
> >>  -Hal
> >> 
> >>> Frank
> >>> 
> >>> 
> >>> 
> >>> On 26/10/13 13:03, Arnold wrote:
> >>>> Hi Frank,
> >>>> 
> >>>> Sent from my iPhone
> >>>> 
> >>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter
at jlab.org>
> >>>>> wrote:
> >>>>> 
> >>>>> My function implements a simple loop:
> >>>>> 
> >>>>> void bar( int start, int end, float* A, float* B,
float* C)
> >>>>> {
> >>>>>     for (int i=start; i<end;++i)
> >>>>>        A[i] = B[i] * C[i];
> >>>>> }
> >>>>> 
> >>>>> This looks pretty much like the standard example.
However, I
> >>>>> built
> >>>>> the function
> >>>>> with the IRBuilder, thus not coming from C and clang.
Also I
> >>>>> changed slightly
> >>>>> the function's signature:
> >>>>> 
> >>>>> define void @bar([8 x i8]* %arg_ptr) {
> >>>>> entrypoint:
> >>>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
> >>>>>   %1 = load i32* %0
> >>>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
> >>>>>   %3 = bitcast [8 x i8]* %2 to i32*
> >>>>>   %4 = load i32* %3
> >>>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
> >>>>>   %6 = bitcast [8 x i8]* %5 to float**
> >>>>>   %7 = load float** %6
> >>>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
> >>>>>   %9 = bitcast [8 x i8]* %8 to float**
> >>>>>   %10 = load float** %9
> >>>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
> >>>>>   %12 = bitcast [8 x i8]* %11 to float**
> >>>>>   %13 = load float** %12
> >>>>>   br label %L0
> >>>>> 
> >>>>> L0:                                               ;
preds > >>>>> %L0,
> >>>>> %entrypoint
> >>>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >>>>>   %15 = getelementptr float* %10, i32 %14
> >>>>>   %16 = load float* %15
> >>>>>   %17 = getelementptr float* %13, i32 %14
> >>>>>   %18 = load float* %17
> >>>>>   %19 = fmul float %18, %16
> >>>>>   %20 = getelementptr float* %7, i32 %14
> >>>>>   store float %19, float* %20
> >>>>>   %21 = add i32 %14, 1
> >>>> Try
> >>>> %21 = add nsw i32 %14, 1
> >>>> instead for no-signed wrapping arithmetic.
> >>>> 
> >>>> If that is not working please post the output of opt ...
> >>>> -debug-only=loop-vectorize ...
> >>>> 
> >>>> 
> >>>> 
> >>>>>   %22 = icmp sge i32 %21, %4
> >>>>>   br i1 %22, label %L1, label %L0
> >>>>> 
> >>>>> L1:                                               ;
preds = %L0
> >>>>>   ret void
> >>>>> }
> >>>>> 
> >>>>> 
> >>>>> As you can see, I use the phi instruction for the loop
index. I
> >>>>> notice
> >>>>> that clang prefers stack allocation. So, I am not sure
what's
> >>>>> the
> >>>>> problem that the loop vectorizer is not working here.
> >>>>> I tried many things, like specifying an architecture
with
> >>>>> vector
> >>>>> units, enforcing the vector width. No success.
> >>>>> 
> >>>>> opt -march=x64-64 -loop-vectorize
-force-vector-width=8 -S
> >>>>> loop.ll
> >>>>> 
> >>>>> The only explanation I have is the use of the phi
instruction.
> >>>>> Is
> >>>>> this
> >>>>> preventing to vectorize the loop?
> >>>>> 
> >>>>> Frank
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> LLVM Developers mailing list
> >>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>> 
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>> 
> > 
> > 
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Frank Winter

2013-Oct-26 20:10 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

If I leave '-march=x86-64' out, it works for me too.

Regards this message:

LV: We need a runtime memory check.
LV: We can vectorize this loop (with a runtime bound check)!

I know, my pointers are neither aliasing each other nor do they point to 
overlapping memory regions.

Is there a way to mark my pointers 'noalias' after loading, e.g. after

   %7 = load float** %6

in order to avoid the runtime check?

Frank



On 26/10/13 15:54, Hal Finkel wrote:> ----- Original Message -----
>>>>> LV: The Widest type: 32 bits.
>>>>> LV: The Widest register is: 32 bits.
>> Yep, we don’t pick up the right TTI.
>>
>> Try -march=x86-64 (or leave it out) you already have this info in the
>> triple.
>>
>> Then it should work (does for me with your example below).
> That may depend on what CPU is picks by default; Frank, if it does not work
for you, try specifying a target CPU (-mcpu=whatever).
>
>   -Hal
>
>>
>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org>
wrote:
>>
>>> Hi Hal!
>>>
>>> I am using the 'x86_64' target. Below the complete module
dump and
>>> here the command line:
>>>
>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
>>> test.ll
>>>
>>> Frank
>>>
>>>
>>> ; ModuleID = 'test.ll'
>>>
>>> target datalayout >>>
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>
>>> target triple = "x86_64-unknown-linux-elf"
>>>
>>> define void @bar([8 x i8]* %arg_ptr) {
>>> entrypoint:
>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>   %1 = load i32* %0
>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>   %4 = load i32* %3
>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>   %7 = load float** %6
>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>   %10 = load float** %9
>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>   %13 = load float** %12
>>>   br label %L0
>>>
>>> L0:                                               ; preds = %L0,
>>> %entrypoint
>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>   %15 = getelementptr float* %10, i32 %14
>>>   %16 = load float* %15
>>>   %17 = getelementptr float* %13, i32 %14
>>>   %18 = load float* %17
>>>   %19 = fmul float %18, %16
>>>   %20 = getelementptr float* %7, i32 %14
>>>   store float %19, float* %20
>>>   %21 = add nsw i32 %14, 1
>>>   %22 = icmp sge i32 %21, %4
>>>   br i1 %22, label %L1, label %L0
>>>
>>> L1:                                               ; preds = %L0
>>>   ret void
>>> }
>>>
>>>
>>>
>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>> ----- Original Message -----
>>>>> Hi Arnold,
>>>>>
>>>>> adding '-debug-only=loop-vectorize' to the command
gives:
>>>>>
>>>>> LV: Checking a loop in "bar"
>>>>> LV: Found a loop: L0
>>>>> LV: Found an induction variable.
>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>> LV: Found an unidentified read ptr:   %13 = load float**
%12
>>>>> LV: We need to do 2 pointer comparisons.
>>>>> LV: We can't vectorize because we can't find the
array bounds.
>>>>> LV: Can't vectorize due to memory conflicts
>>>>> LV: Not vectorizing.
>>>>>
>>>>> It can't find the loop bounds if we use the overflow
version of
>>>>> add.
>>>>> That's a good point. I should mark this addition to not
overflow.
>>>>>
>>>>> When using the non-overflow version I get:
>>>>>
>>>>> LV: Checking a loop in "bar"
>>>>> LV: Found a loop: L0
>>>>> LV: Found an induction variable.
>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>> LV: Found an unidentified read ptr:   %13 = load float**
%12
>>>>> LV: Found a runtime check ptr:  %20 = getelementptr float*
%7,
>>>>> i32
>>>>> %14
>>>>> LV: Found a runtime check ptr:  %15 = getelementptr float*
%10,
>>>>> i32
>>>>> %14
>>>>> LV: Found a runtime check ptr:  %17 = getelementptr float*
%13,
>>>>> i32
>>>>> %14
>>>>> LV: We need to do 2 pointer comparisons.
>>>>> LV: We can perform a memory runtime check if needed.
>>>>> LV: We need a runtime memory check.
>>>>> LV: We can vectorize this loop (with a runtime bound
check)!
>>>>> LV: Found trip count: 0
>>>>> LV: The Widest type: 32 bits.
>>>>> LV: The Widest register is: 32 bits.
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: 
%14
>>>>> >>>>> phi
>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: 
%15
>>>>> >>>>> getelementptr float* %10, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%16
>>>>> >>>>> load
>>>>> float* %15
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: 
%17
>>>>> >>>>> getelementptr float* %13, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%18
>>>>> >>>>> load
>>>>> float* %17
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%19
>>>>> >>>>> fmul
>>>>> float %18, %16
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: 
%20
>>>>> >>>>> getelementptr float* %7, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:
>>>>>    store
>>>>> float
>>>>> %19, float* %20
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%21
>>>>> >>>>> add
>>>>> nsw i32 %14, 1
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%22
>>>>> >>>>> icmp
>>>>> sge i32 %21, %4
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
br
>>>>> i1
>>>>> %22,
>>>>> label %L1, label %L0
>>>>> LV: Scalar loop costs: 7.
>>>>> LV: Selecting VF = : 1.
>>>>> LV: The target has 8 vector registers
>>>>> LV(REG): Calculating max register usage:
>>>>> LV(REG): At #0 Interval # 0
>>>>> LV(REG): At #1 Interval # 1
>>>>> LV(REG): At #2 Interval # 2
>>>>> LV(REG): At #3 Interval # 2
>>>>> LV(REG): At #4 Interval # 3
>>>>> LV(REG): At #5 Interval # 3
>>>>> LV(REG): At #6 Interval # 2
>>>>> LV(REG): At #8 Interval # 1
>>>>> LV(REG): At #9 Interval # 1
>>>>> LV(REG): Found max usage: 3
>>>>> LV(REG): Found invariant usage: 5
>>>>> LV(REG): LoopSize: 11
>>>>> LV: Vectorization is possible but not beneficial.
>>>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>>>>> LV: Unroll Factor is 1
>>>>>
>>>>> It's not beneficial? I didn't expect that. Do you
have a
>>>>> descriptive
>>>>> explanation why it's not beneficial?
>>>> It looks like the vectorizer is not picking up a TTI
>>>> implementation from a target with vector registers (likely,
>>>> you're just seeing the basic cost model). For what target
is
>>>> this?
>>>>
>>>>   -Hal
>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>>
>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>> Hi Frank,
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter
<fwinter at jlab.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> My function implements a simple loop:
>>>>>>>
>>>>>>> void bar( int start, int end, float* A, float* B,
float* C)
>>>>>>> {
>>>>>>>      for (int i=start; i<end;++i)
>>>>>>>         A[i] = B[i] * C[i];
>>>>>>> }
>>>>>>>
>>>>>>> This looks pretty much like the standard example.
However, I
>>>>>>> built
>>>>>>> the function
>>>>>>> with the IRBuilder, thus not coming from C and
clang. Also I
>>>>>>> changed slightly
>>>>>>> the function's signature:
>>>>>>>
>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>> entrypoint:
>>>>>>>    %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>>    %1 = load i32* %0
>>>>>>>    %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>>    %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>    %4 = load i32* %3
>>>>>>>    %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>>    %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>    %7 = load float** %6
>>>>>>>    %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>>    %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>    %10 = load float** %9
>>>>>>>    %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>>    %12 = bitcast [8 x i8]* %11 to float**
>>>>>>>    %13 = load float** %12
>>>>>>>    br label %L0
>>>>>>>
>>>>>>> L0:                                               ;
preds >>>>>>> %L0,
>>>>>>> %entrypoint
>>>>>>>    %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>>    %15 = getelementptr float* %10, i32 %14
>>>>>>>    %16 = load float* %15
>>>>>>>    %17 = getelementptr float* %13, i32 %14
>>>>>>>    %18 = load float* %17
>>>>>>>    %19 = fmul float %18, %16
>>>>>>>    %20 = getelementptr float* %7, i32 %14
>>>>>>>    store float %19, float* %20
>>>>>>>    %21 = add i32 %14, 1
>>>>>> Try
>>>>>> %21 = add nsw i32 %14, 1
>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>
>>>>>> If that is not working please post the output of opt
...
>>>>>> -debug-only=loop-vectorize ...
>>>>>>
>>>>>>
>>>>>>
>>>>>>>    %22 = icmp sge i32 %21, %4
>>>>>>>    br i1 %22, label %L1, label %L0
>>>>>>>
>>>>>>> L1:                                               ;
preds = %L0
>>>>>>>    ret void
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> As you can see, I use the phi instruction for the
loop index. I
>>>>>>> notice
>>>>>>> that clang prefers stack allocation. So, I am not
sure what's
>>>>>>> the
>>>>>>> problem that the loop vectorizer is not working
here.
>>>>>>> I tried many things, like specifying an
architecture with
>>>>>>> vector
>>>>>>> units, enforcing the vector width. No success.
>>>>>>>
>>>>>>> opt -march=x64-64 -loop-vectorize
-force-vector-width=8 -S
>>>>>>> loop.ll
>>>>>>>
>>>>>>> The only explanation I have is the use of the phi
instruction.
>>>>>>> Is
>>>>>>> this
>>>>>>> preventing to vectorize the loop?
>>>>>>>
>>>>>>> Frank
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>
>>>
>>

-- 
-----------------------------------------------------------
Dr Frank Winter
Scientific Computing Group
Jefferson Lab, 12000 Jefferson Ave, CEBAF Centre, Room F216
Newport News, VA 23606, USA
Tel: +1-757-269-6448
EMail: fwinter at jlab.org
-----------------------------------------------------------

Frank Winter

2013-Oct-26 23:29 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

I would need this to work when calling the vectorizer through
the function pass manager. Unfortunately I am having the same
problem there:

LV: The Widest type: 32 bits.
LV: The Widest register is: 32 bits.

It's not picking the target information, although I tried with and
without the target triple in the module.

Any idea what could be wrong?

Frank


On 26/10/13 15:54, Hal Finkel wrote:> ----- Original Message -----
>>>>> LV: The Widest type: 32 bits.
>>>>> LV: The Widest register is: 32 bits.
>> Yep, we don’t pick up the right TTI.
>>
>> Try -march=x86-64 (or leave it out) you already have this info in the
>> triple.
>>
>> Then it should work (does for me with your example below).
> That may depend on what CPU is picks by default; Frank, if it does not work
for you, try specifying a target CPU (-mcpu=whatever).
>
>   -Hal
>
>>
>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org>
wrote:
>>
>>> Hi Hal!
>>>
>>> I am using the 'x86_64' target. Below the complete module
dump and
>>> here the command line:
>>>
>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
>>> test.ll
>>>
>>> Frank
>>>
>>>
>>> ; ModuleID = 'test.ll'
>>>
>>> target datalayout >>>
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>
>>> target triple = "x86_64-unknown-linux-elf"
>>>
>>> define void @bar([8 x i8]* %arg_ptr) {
>>> entrypoint:
>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>   %1 = load i32* %0
>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>   %4 = load i32* %3
>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>   %7 = load float** %6
>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>   %10 = load float** %9
>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>   %13 = load float** %12
>>>   br label %L0
>>>
>>> L0:                                               ; preds = %L0,
>>> %entrypoint
>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>   %15 = getelementptr float* %10, i32 %14
>>>   %16 = load float* %15
>>>   %17 = getelementptr float* %13, i32 %14
>>>   %18 = load float* %17
>>>   %19 = fmul float %18, %16
>>>   %20 = getelementptr float* %7, i32 %14
>>>   store float %19, float* %20
>>>   %21 = add nsw i32 %14, 1
>>>   %22 = icmp sge i32 %21, %4
>>>   br i1 %22, label %L1, label %L0
>>>
>>> L1:                                               ; preds = %L0
>>>   ret void
>>> }
>>>
>>>
>>>
>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>> ----- Original Message -----
>>>>> Hi Arnold,
>>>>>
>>>>> adding '-debug-only=loop-vectorize' to the command
gives:
>>>>>
>>>>> LV: Checking a loop in "bar"
>>>>> LV: Found a loop: L0
>>>>> LV: Found an induction variable.
>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>> LV: Found an unidentified read ptr:   %13 = load float**
%12
>>>>> LV: We need to do 2 pointer comparisons.
>>>>> LV: We can't vectorize because we can't find the
array bounds.
>>>>> LV: Can't vectorize due to memory conflicts
>>>>> LV: Not vectorizing.
>>>>>
>>>>> It can't find the loop bounds if we use the overflow
version of
>>>>> add.
>>>>> That's a good point. I should mark this addition to not
overflow.
>>>>>
>>>>> When using the non-overflow version I get:
>>>>>
>>>>> LV: Checking a loop in "bar"
>>>>> LV: Found a loop: L0
>>>>> LV: Found an induction variable.
>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>> LV: Found an unidentified read ptr:   %13 = load float**
%12
>>>>> LV: Found a runtime check ptr:  %20 = getelementptr float*
%7,
>>>>> i32
>>>>> %14
>>>>> LV: Found a runtime check ptr:  %15 = getelementptr float*
%10,
>>>>> i32
>>>>> %14
>>>>> LV: Found a runtime check ptr:  %17 = getelementptr float*
%13,
>>>>> i32
>>>>> %14
>>>>> LV: We need to do 2 pointer comparisons.
>>>>> LV: We can perform a memory runtime check if needed.
>>>>> LV: We need a runtime memory check.
>>>>> LV: We can vectorize this loop (with a runtime bound
check)!
>>>>> LV: Found trip count: 0
>>>>> LV: The Widest type: 32 bits.
>>>>> LV: The Widest register is: 32 bits.
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: 
%14
>>>>> >>>>> phi
>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: 
%15
>>>>> >>>>> getelementptr float* %10, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%16
>>>>> >>>>> load
>>>>> float* %15
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: 
%17
>>>>> >>>>> getelementptr float* %13, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%18
>>>>> >>>>> load
>>>>> float* %17
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%19
>>>>> >>>>> fmul
>>>>> float %18, %16
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: 
%20
>>>>> >>>>> getelementptr float* %7, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:
>>>>>    store
>>>>> float
>>>>> %19, float* %20
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%21
>>>>> >>>>> add
>>>>> nsw i32 %14, 1
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
%22
>>>>> >>>>> icmp
>>>>> sge i32 %21, %4
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: 
br
>>>>> i1
>>>>> %22,
>>>>> label %L1, label %L0
>>>>> LV: Scalar loop costs: 7.
>>>>> LV: Selecting VF = : 1.
>>>>> LV: The target has 8 vector registers
>>>>> LV(REG): Calculating max register usage:
>>>>> LV(REG): At #0 Interval # 0
>>>>> LV(REG): At #1 Interval # 1
>>>>> LV(REG): At #2 Interval # 2
>>>>> LV(REG): At #3 Interval # 2
>>>>> LV(REG): At #4 Interval # 3
>>>>> LV(REG): At #5 Interval # 3
>>>>> LV(REG): At #6 Interval # 2
>>>>> LV(REG): At #8 Interval # 1
>>>>> LV(REG): At #9 Interval # 1
>>>>> LV(REG): Found max usage: 3
>>>>> LV(REG): Found invariant usage: 5
>>>>> LV(REG): LoopSize: 11
>>>>> LV: Vectorization is possible but not beneficial.
>>>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>>>>> LV: Unroll Factor is 1
>>>>>
>>>>> It's not beneficial? I didn't expect that. Do you
have a
>>>>> descriptive
>>>>> explanation why it's not beneficial?
>>>> It looks like the vectorizer is not picking up a TTI
>>>> implementation from a target with vector registers (likely,
>>>> you're just seeing the basic cost model). For what target
is
>>>> this?
>>>>
>>>>   -Hal
>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>>
>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>> Hi Frank,
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter
<fwinter at jlab.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> My function implements a simple loop:
>>>>>>>
>>>>>>> void bar( int start, int end, float* A, float* B,
float* C)
>>>>>>> {
>>>>>>>      for (int i=start; i<end;++i)
>>>>>>>         A[i] = B[i] * C[i];
>>>>>>> }
>>>>>>>
>>>>>>> This looks pretty much like the standard example.
However, I
>>>>>>> built
>>>>>>> the function
>>>>>>> with the IRBuilder, thus not coming from C and
clang. Also I
>>>>>>> changed slightly
>>>>>>> the function's signature:
>>>>>>>
>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>> entrypoint:
>>>>>>>    %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>>    %1 = load i32* %0
>>>>>>>    %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>>    %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>    %4 = load i32* %3
>>>>>>>    %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>>    %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>    %7 = load float** %6
>>>>>>>    %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>>    %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>    %10 = load float** %9
>>>>>>>    %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>>    %12 = bitcast [8 x i8]* %11 to float**
>>>>>>>    %13 = load float** %12
>>>>>>>    br label %L0
>>>>>>>
>>>>>>> L0:                                               ;
preds >>>>>>> %L0,
>>>>>>> %entrypoint
>>>>>>>    %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>>    %15 = getelementptr float* %10, i32 %14
>>>>>>>    %16 = load float* %15
>>>>>>>    %17 = getelementptr float* %13, i32 %14
>>>>>>>    %18 = load float* %17
>>>>>>>    %19 = fmul float %18, %16
>>>>>>>    %20 = getelementptr float* %7, i32 %14
>>>>>>>    store float %19, float* %20
>>>>>>>    %21 = add i32 %14, 1
>>>>>> Try
>>>>>> %21 = add nsw i32 %14, 1
>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>
>>>>>> If that is not working please post the output of opt
...
>>>>>> -debug-only=loop-vectorize ...
>>>>>>
>>>>>>
>>>>>>
>>>>>>>    %22 = icmp sge i32 %21, %4
>>>>>>>    br i1 %22, label %L1, label %L0
>>>>>>>
>>>>>>> L1:                                               ;
preds = %L0
>>>>>>>    ret void
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> As you can see, I use the phi instruction for the
loop index. I
>>>>>>> notice
>>>>>>> that clang prefers stack allocation. So, I am not
sure what's
>>>>>>> the
>>>>>>> problem that the loop vectorizer is not working
here.
>>>>>>> I tried many things, like specifying an
architecture with
>>>>>>> vector
>>>>>>> units, enforcing the vector width. No success.
>>>>>>>
>>>>>>> opt -march=x64-64 -loop-vectorize
-force-vector-width=8 -S
>>>>>>> loop.ll
>>>>>>>
>>>>>>> The only explanation I have is the use of the phi
instruction.
>>>>>>> Is
>>>>>>> this
>>>>>>> preventing to vectorize the loop?
>>>>>>>
>>>>>>> Frank
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>
>>>
>>

Arnold Schwaighofer

2013-Oct-27 00:43 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

Hi Frank,


On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org> wrote:
> I would need this to work when calling the vectorizer through
> the function pass manager. Unfortunately I am having the same
> problem there:

I am not sure which function pass manager you are referring here. I assume you
create your own (you are not using opt but configure your own pass manager)?

Here is what opt does when it sets up its pass pipeline.

You need to have your/the target add the target’s analysis passes:

opt.cpp:

  PassManager Passes;

  … // Add target library info and data layout.

  Triple ModuleTriple(M->getTargetTriple());
  TargetMachine *Machine = 0;
  if (ModuleTriple.getArch())
    Machine = GetTargetMachine(Triple(ModuleTriple));
  OwningPtr<TargetMachine> TM(Machine);

  // Add internal analysis passes from the target machine.
  if (TM.get())
    TM->addAnalysisPasses(Passes); // <<<
  …
  TM->addAnalysisPasses(FPManager);

Here is what the target does:

void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) {
  // Add first the target-independent BasicTTI pass, then our X86 pass. This
  // allows the X86 pass to delegate to the target independent layer when
  // appropriate.
  PM.add(createBasicTargetTransformInfoPass(this));
  PM.add(createX86TargetTransformInfoPass(this));
}


> LV: The Widest type: 32 bits.
> LV: The Widest register is: 32 bits.
This strongly looks like no target has added a target transform info pass so we
default to NoTTI.

But, it could also be that you don’t have the right sub target (in which case
you need to set the right cpu, “-mcpu” in opt, when the target machine is
created):

unsigned X86TTI::getRegisterBitWidth(bool Vector) const {
  if (Vector) {
    if (ST->hasAVX()) return 256;
    if (ST->hasSSE1()) return 128;
    return 0;
  }

  if (ST->is64Bit())
    return 64;
  return 32;

}


> 
> It's not picking the target information, although I tried with and
> without the target triple in the module
> 
> Any idea what could be wrong?
> 
> Frank
> 
> 
> On 26/10/13 15:54, Hal Finkel wrote:
>> ----- Original Message -----
>>>>>> LV: The Widest type: 32 bits.
>>>>>> LV: The Widest register is: 32 bits.
>>> Yep, we don’t pick up the right TTI.
>>> 
>>> Try -march=x86-64 (or leave it out) you already have this info in
the
>>> triple.
>>> 
>>> Then it should work (does for me with your example below).
>> That may depend on what CPU is picks by default; Frank, if it does not
work for you, try specifying a target CPU (-mcpu=whatever).
>> 
>>  -Hal
>> 
>>> 
>>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at
jlab.org> wrote:
>>> 
>>>> Hi Hal!
>>>> 
>>>> I am using the 'x86_64' target. Below the complete
module dump and
>>>> here the command line:
>>>> 
>>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
>>>> test.ll
>>>> 
>>>> Frank
>>>> 
>>>> 
>>>> ; ModuleID = 'test.ll'
>>>> 
>>>> target datalayout >>>>
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>> 
>>>> target triple = "x86_64-unknown-linux-elf"
>>>> 
>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>> entrypoint:
>>>>  %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>  %1 = load i32* %0
>>>>  %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>  %3 = bitcast [8 x i8]* %2 to i32*
>>>>  %4 = load i32* %3
>>>>  %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>  %6 = bitcast [8 x i8]* %5 to float**
>>>>  %7 = load float** %6
>>>>  %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>  %9 = bitcast [8 x i8]* %8 to float**
>>>>  %10 = load float** %9
>>>>  %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>  %12 = bitcast [8 x i8]* %11 to float**
>>>>  %13 = load float** %12
>>>>  br label %L0
>>>> 
>>>> L0:                                               ; preds =
%L0,
>>>> %entrypoint
>>>>  %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>  %15 = getelementptr float* %10, i32 %14
>>>>  %16 = load float* %15
>>>>  %17 = getelementptr float* %13, i32 %14
>>>>  %18 = load float* %17
>>>>  %19 = fmul float %18, %16
>>>>  %20 = getelementptr float* %7, i32 %14
>>>>  store float %19, float* %20
>>>>  %21 = add nsw i32 %14, 1
>>>>  %22 = icmp sge i32 %21, %4
>>>>  br i1 %22, label %L1, label %L0
>>>> 
>>>> L1:                                               ; preds = %L0
>>>>  ret void
>>>> }
>>>> 
>>>> 
>>>> 
>>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>>> ----- Original Message -----
>>>>>> Hi Arnold,
>>>>>> 
>>>>>> adding '-debug-only=loop-vectorize' to the
command gives:
>>>>>> 
>>>>>> LV: Checking a loop in "bar"
>>>>>> LV: Found a loop: L0
>>>>>> LV: Found an induction variable.
>>>>>> LV: Found an unidentified write ptr:   %7 = load
float** %6
>>>>>> LV: Found an unidentified read ptr:   %10 = load
float** %9
>>>>>> LV: Found an unidentified read ptr:   %13 = load
float** %12
>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>> LV: We can't vectorize because we can't find
the array bounds.
>>>>>> LV: Can't vectorize due to memory conflicts
>>>>>> LV: Not vectorizing.
>>>>>> 
>>>>>> It can't find the loop bounds if we use the
overflow version of
>>>>>> add.
>>>>>> That's a good point. I should mark this addition to
not overflow.
>>>>>> 
>>>>>> When using the non-overflow version I get:
>>>>>> 
>>>>>> LV: Checking a loop in "bar"
>>>>>> LV: Found a loop: L0
>>>>>> LV: Found an induction variable.
>>>>>> LV: Found an unidentified write ptr:   %7 = load
float** %6
>>>>>> LV: Found an unidentified read ptr:   %10 = load
float** %9
>>>>>> LV: Found an unidentified read ptr:   %13 = load
float** %12
>>>>>> LV: Found a runtime check ptr:  %20 = getelementptr
float* %7,
>>>>>> i32
>>>>>> %14
>>>>>> LV: Found a runtime check ptr:  %15 = getelementptr
float* %10,
>>>>>> i32
>>>>>> %14
>>>>>> LV: Found a runtime check ptr:  %17 = getelementptr
float* %13,
>>>>>> i32
>>>>>> %14
>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>> LV: We can perform a memory runtime check if needed.
>>>>>> LV: We need a runtime memory check.
>>>>>> LV: We can vectorize this loop (with a runtime bound
check)!
>>>>>> LV: Found trip count: 0
>>>>>> LV: The Widest type: 32 bits.
>>>>>> LV: The Widest register is: 32 bits.
>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %14
>>>>>> >>>>>> phi
>>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %15
>>>>>> >>>>>> getelementptr float* %10, i32
%14
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %16
>>>>>> >>>>>> load
>>>>>> float* %15
>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %17
>>>>>> >>>>>> getelementptr float* %13, i32
%14
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %18
>>>>>> >>>>>> load
>>>>>> float* %17
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %19
>>>>>> >>>>>> fmul
>>>>>> float %18, %16
>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %20
>>>>>> >>>>>> getelementptr float* %7, i32
%14
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:
>>>>>>   store
>>>>>> float
>>>>>> %19, float* %20
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %21
>>>>>> >>>>>> add
>>>>>> nsw i32 %14, 1
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %22
>>>>>> >>>>>> icmp
>>>>>> sge i32 %21, %4
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   br
>>>>>> i1
>>>>>> %22,
>>>>>> label %L1, label %L0
>>>>>> LV: Scalar loop costs: 7.
>>>>>> LV: Selecting VF = : 1.
>>>>>> LV: The target has 8 vector registers
>>>>>> LV(REG): Calculating max register usage:
>>>>>> LV(REG): At #0 Interval # 0
>>>>>> LV(REG): At #1 Interval # 1
>>>>>> LV(REG): At #2 Interval # 2
>>>>>> LV(REG): At #3 Interval # 2
>>>>>> LV(REG): At #4 Interval # 3
>>>>>> LV(REG): At #5 Interval # 3
>>>>>> LV(REG): At #6 Interval # 2
>>>>>> LV(REG): At #8 Interval # 1
>>>>>> LV(REG): At #9 Interval # 1
>>>>>> LV(REG): Found max usage: 3
>>>>>> LV(REG): Found invariant usage: 5
>>>>>> LV(REG): LoopSize: 11
>>>>>> LV: Vectorization is possible but not beneficial.
>>>>>> LV: Found a vectorizable loop (1) in
saxpy_real.gvn.mod.ll
>>>>>> LV: Unroll Factor is 1
>>>>>> 
>>>>>> It's not beneficial? I didn't expect that. Do
you have a
>>>>>> descriptive
>>>>>> explanation why it's not beneficial?
>>>>> It looks like the vectorizer is not picking up a TTI
>>>>> implementation from a target with vector registers (likely,
>>>>> you're just seeing the basic cost model). For what
target is
>>>>> this?
>>>>> 
>>>>>  -Hal
>>>>> 
>>>>>> Frank
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>>> Hi Frank,
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter
<fwinter at jlab.org>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> My function implements a simple loop:
>>>>>>>> 
>>>>>>>> void bar( int start, int end, float* A, float*
B, float* C)
>>>>>>>> {
>>>>>>>>     for (int i=start; i<end;++i)
>>>>>>>>        A[i] = B[i] * C[i];
>>>>>>>> }
>>>>>>>> 
>>>>>>>> This looks pretty much like the standard
example. However, I
>>>>>>>> built
>>>>>>>> the function
>>>>>>>> with the IRBuilder, thus not coming from C and
clang. Also I
>>>>>>>> changed slightly
>>>>>>>> the function's signature:
>>>>>>>> 
>>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>>> entrypoint:
>>>>>>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>>>   %1 = load i32* %0
>>>>>>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>>   %4 = load i32* %3
>>>>>>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>>   %7 = load float** %6
>>>>>>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>>   %10 = load float** %9
>>>>>>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>>>>>>   %13 = load float** %12
>>>>>>>>   br label %L0
>>>>>>>> 
>>>>>>>> L0:                                            
; preds >>>>>>>> %L0,
>>>>>>>> %entrypoint
>>>>>>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint
]
>>>>>>>>   %15 = getelementptr float* %10, i32 %14
>>>>>>>>   %16 = load float* %15
>>>>>>>>   %17 = getelementptr float* %13, i32 %14
>>>>>>>>   %18 = load float* %17
>>>>>>>>   %19 = fmul float %18, %16
>>>>>>>>   %20 = getelementptr float* %7, i32 %14
>>>>>>>>   store float %19, float* %20
>>>>>>>>   %21 = add i32 %14, 1
>>>>>>> Try
>>>>>>> %21 = add nsw i32 %14, 1
>>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>> 
>>>>>>> If that is not working please post the output of
opt ...
>>>>>>> -debug-only=loop-vectorize ...
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>   %22 = icmp sge i32 %21, %4
>>>>>>>>   br i1 %22, label %L1, label %L0
>>>>>>>> 
>>>>>>>> L1:                                            
; preds = %L0
>>>>>>>>   ret void
>>>>>>>> }
>>>>>>>> 
>>>>>>>> 
>>>>>>>> As you can see, I use the phi instruction for
the loop index. I
>>>>>>>> notice
>>>>>>>> that clang prefers stack allocation. So, I am
not sure what's
>>>>>>>> the
>>>>>>>> problem that the loop vectorizer is not working
here.
>>>>>>>> I tried many things, like specifying an
architecture with
>>>>>>>> vector
>>>>>>>> units, enforcing the vector width. No success.
>>>>>>>> 
>>>>>>>> opt -march=x64-64 -loop-vectorize
-force-vector-width=8 -S
>>>>>>>> loop.ll
>>>>>>>> 
>>>>>>>> The only explanation I have is the use of the
phi instruction.
>>>>>>>> Is
>>>>>>>> this
>>>>>>>> preventing to vectorize the loop?
>>>>>>>> 
>>>>>>>> Frank
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>>>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> 
>>>> 
>>> 
>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Oct 2013 - [LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

Possibly Parallel Threads