thr3ads.net - llvm dev - [LLVMdev] Why is the loop vectorizer not working on my function? [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Arnold Schwaighofer

2013-Oct-27 00:43 UTC

[LLVMdev] Why is the loop vectorizer not working on my function?

Hi Frank,


On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org> wrote:
> I would need this to work when calling the vectorizer through
> the function pass manager. Unfortunately I am having the same
> problem there:

I am not sure which function pass manager you are referring here. I assume you
create your own (you are not using opt but configure your own pass manager)?

Here is what opt does when it sets up its pass pipeline.

You need to have your/the target add the target’s analysis passes:

opt.cpp:

  PassManager Passes;

  … // Add target library info and data layout.

  Triple ModuleTriple(M->getTargetTriple());
  TargetMachine *Machine = 0;
  if (ModuleTriple.getArch())
    Machine = GetTargetMachine(Triple(ModuleTriple));
  OwningPtr<TargetMachine> TM(Machine);

  // Add internal analysis passes from the target machine.
  if (TM.get())
    TM->addAnalysisPasses(Passes); // <<<
  …
  TM->addAnalysisPasses(FPManager);

Here is what the target does:

void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) {
  // Add first the target-independent BasicTTI pass, then our X86 pass. This
  // allows the X86 pass to delegate to the target independent layer when
  // appropriate.
  PM.add(createBasicTargetTransformInfoPass(this));
  PM.add(createX86TargetTransformInfoPass(this));
}


> LV: The Widest type: 32 bits.
> LV: The Widest register is: 32 bits.
This strongly looks like no target has added a target transform info pass so we
default to NoTTI.

But, it could also be that you don’t have the right sub target (in which case
you need to set the right cpu, “-mcpu” in opt, when the target machine is
created):

unsigned X86TTI::getRegisterBitWidth(bool Vector) const {
  if (Vector) {
    if (ST->hasAVX()) return 256;
    if (ST->hasSSE1()) return 128;
    return 0;
  }

  if (ST->is64Bit())
    return 64;
  return 32;

}


> 
> It's not picking the target information, although I tried with and
> without the target triple in the module
> 
> Any idea what could be wrong?
> 
> Frank
> 
> 
> On 26/10/13 15:54, Hal Finkel wrote:
>> ----- Original Message -----
>>>>>> LV: The Widest type: 32 bits.
>>>>>> LV: The Widest register is: 32 bits.
>>> Yep, we don’t pick up the right TTI.
>>> 
>>> Try -march=x86-64 (or leave it out) you already have this info in
the
>>> triple.
>>> 
>>> Then it should work (does for me with your example below).
>> That may depend on what CPU is picks by default; Frank, if it does not
work for you, try specifying a target CPU (-mcpu=whatever).
>> 
>>  -Hal
>> 
>>> 
>>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at
jlab.org> wrote:
>>> 
>>>> Hi Hal!
>>>> 
>>>> I am using the 'x86_64' target. Below the complete
module dump and
>>>> here the command line:
>>>> 
>>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
>>>> test.ll
>>>> 
>>>> Frank
>>>> 
>>>> 
>>>> ; ModuleID = 'test.ll'
>>>> 
>>>> target datalayout >>>>
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>> 
>>>> target triple = "x86_64-unknown-linux-elf"
>>>> 
>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>> entrypoint:
>>>>  %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>  %1 = load i32* %0
>>>>  %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>  %3 = bitcast [8 x i8]* %2 to i32*
>>>>  %4 = load i32* %3
>>>>  %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>  %6 = bitcast [8 x i8]* %5 to float**
>>>>  %7 = load float** %6
>>>>  %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>  %9 = bitcast [8 x i8]* %8 to float**
>>>>  %10 = load float** %9
>>>>  %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>  %12 = bitcast [8 x i8]* %11 to float**
>>>>  %13 = load float** %12
>>>>  br label %L0
>>>> 
>>>> L0:                                               ; preds =
%L0,
>>>> %entrypoint
>>>>  %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>  %15 = getelementptr float* %10, i32 %14
>>>>  %16 = load float* %15
>>>>  %17 = getelementptr float* %13, i32 %14
>>>>  %18 = load float* %17
>>>>  %19 = fmul float %18, %16
>>>>  %20 = getelementptr float* %7, i32 %14
>>>>  store float %19, float* %20
>>>>  %21 = add nsw i32 %14, 1
>>>>  %22 = icmp sge i32 %21, %4
>>>>  br i1 %22, label %L1, label %L0
>>>> 
>>>> L1:                                               ; preds = %L0
>>>>  ret void
>>>> }
>>>> 
>>>> 
>>>> 
>>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>>> ----- Original Message -----
>>>>>> Hi Arnold,
>>>>>> 
>>>>>> adding '-debug-only=loop-vectorize' to the
command gives:
>>>>>> 
>>>>>> LV: Checking a loop in "bar"
>>>>>> LV: Found a loop: L0
>>>>>> LV: Found an induction variable.
>>>>>> LV: Found an unidentified write ptr:   %7 = load
float** %6
>>>>>> LV: Found an unidentified read ptr:   %10 = load
float** %9
>>>>>> LV: Found an unidentified read ptr:   %13 = load
float** %12
>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>> LV: We can't vectorize because we can't find
the array bounds.
>>>>>> LV: Can't vectorize due to memory conflicts
>>>>>> LV: Not vectorizing.
>>>>>> 
>>>>>> It can't find the loop bounds if we use the
overflow version of
>>>>>> add.
>>>>>> That's a good point. I should mark this addition to
not overflow.
>>>>>> 
>>>>>> When using the non-overflow version I get:
>>>>>> 
>>>>>> LV: Checking a loop in "bar"
>>>>>> LV: Found a loop: L0
>>>>>> LV: Found an induction variable.
>>>>>> LV: Found an unidentified write ptr:   %7 = load
float** %6
>>>>>> LV: Found an unidentified read ptr:   %10 = load
float** %9
>>>>>> LV: Found an unidentified read ptr:   %13 = load
float** %12
>>>>>> LV: Found a runtime check ptr:  %20 = getelementptr
float* %7,
>>>>>> i32
>>>>>> %14
>>>>>> LV: Found a runtime check ptr:  %15 = getelementptr
float* %10,
>>>>>> i32
>>>>>> %14
>>>>>> LV: Found a runtime check ptr:  %17 = getelementptr
float* %13,
>>>>>> i32
>>>>>> %14
>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>> LV: We can perform a memory runtime check if needed.
>>>>>> LV: We need a runtime memory check.
>>>>>> LV: We can vectorize this loop (with a runtime bound
check)!
>>>>>> LV: Found trip count: 0
>>>>>> LV: The Widest type: 32 bits.
>>>>>> LV: The Widest register is: 32 bits.
>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %14
>>>>>> >>>>>> phi
>>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %15
>>>>>> >>>>>> getelementptr float* %10, i32
%14
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %16
>>>>>> >>>>>> load
>>>>>> float* %15
>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %17
>>>>>> >>>>>> getelementptr float* %13, i32
%14
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %18
>>>>>> >>>>>> load
>>>>>> float* %17
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %19
>>>>>> >>>>>> fmul
>>>>>> float %18, %16
>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %20
>>>>>> >>>>>> getelementptr float* %7, i32
%14
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:
>>>>>>   store
>>>>>> float
>>>>>> %19, float* %20
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %21
>>>>>> >>>>>> add
>>>>>> nsw i32 %14, 1
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %22
>>>>>> >>>>>> icmp
>>>>>> sge i32 %21, %4
>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   br
>>>>>> i1
>>>>>> %22,
>>>>>> label %L1, label %L0
>>>>>> LV: Scalar loop costs: 7.
>>>>>> LV: Selecting VF = : 1.
>>>>>> LV: The target has 8 vector registers
>>>>>> LV(REG): Calculating max register usage:
>>>>>> LV(REG): At #0 Interval # 0
>>>>>> LV(REG): At #1 Interval # 1
>>>>>> LV(REG): At #2 Interval # 2
>>>>>> LV(REG): At #3 Interval # 2
>>>>>> LV(REG): At #4 Interval # 3
>>>>>> LV(REG): At #5 Interval # 3
>>>>>> LV(REG): At #6 Interval # 2
>>>>>> LV(REG): At #8 Interval # 1
>>>>>> LV(REG): At #9 Interval # 1
>>>>>> LV(REG): Found max usage: 3
>>>>>> LV(REG): Found invariant usage: 5
>>>>>> LV(REG): LoopSize: 11
>>>>>> LV: Vectorization is possible but not beneficial.
>>>>>> LV: Found a vectorizable loop (1) in
saxpy_real.gvn.mod.ll
>>>>>> LV: Unroll Factor is 1
>>>>>> 
>>>>>> It's not beneficial? I didn't expect that. Do
you have a
>>>>>> descriptive
>>>>>> explanation why it's not beneficial?
>>>>> It looks like the vectorizer is not picking up a TTI
>>>>> implementation from a target with vector registers (likely,
>>>>> you're just seeing the basic cost model). For what
target is
>>>>> this?
>>>>> 
>>>>>  -Hal
>>>>> 
>>>>>> Frank
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>>> Hi Frank,
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter
<fwinter at jlab.org>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> My function implements a simple loop:
>>>>>>>> 
>>>>>>>> void bar( int start, int end, float* A, float*
B, float* C)
>>>>>>>> {
>>>>>>>>     for (int i=start; i<end;++i)
>>>>>>>>        A[i] = B[i] * C[i];
>>>>>>>> }
>>>>>>>> 
>>>>>>>> This looks pretty much like the standard
example. However, I
>>>>>>>> built
>>>>>>>> the function
>>>>>>>> with the IRBuilder, thus not coming from C and
clang. Also I
>>>>>>>> changed slightly
>>>>>>>> the function's signature:
>>>>>>>> 
>>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>>> entrypoint:
>>>>>>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>>>   %1 = load i32* %0
>>>>>>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>>   %4 = load i32* %3
>>>>>>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>>   %7 = load float** %6
>>>>>>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>>   %10 = load float** %9
>>>>>>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>>>>>>   %13 = load float** %12
>>>>>>>>   br label %L0
>>>>>>>> 
>>>>>>>> L0:                                            
; preds >>>>>>>> %L0,
>>>>>>>> %entrypoint
>>>>>>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint
]
>>>>>>>>   %15 = getelementptr float* %10, i32 %14
>>>>>>>>   %16 = load float* %15
>>>>>>>>   %17 = getelementptr float* %13, i32 %14
>>>>>>>>   %18 = load float* %17
>>>>>>>>   %19 = fmul float %18, %16
>>>>>>>>   %20 = getelementptr float* %7, i32 %14
>>>>>>>>   store float %19, float* %20
>>>>>>>>   %21 = add i32 %14, 1
>>>>>>> Try
>>>>>>> %21 = add nsw i32 %14, 1
>>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>> 
>>>>>>> If that is not working please post the output of
opt ...
>>>>>>> -debug-only=loop-vectorize ...
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>   %22 = icmp sge i32 %21, %4
>>>>>>>>   br i1 %22, label %L1, label %L0
>>>>>>>> 
>>>>>>>> L1:                                            
; preds = %L0
>>>>>>>>   ret void
>>>>>>>> }
>>>>>>>> 
>>>>>>>> 
>>>>>>>> As you can see, I use the phi instruction for
the loop index. I
>>>>>>>> notice
>>>>>>>> that clang prefers stack allocation. So, I am
not sure what's
>>>>>>>> the
>>>>>>>> problem that the loop vectorizer is not working
here.
>>>>>>>> I tried many things, like specifying an
architecture with
>>>>>>>> vector
>>>>>>>> units, enforcing the vector width. No success.
>>>>>>>> 
>>>>>>>> opt -march=x64-64 -loop-vectorize
-force-vector-width=8 -S
>>>>>>>> loop.ll
>>>>>>>> 
>>>>>>>> The only explanation I have is the use of the
phi instruction.
>>>>>>>> Is
>>>>>>>> this
>>>>>>>> preventing to vectorize the loop?
>>>>>>>> 
>>>>>>>> Frank
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>>>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> 
>>>> 
>>> 
>

Frank Winter

2013-Oct-27 01:42 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

Hi Arnold,

thanks for the detailed setup. Still, I haven't figured out the right 
thing to do.

I would need only the native target since all generated code will 
execute on the JIT execution machine (right now, the old JIT interface). 
There is no need for other targets.

Maybe it would be good to ask specific questions:

How do I get the triple for the native target?

How do I setup the target transform info pass?

I guess I need the target machine for all of this. 'opt' defines the 
GetTargetMachine() function which takes a Triple (my native triple). It 
calls GetTargetOptions(). This function, as implemented in 'opt', looks 
very static to me and surely doesn't fit all machines. Do I need this 
for my setup?

Frank


On 26/10/13 20:43, Arnold Schwaighofer wrote:> Hi Frank,
>
>
> On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org>
wrote:
>
>> I would need this to work when calling the vectorizer through
>> the function pass manager. Unfortunately I am having the same
>> problem there:
>
> I am not sure which function pass manager you are referring here. I assume
you create your own (you are not using opt but configure your own pass manager)?
>
> Here is what opt does when it sets up its pass pipeline.
>
> You need to have your/the target add the target’s analysis passes:
>
> opt.cpp:
>
>    PassManager Passes;
>
>    … // Add target library info and data layout.
>
>    Triple ModuleTriple(M->getTargetTriple());
>    TargetMachine *Machine = 0;
>    if (ModuleTriple.getArch())
>      Machine = GetTargetMachine(Triple(ModuleTriple));
>    OwningPtr<TargetMachine> TM(Machine);
>
>    // Add internal analysis passes from the target machine.
>    if (TM.get())
>      TM->addAnalysisPasses(Passes); // <<<
>    …
>    TM->addAnalysisPasses(FPManager);
>
> Here is what the target does:
>
> void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) {
>    // Add first the target-independent BasicTTI pass, then our X86 pass.
This
>    // allows the X86 pass to delegate to the target independent layer when
>    // appropriate.
>    PM.add(createBasicTargetTransformInfoPass(this));
>    PM.add(createX86TargetTransformInfoPass(this));
> }
>
>
>
>> LV: The Widest type: 32 bits.
>> LV: The Widest register is: 32 bits.
> This strongly looks like no target has added a target transform info pass
so we default to NoTTI.
>
> But, it could also be that you don’t have the right sub target (in which
case you need to set the right cpu, “-mcpu” in opt, when the target machine is
created):
>
> unsigned X86TTI::getRegisterBitWidth(bool Vector) const {
>    if (Vector) {
>      if (ST->hasAVX()) return 256;
>      if (ST->hasSSE1()) return 128;
>      return 0;
>    }
>
>    if (ST->is64Bit())
>      return 64;
>    return 32;
>
> }
>
>
>
>> It's not picking the target information, although I tried with and
>> without the target triple in the module
>>
>> Any idea what could be wrong?
>>
>> Frank
>>
>>
>> On 26/10/13 15:54, Hal Finkel wrote:
>>> ----- Original Message -----
>>>>>>> LV: The Widest type: 32 bits.
>>>>>>> LV: The Widest register is: 32 bits.
>>>> Yep, we don’t pick up the right TTI.
>>>>
>>>> Try -march=x86-64 (or leave it out) you already have this info
in the
>>>> triple.
>>>>
>>>> Then it should work (does for me with your example below).
>>> That may depend on what CPU is picks by default; Frank, if it does
not work for you, try specifying a target CPU (-mcpu=whatever).
>>>
>>>   -Hal
>>>
>>>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at
jlab.org> wrote:
>>>>
>>>>> Hi Hal!
>>>>>
>>>>> I am using the 'x86_64' target. Below the complete
module dump and
>>>>> here the command line:
>>>>>
>>>>> opt -march=x64-64 -loop-vectorize
-debug-only=loop-vectorize -S
>>>>> test.ll
>>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>> ; ModuleID = 'test.ll'
>>>>>
>>>>> target datalayout >>>>>
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>>>
>>>>> target triple = "x86_64-unknown-linux-elf"
>>>>>
>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>> entrypoint:
>>>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>   %1 = load i32* %0
>>>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>>>   %4 = load i32* %3
>>>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>>>   %7 = load float** %6
>>>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>>>   %10 = load float** %9
>>>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>>>   %13 = load float** %12
>>>>>   br label %L0
>>>>>
>>>>> L0:                                               ; preds =
%L0,
>>>>> %entrypoint
>>>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>   %15 = getelementptr float* %10, i32 %14
>>>>>   %16 = load float* %15
>>>>>   %17 = getelementptr float* %13, i32 %14
>>>>>   %18 = load float* %17
>>>>>   %19 = fmul float %18, %16
>>>>>   %20 = getelementptr float* %7, i32 %14
>>>>>   store float %19, float* %20
>>>>>   %21 = add nsw i32 %14, 1
>>>>>   %22 = icmp sge i32 %21, %4
>>>>>   br i1 %22, label %L1, label %L0
>>>>>
>>>>> L1:                                               ; preds =
%L0
>>>>>   ret void
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>>>> ----- Original Message -----
>>>>>>> Hi Arnold,
>>>>>>>
>>>>>>> adding '-debug-only=loop-vectorize' to the
command gives:
>>>>>>>
>>>>>>> LV: Checking a loop in "bar"
>>>>>>> LV: Found a loop: L0
>>>>>>> LV: Found an induction variable.
>>>>>>> LV: Found an unidentified write ptr:   %7 = load
float** %6
>>>>>>> LV: Found an unidentified read ptr:   %10 = load
float** %9
>>>>>>> LV: Found an unidentified read ptr:   %13 = load
float** %12
>>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>>> LV: We can't vectorize because we can't
find the array bounds.
>>>>>>> LV: Can't vectorize due to memory conflicts
>>>>>>> LV: Not vectorizing.
>>>>>>>
>>>>>>> It can't find the loop bounds if we use the
overflow version of
>>>>>>> add.
>>>>>>> That's a good point. I should mark this
addition to not overflow.
>>>>>>>
>>>>>>> When using the non-overflow version I get:
>>>>>>>
>>>>>>> LV: Checking a loop in "bar"
>>>>>>> LV: Found a loop: L0
>>>>>>> LV: Found an induction variable.
>>>>>>> LV: Found an unidentified write ptr:   %7 = load
float** %6
>>>>>>> LV: Found an unidentified read ptr:   %10 = load
float** %9
>>>>>>> LV: Found an unidentified read ptr:   %13 = load
float** %12
>>>>>>> LV: Found a runtime check ptr:  %20 = getelementptr
float* %7,
>>>>>>> i32
>>>>>>> %14
>>>>>>> LV: Found a runtime check ptr:  %15 = getelementptr
float* %10,
>>>>>>> i32
>>>>>>> %14
>>>>>>> LV: Found a runtime check ptr:  %17 = getelementptr
float* %13,
>>>>>>> i32
>>>>>>> %14
>>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>>> LV: We can perform a memory runtime check if
needed.
>>>>>>> LV: We need a runtime memory check.
>>>>>>> LV: We can vectorize this loop (with a runtime
bound check)!
>>>>>>> LV: Found trip count: 0
>>>>>>> LV: The Widest type: 32 bits.
>>>>>>> LV: The Widest register is: 32 bits.
>>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %14
>>>>>>> >>>>>>> phi
>>>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %15
>>>>>>> >>>>>>> getelementptr float*
%10, i32 %14
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %16
>>>>>>> >>>>>>> load
>>>>>>> float* %15
>>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %17
>>>>>>> >>>>>>> getelementptr float*
%13, i32 %14
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %18
>>>>>>> >>>>>>> load
>>>>>>> float* %17
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %19
>>>>>>> >>>>>>> fmul
>>>>>>> float %18, %16
>>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %20
>>>>>>> >>>>>>> getelementptr float*
%7, i32 %14
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:
>>>>>>>    store
>>>>>>> float
>>>>>>> %19, float* %20
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %21
>>>>>>> >>>>>>> add
>>>>>>> nsw i32 %14, 1
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %22
>>>>>>> >>>>>>> icmp
>>>>>>> sge i32 %21, %4
>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   br
>>>>>>> i1
>>>>>>> %22,
>>>>>>> label %L1, label %L0
>>>>>>> LV: Scalar loop costs: 7.
>>>>>>> LV: Selecting VF = : 1.
>>>>>>> LV: The target has 8 vector registers
>>>>>>> LV(REG): Calculating max register usage:
>>>>>>> LV(REG): At #0 Interval # 0
>>>>>>> LV(REG): At #1 Interval # 1
>>>>>>> LV(REG): At #2 Interval # 2
>>>>>>> LV(REG): At #3 Interval # 2
>>>>>>> LV(REG): At #4 Interval # 3
>>>>>>> LV(REG): At #5 Interval # 3
>>>>>>> LV(REG): At #6 Interval # 2
>>>>>>> LV(REG): At #8 Interval # 1
>>>>>>> LV(REG): At #9 Interval # 1
>>>>>>> LV(REG): Found max usage: 3
>>>>>>> LV(REG): Found invariant usage: 5
>>>>>>> LV(REG): LoopSize: 11
>>>>>>> LV: Vectorization is possible but not beneficial.
>>>>>>> LV: Found a vectorizable loop (1) in
saxpy_real.gvn.mod.ll
>>>>>>> LV: Unroll Factor is 1
>>>>>>>
>>>>>>> It's not beneficial? I didn't expect that.
Do you have a
>>>>>>> descriptive
>>>>>>> explanation why it's not beneficial?
>>>>>> It looks like the vectorizer is not picking up a TTI
>>>>>> implementation from a target with vector registers
(likely,
>>>>>> you're just seeing the basic cost model). For what
target is
>>>>>> this?
>>>>>>
>>>>>>   -Hal
>>>>>>
>>>>>>> Frank
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>>>> Hi Frank,
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter
<fwinter at jlab.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> My function implements a simple loop:
>>>>>>>>>
>>>>>>>>> void bar( int start, int end, float* A,
float* B, float* C)
>>>>>>>>> {
>>>>>>>>>      for (int i=start; i<end;++i)
>>>>>>>>>         A[i] = B[i] * C[i];
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> This looks pretty much like the standard
example. However, I
>>>>>>>>> built
>>>>>>>>> the function
>>>>>>>>> with the IRBuilder, thus not coming from C
and clang. Also I
>>>>>>>>> changed slightly
>>>>>>>>> the function's signature:
>>>>>>>>>
>>>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>>>> entrypoint:
>>>>>>>>>    %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>>>>    %1 = load i32* %0
>>>>>>>>>    %2 = getelementptr [8 x i8]* %arg_ptr,
i32 1
>>>>>>>>>    %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>>>    %4 = load i32* %3
>>>>>>>>>    %5 = getelementptr [8 x i8]* %arg_ptr,
i32 2
>>>>>>>>>    %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>>>    %7 = load float** %6
>>>>>>>>>    %8 = getelementptr [8 x i8]* %arg_ptr,
i32 3
>>>>>>>>>    %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>>>    %10 = load float** %9
>>>>>>>>>    %11 = getelementptr [8 x i8]* %arg_ptr,
i32 4
>>>>>>>>>    %12 = bitcast [8 x i8]* %11 to float**
>>>>>>>>>    %13 = load float** %12
>>>>>>>>>    br label %L0
>>>>>>>>>
>>>>>>>>> L0:                                        
; preds >>>>>>>>> %L0,
>>>>>>>>> %entrypoint
>>>>>>>>>    %14 = phi i32 [ %21, %L0 ], [ %1,
%entrypoint ]
>>>>>>>>>    %15 = getelementptr float* %10, i32 %14
>>>>>>>>>    %16 = load float* %15
>>>>>>>>>    %17 = getelementptr float* %13, i32 %14
>>>>>>>>>    %18 = load float* %17
>>>>>>>>>    %19 = fmul float %18, %16
>>>>>>>>>    %20 = getelementptr float* %7, i32 %14
>>>>>>>>>    store float %19, float* %20
>>>>>>>>>    %21 = add i32 %14, 1
>>>>>>>> Try
>>>>>>>> %21 = add nsw i32 %14, 1
>>>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>>>
>>>>>>>> If that is not working please post the output
of opt ...
>>>>>>>> -debug-only=loop-vectorize ...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>    %22 = icmp sge i32 %21, %4
>>>>>>>>>    br i1 %22, label %L1, label %L0
>>>>>>>>>
>>>>>>>>> L1:                                        
; preds = %L0
>>>>>>>>>    ret void
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As you can see, I use the phi instruction
for the loop index. I
>>>>>>>>> notice
>>>>>>>>> that clang prefers stack allocation. So, I
am not sure what's
>>>>>>>>> the
>>>>>>>>> problem that the loop vectorizer is not
working here.
>>>>>>>>> I tried many things, like specifying an
architecture with
>>>>>>>>> vector
>>>>>>>>> units, enforcing the vector width. No
success.
>>>>>>>>>
>>>>>>>>> opt -march=x64-64 -loop-vectorize
-force-vector-width=8 -S
>>>>>>>>> loop.ll
>>>>>>>>>
>>>>>>>>> The only explanation I have is the use of
the phi instruction.
>>>>>>>>> Is
>>>>>>>>> this
>>>>>>>>> preventing to vectorize the loop?
>>>>>>>>>
>>>>>>>>> Frank
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> LLVM Developers mailing list
>>>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>>>>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>

-- 
-----------------------------------------------------------
Dr Frank Winter
Scientific Computing Group
Jefferson Lab, 12000 Jefferson Ave, CEBAF Centre, Room F216
Newport News, VA 23606, USA
Tel: +1-757-269-6448
EMail: fwinter at jlab.org
-----------------------------------------------------------

Arnold Schwaighofer

2013-Oct-27 03:26 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

Frank,

I am not familiar with the JITs.

Basically, somehow you need to create the TargetMachine. There is the
TargetRegistry that can be queried for the Target (for example, using
TargetRegistry::lookupTarget) using the triple/march. The Target can then create
a target machine for you.
The Triple is typically picked up from the IR file
(Module->getTargetTriple()), and can be overridden using -march,-mtriple.

I think the (old) JIT engine just uses the triple of the host llvm was compiled
on (sys::getProcessTriple(), see lib/ExecutionEngine/TargetSelect.cpp).

After you have the target you use it to create the target machine:

  Target *Target = … lookup the target
  TargetMachine *TM = Target->createTargetMachine( …);

Using this target machine you can then add the target transform info pass to a
pass manager by calling:

  TM->addAnalysisPasses(FunPassManager); // This will add the target’s
“TargetTransformInfo” implementation if it has one.



On Oct 26, 2013, at 8:42 PM, Frank Winter <fwinter at jlab.org> wrote:
> Hi Arnold,
> 
> thanks for the detailed setup. Still, I haven't figured out the right
thing to do.
> 
> I would need only the native target since all generated code will execute
on the JIT execution machine (right now, the old JIT interface). There is no
need for other targets.
> 
I am not familiar with the JIT infrastructure I am afraid. I think the old JIT’s
native target is just the host that llvm was compiled on.
 
> Maybe it would be good to ask specific questions:
> 
> How do I get the triple for the native target?
> How do I setup the target transform info pass?
> 
> I guess I need the target machine for all of this. 'opt' defines
the GetTargetMachine() function which takes a Triple (my native triple). It
calls GetTargetOptions(). This function, as implemented in 'opt', looks
very static to me and surely doesn't fit all machines. Do I need this for my
setup?
Yes you need the target machine. There ought to be a JIT api that should return
the target machine (since the JIT needs to create it when it creates machine
code). But since I am not familiar with the JITs I am afraid I can’t help you
further than this.
> 
> Frank
> 
> 
> On 26/10/13 20:43, Arnold Schwaighofer wrote:
>> Hi Frank,
>> 
>> 
>> On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org>
wrote:
>> 
>>> I would need this to work when calling the vectorizer through
>>> the function pass manager. Unfortunately I am having the same
>>> problem there:
>> 
>> I am not sure which function pass manager you are referring here. I
assume you create your own (you are not using opt but configure your own pass
manager)?
>> 
>> Here is what opt does when it sets up its pass pipeline.
>> 
>> You need to have your/the target add the target’s analysis passes:
>> 
>> opt.cpp:
>> 
>>   PassManager Passes;
>> 
>>   … // Add target library info and data layout.
>> 
>>   Triple ModuleTriple(M->getTargetTriple());
>>   TargetMachine *Machine = 0;
>>   if (ModuleTriple.getArch())
>>     Machine = GetTargetMachine(Triple(ModuleTriple));
>>   OwningPtr<TargetMachine> TM(Machine);
>> 
>>   // Add internal analysis passes from the target machine.
>>   if (TM.get())
>>     TM->addAnalysisPasses(Passes); // <<<
>>   …
>>   TM->addAnalysisPasses(FPManager);
>> 
>> Here is what the target does:
>> 
>> void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) {
>>   // Add first the target-independent BasicTTI pass, then our X86 pass.
This
>>   // allows the X86 pass to delegate to the target independent layer
when
>>   // appropriate.
>>   PM.add(createBasicTargetTransformInfoPass(this));
>>   PM.add(createX86TargetTransformInfoPass(this));
>> }
>> 
>> 
>> 
>>> LV: The Widest type: 32 bits.
>>> LV: The Widest register is: 32 bits.
>> This strongly looks like no target has added a target transform info
pass so we default to NoTTI.
>> 
>> But, it could also be that you don’t have the right sub target (in
which case you need to set the right cpu, “-mcpu” in opt, when the target
machine is created):
>> 
>> unsigned X86TTI::getRegisterBitWidth(bool Vector) const {
>>   if (Vector) {
>>     if (ST->hasAVX()) return 256;
>>     if (ST->hasSSE1()) return 128;
>>     return 0;
>>   }
>> 
>>   if (ST->is64Bit())
>>     return 64;
>>   return 32;
>> 
>> }
>> 
>> 
>> 
>>> It's not picking the target information, although I tried with
and
>>> without the target triple in the module
>>> 
>>> Any idea what could be wrong?
>>> 
>>> Frank
>>> 
>>> 
>>> On 26/10/13 15:54, Hal Finkel wrote:
>>>> ----- Original Message -----
>>>>>>>> LV: The Widest type: 32 bits.
>>>>>>>> LV: The Widest register is: 32 bits.
>>>>> Yep, we don’t pick up the right TTI.
>>>>> 
>>>>> Try -march=x86-64 (or leave it out) you already have this
info in the
>>>>> triple.
>>>>> 
>>>>> Then it should work (does for me with your example below).
>>>> That may depend on what CPU is picks by default; Frank, if it
does not work for you, try specifying a target CPU (-mcpu=whatever).
>>>> 
>>>>  -Hal
>>>> 
>>>>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at
jlab.org> wrote:
>>>>> 
>>>>>> Hi Hal!
>>>>>> 
>>>>>> I am using the 'x86_64' target. Below the
complete module dump and
>>>>>> here the command line:
>>>>>> 
>>>>>> opt -march=x64-64 -loop-vectorize
-debug-only=loop-vectorize -S
>>>>>> test.ll
>>>>>> 
>>>>>> Frank
>>>>>> 
>>>>>> 
>>>>>> ; ModuleID = 'test.ll'
>>>>>> 
>>>>>> target datalayout >>>>>>
"e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>>>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>>>> 
>>>>>> target triple = "x86_64-unknown-linux-elf"
>>>>>> 
>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>> entrypoint:
>>>>>>  %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>  %1 = load i32* %0
>>>>>>  %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>  %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>  %4 = load i32* %3
>>>>>>  %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>  %6 = bitcast [8 x i8]* %5 to float**
>>>>>>  %7 = load float** %6
>>>>>>  %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>  %9 = bitcast [8 x i8]* %8 to float**
>>>>>>  %10 = load float** %9
>>>>>>  %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>  %12 = bitcast [8 x i8]* %11 to float**
>>>>>>  %13 = load float** %12
>>>>>>  br label %L0
>>>>>> 
>>>>>> L0:                                               ;
preds = %L0,
>>>>>> %entrypoint
>>>>>>  %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>  %15 = getelementptr float* %10, i32 %14
>>>>>>  %16 = load float* %15
>>>>>>  %17 = getelementptr float* %13, i32 %14
>>>>>>  %18 = load float* %17
>>>>>>  %19 = fmul float %18, %16
>>>>>>  %20 = getelementptr float* %7, i32 %14
>>>>>>  store float %19, float* %20
>>>>>>  %21 = add nsw i32 %14, 1
>>>>>>  %22 = icmp sge i32 %21, %4
>>>>>>  br i1 %22, label %L1, label %L0
>>>>>> 
>>>>>> L1:                                               ;
preds = %L0
>>>>>>  ret void
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>>>>> ----- Original Message -----
>>>>>>>> Hi Arnold,
>>>>>>>> 
>>>>>>>> adding '-debug-only=loop-vectorize' to
the command gives:
>>>>>>>> 
>>>>>>>> LV: Checking a loop in "bar"
>>>>>>>> LV: Found a loop: L0
>>>>>>>> LV: Found an induction variable.
>>>>>>>> LV: Found an unidentified write ptr:   %7 =
load float** %6
>>>>>>>> LV: Found an unidentified read ptr:   %10 =
load float** %9
>>>>>>>> LV: Found an unidentified read ptr:   %13 =
load float** %12
>>>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>>>> LV: We can't vectorize because we can't
find the array bounds.
>>>>>>>> LV: Can't vectorize due to memory conflicts
>>>>>>>> LV: Not vectorizing.
>>>>>>>> 
>>>>>>>> It can't find the loop bounds if we use the
overflow version of
>>>>>>>> add.
>>>>>>>> That's a good point. I should mark this
addition to not overflow.
>>>>>>>> 
>>>>>>>> When using the non-overflow version I get:
>>>>>>>> 
>>>>>>>> LV: Checking a loop in "bar"
>>>>>>>> LV: Found a loop: L0
>>>>>>>> LV: Found an induction variable.
>>>>>>>> LV: Found an unidentified write ptr:   %7 =
load float** %6
>>>>>>>> LV: Found an unidentified read ptr:   %10 =
load float** %9
>>>>>>>> LV: Found an unidentified read ptr:   %13 =
load float** %12
>>>>>>>> LV: Found a runtime check ptr:  %20 =
getelementptr float* %7,
>>>>>>>> i32
>>>>>>>> %14
>>>>>>>> LV: Found a runtime check ptr:  %15 =
getelementptr float* %10,
>>>>>>>> i32
>>>>>>>> %14
>>>>>>>> LV: Found a runtime check ptr:  %17 =
getelementptr float* %13,
>>>>>>>> i32
>>>>>>>> %14
>>>>>>>> LV: We need to do 2 pointer comparisons.
>>>>>>>> LV: We can perform a memory runtime check if
needed.
>>>>>>>> LV: We need a runtime memory check.
>>>>>>>> LV: We can vectorize this loop (with a runtime
bound check)!
>>>>>>>> LV: Found trip count: 0
>>>>>>>> LV: The Widest type: 32 bits.
>>>>>>>> LV: The Widest register is: 32 bits.
>>>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %14
>>>>>>>> >>>>>>>> phi
>>>>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %15
>>>>>>>> >>>>>>>> getelementptr
float* %10, i32 %14
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %16
>>>>>>>> >>>>>>>> load
>>>>>>>> float* %15
>>>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %17
>>>>>>>> >>>>>>>> getelementptr
float* %13, i32 %14
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %18
>>>>>>>> >>>>>>>> load
>>>>>>>> float* %17
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %19
>>>>>>>> >>>>>>>> fmul
>>>>>>>> float %18, %16
>>>>>>>> LV: Found an estimated cost of 0 for VF 1 For
instruction:   %20
>>>>>>>> >>>>>>>> getelementptr
float* %7, i32 %14
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:
>>>>>>>>   store
>>>>>>>> float
>>>>>>>> %19, float* %20
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %21
>>>>>>>> >>>>>>>> add
>>>>>>>> nsw i32 %14, 1
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   %22
>>>>>>>> >>>>>>>> icmp
>>>>>>>> sge i32 %21, %4
>>>>>>>> LV: Found an estimated cost of 1 for VF 1 For
instruction:   br
>>>>>>>> i1
>>>>>>>> %22,
>>>>>>>> label %L1, label %L0
>>>>>>>> LV: Scalar loop costs: 7.
>>>>>>>> LV: Selecting VF = : 1.
>>>>>>>> LV: The target has 8 vector registers
>>>>>>>> LV(REG): Calculating max register usage:
>>>>>>>> LV(REG): At #0 Interval # 0
>>>>>>>> LV(REG): At #1 Interval # 1
>>>>>>>> LV(REG): At #2 Interval # 2
>>>>>>>> LV(REG): At #3 Interval # 2
>>>>>>>> LV(REG): At #4 Interval # 3
>>>>>>>> LV(REG): At #5 Interval # 3
>>>>>>>> LV(REG): At #6 Interval # 2
>>>>>>>> LV(REG): At #8 Interval # 1
>>>>>>>> LV(REG): At #9 Interval # 1
>>>>>>>> LV(REG): Found max usage: 3
>>>>>>>> LV(REG): Found invariant usage: 5
>>>>>>>> LV(REG): LoopSize: 11
>>>>>>>> LV: Vectorization is possible but not
beneficial.
>>>>>>>> LV: Found a vectorizable loop (1) in
saxpy_real.gvn.mod.ll
>>>>>>>> LV: Unroll Factor is 1
>>>>>>>> 
>>>>>>>> It's not beneficial? I didn't expect
that. Do you have a
>>>>>>>> descriptive
>>>>>>>> explanation why it's not beneficial?
>>>>>>> It looks like the vectorizer is not picking up a
TTI
>>>>>>> implementation from a target with vector registers
(likely,
>>>>>>> you're just seeing the basic cost model). For
what target is
>>>>>>> this?
>>>>>>> 
>>>>>>>  -Hal
>>>>>>> 
>>>>>>>> Frank
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>>>>> Hi Frank,
>>>>>>>>> 
>>>>>>>>> Sent from my iPhone
>>>>>>>>> 
>>>>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank
Winter <fwinter at jlab.org>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> My function implements a simple loop:
>>>>>>>>>> 
>>>>>>>>>> void bar( int start, int end, float* A,
float* B, float* C)
>>>>>>>>>> {
>>>>>>>>>>     for (int i=start; i<end;++i)
>>>>>>>>>>        A[i] = B[i] * C[i];
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> This looks pretty much like the
standard example. However, I
>>>>>>>>>> built
>>>>>>>>>> the function
>>>>>>>>>> with the IRBuilder, thus not coming
from C and clang. Also I
>>>>>>>>>> changed slightly
>>>>>>>>>> the function's signature:
>>>>>>>>>> 
>>>>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>>>>> entrypoint:
>>>>>>>>>>   %0 = bitcast [8 x i8]* %arg_ptr to
i32*
>>>>>>>>>>   %1 = load i32* %0
>>>>>>>>>>   %2 = getelementptr [8 x i8]*
%arg_ptr, i32 1
>>>>>>>>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>>>>   %4 = load i32* %3
>>>>>>>>>>   %5 = getelementptr [8 x i8]*
%arg_ptr, i32 2
>>>>>>>>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>>>>   %7 = load float** %6
>>>>>>>>>>   %8 = getelementptr [8 x i8]*
%arg_ptr, i32 3
>>>>>>>>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>>>>   %10 = load float** %9
>>>>>>>>>>   %11 = getelementptr [8 x i8]*
%arg_ptr, i32 4
>>>>>>>>>>   %12 = bitcast [8 x i8]* %11 to
float**
>>>>>>>>>>   %13 = load float** %12
>>>>>>>>>>   br label %L0
>>>>>>>>>> 
>>>>>>>>>> L0:                                    
; preds >>>>>>>>>> %L0,
>>>>>>>>>> %entrypoint
>>>>>>>>>>   %14 = phi i32 [ %21, %L0 ], [ %1,
%entrypoint ]
>>>>>>>>>>   %15 = getelementptr float* %10, i32
%14
>>>>>>>>>>   %16 = load float* %15
>>>>>>>>>>   %17 = getelementptr float* %13, i32
%14
>>>>>>>>>>   %18 = load float* %17
>>>>>>>>>>   %19 = fmul float %18, %16
>>>>>>>>>>   %20 = getelementptr float* %7, i32
%14
>>>>>>>>>>   store float %19, float* %20
>>>>>>>>>>   %21 = add i32 %14, 1
>>>>>>>>> Try
>>>>>>>>> %21 = add nsw i32 %14, 1
>>>>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>>>> 
>>>>>>>>> If that is not working please post the
output of opt ...
>>>>>>>>> -debug-only=loop-vectorize ...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>   %22 = icmp sge i32 %21, %4
>>>>>>>>>>   br i1 %22, label %L1, label %L0
>>>>>>>>>> 
>>>>>>>>>> L1:                                    
; preds = %L0
>>>>>>>>>>   ret void
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> As you can see, I use the phi
instruction for the loop index. I
>>>>>>>>>> notice
>>>>>>>>>> that clang prefers stack allocation.
So, I am not sure what's
>>>>>>>>>> the
>>>>>>>>>> problem that the loop vectorizer is not
working here.
>>>>>>>>>> I tried many things, like specifying an
architecture with
>>>>>>>>>> vector
>>>>>>>>>> units, enforcing the vector width. No
success.
>>>>>>>>>> 
>>>>>>>>>> opt -march=x64-64 -loop-vectorize
-force-vector-width=8 -S
>>>>>>>>>> loop.ll
>>>>>>>>>> 
>>>>>>>>>> The only explanation I have is the use
of the phi instruction.
>>>>>>>>>> Is
>>>>>>>>>> this
>>>>>>>>>> preventing to vectorize the loop?
>>>>>>>>>> 
>>>>>>>>>> Frank
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>
_______________________________________________
>>>>>>>>>> LLVM Developers mailing list
>>>>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>>>>>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
>>>>>>>>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>> 
> 
> 
> -- 
> -----------------------------------------------------------
> Dr Frank Winter
> Scientific Computing Group
> Jefferson Lab, 12000 Jefferson Ave, CEBAF Centre, Room F216
> Newport News, VA 23606, USA
> Tel: +1-757-269-6448
> EMail: fwinter at jlab.org
> -----------------------------------------------------------
>

Josh Klontz

2013-Oct-28 14:55 UTC

head link

[LLVMdev] Why is the loop vectorizer not working on my function?

Frank,

I ran into the same issue getting started with the JIT execution engine and
vectorization. You might find this email thread useful:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-May/062039.html

My code for this looks like:

InitializeNativeTarget();
Module *module = new Module("foo", getGlobalContext());
module->setTargetTriple(sys::getProcessTriple());
EngineBuilder engineBuilder(module);
engineBuilder.setMCPU(sys::getHostCPUName());
engineBuilder.setEngineKind(EngineKind::JIT);
engineBuilder.setOptLevel(CodeGenOpt::Aggressive);
ExecutionEngine *executionEngine = engineBuilder.create();

For a more complete example, starting at line 1175:

https://github.com/biometrics/likely/blob/gh-pages/likely.cpp

Hope that helps!
-Josh



--
View this message in context:
http://llvm.1065342.n5.nabble.com/Why-is-the-loop-vectorizer-not-working-on-my-function-tp62459p62511.html
Sent from the LLVM - Dev mailing list archive at Nabble.com.

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Oct 2013 - [LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

[LLVMdev] Why is the loop vectorizer not working on my function?

Reasonably Related Threads