Arnold Schwaighofer
2013-Oct-27 00:43 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Hi Frank, On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org> wrote:> I would need this to work when calling the vectorizer through > the function pass manager. Unfortunately I am having the same > problem there:I am not sure which function pass manager you are referring here. I assume you create your own (you are not using opt but configure your own pass manager)? Here is what opt does when it sets up its pass pipeline. You need to have your/the target add the target’s analysis passes: opt.cpp: PassManager Passes; … // Add target library info and data layout. Triple ModuleTriple(M->getTargetTriple()); TargetMachine *Machine = 0; if (ModuleTriple.getArch()) Machine = GetTargetMachine(Triple(ModuleTriple)); OwningPtr<TargetMachine> TM(Machine); // Add internal analysis passes from the target machine. if (TM.get()) TM->addAnalysisPasses(Passes); // <<< … TM->addAnalysisPasses(FPManager); Here is what the target does: void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) { // Add first the target-independent BasicTTI pass, then our X86 pass. This // allows the X86 pass to delegate to the target independent layer when // appropriate. PM.add(createBasicTargetTransformInfoPass(this)); PM.add(createX86TargetTransformInfoPass(this)); }> LV: The Widest type: 32 bits. > LV: The Widest register is: 32 bits.This strongly looks like no target has added a target transform info pass so we default to NoTTI. But, it could also be that you don’t have the right sub target (in which case you need to set the right cpu, “-mcpu” in opt, when the target machine is created): unsigned X86TTI::getRegisterBitWidth(bool Vector) const { if (Vector) { if (ST->hasAVX()) return 256; if (ST->hasSSE1()) return 128; return 0; } if (ST->is64Bit()) return 64; return 32; }> > It's not picking the target information, although I tried with and > without the target triple in the module > > Any idea what could be wrong? > > Frank > > > On 26/10/13 15:54, Hal Finkel wrote: >> ----- Original Message ----- >>>>>> LV: The Widest type: 32 bits. >>>>>> LV: The Widest register is: 32 bits. >>> Yep, we don’t pick up the right TTI. >>> >>> Try -march=x86-64 (or leave it out) you already have this info in the >>> triple. >>> >>> Then it should work (does for me with your example below). >> That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever). >> >> -Hal >> >>> >>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote: >>> >>>> Hi Hal! >>>> >>>> I am using the 'x86_64' target. Below the complete module dump and >>>> here the command line: >>>> >>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S >>>> test.ll >>>> >>>> Frank >>>> >>>> >>>> ; ModuleID = 'test.ll' >>>> >>>> target datalayout >>>> "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12 >>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64" >>>> >>>> target triple = "x86_64-unknown-linux-elf" >>>> >>>> define void @bar([8 x i8]* %arg_ptr) { >>>> entrypoint: >>>> %0 = bitcast [8 x i8]* %arg_ptr to i32* >>>> %1 = load i32* %0 >>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >>>> %3 = bitcast [8 x i8]* %2 to i32* >>>> %4 = load i32* %3 >>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >>>> %6 = bitcast [8 x i8]* %5 to float** >>>> %7 = load float** %6 >>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >>>> %9 = bitcast [8 x i8]* %8 to float** >>>> %10 = load float** %9 >>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >>>> %12 = bitcast [8 x i8]* %11 to float** >>>> %13 = load float** %12 >>>> br label %L0 >>>> >>>> L0: ; preds = %L0, >>>> %entrypoint >>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>> %15 = getelementptr float* %10, i32 %14 >>>> %16 = load float* %15 >>>> %17 = getelementptr float* %13, i32 %14 >>>> %18 = load float* %17 >>>> %19 = fmul float %18, %16 >>>> %20 = getelementptr float* %7, i32 %14 >>>> store float %19, float* %20 >>>> %21 = add nsw i32 %14, 1 >>>> %22 = icmp sge i32 %21, %4 >>>> br i1 %22, label %L1, label %L0 >>>> >>>> L1: ; preds = %L0 >>>> ret void >>>> } >>>> >>>> >>>> >>>> On 26/10/13 15:08, Hal Finkel wrote: >>>>> ----- Original Message ----- >>>>>> Hi Arnold, >>>>>> >>>>>> adding '-debug-only=loop-vectorize' to the command gives: >>>>>> >>>>>> LV: Checking a loop in "bar" >>>>>> LV: Found a loop: L0 >>>>>> LV: Found an induction variable. >>>>>> LV: Found an unidentified write ptr: %7 = load float** %6 >>>>>> LV: Found an unidentified read ptr: %10 = load float** %9 >>>>>> LV: Found an unidentified read ptr: %13 = load float** %12 >>>>>> LV: We need to do 2 pointer comparisons. >>>>>> LV: We can't vectorize because we can't find the array bounds. >>>>>> LV: Can't vectorize due to memory conflicts >>>>>> LV: Not vectorizing. >>>>>> >>>>>> It can't find the loop bounds if we use the overflow version of >>>>>> add. >>>>>> That's a good point. I should mark this addition to not overflow. >>>>>> >>>>>> When using the non-overflow version I get: >>>>>> >>>>>> LV: Checking a loop in "bar" >>>>>> LV: Found a loop: L0 >>>>>> LV: Found an induction variable. >>>>>> LV: Found an unidentified write ptr: %7 = load float** %6 >>>>>> LV: Found an unidentified read ptr: %10 = load float** %9 >>>>>> LV: Found an unidentified read ptr: %13 = load float** %12 >>>>>> LV: Found a runtime check ptr: %20 = getelementptr float* %7, >>>>>> i32 >>>>>> %14 >>>>>> LV: Found a runtime check ptr: %15 = getelementptr float* %10, >>>>>> i32 >>>>>> %14 >>>>>> LV: Found a runtime check ptr: %17 = getelementptr float* %13, >>>>>> i32 >>>>>> %14 >>>>>> LV: We need to do 2 pointer comparisons. >>>>>> LV: We can perform a memory runtime check if needed. >>>>>> LV: We need a runtime memory check. >>>>>> LV: We can vectorize this loop (with a runtime bound check)! >>>>>> LV: Found trip count: 0 >>>>>> LV: The Widest type: 32 bits. >>>>>> LV: The Widest register is: 32 bits. >>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %14 >>>>>> >>>>>> phi >>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %15 >>>>>> >>>>>> getelementptr float* %10, i32 %14 >>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %16 >>>>>> >>>>>> load >>>>>> float* %15 >>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %17 >>>>>> >>>>>> getelementptr float* %13, i32 %14 >>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %18 >>>>>> >>>>>> load >>>>>> float* %17 >>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %19 >>>>>> >>>>>> fmul >>>>>> float %18, %16 >>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %20 >>>>>> >>>>>> getelementptr float* %7, i32 %14 >>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: >>>>>> store >>>>>> float >>>>>> %19, float* %20 >>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %21 >>>>>> >>>>>> add >>>>>> nsw i32 %14, 1 >>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %22 >>>>>> >>>>>> icmp >>>>>> sge i32 %21, %4 >>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: br >>>>>> i1 >>>>>> %22, >>>>>> label %L1, label %L0 >>>>>> LV: Scalar loop costs: 7. >>>>>> LV: Selecting VF = : 1. >>>>>> LV: The target has 8 vector registers >>>>>> LV(REG): Calculating max register usage: >>>>>> LV(REG): At #0 Interval # 0 >>>>>> LV(REG): At #1 Interval # 1 >>>>>> LV(REG): At #2 Interval # 2 >>>>>> LV(REG): At #3 Interval # 2 >>>>>> LV(REG): At #4 Interval # 3 >>>>>> LV(REG): At #5 Interval # 3 >>>>>> LV(REG): At #6 Interval # 2 >>>>>> LV(REG): At #8 Interval # 1 >>>>>> LV(REG): At #9 Interval # 1 >>>>>> LV(REG): Found max usage: 3 >>>>>> LV(REG): Found invariant usage: 5 >>>>>> LV(REG): LoopSize: 11 >>>>>> LV: Vectorization is possible but not beneficial. >>>>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll >>>>>> LV: Unroll Factor is 1 >>>>>> >>>>>> It's not beneficial? I didn't expect that. Do you have a >>>>>> descriptive >>>>>> explanation why it's not beneficial? >>>>> It looks like the vectorizer is not picking up a TTI >>>>> implementation from a target with vector registers (likely, >>>>> you're just seeing the basic cost model). For what target is >>>>> this? >>>>> >>>>> -Hal >>>>> >>>>>> Frank >>>>>> >>>>>> >>>>>> >>>>>> On 26/10/13 13:03, Arnold wrote: >>>>>>> Hi Frank, >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>> My function implements a simple loop: >>>>>>>> >>>>>>>> void bar( int start, int end, float* A, float* B, float* C) >>>>>>>> { >>>>>>>> for (int i=start; i<end;++i) >>>>>>>> A[i] = B[i] * C[i]; >>>>>>>> } >>>>>>>> >>>>>>>> This looks pretty much like the standard example. However, I >>>>>>>> built >>>>>>>> the function >>>>>>>> with the IRBuilder, thus not coming from C and clang. Also I >>>>>>>> changed slightly >>>>>>>> the function's signature: >>>>>>>> >>>>>>>> define void @bar([8 x i8]* %arg_ptr) { >>>>>>>> entrypoint: >>>>>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32* >>>>>>>> %1 = load i32* %0 >>>>>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >>>>>>>> %3 = bitcast [8 x i8]* %2 to i32* >>>>>>>> %4 = load i32* %3 >>>>>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >>>>>>>> %6 = bitcast [8 x i8]* %5 to float** >>>>>>>> %7 = load float** %6 >>>>>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >>>>>>>> %9 = bitcast [8 x i8]* %8 to float** >>>>>>>> %10 = load float** %9 >>>>>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >>>>>>>> %12 = bitcast [8 x i8]* %11 to float** >>>>>>>> %13 = load float** %12 >>>>>>>> br label %L0 >>>>>>>> >>>>>>>> L0: ; preds >>>>>>>> %L0, >>>>>>>> %entrypoint >>>>>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>>>>> %15 = getelementptr float* %10, i32 %14 >>>>>>>> %16 = load float* %15 >>>>>>>> %17 = getelementptr float* %13, i32 %14 >>>>>>>> %18 = load float* %17 >>>>>>>> %19 = fmul float %18, %16 >>>>>>>> %20 = getelementptr float* %7, i32 %14 >>>>>>>> store float %19, float* %20 >>>>>>>> %21 = add i32 %14, 1 >>>>>>> Try >>>>>>> %21 = add nsw i32 %14, 1 >>>>>>> instead for no-signed wrapping arithmetic. >>>>>>> >>>>>>> If that is not working please post the output of opt ... >>>>>>> -debug-only=loop-vectorize ... >>>>>>> >>>>>>> >>>>>>> >>>>>>>> %22 = icmp sge i32 %21, %4 >>>>>>>> br i1 %22, label %L1, label %L0 >>>>>>>> >>>>>>>> L1: ; preds = %L0 >>>>>>>> ret void >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> As you can see, I use the phi instruction for the loop index. I >>>>>>>> notice >>>>>>>> that clang prefers stack allocation. So, I am not sure what's >>>>>>>> the >>>>>>>> problem that the loop vectorizer is not working here. >>>>>>>> I tried many things, like specifying an architecture with >>>>>>>> vector >>>>>>>> units, enforcing the vector width. No success. >>>>>>>> >>>>>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S >>>>>>>> loop.ll >>>>>>>> >>>>>>>> The only explanation I have is the use of the phi instruction. >>>>>>>> Is >>>>>>>> this >>>>>>>> preventing to vectorize the loop? >>>>>>>> >>>>>>>> Frank >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> LLVM Developers mailing list >>>>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>> >>>> >>> >
Frank Winter
2013-Oct-27 01:42 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Hi Arnold, thanks for the detailed setup. Still, I haven't figured out the right thing to do. I would need only the native target since all generated code will execute on the JIT execution machine (right now, the old JIT interface). There is no need for other targets. Maybe it would be good to ask specific questions: How do I get the triple for the native target? How do I setup the target transform info pass? I guess I need the target machine for all of this. 'opt' defines the GetTargetMachine() function which takes a Triple (my native triple). It calls GetTargetOptions(). This function, as implemented in 'opt', looks very static to me and surely doesn't fit all machines. Do I need this for my setup? Frank On 26/10/13 20:43, Arnold Schwaighofer wrote:> Hi Frank, > > > On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org> wrote: > >> I would need this to work when calling the vectorizer through >> the function pass manager. Unfortunately I am having the same >> problem there: > > I am not sure which function pass manager you are referring here. I assume you create your own (you are not using opt but configure your own pass manager)? > > Here is what opt does when it sets up its pass pipeline. > > You need to have your/the target add the target’s analysis passes: > > opt.cpp: > > PassManager Passes; > > … // Add target library info and data layout. > > Triple ModuleTriple(M->getTargetTriple()); > TargetMachine *Machine = 0; > if (ModuleTriple.getArch()) > Machine = GetTargetMachine(Triple(ModuleTriple)); > OwningPtr<TargetMachine> TM(Machine); > > // Add internal analysis passes from the target machine. > if (TM.get()) > TM->addAnalysisPasses(Passes); // <<< > … > TM->addAnalysisPasses(FPManager); > > Here is what the target does: > > void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) { > // Add first the target-independent BasicTTI pass, then our X86 pass. This > // allows the X86 pass to delegate to the target independent layer when > // appropriate. > PM.add(createBasicTargetTransformInfoPass(this)); > PM.add(createX86TargetTransformInfoPass(this)); > } > > > >> LV: The Widest type: 32 bits. >> LV: The Widest register is: 32 bits. > This strongly looks like no target has added a target transform info pass so we default to NoTTI. > > But, it could also be that you don’t have the right sub target (in which case you need to set the right cpu, “-mcpu” in opt, when the target machine is created): > > unsigned X86TTI::getRegisterBitWidth(bool Vector) const { > if (Vector) { > if (ST->hasAVX()) return 256; > if (ST->hasSSE1()) return 128; > return 0; > } > > if (ST->is64Bit()) > return 64; > return 32; > > } > > > >> It's not picking the target information, although I tried with and >> without the target triple in the module >> >> Any idea what could be wrong? >> >> Frank >> >> >> On 26/10/13 15:54, Hal Finkel wrote: >>> ----- Original Message ----- >>>>>>> LV: The Widest type: 32 bits. >>>>>>> LV: The Widest register is: 32 bits. >>>> Yep, we don’t pick up the right TTI. >>>> >>>> Try -march=x86-64 (or leave it out) you already have this info in the >>>> triple. >>>> >>>> Then it should work (does for me with your example below). >>> That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever). >>> >>> -Hal >>> >>>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote: >>>> >>>>> Hi Hal! >>>>> >>>>> I am using the 'x86_64' target. Below the complete module dump and >>>>> here the command line: >>>>> >>>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S >>>>> test.ll >>>>> >>>>> Frank >>>>> >>>>> >>>>> ; ModuleID = 'test.ll' >>>>> >>>>> target datalayout >>>>> "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12 >>>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64" >>>>> >>>>> target triple = "x86_64-unknown-linux-elf" >>>>> >>>>> define void @bar([8 x i8]* %arg_ptr) { >>>>> entrypoint: >>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32* >>>>> %1 = load i32* %0 >>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >>>>> %3 = bitcast [8 x i8]* %2 to i32* >>>>> %4 = load i32* %3 >>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >>>>> %6 = bitcast [8 x i8]* %5 to float** >>>>> %7 = load float** %6 >>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >>>>> %9 = bitcast [8 x i8]* %8 to float** >>>>> %10 = load float** %9 >>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >>>>> %12 = bitcast [8 x i8]* %11 to float** >>>>> %13 = load float** %12 >>>>> br label %L0 >>>>> >>>>> L0: ; preds = %L0, >>>>> %entrypoint >>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>> %15 = getelementptr float* %10, i32 %14 >>>>> %16 = load float* %15 >>>>> %17 = getelementptr float* %13, i32 %14 >>>>> %18 = load float* %17 >>>>> %19 = fmul float %18, %16 >>>>> %20 = getelementptr float* %7, i32 %14 >>>>> store float %19, float* %20 >>>>> %21 = add nsw i32 %14, 1 >>>>> %22 = icmp sge i32 %21, %4 >>>>> br i1 %22, label %L1, label %L0 >>>>> >>>>> L1: ; preds = %L0 >>>>> ret void >>>>> } >>>>> >>>>> >>>>> >>>>> On 26/10/13 15:08, Hal Finkel wrote: >>>>>> ----- Original Message ----- >>>>>>> Hi Arnold, >>>>>>> >>>>>>> adding '-debug-only=loop-vectorize' to the command gives: >>>>>>> >>>>>>> LV: Checking a loop in "bar" >>>>>>> LV: Found a loop: L0 >>>>>>> LV: Found an induction variable. >>>>>>> LV: Found an unidentified write ptr: %7 = load float** %6 >>>>>>> LV: Found an unidentified read ptr: %10 = load float** %9 >>>>>>> LV: Found an unidentified read ptr: %13 = load float** %12 >>>>>>> LV: We need to do 2 pointer comparisons. >>>>>>> LV: We can't vectorize because we can't find the array bounds. >>>>>>> LV: Can't vectorize due to memory conflicts >>>>>>> LV: Not vectorizing. >>>>>>> >>>>>>> It can't find the loop bounds if we use the overflow version of >>>>>>> add. >>>>>>> That's a good point. I should mark this addition to not overflow. >>>>>>> >>>>>>> When using the non-overflow version I get: >>>>>>> >>>>>>> LV: Checking a loop in "bar" >>>>>>> LV: Found a loop: L0 >>>>>>> LV: Found an induction variable. >>>>>>> LV: Found an unidentified write ptr: %7 = load float** %6 >>>>>>> LV: Found an unidentified read ptr: %10 = load float** %9 >>>>>>> LV: Found an unidentified read ptr: %13 = load float** %12 >>>>>>> LV: Found a runtime check ptr: %20 = getelementptr float* %7, >>>>>>> i32 >>>>>>> %14 >>>>>>> LV: Found a runtime check ptr: %15 = getelementptr float* %10, >>>>>>> i32 >>>>>>> %14 >>>>>>> LV: Found a runtime check ptr: %17 = getelementptr float* %13, >>>>>>> i32 >>>>>>> %14 >>>>>>> LV: We need to do 2 pointer comparisons. >>>>>>> LV: We can perform a memory runtime check if needed. >>>>>>> LV: We need a runtime memory check. >>>>>>> LV: We can vectorize this loop (with a runtime bound check)! >>>>>>> LV: Found trip count: 0 >>>>>>> LV: The Widest type: 32 bits. >>>>>>> LV: The Widest register is: 32 bits. >>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %14 >>>>>>> >>>>>>> phi >>>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %15 >>>>>>> >>>>>>> getelementptr float* %10, i32 %14 >>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %16 >>>>>>> >>>>>>> load >>>>>>> float* %15 >>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %17 >>>>>>> >>>>>>> getelementptr float* %13, i32 %14 >>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %18 >>>>>>> >>>>>>> load >>>>>>> float* %17 >>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %19 >>>>>>> >>>>>>> fmul >>>>>>> float %18, %16 >>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %20 >>>>>>> >>>>>>> getelementptr float* %7, i32 %14 >>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: >>>>>>> store >>>>>>> float >>>>>>> %19, float* %20 >>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %21 >>>>>>> >>>>>>> add >>>>>>> nsw i32 %14, 1 >>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %22 >>>>>>> >>>>>>> icmp >>>>>>> sge i32 %21, %4 >>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: br >>>>>>> i1 >>>>>>> %22, >>>>>>> label %L1, label %L0 >>>>>>> LV: Scalar loop costs: 7. >>>>>>> LV: Selecting VF = : 1. >>>>>>> LV: The target has 8 vector registers >>>>>>> LV(REG): Calculating max register usage: >>>>>>> LV(REG): At #0 Interval # 0 >>>>>>> LV(REG): At #1 Interval # 1 >>>>>>> LV(REG): At #2 Interval # 2 >>>>>>> LV(REG): At #3 Interval # 2 >>>>>>> LV(REG): At #4 Interval # 3 >>>>>>> LV(REG): At #5 Interval # 3 >>>>>>> LV(REG): At #6 Interval # 2 >>>>>>> LV(REG): At #8 Interval # 1 >>>>>>> LV(REG): At #9 Interval # 1 >>>>>>> LV(REG): Found max usage: 3 >>>>>>> LV(REG): Found invariant usage: 5 >>>>>>> LV(REG): LoopSize: 11 >>>>>>> LV: Vectorization is possible but not beneficial. >>>>>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll >>>>>>> LV: Unroll Factor is 1 >>>>>>> >>>>>>> It's not beneficial? I didn't expect that. Do you have a >>>>>>> descriptive >>>>>>> explanation why it's not beneficial? >>>>>> It looks like the vectorizer is not picking up a TTI >>>>>> implementation from a target with vector registers (likely, >>>>>> you're just seeing the basic cost model). For what target is >>>>>> this? >>>>>> >>>>>> -Hal >>>>>> >>>>>>> Frank >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 26/10/13 13:03, Arnold wrote: >>>>>>>> Hi Frank, >>>>>>>> >>>>>>>> Sent from my iPhone >>>>>>>> >>>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> My function implements a simple loop: >>>>>>>>> >>>>>>>>> void bar( int start, int end, float* A, float* B, float* C) >>>>>>>>> { >>>>>>>>> for (int i=start; i<end;++i) >>>>>>>>> A[i] = B[i] * C[i]; >>>>>>>>> } >>>>>>>>> >>>>>>>>> This looks pretty much like the standard example. However, I >>>>>>>>> built >>>>>>>>> the function >>>>>>>>> with the IRBuilder, thus not coming from C and clang. Also I >>>>>>>>> changed slightly >>>>>>>>> the function's signature: >>>>>>>>> >>>>>>>>> define void @bar([8 x i8]* %arg_ptr) { >>>>>>>>> entrypoint: >>>>>>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32* >>>>>>>>> %1 = load i32* %0 >>>>>>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >>>>>>>>> %3 = bitcast [8 x i8]* %2 to i32* >>>>>>>>> %4 = load i32* %3 >>>>>>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >>>>>>>>> %6 = bitcast [8 x i8]* %5 to float** >>>>>>>>> %7 = load float** %6 >>>>>>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >>>>>>>>> %9 = bitcast [8 x i8]* %8 to float** >>>>>>>>> %10 = load float** %9 >>>>>>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >>>>>>>>> %12 = bitcast [8 x i8]* %11 to float** >>>>>>>>> %13 = load float** %12 >>>>>>>>> br label %L0 >>>>>>>>> >>>>>>>>> L0: ; preds >>>>>>>>> %L0, >>>>>>>>> %entrypoint >>>>>>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>>>>>> %15 = getelementptr float* %10, i32 %14 >>>>>>>>> %16 = load float* %15 >>>>>>>>> %17 = getelementptr float* %13, i32 %14 >>>>>>>>> %18 = load float* %17 >>>>>>>>> %19 = fmul float %18, %16 >>>>>>>>> %20 = getelementptr float* %7, i32 %14 >>>>>>>>> store float %19, float* %20 >>>>>>>>> %21 = add i32 %14, 1 >>>>>>>> Try >>>>>>>> %21 = add nsw i32 %14, 1 >>>>>>>> instead for no-signed wrapping arithmetic. >>>>>>>> >>>>>>>> If that is not working please post the output of opt ... >>>>>>>> -debug-only=loop-vectorize ... >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> %22 = icmp sge i32 %21, %4 >>>>>>>>> br i1 %22, label %L1, label %L0 >>>>>>>>> >>>>>>>>> L1: ; preds = %L0 >>>>>>>>> ret void >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> As you can see, I use the phi instruction for the loop index. I >>>>>>>>> notice >>>>>>>>> that clang prefers stack allocation. So, I am not sure what's >>>>>>>>> the >>>>>>>>> problem that the loop vectorizer is not working here. >>>>>>>>> I tried many things, like specifying an architecture with >>>>>>>>> vector >>>>>>>>> units, enforcing the vector width. No success. >>>>>>>>> >>>>>>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S >>>>>>>>> loop.ll >>>>>>>>> >>>>>>>>> The only explanation I have is the use of the phi instruction. >>>>>>>>> Is >>>>>>>>> this >>>>>>>>> preventing to vectorize the loop? >>>>>>>>> >>>>>>>>> Frank >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> LLVM Developers mailing list >>>>>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>>> _______________________________________________ >>>>>>> LLVM Developers mailing list >>>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>>>-- ----------------------------------------------------------- Dr Frank Winter Scientific Computing Group Jefferson Lab, 12000 Jefferson Ave, CEBAF Centre, Room F216 Newport News, VA 23606, USA Tel: +1-757-269-6448 EMail: fwinter at jlab.org -----------------------------------------------------------
Arnold Schwaighofer
2013-Oct-27 03:26 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Frank, I am not familiar with the JITs. Basically, somehow you need to create the TargetMachine. There is the TargetRegistry that can be queried for the Target (for example, using TargetRegistry::lookupTarget) using the triple/march. The Target can then create a target machine for you. The Triple is typically picked up from the IR file (Module->getTargetTriple()), and can be overridden using -march,-mtriple. I think the (old) JIT engine just uses the triple of the host llvm was compiled on (sys::getProcessTriple(), see lib/ExecutionEngine/TargetSelect.cpp). After you have the target you use it to create the target machine: Target *Target = … lookup the target TargetMachine *TM = Target->createTargetMachine( …); Using this target machine you can then add the target transform info pass to a pass manager by calling: TM->addAnalysisPasses(FunPassManager); // This will add the target’s “TargetTransformInfo” implementation if it has one. On Oct 26, 2013, at 8:42 PM, Frank Winter <fwinter at jlab.org> wrote:> Hi Arnold, > > thanks for the detailed setup. Still, I haven't figured out the right thing to do. > > I would need only the native target since all generated code will execute on the JIT execution machine (right now, the old JIT interface). There is no need for other targets. >I am not familiar with the JIT infrastructure I am afraid. I think the old JIT’s native target is just the host that llvm was compiled on.> Maybe it would be good to ask specific questions: > > How do I get the triple for the native target?> How do I setup the target transform info pass?> > I guess I need the target machine for all of this. 'opt' defines the GetTargetMachine() function which takes a Triple (my native triple). It calls GetTargetOptions(). This function, as implemented in 'opt', looks very static to me and surely doesn't fit all machines. Do I need this for my setup?Yes you need the target machine. There ought to be a JIT api that should return the target machine (since the JIT needs to create it when it creates machine code). But since I am not familiar with the JITs I am afraid I can’t help you further than this.> > Frank > > > On 26/10/13 20:43, Arnold Schwaighofer wrote: >> Hi Frank, >> >> >> On Oct 26, 2013, at 6:29 PM, Frank Winter <fwinter at jlab.org> wrote: >> >>> I would need this to work when calling the vectorizer through >>> the function pass manager. Unfortunately I am having the same >>> problem there: >> >> I am not sure which function pass manager you are referring here. I assume you create your own (you are not using opt but configure your own pass manager)? >> >> Here is what opt does when it sets up its pass pipeline. >> >> You need to have your/the target add the target’s analysis passes: >> >> opt.cpp: >> >> PassManager Passes; >> >> … // Add target library info and data layout. >> >> Triple ModuleTriple(M->getTargetTriple()); >> TargetMachine *Machine = 0; >> if (ModuleTriple.getArch()) >> Machine = GetTargetMachine(Triple(ModuleTriple)); >> OwningPtr<TargetMachine> TM(Machine); >> >> // Add internal analysis passes from the target machine. >> if (TM.get()) >> TM->addAnalysisPasses(Passes); // <<< >> … >> TM->addAnalysisPasses(FPManager); >> >> Here is what the target does: >> >> void X86TargetMachine::addAnalysisPasses(PassManagerBase &PM) { >> // Add first the target-independent BasicTTI pass, then our X86 pass. This >> // allows the X86 pass to delegate to the target independent layer when >> // appropriate. >> PM.add(createBasicTargetTransformInfoPass(this)); >> PM.add(createX86TargetTransformInfoPass(this)); >> } >> >> >> >>> LV: The Widest type: 32 bits. >>> LV: The Widest register is: 32 bits. >> This strongly looks like no target has added a target transform info pass so we default to NoTTI. >> >> But, it could also be that you don’t have the right sub target (in which case you need to set the right cpu, “-mcpu” in opt, when the target machine is created): >> >> unsigned X86TTI::getRegisterBitWidth(bool Vector) const { >> if (Vector) { >> if (ST->hasAVX()) return 256; >> if (ST->hasSSE1()) return 128; >> return 0; >> } >> >> if (ST->is64Bit()) >> return 64; >> return 32; >> >> } >> >> >> >>> It's not picking the target information, although I tried with and >>> without the target triple in the module >>> >>> Any idea what could be wrong? >>> >>> Frank >>> >>> >>> On 26/10/13 15:54, Hal Finkel wrote: >>>> ----- Original Message ----- >>>>>>>> LV: The Widest type: 32 bits. >>>>>>>> LV: The Widest register is: 32 bits. >>>>> Yep, we don’t pick up the right TTI. >>>>> >>>>> Try -march=x86-64 (or leave it out) you already have this info in the >>>>> triple. >>>>> >>>>> Then it should work (does for me with your example below). >>>> That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever). >>>> >>>> -Hal >>>> >>>>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote: >>>>> >>>>>> Hi Hal! >>>>>> >>>>>> I am using the 'x86_64' target. Below the complete module dump and >>>>>> here the command line: >>>>>> >>>>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S >>>>>> test.ll >>>>>> >>>>>> Frank >>>>>> >>>>>> >>>>>> ; ModuleID = 'test.ll' >>>>>> >>>>>> target datalayout >>>>>> "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12 >>>>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64" >>>>>> >>>>>> target triple = "x86_64-unknown-linux-elf" >>>>>> >>>>>> define void @bar([8 x i8]* %arg_ptr) { >>>>>> entrypoint: >>>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32* >>>>>> %1 = load i32* %0 >>>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >>>>>> %3 = bitcast [8 x i8]* %2 to i32* >>>>>> %4 = load i32* %3 >>>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >>>>>> %6 = bitcast [8 x i8]* %5 to float** >>>>>> %7 = load float** %6 >>>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >>>>>> %9 = bitcast [8 x i8]* %8 to float** >>>>>> %10 = load float** %9 >>>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >>>>>> %12 = bitcast [8 x i8]* %11 to float** >>>>>> %13 = load float** %12 >>>>>> br label %L0 >>>>>> >>>>>> L0: ; preds = %L0, >>>>>> %entrypoint >>>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>>> %15 = getelementptr float* %10, i32 %14 >>>>>> %16 = load float* %15 >>>>>> %17 = getelementptr float* %13, i32 %14 >>>>>> %18 = load float* %17 >>>>>> %19 = fmul float %18, %16 >>>>>> %20 = getelementptr float* %7, i32 %14 >>>>>> store float %19, float* %20 >>>>>> %21 = add nsw i32 %14, 1 >>>>>> %22 = icmp sge i32 %21, %4 >>>>>> br i1 %22, label %L1, label %L0 >>>>>> >>>>>> L1: ; preds = %L0 >>>>>> ret void >>>>>> } >>>>>> >>>>>> >>>>>> >>>>>> On 26/10/13 15:08, Hal Finkel wrote: >>>>>>> ----- Original Message ----- >>>>>>>> Hi Arnold, >>>>>>>> >>>>>>>> adding '-debug-only=loop-vectorize' to the command gives: >>>>>>>> >>>>>>>> LV: Checking a loop in "bar" >>>>>>>> LV: Found a loop: L0 >>>>>>>> LV: Found an induction variable. >>>>>>>> LV: Found an unidentified write ptr: %7 = load float** %6 >>>>>>>> LV: Found an unidentified read ptr: %10 = load float** %9 >>>>>>>> LV: Found an unidentified read ptr: %13 = load float** %12 >>>>>>>> LV: We need to do 2 pointer comparisons. >>>>>>>> LV: We can't vectorize because we can't find the array bounds. >>>>>>>> LV: Can't vectorize due to memory conflicts >>>>>>>> LV: Not vectorizing. >>>>>>>> >>>>>>>> It can't find the loop bounds if we use the overflow version of >>>>>>>> add. >>>>>>>> That's a good point. I should mark this addition to not overflow. >>>>>>>> >>>>>>>> When using the non-overflow version I get: >>>>>>>> >>>>>>>> LV: Checking a loop in "bar" >>>>>>>> LV: Found a loop: L0 >>>>>>>> LV: Found an induction variable. >>>>>>>> LV: Found an unidentified write ptr: %7 = load float** %6 >>>>>>>> LV: Found an unidentified read ptr: %10 = load float** %9 >>>>>>>> LV: Found an unidentified read ptr: %13 = load float** %12 >>>>>>>> LV: Found a runtime check ptr: %20 = getelementptr float* %7, >>>>>>>> i32 >>>>>>>> %14 >>>>>>>> LV: Found a runtime check ptr: %15 = getelementptr float* %10, >>>>>>>> i32 >>>>>>>> %14 >>>>>>>> LV: Found a runtime check ptr: %17 = getelementptr float* %13, >>>>>>>> i32 >>>>>>>> %14 >>>>>>>> LV: We need to do 2 pointer comparisons. >>>>>>>> LV: We can perform a memory runtime check if needed. >>>>>>>> LV: We need a runtime memory check. >>>>>>>> LV: We can vectorize this loop (with a runtime bound check)! >>>>>>>> LV: Found trip count: 0 >>>>>>>> LV: The Widest type: 32 bits. >>>>>>>> LV: The Widest register is: 32 bits. >>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %14 >>>>>>>> >>>>>>>> phi >>>>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %15 >>>>>>>> >>>>>>>> getelementptr float* %10, i32 %14 >>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %16 >>>>>>>> >>>>>>>> load >>>>>>>> float* %15 >>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %17 >>>>>>>> >>>>>>>> getelementptr float* %13, i32 %14 >>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %18 >>>>>>>> >>>>>>>> load >>>>>>>> float* %17 >>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %19 >>>>>>>> >>>>>>>> fmul >>>>>>>> float %18, %16 >>>>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %20 >>>>>>>> >>>>>>>> getelementptr float* %7, i32 %14 >>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: >>>>>>>> store >>>>>>>> float >>>>>>>> %19, float* %20 >>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %21 >>>>>>>> >>>>>>>> add >>>>>>>> nsw i32 %14, 1 >>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %22 >>>>>>>> >>>>>>>> icmp >>>>>>>> sge i32 %21, %4 >>>>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction: br >>>>>>>> i1 >>>>>>>> %22, >>>>>>>> label %L1, label %L0 >>>>>>>> LV: Scalar loop costs: 7. >>>>>>>> LV: Selecting VF = : 1. >>>>>>>> LV: The target has 8 vector registers >>>>>>>> LV(REG): Calculating max register usage: >>>>>>>> LV(REG): At #0 Interval # 0 >>>>>>>> LV(REG): At #1 Interval # 1 >>>>>>>> LV(REG): At #2 Interval # 2 >>>>>>>> LV(REG): At #3 Interval # 2 >>>>>>>> LV(REG): At #4 Interval # 3 >>>>>>>> LV(REG): At #5 Interval # 3 >>>>>>>> LV(REG): At #6 Interval # 2 >>>>>>>> LV(REG): At #8 Interval # 1 >>>>>>>> LV(REG): At #9 Interval # 1 >>>>>>>> LV(REG): Found max usage: 3 >>>>>>>> LV(REG): Found invariant usage: 5 >>>>>>>> LV(REG): LoopSize: 11 >>>>>>>> LV: Vectorization is possible but not beneficial. >>>>>>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll >>>>>>>> LV: Unroll Factor is 1 >>>>>>>> >>>>>>>> It's not beneficial? I didn't expect that. Do you have a >>>>>>>> descriptive >>>>>>>> explanation why it's not beneficial? >>>>>>> It looks like the vectorizer is not picking up a TTI >>>>>>> implementation from a target with vector registers (likely, >>>>>>> you're just seeing the basic cost model). For what target is >>>>>>> this? >>>>>>> >>>>>>> -Hal >>>>>>> >>>>>>>> Frank >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 26/10/13 13:03, Arnold wrote: >>>>>>>>> Hi Frank, >>>>>>>>> >>>>>>>>> Sent from my iPhone >>>>>>>>> >>>>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> My function implements a simple loop: >>>>>>>>>> >>>>>>>>>> void bar( int start, int end, float* A, float* B, float* C) >>>>>>>>>> { >>>>>>>>>> for (int i=start; i<end;++i) >>>>>>>>>> A[i] = B[i] * C[i]; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> This looks pretty much like the standard example. However, I >>>>>>>>>> built >>>>>>>>>> the function >>>>>>>>>> with the IRBuilder, thus not coming from C and clang. Also I >>>>>>>>>> changed slightly >>>>>>>>>> the function's signature: >>>>>>>>>> >>>>>>>>>> define void @bar([8 x i8]* %arg_ptr) { >>>>>>>>>> entrypoint: >>>>>>>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32* >>>>>>>>>> %1 = load i32* %0 >>>>>>>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >>>>>>>>>> %3 = bitcast [8 x i8]* %2 to i32* >>>>>>>>>> %4 = load i32* %3 >>>>>>>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >>>>>>>>>> %6 = bitcast [8 x i8]* %5 to float** >>>>>>>>>> %7 = load float** %6 >>>>>>>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >>>>>>>>>> %9 = bitcast [8 x i8]* %8 to float** >>>>>>>>>> %10 = load float** %9 >>>>>>>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >>>>>>>>>> %12 = bitcast [8 x i8]* %11 to float** >>>>>>>>>> %13 = load float** %12 >>>>>>>>>> br label %L0 >>>>>>>>>> >>>>>>>>>> L0: ; preds >>>>>>>>>> %L0, >>>>>>>>>> %entrypoint >>>>>>>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >>>>>>>>>> %15 = getelementptr float* %10, i32 %14 >>>>>>>>>> %16 = load float* %15 >>>>>>>>>> %17 = getelementptr float* %13, i32 %14 >>>>>>>>>> %18 = load float* %17 >>>>>>>>>> %19 = fmul float %18, %16 >>>>>>>>>> %20 = getelementptr float* %7, i32 %14 >>>>>>>>>> store float %19, float* %20 >>>>>>>>>> %21 = add i32 %14, 1 >>>>>>>>> Try >>>>>>>>> %21 = add nsw i32 %14, 1 >>>>>>>>> instead for no-signed wrapping arithmetic. >>>>>>>>> >>>>>>>>> If that is not working please post the output of opt ... >>>>>>>>> -debug-only=loop-vectorize ... >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> %22 = icmp sge i32 %21, %4 >>>>>>>>>> br i1 %22, label %L1, label %L0 >>>>>>>>>> >>>>>>>>>> L1: ; preds = %L0 >>>>>>>>>> ret void >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> As you can see, I use the phi instruction for the loop index. I >>>>>>>>>> notice >>>>>>>>>> that clang prefers stack allocation. So, I am not sure what's >>>>>>>>>> the >>>>>>>>>> problem that the loop vectorizer is not working here. >>>>>>>>>> I tried many things, like specifying an architecture with >>>>>>>>>> vector >>>>>>>>>> units, enforcing the vector width. No success. >>>>>>>>>> >>>>>>>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S >>>>>>>>>> loop.ll >>>>>>>>>> >>>>>>>>>> The only explanation I have is the use of the phi instruction. >>>>>>>>>> Is >>>>>>>>>> this >>>>>>>>>> preventing to vectorize the loop? >>>>>>>>>> >>>>>>>>>> Frank >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> LLVM Developers mailing list >>>>>>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>>>> _______________________________________________ >>>>>>>> LLVM Developers mailing list >>>>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>>>> > > > -- > ----------------------------------------------------------- > Dr Frank Winter > Scientific Computing Group > Jefferson Lab, 12000 Jefferson Ave, CEBAF Centre, Room F216 > Newport News, VA 23606, USA > Tel: +1-757-269-6448 > EMail: fwinter at jlab.org > ----------------------------------------------------------- >
Josh Klontz
2013-Oct-28 14:55 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Frank, I ran into the same issue getting started with the JIT execution engine and vectorization. You might find this email thread useful: http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-May/062039.html My code for this looks like: InitializeNativeTarget(); Module *module = new Module("foo", getGlobalContext()); module->setTargetTriple(sys::getProcessTriple()); EngineBuilder engineBuilder(module); engineBuilder.setMCPU(sys::getHostCPUName()); engineBuilder.setEngineKind(EngineKind::JIT); engineBuilder.setOptLevel(CodeGenOpt::Aggressive); ExecutionEngine *executionEngine = engineBuilder.create(); For a more complete example, starting at line 1175: https://github.com/biometrics/likely/blob/gh-pages/likely.cpp Hope that helps! -Josh -- View this message in context: http://llvm.1065342.n5.nabble.com/Why-is-the-loop-vectorizer-not-working-on-my-function-tp62459p62511.html Sent from the LLVM - Dev mailing list archive at Nabble.com.
Maybe Matching Threads
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?