Frank Winter
2013-Oct-26 15:03 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
My function implements a simple loop: void bar( int start, int end, float* A, float* B, float* C) { for (int i=start; i<end;++i) A[i] = B[i] * C[i]; } This looks pretty much like the standard example. However, I built the function with the IRBuilder, thus not coming from C and clang. Also I changed slightly the function's signature: define void @bar([8 x i8]* %arg_ptr) { entrypoint: %0 = bitcast [8 x i8]* %arg_ptr to i32* %1 = load i32* %0 %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 %3 = bitcast [8 x i8]* %2 to i32* %4 = load i32* %3 %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 %6 = bitcast [8 x i8]* %5 to float** %7 = load float** %6 %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 %9 = bitcast [8 x i8]* %8 to float** %10 = load float** %9 %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 %12 = bitcast [8 x i8]* %11 to float** %13 = load float** %12 br label %L0 L0: ; preds = %L0, %entrypoint %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] %15 = getelementptr float* %10, i32 %14 %16 = load float* %15 %17 = getelementptr float* %13, i32 %14 %18 = load float* %17 %19 = fmul float %18, %16 %20 = getelementptr float* %7, i32 %14 store float %19, float* %20 %21 = add i32 %14, 1 %22 = icmp sge i32 %21, %4 br i1 %22, label %L1, label %L0 L1: ; preds = %L0 ret void } As you can see, I use the phi instruction for the loop index. I notice that clang prefers stack allocation. So, I am not sure what's the problem that the loop vectorizer is not working here. I tried many things, like specifying an architecture with vector units, enforcing the vector width. No success. opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll The only explanation I have is the use of the phi instruction. Is this preventing to vectorize the loop? Frank
Arnold
2013-Oct-26 17:03 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Hi Frank, Sent from my iPhone> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> wrote: > > My function implements a simple loop: > > void bar( int start, int end, float* A, float* B, float* C) > { > for (int i=start; i<end;++i) > A[i] = B[i] * C[i]; > } > > This looks pretty much like the standard example. However, I built the function > with the IRBuilder, thus not coming from C and clang. Also I changed slightly > the function's signature: > > define void @bar([8 x i8]* %arg_ptr) { > entrypoint: > %0 = bitcast [8 x i8]* %arg_ptr to i32* > %1 = load i32* %0 > %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 > %3 = bitcast [8 x i8]* %2 to i32* > %4 = load i32* %3 > %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 > %6 = bitcast [8 x i8]* %5 to float** > %7 = load float** %6 > %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 > %9 = bitcast [8 x i8]* %8 to float** > %10 = load float** %9 > %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 > %12 = bitcast [8 x i8]* %11 to float** > %13 = load float** %12 > br label %L0 > > L0: ; preds = %L0, %entrypoint > %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] > %15 = getelementptr float* %10, i32 %14 > %16 = load float* %15 > %17 = getelementptr float* %13, i32 %14 > %18 = load float* %17 > %19 = fmul float %18, %16 > %20 = getelementptr float* %7, i32 %14 > store float %19, float* %20 > %21 = add i32 %14, 1Try %21 = add nsw i32 %14, 1 instead for no-signed wrapping arithmetic. If that is not working please post the output of opt ... -debug-only=loop-vectorize ...> %22 = icmp sge i32 %21, %4 > br i1 %22, label %L1, label %L0 > > L1: ; preds = %L0 > ret void > } > > > As you can see, I use the phi instruction for the loop index. I notice > that clang prefers stack allocation. So, I am not sure what's the > problem that the loop vectorizer is not working here. > I tried many things, like specifying an architecture with vector > units, enforcing the vector width. No success. > > opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll > > The only explanation I have is the use of the phi instruction. Is this > preventing to vectorize the loop? > > Frank > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Frank Winter
2013-Oct-26 18:40 UTC
[LLVMdev] Why is the loop vectorizer not working on my function?
Hi Arnold, adding '-debug-only=loop-vectorize' to the command gives: LV: Checking a loop in "bar" LV: Found a loop: L0 LV: Found an induction variable. LV: Found an unidentified write ptr: %7 = load float** %6 LV: Found an unidentified read ptr: %10 = load float** %9 LV: Found an unidentified read ptr: %13 = load float** %12 LV: We need to do 2 pointer comparisons. LV: We can't vectorize because we can't find the array bounds. LV: Can't vectorize due to memory conflicts LV: Not vectorizing. It can't find the loop bounds if we use the overflow version of add. That's a good point. I should mark this addition to not overflow. When using the non-overflow version I get: LV: Checking a loop in "bar" LV: Found a loop: L0 LV: Found an induction variable. LV: Found an unidentified write ptr: %7 = load float** %6 LV: Found an unidentified read ptr: %10 = load float** %9 LV: Found an unidentified read ptr: %13 = load float** %12 LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32 %14 LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32 %14 LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32 %14 LV: We need to do 2 pointer comparisons. LV: We can perform a memory runtime check if needed. LV: We need a runtime memory check. LV: We can vectorize this loop (with a runtime bound check)! LV: Found trip count: 0 LV: The Widest type: 32 bits. LV: The Widest register is: 32 bits. LV: Found an estimated cost of 0 for VF 1 For instruction: %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] LV: Found an estimated cost of 0 for VF 1 For instruction: %15 = getelementptr float* %10, i32 %14 LV: Found an estimated cost of 1 for VF 1 For instruction: %16 = load float* %15 LV: Found an estimated cost of 0 for VF 1 For instruction: %17 = getelementptr float* %13, i32 %14 LV: Found an estimated cost of 1 for VF 1 For instruction: %18 = load float* %17 LV: Found an estimated cost of 1 for VF 1 For instruction: %19 = fmul float %18, %16 LV: Found an estimated cost of 0 for VF 1 For instruction: %20 = getelementptr float* %7, i32 %14 LV: Found an estimated cost of 1 for VF 1 For instruction: store float %19, float* %20 LV: Found an estimated cost of 1 for VF 1 For instruction: %21 = add nsw i32 %14, 1 LV: Found an estimated cost of 1 for VF 1 For instruction: %22 = icmp sge i32 %21, %4 LV: Found an estimated cost of 1 for VF 1 For instruction: br i1 %22, label %L1, label %L0 LV: Scalar loop costs: 7. LV: Selecting VF = : 1. LV: The target has 8 vector registers LV(REG): Calculating max register usage: LV(REG): At #0 Interval # 0 LV(REG): At #1 Interval # 1 LV(REG): At #2 Interval # 2 LV(REG): At #3 Interval # 2 LV(REG): At #4 Interval # 3 LV(REG): At #5 Interval # 3 LV(REG): At #6 Interval # 2 LV(REG): At #8 Interval # 1 LV(REG): At #9 Interval # 1 LV(REG): Found max usage: 3 LV(REG): Found invariant usage: 5 LV(REG): LoopSize: 11 LV: Vectorization is possible but not beneficial. LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll LV: Unroll Factor is 1 It's not beneficial? I didn't expect that. Do you have a descriptive explanation why it's not beneficial? Frank On 26/10/13 13:03, Arnold wrote:> Hi Frank, > > Sent from my iPhone > >> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org> wrote: >> >> My function implements a simple loop: >> >> void bar( int start, int end, float* A, float* B, float* C) >> { >> for (int i=start; i<end;++i) >> A[i] = B[i] * C[i]; >> } >> >> This looks pretty much like the standard example. However, I built the function >> with the IRBuilder, thus not coming from C and clang. Also I changed slightly >> the function's signature: >> >> define void @bar([8 x i8]* %arg_ptr) { >> entrypoint: >> %0 = bitcast [8 x i8]* %arg_ptr to i32* >> %1 = load i32* %0 >> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1 >> %3 = bitcast [8 x i8]* %2 to i32* >> %4 = load i32* %3 >> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2 >> %6 = bitcast [8 x i8]* %5 to float** >> %7 = load float** %6 >> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3 >> %9 = bitcast [8 x i8]* %8 to float** >> %10 = load float** %9 >> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4 >> %12 = bitcast [8 x i8]* %11 to float** >> %13 = load float** %12 >> br label %L0 >> >> L0: ; preds = %L0, %entrypoint >> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ] >> %15 = getelementptr float* %10, i32 %14 >> %16 = load float* %15 >> %17 = getelementptr float* %13, i32 %14 >> %18 = load float* %17 >> %19 = fmul float %18, %16 >> %20 = getelementptr float* %7, i32 %14 >> store float %19, float* %20 >> %21 = add i32 %14, 1 > Try > %21 = add nsw i32 %14, 1 > instead for no-signed wrapping arithmetic. > > If that is not working please post the output of opt ... -debug-only=loop-vectorize ... > > > >> %22 = icmp sge i32 %21, %4 >> br i1 %22, label %L1, label %L0 >> >> L1: ; preds = %L0 >> ret void >> } >> >> >> As you can see, I use the phi instruction for the loop index. I notice >> that clang prefers stack allocation. So, I am not sure what's the >> problem that the loop vectorizer is not working here. >> I tried many things, like specifying an architecture with vector >> units, enforcing the vector width. No success. >> >> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll >> >> The only explanation I have is the use of the phi instruction. Is this >> preventing to vectorize the loop? >> >> Frank >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Seemingly Similar Threads
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?
- [LLVMdev] Why is the loop vectorizer not working on my function?