----- Original Message -----> > Hi Nadav, > > that's the whole point of it. I can't in general make the index > calculation simpler. The example given is the simplest non-trivial > index function that is needed. It might well be that it's that > simple that the index calculation in this case can be thrown aways > altogether and - as you say - be replaced by the simple loop you > mentioned. However, this cannot be done in the general case. In the > general case the index calculation requires the 'rem' and 'div' > instruction. The OR instruction must be the result of an arithmetic > transformation of one of the previous passes (due to -O3?). > > I don't see a way around to make the vectorizers recognize such > arithmetic operations. > > Do you think this can be done relatively quickly or does this involve > a huge effort? > > What does 'stepping through' the loop vectorizer mean? Using the > debugger and step through the program? Probably not. Is the way to > go, to alter the 'canVectorize' function print debug output to see > what's going on? > > Hal, you seem to know the loop vectorizer.Heh ;) Nadav wrote it. I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how difficult it would be to make SCEV understand the indexing calculations. -Hal> Is this the place to look > at or is the SLP vectorizer the more promising place? > > Frank > > > On 31/10/13 12:50, Nadav Rotem wrote: > > > > Hi Frank, > > > This loop should be vectorized by the SLP-vectorizer. It has several > scalars (C[0], C[1] … ) that can be merged into a vector. The SLP > vectorizer can’t figure out that the stores are consecutive because > SCEV can’t analyze the OR in the index calculation: > > %2 = and i64 %i.04, 3 > %3 = lshr i64 %i.04, 2 > %4 = shl i64 %3, 3 > %5 = or i64 %4, %2 > %11 = getelementptr inbounds float* %c, i64 %5 > store float %10, float* %11, align 4, !tbaa !0 > > > You wrote that you want each iteration to look like this: > > > > > > > > i 0: 0 1 2 3 > i 4: 8 9 10 11 > i 8: 16 17 18 19 > i 12: 24 25 26 27 > > > > Why can’t you just write a small loop like this: for (i=0; i<4; i++) > C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the > SLP-vectorizer will vectorize the unrolled iterations, or the > loop-vectorizer would catch it. > > > Thanks, > Nadav > > > On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org > wrote: > > > > > A quite small but yet complete example function which all > vectorization passes fail to optimize: > > #include <cstdint> > #include <iostream> > > void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ > c, float * __restrict__ a, float * __restrict__ b) > { > for ( std::uint64_t i = start ; i < end ; i += 4 ) { > { > const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > } > } > > The loop index and array accesses for the first 4 iterations: > > i 0: 0 1 2 3 > i 4: 8 9 10 11 > i 8: 16 17 18 19 > i 12: 24 25 26 27 > > For example on an x86 processor with SSE (128 bit SIMD vectors) the > loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 > SIMD store. > > With current trunk I tried the following on the above example: > > clang++ -emit-llvm -S loop_minimal.cc -std=c++11 > opt -O3 -vectorize-slp -S loop_minimal.ll > opt -O3 -loop-vectorize -S loop_minimal.ll > opt -O3 -bb-vectorize -S loop_minimal.ll > > All optimization passes miss the opportunity. It seems the SCEV AA > pass doesn't understand modulo arithmetic. > > How can the SCEV AA pass be extended to handle this type of > arithmetic? Frank > > On 31/10/13 02:21, Renato Golin wrote: > > > On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote: > > > > > > > const std::uint64_t ir0 = (i+0)%4; // not working > > > > I thought this would be the case when I saw the original expression. > Maybe we need to teach module arithmetic to SCEV? > > --renato > > > > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
On 31/10/13 13:34, Hal Finkel wrote:> ----- Original Message ----- >> Hi Nadav, >> >> that's the whole point of it. I can't in general make the index >> calculation simpler. The example given is the simplest non-trivial >> index function that is needed. It might well be that it's that >> simple that the index calculation in this case can be thrown aways >> altogether and - as you say - be replaced by the simple loop you >> mentioned. However, this cannot be done in the general case. In the >> general case the index calculation requires the 'rem' and 'div' >> instruction. The OR instruction must be the result of an arithmetic >> transformation of one of the previous passes (due to -O3?). >> >> I don't see a way around to make the vectorizers recognize such >> arithmetic operations. >> >> Do you think this can be done relatively quickly or does this involve >> a huge effort? >> >> What does 'stepping through' the loop vectorizer mean? Using the >> debugger and step through the program? Probably not. Is the way to >> go, to alter the 'canVectorize' function print debug output to see >> what's going on? >> >> Hal, you seem to know the loop vectorizer. > Heh ;) Nadav wrote it.I know. But you said that it doesn't use the SCEV AA information but uses its interal SCEV. Being able to say this requires knowledge of the loop vectorizer. That's why I said it.> > I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how difficult it would be to make SCEV understand the indexing calculations. > > -Hal > >> Is this the place to look >> at or is the SLP vectorizer the more promising place? >> >> Frank >> >> >> On 31/10/13 12:50, Nadav Rotem wrote: >> >> >> >> Hi Frank, >> >> >> This loop should be vectorized by the SLP-vectorizer. It has several >> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP >> vectorizer can’t figure out that the stores are consecutive because >> SCEV can’t analyze the OR in the index calculation: >> >> %2 = and i64 %i.04, 3 >> %3 = lshr i64 %i.04, 2 >> %4 = shl i64 %3, 3 >> %5 = or i64 %4, %2 >> %11 = getelementptr inbounds float* %c, i64 %5 >> store float %10, float* %11, align 4, !tbaa !0 >> >> >> You wrote that you want each iteration to look like this: >> >> >> >> >> >> >> >> i 0: 0 1 2 3 >> i 4: 8 9 10 11 >> i 8: 16 17 18 19 >> i 12: 24 25 26 27 >> >> >> >> Why can’t you just write a small loop like this: for (i=0; i<4; i++) >> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the >> SLP-vectorizer will vectorize the unrolled iterations, or the >> loop-vectorizer would catch it. >> >> >> Thanks, >> Nadav >> >> >> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org > wrote: >> >> >> >> >> A quite small but yet complete example function which all >> vectorization passes fail to optimize: >> >> #include <cstdint> >> #include <iostream> >> >> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ >> c, float * __restrict__ a, float * __restrict__ b) >> { >> for ( std::uint64_t i = start ; i < end ; i += 4 ) { >> { >> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> } >> } >> >> The loop index and array accesses for the first 4 iterations: >> >> i 0: 0 1 2 3 >> i 4: 8 9 10 11 >> i 8: 16 17 18 19 >> i 12: 24 25 26 27 >> >> For example on an x86 processor with SSE (128 bit SIMD vectors) the >> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 >> SIMD store. >> >> With current trunk I tried the following on the above example: >> >> clang++ -emit-llvm -S loop_minimal.cc -std=c++11 >> opt -O3 -vectorize-slp -S loop_minimal.ll >> opt -O3 -loop-vectorize -S loop_minimal.ll >> opt -O3 -bb-vectorize -S loop_minimal.ll >> >> All optimization passes miss the opportunity. It seems the SCEV AA >> pass doesn't understand modulo arithmetic. >> >> How can the SCEV AA pass be extended to handle this type of >> arithmetic? Frank >> >> On 31/10/13 02:21, Renato Golin wrote: >> >> >> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote: >> >> >> >> >> >> >> const std::uint64_t ir0 = (i+0)%4; // not working >> >> >> >> I thought this would be the case when I saw the original expression. >> Maybe we need to teach module arithmetic to SCEV? >> >> --renato >> >> >> >> >>
Here's a real example of IR (as opposed to the C example provided earlier) which includes the type of index calculation that lets the loop vectorizer and SLP vectorizer (after unrolling) fail: define void @main(i64 %arg0, i64 %arg1, i1 %arg2, i64 %arg3, float* noalias %arg4, float* noalias %arg5, float* noalias %arg6) { entrypoint: br i1 %arg2, label %L0, label %L1 L0: ; preds = %entrypoint %0 = add nsw i64 %arg0, %arg3 %1 = add nsw i64 %arg1, %arg3 br label %L2 L1: ; preds = %entrypoint br label %L2 L2: ; preds = %L0, %L1 %2 = phi i64 [ %arg0, %L1 ], [ %0, %L0 ] %3 = phi i64 [ %arg1, %L1 ], [ %1, %L0 ] br label %L3 L3: ; preds = %L3, %L2 %4 = phi i64 [ %25, %L3 ], [ %2, %L2 ] %5 = sdiv i64 %4, 4 %6 = srem i64 %4, 4 %7 = mul i64 %5, 2 %8 = mul i64 %7, 4 %9 = add nsw i64 %8, %6 %10 = add nsw i64 %7, 1 %11 = mul i64 %10, 4 %12 = add nsw i64 %11, %6 %13 = getelementptr float* %arg5, i64 %9 %14 = load float* %13 %15 = getelementptr float* %arg5, i64 %12 %16 = load float* %15 %17 = getelementptr float* %arg6, i64 %9 %18 = load float* %17 %19 = getelementptr float* %arg6, i64 %12 %20 = load float* %19 %21 = fadd float %20, %16 %22 = fadd float %18, %14 %23 = getelementptr float* %arg4, i64 %9 store float %22, float* %23 %24 = getelementptr float* %arg4, i64 %12 store float %21, float* %24 %25 = add nsw i64 %4, 1 %26 = icmp sge i64 %25, %3 br i1 %26, label %L4, label %L3 L4: ; preds = %L3 ret void } Right now we have: LV: Checking a loop in "main" LV: Found a loop: L3 LV: Found an induction variable. LV: We need to do 0 pointer comparisons. LV: Checking memory dependencies LV: Bad stride - Not an AddRecExpr pointer %23 = getelementptr float* %arg4, i64 %9 SCEV: ((4 * ((8 * %5) + %6)) + %arg4) LV: Bad stride - Not an AddRecExpr pointer %24 = getelementptr float* %arg4, i64 %12 SCEV: (16 + (4 * %6) + (32 * %5) + %arg4) LV: Src Scev: ((4 * ((8 * %5) + %6)) + %arg4)Sink Scev: (16 + (4 * %6) + (32 * %5) + %arg4)(Induction step: 0) LV: Distance for store float %22, float* %23 to store float %21, float* %24: 16 Non-consecutive pointer access LV: We don't need a runtime memory check. LV: Can't vectorize due to memory conflicts LV: Not vectorizing. However, 4 consecutive iterations could be combined (for an x86 SSE SIMD). (With a SIMD width of 8 floats, it would be that 8 iterations could be combined. In that case the literal constant in %8 = mul i64 %7, 4 would be 8 of course). Even after unrolling this loop by hand (factor 4), the SLP vectorizer fails. Frank On 31/10/13 13:34, Hal Finkel wrote:> ----- Original Message ----- >> Hi Nadav, >> >> that's the whole point of it. I can't in general make the index >> calculation simpler. The example given is the simplest non-trivial >> index function that is needed. It might well be that it's that >> simple that the index calculation in this case can be thrown aways >> altogether and - as you say - be replaced by the simple loop you >> mentioned. However, this cannot be done in the general case. In the >> general case the index calculation requires the 'rem' and 'div' >> instruction. The OR instruction must be the result of an arithmetic >> transformation of one of the previous passes (due to -O3?). >> >> I don't see a way around to make the vectorizers recognize such >> arithmetic operations. >> >> Do you think this can be done relatively quickly or does this involve >> a huge effort? >> >> What does 'stepping through' the loop vectorizer mean? Using the >> debugger and step through the program? Probably not. Is the way to >> go, to alter the 'canVectorize' function print debug output to see >> what's going on? >> >> Hal, you seem to know the loop vectorizer. > Heh ;) Nadav wrote it. > > I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how difficult it would be to make SCEV understand the indexing calculations. > > -Hal > >> Is this the place to look >> at or is the SLP vectorizer the more promising place? >> >> Frank >> >> >> On 31/10/13 12:50, Nadav Rotem wrote: >> >> >> >> Hi Frank, >> >> >> This loop should be vectorized by the SLP-vectorizer. It has several >> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP >> vectorizer can’t figure out that the stores are consecutive because >> SCEV can’t analyze the OR in the index calculation: >> >> %2 = and i64 %i.04, 3 >> %3 = lshr i64 %i.04, 2 >> %4 = shl i64 %3, 3 >> %5 = or i64 %4, %2 >> %11 = getelementptr inbounds float* %c, i64 %5 >> store float %10, float* %11, align 4, !tbaa !0 >> >> >> You wrote that you want each iteration to look like this: >> >> >> >> >> >> >> >> i 0: 0 1 2 3 >> i 4: 8 9 10 11 >> i 8: 16 17 18 19 >> i 12: 24 25 26 27 >> >> >> >> Why can’t you just write a small loop like this: for (i=0; i<4; i++) >> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the >> SLP-vectorizer will vectorize the unrolled iterations, or the >> loop-vectorizer would catch it. >> >> >> Thanks, >> Nadav >> >> >> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org > wrote: >> >> >> >> >> A quite small but yet complete example function which all >> vectorization passes fail to optimize: >> >> #include <cstdint> >> #include <iostream> >> >> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ >> c, float * __restrict__ a, float * __restrict__ b) >> { >> for ( std::uint64_t i = start ; i < end ; i += 4 ) { >> { >> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> } >> } >> >> The loop index and array accesses for the first 4 iterations: >> >> i 0: 0 1 2 3 >> i 4: 8 9 10 11 >> i 8: 16 17 18 19 >> i 12: 24 25 26 27 >> >> For example on an x86 processor with SSE (128 bit SIMD vectors) the >> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 >> SIMD store. >> >> With current trunk I tried the following on the above example: >> >> clang++ -emit-llvm -S loop_minimal.cc -std=c++11 >> opt -O3 -vectorize-slp -S loop_minimal.ll >> opt -O3 -loop-vectorize -S loop_minimal.ll >> opt -O3 -bb-vectorize -S loop_minimal.ll >> >> All optimization passes miss the opportunity. It seems the SCEV AA >> pass doesn't understand modulo arithmetic. >> >> How can the SCEV AA pass be extended to handle this type of >> arithmetic? Frank >> >> On 31/10/13 02:21, Renato Golin wrote: >> >> >> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote: >> >> >> >> >> >> >> const std::uint64_t ir0 = (i+0)%4; // not working >> >> >> >> I thought this would be the case when I saw the original expression. >> Maybe we need to teach module arithmetic to SCEV? >> >> --renato >> >> >> >> >>
----- Original Message -----> On 31/10/13 13:34, Hal Finkel wrote: > > ----- Original Message ----- > >> Hi Nadav, > >> > >> that's the whole point of it. I can't in general make the index > >> calculation simpler. The example given is the simplest non-trivial > >> index function that is needed. It might well be that it's that > >> simple that the index calculation in this case can be thrown aways > >> altogether and - as you say - be replaced by the simple loop you > >> mentioned. However, this cannot be done in the general case. In > >> the > >> general case the index calculation requires the 'rem' and 'div' > >> instruction. The OR instruction must be the result of an > >> arithmetic > >> transformation of one of the previous passes (due to -O3?). > >> > >> I don't see a way around to make the vectorizers recognize such > >> arithmetic operations. > >> > >> Do you think this can be done relatively quickly or does this > >> involve > >> a huge effort? > >> > >> What does 'stepping through' the loop vectorizer mean? Using the > >> debugger and step through the program? Probably not. Is the way to > >> go, to alter the 'canVectorize' function print debug output to see > >> what's going on? > >> > >> Hal, you seem to know the loop vectorizer. > > Heh ;) Nadav wrote it. > I know. But you said that it doesn't use the SCEV AA information but > uses its interal SCEV. Being able to say this requires knowledge of > the > loop vectorizer. That's why I said it.Fair enough; I'm not sure how well I know it, but I've looked over the code a few times. The memory dependence code is in MemoryDepChecker::isDependent in LoopVectorize.cpp. Also, Sebastian Pop and I have been ping-ponging on working on delinearization analysis. Sebastian's latest code (http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20131021/192120.html) has some SCEV-based GCD and division code, and that could be useful for this as well. -Hal> > > > I've CC'd Andy, who is our SCEV expert; hopefully he can comment on > > how difficult it would be to make SCEV understand the indexing > > calculations. > > > > -Hal > > > >> Is this the place to look > >> at or is the SLP vectorizer the more promising place? > >> > >> Frank > >> > >> > >> On 31/10/13 12:50, Nadav Rotem wrote: > >> > >> > >> > >> Hi Frank, > >> > >> > >> This loop should be vectorized by the SLP-vectorizer. It has > >> several > >> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP > >> vectorizer can’t figure out that the stores are consecutive > >> because > >> SCEV can’t analyze the OR in the index calculation: > >> > >> %2 = and i64 %i.04, 3 > >> %3 = lshr i64 %i.04, 2 > >> %4 = shl i64 %3, 3 > >> %5 = or i64 %4, %2 > >> %11 = getelementptr inbounds float* %c, i64 %5 > >> store float %10, float* %11, align 4, !tbaa !0 > >> > >> > >> You wrote that you want each iteration to look like this: > >> > >> > >> > >> > >> > >> > >> > >> i 0: 0 1 2 3 > >> i 4: 8 9 10 11 > >> i 8: 16 17 18 19 > >> i 12: 24 25 26 27 > >> > >> > >> > >> Why can’t you just write a small loop like this: for (i=0; i<4; > >> i++) > >> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the > >> SLP-vectorizer will vectorize the unrolled iterations, or the > >> loop-vectorizer would catch it. > >> > >> > >> Thanks, > >> Nadav > >> > >> > >> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org > > >> wrote: > >> > >> > >> > >> > >> A quite small but yet complete example function which all > >> vectorization passes fail to optimize: > >> > >> #include <cstdint> > >> #include <iostream> > >> > >> void bar(std::uint64_t start, std::uint64_t end, float * > >> __restrict__ > >> c, float * __restrict__ a, float * __restrict__ b) > >> { > >> for ( std::uint64_t i = start ; i < end ; i += 4 ) { > >> { > >> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); > >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > >> } > >> { > >> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); > >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > >> } > >> { > >> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); > >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > >> } > >> { > >> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); > >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > >> } > >> } > >> } > >> > >> The loop index and array accesses for the first 4 iterations: > >> > >> i 0: 0 1 2 3 > >> i 4: 8 9 10 11 > >> i 8: 16 17 18 19 > >> i 12: 24 25 26 27 > >> > >> For example on an x86 processor with SSE (128 bit SIMD vectors) > >> the > >> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 > >> SIMD store. > >> > >> With current trunk I tried the following on the above example: > >> > >> clang++ -emit-llvm -S loop_minimal.cc -std=c++11 > >> opt -O3 -vectorize-slp -S loop_minimal.ll > >> opt -O3 -loop-vectorize -S loop_minimal.ll > >> opt -O3 -bb-vectorize -S loop_minimal.ll > >> > >> All optimization passes miss the opportunity. It seems the SCEV AA > >> pass doesn't understand modulo arithmetic. > >> > >> How can the SCEV AA pass be extended to handle this type of > >> arithmetic? Frank > >> > >> On 31/10/13 02:21, Renato Golin wrote: > >> > >> > >> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote: > >> > >> > >> > >> > >> > >> > >> const std::uint64_t ir0 = (i+0)%4; // not working > >> > >> > >> > >> I thought this would be the case when I saw the original > >> expression. > >> Maybe we need to teach module arithmetic to SCEV? > >> > >> --renato > >> > >> > >> > >> > >> > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory