Hi Frank, This loop should be vectorized by the SLP-vectorizer. It has several scalars (C[0], C[1] … ) that can be merged into a vector. The SLP vectorizer can’t figure out that the stores are consecutive because SCEV can’t analyze the OR in the index calculation: %2 = and i64 %i.04, 3 %3 = lshr i64 %i.04, 2 %4 = shl i64 %3, 3 %5 = or i64 %4, %2 %11 = getelementptr inbounds float* %c, i64 %5 store float %10, float* %11, align 4, !tbaa !0 You wrote that you want each iteration to look like this:> > i 0: 0 1 2 3 > i 4: 8 9 10 11 > i 8: 16 17 18 19 > i 12: 24 25 26 27Why can’t you just write a small loop like this: for (i=0; i<4; i++) C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the SLP-vectorizer will vectorize the unrolled iterations, or the loop-vectorizer would catch it. Thanks, Nadav On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org> wrote:> A quite small but yet complete example function which all vectorization passes fail to optimize: > > #include <cstdint> > #include <iostream> > > void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b) > { > for ( std::uint64_t i = start ; i < end ; i += 4 ) { > { > const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > } > } > > The loop index and array accesses for the first 4 iterations: > > i 0: 0 1 2 3 > i 4: 8 9 10 11 > i 8: 16 17 18 19 > i 12: 24 25 26 27 > > For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store. > > With current trunk I tried the following on the above example: > > clang++ -emit-llvm -S loop_minimal.cc -std=c++11 > opt -O3 -vectorize-slp -S loop_minimal.ll > opt -O3 -loop-vectorize -S loop_minimal.ll > opt -O3 -bb-vectorize -S loop_minimal.ll > > All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic. > > How can the SCEV AA pass be extended to handle this type of arithmetic? > Frank > > On 31/10/13 02:21, Renato Golin wrote: >> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote: >> const std::uint64_t ir0 = (i+0)%4; // not working >> >> I thought this would be the case when I saw the original expression. Maybe we need to teach module arithmetic to SCEV? >> >> --renato > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/797e3f56/attachment.html>
Hi Nadav, that's the whole point of it. I can't in general make the index calculation simpler. The example given is the simplest non-trivial index function that is needed. It might well be that it's that simple that the index calculation in this case can be thrown aways altogether and - as you say - be replaced by the simple loop you mentioned. However, this cannot be done in the general case. In the general case the index calculation requires the 'rem' and 'div' instruction. The OR instruction must be the result of an arithmetic transformation of one of the previous passes (due to -O3?). I don't see a way around to make the vectorizers recognize such arithmetic operations. Do you think this can be done relatively quickly or does this involve a huge effort? What does 'stepping through' the loop vectorizer mean? Using the debugger and step through the program? Probably not. Is the way to go, to alter the 'canVectorize' function print debug output to see what's going on? Hal, you seem to know the loop vectorizer. Is this the place to look at or is the SLP vectorizer the more promising place? Frank On 31/10/13 12:50, Nadav Rotem wrote:> Hi Frank, > > This loop should be vectorized by the SLP-vectorizer. It has several > scalars (C[0], C[1] … ) that can be merged into a vector. The SLP > vectorizer can’t figure out that the stores are consecutive because > SCEV can’t analyze the OR in the index calculation: > %2 = and i64 %i.04, 3 > %3 = lshr i64 %i.04, 2 > %4 = shl i64 %3, 3 > %5 = or i64 %4, %2 > %11 = getelementptr inbounds float* %c, i64 %5 > store float %10, float* %11, align 4, !tbaa !0 > > You wrote that you want each iteration to look like this: > >> >> i 0: 0 1 2 3 >> i 4: 8 9 10 11 >> i 8: 16 17 18 19 >> i 12: 24 25 26 27 > > Why can’t you just write a small loop like this: for (i=0; i<4; i++) > C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the > SLP-vectorizer will vectorize the unrolled iterations, or the > loop-vectorizer would catch it. > > Thanks, > Nadav > > On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org > <mailto:fwinter at jlab.org>> wrote: > >> A quite small but yet complete example function which all >> vectorization passes fail to optimize: >> >> #include <cstdint> >> #include <iostream> >> >> void bar(std::uint64_t start, std::uint64_t end, float * >> __restrict__ c, float * __restrict__ a, float * __restrict__ b) >> { >> for ( std::uint64_t i = start ; i < end ; i += 4 ) { >> { >> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> { >> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; >> } >> } >> } >> >> The loop index and array accesses for the first 4 iterations: >> >> i 0: 0 1 2 3 >> i 4: 8 9 10 11 >> i 8: 16 17 18 19 >> i 12: 24 25 26 27 >> >> For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store. >> >> With current trunk I tried the following on the above example: >> >> clang++ -emit-llvm -S loop_minimal.cc -std=c++11 >> opt -O3 -vectorize-slp -S loop_minimal.ll >> opt -O3 -loop-vectorize -S loop_minimal.ll >> opt -O3 -bb-vectorize -S loop_minimal.ll >> >> All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic. >> >> How can the SCEV AA pass be extended to handle this type of arithmetic? >> Frank >> >> On 31/10/13 02:21, Renato Golin wrote: >>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org >>> <mailto:fwinter at jlab.org>> wrote: >>> >>> const std::uint64_t ir0 = (i+0)%4; // not working >>> >>> >>> I thought this would be the case when I saw the original expression. >>> Maybe we need to teach module arithmetic to SCEV? >>> >>> --renato >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/cea7d8a9/attachment.html>
On 31 October 2013 10:24, Frank Winter <fwinter at jlab.org> wrote:> What does 'stepping through' the loop vectorizer mean? Using the > debugger and step through the program? Probably not. Is the way to go, to > alter the 'canVectorize' function print debug output to see what's going on? >Stepping through 'opt -O3' with a debugger, yes. The debug messages are enabled by using '-debug-only=loop-vectorizer', but they's only half the story. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/b4b9f11f/attachment.html>
----- Original Message -----> > Hi Nadav, > > that's the whole point of it. I can't in general make the index > calculation simpler. The example given is the simplest non-trivial > index function that is needed. It might well be that it's that > simple that the index calculation in this case can be thrown aways > altogether and - as you say - be replaced by the simple loop you > mentioned. However, this cannot be done in the general case. In the > general case the index calculation requires the 'rem' and 'div' > instruction. The OR instruction must be the result of an arithmetic > transformation of one of the previous passes (due to -O3?). > > I don't see a way around to make the vectorizers recognize such > arithmetic operations. > > Do you think this can be done relatively quickly or does this involve > a huge effort? > > What does 'stepping through' the loop vectorizer mean? Using the > debugger and step through the program? Probably not. Is the way to > go, to alter the 'canVectorize' function print debug output to see > what's going on? > > Hal, you seem to know the loop vectorizer.Heh ;) Nadav wrote it. I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how difficult it would be to make SCEV understand the indexing calculations. -Hal> Is this the place to look > at or is the SLP vectorizer the more promising place? > > Frank > > > On 31/10/13 12:50, Nadav Rotem wrote: > > > > Hi Frank, > > > This loop should be vectorized by the SLP-vectorizer. It has several > scalars (C[0], C[1] … ) that can be merged into a vector. The SLP > vectorizer can’t figure out that the stores are consecutive because > SCEV can’t analyze the OR in the index calculation: > > %2 = and i64 %i.04, 3 > %3 = lshr i64 %i.04, 2 > %4 = shl i64 %3, 3 > %5 = or i64 %4, %2 > %11 = getelementptr inbounds float* %c, i64 %5 > store float %10, float* %11, align 4, !tbaa !0 > > > You wrote that you want each iteration to look like this: > > > > > > > > i 0: 0 1 2 3 > i 4: 8 9 10 11 > i 8: 16 17 18 19 > i 12: 24 25 26 27 > > > > Why can’t you just write a small loop like this: for (i=0; i<4; i++) > C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the > SLP-vectorizer will vectorize the unrolled iterations, or the > loop-vectorizer would catch it. > > > Thanks, > Nadav > > > On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org > wrote: > > > > > A quite small but yet complete example function which all > vectorization passes fail to optimize: > > #include <cstdint> > #include <iostream> > > void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ > c, float * __restrict__ a, float * __restrict__ b) > { > for ( std::uint64_t i = start ; i < end ; i += 4 ) { > { > const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > } > } > > The loop index and array accesses for the first 4 iterations: > > i 0: 0 1 2 3 > i 4: 8 9 10 11 > i 8: 16 17 18 19 > i 12: 24 25 26 27 > > For example on an x86 processor with SSE (128 bit SIMD vectors) the > loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 > SIMD store. > > With current trunk I tried the following on the above example: > > clang++ -emit-llvm -S loop_minimal.cc -std=c++11 > opt -O3 -vectorize-slp -S loop_minimal.ll > opt -O3 -loop-vectorize -S loop_minimal.ll > opt -O3 -bb-vectorize -S loop_minimal.ll > > All optimization passes miss the opportunity. It seems the SCEV AA > pass doesn't understand modulo arithmetic. > > How can the SCEV AA pass be extended to handle this type of > arithmetic? Frank > > On 31/10/13 02:21, Renato Golin wrote: > > > On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote: > > > > > > > const std::uint64_t ir0 = (i+0)%4; // not working > > > > I thought this would be the case when I saw the original expression. > Maybe we need to teach module arithmetic to SCEV? > > --renato > > > > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory