On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote:> const std::uint64_t ir0 = (i+0)%4; // not working >I thought this would be the case when I saw the original expression. Maybe we need to teach module arithmetic to SCEV? --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/b68fbb14/attachment.html>
A quite small but yet complete example function which all vectorization passes fail to optimize: #include <cstdint> #include <iostream> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b) { for ( std::uint64_t i = start ; i < end ; i += 4 ) { { const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; } { const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; } { const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; } { const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; } } } The loop index and array accesses for the first 4 iterations: i 0: 0 1 2 3 i 4: 8 9 10 11 i 8: 16 17 18 19 i 12: 24 25 26 27 For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store. With current trunk I tried the following on the above example: clang++ -emit-llvm -S loop_minimal.cc -std=c++11 opt -O3 -vectorize-slp -S loop_minimal.ll opt -O3 -loop-vectorize -S loop_minimal.ll opt -O3 -bb-vectorize -S loop_minimal.ll All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic. How can the SCEV AA pass be extended to handle this type of arithmetic? Frank On 31/10/13 02:21, Renato Golin wrote:> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org > <mailto:fwinter at jlab.org>> wrote: > > const std::uint64_t ir0 = (i+0)%4; // not working > > > I thought this would be the case when I saw the original expression. > Maybe we need to teach module arithmetic to SCEV? > > --renato-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/602f127b/attachment.html>
On 31 October 2013 08:01, Frank Winter <fwinter at jlab.org> wrote:> With current trunk I tried the following on the above example: > > clang++ -emit-llvm -S loop_minimal.cc -std=c++11 > opt -O3 -vectorize-slp -S loop_minimal.ll > opt -O3 -loop-vectorize -S loop_minimal.ll > opt -O3 -bb-vectorize -S loop_minimal.ll > > All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic. > >Hi Frank, IIRC, opt -O3 will already pass all three, so you don't have to add it to the command line. How can the SCEV AA pass be extended to handle this type of arithmetic?> >I'm not an SCEV expert, so it might be possible that it does understand, but the loop vectorizer is not understanding the evolution info, or whatever. What I recommend is to step through the loop vectorizer (canVectorize) and see what the SCEV returns to the vectorizer. If the info is there, but not accounted for, it's a vectorizer bug. If the SCEV gets lost, than you need to add it into the SCEV logic. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/0427103a/attachment.html>
Hi Frank, This loop should be vectorized by the SLP-vectorizer. It has several scalars (C[0], C[1] … ) that can be merged into a vector. The SLP vectorizer can’t figure out that the stores are consecutive because SCEV can’t analyze the OR in the index calculation: %2 = and i64 %i.04, 3 %3 = lshr i64 %i.04, 2 %4 = shl i64 %3, 3 %5 = or i64 %4, %2 %11 = getelementptr inbounds float* %c, i64 %5 store float %10, float* %11, align 4, !tbaa !0 You wrote that you want each iteration to look like this:> > i 0: 0 1 2 3 > i 4: 8 9 10 11 > i 8: 16 17 18 19 > i 12: 24 25 26 27Why can’t you just write a small loop like this: for (i=0; i<4; i++) C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the SLP-vectorizer will vectorize the unrolled iterations, or the loop-vectorizer would catch it. Thanks, Nadav On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org> wrote:> A quite small but yet complete example function which all vectorization passes fail to optimize: > > #include <cstdint> > #include <iostream> > > void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b) > { > for ( std::uint64_t i = start ; i < end ; i += 4 ) { > { > const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > { > const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > } > } > } > > The loop index and array accesses for the first 4 iterations: > > i 0: 0 1 2 3 > i 4: 8 9 10 11 > i 8: 16 17 18 19 > i 12: 24 25 26 27 > > For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store. > > With current trunk I tried the following on the above example: > > clang++ -emit-llvm -S loop_minimal.cc -std=c++11 > opt -O3 -vectorize-slp -S loop_minimal.ll > opt -O3 -loop-vectorize -S loop_minimal.ll > opt -O3 -bb-vectorize -S loop_minimal.ll > > All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic. > > How can the SCEV AA pass be extended to handle this type of arithmetic? > Frank > > On 31/10/13 02:21, Renato Golin wrote: >> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote: >> const std::uint64_t ir0 = (i+0)%4; // not working >> >> I thought this would be the case when I saw the original expression. Maybe we need to teach module arithmetic to SCEV? >> >> --renato > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/797e3f56/attachment.html>
On Oct 30, 2013, at 11:21 PM, Renato Golin <renato.golin at linaro.org> wrote:> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote: > const std::uint64_t ir0 = (i+0)%4; // not working > > I thought this would be the case when I saw the original expression. Maybe we need to teach module arithmetic to SCEV?I let this thread get stale, so here’s the background again: source: const std::uint64_t ir0 = i%4 + 8*(i/4); c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; before instcombine: %4 = urem i64 %i.0, 4 %5 = udiv i64 %i.0, 4 %6 = mul i64 8, %5 %7 = add i64 %4, %6 %8 = getelementptr inbounds float* %a, i64 %7 after instcombine: %2 = and i64 %i.04, 3 %3 = lshr i64 %i.04, 2 %4 = shl i64 %3, 3 %5 = or i64 %4, %2 %11 = getelementptr inbounds float* %c, i64 %5 store float %10, float* %11, align 4, !tbaa !0 Honestly, I don't understand why InstCombine "anti-canonicalizes" add->or. I think that transformation should be deferred into we begin target-specific lower (e.g. InstOptimize pass). Given, that we aren't going to change that any time soon, SCEV could probably be taught to recognize the specific pattern: Instructions (or (and %a, C1), (shl %b, C2)) -> SCEV (add %a, %b) -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131105/09cdb551/attachment.html>
Good that you bring this up. I still have no solution to this vectorization problem. However, I can rewrite the code and insert a second loop which eliminates the 'urem' and 'div' instructions in the index calculations. In this case, the inner loop's trip count would be equal to the SIMD length and the loop vectorizer ignores the loop. Unrolling the loop and SLP is not an option, since the loop body can get lengthy. What would be a quicker to implement: a) Teach the loop vectorizer the 'urem' and 'div' instructions, or b) have the loop vectorizer process loops with trip count equal to the vector length ? One of both solutions will be needed, I guess. Frank On 05/11/13 22:12, Andrew Trick wrote:> > On Oct 30, 2013, at 11:21 PM, Renato Golin <renato.golin at linaro.org > <mailto:renato.golin at linaro.org>> wrote: > >> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org >> <mailto:fwinter at jlab.org>> wrote: >> >> const std::uint64_t ir0 = (i+0)%4; // not working >> >> >> I thought this would be the case when I saw the original expression. >> Maybe we need to teach module arithmetic to SCEV? > > I let this thread get stale, so here’s the background again: > > source: > > const std::uint64_t ir0 = i%4 + 8*(i/4); > c[ ir0 ] = a[ ir0 ] + b[ ir0 ]; > > before instcombine: > > %4 = urem i64 %i.0, 4 > %5 = udiv i64 %i.0, 4 > %6 = mul i64 8, %5 > %7 = add i64 %4, %6 > %8 = getelementptr inbounds float* %a, i64 %7 > > after instcombine: > > %2 = and i64 %i.04, 3 > %3 = lshr i64 %i.04, 2 > %4 = shl i64 %3, 3 > %5 = or i64 %4, %2 > %11 = getelementptr inbounds float* %c, i64 %5 > store float %10, float* %11, align 4, !tbaa !0 > > Honestly, I don't understand why InstCombine "anti-canonicalizes" > add->or. I think that transformation should be deferred into we begin > target-specific lower (e.g. InstOptimize pass). > > Given, that we aren't going to change that any time soon, SCEV could > probably be taught to recognize the specific pattern: > > Instructions (or (and %a, C1), (shl %b, C2)) -> SCEV (add %a, %b) > > -Andy-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131105/b4f36ea9/attachment.html>
Possibly Parallel Threads
- [LLVMdev] loop vectorizer misses opportunity, exploit
- [LLVMdev] loop vectorizer misses opportunity, exploit
- [LLVMdev] loop vectorizer misses opportunity, exploit
- [LLVMdev] loop vectorizer misses opportunity, exploit
- [LLVMdev] loop vectorizer misses opportunity, exploit