thr3ads.net - llvm dev - [LLVMdev] loop vectorizer misses opportunity, exploit [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2013-Oct-31 17:34 UTC

[LLVMdev] loop vectorizer misses opportunity, exploit

----- Original Message -----> 
> Hi Nadav,
> 
> that's the whole point of it. I can't in general make the index
> calculation simpler. The example given is the simplest non-trivial
> index function that is needed. It might well be that it's that
> simple that the index calculation in this case can be thrown aways
> altogether and - as you say - be replaced by the simple loop you
> mentioned. However, this cannot be done in the general case. In the
> general case the index calculation requires the 'rem' and
'div'
> instruction. The OR instruction must be the result of an arithmetic
> transformation of one of the previous passes (due to -O3?).
> 
> I don't see a way around to make the vectorizers recognize such
> arithmetic operations.
> 
> Do you think this can be done relatively quickly or does this involve
> a huge effort?
> 
> What does 'stepping through' the loop vectorizer mean? Using the
> debugger and step through the program? Probably not. Is the way to
> go, to alter the 'canVectorize' function print debug output to see
> what's going on?
> 
> Hal, you seem to know the loop vectorizer.
Heh ;) Nadav wrote it.

I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how
difficult it would be to make SCEV understand the indexing calculations.

 -Hal
> Is this the place to look
> at or is the SLP vectorizer the more promising place?
> 
> Frank
> 
> 
> On 31/10/13 12:50, Nadav Rotem wrote:
> 
> 
> 
> Hi Frank,
> 
> 
> This loop should be vectorized by the SLP-vectorizer. It has several
> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
> vectorizer can’t figure out that the stores are consecutive because
> SCEV can’t analyze the OR in the index calculation:
> 
> %2 = and i64 %i.04, 3
> %3 = lshr i64 %i.04, 2
> %4 = shl i64 %3, 3
> %5 = or i64 %4, %2
> %11 = getelementptr inbounds float* %c, i64 %5
> store float %10, float* %11, align 4, !tbaa !0
> 
> 
> You wrote that you want each iteration to look like this:
> 
> 
> 
> 
> 
> 
> 
> i 0: 0 1 2 3
> i 4: 8 9 10 11
> i 8: 16 17 18 19
> i 12: 24 25 26 27
> 
> 
> 
> Why can’t you just write a small loop like this: for (i=0; i<4; i++)
> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
> SLP-vectorizer will vectorize the unrolled iterations, or the
> loop-vectorizer would catch it.
> 
> 
> Thanks,
> Nadav
> 
> 
> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org >
wrote:
> 
> 
> 
> 
> A quite small but yet complete example function which all
> vectorization passes fail to optimize:
> 
> #include <cstdint>
> #include <iostream>
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
> c, float * __restrict__ a, float * __restrict__ b)
> {
> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
> {
> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> }
> }
> 
> The loop index and array accesses for the first 4 iterations:
> 
> i 0: 0 1 2 3
> i 4: 8 9 10 11
> i 8: 16 17 18 19
> i 12: 24 25 26 27
> 
> For example on an x86 processor with SSE (128 bit SIMD vectors) the
> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
> SIMD store.
> 
> With current trunk I tried the following on the above example:
> 
> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> opt -O3 -vectorize-slp -S loop_minimal.ll
> opt -O3 -loop-vectorize -S loop_minimal.ll
> opt -O3 -bb-vectorize -S loop_minimal.ll
> 
> All optimization passes miss the opportunity. It seems the SCEV AA
> pass doesn't understand modulo arithmetic.
> 
> How can the SCEV AA pass be extended to handle this type of
> arithmetic? Frank
> 
> On 31/10/13 02:21, Renato Golin wrote:
> 
> 
> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote:
> 
> 
> 
> 
> 
> 
> const std::uint64_t ir0 = (i+0)%4; // not working
> 
> 
> 
> I thought this would be the case when I saw the original expression.
> Maybe we need to teach module arithmetic to SCEV?
> 
> --renato
> 
> 
> 
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Frank Winter

2013-Oct-31 17:38 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

On 31/10/13 13:34, Hal Finkel wrote:> ----- Original Message -----
>> Hi Nadav,
>>
>> that's the whole point of it. I can't in general make the index
>> calculation simpler. The example given is the simplest non-trivial
>> index function that is needed. It might well be that it's that
>> simple that the index calculation in this case can be thrown aways
>> altogether and - as you say - be replaced by the simple loop you
>> mentioned. However, this cannot be done in the general case. In the
>> general case the index calculation requires the 'rem' and
'div'
>> instruction. The OR instruction must be the result of an arithmetic
>> transformation of one of the previous passes (due to -O3?).
>>
>> I don't see a way around to make the vectorizers recognize such
>> arithmetic operations.
>>
>> Do you think this can be done relatively quickly or does this involve
>> a huge effort?
>>
>> What does 'stepping through' the loop vectorizer mean? Using
the
>> debugger and step through the program? Probably not. Is the way to
>> go, to alter the 'canVectorize' function print debug output to
see
>> what's going on?
>>
>> Hal, you seem to know the loop vectorizer.
> Heh ;) Nadav wrote it.I know. But you said that it doesn't use the SCEV AA information but 
uses its interal SCEV. Being able to say this requires knowledge of the 
loop vectorizer. That's why I said it.>
> I've CC'd Andy, who is our SCEV expert; hopefully he can comment on
how difficult it would be to make SCEV understand the indexing calculations.
>
>   -Hal
>
>> Is this the place to look
>> at or is the SLP vectorizer the more promising place?
>>
>> Frank
>>
>>
>> On 31/10/13 12:50, Nadav Rotem wrote:
>>
>>
>>
>> Hi Frank,
>>
>>
>> This loop should be vectorized by the SLP-vectorizer. It has several
>> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
>> vectorizer can’t figure out that the stores are consecutive because
>> SCEV can’t analyze the OR in the index calculation:
>>
>> %2 = and i64 %i.04, 3
>> %3 = lshr i64 %i.04, 2
>> %4 = shl i64 %3, 3
>> %5 = or i64 %4, %2
>> %11 = getelementptr inbounds float* %c, i64 %5
>> store float %10, float* %11, align 4, !tbaa !0
>>
>>
>> You wrote that you want each iteration to look like this:
>>
>>
>>
>>
>>
>>
>>
>> i 0: 0 1 2 3
>> i 4: 8 9 10 11
>> i 8: 16 17 18 19
>> i 12: 24 25 26 27
>>
>>
>>
>> Why can’t you just write a small loop like this: for (i=0; i<4; i++)
>> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
>> SLP-vectorizer will vectorize the unrolled iterations, or the
>> loop-vectorizer would catch it.
>>
>>
>> Thanks,
>> Nadav
>>
>>
>> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org >
wrote:
>>
>>
>>
>>
>> A quite small but yet complete example function which all
>> vectorization passes fail to optimize:
>>
>> #include <cstdint>
>> #include <iostream>
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
>> c, float * __restrict__ a, float * __restrict__ b)
>> {
>> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>> {
>> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> }
>> }
>>
>> The loop index and array accesses for the first 4 iterations:
>>
>> i 0: 0 1 2 3
>> i 4: 8 9 10 11
>> i 8: 16 17 18 19
>> i 12: 24 25 26 27
>>
>> For example on an x86 processor with SSE (128 bit SIMD vectors) the
>> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
>> SIMD store.
>>
>> With current trunk I tried the following on the above example:
>>
>> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
>> opt -O3 -vectorize-slp -S loop_minimal.ll
>> opt -O3 -loop-vectorize -S loop_minimal.ll
>> opt -O3 -bb-vectorize -S loop_minimal.ll
>>
>> All optimization passes miss the opportunity. It seems the SCEV AA
>> pass doesn't understand modulo arithmetic.
>>
>> How can the SCEV AA pass be extended to handle this type of
>> arithmetic? Frank
>>
>> On 31/10/13 02:21, Renato Golin wrote:
>>
>>
>> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org >
wrote:
>>
>>
>>
>>
>>
>>
>> const std::uint64_t ir0 = (i+0)%4; // not working
>>
>>
>>
>> I thought this would be the case when I saw the original expression.
>> Maybe we need to teach module arithmetic to SCEV?
>>
>> --renato
>>
>>
>>
>>
>>

Frank Winter

2013-Oct-31 17:49 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

Here's a real example of IR (as opposed to the C example provided 
earlier) which includes the type of index calculation that lets the loop 
vectorizer and SLP vectorizer (after unrolling) fail:


define void @main(i64 %arg0, i64 %arg1, i1 %arg2, i64 %arg3, float* 
noalias %arg4, float* noalias %arg5, float* noalias %arg6) {
entrypoint:
   br i1 %arg2, label %L0, label %L1

L0:                                               ; preds = %entrypoint
   %0 = add nsw i64 %arg0, %arg3
   %1 = add nsw i64 %arg1, %arg3
   br label %L2

L1:                                               ; preds = %entrypoint
   br label %L2

L2:                                               ; preds = %L0, %L1
   %2 = phi i64 [ %arg0, %L1 ], [ %0, %L0 ]
   %3 = phi i64 [ %arg1, %L1 ], [ %1, %L0 ]
   br label %L3

L3:                                               ; preds = %L3, %L2
   %4 = phi i64 [ %25, %L3 ], [ %2, %L2 ]
   %5 = sdiv i64 %4, 4
   %6 = srem i64 %4, 4
   %7 = mul i64 %5, 2
   %8 = mul i64 %7, 4
   %9 = add nsw i64 %8, %6
   %10 = add nsw i64 %7, 1
   %11 = mul i64 %10, 4
   %12 = add nsw i64 %11, %6
   %13 = getelementptr float* %arg5, i64 %9
   %14 = load float* %13
   %15 = getelementptr float* %arg5, i64 %12
   %16 = load float* %15
   %17 = getelementptr float* %arg6, i64 %9
   %18 = load float* %17
   %19 = getelementptr float* %arg6, i64 %12
   %20 = load float* %19
   %21 = fadd float %20, %16
   %22 = fadd float %18, %14
   %23 = getelementptr float* %arg4, i64 %9
   store float %22, float* %23
   %24 = getelementptr float* %arg4, i64 %12
   store float %21, float* %24
   %25 = add nsw i64 %4, 1
   %26 = icmp sge i64 %25, %3
   br i1 %26, label %L4, label %L3

L4:                                               ; preds = %L3
   ret void
}


Right now we have:

LV: Checking a loop in "main"
LV: Found a loop: L3
LV: Found an induction variable.
LV: We need to do 0 pointer comparisons.
LV: Checking memory dependencies
LV: Bad stride - Not an AddRecExpr pointer   %23 = getelementptr float* 
%arg4, i64 %9 SCEV: ((4 * ((8 * %5) + %6)) + %arg4)
LV: Bad stride - Not an AddRecExpr pointer   %24 = getelementptr float* 
%arg4, i64 %12 SCEV: (16 + (4 * %6) + (32 * %5) + %arg4)
LV: Src Scev: ((4 * ((8 * %5) + %6)) + %arg4)Sink Scev: (16 + (4 * %6) + 
(32 * %5) + %arg4)(Induction step: 0)
LV: Distance for   store float %22, float* %23 to   store float %21, 
float* %24: 16
Non-consecutive pointer access
LV: We don't need a runtime memory check.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.

However, 4 consecutive iterations could be combined (for an x86 SSE 
SIMD). (With a SIMD width of 8 floats, it would be that 8 iterations 
could be combined. In that case the literal constant in %8 = mul i64 %7, 
4 would be 8 of course).

Even after unrolling this loop by hand (factor 4), the SLP vectorizer fails.

Frank


On 31/10/13 13:34, Hal Finkel wrote:> ----- Original Message -----
>> Hi Nadav,
>>
>> that's the whole point of it. I can't in general make the index
>> calculation simpler. The example given is the simplest non-trivial
>> index function that is needed. It might well be that it's that
>> simple that the index calculation in this case can be thrown aways
>> altogether and - as you say - be replaced by the simple loop you
>> mentioned. However, this cannot be done in the general case. In the
>> general case the index calculation requires the 'rem' and
'div'
>> instruction. The OR instruction must be the result of an arithmetic
>> transformation of one of the previous passes (due to -O3?).
>>
>> I don't see a way around to make the vectorizers recognize such
>> arithmetic operations.
>>
>> Do you think this can be done relatively quickly or does this involve
>> a huge effort?
>>
>> What does 'stepping through' the loop vectorizer mean? Using
the
>> debugger and step through the program? Probably not. Is the way to
>> go, to alter the 'canVectorize' function print debug output to
see
>> what's going on?
>>
>> Hal, you seem to know the loop vectorizer.
> Heh ;) Nadav wrote it.
>
> I've CC'd Andy, who is our SCEV expert; hopefully he can comment on
how difficult it would be to make SCEV understand the indexing calculations.
>
>   -Hal
>
>> Is this the place to look
>> at or is the SLP vectorizer the more promising place?
>>
>> Frank
>>
>>
>> On 31/10/13 12:50, Nadav Rotem wrote:
>>
>>
>>
>> Hi Frank,
>>
>>
>> This loop should be vectorized by the SLP-vectorizer. It has several
>> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
>> vectorizer can’t figure out that the stores are consecutive because
>> SCEV can’t analyze the OR in the index calculation:
>>
>> %2 = and i64 %i.04, 3
>> %3 = lshr i64 %i.04, 2
>> %4 = shl i64 %3, 3
>> %5 = or i64 %4, %2
>> %11 = getelementptr inbounds float* %c, i64 %5
>> store float %10, float* %11, align 4, !tbaa !0
>>
>>
>> You wrote that you want each iteration to look like this:
>>
>>
>>
>>
>>
>>
>>
>> i 0: 0 1 2 3
>> i 4: 8 9 10 11
>> i 8: 16 17 18 19
>> i 12: 24 25 26 27
>>
>>
>>
>> Why can’t you just write a small loop like this: for (i=0; i<4; i++)
>> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
>> SLP-vectorizer will vectorize the unrolled iterations, or the
>> loop-vectorizer would catch it.
>>
>>
>> Thanks,
>> Nadav
>>
>>
>> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org >
wrote:
>>
>>
>>
>>
>> A quite small but yet complete example function which all
>> vectorization passes fail to optimize:
>>
>> #include <cstdint>
>> #include <iostream>
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
>> c, float * __restrict__ a, float * __restrict__ b)
>> {
>> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>> {
>> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> }
>> }
>>
>> The loop index and array accesses for the first 4 iterations:
>>
>> i 0: 0 1 2 3
>> i 4: 8 9 10 11
>> i 8: 16 17 18 19
>> i 12: 24 25 26 27
>>
>> For example on an x86 processor with SSE (128 bit SIMD vectors) the
>> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
>> SIMD store.
>>
>> With current trunk I tried the following on the above example:
>>
>> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
>> opt -O3 -vectorize-slp -S loop_minimal.ll
>> opt -O3 -loop-vectorize -S loop_minimal.ll
>> opt -O3 -bb-vectorize -S loop_minimal.ll
>>
>> All optimization passes miss the opportunity. It seems the SCEV AA
>> pass doesn't understand modulo arithmetic.
>>
>> How can the SCEV AA pass be extended to handle this type of
>> arithmetic? Frank
>>
>> On 31/10/13 02:21, Renato Golin wrote:
>>
>>
>> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org >
wrote:
>>
>>
>>
>>
>>
>>
>> const std::uint64_t ir0 = (i+0)%4; // not working
>>
>>
>>
>> I thought this would be the case when I saw the original expression.
>> Maybe we need to teach module arithmetic to SCEV?
>>
>> --renato
>>
>>
>>
>>
>>

Hal Finkel

2013-Oct-31 18:39 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

----- Original Message -----> On 31/10/13 13:34, Hal Finkel wrote:
> > ----- Original Message -----
> >> Hi Nadav,
> >>
> >> that's the whole point of it. I can't in general make the
index
> >> calculation simpler. The example given is the simplest non-trivial
> >> index function that is needed. It might well be that it's that
> >> simple that the index calculation in this case can be thrown aways
> >> altogether and - as you say - be replaced by the simple loop you
> >> mentioned. However, this cannot be done in the general case. In
> >> the
> >> general case the index calculation requires the 'rem' and
'div'
> >> instruction. The OR instruction must be the result of an
> >> arithmetic
> >> transformation of one of the previous passes (due to -O3?).
> >>
> >> I don't see a way around to make the vectorizers recognize
such
> >> arithmetic operations.
> >>
> >> Do you think this can be done relatively quickly or does this
> >> involve
> >> a huge effort?
> >>
> >> What does 'stepping through' the loop vectorizer mean?
Using the
> >> debugger and step through the program? Probably not. Is the way to
> >> go, to alter the 'canVectorize' function print debug
output to see
> >> what's going on?
> >>
> >> Hal, you seem to know the loop vectorizer.
> > Heh ;) Nadav wrote it.
> I know. But you said that it doesn't use the SCEV AA information but
> uses its interal SCEV. Being able to say this requires knowledge of
> the
> loop vectorizer. That's why I said it.
Fair enough; I'm not sure how well I know it, but I've looked over the
code a few times. The memory dependence code is in MemoryDepChecker::isDependent
in LoopVectorize.cpp.

Also, Sebastian Pop and I have been ping-ponging on working on delinearization
analysis. Sebastian's latest code
(http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20131021/192120.html)
has some SCEV-based GCD and division code, and that could be useful for this as
well.

 -Hal
> >
> > I've CC'd Andy, who is our SCEV expert; hopefully he can
comment on
> > how difficult it would be to make SCEV understand the indexing
> > calculations.
> >
> >   -Hal
> >
> >> Is this the place to look
> >> at or is the SLP vectorizer the more promising place?
> >>
> >> Frank
> >>
> >>
> >> On 31/10/13 12:50, Nadav Rotem wrote:
> >>
> >>
> >>
> >> Hi Frank,
> >>
> >>
> >> This loop should be vectorized by the SLP-vectorizer. It has
> >> several
> >> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
> >> vectorizer can’t figure out that the stores are consecutive
> >> because
> >> SCEV can’t analyze the OR in the index calculation:
> >>
> >> %2 = and i64 %i.04, 3
> >> %3 = lshr i64 %i.04, 2
> >> %4 = shl i64 %3, 3
> >> %5 = or i64 %4, %2
> >> %11 = getelementptr inbounds float* %c, i64 %5
> >> store float %10, float* %11, align 4, !tbaa !0
> >>
> >>
> >> You wrote that you want each iteration to look like this:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> i 0: 0 1 2 3
> >> i 4: 8 9 10 11
> >> i 8: 16 17 18 19
> >> i 12: 24 25 26 27
> >>
> >>
> >>
> >> Why can’t you just write a small loop like this: for (i=0; i<4;
> >> i++)
> >> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
> >> SLP-vectorizer will vectorize the unrolled iterations, or the
> >> loop-vectorizer would catch it.
> >>
> >>
> >> Thanks,
> >> Nadav
> >>
> >>
> >> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org
>
> >> wrote:
> >>
> >>
> >>
> >>
> >> A quite small but yet complete example function which all
> >> vectorization passes fail to optimize:
> >>
> >> #include <cstdint>
> >> #include <iostream>
> >>
> >> void bar(std::uint64_t start, std::uint64_t end, float *
> >> __restrict__
> >> c, float * __restrict__ a, float * __restrict__ b)
> >> {
> >> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
> >> {
> >> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
> >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> >> }
> >> {
> >> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
> >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> >> }
> >> {
> >> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
> >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> >> }
> >> {
> >> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
> >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> >> }
> >> }
> >> }
> >>
> >> The loop index and array accesses for the first 4 iterations:
> >>
> >> i 0: 0 1 2 3
> >> i 4: 8 9 10 11
> >> i 8: 16 17 18 19
> >> i 12: 24 25 26 27
> >>
> >> For example on an x86 processor with SSE (128 bit SIMD vectors)
> >> the
> >> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
> >> SIMD store.
> >>
> >> With current trunk I tried the following on the above example:
> >>
> >> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> >> opt -O3 -vectorize-slp -S loop_minimal.ll
> >> opt -O3 -loop-vectorize -S loop_minimal.ll
> >> opt -O3 -bb-vectorize -S loop_minimal.ll
> >>
> >> All optimization passes miss the opportunity. It seems the SCEV AA
> >> pass doesn't understand modulo arithmetic.
> >>
> >> How can the SCEV AA pass be extended to handle this type of
> >> arithmetic? Frank
> >>
> >> On 31/10/13 02:21, Renato Golin wrote:
> >>
> >>
> >> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org
> wrote:
> >>
> >>
> >>
> >>
> >>
> >>
> >> const std::uint64_t ir0 = (i+0)%4; // not working
> >>
> >>
> >>
> >> I thought this would be the case when I saw the original
> >> expression.
> >> Maybe we need to teach module arithmetic to SCEV?
> >>
> >> --renato
> >>
> >>
> >>
> >>
> >>
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Oct 2013 - [LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

Possibly Parallel Threads