thr3ads.net - llvm dev - [LLVMdev] loop vectorizer misses opportunity, exploit [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Oct-31 16:50 UTC

[LLVMdev] loop vectorizer misses opportunity, exploit

Hi Frank, 

This loop should be vectorized by the SLP-vectorizer.  It has several scalars
(C[0], C[1] … ) that can be merged into a vector.  The SLP vectorizer can’t
figure out that the stores are consecutive because SCEV can’t analyze the OR in
the index calculation:
 
  %2 = and i64 %i.04, 3
  %3 = lshr i64 %i.04, 2
  %4 = shl i64 %3, 3
  %5 = or i64 %4, %2
  %11 = getelementptr inbounds float* %c, i64 %5
  store float %10, float* %11, align 4, !tbaa !0

You wrote that you want each iteration to look like this:
> 
> i 0:     0 1 2 3 
> i 4:     8 9 10 11 
> i 8:     16 17 18 19 
> i 12:     24 25 26 27

Why can’t you just write a small loop like this:   for (i=0; i<4; i++) C[i] =
A[i] + B[i]  ??    Either the unroller will unroll it and the SLP-vectorizer
will vectorize the unrolled iterations, or the loop-vectorizer would catch it.

Thanks,
Nadav

On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org> wrote:
> A quite small but yet complete example function which all vectorization
passes fail to optimize:
> 
> #include <cstdint>
> #include <iostream>
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c,
float * __restrict__ a, float * __restrict__ b)
> {
>   for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>     {
>       const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>   }
> } 
> 
> The loop index and array accesses for the first 4 iterations:
> 
> i 0:     0 1 2 3 
> i 4:     8 9 10 11 
> i 8:     16 17 18 19 
> i 12:     24 25 26 27
> 
> For example on an x86 processor with SSE (128 bit SIMD vectors) the loop
body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.
> 
> With current trunk I tried the following on the above example:
> 
> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> opt -O3 -vectorize-slp -S loop_minimal.ll
> opt -O3 -loop-vectorize -S loop_minimal.ll
> opt -O3 -bb-vectorize -S loop_minimal.ll
> 
> All optimization passes miss the opportunity. It seems the SCEV AA pass
doesn't understand modulo arithmetic.
> 
> How can the SCEV AA pass be extended to handle this type of arithmetic?
> Frank
> 
> On 31/10/13 02:21, Renato Golin wrote:
>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org>
wrote:
>>       const std::uint64_t ir0 = (i+0)%4;  // not working
>> 
>> I thought this would be the case when I saw the original expression.
Maybe we need to teach module arithmetic to SCEV?
>> 
>> --renato
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/797e3f56/attachment.html>

Frank Winter

2013-Oct-31 17:24 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

Hi Nadav,

that's the whole point of it. I can't in general make the index 
calculation simpler. The example given is the simplest non-trivial index 
function that is needed. It might well be that it's that simple that the 
index calculation in this case can be thrown aways altogether and - as 
you say - be replaced by the simple loop you mentioned. However, this 
cannot be done in the general case. In the general case the index 
calculation requires the 'rem' and 'div' instruction. The OR
instruction
must be the result of an arithmetic transformation of one of the 
previous passes (due to -O3?).

I don't see a way around to make the vectorizers recognize such 
arithmetic operations.

Do you think this can be done relatively quickly or does this involve a 
huge effort?

What does 'stepping through' the loop vectorizer mean? Using the 
debugger and step through the program? Probably not. Is the way to go, 
to alter the 'canVectorize' function print debug output to see
what's
going on?

Hal, you seem to know the loop vectorizer. Is this the place to look at 
or is the SLP vectorizer the more promising place?

Frank


On 31/10/13 12:50, Nadav Rotem wrote:> Hi Frank,
>
> This loop should be vectorized by the SLP-vectorizer.  It has several 
> scalars (C[0], C[1] … ) that can be merged into a vector.  The SLP 
> vectorizer can’t figure out that the stores are consecutive because 
> SCEV can’t analyze the OR in the index calculation:
>   %2 = and i64 %i.04, 3
>   %3 = lshr i64 %i.04, 2
>   %4 = shl i64 %3, 3
>   %5 = or i64 %4, %2
>   %11 = getelementptr inbounds float* %c, i64 %5
>   store float %10, float* %11, align 4, !tbaa !0
>
> You wrote that you want each iteration to look like this:
>
>>
>> i 0:     0 1 2 3
>> i 4:     8 9 10 11
>> i 8:     16 17 18 19
>> i 12:     24 25 26 27
>
> Why can’t you just write a small loop like this:   for (i=0; i<4; i++) 
> C[i] = A[i] + B[i]  ??    Either the unroller will unroll it and the 
> SLP-vectorizer will vectorize the unrolled iterations, or the 
> loop-vectorizer would catch it.
>
> Thanks,
> Nadav
>
> On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org 
> <mailto:fwinter at jlab.org>> wrote:
>
>> A quite small but yet complete example function which all 
>> vectorization passes fail to optimize:
>>
>> #include <cstdint>
>> #include <iostream>
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * 
>> __restrict__  c, float * __restrict__ a, float * __restrict__ b)
>> {
>>   for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>>     {
>>       const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>     }
>>     {
>>       const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>     }
>>     {
>>       const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>     }
>>     {
>>       const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>     }
>>   }
>> }
>>
>> The loop index and array accesses for the first 4 iterations:
>>
>> i 0:     0 1 2 3
>> i 4:     8 9 10 11
>> i 8:     16 17 18 19
>> i 12:     24 25 26 27
>>
>> For example on an x86 processor with SSE (128 bit SIMD vectors) the
loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.
>>
>> With current trunk I tried the following on the above example:
>>
>> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
>> opt -O3 -vectorize-slp -S loop_minimal.ll
>> opt -O3 -loop-vectorize -S loop_minimal.ll
>> opt -O3 -bb-vectorize -S loop_minimal.ll
>>
>> All optimization passes miss the opportunity. It seems the SCEV AA pass
doesn't understand modulo arithmetic.
>>
>> How can the SCEV AA pass be extended to handle this type of arithmetic?
>> Frank
>>
>> On 31/10/13 02:21, Renato Golin wrote:
>>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org 
>>> <mailto:fwinter at jlab.org>> wrote:
>>>
>>>           const std::uint64_t ir0 = (i+0)%4; // not working
>>>
>>>
>>> I thought this would be the case when I saw the original
expression.
>>> Maybe we need to teach module arithmetic to SCEV?
>>>
>>> --renato
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/cea7d8a9/attachment.html>

Renato Golin

2013-Oct-31 17:29 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

On 31 October 2013 10:24, Frank Winter <fwinter at jlab.org> wrote:
>  What does 'stepping through' the loop vectorizer mean? Using the
> debugger and step through the program? Probably not. Is the way to go, to
> alter the 'canVectorize' function print debug output to see
what's going on?
>
Stepping through 'opt -O3' with a debugger, yes. The debug messages are
enabled by using '-debug-only=loop-vectorizer', but they's only half
the
story.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/b4b9f11f/attachment.html>

Hal Finkel

2013-Oct-31 17:34 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

----- Original Message -----> 
> Hi Nadav,
> 
> that's the whole point of it. I can't in general make the index
> calculation simpler. The example given is the simplest non-trivial
> index function that is needed. It might well be that it's that
> simple that the index calculation in this case can be thrown aways
> altogether and - as you say - be replaced by the simple loop you
> mentioned. However, this cannot be done in the general case. In the
> general case the index calculation requires the 'rem' and
'div'
> instruction. The OR instruction must be the result of an arithmetic
> transformation of one of the previous passes (due to -O3?).
> 
> I don't see a way around to make the vectorizers recognize such
> arithmetic operations.
> 
> Do you think this can be done relatively quickly or does this involve
> a huge effort?
> 
> What does 'stepping through' the loop vectorizer mean? Using the
> debugger and step through the program? Probably not. Is the way to
> go, to alter the 'canVectorize' function print debug output to see
> what's going on?
> 
> Hal, you seem to know the loop vectorizer.
Heh ;) Nadav wrote it.

I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how
difficult it would be to make SCEV understand the indexing calculations.

 -Hal
> Is this the place to look
> at or is the SLP vectorizer the more promising place?
> 
> Frank
> 
> 
> On 31/10/13 12:50, Nadav Rotem wrote:
> 
> 
> 
> Hi Frank,
> 
> 
> This loop should be vectorized by the SLP-vectorizer. It has several
> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
> vectorizer can’t figure out that the stores are consecutive because
> SCEV can’t analyze the OR in the index calculation:
> 
> %2 = and i64 %i.04, 3
> %3 = lshr i64 %i.04, 2
> %4 = shl i64 %3, 3
> %5 = or i64 %4, %2
> %11 = getelementptr inbounds float* %c, i64 %5
> store float %10, float* %11, align 4, !tbaa !0
> 
> 
> You wrote that you want each iteration to look like this:
> 
> 
> 
> 
> 
> 
> 
> i 0: 0 1 2 3
> i 4: 8 9 10 11
> i 8: 16 17 18 19
> i 12: 24 25 26 27
> 
> 
> 
> Why can’t you just write a small loop like this: for (i=0; i<4; i++)
> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
> SLP-vectorizer will vectorize the unrolled iterations, or the
> loop-vectorizer would catch it.
> 
> 
> Thanks,
> Nadav
> 
> 
> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org >
wrote:
> 
> 
> 
> 
> A quite small but yet complete example function which all
> vectorization passes fail to optimize:
> 
> #include <cstdint>
> #include <iostream>
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
> c, float * __restrict__ a, float * __restrict__ b)
> {
> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
> {
> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> }
> }
> 
> The loop index and array accesses for the first 4 iterations:
> 
> i 0: 0 1 2 3
> i 4: 8 9 10 11
> i 8: 16 17 18 19
> i 12: 24 25 26 27
> 
> For example on an x86 processor with SSE (128 bit SIMD vectors) the
> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
> SIMD store.
> 
> With current trunk I tried the following on the above example:
> 
> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> opt -O3 -vectorize-slp -S loop_minimal.ll
> opt -O3 -loop-vectorize -S loop_minimal.ll
> opt -O3 -bb-vectorize -S loop_minimal.ll
> 
> All optimization passes miss the opportunity. It seems the SCEV AA
> pass doesn't understand modulo arithmetic.
> 
> How can the SCEV AA pass be extended to handle this type of
> arithmetic? Frank
> 
> On 31/10/13 02:21, Renato Golin wrote:
> 
> 
> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote:
> 
> 
> 
> 
> 
> 
> const std::uint64_t ir0 = (i+0)%4; // not working
> 
> 
> 
> I thought this would be the case when I saw the original expression.
> Maybe we need to teach module arithmetic to SCEV?
> 
> --renato
> 
> 
> 
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Oct 2013 - [LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

Seemingly Similar Threads