thr3ads.net - llvm dev - [LLVMdev] loop vectorizer misses opportunity, exploit [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2013-Oct-31 06:21 UTC

[LLVMdev] loop vectorizer

On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote:
>        const std::uint64_t ir0 = (i+0)%4;  // not working
>
I thought this would be the case when I saw the original expression. Maybe
we need to teach module arithmetic to SCEV?

--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/b68fbb14/attachment.html>

Frank Winter

2013-Oct-31 15:01 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

A quite small but yet complete example function which all vectorization 
passes fail to optimize:

#include <cstdint>
#include <iostream>

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  
c, float * __restrict__ a, float * __restrict__ b)
{
   for ( std::uint64_t i = start ; i < end ; i += 4 ) {
     {
       const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
     }
     {
       const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
     }
     {
       const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
     }
     {
       const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
     }
   }
}

The loop index and array accesses for the first 4 iterations:

i 0:     0 1 2 3
i 4:     8 9 10 11
i 8:     16 17 18 19
i 12:     24 25 26 27

For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body
could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.

With current trunk I tried the following on the above example:

clang++ -emit-llvm -S loop_minimal.cc -std=c++11
opt -O3 -vectorize-slp -S loop_minimal.ll
opt -O3 -loop-vectorize -S loop_minimal.ll
opt -O3 -bb-vectorize -S loop_minimal.ll

All optimization passes miss the opportunity. It seems the SCEV AA pass
doesn't understand modulo arithmetic.

How can the SCEV AA pass be extended to handle this type of arithmetic?

Frank

On 31/10/13 02:21, Renato Golin wrote:
> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org 
> <mailto:fwinter at jlab.org>> wrote:
>
>           const std::uint64_t ir0 = (i+0)%4;  // not working
>
>
> I thought this would be the case when I saw the original expression. 
> Maybe we need to teach module arithmetic to SCEV?
>
> --renato

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/602f127b/attachment.html>

Renato Golin

2013-Oct-31 16:35 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

On 31 October 2013 08:01, Frank Winter <fwinter at jlab.org> wrote:
>  With current trunk I tried the following on the above example:
>
> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> opt -O3 -vectorize-slp -S loop_minimal.ll
> opt -O3 -loop-vectorize -S loop_minimal.ll
> opt -O3 -bb-vectorize -S loop_minimal.ll
>
> All optimization passes miss the opportunity. It seems the SCEV AA pass
doesn't understand modulo arithmetic.
>
>Hi Frank,

IIRC, opt -O3 will already pass all three, so you don't have to add it to
the command line.


How can the SCEV AA pass be extended to handle this type of
arithmetic?>
>I'm not an SCEV expert, so it might be possible that it does understand,
but the loop vectorizer is not understanding the evolution info, or
whatever.

What I recommend is to step through the loop vectorizer (canVectorize) and
see what the SCEV returns to the vectorizer. If the info is there, but not
accounted for, it's a vectorizer bug. If the SCEV gets lost, than you need
to add it into the SCEV logic.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/0427103a/attachment.html>

Nadav Rotem

2013-Oct-31 16:50 UTC

head link

[LLVMdev] loop vectorizer misses opportunity, exploit

Hi Frank, 

This loop should be vectorized by the SLP-vectorizer.  It has several scalars
(C[0], C[1] … ) that can be merged into a vector.  The SLP vectorizer can’t
figure out that the stores are consecutive because SCEV can’t analyze the OR in
the index calculation:
 
  %2 = and i64 %i.04, 3
  %3 = lshr i64 %i.04, 2
  %4 = shl i64 %3, 3
  %5 = or i64 %4, %2
  %11 = getelementptr inbounds float* %c, i64 %5
  store float %10, float* %11, align 4, !tbaa !0

You wrote that you want each iteration to look like this:
> 
> i 0:     0 1 2 3 
> i 4:     8 9 10 11 
> i 8:     16 17 18 19 
> i 12:     24 25 26 27

Why can’t you just write a small loop like this:   for (i=0; i<4; i++) C[i] =
A[i] + B[i]  ??    Either the unroller will unroll it and the SLP-vectorizer
will vectorize the unrolled iterations, or the loop-vectorizer would catch it.

Thanks,
Nadav

On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org> wrote:
> A quite small but yet complete example function which all vectorization
passes fail to optimize:
> 
> #include <cstdint>
> #include <iostream>
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c,
float * __restrict__ a, float * __restrict__ b)
> {
>   for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>     {
>       const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>   }
> } 
> 
> The loop index and array accesses for the first 4 iterations:
> 
> i 0:     0 1 2 3 
> i 4:     8 9 10 11 
> i 8:     16 17 18 19 
> i 12:     24 25 26 27
> 
> For example on an x86 processor with SSE (128 bit SIMD vectors) the loop
body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.
> 
> With current trunk I tried the following on the above example:
> 
> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> opt -O3 -vectorize-slp -S loop_minimal.ll
> opt -O3 -loop-vectorize -S loop_minimal.ll
> opt -O3 -bb-vectorize -S loop_minimal.ll
> 
> All optimization passes miss the opportunity. It seems the SCEV AA pass
doesn't understand modulo arithmetic.
> 
> How can the SCEV AA pass be extended to handle this type of arithmetic?
> Frank
> 
> On 31/10/13 02:21, Renato Golin wrote:
>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org>
wrote:
>>       const std::uint64_t ir0 = (i+0)%4;  // not working
>> 
>> I thought this would be the case when I saw the original expression.
Maybe we need to teach module arithmetic to SCEV?
>> 
>> --renato
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/797e3f56/attachment.html>

Andrew Trick

2013-Nov-06 03:12 UTC

head link

[LLVMdev] loop vectorizer

On Oct 30, 2013, at 11:21 PM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote:
>       const std::uint64_t ir0 = (i+0)%4;  // not working
> 
> I thought this would be the case when I saw the original expression. Maybe
we need to teach module arithmetic to SCEV?
I let this thread get stale, so here’s the background again:

source:

      const std::uint64_t ir0 = i%4 + 8*(i/4);
      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];

before instcombine:

  %4 = urem i64 %i.0, 4
  %5 = udiv i64 %i.0, 4
  %6 = mul i64 8, %5
  %7 = add i64 %4, %6
  %8 = getelementptr inbounds float* %a, i64 %7

after instcombine:

  %2 = and i64 %i.04, 3
  %3 = lshr i64 %i.04, 2
  %4 = shl i64 %3, 3
  %5 = or i64 %4, %2
  %11 = getelementptr inbounds float* %c, i64 %5
  store float %10, float* %11, align 4, !tbaa !0

Honestly, I don't understand why InstCombine "anti-canonicalizes"
add->or. I think that transformation should be deferred into we begin
target-specific lower (e.g. InstOptimize pass).

Given, that we aren't going to change that any time soon, SCEV could
probably be taught to recognize the specific pattern:

Instructions (or (and %a, C1), (shl %b, C2)) -> SCEV (add %a, %b)

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131105/09cdb551/attachment.html>

Frank Winter

2013-Nov-06 03:39 UTC

head link

[LLVMdev] loop vectorizer

Good that you bring this up. I still have no solution to this 
vectorization problem.

However, I can rewrite the code and insert a second loop which 
eliminates the 'urem' and 'div' instructions in the index
calculations.
In this case, the inner loop's trip count would be equal to the SIMD 
length and the loop vectorizer ignores the loop. Unrolling the loop and 
SLP is not an option, since the loop body can get lengthy.

What would be a quicker to implement:

a) Teach the loop vectorizer the 'urem' and 'div' instructions,
or
b) have the loop vectorizer process loops with trip count equal to the 
vector length ?

One of both solutions will be needed, I guess.

Frank



On 05/11/13 22:12, Andrew Trick wrote:>
> On Oct 30, 2013, at 11:21 PM, Renato Golin <renato.golin at linaro.org 
> <mailto:renato.golin at linaro.org>> wrote:
>
>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org 
>> <mailto:fwinter at jlab.org>> wrote:
>>
>>           const std::uint64_t ir0 = (i+0)%4;  // not working
>>
>>
>> I thought this would be the case when I saw the original expression. 
>> Maybe we need to teach module arithmetic to SCEV?
>
> I let this thread get stale, so here’s the background again:
>
> source:
>
>       const std::uint64_t ir0 = i%4 + 8*(i/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>
> before instcombine:
>
>   %4 = urem i64 %i.0, 4
>   %5 = udiv i64 %i.0, 4
>   %6 = mul i64 8, %5
>   %7 = add i64 %4, %6
>   %8 = getelementptr inbounds float* %a, i64 %7
>
> after instcombine:
>
>   %2 = and i64 %i.04, 3
>   %3 = lshr i64 %i.04, 2
>   %4 = shl i64 %3, 3
>   %5 = or i64 %4, %2
>   %11 = getelementptr inbounds float* %c, i64 %5
>   store float %10, float* %11, align 4, !tbaa !0
>
> Honestly, I don't understand why InstCombine
"anti-canonicalizes"
> add->or. I think that transformation should be deferred into we begin 
> target-specific lower (e.g. InstOptimize pass).
>
> Given, that we aren't going to change that any time soon, SCEV could 
> probably be taught to recognize the specific pattern:
>
> Instructions (or (and %a, C1), (shl %b, C2)) -> SCEV (add %a, %b)
>
> -Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131105/b4f36ea9/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Oct 2013 - [LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer misses opportunity, exploit

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

Possibly Parallel Threads