thr3ads.net - llvm dev - [LLVMdev] loop vectorizer [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Oct-31 01:16 UTC

[LLVMdev] loop vectorizer

On Oct 30, 2013, at 6:10 PM, Frank Winter <fwinter at jlab.org> wrote:
> the only option I see is to unroll the loop by hand. Since the array access
is consecutive over 4 loop iterations I gave it a try and unrolled the loop by a
factor of 4.  Which gives the following array accesses:
> 
> loop iter 0:
> index_0 = 0   index_1 = 4
> index_0 = 1   index_1 = 5
> index_0 = 2   index_1 = 6
> index_0 = 3   index_1 = 7
> 
> loop iter 1:
> index_0 = 8   index_1 = 12
> index_0 = 9   index_1 = 13
> index_0 = 10   index_1 = 14
> index_0 = 11   index_1 = 15
The SLP-vectorizer detects 8 stores, but it can’t prove that they are
consecutive, so it moves on.  Can you simplify the address expression ?  Can you
write " index0 = i*8 + 0 “ and give it a try ?
> 
> For completeness, here the code:
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c,
float * __restrict__ a, float * __restrict__ b)
> {
>  const std::uint64_t inner = 4;
>  for (std::uint64_t i = start ; i < end ; i+=4 ) {
>    {
>      const std::uint64_t ir0 = ( ((i+0)/inner) * 2 + 0 ) * inner + (i+0)%4;
>      const std::uint64_t ir1 = ( ((i+0)/inner) * 2 + 1 ) * inner + (i+0)%4;
>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>      c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>    }
>    {
>      const std::uint64_t ir0 = ( ((i+1)/inner) * 2 + 0 ) * inner + (i+1)%4;
>      const std::uint64_t ir1 = ( ((i+1)/inner) * 2 + 1 ) * inner + (i+1)%4;
>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>      c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>    }
>    {
>      const std::uint64_t ir0 = ( ((i+2)/inner) * 2 + 0 ) * inner + (i+2)%4;
>      const std::uint64_t ir1 = ( ((i+2)/inner) * 2 + 1 ) * inner + (i+2)%4;
>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>      c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>    }
>    {
>      const std::uint64_t ir0 = ( ((i+3)/inner) * 2 + 0 ) * inner + (i+3)%4;
>      const std::uint64_t ir1 = ( ((i+3)/inner) * 2 + 1 ) * inner + (i+3)%4;
>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>      c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>    }
>  }
> }
> 
> 
> This should be an ideal test case for the SLP vectorizer, right?
> 
> It seems, I am out of luck:
> 
> opt -O3 -vectorize-slp -debug loop.ll -S
> 
> SLP: Analyzing blocks in _Z3barmmPfS_S_.
> SLP: Found 8 stores to vectorize.
> SLP: Analyzing a store chain of length 8.
> SLP: Trying to vectorize starting at PHIs (1)
> SLP: Vectorizing a list of length = 2.
> SLP: Vectorizing a list of length = 2.
> SLP: Vectorizing a list of length = 2.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/84b89a32/attachment.html>

Frank Winter

2013-Oct-31 01:40 UTC

head link

[LLVMdev] loop vectorizer

I tried the following on the hand-unrolled loop:

       const std::uint64_t ir0 = i*8+0; // working

       const std::uint64_t ir0 = i%4+0; // working

       const std::uint64_t ir0 = (i+0)%4;  // not working

'+0' means +1,+2,+3 in the unrolled iterations.

'Working' means the SLP vectorizer succeeded.

Thus, when working 'towards' the correct index function, auto 
vectorization fails. However, there is no option to use a simpler index 
function.

Is it possible to make the SCEV pass more smart? Or would you strongly 
advise against such endeavor?

Frank


On 30/10/13 21:16, Nadav Rotem wrote:>
> On Oct 30, 2013, at 6:10 PM, Frank Winter <fwinter at jlab.org 
> <mailto:fwinter at jlab.org>> wrote:
>
>> the only option I see is to unroll the loop by hand. Since the array 
>> access is consecutive over 4 loop iterations I gave it a try and 
>> unrolled the loop by a factor of 4.  Which gives the following array 
>> accesses:
>>
>> loop iter 0:
>> index_0 = 0   index_1 = 4
>> index_0 = 1   index_1 = 5
>> index_0 = 2   index_1 = 6
>> index_0 = 3   index_1 = 7
>>
>> loop iter 1:
>> index_0 = 8   index_1 = 12
>> index_0 = 9   index_1 = 13
>> index_0 = 10   index_1 = 14
>> index_0 = 11   index_1 = 15
>
> The SLP-vectorizer detects 8 stores, but it can’t prove that they are 
> consecutive, so it moves on.  Can you simplify the address expression 
> ?  Can you write " index0 = i*8 + 0 “ and give it a try ?
>
>>
>> For completeness, here the code:
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ 
>>  c, float * __restrict__ a, float * __restrict__ b)
>> {
>>  const std::uint64_t inner = 4;
>>  for (std::uint64_t i = start ; i < end ; i+=4 ) {
>>    {
>>      const std::uint64_t ir0 = ( ((i+0)/inner) * 2 + 0 ) * inner + 
>> (i+0)%4;
>>      const std::uint64_t ir1 = ( ((i+0)/inner) * 2 + 1 ) * inner + 
>> (i+0)%4;
>>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>      c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>>    }
>>    {
>>      const std::uint64_t ir0 = ( ((i+1)/inner) * 2 + 0 ) * inner + 
>> (i+1)%4;
>>      const std::uint64_t ir1 = ( ((i+1)/inner) * 2 + 1 ) * inner + 
>> (i+1)%4;
>>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>      c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>>    }
>>    {
>>      const std::uint64_t ir0 = ( ((i+2)/inner) * 2 + 0 ) * inner + 
>> (i+2)%4;
>>      const std::uint64_t ir1 = ( ((i+2)/inner) * 2 + 1 ) * inner + 
>> (i+2)%4;
>>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>      c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>>    }
>>    {
>>      const std::uint64_t ir0 = ( ((i+3)/inner) * 2 + 0 ) * inner + 
>> (i+3)%4;
>>      const std::uint64_t ir1 = ( ((i+3)/inner) * 2 + 1 ) * inner + 
>> (i+3)%4;
>>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>      c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>>    }
>>  }
>> }
>>
>>
>> This should be an ideal test case for the SLP vectorizer, right?
>>
>> It seems, I am out of luck:
>>
>> opt -O3 -vectorize-slp -debug loop.ll -S
>>
>> SLP: Analyzing blocks in _Z3barmmPfS_S_.
>> SLP: Found 8 stores to vectorize.
>> SLP: Analyzing a store chain of length 8.
>> SLP: Trying to vectorize starting at PHIs (1)
>> SLP: Vectorizing a list of length = 2.
>> SLP: Vectorizing a list of length = 2.
>> SLP: Vectorizing a list of length = 2.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/092dab20/attachment.html>

Renato Golin

2013-Oct-31 06:21 UTC

head link

[LLVMdev] loop vectorizer

On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote:
>        const std::uint64_t ir0 = (i+0)%4;  // not working
>
I thought this would be the case when I saw the original expression. Maybe
we need to teach module arithmetic to SCEV?

--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/b68fbb14/attachment.html>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Oct 2013 - [LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

Apparently Analagous Threads