I tried the following on the hand-unrolled loop:
const std::uint64_t ir0 = i*8+0; // working
const std::uint64_t ir0 = i%4+0; // working
const std::uint64_t ir0 = (i+0)%4; // not working
'+0' means +1,+2,+3 in the unrolled iterations.
'Working' means the SLP vectorizer succeeded.
Thus, when working 'towards' the correct index function, auto
vectorization fails. However, there is no option to use a simpler index
function.
Is it possible to make the SCEV pass more smart? Or would you strongly
advise against such endeavor?
Frank
On 30/10/13 21:16, Nadav Rotem wrote:>
> On Oct 30, 2013, at 6:10 PM, Frank Winter <fwinter at jlab.org
> <mailto:fwinter at jlab.org>> wrote:
>
>> the only option I see is to unroll the loop by hand. Since the array
>> access is consecutive over 4 loop iterations I gave it a try and
>> unrolled the loop by a factor of 4. Which gives the following array
>> accesses:
>>
>> loop iter 0:
>> index_0 = 0 index_1 = 4
>> index_0 = 1 index_1 = 5
>> index_0 = 2 index_1 = 6
>> index_0 = 3 index_1 = 7
>>
>> loop iter 1:
>> index_0 = 8 index_1 = 12
>> index_0 = 9 index_1 = 13
>> index_0 = 10 index_1 = 14
>> index_0 = 11 index_1 = 15
>
> The SLP-vectorizer detects 8 stores, but it can’t prove that they are
> consecutive, so it moves on. Can you simplify the address expression
> ? Can you write " index0 = i*8 + 0 “ and give it a try ?
>
>>
>> For completeness, here the code:
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
>> c, float * __restrict__ a, float * __restrict__ b)
>> {
>> const std::uint64_t inner = 4;
>> for (std::uint64_t i = start ; i < end ; i+=4 ) {
>> {
>> const std::uint64_t ir0 = ( ((i+0)/inner) * 2 + 0 ) * inner +
>> (i+0)%4;
>> const std::uint64_t ir1 = ( ((i+0)/inner) * 2 + 1 ) * inner +
>> (i+0)%4;
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
>> }
>> {
>> const std::uint64_t ir0 = ( ((i+1)/inner) * 2 + 0 ) * inner +
>> (i+1)%4;
>> const std::uint64_t ir1 = ( ((i+1)/inner) * 2 + 1 ) * inner +
>> (i+1)%4;
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
>> }
>> {
>> const std::uint64_t ir0 = ( ((i+2)/inner) * 2 + 0 ) * inner +
>> (i+2)%4;
>> const std::uint64_t ir1 = ( ((i+2)/inner) * 2 + 1 ) * inner +
>> (i+2)%4;
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
>> }
>> {
>> const std::uint64_t ir0 = ( ((i+3)/inner) * 2 + 0 ) * inner +
>> (i+3)%4;
>> const std::uint64_t ir1 = ( ((i+3)/inner) * 2 + 1 ) * inner +
>> (i+3)%4;
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
>> }
>> }
>> }
>>
>>
>> This should be an ideal test case for the SLP vectorizer, right?
>>
>> It seems, I am out of luck:
>>
>> opt -O3 -vectorize-slp -debug loop.ll -S
>>
>> SLP: Analyzing blocks in _Z3barmmPfS_S_.
>> SLP: Found 8 stores to vectorize.
>> SLP: Analyzing a store chain of length 8.
>> SLP: Trying to vectorize starting at PHIs (1)
>> SLP: Vectorizing a list of length = 2.
>> SLP: Vectorizing a list of length = 2.
>> SLP: Vectorizing a list of length = 2.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/092dab20/attachment.html>