thr3ads.net - llvm dev - [LLVMdev] loop vectorizer [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Frank Winter

2013-Nov-06 03:39 UTC

[LLVMdev] loop vectorizer

Good that you bring this up. I still have no solution to this 
vectorization problem.

However, I can rewrite the code and insert a second loop which 
eliminates the 'urem' and 'div' instructions in the index
calculations.
In this case, the inner loop's trip count would be equal to the SIMD 
length and the loop vectorizer ignores the loop. Unrolling the loop and 
SLP is not an option, since the loop body can get lengthy.

What would be a quicker to implement:

a) Teach the loop vectorizer the 'urem' and 'div' instructions,
or
b) have the loop vectorizer process loops with trip count equal to the 
vector length ?

One of both solutions will be needed, I guess.

Frank



On 05/11/13 22:12, Andrew Trick wrote:>
> On Oct 30, 2013, at 11:21 PM, Renato Golin <renato.golin at linaro.org 
> <mailto:renato.golin at linaro.org>> wrote:
>
>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org 
>> <mailto:fwinter at jlab.org>> wrote:
>>
>>           const std::uint64_t ir0 = (i+0)%4;  // not working
>>
>>
>> I thought this would be the case when I saw the original expression. 
>> Maybe we need to teach module arithmetic to SCEV?
>
> I let this thread get stale, so here’s the background again:
>
> source:
>
>       const std::uint64_t ir0 = i%4 + 8*(i/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>
> before instcombine:
>
>   %4 = urem i64 %i.0, 4
>   %5 = udiv i64 %i.0, 4
>   %6 = mul i64 8, %5
>   %7 = add i64 %4, %6
>   %8 = getelementptr inbounds float* %a, i64 %7
>
> after instcombine:
>
>   %2 = and i64 %i.04, 3
>   %3 = lshr i64 %i.04, 2
>   %4 = shl i64 %3, 3
>   %5 = or i64 %4, %2
>   %11 = getelementptr inbounds float* %c, i64 %5
>   store float %10, float* %11, align 4, !tbaa !0
>
> Honestly, I don't understand why InstCombine
"anti-canonicalizes"
> add->or. I think that transformation should be deferred into we begin 
> target-specific lower (e.g. InstOptimize pass).
>
> Given, that we aren't going to change that any time soon, SCEV could 
> probably be taught to recognize the specific pattern:
>
> Instructions (or (and %a, C1), (shl %b, C2)) -> SCEV (add %a, %b)
>
> -Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131105/b4f36ea9/attachment.html>

Andrew Trick

2013-Nov-06 04:14 UTC

head link

[LLVMdev] loop vectorizer

On Nov 5, 2013, at 7:39 PM, Frank Winter <fwinter at jlab.org> wrote:
> Good that you bring this up. I still have no solution to this vectorization
problem.
> 
> However, I can rewrite the code and insert a second loop which eliminates
the 'urem' and 'div' instructions in the index calculations. In
this case, the inner loop's trip count would be equal to the SIMD length and
the loop vectorizer ignores the loop. Unrolling the loop and SLP is not an
option, since the loop body can get lengthy.
> 
> What would be a quicker to implement: 
> 
> a) Teach the loop vectorizer the 'urem' and 'div'
instructions, or
If I’m correct assuming that this means teaching SCEV about (or (and…), (shl…)),
then it seems like a worthwhile thing to do, but a fundamental improvement that
is not easy and could have unknown impact.
> b) have the loop vectorizer process loops with trip count equal to the
vector length ?
Seems to me like we should handle this and it’s just a matter of fixing the
driver/heuristics. I would file a PR with this test case.

-Andy
> One of both solutions will be needed, I guess.
> 
> Frank
> 
> 
> 
> On 05/11/13 22:12, Andrew Trick wrote:
>> 
>> On Oct 30, 2013, at 11:21 PM, Renato Golin <renato.golin at
linaro.org> wrote:
>> 
>>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org>
wrote:
>>>       const std::uint64_t ir0 = (i+0)%4;  // not working
>>> 
>>> I thought this would be the case when I saw the original
expression. Maybe we need to teach module arithmetic to SCEV?
>> 
>> I let this thread get stale, so here’s the background again:
>> 
>> source:
>> 
>>       const std::uint64_t ir0 = i%4 + 8*(i/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>> 
>> before instcombine:
>> 
>>   %4 = urem i64 %i.0, 4
>>   %5 = udiv i64 %i.0, 4
>>   %6 = mul i64 8, %5
>>   %7 = add i64 %4, %6
>>   %8 = getelementptr inbounds float* %a, i64 %7
>> 
>> after instcombine:
>> 
>>   %2 = and i64 %i.04, 3
>>   %3 = lshr i64 %i.04, 2
>>   %4 = shl i64 %3, 3
>>   %5 = or i64 %4, %2
>>   %11 = getelementptr inbounds float* %c, i64 %5
>>   store float %10, float* %11, align 4, !tbaa !0
>> 
>> Honestly, I don't understand why InstCombine
"anti-canonicalizes" add->or. I think that transformation should be
deferred into we begin target-specific lower (e.g. InstOptimize pass).
>> 
>> Given, that we aren't going to change that any time soon, SCEV
could probably be taught to recognize the specific pattern:
>> 
>> Instructions (or (and %a, C1), (shl %b, C2)) -> SCEV (add %a, %b)
>> 
>> -Andy
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131105/5a8ed3c0/attachment.html>

Arnold

2013-Nov-06 13:54 UTC

head link

[LLVMdev] loop vectorizer

Sent from my iPhone
> On Nov 5, 2013, at 7:39 PM, Frank Winter <fwinter at jlab.org> wrote:
> 
> Good that you bring this up. I still have no solution to this vectorization
problem.
> 
> However, I can rewrite the code and insert a second loop which eliminates
the 'urem' and 'div' instructions in the index calculations. In
this case, the inner loop's trip count would be equal to the SIMD length and
the loop vectorizer ignores the loop. Unrolling the loop and SLP is not an
option, since the loop body can get lengthy.
> 
> What would be a quicker to implement: 
> 
> a) Teach the loop vectorizer the 'urem' and 'div'
instructions, or
This would probably be harder because your individual accesses are consecutive
within a stride.

a[0] a[1] a[3] a[4]  a[9] a[10] a[11] a[12]

Not something the loop vectorizer currently understands.> b) have the loop vectorizer process loops with trip count equal to the
vector length ?
You should be able to change "TinyTripCountVectorThreshold" in
loopvectorizer.cpp> 
> One of both solutions will be needed, I guess.
> 
> Frank
> 
> 
> 
>> On 05/11/13 22:12, Andrew Trick wrote:
>> 
>>> On Oct 30, 2013, at 11:21 PM, Renato Golin <renato.golin at
linaro.org> wrote:
>>> 
>>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org>
wrote:
>>>>       const std::uint64_t ir0 = (i+0)%4;  // not working
>>> 
>>> I thought this would be the case when I saw the original
expression. Maybe we need to teach module arithmetic to SCEV?
>> 
>> I let this thread get stale, so here’s the background again:
>> 
>> source:
>> 
>>       const std::uint64_t ir0 = i%4 + 8*(i/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>> 
>> before instcombine:
>> 
>>   %4 = urem i64 %i.0, 4
>>   %5 = udiv i64 %i.0, 4
>>   %6 = mul i64 8, %5
>>   %7 = add i64 %4, %6
>>   %8 = getelementptr inbounds float* %a, i64 %7
>> 
>> after instcombine:
>> 
>>   %2 = and i64 %i.04, 3
>>   %3 = lshr i64 %i.04, 2
>>   %4 = shl i64 %3, 3
>>   %5 = or i64 %4, %2
>>   %11 = getelementptr inbounds float* %c, i64 %5
>>   store float %10, float* %11, align 4, !tbaa !0
>> 
>> Honestly, I don't understand why InstCombine
"anti-canonicalizes" add->or. I think that transformation should be
deferred into we begin target-specific lower (e.g. InstOptimize pass).
>> 
>> Given, that we aren't going to change that any time soon, SCEV
could probably be taught to recognize the specific pattern:
>> 
>> Instructions (or (and %a, C1), (shl %b, C2)) -> SCEV (add %a, %b)
>> 
>> -Andy
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131106/a0a12581/attachment.html>

Frank Winter

2013-Nov-06 15:42 UTC

head link

[LLVMdev] loop vectorizer

On 06/11/13 08:54, Arnold wrote:>
>
> Sent from my iPhone
>
> On Nov 5, 2013, at 7:39 PM, Frank Winter <fwinter at jlab.org 
> <mailto:fwinter at jlab.org>> wrote:
>
>> Good that you bring this up. I still have no solution to this 
>> vectorization problem.
>>
>> However, I can rewrite the code and insert a second loop which 
>> eliminates the 'urem' and 'div' instructions in the
index
>> calculations. In this case, the inner loop's trip count would be 
>> equal to the SIMD length and the loop vectorizer ignores the loop. 
>> Unrolling the loop and SLP is not an option, since the loop body can 
>> get lengthy.
>>
>> What would be a quicker to implement:
>>
>> a) Teach the loop vectorizer the 'urem' and 'div'
instructions, or
>
> This would probably be harder because your individual accesses are 
> consecutive within a stride.
>
> a[0] a[1] a[3] a[4]  a[9] a[10] a[11] a[12]
>
> Not something the loop vectorizer currently understands.
>> b) have the loop vectorizer process loops with trip count equal to 
>> the vector length ?
>
> You should be able to change "TinyTripCountVectorThreshold" in 
> loopvectorizer.cpp
I managed to set this option when using 'opt' tool. Is there a way to 
set it when using the API without changing the default value in the 
source code and recompiling LLVM?
>>
>> One of both solutions will be needed, I guess.
>>
>> Frank
>>
>>
>>
>> On 05/11/13 22:12, Andrew Trick wrote:
>>>
>>> On Oct 30, 2013, at 11:21 PM, Renato Golin <renato.golin at
linaro.org
>>> <mailto:renato.golin at linaro.org>> wrote:
>>>
>>>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org 
>>>> <mailto:fwinter at jlab.org>> wrote:
>>>>
>>>>           const std::uint64_t ir0 = (i+0)%4;  // not working
>>>>
>>>>
>>>> I thought this would be the case when I saw the original 
>>>> expression. Maybe we need to teach module arithmetic to SCEV?
>>>
>>> I let this thread get stale, so here’s the background again:
>>>
>>> source:
>>>
>>>       const std::uint64_t ir0 = i%4 + 8*(i/4);
>>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>>
>>> before instcombine:
>>>
>>>   %4 = urem i64 %i.0, 4
>>>   %5 = udiv i64 %i.0, 4
>>>   %6 = mul i64 8, %5
>>>   %7 = add i64 %4, %6
>>>   %8 = getelementptr inbounds float* %a, i64 %7
>>>
>>> after instcombine:
>>>
>>>   %2 = and i64 %i.04, 3
>>>   %3 = lshr i64 %i.04, 2
>>>   %4 = shl i64 %3, 3
>>>   %5 = or i64 %4, %2
>>>   %11 = getelementptr inbounds float* %c, i64 %5
>>>   store float %10, float* %11, align 4, !tbaa !0
>>>
>>> Honestly, I don't understand why InstCombine
"anti-canonicalizes"
>>> add->or. I think that transformation should be deferred into we 
>>> begin target-specific lower (e.g. InstOptimize pass).
>>>
>>> Given, that we aren't going to change that any time soon, SCEV
could
>>> probably be taught to recognize the specific pattern:
>>>
>>> Instructions (or (and %a, C1), (shl %b, C2)) -> SCEV (add %a,
%b)
>>>
>>> -Andy
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131106/380a4a29/attachment.html>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Nov 2013 - [LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

Maybe Matching Threads