thr3ads.net - llvm dev - [LLVMdev] loop vectorizer [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2013-Oct-30 17:50 UTC

[LLVMdev] loop vectorizer

----- Original Message -----> 
> 
> I ran the BB vectorizer as I guess this is the SLP vectorizer.
No, while the BB vectorizer is doing a form of SLP vectorization, there is a
separate SLP vectorization pass which uses a different algorithm. You can pass
-vectorize-slp to opt.

 -Hal
> 
> BBV: using target information
> BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
> BBV: found 2 instructions with candidate pairs
> BBV: found 0 pair connections.
> BBV: done!
> 
> However, this was run on the unrolled loop (I guess).
> 
> Here is the IR printed by 'opt':
> 
> entry:
> %cmp9 = icmp ult i64 %start, %end
> br i1 %cmp9, label %for.body, label %for.end
> 
> for.body: ; preds = %entry, %for.body
> %storemerge10 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
> %div = lshr i64 %storemerge10, 2
> %mul1 = shl i64 %div, 3
> %rem = and i64 %storemerge10, 3
> %add2 = or i64 %mul1, %rem
> %0 = lshr i64 %storemerge10, 1
> %add51 = shl i64 %0, 2
> %mul6 = or i64 %rem, %add51
> %add8 = or i64 %mul6, 4
> %arrayidx = getelementptr inbounds float* %a, i64 %add2
> %1 = load float* %arrayidx, align 4
> %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
> %2 = load float* %arrayidx9, align 4
> %add10 = fadd float %1, %2
> %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
> store float %add10, float* %arrayidx11, align 4
> %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
> %3 = load float* %arrayidx12, align 4
> %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
> %4 = load float* %arrayidx13, align 4
> %add14 = fadd float %3, %4
> %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
> store float %add14, float* %arrayidx15, align 4
> %inc = add i64 %storemerge10, 1
> %exitcond = icmp eq i64 %inc, %end
> br i1 %exitcond, label %for.end, label %for.body
> 
> for.end: ; preds = %for.body, %entry
> ret void
> 
> 
> Is what you're saying that I should unroll the loop first by a given
> factor and then run SLP again? How would I do that say for a factor
> of 2?
> 
> Frank
> 
> 
> 
> On 30/10/13 13:28, Renato Golin wrote:
> 
> 
> 
> 
> On 30 October 2013 09:25, Nadav Rotem < nrotem at apple.com > wrote:
> 
> 
> The access pattern to arrays a and b is non-linear. Unrolled loops
> are usually handled by the SLP-vectorizer. Are ir0 and ir1
> consecutive for all values for i ?
> 
> 
> Based on his list of values, it seems that the induction stride is
> linear within each block of 4 iterations, but it's not a clear
> relationship.
> 
> 
> As you say, it should be possible to spot that once the loop is
> unrolled, and get the SLP to vectorize if the relationship becomes
> clear.
> 
> 
> Maybe I'm wrong, but this looks like a problem of missed
> opportunities, not technically hard to implement.
> 
> 
> --renato
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Frank Winter

2013-Oct-30 18:13 UTC

head link

[LLVMdev] loop vectorizer

The SLP vectorizer apparently did something in the prologue of the 
function (where storing of arguments on the stack happens) which then 
got eliminated later on (since I don't see any vector instructions in 
the final IR). Below the debug output of the SLP pass:

Args: opt -O1 -vectorize-slp -debug loop.ll -S
SLP: Analyzing blocks in _Z3barmmPfS_S_.
SLP: Found 2 stores to vectorize.
SLP: Analyzing a store chain of length 2.
SLP: Trying to vectorize starting at PHIs (1)
SLP: Vectorizing a list of length = 2.
SLP: Vectorizing a list of length = 2.
SLP: Vectorizing a list of length = 2.

IR produced:

define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias 
nocapture %c, float* noalias nocapture readonly %a, float* noalias 
nocapture readonly %b) #3 {
entry:
   %cmp14 = icmp ult i64 %start, %end
   br i1 %cmp14, label %for.body, label %for.end

for.body:                                         ; preds = %entry, 
%for.body
   %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
   %div = lshr i64 %i.015, 2
   %mul = shl i64 %div, 3
   %rem = and i64 %i.015, 3
   %add2 = or i64 %mul, %rem
   %add8 = or i64 %add2, 4
   %arrayidx = getelementptr inbounds float* %a, i64 %add2
   %0 = load float* %arrayidx, align 4
   %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
   %1 = load float* %arrayidx9, align 4
   %add10 = fadd float %0, %1
   %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
   store float %add10, float* %arrayidx11, align 4
   %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
   %2 = load float* %arrayidx12, align 4
   %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
   %3 = load float* %arrayidx13, align 4
   %add14 = fadd float %2, %3
   %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
   store float %add14, float* %arrayidx15, align 4
   %inc = add i64 %i.015, 1
   %exitcond = icmp eq i64 %inc, %end
   br i1 %exitcond, label %for.end, label %for.body

for.end:                                          ; preds = %for.body, 
%entry
   ret void
}



Does this finally mean the current LLVM cannot vectorize the function?:


void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  
c, float * __restrict__ a, float * __restrict__ b)
{
   const std::uint64_t inner = 4;
   for (std::uint64_t i = start ; i < end ; ++i ) {
     const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
     const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
     c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
     c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
   }
}

I was trying the following:

clang++ -emit-llvm -S loop.cc -std=c++11
(this writes loop.ll)

opt -O1 -bb-vectorize -debug loop.ll -S
(I found the -O1 useful to prepare the IR for the loop vectorizer, so I 
hoped this would work here as well)

opt -O1 -loop-vectorize -debug-only=loop-vectorize loop.ll -S

opt -loop-unroll -unroll-allow-partial -debug loop.ll -S
(this didn't unroll the loop.) Here the output (relevant for loop-unroll)
Loop Unroll: F[_Z3barmmPfS_S_] Loop %for.cond

Are there any other combinations of passes that I might try?

Frank



On 30/10/13 13:50, Hal Finkel wrote:> ----- Original Message -----
>>
>> I ran the BB vectorizer as I guess this is the SLP vectorizer.
> No, while the BB vectorizer is doing a form of SLP vectorization, there is
a separate SLP vectorization pass which uses a different algorithm. You can pass
-vectorize-slp to opt.
>
>   -Hal
>
>> BBV: using target information
>> BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
>> BBV: found 2 instructions with candidate pairs
>> BBV: found 0 pair connections.
>> BBV: done!
>>
>> However, this was run on the unrolled loop (I guess).
>>
>> Here is the IR printed by 'opt':
>>
>> entry:
>> %cmp9 = icmp ult i64 %start, %end
>> br i1 %cmp9, label %for.body, label %for.end
>>
>> for.body: ; preds = %entry, %for.body
>> %storemerge10 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>> %div = lshr i64 %storemerge10, 2
>> %mul1 = shl i64 %div, 3
>> %rem = and i64 %storemerge10, 3
>> %add2 = or i64 %mul1, %rem
>> %0 = lshr i64 %storemerge10, 1
>> %add51 = shl i64 %0, 2
>> %mul6 = or i64 %rem, %add51
>> %add8 = or i64 %mul6, 4
>> %arrayidx = getelementptr inbounds float* %a, i64 %add2
>> %1 = load float* %arrayidx, align 4
>> %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>> %2 = load float* %arrayidx9, align 4
>> %add10 = fadd float %1, %2
>> %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>> store float %add10, float* %arrayidx11, align 4
>> %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>> %3 = load float* %arrayidx12, align 4
>> %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>> %4 = load float* %arrayidx13, align 4
>> %add14 = fadd float %3, %4
>> %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>> store float %add14, float* %arrayidx15, align 4
>> %inc = add i64 %storemerge10, 1
>> %exitcond = icmp eq i64 %inc, %end
>> br i1 %exitcond, label %for.end, label %for.body
>>
>> for.end: ; preds = %for.body, %entry
>> ret void
>>
>>
>> Is what you're saying that I should unroll the loop first by a
given
>> factor and then run SLP again? How would I do that say for a factor
>> of 2?
>>
>> Frank
>>
>>
>>
>> On 30/10/13 13:28, Renato Golin wrote:
>>
>>
>>
>>
>> On 30 October 2013 09:25, Nadav Rotem < nrotem at apple.com >
wrote:
>>
>>
>> The access pattern to arrays a and b is non-linear. Unrolled loops
>> are usually handled by the SLP-vectorizer. Are ir0 and ir1
>> consecutive for all values for i ?
>>
>>
>> Based on his list of values, it seems that the induction stride is
>> linear within each block of 4 iterations, but it's not a clear
>> relationship.
>>
>>
>> As you say, it should be possible to spot that once the loop is
>> unrolled, and get the SLP to vectorize if the relationship becomes
>> clear.
>>
>>
>> Maybe I'm wrong, but this looks like a problem of missed
>> opportunities, not technically hard to implement.
>>
>>
>> --renato
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>

Nadav Rotem

2013-Oct-30 18:15 UTC

head link

[LLVMdev] loop vectorizer

The debug messages are misleading. They should read “trying to vectorize a list
of …”;   The problem is that the SCEV analysis is unable to detect that C[ir0]
and C[ir1] are consecutive.  Is this loop from an important benchmark ?

Thanks,
Nadav


On Oct 30, 2013, at 11:13 AM, Frank Winter <fwinter at jlab.org> wrote:
> The SLP vectorizer apparently did something in the prologue of the function
(where storing of arguments on the stack happens) which then got eliminated
later on (since I don't see any vector instructions in the final IR). Below
the debug output of the SLP pass:
> 
> Args: opt -O1 -vectorize-slp -debug loop.ll -S
> SLP: Analyzing blocks in _Z3barmmPfS_S_.
> SLP: Found 2 stores to vectorize.
> SLP: Analyzing a store chain of length 2.
> SLP: Trying to vectorize starting at PHIs (1)
> SLP: Vectorizing a list of length = 2.
> SLP: Vectorizing a list of length = 2.
> SLP: Vectorizing a list of length = 2.
> 
> IR produced:
> 
> define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias nocapture
%c, float* noalias nocapture readonly %a, float* noalias nocapture readonly %b)
#3 {
> entry:
>  %cmp14 = icmp ult i64 %start, %end
>  br i1 %cmp14, label %for.body, label %for.end
> 
> for.body:                                         ; preds = %entry,
%for.body
>  %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>  %div = lshr i64 %i.015, 2
>  %mul = shl i64 %div, 3
>  %rem = and i64 %i.015, 3
>  %add2 = or i64 %mul, %rem
>  %add8 = or i64 %add2, 4
>  %arrayidx = getelementptr inbounds float* %a, i64 %add2
>  %0 = load float* %arrayidx, align 4
>  %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>  %1 = load float* %arrayidx9, align 4
>  %add10 = fadd float %0, %1
>  %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>  store float %add10, float* %arrayidx11, align 4
>  %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>  %2 = load float* %arrayidx12, align 4
>  %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>  %3 = load float* %arrayidx13, align 4
>  %add14 = fadd float %2, %3
>  %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>  store float %add14, float* %arrayidx15, align 4
>  %inc = add i64 %i.015, 1
>  %exitcond = icmp eq i64 %inc, %end
>  br i1 %exitcond, label %for.end, label %for.body
> 
> for.end:                                          ; preds = %for.body,
%entry
>  ret void
> }
> 
> 
> 
> Does this finally mean the current LLVM cannot vectorize the function?:
> 
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c,
float * __restrict__ a, float * __restrict__ b)
> {
>  const std::uint64_t inner = 4;
>  for (std::uint64_t i = start ; i < end ; ++i ) {
>    const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
>    const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
>    c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>    c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>  }
> }
> 
> I was trying the following:
> 
> clang++ -emit-llvm -S loop.cc -std=c++11
> (this writes loop.ll)
> 
> opt -O1 -bb-vectorize -debug loop.ll -S
> (I found the -O1 useful to prepare the IR for the loop vectorizer, so I
hoped this would work here as well)
> 
> opt -O1 -loop-vectorize -debug-only=loop-vectorize loop.ll -S
> 
> opt -loop-unroll -unroll-allow-partial -debug loop.ll -S
> (this didn't unroll the loop.) Here the output (relevant for
loop-unroll)
> Loop Unroll: F[_Z3barmmPfS_S_] Loop %for.cond
> 
> Are there any other combinations of passes that I might try?
> 
> Frank
> 
> 
> 
> On 30/10/13 13:50, Hal Finkel wrote:
>> ----- Original Message -----
>>> 
>>> I ran the BB vectorizer as I guess this is the SLP vectorizer.
>> No, while the BB vectorizer is doing a form of SLP vectorization, there
is a separate SLP vectorization pass which uses a different algorithm. You can
pass -vectorize-slp to opt.
>> 
>>  -Hal
>> 
>>> BBV: using target information
>>> BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
>>> BBV: found 2 instructions with candidate pairs
>>> BBV: found 0 pair connections.
>>> BBV: done!
>>> 
>>> However, this was run on the unrolled loop (I guess).
>>> 
>>> Here is the IR printed by 'opt':
>>> 
>>> entry:
>>> %cmp9 = icmp ult i64 %start, %end
>>> br i1 %cmp9, label %for.body, label %for.end
>>> 
>>> for.body: ; preds = %entry, %for.body
>>> %storemerge10 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>>> %div = lshr i64 %storemerge10, 2
>>> %mul1 = shl i64 %div, 3
>>> %rem = and i64 %storemerge10, 3
>>> %add2 = or i64 %mul1, %rem
>>> %0 = lshr i64 %storemerge10, 1
>>> %add51 = shl i64 %0, 2
>>> %mul6 = or i64 %rem, %add51
>>> %add8 = or i64 %mul6, 4
>>> %arrayidx = getelementptr inbounds float* %a, i64 %add2
>>> %1 = load float* %arrayidx, align 4
>>> %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>>> %2 = load float* %arrayidx9, align 4
>>> %add10 = fadd float %1, %2
>>> %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>>> store float %add10, float* %arrayidx11, align 4
>>> %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>>> %3 = load float* %arrayidx12, align 4
>>> %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>>> %4 = load float* %arrayidx13, align 4
>>> %add14 = fadd float %3, %4
>>> %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>>> store float %add14, float* %arrayidx15, align 4
>>> %inc = add i64 %storemerge10, 1
>>> %exitcond = icmp eq i64 %inc, %end
>>> br i1 %exitcond, label %for.end, label %for.body
>>> 
>>> for.end: ; preds = %for.body, %entry
>>> ret void
>>> 
>>> 
>>> Is what you're saying that I should unroll the loop first by a
given
>>> factor and then run SLP again? How would I do that say for a factor
>>> of 2?
>>> 
>>> Frank
>>> 
>>> 
>>> 
>>> On 30/10/13 13:28, Renato Golin wrote:
>>> 
>>> 
>>> 
>>> 
>>> On 30 October 2013 09:25, Nadav Rotem < nrotem at apple.com >
wrote:
>>> 
>>> 
>>> The access pattern to arrays a and b is non-linear. Unrolled loops
>>> are usually handled by the SLP-vectorizer. Are ir0 and ir1
>>> consecutive for all values for i ?
>>> 
>>> 
>>> Based on his list of values, it seems that the induction stride is
>>> linear within each block of 4 iterations, but it's not a clear
>>> relationship.
>>> 
>>> 
>>> As you say, it should be possible to spot that once the loop is
>>> unrolled, and get the SLP to vectorize if the relationship becomes
>>> clear.
>>> 
>>> 
>>> Maybe I'm wrong, but this looks like a problem of missed
>>> opportunities, not technically hard to implement.
>>> 
>>> 
>>> --renato
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> 
> 
>

Renato Golin

2013-Oct-30 18:22 UTC

head link

[LLVMdev] loop vectorizer

On 30 October 2013 11:13, Frank Winter <fwinter at jlab.org> wrote:
> opt -O1 -bb-vectorize -debug loop.ll -S
> (I found the -O1 useful to prepare the IR for the loop vectorizer, so I
> hoped this would work here as well)
>
O1 might not run all the more aggressive passes that will prepare the loop
for the vectorizers. Not all loops will vectorize on all optimization
levels, even though the vectorizers are enabled on most of them.

Still, this doesn't mean LLVM will vectorize this (or any) specific loop on
a specific optimization level. Using "-O3 -debug-only=..." will help
you
understand the issues, reduce cases, report bugs and improvement requests.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131030/a1764418/attachment.html>

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Oct 2013 - [LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

[LLVMdev] loop vectorizer

Maybe Matching Threads