Well, they are not directly consecutive. They are consecutive with a
constant offset or stride:
ir1 = ir0 + 4
If I rewrite the function in this form
void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
c, float * __restrict__ a, float * __restrict__ b)
{
const std::uint64_t inner = 4;
for (std::uint64_t i = start ; i < end ; ++i ) {
const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
const std::uint64_t ir1 = ir0 + 4;
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
}
}
still neither the SLP nor the loop vectorizer do anything of effect:
SLP: Analyzing blocks in _Z3barmmPfS_S_.
SLP: Found 2 stores to vectorize.
SLP: Analyzing a store chain of length 2.
SLP: Trying to vectorize starting at PHIs (1)
SLP: Vectorizing a list of length = 2.
SLP: Vectorizing a list of length = 2.
SLP: Vectorizing a list of length = 2.
LV: Checking a loop in "_Z3barmmPfS_S_"
LV: Found a loop: for.body
LV: Found an induction variable.
LV: We need to do 0 pointer comparisons.
LV: Checking memory dependencies
LV: Bad stride - Not an AddRecExpr pointer %arrayidx6 = getelementptr
inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw>
LV: Bad stride - Not an AddRecExpr pointer %arrayidx10 = getelementptr
inbounds float* %c, i64 %add312 SCEV: ((4 * %add312)<nsw> + %c)<nsw>
LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 *
%add312)<nsw>
+ %c)<nsw>(Induction step: 0)
LV: Distance for store float %add5, float* %arrayidx6, align 4 to
store float %add9, float* %arrayidx10, align 4: ((4 * %add312)<nsw> +
(-4 * %add2))
Non-consecutive pointer access
LV: We don't need a runtime memory check.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.
To answer Nadav's question. This kind of loop is generated by a
scientific library and we are in the process of evaluating whether LLVM
can be used for this research project. The target architectures will
have (very wide) vector instructions and these loops are
performance-critical to the application. Thus it would be important that
these loops can make use of the vector units.
Right now as it seems LLVM cannot vectorize these loops. We might have
some time to look into this, but it's not sure yet. However, high-level
guidance from LLVM pros would be very useful.
What is this usual way of requesting an improvement feature? Is this
mailing list the central pace to communicate?
Thanks,
Frank
On 30/10/13 14:15, Nadav Rotem wrote:> The debug messages are misleading. They should read “trying to vectorize a
list of …”; The problem is that the SCEV analysis is unable to detect that
C[ir0] and C[ir1] are consecutive. Is this loop from an important benchmark ?
>
> Thanks,
> Nadav
>
>
> On Oct 30, 2013, at 11:13 AM, Frank Winter <fwinter at jlab.org>
wrote:
>
>> The SLP vectorizer apparently did something in the prologue of the
function (where storing of arguments on the stack happens) which then got
eliminated later on (since I don't see any vector instructions in the final
IR). Below the debug output of the SLP pass:
>>
>> Args: opt -O1 -vectorize-slp -debug loop.ll -S
>> SLP: Analyzing blocks in _Z3barmmPfS_S_.
>> SLP: Found 2 stores to vectorize.
>> SLP: Analyzing a store chain of length 2.
>> SLP: Trying to vectorize starting at PHIs (1)
>> SLP: Vectorizing a list of length = 2.
>> SLP: Vectorizing a list of length = 2.
>> SLP: Vectorizing a list of length = 2.
>>
>> IR produced:
>>
>> define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias
nocapture %c, float* noalias nocapture readonly %a, float* noalias nocapture
readonly %b) #3 {
>> entry:
>> %cmp14 = icmp ult i64 %start, %end
>> br i1 %cmp14, label %for.body, label %for.end
>>
>> for.body: ; preds = %entry,
%for.body
>> %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>> %div = lshr i64 %i.015, 2
>> %mul = shl i64 %div, 3
>> %rem = and i64 %i.015, 3
>> %add2 = or i64 %mul, %rem
>> %add8 = or i64 %add2, 4
>> %arrayidx = getelementptr inbounds float* %a, i64 %add2
>> %0 = load float* %arrayidx, align 4
>> %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>> %1 = load float* %arrayidx9, align 4
>> %add10 = fadd float %0, %1
>> %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>> store float %add10, float* %arrayidx11, align 4
>> %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>> %2 = load float* %arrayidx12, align 4
>> %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>> %3 = load float* %arrayidx13, align 4
>> %add14 = fadd float %2, %3
>> %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>> store float %add14, float* %arrayidx15, align 4
>> %inc = add i64 %i.015, 1
>> %exitcond = icmp eq i64 %inc, %end
>> br i1 %exitcond, label %for.end, label %for.body
>>
>> for.end: ; preds = %for.body,
%entry
>> ret void
>> }
>>
>>
>>
>> Does this finally mean the current LLVM cannot vectorize the function?:
>>
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
c, float * __restrict__ a, float * __restrict__ b)
>> {
>> const std::uint64_t inner = 4;
>> for (std::uint64_t i = start ; i < end ; ++i ) {
>> const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
>> const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
>> }
>> }
>>
>> I was trying the following:
>>
>> clang++ -emit-llvm -S loop.cc -std=c++11
>> (this writes loop.ll)
>>
>> opt -O1 -bb-vectorize -debug loop.ll -S
>> (I found the -O1 useful to prepare the IR for the loop vectorizer, so I
hoped this would work here as well)
>>
>> opt -O1 -loop-vectorize -debug-only=loop-vectorize loop.ll -S
>>
>> opt -loop-unroll -unroll-allow-partial -debug loop.ll -S
>> (this didn't unroll the loop.) Here the output (relevant for
loop-unroll)
>> Loop Unroll: F[_Z3barmmPfS_S_] Loop %for.cond
>>
>> Are there any other combinations of passes that I might try?
>>
>> Frank
>>
>>
>>
>> On 30/10/13 13:50, Hal Finkel wrote:
>>> ----- Original Message -----
>>>> I ran the BB vectorizer as I guess this is the SLP vectorizer.
>>> No, while the BB vectorizer is doing a form of SLP vectorization,
there is a separate SLP vectorization pass which uses a different algorithm. You
can pass -vectorize-slp to opt.
>>>
>>> -Hal
>>>
>>>> BBV: using target information
>>>> BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
>>>> BBV: found 2 instructions with candidate pairs
>>>> BBV: found 0 pair connections.
>>>> BBV: done!
>>>>
>>>> However, this was run on the unrolled loop (I guess).
>>>>
>>>> Here is the IR printed by 'opt':
>>>>
>>>> entry:
>>>> %cmp9 = icmp ult i64 %start, %end
>>>> br i1 %cmp9, label %for.body, label %for.end
>>>>
>>>> for.body: ; preds = %entry, %for.body
>>>> %storemerge10 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>>>> %div = lshr i64 %storemerge10, 2
>>>> %mul1 = shl i64 %div, 3
>>>> %rem = and i64 %storemerge10, 3
>>>> %add2 = or i64 %mul1, %rem
>>>> %0 = lshr i64 %storemerge10, 1
>>>> %add51 = shl i64 %0, 2
>>>> %mul6 = or i64 %rem, %add51
>>>> %add8 = or i64 %mul6, 4
>>>> %arrayidx = getelementptr inbounds float* %a, i64 %add2
>>>> %1 = load float* %arrayidx, align 4
>>>> %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>>>> %2 = load float* %arrayidx9, align 4
>>>> %add10 = fadd float %1, %2
>>>> %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>>>> store float %add10, float* %arrayidx11, align 4
>>>> %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>>>> %3 = load float* %arrayidx12, align 4
>>>> %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>>>> %4 = load float* %arrayidx13, align 4
>>>> %add14 = fadd float %3, %4
>>>> %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>>>> store float %add14, float* %arrayidx15, align 4
>>>> %inc = add i64 %storemerge10, 1
>>>> %exitcond = icmp eq i64 %inc, %end
>>>> br i1 %exitcond, label %for.end, label %for.body
>>>>
>>>> for.end: ; preds = %for.body, %entry
>>>> ret void
>>>>
>>>>
>>>> Is what you're saying that I should unroll the loop first
by a given
>>>> factor and then run SLP again? How would I do that say for a
factor
>>>> of 2?
>>>>
>>>> Frank
>>>>
>>>>
>>>>
>>>> On 30/10/13 13:28, Renato Golin wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On 30 October 2013 09:25, Nadav Rotem < nrotem at apple.com
> wrote:
>>>>
>>>>
>>>> The access pattern to arrays a and b is non-linear. Unrolled
loops
>>>> are usually handled by the SLP-vectorizer. Are ir0 and ir1
>>>> consecutive for all values for i ?
>>>>
>>>>
>>>> Based on his list of values, it seems that the induction stride
is
>>>> linear within each block of 4 iterations, but it's not a
clear
>>>> relationship.
>>>>
>>>>
>>>> As you say, it should be possible to spot that once the loop is
>>>> unrolled, and get the SLP to vectorize if the relationship
becomes
>>>> clear.
>>>>
>>>>
>>>> Maybe I'm wrong, but this looks like a problem of missed
>>>> opportunities, not technically hard to implement.
>>>>
>>>>
>>>> --renato
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>
>>