thr3ads.net - llvm dev - [LLVMdev] [Vectorization] Mis match in code generated [Sep 2014]

If this information is useful, please help other people find it:
Share via:

suyog sarda

2014-Sep-18 19:24 UTC

[LLVMdev] [Vectorization] Mis match in code generated

Hi Nadav,

Thanks for the quick reply !!

Ok, so as of now we are lacking capability to handle flat large reductions.

I did go through function vectorizeChainsInBlock() (line number 2862). In
this function,
we try to vectorize if we have phi nodes in the IR (several if's check for
phi nodes) i.e we try to
construct tree that starts at chains.

Any pointers on how to join multiple trees? I will also try to dig more
into it.

Comments/Suggestions are most welcomed !!

On Fri, Sep 19, 2014 at 12:23 AM, Nadav Rotem <nrotem at apple.com> wrote:
>
> On Sep 18, 2014, at 11:48 AM, suyog sarda <sardask01 at gmail.com>
wrote:
>
> Hi,
>
> I am trying to understand LLVM vectorization implementation and was
> looking into both loop and SLP vectorization.
>
> test case 1:
>
>
>
>
>
>
> *int foo(int *a) {int sum = 0,i;for(i=0; i<16; i++)  sum += a[i];return
> sum;}*
>
> This code is vectorized by loop vectorizer where we calculate scalar loop
> cost as 4 and vector loop cost as 2.
> Since vector loop cost is less and above reduction is legal to vectorize,
> loop vectorizer produces vectorized code :
>
> *IR without vectorization:*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
> triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind
readonlydefine i32
> @foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  br label
> %for.bodyfor.body:                                         ; preds >
%for.body, %entry  %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> %sum.04 = phi i32 [ 0, %entry ], [ %add, %for.body ]  %arrayidx >
getelementptr inbounds i32* %a, i32 %i.05  %0 = load i32* %arrayidx, align
> 4, !tbaa !1  %add = add nsw i32 %0, %sum.04  %inc = add nsw i32 %i.05, 1
> %exitcond = icmp eq i32 %i.05, 15  br i1 %exitcond, label %for.end, label
> %for.bodyfor.end:                                          ; preds >
%for.body  ret i32 %add}*
>
>
> *IR after vectorization:*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
> triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind
readonlydefine i32
> @foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %backedge.overflow
> icmp eq i32 15, -1  %overflow.check.anchor = add i32 0, 0  br i1
> %backedge.overflow, label %scalar.ph <http://scalar.ph/>, label
> %overflow.checkedoverflow.checked:                                 ; preds
> = %entry  br i1 false, label %middle.block, label %vector.ph
> <http://vector.ph/>vector.ph
> <http://vector.ph/>:                                        ; preds
> %overflow.checked  br label
> %vector.bodyvector.body:                                      ; preds >
%vector.body, %vector.ph <http://vector.ph/>  %index = phi i32 [ 0,
> %vector.ph <http://vector.ph/> ], [ %index.next, %vector.body ] 
%vec.phi > phi <4 x i32> [ zeroinitializer, %vector.ph
<http://vector.ph/> ], [ %14,
> %vector.body ]  %broadcast.splatinsert = insertelement <4 x i32>
undef, i32
> %index, i32 0  %broadcast.splat = shufflevector <4 x i32>
> %broadcast.splatinsert, <4 x i32> undef, <4 x i32>
zeroinitializer
> %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2,
i32 3>
> %0 = extractelement <4 x i32> %induction, i32 0  %1 = getelementptr
> inbounds i32* %a, i32 %0  %2 = insertelement <4 x i32*> undef, i32*
%1, i32
> 0  %3 = extractelement <4 x i32> %induction, i32 1  %4 =
getelementptr
> inbounds i32* %a, i32 %3  %5 = insertelement <4 x i32*> %2, i32* %4,
i32 1
> %6 = extractelement <4 x i32> %induction, i32 2  %7 = getelementptr
> inbounds i32* %a, i32 %6  %8 = insertelement <4 x i32*> %5, i32* %7,
i32 2
> %9 = extractelement <4 x i32> %induction, i32 3  %10 = getelementptr
> inbounds i32* %a, i32 %9  %11 = insertelement <4 x i32*> %8, i32*
%10, i32
> 3  %12 = getelementptr i32* %1, i32 0  %13 = bitcast i32* %12 to <4 x
> i32>*  %wide.load = load <4 x i32>* %13, align 4, !tbaa !1  %14 =
add nsw
> <4 x i32> %wide.load, %vec.phi  %15 = add nsw <4 x i32>
%induction, <i32 1,
> i32 1, i32 1, i32 1>  %16 = icmp eq <4 x i32> %induction, <i32
15, i32 15,
> i32 15, i32 15>  %index.next = add i32 %index, 4  %17 = icmp eq i32
> %index.next, 16  br i1 %17, label %middle.block, label %vector.body,
> !llvm.loop !5middle.block:                                     ; preds >
%vector.body, %overflow.checked  %resume.val = phi i32 [ 0,
> %overflow.checked ], [ 16, %vector.body ]  %trunc.resume.val = phi i32 [ 0,
> %overflow.checked ], [ 16, %vector.body ]  %rdx.vec.exit.phi = phi <4 x
> i32> [ zeroinitializer, %overflow.checked ], [ %14, %vector.body ]
> %rdx.shuf = shufflevector <4 x i32> %rdx.vec.exit.phi, <4 x
i32> undef, <4
> x i32> <i32 2, i32 3, i32 undef, i32 undef>  %bin.rdx = add <4
x i32>
> %rdx.vec.exit.phi, %rdx.shuf  %rdx.shuf1 = shufflevector <4 x i32>
> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32
undef, i32
> undef>  %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1  %18 >
extractelement <4 x i32> %bin.rdx2, i32 0  %cmp.n = icmp eq i32 16,
> %resume.val  br i1 %cmp.n, label %for.end, label %scalar.ph
> <http://scalar.ph/>scalar.ph
> <http://scalar.ph/>:                                        ; preds
> %middle.block, %entry  %bc.resume.val = phi i32 [ %resume.val,
> %middle.block ], [ 0, %entry ]  %bc.trunc.resume.val = phi i32 [
> %trunc.resume.val, %middle.block ], [ 0, %entry ]  %bc.merge.rdx = phi i32
> [ 0, %entry ], [ %18, %middle.block ]  br label
> %for.bodyfor.body:                                         ; preds >
%for.body, %scalar.ph <http://scalar.ph/>  %i.05 = phi i32 [
> %bc.trunc.resume.val, %scalar.ph <http://scalar.ph/> ], [ %inc,
%for.body
> ]  %sum.04 = phi i32 [ %bc.merge.rdx, %scalar.ph <http://scalar.ph/>
], [
> %add, %for.body ]  %arrayidx = getelementptr inbounds i32* %a, i32 %i.05
> %19 = load i32* %arrayidx, align 4, !tbaa !1  %add = add nsw i32 %19,
> %sum.04  %inc = add nsw i32 %i.05, 1  %exitcond = icmp eq i32 %i.05, 15  br
> i1 %exitcond, label %for.end, label %for.body, !llvm.loop
> !8for.end:                                          ; preds >
%middle.block, %for.body  %add.lcssa = phi i32 [ %add, %for.body ], [ %18,
> %middle.block ]  ret i32 %add.lcssa}*
>
>
> Now for test case 2, where we unroll the loop totally :
>
>
>
>
>
> *int foo(int *a) {int sum = 0;sum >
a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];return
> sum;}*
>
> This code is not vectorized by SLP vectorization.
>
> *IR before vectorization: *
>
> test.ll
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
> triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind
readonlydefine i32
> @foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %0 = load i32* %a,
> align 4, !tbaa !1  %arrayidx1 = getelementptr inbounds i32* %a, i32 1  %1
> load i32* %arrayidx1, align 4, !tbaa !1  %add = add nsw i32 %1, %0
> %arrayidx2 = getelementptr inbounds i32* %a, i32 2  %2 = load i32*
> %arrayidx2, align 4, !tbaa !1  %add3 = add nsw i32 %add, %2  %arrayidx4
> getelementptr inbounds i32* %a, i32 3  %3 = load i32* %arrayidx4, align 4,
> !tbaa !1  %add5 = add nsw i32 %add3, %3  %arrayidx6 = getelementptr
> inbounds i32* %a, i32 4  %4 = load i32* %arrayidx6, align 4, !tbaa !1
> %add7 = add nsw i32 %add5, %4  %arrayidx8 = getelementptr inbounds i32* %a,
> i32 5  %5 = load i32* %arrayidx8, align 4, !tbaa !1  %add9 = add nsw i32
> %add7, %5  %arrayidx10 = getelementptr inbounds i32* %a, i32 6  %6 = load
> i32* %arrayidx10, align 4, !tbaa !1  %add11 = add nsw i32 %add9, %6
> %arrayidx12 = getelementptr inbounds i32* %a, i32 7  %7 = load i32*
> %arrayidx12, align 4, !tbaa !1  %add13 = add nsw i32 %add11, %7
> %arrayidx14 = getelementptr inbounds i32* %a, i32 8  %8 = load i32*
> %arrayidx14, align 4, !tbaa !1  %add15 = add nsw i32 %add13, %8
> %arrayidx16 = getelementptr inbounds i32* %a, i32 9  %9 = load i32*
> %arrayidx16, align 4, !tbaa !1  %add17 = add nsw i32 %add15, %9
> %arrayidx18 = getelementptr inbounds i32* %a, i32 10  %10 = load i32*
> %arrayidx18, align 4, !tbaa !1  %add19 = add nsw i32 %add17, %10
> %arrayidx20 = getelementptr inbounds i32* %a, i32 11  %11 = load i32*
> %arrayidx20, align 4, !tbaa !1  %add21 = add nsw i32 %add19, %11
> %arrayidx22 = getelementptr inbounds i32* %a, i32 12  %12 = load i32*
> %arrayidx22, align 4, !tbaa !1  %add23 = add nsw i32 %add21, %12
> %arrayidx24 = getelementptr inbounds i32* %a, i32 13  %13 = load i32*
> %arrayidx24, align 4, !tbaa !1  %add25 = add nsw i32 %add23, %13
> %arrayidx26 = getelementptr inbounds i32* %a, i32 14  %14 = load i32*
> %arrayidx26, align 4, !tbaa !1  %add27 = add nsw i32 %add25, %14
> %arrayidx28 = getelementptr inbounds i32* %a, i32 15  %15 = load i32*
> %arrayidx28, align 4, !tbaa !1  %add29 = add nsw i32 %add27, %15  ret i32
> %add29}*
> $ opt -S -slp-vectorizer -slp-vectorize-hor test.ll -debug -o test2.ll
>
> Features:+64bit,+sse2
> CPU:generic
>
> Subtarget features: SSELevel 3, 3DNowLevel 0, 64bit 1
> SLP: Analyzing blocks in foo.
>
> *test2.ll (IR after SLP vectorization) :*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
> triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind
readonlydefine i32
> @foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %0 = load i32* %a,
> align 4, !tbaa !1  %arrayidx1 = getelementptr inbounds i32* %a, i32 1  %1
> load i32* %arrayidx1, align 4, !tbaa !1  %add = add nsw i32 %1, %0
> %arrayidx2 = getelementptr inbounds i32* %a, i32 2  %2 = load i32*
> %arrayidx2, align 4, !tbaa !1  %add3 = add nsw i32 %add, %2  %arrayidx4
> getelementptr inbounds i32* %a, i32 3  %3 = load i32* %arrayidx4, align 4,
> !tbaa !1  %add5 = add nsw i32 %add3, %3  %arrayidx6 = getelementptr
> inbounds i32* %a, i32 4  %4 = load i32* %arrayidx6, align 4, !tbaa !1
> %add7 = add nsw i32 %add5, %4  %arrayidx8 = getelementptr inbounds i32* %a,
> i32 5  %5 = load i32* %arrayidx8, align 4, !tbaa !1  %add9 = add nsw i32
> %add7, %5  %arrayidx10 = getelementptr inbounds i32* %a, i32 6  %6 = load
> i32* %arrayidx10, align 4, !tbaa !1  %add11 = add nsw i32 %add9, %6
> %arrayidx12 = getelementptr inbounds i32* %a, i32 7  %7 = load i32*
> %arrayidx12, align 4, !tbaa !1  %add13 = add nsw i32 %add11, %7
> %arrayidx14 = getelementptr inbounds i32* %a, i32 8  %8 = load i32*
> %arrayidx14, align 4, !tbaa !1  %add15 = add nsw i32 %add13, %8
> %arrayidx16 = getelementptr inbounds i32* %a, i32 9  %9 = load i32*
> %arrayidx16, align 4, !tbaa !1  %add17 = add nsw i32 %add15, %9
> %arrayidx18 = getelementptr inbounds i32* %a, i32 10  %10 = load i32*
> %arrayidx18, align 4, !tbaa !1  %add19 = add nsw i32 %add17, %10
> %arrayidx20 = getelementptr inbounds i32* %a, i32 11  %11 = load i32*
> %arrayidx20, align 4, !tbaa !1  %add21 = add nsw i32 %add19, %11
> %arrayidx22 = getelementptr inbounds i32* %a, i32 12  %12 = load i32*
> %arrayidx22, align 4, !tbaa !1  %add23 = add nsw i32 %add21, %12
> %arrayidx24 = getelementptr inbounds i32* %a, i32 13  %13 = load i32*
> %arrayidx24, align 4, !tbaa !1  %add25 = add nsw i32 %add23, %13
> %arrayidx26 = getelementptr inbounds i32* %a, i32 14  %14 = load i32*
> %arrayidx26, align 4, !tbaa !1  %add27 = add nsw i32 %add25, %14
> %arrayidx28 = getelementptr inbounds i32* %a, i32 15  %15 = load i32*
> %arrayidx28, align 4, !tbaa !1  %add29 = add nsw i32 %add27, %15  ret i32
> %add29}*
> Shouldn't the SLP vectorizer vectorize above code? Am I missing out
> something?
> How does loop unrolling affect the final vectorization of code?
>
>
> Hi Suyog,
>
> Thank you for taking the time to investigate the performance of the
> vectorizers. You are correct that the SLP-vectorizer should be able to do
> something about the second case (the unrolled one).  Please look into line
> 2861 in the SLP-vectorizer. This is where we try to analyze reductions and
> vectorize them. At the moment we don't have support for a flat yet
large
> reduction that is the result of unrolled loops because the current
> algorithm tries to construct trees that start at the chains, and it does
> not support joining multiple trees.
>
> http://llvm.org/docs/doxygen/html/SLPVectorizer_8cpp_source.html#l02861
>
> Thanks,
> Nadav
>
>
> I tried to debug above IR. Since there is no store in above IR, SLP
> vectorization doesn't call
> vectorizeStoreChains(). In vectorizeChainsInBlock(), as there is no phi
> node in above IR, it
> also fails to vectorize the IR.
>
> If it is perfectly legal to vectorize above IR, are we lacking code for it
> right now? Or its not
> possible to vectorize above code (as far as i know, it is perfectly
> possible to vectorize above code)?
> If its possible, can someone help me/point out specifics how to start for
> implementing vectorization for such code?
> (We do have function calls isConsecutiveAccess() to identify it. Is it a
> case for horizontal reduction, though the term
> 'reduction' is associated with loops/phi nodes.)
>
> This brings to me another question - if we perform aggressive loop
> unrolling, we might loose opportunity for
> vectorization as visible above !!
>
> (I was looking into bug 20035, when i encountered above issue. Though both
> are not related.)
>
> Your comments/suggestions are most awaited.
>
>
> --
> With regards,
> Suyog Sarda
>
>
>

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/f78831b0/attachment.html>

Arnold

2014-Sep-18 20:19 UTC

head link

[LLVMdev] [Vectorization] Mis match in code generated

There is experimental code to handle horizontal reductions hidden behind a flag
"slp-vectorize-hor" you could try to enable that. It won't
currently work for your example though because we only start to look for
reductions at phis and stores "slp-vectorize-hor-stores".

You could start off by rewriting your example with a store and see whether the
tree match works after using both flags.

F(int *sum ...){
*sum =
a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];
}

If it works we would only need to look for horizontal reductions at returns.

Sent from my iPhone
> On Sep 18, 2014, at 12:24 PM, suyog sarda <sardask01 at gmail.com>
wrote:
> 
> Hi Nadav, 
> 
> Thanks for the quick reply !!
> 
> Ok, so as of now we are lacking capability to handle flat large reductions.
> 
> I did go through function vectorizeChainsInBlock() (line number 2862). In
this function,
> we try to vectorize if we have phi nodes in the IR (several if's check
for phi nodes) i.e we try to
> construct tree that starts at chains. 
> 
> Any pointers on how to join multiple trees? I will also try to dig more
into it.
> 
> Comments/Suggestions are most welcomed !! 
> 
>> On Fri, Sep 19, 2014 at 12:23 AM, Nadav Rotem <nrotem at
apple.com> wrote:
>> 
>>> On Sep 18, 2014, at 11:48 AM, suyog sarda <sardask01 at
gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I am trying to understand LLVM vectorization implementation and was
looking into both loop and SLP vectorization.
>>> 
>>> test case 1:
>>> 
>>> int foo(int *a) {
>>> int sum = 0,i;
>>> for(i=0; i<16; i++)
>>>   sum += a[i];
>>> return sum;
>>> }
>>> 
>>> This code is vectorized by loop vectorizer where we calculate
scalar loop cost as 4 and vector loop cost as 2.
>>> Since vector loop cost is less and above reduction is legal to
vectorize, loop vectorizer produces vectorized code :
>>> 
>>> IR without vectorization:
>>> 
>>> target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
>>> target triple = "x86_64-pc-linux-gnu"
>>> 
>>> ; Function Attrs: nounwind readonly
>>> define i32 @foo(i32* nocapture readonly %a, i32 %n) #0 {
>>> entry:
>>>   br label %for.body
>>> 
>>> for.body:                                         ; preds =
%for.body, %entry
>>>   %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
>>>   %sum.04 = phi i32 [ 0, %entry ], [ %add, %for.body ]
>>>   %arrayidx = getelementptr inbounds i32* %a, i32 %i.05
>>>   %0 = load i32* %arrayidx, align 4, !tbaa !1
>>>   %add = add nsw i32 %0, %sum.04
>>>   %inc = add nsw i32 %i.05, 1
>>>   %exitcond = icmp eq i32 %i.05, 15
>>>   br i1 %exitcond, label %for.end, label %for.body
>>> 
>>> for.end:                                          ; preds =
%for.body
>>>   ret i32 %add
>>> }
>>> 
>>> IR after vectorization:
>>> 
>>> target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
>>> target triple = "x86_64-pc-linux-gnu"
>>> 
>>> ; Function Attrs: nounwind readonly
>>> define i32 @foo(i32* nocapture readonly %a, i32 %n) #0 {
>>> entry:
>>>   %backedge.overflow = icmp eq i32 15, -1
>>>   %overflow.check.anchor = add i32 0, 0
>>>   br i1 %backedge.overflow, label %scalar.ph, label
%overflow.checked
>>> 
>>> overflow.checked:                                 ; preds = %entry
>>>   br i1 false, label %middle.block, label %vector.ph
>>> 
>>> vector.ph:                                        ; preds =
%overflow.checked
>>>   br label %vector.body
>>> 
>>> vector.body:                                      ; preds =
%vector.body, %vector.ph
>>>   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
>>>   %vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [
%14, %vector.body ]
>>>   %broadcast.splatinsert = insertelement <4 x i32> undef, i32
%index, i32 0
>>>   %broadcast.splat = shufflevector <4 x i32>
%broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
>>>   %induction = add <4 x i32> %broadcast.splat, <i32 0, i32
1, i32 2, i32 3>
>>>   %0 = extractelement <4 x i32> %induction, i32 0
>>>   %1 = getelementptr inbounds i32* %a, i32 %0
>>>   %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
>>>   %3 = extractelement <4 x i32> %induction, i32 1
>>>   %4 = getelementptr inbounds i32* %a, i32 %3
>>>   %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
>>>   %6 = extractelement <4 x i32> %induction, i32 2
>>>   %7 = getelementptr inbounds i32* %a, i32 %6
>>>   %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
>>>   %9 = extractelement <4 x i32> %induction, i32 3
>>>   %10 = getelementptr inbounds i32* %a, i32 %9
>>>   %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
>>>   %12 = getelementptr i32* %1, i32 0
>>>   %13 = bitcast i32* %12 to <4 x i32>*
>>>   %wide.load = load <4 x i32>* %13, align 4, !tbaa !1
>>>   %14 = add nsw <4 x i32> %wide.load, %vec.phi
>>>   %15 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32
1, i32 1>
>>>   %16 = icmp eq <4 x i32> %induction, <i32 15, i32 15, i32
15, i32 15>
>>>   %index.next = add i32 %index, 4
>>>   %17 = icmp eq i32 %index.next, 16
>>>   br i1 %17, label %middle.block, label %vector.body, !llvm.loop !5
>>> 
>>> middle.block:                                     ; preds =
%vector.body, %overflow.checked
>>>   %resume.val = phi i32 [ 0, %overflow.checked ], [ 16,
%vector.body ]
>>>   %trunc.resume.val = phi i32 [ 0, %overflow.checked ], [ 16,
%vector.body ]
>>>   %rdx.vec.exit.phi = phi <4 x i32> [ zeroinitializer,
%overflow.checked ], [ %14, %vector.body ]
>>>   %rdx.shuf = shufflevector <4 x i32> %rdx.vec.exit.phi,
<4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32
undef>
>>>   %bin.rdx = add <4 x i32> %rdx.vec.exit.phi, %rdx.shuf
>>>   %rdx.shuf1 = shufflevector <4 x i32> %bin.rdx, <4 x
i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
>>>   %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1
>>>   %18 = extractelement <4 x i32> %bin.rdx2, i32 0
>>>   %cmp.n = icmp eq i32 16, %resume.val
>>>   br i1 %cmp.n, label %for.end, label %scalar.ph
>>> 
>>> scalar.ph:                                        ; preds =
%middle.block, %entry
>>>   %bc.resume.val = phi i32 [ %resume.val, %middle.block ], [ 0,
%entry ]
>>>   %bc.trunc.resume.val = phi i32 [ %trunc.resume.val, %middle.block
], [ 0, %entry ]
>>>   %bc.merge.rdx = phi i32 [ 0, %entry ], [ %18, %middle.block ]
>>>   br label %for.body
>>> 
>>> for.body:                                         ; preds =
%for.body, %scalar.ph
>>>   %i.05 = phi i32 [ %bc.trunc.resume.val, %scalar.ph ], [ %inc,
%for.body ]
>>>   %sum.04 = phi i32 [ %bc.merge.rdx, %scalar.ph ], [ %add,
%for.body ]
>>>   %arrayidx = getelementptr inbounds i32* %a, i32 %i.05
>>>   %19 = load i32* %arrayidx, align 4, !tbaa !1
>>>   %add = add nsw i32 %19, %sum.04
>>>   %inc = add nsw i32 %i.05, 1
>>>   %exitcond = icmp eq i32 %i.05, 15
>>>   br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !8
>>> 
>>> for.end:                                          ; preds =
%middle.block, %for.body
>>>   %add.lcssa = phi i32 [ %add, %for.body ], [ %18, %middle.block ]
>>>   ret i32 %add.lcssa
>>> }  
>>> 
>>> 
>>> Now for test case 2, where we unroll the loop totally : 
>>> 
>>> int foo(int *a) {
>>> int sum = 0;
>>> sum =
a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];
>>> return sum;
>>> }
>>> 
>>> This code is not vectorized by SLP vectorization.
>>> 
>>> IR before vectorization: 
>>> 
>>> test.ll
>>> 
>>> target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
>>> target triple = "x86_64-pc-linux-gnu"
>>> 
>>> ; Function Attrs: nounwind readonly
>>> define i32 @foo(i32* nocapture readonly %a, i32 %n) #0 {
>>> entry:
>>>   %0 = load i32* %a, align 4, !tbaa !1
>>>   %arrayidx1 = getelementptr inbounds i32* %a, i32 1
>>>   %1 = load i32* %arrayidx1, align 4, !tbaa !1
>>>   %add = add nsw i32 %1, %0
>>>   %arrayidx2 = getelementptr inbounds i32* %a, i32 2
>>>   %2 = load i32* %arrayidx2, align 4, !tbaa !1
>>>   %add3 = add nsw i32 %add, %2
>>>   %arrayidx4 = getelementptr inbounds i32* %a, i32 3
>>>   %3 = load i32* %arrayidx4, align 4, !tbaa !1
>>>   %add5 = add nsw i32 %add3, %3
>>>   %arrayidx6 = getelementptr inbounds i32* %a, i32 4
>>>   %4 = load i32* %arrayidx6, align 4, !tbaa !1
>>>   %add7 = add nsw i32 %add5, %4
>>>   %arrayidx8 = getelementptr inbounds i32* %a, i32 5
>>>   %5 = load i32* %arrayidx8, align 4, !tbaa !1
>>>   %add9 = add nsw i32 %add7, %5
>>>   %arrayidx10 = getelementptr inbounds i32* %a, i32 6
>>>   %6 = load i32* %arrayidx10, align 4, !tbaa !1
>>>   %add11 = add nsw i32 %add9, %6
>>>   %arrayidx12 = getelementptr inbounds i32* %a, i32 7
>>>   %7 = load i32* %arrayidx12, align 4, !tbaa !1
>>>   %add13 = add nsw i32 %add11, %7
>>>   %arrayidx14 = getelementptr inbounds i32* %a, i32 8
>>>   %8 = load i32* %arrayidx14, align 4, !tbaa !1
>>>   %add15 = add nsw i32 %add13, %8
>>>   %arrayidx16 = getelementptr inbounds i32* %a, i32 9
>>>   %9 = load i32* %arrayidx16, align 4, !tbaa !1
>>>   %add17 = add nsw i32 %add15, %9
>>>   %arrayidx18 = getelementptr inbounds i32* %a, i32 10
>>>   %10 = load i32* %arrayidx18, align 4, !tbaa !1
>>>   %add19 = add nsw i32 %add17, %10
>>>   %arrayidx20 = getelementptr inbounds i32* %a, i32 11
>>>   %11 = load i32* %arrayidx20, align 4, !tbaa !1
>>>   %add21 = add nsw i32 %add19, %11
>>>   %arrayidx22 = getelementptr inbounds i32* %a, i32 12
>>>   %12 = load i32* %arrayidx22, align 4, !tbaa !1
>>>   %add23 = add nsw i32 %add21, %12
>>>   %arrayidx24 = getelementptr inbounds i32* %a, i32 13
>>>   %13 = load i32* %arrayidx24, align 4, !tbaa !1
>>>   %add25 = add nsw i32 %add23, %13
>>>   %arrayidx26 = getelementptr inbounds i32* %a, i32 14
>>>   %14 = load i32* %arrayidx26, align 4, !tbaa !1
>>>   %add27 = add nsw i32 %add25, %14
>>>   %arrayidx28 = getelementptr inbounds i32* %a, i32 15
>>>   %15 = load i32* %arrayidx28, align 4, !tbaa !1
>>>   %add29 = add nsw i32 %add27, %15
>>>   ret i32 %add29
>>> }
>>> 
>>> $ opt -S -slp-vectorizer -slp-vectorize-hor test.ll -debug -o
test2.ll
>>> 
>>> Features:+64bit,+sse2
>>> CPU:generic
>>> 
>>> Subtarget features: SSELevel 3, 3DNowLevel 0, 64bit 1
>>> SLP: Analyzing blocks in foo.
>>> 
>>> test2.ll (IR after SLP vectorization) :
>>> 
>>> target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
>>> target triple = "x86_64-pc-linux-gnu"
>>> 
>>> ; Function Attrs: nounwind readonly
>>> define i32 @foo(i32* nocapture readonly %a, i32 %n) #0 {
>>> entry:
>>>   %0 = load i32* %a, align 4, !tbaa !1
>>>   %arrayidx1 = getelementptr inbounds i32* %a, i32 1
>>>   %1 = load i32* %arrayidx1, align 4, !tbaa !1
>>>   %add = add nsw i32 %1, %0
>>>   %arrayidx2 = getelementptr inbounds i32* %a, i32 2
>>>   %2 = load i32* %arrayidx2, align 4, !tbaa !1
>>>   %add3 = add nsw i32 %add, %2
>>>   %arrayidx4 = getelementptr inbounds i32* %a, i32 3
>>>   %3 = load i32* %arrayidx4, align 4, !tbaa !1
>>>   %add5 = add nsw i32 %add3, %3
>>>   %arrayidx6 = getelementptr inbounds i32* %a, i32 4
>>>   %4 = load i32* %arrayidx6, align 4, !tbaa !1
>>>   %add7 = add nsw i32 %add5, %4
>>>   %arrayidx8 = getelementptr inbounds i32* %a, i32 5
>>>   %5 = load i32* %arrayidx8, align 4, !tbaa !1
>>>   %add9 = add nsw i32 %add7, %5
>>>   %arrayidx10 = getelementptr inbounds i32* %a, i32 6
>>>   %6 = load i32* %arrayidx10, align 4, !tbaa !1
>>>   %add11 = add nsw i32 %add9, %6
>>>   %arrayidx12 = getelementptr inbounds i32* %a, i32 7
>>>   %7 = load i32* %arrayidx12, align 4, !tbaa !1
>>>   %add13 = add nsw i32 %add11, %7
>>>   %arrayidx14 = getelementptr inbounds i32* %a, i32 8
>>>   %8 = load i32* %arrayidx14, align 4, !tbaa !1
>>>   %add15 = add nsw i32 %add13, %8
>>>   %arrayidx16 = getelementptr inbounds i32* %a, i32 9
>>>   %9 = load i32* %arrayidx16, align 4, !tbaa !1
>>>   %add17 = add nsw i32 %add15, %9
>>>   %arrayidx18 = getelementptr inbounds i32* %a, i32 10
>>>   %10 = load i32* %arrayidx18, align 4, !tbaa !1
>>>   %add19 = add nsw i32 %add17, %10
>>>   %arrayidx20 = getelementptr inbounds i32* %a, i32 11
>>>   %11 = load i32* %arrayidx20, align 4, !tbaa !1
>>>   %add21 = add nsw i32 %add19, %11
>>>   %arrayidx22 = getelementptr inbounds i32* %a, i32 12
>>>   %12 = load i32* %arrayidx22, align 4, !tbaa !1
>>>   %add23 = add nsw i32 %add21, %12
>>>   %arrayidx24 = getelementptr inbounds i32* %a, i32 13
>>>   %13 = load i32* %arrayidx24, align 4, !tbaa !1
>>>   %add25 = add nsw i32 %add23, %13
>>>   %arrayidx26 = getelementptr inbounds i32* %a, i32 14
>>>   %14 = load i32* %arrayidx26, align 4, !tbaa !1
>>>   %add27 = add nsw i32 %add25, %14
>>>   %arrayidx28 = getelementptr inbounds i32* %a, i32 15
>>>   %15 = load i32* %arrayidx28, align 4, !tbaa !1
>>>   %add29 = add nsw i32 %add27, %15
>>>   ret i32 %add29
>>> }
>>> 
>>> Shouldn't the SLP vectorizer vectorize above code? Am I missing
out something?
>>> How does loop unrolling affect the final vectorization of code?
>> 
>> Hi Suyog, 
>> 
>> Thank you for taking the time to investigate the performance of the
vectorizers. You are correct that the SLP-vectorizer should be able to do
something about the second case (the unrolled one).  Please look into line 2861
in the SLP-vectorizer. This is where we try to analyze reductions and vectorize
them. At the moment we don’t have support for a flat yet large reduction that is
the result of unrolled loops because the current algorithm tries to construct
trees that start at the chains, and it does not support joining multiple trees.
>> 
>> http://llvm.org/docs/doxygen/html/SLPVectorizer_8cpp_source.html#l02861
>> 
>> Thanks,
>> Nadav
>> 
>> 
>>> I tried to debug above IR. Since there is no store in above IR, SLP
vectorization doesn't call
>>> vectorizeStoreChains(). In vectorizeChainsInBlock(), as there is no
phi node in above IR, it
>>> also fails to vectorize the IR.
>>> 
>>> If it is perfectly legal to vectorize above IR, are we lacking code
for it right now? Or its not
>>> possible to vectorize above code (as far as i know, it is perfectly
possible to vectorize above code)?
>>> If its possible, can someone help me/point out specifics how to
start for implementing vectorization for such code?
>>> (We do have function calls isConsecutiveAccess() to identify it. Is
it a case for horizontal reduction, though the term
>>> 'reduction' is associated with loops/phi nodes.)
>>> 
>>> This brings to me another question - if we perform aggressive loop
unrolling, we might loose opportunity for
>>> vectorization as visible above !!
>>> 
>>> (I was looking into bug 20035, when i encountered above issue.
Though both are not related.)
>>> 
>>> Your comments/suggestions are most awaited.
>>> 
>>>  
>>> -- 
>>> With regards,
>>> Suyog Sarda
> 
> 
> 
> -- 
> With regards,
> Suyog Sarda-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140918/34e45590/attachment.html>

suyog sarda

2014-Sep-19 18:35 UTC

head link

[LLVMdev] [Vectorization] Mis match in code generated

Hi Arnold,

Thanks for your reply.

I tried test case as suggested by you.





*void foo(int *a, int *sum) {*sum
a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];}*
so that it has a 'store' in its IR.



*IR before vectorization :*target datalayout
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
target triple = "x86_64-pc-linux-gnu"

; Function Attrs: nounwind
define void @foo(i32* nocapture readonly %a, i32* nocapture %sum) #0 {
entry:
  %0 = load i32* %a, align 4, !tbaa !1
  %arrayidx1 = getelementptr inbounds i32* %a, i32 1
  %1 = load i32* %arrayidx1, align 4, !tbaa !1
  %add = add nsw i32 %1, %0
  %arrayidx2 = getelementptr inbounds i32* %a, i32 2
  %2 = load i32* %arrayidx2, align 4, !tbaa !1
  %add3 = add nsw i32 %add, %2
  %arrayidx4 = getelementptr inbounds i32* %a, i32 3
  %3 = load i32* %arrayidx4, align 4, !tbaa !1
  %add5 = add nsw i32 %add3, %3
  %arrayidx6 = getelementptr inbounds i32* %a, i32 4
  %4 = load i32* %arrayidx6, align 4, !tbaa !1
  %add7 = add nsw i32 %add5, %4
  %arrayidx8 = getelementptr inbounds i32* %a, i32 5
  %5 = load i32* %arrayidx8, align 4, !tbaa !1
  %add9 = add nsw i32 %add7, %5
  %arrayidx10 = getelementptr inbounds i32* %a, i32 6
  %6 = load i32* %arrayidx10, align 4, !tbaa !1
  %add11 = add nsw i32 %add9, %6
  %arrayidx12 = getelementptr inbounds i32* %a, i32 7
  %7 = load i32* %arrayidx12, align 4, !tbaa !1
  %add13 = add nsw i32 %add11, %7
  %arrayidx14 = getelementptr inbounds i32* %a, i32 8
  %8 = load i32* %arrayidx14, align 4, !tbaa !1
  %add15 = add nsw i32 %add13, %8
  %arrayidx16 = getelementptr inbounds i32* %a, i32 9
  %9 = load i32* %arrayidx16, align 4, !tbaa !1
  %add17 = add nsw i32 %add15, %9
  %arrayidx18 = getelementptr inbounds i32* %a, i32 10
  %10 = load i32* %arrayidx18, align 4, !tbaa !1
  %add19 = add nsw i32 %add17, %10
  %arrayidx20 = getelementptr inbounds i32* %a, i32 11
  %11 = load i32* %arrayidx20, align 4, !tbaa !1
  %add21 = add nsw i32 %add19, %11
  %arrayidx22 = getelementptr inbounds i32* %a, i32 12
  %12 = load i32* %arrayidx22, align 4, !tbaa !1
  %add23 = add nsw i32 %add21, %12
  %arrayidx24 = getelementptr inbounds i32* %a, i32 13
  %13 = load i32* %arrayidx24, align 4, !tbaa !1
  %add25 = add nsw i32 %add23, %13
  %arrayidx26 = getelementptr inbounds i32* %a, i32 14
  %14 = load i32* %arrayidx26, align 4, !tbaa !1
  %add27 = add nsw i32 %add25, %14
  %arrayidx28 = getelementptr inbounds i32* %a, i32 15
  %15 = load i32* %arrayidx28, align 4, !tbaa !1
  %add29 = add nsw i32 %add27, %15
  store i32 %add29, i32* %sum, align 4, !tbaa !1
  ret void
}



*IR after SLP vectorization with appropriate flags :*
$ opt -S -slp-vectorizer -slp-vectorize-hor=1 -slp-vectorize-hor-store=1
test.ll -debug

(I hope i am passing the args correctly to opt)

Subtarget features: SSELevel 3, 3DNowLevel 0, 64bit 1
SLP: Analyzing blocks in foo.
SLP: Found 1 stores to vectorize.
SLP: Vectorizing a list of length = 2.
; ModuleID = 'test.ll'
target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
target triple = "x86_64-pc-linux-gnu"

; Function Attrs: nounwind
define void @foo(i32* nocapture readonly %a, i32* nocapture %sum) #0 {
entry:
  %0 = load i32* %a, align 4, !tbaa !1
  %arrayidx1 = getelementptr inbounds i32* %a, i32 1
  %1 = load i32* %arrayidx1, align 4, !tbaa !1
  %add = add nsw i32 %1, %0
  %arrayidx2 = getelementptr inbounds i32* %a, i32 2
  %2 = load i32* %arrayidx2, align 4, !tbaa !1
  %add3 = add nsw i32 %add, %2
  %arrayidx4 = getelementptr inbounds i32* %a, i32 3
  %3 = load i32* %arrayidx4, align 4, !tbaa !1
  %add5 = add nsw i32 %add3, %3
  %arrayidx6 = getelementptr inbounds i32* %a, i32 4
  %4 = load i32* %arrayidx6, align 4, !tbaa !1
  %add7 = add nsw i32 %add5, %4
  %arrayidx8 = getelementptr inbounds i32* %a, i32 5
  %5 = load i32* %arrayidx8, align 4, !tbaa !1
  %add9 = add nsw i32 %add7, %5
  %arrayidx10 = getelementptr inbounds i32* %a, i32 6
  %6 = load i32* %arrayidx10, align 4, !tbaa !1
  %add11 = add nsw i32 %add9, %6
  %arrayidx12 = getelementptr inbounds i32* %a, i32 7
  %7 = load i32* %arrayidx12, align 4, !tbaa !1
  %add13 = add nsw i32 %add11, %7
  %arrayidx14 = getelementptr inbounds i32* %a, i32 8
  %8 = load i32* %arrayidx14, align 4, !tbaa !1
  %add15 = add nsw i32 %add13, %8
  %arrayidx16 = getelementptr inbounds i32* %a, i32 9
  %9 = load i32* %arrayidx16, align 4, !tbaa !1
  %add17 = add nsw i32 %add15, %9
  %arrayidx18 = getelementptr inbounds i32* %a, i32 10
  %10 = load i32* %arrayidx18, align 4, !tbaa !1
  %add19 = add nsw i32 %add17, %10
  %arrayidx20 = getelementptr inbounds i32* %a, i32 11
  %11 = load i32* %arrayidx20, align 4, !tbaa !1
  %add21 = add nsw i32 %add19, %11
  %arrayidx22 = getelementptr inbounds i32* %a, i32 12
  %12 = load i32* %arrayidx22, align 4, !tbaa !1
  %add23 = add nsw i32 %add21, %12
  %arrayidx24 = getelementptr inbounds i32* %a, i32 13
  %13 = load i32* %arrayidx24, align 4, !tbaa !1
  %add25 = add nsw i32 %add23, %13
  %arrayidx26 = getelementptr inbounds i32* %a, i32 14
  %14 = load i32* %arrayidx26, align 4, !tbaa !1
  %add27 = add nsw i32 %add25, %14
  %arrayidx28 = getelementptr inbounds i32* %a, i32 15
  %15 = load i32* %arrayidx28, align 4, !tbaa !1
  %add29 = add nsw i32 %add27, %15
  store i32 %add29, i32* %sum, align 4, !tbaa !1
  ret void
}

As observed above, SLP vectorization doesn't vectorize this code.

I tried debugging, just mentioning what i found -
the code hits vectorizeChainsInBlock() ->
if(ShouldStartVectorizeHorAtStore) ->
tryToVectorize() -> tryToVectorizePair() -> tryToVectorizeList() - inside
this function, it iterates in the
for loop for checking if scalar instructions are of same type. It exits for
2nd iteration in the loop as
OpCode does not match , one opcode is 'add' and second is
'load'. So it
exits from function
'tryToVectorizeList()', and comes back to function
'tryToVectorize()', but
fails to satisfy any of the
further 'if' in that function and returns back without vectorizing any
of
the code.

I will try to dig more into it and try to understand where the code can be
implemented.

Comments/suggestions/explanations are always welcomed !!

On Fri, Sep 19, 2014 at 1:49 AM, Arnold <aschwaighofer at apple.com>
wrote:
> There is experimental code to handle horizontal reductions hidden behind a
> flag "slp-vectorize-hor" you could try to enable that. It
won't currently
> work for your example though because we only start to look for reductions
> at phis and stores "slp-vectorize-hor-stores".
>
> You could start off by rewriting your example with a store and see whether
> the tree match works after using both flags.
>
>
>
> *F(int *sum ...){*sum >
a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];}*
>
> If it works we would only need to look for horizontal reductions at
> returns.
>
> Sent from my iPhone
>
> On Sep 18, 2014, at 12:24 PM, suyog sarda <sardask01 at gmail.com>
wrote:
>
> Hi Nadav,
>
> Thanks for the quick reply !!
>
> Ok, so as of now we are lacking capability to handle flat large reductions.
>
> I did go through function vectorizeChainsInBlock() (line number 2862). In
> this function,
> we try to vectorize if we have phi nodes in the IR (several if's check
for
> phi nodes) i.e we try to
> construct tree that starts at chains.
>
> Any pointers on how to join multiple trees? I will also try to dig more
> into it.
>
> Comments/Suggestions are most welcomed !!
>
> On Fri, Sep 19, 2014 at 12:23 AM, Nadav Rotem <nrotem at apple.com>
wrote:
>
>>
>> On Sep 18, 2014, at 11:48 AM, suyog sarda <sardask01 at
gmail.com> wrote:
>>
>> Hi,
>>
>> I am trying to understand LLVM vectorization implementation and was
>> looking into both loop and SLP vectorization.
>>
>> test case 1:
>>
>>
>>
>>
>>
>>
>> *int foo(int *a) {int sum = 0,i;for(i=0; i<16; i++)  sum +=
a[i];return
>> sum;}*
>>
>> This code is vectorized by loop vectorizer where we calculate scalar
loop
>> cost as 4 and vector loop cost as 2.
>> Since vector loop cost is less and above reduction is legal to
vectorize,
>> loop vectorizer produces vectorized code :
>>
>> *IR without vectorization:*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
>> triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind
readonlydefine i32
>> @foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  br label
>> %for.bodyfor.body:                                         ; preds
>> %for.body, %entry  %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
>> %sum.04 = phi i32 [ 0, %entry ], [ %add, %for.body ]  %arrayidx
>> getelementptr inbounds i32* %a, i32 %i.05  %0 = load i32* %arrayidx,
align
>> 4, !tbaa !1  %add = add nsw i32 %0, %sum.04  %inc = add nsw i32 %i.05,
1
>> %exitcond = icmp eq i32 %i.05, 15  br i1 %exitcond, label %for.end,
label
>> %for.bodyfor.end:                                          ; preds
>> %for.body  ret i32 %add}*
>>
>>
>> *IR after vectorization:*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
>> triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind
readonlydefine i32
>> @foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %backedge.overflow
>> icmp eq i32 15, -1  %overflow.check.anchor = add i32 0, 0  br i1
>> %backedge.overflow, label %scalar.ph <http://scalar.ph/>, label
>> %overflow.checkedoverflow.checked:                                 ;
preds
>> = %entry  br i1 false, label %middle.block, label %vector.ph
>> <http://vector.ph/>vector.ph
>> <http://vector.ph/>:                                        ;
preds >> %overflow.checked  br label
>> %vector.bodyvector.body:                                      ; preds
>> %vector.body, %vector.ph <http://vector.ph/>  %index = phi i32 [
0,
>> %vector.ph <http://vector.ph/> ], [ %index.next, %vector.body ] 
%vec.phi >> phi <4 x i32> [ zeroinitializer, %vector.ph
<http://vector.ph/> ], [ %14,
>> %vector.body ]  %broadcast.splatinsert = insertelement <4 x i32>
undef, i32
>> %index, i32 0  %broadcast.splat = shufflevector <4 x i32>
>> %broadcast.splatinsert, <4 x i32> undef, <4 x i32>
zeroinitializer
>> %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1,
i32 2, i32 3>
>> %0 = extractelement <4 x i32> %induction, i32 0  %1 =
getelementptr
>> inbounds i32* %a, i32 %0  %2 = insertelement <4 x i32*> undef,
i32* %1, i32
>> 0  %3 = extractelement <4 x i32> %induction, i32 1  %4 =
getelementptr
>> inbounds i32* %a, i32 %3  %5 = insertelement <4 x i32*> %2, i32*
%4, i32 1
>> %6 = extractelement <4 x i32> %induction, i32 2  %7 =
getelementptr
>> inbounds i32* %a, i32 %6  %8 = insertelement <4 x i32*> %5, i32*
%7, i32 2
>> %9 = extractelement <4 x i32> %induction, i32 3  %10 =
getelementptr
>> inbounds i32* %a, i32 %9  %11 = insertelement <4 x i32*> %8, i32*
%10, i32
>> 3  %12 = getelementptr i32* %1, i32 0  %13 = bitcast i32* %12 to <4
x
>> i32>*  %wide.load = load <4 x i32>* %13, align 4, !tbaa !1 
%14 = add nsw
>> <4 x i32> %wide.load, %vec.phi  %15 = add nsw <4 x i32>
%induction, <i32 1,
>> i32 1, i32 1, i32 1>  %16 = icmp eq <4 x i32> %induction,
<i32 15, i32 15,
>> i32 15, i32 15>  %index.next = add i32 %index, 4  %17 = icmp eq i32
>> %index.next, 16  br i1 %17, label %middle.block, label %vector.body,
>> !llvm.loop !5middle.block:                                     ; preds
>> %vector.body, %overflow.checked  %resume.val = phi i32 [ 0,
>> %overflow.checked ], [ 16, %vector.body ]  %trunc.resume.val = phi i32
[ 0,
>> %overflow.checked ], [ 16, %vector.body ]  %rdx.vec.exit.phi = phi
<4 x
>> i32> [ zeroinitializer, %overflow.checked ], [ %14, %vector.body ]
>> %rdx.shuf = shufflevector <4 x i32> %rdx.vec.exit.phi, <4 x
i32> undef, <4
>> x i32> <i32 2, i32 3, i32 undef, i32 undef>  %bin.rdx = add
<4 x i32>
>> %rdx.vec.exit.phi, %rdx.shuf  %rdx.shuf1 = shufflevector <4 x
i32>
>> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef,
i32 undef, i32
>> undef>  %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1  %18
>> extractelement <4 x i32> %bin.rdx2, i32 0  %cmp.n = icmp eq i32
16,
>> %resume.val  br i1 %cmp.n, label %for.end, label %scalar.ph
>> <http://scalar.ph/>scalar.ph
>> <http://scalar.ph/>:                                        ;
preds >> %middle.block, %entry  %bc.resume.val = phi i32 [ %resume.val,
>> %middle.block ], [ 0, %entry ]  %bc.trunc.resume.val = phi i32 [
>> %trunc.resume.val, %middle.block ], [ 0, %entry ]  %bc.merge.rdx = phi
i32
>> [ 0, %entry ], [ %18, %middle.block ]  br label
>> %for.bodyfor.body:                                         ; preds
>> %for.body, %scalar.ph <http://scalar.ph/>  %i.05 = phi i32 [
>> %bc.trunc.resume.val, %scalar.ph <http://scalar.ph/> ], [ %inc,
%for.body
>> ]  %sum.04 = phi i32 [ %bc.merge.rdx, %scalar.ph
<http://scalar.ph/> ], [
>> %add, %for.body ]  %arrayidx = getelementptr inbounds i32* %a, i32
%i.05
>> %19 = load i32* %arrayidx, align 4, !tbaa !1  %add = add nsw i32 %19,
>> %sum.04  %inc = add nsw i32 %i.05, 1  %exitcond = icmp eq i32 %i.05, 15
br
>> i1 %exitcond, label %for.end, label %for.body, !llvm.loop
>> !8for.end:                                          ; preds >>
%middle.block, %for.body  %add.lcssa = phi i32 [ %add, %for.body ], [ %18,
>> %middle.block ]  ret i32 %add.lcssa}*
>>
>>
>> Now for test case 2, where we unroll the loop totally :
>>
>>
>>
>>
>>
>> *int foo(int *a) {int sum = 0;sum >>
a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];return
>> sum;}*
>>
>> This code is not vectorized by SLP vectorization.
>>
>> *IR before vectorization: *
>>
>> test.ll
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
>> triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind
readonlydefine i32
>> @foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %0 = load i32* %a,
>> align 4, !tbaa !1  %arrayidx1 = getelementptr inbounds i32* %a, i32 1 
%1 >> load i32* %arrayidx1, align 4, !tbaa !1  %add = add nsw i32 %1, %0
>> %arrayidx2 = getelementptr inbounds i32* %a, i32 2  %2 = load i32*
>> %arrayidx2, align 4, !tbaa !1  %add3 = add nsw i32 %add, %2  %arrayidx4
>> getelementptr inbounds i32* %a, i32 3  %3 = load i32* %arrayidx4, align
4,
>> !tbaa !1  %add5 = add nsw i32 %add3, %3  %arrayidx6 = getelementptr
>> inbounds i32* %a, i32 4  %4 = load i32* %arrayidx6, align 4, !tbaa !1
>> %add7 = add nsw i32 %add5, %4  %arrayidx8 = getelementptr inbounds i32*
%a,
>> i32 5  %5 = load i32* %arrayidx8, align 4, !tbaa !1  %add9 = add nsw
i32
>> %add7, %5  %arrayidx10 = getelementptr inbounds i32* %a, i32 6  %6 =
load
>> i32* %arrayidx10, align 4, !tbaa !1  %add11 = add nsw i32 %add9, %6
>> %arrayidx12 = getelementptr inbounds i32* %a, i32 7  %7 = load i32*
>> %arrayidx12, align 4, !tbaa !1  %add13 = add nsw i32 %add11, %7
>> %arrayidx14 = getelementptr inbounds i32* %a, i32 8  %8 = load i32*
>> %arrayidx14, align 4, !tbaa !1  %add15 = add nsw i32 %add13, %8
>> %arrayidx16 = getelementptr inbounds i32* %a, i32 9  %9 = load i32*
>> %arrayidx16, align 4, !tbaa !1  %add17 = add nsw i32 %add15, %9
>> %arrayidx18 = getelementptr inbounds i32* %a, i32 10  %10 = load i32*
>> %arrayidx18, align 4, !tbaa !1  %add19 = add nsw i32 %add17, %10
>> %arrayidx20 = getelementptr inbounds i32* %a, i32 11  %11 = load i32*
>> %arrayidx20, align 4, !tbaa !1  %add21 = add nsw i32 %add19, %11
>> %arrayidx22 = getelementptr inbounds i32* %a, i32 12  %12 = load i32*
>> %arrayidx22, align 4, !tbaa !1  %add23 = add nsw i32 %add21, %12
>> %arrayidx24 = getelementptr inbounds i32* %a, i32 13  %13 = load i32*
>> %arrayidx24, align 4, !tbaa !1  %add25 = add nsw i32 %add23, %13
>> %arrayidx26 = getelementptr inbounds i32* %a, i32 14  %14 = load i32*
>> %arrayidx26, align 4, !tbaa !1  %add27 = add nsw i32 %add25, %14
>> %arrayidx28 = getelementptr inbounds i32* %a, i32 15  %15 = load i32*
>> %arrayidx28, align 4, !tbaa !1  %add29 = add nsw i32 %add27, %15  ret
i32
>> %add29}*
>> $ opt -S -slp-vectorizer -slp-vectorize-hor test.ll -debug -o test2.ll
>>
>> Features:+64bit,+sse2
>> CPU:generic
>>
>> Subtarget features: SSELevel 3, 3DNowLevel 0, 64bit 1
>> SLP: Analyzing blocks in foo.
>>
>> *test2.ll (IR after SLP vectorization) :*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *target datalayout =
"e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
>> triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind
readonlydefine i32
>> @foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %0 = load i32* %a,
>> align 4, !tbaa !1  %arrayidx1 = getelementptr inbounds i32* %a, i32 1 
%1 >> load i32* %arrayidx1, align 4, !tbaa !1  %add = add nsw i32 %1, %0
>> %arrayidx2 = getelementptr inbounds i32* %a, i32 2  %2 = load i32*
>> %arrayidx2, align 4, !tbaa !1  %add3 = add nsw i32 %add, %2  %arrayidx4
>> getelementptr inbounds i32* %a, i32 3  %3 = load i32* %arrayidx4, align
4,
>> !tbaa !1  %add5 = add nsw i32 %add3, %3  %arrayidx6 = getelementptr
>> inbounds i32* %a, i32 4  %4 = load i32* %arrayidx6, align 4, !tbaa !1
>> %add7 = add nsw i32 %add5, %4  %arrayidx8 = getelementptr inbounds i32*
%a,
>> i32 5  %5 = load i32* %arrayidx8, align 4, !tbaa !1  %add9 = add nsw
i32
>> %add7, %5  %arrayidx10 = getelementptr inbounds i32* %a, i32 6  %6 =
load
>> i32* %arrayidx10, align 4, !tbaa !1  %add11 = add nsw i32 %add9, %6
>> %arrayidx12 = getelementptr inbounds i32* %a, i32 7  %7 = load i32*
>> %arrayidx12, align 4, !tbaa !1  %add13 = add nsw i32 %add11, %7
>> %arrayidx14 = getelementptr inbounds i32* %a, i32 8  %8 = load i32*
>> %arrayidx14, align 4, !tbaa !1  %add15 = add nsw i32 %add13, %8
>> %arrayidx16 = getelementptr inbounds i32* %a, i32 9  %9 = load i32*
>> %arrayidx16, align 4, !tbaa !1  %add17 = add nsw i32 %add15, %9
>> %arrayidx18 = getelementptr inbounds i32* %a, i32 10  %10 = load i32*
>> %arrayidx18, align 4, !tbaa !1  %add19 = add nsw i32 %add17, %10
>> %arrayidx20 = getelementptr inbounds i32* %a, i32 11  %11 = load i32*
>> %arrayidx20, align 4, !tbaa !1  %add21 = add nsw i32 %add19, %11
>> %arrayidx22 = getelementptr inbounds i32* %a, i32 12  %12 = load i32*
>> %arrayidx22, align 4, !tbaa !1  %add23 = add nsw i32 %add21, %12
>> %arrayidx24 = getelementptr inbounds i32* %a, i32 13  %13 = load i32*
>> %arrayidx24, align 4, !tbaa !1  %add25 = add nsw i32 %add23, %13
>> %arrayidx26 = getelementptr inbounds i32* %a, i32 14  %14 = load i32*
>> %arrayidx26, align 4, !tbaa !1  %add27 = add nsw i32 %add25, %14
>> %arrayidx28 = getelementptr inbounds i32* %a, i32 15  %15 = load i32*
>> %arrayidx28, align 4, !tbaa !1  %add29 = add nsw i32 %add27, %15  ret
i32
>> %add29}*
>> Shouldn't the SLP vectorizer vectorize above code? Am I missing out
>> something?
>> How does loop unrolling affect the final vectorization of code?
>>
>>
>> Hi Suyog,
>>
>> Thank you for taking the time to investigate the performance of the
>> vectorizers. You are correct that the SLP-vectorizer should be able to
do
>> something about the second case (the unrolled one).  Please look into
line
>> 2861 in the SLP-vectorizer. This is where we try to analyze reductions
and
>> vectorize them. At the moment we don't have support for a flat yet
large
>> reduction that is the result of unrolled loops because the current
>> algorithm tries to construct trees that start at the chains, and it
does
>> not support joining multiple trees.
>>
>> http://llvm.org/docs/doxygen/html/SLPVectorizer_8cpp_source.html#l02861
>>
>> Thanks,
>> Nadav
>>
>>
>> I tried to debug above IR. Since there is no store in above IR, SLP
>> vectorization doesn't call
>> vectorizeStoreChains(). In vectorizeChainsInBlock(), as there is no phi
>> node in above IR, it
>> also fails to vectorize the IR.
>>
>> If it is perfectly legal to vectorize above IR, are we lacking code for
>> it right now? Or its not
>> possible to vectorize above code (as far as i know, it is perfectly
>> possible to vectorize above code)?
>> If its possible, can someone help me/point out specifics how to start
for
>> implementing vectorization for such code?
>> (We do have function calls isConsecutiveAccess() to identify it. Is it
a
>> case for horizontal reduction, though the term
>> 'reduction' is associated with loops/phi nodes.)
>>
>> This brings to me another question - if we perform aggressive loop
>> unrolling, we might loose opportunity for
>> vectorization as visible above !!
>>
>> (I was looking into bug 20035, when i encountered above issue. Though
>> both are not related.)
>>
>> Your comments/suggestions are most awaited.
>>
>>
>> --
>> With regards,
>> Suyog Sarda
>>
>>
>>
>
>
> --
> With regards,
> Suyog Sarda
>
>

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140920/6394e2d0/attachment.html>

llvm dev - Sep 2014 - [LLVMdev] [Vectorization] Mis match in code generated

[LLVMdev] [Vectorization] Mis match in code generated

[LLVMdev] [Vectorization] Mis match in code generated

[LLVMdev] [Vectorization] Mis match in code generated