thr3ads.net - llvm dev - [LLVMdev] Vectorizing alloca instructions [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Tom Stellard

2013-Oct-24 21:04 UTC

[LLVMdev] Vectorizing alloca instructions

Hi,

I've been playing around with the SLPVectorizer trying to get it to
vectorize this simple program:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = alloca [4 x i32]
  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
  store i32 0, i32* %x
  store i32 1, i32* %y
  store i32 2, i32* %z
  store i32 3, i32* %w
  %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index
  %2 = load i32* %1
  store i32 %2, i32 addrspace(1)* %out
  ret void
}

My goal is to have this program transformed to the following:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32
%index
  store i32 %0, i32 addrspace(1)* %out
}

I've slightly modified the SLPVectorizer (see the attached patch) so
that it will vectorize small trees, and I've also fixed a crash in the
BoUpSLP::Gather() function when it is passed a list of store
instructions.  With this patch, the command:

opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll -S
-slp-threshold=-20

Produces the following output and the program remains unchanged:

===
SLP: Analyzing blocks in vector.
SLP: Found 5 stores to vectorize.
SLP: Analyzing a store chain of length 4.
SLP: Analyzing a store chain of length 4
SLP: Analyzing 4 stores at offset 0
SLP: Checking users of    store i32 0, i32* %x. 
SLP: Checking users of    store i32 1, i32* %y. 
SLP: Checking users of    store i32 2, i32* %z. 
SLP: Checking users of    store i32 3, i32* %w. 
SLP: We are able to schedule this bundle.
SLP: Can't sink   store i32 0, i32* %x
 down to   store i32 3, i32* %w
 because of   store i32 1, i32* %y.  Gathering.
SLP: Calculating cost for tree of size 1.
SLP: Check whether the tree with height 1 is fully vectorizable .
SLP: Adding cost 4 for bundle that starts with   store i32 0, i32* %x .
SLP: Total Cost 4.
SLP: Found cost=4 for VF=4
SLP: Decided to vectorize cost=4
SLP: Extracting 0 values .
SLP: Optimizing 0 gather sequences instructions.
SLP: vectorized "vector"

===
I'm having a little trouble figuring out why the stores do not end up
being vectorized.  Does anyone have any insight into this?  Should this
pass be able to perform the desired transformation?

Thanks,
Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: slp-vectorize-alloc.patch
Type: text/x-diff
Size: 1121 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131024/8accc7cc/attachment.patch>

Nadav Rotem

2013-Oct-24 21:15 UTC

head link

[LLVMdev] Vectorizing alloca instructions

Hi Tom, 

Thanks for working on this.  The SLP-vectorizer thinks that %X %Y %Z and %W
alias, so it tries to perform 4 scalar store operations (which is a bad idea). 
We need to figure out why AA thinks that X and Y may alias.  Maybe there is a
problem with the code that uses AA.

Thanks,
Nadav


On Oct 24, 2013, at 2:04 PM, Tom Stellard <tom at stellard.net> wrote:
> Hi,
> 
> I've been playing around with the SLPVectorizer trying to get it to
> vectorize this simple program:
> 
> define void @vector(i32 addrspace(1)* %out, i32 %index) {
> entry:
>  %0 = alloca [4 x i32]
>  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
>  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
>  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
>  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
>  store i32 0, i32* %x
>  store i32 1, i32* %y
>  store i32 2, i32* %z
>  store i32 3, i32* %w
>  %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index
>  %2 = load i32* %1
>  store i32 %2, i32 addrspace(1)* %out
>  ret void
> }
> 
> My goal is to have this program transformed to the following:
> 
> define void @vector(i32 addrspace(1)* %out, i32 %index) {
> entry:
>  %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>,
i32 %index
>  store i32 %0, i32 addrspace(1)* %out
> }
> 
> I've slightly modified the SLPVectorizer (see the attached patch) so
> that it will vectorize small trees, and I've also fixed a crash in the
> BoUpSLP::Gather() function when it is passed a list of store
> instructions.  With this patch, the command:
> 
> opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll
-S -slp-threshold=-20
> 
> Produces the following output and the program remains unchanged:
> 
> ===> 
> SLP: Analyzing blocks in vector.
> SLP: Found 5 stores to vectorize.
> SLP: Analyzing a store chain of length 4.
> SLP: Analyzing a store chain of length 4
> SLP: Analyzing 4 stores at offset 0
> SLP: Checking users of    store i32 0, i32* %x. 
> SLP: Checking users of    store i32 1, i32* %y. 
> SLP: Checking users of    store i32 2, i32* %z. 
> SLP: Checking users of    store i32 3, i32* %w. 
> SLP: We are able to schedule this bundle.
> SLP: Can't sink   store i32 0, i32* %x
> down to   store i32 3, i32* %w
> because of   store i32 1, i32* %y.  Gathering.
> SLP: Calculating cost for tree of size 1.
> SLP: Check whether the tree with height 1 is fully vectorizable .
> SLP: Adding cost 4 for bundle that starts with   store i32 0, i32* %x .
> SLP: Total Cost 4.
> SLP: Found cost=4 for VF=4
> SLP: Decided to vectorize cost=4
> SLP: Extracting 0 values .
> SLP: Optimizing 0 gather sequences instructions.
> SLP: vectorized "vector"
> 
> ===> 
> I'm having a little trouble figuring out why the stores do not end up
> being vectorized.  Does anyone have any insight into this?  Should this
> pass be able to perform the desired transformation?
> 
> Thanks,
> Tom
> 
>
<slp-vectorize-alloc.patch>_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Chandler Carruth

2013-Oct-24 22:00 UTC

head link

[LLVMdev] Vectorizing alloca instructions

On Thu, Oct 24, 2013 at 2:04 PM, Tom Stellard <tom at stellard.net> wrote:
> Hi,
>
> I've been playing around with the SLPVectorizer trying to get it to
> vectorize this simple program:
>
> define void @vector(i32 addrspace(1)* %out, i32 %index) {
> entry:
>   %0 = alloca [4 x i32]
>   %x = getelementptr [4 x i32]* %0, i32 0, i32 0
>   %y = getelementptr [4 x i32]* %0, i32 0, i32 1
>   %z = getelementptr [4 x i32]* %0, i32 0, i32 2
>   %w = getelementptr [4 x i32]* %0, i32 0, i32 3
>   store i32 0, i32* %x
>   store i32 1, i32* %y
>   store i32 2, i32* %z
>   store i32 3, i32* %w
>   %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index
>   %2 = load i32* %1
>   store i32 %2, i32 addrspace(1)* %out
>   ret void
> }
>
> My goal is to have this program transformed to the following:
>
> define void @vector(i32 addrspace(1)* %out, i32 %index) {
> entry:
>   %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>,
i32 %index
>   store i32 %0, i32 addrspace(1)* %out
> }
>
> I've slightly modified the SLPVectorizer

Just a note, I don't think you should or need to vectorize the actual
alloca stuff. If you can simply transform the dynamically indexed load:

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = alloca [4 x i32]
  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
  store i32 0, i32* %x
  store i32 1, i32* %y
  store i32 2, i32* %z
  store i32 3, i32* %w
  %1 = bitcast [4 x i32]* %0 to <4 x i32>*
  %2 = load <4 x i32>* %1
  %3 = extractelement <4 x i32> %2, i32 %index
  store i32 %3, i32 addrspace(1)* %out
  ret void
}

Then running SROA and InstCombine will mop up the rest. So its mostly about
getting the SLPVectorizer to handle the dynamic GEP. As soon as it does
that, everything else will fall away.

Not sure how much this helps, just wanted to point it out.

> (see the attached patch) so
> that it will vectorize small trees, and I've also fixed a crash in the
> BoUpSLP::Gather() function when it is passed a list of store
> instructions.  With this patch, the command:
>
> opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll
> -S -slp-threshold=-20
>
> Produces the following output and the program remains unchanged:
>
> ===>
> SLP: Analyzing blocks in vector.
> SLP: Found 5 stores to vectorize.
> SLP: Analyzing a store chain of length 4.
> SLP: Analyzing a store chain of length 4
> SLP: Analyzing 4 stores at offset 0
> SLP: Checking users of    store i32 0, i32* %x.
> SLP: Checking users of    store i32 1, i32* %y.
> SLP: Checking users of    store i32 2, i32* %z.
> SLP: Checking users of    store i32 3, i32* %w.
> SLP: We are able to schedule this bundle.
> SLP: Can't sink   store i32 0, i32* %x
>  down to   store i32 3, i32* %w
>  because of   store i32 1, i32* %y.  Gathering.
> SLP: Calculating cost for tree of size 1.
> SLP: Check whether the tree with height 1 is fully vectorizable .
> SLP: Adding cost 4 for bundle that starts with   store i32 0, i32* %x .
> SLP: Total Cost 4.
> SLP: Found cost=4 for VF=4
> SLP: Decided to vectorize cost=4
> SLP: Extracting 0 values .
> SLP: Optimizing 0 gather sequences instructions.
> SLP: vectorized "vector"
>
> ===>
> I'm having a little trouble figuring out why the stores do not end up
> being vectorized.  Does anyone have any insight into this?  Should this
> pass be able to perform the desired transformation?
>
> Thanks,
> Tom
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131024/8ab12e9b/attachment.html>

Nadav Rotem

2013-Oct-24 22:16 UTC

head link

[LLVMdev] Vectorizing alloca instructions

On Oct 24, 2013, at 3:00 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> Just a note, I don't think you should or need to vectorize the actual
alloca stuff. If you can simply transform the dynamically indexed load:
> 
> Then running SROA and InstCombine will mop up the rest. So its mostly about
getting the SLPVectorizer to handle the dynamic GEP. As soon as it does that,
everything else will fall away.
> 
I don’t think that Tom wants the SLP-vectorizer to handle dynamic GEPS.  What he
wants is for the SLP-vectorizer to convert the first part of the code:

>> define void @vector(i32 addrspace(1)* %out, i32 %index) {
>> entry:
>>   %0 = alloca [4 x i32]
>>   %x = getelementptr [4 x i32]* %0, i32 0, i32 0
>>   %y = getelementptr [4 x i32]* %0, i32 0, i32 1
>>   %z = getelementptr [4 x i32]* %0, i32 0, i32 2
>>   %w = getelementptr [4 x i32]* %0, i32 0, i32 3
>>   store i32 0, i32* %x
>>   store i32 1, i32* %y
>>   store i32 2, i32* %z
>>   store i32 3, i32* %w
> 
Into this:   Store <i32 0, i32 1, i32 2, i32 3> ....     
> 
>>   %1 = bitcast [4 x i32]* %0 to <4 x i32>*
>>   %2 = load <4 x i32>* %1
>>   %3 = extractelement <4 x i32> %2, i32 %index
>>   store i32 %3, i32 addrspace(1)* %out
>>   ret void
>> }
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131024/00c771b8/attachment.html>

Tom Stellard

2013-Oct-24 23:08 UTC

head link

[LLVMdev] Vectorizing alloca instructions

On Thu, Oct 24, 2013 at 02:15:00PM -0700, Nadav Rotem
wrote:> Hi Tom, 
> 
> Thanks for working on this.  The SLP-vectorizer thinks that %X %Y %Z and %W
alias, so it tries to perform 4 scalar store operations (which is a bad idea). 
We need to figure out why AA thinks that X and Y may alias.  Maybe there is a
problem with the code that uses AA.
> 
Thanks for the tip. opt was using  NoAliasAnalysis by default, so passing
-basicaa to opt gives me this now:

SLP: Analyzing blocks in vector.
SLP: Found 5 stores to vectorize.
SLP: Analyzing a store chain of length 4.
SLP: Analyzing a store chain of length 4
SLP: Analyzing 4 stores at offset 0
SLP: Checking users of    store i32 0, i32* %x. 
SLP: Checking users of    store i32 1, i32* %y. 
SLP: Checking users of    store i32 2, i32* %z. 
SLP: Checking users of    store i32 3, i32* %w. 
SLP: We are able to schedule this bundle.
SLP: added a vector of stores.
SLP: Gathering due to C,S,B,O. 
SLP: Calculating cost for tree of size 2.
SLP: Check whether the tree with height 2 is fully vectorizable .
SLP: Adding cost -3 for bundle that starts with   store i32 0, i32* %x .
SLP: Adding cost 0 for bundle that starts with i32 0 .
SLP: Total Cost -3.
SLP: Found cost=-3 for VF=4
SLP: Decided to vectorize cost=-3
SLP: Extracting 0 values .
SLP:    Erasing scalar:  store i32 0, i32* %x.
SLP:    Erasing scalar:  store i32 1, i32* %y.
SLP:    Erasing scalar:  store i32 2, i32* %z.
SLP:    Erasing scalar:  store i32 3, i32* %w.
SLP: Optimizing 0 gather sequences instructions.
SLP: vectorized "vector"
; ModuleID = 'vector-alloca.ll'
target datalayout
"e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-v2048:2048:2048-n32:64"
target triple = "r600--"

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = alloca [4 x i32]
  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
  %1 = bitcast i32* %x to <4 x i32>*
  store <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32>* %1
  %2 = getelementptr [4 x i32]* %0, i32 0, i32 %index
  %3 = load i32* %2
  store i32 %3, i32 addrspace(1)* %out
  ret void
}

The next step for me is to figure out how to replace the GEP + load with an
extractelement.

-Tom

> Thanks,
> Nadav
> 
> 
> On Oct 24, 2013, at 2:04 PM, Tom Stellard <tom at stellard.net>
wrote:
> 
> > Hi,
> > 
> > I've been playing around with the SLPVectorizer trying to get it
to
> > vectorize this simple program:
> > 
> > define void @vector(i32 addrspace(1)* %out, i32 %index) {
> > entry:
> >  %0 = alloca [4 x i32]
> >  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
> >  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
> >  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
> >  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
> >  store i32 0, i32* %x
> >  store i32 1, i32* %y
> >  store i32 2, i32* %z
> >  store i32 3, i32* %w
> >  %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index
> >  %2 = load i32* %1
> >  store i32 %2, i32 addrspace(1)* %out
> >  ret void
> > }
> > 
> > My goal is to have this program transformed to the following:
> > 
> > define void @vector(i32 addrspace(1)* %out, i32 %index) {
> > entry:
> >  %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32
3>, i32 %index
> >  store i32 %0, i32 addrspace(1)* %out
> > }
> > 
> > I've slightly modified the SLPVectorizer (see the attached patch)
so
> > that it will vectorize small trees, and I've also fixed a crash in
the
> > BoUpSLP::Gather() function when it is passed a list of store
> > instructions.  With this patch, the command:
> > 
> > opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o -
vector-alloca.ll -S -slp-threshold=-20
> > 
> > Produces the following output and the program remains unchanged:
> > 
> > ===> > 
> > SLP: Analyzing blocks in vector.
> > SLP: Found 5 stores to vectorize.
> > SLP: Analyzing a store chain of length 4.
> > SLP: Analyzing a store chain of length 4
> > SLP: Analyzing 4 stores at offset 0
> > SLP: Checking users of    store i32 0, i32* %x. 
> > SLP: Checking users of    store i32 1, i32* %y. 
> > SLP: Checking users of    store i32 2, i32* %z. 
> > SLP: Checking users of    store i32 3, i32* %w. 
> > SLP: We are able to schedule this bundle.
> > SLP: Can't sink   store i32 0, i32* %x
> > down to   store i32 3, i32* %w
> > because of   store i32 1, i32* %y.  Gathering.
> > SLP: Calculating cost for tree of size 1.
> > SLP: Check whether the tree with height 1 is fully vectorizable .
> > SLP: Adding cost 4 for bundle that starts with   store i32 0, i32* %x
.
> > SLP: Total Cost 4.
> > SLP: Found cost=4 for VF=4
> > SLP: Decided to vectorize cost=4
> > SLP: Extracting 0 values .
> > SLP: Optimizing 0 gather sequences instructions.
> > SLP: vectorized "vector"
> > 
> > ===> > 
> > I'm having a little trouble figuring out why the stores do not end
up
> > being vectorized.  Does anyone have any insight into this?  Should
this
> > pass be able to perform the desired transformation?
> > 
> > Thanks,
> > Tom
> > 
> >
<slp-vectorize-alloc.patch>_______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Oct 2013 - [LLVMdev] Vectorizing alloca instructions

[LLVMdev] Vectorizing alloca instructions

[LLVMdev] Vectorizing alloca instructions

[LLVMdev] Vectorizing alloca instructions

[LLVMdev] Vectorizing alloca instructions

[LLVMdev] Vectorizing alloca instructions

Possibly Parallel Threads