Hi, I've been playing around with the SLPVectorizer trying to get it to vectorize this simple program: define void @vector(i32 addrspace(1)* %out, i32 %index) { entry: %0 = alloca [4 x i32] %x = getelementptr [4 x i32]* %0, i32 0, i32 0 %y = getelementptr [4 x i32]* %0, i32 0, i32 1 %z = getelementptr [4 x i32]* %0, i32 0, i32 2 %w = getelementptr [4 x i32]* %0, i32 0, i32 3 store i32 0, i32* %x store i32 1, i32* %y store i32 2, i32* %z store i32 3, i32* %w %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index %2 = load i32* %1 store i32 %2, i32 addrspace(1)* %out ret void } My goal is to have this program transformed to the following: define void @vector(i32 addrspace(1)* %out, i32 %index) { entry: %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index store i32 %0, i32 addrspace(1)* %out } I've slightly modified the SLPVectorizer (see the attached patch) so that it will vectorize small trees, and I've also fixed a crash in the BoUpSLP::Gather() function when it is passed a list of store instructions. With this patch, the command: opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll -S -slp-threshold=-20 Produces the following output and the program remains unchanged: === SLP: Analyzing blocks in vector. SLP: Found 5 stores to vectorize. SLP: Analyzing a store chain of length 4. SLP: Analyzing a store chain of length 4 SLP: Analyzing 4 stores at offset 0 SLP: Checking users of store i32 0, i32* %x. SLP: Checking users of store i32 1, i32* %y. SLP: Checking users of store i32 2, i32* %z. SLP: Checking users of store i32 3, i32* %w. SLP: We are able to schedule this bundle. SLP: Can't sink store i32 0, i32* %x down to store i32 3, i32* %w because of store i32 1, i32* %y. Gathering. SLP: Calculating cost for tree of size 1. SLP: Check whether the tree with height 1 is fully vectorizable . SLP: Adding cost 4 for bundle that starts with store i32 0, i32* %x . SLP: Total Cost 4. SLP: Found cost=4 for VF=4 SLP: Decided to vectorize cost=4 SLP: Extracting 0 values . SLP: Optimizing 0 gather sequences instructions. SLP: vectorized "vector" === I'm having a little trouble figuring out why the stores do not end up being vectorized. Does anyone have any insight into this? Should this pass be able to perform the desired transformation? Thanks, Tom -------------- next part -------------- A non-text attachment was scrubbed... Name: slp-vectorize-alloc.patch Type: text/x-diff Size: 1121 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131024/8accc7cc/attachment.patch>
Hi Tom, Thanks for working on this. The SLP-vectorizer thinks that %X %Y %Z and %W alias, so it tries to perform 4 scalar store operations (which is a bad idea). We need to figure out why AA thinks that X and Y may alias. Maybe there is a problem with the code that uses AA. Thanks, Nadav On Oct 24, 2013, at 2:04 PM, Tom Stellard <tom at stellard.net> wrote:> Hi, > > I've been playing around with the SLPVectorizer trying to get it to > vectorize this simple program: > > define void @vector(i32 addrspace(1)* %out, i32 %index) { > entry: > %0 = alloca [4 x i32] > %x = getelementptr [4 x i32]* %0, i32 0, i32 0 > %y = getelementptr [4 x i32]* %0, i32 0, i32 1 > %z = getelementptr [4 x i32]* %0, i32 0, i32 2 > %w = getelementptr [4 x i32]* %0, i32 0, i32 3 > store i32 0, i32* %x > store i32 1, i32* %y > store i32 2, i32* %z > store i32 3, i32* %w > %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index > %2 = load i32* %1 > store i32 %2, i32 addrspace(1)* %out > ret void > } > > My goal is to have this program transformed to the following: > > define void @vector(i32 addrspace(1)* %out, i32 %index) { > entry: > %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index > store i32 %0, i32 addrspace(1)* %out > } > > I've slightly modified the SLPVectorizer (see the attached patch) so > that it will vectorize small trees, and I've also fixed a crash in the > BoUpSLP::Gather() function when it is passed a list of store > instructions. With this patch, the command: > > opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll -S -slp-threshold=-20 > > Produces the following output and the program remains unchanged: > > ===> > SLP: Analyzing blocks in vector. > SLP: Found 5 stores to vectorize. > SLP: Analyzing a store chain of length 4. > SLP: Analyzing a store chain of length 4 > SLP: Analyzing 4 stores at offset 0 > SLP: Checking users of store i32 0, i32* %x. > SLP: Checking users of store i32 1, i32* %y. > SLP: Checking users of store i32 2, i32* %z. > SLP: Checking users of store i32 3, i32* %w. > SLP: We are able to schedule this bundle. > SLP: Can't sink store i32 0, i32* %x > down to store i32 3, i32* %w > because of store i32 1, i32* %y. Gathering. > SLP: Calculating cost for tree of size 1. > SLP: Check whether the tree with height 1 is fully vectorizable . > SLP: Adding cost 4 for bundle that starts with store i32 0, i32* %x . > SLP: Total Cost 4. > SLP: Found cost=4 for VF=4 > SLP: Decided to vectorize cost=4 > SLP: Extracting 0 values . > SLP: Optimizing 0 gather sequences instructions. > SLP: vectorized "vector" > > ===> > I'm having a little trouble figuring out why the stores do not end up > being vectorized. Does anyone have any insight into this? Should this > pass be able to perform the desired transformation? > > Thanks, > Tom > > <slp-vectorize-alloc.patch>_______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On Thu, Oct 24, 2013 at 2:04 PM, Tom Stellard <tom at stellard.net> wrote:> Hi, > > I've been playing around with the SLPVectorizer trying to get it to > vectorize this simple program: > > define void @vector(i32 addrspace(1)* %out, i32 %index) { > entry: > %0 = alloca [4 x i32] > %x = getelementptr [4 x i32]* %0, i32 0, i32 0 > %y = getelementptr [4 x i32]* %0, i32 0, i32 1 > %z = getelementptr [4 x i32]* %0, i32 0, i32 2 > %w = getelementptr [4 x i32]* %0, i32 0, i32 3 > store i32 0, i32* %x > store i32 1, i32* %y > store i32 2, i32* %z > store i32 3, i32* %w > %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index > %2 = load i32* %1 > store i32 %2, i32 addrspace(1)* %out > ret void > } > > My goal is to have this program transformed to the following: > > define void @vector(i32 addrspace(1)* %out, i32 %index) { > entry: > %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index > store i32 %0, i32 addrspace(1)* %out > } > > I've slightly modified the SLPVectorizerJust a note, I don't think you should or need to vectorize the actual alloca stuff. If you can simply transform the dynamically indexed load: define void @vector(i32 addrspace(1)* %out, i32 %index) { entry: %0 = alloca [4 x i32] %x = getelementptr [4 x i32]* %0, i32 0, i32 0 %y = getelementptr [4 x i32]* %0, i32 0, i32 1 %z = getelementptr [4 x i32]* %0, i32 0, i32 2 %w = getelementptr [4 x i32]* %0, i32 0, i32 3 store i32 0, i32* %x store i32 1, i32* %y store i32 2, i32* %z store i32 3, i32* %w %1 = bitcast [4 x i32]* %0 to <4 x i32>* %2 = load <4 x i32>* %1 %3 = extractelement <4 x i32> %2, i32 %index store i32 %3, i32 addrspace(1)* %out ret void } Then running SROA and InstCombine will mop up the rest. So its mostly about getting the SLPVectorizer to handle the dynamic GEP. As soon as it does that, everything else will fall away. Not sure how much this helps, just wanted to point it out.> (see the attached patch) so > that it will vectorize small trees, and I've also fixed a crash in the > BoUpSLP::Gather() function when it is passed a list of store > instructions. With this patch, the command: > > opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll > -S -slp-threshold=-20 > > Produces the following output and the program remains unchanged: > > ===> > SLP: Analyzing blocks in vector. > SLP: Found 5 stores to vectorize. > SLP: Analyzing a store chain of length 4. > SLP: Analyzing a store chain of length 4 > SLP: Analyzing 4 stores at offset 0 > SLP: Checking users of store i32 0, i32* %x. > SLP: Checking users of store i32 1, i32* %y. > SLP: Checking users of store i32 2, i32* %z. > SLP: Checking users of store i32 3, i32* %w. > SLP: We are able to schedule this bundle. > SLP: Can't sink store i32 0, i32* %x > down to store i32 3, i32* %w > because of store i32 1, i32* %y. Gathering. > SLP: Calculating cost for tree of size 1. > SLP: Check whether the tree with height 1 is fully vectorizable . > SLP: Adding cost 4 for bundle that starts with store i32 0, i32* %x . > SLP: Total Cost 4. > SLP: Found cost=4 for VF=4 > SLP: Decided to vectorize cost=4 > SLP: Extracting 0 values . > SLP: Optimizing 0 gather sequences instructions. > SLP: vectorized "vector" > > ===> > I'm having a little trouble figuring out why the stores do not end up > being vectorized. Does anyone have any insight into this? Should this > pass be able to perform the desired transformation? > > Thanks, > Tom > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131024/8ab12e9b/attachment.html>
On Oct 24, 2013, at 3:00 PM, Chandler Carruth <chandlerc at google.com> wrote:> Just a note, I don't think you should or need to vectorize the actual alloca stuff. If you can simply transform the dynamically indexed load: > > Then running SROA and InstCombine will mop up the rest. So its mostly about getting the SLPVectorizer to handle the dynamic GEP. As soon as it does that, everything else will fall away. >I don’t think that Tom wants the SLP-vectorizer to handle dynamic GEPS. What he wants is for the SLP-vectorizer to convert the first part of the code:>> define void @vector(i32 addrspace(1)* %out, i32 %index) { >> entry: >> %0 = alloca [4 x i32] >> %x = getelementptr [4 x i32]* %0, i32 0, i32 0 >> %y = getelementptr [4 x i32]* %0, i32 0, i32 1 >> %z = getelementptr [4 x i32]* %0, i32 0, i32 2 >> %w = getelementptr [4 x i32]* %0, i32 0, i32 3 >> store i32 0, i32* %x >> store i32 1, i32* %y >> store i32 2, i32* %z >> store i32 3, i32* %w >Into this: Store <i32 0, i32 1, i32 2, i32 3> ....> >> %1 = bitcast [4 x i32]* %0 to <4 x i32>* >> %2 = load <4 x i32>* %1 >> %3 = extractelement <4 x i32> %2, i32 %index >> store i32 %3, i32 addrspace(1)* %out >> ret void >> } >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131024/00c771b8/attachment.html>
On Thu, Oct 24, 2013 at 02:15:00PM -0700, Nadav Rotem wrote:> Hi Tom, > > Thanks for working on this. The SLP-vectorizer thinks that %X %Y %Z and %W alias, so it tries to perform 4 scalar store operations (which is a bad idea). We need to figure out why AA thinks that X and Y may alias. Maybe there is a problem with the code that uses AA. >Thanks for the tip. opt was using NoAliasAnalysis by default, so passing -basicaa to opt gives me this now: SLP: Analyzing blocks in vector. SLP: Found 5 stores to vectorize. SLP: Analyzing a store chain of length 4. SLP: Analyzing a store chain of length 4 SLP: Analyzing 4 stores at offset 0 SLP: Checking users of store i32 0, i32* %x. SLP: Checking users of store i32 1, i32* %y. SLP: Checking users of store i32 2, i32* %z. SLP: Checking users of store i32 3, i32* %w. SLP: We are able to schedule this bundle. SLP: added a vector of stores. SLP: Gathering due to C,S,B,O. SLP: Calculating cost for tree of size 2. SLP: Check whether the tree with height 2 is fully vectorizable . SLP: Adding cost -3 for bundle that starts with store i32 0, i32* %x . SLP: Adding cost 0 for bundle that starts with i32 0 . SLP: Total Cost -3. SLP: Found cost=-3 for VF=4 SLP: Decided to vectorize cost=-3 SLP: Extracting 0 values . SLP: Erasing scalar: store i32 0, i32* %x. SLP: Erasing scalar: store i32 1, i32* %y. SLP: Erasing scalar: store i32 2, i32* %z. SLP: Erasing scalar: store i32 3, i32* %w. SLP: Optimizing 0 gather sequences instructions. SLP: vectorized "vector" ; ModuleID = 'vector-alloca.ll' target datalayout "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-v2048:2048:2048-n32:64" target triple = "r600--" define void @vector(i32 addrspace(1)* %out, i32 %index) { entry: %0 = alloca [4 x i32] %x = getelementptr [4 x i32]* %0, i32 0, i32 0 %y = getelementptr [4 x i32]* %0, i32 0, i32 1 %z = getelementptr [4 x i32]* %0, i32 0, i32 2 %w = getelementptr [4 x i32]* %0, i32 0, i32 3 %1 = bitcast i32* %x to <4 x i32>* store <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32>* %1 %2 = getelementptr [4 x i32]* %0, i32 0, i32 %index %3 = load i32* %2 store i32 %3, i32 addrspace(1)* %out ret void } The next step for me is to figure out how to replace the GEP + load with an extractelement. -Tom> Thanks, > Nadav > > > On Oct 24, 2013, at 2:04 PM, Tom Stellard <tom at stellard.net> wrote: > > > Hi, > > > > I've been playing around with the SLPVectorizer trying to get it to > > vectorize this simple program: > > > > define void @vector(i32 addrspace(1)* %out, i32 %index) { > > entry: > > %0 = alloca [4 x i32] > > %x = getelementptr [4 x i32]* %0, i32 0, i32 0 > > %y = getelementptr [4 x i32]* %0, i32 0, i32 1 > > %z = getelementptr [4 x i32]* %0, i32 0, i32 2 > > %w = getelementptr [4 x i32]* %0, i32 0, i32 3 > > store i32 0, i32* %x > > store i32 1, i32* %y > > store i32 2, i32* %z > > store i32 3, i32* %w > > %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index > > %2 = load i32* %1 > > store i32 %2, i32 addrspace(1)* %out > > ret void > > } > > > > My goal is to have this program transformed to the following: > > > > define void @vector(i32 addrspace(1)* %out, i32 %index) { > > entry: > > %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index > > store i32 %0, i32 addrspace(1)* %out > > } > > > > I've slightly modified the SLPVectorizer (see the attached patch) so > > that it will vectorize small trees, and I've also fixed a crash in the > > BoUpSLP::Gather() function when it is passed a list of store > > instructions. With this patch, the command: > > > > opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll -S -slp-threshold=-20 > > > > Produces the following output and the program remains unchanged: > > > > ===> > > > SLP: Analyzing blocks in vector. > > SLP: Found 5 stores to vectorize. > > SLP: Analyzing a store chain of length 4. > > SLP: Analyzing a store chain of length 4 > > SLP: Analyzing 4 stores at offset 0 > > SLP: Checking users of store i32 0, i32* %x. > > SLP: Checking users of store i32 1, i32* %y. > > SLP: Checking users of store i32 2, i32* %z. > > SLP: Checking users of store i32 3, i32* %w. > > SLP: We are able to schedule this bundle. > > SLP: Can't sink store i32 0, i32* %x > > down to store i32 3, i32* %w > > because of store i32 1, i32* %y. Gathering. > > SLP: Calculating cost for tree of size 1. > > SLP: Check whether the tree with height 1 is fully vectorizable . > > SLP: Adding cost 4 for bundle that starts with store i32 0, i32* %x . > > SLP: Total Cost 4. > > SLP: Found cost=4 for VF=4 > > SLP: Decided to vectorize cost=4 > > SLP: Extracting 0 values . > > SLP: Optimizing 0 gather sequences instructions. > > SLP: vectorized "vector" > > > > ===> > > > I'm having a little trouble figuring out why the stores do not end up > > being vectorized. Does anyone have any insight into this? Should this > > pass be able to perform the desired transformation? > > > > Thanks, > > Tom > > > > <slp-vectorize-alloc.patch>_______________________________________________ > > LLVM Developers mailing list > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >