Tobias Grosser
2012-Feb-03 09:28 UTC
[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short
Hi Hal, this is one of the first test cases, I would love to have improved vectorizer support. I sent it out earlier, but I think it is a good time to look into it again, after the vectorizer was committed. The basic examples is a set of scalar loads that load for consecutive elements and store them back right ahead. For me this is an obvious case where vectorization is beneficial (scalar.ll): define i32 @main() nounwind { %V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0), align 16 %V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1), align 4 %V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2), align 8 %V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3), align 4 store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64 0), align 16 store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64 1), align 4 store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64 2), align 8 store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64 3), align 4 ret i32 0 } opt -O3 -vectorize can not optimize this straight ahead, as the req-chain is too short. Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code: define i32 @main() nounwind { %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*), align 16 store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4 x float>*), align 16 ret i32 0 } Is there any way, we can make this case work by default? Maybe we can decrease the req-chain to 2, and increase the cost for non stride one loads or stores? Another probably unrelated point. I tried also a run with -bb-vectorize-req-chain-depth=1. The generated code is full of shufflevector instructions and eight element vectors. For me this is entirely unexpected. Do you have any ideas what is going on here? Tobi -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: scalar.ll URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120203/b797055f/attachment.ksh> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: vector.ll URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120203/b797055f/attachment-0001.ksh>
Hal Finkel
2012-Feb-03 13:50 UTC
[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short
On Fri, 2012-02-03 at 10:28 +0100, Tobias Grosser wrote:> Hi Hal, > > this is one of the first test cases, I would love to have improved > vectorizer support. I sent it out earlier, but I think it is a good time > to look into it again, after the vectorizer was committed. > > The basic examples is a set of scalar loads that load for consecutive > elements and store them back right ahead. For me this is an obvious case > where vectorization is beneficial (scalar.ll): > > define i32 @main() nounwind { > %V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0), > align 16 > %V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1), > align 4 > %V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2), > align 8 > %V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3), > align 4 > store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64 > 0), align 16 > store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64 > 1), align 4 > store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64 > 2), align 8 > store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64 > 3), align 4 > ret i32 0 > } > > opt -O3 -vectorize can not optimize this straight ahead, as the > req-chain is too short. > > Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code: > > define i32 @main() nounwind { > %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*), > align 16 > store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4 > x float>*), align 16 > ret i32 0 > } > > Is there any way, we can make this case work by default? Maybe we can > decrease the req-chain to 2, and increase the cost for non stride one > loads or stores?Making the default chain length 2 will lead to a lot of unprofitable vectorization. I think we'll probably want to do something like make getDepthFactor return 3 for loads and stores. (or make the default chain length 4 and make getDepthFactor return 2 for loads and stores). We should experiment with this [this was already on my post-commit TODO list].> > Another probably unrelated point. I tried also a run with > -bb-vectorize-req-chain-depth=1. The generated code is full of > shufflevector instructions and eight element vectors. For me this is > entirely unexpected. Do you have any ideas what is going on here?A chain length of 1 means "vectorize any pairs that you possibly can", and it will do this iteratively until it cannot do it any more. As the iteration continues it will pair the previously-paired instructions, until the requested bit limit is reached, and so you'll end up with long vectors (of short types). Thanks again, Hal> > Tobi-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Hal Finkel
2012-Feb-04 04:21 UTC
[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short
On Fri, 2012-02-03 at 10:28 +0100, Tobias Grosser wrote:> Hi Hal, > > this is one of the first test cases, I would love to have improved > vectorizer support. I sent it out earlier, but I think it is a good time > to look into it again, after the vectorizer was committed. > > The basic examples is a set of scalar loads that load for consecutive > elements and store them back right ahead. For me this is an obvious case > where vectorization is beneficial (scalar.ll): > > define i32 @main() nounwind { > %V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0), > align 16 > %V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1), > align 4 > %V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2), > align 8 > %V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3), > align 4 > store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64 > 0), align 16 > store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64 > 1), align 4 > store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64 > 2), align 8 > store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64 > 3), align 4 > ret i32 0 > } > > opt -O3 -vectorize can not optimize this straight ahead, as the > req-chain is too short. > > Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code: > > define i32 @main() nounwind { > %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*), > align 16 > store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4 > x float>*), align 16 > ret i32 0 > } > > Is there any way, we can make this case work by default? Maybe we can > decrease the req-chain to 2, and increase the cost for non stride one > loads or stores?Try it now (after r149761). If this "solution" causes other problems, then we may need to think of something more sophisticated. -Hal> > Another probably unrelated point. I tried also a run with > -bb-vectorize-req-chain-depth=1. The generated code is full of > shufflevector instructions and eight element vectors. For me this is > entirely unexpected. Do you have any ideas what is going on here? > > Tobi-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Pekka Jääskeläinen
2012-Feb-04 14:32 UTC
[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short
Hello, Thanks for your work on the bb-vectorizer. It looks like a promising pass to be used for multi-work-item-vectorization in pocl. On 02/04/2012 06:21 AM, Hal Finkel wrote:> Try it now (after r149761). If this "solution" causes other problems, > then we may need to think of something more sophisticated.I wonder if the case where a store is the last user of the value could be treated as a special case. The case where the code reads, computes and writes values in a fully parallelizable (unrolled) loop is an optimal case for vectorizing as it might not lead to any unpack/pack overheads at all. In case of the bb-vectorizer (if I understood the parameters correctly), if the final store (or actually, any final consumer of a value) is weighed more heavily in the "chain length computation" it could allow using a large chain length threshold and still pick up these "embarrassingly parallel loop cases" where there are potentially many updates to different variables in memory, but with short preceding computation lengths. This type of embarrasingly parallel loop cases are the basic case when vectorizing multiple instances of OpenCL C kernels which are parallel by definition. E.g. a case where the kernel does something like: A = read mem B = read mem C = add A, B write C to mem D = read mem E = read mem F = mul D, E write F to mem When this is parallelized N times in the work group, the vectorizer might fail to vectorize multiple "kernel iterations" properly as the independent computation chains/live ranges (e.g. from D to F) are quite short. Still, vectorization is very beneficial here as, like we know, fully parallel loops vectorize perfectly without unpack/pack overheads in case all the operations can be vectorized (such is the case here when one can scale the work-group size to match the vector width). BR, -- Pekka
Maybe Matching Threads
- [LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short
- [LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short
- [LLVMdev] Enabling vectorization with LLVM 3.3 for a DSL emitting LLVM IR
- [LLVMdev] Enabling vectorization with LLVM 3.3 for a DSL emitting LLVM IR
- [LLVMdev] loop vectorizer