thr3ads.net - llvm dev - [LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2012-Feb-03 09:28 UTC

[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

Hi Hal,

this is one of the first test cases, I would love to have improved 
vectorizer support. I sent it out earlier, but I think it is a good time 
to look into it again, after the vectorizer was committed.

The basic examples is a set of scalar loads that load for consecutive 
elements and store them back right ahead. For me this is an obvious case 
where vectorization is beneficial (scalar.ll):

define i32 @main() nounwind {
%V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0), 	
	align 16
%V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1), 	
	align 4
%V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),
	align 8
%V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),
	align 4
store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64
				       0), align 16
store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64
				       1), align 4
store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64
                                        2), align 8
store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64
                                        3), align 4
   ret i32 0
}

opt -O3 -vectorize can not optimize this straight ahead, as the 
req-chain is too short.

Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:

define i32 @main() nounwind {
   %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x
float>*),
	align 16
   store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B
to <4
					       x float>*), align 16
   ret i32 0
}

Is there any way, we can make this case work by default? Maybe we can 
decrease the req-chain to 2, and increase the cost for non stride one 
loads or stores?

Another probably unrelated point. I tried also a run with 
-bb-vectorize-req-chain-depth=1. The generated code is full of 
shufflevector instructions and eight element vectors. For me this is 
entirely unexpected. Do you have any ideas what is going on here?

Tobi
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: scalar.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120203/b797055f/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: vector.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120203/b797055f/attachment-0001.ksh>

Hal Finkel

2012-Feb-03 13:50 UTC

head link

[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

On Fri, 2012-02-03 at 10:28 +0100, Tobias Grosser wrote:> Hi Hal,
> 
> this is one of the first test cases, I would love to have improved 
> vectorizer support. I sent it out earlier, but I think it is a good time 
> to look into it again, after the vectorizer was committed.
> 
> The basic examples is a set of scalar loads that load for consecutive 
> elements and store them back right ahead. For me this is an obvious case 
> where vectorization is beneficial (scalar.ll):
> 
> define i32 @main() nounwind {
> %V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0), 	
> 	align 16
> %V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1), 	
> 	align 4
> %V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),
> 	align 8
> %V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),
> 	align 4
> store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64
> 				       0), align 16
> store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64
> 				       1), align 4
> store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64
>                                         2), align 8
> store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64
>                                         3), align 4
>    ret i32 0
> }
> 
> opt -O3 -vectorize can not optimize this straight ahead, as the 
> req-chain is too short.
> 
> Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:
> 
> define i32 @main() nounwind {
>    %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x
float>*),
> 	align 16
>    store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]*
@B to <4
> 					       x float>*), align 16
>    ret i32 0
> }
> 
> Is there any way, we can make this case work by default? Maybe we can 
> decrease the req-chain to 2, and increase the cost for non stride one 
> loads or stores?
Making the default chain length 2 will lead to a lot of unprofitable
vectorization. I think we'll probably want to do something like make
getDepthFactor return 3 for loads and stores. (or make the default chain
length 4 and make getDepthFactor return 2 for loads and stores). We
should experiment with this [this was already on my post-commit TODO
list].
> 
> Another probably unrelated point. I tried also a run with 
> -bb-vectorize-req-chain-depth=1. The generated code is full of 
> shufflevector instructions and eight element vectors. For me this is 
> entirely unexpected. Do you have any ideas what is going on here?
A chain length of 1 means "vectorize any pairs that you possibly can",
and it will do this iteratively until it cannot do it any more. As the
iteration continues it will pair the previously-paired instructions,
until the requested bit limit is reached, and so you'll end up with long
vectors (of short types).

Thanks again,
Hal
> 
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2012-Feb-04 04:21 UTC

head link

[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

On Fri, 2012-02-03 at 10:28 +0100, Tobias Grosser wrote:> Hi Hal,
> 
> this is one of the first test cases, I would love to have improved 
> vectorizer support. I sent it out earlier, but I think it is a good time 
> to look into it again, after the vectorizer was committed.
> 
> The basic examples is a set of scalar loads that load for consecutive 
> elements and store them back right ahead. For me this is an obvious case 
> where vectorization is beneficial (scalar.ll):
> 
> define i32 @main() nounwind {
> %V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0), 	
> 	align 16
> %V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1), 	
> 	align 4
> %V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),
> 	align 8
> %V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),
> 	align 4
> store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64
> 				       0), align 16
> store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64
> 				       1), align 4
> store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64
>                                         2), align 8
> store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64
>                                         3), align 4
>    ret i32 0
> }
> 
> opt -O3 -vectorize can not optimize this straight ahead, as the 
> req-chain is too short.
> 
> Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:
> 
> define i32 @main() nounwind {
>    %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x
float>*),
> 	align 16
>    store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]*
@B to <4
> 					       x float>*), align 16
>    ret i32 0
> }
> 
> Is there any way, we can make this case work by default? Maybe we can 
> decrease the req-chain to 2, and increase the cost for non stride one 
> loads or stores?
Try it now (after r149761). If this "solution" causes other problems,
then we may need to think of something more sophisticated.

 -Hal
> 
> Another probably unrelated point. I tried also a run with 
> -bb-vectorize-req-chain-depth=1. The generated code is full of 
> shufflevector instructions and eight element vectors. For me this is 
> entirely unexpected. Do you have any ideas what is going on here?
> 
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Pekka Jääskeläinen

2012-Feb-04 14:32 UTC

head link

[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

Hello,

Thanks for your work on the bb-vectorizer. It looks like a
promising pass to be used for multi-work-item-vectorization in
pocl.

On 02/04/2012 06:21 AM, Hal Finkel wrote:> Try it now (after r149761). If this "solution" causes other
problems,
> then we may need to think of something more sophisticated.
I wonder if the case where a store is the last user of the value could be
treated as a special case. The case where the code reads, computes
and writes values in a fully parallelizable (unrolled) loop is an
optimal case for vectorizing as it might not lead to any unpack/pack
overheads at all.

In case of the bb-vectorizer (if I understood the parameters correctly),
if the final store (or actually, any final consumer of a value) is weighed
more heavily in the "chain length computation" it could allow using a
large chain length threshold and still pick up these "embarrassingly
parallel
loop cases" where there are potentially many updates to different variables
in memory, but with short preceding computation lengths. This type of
embarrasingly parallel loop cases are the basic case when vectorizing
multiple instances of OpenCL C kernels which are parallel by definition.

E.g. a case where the kernel does something like:

A = read mem
B = read mem
C = add A, B
write C to mem

D = read mem
E = read mem
F = mul D, E
write F to mem

When this is parallelized N times in the work group, the vectorizer
might fail to vectorize multiple "kernel iterations" properly as the
independent computation chains/live ranges (e.g. from D to F) are quite
short. Still, vectorization is very beneficial here as, like we know, fully
parallel loops vectorize perfectly without unpack/pack overheads in case
all the operations can be vectorized (such is the case here when one can
scale the work-group size to match the vector width).

BR,
--
Pekka

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Feb 2012 - [LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

Apparently Analagous Threads