Arnold Schwaighofer
2014-Nov-28 20:43 UTC
[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization
On 11/28/14, suyog sarda wrote:> The IR will probably have something like : > > > 1. Extract a[0] and put it in vec1 <2 x i32>, 0 > 2. Extract a[1] and put it in vec1 <2 x i32>, 1 > 2. Extract a[2] and put it in vec2 <2 x i32>, 0 > 3. Extract a[3] and put it in vec2 <2 x i32>, 1 > 4. Add vec1 and vec2, sum in vec3 <2 x i32> > 5. Extract vec3[0] in sum1 > 6. Extract vec3[1] in sum2 > 7 add sum1 and sum2 in sum3 > 8. return sum3 > > > So overall instructions - 6 'extractlement', 4 'insertelement', 1 vector add, 1 scalar add and 1 return statement. We have vectorized add operation.Hi Suyog, Have a look at the code in HorizontalReduction::getReductionCost and HorizontalReduction::emitReduction. You don't need 4 extracts. This can be modeled at the IR level as a combination of shufflevector and vector add instruction on a <4 x i32> vector. TargetTransformInfo::getReductionCost can return the appropriate cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if codegen can implement this sequence of instructions more efficiently. For a <4 x i32> reduction you need only need two vector shuffles, two vector adds and one vector extract to get the scalar result. vadd <0, 1, 2, 3> <2, 3, x, x> // shuffled => <0+2, 1+3, x, x> vadd <0+2, 1+3, x x> <1+3, x, x x> // shuffled => <0+2+1+3, x, x, x> What it takes to get your example working in the SLPVectorizer is: * Get the matching code up to snuff. I think, we should replace the depth first search matcher by explicitly matching the trees we expect in HorizontalReduction::matchReduction. The code should just look for: (+ (+ (+ v1 v2) v3) v4) and maybe (+ ( + v1 v2) (+ v3 v4)) explicitly for v1, .., vn identical operations. * Allow a tree of size of one (the vector loads) if the tree feeds a reduction. * Adjust the cost model AArch64::getReductionCost * AArch64 CodeGen would have to recognize the shuffle reduction if it does not do so already Best, Arnold
James Molloy
2014-Nov-28 21:06 UTC
[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization
Hi Arnold, Thanks for cc'ing me on this. As we discussed at the devmtg, my personal view on this is that the reductions might be better represented as an intrinsic - the matching code is rather complex for the system of shuffles, is duplicated in all backends and is not particularly robust due to the complexity of the pattern. Intrinsics could lower to this pattern if there is no ISA support for a target- in the meantime it keeps the semantics without allowing later passes to muck up the matchable pattern. I have a patch mostly implementing this but it's stuck in my copious post-devmtg queue (notably with the LNT improvments I promised...) What's your opinion on this? Cheers, James On Fri, 28 Nov 2014 at 21:00, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:> On 11/28/14, suyog sarda wrote: > > The IR will probably have something like : > > > > > > 1. Extract a[0] and put it in vec1 <2 x i32>, 0 > > 2. Extract a[1] and put it in vec1 <2 x i32>, 1 > > 2. Extract a[2] and put it in vec2 <2 x i32>, 0 > > 3. Extract a[3] and put it in vec2 <2 x i32>, 1 > > 4. Add vec1 and vec2, sum in vec3 <2 x i32> > > 5. Extract vec3[0] in sum1 > > 6. Extract vec3[1] in sum2 > > 7 add sum1 and sum2 in sum3 > > 8. return sum3 > > > > > > So overall instructions - 6 'extractlement', 4 'insertelement', 1 vector > add, 1 scalar add and 1 return statement. We have vectorized add operation. > > Hi Suyog, > > Have a look at the code in HorizontalReduction::getReductionCost and > HorizontalReduction::emitReduction. > > You don't need 4 extracts. This can be modeled at the IR level as a > combination of shufflevector and vector add instruction on a <4 x i32> > vector. TargetTransformInfo::getReductionCost can return the appropriate > cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if > codegen can implement this sequence of instructions more efficiently. > > For a <4 x i32> reduction you need only need two vector shuffles, two > vector adds and one vector extract to get the scalar result. > > vadd <0, 1, 2, 3> > <2, 3, x, x> // shuffled > => > > <0+2, 1+3, x, x> > > > vadd <0+2, 1+3, x x> > <1+3, x, x x> // shuffled > => > > <0+2+1+3, x, x, x> > > What it takes to get your example working in the SLPVectorizer is: > > * Get the matching code up to snuff. I think, we should replace the depth > first search matcher by explicitly matching the trees we expect in > HorizontalReduction::matchReduction. The code should just look for: > > (+ (+ (+ v1 v2) v3) v4) > and maybe > (+ ( + v1 v2) (+ v3 v4)) > > explicitly for v1, .., vn identical operations. > > * Allow a tree of size of one (the vector loads) if the tree feeds a > reduction. > > * Adjust the cost model AArch64::getReductionCost > > * AArch64 CodeGen would have to recognize the shuffle reduction if it does > not do so already > > > > Best, > Arnold >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141128/1afd60b6/attachment.html>
suyog sarda
2014-Nov-29 07:59 UTC
[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization
> > > Have a look at the code in HorizontalReduction::getReductionCost and > HorizontalReduction::emitReduction. > > You don't need 4 extracts. This can be modeled at the IR level as a > combination of shufflevector and vector add instruction on a <4 x i32> > vector. TargetTransformInfo::getReductionCost can return the appropriate > cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if > codegen can implement this sequence of instructions more efficiently. > > For a <4 x i32> reduction you need only need two vector shuffles, two > vector adds and one vector extract to get the scalar result. > > vadd <0, 1, 2, 3> > <2, 3, x, x> // shuffled > => > > <0+2, 1+3, x, x> > > > vadd <0+2, 1+3, x x> > <1+3, x, x x> // shuffled > => > > <0+2+1+3, x, x, x> >Ahh!! Shuffle vector comes to the rescue. Thanks Arnold for pointing out. I ignored it completely in above explanation.> > What it takes to get your example working in the SLPVectorizer is: > > * Get the matching code up to snuff. I think, we should replace the depth > first search matcher by explicitly matching the trees we expect in > HorizontalReduction::matchReduction. The code should just look for: > > (+ (+ (+ v1 v2) v3) v4) > and maybe > (+ ( + v1 v2) (+ v3 v4)) > > explicitly for v1, .., vn identical operations. > > * Allow a tree of size of one (the vector loads) if the tree feeds a > reduction. > > * Adjust the cost model AArch64::getReductionCost > > * AArch64 CodeGen would have to recognize the shuffle reduction if it does > not do so already > > >Seems everything boils down to properly identifying the reduction chain. I did look at your patch provided in earlier thread (similar discussion), it was working for reduction chain of 4 elements (+(+(+( v1, v2) v3) v4). However, when i tried it for 8 elements, it was asserting. I will look into it. Thanks for getting back on this. -- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141129/a014405e/attachment.html>