thr3ads.net - llvm dev - [LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization [Nov 2014]

If this information is useful, please help other people find it:
Share via:

Arnold Schwaighofer

2014-Nov-28 20:43 UTC

[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization

On 11/28/14, suyog sarda   wrote:> The IR will probably have something like :
> 
> 
> 1. Extract a[0] and put it in vec1 <2 x i32>, 0
> 2. Extract a[1] and put it in vec1 <2 x i32>, 1
> 2. Extract a[2] and put it in vec2 <2 x i32>, 0
> 3. Extract a[3] and put it in vec2 <2 x i32>, 1
> 4. Add vec1 and vec2, sum in vec3 <2 x i32>
> 5. Extract vec3[0] in sum1 
> 6. Extract vec3[1] in sum2
> 7 add sum1 and sum2 in sum3
> 8. return sum3
> 
> 
> So overall instructions - 6 'extractlement', 4
'insertelement', 1 vector add, 1 scalar add and 1 return statement. We
have vectorized add operation.
Hi Suyog,

Have a look at the code in HorizontalReduction::getReductionCost and
HorizontalReduction::emitReduction.

You don't need 4 extracts. This can be modeled at the IR level as a
combination of shufflevector and vector add instruction on a <4 x i32>
vector. TargetTransformInfo::getReductionCost can return the appropriate cost
(for example, one for AArch64::getReductionCost(add, <4 x i32>)) if
codegen can implement this sequence of instructions more efficiently.

For a <4 x i32> reduction you need only need two vector shuffles, two
vector adds and one vector extract to get the scalar result.

vadd <0, 1, 2, 3>
         <2, 3, x, x> // shuffled
=>

<0+2, 1+3, x, x>

vadd <0+2, 1+3, x x>
         <1+3, x, x x> // shuffled
=> 

<0+2+1+3, x, x, x>

What it takes to get your example working in the SLPVectorizer is:

* Get the matching code up to snuff. I think, we should replace the depth first
search matcher by explicitly matching the trees we expect in
HorizontalReduction::matchReduction. The code should just look for:

   (+ (+ (+ v1 v2) v3) v4)
    and maybe
    (+ ( + v1 v2) (+ v3 v4))

    explicitly for v1, .., vn identical operations.

* Allow a tree of size of one (the vector loads) if the tree feeds a reduction.

* Adjust the cost model AArch64::getReductionCost

* AArch64 CodeGen would have to recognize the shuffle reduction if it does not
do so already

Best,
Arnold

James Molloy

2014-Nov-28 21:06 UTC

head link

[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization

Hi Arnold,

Thanks for cc'ing me on this. As we discussed at the devmtg, my personal
view on this is that the reductions might be better represented as an
intrinsic - the matching code is rather complex for the system of shuffles,
is duplicated in all backends and is not particularly robust due to the
complexity of the pattern.

Intrinsics could lower to this pattern if there is no ISA support for a
target- in the meantime it keeps the semantics without allowing later
passes to muck up the matchable pattern.

I have a patch mostly implementing this but it's stuck in my copious
post-devmtg queue (notably with the LNT improvments I promised...)

What's your opinion on this?

Cheers,

James

On Fri, 28 Nov 2014 at 21:00, Arnold Schwaighofer <aschwaighofer at
apple.com>
wrote:
> On 11/28/14, suyog sarda   wrote:
> > The IR will probably have something like :
> >
> >
> > 1. Extract a[0] and put it in vec1 <2 x i32>, 0
> > 2. Extract a[1] and put it in vec1 <2 x i32>, 1
> > 2. Extract a[2] and put it in vec2 <2 x i32>, 0
> > 3. Extract a[3] and put it in vec2 <2 x i32>, 1
> > 4. Add vec1 and vec2, sum in vec3 <2 x i32>
> > 5. Extract vec3[0] in sum1
> > 6. Extract vec3[1] in sum2
> > 7 add sum1 and sum2 in sum3
> > 8. return sum3
> >
> >
> > So overall instructions - 6 'extractlement', 4
'insertelement', 1 vector
> add, 1 scalar add and 1 return statement. We have vectorized add operation.
>
> Hi Suyog,
>
> Have a look at the code in HorizontalReduction::getReductionCost and
> HorizontalReduction::emitReduction.
>
> You don't need 4 extracts. This can be modeled at the IR level as a
> combination of shufflevector and vector add instruction on a <4 x
i32>
> vector. TargetTransformInfo::getReductionCost can return the appropriate
> cost (for example, one for AArch64::getReductionCost(add, <4 x i32>))
if
> codegen can implement this sequence of instructions more efficiently.
>
> For a <4 x i32> reduction you need only need two vector shuffles, two
> vector adds and one vector extract to get the scalar result.
>
> vadd <0, 1, 2, 3>
>          <2, 3, x, x> // shuffled
> =>
>
> <0+2, 1+3, x, x>
>
>
> vadd <0+2, 1+3, x x>
>          <1+3, x, x x> // shuffled
> =>
>
> <0+2+1+3, x, x, x>
>
> What it takes to get your example working in the SLPVectorizer is:
>
> * Get the matching code up to snuff. I think, we should replace the depth
> first search matcher by explicitly matching the trees we expect in
> HorizontalReduction::matchReduction. The code should just look for:
>
>    (+ (+ (+ v1 v2) v3) v4)
>     and maybe
>     (+ ( + v1 v2) (+ v3 v4))
>
>     explicitly for v1, .., vn identical operations.
>
> * Allow a tree of size of one (the vector loads) if the tree feeds a
> reduction.
>
> * Adjust the cost model AArch64::getReductionCost
>
> * AArch64 CodeGen would have to recognize the shuffle reduction if it does
> not do so already
>
>
>
> Best,
> Arnold
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141128/1afd60b6/attachment.html>

suyog sarda

2014-Nov-29 07:59 UTC

head link

[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization

>
>
> Have a look at the code in HorizontalReduction::getReductionCost and
> HorizontalReduction::emitReduction.
>
> You don't need 4 extracts. This can be modeled at the IR level as a
> combination of shufflevector and vector add instruction on a <4 x
i32>
> vector. TargetTransformInfo::getReductionCost can return the appropriate
> cost (for example, one for AArch64::getReductionCost(add, <4 x i32>))
if
> codegen can implement this sequence of instructions more efficiently.
>
> For a <4 x i32> reduction you need only need two vector shuffles, two
> vector adds and one vector extract to get the scalar result.
>
> vadd <0, 1, 2, 3>
>          <2, 3, x, x> // shuffled
> =>
>
> <0+2, 1+3, x, x>
>
>
> vadd <0+2, 1+3, x x>
>          <1+3, x, x x> // shuffled
> =>
>
> <0+2+1+3, x, x, x>
>
Ahh!! Shuffle vector comes to the rescue. Thanks Arnold for pointing out. I
ignored it completely in above explanation.

>
> What it takes to get your example working in the SLPVectorizer is:
>
> * Get the matching code up to snuff. I think, we should replace the depth
> first search matcher by explicitly matching the trees we expect in
> HorizontalReduction::matchReduction. The code should just look for:
>
>    (+ (+ (+ v1 v2) v3) v4)
>     and maybe
>     (+ ( + v1 v2) (+ v3 v4))
>
>     explicitly for v1, .., vn identical operations.
>
> * Allow a tree of size of one (the vector loads) if the tree feeds a
> reduction.
>
> * Adjust the cost model AArch64::getReductionCost
>
> * AArch64 CodeGen would have to recognize the shuffle reduction if it does
> not do so already
>
>
>Seems everything boils down to properly identifying the reduction chain. I
did look at your patch provided in earlier thread (similar discussion), it
was working for reduction chain of 4 elements (+(+(+( v1, v2) v3) v4).
However, when i tried it for 8 elements, it was asserting. I will look into
it. Thanks for getting back on this.

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141129/a014405e/attachment.html>

llvm dev - Nov 2014 - [LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization

[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization

[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization

[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization