thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a vector reduction add instruction. [Nov 2015]

If this information is useful, please help other people find it:
Share via:

Cong Hou via llvm-dev

2015-Nov-13 00:16 UTC

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Hi

When a reduction instruction is vectorized in a loop, it will be
turned into an instruction with vector operands of the same operation
type. This new instruction has a special property that can give us
more flexibility during instruction selection later: this operation is
valid as long as the reduction of all elements of the result vector is
identical to the reduction of all elements of its operands.

One example that can benefit this property is SAD (sum of absolute
differences) pattern detection in SSE2, which provides a psadbw
instruction whose description is shown below:

'''
psadbw: Compute the absolute differences of packed unsigned 8-bit
integers in a and b, then horizontally sum each consecutive 8
differences to produce two unsigned 16-bit integers, and pack these
unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.
'''

In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one
v4i32 as output. However, psadbw will actually produce one i32 result
for four pairs of 8-bit integers (an already reduced result), and the
result is stored in the first element in v4i32. If we properly zero
out the other three elements in v4i32, and with the information that
we have a reduction add that is performed on this result, then we can
safely use psadbw here for much better performance. This can be done
during DAG combine. Another similar example is dot product. And I
think there may be many other scenarios that can benefit from this
property like eliminating redundant shuffles.

The question is, how to let DAG combiner know that a vector operation
is a reduction one?

Here I propose to introduce a "reduction add" instruction for vectors.
This will be a new instruction with vector operands only. Normally it
is treated as a normal ADD operation, but the selection DAG combiner
can make use of this new operation to generate better instructions.
This new instruction is generated when vectorizing reduction add in
loop vectorizer.

I would like to hear more comments on this proposal or suggestions of
better alternative implementations.


thanks,
Cong

Xinliang David Li via llvm-dev

2015-Nov-13 18:03 UTC

head link

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Reductions can be implemented in two ways with SIMD
1) partial reduction in the loop followed by a full reduction outside the
loop
2) use full SIMD reduction instruction in the loop

Having a way for the vectorizer to express patterns for 2) sounds useful.

The hsum intrinsic proposed by Shahid looks like a full reduction for
'ADD'
operation. The question is 1) do we need to add new instruction for this?
2) do we need to generalize vector reduction to cover other scalar opcodes
such as min/max etc?

As regards SAD implementation: the IR idiom for SAD involves integer
promotion/widening, so it seems to me that what is missing is a widening
diff operation/intrinsic. With that, the SAD pattern can be expressed by
vectorizor as :

for (... ) {
   v16i8  v1 = vect_load (...)
   v16i8  v2 = vect_load (...)
   v16i16  vdiff = vect_widen_diff(v1, v2)
   v16i16  vabsdiff = vect_abs(vdiff)
   i16 ss = vect_hsum(vabsdiff)
   r += ss
   ..
}

David


On Thu, Nov 12, 2015 at 4:16 PM, Cong Hou <congh at google.com> wrote:
> Hi
>
> When a reduction instruction is vectorized in a loop, it will be
> turned into an instruction with vector operands of the same operation
> type. This new instruction has a special property that can give us
> more flexibility during instruction selection later: this operation is
> valid as long as the reduction of all elements of the result vector is
> identical to the reduction of all elements of its operands.
>
> One example that can benefit this property is SAD (sum of absolute
> differences) pattern detection in SSE2, which provides a psadbw
> instruction whose description is shown below:
>
> '''
> psadbw: Compute the absolute differences of packed unsigned 8-bit
> integers in a and b, then horizontally sum each consecutive 8
> differences to produce two unsigned 16-bit integers, and pack these
> unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.
> '''
>
> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one
> v4i32 as output. However, psadbw will actually produce one i32 result
> for four pairs of 8-bit integers (an already reduced result), and the
> result is stored in the first element in v4i32. If we properly zero
> out the other three elements in v4i32, and with the information that
> we have a reduction add that is performed on this result, then we can
> safely use psadbw here for much better performance. This can be done
> during DAG combine. Another similar example is dot product. And I
> think there may be many other scenarios that can benefit from this
> property like eliminating redundant shuffles.
>
> The question is, how to let DAG combiner know that a vector operation
> is a reduction one?
>
> Here I propose to introduce a "reduction add" instruction for
vectors.
> This will be a new instruction with vector operands only. Normally it
> is treated as a normal ADD operation, but the selection DAG combiner
> can make use of this new operation to generate better instructions.
> This new instruction is generated when vectorizing reduction add in
> loop vectorizer.
>
> I would like to hear more comments on this proposal or suggestions of
> better alternative implementations.
>
>
> thanks,
> Cong
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151113/7870dc3e/attachment.html>

Cong Hou via llvm-dev

2015-Nov-16 19:23 UTC

head link

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

On Fri, Nov 13, 2015 at 10:03 AM, Xinliang David Li <davidxl at
google.com> wrote:> Reductions can be implemented in two ways with SIMD
> 1) partial reduction in the loop followed by a full reduction outside the
> loop
> 2) use full SIMD reduction instruction in the loop
>
> Having a way for the vectorizer to express patterns for 2) sounds useful.
>
> The hsum intrinsic proposed by Shahid looks like a full reduction for
'ADD'
> operation. The question is 1) do we need to add new instruction for this?
2)
> do we need to generalize vector reduction to cover other scalar opcodes
such
> as min/max etc?
>
> As regards SAD implementation: the IR idiom for SAD involves integer
> promotion/widening, so it seems to me that what is missing is a widening
> diff operation/intrinsic. With that, the SAD pattern can be expressed by
> vectorizor as :
>
> for (... ) {
>    v16i8  v1 = vect_load (...)
>    v16i8  v2 = vect_load (...)
>    v16i16  vdiff = vect_widen_diff(v1, v2)
>    v16i16  vabsdiff = vect_abs(vdiff)
>    i16 ss = vect_hsum(vabsdiff)
>    r += ss
>    ..
> }
I think doing full reduction with hadd is unnecessary in loop: in most
cases it is inefficient than the partial reduction in loop + full
reduction outside of loop. But SSE's SAD and dot-product instructions
do horizontal adds in nature and if we want to combine instructions
into them, we need the information that we are doing reductions. That
is why I am proposing a reduction add IR operation.

Cong
>
> David
>
>
> On Thu, Nov 12, 2015 at 4:16 PM, Cong Hou <congh at google.com>
wrote:
>>
>> Hi
>>
>> When a reduction instruction is vectorized in a loop, it will be
>> turned into an instruction with vector operands of the same operation
>> type. This new instruction has a special property that can give us
>> more flexibility during instruction selection later: this operation is
>> valid as long as the reduction of all elements of the result vector is
>> identical to the reduction of all elements of its operands.
>>
>> One example that can benefit this property is SAD (sum of absolute
>> differences) pattern detection in SSE2, which provides a psadbw
>> instruction whose description is shown below:
>>
>> '''
>> psadbw: Compute the absolute differences of packed unsigned 8-bit
>> integers in a and b, then horizontally sum each consecutive 8
>> differences to produce two unsigned 16-bit integers, and pack these
>> unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.
>> '''
>>
>> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and
one
>> v4i32 as output. However, psadbw will actually produce one i32 result
>> for four pairs of 8-bit integers (an already reduced result), and the
>> result is stored in the first element in v4i32. If we properly zero
>> out the other three elements in v4i32, and with the information that
>> we have a reduction add that is performed on this result, then we can
>> safely use psadbw here for much better performance. This can be done
>> during DAG combine. Another similar example is dot product. And I
>> think there may be many other scenarios that can benefit from this
>> property like eliminating redundant shuffles.
>>
>> The question is, how to let DAG combiner know that a vector operation
>> is a reduction one?
>>
>> Here I propose to introduce a "reduction add" instruction for
vectors.
>> This will be a new instruction with vector operands only. Normally it
>> is treated as a normal ADD operation, but the selection DAG combiner
>> can make use of this new operation to generate better instructions.
>> This new instruction is generated when vectorizing reduction add in
>> loop vectorizer.
>>
>> I would like to hear more comments on this proposal or suggestions of
>> better alternative implementations.
>>
>>
>> thanks,
>> Cong
>
>

Seemingly Similar Threads

Search for more reasonably related threads

llvm dev - Nov 2015 - [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Seemingly Similar Threads