Cong Hou via llvm-dev
2015-Nov-13 00:16 UTC
[llvm-dev] [RFC] Introducing a vector reduction add instruction.
Hi When a reduction instruction is vectorized in a loop, it will be turned into an instruction with vector operands of the same operation type. This new instruction has a special property that can give us more flexibility during instruction selection later: this operation is valid as long as the reduction of all elements of the result vector is identical to the reduction of all elements of its operands. One example that can benefit this property is SAD (sum of absolute differences) pattern detection in SSE2, which provides a psadbw instruction whose description is shown below: ''' psadbw: Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce two unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst. ''' In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one v4i32 as output. However, psadbw will actually produce one i32 result for four pairs of 8-bit integers (an already reduced result), and the result is stored in the first element in v4i32. If we properly zero out the other three elements in v4i32, and with the information that we have a reduction add that is performed on this result, then we can safely use psadbw here for much better performance. This can be done during DAG combine. Another similar example is dot product. And I think there may be many other scenarios that can benefit from this property like eliminating redundant shuffles. The question is, how to let DAG combiner know that a vector operation is a reduction one? Here I propose to introduce a "reduction add" instruction for vectors. This will be a new instruction with vector operands only. Normally it is treated as a normal ADD operation, but the selection DAG combiner can make use of this new operation to generate better instructions. This new instruction is generated when vectorizing reduction add in loop vectorizer. I would like to hear more comments on this proposal or suggestions of better alternative implementations. thanks, Cong
Xinliang David Li via llvm-dev
2015-Nov-13 18:03 UTC
[llvm-dev] [RFC] Introducing a vector reduction add instruction.
Reductions can be implemented in two ways with SIMD 1) partial reduction in the loop followed by a full reduction outside the loop 2) use full SIMD reduction instruction in the loop Having a way for the vectorizer to express patterns for 2) sounds useful. The hsum intrinsic proposed by Shahid looks like a full reduction for 'ADD' operation. The question is 1) do we need to add new instruction for this? 2) do we need to generalize vector reduction to cover other scalar opcodes such as min/max etc? As regards SAD implementation: the IR idiom for SAD involves integer promotion/widening, so it seems to me that what is missing is a widening diff operation/intrinsic. With that, the SAD pattern can be expressed by vectorizor as : for (... ) { v16i8 v1 = vect_load (...) v16i8 v2 = vect_load (...) v16i16 vdiff = vect_widen_diff(v1, v2) v16i16 vabsdiff = vect_abs(vdiff) i16 ss = vect_hsum(vabsdiff) r += ss .. } David On Thu, Nov 12, 2015 at 4:16 PM, Cong Hou <congh at google.com> wrote:> Hi > > When a reduction instruction is vectorized in a loop, it will be > turned into an instruction with vector operands of the same operation > type. This new instruction has a special property that can give us > more flexibility during instruction selection later: this operation is > valid as long as the reduction of all elements of the result vector is > identical to the reduction of all elements of its operands. > > One example that can benefit this property is SAD (sum of absolute > differences) pattern detection in SSE2, which provides a psadbw > instruction whose description is shown below: > > ''' > psadbw: Compute the absolute differences of packed unsigned 8-bit > integers in a and b, then horizontally sum each consecutive 8 > differences to produce two unsigned 16-bit integers, and pack these > unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst. > ''' > > In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one > v4i32 as output. However, psadbw will actually produce one i32 result > for four pairs of 8-bit integers (an already reduced result), and the > result is stored in the first element in v4i32. If we properly zero > out the other three elements in v4i32, and with the information that > we have a reduction add that is performed on this result, then we can > safely use psadbw here for much better performance. This can be done > during DAG combine. Another similar example is dot product. And I > think there may be many other scenarios that can benefit from this > property like eliminating redundant shuffles. > > The question is, how to let DAG combiner know that a vector operation > is a reduction one? > > Here I propose to introduce a "reduction add" instruction for vectors. > This will be a new instruction with vector operands only. Normally it > is treated as a normal ADD operation, but the selection DAG combiner > can make use of this new operation to generate better instructions. > This new instruction is generated when vectorizing reduction add in > loop vectorizer. > > I would like to hear more comments on this proposal or suggestions of > better alternative implementations. > > > thanks, > Cong >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151113/7870dc3e/attachment.html>
Cong Hou via llvm-dev
2015-Nov-16 19:23 UTC
[llvm-dev] [RFC] Introducing a vector reduction add instruction.
On Fri, Nov 13, 2015 at 10:03 AM, Xinliang David Li <davidxl at google.com> wrote:> Reductions can be implemented in two ways with SIMD > 1) partial reduction in the loop followed by a full reduction outside the > loop > 2) use full SIMD reduction instruction in the loop > > Having a way for the vectorizer to express patterns for 2) sounds useful. > > The hsum intrinsic proposed by Shahid looks like a full reduction for 'ADD' > operation. The question is 1) do we need to add new instruction for this? 2) > do we need to generalize vector reduction to cover other scalar opcodes such > as min/max etc? > > As regards SAD implementation: the IR idiom for SAD involves integer > promotion/widening, so it seems to me that what is missing is a widening > diff operation/intrinsic. With that, the SAD pattern can be expressed by > vectorizor as : > > for (... ) { > v16i8 v1 = vect_load (...) > v16i8 v2 = vect_load (...) > v16i16 vdiff = vect_widen_diff(v1, v2) > v16i16 vabsdiff = vect_abs(vdiff) > i16 ss = vect_hsum(vabsdiff) > r += ss > .. > }I think doing full reduction with hadd is unnecessary in loop: in most cases it is inefficient than the partial reduction in loop + full reduction outside of loop. But SSE's SAD and dot-product instructions do horizontal adds in nature and if we want to combine instructions into them, we need the information that we are doing reductions. That is why I am proposing a reduction add IR operation. Cong> > David > > > On Thu, Nov 12, 2015 at 4:16 PM, Cong Hou <congh at google.com> wrote: >> >> Hi >> >> When a reduction instruction is vectorized in a loop, it will be >> turned into an instruction with vector operands of the same operation >> type. This new instruction has a special property that can give us >> more flexibility during instruction selection later: this operation is >> valid as long as the reduction of all elements of the result vector is >> identical to the reduction of all elements of its operands. >> >> One example that can benefit this property is SAD (sum of absolute >> differences) pattern detection in SSE2, which provides a psadbw >> instruction whose description is shown below: >> >> ''' >> psadbw: Compute the absolute differences of packed unsigned 8-bit >> integers in a and b, then horizontally sum each consecutive 8 >> differences to produce two unsigned 16-bit integers, and pack these >> unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst. >> ''' >> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one >> v4i32 as output. However, psadbw will actually produce one i32 result >> for four pairs of 8-bit integers (an already reduced result), and the >> result is stored in the first element in v4i32. If we properly zero >> out the other three elements in v4i32, and with the information that >> we have a reduction add that is performed on this result, then we can >> safely use psadbw here for much better performance. This can be done >> during DAG combine. Another similar example is dot product. And I >> think there may be many other scenarios that can benefit from this >> property like eliminating redundant shuffles. >> >> The question is, how to let DAG combiner know that a vector operation >> is a reduction one? >> >> Here I propose to introduce a "reduction add" instruction for vectors. >> This will be a new instruction with vector operands only. Normally it >> is treated as a normal ADD operation, but the selection DAG combiner >> can make use of this new operation to generate better instructions. >> This new instruction is generated when vectorizing reduction add in >> loop vectorizer. >> >> I would like to hear more comments on this proposal or suggestions of >> better alternative implementations. >> >> >> thanks, >> Cong > >