thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a vector reduction add instruction. [Nov 2015]

If this information is useful, please help other people find it:
Share via:

Cong Hou via llvm-dev

2015-Nov-19 21:12 UTC

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

After some attempt to implement reduce-add in LLVM, I found out a
easier way to detect reduce-add without introducing new IR operations.
The basic idea is annotating phi node instead of add (so that it is
easier to handle other reduction operations). In PHINode class, we can
add a flag indicating if the phi node is a reduction one (the flag can
be set in loop vectorizer for vectorized phi nodes). Then when we
build SDNode for instruction selection, we detect those reduction phi
nodes and then annotate reduction operations. This requires an
additional flag in SDNodeFlags. We can then check this flag when
combining instructions to detect reduction operations.

In this approach, I have managed to let LLVM compile a SAD loop into
psadbw instructions.

Source code:


const int N = 1024;
unsigned char a[N], b[N];

int sad() {
  int s = 0;
  for (int i = 0; i < N; ++i) {
    int res = a[i] - b[i];
    s += (res > 0) ? res : -res;
  }
  return s;
}


Emitted instructions on X86:



# BB#0:                                 # %entry
pxor %xmm0, %xmm0
movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
pxor %xmm1, %xmm1
.align 16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
psadbw %xmm2, %xmm3
paddd %xmm3, %xmm0
movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
psadbw %xmm2, %xmm3
paddd %xmm3, %xmm1
addq $8, %rax
jne .LBB0_1
# BB#2:                                 # %middle.block
paddd %xmm0, %xmm1
pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
paddd %xmm1, %xmm0
pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
paddd %xmm0, %xmm1
movd %xmm1, %eax
retq


Note that due to smaller VF we are using now (currently 4), we could
not explore the most benefit of psadbw. The patch in
http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
on the smallest type in a loop. The follow-up work is refining the
cost model to let bigger VFs have less cost. For the example above the
best result is from VF >=16.

The draft of the patch is here: http://reviews.llvm.org/D14840

I will refine the patch later and submit it for review.


thanks,
Cong


On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com>
wrote:> On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
> <Asghar-ahmad.Shahid at amd.com> wrote:
>> Hi Cong,
>>
>>> -----Original Message-----
>>> From: Cong Hou [mailto:congh at google.com]
>>> Sent: Tuesday, November 17, 2015 12:47 AM
>>> To: Shahid, Asghar-ahmad
>>> Cc: David Li
>>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
instruction.
>>>
>>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-
>>> ahmad.Shahid at amd.com> wrote:
>>> > Hi Cong,
>>> >
>>> > We had proposed an intrinsic approach to do this. However the
>>> > discussion reached to a point where it was asked that
"Why do we need
>>> > another approach if "reduction add" can be pattern
matched in
>>> DAGCombine?"
>>> > However I feel if we have strong enough rationale for
introduction of
>>> > this instruction, it would be great. The 1st link below has
the complete
>>> discussion about the intrinsic approach.
>>>
>>> Yes, I think introducing such a reduction add can let us do pattern
recognition
>>> of either SAD or dot production (or more?) without introducing any
>>> additional intrinsics.
>> I agree. Another use case could be POPCOUNT operation. Moreover, as
'reduction add'
>> Is being adopted by more targets now a days, reflecting that in LLVM IR
as an instruction
>> Is a good idea.
>> BTW, what is the idea of the syntax and semantic of this operation you
have?
>
> We can introduce a reduce-add for vectors only, or make it general so
> that it could also accept scale operands. Normally it is identical to
> normal add, but during instruction selection we can do pattern
> recognition based on more information provided by this new operations.
> For vectors, this means the result of this operation only guarantee
> that the sum of all elements in the result is identical to the sum of
> all elements of its operands. This gives us enough freedom to do
> aggressive transformations, such as SAD or dot-product.
>
> Do you think if this is enough for us to get there?
>
>
> Cong
>
>>
>> The main concern may be cost
>>> model: we could not guarantee that a SAD loop is unrolled 16 times
on SSE to
>>> make use the most benefit of SAD. After the patch
>>> http://reviews.llvm.org/D8943 is landed, I am now working on
improving cost
>>> models of type conversions on X86. I believe even without SAD
instruction
>>> we can still get better performance with unrolling a SAD loop 16
times. This
>>> seems tricky but it works. What do you think?
>> I also think same as we can increase the bandwidth with proper cost
modeling.
>>
>> Regards,
>> Shahid
>>>
>>> Cong
>>>
>>> >
>>> > http://reviews.llvm.org/D10964
>>> > http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
>>> >
>>> > Regards,
>>> > Shahid
>>> >
>>> >
>>> >
>>> >> -----Original Message-----
>>> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org]
On Behalf Of
>>> >> Cong Hou via llvm-dev
>>> >> Sent: Friday, November 13, 2015 5:47 AM
>>> >> To: llvm-dev at lists.llvm.org
>>> >> Cc: David Li
>>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction
add instruction.
>>> >>
>>> >> Hi
>>> >>
>>> >> When a reduction instruction is vectorized in a loop, it
will be
>>> >> turned into an instruction with vector operands of the
same operation
>>> >> type. This new instruction has a special property that can
give us
>>> >> more flexibility during instruction selection later: this
operation
>>> >> is valid as long as the reduction of all elements of the
result
>>> >> vector is identical to the reduction of all elements of
its operands.
>>> >>
>>> >> One example that can benefit this property is SAD (sum of
absolute
>>> >> differences) pattern detection in SSE2, which provides a
psadbw
>>> >> instruction whose description is shown below:
>>> >>
>>> >> '''
>>> >> psadbw: Compute the absolute differences of packed
unsigned 8-bit
>>> >> integers in a and b, then horizontally sum each
consecutive 8
>>> >> differences to produce two unsigned 16-bit integers, and
pack these
>>> >> unsigned 16-bit integers in the low 16 bits of 64-bit
elements in dst.
>>> >> '''
>>> >>
>>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as
inputs and one
>>> >> v4i32 as output. However, psadbw will actually produce one
i32 result
>>> >> for four pairs of 8-bit integers (an already reduced
result), and the
>>> >> result is stored in the first element in v4i32. If we
properly zero
>>> >> out the other three elements in v4i32, and with the
information that
>>> >> we have a reduction add that is performed on this result,
then we can
>>> >> safely use psadbw here for much better performance. This
can be done
>>> >> during DAG combine. Another similar example is dot
product. And I
>>> >> think there may be many other scenarios that can benefit
from this
>>> >> property like eliminating redundant shuffles.
>>> >>
>>> >> The question is, how to let DAG combiner know that a
vector operation
>>> >> is a reduction one?
>>> >>
>>> >> Here I propose to introduce a "reduction add"
instruction for vectors.
>>> >> This will be a new instruction with vector operands only.
Normally it
>>> >> is treated as a normal ADD operation, but the selection
DAG combiner
>>> >> can make use of this new operation to generate better
instructions.
>>> >> This new instruction is generated when vectorizing
reduction add in
>>> >> loop vectorizer.
>>> >>
>>> >> I would like to hear more comments on this proposal or
suggestions of
>>> >> better alternative implementations.
>>> >>
>>> >>
>>> >> thanks,
>>> >> Cong
>>> >> _______________________________________________
>>> >> LLVM Developers mailing list
>>> >> llvm-dev at lists.llvm.org
>>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Shahid, Asghar-ahmad via llvm-dev

2015-Nov-20 05:20 UTC

head link

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Hi Cong,

You hit the right place.Seems I could not get your question regarding this
earlier.
This is what we did to find reduction pattern at IR level in loop vectorizer.
Adding SDNode flag seems to be the right thing to do.
>For the example above the best result is from VF >=16.
That is right. I also observed that for constant loop bound <16, the pattern
gets unrolled
To straight line code. This may be due to the cost-model inefficiency. So after
the improvement
Of cost-model this case would also be realized.

Regards,
Shahid
> -----Original Message-----
> From: Cong Hou [mailto:congh at google.com]
> Sent: Friday, November 20, 2015 2:42 AM
> To: llvm-dev
> Cc: David Li; Shahid, Asghar-ahmad
> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
instruction.
> 
> After some attempt to implement reduce-add in LLVM, I found out a easier
> way to detect reduce-add without introducing new IR operations.
> The basic idea is annotating phi node instead of add (so that it is easier
to
> handle other reduction operations). In PHINode class, we can add a flag
> indicating if the phi node is a reduction one (the flag can be set in loop
> vectorizer for vectorized phi nodes). Then when we build SDNode for
> instruction selection, we detect those reduction phi nodes and then
> annotate reduction operations. This requires an additional flag in
> SDNodeFlags. We can then check this flag when combining instructions to
> detect reduction operations.
> 
> In this approach, I have managed to let LLVM compile a SAD loop into psadbw
> instructions.
> 
> Source code:
> 
> 
> const int N = 1024;
> unsigned char a[N], b[N];
> 
> int sad() {
>   int s = 0;
>   for (int i = 0; i < N; ++i) {
>     int res = a[i] - b[i];
>     s += (res > 0) ? res : -res;
>   }
>   return s;
> }
> 
> 
> Emitted instructions on X86:
> 
> 
> 
> # BB#0:                                 # %entry
> pxor %xmm0, %xmm0
> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
> pxor %xmm1, %xmm1
> .align 16, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header:
Depth=1
> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm0
> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm1
> addq $8, %rax
> jne .LBB0_1
> # BB#2:                                 # %middle.block
> paddd %xmm0, %xmm1
> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
> paddd %xmm1, %xmm0
> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
> paddd %xmm0, %xmm1
> movd %xmm1, %eax
> retq
> 
> 
> Note that due to smaller VF we are using now (currently 4), we could not
> explore the most benefit of psadbw. The patch in
> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based on the
> smallest type in a loop. The follow-up work is refining the cost model to
let
> bigger VFs have less cost. For the example above the best result is from VF
> >=16.
> 
> The draft of the patch is here: http://reviews.llvm.org/D14840
> 
> I will refine the patch later and submit it for review.
> 
> 
> thanks,
> Cong
> 
> 
> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com>
wrote:
> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
> > <Asghar-ahmad.Shahid at amd.com> wrote:
> >> Hi Cong,
> >>
> >>> -----Original Message-----
> >>> From: Cong Hou [mailto:congh at google.com]
> >>> Sent: Tuesday, November 17, 2015 12:47 AM
> >>> To: Shahid, Asghar-ahmad
> >>> Cc: David Li
> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction
add
> instruction.
> >>>
> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad
<Asghar-
> >>> ahmad.Shahid at amd.com> wrote:
> >>> > Hi Cong,
> >>> >
> >>> > We had proposed an intrinsic approach to do this. However
the
> >>> > discussion reached to a point where it was asked that
"Why do we
> >>> > need another approach if "reduction add" can be
pattern matched in
> >>> DAGCombine?"
> >>> > However I feel if we have strong enough rationale for
introduction
> >>> > of this instruction, it would be great. The 1st link
below has the
> >>> > complete
> >>> discussion about the intrinsic approach.
> >>>
> >>> Yes, I think introducing such a reduction add can let us do
pattern
> >>> recognition of either SAD or dot production (or more?) without
> >>> introducing any additional intrinsics.
> >> I agree. Another use case could be POPCOUNT operation. Moreover,
as
> 'reduction add'
> >> Is being adopted by more targets now a days, reflecting that in
LLVM
> >> IR as an instruction Is a good idea.
> >> BTW, what is the idea of the syntax and semantic of this operation
you
> have?
> >
> > We can introduce a reduce-add for vectors only, or make it general so
> > that it could also accept scale operands. Normally it is identical to
> > normal add, but during instruction selection we can do pattern
> > recognition based on more information provided by this new operations.
> > For vectors, this means the result of this operation only guarantee
> > that the sum of all elements in the result is identical to the sum of
> > all elements of its operands. This gives us enough freedom to do
> > aggressive transformations, such as SAD or dot-product.
> >
> > Do you think if this is enough for us to get there?
> >
> >
> > Cong
> >
> >>
> >> The main concern may be cost
> >>> model: we could not guarantee that a SAD loop is unrolled 16
times
> >>> on SSE to make use the most benefit of SAD. After the patch
> >>> http://reviews.llvm.org/D8943 is landed, I am now working on
> >>> improving cost models of type conversions on X86. I believe
even
> >>> without SAD instruction we can still get better performance
with
> >>> unrolling a SAD loop 16 times. This seems tricky but it works.
What do
> you think?
> >> I also think same as we can increase the bandwidth with proper
cost
> modeling.
> >>
> >> Regards,
> >> Shahid
> >>>
> >>> Cong
> >>>
> >>> >
> >>> > http://reviews.llvm.org/D10964
> >>> >
http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
> >>> >
> >>> > Regards,
> >>> > Shahid
> >>> >
> >>> >
> >>> >
> >>> >> -----Original Message-----
> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org] On Behalf
> >>> >> Of Cong Hou via llvm-dev
> >>> >> Sent: Friday, November 13, 2015 5:47 AM
> >>> >> To: llvm-dev at lists.llvm.org
> >>> >> Cc: David Li
> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector
reduction add
> instruction.
> >>> >>
> >>> >> Hi
> >>> >>
> >>> >> When a reduction instruction is vectorized in a loop,
it will be
> >>> >> turned into an instruction with vector operands of
the same
> >>> >> operation type. This new instruction has a special
property that
> >>> >> can give us more flexibility during instruction
selection later:
> >>> >> this operation is valid as long as the reduction of
all elements
> >>> >> of the result vector is identical to the reduction of
all elements of its
> operands.
> >>> >>
> >>> >> One example that can benefit this property is SAD
(sum of
> >>> >> absolute
> >>> >> differences) pattern detection in SSE2, which
provides a psadbw
> >>> >> instruction whose description is shown below:
> >>> >>
> >>> >> '''
> >>> >> psadbw: Compute the absolute differences of packed
unsigned 8-bit
> >>> >> integers in a and b, then horizontally sum each
consecutive 8
> >>> >> differences to produce two unsigned 16-bit integers,
and pack
> >>> >> these unsigned 16-bit integers in the low 16 bits of
64-bit elements in
> dst.
> >>> >> '''
> >>> >>
> >>> >> In LLVM's IR, for a SAD loop we will have two
v4i8 as inputs and
> >>> >> one
> >>> >> v4i32 as output. However, psadbw will actually
produce one i32
> >>> >> result for four pairs of 8-bit integers (an already
reduced
> >>> >> result), and the result is stored in the first
element in v4i32.
> >>> >> If we properly zero out the other three elements in
v4i32, and
> >>> >> with the information that we have a reduction add
that is
> >>> >> performed on this result, then we can safely use
psadbw here for
> >>> >> much better performance. This can be done during DAG
combine.
> >>> >> Another similar example is dot product. And I think
there may be
> >>> >> many other scenarios that can benefit from this
property like
> eliminating redundant shuffles.
> >>> >>
> >>> >> The question is, how to let DAG combiner know that a
vector
> >>> >> operation is a reduction one?
> >>> >>
> >>> >> Here I propose to introduce a "reduction
add" instruction for vectors.
> >>> >> This will be a new instruction with vector operands
only.
> >>> >> Normally it is treated as a normal ADD operation, but
the
> >>> >> selection DAG combiner can make use of this new
operation to
> generate better instructions.
> >>> >> This new instruction is generated when vectorizing
reduction add
> >>> >> in loop vectorizer.
> >>> >>
> >>> >> I would like to hear more comments on this proposal
or
> >>> >> suggestions of better alternative implementations.
> >>> >>
> >>> >>
> >>> >> thanks,
> >>> >> Cong
> >>> >> _______________________________________________
> >>> >> LLVM Developers mailing list
> >>> >> llvm-dev at lists.llvm.org
> >>> >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Xinliang David Li via llvm-dev

2015-Nov-20 06:09 UTC

head link

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

This sounds like the right approach and the result is very promising. The
VF tuning patch is already in but just not enabled by default, right?

David

On Thu, Nov 19, 2015 at 1:12 PM, Cong Hou <congh at google.com> wrote:
> After some attempt to implement reduce-add in LLVM, I found out a
> easier way to detect reduce-add without introducing new IR operations.
> The basic idea is annotating phi node instead of add (so that it is
> easier to handle other reduction operations). In PHINode class, we can
> add a flag indicating if the phi node is a reduction one (the flag can
> be set in loop vectorizer for vectorized phi nodes). Then when we
> build SDNode for instruction selection, we detect those reduction phi
> nodes and then annotate reduction operations. This requires an
> additional flag in SDNodeFlags. We can then check this flag when
> combining instructions to detect reduction operations.
>
> In this approach, I have managed to let LLVM compile a SAD loop into
> psadbw instructions.
>
> Source code:
>
>
> const int N = 1024;
> unsigned char a[N], b[N];
>
> int sad() {
>   int s = 0;
>   for (int i = 0; i < N; ++i) {
>     int res = a[i] - b[i];
>     s += (res > 0) ? res : -res;
>   }
>   return s;
> }
>
>
> Emitted instructions on X86:
>
>
>
> # BB#0:                                 # %entry
> pxor %xmm0, %xmm0
> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
> pxor %xmm1, %xmm1
> .align 16, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header:
Depth=1
> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm0
> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm1
> addq $8, %rax
> jne .LBB0_1
> # BB#2:                                 # %middle.block
> paddd %xmm0, %xmm1
> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
> paddd %xmm1, %xmm0
> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
> paddd %xmm0, %xmm1
> movd %xmm1, %eax
> retq
>
>
> Note that due to smaller VF we are using now (currently 4), we could
> not explore the most benefit of psadbw. The patch in
> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
> on the smallest type in a loop. The follow-up work is refining the
> cost model to let bigger VFs have less cost. For the example above the
> best result is from VF >=16.
>
> The draft of the patch is here: http://reviews.llvm.org/D14840
>
> I will refine the patch later and submit it for review.
>
>
> thanks,
> Cong
>
>
> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com>
wrote:
> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
> > <Asghar-ahmad.Shahid at amd.com> wrote:
> >> Hi Cong,
> >>
> >>> -----Original Message-----
> >>> From: Cong Hou [mailto:congh at google.com]
> >>> Sent: Tuesday, November 17, 2015 12:47 AM
> >>> To: Shahid, Asghar-ahmad
> >>> Cc: David Li
> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction
add
> instruction.
> >>>
> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad
<Asghar-
> >>> ahmad.Shahid at amd.com> wrote:
> >>> > Hi Cong,
> >>> >
> >>> > We had proposed an intrinsic approach to do this. However
the
> >>> > discussion reached to a point where it was asked that
"Why do we need
> >>> > another approach if "reduction add" can be
pattern matched in
> >>> DAGCombine?"
> >>> > However I feel if we have strong enough rationale for
introduction of
> >>> > this instruction, it would be great. The 1st link below
has the
> complete
> >>> discussion about the intrinsic approach.
> >>>
> >>> Yes, I think introducing such a reduction add can let us do
pattern
> recognition
> >>> of either SAD or dot production (or more?) without introducing
any
> >>> additional intrinsics.
> >> I agree. Another use case could be POPCOUNT operation. Moreover,
as
> 'reduction add'
> >> Is being adopted by more targets now a days, reflecting that in
LLVM IR
> as an instruction
> >> Is a good idea.
> >> BTW, what is the idea of the syntax and semantic of this operation
you
> have?
> >
> > We can introduce a reduce-add for vectors only, or make it general so
> > that it could also accept scale operands. Normally it is identical to
> > normal add, but during instruction selection we can do pattern
> > recognition based on more information provided by this new operations.
> > For vectors, this means the result of this operation only guarantee
> > that the sum of all elements in the result is identical to the sum of
> > all elements of its operands. This gives us enough freedom to do
> > aggressive transformations, such as SAD or dot-product.
> >
> > Do you think if this is enough for us to get there?
> >
> >
> > Cong
> >
> >>
> >> The main concern may be cost
> >>> model: we could not guarantee that a SAD loop is unrolled 16
times on
> SSE to
> >>> make use the most benefit of SAD. After the patch
> >>> http://reviews.llvm.org/D8943 is landed, I am now working on
> improving cost
> >>> models of type conversions on X86. I believe even without SAD
> instruction
> >>> we can still get better performance with unrolling a SAD loop
16
> times. This
> >>> seems tricky but it works. What do you think?
> >> I also think same as we can increase the bandwidth with proper
cost
> modeling.
> >>
> >> Regards,
> >> Shahid
> >>>
> >>> Cong
> >>>
> >>> >
> >>> > http://reviews.llvm.org/D10964
> >>> >
http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
> >>> >
> >>> > Regards,
> >>> > Shahid
> >>> >
> >>> >
> >>> >
> >>> >> -----Original Message-----
> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org] On Behalf
> Of
> >>> >> Cong Hou via llvm-dev
> >>> >> Sent: Friday, November 13, 2015 5:47 AM
> >>> >> To: llvm-dev at lists.llvm.org
> >>> >> Cc: David Li
> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector
reduction add
> instruction.
> >>> >>
> >>> >> Hi
> >>> >>
> >>> >> When a reduction instruction is vectorized in a loop,
it will be
> >>> >> turned into an instruction with vector operands of
the same
> operation
> >>> >> type. This new instruction has a special property
that can give us
> >>> >> more flexibility during instruction selection later:
this operation
> >>> >> is valid as long as the reduction of all elements of
the result
> >>> >> vector is identical to the reduction of all elements
of its
> operands.
> >>> >>
> >>> >> One example that can benefit this property is SAD
(sum of absolute
> >>> >> differences) pattern detection in SSE2, which
provides a psadbw
> >>> >> instruction whose description is shown below:
> >>> >>
> >>> >> '''
> >>> >> psadbw: Compute the absolute differences of packed
unsigned 8-bit
> >>> >> integers in a and b, then horizontally sum each
consecutive 8
> >>> >> differences to produce two unsigned 16-bit integers,
and pack these
> >>> >> unsigned 16-bit integers in the low 16 bits of 64-bit
elements in
> dst.
> >>> >> '''
> >>> >>
> >>> >> In LLVM's IR, for a SAD loop we will have two
v4i8 as inputs and one
> >>> >> v4i32 as output. However, psadbw will actually
produce one i32
> result
> >>> >> for four pairs of 8-bit integers (an already reduced
result), and
> the
> >>> >> result is stored in the first element in v4i32. If we
properly zero
> >>> >> out the other three elements in v4i32, and with the
information that
> >>> >> we have a reduction add that is performed on this
result, then we
> can
> >>> >> safely use psadbw here for much better performance.
This can be done
> >>> >> during DAG combine. Another similar example is dot
product. And I
> >>> >> think there may be many other scenarios that can
benefit from this
> >>> >> property like eliminating redundant shuffles.
> >>> >>
> >>> >> The question is, how to let DAG combiner know that a
vector
> operation
> >>> >> is a reduction one?
> >>> >>
> >>> >> Here I propose to introduce a "reduction
add" instruction for
> vectors.
> >>> >> This will be a new instruction with vector operands
only. Normally
> it
> >>> >> is treated as a normal ADD operation, but the
selection DAG combiner
> >>> >> can make use of this new operation to generate better
instructions.
> >>> >> This new instruction is generated when vectorizing
reduction add in
> >>> >> loop vectorizer.
> >>> >>
> >>> >> I would like to hear more comments on this proposal
or suggestions
> of
> >>> >> better alternative implementations.
> >>> >>
> >>> >>
> >>> >> thanks,
> >>> >> Cong
> >>> >> _______________________________________________
> >>> >> LLVM Developers mailing list
> >>> >> llvm-dev at lists.llvm.org
> >>> >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151119/0529dff4/attachment.html>

Cong Hou via llvm-dev

2015-Nov-20 06:21 UTC

head link

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

On Thu, Nov 19, 2015 at 10:09 PM, Xinliang David Li <davidxl at
google.com> wrote:> This sounds like the right approach and the result is very promising. The
VF
> tuning patch is already in but just not enabled by default, right?
Yes, but the cost model is not ready yet, which depends on the patch
that optimizes promotion/demotion between vectors of i8 and i32
(http://reviews.llvm.org/D14588). Once this patch is checked in, I
will update the cost model to let bigger VF be chosen.


Cong
>
> David
>
> On Thu, Nov 19, 2015 at 1:12 PM, Cong Hou <congh at google.com>
wrote:
>>
>> After some attempt to implement reduce-add in LLVM, I found out a
>> easier way to detect reduce-add without introducing new IR operations.
>> The basic idea is annotating phi node instead of add (so that it is
>> easier to handle other reduction operations). In PHINode class, we can
>> add a flag indicating if the phi node is a reduction one (the flag can
>> be set in loop vectorizer for vectorized phi nodes). Then when we
>> build SDNode for instruction selection, we detect those reduction phi
>> nodes and then annotate reduction operations. This requires an
>> additional flag in SDNodeFlags. We can then check this flag when
>> combining instructions to detect reduction operations.
>>
>> In this approach, I have managed to let LLVM compile a SAD loop into
>> psadbw instructions.
>>
>> Source code:
>>
>>
>> const int N = 1024;
>> unsigned char a[N], b[N];
>>
>> int sad() {
>>   int s = 0;
>>   for (int i = 0; i < N; ++i) {
>>     int res = a[i] - b[i];
>>     s += (res > 0) ? res : -res;
>>   }
>>   return s;
>> }
>>
>>
>> Emitted instructions on X86:
>>
>>
>>
>> # BB#0:                                 # %entry
>> pxor %xmm0, %xmm0
>> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
>> pxor %xmm1, %xmm1
>> .align 16, 0x90
>> .LBB0_1:                                # %vector.body
>>                                         # =>This Inner Loop Header:
>> Depth=1
>> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
>> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
>> psadbw %xmm2, %xmm3
>> paddd %xmm3, %xmm0
>> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
>> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
>> psadbw %xmm2, %xmm3
>> paddd %xmm3, %xmm1
>> addq $8, %rax
>> jne .LBB0_1
>> # BB#2:                                 # %middle.block
>> paddd %xmm0, %xmm1
>> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
>> paddd %xmm1, %xmm0
>> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
>> paddd %xmm0, %xmm1
>> movd %xmm1, %eax
>> retq
>>
>>
>> Note that due to smaller VF we are using now (currently 4), we could
>> not explore the most benefit of psadbw. The patch in
>> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
>> on the smallest type in a loop. The follow-up work is refining the
>> cost model to let bigger VFs have less cost. For the example above the
>> best result is from VF >=16.
>>
>> The draft of the patch is here: http://reviews.llvm.org/D14840
>>
>> I will refine the patch later and submit it for review.
>>
>>
>> thanks,
>> Cong
>>
>>
>> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com>
wrote:
>> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
>> > <Asghar-ahmad.Shahid at amd.com> wrote:
>> >> Hi Cong,
>> >>
>> >>> -----Original Message-----
>> >>> From: Cong Hou [mailto:congh at google.com]
>> >>> Sent: Tuesday, November 17, 2015 12:47 AM
>> >>> To: Shahid, Asghar-ahmad
>> >>> Cc: David Li
>> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector
reduction add
>> >>> instruction.
>> >>>
>> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad
<Asghar-
>> >>> ahmad.Shahid at amd.com> wrote:
>> >>> > Hi Cong,
>> >>> >
>> >>> > We had proposed an intrinsic approach to do this.
However the
>> >>> > discussion reached to a point where it was asked that
"Why do we
>> >>> > need
>> >>> > another approach if "reduction add" can be
pattern matched in
>> >>> DAGCombine?"
>> >>> > However I feel if we have strong enough rationale for
introduction
>> >>> > of
>> >>> > this instruction, it would be great. The 1st link
below has the
>> >>> > complete
>> >>> discussion about the intrinsic approach.
>> >>>
>> >>> Yes, I think introducing such a reduction add can let us
do pattern
>> >>> recognition
>> >>> of either SAD or dot production (or more?) without
introducing any
>> >>> additional intrinsics.
>> >> I agree. Another use case could be POPCOUNT operation.
Moreover, as
>> >> 'reduction add'
>> >> Is being adopted by more targets now a days, reflecting that
in LLVM IR
>> >> as an instruction
>> >> Is a good idea.
>> >> BTW, what is the idea of the syntax and semantic of this
operation you
>> >> have?
>> >
>> > We can introduce a reduce-add for vectors only, or make it general
so
>> > that it could also accept scale operands. Normally it is identical
to
>> > normal add, but during instruction selection we can do pattern
>> > recognition based on more information provided by this new
operations.
>> > For vectors, this means the result of this operation only
guarantee
>> > that the sum of all elements in the result is identical to the sum
of
>> > all elements of its operands. This gives us enough freedom to do
>> > aggressive transformations, such as SAD or dot-product.
>> >
>> > Do you think if this is enough for us to get there?
>> >
>> >
>> > Cong
>> >
>> >>
>> >> The main concern may be cost
>> >>> model: we could not guarantee that a SAD loop is unrolled
16 times on
>> >>> SSE to
>> >>> make use the most benefit of SAD. After the patch
>> >>> http://reviews.llvm.org/D8943 is landed, I am now working
on improving
>> >>> cost
>> >>> models of type conversions on X86. I believe even without
SAD
>> >>> instruction
>> >>> we can still get better performance with unrolling a SAD
loop 16
>> >>> times. This
>> >>> seems tricky but it works. What do you think?
>> >> I also think same as we can increase the bandwidth with proper
cost
>> >> modeling.
>> >>
>> >> Regards,
>> >> Shahid
>> >>>
>> >>> Cong
>> >>>
>> >>> >
>> >>> > http://reviews.llvm.org/D10964
>> >>> >
http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
>> >>> >
>> >>> > Regards,
>> >>> > Shahid
>> >>> >
>> >>> >
>> >>> >
>> >>> >> -----Original Message-----
>> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org] On Behalf
>> >>> >> Of
>> >>> >> Cong Hou via llvm-dev
>> >>> >> Sent: Friday, November 13, 2015 5:47 AM
>> >>> >> To: llvm-dev at lists.llvm.org
>> >>> >> Cc: David Li
>> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector
reduction add
>> >>> >> instruction.
>> >>> >>
>> >>> >> Hi
>> >>> >>
>> >>> >> When a reduction instruction is vectorized in a
loop, it will be
>> >>> >> turned into an instruction with vector operands
of the same
>> >>> >> operation
>> >>> >> type. This new instruction has a special property
that can give us
>> >>> >> more flexibility during instruction selection
later: this operation
>> >>> >> is valid as long as the reduction of all elements
of the result
>> >>> >> vector is identical to the reduction of all
elements of its
>> >>> >> operands.
>> >>> >>
>> >>> >> One example that can benefit this property is SAD
(sum of absolute
>> >>> >> differences) pattern detection in SSE2, which
provides a psadbw
>> >>> >> instruction whose description is shown below:
>> >>> >>
>> >>> >> '''
>> >>> >> psadbw: Compute the absolute differences of
packed unsigned 8-bit
>> >>> >> integers in a and b, then horizontally sum each
consecutive 8
>> >>> >> differences to produce two unsigned 16-bit
integers, and pack these
>> >>> >> unsigned 16-bit integers in the low 16 bits of
64-bit elements in
>> >>> >> dst.
>> >>> >> '''
>> >>> >>
>> >>> >> In LLVM's IR, for a SAD loop we will have two
v4i8 as inputs and
>> >>> >> one
>> >>> >> v4i32 as output. However, psadbw will actually
produce one i32
>> >>> >> result
>> >>> >> for four pairs of 8-bit integers (an already
reduced result), and
>> >>> >> the
>> >>> >> result is stored in the first element in v4i32.
If we properly zero
>> >>> >> out the other three elements in v4i32, and with
the information
>> >>> >> that
>> >>> >> we have a reduction add that is performed on this
result, then we
>> >>> >> can
>> >>> >> safely use psadbw here for much better
performance. This can be
>> >>> >> done
>> >>> >> during DAG combine. Another similar example is
dot product. And I
>> >>> >> think there may be many other scenarios that can
benefit from this
>> >>> >> property like eliminating redundant shuffles.
>> >>> >>
>> >>> >> The question is, how to let DAG combiner know
that a vector
>> >>> >> operation
>> >>> >> is a reduction one?
>> >>> >>
>> >>> >> Here I propose to introduce a "reduction
add" instruction for
>> >>> >> vectors.
>> >>> >> This will be a new instruction with vector
operands only. Normally
>> >>> >> it
>> >>> >> is treated as a normal ADD operation, but the
selection DAG
>> >>> >> combiner
>> >>> >> can make use of this new operation to generate
better instructions.
>> >>> >> This new instruction is generated when
vectorizing reduction add in
>> >>> >> loop vectorizer.
>> >>> >>
>> >>> >> I would like to hear more comments on this
proposal or suggestions
>> >>> >> of
>> >>> >> better alternative implementations.
>> >>> >>
>> >>> >>
>> >>> >> thanks,
>> >>> >> Cong
>> >>> >> _______________________________________________
>> >>> >> LLVM Developers mailing list
>> >>> >> llvm-dev at lists.llvm.org
>> >>> >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>

Hal Finkel via llvm-dev

2015-Nov-25 22:32 UTC

head link

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Hi Cong,

After reading the original RFC and this update, I'm still not entirely sure
I understand the semantics of the flag you're proposing to add. Does it
having something to do with the ordering of the reduction operations?

Thanks again,
Hal

----- Original Message -----> From: "Cong Hou via llvm-dev" <llvm-dev at lists.llvm.org>
> To: "llvm-dev" <llvm-dev at lists.llvm.org>
> Cc: "David Li" <davidxl at google.com>
> Sent: Thursday, November 19, 2015 3:12:10 PM
> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
instruction.
> 
> After some attempt to implement reduce-add in LLVM, I found out a
> easier way to detect reduce-add without introducing new IR
> operations.
> The basic idea is annotating phi node instead of add (so that it is
> easier to handle other reduction operations). In PHINode class, we
> can
> add a flag indicating if the phi node is a reduction one (the flag
> can
> be set in loop vectorizer for vectorized phi nodes). Then when we
> build SDNode for instruction selection, we detect those reduction phi
> nodes and then annotate reduction operations. This requires an
> additional flag in SDNodeFlags. We can then check this flag when
> combining instructions to detect reduction operations.
> 
> In this approach, I have managed to let LLVM compile a SAD loop into
> psadbw instructions.
> 
> Source code:
> 
> 
> const int N = 1024;
> unsigned char a[N], b[N];
> 
> int sad() {
>   int s = 0;
>   for (int i = 0; i < N; ++i) {
>     int res = a[i] - b[i];
>     s += (res > 0) ? res : -res;
>   }
>   return s;
> }
> 
> 
> Emitted instructions on X86:
> 
> 
> 
> # BB#0:                                 # %entry
> pxor %xmm0, %xmm0
> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
> pxor %xmm1, %xmm1
> .align 16, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header:
>                                         Depth=1
> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm0
> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm1
> addq $8, %rax
> jne .LBB0_1
> # BB#2:                                 # %middle.block
> paddd %xmm0, %xmm1
> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
> paddd %xmm1, %xmm0
> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
> paddd %xmm0, %xmm1
> movd %xmm1, %eax
> retq
> 
> 
> Note that due to smaller VF we are using now (currently 4), we could
> not explore the most benefit of psadbw. The patch in
> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
> on the smallest type in a loop. The follow-up work is refining the
> cost model to let bigger VFs have less cost. For the example above
> the
> best result is from VF >=16.
> 
> The draft of the patch is here: http://reviews.llvm.org/D14840
> 
> I will refine the patch later and submit it for review.
> 
> 
> thanks,
> Cong
> 
> 
> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com>
wrote:
> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
> > <Asghar-ahmad.Shahid at amd.com> wrote:
> >> Hi Cong,
> >>
> >>> -----Original Message-----
> >>> From: Cong Hou [mailto:congh at google.com]
> >>> Sent: Tuesday, November 17, 2015 12:47 AM
> >>> To: Shahid, Asghar-ahmad
> >>> Cc: David Li
> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction
add
> >>> instruction.
> >>>
> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad
<Asghar-
> >>> ahmad.Shahid at amd.com> wrote:
> >>> > Hi Cong,
> >>> >
> >>> > We had proposed an intrinsic approach to do this. However
the
> >>> > discussion reached to a point where it was asked that
"Why do
> >>> > we need
> >>> > another approach if "reduction add" can be
pattern matched in
> >>> DAGCombine?"
> >>> > However I feel if we have strong enough rationale for
> >>> > introduction of
> >>> > this instruction, it would be great. The 1st link below
has the
> >>> > complete
> >>> discussion about the intrinsic approach.
> >>>
> >>> Yes, I think introducing such a reduction add can let us do
> >>> pattern recognition
> >>> of either SAD or dot production (or more?) without introducing
> >>> any
> >>> additional intrinsics.
> >> I agree. Another use case could be POPCOUNT operation. Moreover,
> >> as 'reduction add'
> >> Is being adopted by more targets now a days, reflecting that in
> >> LLVM IR as an instruction
> >> Is a good idea.
> >> BTW, what is the idea of the syntax and semantic of this operation
> >> you have?
> >
> > We can introduce a reduce-add for vectors only, or make it general
> > so
> > that it could also accept scale operands. Normally it is identical
> > to
> > normal add, but during instruction selection we can do pattern
> > recognition based on more information provided by this new
> > operations.
> > For vectors, this means the result of this operation only guarantee
> > that the sum of all elements in the result is identical to the sum
> > of
> > all elements of its operands. This gives us enough freedom to do
> > aggressive transformations, such as SAD or dot-product.
> >
> > Do you think if this is enough for us to get there?
> >
> >
> > Cong
> >
> >>
> >> The main concern may be cost
> >>> model: we could not guarantee that a SAD loop is unrolled 16
> >>> times on SSE to
> >>> make use the most benefit of SAD. After the patch
> >>> http://reviews.llvm.org/D8943 is landed, I am now working on
> >>> improving cost
> >>> models of type conversions on X86. I believe even without SAD
> >>> instruction
> >>> we can still get better performance with unrolling a SAD loop
16
> >>> times. This
> >>> seems tricky but it works. What do you think?
> >> I also think same as we can increase the bandwidth with proper
> >> cost modeling.
> >>
> >> Regards,
> >> Shahid
> >>>
> >>> Cong
> >>>
> >>> >
> >>> > http://reviews.llvm.org/D10964
> >>> >
http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
> >>> >
> >>> > Regards,
> >>> > Shahid
> >>> >
> >>> >
> >>> >
> >>> >> -----Original Message-----
> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org] On
> >>> >> Behalf Of
> >>> >> Cong Hou via llvm-dev
> >>> >> Sent: Friday, November 13, 2015 5:47 AM
> >>> >> To: llvm-dev at lists.llvm.org
> >>> >> Cc: David Li
> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector
reduction add
> >>> >> instruction.
> >>> >>
> >>> >> Hi
> >>> >>
> >>> >> When a reduction instruction is vectorized in a loop,
it will
> >>> >> be
> >>> >> turned into an instruction with vector operands of
the same
> >>> >> operation
> >>> >> type. This new instruction has a special property
that can
> >>> >> give us
> >>> >> more flexibility during instruction selection later:
this
> >>> >> operation
> >>> >> is valid as long as the reduction of all elements of
the
> >>> >> result
> >>> >> vector is identical to the reduction of all elements
of its
> >>> >> operands.
> >>> >>
> >>> >> One example that can benefit this property is SAD
(sum of
> >>> >> absolute
> >>> >> differences) pattern detection in SSE2, which
provides a
> >>> >> psadbw
> >>> >> instruction whose description is shown below:
> >>> >>
> >>> >> '''
> >>> >> psadbw: Compute the absolute differences of packed
unsigned
> >>> >> 8-bit
> >>> >> integers in a and b, then horizontally sum each
consecutive 8
> >>> >> differences to produce two unsigned 16-bit integers,
and pack
> >>> >> these
> >>> >> unsigned 16-bit integers in the low 16 bits of 64-bit
elements
> >>> >> in dst.
> >>> >> '''
> >>> >>
> >>> >> In LLVM's IR, for a SAD loop we will have two
v4i8 as inputs
> >>> >> and one
> >>> >> v4i32 as output. However, psadbw will actually
produce one i32
> >>> >> result
> >>> >> for four pairs of 8-bit integers (an already reduced
result),
> >>> >> and the
> >>> >> result is stored in the first element in v4i32. If we
properly
> >>> >> zero
> >>> >> out the other three elements in v4i32, and with the
> >>> >> information that
> >>> >> we have a reduction add that is performed on this
result, then
> >>> >> we can
> >>> >> safely use psadbw here for much better performance.
This can
> >>> >> be done
> >>> >> during DAG combine. Another similar example is dot
product.
> >>> >> And I
> >>> >> think there may be many other scenarios that can
benefit from
> >>> >> this
> >>> >> property like eliminating redundant shuffles.
> >>> >>
> >>> >> The question is, how to let DAG combiner know that a
vector
> >>> >> operation
> >>> >> is a reduction one?
> >>> >>
> >>> >> Here I propose to introduce a "reduction
add" instruction for
> >>> >> vectors.
> >>> >> This will be a new instruction with vector operands
only.
> >>> >> Normally it
> >>> >> is treated as a normal ADD operation, but the
selection DAG
> >>> >> combiner
> >>> >> can make use of this new operation to generate better
> >>> >> instructions.
> >>> >> This new instruction is generated when vectorizing
reduction
> >>> >> add in
> >>> >> loop vectorizer.
> >>> >>
> >>> >> I would like to hear more comments on this proposal
or
> >>> >> suggestions of
> >>> >> better alternative implementations.
> >>> >>
> >>> >>
> >>> >> thanks,
> >>> >> Cong
> >>> >> _______________________________________________
> >>> >> LLVM Developers mailing list
> >>> >> llvm-dev at lists.llvm.org
> >>> >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Cong Hou via llvm-dev

2015-Nov-25 23:07 UTC

head link

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

On Wed, Nov 25, 2015 at 2:32 PM, Hal Finkel <hfinkel at anl.gov>
wrote:> Hi Cong,
>
> After reading the original RFC and this update, I'm still not entirely
sure I understand the semantics of the flag you're proposing to add. Does it
having something to do with the ordering of the reduction operations?
The flag is only useful for vectorized reduction for now. I'll give
you an example how this flag is used.

Given the following reduction loop:

int a[N];
int s = 0;
for (int i = 0; i < N; ++i)
  s += a[i];

After it is vectorized, we have a reduction operation add whose
operands and the result are vectors. Suppose it is represented as:

[s0, s1, s2, s3] = [s0, s1, s2, s3] + [a0, a1, a2, a3].

If we know this operation is a reduction one, then we could have some
flexibility on how elements in the result are organized, as long as
the sum of them stays the same (the precondition is that the only use
of the reduction result is the reduction phi node, and this is usually
true as long as a reduction loop can be vectorized). For example, if
we let the result be [s0+s1, 0, s2+s3, 0] or [0, 0, s0+s1+s2+s3, 0],
the reduction result won't change. This enable us to detect SAD or
dot-product patterns and use SSE's psadbw and pmaddwd instructions.
Please see my respond to your another email for more details.

Thanks!

Cong
>
> Thanks again,
> Hal
>
> ----- Original Message -----
>> From: "Cong Hou via llvm-dev" <llvm-dev at
lists.llvm.org>
>> To: "llvm-dev" <llvm-dev at lists.llvm.org>
>> Cc: "David Li" <davidxl at google.com>
>> Sent: Thursday, November 19, 2015 3:12:10 PM
>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
instruction.
>>
>> After some attempt to implement reduce-add in LLVM, I found out a
>> easier way to detect reduce-add without introducing new IR
>> operations.
>> The basic idea is annotating phi node instead of add (so that it is
>> easier to handle other reduction operations). In PHINode class, we
>> can
>> add a flag indicating if the phi node is a reduction one (the flag
>> can
>> be set in loop vectorizer for vectorized phi nodes). Then when we
>> build SDNode for instruction selection, we detect those reduction phi
>> nodes and then annotate reduction operations. This requires an
>> additional flag in SDNodeFlags. We can then check this flag when
>> combining instructions to detect reduction operations.
>>
>> In this approach, I have managed to let LLVM compile a SAD loop into
>> psadbw instructions.
>>
>> Source code:
>>
>>
>> const int N = 1024;
>> unsigned char a[N], b[N];
>>
>> int sad() {
>>   int s = 0;
>>   for (int i = 0; i < N; ++i) {
>>     int res = a[i] - b[i];
>>     s += (res > 0) ? res : -res;
>>   }
>>   return s;
>> }
>>
>>
>> Emitted instructions on X86:
>>
>>
>>
>> # BB#0:                                 # %entry
>> pxor %xmm0, %xmm0
>> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
>> pxor %xmm1, %xmm1
>> .align 16, 0x90
>> .LBB0_1:                                # %vector.body
>>                                         # =>This Inner Loop Header:
>>                                         Depth=1
>> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
>> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
>> psadbw %xmm2, %xmm3
>> paddd %xmm3, %xmm0
>> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
>> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
>> psadbw %xmm2, %xmm3
>> paddd %xmm3, %xmm1
>> addq $8, %rax
>> jne .LBB0_1
>> # BB#2:                                 # %middle.block
>> paddd %xmm0, %xmm1
>> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
>> paddd %xmm1, %xmm0
>> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
>> paddd %xmm0, %xmm1
>> movd %xmm1, %eax
>> retq
>>
>>
>> Note that due to smaller VF we are using now (currently 4), we could
>> not explore the most benefit of psadbw. The patch in
>> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
>> on the smallest type in a loop. The follow-up work is refining the
>> cost model to let bigger VFs have less cost. For the example above
>> the
>> best result is from VF >=16.
>>
>> The draft of the patch is here: http://reviews.llvm.org/D14840
>>
>> I will refine the patch later and submit it for review.
>>
>>
>> thanks,
>> Cong
>>
>>
>> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com>
wrote:
>> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
>> > <Asghar-ahmad.Shahid at amd.com> wrote:
>> >> Hi Cong,
>> >>
>> >>> -----Original Message-----
>> >>> From: Cong Hou [mailto:congh at google.com]
>> >>> Sent: Tuesday, November 17, 2015 12:47 AM
>> >>> To: Shahid, Asghar-ahmad
>> >>> Cc: David Li
>> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector
reduction add
>> >>> instruction.
>> >>>
>> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad
<Asghar-
>> >>> ahmad.Shahid at amd.com> wrote:
>> >>> > Hi Cong,
>> >>> >
>> >>> > We had proposed an intrinsic approach to do this.
However the
>> >>> > discussion reached to a point where it was asked that
"Why do
>> >>> > we need
>> >>> > another approach if "reduction add" can be
pattern matched in
>> >>> DAGCombine?"
>> >>> > However I feel if we have strong enough rationale for
>> >>> > introduction of
>> >>> > this instruction, it would be great. The 1st link
below has the
>> >>> > complete
>> >>> discussion about the intrinsic approach.
>> >>>
>> >>> Yes, I think introducing such a reduction add can let us
do
>> >>> pattern recognition
>> >>> of either SAD or dot production (or more?) without
introducing
>> >>> any
>> >>> additional intrinsics.
>> >> I agree. Another use case could be POPCOUNT operation.
Moreover,
>> >> as 'reduction add'
>> >> Is being adopted by more targets now a days, reflecting that
in
>> >> LLVM IR as an instruction
>> >> Is a good idea.
>> >> BTW, what is the idea of the syntax and semantic of this
operation
>> >> you have?
>> >
>> > We can introduce a reduce-add for vectors only, or make it general
>> > so
>> > that it could also accept scale operands. Normally it is identical
>> > to
>> > normal add, but during instruction selection we can do pattern
>> > recognition based on more information provided by this new
>> > operations.
>> > For vectors, this means the result of this operation only
guarantee
>> > that the sum of all elements in the result is identical to the sum
>> > of
>> > all elements of its operands. This gives us enough freedom to do
>> > aggressive transformations, such as SAD or dot-product.
>> >
>> > Do you think if this is enough for us to get there?
>> >
>> >
>> > Cong
>> >
>> >>
>> >> The main concern may be cost
>> >>> model: we could not guarantee that a SAD loop is unrolled
16
>> >>> times on SSE to
>> >>> make use the most benefit of SAD. After the patch
>> >>> http://reviews.llvm.org/D8943 is landed, I am now working
on
>> >>> improving cost
>> >>> models of type conversions on X86. I believe even without
SAD
>> >>> instruction
>> >>> we can still get better performance with unrolling a SAD
loop 16
>> >>> times. This
>> >>> seems tricky but it works. What do you think?
>> >> I also think same as we can increase the bandwidth with proper
>> >> cost modeling.
>> >>
>> >> Regards,
>> >> Shahid
>> >>>
>> >>> Cong
>> >>>
>> >>> >
>> >>> > http://reviews.llvm.org/D10964
>> >>> >
http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
>> >>> >
>> >>> > Regards,
>> >>> > Shahid
>> >>> >
>> >>> >
>> >>> >
>> >>> >> -----Original Message-----
>> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at
lists.llvm.org] On
>> >>> >> Behalf Of
>> >>> >> Cong Hou via llvm-dev
>> >>> >> Sent: Friday, November 13, 2015 5:47 AM
>> >>> >> To: llvm-dev at lists.llvm.org
>> >>> >> Cc: David Li
>> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector
reduction add
>> >>> >> instruction.
>> >>> >>
>> >>> >> Hi
>> >>> >>
>> >>> >> When a reduction instruction is vectorized in a
loop, it will
>> >>> >> be
>> >>> >> turned into an instruction with vector operands
of the same
>> >>> >> operation
>> >>> >> type. This new instruction has a special property
that can
>> >>> >> give us
>> >>> >> more flexibility during instruction selection
later: this
>> >>> >> operation
>> >>> >> is valid as long as the reduction of all elements
of the
>> >>> >> result
>> >>> >> vector is identical to the reduction of all
elements of its
>> >>> >> operands.
>> >>> >>
>> >>> >> One example that can benefit this property is SAD
(sum of
>> >>> >> absolute
>> >>> >> differences) pattern detection in SSE2, which
provides a
>> >>> >> psadbw
>> >>> >> instruction whose description is shown below:
>> >>> >>
>> >>> >> '''
>> >>> >> psadbw: Compute the absolute differences of
packed unsigned
>> >>> >> 8-bit
>> >>> >> integers in a and b, then horizontally sum each
consecutive 8
>> >>> >> differences to produce two unsigned 16-bit
integers, and pack
>> >>> >> these
>> >>> >> unsigned 16-bit integers in the low 16 bits of
64-bit elements
>> >>> >> in dst.
>> >>> >> '''
>> >>> >>
>> >>> >> In LLVM's IR, for a SAD loop we will have two
v4i8 as inputs
>> >>> >> and one
>> >>> >> v4i32 as output. However, psadbw will actually
produce one i32
>> >>> >> result
>> >>> >> for four pairs of 8-bit integers (an already
reduced result),
>> >>> >> and the
>> >>> >> result is stored in the first element in v4i32.
If we properly
>> >>> >> zero
>> >>> >> out the other three elements in v4i32, and with
the
>> >>> >> information that
>> >>> >> we have a reduction add that is performed on this
result, then
>> >>> >> we can
>> >>> >> safely use psadbw here for much better
performance. This can
>> >>> >> be done
>> >>> >> during DAG combine. Another similar example is
dot product.
>> >>> >> And I
>> >>> >> think there may be many other scenarios that can
benefit from
>> >>> >> this
>> >>> >> property like eliminating redundant shuffles.
>> >>> >>
>> >>> >> The question is, how to let DAG combiner know
that a vector
>> >>> >> operation
>> >>> >> is a reduction one?
>> >>> >>
>> >>> >> Here I propose to introduce a "reduction
add" instruction for
>> >>> >> vectors.
>> >>> >> This will be a new instruction with vector
operands only.
>> >>> >> Normally it
>> >>> >> is treated as a normal ADD operation, but the
selection DAG
>> >>> >> combiner
>> >>> >> can make use of this new operation to generate
better
>> >>> >> instructions.
>> >>> >> This new instruction is generated when
vectorizing reduction
>> >>> >> add in
>> >>> >> loop vectorizer.
>> >>> >>
>> >>> >> I would like to hear more comments on this
proposal or
>> >>> >> suggestions of
>> >>> >> better alternative implementations.
>> >>> >>
>> >>> >>
>> >>> >> thanks,
>> >>> >> Cong
>> >>> >> _______________________________________________
>> >>> >> LLVM Developers mailing list
>> >>> >> llvm-dev at lists.llvm.org
>> >>> >>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Nov 2015 - [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Reasonably Related Threads