thr3ads.net - llvm dev - [llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Craig Topper via llvm-dev

2018-Jul-24 17:04 UTC

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 07/23/2018 06:37 PM, Craig Topper wrote:
>
>
> ~Craig
>
>
> On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> On 07/23/2018 05:22 PM, Craig Topper wrote:
>>
>> Hello all,
>>
>> This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
>> multipying sign extended 16-bit values to produce a 32-bit accumulated
>> result. The x86 backend is currently not able to optimize it as well as
gcc
>> and icc. The IR we are getting from the loop vectorizer has several
v8i32
>> adds and muls inside the loop. These are fed by v8i16 loads and sexts
from
>> v8i16 to v8i32. The x86 backend recognizes that these are addition
>> reductions of multiplication so we use the vpmaddwd instruction which
>> calculates 32-bit products from 16-bit inputs and does a horizontal add
of
>> adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32
>> result.
>>
>>
> That godbolt link seems wrong. It wasn't supposed to be clang IR. This
> should be right.
>
>
>>
>> In the example code, because we are reducing the number of elements
from
>> 8->4 in the vpmaddwd step we are left with a width mismatch between
>> vpmaddwd and the vpaddd instruction that we use to sum with the results
>> from the previous loop iterations. We rely on the fact that a 128-bit
>> vpmaddwd zeros the upper bits of the register so that we can use a
256-bit
>> vpaddd instruction so that the upper elements can keep going around the
>> loop without being disturbed in case they weren't initialized to 0.
But
>> this still means the vpmaddwd instruction is doing half the amount of
work
>> the CPU is capable of if we had been able to use a 256-bit vpmaddwd
>> instruction. Additionally, future x86 CPUs will be gaining an
instruction
>> that can do VPMADDWD and VPADDD in one instruction, but that width
mismatch
>> makes that instruction difficult to utilize.
>>
>> In order for the backend to handle this better it would be great if we
>> could have something like two v32i8 loads, two shufflevectors to
extract
>> the even elements and the odd elements to create four v16i8 pieces.
>>
>>
>> Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?
>>
>
> Oops that should have been v16i16. Mixed up my 256-bit types.
>
>
>>
>> Sign extend each of those pieces. Multiply the two even pieces and the
>> two odd pieces separately, sum those results with a v8i32 add. Then
another
>> v8i32 add to accumulate the previous loop iterations.
>>
>>
> I'm still missing something. Why do you want to separate out the even
and
> odd parts instead of just adding up the first half of the numbers and the
> second half?
>
Doing even/odd matches up with a pattern I already have to support for the
code in https://reviews.llvm.org/D49636. I wouldn't even need to detect is
as a reduction to do the reassocation since even/odd exactly matches the
behavior of the instruction. But you're right we could also just detect the
reduction and add two halves.


>
> Thanks again,
> Hal
>
> Then ensures that no pieces exceed the target vector width and the final
>> operation is correctly sized to go around the loop in one register. All
but
>> the last add can then be pattern matched to vpmaddwd as proposed in
>> https://reviews.llvm.org/D49636. And for the future CPU the whole thing
>> can be matched to the new instruction.
>>
>> Do other targets have a similar instruction or a similar issue to this?
>> Is this something we can solve in the loop vectorizer? Or should we
have a
>> separate IR transformation that can recognize this pattern and generate
the
>> new sequence? As a separate pass we would need to pair two vector loads
>> together, remove a reduction step outside the loop and remove half the
phis
>> assuming the loop was partially unrolled. Or if there was only one
add/mul
>> inside the loop we'd have to reduce its width and the width of the
phi.
>>
>>
>> Can you explain how the desired code from the vectorizer differs from
the
>> code that the vectorizer produces if you add '#pragma clang loop
>> vectorize(enable) vectorize_width(16)'  above the loop? I tried it
in your
>> godbolt example and the generated code looks very similar to the
>> icc-generated code.
>>
>
> It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily
> carried across the loop. It's then redundantly added twice in the
reduction
> after the loop despite it being 0. This happens because we basically
> tricked the backend into generating a 256-bit vpmaddwd concated with a
> 256-bit zero vector going into a 512-bit vaddd before type legalization.
> The 512-bit concat and vpaddd get split during type legalization, and the
> high half of the add gets constant folded away. I'm guessing we
probably
> finished with 4 vpxors before the loop but MachineCSE(or some other pass?)
> combined two of them when it figured out the loop didn't modify them.
>
>
>>
>> Thanks again,
>> Hal
>>
>>
>> Thanks,
>> ~Craig
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/b2237b9c/attachment.html>

Craig Topper via llvm-dev

2018-Jul-24 17:07 UTC

head link

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

With maximize-bandwidth I'd still end up with the extra vpxors above the
loop and the extra addition reduction steps at the end that we get from
forcing the vf to 16 right?

~Craig


On Tue, Jul 24, 2018 at 10:04 AM Craig Topper <craig.topper at gmail.com>
wrote:
>
>
> On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> On 07/23/2018 06:37 PM, Craig Topper wrote:
>>
>>
>> ~Craig
>>
>>
>> On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov>
wrote:
>>
>>>
>>> On 07/23/2018 05:22 PM, Craig Topper wrote:
>>>
>>> Hello all,
>>>
>>> This code https://godbolt.org/g/tTyxpf is a dot product reduction
loop
>>> multipying sign extended 16-bit values to produce a 32-bit
accumulated
>>> result. The x86 backend is currently not able to optimize it as
well as gcc
>>> and icc. The IR we are getting from the loop vectorizer has several
v8i32
>>> adds and muls inside the loop. These are fed by v8i16 loads and
sexts from
>>> v8i16 to v8i32. The x86 backend recognizes that these are addition
>>> reductions of multiplication so we use the vpmaddwd instruction
which
>>> calculates 32-bit products from 16-bit inputs and does a horizontal
add of
>>> adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a
v4i32
>>> result.
>>>
>>>
>> That godbolt link seems wrong. It wasn't supposed to be clang IR.
This
>> should be right.
>>
>>
>>>
>>> In the example code, because we are reducing the number of elements
from
>>> 8->4 in the vpmaddwd step we are left with a width mismatch
between
>>> vpmaddwd and the vpaddd instruction that we use to sum with the
results
>>> from the previous loop iterations. We rely on the fact that a
128-bit
>>> vpmaddwd zeros the upper bits of the register so that we can use a
256-bit
>>> vpaddd instruction so that the upper elements can keep going around
the
>>> loop without being disturbed in case they weren't initialized
to 0. But
>>> this still means the vpmaddwd instruction is doing half the amount
of work
>>> the CPU is capable of if we had been able to use a 256-bit vpmaddwd
>>> instruction. Additionally, future x86 CPUs will be gaining an
instruction
>>> that can do VPMADDWD and VPADDD in one instruction, but that width
mismatch
>>> makes that instruction difficult to utilize.
>>>
>>> In order for the backend to handle this better it would be great if
we
>>> could have something like two v32i8 loads, two shufflevectors to
extract
>>> the even elements and the odd elements to create four v16i8 pieces.
>>>
>>>
>>> Why v*i8 loads? I thought that we have 16-bit and 32-bit types
here?
>>>
>>
>> Oops that should have been v16i16. Mixed up my 256-bit types.
>>
>>
>>>
>>> Sign extend each of those pieces. Multiply the two even pieces and
the
>>> two odd pieces separately, sum those results with a v8i32 add. Then
another
>>> v8i32 add to accumulate the previous loop iterations.
>>>
>>>
>> I'm still missing something. Why do you want to separate out the
even and
>> odd parts instead of just adding up the first half of the numbers and
the
>> second half?
>>
>
> Doing even/odd matches up with a pattern I already have to support for the
> code in https://reviews.llvm.org/D49636. I wouldn't even need to detect
> is as a reduction to do the reassocation since even/odd exactly matches the
> behavior of the instruction. But you're right we could also just detect
the
> reduction and add two halves.
>
>
>
>>
>> Thanks again,
>> Hal
>>
>> Then ensures that no pieces exceed the target vector width and the
final
>>> operation is correctly sized to go around the loop in one register.
All but
>>> the last add can then be pattern matched to vpmaddwd as proposed in
>>> https://reviews.llvm.org/D49636. And for the future CPU the whole
thing
>>> can be matched to the new instruction.
>>>
>>> Do other targets have a similar instruction or a similar issue to
this?
>>> Is this something we can solve in the loop vectorizer? Or should we
have a
>>> separate IR transformation that can recognize this pattern and
generate the
>>> new sequence? As a separate pass we would need to pair two vector
loads
>>> together, remove a reduction step outside the loop and remove half
the phis
>>> assuming the loop was partially unrolled. Or if there was only one
add/mul
>>> inside the loop we'd have to reduce its width and the width of
the phi.
>>>
>>>
>>> Can you explain how the desired code from the vectorizer differs
from
>>> the code that the vectorizer produces if you add '#pragma clang
loop
>>> vectorize(enable) vectorize_width(16)'  above the loop? I tried
it in your
>>> godbolt example and the generated code looks very similar to the
>>> icc-generated code.
>>>
>>
>> It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being
unnecessarily
>> carried across the loop. It's then redundantly added twice in the
reduction
>> after the loop despite it being 0. This happens because we basically
>> tricked the backend into generating a 256-bit vpmaddwd concated with a
>> 256-bit zero vector going into a 512-bit vaddd before type
legalization.
>> The 512-bit concat and vpaddd get split during type legalization, and
the
>> high half of the add gets constant folded away. I'm guessing we
probably
>> finished with 4 vpxors before the loop but MachineCSE(or some other
pass?)
>> combined two of them when it figured out the loop didn't modify
them.
>>
>>
>>>
>>> Thanks again,
>>> Hal
>>>
>>>
>>> Thanks,
>>> ~Craig
>>>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/dba2bfc9/attachment-0001.html>

Saito, Hideki via llvm-dev

2018-Jul-24 17:08 UTC

head link

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

Right. Nothing changes there by purely going with wider vector.

From: Craig Topper [mailto:craig.topper at gmail.com]
Sent: Tuesday, July 24, 2018 10:07 AM
To: Hal Finkel <hfinkel at anl.gov>
Cc: Saito, Hideki <hideki.saito at intel.com>; estotzer at ti.com; Nemanja
Ivanovic <nemanja.i.ibm at gmail.com>; Adam Nemet <anemet at
apple.com>; graham.hunter at arm.com; Michael Kuperstein <mkuper at
google.com>; Sanjay Patel <spatel at rotateright.com>; Simon Pilgrim
<llvm-dev at redking.me.uk>; ashutosh.nema at amd.com; llvm-dev
<llvm-dev at lists.llvm.org>
Subject: Re: [LoopVectorizer] Improving the performance of dot product reduction
loop

With maximize-bandwidth I'd still end up with the extra vpxors above the
loop and the extra addition reduction steps at the end that we get from forcing
the vf to 16 right?

~Craig

On Tue, Jul 24, 2018 at 10:04 AM Craig Topper <craig.topper at
gmail.com<mailto:craig.topper at gmail.com>> wrote:

On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:

On 07/23/2018 06:37 PM, Craig Topper wrote:

~Craig

On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:

On 07/23/2018 05:22 PM, Craig Topper wrote:
Hello all,

This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
multipying sign extended 16-bit values to produce a 32-bit accumulated result.
The x86 backend is currently not able to optimize it as well as gcc and icc. The
IR we are getting from the loop vectorizer has several v8i32 adds and muls
inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The
x86 backend recognizes that these are addition reductions of multiplication so
we use the vpmaddwd instruction which calculates 32-bit products from 16-bit
inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16
inputs will produce a v4i32 result.

That godbolt link seems wrong. It wasn't supposed to be clang IR. This
should be right.

In the example code, because we are reducing the number of elements from 8->4
in the vpmaddwd step we are left with a width mismatch between vpmaddwd and the
vpaddd instruction that we use to sum with the results from the previous loop
iterations. We rely on the fact that a 128-bit vpmaddwd zeros the upper bits of
the register so that we can use a 256-bit vpaddd instruction so that the upper
elements can keep going around the loop without being disturbed in case they
weren't initialized to 0. But this still means the vpmaddwd instruction is
doing half the amount of work the CPU is capable of if we had been able to use a
256-bit vpmaddwd instruction. Additionally, future x86 CPUs will be gaining an
instruction that can do VPMADDWD and VPADDD in one instruction, but that width
mismatch makes that instruction difficult to utilize.

In order for the backend to handle this better it would be great if we could
have something like two v32i8 loads, two shufflevectors to extract the even
elements and the odd elements to create four v16i8 pieces.

Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?

Oops that should have been v16i16. Mixed up my 256-bit types.

Sign extend each of those pieces. Multiply the two even pieces and the two odd
pieces separately, sum those results with a v8i32 add. Then another v8i32 add to
accumulate the previous loop iterations.

I'm still missing something. Why do you want to separate out the even and
odd parts instead of just adding up the first half of the numbers and the second
half?

Doing even/odd matches up with a pattern I already have to support for the code
in https://reviews.llvm.org/D49636. I wouldn't even need to detect is as a
reduction to do the reassocation since even/odd exactly matches the behavior of
the instruction. But you're right we could also just detect the reduction
and add two halves.

Thanks again,
Hal

Then ensures that no pieces exceed the target vector width and the final
operation is correctly sized to go around the loop in one register. All but the
last add can then be pattern matched to vpmaddwd as proposed in
https://reviews.llvm.org/D49636. And for the future CPU the whole thing can be
matched to the new instruction.

Do other targets have a similar instruction or a similar issue to this? Is this
something we can solve in the loop vectorizer? Or should we have a separate IR
transformation that can recognize this pattern and generate the new sequence? As
a separate pass we would need to pair two vector loads together, remove a
reduction step outside the loop and remove half the phis assuming the loop was
partially unrolled. Or if there was only one add/mul inside the loop we'd
have to reduce its width and the width of the phi.

Can you explain how the desired code from the vectorizer differs from the code
that the vectorizer produces if you add '#pragma clang loop
vectorize(enable) vectorize_width(16)'  above the loop? I tried it in your
godbolt example and the generated code looks very similar to the icc-generated
code.

It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily
carried across the loop. It's then redundantly added twice in the reduction
after the loop despite it being 0. This happens because we basically tricked the
backend into generating a 256-bit vpmaddwd concated with a 256-bit zero vector
going into a 512-bit vaddd before type legalization. The 512-bit concat and
vpaddd get split during type legalization, and the high half of the add gets
constant folded away. I'm guessing we probably finished with 4 vpxors before
the loop but MachineCSE(or some other pass?) combined two of them when it
figured out the loop didn't modify them.

Thanks again,
Hal

Thanks,
~Craig

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/6418c1d2/attachment.html>

Saito, Hideki via llvm-dev

2018-Jul-24 17:11 UTC

head link

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

>But you're right we could also just detect the reduction and add two
halves.
If we take this approach, we also need to teach vector type legalizer about that
the last add of the following is reduction sum and thus add to the same temp
instead of creating the second temp. Should not be a big deal --- just calling
it out.

full vector unit stride load of A[],
full vector unit stride load of B[],
sign extend both, (this makes it 2x full vector, on the surface)
multiply
add

Hideki

From: Craig Topper [mailto:craig.topper at gmail.com]
Sent: Tuesday, July 24, 2018 10:05 AM
To: Hal Finkel <hfinkel at anl.gov>
Cc: Saito, Hideki <hideki.saito at intel.com>; estotzer at ti.com; Nemanja
Ivanovic <nemanja.i.ibm at gmail.com>; Adam Nemet <anemet at
apple.com>; graham.hunter at arm.com; Michael Kuperstein <mkuper at
google.com>; Sanjay Patel <spatel at rotateright.com>; Simon Pilgrim
<llvm-dev at redking.me.uk>; ashutosh.nema at amd.com; llvm-dev
<llvm-dev at lists.llvm.org>
Subject: Re: [LoopVectorizer] Improving the performance of dot product reduction
loop


On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:

On 07/23/2018 06:37 PM, Craig Topper wrote:

~Craig

On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:

On 07/23/2018 05:22 PM, Craig Topper wrote:
Hello all,

This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
multipying sign extended 16-bit values to produce a 32-bit accumulated result.
The x86 backend is currently not able to optimize it as well as gcc and icc. The
IR we are getting from the loop vectorizer has several v8i32 adds and muls
inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The
x86 backend recognizes that these are addition reductions of multiplication so
we use the vpmaddwd instruction which calculates 32-bit products from 16-bit
inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16
inputs will produce a v4i32 result.

That godbolt link seems wrong. It wasn't supposed to be clang IR. This
should be right.


In the example code, because we are reducing the number of elements from 8->4
in the vpmaddwd step we are left with a width mismatch between vpmaddwd and the
vpaddd instruction that we use to sum with the results from the previous loop
iterations. We rely on the fact that a 128-bit vpmaddwd zeros the upper bits of
the register so that we can use a 256-bit vpaddd instruction so that the upper
elements can keep going around the loop without being disturbed in case they
weren't initialized to 0. But this still means the vpmaddwd instruction is
doing half the amount of work the CPU is capable of if we had been able to use a
256-bit vpmaddwd instruction. Additionally, future x86 CPUs will be gaining an
instruction that can do VPMADDWD and VPADDD in one instruction, but that width
mismatch makes that instruction difficult to utilize.

In order for the backend to handle this better it would be great if we could
have something like two v32i8 loads, two shufflevectors to extract the even
elements and the odd elements to create four v16i8 pieces.

Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?

Oops that should have been v16i16. Mixed up my 256-bit types.



Sign extend each of those pieces. Multiply the two even pieces and the two odd
pieces separately, sum those results with a v8i32 add. Then another v8i32 add to
accumulate the previous loop iterations.

I'm still missing something. Why do you want to separate out the even and
odd parts instead of just adding up the first half of the numbers and the second
half?

Doing even/odd matches up with a pattern I already have to support for the code
in https://reviews.llvm.org/D49636. I wouldn't even need to detect is as a
reduction to do the reassocation since even/odd exactly matches the behavior of
the instruction. But you're right we could also just detect the reduction
and add two halves.



Thanks again,
Hal


Then ensures that no pieces exceed the target vector width and the final
operation is correctly sized to go around the loop in one register. All but the
last add can then be pattern matched to vpmaddwd as proposed in
https://reviews.llvm.org/D49636. And for the future CPU the whole thing can be
matched to the new instruction.

Do other targets have a similar instruction or a similar issue to this? Is this
something we can solve in the loop vectorizer? Or should we have a separate IR
transformation that can recognize this pattern and generate the new sequence? As
a separate pass we would need to pair two vector loads together, remove a
reduction step outside the loop and remove half the phis assuming the loop was
partially unrolled. Or if there was only one add/mul inside the loop we'd
have to reduce its width and the width of the phi.

Can you explain how the desired code from the vectorizer differs from the code
that the vectorizer produces if you add '#pragma clang loop
vectorize(enable) vectorize_width(16)'  above the loop? I tried it in your
godbolt example and the generated code looks very similar to the icc-generated
code.

It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily
carried across the loop. It's then redundantly added twice in the reduction
after the loop despite it being 0. This happens because we basically tricked the
backend into generating a 256-bit vpmaddwd concated with a 256-bit zero vector
going into a 512-bit vaddd before type legalization. The 512-bit concat and
vpaddd get split during type legalization, and the high half of the add gets
constant folded away. I'm guessing we probably finished with 4 vpxors before
the loop but MachineCSE(or some other pass?) combined two of them when it
figured out the loop didn't modify them.


Thanks again,
Hal



Thanks,
~Craig



--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory



--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/d96a56c6/attachment.html>

Hal Finkel via llvm-dev

2018-Jul-24 18:41 UTC

head link

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

On 07/24/2018 12:07 PM, Craig Topper wrote:> With maximize-bandwidth I'd still end up with the extra vpxors above
> the loop and the extra addition reduction steps at the end that we get
> from forcing the vf to 16 right?
Yea, I think we'd need something else to help with that. My underlying
thought here is that, based on your description, we really do want a VF
of 16 (because we want to use 256-bit loads, etc.). And so, when
thinking about how to fix things, we should start with looking at the VF
= 16 output, and not the VF = 8 output, as the right starting point for
the backend.

 -Hal
>
> ~Craig
>
>
> On Tue, Jul 24, 2018 at 10:04 AM Craig Topper <craig.topper at gmail.com
> <mailto:craig.topper at gmail.com>> wrote:
>
>
>
>     On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov
>     <mailto:hfinkel at anl.gov>> wrote:
>
>
>         On 07/23/2018 06:37 PM, Craig Topper wrote:
>>
>>         ~Craig
>>
>>
>>         On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at
anl.gov
>>         <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>>             On 07/23/2018 05:22 PM, Craig Topper wrote:
>>>             Hello all,
>>>
>>>             This code https://godbolt.org/g/tTyxpf is a dot product
>>>             reduction loop multipying sign extended 16-bit values
to
>>>             produce a 32-bit accumulated result. The x86 backend is
>>>             currently not able to optimize it as well as gcc and
>>>             icc. The IR we are getting from the loop vectorizer has
>>>             several v8i32 adds and muls inside the loop. These are
>>>             fed by v8i16 loads and sexts from v8i16 to v8i32. The
>>>             x86 backend recognizes that these are addition
>>>             reductions of multiplication so we use the vpmaddwd
>>>             instruction which calculates 32-bit products from
16-bit
>>>             inputs and does a horizontal add of adjacent pairs. A
>>>             vpmaddwd given two v8i16 inputs will produce a v4i32
result.
>>
>>
>>         That godbolt link seems wrong. It wasn't supposed to be
clang
>>         IR. This should be right.
>>          
>>
>>>
>>>             In the example code, because we are reducing the number
>>>             of elements from 8->4 in the vpmaddwd step we are
left
>>>             with a width mismatch between vpmaddwd and the vpaddd
>>>             instruction that we use to sum with the results from
the
>>>             previous loop iterations. We rely on the fact that a
>>>             128-bit vpmaddwd zeros the upper bits of the register
so
>>>             that we can use a 256-bit vpaddd instruction so that
the
>>>             upper elements can keep going around the loop without
>>>             being disturbed in case they weren't initialized to
0.
>>>             But this still means the vpmaddwd instruction is doing
>>>             half the amount of work the CPU is capable of if we had
>>>             been able to use a 256-bit vpmaddwd instruction.
>>>             Additionally, future x86 CPUs will be gaining an
>>>             instruction that can do VPMADDWD and VPADDD in one
>>>             instruction, but that width mismatch makes that
>>>             instruction difficult to utilize.
>>>
>>>             In order for the backend to handle this better it would
>>>             be great if we could have something like two v32i8
>>>             loads, two shufflevectors to extract the even elements
>>>             and the odd elements to create four v16i8 pieces.
>>
>>             Why v*i8 loads? I thought that we have 16-bit and 32-bit
>>             types here?
>>
>>
>>         Oops that should have been v16i16. Mixed up my 256-bit types.
>>          
>>
>>
>>>             Sign extend each of those pieces. Multiply the two even
>>>             pieces and the two odd pieces separately, sum those
>>>             results with a v8i32 add. Then another v8i32 add to
>>>             accumulate the previous loop iterations.
>>
>
>         I'm still missing something. Why do you want to separate out
>         the even and odd parts instead of just adding up the first
>         half of the numbers and the second half?
>
>
>     Doing even/odd matches up with a pattern I already have to support
>     for the code in https://reviews.llvm.org/D49636. I wouldn't even
>     need to detect is as a reduction to do the reassocation since
>     even/odd exactly matches the behavior of the instruction. But
>     you're right we could also just detect the reduction and add two
>     halves.
>
>      
>
>
>         Thanks again,
>         Hal
>
>>>             Then ensures that no pieces exceed the target vector
>>>             width and the final operation is correctly sized to go
>>>             around the loop in one register. All but the last add
>>>             can then be pattern matched to vpmaddwd as proposed
>>>             in https://reviews.llvm.org/D49636. And for the future
>>>             CPU the whole thing can be matched to the new
instruction.
>>>
>>>             Do other targets have a similar instruction or a
similar
>>>             issue to this? Is this something we can solve in the
>>>             loop vectorizer? Or should we have a separate IR
>>>             transformation that can recognize this pattern and
>>>             generate the new sequence? As a separate pass we would
>>>             need to pair two vector loads together, remove a
>>>             reduction step outside the loop and remove half the
phis
>>>             assuming the loop was partially unrolled. Or if there
>>>             was only one add/mul inside the loop we'd have to
reduce
>>>             its width and the width of the phi.
>>
>>             Can you explain how the desired code from the vectorizer
>>             differs from the code that the vectorizer produces if you
>>             add '#pragma clang loop vectorize(enable)
>>             vectorize_width(16)'  above the loop? I tried it in
your
>>             godbolt example and the generated code looks very similar
>>             to the icc-generated code.
>>
>>
>>         It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being
>>         unnecessarily carried across the loop. It's then
redundantly
>>         added twice in the reduction after the loop despite it being
>>         0. This happens because we basically tricked the backend into
>>         generating a 256-bit vpmaddwd concated with a 256-bit zero
>>         vector going into a 512-bit vaddd before type legalization.
>>         The 512-bit concat and vpaddd get split during type
>>         legalization, and the high half of the add gets constant
>>         folded away. I'm guessing we probably finished with 4
vpxors
>>         before the loop but MachineCSE(or some other pass?) combined
>>         two of them when it figured out the loop didn't modify
them.
>>          
>>
>>
>>             Thanks again,
>>             Hal
>>
>>>
>>>             Thanks,
>>>             ~Craig
>>
>>             -- 
>>             Hal Finkel
>>             Lead, Compiler Technology and Programming Languages
>>             Leadership Computing Facility
>>             Argonne National Laboratory
>>
>
>         -- 
>         Hal Finkel
>         Lead, Compiler Technology and Programming Languages
>         Leadership Computing Facility
>         Argonne National Laboratory
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/e1da2433/attachment.html>

llvm dev - Jul 2018 - [LoopVectorizer] Improving the performance of dot product reduction loop

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop