thr3ads.net - llvm dev - [LLVMdev] Commutability of X86 FMA3 instructions. [Dec 2013]

If this information is useful, please help other people find it:
Share via:

Lang Hames

2013-Dec-23 20:07 UTC

[LLVMdev] Commutability of X86 FMA3 instructions.

Hi Elena,

Thank you very much for looking in to that.

I'll go ahead and remove the isCommutable flag from those
instructions, since it sounds like that's the right thing to do. I
would still like to change the default from the 231 variant to 213
too, as this will reduce code-size for accumulator-style loops. I have
at least one benchmark that shows significant speedups when this
change is made (presumably the smaller loop fits within the decode
buffer, though I haven't verified this). Please let me know if there's
any reason to keep the 231 variant.

Cheers,
Lang.


On Sat, Dec 21, 2013 at 11:42 PM, Demikhovsky, Elena
<elena.demikhovsky at intel.com> wrote:> 1) Our architects say that vmovaps %reg, %reg causes only register renaming
in HW and does not  affect performance.
>
> 2) The non-numeric result of FMA is not commutative for multiplicand and
the multiplier. You are right. This property should be removed.
>
>
> -  Elena
>
> -----Original Message-----
> From: Lang Hames [mailto:lhames at gmail.com]
> Sent: Friday, December 20, 2013 23:03
> To: Kay Tiong Khoo
> Cc: LLVM Developers Mailing List; Demikhovsky, Elena; Craig Topper
> Subject: Re: [LLVMdev] Commutability of X86 FMA3 instructions.
>
> Hi Kay,
>
> My patch will partially address your bug. For now I'm just looking to
switch the default FMA from vfmadd213xx to vfmadd231xx. That will cause the code
in PR17229 to compile as desired, but would regress code like:
>
> double foo(double a, double b, double c) {
>   return a * b + c;
> }
>
> Which will now require a vmovaps + vfmadd231.
>
> If this impacts real benchmarks we could add an optimization to change the
FMA variant based on how it's used.
>
> - Lang.
>
> On Fri, Dec 20, 2013 at 8:29 AM, Kay Tiong Khoo <kkhoo at
perfwizard.com> wrote:
>> Hi Lang,
>>
>> Unfortunately, I don't have an answer on the commutability
question,
>> but I wanted to let you know that I filed a bug on this:
>> http://llvm.org/bugs/show_bug.cgi?id=17229
>>
>> This also shows a memory operand variant of the fma that you may want
>> to consider in your patch and testcases.
>>
>> Thanks!
>>
>>
>> On Thu, Dec 19, 2013 at 10:45 PM, Lang Hames <lhames at
gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> The 213 variant of the FMA3 instructions is currently marked
>>> commutable (see X86InstrFMA.td). Is that safe? According to the ISA
>>> the FMA3 instructions aren't commutable for non-numeric
results, so
>>> I'd have thought commuting this would only be valid in
fast-math mode?
>>>
>>> For the curious, the reason that I'm asking is that we
currently
>>> always select the 213 variant, but this introduces an extra copies
in
>>> accumulator-style loops. Something like:
>>>
>>> while (...)
>>>   accumulator = x * y + accumulator;
>>>
>>> yields:
>>>
>>> loop:
>>>   vfmadd.213 y, x, acc
>>>   vmovaps acc, x
>>>   decl count
>>>   jne loop
>>>
>>> instead of
>>>
>>> loop:
>>>   vfmadd.231 acc, x, y
>>>   decl count
>>>   jne loop
>>>
>>> I have started writing a patch to generate the 231 variant by
>>> default, and I want to know whether I need to go to the trouble of
>>> adding custom commute logic. If these things aren't commutable
then I
>>> don't need to worry at all. If they are commutable, but only in
>>> fast-math mode, then I can support that too.
>>>
>>> Thanks for the help!
>>>
>>> - Lang.
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Demikhovsky, Elena

2013-Dec-24 07:18 UTC

head link

[LLVMdev] Commutability of X86 FMA3 instructions.

I remember that I tried all variants and checked many benchmarks.
I looked for a variant where more memory operands can be folded into
instruction.
I, probably, found that the current solution is better than others.

I also should say that FMA does not promise speed up at all. We measure many
benchmarks (~30) and saw performance regression for many of them. We went after
solution where the average performance was higher.
So, I propose to check many benchmarks before switch and see what happens with
memory folding.

-  Elena


-----Original Message-----
From: Lang Hames [mailto:lhames at gmail.com] 
Sent: Monday, December 23, 2013 22:08
To: Demikhovsky, Elena
Cc: Kay Tiong Khoo; LLVM Developers Mailing List; Craig Topper
Subject: Re: [LLVMdev] Commutability of X86 FMA3 instructions.

Hi Elena,

Thank you very much for looking in to that.

I'll go ahead and remove the isCommutable flag from those instructions,
since it sounds like that's the right thing to do. I would still like to
change the default from the 231 variant to 213 too, as this will reduce
code-size for accumulator-style loops. I have at least one benchmark that shows
significant speedups when this change is made (presumably the smaller loop fits
within the decode buffer, though I haven't verified this). Please let me
know if there's any reason to keep the 231 variant.

Cheers,
Lang.


On Sat, Dec 21, 2013 at 11:42 PM, Demikhovsky, Elena <elena.demikhovsky at
intel.com> wrote:> 1) Our architects say that vmovaps %reg, %reg causes only register renaming
in HW and does not  affect performance.
>
> 2) The non-numeric result of FMA is not commutative for multiplicand and
the multiplier. You are right. This property should be removed.
>
>
> -  Elena
>
> -----Original Message-----
> From: Lang Hames [mailto:lhames at gmail.com]
> Sent: Friday, December 20, 2013 23:03
> To: Kay Tiong Khoo
> Cc: LLVM Developers Mailing List; Demikhovsky, Elena; Craig Topper
> Subject: Re: [LLVMdev] Commutability of X86 FMA3 instructions.
>
> Hi Kay,
>
> My patch will partially address your bug. For now I'm just looking to
switch the default FMA from vfmadd213xx to vfmadd231xx. That will cause the code
in PR17229 to compile as desired, but would regress code like:
>
> double foo(double a, double b, double c) {
>   return a * b + c;
> }
>
> Which will now require a vmovaps + vfmadd231.
>
> If this impacts real benchmarks we could add an optimization to change the
FMA variant based on how it's used.
>
> - Lang.
>
> On Fri, Dec 20, 2013 at 8:29 AM, Kay Tiong Khoo <kkhoo at
perfwizard.com> wrote:
>> Hi Lang,
>>
>> Unfortunately, I don't have an answer on the commutability
question,
>> but I wanted to let you know that I filed a bug on this:
>> http://llvm.org/bugs/show_bug.cgi?id=17229
>>
>> This also shows a memory operand variant of the fma that you may want 
>> to consider in your patch and testcases.
>>
>> Thanks!
>>
>>
>> On Thu, Dec 19, 2013 at 10:45 PM, Lang Hames <lhames at
gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> The 213 variant of the FMA3 instructions is currently marked 
>>> commutable (see X86InstrFMA.td). Is that safe? According to the ISA
>>> the FMA3 instructions aren't commutable for non-numeric
results, so
>>> I'd have thought commuting this would only be valid in
fast-math mode?
>>>
>>> For the curious, the reason that I'm asking is that we
currently
>>> always select the 213 variant, but this introduces an extra copies 
>>> in accumulator-style loops. Something like:
>>>
>>> while (...)
>>>   accumulator = x * y + accumulator;
>>>
>>> yields:
>>>
>>> loop:
>>>   vfmadd.213 y, x, acc
>>>   vmovaps acc, x
>>>   decl count
>>>   jne loop
>>>
>>> instead of
>>>
>>> loop:
>>>   vfmadd.231 acc, x, y
>>>   decl count
>>>   jne loop
>>>
>>> I have started writing a patch to generate the 231 variant by 
>>> default, and I want to know whether I need to go to the trouble of 
>>> adding custom commute logic. If these things aren't commutable
then
>>> I don't need to worry at all. If they are commutable, but only
in
>>> fast-math mode, then I can support that too.
>>>
>>> Thanks for the help!
>>>
>>> - Lang.
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
>---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Lang Hames

2014-Jan-03 19:05 UTC

head link

[LLVMdev] Commutability of X86 FMA3 instructions.

Hi Elena,

Ok. I will do some wider benchmarking before considering any change to
the default.
>From your observations it sounds like right solution is to add apost-isel peephole optimization to chose the most appropriate variant.
I'll give that a try.

- Lang.

On Mon, Dec 23, 2013 at 11:18 PM, Demikhovsky, Elena
<elena.demikhovsky at intel.com> wrote:> I remember that I tried all variants and checked many benchmarks.
> I looked for a variant where more memory operands can be folded into
instruction.
> I, probably, found that the current solution is better than others.
>
> I also should say that FMA does not promise speed up at all. We measure
many benchmarks (~30) and saw performance regression for many of them. We went
after solution where the average performance was higher.
> So, I propose to check many benchmarks before switch and see what happens
with memory folding.
>
> -  Elena
>
>
> -----Original Message-----
> From: Lang Hames [mailto:lhames at gmail.com]
> Sent: Monday, December 23, 2013 22:08
> To: Demikhovsky, Elena
> Cc: Kay Tiong Khoo; LLVM Developers Mailing List; Craig Topper
> Subject: Re: [LLVMdev] Commutability of X86 FMA3 instructions.
>
> Hi Elena,
>
> Thank you very much for looking in to that.
>
> I'll go ahead and remove the isCommutable flag from those instructions,
since it sounds like that's the right thing to do. I would still like to
change the default from the 231 variant to 213 too, as this will reduce
code-size for accumulator-style loops. I have at least one benchmark that shows
significant speedups when this change is made (presumably the smaller loop fits
within the decode buffer, though I haven't verified this). Please let me
know if there's any reason to keep the 231 variant.
>
> Cheers,
> Lang.
>
>
> On Sat, Dec 21, 2013 at 11:42 PM, Demikhovsky, Elena <elena.demikhovsky
at intel.com> wrote:
>> 1) Our architects say that vmovaps %reg, %reg causes only register
renaming in HW and does not  affect performance.
>>
>> 2) The non-numeric result of FMA is not commutative for multiplicand
and the multiplier. You are right. This property should be removed.
>>
>>
>> -  Elena
>>
>> -----Original Message-----
>> From: Lang Hames [mailto:lhames at gmail.com]
>> Sent: Friday, December 20, 2013 23:03
>> To: Kay Tiong Khoo
>> Cc: LLVM Developers Mailing List; Demikhovsky, Elena; Craig Topper
>> Subject: Re: [LLVMdev] Commutability of X86 FMA3 instructions.
>>
>> Hi Kay,
>>
>> My patch will partially address your bug. For now I'm just looking
to switch the default FMA from vfmadd213xx to vfmadd231xx. That will cause the
code in PR17229 to compile as desired, but would regress code like:
>>
>> double foo(double a, double b, double c) {
>>   return a * b + c;
>> }
>>
>> Which will now require a vmovaps + vfmadd231.
>>
>> If this impacts real benchmarks we could add an optimization to change
the FMA variant based on how it's used.
>>
>> - Lang.
>>
>> On Fri, Dec 20, 2013 at 8:29 AM, Kay Tiong Khoo <kkhoo at
perfwizard.com> wrote:
>>> Hi Lang,
>>>
>>> Unfortunately, I don't have an answer on the commutability
question,
>>> but I wanted to let you know that I filed a bug on this:
>>> http://llvm.org/bugs/show_bug.cgi?id=17229
>>>
>>> This also shows a memory operand variant of the fma that you may
want
>>> to consider in your patch and testcases.
>>>
>>> Thanks!
>>>
>>>
>>> On Thu, Dec 19, 2013 at 10:45 PM, Lang Hames <lhames at
gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> The 213 variant of the FMA3 instructions is currently marked
>>>> commutable (see X86InstrFMA.td). Is that safe? According to the
ISA
>>>> the FMA3 instructions aren't commutable for non-numeric
results, so
>>>> I'd have thought commuting this would only be valid in
fast-math mode?
>>>>
>>>> For the curious, the reason that I'm asking is that we
currently
>>>> always select the 213 variant, but this introduces an extra
copies
>>>> in accumulator-style loops. Something like:
>>>>
>>>> while (...)
>>>>   accumulator = x * y + accumulator;
>>>>
>>>> yields:
>>>>
>>>> loop:
>>>>   vfmadd.213 y, x, acc
>>>>   vmovaps acc, x
>>>>   decl count
>>>>   jne loop
>>>>
>>>> instead of
>>>>
>>>> loop:
>>>>   vfmadd.231 acc, x, y
>>>>   decl count
>>>>   jne loop
>>>>
>>>> I have started writing a patch to generate the 231 variant by
>>>> default, and I want to know whether I need to go to the trouble
of
>>>> adding custom commute logic. If these things aren't
commutable then
>>>> I don't need to worry at all. If they are commutable, but
only in
>>>> fast-math mode, then I can support that too.
>>>>
>>>> Thanks for the help!
>>>>
>>>> - Lang.
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

llvm dev - Dec 2013 - [LLVMdev] Commutability of X86 FMA3 instructions.

[LLVMdev] Commutability of X86 FMA3 instructions.

[LLVMdev] Commutability of X86 FMA3 instructions.

[LLVMdev] Commutability of X86 FMA3 instructions.