thr3ads.net - llvm dev - [llvm-dev] [X86] FMA transformation restrictions [Sep 2016]

If this information is useful, please help other people find it:
Share via:

via llvm-dev

2016-Sep-12 17:24 UTC

[llvm-dev] [X86] FMA transformation restrictions

I noticed that the operand commuting code in X86InstrInfo.cpp treats
scalar FMA intrinsics specially.  It prevents operand commuting on these
scalar instructions because the scalar FMA instructions preserve the
upper bits of the vector.  Presumably, the restrictions are there
because commuting operands potentially changes the result upper bits.

However, AFAIK the Intel and GNU FMA intrinsics don't actually specify
which FMA (213, 132, 231) is going to be used and so the user can't rely
on knowing which operand is tied to the destination.  Thus the user
can't rely on knowing what the upper bits will be.

Is there some other reason these scalar FMA commuting restrictions are
in place?

Thanks!

                            -David

Michael Kuperstein via llvm-dev

2016-Sep-12 19:07 UTC

head link

[llvm-dev] [X86] FMA transformation restrictions

Hi David,

Assuming I understood the question correctly - Intel doesn't specify which
FMA instruction is going to be used, but it does specify what it expects
the result to be.

E.g. for
__m128 _mm_fmadd_ss (__m128 a, __m128 b, __m128 c)
the specified semantics are:

dst[31:0] := (a[31:0] * b[31:0]) + c[31:0]
dst[127:32] := a[127:32]
dst[MAX:128] := 0

The user is allowed to rely on the upper bits of the result being the upper
bits of a, and the compiler is required choose an appropriate instruction
form that will make this happen.

Thanks,
  Michael

On Mon, Sep 12, 2016 at 10:24 AM, via llvm-dev <llvm-dev at
lists.llvm.org>
wrote:
> I noticed that the operand commuting code in X86InstrInfo.cpp treats
> scalar FMA intrinsics specially.  It prevents operand commuting on these
> scalar instructions because the scalar FMA instructions preserve the
> upper bits of the vector.  Presumably, the restrictions are there
> because commuting operands potentially changes the result upper bits.
>
> However, AFAIK the Intel and GNU FMA intrinsics don't actually specify
> which FMA (213, 132, 231) is going to be used and so the user can't
rely
> on knowing which operand is tied to the destination.  Thus the user
> can't rely on knowing what the upper bits will be.
>
> Is there some other reason these scalar FMA commuting restrictions are
> in place?
>
> Thanks!
>
>                             -David
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160912/4018f055/attachment.html>

David A. Greene via llvm-dev

2016-Sep-12 20:17 UTC

head link

[llvm-dev] [X86] FMA transformation restrictions

Michael Kuperstein <mkuper at google.com> writes:
> Hi David,
>
> Assuming I understood the question correctly - Intel doesn't specify
> which FMA instruction is going to be used, but it does specify what it
> expects the result to be. 
>
> E.g. for
> __m128 _mm_fmadd_ss (__m128 a, __m128 b, __m128 c) 
> the specified semantics are:
>
> dst[31:0] := (a[31:0] * b[31:0]) + c[31:0]
> dst[127:32] := a[127:32]
> dst[MAX:128] := 0
>
> The user is allowed to rely on the upper bits of the result being the
> upper bits of a, and the compiler is required choose an appropriate
> instruction form that will make this happen.
Ah, thank you.  I missed that line about grabbing the upper bits
from a.

Not a big deal, I was just wondering.  Carry on.  :)

                      -David

Vyacheslav Klochkov via llvm-dev

2016-Sep-12 21:41 UTC

head link

[llvm-dev] [X86] FMA transformation restrictions

Hi David,

The commute of 1st<->2nd and 1st<->3rd operands is _usually_
prohibited
for scalar FMA *_Int opcodes because it would change the values passed
through the first operand of intrinsic.

I would challenge your statement:
  "user cannot rely on knowing which operand is tied to the
destination".
It is the common practice for all intrinsics with *_ss() and *_sd()
suffixes that the first operand of the intrinsic is tied to the destination.
For example:
    // https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf
    __m128 _mm_add_ss(__m128 a, __m128 b)
    Adds the lower single-precision, floating-point (SP FP) values of a and
b;
the upper 3 SP FP values are passed through from a.

Probably, this moment was not mentioned explicitly for FMA intrinsics here:
https://software.intel.com/en-us/node/582845
That is rather a documentation problem (actually, my fault,
as I did not add a special notice when created/added those new
_mm_fmadd_ss/sd() intrinsics).

The intention was to maintain the existing assumption regarding the 1st
intrinsic operand as usually and let users (including some math library
guys) the tool that would have defined input/output behavior.

It is important to mention that the FMA form selection (132/213/231) by
compiler
does not change the precision of the result. Is is always correct for
vector opcodes and
conditionally correct for *_Int opcodes.

*_Int opcodes may need some additional correctness analysis.
Commuting 2nd and 3rd operands is always correct, while commuting 1st and
2nd or 1st and 3rd
requires use-def analysis.
It is Ok to commute the 1st operand if it is known that the upper bits
of the intrinsic result are not used.
For example:
  __m128 res = _mm_fmadd_ss(a, b, c);
  _mm_store_ss(ptr, res); // this is the ONLY user of 'res'.

I did not see such use-def analysis in LLVM, but surely such exist in some
other compilers.
Perhaps such analysis would be implemented in LLVM eventually/soon.

Thank you,
Vyacheslav Klochkov

On Mon, Sep 12, 2016 at 10:24 AM, <dag at cray.com> wrote:
> I noticed that the operand commuting code in X86InstrInfo.cpp treats
> scalar FMA intrinsics specially.  It prevents operand commuting on these
> scalar instructions because the scalar FMA instructions preserve the
> upper bits of the vector.  Presumably, the restrictions are there
> because commuting operands potentially changes the result upper bits.
>
> However, AFAIK the Intel and GNU FMA intrinsics don't actually specify
> which FMA (213, 132, 231) is going to be used and so the user can't
rely
> on knowing which operand is tied to the destination.  Thus the user
> can't rely on knowing what the upper bits will be.
>
> Is there some other reason these scalar FMA commuting restrictions are
> in place?
>
> Thanks!
>
>                             -David
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160912/fda5327c/attachment.html>

David A. Greene via llvm-dev

2016-Sep-13 01:24 UTC

head link

[llvm-dev] [X86] FMA transformation restrictions

Vyacheslav Klochkov <vyacheslav.n.klochkov at gmail.com> writes:
> Probably, this moment was not mentioned explicitly for FMA intrinsics
> here:
> https://software.intel.com/en-us/node/582845That is rather a documentation
problem (actually, my fault,
> as I did not add a special notice when created/added those new _
> mm_fmadd_ss/sd() intrinsics).
I didn't reference that document but I think I just missed the
passthrough from a in the description in another document.
> The intention was to maintain the existing assumption regarding the
> 1st intrinsic operand as usually and let users (including some math
> library guys) the tool that would have defined input/output behavior.
Makes sense.
> It is important to mention that the FMA form selection (132/213/231)
> by compiler 
> does not change the precision of the result. Is is always correct for
> vector opcodes and
> conditionally correct for *_Int opcodes. 
>
> *_Int opcodes may need some additional correctness analysis.
> Commuting 2nd and 3rd operands is always correct, while commuting 1st
> and 2nd or 1st and 3rd
> requires use-def analysis.
> It is Ok to commute the 1st operand if it is known that the upper bits 
> of the intrinsic result are not used.
> For example:
> __m128 res = _mm_fmadd_ss(a, b, c);
> _mm_store_ss(ptr, res); // this is the ONLY user of 'res'.
Yes of courrse.
> I did not see such use-def analysis in LLVM, but surely such exist in
> some other compilers.
> Perhaps such analysis would be implemented in LLVM eventually/soon.
It would be nice.  :)

Thanks for your help!

                   -David

llvm dev - Sep 2016 - [X86] FMA transformation restrictions

[llvm-dev] [X86] FMA transformation restrictions

[llvm-dev] [X86] FMA transformation restrictions

[llvm-dev] [X86] FMA transformation restrictions

[llvm-dev] [X86] FMA transformation restrictions

[llvm-dev] [X86] FMA transformation restrictions