I noticed that the operand commuting code in X86InstrInfo.cpp treats scalar FMA intrinsics specially. It prevents operand commuting on these scalar instructions because the scalar FMA instructions preserve the upper bits of the vector. Presumably, the restrictions are there because commuting operands potentially changes the result upper bits. However, AFAIK the Intel and GNU FMA intrinsics don't actually specify which FMA (213, 132, 231) is going to be used and so the user can't rely on knowing which operand is tied to the destination. Thus the user can't rely on knowing what the upper bits will be. Is there some other reason these scalar FMA commuting restrictions are in place? Thanks! -David
Michael Kuperstein via llvm-dev
2016-Sep-12 19:07 UTC
[llvm-dev] [X86] FMA transformation restrictions
Hi David, Assuming I understood the question correctly - Intel doesn't specify which FMA instruction is going to be used, but it does specify what it expects the result to be. E.g. for __m128 _mm_fmadd_ss (__m128 a, __m128 b, __m128 c) the specified semantics are: dst[31:0] := (a[31:0] * b[31:0]) + c[31:0] dst[127:32] := a[127:32] dst[MAX:128] := 0 The user is allowed to rely on the upper bits of the result being the upper bits of a, and the compiler is required choose an appropriate instruction form that will make this happen. Thanks, Michael On Mon, Sep 12, 2016 at 10:24 AM, via llvm-dev <llvm-dev at lists.llvm.org> wrote:> I noticed that the operand commuting code in X86InstrInfo.cpp treats > scalar FMA intrinsics specially. It prevents operand commuting on these > scalar instructions because the scalar FMA instructions preserve the > upper bits of the vector. Presumably, the restrictions are there > because commuting operands potentially changes the result upper bits. > > However, AFAIK the Intel and GNU FMA intrinsics don't actually specify > which FMA (213, 132, 231) is going to be used and so the user can't rely > on knowing which operand is tied to the destination. Thus the user > can't rely on knowing what the upper bits will be. > > Is there some other reason these scalar FMA commuting restrictions are > in place? > > Thanks! > > -David > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160912/4018f055/attachment.html>
David A. Greene via llvm-dev
2016-Sep-12 20:17 UTC
[llvm-dev] [X86] FMA transformation restrictions
Michael Kuperstein <mkuper at google.com> writes:> Hi David, > > Assuming I understood the question correctly - Intel doesn't specify > which FMA instruction is going to be used, but it does specify what it > expects the result to be. > > E.g. for > __m128 _mm_fmadd_ss (__m128 a, __m128 b, __m128 c) > the specified semantics are: > > dst[31:0] := (a[31:0] * b[31:0]) + c[31:0] > dst[127:32] := a[127:32] > dst[MAX:128] := 0 > > The user is allowed to rely on the upper bits of the result being the > upper bits of a, and the compiler is required choose an appropriate > instruction form that will make this happen.Ah, thank you. I missed that line about grabbing the upper bits from a. Not a big deal, I was just wondering. Carry on. :) -David
Vyacheslav Klochkov via llvm-dev
2016-Sep-12 21:41 UTC
[llvm-dev] [X86] FMA transformation restrictions
Hi David, The commute of 1st<->2nd and 1st<->3rd operands is _usually_ prohibited for scalar FMA *_Int opcodes because it would change the values passed through the first operand of intrinsic. I would challenge your statement: "user cannot rely on knowing which operand is tied to the destination". It is the common practice for all intrinsics with *_ss() and *_sd() suffixes that the first operand of the intrinsic is tied to the destination. For example: // https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf __m128 _mm_add_ss(__m128 a, __m128 b) Adds the lower single-precision, floating-point (SP FP) values of a and b; the upper 3 SP FP values are passed through from a. Probably, this moment was not mentioned explicitly for FMA intrinsics here: https://software.intel.com/en-us/node/582845 That is rather a documentation problem (actually, my fault, as I did not add a special notice when created/added those new _mm_fmadd_ss/sd() intrinsics). The intention was to maintain the existing assumption regarding the 1st intrinsic operand as usually and let users (including some math library guys) the tool that would have defined input/output behavior. It is important to mention that the FMA form selection (132/213/231) by compiler does not change the precision of the result. Is is always correct for vector opcodes and conditionally correct for *_Int opcodes. *_Int opcodes may need some additional correctness analysis. Commuting 2nd and 3rd operands is always correct, while commuting 1st and 2nd or 1st and 3rd requires use-def analysis. It is Ok to commute the 1st operand if it is known that the upper bits of the intrinsic result are not used. For example: __m128 res = _mm_fmadd_ss(a, b, c); _mm_store_ss(ptr, res); // this is the ONLY user of 'res'. I did not see such use-def analysis in LLVM, but surely such exist in some other compilers. Perhaps such analysis would be implemented in LLVM eventually/soon. Thank you, Vyacheslav Klochkov On Mon, Sep 12, 2016 at 10:24 AM, <dag at cray.com> wrote:> I noticed that the operand commuting code in X86InstrInfo.cpp treats > scalar FMA intrinsics specially. It prevents operand commuting on these > scalar instructions because the scalar FMA instructions preserve the > upper bits of the vector. Presumably, the restrictions are there > because commuting operands potentially changes the result upper bits. > > However, AFAIK the Intel and GNU FMA intrinsics don't actually specify > which FMA (213, 132, 231) is going to be used and so the user can't rely > on knowing which operand is tied to the destination. Thus the user > can't rely on knowing what the upper bits will be. > > Is there some other reason these scalar FMA commuting restrictions are > in place? > > Thanks! > > -David >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160912/fda5327c/attachment.html>
David A. Greene via llvm-dev
2016-Sep-13 01:24 UTC
[llvm-dev] [X86] FMA transformation restrictions
Vyacheslav Klochkov <vyacheslav.n.klochkov at gmail.com> writes:> Probably, this moment was not mentioned explicitly for FMA intrinsics > here: > https://software.intel.com/en-us/node/582845That is rather a documentation problem (actually, my fault, > as I did not add a special notice when created/added those new _ > mm_fmadd_ss/sd() intrinsics).I didn't reference that document but I think I just missed the passthrough from a in the description in another document.> The intention was to maintain the existing assumption regarding the > 1st intrinsic operand as usually and let users (including some math > library guys) the tool that would have defined input/output behavior.Makes sense.> It is important to mention that the FMA form selection (132/213/231) > by compiler > does not change the precision of the result. Is is always correct for > vector opcodes and > conditionally correct for *_Int opcodes. > > *_Int opcodes may need some additional correctness analysis. > Commuting 2nd and 3rd operands is always correct, while commuting 1st > and 2nd or 1st and 3rd > requires use-def analysis. > It is Ok to commute the 1st operand if it is known that the upper bits > of the intrinsic result are not used. > For example: > __m128 res = _mm_fmadd_ss(a, b, c); > _mm_store_ss(ptr, res); // this is the ONLY user of 'res'.Yes of courrse.> I did not see such use-def analysis in LLVM, but surely such exist in > some other compilers. > Perhaps such analysis would be implemented in LLVM eventually/soon.It would be nice. :) Thanks for your help! -David