Stephen Lin
2013-Jul-08 22:10 UTC
[LLVMdev] API break for out-of-tree targets implementing TargetLoweringBase::isFMAFasterThanMulAndAdd
Hello, To any out-of-tree targets, please be aware that I intend to commit a patch that will break the build of any target implementing TargetLoweringBase::isFMAFasterThanMulAndAdd, for the reasons described below. (Basically, the current interface definition is broken and not followed, and no in-tree target was doing the right thing with it, so it is unlikely any out-of-tree target is either...) To un-break your build after this patch goes through, you will need to rename isFMAFasterThanMulAndAdd to isFMAFasterThanFMulAndFAdd and ensure that it returns true for any type that eventually legalizes to a type for which FMAs are faster than FMul + FAdd (which usually means you have hardware support of the operation.). You can look at in-tree target implementations as an example. Please let me know if there are any objections before tomorrow morning. Stephen ---------- Forwarded message ---------- From: Stephen Lin <swlin at post.harvard.edu> Date: Sun, Jul 7, 2013 at 9:25 PM Subject: [PATCH] Resolve issues with fmuladd intrinsic handling across multiple backends To: llvm-commits at cs.uiuc.edu Hi, While working on another patch, I discovered multiple related issues with fmuladd intrinsic handling which I believe the attached patch resolves. Currently, the operation depends on the target's implementation of the virtual function TargetLoweringBase::isFMAFasterThanMulAndAdd, the comments of which currently claims: - /// isFMAFasterThanMulAndAdd - Return true if an FMA operation is faster than - /// a pair of mul and add instructions. fmuladd intrinsics will be expanded to - /// FMAs when this method returns true (and FMAs are legal), otherwise fmuladd - /// is expanded to mul + add. The "and FMAs are legal" portion of the above comment is simply a lie; the legality of FMA operations is not checked before lowering fmuladd to ISD::FMA; however, the AArch64, SystemZ, and X86 implementations of this function are coded assuming that legality is checked and thus simply return true. This results in the following issues: 1. On X86(-64) targets, ISD::FMA nodes are formed when lowering fmuladd intrinsics even if the subtarget does not support FMA instructions, leading to laughably bad code generation in some situations (for example, compiling a call to "@llvm.fmuladd.v16f32(<16 x float> %a, <16 x float> %b, <16 x float> %c)" without "-mattr=+fma" or "-mattr=+fma4" results in 16 calls for the fmaf libm function instead of AVX muls and adds. 2. On AArch64 targets, ISD::FMA nodes are formed for operations on fp128, resulting in a call to a software fp128 FMA implementation rather than a software fp128 multiply and a software fp128 add. This does not seem to be the intended behavior given the comment above; however, I am not sure if this is actually better or worse, since neither set of operations is supported in hardware...does anyone know which is likely to be faster? However, on further investigation, it seems like not checking the legality of an FMA operation in lowering fmuladd intrinsics is a feature, not a bug, since it allows formation of FMAs with types like v16f32, as long as they legalize (via splitting, scalarization, promotion, etc.) to types that support FMAs; it turns out this case is explicitly tested for in test/CodeGen/X86/wide-fma-contraction.ll. So the proper solution, it seems, is for TargetLoweringInfo::isFMAFasterThanMulAndAdd implementations to check types and preconditions rather than depending on the caller to do so. The last in-tree target to implement this function, PowerPC, actually does check types, however, it only checks for the specific legal types, and therefore the following occurs: 3. On PowerPC targets, FMAs are not generated from fmuladd intrinsics on types like v2f32, v8f32, v4f64, etc., even though they promote, split, scalarize, etc. to types that support hardware FMAs. This patch resolves all these issues by modifying the implementations of this virtual to check for subtarget features and for EVTs that legalize to MVTs that support hardware FMAs. (If I've understood the legalization rules correctly, then it turns out the latter can always be done in these targets just by checking the scalar type, but this is not the correct solution in the general case.) Comments are adjusted accordingly. (For the AArch64 backend in particular, I made the assumption, for now, that FMAs should not be formed on fp128, but this can be changed with a one-line fix.) Since all current implementations of this virtual were buggy and the comment describing its usage was incorrect, I've also decided to rename it from "TargetLoweringBase::isFMAFasterThanMulAndAdd" to "TargetLoweringBase::isFMAFasterThanFMulAndFAdd" to force a merge conflict for any out-of-tree target implementing it; the new name is more accurate and consistent with the name of other functions in this interface anyway*. I will send a message to LLVMDev describing the change and the appropriate fixes required if/when it goes through. (Suggestions for a better solution to this problem are welcome!) Also, I took this opportunity to modify DAGCombiner to only check for type legality when forming ISD::FMA nodes (in unsafe math mode) after legalization, since this is allowed by the new function interface definition (as long as isFMAFasterThanFMulAndFAdd returns true for the type). In practice with current target implementations this change seems to be a no-op, since the FMA will be formed after legalization anyway; however, there doesn't seem to be any harm in forming the FMA earlier if possible. No current tests are affected by this change; I've added tests that the specific symptoms described above are fixed as well as some other sanity checks, just in case. Please review and let me know if you have any comments! In particular, if anyone has an definite knowledge about what the correct behavior of fmuladd of fp128 on AArch64 should be, please let me know. Thanks, Stephen * fast integer FMA instructions are at least theoretically possible, I think (?); however, the current comment already indicates that the function is used specifically to lower fmuladd intrinsics so "FMul" and "FAdd" are more accurate. -------------- next part -------------- A non-text attachment was scrubbed... Name: fma-faster-check-types.patch Type: application/octet-stream Size: 26781 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130708/52adf721/attachment.obj>
Stephen Lin
2013-Jul-09 18:24 UTC
[LLVMdev] API break for out-of-tree targets implementing TargetLoweringBase::isFMAFasterThanMulAndAdd
OK, the patch is committed as r185956, and is guaranteed to break either the merge or build of any out-of-tree target implementing TargetLoweringBase::isFMAFasterThanMulAndAdd. To resolve, rename isFMAFasterThanMulAndAdd to isFMAFasterThanFMulAndFAdd and ensure that it returns true for any type that eventually legalizes to a type for which FMAs are faster than FMul + FAdd. For all in-tree targets, this simply requires checking any subtarget prerequisites (if any) and then checking the scalar type of the EVT passed to the function...however, it may be more complicated in other situations (for example, if FMAs are supported on a scalar type but not on a legal vectorized type, and the fmul + fadd on the vector is faster than scalarizing and using the scalar FMA...) Please let me know if there any questions or concerns. Stephen On Mon, Jul 8, 2013 at 3:10 PM, Stephen Lin <swlin at post.harvard.edu> wrote:> Hello, > > To any out-of-tree targets, please be aware that I intend to commit a > patch that will break the build of any target implementing > TargetLoweringBase::isFMAFasterThanMulAndAdd, for the reasons > described below. (Basically, the current interface definition is > broken and not followed, and no in-tree target was doing the right > thing with it, so it is unlikely any out-of-tree target is either...) > > To un-break your build after this patch goes through, you will need to > rename isFMAFasterThanMulAndAdd to isFMAFasterThanFMulAndFAdd and > ensure that it returns true for any type that eventually legalizes to > a type for which FMAs are faster than FMul + FAdd (which usually means > you have hardware support of the operation.). You can look at in-tree > target implementations as an example. > > Please let me know if there are any objections before tomorrow morning. > > Stephen > > ---------- Forwarded message ---------- > From: Stephen Lin <swlin at post.harvard.edu> > Date: Sun, Jul 7, 2013 at 9:25 PM > Subject: [PATCH] Resolve issues with fmuladd intrinsic handling across > multiple backends > To: llvm-commits at cs.uiuc.edu > > > Hi, > > While working on another patch, I discovered multiple related issues > with fmuladd intrinsic handling which I believe the attached patch > resolves. > > Currently, the operation depends on the target's implementation of the > virtual function TargetLoweringBase::isFMAFasterThanMulAndAdd, the > comments of which currently claims: > > - /// isFMAFasterThanMulAndAdd - Return true if an FMA operation is faster than > - /// a pair of mul and add instructions. fmuladd intrinsics will be > expanded to > - /// FMAs when this method returns true (and FMAs are legal), > otherwise fmuladd > - /// is expanded to mul + add. > > The "and FMAs are legal" portion of the above comment is simply a lie; > the legality of FMA operations is not checked before lowering fmuladd > to ISD::FMA; however, the AArch64, SystemZ, and X86 implementations of > this function are coded assuming that legality is checked and thus > simply return true. This results in the following issues: > > 1. On X86(-64) targets, ISD::FMA nodes are formed when lowering > fmuladd intrinsics even if the subtarget does not support FMA > instructions, leading to laughably bad code generation in some > situations (for example, compiling a call to "@llvm.fmuladd.v16f32(<16 > x float> %a, <16 x float> %b, <16 x float> %c)" without "-mattr=+fma" > or "-mattr=+fma4" results in 16 calls for the fmaf libm function > instead of AVX muls and adds. > > 2. On AArch64 targets, ISD::FMA nodes are formed for operations on > fp128, resulting in a call to a software fp128 FMA implementation > rather than a software fp128 multiply and a software fp128 add. This > does not seem to be the intended behavior given the comment above; > however, I am not sure if this is actually better or worse, since > neither set of operations is supported in hardware...does anyone know > which is likely to be faster? > > However, on further investigation, it seems like not checking the > legality of an FMA operation in lowering fmuladd intrinsics is a > feature, not a bug, since it allows formation of FMAs with types like > v16f32, as long as they legalize (via splitting, scalarization, > promotion, etc.) to types that support FMAs; it turns out this case is > explicitly tested for in test/CodeGen/X86/wide-fma-contraction.ll. > > So the proper solution, it seems, is for > TargetLoweringInfo::isFMAFasterThanMulAndAdd implementations to check > types and preconditions rather than depending on the caller to do so. > The last in-tree target to implement this function, PowerPC, actually > does check types, however, it only checks for the specific legal > types, and therefore the following occurs: > > 3. On PowerPC targets, FMAs are not generated from fmuladd intrinsics > on types like v2f32, v8f32, v4f64, etc., even though they promote, > split, scalarize, etc. to types that support hardware FMAs. > > This patch resolves all these issues by modifying the implementations > of this virtual to check for subtarget features and for EVTs that > legalize to MVTs that support hardware FMAs. (If I've understood the > legalization rules correctly, then it turns out the latter can always > be done in these targets just by checking the scalar type, but this is > not the correct solution in the general case.) > Comments are adjusted accordingly. > > (For the AArch64 backend in particular, I made the assumption, for > now, that FMAs should not be formed on fp128, but this can be changed > with a one-line fix.) > > Since all current implementations of this virtual were buggy and the > comment describing its usage was incorrect, I've also decided to > rename it from "TargetLoweringBase::isFMAFasterThanMulAndAdd" to > "TargetLoweringBase::isFMAFasterThanFMulAndFAdd" to force a merge > conflict for any out-of-tree target implementing it; the new name is > more accurate and consistent with the name of other functions in this > interface anyway*. I will send a message to LLVMDev describing the > change and the appropriate fixes required if/when it goes through. > (Suggestions for a better solution to this problem are welcome!) > > Also, I took this opportunity to modify DAGCombiner to only check for > type legality when forming ISD::FMA nodes (in unsafe math mode) after > legalization, since this is allowed by the new function interface > definition (as long as isFMAFasterThanFMulAndFAdd returns true for the > type). In practice with current target implementations this change > seems to be a no-op, since the FMA will be formed after legalization > anyway; however, there doesn't seem to be any harm in forming the FMA > earlier if possible. > > No current tests are affected by this change; I've added tests that > the specific symptoms described above are fixed as well as some other > sanity checks, just in case. > > Please review and let me know if you have any comments! In particular, > if anyone has an definite knowledge about what the correct behavior of > fmuladd of fp128 on AArch64 should be, please let me know. > > Thanks, > Stephen > > * fast integer FMA instructions are at least theoretically possible, I > think (?); however, the current comment already indicates that the > function is used specifically to lower fmuladd intrinsics so "FMul" > and "FAdd" are more accurate.