Hal Finkel
2013-Nov-10 15:36 UTC
[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry
Hello everyone, The particular motivation for this e-mail is my desire for feedback on how to fix PR17758; but there is a core design issue here, so I'd like a wide audience. The underlying issue is that, because the semantics of llvm.sqrt are purposefully defined to be different from libm sqrt (unlike all of the other llvm.<libm function> intrinsics) (*), and because autovectorization relies on the vector forms of these intrinsics when vectorizing function calls to libm math functions, we cannot vectorize a libm sqrt() call into a vector llvm.sqrt call. However, in fast-math mode, we'd like to vectorize calls to sqrt, and so I modified Clang to emit calls to llvm.sqrt in fast-math mode for sqrt (and sqrt[fl]). This makes it similar to the libm pow and fma calls, which Clang always transforms into the llvm.pow and llvm.fma intrinsics. Here's the problem: There is an InstCombine optimization for sqrt (inside visitFPTrunc), and a bunch of optimizations inside SimplifyLibCalls that apply only to the sqrt libm call, and not to the intrinsics. The result, among other things, is PR17758, where fast-math mode actually produces slower code for non-vectorized sqrt calls. Some questions: - Is the asymmetry between optimizations performed on libm calls and their corresponding llvm.<libm function> intrinsics intentional, or just due to a lack of motivation? - Even if unintentional, is this asymmetry in any way desirable (for sqrt in particular, or in general)? - I can refactor all existing optimizations to be libm-call vs. intrinsics agnostic, but is that the desired solution? If so, any advice on a particularly-nice way to do this would certainly be appreciated. For example, an alternative solution for PR17758 in particular would be to revert the Clang change, introduce a new intrinsic for sqrt that does match the libm semantics, and have vectorization use that when available. Another alternative is to revert the Clang change and make autovectorization of libm sqrt -> llvm.sqrt dependent on the NoNaNsFPMath TargetOptions flag (this requires directly exposing parts of TargetOptions to the IR level). I believe that both of these alternatives also require fixing the inliner to deal properly with fast-math attributes during LTO (unless I can just ignore this for now). This was the objection raised to the TargetOptions solution when I first brought it up. (*) According to the language reference, the specific difference is, "Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for negative numbers other than -0.0 (which allows for better optimization, because there is no need to worry about errno being set). llvm.sqrt(-0.0) is defined to return -0.0 like IEEE sqrt." Thanks again, Hal -- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Nicholas Chapman
2013-Nov-11 21:45 UTC
[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry
Hi Hal, all. I'm not sure why llvm.sqrt is 'special'. Maybe because there is a SSE packed sqrt instruction (SQRTPS) but not e.g. a packed sin instruction AFAIK. As mentioned in a recent mail to this list, I would like llvm.sqrt to be defined as NaN for argument x < 0. I believe this would bring it more into line with the other intrinsics, and with the libm result, which is NaN for x < 0: http://pubs.opengroup.org/onlinepubs/007904975/functions/sqrt.html Cheers, Nick On 10/11/2013 3:36 p.m., Hal Finkel wrote:> Hello everyone, > > The particular motivation for this e-mail is my desire for feedback on how to fix PR17758; but there is a core design issue here, so I'd like a wide audience. > > The underlying issue is that, because the semantics of llvm.sqrt are purposefully defined to be different from libm sqrt (unlike all of the other llvm.<libm function> intrinsics) (*), and because autovectorization relies on the vector forms of these intrinsics when vectorizing function calls to libm math functions, we cannot vectorize a libm sqrt() call into a vector llvm.sqrt call. However, in fast-math mode, we'd like to vectorize calls to sqrt, and so I modified Clang to emit calls to llvm.sqrt in fast-math mode for sqrt (and sqrt[fl]). This makes it similar to the libm pow and fma calls, which Clang always transforms into the llvm.pow and llvm.fma intrinsics. > > Here's the problem: There is an InstCombine optimization for sqrt (inside visitFPTrunc), and a bunch of optimizations inside SimplifyLibCalls that apply only to the sqrt libm call, and not to the intrinsics. The result, among other things, is PR17758, where fast-math mode actually produces slower code for non-vectorized sqrt calls. > > Some questions: > > - Is the asymmetry between optimizations performed on libm calls and their corresponding llvm.<libm function> intrinsics intentional, or just due to a lack of motivation? > > - Even if unintentional, is this asymmetry in any way desirable (for sqrt in particular, or in general)? > > - I can refactor all existing optimizations to be libm-call vs. intrinsics agnostic, but is that the desired solution? If so, any advice on a particularly-nice way to do this would certainly be appreciated. > > For example, an alternative solution for PR17758 in particular would be to revert the Clang change, introduce a new intrinsic for sqrt that does match the libm semantics, and have vectorization use that when available. > > Another alternative is to revert the Clang change and make autovectorization of libm sqrt -> llvm.sqrt dependent on the NoNaNsFPMath TargetOptions flag (this requires directly exposing parts of TargetOptions to the IR level). > > I believe that both of these alternatives also require fixing the inliner to deal properly with fast-math attributes during LTO (unless I can just ignore this for now). This was the objection raised to the TargetOptions solution when I first brought it up. > > (*) According to the language reference, the specific difference is, "Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for negative numbers other than -0.0 (which allows for better optimization, because there is no need to worry about errno being set). llvm.sqrt(-0.0) is defined to return -0.0 like IEEE sqrt." > > Thanks again, > Hal >
Hal Finkel
2013-Nov-12 05:30 UTC
[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry
----- Original Message -----> Hi Hal, all. > > I'm not sure why llvm.sqrt is 'special'. Maybe because there is a > SSE > packed sqrt instruction (SQRTPS) but not e.g. a packed sin > instruction > AFAIK.This seems relevant: http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-August/010248.html Chris, et al., does the decision on how to treat sqrt predate our current way of handling errno? -Hal> > As mentioned in a recent mail to this list, I would like llvm.sqrt to > be > defined as NaN for argument x < 0. I believe this would bring it > more > into line with the other intrinsics, and with the libm result, which > is > NaN for x < 0: > http://pubs.opengroup.org/onlinepubs/007904975/functions/sqrt.html > > Cheers, > Nick > > On 10/11/2013 3:36 p.m., Hal Finkel wrote: > > Hello everyone, > > > > The particular motivation for this e-mail is my desire for feedback > > on how to fix PR17758; but there is a core design issue here, so > > I'd like a wide audience. > > > > The underlying issue is that, because the semantics of llvm.sqrt > > are purposefully defined to be different from libm sqrt (unlike > > all of the other llvm.<libm function> intrinsics) (*), and because > > autovectorization relies on the vector forms of these intrinsics > > when vectorizing function calls to libm math functions, we cannot > > vectorize a libm sqrt() call into a vector llvm.sqrt call. > > However, in fast-math mode, we'd like to vectorize calls to sqrt, > > and so I modified Clang to emit calls to llvm.sqrt in fast-math > > mode for sqrt (and sqrt[fl]). This makes it similar to the libm > > pow and fma calls, which Clang always transforms into the llvm.pow > > and llvm.fma intrinsics. > > > > Here's the problem: There is an InstCombine optimization for sqrt > > (inside visitFPTrunc), and a bunch of optimizations inside > > SimplifyLibCalls that apply only to the sqrt libm call, and not to > > the intrinsics. The result, among other things, is PR17758, where > > fast-math mode actually produces slower code for non-vectorized > > sqrt calls. > > > > Some questions: > > > > - Is the asymmetry between optimizations performed on libm calls > > and their corresponding llvm.<libm function> intrinsics > > intentional, or just due to a lack of motivation? > > > > - Even if unintentional, is this asymmetry in any way desirable > > (for sqrt in particular, or in general)? > > > > - I can refactor all existing optimizations to be libm-call vs. > > intrinsics agnostic, but is that the desired solution? If so, > > any advice on a particularly-nice way to do this would certainly > > be appreciated. > > > > For example, an alternative solution for PR17758 in particular > > would be to revert the Clang change, introduce a new intrinsic for > > sqrt that does match the libm semantics, and have vectorization > > use that when available. > > > > Another alternative is to revert the Clang change and make > > autovectorization of libm sqrt -> llvm.sqrt dependent on the > > NoNaNsFPMath TargetOptions flag (this requires directly exposing > > parts of TargetOptions to the IR level). > > > > I believe that both of these alternatives also require fixing the > > inliner to deal properly with fast-math attributes during LTO > > (unless I can just ignore this for now). This was the objection > > raised to the TargetOptions solution when I first brought it up. > > > > (*) According to the language reference, the specific difference > > is, "Unlike sqrt in libm, however, llvm.sqrt has undefined > > behavior for negative numbers other than -0.0 (which allows for > > better optimization, because there is no need to worry about errno > > being set). llvm.sqrt(-0.0) is defined to return -0.0 like IEEE > > sqrt." > > > > Thanks again, > > Hal > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Possibly Parallel Threads
- [LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry
- [LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry
- [LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry
- [LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry
- [LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry