thr3ads.net - llvm dev - [LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2013-Nov-10 15:36 UTC

[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

Hello everyone,

The particular motivation for this e-mail is my desire for feedback on how to
fix PR17758; but there is a core design issue here, so I'd like a wide
audience.

The underlying issue is that, because the semantics of llvm.sqrt are
purposefully defined to be different from libm sqrt (unlike all of the other
llvm.<libm function> intrinsics) (*), and because autovectorization relies
on the vector forms of these intrinsics when vectorizing function calls to libm
math functions, we cannot vectorize a libm sqrt() call into a vector llvm.sqrt
call. However, in fast-math mode, we'd like to vectorize calls to sqrt, and
so I modified Clang to emit calls to llvm.sqrt in fast-math mode for sqrt (and
sqrt[fl]). This makes it similar to the libm pow and fma calls, which Clang
always transforms into the llvm.pow and llvm.fma intrinsics.

Here's the problem: There is an InstCombine optimization for sqrt (inside
visitFPTrunc), and a bunch of optimizations inside SimplifyLibCalls that apply
only to the sqrt libm call, and not to the intrinsics. The result, among other
things, is PR17758, where fast-math mode actually produces slower code for
non-vectorized sqrt calls.

Some questions:

 - Is the asymmetry between optimizations performed on libm calls and their
corresponding llvm.<libm function> intrinsics intentional, or just due to
a lack of motivation?

 - Even if unintentional, is this asymmetry in any way desirable (for sqrt in
particular, or in general)?

 - I can refactor all existing optimizations to be libm-call vs. intrinsics
agnostic, but is that the desired solution? If so, any advice on a
particularly-nice way to do this would certainly be appreciated.

For example, an alternative solution for PR17758 in particular would be to
revert the Clang change, introduce a new intrinsic for sqrt that does match the
libm semantics, and have vectorization use that when available.

Another alternative is to revert the Clang change and make autovectorization of
libm sqrt -> llvm.sqrt dependent on the NoNaNsFPMath TargetOptions flag (this
requires directly exposing parts of TargetOptions to the IR level).

I believe that both of these alternatives also require fixing the inliner to
deal properly with fast-math attributes during LTO (unless I can just ignore
this for now). This was the objection raised to the TargetOptions solution when
I first brought it up.

(*) According to the language reference, the specific difference is,
"Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for
negative numbers other than -0.0 (which allows for better optimization, because
there is no need to worry about errno being set). llvm.sqrt(-0.0) is defined to
return -0.0 like IEEE sqrt."

Thanks again,
Hal

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Nicholas Chapman

2013-Nov-11 21:45 UTC

head link

[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

Hi Hal, all.

I'm not sure why llvm.sqrt is 'special'.  Maybe because there is a
SSE
packed sqrt instruction (SQRTPS) but not e.g. a packed sin instruction 
AFAIK.

As mentioned in a recent mail to this list, I would like llvm.sqrt to be 
defined as NaN for argument x < 0.  I believe this would bring it more 
into line with the other intrinsics, and with the libm result, which is 
NaN for x < 0: 
http://pubs.opengroup.org/onlinepubs/007904975/functions/sqrt.html

Cheers,
     Nick

On 10/11/2013 3:36 p.m., Hal Finkel wrote:> Hello everyone,
>
> The particular motivation for this e-mail is my desire for feedback on how
to fix PR17758; but there is a core design issue here, so I'd like a wide
audience.
>
> The underlying issue is that, because the semantics of llvm.sqrt are
purposefully defined to be different from libm sqrt (unlike all of the other
llvm.<libm function> intrinsics) (*), and because autovectorization relies
on the vector forms of these intrinsics when vectorizing function calls to libm
math functions, we cannot vectorize a libm sqrt() call into a vector llvm.sqrt
call. However, in fast-math mode, we'd like to vectorize calls to sqrt, and
so I modified Clang to emit calls to llvm.sqrt in fast-math mode for sqrt (and
sqrt[fl]). This makes it similar to the libm pow and fma calls, which Clang
always transforms into the llvm.pow and llvm.fma intrinsics.
>
> Here's the problem: There is an InstCombine optimization for sqrt
(inside visitFPTrunc), and a bunch of optimizations inside SimplifyLibCalls that
apply only to the sqrt libm call, and not to the intrinsics. The result, among
other things, is PR17758, where fast-math mode actually produces slower code for
non-vectorized sqrt calls.
>
> Some questions:
>
>   - Is the asymmetry between optimizations performed on libm calls and
their corresponding llvm.<libm function> intrinsics intentional, or just
due to a lack of motivation?
>
>   - Even if unintentional, is this asymmetry in any way desirable (for sqrt
in particular, or in general)?
>
>   - I can refactor all existing optimizations to be libm-call vs.
intrinsics agnostic, but is that the desired solution? If so, any advice on a
particularly-nice way to do this would certainly be appreciated.
>
> For example, an alternative solution for PR17758 in particular would be to
revert the Clang change, introduce a new intrinsic for sqrt that does match the
libm semantics, and have vectorization use that when available.
>
> Another alternative is to revert the Clang change and make
autovectorization of libm sqrt -> llvm.sqrt dependent on the NoNaNsFPMath
TargetOptions flag (this requires directly exposing parts of TargetOptions to
the IR level).
>
> I believe that both of these alternatives also require fixing the inliner
to deal properly with fast-math attributes during LTO (unless I can just ignore
this for now). This was the objection raised to the TargetOptions solution when
I first brought it up.
>
> (*) According to the language reference, the specific difference is,
"Unlike sqrt in libm, however, llvm.sqrt has undefined behavior for
negative numbers other than -0.0 (which allows for better optimization, because
there is no need to worry about errno being set). llvm.sqrt(-0.0) is defined to
return -0.0 like IEEE sqrt."
>
> Thanks again,
> Hal
>

Hal Finkel

2013-Nov-12 05:30 UTC

head link

[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

----- Original Message -----> Hi Hal, all.
> 
> I'm not sure why llvm.sqrt is 'special'.  Maybe because there
is a
> SSE
> packed sqrt instruction (SQRTPS) but not e.g. a packed sin
> instruction
> AFAIK.
This seems relevant:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2007-August/010248.html

Chris, et al., does the decision on how to treat sqrt predate our current way of
handling errno?

 -Hal
> 
> As mentioned in a recent mail to this list, I would like llvm.sqrt to
> be
> defined as NaN for argument x < 0.  I believe this would bring it
> more
> into line with the other intrinsics, and with the libm result, which
> is
> NaN for x < 0:
> http://pubs.opengroup.org/onlinepubs/007904975/functions/sqrt.html
> 
> Cheers,
>      Nick
> 
> On 10/11/2013 3:36 p.m., Hal Finkel wrote:
> > Hello everyone,
> >
> > The particular motivation for this e-mail is my desire for feedback
> > on how to fix PR17758; but there is a core design issue here, so
> > I'd like a wide audience.
> >
> > The underlying issue is that, because the semantics of llvm.sqrt
> > are purposefully defined to be different from libm sqrt (unlike
> > all of the other llvm.<libm function> intrinsics) (*), and
because
> > autovectorization relies on the vector forms of these intrinsics
> > when vectorizing function calls to libm math functions, we cannot
> > vectorize a libm sqrt() call into a vector llvm.sqrt call.
> > However, in fast-math mode, we'd like to vectorize calls to sqrt,
> > and so I modified Clang to emit calls to llvm.sqrt in fast-math
> > mode for sqrt (and sqrt[fl]). This makes it similar to the libm
> > pow and fma calls, which Clang always transforms into the llvm.pow
> > and llvm.fma intrinsics.
> >
> > Here's the problem: There is an InstCombine optimization for sqrt
> > (inside visitFPTrunc), and a bunch of optimizations inside
> > SimplifyLibCalls that apply only to the sqrt libm call, and not to
> > the intrinsics. The result, among other things, is PR17758, where
> > fast-math mode actually produces slower code for non-vectorized
> > sqrt calls.
> >
> > Some questions:
> >
> >   - Is the asymmetry between optimizations performed on libm calls
> >   and their corresponding llvm.<libm function> intrinsics
> >   intentional, or just due to a lack of motivation?
> >
> >   - Even if unintentional, is this asymmetry in any way desirable
> >   (for sqrt in particular, or in general)?
> >
> >   - I can refactor all existing optimizations to be libm-call vs.
> >   intrinsics agnostic, but is that the desired solution? If so,
> >   any advice on a particularly-nice way to do this would certainly
> >   be appreciated.
> >
> > For example, an alternative solution for PR17758 in particular
> > would be to revert the Clang change, introduce a new intrinsic for
> > sqrt that does match the libm semantics, and have vectorization
> > use that when available.
> >
> > Another alternative is to revert the Clang change and make
> > autovectorization of libm sqrt -> llvm.sqrt dependent on the
> > NoNaNsFPMath TargetOptions flag (this requires directly exposing
> > parts of TargetOptions to the IR level).
> >
> > I believe that both of these alternatives also require fixing the
> > inliner to deal properly with fast-math attributes during LTO
> > (unless I can just ignore this for now). This was the objection
> > raised to the TargetOptions solution when I first brought it up.
> >
> > (*) According to the language reference, the specific difference
> > is, "Unlike sqrt in libm, however, llvm.sqrt has undefined
> > behavior for negative numbers other than -0.0 (which allows for
> > better optimization, because there is no need to worry about errno
> > being set). llvm.sqrt(-0.0) is defined to return -0.0 like IEEE
> > sqrt."
> >
> > Thanks again,
> > Hal
> >
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Nov 2013 - [LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

[LLVMdev] [RFC] How to fix sqrt vs llvm.sqrt optimization asymmetry

Maybe Matching Threads