thr3ads.net - llvm dev - [llvm-dev] Vectorization with fast-math on irregular ISA sub-sets [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2016-Feb-11 01:15 UTC

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

----- Original Message -----> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "James Molloy" <James.Molloy at arm.com>, "Nadav
Rotem" <nrotem at apple.com>, "Arnold Schwaighofer"
> <aschwaighofer at apple.com>, "LLVM Dev" <llvm-dev at
lists.llvm.org>, "nd" <nd at arm.com>
> Sent: Wednesday, February 10, 2016 8:30:50 AM
> Subject: Re: Vectorization with fast-math on irregular ISA sub-sets
> 
> On 9 February 2016 at 20:29, Hal Finkel <hfinkel at anl.gov> wrote:
> >> If the scalarisation is in IR, then any NEON intrinsic in C code
> >> will
> >> get wrongly scalarised. Builtins can be lowered in either IR
> >> operations or builtins, and the back-end has no way of knowing the
> >> origin.
> >>
> >> If the scalarization is lower down, then we risk also changing
> >> inline
> >> ASM snippets, which is even worse.
> >
> > Yes, but we don't do that, so that's not a practical concern.
> 
> The IR scalarisation is, though.
> 
> 
> > To be fair, our IR language reference does not actually say that
> > our floating-point arithmetic is IEEE compliant, but it is
> > implied, and frontends depend on this fact. We really should not
> > change the IR floating-point semantics contract over this. It
> > might require some user education, but that's much better than
> > producing subtly-wrong results.
> 
> But we lower a NEON intrinsic into plain IR instructions.
> 
> If I got it right, the current "fast" attribute is "may use
non IEEE
> compliant", emphasis on the *may*.
> 
> As a user, I'd be really angry if I used "float32x4_t vaddq_f32
> (float32x4_t, float32x4_t)" and the compiler emitted four VADD.f32
> SN.
> 
> Right now, Clang lowers:
>   vaddq_f32 (a, b);
> 
> to:
>   %add.i = fadd <4 x float> %a, %b
> 
> which lowers (correctly) to:
>   vadd.f32 q0, q0, q1
> 
> If, OTOH, "fast" means "*must* select the fastest",
then we may get
> away with using it.
> 
> So, your proposal seems to be that, while lowering NEON intrinsics,
> Clang *always* emit the "fast" attribute for all FP operations,
and
> that such scalarisation phase would split *all* non-fast FP
> operations
> if the target has non-IEEE-754 compliant SIMD.
To be clear, I'm recommending that you add flags like nnan, ninf and nsz.
However, I think that I've changed my mind: This won't work for the
intrinsics. The flags are defined as:

  nsz
  No Signed Zeros - Allow optimizations to treat the sign of a zero argument or
result as insignificant.

  nnan
  No NaNs - Allow optimizations to assume the arguments and result are not NaN.
Such optimizations are required to retain defined behavior over NaNs, but the
value of the result is undefined.

  ninf
  No Infs - Allow optimizations to assume the arguments and result are not
+/-Inf. Such optimizations are required to retain defined behavior over +/-Inf,
but the value of the result is undefined.

and this is not right for the intrinsics-generated IR. The problem is that, for
intrinsics, the users get the assume the exact semantics provided by the
underlying machine instructions. By using intrinsics, the user is not telling
the compiler it can do arbitrary things with the sign bit on zeros and all of
the bits when given an NaN/Inf input. Rather, the user expects very specific
(non-IEEE) behavior.

I think we have two options here:

 1. Lower these intrinsics into target-level intrinsics

 2. Add flags (or something like that) that indicate the alternate non-IEEE
semantics that ARM actually provides.

I suspect that (1) will cause performance regressions (since we don't
optimize the intrinsics as well as the generic IR we previously generated), so
we should investigate (2).
> 
> James' proposal is to not vectorise loops if an IEE-754 compliant
> SIMD
> is not on, and to only generate VFP instructions in the SLP
> vectoriser. If we're not generating the large vector operations in
> the
> first place, why would we need to scalarise them?
We should indeed let the cost model reflect the scalarization cost in cases
where we need IEEE semantics.
> 
> If we do vectorise to SIMD and then later scalarise, wouldn't that
> change the cost model? Wouldn't it be harder to predict performance
> gains, given that our cost model is only approximate and very
> empirical?
We'd need to pass the fast-math flags to the cost model so that we'd get
costs back that depended on whether or not we could actually use the vector
instructions.

 -Hal
> 
> Other front-ends should produce "valid" (target-specific) IR in
the
> first place, no? Hand generated broken IR is not something we wish to
> support either, I believe.
> 
> 
> > We have a pass-feedback mechanism, I think it would be very useful
> > for compiling with -Rpass-missed=loop-vectorize and/or
> > -Rpass-analysis=loop-vectorize helpfully informed users that
> > compiling with -ffast-math and/or -ffinite-math-only and
> > -fno-signed-zeros would allow the loop to be vectorized for the
> > targeted hardware.
> 
> That works for optimisations, not for intrinsics. Since we use the
> same intermediate representation for both, we can't assume anything.
> 
> cheers,
> --renato
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Renato Golin via llvm-dev

2016-Feb-11 09:53 UTC

head link

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

On 11 February 2016 at 01:15, Hal Finkel <hfinkel at anl.gov>
wrote:> Rather, the user expects very specific (non-IEEE) behavior.
Precisely! :)

> I think we have two options here:
>
>  1. Lower these intrinsics into target-level intrinsics
That's not an option for the reasons you outline (performance), but
also because this would explode the number of intrinsics we have to
deal with, making the IR *very* opaque and hard to deal with.

>  2. Add flags (or something like that) that indicate the alternate non-IEEE
semantics that ARM actually provides.
That's my idea, but I want to think about it only when we really need
to. Adding new flags always lead us to hard choices, and backwards
compatibility will be a problem here.

> We'd need to pass the fast-math flags to the cost model so that
we'd get costs back that depended on whether or not we could actually use
the vector instructions.
Indeed, that's the only way. But I foresee the cost model at least
doubling its complexity for those unfortunate targets. Right now, we
use heuristics to map the costs of casts, shuffles and memory
operations that normally disappear, but when loops can now use NEON
and VFP as well as scalar in the same objects, how the back-end will
emit those pseudo-operations will be anyone's guess.

In that sense, James' suggestion to create a flag for strict IEEE
semantics, locking SIMD FP out of the question entirely, is an easy
intermediate step.

cheers,
--renato

Renato Golin via llvm-dev

2016-Feb-11 10:53 UTC

head link

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Hal,

I had a read on the ARM ARM about VFP and SIMD FP semantics and my
analysis is that NEON's only problem is the Flush-to-zero behaviour,
which is non-compliant.

NEON deals with NaNs and Infs in the way specified by the standard and
should not cause any concern to us. But we don't seem to have a flag
specifically to denormals, so I think using the UnsafeMath is the
safest option for now.

On 11 February 2016 at 01:15, Hal Finkel <hfinkel at anl.gov>
wrote:>   nsz
>   No Signed Zeros - Allow optimizations to treat the sign of a zero
argument or result as insignificant.
In both VFP and NEON, zero signs are significant. In NEON, the
flush-to-zero's zero will have the same sign as the input denormal.

>   nnan
>   No NaNs - Allow optimizations to assume the arguments and result are not
NaN. Such optimizations are required to retain defined behavior over NaNs, but
the value of the result is undefined.
Both VFP and NEON treat NaNs as the standard requires, ie. [ NaN op ? ] = NaN.

>   ninf
>   No Infs - Allow optimizations to assume the arguments and result are not
+/-Inf. Such optimizations are required to retain defined behavior over +/-Inf,
but the value of the result is undefined.
Same here. Operations with Inf generate Inf or NaNs on both units.

The flush-to-zero behaviour has an effect on both NaNs and Infs, since
it happens before. So a denormal operation with an Inf in VFP will not
generate a NaN, while in NEON it'll be flushed to zero first, thus
generating NaNs.

James, is that a correct assessment?

cheers,
--renato

Martin J. O'Riordan via llvm-dev

2016-Feb-11 11:23 UTC

head link

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Our processor also has some issues regarding the handling of denormals - scalar
and vector - and we ran into a related problem only a few days ago.

The v3.8 compiler has done a lot of good work on optimisations for
floating-point math, but ironically one of them broke our implementation of
'nextafterf'.  The desired code fragment (FP32) is:

  float xAbs = fabsf(x);

since we know our instruction for this does not handle denormals and the
algorithm is sensitive to correct denormals, the code was written to avoid this
issue as follows:

  float xAbs = __builtin_astype(__builtin_astype(x, unsigned) & 0x7FFFFFFF,
float);

But the v3.8 FP optimiser now recognises this pattern and replaces it with an
ISD::FABS node and broke our workaround :-)  It's a great optimisation and I
have no problem with its correctness, but I was thinking that perhaps I might
see where I should extend the target information interface to allow a target to
say that it does not support denormals so that this and possibly other
optimisations could be suppressed in a target dependent way.

Overall the new FP32 optimisation patterns appear to have yielded a small but
not insignificant performance advantage over v3.7.1, though it is still early
days for my complete measurements.

	MartinO

-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Renato
Golin via llvm-dev
Sent: 11 February 2016 10:53
To: Hal Finkel <hfinkel at anl.gov>
Cc: LLVM Dev <llvm-dev at lists.llvm.org>; nd <nd at arm.com>
Subject: Re: [llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Hal,

I had a read on the ARM ARM about VFP and SIMD FP semantics and my analysis is
that NEON's only problem is the Flush-to-zero behaviour, which is
non-compliant.

NEON deals with NaNs and Infs in the way specified by the standard and should
not cause any concern to us. But we don't seem to have a flag specifically
to denormals, so I think using the UnsafeMath is the safest option for now.

On 11 February 2016 at 01:15, Hal Finkel <hfinkel at anl.gov>
wrote:>   nsz
>   No Signed Zeros - Allow optimizations to treat the sign of a zero
argument or result as insignificant.
In both VFP and NEON, zero signs are significant. In NEON, the
flush-to-zero's zero will have the same sign as the input denormal.

>   nnan
>   No NaNs - Allow optimizations to assume the arguments and result are not
NaN. Such optimizations are required to retain defined behavior over NaNs, but
the value of the result is undefined.
Both VFP and NEON treat NaNs as the standard requires, ie. [ NaN op ? ] = NaN.

>   ninf
>   No Infs - Allow optimizations to assume the arguments and result are not
+/-Inf. Such optimizations are required to retain defined behavior over +/-Inf,
but the value of the result is undefined.
Same here. Operations with Inf generate Inf or NaNs on both units.

The flush-to-zero behaviour has an effect on both NaNs and Infs, since it
happens before. So a denormal operation with an Inf in VFP will not generate a
NaN, while in NEON it'll be flushed to zero first, thus generating NaNs.

James, is that a correct assessment?

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

James Molloy via llvm-dev

2016-Feb-15 08:34 UTC

head link

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Hi,
> James, is that a correct assessment?
Yes, it is also my belief that the only way ARMv7 NEON differs from IEEE754 is
lack of denormal support.

James
> On 11 Feb 2016, at 10:53, Renato Golin <renato.golin at linaro.org>
wrote:
> 
> Hal,
> 
> I had a read on the ARM ARM about VFP and SIMD FP semantics and my
> analysis is that NEON's only problem is the Flush-to-zero behaviour,
> which is non-compliant.
> 
> NEON deals with NaNs and Infs in the way specified by the standard and
> should not cause any concern to us. But we don't seem to have a flag
> specifically to denormals, so I think using the UnsafeMath is the
> safest option for now.
> 
> On 11 February 2016 at 01:15, Hal Finkel <hfinkel at anl.gov> wrote:
>>  nsz
>>  No Signed Zeros - Allow optimizations to treat the sign of a zero
argument or result as insignificant.
> 
> In both VFP and NEON, zero signs are significant. In NEON, the
> flush-to-zero's zero will have the same sign as the input denormal.
> 
> 
>>  nnan
>>  No NaNs - Allow optimizations to assume the arguments and result are
not NaN. Such optimizations are required to retain defined behavior over NaNs,
but the value of the result is undefined.
> 
> Both VFP and NEON treat NaNs as the standard requires, ie. [ NaN op ? ] =
NaN.
> 
> 
>>  ninf
>>  No Infs - Allow optimizations to assume the arguments and result are
not +/-Inf. Such optimizations are required to retain defined behavior over
+/-Inf, but the value of the result is undefined.
> 
> Same here. Operations with Inf generate Inf or NaNs on both units.
> 
> The flush-to-zero behaviour has an effect on both NaNs and Infs, since
> it happens before. So a denormal operation with an Inf in VFP will not
> generate a NaN, while in NEON it'll be flushed to zero first, thus
> generating NaNs.
> 
> James, is that a correct assessment?
> 
> cheers,
> --renato
>

llvm dev - Feb 2016 - Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets