thr3ads.net - llvm dev - [llvm-dev] Vectorization with fast-math on irregular ISA sub-sets [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2016-Feb-09 20:29 UTC

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

----- Original Message -----> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "James Molloy" <James.Molloy at arm.com>, "Nadav
Rotem" <nrotem at apple.com>, "Arnold Schwaighofer"
> <aschwaighofer at apple.com>, "LLVM Dev" <llvm-dev at
lists.llvm.org>, "nd" <nd at arm.com>
> Sent: Tuesday, February 9, 2016 3:38:20 AM
> Subject: Re: Vectorization with fast-math on irregular ISA sub-sets
> 
> On 9 February 2016 at 03:48, Hal Finkel <hfinkel at anl.gov> wrote:
> > Yes, and generically speaking, it does for FP loops as well
> > (except, as has been noted, when there are FP reductions).
> 
> Right, and I think that's the problem, since a series of FP
> inductions
> could converge to a different value in NEON or VFP, basically acting
> like a n-wise reduction. Since we can't (yet?) prove there isn't a
> series of operations with the same data, we have to treat them as
> unsafe for non-IEEE FP operations.
> 
> 
> > It seems like we need two things here:
> >
> >  1. Use our backend fast-math flags during instruction selection to
> >  scalarize vector instructions that don't have the right
> >  allowances (on targets where that's necessary)
> >  2. Update the TTI cost model interfaces to take fast-math flags so
> >  that all vectorizers can make appropriate decisions
> 
> I think this is exactly the opposite of what James is saying, and I
> have to agree with him, since this would scalarise everything.
No, it just means that the intrinsics need to set the appropriate fast-math
flags on the instructions generated. This might require some frontend enablement
work, so be it.

There might be a slight issue with legacy IR bitcode, but if that's going to
be a problem in practice, we can design some scheme to let auto-upgrade do the
right thing.
> 
> If the scalarisation is in IR, then any NEON intrinsic in C code will
> get wrongly scalarised. Builtins can be lowered in either IR
> operations or builtins, and the back-end has no way of knowing the
> origin.
> 
> If the scalarization is lower down, then we risk also changing inline
> ASM snippets, which is even worse.
Yes, but we don't do that, so that's not a practical concern.
> 
> James' idea on this one is to have an additional flag to *enable*
> such
> scalarisation when the user cares too much about it, which I also
> think it's a better idea than to make that the default behaviour.
The --stop-pretending-to-be-IEEE-compliant-when-not-really flag? ;) I don't
think that's a good idea.

To be fair, our IR language reference does not actually say that our
floating-point arithmetic is IEEE compliant, but it is implied, and frontends
depend on this fact. We really should not change the IR floating-point semantics
contract over this. It might require some user education, but that's much
better than producing subtly-wrong results.

We have a pass-feedback mechanism, I think it would be very useful for compiling
with -Rpass-missed=loop-vectorize and/or -Rpass-analysis=loop-vectorize
helpfully informed users that compiling with -ffast-math and/or
-ffinite-math-only and -fno-signed-zeros would allow the loop to be vectorized
for the targeted hardware.

 -Hal
> 
> cheers,
> --renato
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Renato Golin via llvm-dev

2016-Feb-10 14:30 UTC

head link

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

On 9 February 2016 at 20:29, Hal Finkel <hfinkel at anl.gov>
wrote:>> If the scalarisation is in IR, then any NEON intrinsic in C code will
>> get wrongly scalarised. Builtins can be lowered in either IR
>> operations or builtins, and the back-end has no way of knowing the
>> origin.
>>
>> If the scalarization is lower down, then we risk also changing inline
>> ASM snippets, which is even worse.
>
> Yes, but we don't do that, so that's not a practical concern.
The IR scalarisation is, though.

> To be fair, our IR language reference does not actually say that our
floating-point arithmetic is IEEE compliant, but it is implied, and frontends
depend on this fact. We really should not change the IR floating-point semantics
contract over this. It might require some user education, but that's much
better than producing subtly-wrong results.
But we lower a NEON intrinsic into plain IR instructions.

If I got it right, the current "fast" attribute is "may use non
IEEE
compliant", emphasis on the *may*.

As a user, I'd be really angry if I used "float32x4_t vaddq_f32
(float32x4_t, float32x4_t)" and the compiler emitted four VADD.f32 SN.

Right now, Clang lowers:
  vaddq_f32 (a, b);

to:
  %add.i = fadd <4 x float> %a, %b

which lowers (correctly) to:
  vadd.f32 q0, q0, q1

If, OTOH, "fast" means "*must* select the fastest", then we
may get
away with using it.

So, your proposal seems to be that, while lowering NEON intrinsics,
Clang *always* emit the "fast" attribute for all FP operations, and
that such scalarisation phase would split *all* non-fast FP operations
if the target has non-IEEE-754 compliant SIMD.

James' proposal is to not vectorise loops if an IEE-754 compliant SIMD
is not on, and to only generate VFP instructions in the SLP
vectoriser. If we're not generating the large vector operations in the
first place, why would we need to scalarise them?

If we do vectorise to SIMD and then later scalarise, wouldn't that
change the cost model? Wouldn't it be harder to predict performance
gains, given that our cost model is only approximate and very
empirical?

Other front-ends should produce "valid" (target-specific) IR in the
first place, no? Hand generated broken IR is not something we wish to
support either, I believe.

> We have a pass-feedback mechanism, I think it would be very useful for
compiling with -Rpass-missed=loop-vectorize and/or
-Rpass-analysis=loop-vectorize helpfully informed users that compiling with
-ffast-math and/or -ffinite-math-only and -fno-signed-zeros would allow the loop
to be vectorized for the targeted hardware.
That works for optimisations, not for intrinsics. Since we use the
same intermediate representation for both, we can't assume anything.

cheers,
--renato

Hal Finkel via llvm-dev

2016-Feb-11 01:15 UTC

head link

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

----- Original Message -----> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "James Molloy" <James.Molloy at arm.com>, "Nadav
Rotem" <nrotem at apple.com>, "Arnold Schwaighofer"
> <aschwaighofer at apple.com>, "LLVM Dev" <llvm-dev at
lists.llvm.org>, "nd" <nd at arm.com>
> Sent: Wednesday, February 10, 2016 8:30:50 AM
> Subject: Re: Vectorization with fast-math on irregular ISA sub-sets
> 
> On 9 February 2016 at 20:29, Hal Finkel <hfinkel at anl.gov> wrote:
> >> If the scalarisation is in IR, then any NEON intrinsic in C code
> >> will
> >> get wrongly scalarised. Builtins can be lowered in either IR
> >> operations or builtins, and the back-end has no way of knowing the
> >> origin.
> >>
> >> If the scalarization is lower down, then we risk also changing
> >> inline
> >> ASM snippets, which is even worse.
> >
> > Yes, but we don't do that, so that's not a practical concern.
> 
> The IR scalarisation is, though.
> 
> 
> > To be fair, our IR language reference does not actually say that
> > our floating-point arithmetic is IEEE compliant, but it is
> > implied, and frontends depend on this fact. We really should not
> > change the IR floating-point semantics contract over this. It
> > might require some user education, but that's much better than
> > producing subtly-wrong results.
> 
> But we lower a NEON intrinsic into plain IR instructions.
> 
> If I got it right, the current "fast" attribute is "may use
non IEEE
> compliant", emphasis on the *may*.
> 
> As a user, I'd be really angry if I used "float32x4_t vaddq_f32
> (float32x4_t, float32x4_t)" and the compiler emitted four VADD.f32
> SN.
> 
> Right now, Clang lowers:
>   vaddq_f32 (a, b);
> 
> to:
>   %add.i = fadd <4 x float> %a, %b
> 
> which lowers (correctly) to:
>   vadd.f32 q0, q0, q1
> 
> If, OTOH, "fast" means "*must* select the fastest",
then we may get
> away with using it.
> 
> So, your proposal seems to be that, while lowering NEON intrinsics,
> Clang *always* emit the "fast" attribute for all FP operations,
and
> that such scalarisation phase would split *all* non-fast FP
> operations
> if the target has non-IEEE-754 compliant SIMD.
To be clear, I'm recommending that you add flags like nnan, ninf and nsz.
However, I think that I've changed my mind: This won't work for the
intrinsics. The flags are defined as:

  nsz
  No Signed Zeros - Allow optimizations to treat the sign of a zero argument or
result as insignificant.

  nnan
  No NaNs - Allow optimizations to assume the arguments and result are not NaN.
Such optimizations are required to retain defined behavior over NaNs, but the
value of the result is undefined.

  ninf
  No Infs - Allow optimizations to assume the arguments and result are not
+/-Inf. Such optimizations are required to retain defined behavior over +/-Inf,
but the value of the result is undefined.

and this is not right for the intrinsics-generated IR. The problem is that, for
intrinsics, the users get the assume the exact semantics provided by the
underlying machine instructions. By using intrinsics, the user is not telling
the compiler it can do arbitrary things with the sign bit on zeros and all of
the bits when given an NaN/Inf input. Rather, the user expects very specific
(non-IEEE) behavior.

I think we have two options here:

 1. Lower these intrinsics into target-level intrinsics

 2. Add flags (or something like that) that indicate the alternate non-IEEE
semantics that ARM actually provides.

I suspect that (1) will cause performance regressions (since we don't
optimize the intrinsics as well as the generic IR we previously generated), so
we should investigate (2).
> 
> James' proposal is to not vectorise loops if an IEE-754 compliant
> SIMD
> is not on, and to only generate VFP instructions in the SLP
> vectoriser. If we're not generating the large vector operations in
> the
> first place, why would we need to scalarise them?
We should indeed let the cost model reflect the scalarization cost in cases
where we need IEEE semantics.
> 
> If we do vectorise to SIMD and then later scalarise, wouldn't that
> change the cost model? Wouldn't it be harder to predict performance
> gains, given that our cost model is only approximate and very
> empirical?
We'd need to pass the fast-math flags to the cost model so that we'd get
costs back that depended on whether or not we could actually use the vector
instructions.

 -Hal
> 
> Other front-ends should produce "valid" (target-specific) IR in
the
> first place, no? Hand generated broken IR is not something we wish to
> support either, I believe.
> 
> 
> > We have a pass-feedback mechanism, I think it would be very useful
> > for compiling with -Rpass-missed=loop-vectorize and/or
> > -Rpass-analysis=loop-vectorize helpfully informed users that
> > compiling with -ffast-math and/or -ffinite-math-only and
> > -fno-signed-zeros would allow the loop to be vectorized for the
> > targeted hardware.
> 
> That works for optimisations, not for intrinsics. Since we use the
> same intermediate representation for both, we can't assume anything.
> 
> cheers,
> --renato
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Feb 2016 - Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Seemingly Similar Threads