thr3ads.net - llvm dev - [llvm-dev] Vectorization with fast-math on irregular ISA sub-sets [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2016-Feb-08 15:41 UTC

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Folks,

I'm now looking at https://llvm.org/bugs/show_bug.cgi?id=16274, which
seems to have some support in the vectorizer, but not as we need for
this particular case. I may have missed something obvious, please let
me know if there is a better way.

As you already know, ARM has two FP instruction sets: VFP and NEON.
VFP applies to single FP registers while NEON is a full SIMD. The
problem is that NEON is not IEEE compliant on FP operations, while VFP
is.

Even if the target has NEON and the user has asked for it to be used,
without -ffast-math and related arguments, we simply can't produce
NEON instructions for FP operations. Different operations may have
different non-compliance (inf, denormals, etc) and I haven't yet
investigated the full support, but it's safe to start from blocking
*all* FP operations on NEON when *any* FP restrictions are in place.
We can expand for better support later, when the infrastructure is in
place.

As far as I could see, ffast-math is included in the vectorizer, but
as an all-or-nothing, which is not what we want to do. So, I thought
about two ways we could go about doing this:


1.  The pragmatic way

Add a cost "TCC_Impossible = AllOnes" to TCC and on ARM's cost
model,
check if fast-math is checked on FP ALU operations and return that if
false. So, VFP costs would be less than NEON costs divided by their
widths.

This would make any vectorization beyond VFP instructions impossible
is fast-math is not chosen, while still using VFP instructions in the
loop, making it slightly faster.

I'm sceptical to introducing the TCC_Impossible cost, as it seems a
dirty trick. I'm open to other better solutions.


2.  The thorough way

Add a flag on TableGen on vector instructions meaning IEEE compliance
for the different levels of support. Add a "fall-back" VFP instruction
to each of them (either in TableGen or TTI).

In the vectorizer, on FP ALU cost, add a check on fast-math && IEEE
conformance. If failed, check the fall-back instruction's width and
add the cost as that * Width/FallBackWidth.

In the back-end, when emitting vector instructions, add the same check
and emit (unroll) the NEON instructions into similar VFP ones, by
checking it's fall-back instruction.

This approach has the benefit of validating IEEE compliance at the
instruction level, thus working for any other "vectorizer" out there,
including out-of-tree ones (though this benefit is very limited).

But it also can change code that it shouldn't, like inline asm or
intrinsics. I have no solution to this particular problem.

Any thoughts?

cheers,
--renato

James Molloy via llvm-dev

2016-Feb-08 16:33 UTC

head link

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Hi Renato,

I think it’s important to distinguish between the loop vectorizer and
already-existing vector IR + the SLP vectorizer here.

The loop vectorizer does indeed require -ffast-math, but the IEEE-nonconformant
transforms it does are far greater than using an ISA which may FTZ. It needs
-ffast-math because any FP reductions necessarily have their execution order
shuffled, due to executing some of them in parallel and reducing to scalar at
the end. Therefore the LV doesn’t need to be changed - it will only work when
“fast” is given and will only emit “fast” vector instructions.

The SLP vectoriser however should theoretically take non-fast scalars and
produce non-fast vectors. Similarly people will hand-write vector IR, or
generate it from other frontends.

Because of this, I think it’s important that we shouldn’t change the semantics
of the IR currently. Making vector IR targeting ARM produce scalar instructions
unless a modifier is given will undoubtedly cause problems down the line with
frontends being out of sync or not being updated. Even worse, the symptom of
this would just be “LLVM produces poor code for ARM” / “LLVM’s vector codegen is
terrible for ARM” - performance errata and not conformance. That’s why I think
changing to a full-strict-by-default approach would be bad for the project. It
would also violate the principle of least surprise - I wrote vector instructions
and picked a vector ISA… but they’re being scalarized?

My experience is that the number of people who care about pull IEEE
compatibility on ARMv7 hardware is limited, and the set of people who care about
exact ULP constraints even more limited. I think we absolutely should make a
solution that solves PR16274, but I think it would have to be opt-in, not
opt-out.

James> On 8 Feb 2016, at 15:41, Renato Golin <renato.golin at linaro.org>
wrote:
> 
> Folks,
> 
> I'm now looking at https://llvm.org/bugs/show_bug.cgi?id=16274, which
> seems to have some support in the vectorizer, but not as we need for
> this particular case. I may have missed something obvious, please let
> me know if there is a better way.
> 
> As you already know, ARM has two FP instruction sets: VFP and NEON.
> VFP applies to single FP registers while NEON is a full SIMD. The
> problem is that NEON is not IEEE compliant on FP operations, while VFP
> is.
> 
> Even if the target has NEON and the user has asked for it to be used,
> without -ffast-math and related arguments, we simply can't produce
> NEON instructions for FP operations. Different operations may have
> different non-compliance (inf, denormals, etc) and I haven't yet
> investigated the full support, but it's safe to start from blocking
> *all* FP operations on NEON when *any* FP restrictions are in place.
> We can expand for better support later, when the infrastructure is in
> place.
> 
> As far as I could see, ffast-math is included in the vectorizer, but
> as an all-or-nothing, which is not what we want to do. So, I thought
> about two ways we could go about doing this:
> 
> 
> 1.  The pragmatic way
> 
> Add a cost "TCC_Impossible = AllOnes" to TCC and on ARM's
cost model,
> check if fast-math is checked on FP ALU operations and return that if
> false. So, VFP costs would be less than NEON costs divided by their
> widths.
> 
> This would make any vectorization beyond VFP instructions impossible
> is fast-math is not chosen, while still using VFP instructions in the
> loop, making it slightly faster.
> 
> I'm sceptical to introducing the TCC_Impossible cost, as it seems a
> dirty trick. I'm open to other better solutions.
> 
> 
> 2.  The thorough way
> 
> Add a flag on TableGen on vector instructions meaning IEEE compliance
> for the different levels of support. Add a "fall-back" VFP
instruction
> to each of them (either in TableGen or TTI).
> 
> In the vectorizer, on FP ALU cost, add a check on fast-math && IEEE
> conformance. If failed, check the fall-back instruction's width and
> add the cost as that * Width/FallBackWidth.
> 
> In the back-end, when emitting vector instructions, add the same check
> and emit (unroll) the NEON instructions into similar VFP ones, by
> checking it's fall-back instruction.
> 
> This approach has the benefit of validating IEEE compliance at the
> instruction level, thus working for any other "vectorizer" out
there,
> including out-of-tree ones (though this benefit is very limited).
> 
> But it also can change code that it shouldn't, like inline asm or
> intrinsics. I have no solution to this particular problem.
> 
> Any thoughts?
> 
> cheers,
> --renato
>

Renato Golin via llvm-dev

2016-Feb-08 19:15 UTC

head link

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

On 8 February 2016 at 16:33, James Molloy <James.Molloy at arm.com>
wrote:> The loop vectorizer does indeed require -ffast-math, but the
IEEE-nonconformant transforms it does are far greater than using an ISA which
may FTZ. It needs -ffast-math because any FP reductions necessarily have their
execution order shuffled, due to executing some of them in parallel and reducing
to scalar at the end. Therefore the LV doesn’t need to be changed - it will only
work when “fast” is given and will only emit “fast” vector instructions.
Good point. This seems to be a much more rigorous definition in the
new 2008 standard. Right now, the loop vectorizer produces vector code
without -ffast-math. Are you saying we should disable it altogether
for all architectures that claim to follow the new standard?

Inner loops can be "vectorized" by SLP using only VFP instructions.

The implementation seem to have moved to Inst->hasUnsafeAlgebra(), so
we may need to return false in the legalization phase if the flag is
omitted and any instruction has unsafe algebra.

> The SLP vectoriser however should theoretically take non-fast scalars and
produce non-fast vectors. Similarly people will hand-write vector IR, or
generate it from other frontends.
We can't guarantee the semantics of the unsafe-math flag in any IR
that was not generated by a front-end which knows about it. So, it
follows that we'll stop vectorizing their basic blocks, and there
could be some outcry. We need some general consensus if that's what
people want. I don't think we do.

> Because of this, I think it’s important that we shouldn’t change the
semantics of the IR currently. Making vector IR targeting ARM produce scalar
instructions unless a modifier is given will undoubtedly cause problems down the
line with frontends being out of sync or not being updated. Even worse, the
symptom of this would just be “LLVM produces poor code for ARM” / “LLVM’s vector
codegen is terrible for ARM” - performance errata and not conformance. That’s
why I think changing to a full-strict-by-default approach would be bad for the
project.
> It would also violate the principle of least surprise - I wrote vector
instructions and picked a vector ISA… but they’re being scalarized?
Right, this is opposing to marking an instruction with unsafe by
default (ie my second option). If that's so, I agree with you that
it's not trivial and may create more problems than it solves.

Hand written IR, inline ASM and intrinsics should remain for what they
are. So 16274 is probably a "won't fix"?

> My experience is that the number of people who care about pull IEEE
compatibility on ARMv7 hardware is limited, and the set of people who care about
exact ULP constraints even more limited. I think we absolutely should make a
solution that solves PR16274, but I think it would have to be opt-in, not
opt-out.
And I'm guessing this is related to SLP and others. If so, I agree.

So,

For 16275, the fix is to disable loop vect. for no-fast-math + hasUnsafeAlgebra.

For 16274, disabling NEON emission in SLP would be one way, but we
must avoid any fiddling with inline asm and intrinsics, so I don't
think we should be doing that in any generic way. Certainly not
related to the example, from IR to instruction.

Makes sense?

--renato

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Feb 2016 - Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Seemingly Similar Threads