thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Simon Moll via llvm-dev

2019-Feb-01 09:52 UTC

[llvm-dev] [RFC] Vector Predication

Hi,

On 1/31/19 8:17 PM, Philip Reames wrote:>
> On 1/31/19 11:03 AM, David Greene wrote:
>> Philip Reames <listmail at philipreames.com> writes:
>>
>>> Question 1 - Why do we need separate mask and lengths? Can't
the
>>> length be easily folded into the mask operand?
>>>
>>> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
>>> and then pattern matched in the backend if needed
>> I'm a little concerned about how difficult it will be to maintain
enough
>> information throughout compilation to be able to match this on a
machine
>> with an explicit vector length value.
> Does the hardware *also* have a mask register?  If so, this is a 
> likely minor code quality issue which can be incrementally refined 
> on.  If it doesn't, then I can see your concern.
>>
>>> Question 2 - Have you explored using selects instead? What
practical
>>> problems do you run into which make you believe explicit
predication
>>> is required?
>>>
>>> e.g. %sub = fsub <4 x float> %x, %y
>>> %result = select <4 x i1> %M, <4 x float> %sub, undef
>> That is semantically incorrect.  According to IR semantics, the fsub is
>> fully evaluated before the select comes along.  It could trap for
>> elements where %M is 0, whereas a masked intrinsic conveys the proper
>> semantics of masking traps for masked-out elements.  We need intrinsics
>> and eventually (IMHO) fully first-class predication to make this work
>> properly.
>
> If you want specific trap behavior, you need to use the constrained 
> family of intrinsics instead.  In IR, fsub is expected not to trap.
>
> We have an existing solution for modeling FP environment aspects such 
> as rounding and trapping.  The proposed signatures for your EVL 
> proposal do not appear to subsume those, and you've not proposed their 
> retirement.  We definitely don't want *two* ways of describing FP 
> trapping.
>
> In other words, I don't find this reason compelling since my example 
> can simply be rewritten using the appropriate constrained intrinsic.
The existing constrained fp intrinsics do not take a mask or vlen. So, 
you can not have vectorized trapping fp math at the moment (beyond what 
LV can do...).

Masking has advantages even in the default non-trapping fp environment: 
It is not uncommon for fp hardware to be slow on denormal values. If you 
take the operation + select approach, spurious computation on denomals 
could occur, slowing down the program.

If you target has no masked fp ops (SSE, NEON, ..), you can still use 
EVL and have the backend lower it to 
"select-safe-inputs-on-masked-off-lanes + fp-operation" pattern. If
you
emit that pattern to early, InstCombine etc might fold it away.. also 
because IR optimizations can not distinguish between a select that was 
part of the original program and a select that was inserted to have a 
matchable pattern in the backend.
>>> My context for these questions is that my experience recently w/o
>>> existing masked intrinsics shows us missing fairly basic
>>> optimizations, precisely because they weren't able to reuse all
of the
>>> existing infrastructure. (I've been working on
>>> SimplifyDemandedVectorElts recently for exactly this reason.) My
>>> concern is that your EVL proposal will end up in the same state.
>> I think that's just the nature of the beast.  We need IR-level
support
>> for masking and we have to teach LLVM about it.
> I'm solidly of the opinion that we already *have* IR support for 
> explicit masking in the form of gather/scatter/etc...  Until someone 
> has taken the effort to make masking in this context *actually work 
> well*, I'm unconvinced that we should greatly expand the usage in the
IR.
What do you mean by "make masking *work well*"? LLVMs vectorization 
support is stuck in ~2007 (SSE, ..) with patched-in intrinsics to 
support masked load/store and gather/scatter on AVX2.

I think this is a chicken-and-egg problem: LLVMs LoopVectorizer is 
rather limited and is used to argue that better IR support for 
predication was not necessary. However, if we had better IR support more 
aggressive vectorization schemes are possible.. right now, if you are 
serious about exploiting a SIMD ISAs, people use target-specific 
intrinsics to get the functionality they need.
>>
>>                             -David
-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

Philip Reames via llvm-dev

2019-Feb-05 00:38 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 2/1/19 1:52 AM, Simon Moll wrote:> Hi,
>
> On 1/31/19 8:17 PM, Philip Reames wrote:
>>
>> On 1/31/19 11:03 AM, David Greene wrote:
>>> Philip Reames <listmail at philipreames.com> writes:
>>>
>>>> Question 1 - Why do we need separate mask and lengths?
Can't the
>>>> length be easily folded into the mask operand?
>>>>
>>>> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L
-1))
>>>> and then pattern matched in the backend if needed
>>> I'm a little concerned about how difficult it will be to
maintain
>>> enough
>>> information throughout compilation to be able to match this on a 
>>> machine
>>> with an explicit vector length value.
>> Does the hardware *also* have a mask register?  If so, this is a 
>> likely minor code quality issue which can be incrementally refined 
>> on.  If it doesn't, then I can see your concern.
>>>
>>>> Question 2 - Have you explored using selects instead? What
practical
>>>> problems do you run into which make you believe explicit
predication
>>>> is required?
>>>>
>>>> e.g. %sub = fsub <4 x float> %x, %y
>>>> %result = select <4 x i1> %M, <4 x float> %sub,
undef
>>> That is semantically incorrect.  According to IR semantics, the
fsub is
>>> fully evaluated before the select comes along.  It could trap for
>>> elements where %M is 0, whereas a masked intrinsic conveys the
proper
>>> semantics of masking traps for masked-out elements.  We need
intrinsics
>>> and eventually (IMHO) fully first-class predication to make this
work
>>> properly.
>>
>> If you want specific trap behavior, you need to use the constrained 
>> family of intrinsics instead.  In IR, fsub is expected not to trap.
>>
>> We have an existing solution for modeling FP environment aspects such 
>> as rounding and trapping.  The proposed signatures for your EVL 
>> proposal do not appear to subsume those, and you've not proposed 
>> their retirement.  We definitely don't want *two* ways of
describing
>> FP trapping.
>>
>> In other words, I don't find this reason compelling since my
example
>> can simply be rewritten using the appropriate constrained intrinsic.
>
> The existing constrained fp intrinsics do not take a mask or vlen. So, 
> you can not have vectorized trapping fp math at the moment (beyond 
> what LV can do...).
>
> Masking has advantages even in the default non-trapping fp 
> environment: It is not uncommon for fp hardware to be slow on denormal 
> values. If you take the operation + select approach, spurious 
> computation on denomals could occur, slowing down the program.Why?  If you're backend has support for predicate fsub, you'd just 
pattern match for that.>
> If you target has no masked fp ops (SSE, NEON, ..), you can still use 
> EVL and have the backend lower it to 
> "select-safe-inputs-on-masked-off-lanes + fp-operation" pattern.
If
> you emit that pattern to early, InstCombine etc might fold it away.. 
> also because IR optimizations can not distinguish between a select 
> that was part of the original program and a select that was inserted 
> to have a matchable pattern in the backend."InstCombine might fold it away" is the entire point of a generic 
optimizer.  That's what we want to preserve, not
limit.>
>>>> My context for these questions is that my experience recently
w/o
>>>> existing masked intrinsics shows us missing fairly basic
>>>> optimizations, precisely because they weren't able to reuse
all of the
>>>> existing infrastructure. (I've been working on
>>>> SimplifyDemandedVectorElts recently for exactly this reason.)
My
>>>> concern is that your EVL proposal will end up in the same
state.
>>> I think that's just the nature of the beast.  We need IR-level
support
>>> for masking and we have to teach LLVM about it.
>> I'm solidly of the opinion that we already *have* IR support for 
>> explicit masking in the form of gather/scatter/etc...  Until someone 
>> has taken the effort to make masking in this context *actually work 
>> well*, I'm unconvinced that we should greatly expand the usage in
the
>> IR.
>
> What do you mean by "make masking *work well*"? 
Well, let's start with basic support in places like InstSimplify, 
ConstantFolding, SelectionDAG, InstCombine, DSE, GVN, etc...

If we actually implement EVL, then we'll need all of that work anyways, 
so why not prototype it now with the existing masked instructions?

> LLVMs vectorization support is stuck in ~2007 (SSE, ..) with 
> patched-in intrinsics to support masked load/store and gather/scatter 
> on AVX2.
>
> I think this is a chicken-and-egg problem: LLVMs LoopVectorizer is 
> rather limited and is used to argue that better IR support for 
> predication was not necessary. However, if we had better IR support 
> more aggressive vectorization schemes are possible.. right now, if you 
> are serious about exploiting a SIMD ISAs, people use target-specific 
> intrinsics to get the functionality they need.
This is bordering on a religious war I have no interest in getting 
into.  I'll simply say I don't agree w/your perspective and leave it
there.

Philip

David Greene via llvm-dev

2019-Feb-07 17:28 UTC

head link

[llvm-dev] [RFC] Vector Predication

Philip Reames <listmail at philipreames.com> writes:
>> Masking has advantages even in the default non-trapping fp
>> environment: It is not uncommon for fp hardware to be slow on
>> denormal values. If you take the operation + select approach,
>> spurious computation on denomals could occur, slowing down the
>> program.
> Why?  If you're backend has support for predicate fsub, you'd just
> pattern match for that.
It's not that simple.  Often the IR gets mangled so badly during
optimization that the pattern is no longer recognizable.  I've fixed
bugs in LLVM where use of select to implement predication was causing
traps because instcombine or something else lifted one of the operands
of the select beyond a point where isel could match it.

select is not semantically equivalent to predication and there is no way
to force it to be without drastically changing the IR specification.

                            -David

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

Seemingly Similar Threads