thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

David Greene via llvm-dev

2019-Jan-31 19:03 UTC

[llvm-dev] [RFC] Vector Predication

Philip Reames <listmail at philipreames.com> writes:
> Question 1 - Why do we need separate mask and lengths? Can't the
> length be easily folded into the mask operand? 
>
> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
> and then pattern matched in the backend if needed
I'm a little concerned about how difficult it will be to maintain enough
information throughout compilation to be able to match this on a machine
with an explicit vector length value.
> Question 2 - Have you explored using selects instead? What practical
> problems do you run into which make you believe explicit predication
> is required?
>
> e.g. %sub = fsub <4 x float> %x, %y
> %result = select <4 x i1> %M, <4 x float> %sub, undef
That is semantically incorrect.  According to IR semantics, the fsub is
fully evaluated before the select comes along.  It could trap for
elements where %M is 0, whereas a masked intrinsic conveys the proper
semantics of masking traps for masked-out elements.  We need intrinsics
and eventually (IMHO) fully first-class predication to make this work
properly.
> My context for these questions is that my experience recently w/o
> existing masked intrinsics shows us missing fairly basic
> optimizations, precisely because they weren't able to reuse all of the
> existing infrastructure. (I've been working on
> SimplifyDemandedVectorElts recently for exactly this reason.) My
> concern is that your EVL proposal will end up in the same state.
I think that's just the nature of the beast.  We need IR-level support
for masking and we have to teach LLVM about it.

                           -David

Philip Reames via llvm-dev

2019-Jan-31 19:17 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 1/31/19 11:03 AM, David Greene wrote:> Philip Reames <listmail at philipreames.com> writes:
>
>> Question 1 - Why do we need separate mask and lengths? Can't the
>> length be easily folded into the mask operand?
>>
>> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
>> and then pattern matched in the backend if needed
> I'm a little concerned about how difficult it will be to maintain
enough
> information throughout compilation to be able to match this on a machine
> with an explicit vector length value.Does the hardware *also* have a mask register?  If so, this is a likely 
minor code quality issue which can be incrementally refined on.  If it 
doesn't, then I can see your concern.>
>> Question 2 - Have you explored using selects instead? What practical
>> problems do you run into which make you believe explicit predication
>> is required?
>>
>> e.g. %sub = fsub <4 x float> %x, %y
>> %result = select <4 x i1> %M, <4 x float> %sub, undef
> That is semantically incorrect.  According to IR semantics, the fsub is
> fully evaluated before the select comes along.  It could trap for
> elements where %M is 0, whereas a masked intrinsic conveys the proper
> semantics of masking traps for masked-out elements.  We need intrinsics
> and eventually (IMHO) fully first-class predication to make this work
> properly.
If you want specific trap behavior, you need to use the constrained 
family of intrinsics instead.  In IR, fsub is expected not to trap.

We have an existing solution for modeling FP environment aspects such as 
rounding and trapping.  The proposed signatures for your EVL proposal do 
not appear to subsume those, and you've not proposed their retirement.  
We definitely don't want *two* ways of describing FP trapping.

In other words, I don't find this reason compelling since my example can 
simply be rewritten using the appropriate constrained intrinsic.

>
>> My context for these questions is that my experience recently w/o
>> existing masked intrinsics shows us missing fairly basic
>> optimizations, precisely because they weren't able to reuse all of
the
>> existing infrastructure. (I've been working on
>> SimplifyDemandedVectorElts recently for exactly this reason.) My
>> concern is that your EVL proposal will end up in the same state.
> I think that's just the nature of the beast.  We need IR-level support
> for masking and we have to teach LLVM about it.I'm solidly of the opinion that we already *have* IR support for 
explicit masking in the form of gather/scatter/etc...  Until someone has 
taken the effort to make masking in this context *actually work well*, 
I'm unconvinced that we should greatly expand the usage in the
IR.>
>                             -David

Kaylor, Andrew via llvm-dev

2019-Jan-31 19:18 UTC

head link

[llvm-dev] [RFC] Vector Predication

>> Question 2 - Have you explored using selects instead? What practical 
>> problems do you run into which make you believe explicit predication 
>> is required?
>>
>> e.g. %sub = fsub <4 x float> %x, %y
>> %result = select <4 x i1> %M, <4 x float> %sub, undef
> That is semantically incorrect.  According to IR semantics, the fsub is
fully
> evaluated before the select comes along.  It could trap for elements where
> %M is 0, whereas a masked intrinsic conveys the proper semantics of
> masking traps for masked-out elements.  We need intrinsics and eventually
> (IMHO) fully first-class predication to make this work properly.
The LLVM language reference says, "The default LLVM floating-point
environment assumes that floating-point instructions do not have side
effects." So that's why this situation has been tolerated. As
you're probably aware have work in progress to enable a mode where the
default FP environment is not assumed and we properly handle FP status flags and
exception unmasking. This will absolutely require masked versions of the
operations.

I personally like the idea of having masked operations in the general case as
opposed to using selects and hoping the selection DAG will pick the right
instructions, because it doesn't always work out that way. But I suppose
that needs to be weighed against whatever optimization opportunities are missed
because of the less general representation. I agree that we should be able to
mitigate this by teaching the optimizer to handle masked operations.

-Andy

-----Original Message-----
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of David
Greene via llvm-dev
Sent: Thursday, January 31, 2019 11:04 AM
To: Philip Reames <listmail at philipreames.com>
Cc: via llvm-dev <llvm-dev at lists.llvm.org>; Saito, Hideki
<hideki.saito at intel.com>; Topper, Craig <craig.topper at
intel.com>; Maslov, Sergey V <sergey.v.maslov at intel.com>
Subject: Re: [llvm-dev] [RFC] Vector Predication

Philip Reames <listmail at philipreames.com> writes:
> Question 1 - Why do we need separate mask and lengths? Can't the 
> length be easily folded into the mask operand?
>
> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1)) and then
pattern
> matched in the backend if needed
I'm a little concerned about how difficult it will be to maintain enough
information throughout compilation to be able to match this on a machine with an
explicit vector length value.
> Question 2 - Have you explored using selects instead? What practical 
> problems do you run into which make you believe explicit predication 
> is required?
>
> e.g. %sub = fsub <4 x float> %x, %y
> %result = select <4 x i1> %M, <4 x float> %sub, undef
That is semantically incorrect.  According to IR semantics, the fsub is fully
evaluated before the select comes along.  It could trap for elements where %M is
0, whereas a masked intrinsic conveys the proper semantics of masking traps for
masked-out elements.  We need intrinsics and eventually (IMHO) fully first-class
predication to make this work properly.
> My context for these questions is that my experience recently w/o 
> existing masked intrinsics shows us missing fairly basic 
> optimizations, precisely because they weren't able to reuse all of the 
> existing infrastructure. (I've been working on 
> SimplifyDemandedVectorElts recently for exactly this reason.) My 
> concern is that your EVL proposal will end up in the same state.
I think that's just the nature of the beast.  We need IR-level support for
masking and we have to teach LLVM about it.

                           -David
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Robin Kruppe via llvm-dev

2019-Jan-31 21:14 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Thu, 31 Jan 2019 at 20:17, Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On 1/31/19 11:03 AM, David Greene wrote:
> > Philip Reames <listmail at philipreames.com> writes:
> >
> >> Question 1 - Why do we need separate mask and lengths? Can't
the
> >> length be easily folded into the mask operand?
> >>
> >> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
> >> and then pattern matched in the backend if needed
> > I'm a little concerned about how difficult it will be to maintain
enough
> > information throughout compilation to be able to match this on a
machine
> > with an explicit vector length value.
> Does the hardware *also* have a mask register?  If so, this is a likely
> minor code quality issue which can be incrementally refined on.  If it
> doesn't, then I can see your concern.
>
Masking/predication is supported nearly universally, but I don't think the
code quality issue is minor. It would be on a typical packed-SIMD machine
with 128/256/512 bit registers, but the processors with a vector length
register are usually built with much larger registers files and without a
corresponding increase in the number of functional units. For example, 4096
bit per vector register is really quite modest for this kind of machine,
while the data path can reasonable be "only" 128 or 256 bit.

This changes the calculus quite a bit: vector lengths much shorter or
minimally larger than one full register are suddenly reasonable common (in
application code, not so much in HPC kernels) and because each vector
instruction is split into many data-path-sized uops, it's trivial and very
rewarding to cut processing short halfway through a vector. The efficiency
of "short vector code" then depends on the ability to finish each
operation
on those short vectors relatively quickly rather than padding everything to
a full vector register.

For example, if a loop with a trip count of 20 is vectorized on a machine
with 64 elements per vector (that's 64b elements in a 4096b register, so
this is lowballing it!), using only masks and not the vector length
register makes your vector unit do about three times more work than it
would have to if you set the vector length register to 20. That keeps the
register file and functional units busy for no good reason. Some
microarchitectures take on the burden of determining when a whole chunk of
the vector is masked out and can then skip over it quickly, but many others
don't. So you're likely burning a whole bunch of power and quite
possibly
taking up cycles that could be filled with useful work from other
instructions instead.

Cheers,
Robin
>> Question 2 - Have you explored using selects instead? What practical
> >> problems do you run into which make you believe explicit
predication
> >> is required?
> >>
> >> e.g. %sub = fsub <4 x float> %x, %y
> >> %result = select <4 x i1> %M, <4 x float> %sub, undef
> > That is semantically incorrect.  According to IR semantics, the fsub
is
> > fully evaluated before the select comes along.  It could trap for
> > elements where %M is 0, whereas a masked intrinsic conveys the proper
> > semantics of masking traps for masked-out elements.  We need
intrinsics
> > and eventually (IMHO) fully first-class predication to make this work
> > properly.
>
> If you want specific trap behavior, you need to use the constrained
> family of intrinsics instead.  In IR, fsub is expected not to trap.
>
> We have an existing solution for modeling FP environment aspects such as
> rounding and trapping.  The proposed signatures for your EVL proposal do
> not appear to subsume those, and you've not proposed their retirement.
> We definitely don't want *two* ways of describing FP trapping.
>
> In other words, I don't find this reason compelling since my example
can
> simply be rewritten using the appropriate constrained intrinsic.
>
>
> >
> >> My context for these questions is that my experience recently w/o
> >> existing masked intrinsics shows us missing fairly basic
> >> optimizations, precisely because they weren't able to reuse
all of the
> >> existing infrastructure. (I've been working on
> >> SimplifyDemandedVectorElts recently for exactly this reason.) My
> >> concern is that your EVL proposal will end up in the same state.
> > I think that's just the nature of the beast.  We need IR-level
support
> > for masking and we have to teach LLVM about it.
> I'm solidly of the opinion that we already *have* IR support for
> explicit masking in the form of gather/scatter/etc...  Until someone has
> taken the effort to make masking in this context *actually work well*,
> I'm unconvinced that we should greatly expand the usage in the IR.
> >
> >                             -David
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190131/0bb14e6f/attachment.html>

Simon Moll via llvm-dev

2019-Feb-01 09:52 UTC

head link

[llvm-dev] [RFC] Vector Predication

Hi,

On 1/31/19 8:17 PM, Philip Reames wrote:>
> On 1/31/19 11:03 AM, David Greene wrote:
>> Philip Reames <listmail at philipreames.com> writes:
>>
>>> Question 1 - Why do we need separate mask and lengths? Can't
the
>>> length be easily folded into the mask operand?
>>>
>>> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
>>> and then pattern matched in the backend if needed
>> I'm a little concerned about how difficult it will be to maintain
enough
>> information throughout compilation to be able to match this on a
machine
>> with an explicit vector length value.
> Does the hardware *also* have a mask register?  If so, this is a 
> likely minor code quality issue which can be incrementally refined 
> on.  If it doesn't, then I can see your concern.
>>
>>> Question 2 - Have you explored using selects instead? What
practical
>>> problems do you run into which make you believe explicit
predication
>>> is required?
>>>
>>> e.g. %sub = fsub <4 x float> %x, %y
>>> %result = select <4 x i1> %M, <4 x float> %sub, undef
>> That is semantically incorrect.  According to IR semantics, the fsub is
>> fully evaluated before the select comes along.  It could trap for
>> elements where %M is 0, whereas a masked intrinsic conveys the proper
>> semantics of masking traps for masked-out elements.  We need intrinsics
>> and eventually (IMHO) fully first-class predication to make this work
>> properly.
>
> If you want specific trap behavior, you need to use the constrained 
> family of intrinsics instead.  In IR, fsub is expected not to trap.
>
> We have an existing solution for modeling FP environment aspects such 
> as rounding and trapping.  The proposed signatures for your EVL 
> proposal do not appear to subsume those, and you've not proposed their 
> retirement.  We definitely don't want *two* ways of describing FP 
> trapping.
>
> In other words, I don't find this reason compelling since my example 
> can simply be rewritten using the appropriate constrained intrinsic.
The existing constrained fp intrinsics do not take a mask or vlen. So, 
you can not have vectorized trapping fp math at the moment (beyond what 
LV can do...).

Masking has advantages even in the default non-trapping fp environment: 
It is not uncommon for fp hardware to be slow on denormal values. If you 
take the operation + select approach, spurious computation on denomals 
could occur, slowing down the program.

If you target has no masked fp ops (SSE, NEON, ..), you can still use 
EVL and have the backend lower it to 
"select-safe-inputs-on-masked-off-lanes + fp-operation" pattern. If
you
emit that pattern to early, InstCombine etc might fold it away.. also 
because IR optimizations can not distinguish between a select that was 
part of the original program and a select that was inserted to have a 
matchable pattern in the backend.
>>> My context for these questions is that my experience recently w/o
>>> existing masked intrinsics shows us missing fairly basic
>>> optimizations, precisely because they weren't able to reuse all
of the
>>> existing infrastructure. (I've been working on
>>> SimplifyDemandedVectorElts recently for exactly this reason.) My
>>> concern is that your EVL proposal will end up in the same state.
>> I think that's just the nature of the beast.  We need IR-level
support
>> for masking and we have to teach LLVM about it.
> I'm solidly of the opinion that we already *have* IR support for 
> explicit masking in the form of gather/scatter/etc...  Until someone 
> has taken the effort to make masking in this context *actually work 
> well*, I'm unconvinced that we should greatly expand the usage in the
IR.
What do you mean by "make masking *work well*"? LLVMs vectorization 
support is stuck in ~2007 (SSE, ..) with patched-in intrinsics to 
support masked load/store and gather/scatter on AVX2.

I think this is a chicken-and-egg problem: LLVMs LoopVectorizer is 
rather limited and is used to argue that better IR support for 
predication was not necessary. However, if we had better IR support more 
aggressive vectorization schemes are possible.. right now, if you are 
serious about exploiting a SIMD ISAs, people use target-specific 
intrinsics to get the functionality they need.
>>
>>                             -David
-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

Apparently Analagous Threads