thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Robin Kruppe via llvm-dev

2019-Jan-31 21:14 UTC

[llvm-dev] [RFC] Vector Predication

On Thu, 31 Jan 2019 at 20:17, Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On 1/31/19 11:03 AM, David Greene wrote:
> > Philip Reames <listmail at philipreames.com> writes:
> >
> >> Question 1 - Why do we need separate mask and lengths? Can't
the
> >> length be easily folded into the mask operand?
> >>
> >> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
> >> and then pattern matched in the backend if needed
> > I'm a little concerned about how difficult it will be to maintain
enough
> > information throughout compilation to be able to match this on a
machine
> > with an explicit vector length value.
> Does the hardware *also* have a mask register?  If so, this is a likely
> minor code quality issue which can be incrementally refined on.  If it
> doesn't, then I can see your concern.
>
Masking/predication is supported nearly universally, but I don't think the
code quality issue is minor. It would be on a typical packed-SIMD machine
with 128/256/512 bit registers, but the processors with a vector length
register are usually built with much larger registers files and without a
corresponding increase in the number of functional units. For example, 4096
bit per vector register is really quite modest for this kind of machine,
while the data path can reasonable be "only" 128 or 256 bit.

This changes the calculus quite a bit: vector lengths much shorter or
minimally larger than one full register are suddenly reasonable common (in
application code, not so much in HPC kernels) and because each vector
instruction is split into many data-path-sized uops, it's trivial and very
rewarding to cut processing short halfway through a vector. The efficiency
of "short vector code" then depends on the ability to finish each
operation
on those short vectors relatively quickly rather than padding everything to
a full vector register.

For example, if a loop with a trip count of 20 is vectorized on a machine
with 64 elements per vector (that's 64b elements in a 4096b register, so
this is lowballing it!), using only masks and not the vector length
register makes your vector unit do about three times more work than it
would have to if you set the vector length register to 20. That keeps the
register file and functional units busy for no good reason. Some
microarchitectures take on the burden of determining when a whole chunk of
the vector is masked out and can then skip over it quickly, but many others
don't. So you're likely burning a whole bunch of power and quite
possibly
taking up cycles that could be filled with useful work from other
instructions instead.

Cheers,
Robin
>> Question 2 - Have you explored using selects instead? What practical
> >> problems do you run into which make you believe explicit
predication
> >> is required?
> >>
> >> e.g. %sub = fsub <4 x float> %x, %y
> >> %result = select <4 x i1> %M, <4 x float> %sub, undef
> > That is semantically incorrect.  According to IR semantics, the fsub
is
> > fully evaluated before the select comes along.  It could trap for
> > elements where %M is 0, whereas a masked intrinsic conveys the proper
> > semantics of masking traps for masked-out elements.  We need
intrinsics
> > and eventually (IMHO) fully first-class predication to make this work
> > properly.
>
> If you want specific trap behavior, you need to use the constrained
> family of intrinsics instead.  In IR, fsub is expected not to trap.
>
> We have an existing solution for modeling FP environment aspects such as
> rounding and trapping.  The proposed signatures for your EVL proposal do
> not appear to subsume those, and you've not proposed their retirement.
> We definitely don't want *two* ways of describing FP trapping.
>
> In other words, I don't find this reason compelling since my example
can
> simply be rewritten using the appropriate constrained intrinsic.
>
>
> >
> >> My context for these questions is that my experience recently w/o
> >> existing masked intrinsics shows us missing fairly basic
> >> optimizations, precisely because they weren't able to reuse
all of the
> >> existing infrastructure. (I've been working on
> >> SimplifyDemandedVectorElts recently for exactly this reason.) My
> >> concern is that your EVL proposal will end up in the same state.
> > I think that's just the nature of the beast.  We need IR-level
support
> > for masking and we have to teach LLVM about it.
> I'm solidly of the opinion that we already *have* IR support for
> explicit masking in the form of gather/scatter/etc...  Until someone has
> taken the effort to make masking in this context *actually work well*,
> I'm unconvinced that we should greatly expand the usage in the IR.
> >
> >                             -David
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190131/0bb14e6f/attachment.html>

Philip Reames via llvm-dev

2019-Feb-01 00:04 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 1/31/19 1:14 PM, Robin Kruppe wrote:>
>
> On Thu, 31 Jan 2019 at 20:17, Philip Reames via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>
>     On 1/31/19 11:03 AM, David Greene wrote:
>     > Philip Reames <listmail at philipreames.com
>     <mailto:listmail at philipreames.com>> writes:
>     >
>     >> Question 1 - Why do we need separate mask and lengths?
Can't the
>     >> length be easily folded into the mask operand?
>     >>
>     >> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L
-1))
>     >> and then pattern matched in the backend if needed
>     > I'm a little concerned about how difficult it will be to
>     maintain enough
>     > information throughout compilation to be able to match this on a
>     machine
>     > with an explicit vector length value.
>     Does the hardware *also* have a mask register?  If so, this is a
>     likely
>     minor code quality issue which can be incrementally refined on. 
>     If it
>     doesn't, then I can see your concern.
>
>
> Masking/predication is supported nearly universally, but I don't think 
> the code quality issue is minor. It would be on a typical packed-SIMD 
> machine with 128/256/512 bit registers, but the processors with a 
> vector length register are usually built with much larger registers 
> files and without a corresponding increase in the number of functional 
> units. For example, 4096 bit per vector register is really quite 
> modest for this kind of machine, while the data path can reasonable be 
> "only" 128 or 256 bit.
>
> This changes the calculus quite a bit: vector lengths much shorter or 
> minimally larger than one full register are suddenly reasonable common 
> (in application code, not so much in HPC kernels) and because each 
> vector instruction is split into many data-path-sized uops, it's 
> trivial and very rewarding to cut processing short halfway through a 
> vector. The efficiency of "short vector code" then depends on the
> ability to finish each operation on those short vectors relatively 
> quickly rather than padding everything to a full vector register.
>
> For example, if a loop with a trip count of 20 is vectorized on a 
> machine with 64 elements per vector (that's 64b elements in a 4096b 
> register, so this is lowballing it!), using only masks and not the 
> vector length register makes your vector unit do about three times 
> more work than it would have to if you set the vector length register 
> to 20. That keeps the register file and functional units busy for no 
> good reason. Some microarchitectures take on the burden of determining 
> when a whole chunk of the vector is masked out and can then skip over 
> it quickly, but many others don't. So you're likely burning a whole
> bunch of power and quite possibly taking up cycles that could be 
> filled with useful work from other instructions instead.
Thank you for the explanation.

Do such architectures frequently have arithmetic operations on the mask 
registers?  (i.e. can I reasonable compute a conservative length given a 
mask register value)  If I can, then having a mask as the canonical form 
and re-deriving the length register from a mask for a sequence of 
instructions which share a predicate seems fairly reasonable.  Note that 
I'm assuming this as a fallback, and that the common case is handled via 
the equivalent of ComputeKnownBits on the mask itself at compile time.

The only case where the combination of a CKB and dynamic mask->length 
fallback wouldn't handle reliably is when we have a mask loaded from an 
external source (memory, function call boundary, etc...) and a short 
sequence of vector ops.  Are such really common enough that it needs to 
be a first class element of the design?


p.s. To make sure my tone is coming across correctly, let me spell out 
that I'm not convinced, but I'm not actively objecting. I'm playing 
devils advocate for the purposes of fleshing out a design, but if folks 
more knowledgeable than I strongly believe the right design requires 
both masks and lengths, I'm happy to defer on that point.

>
> Cheers,
> Robin
>
>     >> Question 2 - Have you explored using selects instead? What
>     practical
>     >> problems do you run into which make you believe explicit
>     predication
>     >> is required?
>     >>
>     >> e.g. %sub = fsub <4 x float> %x, %y
>     >> %result = select <4 x i1> %M, <4 x float> %sub,
undef
>     > That is semantically incorrect.  According to IR semantics, the
>     fsub is
>     > fully evaluated before the select comes along.  It could trap for
>     > elements where %M is 0, whereas a masked intrinsic conveys the
>     proper
>     > semantics of masking traps for masked-out elements.  We need
>     intrinsics
>     > and eventually (IMHO) fully first-class predication to make this
>     work
>     > properly.
>
>     If you want specific trap behavior, you need to use the constrained
>     family of intrinsics instead.  In IR, fsub is expected not to trap.
>
>     We have an existing solution for modeling FP environment aspects
>     such as
>     rounding and trapping.  The proposed signatures for your EVL
>     proposal do
>     not appear to subsume those, and you've not proposed their
>     retirement.
>     We definitely don't want *two* ways of describing FP trapping.
>
>     In other words, I don't find this reason compelling since my
>     example can
>     simply be rewritten using the appropriate constrained intrinsic.
>
>
>     >
>     >> My context for these questions is that my experience recently
w/o
>     >> existing masked intrinsics shows us missing fairly basic
>     >> optimizations, precisely because they weren't able to
reuse all
>     of the
>     >> existing infrastructure. (I've been working on
>     >> SimplifyDemandedVectorElts recently for exactly this reason.)
My
>     >> concern is that your EVL proposal will end up in the same
state.
>     > I think that's just the nature of the beast.  We need IR-level
>     support
>     > for masking and we have to teach LLVM about it.
>     I'm solidly of the opinion that we already *have* IR support for
>     explicit masking in the form of gather/scatter/etc...  Until
>     someone has
>     taken the effort to make masking in this context *actually work
>     well*,
>     I'm unconvinced that we should greatly expand the usage in the IR.
>     >
>     >                             -David
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190131/c48e13bd/attachment.html>

Saito, Hideki via llvm-dev

2019-Feb-01 00:31 UTC

head link

[llvm-dev] [RFC] Vector Predication

>when we have a mask loaded from an external source (memory, function call
boundary, etc...) and a short sequence of vector ops
Mask value from function call parameter is common. OpenMP declare simd function
does exactly that for the masked cases.

From: Philip Reames [mailto:listmail at philipreames.com]
Sent: Thursday, January 31, 2019 4:05 PM
To: Robin Kruppe <robin.kruppe at gmail.com>
Cc: David Greene <dag at cray.com>; via llvm-dev <llvm-dev at
lists.llvm.org>; Saito, Hideki <hideki.saito at intel.com>; Topper,
Craig <craig.topper at intel.com>; Maslov, Sergey V <sergey.v.maslov at
intel.com>
Subject: Re: [llvm-dev] [RFC] Vector Predication

On 1/31/19 1:14 PM, Robin Kruppe wrote:

On Thu, 31 Jan 2019 at 20:17, Philip Reames via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

On 1/31/19 11:03 AM, David Greene wrote:> Philip Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>> writes:
>
>> Question 1 - Why do we need separate mask and lengths? Can't the
>> length be easily folded into the mask operand?
>>
>> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
>> and then pattern matched in the backend if needed
> I'm a little concerned about how difficult it will be to maintain
enough
> information throughout compilation to be able to match this on a machine
> with an explicit vector length value.Does the hardware *also* have a mask register?  If so, this is a likely
minor code quality issue which can be incrementally refined on.  If it
doesn't, then I can see your concern.

Masking/predication is supported nearly universally, but I don't think the
code quality issue is minor. It would be on a typical packed-SIMD machine with
128/256/512 bit registers, but the processors with a vector length register are
usually built with much larger registers files and without a corresponding
increase in the number of functional units. For example, 4096 bit per vector
register is really quite modest for this kind of machine, while the data path
can reasonable be "only" 128 or 256 bit.

This changes the calculus quite a bit: vector lengths much shorter or minimally
larger than one full register are suddenly reasonable common (in application
code, not so much in HPC kernels) and because each vector instruction is split
into many data-path-sized uops, it's trivial and very rewarding to cut
processing short halfway through a vector. The efficiency of "short vector
code" then depends on the ability to finish each operation on those short
vectors relatively quickly rather than padding everything to a full vector
register.

For example, if a loop with a trip count of 20 is vectorized on a machine with
64 elements per vector (that's 64b elements in a 4096b register, so this is
lowballing it!), using only masks and not the vector length register makes your
vector unit do about three times more work than it would have to if you set the
vector length register to 20. That keeps the register file and functional units
busy for no good reason. Some microarchitectures take on the burden of
determining when a whole chunk of the vector is masked out and can then skip
over it quickly, but many others don't. So you're likely burning a whole
bunch of power and quite possibly taking up cycles that could be filled with
useful work from other instructions instead.

Thank you for the explanation.

Do such architectures frequently have arithmetic operations on the mask
registers?  (i.e. can I reasonable compute a conservative length given a mask
register value)  If I can, then having a mask as the canonical form and
re-deriving the length register from a mask for a sequence of instructions which
share a predicate seems fairly reasonable.  Note that I'm assuming this as a
fallback, and that the common case is handled via the equivalent of
ComputeKnownBits on the mask itself at compile time.

The only case where the combination of a CKB and dynamic mask->length
fallback wouldn't handle reliably is when we have a mask loaded from an
external source (memory, function call boundary, etc...) and a short sequence of
vector ops.  Are such really common enough that it needs to be a first class
element of the design?

p.s. To make sure my tone is coming across correctly, let me spell out that
I'm not convinced, but I'm not actively objecting.  I'm playing
devils advocate for the purposes of fleshing out a design, but if folks more
knowledgeable than I strongly believe the right design requires both masks and
lengths, I'm happy to defer on that point.

Cheers,
Robin
>> Question 2 - Have you explored using selects instead? What practical
>> problems do you run into which make you believe explicit predication
>> is required?
>>
>> e.g. %sub = fsub <4 x float> %x, %y
>> %result = select <4 x i1> %M, <4 x float> %sub, undef
> That is semantically incorrect.  According to IR semantics, the fsub is
> fully evaluated before the select comes along.  It could trap for
> elements where %M is 0, whereas a masked intrinsic conveys the proper
> semantics of masking traps for masked-out elements.  We need intrinsics
> and eventually (IMHO) fully first-class predication to make this work
> properly.
If you want specific trap behavior, you need to use the constrained
family of intrinsics instead.  In IR, fsub is expected not to trap.

We have an existing solution for modeling FP environment aspects such as
rounding and trapping.  The proposed signatures for your EVL proposal do
not appear to subsume those, and you've not proposed their retirement.
We definitely don't want *two* ways of describing FP trapping.

In other words, I don't find this reason compelling since my example can
simply be rewritten using the appropriate constrained intrinsic.

>
>> My context for these questions is that my experience recently w/o
>> existing masked intrinsics shows us missing fairly basic
>> optimizations, precisely because they weren't able to reuse all of
the
>> existing infrastructure. (I've been working on
>> SimplifyDemandedVectorElts recently for exactly this reason.) My
>> concern is that your EVL proposal will end up in the same state.
> I think that's just the nature of the beast.  We need IR-level support
> for masking and we have to teach LLVM about it.I'm solidly of the opinion that we already *have* IR support for
explicit masking in the form of gather/scatter/etc...  Until someone has
taken the effort to make masking in this context *actually work well*,
I'm unconvinced that we should greatly expand the usage in the
IR.>
>                             -David_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/561916a9/attachment.html>

Bruce Hoult via llvm-dev

2019-Feb-01 00:57 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Thu, Jan 31, 2019 at 4:05 PM Philip Reames via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Do such architectures frequently have arithmetic operations on the mask
registers? (i.e. can I reasonable compute a conservative length given a mask
register value) If I can, then having a mask as the canonical form and
re-deriving the length register from a mask for a sequence of instructions which
share a predicate seems fairly reasonable. Note that I'm assuming this as a
fallback, and that the common case is handled via the equivalent of
ComputeKnownBits on the mask itself at compile time.
If masking is used (which it is usually not for loops without control
flow inside the vectorised loop) then, yes, logical operations on the
mask registers will happen at every basic block boundary.

But it is NOT the case that you can computer the active vector length
VL from an initial mask value. The active vector length is set by the
hardware based on the remaining application vector length. The VL can
change for each loop iteration -- the normal pattern is for VL to
equal VLMAX for initial executions of the loop, and then be less than
VLMAX for the final one or two iterations of the loop. For example if
VLMAX is 16 and there are 19 elements left in the application vector
then the hardware might choose to use 10 elements for the 2nd to last
iteration and 9 elements for the last iteration. Or not. Other
hardware might choose to perform the last three iterations as 12/12/11
instead of 16/10/9. (It is constrained to be monotonic).

VL can also be dynamically shortened in the middle of a loop iteration
by an unaligned vector load that crosses a protection boundary if the
later elements are inaccessible.

I'm curious what SVE will do if there is an if/then/else in the middle
of a vectorised loop with a shorter-than-maximum vector length. You
can't just invert the mask when going from the then-part to the
else-part because that would re-enable elements past the end of the
vector. You'd need to invert the mask and then AND it with the mask
containing the (bitwise representation of) the vector length.

Robin Kruppe via llvm-dev

2019-Feb-01 11:57 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Fri, 1 Feb 2019 at 01:04, Philip Reames <listmail at philipreames.com>
wrote:
>
> On 1/31/19 1:14 PM, Robin Kruppe wrote:
>
>
>
> On Thu, 31 Jan 2019 at 20:17, Philip Reames via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>> On 1/31/19 11:03 AM, David Greene wrote:
>> > Philip Reames <listmail at philipreames.com> writes:
>> >
>> >> Question 1 - Why do we need separate mask and lengths?
Can't the
>> >> length be easily folded into the mask operand?
>> >>
>> >> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L
-1))
>> >> and then pattern matched in the backend if needed
>> > I'm a little concerned about how difficult it will be to
maintain enough
>> > information throughout compilation to be able to match this on a
machine
>> > with an explicit vector length value.
>> Does the hardware *also* have a mask register?  If so, this is a likely
>> minor code quality issue which can be incrementally refined on.  If it
>> doesn't, then I can see your concern.
>>
>
> Masking/predication is supported nearly universally, but I don't think
the
> code quality issue is minor. It would be on a typical packed-SIMD machine
> with 128/256/512 bit registers, but the processors with a vector length
> register are usually built with much larger registers files and without a
> corresponding increase in the number of functional units. For example, 4096
> bit per vector register is really quite modest for this kind of machine,
> while the data path can reasonable be "only" 128 or 256 bit.
>
> This changes the calculus quite a bit: vector lengths much shorter or
> minimally larger than one full register are suddenly reasonable common (in
> application code, not so much in HPC kernels) and because each vector
> instruction is split into many data-path-sized uops, it's trivial and
very
> rewarding to cut processing short halfway through a vector. The efficiency
> of "short vector code" then depends on the ability to finish each
operation
> on those short vectors relatively quickly rather than padding everything to
> a full vector register.
>
> For example, if a loop with a trip count of 20 is vectorized on a machine
> with 64 elements per vector (that's 64b elements in a 4096b register,
so
> this is lowballing it!), using only masks and not the vector length
> register makes your vector unit do about three times more work than it
> would have to if you set the vector length register to 20. That keeps the
> register file and functional units busy for no good reason. Some
> microarchitectures take on the burden of determining when a whole chunk of
> the vector is masked out and can then skip over it quickly, but many others
> don't. So you're likely burning a whole bunch of power and quite
possibly
> taking up cycles that could be filled with useful work from other
> instructions instead.
>
> Thank you for the explanation.
>
> Do such architectures frequently have arithmetic operations on the mask
> registers?  (i.e. can I reasonable compute a conservative length given a
> mask register value)  If I can, then having a mask as the canonical form
> and re-deriving the length register from a mask for a sequence of
> instructions which share a predicate seems fairly reasonable.
>A mask is frequently too large to reasonably treat it as an integer, but
"find the index of the first mask bit that is 0" or some variant of it
is a
common instruction. However, using it in the way you suggest can be very
expensive. Long vector operations have very high latency (easily tens of
cycles) until all result lanes are available (this doesn't stall later
lane-wise operations thanks to chaining). In the code pattern you describe,
all following vector operations have to stall until the "find first zero in
the mask" operation is complete, plus roundrip latency from the vector unit
to the scalar control logic and back. If the vector length is large, this
means having to stall until the entire mask is computed (and then some),
and so you stall for tens of cycles.

And it might be just as bad for short vector lengths. It's easy to imagine
that "find first zero in mask" instruction short-circuits and finishes
as
soon as any 0 bit is encountered, but my understanding is that this would
require extra control logic, so I can also easily imagine hardware
designers not going there (because they might expect it to not matter on
the workloads they care about). But this is a microarchitectural detail
that I can't make sweeping statements about, I would have to poll designers
to give an answer even just about RISC-V vector units currently in
development.
> The only case where the combination of a CKB and dynamic mask->length
> fallback wouldn't handle reliably is when we have a mask loaded from an
> external source (memory, function call boundary, etc...) and a short
> sequence of vector ops.  Are such really common enough that it needs to be
> a first class element of the design?
>As Hideki said, masks passed in as parameters are everywhere when "whole
functions" are vectorized rather than just loops in a leaf function, and
while I don't have any hard data, I see no good reason why short functions
should be less common in that setting than in other code. Although I
suppose one could pass the active vector length separately as an integer
and re-compute length->mask in the function body.
> p.s. To make sure my tone is coming across correctly, let me spell out
> that I'm not convinced, but I'm not actively objecting.  I'm
playing devils
> advocate for the purposes of fleshing out a design, but if folks more
> knowledgeable than I strongly believe the right design requires both masks
> and lengths, I'm happy to defer on that point.
>Gotcha. I think it's important to make a good case for this and not just
assert on authority what is good codegen for these architecture.

Cheers,
Robin
>
>
> Cheers,
> Robin
>
> >> Question 2 - Have you explored using selects instead? What
practical
>> >> problems do you run into which make you believe explicit
predication
>> >> is required?
>> >>
>> >> e.g. %sub = fsub <4 x float> %x, %y
>> >> %result = select <4 x i1> %M, <4 x float> %sub,
undef
>> > That is semantically incorrect.  According to IR semantics, the
fsub is
>> > fully evaluated before the select comes along.  It could trap for
>> > elements where %M is 0, whereas a masked intrinsic conveys the
proper
>> > semantics of masking traps for masked-out elements.  We need
intrinsics
>> > and eventually (IMHO) fully first-class predication to make this
work
>> > properly.
>>
>> If you want specific trap behavior, you need to use the constrained
>> family of intrinsics instead.  In IR, fsub is expected not to trap.
>>
>> We have an existing solution for modeling FP environment aspects such
as
>> rounding and trapping.  The proposed signatures for your EVL proposal
do
>> not appear to subsume those, and you've not proposed their
retirement.
>> We definitely don't want *two* ways of describing FP trapping.
>>
>> In other words, I don't find this reason compelling since my
example can
>> simply be rewritten using the appropriate constrained intrinsic.
>>
>>
>> >
>> >> My context for these questions is that my experience recently
w/o
>> >> existing masked intrinsics shows us missing fairly basic
>> >> optimizations, precisely because they weren't able to
reuse all of the
>> >> existing infrastructure. (I've been working on
>> >> SimplifyDemandedVectorElts recently for exactly this reason.)
My
>> >> concern is that your EVL proposal will end up in the same
state.
>> > I think that's just the nature of the beast.  We need IR-level
support
>> > for masking and we have to teach LLVM about it.
>> I'm solidly of the opinion that we already *have* IR support for
>> explicit masking in the form of gather/scatter/etc...  Until someone
has
>> taken the effort to make masking in this context *actually work well*,
>> I'm unconvinced that we should greatly expand the usage in the IR.
>> >
>> >                             -David
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/9f56b652/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

Apparently Analagous Threads