thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Robin Kruppe via llvm-dev

2019-Feb-04 20:18 UTC

[llvm-dev] [RFC] Vector Predication

On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Simon Moll <moll at cs.uni-saarland.de> writes:
>
> > You are referring to the sub-vector sizes, if i am understanding
> > correctly. I'd assume that the mask sub-vector length always has
to be
> > either 1 or the same as the data sub-vector length. For example, this
> > is ok:
> >
> > %result = call <scalable 3 x float>
@llvm.evl.fsub.v4f32(<scalable 3 x
> > float> %x, <scalable 3 x float> %y, <scalable 1 x i1>
%M, i32 %L)
>
> What does <scalable 1 x i1> applied to <scalable 3 x float>
mean?  I
> would expect a requirement of <scalable 3 x i1>.  At least that's
how I
> understood the SVE proposal [1].  The n's in <scalable n x type>
have to
> match.
>
I believe the idea is to allow each single mask bit to control multiple
consecutive lanes at once, effectively interpreting the vector being
operated on as "many short fixed-length vectors, concatenated" rather
than
a single long vector of scalars. This is a different interpretation of that
type than usual, but it's not crazy, e.g. a similar reinterpretation of
vector types seems to be the favored approach for adding matrix operations
to LLVM IR. It somewhat obscures the point to discuss this only for
scalable vectors, there's no conceptual reason why one couldn't do the
same
with fixed size vectors.

In fact, I would recommend against making almost any new feature or
intrinsic exclusive to scalable vectors, including this one: there
shouldn't be much extra code required to allow and support it, and not
doing so makes the IR less orthogonal. For example, if a <scalable 4 x
float> fadd with a <scalable 1 x i1> mask works, then <4 x float>
fadd with
a <1 x i1> mask, a <8 x float> fadd with a <2 x i1> mask, etc.
should also
be possible overloads of the same intrinsic.

So far, so good. A bit odd, when I think about it, but if hardware out
there has that capability, maybe this is a good way to encode it in IR
(other options might work too, though). The crux, however, is the
interaction with the dynamic vector length: is it in terms of the mask? the
longer data vector? if the latter, what happens if it isn't divisible by
the mask length? There are multiple options and it's not clear to me which
one is "the right one", both for architectures with native support
(hopefully the one brough up here won't be the only one) and for internal
consistency of the IR. If there was an established architecture with this
kind of feature where people have gathered lots of practical experience
with it, we could use that inform the decision (just as we have for
ordinary predication and dynamic vector length). But I'm not aware of any
architecture that does this other than the one Jacob and lkcl are working
on, and as far as I know their project still in the early stages.

Cheers,
Robin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/2ec0c964/attachment.html>

Simon Moll via llvm-dev

2019-Feb-04 21:04 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 2/4/19 9:18 PM, Robin Kruppe wrote:>
>
> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     Simon Moll <moll at cs.uni-saarland.de
>     <mailto:moll at cs.uni-saarland.de>> writes:
>
>     > You are referring to the sub-vector sizes, if i am understanding
>     > correctly. I'd assume that the mask sub-vector length always
has
>     to be
>     > either 1 or the same as the data sub-vector length. For example,
>     this
>     > is ok:
>     >
>     > %result = call <scalable 3 x float>
>     @llvm.evl.fsub.v4f32(<scalable 3 x
>     > float> %x, <scalable 3 x float> %y, <scalable 1 x
i1> %M, i32 %L)
>
>     What does <scalable 1 x i1> applied to <scalable 3 x float>
mean?  I
>     would expect a requirement of <scalable 3 x i1>.  At least
that's
>     how I
>     understood the SVE proposal [1].  The n's in <scalable n x
type>
>     have to
>     match.
>
>
> I believe the idea is to allow each single mask bit to control 
> multiple consecutive lanes at once, effectively interpreting the 
> vector being operated on as "many short fixed-length vectors, 
> concatenated" rather than a single long vector of scalars. This is a 
> different interpretation of that type than usual, but it's not crazy, 
> e.g. a similar reinterpretation of vector types seems to be the 
> favored approach for adding matrix operations to LLVM IR. It somewhat 
> obscures the point to discuss this only for scalable vectors, there's 
> no conceptual reason why one couldn't do the same with fixed size
vectors.
>
> In fact, I would recommend against making almost any new feature or 
> intrinsic exclusive to scalable vectors, including this one: there 
> shouldn't be much extra code required to allow and support it, and not 
> doing so makes the IR less orthogonal. For example, if a <scalable 4 x 
> float> fadd with a <scalable 1 x i1> mask works, then <4 x
float> fadd
> with a <1 x i1> mask, a <8 x float> fadd with a <2 x i1>
mask, etc.
> should also be possible overloads of the same intrinsic.Yep. Doing the same for standard vector IR is on the radar: 
https://reviews.llvm.org/D57504#1380587.>
> So far, so good. A bit odd, when I think about it, but if hardware out 
> there has that capability, maybe this is a good way to encode it in IR 
> (other options might work too, though). The crux, however, is the 
> interaction with the dynamic vector length: is it in terms of the 
> mask? the longer data vector? if the latter, what happens if it isn't 
> divisible by the mask length? There are multiple options and it's not 
> clear to me which one is "the right one", both for architectures
with
> native support (hopefully the one brough up here won't be the only 
> one) and for internal consistency of the IR. If there was an 
> established architecture with this kind of feature where people have 
> gathered lots of practical experience with it, we could use that 
> inform the decision (just as we have for ordinary predication and 
> dynamic vector length). But I'm not aware of any architecture that 
> does this other than the one Jacob and lkcl are working on, and as far 
> as I know their project still in the early stages.
The current understanding is that the dynamic vector length operates in 
the granularity of the mask: https://reviews.llvm.org/D57504#1381211

In unscaled IR types, this means VL masks each scalar result, in scaled 
types VL masks sub vectors. E.g. for %L == 1 the following call produces 
a pair of floats as the result:

    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x,
<scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)

I agree that we should only consider the tied sub-vector case for this 
first version and keep discussing the unconstrained version. It is 
seductively easy to allow this but impossible to take it back.

---

The story is different when we talk only(!) about memory accesses and 
having different vector sizes in the operands and the transferred type 
(result type for loads, value operand type for stores):

Eg on AVX, this call could turn into a 64bit gather operation of pairs 
of floats:

<16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x
i1> mask %M,
i32 vlen 8)

And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which 
may be represented as:

<scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x
double*>
%Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)

- Simon

-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/ce12c3c5/attachment.html>

Robin Kruppe via llvm-dev

2019-Feb-04 21:40 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de>
wrote:
> On 2/4/19 9:18 PM, Robin Kruppe wrote:
>
>
>
> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Simon Moll <moll at cs.uni-saarland.de> writes:
>>
>> > You are referring to the sub-vector sizes, if i am understanding
>> > correctly. I'd assume that the mask sub-vector length always
has to be
>> > either 1 or the same as the data sub-vector length. For example,
this
>> > is ok:
>> >
>> > %result = call <scalable 3 x float>
@llvm.evl.fsub.v4f32(<scalable 3 x
>> > float> %x, <scalable 3 x float> %y, <scalable 1 x
i1> %M, i32 %L)
>>
>> What does <scalable 1 x i1> applied to <scalable 3 x float>
mean?  I
>> would expect a requirement of <scalable 3 x i1>.  At least
that's how I
>> understood the SVE proposal [1].  The n's in <scalable n x
type> have to
>> match.
>>
>
> I believe the idea is to allow each single mask bit to control multiple
> consecutive lanes at once, effectively interpreting the vector being
> operated on as "many short fixed-length vectors, concatenated"
rather than
> a single long vector of scalars. This is a different interpretation of that
> type than usual, but it's not crazy, e.g. a similar reinterpretation of
> vector types seems to be the favored approach for adding matrix operations
> to LLVM IR. It somewhat obscures the point to discuss this only for
> scalable vectors, there's no conceptual reason why one couldn't do
the same
> with fixed size vectors.
>
> In fact, I would recommend against making almost any new feature or
> intrinsic exclusive to scalable vectors, including this one: there
> shouldn't be much extra code required to allow and support it, and not
> doing so makes the IR less orthogonal. For example, if a <scalable 4 x
> float> fadd with a <scalable 1 x i1> mask works, then <4 x
float> fadd with
> a <1 x i1> mask, a <8 x float> fadd with a <2 x i1> mask,
etc. should also
> be possible overloads of the same intrinsic.
>
> Yep. Doing the same for standard vector IR is on the radar:
> https://reviews.llvm.org/D57504#1380587.
>
>
> So far, so good. A bit odd, when I think about it, but if hardware out
> there has that capability, maybe this is a good way to encode it in IR
> (other options might work too, though). The crux, however, is the
> interaction with the dynamic vector length: is it in terms of the mask? the
> longer data vector? if the latter, what happens if it isn't divisible
by
> the mask length? There are multiple options and it's not clear to me
which
> one is "the right one", both for architectures with native
support
> (hopefully the one brough up here won't be the only one) and for
internal
> consistency of the IR. If there was an established architecture with this
> kind of feature where people have gathered lots of practical experience
> with it, we could use that inform the decision (just as we have for
> ordinary predication and dynamic vector length). But I'm not aware of
any
> architecture that does this other than the one Jacob and lkcl are working
> on, and as far as I know their project still in the early stages.
>
> The current understanding is that the dynamic vector length operates in
> the granularity of the mask: https://reviews.llvm.org/D57504#1381211
>I do understand that this is what Jacob proposes based on the architecture
he works on. However, it is not yet clear to me whether that is the most
useful option overall, nor that it is the only option that will lead to
reasonable codegen for their architecture. But let's leave discussion of
the details on Phab. I just want to highlight one issue that is not
specific to Jacob's angle, as it relates to the interpretation of scalable
vectors more generally:
> In unscaled IR types, this means VL masks each scalar result, in scaled
> types VL masks sub vectors. E.g. for %L == 1 the following call produces a
> pair of floats as the result:
>
>    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x,
<scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
>
> As I wrote on Phab mere minutes before you sent this email, I do not thinkthis is the right interpretation for any architecture I know about (I do
not know anything about the things Jacob and Luke are working on) nor from
the POV of the scalable vector types proposal. A scalable vector is not
conventionally "a variable-length vector of fixed-size vectors", it it
simply an ordinary "flat" vector whose length happens to be mostly
unknown
at compile time. If some intrinsics want to interpret it differently, that
is fine, but that's a property of those specific intrinsics -- similar to
how proposed matrix intrinsics might interpret a 16 element vector as a 4x4
matrix.
> I agree that we should only consider the tied sub-vector case for this
> first version and keep discussing the unconstrained version. It is
> seductively easy to allow this but impossible to take it back.
>
> ---
>
> The story is different when we talk only(!) about memory accesses and
> having different vector sizes in the operands and the transferred type
> (result type for loads, value operand type for stores):
>
> Eg on AVX, this call could turn into a 64bit gather operation of pairs of
> floats:
>
>     <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr,
<8 x i1> mask %M, i32 vlen 8)
>
> Is that IR you'd expect someone to generate (or a backend to consume)
forthis operation? It seems like a rather unnatural or "magical" way to
represent the intent (load 64b each from 8 pointers), at least with the way
I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast.
> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which may
> be represented as:
>
>     <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16
x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)
>
> In contrast to the above I can't very well say one should write this as
agather of i1024, but it also seems like a rather specialized instruction
(presumably used for blocked processing of matrices?) so I can't say that
this on its own motivates me to complicate a proposed core IR construct.

Cheers,
Robin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/6679b87d/attachment.html>

David Greene via llvm-dev

2019-Feb-07 16:56 UTC

head link

[llvm-dev] [RFC] Vector Predication

Simon Moll <moll at cs.uni-saarland.de> writes:
> In unscaled IR types, this means VL masks each scalar result, in
> scaled types VL masks sub vectors. E.g. for %L == 1 the following call
> produces a pair of floats as the result:
>
>    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x,
<scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
That seems wrong to me.  In the SVE proposal, <scalable 2 x float> means
a dynamic vector length guaranteed to be a multiple of 2 floats long.
There is no notion of sub-vector.  The vector length parameter should
result in <scalable 2 x float> result, but where the second float value
is undefined.  It would be surprising to get two full results.  If
sub-vector types were notated as <scalable 1 x <2 x float>> then %L
== 1
would result in two full float results.

                            -David

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication