thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Robin Kruppe via llvm-dev

2019-Feb-04 21:40 UTC

[llvm-dev] [RFC] Vector Predication

On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de>
wrote:
> On 2/4/19 9:18 PM, Robin Kruppe wrote:
>
>
>
> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Simon Moll <moll at cs.uni-saarland.de> writes:
>>
>> > You are referring to the sub-vector sizes, if i am understanding
>> > correctly. I'd assume that the mask sub-vector length always
has to be
>> > either 1 or the same as the data sub-vector length. For example,
this
>> > is ok:
>> >
>> > %result = call <scalable 3 x float>
@llvm.evl.fsub.v4f32(<scalable 3 x
>> > float> %x, <scalable 3 x float> %y, <scalable 1 x
i1> %M, i32 %L)
>>
>> What does <scalable 1 x i1> applied to <scalable 3 x float>
mean?  I
>> would expect a requirement of <scalable 3 x i1>.  At least
that's how I
>> understood the SVE proposal [1].  The n's in <scalable n x
type> have to
>> match.
>>
>
> I believe the idea is to allow each single mask bit to control multiple
> consecutive lanes at once, effectively interpreting the vector being
> operated on as "many short fixed-length vectors, concatenated"
rather than
> a single long vector of scalars. This is a different interpretation of that
> type than usual, but it's not crazy, e.g. a similar reinterpretation of
> vector types seems to be the favored approach for adding matrix operations
> to LLVM IR. It somewhat obscures the point to discuss this only for
> scalable vectors, there's no conceptual reason why one couldn't do
the same
> with fixed size vectors.
>
> In fact, I would recommend against making almost any new feature or
> intrinsic exclusive to scalable vectors, including this one: there
> shouldn't be much extra code required to allow and support it, and not
> doing so makes the IR less orthogonal. For example, if a <scalable 4 x
> float> fadd with a <scalable 1 x i1> mask works, then <4 x
float> fadd with
> a <1 x i1> mask, a <8 x float> fadd with a <2 x i1> mask,
etc. should also
> be possible overloads of the same intrinsic.
>
> Yep. Doing the same for standard vector IR is on the radar:
> https://reviews.llvm.org/D57504#1380587.
>
>
> So far, so good. A bit odd, when I think about it, but if hardware out
> there has that capability, maybe this is a good way to encode it in IR
> (other options might work too, though). The crux, however, is the
> interaction with the dynamic vector length: is it in terms of the mask? the
> longer data vector? if the latter, what happens if it isn't divisible
by
> the mask length? There are multiple options and it's not clear to me
which
> one is "the right one", both for architectures with native
support
> (hopefully the one brough up here won't be the only one) and for
internal
> consistency of the IR. If there was an established architecture with this
> kind of feature where people have gathered lots of practical experience
> with it, we could use that inform the decision (just as we have for
> ordinary predication and dynamic vector length). But I'm not aware of
any
> architecture that does this other than the one Jacob and lkcl are working
> on, and as far as I know their project still in the early stages.
>
> The current understanding is that the dynamic vector length operates in
> the granularity of the mask: https://reviews.llvm.org/D57504#1381211
>I do understand that this is what Jacob proposes based on the architecture
he works on. However, it is not yet clear to me whether that is the most
useful option overall, nor that it is the only option that will lead to
reasonable codegen for their architecture. But let's leave discussion of
the details on Phab. I just want to highlight one issue that is not
specific to Jacob's angle, as it relates to the interpretation of scalable
vectors more generally:
> In unscaled IR types, this means VL masks each scalar result, in scaled
> types VL masks sub vectors. E.g. for %L == 1 the following call produces a
> pair of floats as the result:
>
>    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x,
<scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
>
> As I wrote on Phab mere minutes before you sent this email, I do not thinkthis is the right interpretation for any architecture I know about (I do
not know anything about the things Jacob and Luke are working on) nor from
the POV of the scalable vector types proposal. A scalable vector is not
conventionally "a variable-length vector of fixed-size vectors", it it
simply an ordinary "flat" vector whose length happens to be mostly
unknown
at compile time. If some intrinsics want to interpret it differently, that
is fine, but that's a property of those specific intrinsics -- similar to
how proposed matrix intrinsics might interpret a 16 element vector as a 4x4
matrix.
> I agree that we should only consider the tied sub-vector case for this
> first version and keep discussing the unconstrained version. It is
> seductively easy to allow this but impossible to take it back.
>
> ---
>
> The story is different when we talk only(!) about memory accesses and
> having different vector sizes in the operands and the transferred type
> (result type for loads, value operand type for stores):
>
> Eg on AVX, this call could turn into a 64bit gather operation of pairs of
> floats:
>
>     <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr,
<8 x i1> mask %M, i32 vlen 8)
>
> Is that IR you'd expect someone to generate (or a backend to consume)
forthis operation? It seems like a rather unnatural or "magical" way to
represent the intent (load 64b each from 8 pointers), at least with the way
I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast.
> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which may
> be represented as:
>
>     <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16
x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)
>
> In contrast to the above I can't very well say one should write this as
agather of i1024, but it also seems like a rather specialized instruction
(presumably used for blocked processing of matrices?) so I can't say that
this on its own motivates me to complicate a proposed core IR construct.

Cheers,
Robin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/6679b87d/attachment.html>

Simon Moll via llvm-dev

2019-Feb-04 22:04 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 2/4/19 10:40 PM, Robin Kruppe wrote:>
>
> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de 
> <mailto:moll at cs.uni-saarland.de>> wrote:
>
>     On 2/4/19 9:18 PM, Robin Kruppe wrote:
>>
>>
>>     On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev
>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>         Simon Moll <moll at cs.uni-saarland.de
>>         <mailto:moll at cs.uni-saarland.de>> writes:
>>
>>         > You are referring to the sub-vector sizes, if i am
>>         understanding
>>         > correctly. I'd assume that the mask sub-vector length
>>         always has to be
>>         > either 1 or the same as the data sub-vector length. For
>>         example, this
>>         > is ok:
>>         >
>>         > %result = call <scalable 3 x float>
>>         @llvm.evl.fsub.v4f32(<scalable 3 x
>>         > float> %x, <scalable 3 x float> %y, <scalable
1 x i1> %M,
>>         i32 %L)
>>
>>         What does <scalable 1 x i1> applied to <scalable 3 x
float>
>>         mean?  I
>>         would expect a requirement of <scalable 3 x i1>.  At
least
>>         that's how I
>>         understood the SVE proposal [1].  The n's in <scalable n
x
>>         type> have to
>>         match.
>>
>>
>>     I believe the idea is to allow each single mask bit to control
>>     multiple consecutive lanes at once, effectively interpreting the
>>     vector being operated on as "many short fixed-length vectors,
>>     concatenated" rather than a single long vector of scalars.
This
>>     is a different interpretation of that type than usual, but it's
>>     not crazy, e.g. a similar reinterpretation of vector types seems
>>     to be the favored approach for adding matrix operations to LLVM
>>     IR. It somewhat obscures the point to discuss this only for
>>     scalable vectors, there's no conceptual reason why one
couldn't
>>     do the same with fixed size vectors.
>>
>>     In fact, I would recommend against making almost any new feature
>>     or intrinsic exclusive to scalable vectors, including this one:
>>     there shouldn't be much extra code required to allow and
support
>>     it, and not doing so makes the IR less orthogonal. For example,
>>     if a <scalable 4 x float> fadd with a <scalable 1 x i1>
mask
>>     works, then <4 x float> fadd with a <1 x i1> mask, a
<8 x float>
>>     fadd with a <2 x i1> mask, etc. should also be possible
overloads
>>     of the same intrinsic.
>     Yep. Doing the same for standard vector IR is on the radar:
>     https://reviews.llvm.org/D57504#1380587.
>>
>>     So far, so good. A bit odd, when I think about it, but if
>>     hardware out there has that capability, maybe this is a good way
>>     to encode it in IR (other options might work too, though). The
>>     crux, however, is the interaction with the dynamic vector length:
>>     is it in terms of the mask? the longer data vector? if the
>>     latter, what happens if it isn't divisible by the mask length?
>>     There are multiple options and it's not clear to me which one
is
>>     "the right one", both for architectures with native
support
>>     (hopefully the one brough up here won't be the only one) and
for
>>     internal consistency of the IR. If there was an established
>>     architecture with this kind of feature where people have gathered
>>     lots of practical experience with it, we could use that inform
>>     the decision (just as we have for ordinary predication and
>>     dynamic vector length). But I'm not aware of any architecture
>>     that does this other than the one Jacob and lkcl are working on,
>>     and as far as I know their project still in the early stages.
>
>     The current understanding is that the dynamic vector length
>     operates in the granularity of the mask:
>     https://reviews.llvm.org/D57504#1381211
>
> I do understand that this is what Jacob proposes based on the 
> architecture he works on. However, it is not yet clear to me whether 
> that is the most useful option overall, nor that it is the only option 
> that will lead to reasonable codegen for their architecture. But let's 
> leave discussion of the details on Phab. I just want to highlight one 
> issue that is not specific to Jacob's angle, as it relates to the 
> interpretation of scalable vectors more generally:
>
>     In unscaled IR types, this means VL masks each scalar result, in
>     scaled types VL masks sub vectors. E.g. for %L == 1 the following
>     call produces a pair of floats as the result:
>
>         <scalable 2 x float> evl.fsub(<scalable 2 x float> %x,
<scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
>
> As I wrote on Phab mere minutes before you sent this email, I do not 
> think this is the right interpretation for any architecture I know 
> about (I do not know anything about the things Jacob and Luke are 
> working on) nor from the POV of the scalable vector types proposal. A 
> scalable vector is not conventionally "a variable-length vector of 
> fixed-size vectors", it it simply an ordinary "flat" vector
whose
> length happens to be mostly unknown at compile time. If some 
> intrinsics want to interpret it differently, that is fine, but that's 
> a property of those specific intrinsics -- similar to how proposed 
> matrix intrinsics might interpret a 16 element vector as a 4x4 matrix.
On NEC SX-Aurora the vector length is always interpreted in 64bit data 
chunks. That is one example of a real architecture where the vscaled 
interpretation of VL makes sense.
>     I agree that we should only consider the tied sub-vector case for
>     this first version and keep discussing the unconstrained version.
>     It is seductively easy to allow this but impossible to take it back.
>
>     ---
>
>     The story is different when we talk only(!) about memory accesses
>     and having different vector sizes in the operands and the
>     transferred type (result type for loads, value operand type for
>     stores):
>
>     Eg on AVX, this call could turn into a 64bit gather operation of
>     pairs of floats:
>
>     <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr,
<8 x i1>
>     mask %M, i32 vlen 8)
>
> Is that IR you'd expect someone to generate (or a backend to consume) 
> for this operation? It seems like a rather unnatural or "magical"
way
> to represent the intent (load 64b each from 8 pointers), at least with 
> the way I'm thinking about it. I'd expect a gather of 8xi64 and a
bitcast.
>
>     And there is a native 16 x 16 element load (VLD2D) on SX-Aurora,
>     which may be represented as:
>
>     <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16
x
>     double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)
>
> In contrast to the above I can't very well say one should write this 
> as a gather of i1024, but it also seems like a rather specialized 
> instruction (presumably used for blocked processing of matrices?) so I 
> can't say that this on its own motivates me to complicate a proposed 
> core IR construct.It actually reduces complexity by shifting it from the address 
computation into the instruction. This would cover all three cases: 
VLD2D, <2 x float> gather on AVX and <W x float> loads for this
early
RISC-V based architecture that Jacob and lkcl are working on. However, 
this is not a top priority and we can leave it out of the first
version.>
> Cheers,
> Robin
>- Simon

-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/0327a7b9/attachment.html>

Jacob Lifshay via llvm-dev

2019-Feb-04 22:23 UTC

head link

[llvm-dev] [RFC] Vector Predication

The architecture Luke and I are working on, assuming it goes the way I
think it will, will have instructions like:
fmadd.s.vvss rd, rs1, rs2, rs3, len=N*VL, pred=rp
where N is from 1 to 4
which has the following pseudo-code:
constexpr auto f32_per_reg = 2;
union FReg
{
    double f64[1];
    float f32[f32_per_reg];
    _Float16 f16[4];
};
union Reg
{
    uint64_t i64[1];
    uint32_t i32[2];
    uint16_t i16[4];
    uint8_t i8[8];
};
// registers
FReg fregs[128];
Reg regs[128];
uint64_t vl;
// instruction fields
int rd, rs1, rs2, rs3, rp, N;
// main code
for(uint64_t i = 0; i < vl * N; i++)
{
    if(regs[rp].i64[0] & (1ULL << i / N))
    {
        auto rv = i / f32_per_reg;
        auto sv = i % f32_per_reg;
        auto rs = (i % N) / f32_per_reg;
        auto ss = (i % N) / f32_per_reg;
        // *+ is contracted into fma
        fregs[rd + rv].f32[sv] = fregs[rs1 + rv].f32[sv] * fregs[rs2 +
rs].f32[ss] + fregs[rs3 + rs].f32[ss];
    }
}

So it would be handy for the vector length on evl intrinsics to be in units
of the mask length so we don't have to pattern match a division in the
backend. We could have 2 variants of the vector length argument, one in
terms of the data vector and one in terms of the mask vector. we could
legalize the mask vector variant for those architectures that need it by
pulling the multiplication out and switching to the data vector variants.

Jacob

On Mon, Feb 4, 2019, 13:41 Robin Kruppe via llvm-dev <
llvm-dev at lists.llvm.org wrote:
>
>
> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de>
wrote:
>
>> On 2/4/19 9:18 PM, Robin Kruppe wrote:
>>
>>
>>
>> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Simon Moll <moll at cs.uni-saarland.de> writes:
>>>
>>> > You are referring to the sub-vector sizes, if i am
understanding
>>> > correctly. I'd assume that the mask sub-vector length
always has to be
>>> > either 1 or the same as the data sub-vector length. For
example, this
>>> > is ok:
>>> >
>>> > %result = call <scalable 3 x float>
@llvm.evl.fsub.v4f32(<scalable 3 x
>>> > float> %x, <scalable 3 x float> %y, <scalable 1 x
i1> %M, i32 %L)
>>>
>>> What does <scalable 1 x i1> applied to <scalable 3 x
float> mean?  I
>>> would expect a requirement of <scalable 3 x i1>.  At least
that's how I
>>> understood the SVE proposal [1].  The n's in <scalable n x
type> have to
>>> match.
>>>
>>
>> I believe the idea is to allow each single mask bit to control multiple
>> consecutive lanes at once, effectively interpreting the vector being
>> operated on as "many short fixed-length vectors,
concatenated" rather than
>> a single long vector of scalars. This is a different interpretation of
that
>> type than usual, but it's not crazy, e.g. a similar
reinterpretation of
>> vector types seems to be the favored approach for adding matrix
operations
>> to LLVM IR. It somewhat obscures the point to discuss this only for
>> scalable vectors, there's no conceptual reason why one couldn't
do the same
>> with fixed size vectors.
>>
>> In fact, I would recommend against making almost any new feature or
>> intrinsic exclusive to scalable vectors, including this one: there
>> shouldn't be much extra code required to allow and support it, and
not
>> doing so makes the IR less orthogonal. For example, if a <scalable 4
x
>> float> fadd with a <scalable 1 x i1> mask works, then <4 x
float> fadd with
>> a <1 x i1> mask, a <8 x float> fadd with a <2 x i1>
mask, etc. should also
>> be possible overloads of the same intrinsic.
>>
>> Yep. Doing the same for standard vector IR is on the radar:
>> https://reviews.llvm.org/D57504#1380587.
>>
>>
>> So far, so good. A bit odd, when I think about it, but if hardware out
>> there has that capability, maybe this is a good way to encode it in IR
>> (other options might work too, though). The crux, however, is the
>> interaction with the dynamic vector length: is it in terms of the mask?
the
>> longer data vector? if the latter, what happens if it isn't
divisible by
>> the mask length? There are multiple options and it's not clear to
me which
>> one is "the right one", both for architectures with native
support
>> (hopefully the one brough up here won't be the only one) and for
internal
>> consistency of the IR. If there was an established architecture with
this
>> kind of feature where people have gathered lots of practical experience
>> with it, we could use that inform the decision (just as we have for
>> ordinary predication and dynamic vector length). But I'm not aware
of any
>> architecture that does this other than the one Jacob and lkcl are
working
>> on, and as far as I know their project still in the early stages.
>>
>> The current understanding is that the dynamic vector length operates in
>> the granularity of the mask: https://reviews.llvm.org/D57504#1381211
>>
> I do understand that this is what Jacob proposes based on the architecture
> he works on. However, it is not yet clear to me whether that is the most
> useful option overall, nor that it is the only option that will lead to
> reasonable codegen for their architecture. But let's leave discussion
of
> the details on Phab. I just want to highlight one issue that is not
> specific to Jacob's angle, as it relates to the interpretation of
scalable
> vectors more generally:
>
>> In unscaled IR types, this means VL masks each scalar result, in scaled
>> types VL masks sub vectors. E.g. for %L == 1 the following call
produces a
>> pair of floats as the result:
>>
>>    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x,
<scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
>>
>> As I wrote on Phab mere minutes before you sent this email, I do not
> think this is the right interpretation for any architecture I know about (I
> do not know anything about the things Jacob and Luke are working on) nor
> from the POV of the scalable vector types proposal. A scalable vector is
> not conventionally "a variable-length vector of fixed-size
vectors", it it
> simply an ordinary "flat" vector whose length happens to be
mostly unknown
> at compile time. If some intrinsics want to interpret it differently, that
> is fine, but that's a property of those specific intrinsics -- similar
to
> how proposed matrix intrinsics might interpret a 16 element vector as a 4x4
> matrix.
>
>> I agree that we should only consider the tied sub-vector case for this
>> first version and keep discussing the unconstrained version. It is
>> seductively easy to allow this but impossible to take it back.
>>
>> ---
>>
>> The story is different when we talk only(!) about memory accesses and
>> having different vector sizes in the operands and the transferred type
>> (result type for loads, value operand type for stores):
>>
>> Eg on AVX, this call could turn into a 64bit gather operation of pairs
of
>> floats:
>>
>>     <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr,
<8 x i1> mask %M, i32 vlen 8)
>>
>> Is that IR you'd expect someone to generate (or a backend to
consume) for
> this operation? It seems like a rather unnatural or "magical" way
to
> represent the intent (load 64b each from 8 pointers), at least with the way
> I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast.
>
>> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which
>> may be represented as:
>>
>>     <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable
16 x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)
>>
>> In contrast to the above I can't very well say one should write
this as a
> gather of i1024, but it also seems like a rather specialized instruction
> (presumably used for blocked processing of matrices?) so I can't say
that
> this on its own motivates me to complicate a proposed core IR construct.
>
> Cheers,
> Robin
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/5ad45d64/attachment-0001.html>

Jacob Lifshay via llvm-dev

2019-Feb-04 22:26 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Mon, Feb 4, 2019, 14:23 Jacob Lifshay <programmerjake at gmail.com wrote:
> The architecture Luke and I are working on, assuming it goes the way I
> think it will, will have instructions like:
> fmadd.s.vvss rd, rs1, rs2, rs3, len=N*VL, pred=rp
> where N is from 1 to 4
> which has the following pseudo-code:
> constexpr auto f32_per_reg = 2;
> union FReg
> {
>     double f64[1];
>     float f32[f32_per_reg];
>     _Float16 f16[4];
> };
> union Reg
> {
>     uint64_t i64[1];
>     uint32_t i32[2];
>     uint16_t i16[4];
>     uint8_t i8[8];
> };
> // registers
> FReg fregs[128];
> Reg regs[128];
> uint64_t vl;
> // instruction fields
> int rd, rs1, rs2, rs3, rp, N;
> // main code
> for(uint64_t i = 0; i < vl * N; i++)
> {
>     if(regs[rp].i64[0] & (1ULL << i / N))
>     {
>         auto rv = i / f32_per_reg;
>         auto sv = i % f32_per_reg;
>         auto rs = (i % N) / f32_per_reg;
>         auto ss = (i % N) / f32_per_reg;
>should have been:
auto ss = (i % N) % f32_per_reg;
>         // *+ is contracted into fma
>         fregs[rd + rv].f32[sv] = fregs[rs1 + rv].f32[sv] * fregs[rs2 +
> rs].f32[ss] + fregs[rs3 + rs].f32[ss];
>     }
> }
>
> So it would be handy for the vector length on evl intrinsics to be in
> units of the mask length so we don't have to pattern match a division
in
> the backend. We could have 2 variants of the vector length argument, one in
> terms of the data vector and one in terms of the mask vector. we could
> legalize the mask vector variant for those architectures that need it by
> pulling the multiplication out and switching to the data vector variants.
>
> Jacob
>
> On Mon, Feb 4, 2019, 13:41 Robin Kruppe via llvm-dev <
> llvm-dev at lists.llvm.org wrote:
>
>>
>>
>> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at
cs.uni-saarland.de> wrote:
>>
>>> On 2/4/19 9:18 PM, Robin Kruppe wrote:
>>>
>>>
>>>
>>> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Simon Moll <moll at cs.uni-saarland.de> writes:
>>>>
>>>> > You are referring to the sub-vector sizes, if i am
understanding
>>>> > correctly. I'd assume that the mask sub-vector length
always has to be
>>>> > either 1 or the same as the data sub-vector length. For
example, this
>>>> > is ok:
>>>> >
>>>> > %result = call <scalable 3 x float>
@llvm.evl.fsub.v4f32(<scalable 3 x
>>>> > float> %x, <scalable 3 x float> %y, <scalable
1 x i1> %M, i32 %L)
>>>>
>>>> What does <scalable 1 x i1> applied to <scalable 3 x
float> mean?  I
>>>> would expect a requirement of <scalable 3 x i1>.  At
least that's how I
>>>> understood the SVE proposal [1].  The n's in <scalable n
x type> have to
>>>> match.
>>>>
>>>
>>> I believe the idea is to allow each single mask bit to control
multiple
>>> consecutive lanes at once, effectively interpreting the vector
being
>>> operated on as "many short fixed-length vectors,
concatenated" rather than
>>> a single long vector of scalars. This is a different interpretation
of that
>>> type than usual, but it's not crazy, e.g. a similar
reinterpretation of
>>> vector types seems to be the favored approach for adding matrix
operations
>>> to LLVM IR. It somewhat obscures the point to discuss this only for
>>> scalable vectors, there's no conceptual reason why one
couldn't do the same
>>> with fixed size vectors.
>>>
>>> In fact, I would recommend against making almost any new feature or
>>> intrinsic exclusive to scalable vectors, including this one: there
>>> shouldn't be much extra code required to allow and support it,
and not
>>> doing so makes the IR less orthogonal. For example, if a
<scalable 4 x
>>> float> fadd with a <scalable 1 x i1> mask works, then
<4 x float> fadd with
>>> a <1 x i1> mask, a <8 x float> fadd with a <2 x
i1> mask, etc. should also
>>> be possible overloads of the same intrinsic.
>>>
>>> Yep. Doing the same for standard vector IR is on the radar:
>>> https://reviews.llvm.org/D57504#1380587.
>>>
>>>
>>> So far, so good. A bit odd, when I think about it, but if hardware
out
>>> there has that capability, maybe this is a good way to encode it in
IR
>>> (other options might work too, though). The crux, however, is the
>>> interaction with the dynamic vector length: is it in terms of the
mask? the
>>> longer data vector? if the latter, what happens if it isn't
divisible by
>>> the mask length? There are multiple options and it's not clear
to me which
>>> one is "the right one", both for architectures with
native support
>>> (hopefully the one brough up here won't be the only one) and
for internal
>>> consistency of the IR. If there was an established architecture
with this
>>> kind of feature where people have gathered lots of practical
experience
>>> with it, we could use that inform the decision (just as we have for
>>> ordinary predication and dynamic vector length). But I'm not
aware of any
>>> architecture that does this other than the one Jacob and lkcl are
working
>>> on, and as far as I know their project still in the early stages.
>>>
>>> The current understanding is that the dynamic vector length
operates in
>>> the granularity of the mask:
https://reviews.llvm.org/D57504#1381211
>>>
>> I do understand that this is what Jacob proposes based on the
>> architecture he works on. However, it is not yet clear to me whether
that
>> is the most useful option overall, nor that it is the only option that
will
>> lead to reasonable codegen for their architecture. But let's leave
>> discussion of the details on Phab. I just want to highlight one issue
that
>> is not specific to Jacob's angle, as it relates to the
interpretation of
>> scalable vectors more generally:
>>
>>> In unscaled IR types, this means VL masks each scalar result, in
scaled
>>> types VL masks sub vectors. E.g. for %L == 1 the following call
produces a
>>> pair of floats as the result:
>>>
>>>    <scalable 2 x float> evl.fsub(<scalable 2 x float>
%x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
>>>
>>> As I wrote on Phab mere minutes before you sent this email, I do
not
>> think this is the right interpretation for any architecture I know
about (I
>> do not know anything about the things Jacob and Luke are working on)
nor
>> from the POV of the scalable vector types proposal. A scalable vector
is
>> not conventionally "a variable-length vector of fixed-size
vectors", it it
>> simply an ordinary "flat" vector whose length happens to be
mostly unknown
>> at compile time. If some intrinsics want to interpret it differently,
that
>> is fine, but that's a property of those specific intrinsics --
similar to
>> how proposed matrix intrinsics might interpret a 16 element vector as a
4x4
>> matrix.
>>
>>> I agree that we should only consider the tied sub-vector case for
this
>>> first version and keep discussing the unconstrained version. It is
>>> seductively easy to allow this but impossible to take it back.
>>>
>>> ---
>>>
>>> The story is different when we talk only(!) about memory accesses
and
>>> having different vector sizes in the operands and the transferred
type
>>> (result type for loads, value operand type for stores):
>>>
>>> Eg on AVX, this call could turn into a 64bit gather operation of
pairs
>>> of floats:
>>>
>>>     <16 x float> llvm.evl.gather.v16f32(<8 x float*>
%Ptr, <8 x i1> mask %M, i32 vlen 8)
>>>
>>> Is that IR you'd expect someone to generate (or a backend to
consume)
>> for this operation? It seems like a rather unnatural or
"magical" way to
>> represent the intent (load 64b each from 8 pointers), at least with the
way
>> I'm thinking about it. I'd expect a gather of 8xi64 and a
bitcast.
>>
>>> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora,
which
>>> may be represented as:
>>>
>>>     <scalable 256 x double>
llvm.evl.gather.nxv16f64(<scalable 16 x double*> %Ptr, <scalable 16 x
i1> mask %M, i32 vlen 16)
>>>
>>> In contrast to the above I can't very well say one should write
this as
>> a gather of i1024, but it also seems like a rather specialized
instruction
>> (presumably used for blocked processing of matrices?) so I can't
say that
>> this on its own motivates me to complicate a proposed core IR
construct.
>>
>> Cheers,
>> Robin
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/b67100d5/attachment.html>

Robin Kruppe via llvm-dev

2019-Feb-04 23:31 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Mon, 4 Feb 2019 at 23:04, Simon Moll <moll at cs.uni-saarland.de>
wrote:
>
> On NEC SX-Aurora the vector length is always interpreted in 64bit data
> chunks. That is one example of a real architecture where the vscaled
> interpretation of VL makes sense.
>Now this is a problem. Let's leave the details of why RISC-V V needs the
other interpretation to Phab, but we definitely have a conflict in what
these two architectures need. How do we reconcile them? Picking one option
and requiring a multiplication/division of the vlen argument to get the
other meaning is a nice canonical IR form, but it seems a bit problematic
for codegen because that mul/div is a prime candicate for being CSE'd
across blocks (pure calculation, repeated everywhere) and consequently
being difficult to access for pattern matching in the backend.

On the other hand, it's a less serious problem than was previously
discussed re: vlen vs predication. The actual change in codegen is just
omitting one instruction, which one can easily do that in an SSA-based MIR
pass if necessary (instead of during ISel). Moreover, the cost of a missed
folding opportunity is relatively minor, since it'll most likely be just a
shift by an immediate, and it'll usually be amortized over basically the
entire loop body in a lot of code.

Still, does anyone have a better idea?

Cheers,
Robin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/6f0b5aa1/attachment.html>

Luke Kenneth Casson Leighton via llvm-dev

2019-Feb-05 00:27 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Mon, Feb 4, 2019 at 9:41 PM Robin Kruppe <robin.kruppe at gmail.com>
wrote:> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de>
wrote:
>> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which
may be represented as:
>>
>>     <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable
16 x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)
>
> In contrast to the above I can't very well say one should write this as
a gather of i1024, but it also seems like a rather specialized instruction
(presumably used for blocked processing of matrices?) so I can't say that
this on its own motivates me to complicate a proposed core IR construct.
 i concur: any architecture that has fixed (SIMD-style) processing
widths is basically hell for any compiler to deal with in a generic
way.  architecturally-fixed element widths really do not fit into the
arbitrary-width vector processing paradigm, and i foresee efforts to
try resulting in design conflict and ultimately failure.

 can i therefore recommend that variable-length ISAs be given top
priority in the design, so that there is the possibility for various
variable-length architectures to do near-direct highly optimised
translations, and leave SIMD-style (fixed width) architectures to
continue the long-standing well-established practice of creating
"special-case / corner-case / cleanup" code?

 (with apologies to proponents of SX-Aurora, MMX, SSE and other
fixed-width SIMD-style architectures...)

l.

Luke Kenneth Casson Leighton via llvm-dev

2019-Feb-05 00:54 UTC

head link

[llvm-dev] [RFC] Vector Predication

with apologies for breaking the thread, i wasn't cc'd earlier in the
conversation.
http://lists.llvm.org/pipermail/llvm-dev/2019-January/129806.html

david, you wrote:
> I'm solidly of the opinion that we already *have* IR support for
> explicit masking in the form of gather/scatter/etc...  Until someone has
> taken the effort to make masking in this context *actually work well*,
> I'm unconvinced that we should greatly expand the usage in the IR.
the problem with gather/scatter is that it requires moving the data
(MV or LD/ST)

MV - particularly with quite large data sets - puts pressure on a
microarchitecture to increase the size of the register file (otherwise
data has to be pushed to stack).

LD/ST - as shown by Jeff Bush in his work on nyuzi - results in
*significant* power consumption increases due to having to push data
through the L1/L2 cache (which is all CAMs).

in SV we are deliberately dropping the vectorisation onto the
*standard* register file *precisely* to avoid the need to exchange
data between a special vector register file and a scalar register
file.

additionally, the microarchitecture being designed actually happens to
effectively implement (use) gather/scatter techniques when a predicate
mask is used.  this through pushing element operations into a
multi-issue instruction queue, and simply skipping of non-predicated
elements [thus we get 100% ALU utilisation even when there are
back-to-back "if then else" inverted predicate masks (the non-inverted
predicate issuing one set of elements, and the inverted predicate
matches perfectly with that). ]

basically i feel that this is the right paradigm.

now, if a given ISA doesn't *have* predicate masks, then yes,
absolutely, gather/scatter at the *instruction* level (as opposed to
the micro-architectural level) is the correct way to *emulate*
predication.  instructions may be issued that exclude the
non-predicated elements, put them into a group (even a SIMD
fixed-width group), and re-extract them on the other side of the
group-operation into the required destination registers.

even the previously-mentioned SX-Aurora architecture (and other SIMD
architectures) could use this trick, to effectively "emulate"
predication where the ISA doesn't have predicate masks, and it can
also be used to emulate variable-length vectors, through simply
setting the top elements of a SIMD block to zero (or ignoring them
entirely) and only copying out the lower-indexed elements with a
scatter operation.  whilst that is not particularly efficient, that's
not LLVM's problem: SIMD architectures were designed the way they are
because it's seductively simpler at the hardware level.

however to expect an architecture that *does* support proper
predication to have to complexify the way it does predication, by
shoe-horning it into gather/scatter... that's sub-optimal and i'm
drawing a mental blank as to how it could be done, let alone done
effectively and efficiently.

that's not to say that gather/scatter should be removed entirely: that
would be a mistake.  there are circumstances where gather/scatter is
far better suited for use than predicate masks.

bottom line: i feel that expecting predication to be implemented in
terms of gather/scatter is the wrong way round.  the IR should have
explicit and proper support for predicate masks, and architectures
that don't *have* predicate masks should *use* gather/scatter
instructions to emulate it.

l.

David Greene via llvm-dev

2019-Feb-07 17:08 UTC

head link

[llvm-dev] [RFC] Vector Predication

Jacob Lifshay <programmerjake at gmail.com> writes:
> So it would be handy for the vector length on evl intrinsics to be in
> units of the mask length so we don't have to pattern match a division
> in the backend. We could have 2 variants of the vector length
> argument, one in terms of the data vector and one in terms of the mask
> vector. we could legalize the mask vector variant for those
> architectures that need it by pulling the multiplication out and
> switching to the data vector variants.
Would it make sense to have two different intrinsics?

# "Normal" form, L is in terms of flat vector length.
<scalable 2 x float> evl.fsub(<scalable 2 x float> %x,
                              <scalable 2 x float> %y,
                              <scalable 2 x i1> %M, i32 %L)

# "Sub-vector" form, L is in terms of sub-vectors elements.
<scalable 1 x <2 x float>> evl.fsub(<scalable 1 x <2 x
float>> %x,
                                    <scalable 1 x <2 x float>> %y,
                                    <scalable 1 x <2 x i1>> %M, i32
%L

Overloading types to mean two very different things is confusing to me.

                           -David

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication