thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Simon Moll via llvm-dev

2019-Feb-05 08:11 UTC

[llvm-dev] [RFC] Vector Predication

On 2/5/19 1:27 AM, Philip Reames via llvm-dev wrote:>
> On 1/31/19 4:57 PM, Bruce Hoult wrote:
>> On Thu, Jan 31, 2019 at 4:05 PM Philip Reames via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>> Do such architectures frequently have arithmetic operations on the 
>>> mask registers?  (i.e. can I reasonable compute a conservative 
>>> length given a mask register value)  If I can, then having a mask
as
>>> the canonical form and re-deriving the length register from a mask 
>>> for a sequence of instructions which share a predicate seems fairly
>>> reasonable. Note that I'm assuming this as a fallback, and that
the
>>> common case is handled via the equivalent of ComputeKnownBits on
the
>>> mask itself at compile time.
>> If masking is used (which it is usually not for loops without control
>> flow inside the vectorised loop) then, yes, logical operations on the
>> mask registers will happen at every basic block boundary.
>>
>> But it is NOT the case that you can computer the active vector length
>> VL from an initial mask value. The active vector length is set by the
>> hardware based on the remaining application vector length. The VL can
>> change for each loop iteration -- the normal pattern is for VL to
>> equal VLMAX for initial executions of the loop, and then be less than
>> VLMAX for the final one or two iterations of the loop. For example if
>> VLMAX is 16 and there are 19 elements left in the application vector
>> then the hardware might choose to use 10 elements for the 2nd to last
>> iteration and 9 elements for the last iteration. Or not. Other
>> hardware might choose to perform the last three iterations as 12/12/11
>> instead of 16/10/9. (It is constrained to be monotonic).
>>
>> VL can also be dynamically shortened in the middle of a loop iteration
>> by an unaligned vector load that crosses a protection boundary if the
>> later elements are inaccessible.
> I can't reconcile this complexity with either the snippet on RISV 
> which was shared, or the current EVL proposal.  Doesn't this imply 
> that the vector length can change between *every* pair of vector 
> instructions?  If so, how does having it as part of the EVL intrinsics 
> work?
I think this is the usual mixup of AVL and MVL.

AVL: is part of the predicate and can change between vector operations 
just like a mask can (light weight).

MVL: Is the physical vector register length and can be re-configured per 
function (RVV only atm) - (heavy weight, stop-the-world instruction).

The vectorlen parameter in EVL intrinsics is for the AVL.
>>
>> I'm curious what SVE will do if there is an if/then/else in the
middle
>> of a vectorised loop with a shorter-than-maximum vector length. You
>> can't just invert the mask when going from the then-part to the
>> else-part because that would re-enable elements past the end of the
>> vector. You'd need to invert the mask and then AND it with the mask
>> containing the (bitwise representation of) the vector length.
I folks have issues with carrying the vlen around even if the target 
only supports masking, we can rephrase EVL using higher-order functions 
with varargs (basically prefixing):

ARM SVE, AVX512 (mask only targets):

    llvm.evl.masked(<16 x i1> mask %M, ...)

     llvm.evl.fsub(<16 x float>, <16 x float>)  ; exists only to get
a
function handls

     call @llvm.evl.masked.v16f32(%M, @llvm.evl.fsub(v16f32, <16 x 
float>, <16 x float>)

RISC-V V, SX-Aurora:

     llvm.evl.pred(<16 x i1> mask %M, i32 vlen %VL, ...)

     llvm.evl.pred(%M, %vl, @llvm.evl.fsub, %a, %b)

The problem with this is mostly that the operand positions are now off 
compared to regular IR and the API abstractions that accept both will 
have to account for that.

- Simon
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

Bruce Hoult via llvm-dev

2019-Feb-05 11:06 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at cs.uni-saarland.de>
wrote:> I think this is the usual mixup of AVL and MVL.
>
> AVL: is part of the predicate and can change between vector operations
> just like a mask can (light weight).
>
> MVL: Is the physical vector register length and can be re-configured per
> function (RVV only atm) - (heavy weight, stop-the-world instruction).
>
> The vectorlen parameter in EVL intrinsics is for the AVL.
Unless I misunderstand, this doesn't describe RVV correctly, although
this is understandable as the spec has moved around a bit in the last
six or twelve months as it's gotten closer to being set in stone.

The way it has ended up (very unlikely to change now) is:

- any given RVV vector unit has 32 registers each with the same and
fixed length in bits.

- the vector unit is configured by the VSETVL[I] instruction which has
two arguments: 1) the requested AVL, and 2) the vtype (vector type).

- The vtype is an integer with several small fields, of which two are
currently defined (the other bits must be zero). The fields are the
Standard Element Width and VLMul. SEW can be any power of 2 from 8
bits up to some implementation-defined maximum (1024 bits absolute
maximum). VLMul says that you don't actually need 32 distinct vector
variables in your current loop/function and you're willing to trade
number of registers for a larger MVL. So, you can gang together each
even/odd register pair into 16 longer registers (named 0,2,4...30), or
you can gang together groups of four or at most eight registers.

- the current MVL -- the maximum number of elements in a vector
register -- is the hardware register length, multiplied by the VLMul
field in vtype, divided by the SEW field in vtype.

- the AVL is the smaller of MVL and the requested AVL.

- only two things can change AVL: the VSETVL[I] instruction, and a
special kind of memory load: "Unit-stride First-Fault Loads" if the
load crosses a protection boundary and the tail of the vector is
inaccessible. This kind of load is relatively uncommon and exists so
you can vectorise things where the end of the application vector is
data-dependent rather than counted. The canonical example is
strlen()/strcpy(). For most code you can ignore it and say the AVL
changes only when you execute VSETVL[I].

- any time the program uses VSETVL[I] *both* the MVL and the AVL can change.

- the common case is a loop with the vtype in an immediate VSETVLI at
the head of the loop. In this case, the AVL potentially changes in
every iteration of the loop (but usually only in the last one or two
iterations). As the vtype is in an immediate it can't change from
iteration to iteration. But it's common for two loops in the same
function to use different vtype, and so different MVL, because the
loops might either operate on different data types, or need a
different number of vector variables in the loop, or both.

- VSETVL[I] is *not* heavyweight, even if it changes the MVL. It's
quite ok to execute it as much as you want -- even before every vector
instruction if you want. That would be pretty unusual, and I think
falls more into the "clever hand-written code" area than into anything
a compiler is likely to want to generate from C loops, although it's
certainly possible.

Here's an example:

void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){
  for (size_t i=0; i<n; ++i)
    dst[i] += a[i] * b[i];
}

If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then you
might want to compile this to:

# args n in a0, dst in a1, a in a2, b in a3, AVL in t0
foo:
    vsetvli a4, a0, vsew32,vlmul4  # vtype = 32-bit integer vectors, AVL in a4
    vlw.v v0, (a2)          # Get 32b vector a into v0-v3
    vlw.v v4, (a3)          # Get 32b vector b into v4-v7
    slli a5, a4, 2             # multiply AVL by element size 4 bytes
    add a2, a2, a5        # Bump pointer a
    add a3, a3, a5        # Bump pointer b
    vwmul.vv v8, v0, v4   # 64b result in v8-v15

    vsetvli zero, a0, vsew64,vlmul8  # Operate on 64b values, discard
new AVL as it's the same
    vld.v v16, (a1)         # Get 64b vector dst into v16-v23
    vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23
    vsd.v v16, (a1)          # Store vector of 64b
    slli a5, a4, 3               # multiply AVL by element size 8 bytes
    add a1, a1, a5        # Bump pointer dst
    sub a0, a0, a4        # subtract AVL from n to get remaining count
    bnez a0, foo         # Any more?
    ret

The alternative of course is to set up for 64 bit elements at the
outset, let the two vlw.v's for a and b widen the 32 bit loads into 64
bit elements, then do 64x64->64 multiplies. The code would be two
instructions shorter, saving one of the vsetvli (4 bytes) and one of
the shifts (2 bytes).

Assuming for the moment a 512 bit (64 byte) vector register size
(total vector register file 2 KB). this function initially sets the
MVL to 64 (2048 bits divided into 32-bit elements). The widening
multiply produces 64 64-bit elements. The second half of the loop then
sets the element size to 64 bits and doubles the vlmul, so the MVL is
still 64 (4096 bits divided into 64-bit elements). The load, add, and
store of dst then takes place using 64 bit calculations.

Except on the last iteration [1] the AVL will be the same as the MVL.
Both will change (in bits, not in number of elements in this case)
twice in each loop.

[1] if on the 2nd to last iteration there are, say, 72 elements left,
the vsetvli instruction might choose to return an AVL of 36 elements,
leaving 36 for the last iteration, rather than doing 64 and then
leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and
32 depending on what suits that particular hardware. Or maybe it will
equalise the last three or four or more iterations. The main rule is
the AVL must decrease monotonically.

Luke Kenneth Casson Leighton via llvm-dev

2019-Feb-05 11:49 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Tuesday, February 5, 2019, Simon Moll <moll at cs.uni-saarland.de>
wrote:

I think this is the usual mixup of AVL and MVL.>
> AVL: is part of the predicate

Mmm that's very confusing to say that AVL is part of the predicate.
It's.... kiinda true?

>
>  and can change between vector operations just like a mask can (light
> weight).

Yes, ok, it's more that it is an "advisory". In RVV the program
(the
instruction) *requests* a specific AVL and the processor responds with an
*actual* AVL of between 0 (yes really, zero) and MIN(MVL, requested_AVL).

To say that it's a predicate, well... a predicate mask, you set it, and the
mask is obeyed, period. AVL, that just doesn't happen.

>
> MVL: Is the physical vector register length and can be re-configured per
> function (RVV only atm) - (heavy weight, stop-the-world instruction).

My understanding of RVV is that MVL is intended to be more of a hardcoded
parameter that is part of the processor design. Any compiler should be
generating code that really does not need to know what MVL is.

SV is slightly different, due to the fact that we use the *scalar* regfile
as if it was a typecasted SRAM. The register number in any given
instruction is just a pointer to the SRAM address at which vector elements
i8/16/32/64 are read/written.

So in SV we need to *set* the MVL, otherwise how can the engine know the
point where it has to stop reading/writing to the register SRAM?

However what is most likely to happen is, MVL will be set globally to e.g 4
and be done with it.

SV semantics for AVL are also slightly different from RVV, not by much
though. The engine is not permitted to choose arbitrary values: if AVL is
requested to be set to 4, it must *be* set to MIN(MVL, 4).  This can
sometimes avoid the need for a loop, entirely (short vectors).

Note also that in SV, neither AVL nor MVL may be set to zero. AVL=1
indicates that the engine is to interpret instructions in SCALAR mode.

The vectorlen parameter in EVL intrinsics is for the
AVL.>
>Ok so there is a bit of a problem, for both SV and RVV, in that both can
end up with different AVL values from what is requested.

If the API expects that when AVL elements are to be processed, that exactly
that number of elements *will* have been processed, that is simply not the
case and that assumption will result in a catastrophic failure, elements
not being processed.

To deal with that, if it is a hard requirement of the API that exactly the
number of AVL ops are carried out as requested, an otherwise completely
redundant assembly code for-loop will have to be generated.

Oh and then outside of that loop would be the IR level inner loop that was
actually part of the user's program.

Basically what I am saying is that the semantics "request an AVL from the
hardware and get an ACTUAL number of elements to be processed" really needs
to become part of the API.

Now, fascinatingly, for SIMD-only style architectures, that could
hypothetically be used to communicate to the JIT engine converting the IR
to use progressively smaller SIMD widths, on architectures that have
multiple widths.  Also to indicate when corner-case cleanup is to be used.
(SIMD alteady being a mess, this would all not be high priority / optimised)

OR...

the inner workings of AVL are entirely hidden and opaque to the IR. The IR
sets the total explicit number of elements, and It Gets Done.

However I suspect that doing that will open a can o worms.

>>> I'm curious what SVE will do if there is an if/then/else in the
middle
>>> of a vectorised loop with a shorter-than-maximum vector length. You
>>> can't just invert the mask when going from the then-part to the
>>> else-part because that would re-enable elements past the end of the
>>> vector. You'd need to invert the mask and then AND it with the
mask
>>> containing the (bitwise representation of) the vector length.
>>>
>>Yep, that is a workable solution for fixed width (SIMD) architectures, it
is a good pattern to use.

 As I mentioned earlier (about the mistake of using gather/scatter as a
means and method of implementing predication), it would be a mistake to try
to "dumb down" this proposal to cater for fixed-length SIMD engines to
the
detriment of dynamic-length engines.

If you try that then all the advantages of dynamic-length ISAs are utterly
destroyed, as the only way to implement the compliance with a dumbed-down
fixed-length proposal is: for variable-length ISAs to issue brain-dead
FIXED length assembly code.

Whereas if the API can cope with variable length, the length that is
returned for a SIMD engine may be one of the multiples of SIMD widths that
that engine supports, can use scatter/gather as a substitute for
(potential) lack of predication masks and so on.

If as an industry we want to break free of the seductively broken SIMD
paradigm, then variable-length engines need to be given top priority.

Really. and again, I say that with profuse apologies to all engineers who
have to deal with SIMD. I know it's so much easier to implement at the
hardware level, it's just that SIMD has always made the compiler writers
job absolute hell.

L.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/b4470a42/attachment.html>

Simon Moll via llvm-dev

2019-Feb-05 12:26 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 2/5/19 12:49 PM, Luke Kenneth Casson Leighton wrote:>
> Basically what I am saying is that the semantics "request an AVL from 
> the hardware and get an ACTUAL number of elements to be processed" 
> really needs to become part of the API.
>Ok. We could add this behavior to the EVL contract with an intrinsic.

%EffectiveVL = llvm.evl.setvl(<scalable vscale x float>, %RequestedAVL) 
where vscale would be interpreted as VLMul on RISC-V.
> the inner workings of AVL are entirely hidden and opaque to the IR. 
> The IR sets the total explicit number of elements, and It Gets Done.
That would still be ok, if the following invariant holds on RVV:

Given that

    %effectivevl = setvl <vty> %reqVL

Will the following invariant hold whenever %avl <= %effectivevl?

    %avl == setvl <vty>, %avl

In that case, we could require the VL parameter to be derived from the 
evl.setvl intrinsic without violating that part of EVL semantics (all 
elements in the range from [0 to VL) will be processed).

- Simon
>
>
> -- 
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

Simon Moll via llvm-dev

2019-Feb-05 12:28 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 2/5/19 12:06 PM, Bruce Hoult wrote:> On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at
cs.uni-saarland.de> wrote:
>> I think this is the usual mixup of AVL and MVL.
>>
>> AVL: is part of the predicate and can change between vector operations
>> just like a mask can (light weight).
>>
>> MVL: Is the physical vector register length and can be re-configured
per
>> function (RVV only atm) - (heavy weight, stop-the-world instruction).
>>
>> The vectorlen parameter in EVL intrinsics is for the AVL.
> Unless I misunderstand, this doesn't describe RVV correctly, although
> this is understandable as the spec has moved around a bit in the last
> six or twelve months as it's gotten closer to being set in stone.
>
> The way it has ended up (very unlikely to change now) is:
>
> - any given RVV vector unit has 32 registers each with the same and
> fixed length in bits.
>
> - the vector unit is configured by the VSETVL[I] instruction which has
> two arguments: 1) the requested AVL, and 2) the vtype (vector type).
>
> - The vtype is an integer with several small fields, of which two are
> currently defined (the other bits must be zero). The fields are the
> Standard Element Width and VLMul. SEW can be any power of 2 from 8
> bits up to some implementation-defined maximum (1024 bits absolute
> maximum). VLMul says that you don't actually need 32 distinct vector
> variables in your current loop/function and you're willing to trade
> number of registers for a larger MVL. So, you can gang together each
> even/odd register pair into 16 longer registers (named 0,2,4...30), or
> you can gang together groups of four or at most eight registers.
>
> - the current MVL -- the maximum number of elements in a vector
> register -- is the hardware register length, multiplied by the VLMul
> field in vtype, divided by the SEW field in vtype.
>
> - the AVL is the smaller of MVL and the requested AVL.
>
> - only two things can change AVL: the VSETVL[I] instruction, and a
> special kind of memory load: "Unit-stride First-Fault Loads" if
the
> load crosses a protection boundary and the tail of the vector is
> inaccessible. This kind of load is relatively uncommon and exists so
> you can vectorise things where the end of the application vector is
> data-dependent rather than counted. The canonical example is
> strlen()/strcpy(). For most code you can ignore it and say the AVL
> changes only when you execute VSETVL[I].
>
> - any time the program uses VSETVL[I] *both* the MVL and the AVL can
change.
>
> - the common case is a loop with the vtype in an immediate VSETVLI at
> the head of the loop. In this case, the AVL potentially changes in
> every iteration of the loop (but usually only in the last one or two
> iterations). As the vtype is in an immediate it can't change from
> iteration to iteration. But it's common for two loops in the same
> function to use different vtype, and so different MVL, because the
> loops might either operate on different data types, or need a
> different number of vector variables in the loop, or both.
>
> - VSETVL[I] is *not* heavyweight, even if it changes the MVL. It's
> quite ok to execute it as much as you want -- even before every vector
> instruction if you want. That would be pretty unusual, and I think
> falls more into the "clever hand-written code" area than into
anything
> a compiler is likely to want to generate from C loops, although it's
> certainly possible.
>
> Here's an example:
>
> void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){
>    for (size_t i=0; i<n; ++i)
>      dst[i] += a[i] * b[i];
> }
>
> If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then
you
> might want to compile this to:
>
> # args n in a0, dst in a1, a in a2, b in a3, AVL in t0
> foo:
>      vsetvli a4, a0, vsew32,vlmul4  # vtype = 32-bit integer vectors, AVL
in a4
>      vlw.v v0, (a2)          # Get 32b vector a into v0-v3
>      vlw.v v4, (a3)          # Get 32b vector b into v4-v7
>      slli a5, a4, 2             # multiply AVL by element size 4 bytes
>      add a2, a2, a5        # Bump pointer a
>      add a3, a3, a5        # Bump pointer b
>      vwmul.vv v8, v0, v4   # 64b result in v8-v15
>
>      vsetvli zero, a0, vsew64,vlmul8  # Operate on 64b values, discard
> new AVL as it's the same
>      vld.v v16, (a1)         # Get 64b vector dst into v16-v23
>      vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23
>      vsd.v v16, (a1)          # Store vector of 64b
>      slli a5, a4, 3               # multiply AVL by element size 8 bytes
>      add a1, a1, a5        # Bump pointer dst
>      sub a0, a0, a4        # subtract AVL from n to get remaining count
>      bnez a0, foo         # Any more?
>      ret
>
> The alternative of course is to set up for 64 bit elements at the
> outset, let the two vlw.v's for a and b widen the 32 bit loads into 64
> bit elements, then do 64x64->64 multiplies. The code would be two
> instructions shorter, saving one of the vsetvli (4 bytes) and one of
> the shifts (2 bytes).
>
> Assuming for the moment a 512 bit (64 byte) vector register size
> (total vector register file 2 KB). this function initially sets the
> MVL to 64 (2048 bits divided into 32-bit elements). The widening
> multiply produces 64 64-bit elements. The second half of the loop then
> sets the element size to 64 bits and doubles the vlmul, so the MVL is
> still 64 (4096 bits divided into 64-bit elements). The load, add, and
> store of dst then takes place using 64 bit calculations.
>
> Except on the last iteration [1] the AVL will be the same as the MVL.
> Both will change (in bits, not in number of elements in this case)
> twice in each loop.
>
> [1] if on the 2nd to last iteration there are, say, 72 elements left,
> the vsetvli instruction might choose to return an AVL of 36 elements,
> leaving 36 for the last iteration, rather than doing 64 and then
> leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and
> 32 depending on what suits that particular hardware. Or maybe it will
> equalise the last three or four or more iterations. The main rule is
> the AVL must decrease monotonically.
Thank you for the detailed explanation! I wasn't aware of the current 
state of RVV in that regard.

This seems to imply that enforcing MVL changes only per function level 
is now moot (as in 
https://lists.llvm.org/pipermail/llvm-dev/2018-April/122517.html).


-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication