thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector Predication [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Simon Moll via llvm-dev

2019-Feb-05 12:28 UTC

[llvm-dev] [RFC] Vector Predication

On 2/5/19 12:06 PM, Bruce Hoult wrote:> On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at
cs.uni-saarland.de> wrote:
>> I think this is the usual mixup of AVL and MVL.
>>
>> AVL: is part of the predicate and can change between vector operations
>> just like a mask can (light weight).
>>
>> MVL: Is the physical vector register length and can be re-configured
per
>> function (RVV only atm) - (heavy weight, stop-the-world instruction).
>>
>> The vectorlen parameter in EVL intrinsics is for the AVL.
> Unless I misunderstand, this doesn't describe RVV correctly, although
> this is understandable as the spec has moved around a bit in the last
> six or twelve months as it's gotten closer to being set in stone.
>
> The way it has ended up (very unlikely to change now) is:
>
> - any given RVV vector unit has 32 registers each with the same and
> fixed length in bits.
>
> - the vector unit is configured by the VSETVL[I] instruction which has
> two arguments: 1) the requested AVL, and 2) the vtype (vector type).
>
> - The vtype is an integer with several small fields, of which two are
> currently defined (the other bits must be zero). The fields are the
> Standard Element Width and VLMul. SEW can be any power of 2 from 8
> bits up to some implementation-defined maximum (1024 bits absolute
> maximum). VLMul says that you don't actually need 32 distinct vector
> variables in your current loop/function and you're willing to trade
> number of registers for a larger MVL. So, you can gang together each
> even/odd register pair into 16 longer registers (named 0,2,4...30), or
> you can gang together groups of four or at most eight registers.
>
> - the current MVL -- the maximum number of elements in a vector
> register -- is the hardware register length, multiplied by the VLMul
> field in vtype, divided by the SEW field in vtype.
>
> - the AVL is the smaller of MVL and the requested AVL.
>
> - only two things can change AVL: the VSETVL[I] instruction, and a
> special kind of memory load: "Unit-stride First-Fault Loads" if
the
> load crosses a protection boundary and the tail of the vector is
> inaccessible. This kind of load is relatively uncommon and exists so
> you can vectorise things where the end of the application vector is
> data-dependent rather than counted. The canonical example is
> strlen()/strcpy(). For most code you can ignore it and say the AVL
> changes only when you execute VSETVL[I].
>
> - any time the program uses VSETVL[I] *both* the MVL and the AVL can
change.
>
> - the common case is a loop with the vtype in an immediate VSETVLI at
> the head of the loop. In this case, the AVL potentially changes in
> every iteration of the loop (but usually only in the last one or two
> iterations). As the vtype is in an immediate it can't change from
> iteration to iteration. But it's common for two loops in the same
> function to use different vtype, and so different MVL, because the
> loops might either operate on different data types, or need a
> different number of vector variables in the loop, or both.
>
> - VSETVL[I] is *not* heavyweight, even if it changes the MVL. It's
> quite ok to execute it as much as you want -- even before every vector
> instruction if you want. That would be pretty unusual, and I think
> falls more into the "clever hand-written code" area than into
anything
> a compiler is likely to want to generate from C loops, although it's
> certainly possible.
>
> Here's an example:
>
> void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){
>    for (size_t i=0; i<n; ++i)
>      dst[i] += a[i] * b[i];
> }
>
> If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then
you
> might want to compile this to:
>
> # args n in a0, dst in a1, a in a2, b in a3, AVL in t0
> foo:
>      vsetvli a4, a0, vsew32,vlmul4  # vtype = 32-bit integer vectors, AVL
in a4
>      vlw.v v0, (a2)          # Get 32b vector a into v0-v3
>      vlw.v v4, (a3)          # Get 32b vector b into v4-v7
>      slli a5, a4, 2             # multiply AVL by element size 4 bytes
>      add a2, a2, a5        # Bump pointer a
>      add a3, a3, a5        # Bump pointer b
>      vwmul.vv v8, v0, v4   # 64b result in v8-v15
>
>      vsetvli zero, a0, vsew64,vlmul8  # Operate on 64b values, discard
> new AVL as it's the same
>      vld.v v16, (a1)         # Get 64b vector dst into v16-v23
>      vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23
>      vsd.v v16, (a1)          # Store vector of 64b
>      slli a5, a4, 3               # multiply AVL by element size 8 bytes
>      add a1, a1, a5        # Bump pointer dst
>      sub a0, a0, a4        # subtract AVL from n to get remaining count
>      bnez a0, foo         # Any more?
>      ret
>
> The alternative of course is to set up for 64 bit elements at the
> outset, let the two vlw.v's for a and b widen the 32 bit loads into 64
> bit elements, then do 64x64->64 multiplies. The code would be two
> instructions shorter, saving one of the vsetvli (4 bytes) and one of
> the shifts (2 bytes).
>
> Assuming for the moment a 512 bit (64 byte) vector register size
> (total vector register file 2 KB). this function initially sets the
> MVL to 64 (2048 bits divided into 32-bit elements). The widening
> multiply produces 64 64-bit elements. The second half of the loop then
> sets the element size to 64 bits and doubles the vlmul, so the MVL is
> still 64 (4096 bits divided into 64-bit elements). The load, add, and
> store of dst then takes place using 64 bit calculations.
>
> Except on the last iteration [1] the AVL will be the same as the MVL.
> Both will change (in bits, not in number of elements in this case)
> twice in each loop.
>
> [1] if on the 2nd to last iteration there are, say, 72 elements left,
> the vsetvli instruction might choose to return an AVL of 36 elements,
> leaving 36 for the last iteration, rather than doing 64 and then
> leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and
> 32 depending on what suits that particular hardware. Or maybe it will
> equalise the last three or four or more iterations. The main rule is
> the AVL must decrease monotonically.
Thank you for the detailed explanation! I wasn't aware of the current 
state of RVV in that regard.

This seems to imply that enforcing MVL changes only per function level 
is now moot (as in 
https://lists.llvm.org/pipermail/llvm-dev/2018-April/122517.html).


-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

Bruce Hoult via llvm-dev

2019-Feb-05 13:00 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Tue, Feb 5, 2019 at 4:28 AM Simon Moll <moll at cs.uni-saarland.de>
wrote:> Thank you for the detailed explanation! I wasn't aware of the
current> state of RVV in that regard.
>
> This seems to imply that enforcing MVL changes only per function level
> is now moot (as in
> https://lists.llvm.org/pipermail/llvm-dev/2018-April/122517.html).
As Robin said even at that time:

=============# Runtime-varying vector length in the IR

This is achieved by simply declaring "by fiat" that the vector length
is
determined on function entry and remain constant for the rest of the
function execution. Other functions and other calls to the same function
may observe a different vector length, but within one call to a given
function, the vector length is fixed. That is not precisely how the
hardware works, but it is a contract the backend can uphold easily
=============
i.e. that's not a restriction imposed by the hardware, but writing the
compiler that way *might* make the compiler simpler, at the cost of
missing out on some of the flexibility in the hardware.

That's still exactly as applicable or inapplicable now as then.

Robin also said then:

=============For scenarios like two entirely separate vectorized loops within
one
function, it might be useful to drastically change the vector unit
configuration in the middle of a function.
=============
So, no real change or disagreement, it's just a question of how to
expediently shoehorn *something* into LLVM, on which I definitely
defer to Robin and Alex.

Luke Kenneth Casson Leighton via llvm-dev

2019-Feb-05 13:14 UTC

head link

[llvm-dev] [RFC] Vector Predication

On Tue, Feb 5, 2019 at 1:01 PM Bruce Hoult <brucehoult at sifive.com>
wrote:
> Robin also said then:
>
> =============> For scenarios like two entirely separate vectorized loops
within one
> function, it might be useful to drastically change the vector unit
> configuration in the middle of a function.
> ============= a workable solution that keeps the simplicity of the proposed
contract yet also provides flexibility that would otherwise result in
missed optimisation opportunities is to treat the vectorised loops
(plural), particularly if they are inner loops within outer loops, as
"nameless functions".

 by splitting out the (two or more) separate loops into "functions
without names even though the programmer didn't actually *make* them
as functions", the infrastructure associated *with* functions may push
the required context (such as the current VL) onto the stack, and
restore it on exit from the [nameless] function.

 kinda like the opposite of inlining.  outlining? :)

 i do not have enough compiler experience to say if the overhead of
the associated push/pop of a function call is worthwhile or not.

l.

Simon Moll via llvm-dev

2019-Feb-05 13:18 UTC

head link

[llvm-dev] [RFC] Vector Predication

On 2/5/19 2:00 PM, Bruce Hoult wrote:> On Tue, Feb 5, 2019 at 4:28 AM Simon Moll <moll at
cs.uni-saarland.de>
> wrote:> Thank you for the detailed explanation! I wasn't aware of
the
> current
>> state of RVV in that regard.
>>
>> This seems to imply that enforcing MVL changes only per function level
>> is now moot (as in
>> https://lists.llvm.org/pipermail/llvm-dev/2018-April/122517.html).
> As Robin said even at that time:
>
> =============> # Runtime-varying vector length in the IR
>
> This is achieved by simply declaring "by fiat" that the vector
length is
> determined on function entry and remain constant for the rest of the
> function execution. Other functions and other calls to the same function
> may observe a different vector length, but within one call to a given
> function, the vector length is fixed. That is not precisely how the
> hardware works, but it is a contract the backend can uphold easily
> =============>
> i.e. that's not a restriction imposed by the hardware, but writing the
> compiler that way *might* make the compiler simpler, at the cost of
> missing out on some of the flexibility in the hardware.
>
> That's still exactly as applicable or inapplicable now as then.
>
> Robin also said then:
>
> =============> For scenarios like two entirely separate vectorized loops
within one
> function, it might be useful to drastically change the vector unit
> configuration in the middle of a function.
> =============>
> So, no real change or disagreement, it's just a question of how to
> expediently shoehorn *something* into LLVM, on which I definitely
> defer to Robin and Alex.Ok. I am just poking around a little to see whether something like 
llvm.evl.setvl 
(https://lists.llvm.org/pipermail/llvm-dev/2019-February/129973.html) 
would be compatible with your solution for MVL configuration in RVV

-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

llvm dev - Feb 2019 - [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication

[llvm-dev] [RFC] Vector Predication