thr3ads.net - llvm dev - [llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM [Apr 2018]

Robin Kruppe via llvm-dev

2018-Apr-11 09:44 UTC

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

RISC-V is an open and free instruction set architecture (ISA) used in
numerous domains in industry and research. The vector extension (short:
'V') supplements the basic ISA with support for data parallel
computations.
This RFC sketches a strategy for targeting this instruction set extension
in LLVM.

Some but not all of what is proposed here has already been implemented out
of tree. It is explicitly not proposed to upstream any of this yet: the
vector extension is still evolving (though the core concepts are reasonably
stable), and the implementation is currently very much prototype quality.
Nevertheless, I want to kick off a discussion about this with the LLVM
community now to make sure I'm on the right track and to make the eventual
upstreaming go more smoothly. In particular, a large and potentially
controversial part of the strategy is a proposal for extending LLVM IR with
a new vector type.

There is also much to be said about how to structure the code generation
for this ISA. However, since that aspect somewhat simpler, largely
orthogonal and affects a smaller subset of the community, the details will
be left to a future RFC.

This RFC is intended to be self-contained, but interested readers can learn
more about the vector extension from Roger Espasa's talk at the 7th RISC-V
workshop (slides [1], recording [2]). The draft specification is also
available as part of the RISC-V Instruction Set Manual [3], but right now
it is unfortunately incomplete and in the process of being updated.

I will also be at EuroLLVM with a lightning talk and poster on this
subject, so if you're there as well, we can discuss in person.

[1]
https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf
[2] https://www.youtube.com/watch?v=GzZ-8bHsD5s
[3] https://github.com/riscv/riscv-isa-manual/


# Summary

First-class support for the RISC-V vector ISA requires representing a
hardware vector length that is not just unknown at compile time, but also
changes during execution. This in turn places some restrictions on code
motion: the vector length must not change while any vector values are live.
This RFC proposes to add a new vector type to LLVM IR for this purpose.
Simply put, its length is tied to the surrounding function and the existing
`token` type is leveraged to tell optimization passes that certain vector
operations must remain together (i.e., in the same function).


# Background

The RISC-V vector extension has many interesting properties. This RFC is
not the right place to talk about it in detail, but this section will
briefly introduce the aspect that is most difficult to support in LLVM IR,
and which is consequently the focus of this RFC: the runtime-variable
vector length.

The number of elements in a vector register is determined by the
microarchitecture. Software uses strip-mined loops to transparently process
as many elements per iteration as the hardware can support. But even beyond
that, the vector length can vary during the execution of a program:
different kernels may *configure* the vector unit differently depending on
their needs, leading to different parts of the program having
differently-sized vector registers.

The vector length being determined by the microarchitecture is similar to
Arm's Scalable Vector Extension (SVE), for which support is being
upstreamed at the moment. However, in SVE the vector length is fixed once a
program starts running, while full use of the RISC-V vector extension will
lead to the vector length changing regularly during execution. It's
possible to maintain the same configuration -- and therefore the same
vector length -- throughout the entire program, but this will often perform
worse than a tailored configuration.

## Maximum vs active vs application vector length

The V ISA has two notions of vector length: the *maximum* vector length
(called MVL), which describes the number of elements in each vector
register, and the *active* vector length (called vl), which limits how many
of those elements are actually processed by each vector instruction.

The latter is used to express loops of any application-specified length
with a single copy of the loop body. Instead of handling the tail
iterations not divisible by MVL separately with scalar code, the active
length length is set up so that the last few iterations process as many
elements as are left to process.

The effect of the active vector length is similar to a mask of the form
`<true, ..., true, false, ..., false>`, aside from the scalar control
logic
that sets and maintains the active vector length. Thus it can be modeled in
IR with judicious use of intrinsics and masking. This still allows also
having a single loop body in IR, without introducing new IR concepts in
addition to those already needed for the variable MVL.

Thus, the rest of this RFC focuses on handling the MVL: all references to
"vector length" from this point on should be taken to refer to the
MVL, not
the active vector length.


# Scope of the support

To preempt misunderstandings, this section outlines what is meant and not
meant by "support for the vector extension".

## Variable vector length

There *is* an option to entirely avoid the concept of the vector length
changing during execution. Keeping the same vector unit configuration
throughout the entire program execution also leads to the vector length
being fixed once the program starts executing. In this case, compiler
support works out rather rather similar to support for Arm SVE, with the
biggest difference being that vectors lengths are not multiples of 128 bit
(which legalization can paper over). Indeed, no IR changes beyond those
proposed for SVE support appear to be necessary to implement this approach
to RISC-V vector extension support.

However, this approach is wasteful, as a tailored configuration can improve
performance and energy efficiency significantly. As one data point, the
Hwacha project reported [4] up to 9.5% fewer cycles taken and up to 11%
less energy consumed on a microarchitecture built to exploit narrower bit
widths of vector elements (comparing "mvp, packed: yes" to "mvp,
packed:
no"). Besides such microarchitectural optimizations, enabling fewer
registers can improve context switch times because fewer registers need to
be saved, and being more flexible in how registers can be used (in
particular, how many are reserved for scalar values) aids register
allocation.

Thus, restricting programs to a single configuration may be a good first
step to get things up and running, but ultimately support for
runtime-varying vector lengths is desired to make the most of hardware
capabilities.

[4] "A Case for MVPs: Mixed-Precision Vector Processors", Albert Ou,
Quan
Nguyen, Yunsup Lee, Krste Asanović,
http://hwacha.org/papers/hwacha-mvp-prism2014.pdf


## Producing vector code

It is intended that vector code is primarily produced via loop
vectorization and other IR-level auto-vectorizers (e.g., the region
vectorizer), not written by hand. Supporting loop vectorization is of
highest priority. The groundwork for loop vectorization should be useful
for other kinds of automatic vectorization as well, but loop vectorization
will be implemented first.

It's not required or expected that the stock loop vectorizer can generate
RISC-V vector code from the start. Considering the many significant
differences to the packed-SIMD architectures the loop vectorizer is
tailored to, it's quite likely that some experimentation in this space is
required (e.g., building on VPlan and writing custom recipes). Of course,
in the long term there should be as much code sharing as possible.

Support for hand-written vector code va source-language-level intrinsics
(as opposed to inline assembly) would be nice to have and probably falls
out for free, but is rather low priority.


## No vector unit configuration in IR

While configuring the vector unit is an essential part of compiling for the
V ISA, it has no place in LLVM IR. Vectors should be regular SSA values
that don't reference any extra state other than (by necessity) the vector
length. Deciding how to configure the vector unit for a given piece of code
is target-dependent and intertwined with register allocation, and will
therefore be left to the backend.

# Challenge: Code motion around vector length changes

When the vector length can change during execution, there are implicit
dependencies between vector operations and points in the program where the
vector length may change. These dependencies must be taken into account
somehow, or else code motion passes could move vector operations across
vector length changes, effectively changing program semantics. For example,
it's nonsensical to compute a vector value `%v1` with one vector length,
change the vector length, and then compute another vector `%v2` with the
new vector length and add it to `%v1`. This makes no more sense than adding
`<4 x i32>` to `<8 x i32>`, yet it could happen if an input program
has a
vector length change *after* these vector calculations and optimization
passes are not aware of the impact of the vector length change on those
calculations.

Crucially, the vector length changes when calling and returning from
functions in most calling conventions. Functions that don't specifically
use a vectorcall ABI configure the vector unit for their own use when
called, rather than using configuration set up by the caller. Therefore,
caller and callee will generally have different vector lengths, and moving
vector operations from the caller into the callee or vice versa tends to
break programs.

However, note that the precise value of the vector length doesn't really
matter -- software is supposed to be *vector length agnostic*. Completely
inlining a function is perfectly fine, for example. What matters is that
the vector length doesn't change *during* vector computations, i.e., while
any vector values are live (either as SSA values, or in memory!). Thus,
there is no need to support and track vector length changes at instruction
granularity. It's enough to coarsely enforce that the vector length remain
constant throughout a larger code region (say, a loop nest, or a function).


# Runtime-varying vector length in the IR

This is achieved by simply declaring "by fiat" that the vector length
is
determined on function entry and remain constant for the rest of the
function execution. Other functions and other calls to the same function
may observe a different vector length, but within one call to a given
function, the vector length is fixed. That is not precisely how the
hardware works, but it is a contract the backend can uphold easily (more on
this later) and it allows piggy-backing on existing IR constructs (the
`token` type) to communicate restrictions to optimization passes.
Nevertheless, some additions to IR are required: a new first class type, a
new instruction, and a new operand for some existing instructions.

## IR semantics

Every time a function is called, a positive integer called the *dynamic
vector length* is determined in an unspecified way. The dynamic vector
length can differ not only between different functions, but also between
different calls to the same function. The exception is that functions with
the `inherits_vlen` attribute get the same dynamic vector length as their
caller (Note: this attribute is a straw man, target-specific calling
conventions may work better for this purpose).

A new instruction `vlentoken` is added, which has no operands and is of
type `token`. This token represents the dynamic vector length of the
function execution it is in. There can only be one such instruction per
function (this is inconsequential to the operational semantics, but it
simplifies IR passes).

A new kind of type is added, the *dynamic-length vector*, written `< vlen x
<element type> >`. It represents a vector with a number of elements
equal
to the dynamic vector length. Like fixed-length and scalable vectors, these
vectors can only contain integer, float and pointer elements.

A use of a `vlentoken` (representing a dynamic vector length) is attached
to all operations that care for the dynamic vector length. That is, *every*
instruction that handles dynamic-length vectors or is impacted by their
length receives the respective function's `vlentoken` as extra operand, and
operates on a number of elements equal to the corresponding dynamic vector
length.

`< vlen x <element type> >` is a first-class type and supports most
common
instructions (details below), but cannot be used as function argument or
function return type unless the `inherits_vlen` attribute is applied to the
callee.

## Rationale / "why this works"

Although the vector length is a property of vector *values*, tracking the
dynamic vector length at the type level would require a "separate
type" for
each call to any function. It's much more feasible to attach the vector
length to the *operations* instead. This works out because SSA values are
function-local (so all operation on them agree on the vector length by
definition) with the exception of function arguments and return values.
Consequently, dynamic-length vectors in function signatures are disallowed
unless the `inherits_vlen` attribute ensured caller and callee have the
same dynamic vector length.

The `vlentoken` token ensures that all operations that start out in the
same function must remain in the same function while the code is
transformed (recall that tokens cannot be passed to or returned from
non-intrinsic functions). That's why it is important that `vlentoken` is a
token, not simply an integer as one might expect. In other words,
`vlentoken` does not give the program access to the dynamic vector length,
it communicates a restriction to the optimizer.

## More details

The `< vlen x <element type> >` type is separate from the existing
vector
types. Instructions for fixed-length vectors (elementwise arithmetic,
`insertelement`, `select` with a vector of `i1`s, etc.) are not extended to
this new type, at least not in this RFC. It's a possible future extension,
but for now, target-specific intrinsics work fine for those operations.

The following operations on dynamic-length vectors *are* supported:

- `phi`
- `load` and `store`
- `alloca` (at least of a single vector; the `alloca <ty>, <ty>
<NumElements>` form ties into the open question about aggregates and GEPs
below)
- `select` (with `i1` condition)
- Argument passing and return values (`call`, `invoke`, `ret`) for
functions with `inherits_vlen`

All of these instructions (including phi) have an additional operand of
type `token` if and only if they operate on a vector of dynamic length. In
textual IR, one appends `, vlen <the token>` to the instruction, for
example:

    %0 = vlentoken
    %ptr = alloca <vlen x i32>, vlen %0
    %v = call <vlen x i32> @foo(), vlen %0
    store <vlen x i32> %v, <vlen x i32>* %ptr, vlen %0

Open question: should GEPs and aggregates involving dynamic-length vectors
work? This RFC errs on the side of simplicity and excludes them (they're
non-trivial to implement and not needed for strip-mined loops) but if
desired, they could be supported.

There are no constants of dynamic-length-vector type except
`zeroinitializer` and `undef` (resp. `poison` once that is adopted). In
particular, there is no equivalent to fixed-length vector constants (`<ty
elem1, ty elem2, ...>`). Dynamic-length vectors also cannot be stored in
globals.

## Impact on optimizations

The semantics imply several restrictions on optimizations, but these are
mostly encoded with existing IR constructs -- chiefly, the `token` type
that ties all vector operations to a `vlentoken`. For example, because
token values cannot be passed to (non-intrinsic) functions or returned from
them, no special pleading is needed to keep an outliner or partial inliner
from spreading vector operations across multiple functions -- correct
passes already don't do that when tokens are involved. Passes do, however,
need to be updated in two respects.

First, the new token operand needs to be respected when comparing two
instructions, creating new instructions, etc. -- this is an inherent
downside of adding new operands, but also rather mechanical. The rule that
there is only one `vlentoken` per function makes this even more mechanical
than usual, because all instruction within one function have the same
vector length token. This means that one does not need to consider them
when comparing instructions from the same function, and it's always clear
which token should be used when creating a new instruction.

Second, the very same rule of only one `vlentoken` per function must be
respected during interprocedural code motion. For example, inlining can't
just copy the `vlentoken` from the callee into the caller.

However, note that it's valid to *merge* the caller's and callee's
`vlentoken` instructions. Because the semantics state that each call to a
function can get a different dynamic vector length, merging `vlentoken`s
*refines* the program's behavior by picking the possible execution where
the callee "happens to" get the same vector length as the caller in
the
inlined calls. So inlining can simply replace all `vlentoken` tokens in the
inlined code with the `vlentoken` token of the callee. Other passes are
likely similarly easy to update (and in the worst case, they can just bail
out when seeing dynamic-length vectors).

## Impact on backends

Unsurprisingly, the IR types `<vlen x <element type> >` come with
associated MVTs. There's also a new SelectionDAG node `VLENTOKEN` to mirror
the `vlentoken` IR instruction (and presumably `G_VLENTOKEN` in GMIR for
GlobalISel).

Backends other than RISC-V can legalize these MVTs and the `VLENTOKEN` node
very easily, even if in practice there currently aren't many useful
operations on these vectors without target-specific intrinsics.
Specifically, `< vlen x <element type> >` can be legalized as `<
n x
<element type> >` or even `< scalable n x <element type> >`
(the vector
type for Arm SVE) for any fixed `n`. All the complications stemming from
the runtime-varying vector length go away, and the `vlentoken` node can
simply be dropped on the floor.

That leaves targets with an actual runtime-varying vector length in
hardware, i.e., RISCV with the V feature enabled. As stated in the
introduction, this RFC does not cover the backend changes in detail, but to
give you a rough idea, here's a sketch. Keep in mind (especially if
you're
familiar with V) that this is glossing over everything not directly related
to the proposed IR type (particularly the "polymorphic instruction
set"
aspect of the register configuration).

As described earlier, the vector length in RISC-V is completely determined
by the vector unit configuration. Therefore, vector operations in Machine
IR have an implicit use of the configuration registers. This is the moral
equivalent of the `vlentoken` token operand, but more precise (and MIR
doesn't have an equivalent of the token type anyway). To complete the
picture, all operations that change the configuration are made explicit.

Because only virtual registers can be live across basic block boundaries
before register allocation, this may require a dummy register class with
only a single physical register, or something similarly inelegant.

With a way to precisely represent vector length changes in hand, the
backend just needs to ensure it implements the semantics of `< vlen x
<element type> >` described earlier. This is achieved by configuring
the
vector unit "in the prologue", and then not doing anything that might
change the vector length inside the function. This setup is effectively in
the prologue (i.e., before any user code) but not actually inserted during
the "prologue/epilogue insertion" pass, which runs far too late for
this
purpose.

For scenarios like two entirely separate vectorized loops within one
function, it might be useful to drastically change the vector unit
configuration in the middle of a function. This could be implemented as an
optimization (e.g., a pre-RA machine function), but it's all hypothetical
so far.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180411/8ba3488b/attachment.html>

Eric Christopher via llvm-dev

2018-Apr-12 21:27 UTC

head link

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

I'm just going to add Kristof here since ARM is looking to add SVE here and
this overlaps quite a bit with their goals.

-eric

On Wed, Apr 11, 2018 at 2:45 AM Robin Kruppe via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> RISC-V is an open and free instruction set architecture (ISA) used in
> numerous domains in industry and research. The vector extension (short:
> 'V') supplements the basic ISA with support for data parallel
computations.
> This RFC sketches a strategy for targeting this instruction set extension
> in LLVM.
>
> Some but not all of what is proposed here has already been implemented out
> of tree. It is explicitly not proposed to upstream any of this yet: the
> vector extension is still evolving (though the core concepts are reasonably
> stable), and the implementation is currently very much prototype quality.
> Nevertheless, I want to kick off a discussion about this with the LLVM
> community now to make sure I'm on the right track and to make the
eventual
> upstreaming go more smoothly. In particular, a large and potentially
> controversial part of the strategy is a proposal for extending LLVM IR with
> a new vector type.
>
> There is also much to be said about how to structure the code generation
> for this ISA. However, since that aspect somewhat simpler, largely
> orthogonal and affects a smaller subset of the community, the details will
> be left to a future RFC.
>
> This RFC is intended to be self-contained, but interested readers can
> learn more about the vector extension from Roger Espasa's talk at the
7th
> RISC-V workshop (slides [1], recording [2]). The draft specification is
> also available as part of the RISC-V Instruction Set Manual [3], but right
> now it is unfortunately incomplete and in the process of being updated.
>
> I will also be at EuroLLVM with a lightning talk and poster on this
> subject, so if you're there as well, we can discuss in person.
>
> [1]
>
https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf
> [2] https://www.youtube.com/watch?v=GzZ-8bHsD5s
> [3] https://github.com/riscv/riscv-isa-manual/
>
>
> # Summary
>
> First-class support for the RISC-V vector ISA requires representing a
> hardware vector length that is not just unknown at compile time, but also
> changes during execution. This in turn places some restrictions on code
> motion: the vector length must not change while any vector values are live.
> This RFC proposes to add a new vector type to LLVM IR for this purpose.
> Simply put, its length is tied to the surrounding function and the existing
> `token` type is leveraged to tell optimization passes that certain vector
> operations must remain together (i.e., in the same function).
>
>
> # Background
>
> The RISC-V vector extension has many interesting properties. This RFC is
> not the right place to talk about it in detail, but this section will
> briefly introduce the aspect that is most difficult to support in LLVM IR,
> and which is consequently the focus of this RFC: the runtime-variable
> vector length.
>
> The number of elements in a vector register is determined by the
> microarchitecture. Software uses strip-mined loops to transparently process
> as many elements per iteration as the hardware can support. But even beyond
> that, the vector length can vary during the execution of a program:
> different kernels may *configure* the vector unit differently depending on
> their needs, leading to different parts of the program having
> differently-sized vector registers.
>
> The vector length being determined by the microarchitecture is similar to
> Arm's Scalable Vector Extension (SVE), for which support is being
> upstreamed at the moment. However, in SVE the vector length is fixed once a
> program starts running, while full use of the RISC-V vector extension will
> lead to the vector length changing regularly during execution. It's
> possible to maintain the same configuration -- and therefore the same
> vector length -- throughout the entire program, but this will often perform
> worse than a tailored configuration.
>
> ## Maximum vs active vs application vector length
>
> The V ISA has two notions of vector length: the *maximum* vector length
> (called MVL), which describes the number of elements in each vector
> register, and the *active* vector length (called vl), which limits how many
> of those elements are actually processed by each vector instruction.
>
> The latter is used to express loops of any application-specified length
> with a single copy of the loop body. Instead of handling the tail
> iterations not divisible by MVL separately with scalar code, the active
> length length is set up so that the last few iterations process as many
> elements as are left to process.
>
> The effect of the active vector length is similar to a mask of the form
> `<true, ..., true, false, ..., false>`, aside from the scalar control
logic
> that sets and maintains the active vector length. Thus it can be modeled in
> IR with judicious use of intrinsics and masking. This still allows also
> having a single loop body in IR, without introducing new IR concepts in
> addition to those already needed for the variable MVL.
>
> Thus, the rest of this RFC focuses on handling the MVL: all references to
> "vector length" from this point on should be taken to refer to
the MVL, not
> the active vector length.
>
>
> # Scope of the support
>
> To preempt misunderstandings, this section outlines what is meant and not
> meant by "support for the vector extension".
>
> ## Variable vector length
>
> There *is* an option to entirely avoid the concept of the vector length
> changing during execution. Keeping the same vector unit configuration
> throughout the entire program execution also leads to the vector length
> being fixed once the program starts executing. In this case, compiler
> support works out rather rather similar to support for Arm SVE, with the
> biggest difference being that vectors lengths are not multiples of 128 bit
> (which legalization can paper over). Indeed, no IR changes beyond those
> proposed for SVE support appear to be necessary to implement this approach
> to RISC-V vector extension support.
>
> However, this approach is wasteful, as a tailored configuration can
> improve performance and energy efficiency significantly. As one data point,
> the Hwacha project reported [4] up to 9.5% fewer cycles taken and up to 11%
> less energy consumed on a microarchitecture built to exploit narrower bit
> widths of vector elements (comparing "mvp, packed: yes" to
"mvp, packed:
> no"). Besides such microarchitectural optimizations, enabling fewer
> registers can improve context switch times because fewer registers need to
> be saved, and being more flexible in how registers can be used (in
> particular, how many are reserved for scalar values) aids register
> allocation.
>
> Thus, restricting programs to a single configuration may be a good first
> step to get things up and running, but ultimately support for
> runtime-varying vector lengths is desired to make the most of hardware
> capabilities.
>
> [4] "A Case for MVPs: Mixed-Precision Vector Processors", Albert
Ou, Quan
> Nguyen, Yunsup Lee, Krste Asanović,
> http://hwacha.org/papers/hwacha-mvp-prism2014.pdf
>
>
> ## Producing vector code
>
> It is intended that vector code is primarily produced via loop
> vectorization and other IR-level auto-vectorizers (e.g., the region
> vectorizer), not written by hand. Supporting loop vectorization is of
> highest priority. The groundwork for loop vectorization should be useful
> for other kinds of automatic vectorization as well, but loop vectorization
> will be implemented first.
>
> It's not required or expected that the stock loop vectorizer can
generate
> RISC-V vector code from the start. Considering the many significant
> differences to the packed-SIMD architectures the loop vectorizer is
> tailored to, it's quite likely that some experimentation in this space
is
> required (e.g., building on VPlan and writing custom recipes). Of course,
> in the long term there should be as much code sharing as possible.
>
> Support for hand-written vector code va source-language-level intrinsics
> (as opposed to inline assembly) would be nice to have and probably falls
> out for free, but is rather low priority.
>
>
> ## No vector unit configuration in IR
>
> While configuring the vector unit is an essential part of compiling for
> the V ISA, it has no place in LLVM IR. Vectors should be regular SSA values
> that don't reference any extra state other than (by necessity) the
vector
> length. Deciding how to configure the vector unit for a given piece of code
> is target-dependent and intertwined with register allocation, and will
> therefore be left to the backend.
>
> # Challenge: Code motion around vector length changes
>
> When the vector length can change during execution, there are implicit
> dependencies between vector operations and points in the program where the
> vector length may change. These dependencies must be taken into account
> somehow, or else code motion passes could move vector operations across
> vector length changes, effectively changing program semantics. For example,
> it's nonsensical to compute a vector value `%v1` with one vector
length,
> change the vector length, and then compute another vector `%v2` with the
> new vector length and add it to `%v1`. This makes no more sense than adding
> `<4 x i32>` to `<8 x i32>`, yet it could happen if an input
program has a
> vector length change *after* these vector calculations and optimization
> passes are not aware of the impact of the vector length change on those
> calculations.
>
> Crucially, the vector length changes when calling and returning from
> functions in most calling conventions. Functions that don't
specifically
> use a vectorcall ABI configure the vector unit for their own use when
> called, rather than using configuration set up by the caller. Therefore,
> caller and callee will generally have different vector lengths, and moving
> vector operations from the caller into the callee or vice versa tends to
> break programs.
>
> However, note that the precise value of the vector length doesn't
really
> matter -- software is supposed to be *vector length agnostic*. Completely
> inlining a function is perfectly fine, for example. What matters is that
> the vector length doesn't change *during* vector computations, i.e.,
while
> any vector values are live (either as SSA values, or in memory!). Thus,
> there is no need to support and track vector length changes at instruction
> granularity. It's enough to coarsely enforce that the vector length
remain
> constant throughout a larger code region (say, a loop nest, or a function).
>
>
> # Runtime-varying vector length in the IR
>
> This is achieved by simply declaring "by fiat" that the vector
length is
> determined on function entry and remain constant for the rest of the
> function execution. Other functions and other calls to the same function
> may observe a different vector length, but within one call to a given
> function, the vector length is fixed. That is not precisely how the
> hardware works, but it is a contract the backend can uphold easily (more on
> this later) and it allows piggy-backing on existing IR constructs (the
> `token` type) to communicate restrictions to optimization passes.
> Nevertheless, some additions to IR are required: a new first class type, a
> new instruction, and a new operand for some existing instructions.
>
> ## IR semantics
>
> Every time a function is called, a positive integer called the *dynamic
> vector length* is determined in an unspecified way. The dynamic vector
> length can differ not only between different functions, but also between
> different calls to the same function. The exception is that functions with
> the `inherits_vlen` attribute get the same dynamic vector length as their
> caller (Note: this attribute is a straw man, target-specific calling
> conventions may work better for this purpose).
>
> A new instruction `vlentoken` is added, which has no operands and is of
> type `token`. This token represents the dynamic vector length of the
> function execution it is in. There can only be one such instruction per
> function (this is inconsequential to the operational semantics, but it
> simplifies IR passes).
>
> A new kind of type is added, the *dynamic-length vector*, written `<
vlen
> x <element type> >`. It represents a vector with a number of
elements equal
> to the dynamic vector length. Like fixed-length and scalable vectors, these
> vectors can only contain integer, float and pointer elements.
>
> A use of a `vlentoken` (representing a dynamic vector length) is attached
> to all operations that care for the dynamic vector length. That is, *every*
> instruction that handles dynamic-length vectors or is impacted by their
> length receives the respective function's `vlentoken` as extra operand,
and
> operates on a number of elements equal to the corresponding dynamic vector
> length.
>
> `< vlen x <element type> >` is a first-class type and supports
most common
> instructions (details below), but cannot be used as function argument or
> function return type unless the `inherits_vlen` attribute is applied to the
> callee.
>
> ## Rationale / "why this works"
>
> Although the vector length is a property of vector *values*, tracking the
> dynamic vector length at the type level would require a "separate
type" for
> each call to any function. It's much more feasible to attach the vector
> length to the *operations* instead. This works out because SSA values are
> function-local (so all operation on them agree on the vector length by
> definition) with the exception of function arguments and return values.
> Consequently, dynamic-length vectors in function signatures are disallowed
> unless the `inherits_vlen` attribute ensured caller and callee have the
> same dynamic vector length.
>
> The `vlentoken` token ensures that all operations that start out in the
> same function must remain in the same function while the code is
> transformed (recall that tokens cannot be passed to or returned from
> non-intrinsic functions). That's why it is important that `vlentoken`
is a
> token, not simply an integer as one might expect. In other words,
> `vlentoken` does not give the program access to the dynamic vector length,
> it communicates a restriction to the optimizer.
>
> ## More details
>
> The `< vlen x <element type> >` type is separate from the
existing vector
> types. Instructions for fixed-length vectors (elementwise arithmetic,
> `insertelement`, `select` with a vector of `i1`s, etc.) are not extended to
> this new type, at least not in this RFC. It's a possible future
extension,
> but for now, target-specific intrinsics work fine for those operations.
>
> The following operations on dynamic-length vectors *are* supported:
>
> - `phi`
> - `load` and `store`
> - `alloca` (at least of a single vector; the `alloca <ty>, <ty>
> <NumElements>` form ties into the open question about aggregates and
GEPs
> below)
> - `select` (with `i1` condition)
> - Argument passing and return values (`call`, `invoke`, `ret`) for
> functions with `inherits_vlen`
>
> All of these instructions (including phi) have an additional operand of
> type `token` if and only if they operate on a vector of dynamic length. In
> textual IR, one appends `, vlen <the token>` to the instruction, for
> example:
>
>     %0 = vlentoken
>     %ptr = alloca <vlen x i32>, vlen %0
>     %v = call <vlen x i32> @foo(), vlen %0
>     store <vlen x i32> %v, <vlen x i32>* %ptr, vlen %0
>
> Open question: should GEPs and aggregates involving dynamic-length vectors
> work? This RFC errs on the side of simplicity and excludes them
(they're
> non-trivial to implement and not needed for strip-mined loops) but if
> desired, they could be supported.
>
> There are no constants of dynamic-length-vector type except
> `zeroinitializer` and `undef` (resp. `poison` once that is adopted). In
> particular, there is no equivalent to fixed-length vector constants
(`<ty
> elem1, ty elem2, ...>`). Dynamic-length vectors also cannot be stored in
> globals.
>
> ## Impact on optimizations
>
> The semantics imply several restrictions on optimizations, but these are
> mostly encoded with existing IR constructs -- chiefly, the `token` type
> that ties all vector operations to a `vlentoken`. For example, because
> token values cannot be passed to (non-intrinsic) functions or returned from
> them, no special pleading is needed to keep an outliner or partial inliner
> from spreading vector operations across multiple functions -- correct
> passes already don't do that when tokens are involved. Passes do,
however,
> need to be updated in two respects.
>
> First, the new token operand needs to be respected when comparing two
> instructions, creating new instructions, etc. -- this is an inherent
> downside of adding new operands, but also rather mechanical. The rule that
> there is only one `vlentoken` per function makes this even more mechanical
> than usual, because all instruction within one function have the same
> vector length token. This means that one does not need to consider them
> when comparing instructions from the same function, and it's always
clear
> which token should be used when creating a new instruction.
>
> Second, the very same rule of only one `vlentoken` per function must be
> respected during interprocedural code motion. For example, inlining
can't
> just copy the `vlentoken` from the callee into the caller.
>
> However, note that it's valid to *merge* the caller's and
callee's
> `vlentoken` instructions. Because the semantics state that each call to a
> function can get a different dynamic vector length, merging `vlentoken`s
> *refines* the program's behavior by picking the possible execution
where
> the callee "happens to" get the same vector length as the caller
in the
> inlined calls. So inlining can simply replace all `vlentoken` tokens in the
> inlined code with the `vlentoken` token of the callee. Other passes are
> likely similarly easy to update (and in the worst case, they can just bail
> out when seeing dynamic-length vectors).
>
> ## Impact on backends
>
> Unsurprisingly, the IR types `<vlen x <element type> >` come
with
> associated MVTs. There's also a new SelectionDAG node `VLENTOKEN` to
mirror
> the `vlentoken` IR instruction (and presumably `G_VLENTOKEN` in GMIR for
> GlobalISel).
>
> Backends other than RISC-V can legalize these MVTs and the `VLENTOKEN`
> node very easily, even if in practice there currently aren't many
useful
> operations on these vectors without target-specific intrinsics.
> Specifically, `< vlen x <element type> >` can be legalized as
`< n x
> <element type> >` or even `< scalable n x <element type>
>` (the vector
> type for Arm SVE) for any fixed `n`. All the complications stemming from
> the runtime-varying vector length go away, and the `vlentoken` node can
> simply be dropped on the floor.
>
> That leaves targets with an actual runtime-varying vector length in
> hardware, i.e., RISCV with the V feature enabled. As stated in the
> introduction, this RFC does not cover the backend changes in detail, but to
> give you a rough idea, here's a sketch. Keep in mind (especially if
you're
> familiar with V) that this is glossing over everything not directly related
> to the proposed IR type (particularly the "polymorphic instruction
set"
> aspect of the register configuration).
>
> As described earlier, the vector length in RISC-V is completely determined
> by the vector unit configuration. Therefore, vector operations in Machine
> IR have an implicit use of the configuration registers. This is the moral
> equivalent of the `vlentoken` token operand, but more precise (and MIR
> doesn't have an equivalent of the token type anyway). To complete the
> picture, all operations that change the configuration are made explicit.
>
> Because only virtual registers can be live across basic block boundaries
> before register allocation, this may require a dummy register class with
> only a single physical register, or something similarly inelegant.
>
> With a way to precisely represent vector length changes in hand, the
> backend just needs to ensure it implements the semantics of `< vlen x
> <element type> >` described earlier. This is achieved by
configuring the
> vector unit "in the prologue", and then not doing anything that
might
> change the vector length inside the function. This setup is effectively in
> the prologue (i.e., before any user code) but not actually inserted during
> the "prologue/epilogue insertion" pass, which runs far too late
for this
> purpose.
>
> For scenarios like two entirely separate vectorized loops within one
> function, it might be useful to drastically change the vector unit
> configuration in the middle of a function. This could be implemented as an
> optimization (e.g., a pre-RA machine function), but it's all
hypothetical
> so far.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180412/de147bde/attachment.html>

Renato Golin via llvm-dev

2018-Apr-12 22:14 UTC

head link

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

Adding also Florian and Sanders, as they're the ones implementing SVE.

On Thu, 12 Apr 2018, 22:28 Eric Christopher via llvm-dev, <
llvm-dev at lists.llvm.org> wrote:
> I'm just going to add Kristof here since ARM is looking to add SVE here
> and this overlaps quite a bit with their goals.
>
> -eric
>
> On Wed, Apr 11, 2018 at 2:45 AM Robin Kruppe via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> RISC-V is an open and free instruction set architecture (ISA) used in
>> numerous domains in industry and research. The vector extension (short:
>> 'V') supplements the basic ISA with support for data parallel
computations.
>> This RFC sketches a strategy for targeting this instruction set
extension
>> in LLVM.
>>
>> Some but not all of what is proposed here has already been implemented
>> out of tree. It is explicitly not proposed to upstream any of this yet:
the
>> vector extension is still evolving (though the core concepts are
reasonably
>> stable), and the implementation is currently very much prototype
quality.
>> Nevertheless, I want to kick off a discussion about this with the LLVM
>> community now to make sure I'm on the right track and to make the
eventual
>> upstreaming go more smoothly. In particular, a large and potentially
>> controversial part of the strategy is a proposal for extending LLVM IR
with
>> a new vector type.
>>
>> There is also much to be said about how to structure the code
generation
>> for this ISA. However, since that aspect somewhat simpler, largely
>> orthogonal and affects a smaller subset of the community, the details
will
>> be left to a future RFC.
>>
>> This RFC is intended to be self-contained, but interested readers can
>> learn more about the vector extension from Roger Espasa's talk at
the 7th
>> RISC-V workshop (slides [1], recording [2]). The draft specification is
>> also available as part of the RISC-V Instruction Set Manual [3], but
right
>> now it is unfortunately incomplete and in the process of being updated.
>>
>> I will also be at EuroLLVM with a lightning talk and poster on this
>> subject, so if you're there as well, we can discuss in person.
>>
>> [1]
>>
https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf
>> [2] https://www.youtube.com/watch?v=GzZ-8bHsD5s
>> [3] https://github.com/riscv/riscv-isa-manual/
>>
>>
>> # Summary
>>
>> First-class support for the RISC-V vector ISA requires representing a
>> hardware vector length that is not just unknown at compile time, but
also
>> changes during execution. This in turn places some restrictions on code
>> motion: the vector length must not change while any vector values are
live.
>> This RFC proposes to add a new vector type to LLVM IR for this purpose.
>> Simply put, its length is tied to the surrounding function and the
existing
>> `token` type is leveraged to tell optimization passes that certain
vector
>> operations must remain together (i.e., in the same function).
>>
>>
>> # Background
>>
>> The RISC-V vector extension has many interesting properties. This RFC
is
>> not the right place to talk about it in detail, but this section will
>> briefly introduce the aspect that is most difficult to support in LLVM
IR,
>> and which is consequently the focus of this RFC: the runtime-variable
>> vector length.
>>
>> The number of elements in a vector register is determined by the
>> microarchitecture. Software uses strip-mined loops to transparently
process
>> as many elements per iteration as the hardware can support. But even
beyond
>> that, the vector length can vary during the execution of a program:
>> different kernels may *configure* the vector unit differently depending
on
>> their needs, leading to different parts of the program having
>> differently-sized vector registers.
>>
>> The vector length being determined by the microarchitecture is similar
to
>> Arm's Scalable Vector Extension (SVE), for which support is being
>> upstreamed at the moment. However, in SVE the vector length is fixed
once a
>> program starts running, while full use of the RISC-V vector extension
will
>> lead to the vector length changing regularly during execution. It's
>> possible to maintain the same configuration -- and therefore the same
>> vector length -- throughout the entire program, but this will often
perform
>> worse than a tailored configuration.
>>
>> ## Maximum vs active vs application vector length
>>
>> The V ISA has two notions of vector length: the *maximum* vector length
>> (called MVL), which describes the number of elements in each vector
>> register, and the *active* vector length (called vl), which limits how
many
>> of those elements are actually processed by each vector instruction.
>>
>> The latter is used to express loops of any application-specified length
>> with a single copy of the loop body. Instead of handling the tail
>> iterations not divisible by MVL separately with scalar code, the active
>> length length is set up so that the last few iterations process as many
>> elements as are left to process.
>>
>> The effect of the active vector length is similar to a mask of the form
>> `<true, ..., true, false, ..., false>`, aside from the scalar
control logic
>> that sets and maintains the active vector length. Thus it can be
modeled in
>> IR with judicious use of intrinsics and masking. This still allows also
>> having a single loop body in IR, without introducing new IR concepts in
>> addition to those already needed for the variable MVL.
>>
>> Thus, the rest of this RFC focuses on handling the MVL: all references
to
>> "vector length" from this point on should be taken to refer
to the MVL, not
>> the active vector length.
>>
>>
>> # Scope of the support
>>
>> To preempt misunderstandings, this section outlines what is meant and
not
>> meant by "support for the vector extension".
>>
>> ## Variable vector length
>>
>> There *is* an option to entirely avoid the concept of the vector length
>> changing during execution. Keeping the same vector unit configuration
>> throughout the entire program execution also leads to the vector length
>> being fixed once the program starts executing. In this case, compiler
>> support works out rather rather similar to support for Arm SVE, with
the
>> biggest difference being that vectors lengths are not multiples of 128
bit
>> (which legalization can paper over). Indeed, no IR changes beyond those
>> proposed for SVE support appear to be necessary to implement this
approach
>> to RISC-V vector extension support.
>>
>> However, this approach is wasteful, as a tailored configuration can
>> improve performance and energy efficiency significantly. As one data
point,
>> the Hwacha project reported [4] up to 9.5% fewer cycles taken and up to
11%
>> less energy consumed on a microarchitecture built to exploit narrower
bit
>> widths of vector elements (comparing "mvp, packed: yes" to
"mvp, packed:
>> no"). Besides such microarchitectural optimizations, enabling
fewer
>> registers can improve context switch times because fewer registers need
to
>> be saved, and being more flexible in how registers can be used (in
>> particular, how many are reserved for scalar values) aids register
>> allocation.
>>
>> Thus, restricting programs to a single configuration may be a good
first
>> step to get things up and running, but ultimately support for
>> runtime-varying vector lengths is desired to make the most of hardware
>> capabilities.
>>
>> [4] "A Case for MVPs: Mixed-Precision Vector Processors",
Albert Ou, Quan
>> Nguyen, Yunsup Lee, Krste Asanović,
>> http://hwacha.org/papers/hwacha-mvp-prism2014.pdf
>>
>>
>> ## Producing vector code
>>
>> It is intended that vector code is primarily produced via loop
>> vectorization and other IR-level auto-vectorizers (e.g., the region
>> vectorizer), not written by hand. Supporting loop vectorization is of
>> highest priority. The groundwork for loop vectorization should be
useful
>> for other kinds of automatic vectorization as well, but loop
vectorization
>> will be implemented first.
>>
>> It's not required or expected that the stock loop vectorizer can
generate
>> RISC-V vector code from the start. Considering the many significant
>> differences to the packed-SIMD architectures the loop vectorizer is
>> tailored to, it's quite likely that some experimentation in this
space is
>> required (e.g., building on VPlan and writing custom recipes). Of
course,
>> in the long term there should be as much code sharing as possible.
>>
>> Support for hand-written vector code va source-language-level
intrinsics
>> (as opposed to inline assembly) would be nice to have and probably
falls
>> out for free, but is rather low priority.
>>
>>
>> ## No vector unit configuration in IR
>>
>> While configuring the vector unit is an essential part of compiling for
>> the V ISA, it has no place in LLVM IR. Vectors should be regular SSA
values
>> that don't reference any extra state other than (by necessity) the
vector
>> length. Deciding how to configure the vector unit for a given piece of
code
>> is target-dependent and intertwined with register allocation, and will
>> therefore be left to the backend.
>>
>> # Challenge: Code motion around vector length changes
>>
>> When the vector length can change during execution, there are implicit
>> dependencies between vector operations and points in the program where
the
>> vector length may change. These dependencies must be taken into account
>> somehow, or else code motion passes could move vector operations across
>> vector length changes, effectively changing program semantics. For
example,
>> it's nonsensical to compute a vector value `%v1` with one vector
length,
>> change the vector length, and then compute another vector `%v2` with
the
>> new vector length and add it to `%v1`. This makes no more sense than
adding
>> `<4 x i32>` to `<8 x i32>`, yet it could happen if an input
program has a
>> vector length change *after* these vector calculations and optimization
>> passes are not aware of the impact of the vector length change on those
>> calculations.
>>
>> Crucially, the vector length changes when calling and returning from
>> functions in most calling conventions. Functions that don't
specifically
>> use a vectorcall ABI configure the vector unit for their own use when
>> called, rather than using configuration set up by the caller.
Therefore,
>> caller and callee will generally have different vector lengths, and
moving
>> vector operations from the caller into the callee or vice versa tends
to
>> break programs.
>>
>> However, note that the precise value of the vector length doesn't
really
>> matter -- software is supposed to be *vector length agnostic*.
Completely
>> inlining a function is perfectly fine, for example. What matters is
that
>> the vector length doesn't change *during* vector computations,
i.e., while
>> any vector values are live (either as SSA values, or in memory!). Thus,
>> there is no need to support and track vector length changes at
instruction
>> granularity. It's enough to coarsely enforce that the vector length
remain
>> constant throughout a larger code region (say, a loop nest, or a
function).
>>
>>
>> # Runtime-varying vector length in the IR
>>
>> This is achieved by simply declaring "by fiat" that the
vector length is
>> determined on function entry and remain constant for the rest of the
>> function execution. Other functions and other calls to the same
function
>> may observe a different vector length, but within one call to a given
>> function, the vector length is fixed. That is not precisely how the
>> hardware works, but it is a contract the backend can uphold easily
(more on
>> this later) and it allows piggy-backing on existing IR constructs (the
>> `token` type) to communicate restrictions to optimization passes.
>> Nevertheless, some additions to IR are required: a new first class
type, a
>> new instruction, and a new operand for some existing instructions.
>>
>> ## IR semantics
>>
>> Every time a function is called, a positive integer called the *dynamic
>> vector length* is determined in an unspecified way. The dynamic vector
>> length can differ not only between different functions, but also
between
>> different calls to the same function. The exception is that functions
with
>> the `inherits_vlen` attribute get the same dynamic vector length as
their
>> caller (Note: this attribute is a straw man, target-specific calling
>> conventions may work better for this purpose).
>>
>> A new instruction `vlentoken` is added, which has no operands and is of
>> type `token`. This token represents the dynamic vector length of the
>> function execution it is in. There can only be one such instruction per
>> function (this is inconsequential to the operational semantics, but it
>> simplifies IR passes).
>>
>> A new kind of type is added, the *dynamic-length vector*, written `<
vlen
>> x <element type> >`. It represents a vector with a number of
elements equal
>> to the dynamic vector length. Like fixed-length and scalable vectors,
these
>> vectors can only contain integer, float and pointer elements.
>>
>> A use of a `vlentoken` (representing a dynamic vector length) is
attached
>> to all operations that care for the dynamic vector length. That is,
*every*
>> instruction that handles dynamic-length vectors or is impacted by their
>> length receives the respective function's `vlentoken` as extra
operand, and
>> operates on a number of elements equal to the corresponding dynamic
vector
>> length.
>>
>> `< vlen x <element type> >` is a first-class type and
supports most
>> common instructions (details below), but cannot be used as function
>> argument or function return type unless the `inherits_vlen` attribute
is
>> applied to the callee.
>>
>> ## Rationale / "why this works"
>>
>> Although the vector length is a property of vector *values*, tracking
the
>> dynamic vector length at the type level would require a "separate
type" for
>> each call to any function. It's much more feasible to attach the
vector
>> length to the *operations* instead. This works out because SSA values
are
>> function-local (so all operation on them agree on the vector length by
>> definition) with the exception of function arguments and return values.
>> Consequently, dynamic-length vectors in function signatures are
disallowed
>> unless the `inherits_vlen` attribute ensured caller and callee have the
>> same dynamic vector length.
>>
>> The `vlentoken` token ensures that all operations that start out in the
>> same function must remain in the same function while the code is
>> transformed (recall that tokens cannot be passed to or returned from
>> non-intrinsic functions). That's why it is important that
`vlentoken` is a
>> token, not simply an integer as one might expect. In other words,
>> `vlentoken` does not give the program access to the dynamic vector
length,
>> it communicates a restriction to the optimizer.
>>
>> ## More details
>>
>> The `< vlen x <element type> >` type is separate from the
existing vector
>> types. Instructions for fixed-length vectors (elementwise arithmetic,
>> `insertelement`, `select` with a vector of `i1`s, etc.) are not
extended to
>> this new type, at least not in this RFC. It's a possible future
extension,
>> but for now, target-specific intrinsics work fine for those operations.
>>
>> The following operations on dynamic-length vectors *are* supported:
>>
>> - `phi`
>> - `load` and `store`
>> - `alloca` (at least of a single vector; the `alloca <ty>,
<ty>
>> <NumElements>` form ties into the open question about aggregates
and GEPs
>> below)
>> - `select` (with `i1` condition)
>> - Argument passing and return values (`call`, `invoke`, `ret`) for
>> functions with `inherits_vlen`
>>
>> All of these instructions (including phi) have an additional operand of
>> type `token` if and only if they operate on a vector of dynamic length.
In
>> textual IR, one appends `, vlen <the token>` to the instruction,
for
>> example:
>>
>>     %0 = vlentoken
>>     %ptr = alloca <vlen x i32>, vlen %0
>>     %v = call <vlen x i32> @foo(), vlen %0
>>     store <vlen x i32> %v, <vlen x i32>* %ptr, vlen %0
>>
>> Open question: should GEPs and aggregates involving dynamic-length
>> vectors work? This RFC errs on the side of simplicity and excludes them
>> (they're non-trivial to implement and not needed for strip-mined
loops) but
>> if desired, they could be supported.
>>
>> There are no constants of dynamic-length-vector type except
>> `zeroinitializer` and `undef` (resp. `poison` once that is adopted). In
>> particular, there is no equivalent to fixed-length vector constants
(`<ty
>> elem1, ty elem2, ...>`). Dynamic-length vectors also cannot be
stored in
>> globals.
>>
>> ## Impact on optimizations
>>
>> The semantics imply several restrictions on optimizations, but these
are
>> mostly encoded with existing IR constructs -- chiefly, the `token` type
>> that ties all vector operations to a `vlentoken`. For example, because
>> token values cannot be passed to (non-intrinsic) functions or returned
from
>> them, no special pleading is needed to keep an outliner or partial
inliner
>> from spreading vector operations across multiple functions -- correct
>> passes already don't do that when tokens are involved. Passes do,
however,
>> need to be updated in two respects.
>>
>> First, the new token operand needs to be respected when comparing two
>> instructions, creating new instructions, etc. -- this is an inherent
>> downside of adding new operands, but also rather mechanical. The rule
that
>> there is only one `vlentoken` per function makes this even more
mechanical
>> than usual, because all instruction within one function have the same
>> vector length token. This means that one does not need to consider them
>> when comparing instructions from the same function, and it's always
clear
>> which token should be used when creating a new instruction.
>>
>> Second, the very same rule of only one `vlentoken` per function must be
>> respected during interprocedural code motion. For example, inlining
can't
>> just copy the `vlentoken` from the callee into the caller.
>>
>> However, note that it's valid to *merge* the caller's and
callee's
>> `vlentoken` instructions. Because the semantics state that each call to
a
>> function can get a different dynamic vector length, merging
`vlentoken`s
>> *refines* the program's behavior by picking the possible execution
where
>> the callee "happens to" get the same vector length as the
caller in the
>> inlined calls. So inlining can simply replace all `vlentoken` tokens in
the
>> inlined code with the `vlentoken` token of the callee. Other passes are
>> likely similarly easy to update (and in the worst case, they can just
bail
>> out when seeing dynamic-length vectors).
>>
>> ## Impact on backends
>>
>> Unsurprisingly, the IR types `<vlen x <element type> >`
come with
>> associated MVTs. There's also a new SelectionDAG node `VLENTOKEN`
to mirror
>> the `vlentoken` IR instruction (and presumably `G_VLENTOKEN` in GMIR
for
>> GlobalISel).
>>
>> Backends other than RISC-V can legalize these MVTs and the `VLENTOKEN`
>> node very easily, even if in practice there currently aren't many
useful
>> operations on these vectors without target-specific intrinsics.
>> Specifically, `< vlen x <element type> >` can be legalized
as `< n x
>> <element type> >` or even `< scalable n x <element
type> >` (the vector
>> type for Arm SVE) for any fixed `n`. All the complications stemming
from
>> the runtime-varying vector length go away, and the `vlentoken` node can
>> simply be dropped on the floor.
>>
>> That leaves targets with an actual runtime-varying vector length in
>> hardware, i.e., RISCV with the V feature enabled. As stated in the
>> introduction, this RFC does not cover the backend changes in detail,
but to
>> give you a rough idea, here's a sketch. Keep in mind (especially if
you're
>> familiar with V) that this is glossing over everything not directly
related
>> to the proposed IR type (particularly the "polymorphic instruction
set"
>> aspect of the register configuration).
>>
>> As described earlier, the vector length in RISC-V is completely
>> determined by the vector unit configuration. Therefore, vector
operations
>> in Machine IR have an implicit use of the configuration registers. This
is
>> the moral equivalent of the `vlentoken` token operand, but more precise
>> (and MIR doesn't have an equivalent of the token type anyway). To
complete
>> the picture, all operations that change the configuration are made
explicit.
>>
>> Because only virtual registers can be live across basic block
boundaries
>> before register allocation, this may require a dummy register class
with
>> only a single physical register, or something similarly inelegant.
>>
>> With a way to precisely represent vector length changes in hand, the
>> backend just needs to ensure it implements the semantics of `< vlen
x
>> <element type> >` described earlier. This is achieved by
configuring the
>> vector unit "in the prologue", and then not doing anything
that might
>> change the vector length inside the function. This setup is effectively
in
>> the prologue (i.e., before any user code) but not actually inserted
during
>> the "prologue/epilogue insertion" pass, which runs far too
late for this
>> purpose.
>>
>> For scenarios like two entirely separate vectorized loops within one
>> function, it might be useful to drastically change the vector unit
>> configuration in the middle of a function. This could be implemented as
an
>> optimization (e.g., a pre-RA machine function), but it's all
hypothetical
>> so far.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180412/01294b74/attachment-0001.html>

Graham Hunter via llvm-dev

2018-Apr-13 12:37 UTC

head link

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

Hi,

Nice to see another group tackling length agnostic vectorization :)

I'm still reading through all the details, but I do have one initial
question related to the vector type; why not use the same mechanism Arm is
proposing for SVE?

We chose the format "<n x m x <ty>>" to represent two
crucial concepts: arbitrary length vectors with the same number of elements, or
with the same number of bits. From what I gather, you're just interested in
the former, so is there a good reason "<n x 1 x <ty>>"
wouldn't work for your use case? The 'n' is just a term to indicate
it's a multiple only known at runtime; you've used 'vlen', and
there are other suggested names, but there's nothing to tie it to an exact
size -- if your dynamic vlen changes with each function, the
'n'/'vlen' just represents the current value.

I'll continue looking through this, and will have more feedback next week.

-Graham

On 11/04/2018, 10:45, "llvm-dev on behalf of Robin Kruppe via
llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of llvm-dev at
lists.llvm.org> wrote:

RISC-V is an open and free instruction set architecture (ISA) used in
numerous domains in industry and research. The vector extension (short:
'V') supplements the basic ISA with support for data parallel
computations. This RFC sketches a strategy
for targeting this instruction set extension in LLVM.

Some but not all of what is proposed here has already been implemented out
of tree. It is explicitly not proposed to upstream any of this yet: the vector
extension is still evolving (though the core concepts are reasonably stable),
and the implementation is
currently very much prototype quality. Nevertheless, I want to kick off a
discussion about this with the LLVM community now to make sure I'm on the
right track and to make the eventual upstreaming go more smoothly. In
particular, a large and potentially controversial
part of the strategy is a proposal for extending LLVM IR with a new vector
type.

There is also much to be said about how to structure the code generation for
this ISA. However, since that aspect somewhat simpler, largely orthogonal and
affects a smaller subset of the community, the details will be left to a future
RFC.

This RFC is intended to be self-contained, but interested readers can learn
more about the vector extension from Roger Espasa's talk at the 7th RISC-V
workshop (slides [1], recording [2]). The draft specification is also available
as part of the RISC-V Instruction
Set Manual [3], but right now it is unfortunately incomplete and in the
process of being updated.

I will also be at EuroLLVM with a lightning talk and poster on this subject,
so if you're there as well, we can discuss in person.

[1]

https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf
<https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
[2] https://www.youtube.com/watch?v=GzZ-8bHsD5s
[3] https://github.com/riscv/riscv-isa-manual/

# Summary

First-class support for the RISC-V vector ISA requires representing a
hardware vector length that is not just unknown at compile time, but also
changes during execution. This in turn places some restrictions on code motion:
the vector length must not change
while any vector values are live. This RFC proposes to add a new vector
type to LLVM IR for this purpose. Simply put, its length is tied to the
surrounding function and the existing `token` type is leveraged to tell
optimization passes that certain vector
operations must remain together (i.e., in the same function).

# Background

The RISC-V vector extension has many interesting properties. This RFC is not
the right place to talk about it in detail, but this section will briefly
introduce the aspect that is most difficult to support in LLVM IR, and which is
consequently the focus of
this RFC: the runtime-variable vector length.

The number of elements in a vector register is determined by the
microarchitecture. Software uses strip-mined loops to transparently process as
many elements per iteration as the hardware can support. But even beyond that,
the vector length can vary during
the execution of a program: different kernels may *configure* the vector
unit differently depending on their needs, leading to different parts of the
program having differently-sized vector registers.

The vector length being determined by the microarchitecture is similar to
Arm's Scalable Vector Extension (SVE), for which support is being upstreamed
at the moment. However, in SVE the vector length is fixed once a program starts
running, while full use of
the RISC-V vector extension will lead to the vector length changing
regularly during execution. It's possible to maintain the same configuration
-- and therefore the same vector length -- throughout the entire program, but
this will often perform worse than
a tailored configuration.

## Maximum vs active vs application vector length

The V ISA has two notions of vector length: the *maximum* vector length
(called MVL), which describes the number of elements in each vector register,
and the *active* vector length (called vl), which limits how many of those
elements are actually processed
by each vector instruction.

The latter is used to express loops of any application-specified length with
a single copy of the loop body. Instead of handling the tail iterations not
divisible by MVL separately with scalar code, the active length length is set up
so that the last few iterations
process as many elements as are left to process.

The effect of the active vector length is similar to a mask of the form
`<true, ..., true, false, ..., false>`, aside from the scalar control
logic that sets and maintains the active vector length. Thus it can be modeled
in IR with judicious use of intrinsics
and masking. This still allows also having a single loop body in IR,
without introducing new IR concepts in addition to those already needed for the
variable MVL.

Thus, the rest of this RFC focuses on handling the MVL: all references to
"vector length" from this point on should be taken to refer to the
MVL, not the active vector length.

# Scope of the support

To preempt misunderstandings, this section outlines what is meant and not
meant by "support for the vector extension".

## Variable vector length

There *is* an option to entirely avoid the concept of the vector length
changing during execution. Keeping the same vector unit configuration throughout
the entire program execution also leads to the vector length being fixed once
the program starts executing.
In this case, compiler support works out rather rather similar to support
for Arm SVE, with the biggest difference being that vectors lengths are not
multiples of 128 bit (which legalization can paper over). Indeed, no IR changes
beyond those proposed for
SVE support appear to be necessary to implement this approach to RISC-V
vector extension support.

However, this approach is wasteful, as a tailored configuration can improve
performance and energy efficiency significantly. As one data point, the Hwacha
project reported [4] up to 9.5% fewer cycles taken and up to 11% less energy
consumed on a microarchitecture
built to exploit narrower bit widths of vector elements (comparing
"mvp, packed: yes" to "mvp, packed: no"). Besides such
microarchitectural optimizations, enabling fewer registers can improve context
switch times because fewer registers need to be saved,
and being more flexible in how registers can be used (in particular, how
many are reserved for scalar values) aids register allocation.

Thus, restricting programs to a single configuration may be a good first
step to get things up and running, but ultimately support for runtime-varying
vector lengths is desired to make the most of hardware capabilities.

[4] "A Case for MVPs: Mixed-Precision Vector Processors", Albert
Ou, Quan Nguyen, Yunsup Lee, Krste Asanović,
http://hwacha.org/papers/hwacha-mvp-prism2014.pdf

## Producing vector code

It is intended that vector code is primarily produced via loop vectorization
and other IR-level auto-vectorizers (e.g., the region vectorizer), not written
by hand. Supporting loop vectorization is of highest priority. The groundwork
for loop vectorization
should be useful for other kinds of automatic vectorization as well, but
loop vectorization will be implemented first.

It's not required or expected that the stock loop vectorizer can
generate RISC-V vector code from the start. Considering the many significant
differences to the packed-SIMD architectures the loop vectorizer is tailored to,
it's quite likely that some experimentation
in this space is required (e.g., building on VPlan and writing custom
recipes). Of course, in the long term there should be as much code sharing as
possible.

Support for hand-written vector code va source-language-level intrinsics (as
opposed to inline assembly) would be nice to have and probably falls out for
free, but is rather low priority.

## No vector unit configuration in IR

While configuring the vector unit is an essential part of compiling for the
V ISA, it has no place in LLVM IR. Vectors should be regular SSA values that
don't reference any extra state other than (by necessity) the vector length.
Deciding how to configure the
vector unit for a given piece of code is target-dependent and intertwined
with register allocation, and will therefore be left to the backend.

# Challenge: Code motion around vector length changes

When the vector length can change during execution, there are implicit
dependencies between vector operations and points in the program where the
vector length may change. These dependencies must be taken into account somehow,
or else code motion passes could
move vector operations across vector length changes, effectively changing
program semantics. For example, it's nonsensical to compute a vector value
`%v1` with one vector length, change the vector length, and then compute another
vector `%v2` with the new
vector length and add it to `%v1`. This makes no more sense than adding
`<4 x i32>` to `<8 x i32>`, yet it could happen if an input program
has a vector length change *after* these vector calculations and optimization
passes are not aware of the impact of
the vector length change on those calculations.

Crucially, the vector length changes when calling and returning from
functions in most calling conventions. Functions that don't specifically use
a vectorcall ABI configure the vector unit for their own use when called, rather
than using configuration set up
by the caller. Therefore, caller and callee will generally have different
vector lengths, and moving vector operations from the caller into the callee or
vice versa tends to break programs.

However, note that the precise value of the vector length doesn't really
matter -- software is supposed to be *vector length agnostic*. Completely
inlining a function is perfectly fine, for example. What matters is that the
vector length doesn't change *during*
vector computations, i.e., while any vector values are live (either as SSA
values, or in memory!). Thus, there is no need to support and track vector
length changes at instruction granularity. It's enough to coarsely enforce
that the vector length remain constant
throughout a larger code region (say, a loop nest, or a function).

# Runtime-varying vector length in the IR

This is achieved by simply declaring "by fiat" that the vector
length is determined on function entry and remain constant for the rest of the
function execution. Other functions and other calls to the same function may
observe a different vector length, but
within one call to a given function, the vector length is fixed. That is
not precisely how the hardware works, but it is a contract the backend can
uphold easily (more on this later) and it allows piggy-backing on existing IR
constructs (the `token` type)
to communicate restrictions to optimization passes. Nevertheless, some
additions to IR are required: a new first class type, a new instruction, and a
new operand for some existing instructions.

## IR semantics

Every time a function is called, a positive integer called the *dynamic
vector length* is determined in an unspecified way. The dynamic vector length
can differ not only between different functions, but also between different
calls to the same function. The
exception is that functions with the `inherits_vlen` attribute get the same
dynamic vector length as their caller (Note: this attribute is a straw man,
target-specific calling conventions may work better for this purpose).

A new instruction `vlentoken` is added, which has no operands and is of type
`token`. This token represents the dynamic vector length of the function
execution it is in. There can only be one such instruction per function (this is
inconsequential to the operational
semantics, but it simplifies IR passes).

A new kind of type is added, the *dynamic-length vector*, written `< vlen
x <element type> >`. It represents a vector with a number of elements
equal to the dynamic vector length. Like fixed-length and scalable vectors,
these vectors can only contain integer,
float and pointer elements.

A use of a `vlentoken` (representing a dynamic vector length) is attached to
all operations that care for the dynamic vector length. That is, *every*
instruction that handles dynamic-length vectors or is impacted by their length
receives the respective function's
`vlentoken` as extra operand, and operates on a number of elements equal to
the corresponding dynamic vector length.

`< vlen x <element type> >` is a first-class type and supports
most common instructions (details below), but cannot be used as function
argument or function return type unless the `inherits_vlen` attribute is applied
to the callee.

## Rationale / "why this works"

Although the vector length is a property of vector *values*, tracking the
dynamic vector length at the type level would require a "separate
type" for each call to any function. It's much more feasible to attach
the vector length to the *operations* instead.
This works out because SSA values are function-local (so all operation on
them agree on the vector length by definition) with the exception of function
arguments and return values. Consequently, dynamic-length vectors in function
signatures are disallowed
unless the `inherits_vlen` attribute ensured caller and callee have the
same dynamic vector length.

The `vlentoken` token ensures that all operations that start out in the same
function must remain in the same function while the code is transformed (recall
that tokens cannot be passed to or returned from non-intrinsic functions).
That's why it is important
that `vlentoken` is a token, not simply an integer as one might expect. In
other words, `vlentoken` does not give the program access to the dynamic vector
length, it communicates a restriction to the optimizer.

## More details

The `< vlen x <element type> >` type is separate from the
existing vector types. Instructions for fixed-length vectors (elementwise
arithmetic, `insertelement`, `select` with a vector of `i1`s, etc.) are not
extended to this new type, at least not in this RFC.
It's a possible future extension, but for now, target-specific
intrinsics work fine for those operations.

The following operations on dynamic-length vectors *are* supported:

- `phi`
- `load` and `store`
- `alloca` (at least of a single vector; the `alloca <ty>, <ty>
<NumElements>` form ties into the open question about aggregates and GEPs
below)
- `select` (with `i1` condition)
- Argument passing and return values (`call`, `invoke`, `ret`) for functions
with `inherits_vlen`

All of these instructions (including phi) have an additional operand of type
`token` if and only if they operate on a vector of dynamic length. In textual
IR, one appends `, vlen <the token>` to the instruction, for example:

%0 = vlentoken
%ptr = alloca <vlen x i32>, vlen %0
%v = call <vlen x i32> @foo(), vlen %0
store <vlen x i32> %v, <vlen x i32>* %ptr, vlen %0

Open question: should GEPs and aggregates involving dynamic-length vectors
work? This RFC errs on the side of simplicity and excludes them (they're
non-trivial to implement and not needed for strip-mined loops) but if desired,
they could be supported.

There are no constants of dynamic-length-vector type except
`zeroinitializer` and `undef` (resp. `poison` once that is adopted). In
particular, there is no equivalent to fixed-length vector constants (`<ty
elem1, ty elem2, ...>`). Dynamic-length vectors also
cannot be stored in globals.

## Impact on optimizations

The semantics imply several restrictions on optimizations, but these are
mostly encoded with existing IR constructs -- chiefly, the `token` type that
ties all vector operations to a `vlentoken`. For example, because token values
cannot be passed to (non-intrinsic)
functions or returned from them, no special pleading is needed to keep an
outliner or partial inliner from spreading vector operations across multiple
functions -- correct passes already don't do that when tokens are involved.
Passes do, however, need to be
updated in two respects.

First, the new token operand needs to be respected when comparing two
instructions, creating new instructions, etc. -- this is an inherent downside of
adding new operands, but also rather mechanical. The rule that there is only one
`vlentoken` per function
makes this even more mechanical than usual, because all instruction within
one function have the same vector length token. This means that one does not
need to consider them when comparing instructions from the same function, and
it's always clear which token
should be used when creating a new instruction.

Second, the very same rule of only one `vlentoken` per function must be
respected during interprocedural code motion. For example, inlining can't
just copy the `vlentoken` from the callee into the caller.

However, note that it's valid to *merge* the caller's and
callee's `vlentoken` instructions. Because the semantics state that each
call to a function can get a different dynamic vector length, merging
`vlentoken`s *refines* the program's behavior by picking
the possible execution where the callee "happens to" get the same
vector length as the caller in the inlined calls. So inlining can simply replace
all `vlentoken` tokens in the inlined code with the `vlentoken` token of the
callee. Other passes are likely
similarly easy to update (and in the worst case, they can just bail out
when seeing dynamic-length vectors).

## Impact on backends

Unsurprisingly, the IR types `<vlen x <element type> >` come
with associated MVTs. There's also a new SelectionDAG node `VLENTOKEN` to
mirror the `vlentoken` IR instruction (and presumably `G_VLENTOKEN` in GMIR for
GlobalISel).

Backends other than RISC-V can legalize these MVTs and the `VLENTOKEN` node
very easily, even if in practice there currently aren't many useful
operations on these vectors without target-specific intrinsics. Specifically,
`< vlen x <element type> >` can be
legalized as `< n x <element type> >` or even `< scalable n
x <element type> >` (the vector type for Arm SVE) for any fixed `n`.
All the complications stemming from the runtime-varying vector length go away,
and the `vlentoken` node can simply be dropped on
the floor.

That leaves targets with an actual runtime-varying vector length in
hardware, i.e., RISCV with the V feature enabled. As stated in the introduction,
this RFC does not cover the backend changes in detail, but to give you a rough
idea, here's a sketch. Keep in
mind (especially if you're familiar with V) that this is glossing over
everything not directly related to the proposed IR type (particularly the
"polymorphic instruction set" aspect of the register configuration).

As described earlier, the vector length in RISC-V is completely determined
by the vector unit configuration. Therefore, vector operations in Machine IR
have an implicit use of the configuration registers. This is the moral
equivalent of the `vlentoken` token
operand, but more precise (and MIR doesn't have an equivalent of the
token type anyway). To complete the picture, all operations that change the
configuration are made explicit.

Because only virtual registers can be live across basic block boundaries
before register allocation, this may require a dummy register class with only a
single physical register, or something similarly inelegant.

With a way to precisely represent vector length changes in hand, the backend
just needs to ensure it implements the semantics of `< vlen x <element
type> >` described earlier. This is achieved by configuring the vector
unit "in the prologue", and then not doing
anything that might change the vector length inside the function. This
setup is effectively in the prologue (i.e., before any user code) but not
actually inserted during the "prologue/epilogue insertion" pass, which
runs far too late for this purpose.

For scenarios like two entirely separate vectorized loops within one
function, it might be useful to drastically change the vector unit configuration
in the middle of a function. This could be implemented as an optimization (e.g.,
a pre-RA machine function),
but it's all hypothetical so far.

Robin Kruppe via llvm-dev

2018-Apr-13 14:53 UTC

head link

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

On 13 April 2018 at 16:52, Robin Kruppe <robin.kruppe at gmail.com> wrote:
> On 13 April 2018 at 14:37, Graham Hunter <Graham.Hunter at arm.com>
wrote:
>
>> Hi,
>>
>> Nice to see another group tackling length agnostic vectorization :)
>>
>> I'm still reading through all the details, but I do have one
initial
>> question related to the vector type; why not use the same mechanism Arm
is
>> proposing for SVE?
>>
>> We chose the format "<n x m x <ty>>" to represent
two crucial concepts:
>> arbitrary length vectors with the same number of elements, or with the
same
>> number of bits. From what I gather, you're just interested in the
former,
>> so is there a good reason "<n x 1 x <ty>>"
wouldn't work for your use case?
>> The 'n' is just a term to indicate it's a multiple only
known at runtime;
>> you've used 'vlen', and there are other suggested names,
but there's
>> nothing to tie it to an exact size -- if your dynamic vlen changes with
>> each function, the 'n'/'vlen' just represents the
current value.
>>
>
> If the vector length changes at run time, some optimizations that
> otherwise make perfect sense are not legal, particularly outlining --
> details are in the RFC text. If loss of those optimizations is OK for SVE,
> we could talk about unifying the two vector types. Alternatively, I believe
> we could generalize the scalable vector type and my proposal, e.g., by
> adding my proposed "vlentoken" to instructions on scalable
vectors and
> encode the "unknown, but fixed at program load time" vectorlength
(i.e.,
> SVE) by using `token none` as vector length token.
>
> For the initial proposal, I did not want to impose either the loss in
> optimization power or the additional complexity on scalable vectors, but if
> you folks are fine with one of those options, I'd be very glad to do so
and
> reduce the number of types added.
>
> Thanks,
>
> Robin
>
> I'll continue looking through this, and will have more feedback next
week.
>>
>> -Graham
>>
>> On 11/04/2018, 10:45, "llvm-dev on behalf of Robin Kruppe via
llvm-dev" <
>> llvm-dev-bounces at lists.llvm.org on behalf of llvm-dev at
lists.llvm.org>
>> wrote:
>>
>>     RISC-V is an open and free instruction set architecture (ISA) used
in
>> numerous domains in industry and research. The vector extension (short:
>> 'V') supplements the basic ISA with support for data parallel
computations.
>> This RFC sketches a strategy
>>      for targeting this instruction set extension in LLVM.
>>
>>     Some but not all of what is proposed here has already been
>> implemented out of tree. It is explicitly not proposed to upstream any
of
>> this yet: the vector extension is still evolving (though the core
concepts
>> are reasonably stable), and the implementation is
>>      currently very much prototype quality. Nevertheless, I want to
kick
>> off a discussion about this with the LLVM community now to make sure
I'm on
>> the right track and to make the eventual upstreaming go more smoothly.
In
>> particular, a large and potentially controversial
>>      part of the strategy is a proposal for extending LLVM IR with a
new
>> vector type.
>>
>>     There is also much to be said about how to structure the code
>> generation for this ISA. However, since that aspect somewhat simpler,
>> largely orthogonal and affects a smaller subset of the community, the
>> details will be left to a future RFC.
>>
>>     This RFC is intended to be self-contained, but interested readers
can
>> learn more about the vector extension from Roger Espasa's talk at
the 7th
>> RISC-V workshop (slides [1], recording [2]). The draft specification is
>> also available as part of the RISC-V Instruction
>>      Set Manual [3], but right now it is unfortunately incomplete and
in
>> the process of being updated.
>>
>>     I will also be at EuroLLVM with a lightning talk and poster on this
>> subject, so if you're there as well, we can discuss in person.
>>
>>     [1]
>>     https://content.riscv.org/wp-content/uploads/2017/12/Wed-133
>> 0-RISCVRogerEspasaVEXT-v4.pdf <https://content.riscv.org/wp-
>> content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
>>     [2] https://www.youtube.com/watch?v=GzZ-8bHsD5s
>>     [3] https://github.com/riscv/riscv-isa-manual/
>>
>>
>>     # Summary
>>
>>     First-class support for the RISC-V vector ISA requires representing
a
>> hardware vector length that is not just unknown at compile time, but
also
>> changes during execution. This in turn places some restrictions on code
>> motion: the vector length must not change
>>      while any vector values are live. This RFC proposes to add a new
>> vector type to LLVM IR for this purpose. Simply put, its length is tied
to
>> the surrounding function and the existing `token` type is leveraged to
tell
>> optimization passes that certain vector
>>      operations must remain together (i.e., in the same function).
>>
>>
>>     # Background
>>
>>     The RISC-V vector extension has many interesting properties. This
RFC
>> is not the right place to talk about it in detail, but this section
will
>> briefly introduce the aspect that is most difficult to support in LLVM
IR,
>> and which is consequently the focus of
>>      this RFC: the runtime-variable vector length.
>>
>>     The number of elements in a vector register is determined by the
>> microarchitecture. Software uses strip-mined loops to transparently
process
>> as many elements per iteration as the hardware can support. But even
beyond
>> that, the vector length can vary during
>>      the execution of a program: different kernels may *configure* the
>> vector unit differently depending on their needs, leading to different
>> parts of the program having differently-sized vector registers.
>>
>>     The vector length being determined by the microarchitecture is
>> similar to Arm's Scalable Vector Extension (SVE), for which support
is
>> being upstreamed at the moment. However, in SVE the vector length is
fixed
>> once a program starts running, while full use of
>>      the RISC-V vector extension will lead to the vector length
changing
>> regularly during execution. It's possible to maintain the same
>> configuration -- and therefore the same vector length -- throughout the
>> entire program, but this will often perform worse than
>>      a tailored configuration.
>>
>>     ## Maximum vs active vs application vector length
>>
>>     The V ISA has two notions of vector length: the *maximum* vector
>> length (called MVL), which describes the number of elements in each
vector
>> register, and the *active* vector length (called vl), which limits how
many
>> of those elements are actually processed
>>      by each vector instruction.
>>
>>     The latter is used to express loops of any application-specified
>> length with a single copy of the loop body. Instead of handling the
tail
>> iterations not divisible by MVL separately with scalar code, the active
>> length length is set up so that the last few iterations
>>      process as many elements as are left to process.
>>
>>     The effect of the active vector length is similar to a mask of the
>> form `<true, ..., true, false, ..., false>`, aside from the
scalar control
>> logic that sets and maintains the active vector length. Thus it can be
>> modeled in IR with judicious use of intrinsics
>>      and masking. This still allows also having a single loop body in
IR,
>> without introducing new IR concepts in addition to those already needed
for
>> the variable MVL.
>>
>>     Thus, the rest of this RFC focuses on handling the MVL: all
>> references to "vector length" from this point on should be
taken to refer
>> to the MVL, not the active vector length.
>>
>>
>>     # Scope of the support
>>
>>     To preempt misunderstandings, this section outlines what is meant
and
>> not meant by "support for the vector extension".
>>
>>     ## Variable vector length
>>
>>     There *is* an option to entirely avoid the concept of the vector
>> length changing during execution. Keeping the same vector unit
>> configuration throughout the entire program execution also leads to the
>> vector length being fixed once the program starts executing.
>>      In this case, compiler support works out rather rather similar to
>> support for Arm SVE, with the biggest difference being that vectors
lengths
>> are not multiples of 128 bit (which legalization can paper over).
Indeed,
>> no IR changes beyond those proposed for
>>      SVE support appear to be necessary to implement this approach to
>> RISC-V vector extension support.
>>
>>     However, this approach is wasteful, as a tailored configuration can
>> improve performance and energy efficiency significantly. As one data
point,
>> the Hwacha project reported [4] up to 9.5% fewer cycles taken and up to
11%
>> less energy consumed on a microarchitecture
>>      built to exploit narrower bit widths of vector elements (comparing
>> "mvp, packed: yes" to "mvp, packed: no"). Besides
such microarchitectural
>> optimizations, enabling fewer registers can improve context switch
times
>> because fewer registers need to be saved,
>>      and being more flexible in how registers can be used (in
particular,
>> how many are reserved for scalar values) aids register allocation.
>>
>>     Thus, restricting programs to a single configuration may be a good
>> first step to get things up and running, but ultimately support for
>> runtime-varying vector lengths is desired to make the most of hardware
>> capabilities.
>>
>>     [4] "A Case for MVPs: Mixed-Precision Vector Processors",
Albert Ou,
>> Quan Nguyen, Yunsup Lee, Krste Asanović,
>>     http://hwacha.org/papers/hwacha-mvp-prism2014.pdf
>>
>>
>>     ## Producing vector code
>>
>>     It is intended that vector code is primarily produced via loop
>> vectorization and other IR-level auto-vectorizers (e.g., the region
>> vectorizer), not written by hand. Supporting loop vectorization is of
>> highest priority. The groundwork for loop vectorization
>>      should be useful for other kinds of automatic vectorization as
well,
>> but loop vectorization will be implemented first.
>>
>>     It's not required or expected that the stock loop vectorizer
can
>> generate RISC-V vector code from the start. Considering the many
>> significant differences to the packed-SIMD architectures the loop
>> vectorizer is tailored to, it's quite likely that some
experimentation
>>      in this space is required (e.g., building on VPlan and writing
>> custom recipes). Of course, in the long term there should be as much
code
>> sharing as possible.
>>
>>     Support for hand-written vector code va source-language-level
>> intrinsics (as opposed to inline assembly) would be nice to have and
>> probably falls out for free, but is rather low priority.
>>
>>
>>     ## No vector unit configuration in IR
>>
>>     While configuring the vector unit is an essential part of compiling
>> for the V ISA, it has no place in LLVM IR. Vectors should be regular
SSA
>> values that don't reference any extra state other than (by
necessity) the
>> vector length. Deciding how to configure the
>>      vector unit for a given piece of code is target-dependent and
>> intertwined with register allocation, and will therefore be left to the
>> backend.
>>
>>     # Challenge: Code motion around vector length changes
>>
>>     When the vector length can change during execution, there are
>> implicit dependencies between vector operations and points in the
program
>> where the vector length may change. These dependencies must be taken
into
>> account somehow, or else code motion passes could
>>      move vector operations across vector length changes, effectively
>> changing program semantics. For example, it's nonsensical to
compute a
>> vector value `%v1` with one vector length, change the vector length,
and
>> then compute another vector `%v2` with the new
>>      vector length and add it to `%v1`. This makes no more sense than
>> adding `<4 x i32>` to `<8 x i32>`, yet it could happen if
an input program
>> has a vector length change *after* these vector calculations and
>> optimization passes are not aware of the impact of
>>      the vector length change on those calculations.
>>
>>     Crucially, the vector length changes when calling and returning
from
>> functions in most calling conventions. Functions that don't
specifically
>> use a vectorcall ABI configure the vector unit for their own use when
>> called, rather than using configuration set up
>>      by the caller. Therefore, caller and callee will generally have
>> different vector lengths, and moving vector operations from the caller
into
>> the callee or vice versa tends to break programs.
>>
>>     However, note that the precise value of the vector length
doesn't
>> really matter -- software is supposed to be *vector length agnostic*.
>> Completely inlining a function is perfectly fine, for example. What
matters
>> is that the vector length doesn't change *during*
>>      vector computations, i.e., while any vector values are live
(either
>> as SSA values, or in memory!). Thus, there is no need to support and
track
>> vector length changes at instruction granularity. It's enough to
coarsely
>> enforce that the vector length remain constant
>>      throughout a larger code region (say, a loop nest, or a function).
>>
>>
>>     # Runtime-varying vector length in the IR
>>
>>     This is achieved by simply declaring "by fiat" that the
vector length
>> is determined on function entry and remain constant for the rest of the
>> function execution. Other functions and other calls to the same
function
>> may observe a different vector length, but
>>      within one call to a given function, the vector length is fixed.
>> That is not precisely how the hardware works, but it is a contract the
>> backend can uphold easily (more on this later) and it allows
piggy-backing
>> on existing IR constructs (the `token` type)
>>      to communicate restrictions to optimization passes. Nevertheless,
>> some additions to IR are required: a new first class type, a new
>> instruction, and a new operand for some existing instructions.
>>
>>     ## IR semantics
>>
>>     Every time a function is called, a positive integer called the
>> *dynamic vector length* is determined in an unspecified way. The
dynamic
>> vector length can differ not only between different functions, but also
>> between different calls to the same function. The
>>      exception is that functions with the `inherits_vlen` attribute get
>> the same dynamic vector length as their caller (Note: this attribute is
a
>> straw man, target-specific calling conventions may work better for this
>> purpose).
>>
>>     A new instruction `vlentoken` is added, which has no operands and
is
>> of type `token`. This token represents the dynamic vector length of the
>> function execution it is in. There can only be one such instruction per
>> function (this is inconsequential to the operational
>>      semantics, but it simplifies IR passes).
>>
>>     A new kind of type is added, the *dynamic-length vector*, written
`<
>> vlen x <element type> >`. It represents a vector with a number
of elements
>> equal to the dynamic vector length. Like fixed-length and scalable
vectors,
>> these vectors can only contain integer,
>>      float and pointer elements.
>>
>>     A use of a `vlentoken` (representing a dynamic vector length) is
>> attached to all operations that care for the dynamic vector length.
That
>> is, *every* instruction that handles dynamic-length vectors or is
impacted
>> by their length receives the respective function's
>>      `vlentoken` as extra operand, and operates on a number of elements
>> equal to the corresponding dynamic vector length.
>>
>>     `< vlen x <element type> >` is a first-class type and
supports most
>> common instructions (details below), but cannot be used as function
>> argument or function return type unless the `inherits_vlen` attribute
is
>> applied to the callee.
>>
>>
>>     ## Rationale / "why this works"
>>
>>     Although the vector length is a property of vector *values*,
tracking
>> the dynamic vector length at the type level would require a
"separate type"
>> for each call to any function. It's much more feasible to attach
the vector
>> length to the *operations* instead.
>>      This works out because SSA values are function-local (so all
>> operation on them agree on the vector length by definition) with the
>> exception of function arguments and return values. Consequently,
>> dynamic-length vectors in function signatures are disallowed
>>      unless the `inherits_vlen` attribute ensured caller and callee
have
>> the same dynamic vector length.
>>
>>     The `vlentoken` token ensures that all operations that start out in
>> the same function must remain in the same function while the code is
>> transformed (recall that tokens cannot be passed to or returned from
>> non-intrinsic functions). That's why it is important
>>      that `vlentoken` is a token, not simply an integer as one might
>> expect. In other words, `vlentoken` does not give the program access to
the
>> dynamic vector length, it communicates a restriction to the optimizer.
>>
>>     ## More details
>>
>>     The `< vlen x <element type> >` type is separate from
the existing
>> vector types. Instructions for fixed-length vectors (elementwise
>> arithmetic, `insertelement`, `select` with a vector of `i1`s, etc.) are
not
>> extended to this new type, at least not in this RFC.
>>      It's a possible future extension, but for now, target-specific
>> intrinsics work fine for those operations.
>>
>>     The following operations on dynamic-length vectors *are* supported:
>>
>>     - `phi`
>>     - `load` and `store`
>>     - `alloca` (at least of a single vector; the `alloca <ty>,
<ty>
>> <NumElements>` form ties into the open question about aggregates
and GEPs
>> below)
>>     - `select` (with `i1` condition)
>>     - Argument passing and return values (`call`, `invoke`, `ret`) for
>> functions with `inherits_vlen`
>>
>>     All of these instructions (including phi) have an additional
operand
>> of type `token` if and only if they operate on a vector of dynamic
length.
>> In textual IR, one appends `, vlen <the token>` to the
instruction, for
>> example:
>>
>>         %0 = vlentoken
>>         %ptr = alloca <vlen x i32>, vlen %0
>>         %v = call <vlen x i32> @foo(), vlen %0
>>         store <vlen x i32> %v, <vlen x i32>* %ptr, vlen %0
>>
>>     Open question: should GEPs and aggregates involving dynamic-length
>> vectors work? This RFC errs on the side of simplicity and excludes them
>> (they're non-trivial to implement and not needed for strip-mined
loops) but
>> if desired, they could be supported.
>>
>>     There are no constants of dynamic-length-vector type except
>> `zeroinitializer` and `undef` (resp. `poison` once that is adopted). In
>> particular, there is no equivalent to fixed-length vector constants
(`<ty
>> elem1, ty elem2, ...>`). Dynamic-length vectors also
>>      cannot be stored in globals.
>>
>>     ## Impact on optimizations
>>
>>     The semantics imply several restrictions on optimizations, but
these
>> are mostly encoded with existing IR constructs -- chiefly, the `token`
type
>> that ties all vector operations to a `vlentoken`. For example, because
>> token values cannot be passed to (non-intrinsic)
>>      functions or returned from them, no special pleading is needed to
>> keep an outliner or partial inliner from spreading vector operations
across
>> multiple functions -- correct passes already don't do that when
tokens are
>> involved. Passes do, however, need to be
>>      updated in two respects.
>>
>>     First, the new token operand needs to be respected when comparing
two
>> instructions, creating new instructions, etc. -- this is an inherent
>> downside of adding new operands, but also rather mechanical. The rule
that
>> there is only one `vlentoken` per function
>>      makes this even more mechanical than usual, because all
instruction
>> within one function have the same vector length token. This means that
one
>> does not need to consider them when comparing instructions from the
same
>> function, and it's always clear which token
>>      should be used when creating a new instruction.
>>
>>     Second, the very same rule of only one `vlentoken` per function
must
>> be respected during interprocedural code motion. For example, inlining
>> can't just copy the `vlentoken` from the callee into the caller.
>>
>>     However, note that it's valid to *merge* the caller's and
callee's
>> `vlentoken` instructions. Because the semantics state that each call to
a
>> function can get a different dynamic vector length, merging
`vlentoken`s
>> *refines* the program's behavior by picking
>>      the possible execution where the callee "happens to" get
the same
>> vector length as the caller in the inlined calls. So inlining can
simply
>> replace all `vlentoken` tokens in the inlined code with the `vlentoken`
>> token of the callee. Other passes are likely
>>      similarly easy to update (and in the worst case, they can just
bail
>> out when seeing dynamic-length vectors).
>>
>>     ## Impact on backends
>>
>>     Unsurprisingly, the IR types `<vlen x <element type> >`
come with
>> associated MVTs. There's also a new SelectionDAG node `VLENTOKEN`
to mirror
>> the `vlentoken` IR instruction (and presumably `G_VLENTOKEN` in GMIR
for
>> GlobalISel).
>>
>>     Backends other than RISC-V can legalize these MVTs and the
>> `VLENTOKEN` node very easily, even if in practice there currently
aren't
>> many useful operations on these vectors without target-specific
intrinsics.
>> Specifically, `< vlen x <element type> >` can be
>>      legalized as `< n x <element type> >` or even `<
scalable n x
>> <element type> >` (the vector type for Arm SVE) for any fixed
`n`. All the
>> complications stemming from the runtime-varying vector length go away,
and
>> the `vlentoken` node can simply be dropped on
>>      the floor.
>>
>>     That leaves targets with an actual runtime-varying vector length in
>> hardware, i.e., RISCV with the V feature enabled. As stated in the
>> introduction, this RFC does not cover the backend changes in detail,
but to
>> give you a rough idea, here's a sketch. Keep in
>>      mind (especially if you're familiar with V) that this is
glossing
>> over everything not directly related to the proposed IR type
(particularly
>> the "polymorphic instruction set" aspect of the register
configuration).
>>
>>     As described earlier, the vector length in RISC-V is completely
>> determined by the vector unit configuration. Therefore, vector
operations
>> in Machine IR have an implicit use of the configuration registers. This
is
>> the moral equivalent of the `vlentoken` token
>>      operand, but more precise (and MIR doesn't have an equivalent
of the
>> token type anyway). To complete the picture, all operations that change
the
>> configuration are made explicit.
>>
>>     Because only virtual registers can be live across basic block
>> boundaries before register allocation, this may require a dummy
register
>> class with only a single physical register, or something similarly
>> inelegant.
>>
>>     With a way to precisely represent vector length changes in hand,
the
>> backend just needs to ensure it implements the semantics of `< vlen
x
>> <element type> >` described earlier. This is achieved by
configuring the
>> vector unit "in the prologue", and then not doing
>>      anything that might change the vector length inside the function.
>> This setup is effectively in the prologue (i.e., before any user code)
but
>> not actually inserted during the "prologue/epilogue
insertion" pass, which
>> runs far too late for this purpose.
>>
>>     For scenarios like two entirely separate vectorized loops within
one
>> function, it might be useful to drastically change the vector unit
>> configuration in the middle of a function. This could be implemented as
an
>> optimization (e.g., a pre-RA machine function),
>>      but it's all hypothetical so far.
>>
>>
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180413/e04691e1/attachment-0001.html>

Graham Hunter via llvm-dev

2018-Apr-16 16:59 UTC

head link

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

Hi,

Comments below

On 13 April 2018 at 16:52, Robin Kruppe <robin.kruppe at
gmail.com<mailto:robin.kruppe at gmail.com>> wrote:
On 13 April 2018 at 14:37, Graham Hunter <Graham.Hunter at
arm.com<mailto:Graham.Hunter at arm.com>> wrote:
Hi,

Nice to see another group tackling length agnostic vectorization :)

I'm still reading through all the details, but I do have one initial
question related to the vector type; why not use the same mechanism Arm is
proposing for SVE?

We chose the format "<n x m x <ty>>" to represent two
crucial concepts: arbitrary length vectors with the same number of elements, or
with the same number of bits. From what I gather, you're just interested in
the former, so is there a good reason "<n x 1 x <ty>>"
wouldn't work for your use case? The 'n' is just a term to indicate
it's a multiple only known at runtime; you've used 'vlen', and
there are other suggested names, but there's nothing to tie it to an exact
size -- if your dynamic vlen changes with each function, the
'n'/'vlen' just represents the current value.

If the vector length changes at run time, some optimizations that otherwise make
perfect sense are not legal, particularly outlining -- details are in the RFC
text. If loss of those optimizations is OK for SVE, we could talk about unifying
the two vector types. Alternatively, I believe we could generalize the scalable
vector type and my proposal, e.g., by adding my proposed "vlentoken"
to instructions on scalable vectors and encode the "unknown, but fixed at
program load time" vectorlength (i.e., SVE) by using `token none` as vector
length token.

I think your proposed function attribute 'inherits_vlen' would allow for
outlining for SVE without causing problems, possibly using a TTI function to
determine whether that was reasonable based on the backend.

I don't think a token is needed -- you can tell whether special care is
needed when moving/copying an instruction between functions based on the types
of the input/output. I may have missed something in the RFC though.

One question regarding that: since it's just a token, how were you planning
to add the number of elements to a scalar, e.g. to check the remaining number of
iterations against the maximum? For SVE we use a 'vscale' intrinsic to
return the runtime multiple of the base type, which I think would work for you
as well.

For now we can support function calls/returns being a barrier on vector length,
as proposed in your RFC. This was briefly discussed in one of the earlier SVE
threads -- while we don't expect anyone to change the SVE vector length at
runtime, it is possible (via kernel ioctl call) and subject to similar
constraints in that changing it during vector computation will almost certainly
generate an incorrect result. The default AArch64 calling convention doesn't
preserve SVE registers, so this works fine.

I'll be posting a revised patch set soon (hopefully within a couple of
weeks) which should help this discussion -- Sander has upstreamed the
instructions I need for a very simple patch series showing the IR changes and
trivial codegen in isolation.

-Graham

For the initial proposal, I did not want to impose either the loss in
optimization power or the additional complexity on scalable vectors, but if you
folks are fine with one of those options, I'd be very glad to do so and
reduce the number of types added.

Thanks,

Robin

I'll continue looking through this, and will have more feedback next week.

-Graham

On 11/04/2018, 10:45, "llvm-dev on behalf of Robin Kruppe via
llvm-dev" <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org> on behalf of llvm-dev at lists.llvm.org<mailto:llvm-dev
at lists.llvm.org>> wrote:

RISC-V is an open and free instruction set architecture (ISA) used in
numerous domains in industry and research. The vector extension (short:
'V') supplements the basic ISA with support for data parallel
computations. This RFC sketches a strategy
for targeting this instruction set extension in LLVM.

Some but not all of what is proposed here has already been implemented out
of tree. It is explicitly not proposed to upstream any of this yet: the vector
extension is still evolving (though the core concepts are reasonably stable),
and the implementation is
currently very much prototype quality. Nevertheless, I want to kick off a
discussion about this with the LLVM community now to make sure I'm on the
right track and to make the eventual upstreaming go more smoothly. In
particular, a large and potentially controversial
part of the strategy is a proposal for extending LLVM IR with a new vector
type.

There is also much to be said about how to structure the code generation for
this ISA. However, since that aspect somewhat simpler, largely orthogonal and
affects a smaller subset of the community, the details will be left to a future
RFC.

This RFC is intended to be self-contained, but interested readers can learn
more about the vector extension from Roger Espasa's talk at the 7th RISC-V
workshop (slides [1], recording [2]). The draft specification is also available
as part of the RISC-V Instruction
Set Manual [3], but right now it is unfortunately incomplete and in the
process of being updated.

I will also be at EuroLLVM with a lightning talk and poster on this subject,
so if you're there as well, we can discuss in person.

[1]

https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf
<https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-RISCVRogerEspasaVEXT-v4.pdf>
[2] https://www.youtube.com/watch?v=GzZ-8bHsD5s
[3] https://github.com/riscv/riscv-isa-manual/

# Summary

First-class support for the RISC-V vector ISA requires representing a
hardware vector length that is not just unknown at compile time, but also
changes during execution. This in turn places some restrictions on code motion:
the vector length must not change
while any vector values are live. This RFC proposes to add a new vector
type to LLVM IR for this purpose. Simply put, its length is tied to the
surrounding function and the existing `token` type is leveraged to tell
optimization passes that certain vector
operations must remain together (i.e., in the same function).

# Background

The RISC-V vector extension has many interesting properties. This RFC is not
the right place to talk about it in detail, but this section will briefly
introduce the aspect that is most difficult to support in LLVM IR, and which is
consequently the focus of
this RFC: the runtime-variable vector length.

The number of elements in a vector register is determined by the
microarchitecture. Software uses strip-mined loops to transparently process as
many elements per iteration as the hardware can support. But even beyond that,
the vector length can vary during
the execution of a program: different kernels may *configure* the vector
unit differently depending on their needs, leading to different parts of the
program having differently-sized vector registers.

The vector length being determined by the microarchitecture is similar to
Arm's Scalable Vector Extension (SVE), for which support is being upstreamed
at the moment. However, in SVE the vector length is fixed once a program starts
running, while full use of
the RISC-V vector extension will lead to the vector length changing
regularly during execution. It's possible to maintain the same configuration
-- and therefore the same vector length -- throughout the entire program, but
this will often perform worse than
a tailored configuration.

## Maximum vs active vs application vector length

The V ISA has two notions of vector length: the *maximum* vector length
(called MVL), which describes the number of elements in each vector register,
and the *active* vector length (called vl), which limits how many of those
elements are actually processed
by each vector instruction.

The latter is used to express loops of any application-specified length with
a single copy of the loop body. Instead of handling the tail iterations not
divisible by MVL separately with scalar code, the active length length is set up
so that the last few iterations
process as many elements as are left to process.

The effect of the active vector length is similar to a mask of the form
`<true, ..., true, false, ..., false>`, aside from the scalar control
logic that sets and maintains the active vector length. Thus it can be modeled
in IR with judicious use of intrinsics
and masking. This still allows also having a single loop body in IR,
without introducing new IR concepts in addition to those already needed for the
variable MVL.

Thus, the rest of this RFC focuses on handling the MVL: all references to
"vector length" from this point on should be taken to refer to the
MVL, not the active vector length.

# Scope of the support

To preempt misunderstandings, this section outlines what is meant and not
meant by "support for the vector extension".

## Variable vector length

There *is* an option to entirely avoid the concept of the vector length
changing during execution. Keeping the same vector unit configuration throughout
the entire program execution also leads to the vector length being fixed once
the program starts executing.
In this case, compiler support works out rather rather similar to support
for Arm SVE, with the biggest difference being that vectors lengths are not
multiples of 128 bit (which legalization can paper over). Indeed, no IR changes
beyond those proposed for
SVE support appear to be necessary to implement this approach to RISC-V
vector extension support.

However, this approach is wasteful, as a tailored configuration can improve
performance and energy efficiency significantly. As one data point, the Hwacha
project reported [4] up to 9.5% fewer cycles taken and up to 11% less energy
consumed on a microarchitecture
built to exploit narrower bit widths of vector elements (comparing
"mvp, packed: yes" to "mvp, packed: no"). Besides such
microarchitectural optimizations, enabling fewer registers can improve context
switch times because fewer registers need to be saved,
and being more flexible in how registers can be used (in particular, how
many are reserved for scalar values) aids register allocation.

Thus, restricting programs to a single configuration may be a good first
step to get things up and running, but ultimately support for runtime-varying
vector lengths is desired to make the most of hardware capabilities.

[4] "A Case for MVPs: Mixed-Precision Vector Processors", Albert
Ou, Quan Nguyen, Yunsup Lee, Krste Asanović,
http://hwacha.org/papers/hwacha-mvp-prism2014.pdf

## Producing vector code

It is intended that vector code is primarily produced via loop vectorization
and other IR-level auto-vectorizers (e.g., the region vectorizer), not written
by hand. Supporting loop vectorization is of highest priority. The groundwork
for loop vectorization
should be useful for other kinds of automatic vectorization as well, but
loop vectorization will be implemented first.

It's not required or expected that the stock loop vectorizer can
generate RISC-V vector code from the start. Considering the many significant
differences to the packed-SIMD architectures the loop vectorizer is tailored to,
it's quite likely that some experimentation
in this space is required (e.g., building on VPlan and writing custom
recipes). Of course, in the long term there should be as much code sharing as
possible.

Support for hand-written vector code va source-language-level intrinsics (as
opposed to inline assembly) would be nice to have and probably falls out for
free, but is rather low priority.

## No vector unit configuration in IR

While configuring the vector unit is an essential part of compiling for the
V ISA, it has no place in LLVM IR. Vectors should be regular SSA values that
don't reference any extra state other than (by necessity) the vector length.
Deciding how to configure the
vector unit for a given piece of code is target-dependent and intertwined
with register allocation, and will therefore be left to the backend.

# Challenge: Code motion around vector length changes

When the vector length can change during execution, there are implicit
dependencies between vector operations and points in the program where the
vector length may change. These dependencies must be taken into account somehow,
or else code motion passes could
move vector operations across vector length changes, effectively changing
program semantics. For example, it's nonsensical to compute a vector value
`%v1` with one vector length, change the vector length, and then compute another
vector `%v2` with the new
vector length and add it to `%v1`. This makes no more sense than adding
`<4 x i32>` to `<8 x i32>`, yet it could happen if an input program
has a vector length change *after* these vector calculations and optimization
passes are not aware of the impact of
the vector length change on those calculations.

Crucially, the vector length changes when calling and returning from
functions in most calling conventions. Functions that don't specifically use
a vectorcall ABI configure the vector unit for their own use when called, rather
than using configuration set up
by the caller. Therefore, caller and callee will generally have different
vector lengths, and moving vector operations from the caller into the callee or
vice versa tends to break programs.

However, note that the precise value of the vector length doesn't really
matter -- software is supposed to be *vector length agnostic*. Completely
inlining a function is perfectly fine, for example. What matters is that the
vector length doesn't change *during*
vector computations, i.e., while any vector values are live (either as SSA
values, or in memory!). Thus, there is no need to support and track vector
length changes at instruction granularity. It's enough to coarsely enforce
that the vector length remain constant
throughout a larger code region (say, a loop nest, or a function).

# Runtime-varying vector length in the IR

This is achieved by simply declaring "by fiat" that the vector
length is determined on function entry and remain constant for the rest of the
function execution. Other functions and other calls to the same function may
observe a different vector length, but
within one call to a given function, the vector length is fixed. That is
not precisely how the hardware works, but it is a contract the backend can
uphold easily (more on this later) and it allows piggy-backing on existing IR
constructs (the `token` type)
to communicate restrictions to optimization passes. Nevertheless, some
additions to IR are required: a new first class type, a new instruction, and a
new operand for some existing instructions.

## IR semantics

Every time a function is called, a positive integer called the *dynamic
vector length* is determined in an unspecified way. The dynamic vector length
can differ not only between different functions, but also between different
calls to the same function. The
exception is that functions with the `inherits_vlen` attribute get the same
dynamic vector length as their caller (Note: this attribute is a straw man,
target-specific calling conventions may work better for this purpose).

A new instruction `vlentoken` is added, which has no operands and is of type
`token`. This token represents the dynamic vector length of the function
execution it is in. There can only be one such instruction per function (this is
inconsequential to the operational
semantics, but it simplifies IR passes).

A new kind of type is added, the *dynamic-length vector*, written `< vlen
x <element type> >`. It represents a vector with a number of elements
equal to the dynamic vector length. Like fixed-length and scalable vectors,
these vectors can only contain integer,
float and pointer elements.

A use of a `vlentoken` (representing a dynamic vector length) is attached to
all operations that care for the dynamic vector length. That is, *every*
instruction that handles dynamic-length vectors or is impacted by their length
receives the respective function's
`vlentoken` as extra operand, and operates on a number of elements equal to
the corresponding dynamic vector length.

`< vlen x <element type> >` is a first-class type and supports
most common instructions (details below), but cannot be used as function
argument or function return type unless the `inherits_vlen` attribute is applied
to the callee.

## Rationale / "why this works"

Although the vector length is a property of vector *values*, tracking the
dynamic vector length at the type level would require a "separate
type" for each call to any function. It's much more feasible to attach
the vector length to the *operations* instead.
This works out because SSA values are function-local (so all operation on
them agree on the vector length by definition) with the exception of function
arguments and return values. Consequently, dynamic-length vectors in function
signatures are disallowed
unless the `inherits_vlen` attribute ensured caller and callee have the
same dynamic vector length.

The `vlentoken` token ensures that all operations that start out in the same
function must remain in the same function while the code is transformed (recall
that tokens cannot be passed to or returned from non-intrinsic functions).
That's why it is important
that `vlentoken` is a token, not simply an integer as one might expect. In
other words, `vlentoken` does not give the program access to the dynamic vector
length, it communicates a restriction to the optimizer.

## More details

The `< vlen x <element type> >` type is separate from the
existing vector types. Instructions for fixed-length vectors (elementwise
arithmetic, `insertelement`, `select` with a vector of `i1`s, etc.) are not
extended to this new type, at least not in this RFC.
It's a possible future extension, but for now, target-specific
intrinsics work fine for those operations.

The following operations on dynamic-length vectors *are* supported:

- `phi`
- `load` and `store`
- `alloca` (at least of a single vector; the `alloca <ty>, <ty>
<NumElements>` form ties into the open question about aggregates and GEPs
below)
- `select` (with `i1` condition)
- Argument passing and return values (`call`, `invoke`, `ret`) for functions
with `inherits_vlen`

All of these instructions (including phi) have an additional operand of type
`token` if and only if they operate on a vector of dynamic length. In textual
IR, one appends `, vlen <the token>` to the instruction, for example:

%0 = vlentoken
%ptr = alloca <vlen x i32>, vlen %0
%v = call <vlen x i32> @foo(), vlen %0
store <vlen x i32> %v, <vlen x i32>* %ptr, vlen %0

Open question: should GEPs and aggregates involving dynamic-length vectors
work? This RFC errs on the side of simplicity and excludes them (they're
non-trivial to implement and not needed for strip-mined loops) but if desired,
they could be supported.

There are no constants of dynamic-length-vector type except
`zeroinitializer` and `undef` (resp. `poison` once that is adopted). In
particular, there is no equivalent to fixed-length vector constants (`<ty
elem1, ty elem2, ...>`). Dynamic-length vectors also
cannot be stored in globals.

## Impact on optimizations

The semantics imply several restrictions on optimizations, but these are
mostly encoded with existing IR constructs -- chiefly, the `token` type that
ties all vector operations to a `vlentoken`. For example, because token values
cannot be passed to (non-intrinsic)
functions or returned from them, no special pleading is needed to keep an
outliner or partial inliner from spreading vector operations across multiple
functions -- correct passes already don't do that when tokens are involved.
Passes do, however, need to be
updated in two respects.

First, the new token operand needs to be respected when comparing two
instructions, creating new instructions, etc. -- this is an inherent downside of
adding new operands, but also rather mechanical. The rule that there is only one
`vlentoken` per function
makes this even more mechanical than usual, because all instruction within
one function have the same vector length token. This means that one does not
need to consider them when comparing instructions from the same function, and
it's always clear which token
should be used when creating a new instruction.

Second, the very same rule of only one `vlentoken` per function must be
respected during interprocedural code motion. For example, inlining can't
just copy the `vlentoken` from the callee into the caller.

However, note that it's valid to *merge* the caller's and
callee's `vlentoken` instructions. Because the semantics state that each
call to a function can get a different dynamic vector length, merging
`vlentoken`s *refines* the program's behavior by picking
the possible execution where the callee "happens to" get the same
vector length as the caller in the inlined calls. So inlining can simply replace
all `vlentoken` tokens in the inlined code with the `vlentoken` token of the
callee. Other passes are likely
similarly easy to update (and in the worst case, they can just bail out
when seeing dynamic-length vectors).

## Impact on backends

Unsurprisingly, the IR types `<vlen x <element type> >` come
with associated MVTs. There's also a new SelectionDAG node `VLENTOKEN` to
mirror the `vlentoken` IR instruction (and presumably `G_VLENTOKEN` in GMIR for
GlobalISel).

Backends other than RISC-V can legalize these MVTs and the `VLENTOKEN` node
very easily, even if in practice there currently aren't many useful
operations on these vectors without target-specific intrinsics. Specifically,
`< vlen x <element type> >` can be
legalized as `< n x <element type> >` or even `< scalable n
x <element type> >` (the vector type for Arm SVE) for any fixed `n`.
All the complications stemming from the runtime-varying vector length go away,
and the `vlentoken` node can simply be dropped on
the floor.

That leaves targets with an actual runtime-varying vector length in
hardware, i.e., RISCV with the V feature enabled. As stated in the introduction,
this RFC does not cover the backend changes in detail, but to give you a rough
idea, here's a sketch. Keep in
mind (especially if you're familiar with V) that this is glossing over
everything not directly related to the proposed IR type (particularly the
"polymorphic instruction set" aspect of the register configuration).

As described earlier, the vector length in RISC-V is completely determined
by the vector unit configuration. Therefore, vector operations in Machine IR
have an implicit use of the configuration registers. This is the moral
equivalent of the `vlentoken` token
operand, but more precise (and MIR doesn't have an equivalent of the
token type anyway). To complete the picture, all operations that change the
configuration are made explicit.

Because only virtual registers can be live across basic block boundaries
before register allocation, this may require a dummy register class with only a
single physical register, or something similarly inelegant.

With a way to precisely represent vector length changes in hand, the backend
just needs to ensure it implements the semantics of `< vlen x <element
type> >` described earlier. This is achieved by configuring the vector
unit "in the prologue", and then not doing
anything that might change the vector length inside the function. This
setup is effectively in the prologue (i.e., before any user code) but not
actually inserted during the "prologue/epilogue insertion" pass, which
runs far too late for this purpose.

For scenarios like two entirely separate vectorized loops within one
function, it might be useful to drastically change the vector unit configuration
in the middle of a function. This could be implemented as an optimization (e.g.,
a pre-RA machine function),
but it's all hypothetical so far.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180416/e8bd751e/attachment-0001.html>

llvm dev - Apr 2018 - RFC: Supporting the RISC-V vector extension in LLVM

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

[llvm-dev] RFC: Supporting the RISC-V vector extension in LLVM

Seemingly Similar Threads