thr3ads.net - llvm dev - [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Graham Hunter via llvm-dev

2018-Jun-05 13:15 UTC

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the
end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC
later.

Feedback and questions welcome.

-Graham

============================================================Supporting SIMD
instruction sets with variable vector lengths
============================================================
In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

=========Background
=========
*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

=======Contents
=======
The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size
isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a
runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

=======1. Types
=======
To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same
number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the
types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

==============2. Size Queries
==============
This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector
length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.

=======================================3. Representing Vector Length at Runtime
=======================================
With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  ;; <check and branch>
``
==========================4. Generating Vector Values
==========================For constant vector values, we cannot specify all the
elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the
new
start can be added, and changing the step requires multiplying by a splat.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x
i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

=================5. Code Generation
=================
IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary
to
convert a predicated instruction into an unpredicated one.

=========6. Example
=========
The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align
4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

=========7. Patches
=========
List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

via llvm-dev

2018-Jun-05 15:23 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Graham,

Just a few initial comments.

Graham Hunter <Graham.Hunter at arm.com> writes:
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the
same number of
>   bytes.
"scalable" instead of "scalable x."
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not
need to
> deal with scalable types, another function (getFixedSizeExpressionInBits)
that
> only returns unscaled bits will be provided, with a debug assert that the
type
> isn't scalable.
Can you explain a bit about what the two integers represent?  What's the
"unscaled" part for?

The name "getSizeExpressionInBits" makes me think that a Value
expression will be returned (something like a ConstantExpr that uses
vscale).  I would be surprised to get a pair of integers back.  Do
clients actually need constant integer values or would a ConstantExpr
sufffice?  We could add a ConstantVScale or something to make it work.
> Comparing two of these sizes together is straightforward if only unscaled
sizes
> are used. Comparisons between scaled sizes is also simple when comparing
sizes
> within a function (or across functions with the inherit flag mentioned in
the
> changes to the type), but cannot be compared otherwise. If a mix is
present,
> then any number of unscaled bits will not be considered to have a greater
size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled
bits
> (since the runtime multiple is at least one).
If we went the ConstantExpr route and added ConstantExpr support to
ScalarEvolution, then SCEVs could be compared to do this size
comparison.  We have code here that adds ConstantExpr support to
ScalarEvolution.  We just didn't know if anyone else would be interested
in it since we added it solely for our Fortran frontend.
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
vector.
I think this may be a case where added a full-fledged Instruction might
be worthwhile.  Because vscale is intimately tied to addressing, it
seems like things such as ScalarEvolution support will be important.  I
don't know what's involved in making intrinsics work with
ScalarEvolution but it seems strangely odd that a key component of IR
computation would live outside the IR proper, in the sense that all
other fundamental addressing operations are Instructions.
> For constants consisting of a sequence of values, an experimental
`stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of
the new
> start can be added, and changing the step requires multiplying by a splat.
This is another case where an Instruction might be better, for the same
reasons as with vscale.

Also, "iota" is the name Cray has traditionally used for this
operation
as it is the mathematical name for the concept.  It's also used by C++
and go and so should be familiar to many people.
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
As above, we could add ConstantVScale and also ConstantStepVector (or
ConstantIota).  They won't fold to compile-time values but the
expressions could be simplified.  I haven't really thought through the
implications of this, just brainstorming ideas.  What does your
downstream compiler require in terms of constant support.  What kinds of
queries does it need to do?

                         -David

Renato Golin via llvm-dev

2018-Jun-05 17:02 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

On 5 June 2018 at 16:23,  <dag at cray.com> wrote:> The name "getSizeExpressionInBits" makes me think that a Value
> expression will be returned (something like a ConstantExpr that uses
> vscale).  I would be surprised to get a pair of integers back.
Same here.

> If we went the ConstantExpr route and added ConstantExpr support to
> ScalarEvolution, then SCEVs could be compared to do this size
> comparison.
This sounds like a cleaner solution.

> I think this may be a case where added a full-fledged Instruction might
> be worthwhile.  Because vscale is intimately tied to addressing, it
> seems like things such as ScalarEvolution support will be important.  I
> don't know what's involved in making intrinsics work with
> ScalarEvolution but it seems strangely odd that a key component of IR
> computation would live outside the IR proper, in the sense that all
> other fundamental addressing operations are Instructions.
...> This is another case where an Instruction might be better, for the same
> reasons as with vscale.
This is a side-effect of the original RFC a few years ago. The general
feeling was that we can start with intrinsics and, if they work, we
change the IR.

We can have a work-in-progress implementation before fully committing
SCEV and other more core ideas in, and then slowly, and with more
certainty, move where it makes more sense.

> Also, "iota" is the name Cray has traditionally used for this
operation
> as it is the mathematical name for the concept.  It's also used by C++
> and go and so should be familiar to many people.
That sounds better, but stepvector is more than C++'s iota and it's
just a plain scalar evolution sequence like {start, op, step}. In
C++'s iota (AFAICS), the step is always 1.

Anyway, I don't mind any name, really. Whatever is more mnemonic.

>  ;; Create sequence for scalable vector
>  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
Once stepvetor (or iota) becomes a proper IR instruction, I'd like to
see this restricted to inlined syntax. The sequence { step*vec +
offset } only makes sense in the scalable context and the intermediate
results should not be used elsewhere.


-- 
cheers,
--renato

Graham Hunter via llvm-dev

2018-Jun-05 18:25 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi David,

Thanks for taking a look.
> On 5 Jun 2018, at 16:23, dag at cray.com wrote:
> 
> Hi Graham,
> 
> Just a few initial comments.
> 
> Graham Hunter <Graham.Hunter at arm.com> writes:
> 
>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have
the same number of
>>  bytes.
> 
> "scalable" instead of "scalable x."
Yep, missed that in the conversion from the old <n x m x ty> format.
> 
>> For derived types, a function (getSizeExpressionInBits) to return a
pair of
>> integers (one to indicate unscaled bits, the other for bits that need
to be
>> scaled by the runtime multiple) will be added. For backends that do not
need to
>> deal with scalable types, another function
(getFixedSizeExpressionInBits) that
>> only returns unscaled bits will be provided, with a debug assert that
the type
>> isn't scalable.
> 
> Can you explain a bit about what the two integers represent?  What's
the
> "unscaled" part for?
'Unscaled' just means 'exactly this many bits', whereas
'scaled' is 'this many bits
multiplied by vscale'.
> 
> The name "getSizeExpressionInBits" makes me think that a Value
> expression will be returned (something like a ConstantExpr that uses
> vscale).  I would be surprised to get a pair of integers back.  Do
> clients actually need constant integer values or would a ConstantExpr
> sufffice?  We could add a ConstantVScale or something to make it work.
I agree the name is not ideal and I'm open to suggestions -- I was thinking
of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.

Assuming the pair is of the form (unscaled, scaled), then for a type with a size
known at
compile time like <4 x i32> the size would be (128, 0).

For a scalable type like <scalable 4 x i32> the size would be (0, 128).

For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64,
256).

When calculating the offset for memory addresses, you just need to multiply the
scaled
part by vscale and add the unscaled as is.
> 
>> Comparing two of these sizes together is straightforward if only
unscaled sizes
>> are used. Comparisons between scaled sizes is also simple when
comparing sizes
>> within a function (or across functions with the inherit flag mentioned
in the
>> changes to the type), but cannot be compared otherwise. If a mix is
present,
>> then any number of unscaled bits will not be considered to have a
greater size
>> than a smaller number of scaled bits, but a smaller number of unscaled
bits
>> will be considered to have a smaller size than a greater number of
scaled bits
>> (since the runtime multiple is at least one).
> 
> If we went the ConstantExpr route and added ConstantExpr support to
> ScalarEvolution, then SCEVs could be compared to do this size
> comparison.  We have code here that adds ConstantExpr support to
> ScalarEvolution.  We just didn't know if anyone else would be
interested
> in it since we added it solely for our Fortran frontend.
We added a dedicated SCEV expression class for vscale instead; I suspect it
works
either way.
> 
>> We have added an experimental `vscale` intrinsic to represent the
runtime
>> multiple. Multiplying the result of this intrinsic by the minimum
number of
>> elements in a vector gives the total number of elements in a scalable
vector.
> 
> I think this may be a case where added a full-fledged Instruction might
> be worthwhile.  Because vscale is intimately tied to addressing, it
> seems like things such as ScalarEvolution support will be important.  I
> don't know what's involved in making intrinsics work with
> ScalarEvolution but it seems strangely odd that a key component of IR
> computation would live outside the IR proper, in the sense that all
> other fundamental addressing operations are Instructions.
We've tried it as both an instruction and as a 'Constant', and both
work fine with
ScalarEvolution. I have not yet tried it with the intrinsic.
> 
>> For constants consisting of a sequence of values, an experimental
`stepvector`
>> intrinsic has been added to represent a simple constant of the form
>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat
of the new
>> start can be added, and changing the step requires multiplying by a
splat.
> 
> This is another case where an Instruction might be better, for the same
> reasons as with vscale.
> 
> Also, "iota" is the name Cray has traditionally used for this
operation
> as it is the mathematical name for the concept.  It's also used by C++
> and go and so should be familiar to many people.
Iota would be fine with me; I forget the reason we didn't go with that
initially. We
also had 'series_vector' in the past, but that was a more generic form
with start
and step parameters instead of requiring additional IR instructions to multiply
and
add for the result as we do for stepvector.
> 
>> Future Work
>> -----------
>> 
>> Intrinsics cannot currently be used for constant folding. Our
downstream
>> compiler (using Constants instead of intrinsics) relies quite heavily
on this
>> for good code generation, so we will need to find new ways to recognize
and
>> fold these values.
> 
> As above, we could add ConstantVScale and also ConstantStepVector (or
> ConstantIota).  They won't fold to compile-time values but the
> expressions could be simplified.  I haven't really thought through the
> implications of this, just brainstorming ideas.  What does your
> downstream compiler require in terms of constant support.  What kinds of
> queries does it need to do?
It makes things a little easier to pattern match (just looking for a constant to
start
instead of having to match multiple different forms of vscale or stepvector
multiplied
and/or added in each place you're looking for them).

The bigger reason we currently depend on them being constant is that code
generation
generally looks at a single block at a time, and there are several expressions
using
vscale that we don't want to be generated in one block and passed around in
a register,
since many of the load/store addressing forms for instructions will already
scale properly.

We've done this downstream by having them be Constants, but if there's a
good way
of doing them with intrinsics we'd be fine with that too.

-Graham

Robin Kruppe via llvm-dev

2018-Jun-08 15:24 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was
curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length
vectors. Details may change as we gather more experience, but it's a very
good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called "Stack Regions" but
(IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I'll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

On 5 June 2018 at 15:15, Graham Hunter <Graham.Hunter at arm.com> wrote:
> Hi,
>
> Now that Sander has committed enough MC support for SVE, here's an
updated
> RFC for variable length vector support with a set of 14 patches (listed at
> the end)
> to demonstrate code generation for SVE using the extensions proposed in
> the RFC.
>
> I have some ideas about how to support RISC-V's upcoming extension
> alongside
> SVE; I'll send an email with some additional comments on Robin's
RFC later.
>
> Feedback and questions welcome.
>
> -Graham
>
> ============================================================> Supporting
SIMD instruction sets with variable vector lengths
> ============================================================>
> In this RFC we propose extending LLVM IR to support code-generation for
> variable
> length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
> approach is backwards compatible and should be as non-intrusive as
> possible; the
> only change needed in other backends is how size is queried on vector
> types, and
> it only requires a change in which function is called. We have created a
> set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7
> of
> this rfc) can be found on Phabricator and are intended to illustrate the
> scope
> of changes required by the general approach described in this RFC.
>
> =========> Background
> =========>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension
> for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of
> the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in
> which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time
> constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architectu
> re-reference-manual-supplement-the-scalable-vector-
> extension-sve-for-armv8-a
>
> =======> Contents
> =======>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent
> vectors that
>    have a length which is a runtime-determined multiple of a known base
> length.
>
> 2. Size Queries - how to reason about the size of types for which the size
> isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in
> address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a
runtime-determined
>    number of elements.
>
> 5. A brief note on code generation of these new operations for AArch64.
>
> 6. An example of C code and matching IR using the proposed extensions.
>
> 7. A list of patches demonstrating the changes required to emit SVE
> instructions
>    for a loop that has already been vectorized using the extensions
> described
>    in this RFC.
>
> =======> 1. Types
> =======>
> To represent a vector of unknown length a boolean `Scalable` property has
> been
> added to the `VectorType` class, which indicates that the number of
> elements in
> the vector is a runtime-determined integer multiple of the `NumElements`
> field.
> Most code that deals with vectors doesn't need to know the exact
length,
> but
> does need to know relative lengths -- e.g. get a vector with the same
> number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we
> introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and
> avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can
> be used
> to obtain the (potentially scalable) number of elements. Overloaded
> division and
> multiplication operators allow an ElementCount instance to be used in much
> the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason
> about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution
> for SVE,
> but it is possible. The model of scalable vectors presented in this RFC
> assumes
> that the multiple will be constant within a function but not necessarily
> across
> functions. As suggested in the recent RISC-V rfc, a new function attribute
> to
> inherit the multiple across function calls will allow for function calls
> with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number
> of
> elements, and the string literal `scalable` indicates that the total
> number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary
> choice
> for indicating that the vector is scalable, and could be substituted by
> another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change
> in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of
> elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming
> they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
>   elements.
>
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the
same number
> of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to
> the
> type record. If the field is not present the type will default to a
> fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes
> that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable
of
> providing support for C-level intrinsics instead of being used and
> recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that
> creates
> new VectorTypes, much of that code doesn't need to care about whether
the
> types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to
> represent
> the number of elements.
>
> ==============> 2. Size Queries
> ==============>
> This is a proposal for how to deal with querying the size of scalable
> types.
> While it has not been implemented in full, the general approach works well
> for calculating offsets into structures with scalable types in a modified
> version of ComputeValueVTs in our downstream compiler.
>
> Current IR types that have a known size all return a single integer
> constant.
> For scalable types a second integer is needed to indicate the number of
> bytes
> which need to be scaled by the runtime multiple to obtain the actual
> length.
>
> For primitive types, getPrimitiveSizeInBits will function as it does today,
> except that it will no longer return a size for vector types (it will
> return 0,
> as it does for other derived types). The majority of calls to this
> function are
> already for scalar rather than vector types.
>
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not
> need to
> deal with scalable types, another function (getFixedSizeExpressionInBits)
> that
> only returns unscaled bits will be provided, with a debug assert that the
> type
> isn't scalable.
>
> Similar functionality will be added to DataLayout.
>
> Comparing two of these sizes together is straightforward if only unscaled
> sizes
> are used. Comparisons between scaled sizes is also simple when comparing
> sizes
> within a function (or across functions with the inherit flag mentioned in
> the
> changes to the type), but cannot be compared otherwise. If a mix is
> present,
> then any number of unscaled bits will not be considered to have a greater
> size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled
> bits
> (since the runtime multiple is at least one).
>
You mention it's not fully implemented yet, but perhaps you have some
thoughts
on what the APIs for this should look like?

For size comparisons it's concerning that it could silently give misleading
results when operating across function boundaries. One solution could be a
function-specific API like `compareSizesIn(Type *, Type *, Function *)`, but
that extra parameter may turn out to be very viral. OTOH, maybe that's
unavoidable complexity from having type "sizes" vary between
functions.

Alternatively, general size queries could return "incomparable" and
"only if
in the same function" results in addition to smaller/larger/equal. This
might
nudge code to handling all these possibilities as well as it can.

Future Work> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe'
vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns
true
> and make use of the size when calculating offsets.
>
Seconded, I encountered those as well.

We have considered introducing multiple helper functions instead
of> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
>
+1 for clear helpers, but this sentiment is somewhat independent of the
changes to type sizes. For example, `isBitCastableTo` seems appealing to me
not just because it clarifies the intent of the size comparison but also
because it can encapsulate all the cases where types have the same size but
bitcasts still aren't allowed (ptr<->int, aggregates). With a good API
for
the
{scaled, unscaled} pairs, the size comparison should not be much more
complex
than today.

Aside: I was curious, so I grepped and found that this specific predicate
already exists under the name CastInst::isBitCastable.

> =======================================> 3. Representing Vector Length
at Runtime
> =======================================>
> With a scalable vector type defined, we now need a way to represent the
> runtime
> length in IR in order to generate addresses for consecutive vectors in
> memory
> and determine how many elements have been processed in an iteration of a
> loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
> vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
> %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
> %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>
I didn't see anything about this in the text (apologies if I missed
something), but it appears this intrinsic is overloaded to be able to return
any integer width. It's not a big deal either way, but is there a particular
reason for doing that, rather than picking one sufficiently large integer
type
and combining it with trunc/zext as appropriate?

>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>
Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a
pseudo-IR shorthand or an artifact from when vscale was proposed as a
constant
expression or something? I would have expected:

```
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
%vscale64.x4 = mul i64 %vscale64, 4
%index.next = add i64 %index, %vscale64.x4
```

  ;; <check and branch>> ``
> ==========================> 4. Generating Vector Values
> ==========================> For constant vector values, we cannot
specify all the elements as we can
> for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length
> vectors.
>
> For constants consisting of a sequence of values, an experimental
> `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of
the new
> start can be added, and changing the step requires multiplying by a splat.
>
+1 for making this intrinsic minimal and having canonical IR instruction
sequences for things like stride and starting offset.

> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef,
<4 x i32>
> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32>
> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable
4 x i32>
> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.step
> vector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
> this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> =================> 5. Code Generation
> =================>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64
> backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit
> was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not
> be
> necessary in future, generating an all-true 'ptrue' value was
necessary to
> convert a predicated instruction into an unpredicated one.
>
> =========> 6. Example
> =========>
> The following example shows a simple C loop which assigns the array index
> to
> the array elements matching that index. The IR shows how vscale and
> stepvector
> are used to create the needed values and to advance the index variable in
> the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.step
> vector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
> %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader
> ]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
> %vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32>
> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3,
align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> =========> 7. Patches
> =========>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D4776
> 8
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180608/db903c7d/attachment.html>

Sander De Smalen via llvm-dev

2018-Jun-11 12:50 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Robin,
> One thing I am missing is a discussion of how stack frame layout will be
> handled. Earlier RFCs mentioned a concept called "Stack Regions"
but (IIRC)
> gave little details and it was a long time ago anyway. What are your
current
> plans here?Good that you're pointing this out. For now, we can be overly conservative
and
assume the in-memory size of a scalable vector to be the architecturally defined
maximum. Naturally this is not very efficient as we allocate much more space
on the stack than needed.

I'll be posting a separate RFC that addresses a more optimal frame-layout
with scalable vectors in the coming weeks.

Cheers,

Sander

From: Robin Kruppe <robin.kruppe at gmail.com>
Date: Friday, 8 June 2018 at 16:24
To: Graham Hunter <Graham.Hunter at arm.com>
Cc: Chris Lattner <clattner at nondot.org>, Hal Finkel <hfinkel at
anl.gov>, "Jones, Joel" <Joel.Jones at cavium.com>, "dag
at cray.com" <dag at cray.com>, Renato Golin <renato.golin at
linaro.org>, Kristof Beyls <Kristof.Beyls at arm.com>, Amara Emerson
<aemerson at apple.com>, Florian Hahn <Florian.Hahn at arm.com>,
Sander De Smalen <Sander.DeSmalen at arm.com>, "llvm-dev at
lists.llvm.org" <llvm-dev at lists.llvm.org>, "mkuper at
google.com" <mkuper at google.com>, Sjoerd Meijer <Sjoerd.Meijer
at arm.com>, Sam Parker <Sam.Parker at arm.com>, nd <nd at
arm.com>
Subject: Re: [RFC][SVE] Supporting SIMD instruction sets with variable vector
lengths

Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length
vectors. Details may change as we gather more experience, but it's a very
good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called "Stack Regions" but
(IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I'll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

On 5 June 2018 at 15:15, Graham Hunter <mailto:Graham.Hunter at arm.com>
wrote:
Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the
end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC
later.

Feedback and questions welcome.

-Graham

============================================================Supporting SIMD
instruction sets with variable vector lengths
============================================================
In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

=========Background
=========
*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

=======Contents
=======
The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size
isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a
runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

=======1. Types
=======
To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same
number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the
types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

==============2. Size Queries
==============
This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

You mention it's not fully implemented yet, but perhaps you have some
thoughts
on what the APIs for this should look like?

For size comparisons it's concerning that it could silently give misleading
results when operating across function boundaries. One solution could be a
function-specific API like `compareSizesIn(Type *, Type *, Function *)`, but
that extra parameter may turn out to be very viral. OTOH, maybe that's
unavoidable complexity from having type "sizes" vary between
functions.

Alternatively, general size queries could return "incomparable" and
"only if
in the same function" results in addition to smaller/larger/equal. This
might
nudge code to handling all these possibilities as well as it can.

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector
length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

Seconded, I encountered those as well.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.

+1 for clear helpers, but this sentiment is somewhat independent of the
changes to type sizes. For example, `isBitCastableTo` seems appealing to me
not just because it clarifies the intent of the size comparison but also
because it can encapsulate all the cases where types have the same size but
bitcasts still aren't allowed (ptr<->int, aggregates). With a good API
for the
{scaled, unscaled} pairs, the size comparison should not be much more complex
than today.

Aside: I was curious, so I grepped and found that this specific predicate
already exists under the name CastInst::isBitCastable.

=======================================3. Representing Vector Length at Runtime
=======================================
With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()

I didn't see anything about this in the text (apologies if I missed
something), but it appears this intrinsic is overloaded to be able to return
any integer width. It's not a big deal either way, but is there a particular
reason for doing that, rather than picking one sufficiently large integer type
and combining it with trunc/zext as appropriate?

  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)

Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a
pseudo-IR shorthand or an artifact from when vscale was proposed as a constant
expression or something? I would have expected:

```
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
%vscale64.x4 = mul i64 %vscale64, 4
%index.next = add i64 %index, %vscale64.x4
```
  ;; <check and branch>
``
==========================4. Generating Vector Values
==========================For constant vector values, we cannot specify all the
elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the
new
start can be added, and changing the step requires multiplying by a splat.

+1 for making this intrinsic minimal and having canonical IR instruction
sequences for things like stride and starting offset.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x
i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

=================5. Code Generation
=================
IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary
to
convert a predicated instruction into an unpredicated one.

=========6. Example
=========
The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align
4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

=========7. Patches
=========
List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

Graham Hunter via llvm-dev

2018-Jun-11 15:48 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Robin,

Thanks for the comments; replies inline (except for the stack regions question;
Sander is handling that side of things).

-Graham

On 8 Jun 2018, at 16:24, Robin Kruppe <robin.kruppe at
gmail.com<mailto:robin.kruppe at gmail.com>> wrote:

Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length
vectors. Details may change as we gather more experience, but it's a very
good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called "Stack Regions" but
(IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I'll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

On 5 June 2018 at 15:15, Graham Hunter <Graham.Hunter at
arm.com<mailto:Graham.Hunter at arm.com>> wrote:
Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the
end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC
later.

Feedback and questions welcome.

-Graham

============================================================Supporting SIMD
instruction sets with variable vector lengths
============================================================
In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

=========Background
=========
*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

=======Contents
=======
The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size
isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a
runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

=======1. Types
=======
To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same
number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the
types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

==============2. Size Queries
==============
This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

You mention it's not fully implemented yet, but perhaps you have some
thoughts
on what the APIs for this should look like?

For size comparisons it's concerning that it could silently give misleading
results when operating across function boundaries. One solution could be a
function-specific API like `compareSizesIn(Type *, Type *, Function *)`, but
that extra parameter may turn out to be very viral. OTOH, maybe that's
unavoidable complexity from having type "sizes" vary between
functions.

Alternatively, general size queries could return "incomparable" and
"only if
in the same function" results in addition to smaller/larger/equal. This
might
nudge code to handling all these possibilities as well as it can.

I agree that would be nice to catch invalid comparisons; I've considered a
function that would take two 'Value*'s instead of types so that the
parent
could be determined, but I don't think that works in all cases.

Using an optional 'Function*' argument in a size comparison function
will
work; I think many of the existing size queries are on scalar values in code
that has already explicitly checked that it's not working on aggregate
types.

It may be a better idea to catch misuses when trying to clone instructions
into a different function.


Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector
length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

Seconded, I encountered those as well.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.

+1 for clear helpers, but this sentiment is somewhat independent of the
changes to type sizes. For example, `isBitCastableTo` seems appealing to me
not just because it clarifies the intent of the size comparison but also
because it can encapsulate all the cases where types have the same size but
bitcasts still aren't allowed (ptr<->int, aggregates). With a good API
for the
{scaled, unscaled} pairs, the size comparison should not be much more complex
than today.

Agreed, it's orthogonal to the scalable vector part.


Aside: I was curious, so I grepped and found that this specific predicate
already exists under the name CastInst::isBitCastable.

So it does; there's a few more cases around the codebase that don't use
that function
though. I may create a cleanup patch for them.

Other possibilities might be 'requires[Sign|Zero]Extension',
'requiresTrunc', 'getLargestType',
etc.



=======================================3. Representing Vector Length at Runtime
=======================================
With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()

I didn't see anything about this in the text (apologies if I missed
something), but it appears this intrinsic is overloaded to be able to return
any integer width. It's not a big deal either way, but is there a particular
reason for doing that, rather than picking one sufficiently large integer type
and combining it with trunc/zext as appropriate?

It's a leftover from me converting from the constant, which could be of any
integer type.


  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)

Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a
pseudo-IR shorthand or an artifact from when vscale was proposed as a constant
expression or something? I would have expected:

```
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
%vscale64.x4 = mul i64 %vscale64, 4
%index.next = add i64 %index, %vscale64.x4
```

The latter, though in this case I updated the example IR by hand so didn't
catch
that case ;)

The IR in the example patch was generated by the compiler, and does split out
the
multiply into a separate instruction (which is then strength-reduced to a
shift).


  ;; <check and branch>
``
==========================4. Generating Vector Values
==========================For constant vector values, we cannot specify all the
elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the
new
start can be added, and changing the step requires multiplying by a splat.

+1 for making this intrinsic minimal and having canonical IR instruction
sequences for things like stride and starting offset.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x
i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

=================5. Code Generation
=================
IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary
to
convert a predicated instruction into an unpredicated one.

=========6. Example
=========
The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align
4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

=========7. Patches
=========
List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180611/d9011d45/attachment.html>

David A. Greene via llvm-dev

2018-Jun-11 21:19 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Graham Hunter <Graham.Hunter at arm.com> writes:
> =======> 1. Types
> =======>
> To represent a vector of unknown length a boolean `Scalable` property has
been
> added to the `VectorType` class, which indicates that the number of
elements in
> the vector is a runtime-determined integer multiple of the `NumElements`
field.
> Most code that deals with vectors doesn't need to know the exact
length, but
> does need to know relative lengths -- e.g. get a vector with the same
number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we
introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and
avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be
used
> to obtain the (potentially scalable) number of elements. Overloaded
division and
> multiplication operators allow an ElementCount instance to be used in much
the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason
about the
> relationship between different scalable vector types without knowing their
> exact length.
How does this work in practice?  Let's say I populate a vector with a
splat.  Presumably, that gives me a "full length" vector.  Let's
say the
type is <scalable 2 x double>.  How do I split the vector and get
something half the width?  What is its type?  How do I split it again
and get something a quarter of the width?  What is its type?  How do I
use half- and quarter-width vectors?  Must I resort to predication?

Ths split question comes into play for backward compatibility.  How
would one take a scalable vector and pass it into a NEON library?  It is
likely that some math functions, for example, will not have SVE versions
available.

Is there a way to represent "double width" vectors?  In
mixed-data-size
loops it is sometimes convenient to reason about double-width vectors
rather than having to split them (to legalize for the target
architecture) and keep track of their parts early on.  I guess the more
fundamental question is about how such loops should be handled.

What do insertelement and extractelement mean for scalable vectors?
Your examples showed insertelement at index zero.  How would I, say,
insertelement into the upper half of the vector?  Or any arbitrary
place?  Does insertelement at index 10 of a <scalable 2 x double> work,
assuming vscale is large enough?  It is sometimes useful to constitute a
vector out of various scalar pieces and insertelement is a convenient
way to do it.

Thanks for your patience.  :)

                          -David

Graham Hunter via llvm-dev

2018-Jun-15 16:19 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi David,

Responses below.

-Graham

On 11 Jun 2018, at 22:19, David A. Greene <dag at cray.com<mailto:dag at
cray.com>> wrote:

Graham Hunter <Graham.Hunter at arm.com<mailto:Graham.Hunter at
arm.com>> writes:

=======1. Types
=======
To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

How does this work in practice? Let's say I populate a vector with a
splat. Presumably, that gives me a "full length" vector. Let's
say the
type is <scalable 2 x double>. How do I split the vector and get
something half the width? What is its type? How do I split it again
and get something a quarter of the width? What is its type? How do I
use half- and quarter-width vectors? Must I resort to predication?

To split a <scalable 2 x double> in half, you'd use a shufflevector in
much the
same way you would for fixed-length vector types.

e.g.
``
%sv = call <scalable 1 x i32>
@llvm.experimental.vector.stepvector.nxv1i32()
%halfvec = shufflevector <scalable 2 x double> %fullvec, <scalable 2
x double> undef, <scalable 1 x i32> %sv
``

You can't split it any further than a <scalable 1 x <ty>>, since
there may only be
one element in the actual hardware vector at runtime. The same restriction
applies to
a <1 x <ty>>. This is why we have a minimum number of lanes in
addition to the
scalable flag so that we can concatenate and split vectors, since SVE registers
have
the same number of bytes and will therefore decrease the number of elements per
register as the element type increases in size.

If you want to extract something other than the first part of a vector, you need
to add
offsets based on a calculation from vscale (e.g. adding vscale * (min_elts/2)
allows you
to reach the high half of a larger register).

If you check the patch which introduces splatvector
(https://reviews.llvm.org/D47775),
you can see a line which currently produces an error if changing the size of a
vector
is required, and notes that VECTOR_SHUFFLE_VAR hasn't been implemented yet.

In our downstream compiler, this is an ISD alongside VECTOR_SHUFFLE which
allows a shuffle with a variable mask instead of a constant.

If people feel it would be useful, I can prepare another patch which implements
these
shuffles (as an intrinsic rather than a common ISD) for review now instead of
later;
I tried to keep the initial patch set small so didn't cover all cases.

In terms of using predication, that's generally not required for the legal
integer types;
normal promotion via sign or zero extension work. We've tried to reuse
existing
mechanisms wherever possible.

For floating point types, we do use predication to allow the use of otherwise
illegal
types like <scalable 1 x double>, but that's limited to the AArch64
backend and does
not need to be represented in IR.

Ths split question comes into play for backward compatibility. How
would one take a scalable vector and pass it into a NEON library? It is
likely that some math functions, for example, will not have SVE versions
available.

I don't believe we intend to support this, but instead provide libraries
with
SVE versions of functions instead. The problem is that you don't know how
many NEON-size subvectors exist within an SVE vector at compile time.
While you could create a loop with 'vscale' number of iterations and try
to
extract those subvectors, I suspect the IR would end up being quite messy
and potentially hard to recognize and optimize.

The other problem with calling non-SVE functions is that any live SVE
registers must be spilled to the stack and filled after the call, which is
likely to be quite expensive.

Is there a way to represent "double width" vectors? In
mixed-data-size
loops it is sometimes convenient to reason about double-width vectors
rather than having to split them (to legalize for the target
architecture) and keep track of their parts early on. I guess the more
fundamental question is about how such loops should be handled.

For SVE, it's fine to generate IR with types that are 'double size'
or larger,
and just leave it to legalization at SelectionDAG level to split into multiple
legal size registers.

Again, if people would like me to create a patch to illustrate legalization
sooner
rather than later to help understand what's needed to support these types,
let
me know.

What do insertelement and extractelement mean for scalable vectors?
Your examples showed insertelement at index zero. How would I, say,
insertelement into the upper half of the vector? Or any arbitrary
place? Does insertelement at index 10 of a <scalable 2 x double> work,
assuming vscale is large enough? It is sometimes useful to constitute a
vector out of various scalar pieces and insertelement is a convenient
way to do it.

So you can insert or extract any element known to exist (in other words,
it's
within the minimum number of elements). Using a constant index outside
that range will fail, as we won't know whether the element actually exists
until we're running on a cpu.

Our downstream compiler supports inserting and extracting arbitrary elements
from calculated offsets as part of our experiment on search loop vectorization,
but that generates the offsets based on a count of true bits within partitioned
predicates. I was planning on proposing new intrinsics to improve predicate use
within llvm at a later date.

We have been able to implement various types of known shuffles (like the
high/low
half extract, zip, concatention, etc) with vscale, stepvector, and the existing
IR
instructions.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/08e788d9/attachment-0001.html>

Graham Hunter via llvm-dev

2018-Jul-02 09:53 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi,

I've updated the RFC slightly based on the discussion within the thread,
reposted below. Let me know if I've missed anything or if more clarification
is needed.

Thanks,

-Graham

============================================================Supporting SIMD
instruction sets with variable vector lengths
============================================================
In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

=========Background
=========
*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

=======Contents
=======
The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size
isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a
runtime-determined
   number of elements.

5. An explanation of splitting/concatentating scalable vectors.

6. A brief note on code generation of these new operations for AArch64.

7. An example of C code and matching IR using the proposed extensions.

8. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

=======1. Types
=======
To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
  elements.

``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same
number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the
types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

==============2. Size Queries
==============
This is a proposal for how to deal with querying the size of scalable types for
analysis of IR. While it has not been implemented in full, the general approach
works well for calculating offsets into structures with scalable types in a
modified version of ComputeValueVTs in our downstream compiler.

For current IR types that have a known size, all query functions return a single
integer constant. For scalable types a second integer is needed to indicate the
number of bytes/bits which need to be scaled by the runtime multiple to obtain
the actual length.

For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function `getScalableSizePairInBits()` will be added, which
returns a pair of integers (one to indicate unscaled bits, the other for bits
that need to be scaled by the runtime multiple). For backends that do not need
to deal with scalable types the existing methods will suffice, but a debug-only
assert will be added to them to ensure they aren't used on scalable types.

Similar functionality will be added to DataLayout.

Comparisons between sizes will use the following methods, assuming that X and
Y are non-zero integers and the form is of { unscaled, scaled }.

{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.

{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
                         functions that inherit vector length. Cannot be
                         compared across non-inheriting functions.

{ X, 0 } > { 0, Y }: Cannot return true.

{ X, 0 } = { 0, Y }: Cannot return true.

{ X, 0 } < { 0, Y }: Can return true.

{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
                             terms and try the above comparisons; it
                             may not be possible to get a good answer.

It's worth noting that we don't expect the last case (mixed scaled and
unscaled sizes) to occur. Richard Sandiford's proposed C extensions
(http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
prohibits mixing fixed-size types into sizeless struct.

I don't know if we need a 'maybe' or 'unknown' result for
cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.

My current intention is to rely on functions that clone or copy values to
check whether they are being used to copy scalable vectors across function
boundaries without the inherit vlen attribute and raise an error there instead
of requiring passing the Function a type size is from for each comparison. If
there's a strong preference for moving the check to the size comparison
function
let me know; I will be starting work on patches for this later in the year if
there's no major problems with the idea.

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector
length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.

=======================================3. Representing Vector Length at Runtime
=======================================
With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  ;; <check and branch>
``
==========================4. Generating Vector Values
==========================For constant vector values, we cannot specify all the
elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the
new
start can be added, and changing the step requires multiplying by a splat.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x
i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

==========================================5. Splitting and Combining Scalable
Vectors
==========================================
Splitting and combining scalable vectors in IR is done in the same manner as
for fixed-length vectors, but with a non-constant mask for the shufflevector.

The following is an example of splitting a <scalable 4 x double> into two
separate <scalable 2 x double> values.

``
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  ;; Stepvector generates the element ids for first subvector
  %sv1 = call <scalable 2 x i64>
@llvm.experimental.vector.stepvector.nxv2i64()
  ;; Add vscale * 2 to get the starting element for the second subvector
  %ec = mul i64 %vscale64, 2
  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x
i64> undef, <scalable 2 x i32> zeroinitializer
  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
  ;; Perform the extracts
  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x
double> undef, <scalable 2 x i64> %sv1
  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x
double> undef, <scalable 2 x i64> %sv2
``

=================6. Code Generation
=================
IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary
to
convert a predicated instruction into an unpredicated one.

=========7. Example
=========
The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align
4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

=========8. Patches
=========
List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

Simon Moll via llvm-dev

2018-Jul-02 15:08 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi,

i am the main author of RV, the Region Vectorizer 
(github.com/cdl-saarland/rv). I want to share our standpoint as 
potential users of the proposed vector-length agnostic IR (RISC-V, ARM SVE).

-- support for `llvm.experimental.vector.reduce.*` intrinsics --

RV relies heavily on predicate reductions (`or` and `and` reduction) to 
tame divergent loops and provide a vector-length agnostic programming 
model on LLVM IR. I'd really like to see these adopted early on in the 
new VLA backends so we can fully support these targets from the start. 
Without these generic intrinsics, we would either need to emit target 
specific ones or go through the painful process of VLA-style reduction 
trees with loops or the like.


-- setting the vector length (MVL) --

I really like the idea of the `inherits_vlen` attribute. Absence of this 
attribute in a callee means we can safely stop tracking the vector 
length across the call boundary.

However, i think there are some issues with the `vlen token` approach.

* Why do you need an explicit vlen token if there is a 1 : 1-0 
correspondence between functions and vlen tokens?

* My main concern is that you are navigating towards a local optimum 
here. All is well as long as there is only one vector length per 
function. However, if the architecture supports changing the vector 
length at any point but you explicitly forbid it, programmers will 
complain, well, i will for one ;-) Once you give in to that demand you 
are facing the situation that multiple vector length tokens are live 
within the same function. This means you have to stop transformations 
from mixing vector operations with different vector lengths: these would 
otherwise incur an expensive state change at every vlen transition. 
However, there is no natural way to express that two SSA values (vlen 
tokens) must not be live at the same program point.

On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote:> There are some operations that use vl for things other than simple
> masking. To give one example, "speculative" loads (which
silencing
> some exceptions to safely permit vectorization of some loops with
> data-dependent exits, such as strlen) can shrink vl as a side effect.
> I believe this can be handled by modelling all relevant operations
> (including setvl itself) as intrinsics that have side effects or
> read/write inaccessible memory. However, if you want to have the
> "current" vl (or equivalent mask) around as SSA value, you need
to
> "reload" it after any operation that updates vl. That seems like
it
> could get a bit complex if you want to do it efficiently (in the
> limit, it seems equivalent to SSA construction).I think modeling the vector length as state isn't as bad as it may sound 
first. In fact, how about modeling the "hard" vector length as a 
thread_local global variable? That way there is exactly one valid vector 
length value at every point (defined by the value of the thread_local 
global variable of the exact name). There is no need for a "demanded 
vlen" analyses: the global variable yields the value immediately. The 
RISC-V backend can map the global directly to the vlen register. If a 
target does not support a re-configurable vector length (SVE), it is 
safe to run SSA construction during legalization and use explicit 
predication instead. You'd perform SSA construction only at the 
backend/legalization phase.
Vice versa coming from IR targeted at LLVM SVE, you can go the other 
way, run a demanded vlen analysis, and encode it explicitly in the 
program. vlen changes are expensive and should be rare anyway.

; explicit vlen_state modelling in RV could look like this:

@vlen_state=thread_local globaltoken ; this gives AA a fixed point to 
constraint vlen-dependent operations

llvm.vla.setvl(i32 %n)                  ; implicitly writes-only %vlen_state
i32 llvm.vla.getvl()                    ; implicitly reads-only %vlen_state

llvm.vla.fadd.f64(f64, f64)           ; implicitly reads-only %vlen_state
llvm.vla.fdiv.f64(f64, f64)           : .. same

; this implements the "speculative" load mentioned in the quote above 
(writes %vlen_state. I suppose it also reads it first?)
<scalable 1 x f64> llvm.riscv.probe.f64(%ptr)

By relying on memory dependence, this also implies that arithmetic 
operations can be re-ordered freely as long as vlen_state does not 
change between them (SLP, "loop mix (CGO16)", ..).

Regarding function calls, if the callee does not have the 
'inherits_vlen' attribute, the target can use a default value at 
function entry (max width or "undef"). Otherwise, the vector length 
needs to be communicated from caller to callee. However, the 
`vlen_state` variable already achieves that for a first implementation.

Last but not least, thank you all for working on this! I am really 
looking forward to playing around with vla architectures in LLVM.

Regards,

Simon


On 07/02/2018 11:53 AM, Graham Hunter via llvm-dev
wrote:> Hi,
>
> I've updated the RFC slightly based on the discussion within the
thread, reposted below. Let me know if I've missed anything or if more
clarification is needed.
>
> Thanks,
>
> -Graham
>
> ============================================================> Supporting
SIMD instruction sets with variable vector lengths
> ============================================================>
> In this RFC we propose extending LLVM IR to support code-generation for
variable
> length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
> approach is backwards compatible and should be as non-intrusive as
possible; the
> only change needed in other backends is how size is queried on vector
types, and
> it only requires a change in which function is called. We have created a
set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7
of
> this rfc) can be found on Phabricator and are intended to illustrate the
scope
> of changes required by the general approach described in this RFC.
>
> =========> Background
> =========>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension
for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of
the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in
which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time
constants.
>
> Documentation for SVE can be found at
>
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> =======> Contents
> =======>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors
that
>     have a length which is a runtime-determined multiple of a known base
length.
>
> 2. Size Queries - how to reason about the size of types for which the size
isn't
>     fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in
address
>     calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a
runtime-determined
>     number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE
instructions
>     for a loop that has already been vectorized using the extensions
described
>     in this RFC.
>
> =======> 1. Types
> =======>
> To represent a vector of unknown length a boolean `Scalable` property has
been
> added to the `VectorType` class, which indicates that the number of
elements in
> the vector is a runtime-determined integer multiple of the `NumElements`
field.
> Most code that deals with vectors doesn't need to know the exact
length, but
> does need to know relative lengths -- e.g. get a vector with the same
number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we
introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and
avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be
used
> to obtain the (potentially scalable) number of elements. Overloaded
division and
> multiplication operators allow an ElementCount instance to be used in much
the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason
about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for
SVE,
> but it is possible. The model of scalable vectors presented in this RFC
assumes
> that the multiple will be constant within a function but not necessarily
across
> functions. As suggested in the recent RISC-V rfc, a new function attribute
to
> inherit the multiple across function calls will allow for function calls
with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number
of
> elements, and the string literal `scalable` indicates that the total number
of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary
choice
> for indicating that the vector is scalable, and could be substituted by
another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change
in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of
elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they
are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
>    elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same
number of
>    bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to
the
> type record. If the field is not present the type will default to a
fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes
that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable
of
> providing support for C-level intrinsics instead of being used and
recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that
creates
> new VectorTypes, much of that code doesn't need to care about whether
the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to
represent
> the number of elements.
>
> ==============> 2. Size Queries
> ==============>
> This is a proposal for how to deal with querying the size of scalable types
for
> analysis of IR. While it has not been implemented in full, the general
approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a
single
> integer constant. For scalable types a second integer is needed to indicate
the
> number of bytes/bits which need to be scaled by the runtime multiple to
obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does
today,
> except that it will no longer return a size for vector types (it will
return 0,
> as it does for other derived types). The majority of calls to this function
are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added,
which
> returns a pair of integers (one to indicate unscaled bits, the other for
bits
> that need to be scaled by the runtime multiple). For backends that do not
need
> to deal with scalable types the existing methods will suffice, but a
debug-only
> assert will be added to them to ensure they aren't used on scalable
types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X
and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or
across
>                           functions that inherit vector length. Cannot be
>                           compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract
common
>                               terms and try the above comparisons; it
>                               may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled
and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result
for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there
instead
> of requiring passing the Function a type size is from for each comparison.
If
> there's a strong preference for moving the check to the size comparison
function
> let me know; I will be starting work on patches for this later in the year
if
> there's no major problems with the idea.
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe'
vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns
true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> =======================================> 3. Representing Vector Length
at Runtime
> =======================================>
> With a scalable vector type defined, we now need a way to represent the
runtime
> length in IR in order to generate addresses for consecutive vectors in
memory
> and determine how many elements have been processed in an iteration of a
loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>    %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>    ;; <loop body>
>    ;; Increment induction var
>    %index.next = add i64 %index, 4
>    ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>    %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>    ;; <loop body>
>    ;; Increment induction var
>    %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>    %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>    ;; <check and branch>
> ``
> ==========================> 4. Generating Vector Values
> ==========================> For constant vector values, we cannot
specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length
vectors.
>
> For constants consisting of a sequence of values, an experimental
`stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of
the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>    ;; Splat a value
>    %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>    %splat = shufflevector <4 x i32> %insert, <4 x i32> undef,
<4 x i32> zeroinitializer
>    ;; Add a constant sequence
>    %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>    ;; Splat a value
>    %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32
0
>    %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4
x i32> undef, <scalable 4 x i32> zeroinitializer
>    ;; Splat offset + stride (the same in this case)
>    %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>    %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable
4 x i32> undef, <scalable 4 x i32> zeroinitializer
>    ;; Create sequence for scalable vector
>    %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>    %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>    %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>    ;; Add the runtime-generated sequence
>    %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ==========================================> 5. Splitting and Combining
Scalable Vectors
> ==========================================>
> Splitting and combining scalable vectors in IR is done in the same manner
as
> for fixed-length vectors, but with a non-constant mask for the
shufflevector.
>
> The following is an example of splitting a <scalable 4 x double> into
two
> separate <scalable 2 x double> values.
>
> ``
>    %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>    ;; Stepvector generates the element ids for first subvector
>    %sv1 = call <scalable 2 x i64>
@llvm.experimental.vector.stepvector.nxv2i64()
>    ;; Add vscale * 2 to get the starting element for the second subvector
>    %ec = mul i64 %vscale64, 2
>    %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>    %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x
i64> undef, <scalable 2 x i32> zeroinitializer
>    %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>    ;; Perform the extracts
>    %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x
double> undef, <scalable 2 x i64> %sv1
>    %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x
double> undef, <scalable 2 x i64> %sv2
> ``
>
> =================> 6. Code Generation
> =================>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64
backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit
was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not
be
> necessary in future, generating an all-true 'ptrue' value was
necessary to
> convert a predicated instruction into an unpredicated one.
>
> =========> 7. Example
> =========>
> The following example shows a simple C loop which assigns the array index
to
> the array elements matching that index. The IR shows how vscale and
stepvector
> are used to create the needed values and to advance the index variable in
the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>    for (int i = 0; i < count; ++i)
>      a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>    ;; Other setup
>    ;; Stepvector used to create initial identity vector
>    %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>    br vector.body
>
> vector.body
>    %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>    %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>             ;; stepvector used for index identity on entry to loop body ;;
>    %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                       [ %stepvector, %vector.body.preheader
]
>    %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>    %vscale32 = trunc i64 %vscale64 to i32
>    %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>             ;; vscale splat used to increment identity vector ;;
>    %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
>    %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>    %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>    %2 = getelementptr inbounds i32, i32* %a, i64 %0
>    %3 = bitcast i32* %2 to <scalable 4 x i32>*
>    store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3,
align 4
>
>             ;; vscale used to increment loop index
>    %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>    %4 = icmp eq i64 %index.next, %n.vec
>    br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> =========> 8. Patches
> =========>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180702/fe841c6c/attachment.html>

Robin Kruppe via llvm-dev

2018-Jul-07 13:46 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Graham,

thanks again. The changes and additions all make sense to me, I just
have one minor comment about shufflevector.

On 2 July 2018 at 11:53, Graham Hunter <Graham.Hunter at arm.com>
wrote:> Hi,
>
> I've updated the RFC slightly based on the discussion within the
thread, reposted below. Let me know if I've missed anything or if more
clarification is needed.
>
> Thanks,
>
> -Graham
>
> ============================================================> Supporting
SIMD instruction sets with variable vector lengths
> ============================================================>
> In this RFC we propose extending LLVM IR to support code-generation for
variable
> length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
> approach is backwards compatible and should be as non-intrusive as
possible; the
> only change needed in other backends is how size is queried on vector
types, and
> it only requires a change in which function is called. We have created a
set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7
of
> this rfc) can be found on Phabricator and are intended to illustrate the
scope
> of changes required by the general approach described in this RFC.
>
> =========> Background
> =========>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension
for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of
the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in
which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time
constants.
>
> Documentation for SVE can be found at
>
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> =======> Contents
> =======>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors
that
>    have a length which is a runtime-determined multiple of a known base
length.
>
> 2. Size Queries - how to reason about the size of types for which the size
isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in
address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a
runtime-determined
>    number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE
instructions
>    for a loop that has already been vectorized using the extensions
described
>    in this RFC.
>
> =======> 1. Types
> =======>
> To represent a vector of unknown length a boolean `Scalable` property has
been
> added to the `VectorType` class, which indicates that the number of
elements in
> the vector is a runtime-determined integer multiple of the `NumElements`
field.
> Most code that deals with vectors doesn't need to know the exact
length, but
> does need to know relative lengths -- e.g. get a vector with the same
number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we
introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and
avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be
used
> to obtain the (potentially scalable) number of elements. Overloaded
division and
> multiplication operators allow an ElementCount instance to be used in much
the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason
about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for
SVE,
> but it is possible. The model of scalable vectors presented in this RFC
assumes
> that the multiple will be constant within a function but not necessarily
across
> functions. As suggested in the recent RISC-V rfc, a new function attribute
to
> inherit the multiple across function calls will allow for function calls
with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number
of
> elements, and the string literal `scalable` indicates that the total number
of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary
choice
> for indicating that the vector is scalable, and could be substituted by
another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change
in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of
elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they
are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
>   elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same
number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to
the
> type record. If the field is not present the type will default to a
fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes
that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable
of
> providing support for C-level intrinsics instead of being used and
recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that
creates
> new VectorTypes, much of that code doesn't need to care about whether
the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to
represent
> the number of elements.
>
> ==============> 2. Size Queries
> ==============>
> This is a proposal for how to deal with querying the size of scalable types
for
> analysis of IR. While it has not been implemented in full, the general
approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a
single
> integer constant. For scalable types a second integer is needed to indicate
the
> number of bytes/bits which need to be scaled by the runtime multiple to
obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does
today,
> except that it will no longer return a size for vector types (it will
return 0,
> as it does for other derived types). The majority of calls to this function
are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added,
which
> returns a pair of integers (one to indicate unscaled bits, the other for
bits
> that need to be scaled by the runtime multiple). For backends that do not
need
> to deal with scalable types the existing methods will suffice, but a
debug-only
> assert will be added to them to ensure they aren't used on scalable
types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X
and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or
across
>                          functions that inherit vector length. Cannot be
>                          compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract
common
>                              terms and try the above comparisons; it
>                              may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled
and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result
for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there
instead
> of requiring passing the Function a type size is from for each comparison.
If
> there's a strong preference for moving the check to the size comparison
function
> let me know; I will be starting work on patches for this later in the year
if
> there's no major problems with the idea.
Seems reasonable.
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe'
vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns
true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> =======================================> 3. Representing Vector Length
at Runtime
> =======================================>
> With a scalable vector type defined, we now need a way to represent the
runtime
> length in IR in order to generate addresses for consecutive vectors in
memory
> and determine how many elements have been processed in an iteration of a
loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ==========================> 4. Generating Vector Values
> ==========================> For constant vector values, we cannot
specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length
vectors.
>
> For constants consisting of a sequence of values, an experimental
`stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of
the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef,
<4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable
4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ==========================================> 5. Splitting and Combining
Scalable Vectors
> ==========================================>
> Splitting and combining scalable vectors in IR is done in the same manner
as
> for fixed-length vectors, but with a non-constant mask for the
shufflevector.
It makes sense to have runtime-computed shuffle masks for some
architectures, especially those with runtime-variable vector lengths,
but lifting the current restriction that the shufflevector mask is a
constant affects all code that inspects the indices. There's lot such
code and as far as I've seen a fair amount of that code crucially
depends on the mask being constant. I'm not opposed to lifting the
restriction, but I want to call attention to it and double-check
everyone's okay with it because it seems like a big step and, unlike
other IR changes in this RFC, it isn't really necessary (we could also
use an intrinsic for these shuffles).
> The following is an example of splitting a <scalable 4 x double> into
two
> separate <scalable 2 x double> values.
>
> ``
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   ;; Stepvector generates the element ids for first subvector
>   %sv1 = call <scalable 2 x i64>
@llvm.experimental.vector.stepvector.nxv2i64()
>   ;; Add vscale * 2 to get the starting element for the second subvector
>   %ec = mul i64 %vscale64, 2
>   %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>   %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x
i64> undef, <scalable 2 x i32> zeroinitializer
>   %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>   ;; Perform the extracts
>   %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x
double> undef, <scalable 2 x i64> %sv1
>   %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x
double> undef, <scalable 2 x i64> %sv2
> ``
>
> =================> 6. Code Generation
> =================>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64
backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit
was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not
be
> necessary in future, generating an all-true 'ptrue' value was
necessary to
> convert a predicated instruction into an unpredicated one.
>
> =========> 7. Example
> =========>
> The following example shows a simple C loop which assigns the array index
to
> the array elements matching that index. The IR shows how vscale and
stepvector
> are used to create the needed values and to advance the index variable in
the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader
]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3,
align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> =========> 8. Patches
> =========>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>

Graham Hunter via llvm-dev

2018-Jul-30 09:23 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi,

Are there any objections to going ahead with this? If not, we'll try to get
the patches reviewed and committed after the 7.0 branch occurs.

-Graham
> On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at arm.com>
wrote:
> 
> Hi,
> 
> I've updated the RFC slightly based on the discussion within the
thread, reposted below. Let me know if I've missed anything or if more
clarification is needed.
> 
> Thanks,
> 
> -Graham
> 
> ============================================================> Supporting
SIMD instruction sets with variable vector lengths
> ============================================================> 
> In this RFC we propose extending LLVM IR to support code-generation for
variable
> length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
> approach is backwards compatible and should be as non-intrusive as
possible; the
> only change needed in other backends is how size is queried on vector
types, and
> it only requires a change in which function is called. We have created a
set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7
of
> this rfc) can be found on Phabricator and are intended to illustrate the
scope
> of changes required by the general approach described in this RFC.
> 
> =========> Background
> =========> 
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension
for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of
the
> increased compute power without recompilation.
> 
> As the vector length is no longer a compile-time known value, the way in
which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time
constants.
> 
> Documentation for SVE can be found at
>
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
> 
> =======> Contents
> =======> 
> The rest of this RFC covers the following topics:
> 
> 1. Types -- a proposal to extend VectorType to be able to represent vectors
that
>   have a length which is a runtime-determined multiple of a known base
length.
> 
> 2. Size Queries - how to reason about the size of types for which the size
isn't
>   fully known at compile time.
> 
> 3. Representing the runtime multiple of vector length in IR for use in
address
>   calculations and induction variable comparisons.
> 
> 4. Generating 'constant' values in IR for vectors with a
runtime-determined
>   number of elements.
> 
> 5. An explanation of splitting/concatentating scalable vectors.
> 
> 6. A brief note on code generation of these new operations for AArch64.
> 
> 7. An example of C code and matching IR using the proposed extensions.
> 
> 8. A list of patches demonstrating the changes required to emit SVE
instructions
>   for a loop that has already been vectorized using the extensions
described
>   in this RFC.
> 
> =======> 1. Types
> =======> 
> To represent a vector of unknown length a boolean `Scalable` property has
been
> added to the `VectorType` class, which indicates that the number of
elements in
> the vector is a runtime-determined integer multiple of the `NumElements`
field.
> Most code that deals with vectors doesn't need to know the exact
length, but
> does need to know relative lengths -- e.g. get a vector with the same
number of
> elements but a different element type, or with half or double the number of
> elements.
> 
> In order to allow code to transparently support scalable vectors, we
introduce
> an `ElementCount` class with two members:
> 
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
> 
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
> 
> The intent for code working with vectors is to use convenience methods and
avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be
used
> to obtain the (potentially scalable) number of elements. Overloaded
division and
> multiplication operators allow an ElementCount instance to be used in much
the
> same manner as an integer for most cases.
> 
> This mixture of compile-time and runtime quantities allow us to reason
about the
> relationship between different scalable vector types without knowing their
> exact length.
> 
> The runtime multiple is not expected to change during program execution for
SVE,
> but it is possible. The model of scalable vectors presented in this RFC
assumes
> that the multiple will be constant within a function but not necessarily
across
> functions. As suggested in the recent RISC-V rfc, a new function attribute
to
> inherit the multiple across function calls will allow for function calls
with
> vector arguments/return values and inlining/outlining optimizations.
> 
> IR Textual Form
> ---------------
> 
> The textual form for a scalable vector is:
> 
> ``<scalable <n> x <type>>``
> 
> where `type` is the scalar type of each element, `n` is the minimum number
of
> elements, and the string literal `scalable` indicates that the total number
of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary
choice
> for indicating that the vector is scalable, and could be substituted by
another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change
in
> the format for existing vectors.
> 
> Scalable vectors with the same `Min` value have the same number of
elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they
are
> used within the same function):
> 
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
>  elements.
> 
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same
number of
>  bytes.
> 
> IR Bitcode Form
> ---------------
> 
> To serialize scalable vectors to bitcode, a new boolean field is added to
the
> type record. If the field is not present the type will default to a
fixed-length
> vector type, preserving backwards compatibility.
> 
> Alternatives Considered
> -----------------------
> 
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
> 
> A dedicated target type would either need to extend all existing passes
that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
> 
> This hasn't been done for the x86_mmx type, and so it is only capable
of
> providing support for C-level intrinsics instead of being used and
recognized by
> passes inside llvm.
> 
> Although our current solution will need to change some of the code that
creates
> new VectorTypes, much of that code doesn't need to care about whether
the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to
represent
> the number of elements.
> 
> ==============> 2. Size Queries
> ==============> 
> This is a proposal for how to deal with querying the size of scalable types
for
> analysis of IR. While it has not been implemented in full, the general
approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
> 
> For current IR types that have a known size, all query functions return a
single
> integer constant. For scalable types a second integer is needed to indicate
the
> number of bytes/bits which need to be scaled by the runtime multiple to
obtain
> the actual length.
> 
> For primitive types, `getPrimitiveSizeInBits()` will function as it does
today,
> except that it will no longer return a size for vector types (it will
return 0,
> as it does for other derived types). The majority of calls to this function
are
> already for scalar rather than vector types.
> 
> For derived types, a function `getScalableSizePairInBits()` will be added,
which
> returns a pair of integers (one to indicate unscaled bits, the other for
bits
> that need to be scaled by the runtime multiple). For backends that do not
need
> to deal with scalable types the existing methods will suffice, but a
debug-only
> assert will be added to them to ensure they aren't used on scalable
types.
> 
> Similar functionality will be added to DataLayout.
> 
> Comparisons between sizes will use the following methods, assuming that X
and
> Y are non-zero integers and the form is of { unscaled, scaled }.
> 
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
> 
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or
across
>                         functions that inherit vector length. Cannot be
>                         compared across non-inheriting functions.
> 
> { X, 0 } > { 0, Y }: Cannot return true.
> 
> { X, 0 } = { 0, Y }: Cannot return true.
> 
> { X, 0 } < { 0, Y }: Can return true.
> 
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract
common
>                             terms and try the above comparisons; it
>                             may not be possible to get a good answer.
> 
> It's worth noting that we don't expect the last case (mixed scaled
and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
> 
> I don't know if we need a 'maybe' or 'unknown' result
for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
> 
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there
instead
> of requiring passing the Function a type size is from for each comparison.
If
> there's a strong preference for moving the check to the size comparison
function
> let me know; I will be starting work on patches for this later in the year
if
> there's no major problems with the idea.
> 
> Future Work
> -----------
> 
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
> 
> However, SVE's predication will mean that a dynamic 'safe'
vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
> 
> Alternatives Considered
> -----------------------
> 
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns
true
> and make use of the size when calculating offsets.
> 
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
> 
> =======================================> 3. Representing Vector Length
at Runtime
> =======================================> 
> With a scalable vector type defined, we now need a way to represent the
runtime
> length in IR in order to generate addresses for consecutive vectors in
memory
> and determine how many elements have been processed in an iteration of a
loop.
> 
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
vector.
> 
> Fixed-Length Code
> -----------------
> 
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %index.next = add i64 %index, 4
>  ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
> 
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>  %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>  ;; <loop body>
>  ;; Increment induction var
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  ;; <check and branch>
> ``
> ==========================> 4. Generating Vector Values
> ==========================> For constant vector values, we cannot
specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length
vectors.
> 
> For constants consisting of a sequence of values, an experimental
`stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of
the new
> start can be added, and changing the step requires multiplying by a splat.
> 
> Fixed-Length Code
> -----------------
> ``
>  ;; Splat a value
>  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef,
<4 x i32> zeroinitializer
>  ;; Add a constant sequence
>  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>  ;; Splat a value
>  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Splat offset + stride (the same in this case)
>  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4
x i32> undef, <scalable 4 x i32> zeroinitializer
>  ;; Create sequence for scalable vector
>  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>  ;; Add the runtime-generated sequence
>  %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
> 
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
> 
> ==========================================> 5. Splitting and Combining
Scalable Vectors
> ==========================================> 
> Splitting and combining scalable vectors in IR is done in the same manner
as
> for fixed-length vectors, but with a non-constant mask for the
shufflevector.
> 
> The following is an example of splitting a <scalable 4 x double> into
two
> separate <scalable 2 x double> values.
> 
> ``
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  ;; Stepvector generates the element ids for first subvector
>  %sv1 = call <scalable 2 x i64>
@llvm.experimental.vector.stepvector.nxv2i64()
>  ;; Add vscale * 2 to get the starting element for the second subvector
>  %ec = mul i64 %vscale64, 2
>  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x
i64> undef, <scalable 2 x i32> zeroinitializer
>  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>  ;; Perform the extracts
>  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x
double> undef, <scalable 2 x i64> %sv1
>  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x
double> undef, <scalable 2 x i64> %sv2
> ``
> 
> =================> 6. Code Generation
> =================> 
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
> 
> All three intrinsics are custom lowered and legalized in the AArch64
backend.
> 
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
> 
> GlobalISel
> ----------
> 
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit
was
> added to the raw_data representation for vectors and vectors of pointers.
> 
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not
be
> necessary in future, generating an all-true 'ptrue' value was
necessary to
> convert a predicated instruction into an unpredicated one.
> 
> =========> 7. Example
> =========> 
> The following example shows a simple C loop which assigns the array index
to
> the array elements matching that index. The IR shows how vscale and
stepvector
> are used to create the needed values and to advance the index variable in
the
> loop.
> 
> C Code
> ------
> 
> ``
> void IdentityArrayInit(int *a, int count) {
>  for (int i = 0; i < count; ++i)
>    a[i] = i;
> }
> ``
> 
> Scalable IR Vector Body
> -----------------------
> 
> ``
> vector.body.preheader:
>  ;; Other setup
>  ;; Stepvector used to create initial identity vector
>  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>  br vector.body
> 
> vector.body
>  %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
> 
>           ;; stepvector used for index identity on entry to loop body ;;
>  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                     [ %stepvector, %vector.body.preheader ]
>  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>  %vscale32 = trunc i64 %vscale64 to i32
>  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
> 
>           ;; vscale splat used to increment identity vector ;;
>  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
>  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3,
align 4
> 
>           ;; vscale used to increment loop index
>  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>  %4 = icmp eq i64 %index.next, %n.vec
>  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
> 
> =========> 8. Patches
> =========> 
> List of patches:
> 
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>

JinGu Kang via llvm-dev

2019-May-24 18:27 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Graham,

I am working on a custom target and it is considering scalable vector type
representation in programming language. While I am collecting the information
about it, I have met your RFC. I have a question. I think the one of fundamental
issues is that we do not know the memory layout of the type at compile time. I
am not sure whether the RFC covers this issue or not. Conservatively, I imagined
the memory layout of biggest type which the scalable vector type can support. I
could miss some discussions about it. If I missed something, please let me know.

Thanks,
JinGu Kang

________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Graham
Hunter via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 05 June 2018 14:15
To: Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin;
Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe;
llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker
Cc: nd
Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable
vector lengths

Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the
end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC
later.

Feedback and questions welcome.

-Graham

============================================================Supporting SIMD
instruction sets with variable vector lengths
============================================================
In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

=========Background
=========
*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

=======Contents
=======
The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size
isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a
runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

=======1. Types
=======
To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same
number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the
types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

==============2. Size Queries
==============
This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector
length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.

=======================================3. Representing Vector Length at Runtime
=======================================
With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  ;; <check and branch>
``
==========================4. Generating Vector Values
==========================For constant vector values, we cannot specify all the
elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the
new
start can be added, and changing the step requires multiplying by a splat.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x
i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

=================5. Code Generation
=================
IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary
to
convert a predicated instruction into an unpredicated one.

=========6. Example
=========
The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align
4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

=========7. Patches
=========
List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190524/22671446/attachment.html>

Bruce Hoult via llvm-dev

2019-May-24 19:12 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

In the RISC-V V extension, there is no upper limit to the size vector
registers can be in a future CPU. (Formally, the upper limit is at
least 2^31 bytes)

Generic code can enquire the size, dynamically allocate space, and
transparently save and restore the contents of a vector register or
registers.

On Fri, May 24, 2019 at 11:28 AM JinGu Kang via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> Hi Graham,
>
> I am working on a custom target and it is considering scalable vector type
representation in programming language. While I am collecting the information
about it, I have met your RFC. I have a question. I think the one of fundamental
issues is that we do not know the memory layout of the type at compile time. I
am not sure whether the RFC covers this issue or not. Conservatively, I imagined
the memory layout of biggest type which the scalable vector type can support. I
could miss some discussions about it. If I missed something, please let me know.
>
> Thanks,
> JinGu Kang
>
> ________________________________
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: 05 June 2018 14:15
> To: Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin;
Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe;
llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker
> Cc: nd
> Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with
variable vector lengths
>
> Hi,
>
> Now that Sander has committed enough MC support for SVE, here's an
updated
> RFC for variable length vector support with a set of 14 patches (listed at
the end)
> to demonstrate code generation for SVE using the extensions proposed in the
RFC.
>
> I have some ideas about how to support RISC-V's upcoming extension
alongside
> SVE; I'll send an email with some additional comments on Robin's
RFC later.
>
> Feedback and questions welcome.
>
> -Graham
>
> ============================================================> Supporting
SIMD instruction sets with variable vector lengths
> ============================================================>
> In this RFC we propose extending LLVM IR to support code-generation for
variable
> length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
> approach is backwards compatible and should be as non-intrusive as
possible; the
> only change needed in other backends is how size is queried on vector
types, and
> it only requires a change in which function is called. We have created a
set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7
of
> this rfc) can be found on Phabricator and are intended to illustrate the
scope
> of changes required by the general approach described in this RFC.
>
> =========> Background
> =========>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension
for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of
the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in
which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time
constants.
>
> Documentation for SVE can be found at
>
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> =======> Contents
> =======>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors
that
>    have a length which is a runtime-determined multiple of a known base
length.
>
> 2. Size Queries - how to reason about the size of types for which the size
isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in
address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a
runtime-determined
>    number of elements.
>
> 5. A brief note on code generation of these new operations for AArch64.
>
> 6. An example of C code and matching IR using the proposed extensions.
>
> 7. A list of patches demonstrating the changes required to emit SVE
instructions
>    for a loop that has already been vectorized using the extensions
described
>    in this RFC.
>
> =======> 1. Types
> =======>
> To represent a vector of unknown length a boolean `Scalable` property has
been
> added to the `VectorType` class, which indicates that the number of
elements in
> the vector is a runtime-determined integer multiple of the `NumElements`
field.
> Most code that deals with vectors doesn't need to know the exact
length, but
> does need to know relative lengths -- e.g. get a vector with the same
number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we
introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and
avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be
used
> to obtain the (potentially scalable) number of elements. Overloaded
division and
> multiplication operators allow an ElementCount instance to be used in much
the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason
about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for
SVE,
> but it is possible. The model of scalable vectors presented in this RFC
assumes
> that the multiple will be constant within a function but not necessarily
across
> functions. As suggested in the recent RISC-V rfc, a new function attribute
to
> inherit the multiple across function calls will allow for function calls
with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number
of
> elements, and the string literal `scalable` indicates that the total number
of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary
choice
> for indicating that the vector is scalable, and could be substituted by
another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change
in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of
elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they
are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
>   elements.
>
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the
same number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to
the
> type record. If the field is not present the type will default to a
fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes
that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable
of
> providing support for C-level intrinsics instead of being used and
recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that
creates
> new VectorTypes, much of that code doesn't need to care about whether
the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to
represent
> the number of elements.
>
> ==============> 2. Size Queries
> ==============>
> This is a proposal for how to deal with querying the size of scalable
types.
> While it has not been implemented in full, the general approach works well
> for calculating offsets into structures with scalable types in a modified
> version of ComputeValueVTs in our downstream compiler.
>
> Current IR types that have a known size all return a single integer
constant.
> For scalable types a second integer is needed to indicate the number of
bytes
> which need to be scaled by the runtime multiple to obtain the actual
length.
>
> For primitive types, getPrimitiveSizeInBits will function as it does today,
> except that it will no longer return a size for vector types (it will
return 0,
> as it does for other derived types). The majority of calls to this function
are
> already for scalar rather than vector types.
>
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not
need to
> deal with scalable types, another function (getFixedSizeExpressionInBits)
that
> only returns unscaled bits will be provided, with a debug assert that the
type
> isn't scalable.
>
> Similar functionality will be added to DataLayout.
>
> Comparing two of these sizes together is straightforward if only unscaled
sizes
> are used. Comparisons between scaled sizes is also simple when comparing
sizes
> within a function (or across functions with the inherit flag mentioned in
the
> changes to the type), but cannot be compared otherwise. If a mix is
present,
> then any number of unscaled bits will not be considered to have a greater
size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled
bits
> (since the runtime multiple is at least one).
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe'
vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns
true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
>
> =======================================> 3. Representing Vector Length
at Runtime
> =======================================>
> With a scalable vector type defined, we now need a way to represent the
runtime
> length in IR in order to generate addresses for consecutive vectors in
memory
> and determine how many elements have been processed in an iteration of a
loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ==========================> 4. Generating Vector Values
> ==========================> For constant vector values, we cannot
specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length
vectors.
>
> For constants consisting of a sequence of values, an experimental
`stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of
the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef,
<4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable
4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> =================> 5. Code Generation
> =================>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64
backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit
was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not
be
> necessary in future, generating an all-true 'ptrue' value was
necessary to
> convert a predicated instruction into an unpredicated one.
>
> =========> 6. Example
> =========>
> The following example shows a simple C loop which assigns the array index
to
> the array elements matching that index. The IR shows how vscale and
stepvector
> are used to create the needed values and to advance the index variable in
the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader
]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3,
align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> =========> 7. Patches
> =========>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Jul 2018 - [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Seemingly Similar Threads