thr3ads.net - llvm dev - [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths [May 2019]

If this information is useful, please help other people find it:
Share via:

Bruce Hoult via llvm-dev

2019-May-24 19:12 UTC

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

In the RISC-V V extension, there is no upper limit to the size vector
registers can be in a future CPU. (Formally, the upper limit is at
least 2^31 bytes)

Generic code can enquire the size, dynamically allocate space, and
transparently save and restore the contents of a vector register or
registers.

On Fri, May 24, 2019 at 11:28 AM JinGu Kang via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> Hi Graham,
>
> I am working on a custom target and it is considering scalable vector type
representation in programming language. While I am collecting the information
about it, I have met your RFC. I have a question. I think the one of fundamental
issues is that we do not know the memory layout of the type at compile time. I
am not sure whether the RFC covers this issue or not. Conservatively, I imagined
the memory layout of biggest type which the scalable vector type can support. I
could miss some discussions about it. If I missed something, please let me know.
>
> Thanks,
> JinGu Kang
>
> ________________________________
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: 05 June 2018 14:15
> To: Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin;
Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe;
llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker
> Cc: nd
> Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with
variable vector lengths
>
> Hi,
>
> Now that Sander has committed enough MC support for SVE, here's an
updated
> RFC for variable length vector support with a set of 14 patches (listed at
the end)
> to demonstrate code generation for SVE using the extensions proposed in the
RFC.
>
> I have some ideas about how to support RISC-V's upcoming extension
alongside
> SVE; I'll send an email with some additional comments on Robin's
RFC later.
>
> Feedback and questions welcome.
>
> -Graham
>
> ============================================================> Supporting
SIMD instruction sets with variable vector lengths
> ============================================================>
> In this RFC we propose extending LLVM IR to support code-generation for
variable
> length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
> approach is backwards compatible and should be as non-intrusive as
possible; the
> only change needed in other backends is how size is queried on vector
types, and
> it only requires a change in which function is called. We have created a
set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7
of
> this rfc) can be found on Phabricator and are intended to illustrate the
scope
> of changes required by the general approach described in this RFC.
>
> =========> Background
> =========>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension
for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of
the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in
which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time
constants.
>
> Documentation for SVE can be found at
>
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> =======> Contents
> =======>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors
that
>    have a length which is a runtime-determined multiple of a known base
length.
>
> 2. Size Queries - how to reason about the size of types for which the size
isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in
address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a
runtime-determined
>    number of elements.
>
> 5. A brief note on code generation of these new operations for AArch64.
>
> 6. An example of C code and matching IR using the proposed extensions.
>
> 7. A list of patches demonstrating the changes required to emit SVE
instructions
>    for a loop that has already been vectorized using the extensions
described
>    in this RFC.
>
> =======> 1. Types
> =======>
> To represent a vector of unknown length a boolean `Scalable` property has
been
> added to the `VectorType` class, which indicates that the number of
elements in
> the vector is a runtime-determined integer multiple of the `NumElements`
field.
> Most code that deals with vectors doesn't need to know the exact
length, but
> does need to know relative lengths -- e.g. get a vector with the same
number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we
introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and
avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be
used
> to obtain the (potentially scalable) number of elements. Overloaded
division and
> multiplication operators allow an ElementCount instance to be used in much
the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason
about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for
SVE,
> but it is possible. The model of scalable vectors presented in this RFC
assumes
> that the multiple will be constant within a function but not necessarily
across
> functions. As suggested in the recent RISC-V rfc, a new function attribute
to
> inherit the multiple across function calls will allow for function calls
with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number
of
> elements, and the string literal `scalable` indicates that the total number
of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary
choice
> for indicating that the vector is scalable, and could be substituted by
another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change
in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of
elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they
are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
>   elements.
>
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the
same number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to
the
> type record. If the field is not present the type will default to a
fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes
that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable
of
> providing support for C-level intrinsics instead of being used and
recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that
creates
> new VectorTypes, much of that code doesn't need to care about whether
the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to
represent
> the number of elements.
>
> ==============> 2. Size Queries
> ==============>
> This is a proposal for how to deal with querying the size of scalable
types.
> While it has not been implemented in full, the general approach works well
> for calculating offsets into structures with scalable types in a modified
> version of ComputeValueVTs in our downstream compiler.
>
> Current IR types that have a known size all return a single integer
constant.
> For scalable types a second integer is needed to indicate the number of
bytes
> which need to be scaled by the runtime multiple to obtain the actual
length.
>
> For primitive types, getPrimitiveSizeInBits will function as it does today,
> except that it will no longer return a size for vector types (it will
return 0,
> as it does for other derived types). The majority of calls to this function
are
> already for scalar rather than vector types.
>
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not
need to
> deal with scalable types, another function (getFixedSizeExpressionInBits)
that
> only returns unscaled bits will be provided, with a debug assert that the
type
> isn't scalable.
>
> Similar functionality will be added to DataLayout.
>
> Comparing two of these sizes together is straightforward if only unscaled
sizes
> are used. Comparisons between scaled sizes is also simple when comparing
sizes
> within a function (or across functions with the inherit flag mentioned in
the
> changes to the type), but cannot be compared otherwise. If a mix is
present,
> then any number of unscaled bits will not be considered to have a greater
size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled
bits
> (since the runtime multiple is at least one).
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe'
vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns
true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
>
> =======================================> 3. Representing Vector Length
at Runtime
> =======================================>
> With a scalable vector type defined, we now need a way to represent the
runtime
> length in IR in order to generate addresses for consecutive vectors in
memory
> and determine how many elements have been processed in an iteration of a
loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ==========================> 4. Generating Vector Values
> ==========================> For constant vector values, we cannot
specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length
vectors.
>
> For constants consisting of a sequence of values, an experimental
`stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of
the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef,
<4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable
4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> =================> 5. Code Generation
> =================>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64
backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit
was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not
be
> necessary in future, generating an all-true 'ptrue' value was
necessary to
> convert a predicated instruction into an unpredicated one.
>
> =========> 6. Example
> =========>
> The following example shows a simple C loop which assigns the array index
to
> the array elements matching that index. The IR shows how vscale and
stepvector
> are used to create the needed values and to advance the index variable in
the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader
]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3,
align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> =========> 7. Patches
> =========>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

JinGu Kang via llvm-dev

2019-May-24 20:47 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Bruce,

Thanks for your comment.
> Generic code can enquire the size, dynamically allocate space, andtransparently save and restore the contents of a vector register or
registers.

I am not sure what the Generic code is. It seems it uses similar way as the
implementation of variable length array. If so, I think it could be one way to
support it.

Thanks,
JinGu Kang

________________________________
From: Bruce Hoult <brucehoult at sifive.com>
Sent: 24 May 2019 20:12
To: JinGu Kang
Cc: Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin;
Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe;
llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker;
Graham Hunter; nd
Subject: Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with
variable vector lengths

In the RISC-V V extension, there is no upper limit to the size vector
registers can be in a future CPU. (Formally, the upper limit is at
least 2^31 bytes)

Generic code can enquire the size, dynamically allocate space, and
transparently save and restore the contents of a vector register or
registers.

On Fri, May 24, 2019 at 11:28 AM JinGu Kang via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> Hi Graham,
>
> I am working on a custom target and it is considering scalable vector type
representation in programming language. While I am collecting the information
about it, I have met your RFC. I have a question. I think the one of fundamental
issues is that we do not know the memory layout of the type at compile time. I
am not sure whether the RFC covers this issue or not. Conservatively, I imagined
the memory layout of biggest type which the scalable vector type can support. I
could miss some discussions about it. If I missed something, please let me know.
>
> Thanks,
> JinGu Kang
>
> ________________________________
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: 05 June 2018 14:15
> To: Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin;
Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe;
llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker
> Cc: nd
> Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with
variable vector lengths
>
> Hi,
>
> Now that Sander has committed enough MC support for SVE, here's an
updated
> RFC for variable length vector support with a set of 14 patches (listed at
the end)
> to demonstrate code generation for SVE using the extensions proposed in the
RFC.
>
> I have some ideas about how to support RISC-V's upcoming extension
alongside
> SVE; I'll send an email with some additional comments on Robin's
RFC later.
>
> Feedback and questions welcome.
>
> -Graham
>
> ============================================================> Supporting
SIMD instruction sets with variable vector lengths
> ============================================================>
> In this RFC we propose extending LLVM IR to support code-generation for
variable
> length vector architectures like Arm's SVE or RISC-V's 'V'
extension. Our
> approach is backwards compatible and should be as non-intrusive as
possible; the
> only change needed in other backends is how size is queried on vector
types, and
> it only requires a change in which function is called. We have created a
set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7
of
> this rfc) can be found on Phabricator and are intended to illustrate the
scope
> of changes required by the general approach described in this RFC.
>
> =========> Background
> =========>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension
for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of
the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in
which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time
constants.
>
> Documentation for SVE can be found at
>
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> =======> Contents
> =======>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors
that
>    have a length which is a runtime-determined multiple of a known base
length.
>
> 2. Size Queries - how to reason about the size of types for which the size
isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in
address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a
runtime-determined
>    number of elements.
>
> 5. A brief note on code generation of these new operations for AArch64.
>
> 6. An example of C code and matching IR using the proposed extensions.
>
> 7. A list of patches demonstrating the changes required to emit SVE
instructions
>    for a loop that has already been vectorized using the extensions
described
>    in this RFC.
>
> =======> 1. Types
> =======>
> To represent a vector of unknown length a boolean `Scalable` property has
been
> added to the `VectorType` class, which indicates that the number of
elements in
> the vector is a runtime-determined integer multiple of the `NumElements`
field.
> Most code that deals with vectors doesn't need to know the exact
length, but
> does need to know relative lengths -- e.g. get a vector with the same
number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we
introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and
avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be
used
> to obtain the (potentially scalable) number of elements. Overloaded
division and
> multiplication operators allow an ElementCount instance to be used in much
the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason
about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for
SVE,
> but it is possible. The model of scalable vectors presented in this RFC
assumes
> that the multiple will be constant within a function but not necessarily
across
> functions. As suggested in the recent RISC-V rfc, a new function attribute
to
> inherit the multiple across function calls will allow for function calls
with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number
of
> elements, and the string literal `scalable` indicates that the total number
of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary
choice
> for indicating that the vector is scalable, and could be substituted by
another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change
in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of
elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they
are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same
number of
>   elements.
>
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the
same number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to
the
> type record. If the field is not present the type will default to a
fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes
that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable
of
> providing support for C-level intrinsics instead of being used and
recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that
creates
> new VectorTypes, much of that code doesn't need to care about whether
the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to
represent
> the number of elements.
>
> ==============> 2. Size Queries
> ==============>
> This is a proposal for how to deal with querying the size of scalable
types.
> While it has not been implemented in full, the general approach works well
> for calculating offsets into structures with scalable types in a modified
> version of ComputeValueVTs in our downstream compiler.
>
> Current IR types that have a known size all return a single integer
constant.
> For scalable types a second integer is needed to indicate the number of
bytes
> which need to be scaled by the runtime multiple to obtain the actual
length.
>
> For primitive types, getPrimitiveSizeInBits will function as it does today,
> except that it will no longer return a size for vector types (it will
return 0,
> as it does for other derived types). The majority of calls to this function
are
> already for scalar rather than vector types.
>
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not
need to
> deal with scalable types, another function (getFixedSizeExpressionInBits)
that
> only returns unscaled bits will be provided, with a debug assert that the
type
> isn't scalable.
>
> Similar functionality will be added to DataLayout.
>
> Comparing two of these sizes together is straightforward if only unscaled
sizes
> are used. Comparisons between scaled sizes is also simple when comparing
sizes
> within a function (or across functions with the inherit flag mentioned in
the
> changes to the type), but cannot be compared otherwise. If a mix is
present,
> then any number of unscaled bits will not be considered to have a greater
size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled
bits
> (since the runtime multiple is at least one).
>
> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe'
vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns
true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
>
> =======================================> 3. Representing Vector Length
at Runtime
> =======================================>
> With a scalable vector type defined, we now need a way to represent the
runtime
> length in IR in order to generate addresses for consecutive vectors in
memory
> and determine how many elements have been processed in an iteration of a
loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable
vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ==========================> 4. Generating Vector Values
> ==========================> For constant vector values, we cannot
specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length
vectors.
>
> For constants consisting of a sequence of values, an experimental
`stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of
the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef,
<4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable
4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on
this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> =================> 5. Code Generation
> =================>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64
backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit
was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not
be
> necessary in future, generating an all-true 'ptrue' value was
necessary to
> convert a predicated instruction into an unpredicated one.
>
> =========> 6. Example
> =========>
> The following example shows a simple C loop which assigns the array index
to
> the array elements matching that index. The IR shows how vscale and
stepvector
> are used to create the needed values and to advance the index variable in
the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader
]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x
i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3,
align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> =========> 7. Patches
> =========>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190524/97c6eaf3/attachment-0001.html>

Bruce Hoult via llvm-dev

2019-May-24 21:07 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

By "generic code" I mean program instructions that is guaranteed to
run correctly on any RISC-V CPU with the V extension, now or in the
far future.

On Fri, May 24, 2019 at 1:47 PM JinGu Kang <jingu at codeplay.com>
wrote:>
> Hi Bruce,
>
> Thanks for your comment.
>
> > Generic code can enquire the size, dynamically allocate space, and
> transparently save and restore the contents of a vector register or
> registers.
>
> I am not sure what the Generic code is. It seems it uses similar way as the
implementation of variable length array. If so, I think it could be one way to
support it.
>
> Thanks,
> JinGu Kang
>
> ________________________________
> From: Bruce Hoult <brucehoult at sifive.com>
> Sent: 24 May 2019 20:12
> To: JinGu Kang
> Cc: Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin;
Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe;
llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker;
Graham Hunter; nd
> Subject: Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with
variable vector lengths
>
> In the RISC-V V extension, there is no upper limit to the size vector
> registers can be in a future CPU. (Formally, the upper limit is at
> least 2^31 bytes)
>
> Generic code can enquire the size, dynamically allocate space, and
> transparently save and restore the contents of a vector register or
> registers.
>
> On Fri, May 24, 2019 at 11:28 AM JinGu Kang via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> >
> > Hi Graham,
> >
> > I am working on a custom target and it is considering scalable vector
type representation in programming language. While I am collecting the
information about it, I have met your RFC. I have a question. I think the one of
fundamental issues is that we do not know the memory layout of the type at
compile time. I am not sure whether the RFC covers this issue or not.
Conservatively, I imagined the memory layout of biggest type which the scalable
vector type can support. I could miss some discussions about it. If I missed
something, please let me know.
> >
> > Thanks,
> > JinGu Kang
> >
> > ________________________________
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org>
> > Sent: 05 June 2018 14:15
> > To: Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato
Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin
Kruppe; llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam
Parker
> > Cc: nd
> > Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with
variable vector lengths
> >
> > Hi,
> >
> > Now that Sander has committed enough MC support for SVE, here's an
updated
> > RFC for variable length vector support with a set of 14 patches
(listed at the end)
> > to demonstrate code generation for SVE using the extensions proposed
in the RFC.
> >
> > I have some ideas about how to support RISC-V's upcoming extension
alongside
> > SVE; I'll send an email with some additional comments on
Robin's RFC later.
> >
> > Feedback and questions welcome.
> >
> > -Graham
> >
> > ============================================================> >
Supporting SIMD instruction sets with variable vector lengths
> > ============================================================> >
> > In this RFC we propose extending LLVM IR to support code-generation
for variable
> > length vector architectures like Arm's SVE or RISC-V's
'V' extension. Our
> > approach is backwards compatible and should be as non-intrusive as
possible; the
> > only change needed in other backends is how size is queried on vector
types, and
> > it only requires a change in which function is called. We have created
a set of
> > proof-of-concept patches to represent a simple vectorized loop in IR
and
> > generate SVE instructions from that IR. These patches (listed in
section 7 of
> > this rfc) can be found on Phabricator and are intended to illustrate
the scope
> > of changes required by the general approach described in this RFC.
> >
> > =========> > Background
> > =========> >
> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA
extension for
> > AArch64 which is intended to scale with hardware such that the same
binary
> > running on a processor with longer vector registers can take advantage
of the
> > increased compute power without recompilation.
> >
> > As the vector length is no longer a compile-time known value, the way
in which
> > the LLVM vectorizer generates code requires modifications such that
certain
> > values are now runtime evaluated expressions instead of compile-time
constants.
> >
> > Documentation for SVE can be found at
> >
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
> >
> > =======> > Contents
> > =======> >
> > The rest of this RFC covers the following topics:
> >
> > 1. Types -- a proposal to extend VectorType to be able to represent
vectors that
> >    have a length which is a runtime-determined multiple of a known
base length.
> >
> > 2. Size Queries - how to reason about the size of types for which the
size isn't
> >    fully known at compile time.
> >
> > 3. Representing the runtime multiple of vector length in IR for use in
address
> >    calculations and induction variable comparisons.
> >
> > 4. Generating 'constant' values in IR for vectors with a
runtime-determined
> >    number of elements.
> >
> > 5. A brief note on code generation of these new operations for
AArch64.
> >
> > 6. An example of C code and matching IR using the proposed extensions.
> >
> > 7. A list of patches demonstrating the changes required to emit SVE
instructions
> >    for a loop that has already been vectorized using the extensions
described
> >    in this RFC.
> >
> > =======> > 1. Types
> > =======> >
> > To represent a vector of unknown length a boolean `Scalable` property
has been
> > added to the `VectorType` class, which indicates that the number of
elements in
> > the vector is a runtime-determined integer multiple of the
`NumElements` field.
> > Most code that deals with vectors doesn't need to know the exact
length, but
> > does need to know relative lengths -- e.g. get a vector with the same
number of
> > elements but a different element type, or with half or double the
number of
> > elements.
> >
> > In order to allow code to transparently support scalable vectors, we
introduce
> > an `ElementCount` class with two members:
> >
> > - `unsigned Min`: the minimum number of elements.
> > - `bool Scalable`: is the element count an unknown multiple of `Min`?
> >
> > For non-scalable vectors (``Scalable=false``) the scale is considered
to be
> > equal to one and thus `Min` represents the exact number of elements in
the
> > vector.
> >
> > The intent for code working with vectors is to use convenience methods
and avoid
> > directly dealing with the number of elements. If needed, calling
> > `getElementCount` on a vector type instead of `getVectorNumElements`
can be used
> > to obtain the (potentially scalable) number of elements. Overloaded
division and
> > multiplication operators allow an ElementCount instance to be used in
much the
> > same manner as an integer for most cases.
> >
> > This mixture of compile-time and runtime quantities allow us to reason
about the
> > relationship between different scalable vector types without knowing
their
> > exact length.
> >
> > The runtime multiple is not expected to change during program
execution for SVE,
> > but it is possible. The model of scalable vectors presented in this
RFC assumes
> > that the multiple will be constant within a function but not
necessarily across
> > functions. As suggested in the recent RISC-V rfc, a new function
attribute to
> > inherit the multiple across function calls will allow for function
calls with
> > vector arguments/return values and inlining/outlining optimizations.
> >
> > IR Textual Form
> > ---------------
> >
> > The textual form for a scalable vector is:
> >
> > ``<scalable <n> x <type>>``
> >
> > where `type` is the scalar type of each element, `n` is the minimum
number of
> > elements, and the string literal `scalable` indicates that the total
number of
> > elements is an unknown multiple of `n`; `scalable` is just an
arbitrary choice
> > for indicating that the vector is scalable, and could be substituted
by another.
> > For fixed-length vectors, the `scalable` is omitted, so there is no
change in
> > the format for existing vectors.
> >
> > Scalable vectors with the same `Min` value have the same number of
elements, and
> > the same number of bytes if `Min * sizeof(type)` is the same (assuming
they are
> > used within the same function):
> >
> > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the
same number of
> >   elements.
> >
> > ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have
the same number of
> >   bytes.
> >
> > IR Bitcode Form
> > ---------------
> >
> > To serialize scalable vectors to bitcode, a new boolean field is added
to the
> > type record. If the field is not present the type will default to a
fixed-length
> > vector type, preserving backwards compatibility.
> >
> > Alternatives Considered
> > -----------------------
> >
> > We did consider one main alternative -- a dedicated target type, like
the
> > x86_mmx type.
> >
> > A dedicated target type would either need to extend all existing
passes that
> > work with vectors to recognize the new type, or to duplicate all that
code
> > in order to get reasonable code generation and autovectorization.
> >
> > This hasn't been done for the x86_mmx type, and so it is only
capable of
> > providing support for C-level intrinsics instead of being used and
recognized by
> > passes inside llvm.
> >
> > Although our current solution will need to change some of the code
that creates
> > new VectorTypes, much of that code doesn't need to care about
whether the types
> > are scalable or not -- they can use preexisting methods like
> > `getHalfElementsVectorType`. If the code is a little more complex,
> > `ElementCount` structs can be used instead of an `unsigned` value to
represent
> > the number of elements.
> >
> > ==============> > 2. Size Queries
> > ==============> >
> > This is a proposal for how to deal with querying the size of scalable
types.
> > While it has not been implemented in full, the general approach works
well
> > for calculating offsets into structures with scalable types in a
modified
> > version of ComputeValueVTs in our downstream compiler.
> >
> > Current IR types that have a known size all return a single integer
constant.
> > For scalable types a second integer is needed to indicate the number
of bytes
> > which need to be scaled by the runtime multiple to obtain the actual
length.
> >
> > For primitive types, getPrimitiveSizeInBits will function as it does
today,
> > except that it will no longer return a size for vector types (it will
return 0,
> > as it does for other derived types). The majority of calls to this
function are
> > already for scalar rather than vector types.
> >
> > For derived types, a function (getSizeExpressionInBits) to return a
pair of
> > integers (one to indicate unscaled bits, the other for bits that need
to be
> > scaled by the runtime multiple) will be added. For backends that do
not need to
> > deal with scalable types, another function
(getFixedSizeExpressionInBits) that
> > only returns unscaled bits will be provided, with a debug assert that
the type
> > isn't scalable.
> >
> > Similar functionality will be added to DataLayout.
> >
> > Comparing two of these sizes together is straightforward if only
unscaled sizes
> > are used. Comparisons between scaled sizes is also simple when
comparing sizes
> > within a function (or across functions with the inherit flag mentioned
in the
> > changes to the type), but cannot be compared otherwise. If a mix is
present,
> > then any number of unscaled bits will not be considered to have a
greater size
> > than a smaller number of scaled bits, but a smaller number of unscaled
bits
> > will be considered to have a smaller size than a greater number of
scaled bits
> > (since the runtime multiple is at least one).
> >
> > Future Work
> > -----------
> >
> > Since we cannot determine the exact size of a scalable vector, the
> > existing logic for alias detection won't work when multiple
accesses
> > share a common base pointer with different offsets.
> >
> > However, SVE's predication will mean that a dynamic 'safe'
vector length
> > can be determined at runtime, so after initial support has been added
we
> > can work on vectorizing loops using runtime predication to avoid
aliasing
> > problems.
> >
> > Alternatives Considered
> > -----------------------
> >
> > Marking scalable vectors as unsized doesn't work well, as many
parts of
> > llvm dealing with loads and stores assert that 'isSized()'
returns true
> > and make use of the size when calculating offsets.
> >
> > We have considered introducing multiple helper functions instead of
> > using direct size queries, but that doesn't cover all cases. It
may
> > still be a good idea to introduce them to make the purpose in a given
> > case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
> >
> > =======================================> > 3. Representing
Vector Length at Runtime
> > =======================================> >
> > With a scalable vector type defined, we now need a way to represent
the runtime
> > length in IR in order to generate addresses for consecutive vectors in
memory
> > and determine how many elements have been processed in an iteration of
a loop.
> >
> > We have added an experimental `vscale` intrinsic to represent the
runtime
> > multiple. Multiplying the result of this intrinsic by the minimum
number of
> > elements in a vector gives the total number of elements in a scalable
vector.
> >
> > Fixed-Length Code
> > -----------------
> >
> > Assuming a vector type of <4 x <ty>>
> > ``
> > vector.body:
> >   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
> >   ;; <loop body>
> >   ;; Increment induction var
> >   %index.next = add i64 %index, 4
> >   ;; <check and branch>
> > ``
> > Scalable Equivalent
> > -------------------
> >
> > Assuming a vector type of <scalable 4 x <ty>>
> > ``
> > vector.body:
> >   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
> >   ;; <loop body>
> >   ;; Increment induction var
> >   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
> >   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
> >   ;; <check and branch>
> > ``
> > ==========================> > 4. Generating Vector Values
> > ==========================> > For constant vector values, we
cannot specify all the elements as we can for
> > fixed-length vectors; fortunately only a small number of easily
synthesized
> > patterns are required for autovectorization. The `zeroinitializer`
constant
> > can be used in the same manner as fixed-length vectors for a constant
zero
> > splat. This can then be combined with `insertelement` and
`shufflevector`
> > to create arbitrary value splats in the same manner as fixed-length
vectors.
> >
> > For constants consisting of a sequence of values, an experimental
`stepvector`
> > intrinsic has been added to represent a simple constant of the form
> > `<0, 1, 2... num_elems-1>`. To change the starting value a splat
of the new
> > start can be added, and changing the step requires multiplying by a
splat.
> >
> > Fixed-Length Code
> > -----------------
> > ``
> >   ;; Splat a value
> >   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
> >   %splat = shufflevector <4 x i32> %insert, <4 x i32>
undef, <4 x i32> zeroinitializer
> >   ;; Add a constant sequence
> >   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32
8>
> > ``
> > Scalable Equivalent
> > -------------------
> > ``
> >   ;; Splat a value
> >   %insert = insertelement <scalable 4 x i32> undef, i32 %value,
i32 0
> >   %splat = shufflevector <scalable 4 x i32> %insert,
<scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >   ;; Splat offset + stride (the same in this case)
> >   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32
0
> >   %str_off = shufflevector <scalable 4 x i32> %insert2,
<scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >   ;; Create sequence for scalable vector
> >   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
> >   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
> >   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
> >   ;; Add the runtime-generated sequence
> >   %add = add <scalable 4 x i32> %splat, %addoffset
> > ``
> > Future Work
> > -----------
> >
> > Intrinsics cannot currently be used for constant folding. Our
downstream
> > compiler (using Constants instead of intrinsics) relies quite heavily
on this
> > for good code generation, so we will need to find new ways to
recognize and
> > fold these values.
> >
> > =================> > 5. Code Generation
> > =================> >
> > IR splats will be converted to an experimental splatvector intrinsic
in
> > SelectionDAGBuilder.
> >
> > All three intrinsics are custom lowered and legalized in the AArch64
backend.
> >
> > Two new AArch64ISD nodes have been added to represent the same
concepts
> > at the SelectionDAG level, while splatvector maps onto the existing
> > AArch64ISD::DUP.
> >
> > GlobalISel
> > ----------
> >
> > Since GlobalISel was enabled by default on AArch64, it was necessary
to add
> > scalable vector support to the LowLevelType implementation. A single
bit was
> > added to the raw_data representation for vectors and vectors of
pointers.
> >
> > In addition, types that only exist in destination patterns are planted
in
> > the enumeration of available types for generated code. While this may
not be
> > necessary in future, generating an all-true 'ptrue' value was
necessary to
> > convert a predicated instruction into an unpredicated one.
> >
> > =========> > 6. Example
> > =========> >
> > The following example shows a simple C loop which assigns the array
index to
> > the array elements matching that index. The IR shows how vscale and
stepvector
> > are used to create the needed values and to advance the index variable
in the
> > loop.
> >
> > C Code
> > ------
> >
> > ``
> > void IdentityArrayInit(int *a, int count) {
> >   for (int i = 0; i < count; ++i)
> >     a[i] = i;
> > }
> > ``
> >
> > Scalable IR Vector Body
> > -----------------------
> >
> > ``
> > vector.body.preheader:
> >   ;; Other setup
> >   ;; Stepvector used to create initial identity vector
> >   %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
> >   br vector.body
> >
> > vector.body
> >   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
> >   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
> >
> >            ;; stepvector used for index identity on entry to loop body
;;
> >   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body
],
> >                                      [ %stepvector,
%vector.body.preheader ]
> >   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
> >   %vscale32 = trunc i64 %vscale64 to i32
> >   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
> >
> >            ;; vscale splat used to increment identity vector ;;
> >   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
%vscale32, i32 4), i32 0
> >   %splat shufflevector <scalable 4 x i32> %insert, <scalable
4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
> >   %2 = getelementptr inbounds i32, i32* %a, i64 %0
> >   %3 = bitcast i32* %2 to <scalable 4 x i32>*
> >   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>*
%3, align 4
> >
> >            ;; vscale used to increment loop index
> >   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
> >   %4 = icmp eq i64 %index.next, %n.vec
> >   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> > ``
> >
> > =========> > 7. Patches
> > =========> >
> > List of patches:
> >
> > 1. Extend VectorType: https://reviews.llvm.org/D32530
> > 2. Vector element type Tablegen constraint:
https://reviews.llvm.org/D47768
> > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> > 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> > 10. Initial store patterns: https://reviews.llvm.org/D47776
> > 11. Initial addition patterns: https://reviews.llvm.org/D47777
> > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - May 2019 - [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths