thr3ads.net - llvm dev - [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2018-Jul-30 19:10 UTC

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

On 07/30/2018 05:34 AM, Chandler Carruth wrote:> I strongly suspect that there remains widespread concern with the
> direction of this, I know I have them.
>
> I don't think that many of the people who have that concern have had
> time to come back to this RFC and make progress on it, likely because
> of other commitments or simply the amount of churn around SVE related
> patches and such. That is at least why I haven't had time to return to
> this RFC and try to write more detailed feedback.
>
> Certainly, I would want to see pretty clear and considered support for
> this change to the IR type system from Hal, Chris, Eric and/or other
> long time maintainers of core LLVM IR components before it moves
> forward, and I don't see that in this thread.
At a high level, I'm happy with this approach. I think it will be
important for LLVM to support runtime-determined vector lengths - I see
the customizability and power-efficiency constraints that motivate these
designs continuing to increase in importance. I'm still undecided on
whether this makes vector code nicer even for fixed-vector-length
architectures, but some of the design decisions that it forces, such as
having explicit intrinsics for reductions and other horizontal
operations, seem like the right direction regardless. I have two questions:

1.> This is a proposal for how to deal with querying the size of scalable
> types for
> > analysis of IR. While it has not been implemented in full, 
Is this still true? The details here need to all work out, obviously,
and we should make sure that any issues are identified.

2. I know that there has been some discussion around support for
changing the vector length during program execution (e.g., to account
for some (proposed?) RISC-V feature), perhaps even during the execution
of a single function. I'm very concerned about this idea because it is
not at all clear to me how to limit information transfer contaminated
with the vector size from propagating between different regions. As a
result, I'm concerned about trying to add this on later, and so if this
is part of the plan, I think that we need to think through the details
up front because it could have a major impact on the design.

Thanks again,
Hal
>
> Put differently: I don't think silence is assent here. You really need
> some clear signal of consensus.
>
> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <Graham.Hunter at arm.com
> <mailto:Graham.Hunter at arm.com>> wrote:
>
>     Hi,
>
>     Are there any objections to going ahead with this? If not, we'll
>     try to get the patches reviewed and committed after the 7.0 branch
>     occurs.
>
>     -Graham
>
>     > On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at
arm.com
>     <mailto:Graham.Hunter at arm.com>> wrote:
>     >
>     > Hi,
>     >
>     > I've updated the RFC slightly based on the discussion within
the
>     thread, reposted below. Let me know if I've missed anything or if
>     more clarification is needed.
>     >
>     > Thanks,
>     >
>     > -Graham
>     >
>     > ============================================================>  
> Supporting SIMD instruction sets with variable vector lengths
>     > ============================================================>  
>
>     > In this RFC we propose extending LLVM IR to support
>     code-generation for variable
>     > length vector architectures like Arm's SVE or RISC-V's
'V'
>     extension. Our
>     > approach is backwards compatible and should be as non-intrusive
>     as possible; the
>     > only change needed in other backends is how size is queried on
>     vector types, and
>     > it only requires a change in which function is called. We have
>     created a set of
>     > proof-of-concept patches to represent a simple vectorized loop
>     in IR and
>     > generate SVE instructions from that IR. These patches (listed in
>     section 7 of
>     > this rfc) can be found on Phabricator and are intended to
>     illustrate the scope
>     > of changes required by the general approach described in this RFC.
>     >
>     > =========>     > Background
>     > =========>     >
>     > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA
>     extension for
>     > AArch64 which is intended to scale with hardware such that the
>     same binary
>     > running on a processor with longer vector registers can take
>     advantage of the
>     > increased compute power without recompilation.
>     >
>     > As the vector length is no longer a compile-time known value,
>     the way in which
>     > the LLVM vectorizer generates code requires modifications such
>     that certain
>     > values are now runtime evaluated expressions instead of
>     compile-time constants.
>     >
>     > Documentation for SVE can be found at
>     >
>    
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>     >
>     > =======>     > Contents
>     > =======>     >
>     > The rest of this RFC covers the following topics:
>     >
>     > 1. Types -- a proposal to extend VectorType to be able to
>     represent vectors that
>     >   have a length which is a runtime-determined multiple of a
>     known base length.
>     >
>     > 2. Size Queries - how to reason about the size of types for
>     which the size isn't
>     >   fully known at compile time.
>     >
>     > 3. Representing the runtime multiple of vector length in IR for
>     use in address
>     >   calculations and induction variable comparisons.
>     >
>     > 4. Generating 'constant' values in IR for vectors with a
>     runtime-determined
>     >   number of elements.
>     >
>     > 5. An explanation of splitting/concatentating scalable vectors.
>     >
>     > 6. A brief note on code generation of these new operations for
>     AArch64.
>     >
>     > 7. An example of C code and matching IR using the proposed
>     extensions.
>     >
>     > 8. A list of patches demonstrating the changes required to emit
>     SVE instructions
>     >   for a loop that has already been vectorized using the
>     extensions described
>     >   in this RFC.
>     >
>     > =======>     > 1. Types
>     > =======>     >
>     > To represent a vector of unknown length a boolean `Scalable`
>     property has been
>     > added to the `VectorType` class, which indicates that the number
>     of elements in
>     > the vector is a runtime-determined integer multiple of the
>     `NumElements` field.
>     > Most code that deals with vectors doesn't need to know the
exact
>     length, but
>     > does need to know relative lengths -- e.g. get a vector with the
>     same number of
>     > elements but a different element type, or with half or double
>     the number of
>     > elements.
>     >
>     > In order to allow code to transparently support scalable
>     vectors, we introduce
>     > an `ElementCount` class with two members:
>     >
>     > - `unsigned Min`: the minimum number of elements.
>     > - `bool Scalable`: is the element count an unknown multiple of
>     `Min`?
>     >
>     > For non-scalable vectors (``Scalable=false``) the scale is
>     considered to be
>     > equal to one and thus `Min` represents the exact number of
>     elements in the
>     > vector.
>     >
>     > The intent for code working with vectors is to use convenience
>     methods and avoid
>     > directly dealing with the number of elements. If needed, calling
>     > `getElementCount` on a vector type instead of
>     `getVectorNumElements` can be used
>     > to obtain the (potentially scalable) number of elements.
>     Overloaded division and
>     > multiplication operators allow an ElementCount instance to be
>     used in much the
>     > same manner as an integer for most cases.
>     >
>     > This mixture of compile-time and runtime quantities allow us to
>     reason about the
>     > relationship between different scalable vector types without
>     knowing their
>     > exact length.
>     >
>     > The runtime multiple is not expected to change during program
>     execution for SVE,
>     > but it is possible. The model of scalable vectors presented in
>     this RFC assumes
>     > that the multiple will be constant within a function but not
>     necessarily across
>     > functions. As suggested in the recent RISC-V rfc, a new function
>     attribute to
>     > inherit the multiple across function calls will allow for
>     function calls with
>     > vector arguments/return values and inlining/outlining
optimizations.
>     >
>     > IR Textual Form
>     > ---------------
>     >
>     > The textual form for a scalable vector is:
>     >
>     > ``<scalable <n> x <type>>``
>     >
>     > where `type` is the scalar type of each element, `n` is the
>     minimum number of
>     > elements, and the string literal `scalable` indicates that the
>     total number of
>     > elements is an unknown multiple of `n`; `scalable` is just an
>     arbitrary choice
>     > for indicating that the vector is scalable, and could be
>     substituted by another.
>     > For fixed-length vectors, the `scalable` is omitted, so there is
>     no change in
>     > the format for existing vectors.
>     >
>     > Scalable vectors with the same `Min` value have the same number
>     of elements, and
>     > the same number of bytes if `Min * sizeof(type)` is the same
>     (assuming they are
>     > used within the same function):
>     >
>     > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have
the same
>     number of
>     >  elements.
>     >
>     > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have
the same
>     number of
>     >  bytes.
>     >
>     > IR Bitcode Form
>     > ---------------
>     >
>     > To serialize scalable vectors to bitcode, a new boolean field is
>     added to the
>     > type record. If the field is not present the type will default
>     to a fixed-length
>     > vector type, preserving backwards compatibility.
>     >
>     > Alternatives Considered
>     > -----------------------
>     >
>     > We did consider one main alternative -- a dedicated target type,
>     like the
>     > x86_mmx type.
>     >
>     > A dedicated target type would either need to extend all existing
>     passes that
>     > work with vectors to recognize the new type, or to duplicate all
>     that code
>     > in order to get reasonable code generation and autovectorization.
>     >
>     > This hasn't been done for the x86_mmx type, and so it is only
>     capable of
>     > providing support for C-level intrinsics instead of being used
>     and recognized by
>     > passes inside llvm.
>     >
>     > Although our current solution will need to change some of the
>     code that creates
>     > new VectorTypes, much of that code doesn't need to care about
>     whether the types
>     > are scalable or not -- they can use preexisting methods like
>     > `getHalfElementsVectorType`. If the code is a little more complex,
>     > `ElementCount` structs can be used instead of an `unsigned`
>     value to represent
>     > the number of elements.
>     >
>     > ==============>     > 2. Size Queries
>     > ==============>     >
>     > This is a proposal for how to deal with querying the size of
>     scalable types for
>     > analysis of IR. While it has not been implemented in full, the
>     general approach
>     > works well for calculating offsets into structures with scalable
>     types in a
>     > modified version of ComputeValueVTs in our downstream compiler.
>     >
>     > For current IR types that have a known size, all query functions
>     return a single
>     > integer constant. For scalable types a second integer is needed
>     to indicate the
>     > number of bytes/bits which need to be scaled by the runtime
>     multiple to obtain
>     > the actual length.
>     >
>     > For primitive types, `getPrimitiveSizeInBits()` will function as
>     it does today,
>     > except that it will no longer return a size for vector types (it
>     will return 0,
>     > as it does for other derived types). The majority of calls to
>     this function are
>     > already for scalar rather than vector types.
>     >
>     > For derived types, a function `getScalableSizePairInBits()` will
>     be added, which
>     > returns a pair of integers (one to indicate unscaled bits, the
>     other for bits
>     > that need to be scaled by the runtime multiple). For backends
>     that do not need
>     > to deal with scalable types the existing methods will suffice,
>     but a debug-only
>     > assert will be added to them to ensure they aren't used on
>     scalable types.
>     >
>     > Similar functionality will be added to DataLayout.
>     >
>     > Comparisons between sizes will use the following methods,
>     assuming that X and
>     > Y are non-zero integers and the form is of { unscaled, scaled }.
>     >
>     > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>     >
>     > { 0, X } <cmp> { 0, Y }: Normal comparison within a
function, or
>     across
>     >                         functions that inherit vector length.
>     Cannot be
>     >                         compared across non-inheriting functions.
>     >
>     > { X, 0 } > { 0, Y }: Cannot return true.
>     >
>     > { X, 0 } = { 0, Y }: Cannot return true.
>     >
>     > { X, 0 } < { 0, Y }: Can return true.
>     >
>     > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to
subtract
>     common
>     >                             terms and try the above comparisons;
it
>     >                             may not be possible to get a good
>     answer.
>     >
>     > It's worth noting that we don't expect the last case
(mixed
>     scaled and
>     > unscaled sizes) to occur. Richard Sandiford's proposed C
extensions
>     > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html)
>     explicitly
>     > prohibits mixing fixed-size types into sizeless struct.
>     >
>     > I don't know if we need a 'maybe' or 'unknown'
result for cases
>     comparing scaled
>     > vs. unscaled; I believe the gcc implementation of SVE allows for
>     such
>     > results, but that supports a generic polynomial length
>     representation.
>     >
>     > My current intention is to rely on functions that clone or copy
>     values to
>     > check whether they are being used to copy scalable vectors
>     across function
>     > boundaries without the inherit vlen attribute and raise an error
>     there instead
>     > of requiring passing the Function a type size is from for each
>     comparison. If
>     > there's a strong preference for moving the check to the size
>     comparison function
>     > let me know; I will be starting work on patches for this later
>     in the year if
>     > there's no major problems with the idea.
>     >
>     > Future Work
>     > -----------
>     >
>     > Since we cannot determine the exact size of a scalable vector, the
>     > existing logic for alias detection won't work when multiple
accesses
>     > share a common base pointer with different offsets.
>     >
>     > However, SVE's predication will mean that a dynamic
'safe'
>     vector length
>     > can be determined at runtime, so after initial support has been
>     added we
>     > can work on vectorizing loops using runtime predication to avoid
>     aliasing
>     > problems.
>     >
>     > Alternatives Considered
>     > -----------------------
>     >
>     > Marking scalable vectors as unsized doesn't work well, as many
>     parts of
>     > llvm dealing with loads and stores assert that 'isSized()'
>     returns true
>     > and make use of the size when calculating offsets.
>     >
>     > We have considered introducing multiple helper functions instead
of
>     > using direct size queries, but that doesn't cover all cases.
It may
>     > still be a good idea to introduce them to make the purpose in a
>     given
>     > case more obvious, e.g.
'requiresSignExtension(Type*,Type*)'.
>     >
>     > =======================================>     > 3.
Representing Vector Length at Runtime
>     > =======================================>     >
>     > With a scalable vector type defined, we now need a way to
>     represent the runtime
>     > length in IR in order to generate addresses for consecutive
>     vectors in memory
>     > and determine how many elements have been processed in an
>     iteration of a loop.
>     >
>     > We have added an experimental `vscale` intrinsic to represent
>     the runtime
>     > multiple. Multiplying the result of this intrinsic by the
>     minimum number of
>     > elements in a vector gives the total number of elements in a
>     scalable vector.
>     >
>     > Fixed-Length Code
>     > -----------------
>     >
>     > Assuming a vector type of <4 x <ty>>
>     > ``
>     > vector.body:
>     >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>     %vector.body.preheader ]
>     >  ;; <loop body>
>     >  ;; Increment induction var
>     >  %index.next = add i64 %index, 4
>     >  ;; <check and branch>
>     > ``
>     > Scalable Equivalent
>     > -------------------
>     >
>     > Assuming a vector type of <scalable 4 x <ty>>
>     > ``
>     > vector.body:
>     >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>     %vector.body.preheader ]
>     >  ;; <loop body>
>     >  ;; Increment induction var
>     >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>     >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>     >  ;; <check and branch>
>     > ``
>     > ==========================>     > 4. Generating Vector
Values
>     > ==========================>     > For constant vector
values, we cannot specify all the elements
>     as we can for
>     > fixed-length vectors; fortunately only a small number of easily
>     synthesized
>     > patterns are required for autovectorization. The
>     `zeroinitializer` constant
>     > can be used in the same manner as fixed-length vectors for a
>     constant zero
>     > splat. This can then be combined with `insertelement` and
>     `shufflevector`
>     > to create arbitrary value splats in the same manner as
>     fixed-length vectors.
>     >
>     > For constants consisting of a sequence of values, an
>     experimental `stepvector`
>     > intrinsic has been added to represent a simple constant of the
form
>     > `<0, 1, 2... num_elems-1>`. To change the starting value a
splat
>     of the new
>     > start can be added, and changing the step requires multiplying
>     by a splat.
>     >
>     > Fixed-Length Code
>     > -----------------
>     > ``
>     >  ;; Splat a value
>     >  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>     >  %splat = shufflevector <4 x i32> %insert, <4 x i32>
undef, <4 x
>     i32> zeroinitializer
>     >  ;; Add a constant sequence
>     >  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32
8>
>     > ``
>     > Scalable Equivalent
>     > -------------------
>     > ``
>     >  ;; Splat a value
>     >  %insert = insertelement <scalable 4 x i32> undef, i32
%value, i32 0
>     >  %splat = shufflevector <scalable 4 x i32> %insert,
<scalable 4
>     x i32> undef, <scalable 4 x i32> zeroinitializer
>     >  ;; Splat offset + stride (the same in this case)
>     >  %insert2 = insertelement <scalable 4 x i32> under, i32 2,
i32 0
>     >  %str_off = shufflevector <scalable 4 x i32> %insert2,
<scalable
>     4 x i32> undef, <scalable 4 x i32> zeroinitializer
>     >  ;; Create sequence for scalable vector
>     >  %stepvector = call <scalable 4 x i32>
>     @llvm.experimental.vector.stepvector.nxv4i32()
>     >  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>     >  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>     >  ;; Add the runtime-generated sequence
>     >  %add = add <scalable 4 x i32> %splat, %addoffset
>     > ``
>     > Future Work
>     > -----------
>     >
>     > Intrinsics cannot currently be used for constant folding. Our
>     downstream
>     > compiler (using Constants instead of intrinsics) relies quite
>     heavily on this
>     > for good code generation, so we will need to find new ways to
>     recognize and
>     > fold these values.
>     >
>     > ==========================================>     > 5.
Splitting and Combining Scalable Vectors
>     > ==========================================>     >
>     > Splitting and combining scalable vectors in IR is done in the
>     same manner as
>     > for fixed-length vectors, but with a non-constant mask for the
>     shufflevector.
>     >
>     > The following is an example of splitting a <scalable 4 x
double>
>     into two
>     > separate <scalable 2 x double> values.
>     >
>     > ``
>     >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>     >  ;; Stepvector generates the element ids for first subvector
>     >  %sv1 = call <scalable 2 x i64>
>     @llvm.experimental.vector.stepvector.nxv2i64()
>     >  ;; Add vscale * 2 to get the starting element for the second
>     subvector
>     >  %ec = mul i64 %vscale64, 2
>     >  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec,
i32 0
>     >  %ec.splat = shufflevector <scalable 2 x i64> %9,
<scalable 2 x
>     i64> undef, <scalable 2 x i32> zeroinitializer
>     >  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>     >  ;; Perform the extracts
>     >  %res1 = shufflevector <scalable 4 x double> %in,
<scalable 4 x
>     double> undef, <scalable 2 x i64> %sv1
>     >  %res2 = shufflevector <scalable 4 x double> %in,
<scalable 4 x
>     double> undef, <scalable 2 x i64> %sv2
>     > ``
>     >
>     > =================>     > 6. Code Generation
>     > =================>     >
>     > IR splats will be converted to an experimental splatvector
>     intrinsic in
>     > SelectionDAGBuilder.
>     >
>     > All three intrinsics are custom lowered and legalized in the
>     AArch64 backend.
>     >
>     > Two new AArch64ISD nodes have been added to represent the same
>     concepts
>     > at the SelectionDAG level, while splatvector maps onto the
existing
>     > AArch64ISD::DUP.
>     >
>     > GlobalISel
>     > ----------
>     >
>     > Since GlobalISel was enabled by default on AArch64, it was
>     necessary to add
>     > scalable vector support to the LowLevelType implementation. A
>     single bit was
>     > added to the raw_data representation for vectors and vectors of
>     pointers.
>     >
>     > In addition, types that only exist in destination patterns are
>     planted in
>     > the enumeration of available types for generated code. While
>     this may not be
>     > necessary in future, generating an all-true 'ptrue' value
was
>     necessary to
>     > convert a predicated instruction into an unpredicated one.
>     >
>     > =========>     > 7. Example
>     > =========>     >
>     > The following example shows a simple C loop which assigns the
>     array index to
>     > the array elements matching that index. The IR shows how vscale
>     and stepvector
>     > are used to create the needed values and to advance the index
>     variable in the
>     > loop.
>     >
>     > C Code
>     > ------
>     >
>     > ``
>     > void IdentityArrayInit(int *a, int count) {
>     >  for (int i = 0; i < count; ++i)
>     >    a[i] = i;
>     > }
>     > ``
>     >
>     > Scalable IR Vector Body
>     > -----------------------
>     >
>     > ``
>     > vector.body.preheader:
>     >  ;; Other setup
>     >  ;; Stepvector used to create initial identity vector
>     >  %stepvector = call <scalable 4 x i32>
>     @llvm.experimental.vector.stepvector.nxv4i32()
>     >  br vector.body
>     >
>     > vector.body
>     >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>     %vector.body.preheader ]
>     >  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>     >
>     >           ;; stepvector used for index identity on entry to loop
>     body ;;
>     >  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8,
%vector.body ],
>     >                                     [ %stepvector,
>     %vector.body.preheader ]
>     >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>     >  %vscale32 = trunc i64 %vscale64 to i32
>     >  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>     >
>     >           ;; vscale splat used to increment identity vector ;;
>     >  %insert = insertelement <scalable 4 x i32> undef, i32 mul
(i32
>     %vscale32, i32 4), i32 0
>     >  %splat shufflevector <scalable 4 x i32> %insert,
<scalable 4 x
>     i32> undef, <scalable 4 x i32> zeroinitializer
>     >  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>     >  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>     >  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>     >  store <scalable 4 x i32> %vec.ind7, <scalable 4 x
i32>* %3, align 4
>     >
>     >           ;; vscale used to increment loop index
>     >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>     >  %4 = icmp eq i64 %index.next, %n.vec
>     >  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>     > ``
>     >
>     > =========>     > 8. Patches
>     > =========>     >
>     > List of patches:
>     >
>     > 1. Extend VectorType: https://reviews.llvm.org/D32530
>     > 2. Vector element type Tablegen constraint:
>     https://reviews.llvm.org/D47768
>     > 3. LLT support for scalable vectors:
https://reviews.llvm.org/D47769
>     > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
>     > 5. SVE Calling Convention: https://reviews.llvm.org/D47771
>     > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
>     > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
>     > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
>     > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
>     > 10. Initial store patterns: https://reviews.llvm.org/D47776
>     > 11. Initial addition patterns: https://reviews.llvm.org/D47777
>     > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
>     > 13. Implement copy logic for Z regs:
https://reviews.llvm.org/D47779
>     > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>     >
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180730/25c38df6/attachment.html>

David A. Greene via llvm-dev

2018-Jul-30 19:57 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> writes:
> 2. I know that there has been some discussion around support for
> changing the vector length during program execution (e.g., to account
> for some (proposed?) RISC-V feature), perhaps even during the
> execution of a single function. I'm very concerned about this idea
> because it is not at all clear to me how to limit information transfer
> contaminated with the vector size from propagating between different
> regions. As a result, I'm concerned about trying to add this on later,
> and so if this is part of the plan, I think that we need to think
> through the details up front because it could have a major impact on
> the design.
Can you elaborate a bit on your concerns?  I'm not sure how allowing
vector length changes impacts the design of this proposal.  As far as I
understand things, this proposal is about dealing with unknown vector
lengths, providing types and intrinsics where needed to support
necessary operations.  It seems to me that building support for changing
the vscale at runtime is somewhat orthogonal.  That is, anyone doing
such a thing will probably have to provide some more intrinsics to
capture the dependency chains and prevent miscompilation, but the basic
types and other intrinsics would remain the same.

What I'm going to say below is from my (narrow) perspective of machines
I've worked on.  It's not meant to cover all possibilities, things
people might do with RISC-V etc.  I intend it as a (common) example for
discussion.

Changing vector length during execution of a loop (for the last
iteration, typically) is very common for architectures without
predication.  Traditional Cray processors, for example, had a vector
length register.  The compiler had to manage updates to the vl register
just like any other register and instructions used vl as an implicit
operand.

I'm not sure exactly how the SVE proposal would address this kind of
operation.  llvm.experimental.vector.vscale is a vector length read.  I
could imagine a load intrinsic that takes a vscale value as an operand,
thus connecting the vector length to the load and the transitive closure
of its uses.  I could also imagine an intrinsic to change the vscale.
The trick would be to disallow reordering vector length reads and
writes.  None of this seems to require changes to the proposed type
system, only the addition of some (target-specific?) intrinsics.

I think it would be unlikely for anyone to need to change the vector
length during evaluation of an in-register expression.  That is, vector
length changes would normally be made only at observable points in the
program (loads, stores, etc.) and probably only at certain control-flow
boundaries (loop back-edges, function calls/returns and so on).  Thus we
wouldn't need intrinsics or other new IR for every possible operation in
LLVM, only at the boundaries.

                          -David

Renato Golin via llvm-dev

2018-Jul-30 20:12 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

On Mon, 30 Jul 2018 at 20:57, David A. Greene via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> I'm not sure exactly how the SVE proposal would address this kind of
> operation.
SVE uses predication. The physical number of lanes doesn't have to
change to have the same effect (alignment, tails).

> I think it would be unlikely for anyone to need to change the vector
> length during evaluation of an in-register expression.
The worry here is not within each instruction but across instructions.
SVE (and I think RISC-V) allow register size to be dynamically set.

For example, on the same machine, it may be 256 for one process and
512 for another (for example, to save power).

But the change is via a system register, so in theory, anyone can
write an inline asm in the beginning of a function and change the
vector length to whatever they want.

Worst still, people can do that inside loops, or in a tail loop,
thinking it's a good idea (or this is a Cray machine :).

AFAIK, the interface for changing the register length will not be
exposed programmatically, so in theory, we should not worry about it.
Any inline asm hack can be considered out of scope / user error.

However, Hal's concern seems to be that, in the event of anyone
planning to add it to their APIs, we need to make sure the proposed
semantics can cope with it (do we need to update the predicates again?
what will vscale mean, then and when?).

If not, we may have to enforce that this will not come to pass in its
current form. In this case, changing it later will require *a lot*
more effort than doing it now.

So, it would be good to get a clear response from the two fronts (SVE
and RISC-V) about the future intention to expose that or not.

--
cheers,
--renato

Robin Kruppe via llvm-dev

2018-Jul-31 18:03 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi all,

I'm starting to feel like I'm a broken record about this, but too much
of the discussion has been unclear about this and I think it causes a
fair amount of confusion, so I feel obligated to state it again as
clearly as I can: There are TWO independent notions of vector length
in this space! Namely:

1. How large are the machine's vector registers?
2. How many elements of a vector register are processed by an instruction?

This RFC addresses only the former, with the vscale concept. We have
been and still are discussing the latter in this email thread too,
sometimes under names such as "VL" or "active vector
length", but
unfortunately also often as just plain "vector length". I think this
is very unfortunate: having two intermingled discussions about
different things which share a name is very confusing, especially
since I believe there is no need to discuss them together.

The active vector length can't be larger than the number of elements
in a vector register, but apart from that they are entirely separate
and whether an architecture has a fixed- or variable-size register is
completely orthogonal to whether it has a VL register. All
combinations make sense and exist in real architectures:

- SSE, NEON, etc. have fixed-size vector registers (e.g. 128 bit)
without any active vector length mechanism
- Classical Cray-style vector processors have fixed-size vector
registers (e.g., Cray-1 had 64x64bit) and an active vector length
mechanism
- SVE has variable-size vector registers and no active vector length
mechanism (loops are instead controlled by predication)
- The vector extension for RISC-V has variable-size vector register
and an active vector length mechanism

More importantly, the two mechanism are *used* very differently and
place very different demands on a compiler. Therefore, any discussion
that conflates these two concerns is doomed from the start IMHO. I
have written a bit about these differences, but since I know many
people here only have so much time, I moved this to an "appendix"
after the end of this email and will now go straight to addressing
Hal's second concern with this distinction in mind.

On 30 July 2018 at 21:10, Hal Finkel <hfinkel at anl.gov>
wrote:>
> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>
> I strongly suspect that there remains widespread concern with the direction
> of this, I know I have them.
>
> I don't think that many of the people who have that concern have had
time to
> come back to this RFC and make progress on it, likely because of other
> commitments or simply the amount of churn around SVE related patches and
> such. That is at least why I haven't had time to return to this RFC and
try
> to write more detailed feedback.
>
> Certainly, I would want to see pretty clear and considered support for this
> change to the IR type system from Hal, Chris, Eric and/or other long time
> maintainers of core LLVM IR components before it moves forward, and I
don't
> see that in this thread.
>
>
> At a high level, I'm happy with this approach. I think it will be
important
> for LLVM to support runtime-determined vector lengths - I see the
> customizability and power-efficiency constraints that motivate these
designs
> continuing to increase in importance. I'm still undecided on whether
this
> makes vector code nicer even for fixed-vector-length architectures, but
some
> of the design decisions that it forces, such as having explicit intrinsics
> for reductions and other horizontal operations, seem like the right
> direction regardless. I have two questions:
>
> 1.
>
> This is a proposal for how to deal with querying the size of scalable types
> for
>> analysis of IR. While it has not been implemented in full,
>
>
> Is this still true? The details here need to all work out, obviously, and
we
> should make sure that any issues are identified.
>
> 2. I know that there has been some discussion around support for changing
> the vector length during program execution (e.g., to account for some
> (proposed?) RISC-V feature), perhaps even during the execution of a single
> function. I'm very concerned about this idea because it is not at all
clear
> to me how to limit information transfer contaminated with the vector size
> from propagating between different regions. As a result, I'm concerned
about
> trying to add this on later, and so if this is part of the plan, I think
> that we need to think through the details up front because it could have a
> major impact on the design.
Yes, changing vscale during program execution is necessary to some
degree for the RISC-V vector extension. Yes, doing this at arbitrary
program points is indeed extremely challenging for a compiler to
support, for the reason you describe. This is why I proposed a
tradeoff, which Graham incorporated into this RFC: vscale can only
change at function boundaries and is fixed between function entry and
exit. This restriction is OK (not ideal, but good enough IMO) for
RISC-V and it makes the problem much more manageable because most code
in LLVM operates only within one function at a time, so it never has
to encounter vscale changes. I also think this is the most we'll ever
be able to support -- the problem you describe isn't going away, and I
don't know of any major use cases that would require us to tackle this
difficult problem in its entirety. However, I might be unaware of
something people want to do with SVE that doesn't fit into this mould.

Despite also being relevant for RISC-V and being discussed extensively
in this thread, the active vector length is basically just a very
minor twist on predication, and therefore doesn't interact at all with
the type system changes proposed here. Like predication, it can just
be modelled by regular data flow between IR operations (as David
already said). As with predication, a smaller *active* vector length
(~= a mask with few elements enabled) doesn't mean vectors suddenly
have fewer elements, just that more of them are masked out while doing
calculations. While there's an interesting design space for how to
best represent this predication in the IR, it has entirely different
challenges and constraints than vscale changes. If anything, the
"active vector length" discussion has more in common with past
discussions about making predication for *fixed-length* vectors more
of a first-class citizen in LLVM IR.

So I think this RFC as-is solves the problem of changing vector
register sizes about as well as it can and needs to be solved, and in
a way that is entirely satisfactory for RISC-V (again, I can't speak
for SVE, I don't know the use cases there). While more work is needed
to deal with another aspect of the RISC-V vector architecture (the VL
register), that can and should be a separate discussion, the results
of which won't invalidate anything decided in this RFC.


Cheers,
Robin


## Appendix

The active vector length or VL register is a tool for loop control,
ensuring the vectorized loop does not run too far while still
maximizing use of the vector unit. As such, it is recomputed
frequently (at minimum once per loop iteration, possibly even within a
loop as Bruce explained) and can be seen as a particular kind of
predication. It applies to a particular operation, prevents it from
having unwanted side effects, and operates on a subset of a larger
vector. As SVE illustrates, one can use plain old masks in precisely
the same way to solve the same problem, constructing and maintaining
masks that enable "the first n elements" where n would be the active
vector length in a different architecture. Creating a special VL
register for this purpose is just an architectural accomodation for
this style of predication. While it may have significant impact on the
microarchitecture and suggest a different mental model to programmers,
it's basically just predication from a compiler's perspective.

The vector register size, on the other hand, is not something you
change just like that. Changing it is, at best, like deciding to
switch from AVX exclusively (i.e.,  no xmm registers) to SSE
exclusively. It changes fundamental properties of your register field
and vector unit. While you can easily compile one part of your
application one way and another part differently if they don't
interact directly, once you try to do this e.g. in the middle of a
vectorized code region, it gets difficult even conceptually, to say
nothing of the compiler implementation. Furthermore, in the AVX->SSE
case you know how the vector length changes, and that might also apply
to SVE -- e.g. you could halve vscale and split all your existing
N-element vectors into two (N/2)-element vectors each -- but on
RISC-V, you probably won't be able to control the vector register size
directly, so you can't even do that much.

These difference also impact how to approach the two concepts in a
compiler. The active vector length -- very much like a mask -- is just
a piece of data that is computed and then used in various vector
operations as an extra operand, as David suggested in a recent email.
For this reason, I agree with his assesment that the active vector
length is "just data flow" and doesn't interact with the type
system
changes discussed in this RFC.

vscale, on the other hand, is not easily handled as "just a piece of
data". The size of vector registers impacts many things besides
individual operations that are explicit in IR, and as such many parts
of the compiler have to be acutely aware of what it isand where it
might change. To give just one example, if you increase the size of
vector registers in the middle of a function, you need to reserve more
stack space for spilling -- if you just reserve stack space in the
prologue using the *initial* register size, you won't have enough
space to spill the larger vector values later on. There's myriad more
problems like this if you sit down and sift through IR transformations
and the CodeGen infrastructure (as I have been doing for RISC-V over
the last year). A change of vscale is best considered to be a massive
barrier to all code that is even remotely vector-related. In Hal's
terms, you really want to prevent anything contaminated with the
vector register size to cross over the point where you change the
vector register size.

Like Hal, I am very skeptical how, if at all, such a barrier could be
added to IR. And I've spent a lot of time trying to come up with a
solution as part of my RISC-V work. That is why my RFC back in April
proposed a trade-off, which has been incorporated by Graham into this
RFC: vscale can change between functions, but does not change within a
function. As an analogy, consider how LLVM supports different
subtargets (each with different registers, instructions and legal
types) on a per-function basis but doesn't allow e.g. making a
register class completely unavailable at a certain point in a
function.
> Thanks again,
> Hal
>
>
>
> Put differently: I don't think silence is assent here. You really need
some
> clear signal of consensus.
>
> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <Graham.Hunter at
arm.com> wrote:
>>
>> Hi,
>>
>> Are there any objections to going ahead with this? If not, we'll
try to
>> get the patches reviewed and committed after the 7.0 branch occurs.
>>
>> -Graham
>>
>> > On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at
arm.com> wrote:
>> >
>> > Hi,
>> >
>> > I've updated the RFC slightly based on the discussion within
the thread,
>> > reposted below. Let me know if I've missed anything or if more
clarification
>> > is needed.
>> >
>> > Thanks,
>> >
>> > -Graham
>> >
>> >
============================================================>> >
Supporting SIMD instruction sets with variable vector lengths
>> >
============================================================>> >
>> > In this RFC we propose extending LLVM IR to support
code-generation for
>> > variable
>> > length vector architectures like Arm's SVE or RISC-V's
'V' extension.
>> > Our
>> > approach is backwards compatible and should be as non-intrusive as
>> > possible; the
>> > only change needed in other backends is how size is queried on
vector
>> > types, and
>> > it only requires a change in which function is called. We have
created a
>> > set of
>> > proof-of-concept patches to represent a simple vectorized loop in
IR and
>> > generate SVE instructions from that IR. These patches (listed in
section
>> > 7 of
>> > this rfc) can be found on Phabricator and are intended to
illustrate the
>> > scope
>> > of changes required by the general approach described in this RFC.
>> >
>> > =========>> > Background
>> > =========>> >
>> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA
extension
>> > for
>> > AArch64 which is intended to scale with hardware such that the
same
>> > binary
>> > running on a processor with longer vector registers can take
advantage
>> > of the
>> > increased compute power without recompilation.
>> >
>> > As the vector length is no longer a compile-time known value, the
way in
>> > which
>> > the LLVM vectorizer generates code requires modifications such
that
>> > certain
>> > values are now runtime evaluated expressions instead of
compile-time
>> > constants.
>> >
>> > Documentation for SVE can be found at
>> >
>> >
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>> >
>> > =======>> > Contents
>> > =======>> >
>> > The rest of this RFC covers the following topics:
>> >
>> > 1. Types -- a proposal to extend VectorType to be able to
represent
>> > vectors that
>> >   have a length which is a runtime-determined multiple of a known
base
>> > length.
>> >
>> > 2. Size Queries - how to reason about the size of types for which
the
>> > size isn't
>> >   fully known at compile time.
>> >
>> > 3. Representing the runtime multiple of vector length in IR for
use in
>> > address
>> >   calculations and induction variable comparisons.
>> >
>> > 4. Generating 'constant' values in IR for vectors with a
>> > runtime-determined
>> >   number of elements.
>> >
>> > 5. An explanation of splitting/concatentating scalable vectors.
>> >
>> > 6. A brief note on code generation of these new operations for
AArch64.
>> >
>> > 7. An example of C code and matching IR using the proposed
extensions.
>> >
>> > 8. A list of patches demonstrating the changes required to emit
SVE
>> > instructions
>> >   for a loop that has already been vectorized using the extensions
>> > described
>> >   in this RFC.
>> >
>> > =======>> > 1. Types
>> > =======>> >
>> > To represent a vector of unknown length a boolean `Scalable`
property
>> > has been
>> > added to the `VectorType` class, which indicates that the number
of
>> > elements in
>> > the vector is a runtime-determined integer multiple of the
`NumElements`
>> > field.
>> > Most code that deals with vectors doesn't need to know the
exact length,
>> > but
>> > does need to know relative lengths -- e.g. get a vector with the
same
>> > number of
>> > elements but a different element type, or with half or double the
number
>> > of
>> > elements.
>> >
>> > In order to allow code to transparently support scalable vectors,
we
>> > introduce
>> > an `ElementCount` class with two members:
>> >
>> > - `unsigned Min`: the minimum number of elements.
>> > - `bool Scalable`: is the element count an unknown multiple of
`Min`?
>> >
>> > For non-scalable vectors (``Scalable=false``) the scale is
considered to
>> > be
>> > equal to one and thus `Min` represents the exact number of
elements in
>> > the
>> > vector.
>> >
>> > The intent for code working with vectors is to use convenience
methods
>> > and avoid
>> > directly dealing with the number of elements. If needed, calling
>> > `getElementCount` on a vector type instead of
`getVectorNumElements` can
>> > be used
>> > to obtain the (potentially scalable) number of elements.
Overloaded
>> > division and
>> > multiplication operators allow an ElementCount instance to be used
in
>> > much the
>> > same manner as an integer for most cases.
>> >
>> > This mixture of compile-time and runtime quantities allow us to
reason
>> > about the
>> > relationship between different scalable vector types without
knowing
>> > their
>> > exact length.
>> >
>> > The runtime multiple is not expected to change during program
execution
>> > for SVE,
>> > but it is possible. The model of scalable vectors presented in
this RFC
>> > assumes
>> > that the multiple will be constant within a function but not
necessarily
>> > across
>> > functions. As suggested in the recent RISC-V rfc, a new function
>> > attribute to
>> > inherit the multiple across function calls will allow for function
calls
>> > with
>> > vector arguments/return values and inlining/outlining
optimizations.
>> >
>> > IR Textual Form
>> > ---------------
>> >
>> > The textual form for a scalable vector is:
>> >
>> > ``<scalable <n> x <type>>``
>> >
>> > where `type` is the scalar type of each element, `n` is the
minimum
>> > number of
>> > elements, and the string literal `scalable` indicates that the
total
>> > number of
>> > elements is an unknown multiple of `n`; `scalable` is just an
arbitrary
>> > choice
>> > for indicating that the vector is scalable, and could be
substituted by
>> > another.
>> > For fixed-length vectors, the `scalable` is omitted, so there is
no
>> > change in
>> > the format for existing vectors.
>> >
>> > Scalable vectors with the same `Min` value have the same number of
>> > elements, and
>> > the same number of bytes if `Min * sizeof(type)` is the same
(assuming
>> > they are
>> > used within the same function):
>> >
>> > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have
the same number of
>> >  elements.
>> >
>> > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have
the same number
>> > of
>> >  bytes.
>> >
>> > IR Bitcode Form
>> > ---------------
>> >
>> > To serialize scalable vectors to bitcode, a new boolean field is
added
>> > to the
>> > type record. If the field is not present the type will default to
a
>> > fixed-length
>> > vector type, preserving backwards compatibility.
>> >
>> > Alternatives Considered
>> > -----------------------
>> >
>> > We did consider one main alternative -- a dedicated target type,
like
>> > the
>> > x86_mmx type.
>> >
>> > A dedicated target type would either need to extend all existing
passes
>> > that
>> > work with vectors to recognize the new type, or to duplicate all
that
>> > code
>> > in order to get reasonable code generation and autovectorization.
>> >
>> > This hasn't been done for the x86_mmx type, and so it is only
capable of
>> > providing support for C-level intrinsics instead of being used and
>> > recognized by
>> > passes inside llvm.
>> >
>> > Although our current solution will need to change some of the code
that
>> > creates
>> > new VectorTypes, much of that code doesn't need to care about
whether
>> > the types
>> > are scalable or not -- they can use preexisting methods like
>> > `getHalfElementsVectorType`. If the code is a little more complex,
>> > `ElementCount` structs can be used instead of an `unsigned` value
to
>> > represent
>> > the number of elements.
>> >
>> > ==============>> > 2. Size Queries
>> > ==============>> >
>> > This is a proposal for how to deal with querying the size of
scalable
>> > types for
>> > analysis of IR. While it has not been implemented in full, the
general
>> > approach
>> > works well for calculating offsets into structures with scalable
types
>> > in a
>> > modified version of ComputeValueVTs in our downstream compiler.
>> >
>> > For current IR types that have a known size, all query functions
return
>> > a single
>> > integer constant. For scalable types a second integer is needed to
>> > indicate the
>> > number of bytes/bits which need to be scaled by the runtime
multiple to
>> > obtain
>> > the actual length.
>> >
>> > For primitive types, `getPrimitiveSizeInBits()` will function as
it does
>> > today,
>> > except that it will no longer return a size for vector types (it
will
>> > return 0,
>> > as it does for other derived types). The majority of calls to this
>> > function are
>> > already for scalar rather than vector types.
>> >
>> > For derived types, a function `getScalableSizePairInBits()` will
be
>> > added, which
>> > returns a pair of integers (one to indicate unscaled bits, the
other for
>> > bits
>> > that need to be scaled by the runtime multiple). For backends that
do
>> > not need
>> > to deal with scalable types the existing methods will suffice, but
a
>> > debug-only
>> > assert will be added to them to ensure they aren't used on
scalable
>> > types.
>> >
>> > Similar functionality will be added to DataLayout.
>> >
>> > Comparisons between sizes will use the following methods, assuming
that
>> > X and
>> > Y are non-zero integers and the form is of { unscaled, scaled }.
>> >
>> > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>> >
>> > { 0, X } <cmp> { 0, Y }: Normal comparison within a
function, or across
>> >                         functions that inherit vector length.
Cannot be
>> >                         compared across non-inheriting functions.
>> >
>> > { X, 0 } > { 0, Y }: Cannot return true.
>> >
>> > { X, 0 } = { 0, Y }: Cannot return true.
>> >
>> > { X, 0 } < { 0, Y }: Can return true.
>> >
>> > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to
subtract common
>> >                             terms and try the above comparisons;
it
>> >                             may not be possible to get a good
answer.
>> >
>> > It's worth noting that we don't expect the last case
(mixed scaled and
>> > unscaled sizes) to occur. Richard Sandiford's proposed C
extensions
>> > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html)
>> > explicitly
>> > prohibits mixing fixed-size types into sizeless struct.
>> >
>> > I don't know if we need a 'maybe' or 'unknown'
result for cases
>> > comparing scaled
>> > vs. unscaled; I believe the gcc implementation of SVE allows for
such
>> > results, but that supports a generic polynomial length
representation.
>> >
>> > My current intention is to rely on functions that clone or copy
values
>> > to
>> > check whether they are being used to copy scalable vectors across
>> > function
>> > boundaries without the inherit vlen attribute and raise an error
there
>> > instead
>> > of requiring passing the Function a type size is from for each
>> > comparison. If
>> > there's a strong preference for moving the check to the size
comparison
>> > function
>> > let me know; I will be starting work on patches for this later in
the
>> > year if
>> > there's no major problems with the idea.
>> >
>> > Future Work
>> > -----------
>> >
>> > Since we cannot determine the exact size of a scalable vector, the
>> > existing logic for alias detection won't work when multiple
accesses
>> > share a common base pointer with different offsets.
>> >
>> > However, SVE's predication will mean that a dynamic
'safe' vector length
>> > can be determined at runtime, so after initial support has been
added we
>> > can work on vectorizing loops using runtime predication to avoid
>> > aliasing
>> > problems.
>> >
>> > Alternatives Considered
>> > -----------------------
>> >
>> > Marking scalable vectors as unsized doesn't work well, as many
parts of
>> > llvm dealing with loads and stores assert that 'isSized()'
returns true
>> > and make use of the size when calculating offsets.
>> >
>> > We have considered introducing multiple helper functions instead
of
>> > using direct size queries, but that doesn't cover all cases.
It may
>> > still be a good idea to introduce them to make the purpose in a
given
>> > case more obvious, e.g.
'requiresSignExtension(Type*,Type*)'.
>> >
>> > =======================================>> > 3.
Representing Vector Length at Runtime
>> > =======================================>> >
>> > With a scalable vector type defined, we now need a way to
represent the
>> > runtime
>> > length in IR in order to generate addresses for consecutive
vectors in
>> > memory
>> > and determine how many elements have been processed in an
iteration of a
>> > loop.
>> >
>> > We have added an experimental `vscale` intrinsic to represent the
>> > runtime
>> > multiple. Multiplying the result of this intrinsic by the minimum
number
>> > of
>> > elements in a vector gives the total number of elements in a
scalable
>> > vector.
>> >
>> > Fixed-Length Code
>> > -----------------
>> >
>> > Assuming a vector type of <4 x <ty>>
>> > ``
>> > vector.body:
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>> > %vector.body.preheader ]
>> >  ;; <loop body>
>> >  ;; Increment induction var
>> >  %index.next = add i64 %index, 4
>> >  ;; <check and branch>
>> > ``
>> > Scalable Equivalent
>> > -------------------
>> >
>> > Assuming a vector type of <scalable 4 x <ty>>
>> > ``
>> > vector.body:
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>> > %vector.body.preheader ]
>> >  ;; <loop body>
>> >  ;; Increment induction var
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>> >  ;; <check and branch>
>> > ``
>> > ==========================>> > 4. Generating Vector
Values
>> > ==========================>> > For constant vector
values, we cannot specify all the elements as we can
>> > for
>> > fixed-length vectors; fortunately only a small number of easily
>> > synthesized
>> > patterns are required for autovectorization. The `zeroinitializer`
>> > constant
>> > can be used in the same manner as fixed-length vectors for a
constant
>> > zero
>> > splat. This can then be combined with `insertelement` and
>> > `shufflevector`
>> > to create arbitrary value splats in the same manner as
fixed-length
>> > vectors.
>> >
>> > For constants consisting of a sequence of values, an experimental
>> > `stepvector`
>> > intrinsic has been added to represent a simple constant of the
form
>> > `<0, 1, 2... num_elems-1>`. To change the starting value a
splat of the
>> > new
>> > start can be added, and changing the step requires multiplying by
a
>> > splat.
>> >
>> > Fixed-Length Code
>> > -----------------
>> > ``
>> >  ;; Splat a value
>> >  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>> >  %splat = shufflevector <4 x i32> %insert, <4 x i32>
undef, <4 x i32>
>> > zeroinitializer
>> >  ;; Add a constant sequence
>> >  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32
8>
>> > ``
>> > Scalable Equivalent
>> > -------------------
>> > ``
>> >  ;; Splat a value
>> >  %insert = insertelement <scalable 4 x i32> undef, i32
%value, i32 0
>> >  %splat = shufflevector <scalable 4 x i32> %insert,
<scalable 4 x i32>
>> > undef, <scalable 4 x i32> zeroinitializer
>> >  ;; Splat offset + stride (the same in this case)
>> >  %insert2 = insertelement <scalable 4 x i32> under, i32 2,
i32 0
>> >  %str_off = shufflevector <scalable 4 x i32> %insert2,
<scalable 4 x
>> > i32> undef, <scalable 4 x i32> zeroinitializer
>> >  ;; Create sequence for scalable vector
>> >  %stepvector = call <scalable 4 x i32>
>> > @llvm.experimental.vector.stepvector.nxv4i32()
>> >  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>> >  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>> >  ;; Add the runtime-generated sequence
>> >  %add = add <scalable 4 x i32> %splat, %addoffset
>> > ``
>> > Future Work
>> > -----------
>> >
>> > Intrinsics cannot currently be used for constant folding. Our
downstream
>> > compiler (using Constants instead of intrinsics) relies quite
heavily on
>> > this
>> > for good code generation, so we will need to find new ways to
recognize
>> > and
>> > fold these values.
>> >
>> > ==========================================>> > 5.
Splitting and Combining Scalable Vectors
>> > ==========================================>> >
>> > Splitting and combining scalable vectors in IR is done in the same
>> > manner as
>> > for fixed-length vectors, but with a non-constant mask for the
>> > shufflevector.
>> >
>> > The following is an example of splitting a <scalable 4 x
double> into
>> > two
>> > separate <scalable 2 x double> values.
>> >
>> > ``
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  ;; Stepvector generates the element ids for first subvector
>> >  %sv1 = call <scalable 2 x i64>
>> > @llvm.experimental.vector.stepvector.nxv2i64()
>> >  ;; Add vscale * 2 to get the starting element for the second
subvector
>> >  %ec = mul i64 %vscale64, 2
>> >  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec,
i32 0
>> >  %ec.splat = shufflevector <scalable 2 x i64> %9,
<scalable 2 x i64>
>> > undef, <scalable 2 x i32> zeroinitializer
>> >  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>> >  ;; Perform the extracts
>> >  %res1 = shufflevector <scalable 4 x double> %in,
<scalable 4 x double>
>> > undef, <scalable 2 x i64> %sv1
>> >  %res2 = shufflevector <scalable 4 x double> %in,
<scalable 4 x double>
>> > undef, <scalable 2 x i64> %sv2
>> > ``
>> >
>> > =================>> > 6. Code Generation
>> > =================>> >
>> > IR splats will be converted to an experimental splatvector
intrinsic in
>> > SelectionDAGBuilder.
>> >
>> > All three intrinsics are custom lowered and legalized in the
AArch64
>> > backend.
>> >
>> > Two new AArch64ISD nodes have been added to represent the same
concepts
>> > at the SelectionDAG level, while splatvector maps onto the
existing
>> > AArch64ISD::DUP.
>> >
>> > GlobalISel
>> > ----------
>> >
>> > Since GlobalISel was enabled by default on AArch64, it was
necessary to
>> > add
>> > scalable vector support to the LowLevelType implementation. A
single bit
>> > was
>> > added to the raw_data representation for vectors and vectors of
>> > pointers.
>> >
>> > In addition, types that only exist in destination patterns are
planted
>> > in
>> > the enumeration of available types for generated code. While this
may
>> > not be
>> > necessary in future, generating an all-true 'ptrue' value
was necessary
>> > to
>> > convert a predicated instruction into an unpredicated one.
>> >
>> > =========>> > 7. Example
>> > =========>> >
>> > The following example shows a simple C loop which assigns the
array
>> > index to
>> > the array elements matching that index. The IR shows how vscale
and
>> > stepvector
>> > are used to create the needed values and to advance the index
variable
>> > in the
>> > loop.
>> >
>> > C Code
>> > ------
>> >
>> > ``
>> > void IdentityArrayInit(int *a, int count) {
>> >  for (int i = 0; i < count; ++i)
>> >    a[i] = i;
>> > }
>> > ``
>> >
>> > Scalable IR Vector Body
>> > -----------------------
>> >
>> > ``
>> > vector.body.preheader:
>> >  ;; Other setup
>> >  ;; Stepvector used to create initial identity vector
>> >  %stepvector = call <scalable 4 x i32>
>> > @llvm.experimental.vector.stepvector.nxv4i32()
>> >  br vector.body
>> >
>> > vector.body
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>> > %vector.body.preheader ]
>> >  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>> >
>> >           ;; stepvector used for index identity on entry to loop
body ;;
>> >  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8,
%vector.body ],
>> >                                     [ %stepvector,
>> > %vector.body.preheader ]
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  %vscale32 = trunc i64 %vscale64 to i32
>> >  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>> >
>> >           ;; vscale splat used to increment identity vector ;;
>> >  %insert = insertelement <scalable 4 x i32> undef, i32 mul
(i32
>> > %vscale32, i32 4), i32 0
>> >  %splat shufflevector <scalable 4 x i32> %insert,
<scalable 4 x i32>
>> > undef, <scalable 4 x i32> zeroinitializer
>> >  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>> >  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>> >  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>> >  store <scalable 4 x i32> %vec.ind7, <scalable 4 x
i32>* %3, align 4
>> >
>> >           ;; vscale used to increment loop index
>> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>> >  %4 = icmp eq i64 %index.next, %n.vec
>> >  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>> > ``
>> >
>> > =========>> > 8. Patches
>> > =========>> >
>> > List of patches:
>> >
>> > 1. Extend VectorType: https://reviews.llvm.org/D32530
>> > 2. Vector element type Tablegen constraint:
>> > https://reviews.llvm.org/D47768
>> > 3. LLT support for scalable vectors:
https://reviews.llvm.org/D47769
>> > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
>> > 5. SVE Calling Convention: https://reviews.llvm.org/D47771
>> > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
>> > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
>> > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
>> > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
>> > 10. Initial store patterns: https://reviews.llvm.org/D47776
>> > 11. Initial addition patterns: https://reviews.llvm.org/D47777
>> > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
>> > 13. Implement copy logic for Z regs:
https://reviews.llvm.org/D47779
>> > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>> >
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory

Renato Golin via llvm-dev

2018-Jul-31 18:32 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Robin,

On Tue, 31 Jul 2018 at 19:03, Robin Kruppe <robin.kruppe at gmail.com>
wrote:> 1. How large are the machine's vector registers?
This is the only one I'm talking about. :)

> Like Hal, I am very skeptical how, if at all, such a barrier could be
> added to IR. And I've spent a lot of time trying to come up with a
> solution as part of my RISC-V work. That is why my RFC back in April
> proposed a trade-off, which has been incorporated by Graham into this
> RFC: vscale can change between functions, but does not change within a
> function. As an analogy, consider how LLVM supports different
> subtargets (each with different registers, instructions and legal
> types) on a per-function basis but doesn't allow e.g. making a
> register class completely unavailable at a certain point in a
> function.
Cray seems to use changes in vscale as we use predication for the last
loop iteration, as well as RISC-V uses for giving away resources to
different functions.

In the former case, they may want to change the vscale inside the same
function in the last iteration, but given that this is semantically
equivalent to shortening predicates, it could be a back-end decision
and not an IR one. We could have the same notation for both target
behaviours and not have to worry about the boundaries.

In the latter case, it's clear that functions are hard boundaries.
Providing, of course, that you either inline all functions called
before vectorisation, or, and only if there is a scalable vector PCS
ABI, make sure that all of them have the same length?

I haven't thought long enough about the latter, and that's why I was
proposing we take a conservative approach and restrict to what we can
actually reasonably do now.

I think this is what you and Graham are trying to do, right?

cheers,
--renato

Graham Hunter via llvm-dev

2018-Aug-01 19:00 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Hal,
> On 30 Jul 2018, at 20:10, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> 
> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>> I strongly suspect that there remains widespread concern with the
direction of this, I know I have them.
>> 
>> I don't think that many of the people who have that concern have
had time to come back to this RFC and make progress on it, likely because of
other commitments or simply the amount of churn around SVE related patches and
such. That is at least why I haven't had time to return to this RFC and try
to write more detailed feedback.
>> 
>> Certainly, I would want to see pretty clear and considered support for
this change to the IR type system from Hal, Chris, Eric and/or other long time
maintainers of core LLVM IR components before it moves forward, and I don't
see that in this thread.
> 
> At a high level, I'm happy with this approach. I think it will be
important for LLVM to support runtime-determined vector lengths - I see the
customizability and power-efficiency constraints that motivate these designs
continuing to increase in importance. I'm still undecided on whether this
makes vector code nicer even for fixed-vector-length architectures, but some of
the design decisions that it forces, such as having explicit intrinsics for
reductions and other horizontal operations, seem like the right direction
regardless.
Thanks, that's good to hear.
> 1.
>> This is a proposal for how to deal with querying the size of scalable
types for
>> > analysis of IR. While it has not been implemented in full,
> 
> Is this still true? The details here need to all work out, obviously, and
we should make sure that any issues are identified.
Yes. I had hoped to get some more comments on the basic approach before
progressing with the implementation, but if it makes more sense to have the
implementation available to discuss then I'll start creating patches.
> 2. I know that there has been some discussion around support for changing
the vector length during program execution (e.g., to account for some
(proposed?) RISC-V feature), perhaps even during the execution of a single
function. I'm very concerned about this idea because it is not at all clear
to me how to limit information transfer contaminated with the vector size from
propagating between different regions. As a result, I'm concerned about
trying to add this on later, and so if this is part of the plan, I think that we
need to think through the details up front because it could have a major impact
on the design.
I think Robin's email yesterday covered it fairly nicely; this RFC proposes
that the hardware length of vectors will be consistent throughout an entire
function, so we don't need to limit information inside a function, just
between them. For SVE, h/w vector length will likely be consistent across the
whole program as well (assuming the programmer doesn't make a prctl call to
the kernel to change it) so we could drop that limit too, but I thought it best
to come up with a unified approach that would work for both architectures. The
'inherits_vscale' attribute would allow us to continue optimizing across
functions for SVE where desired.

Modelling the dynamic vector length for RVV is something for Robin (or others)
to tackle later, but can be though of (at a high level) as an implicit predicate
on all operations.

-Graham
> 
> Thanks again,
> Hal
> 
>> 
>> Put differently: I don't think silence is assent here. You really
need some clear signal of consensus.
>> 
>> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <Graham.Hunter at
arm.com> wrote:
>> Hi,
>> 
>> Are there any objections to going ahead with this? If not, we'll
try to get the patches reviewed and committed after the 7.0 branch occurs.
>> 
>> -Graham
>> 
>> > On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at
arm.com> wrote:
>> > 
>> > Hi,
>> > 
>> > I've updated the RFC slightly based on the discussion within
the thread, reposted below. Let me know if I've missed anything or if more
clarification is needed.
>> > 
>> > Thanks,
>> > 
>> > -Graham
>> > 
>> >
============================================================>> >
Supporting SIMD instruction sets with variable vector lengths
>> >
============================================================>> >
>> > In this RFC we propose extending LLVM IR to support
code-generation for variable
>> > length vector architectures like Arm's SVE or RISC-V's
'V' extension. Our
>> > approach is backwards compatible and should be as non-intrusive as
possible; the
>> > only change needed in other backends is how size is queried on
vector types, and
>> > it only requires a change in which function is called. We have
created a set of
>> > proof-of-concept patches to represent a simple vectorized loop in
IR and
>> > generate SVE instructions from that IR. These patches (listed in
section 7 of
>> > this rfc) can be found on Phabricator and are intended to
illustrate the scope
>> > of changes required by the general approach described in this RFC.
>> > 
>> > =========>> > Background
>> > =========>> > 
>> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA
extension for
>> > AArch64 which is intended to scale with hardware such that the
same binary
>> > running on a processor with longer vector registers can take
advantage of the
>> > increased compute power without recompilation.
>> > 
>> > As the vector length is no longer a compile-time known value, the
way in which
>> > the LLVM vectorizer generates code requires modifications such
that certain
>> > values are now runtime evaluated expressions instead of
compile-time constants.
>> > 
>> > Documentation for SVE can be found at
>> >
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>> > 
>> > =======>> > Contents
>> > =======>> > 
>> > The rest of this RFC covers the following topics:
>> > 
>> > 1. Types -- a proposal to extend VectorType to be able to
represent vectors that
>> >   have a length which is a runtime-determined multiple of a known
base length.
>> > 
>> > 2. Size Queries - how to reason about the size of types for which
the size isn't
>> >   fully known at compile time.
>> > 
>> > 3. Representing the runtime multiple of vector length in IR for
use in address
>> >   calculations and induction variable comparisons.
>> > 
>> > 4. Generating 'constant' values in IR for vectors with a
runtime-determined
>> >   number of elements.
>> > 
>> > 5. An explanation of splitting/concatentating scalable vectors.
>> > 
>> > 6. A brief note on code generation of these new operations for
AArch64.
>> > 
>> > 7. An example of C code and matching IR using the proposed
extensions.
>> > 
>> > 8. A list of patches demonstrating the changes required to emit
SVE instructions
>> >   for a loop that has already been vectorized using the extensions
described
>> >   in this RFC.
>> > 
>> > =======>> > 1. Types
>> > =======>> > 
>> > To represent a vector of unknown length a boolean `Scalable`
property has been
>> > added to the `VectorType` class, which indicates that the number
of elements in
>> > the vector is a runtime-determined integer multiple of the
`NumElements` field.
>> > Most code that deals with vectors doesn't need to know the
exact length, but
>> > does need to know relative lengths -- e.g. get a vector with the
same number of
>> > elements but a different element type, or with half or double the
number of
>> > elements.
>> > 
>> > In order to allow code to transparently support scalable vectors,
we introduce
>> > an `ElementCount` class with two members:
>> > 
>> > - `unsigned Min`: the minimum number of elements.
>> > - `bool Scalable`: is the element count an unknown multiple of
`Min`?
>> > 
>> > For non-scalable vectors (``Scalable=false``) the scale is
considered to be
>> > equal to one and thus `Min` represents the exact number of
elements in the
>> > vector.
>> > 
>> > The intent for code working with vectors is to use convenience
methods and avoid
>> > directly dealing with the number of elements. If needed, calling
>> > `getElementCount` on a vector type instead of
`getVectorNumElements` can be used
>> > to obtain the (potentially scalable) number of elements.
Overloaded division and
>> > multiplication operators allow an ElementCount instance to be used
in much the
>> > same manner as an integer for most cases.
>> > 
>> > This mixture of compile-time and runtime quantities allow us to
reason about the
>> > relationship between different scalable vector types without
knowing their
>> > exact length.
>> > 
>> > The runtime multiple is not expected to change during program
execution for SVE,
>> > but it is possible. The model of scalable vectors presented in
this RFC assumes
>> > that the multiple will be constant within a function but not
necessarily across
>> > functions. As suggested in the recent RISC-V rfc, a new function
attribute to
>> > inherit the multiple across function calls will allow for function
calls with
>> > vector arguments/return values and inlining/outlining
optimizations.
>> > 
>> > IR Textual Form
>> > ---------------
>> > 
>> > The textual form for a scalable vector is:
>> > 
>> > ``<scalable <n> x <type>>``
>> > 
>> > where `type` is the scalar type of each element, `n` is the
minimum number of
>> > elements, and the string literal `scalable` indicates that the
total number of
>> > elements is an unknown multiple of `n`; `scalable` is just an
arbitrary choice
>> > for indicating that the vector is scalable, and could be
substituted by another.
>> > For fixed-length vectors, the `scalable` is omitted, so there is
no change in
>> > the format for existing vectors.
>> > 
>> > Scalable vectors with the same `Min` value have the same number of
elements, and
>> > the same number of bytes if `Min * sizeof(type)` is the same
(assuming they are
>> > used within the same function):
>> > 
>> > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have
the same number of
>> >  elements.
>> > 
>> > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have
the same number of
>> >  bytes.
>> > 
>> > IR Bitcode Form
>> > ---------------
>> > 
>> > To serialize scalable vectors to bitcode, a new boolean field is
added to the
>> > type record. If the field is not present the type will default to
a fixed-length
>> > vector type, preserving backwards compatibility.
>> > 
>> > Alternatives Considered
>> > -----------------------
>> > 
>> > We did consider one main alternative -- a dedicated target type,
like the
>> > x86_mmx type.
>> > 
>> > A dedicated target type would either need to extend all existing
passes that
>> > work with vectors to recognize the new type, or to duplicate all
that code
>> > in order to get reasonable code generation and autovectorization.
>> > 
>> > This hasn't been done for the x86_mmx type, and so it is only
capable of
>> > providing support for C-level intrinsics instead of being used and
recognized by
>> > passes inside llvm.
>> > 
>> > Although our current solution will need to change some of the code
that creates
>> > new VectorTypes, much of that code doesn't need to care about
whether the types
>> > are scalable or not -- they can use preexisting methods like
>> > `getHalfElementsVectorType`. If the code is a little more complex,
>> > `ElementCount` structs can be used instead of an `unsigned` value
to represent
>> > the number of elements.
>> > 
>> > ==============>> > 2. Size Queries
>> > ==============>> > 
>> > This is a proposal for how to deal with querying the size of
scalable types for
>> > analysis of IR. While it has not been implemented in full, the
general approach
>> > works well for calculating offsets into structures with scalable
types in a
>> > modified version of ComputeValueVTs in our downstream compiler.
>> > 
>> > For current IR types that have a known size, all query functions
return a single
>> > integer constant. For scalable types a second integer is needed to
indicate the
>> > number of bytes/bits which need to be scaled by the runtime
multiple to obtain
>> > the actual length.
>> > 
>> > For primitive types, `getPrimitiveSizeInBits()` will function as
it does today,
>> > except that it will no longer return a size for vector types (it
will return 0,
>> > as it does for other derived types). The majority of calls to this
function are
>> > already for scalar rather than vector types.
>> > 
>> > For derived types, a function `getScalableSizePairInBits()` will
be added, which
>> > returns a pair of integers (one to indicate unscaled bits, the
other for bits
>> > that need to be scaled by the runtime multiple). For backends that
do not need
>> > to deal with scalable types the existing methods will suffice, but
a debug-only
>> > assert will be added to them to ensure they aren't used on
scalable types.
>> > 
>> > Similar functionality will be added to DataLayout.
>> > 
>> > Comparisons between sizes will use the following methods, assuming
that X and
>> > Y are non-zero integers and the form is of { unscaled, scaled }.
>> > 
>> > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>> > 
>> > { 0, X } <cmp> { 0, Y }: Normal comparison within a
function, or across
>> >                         functions that inherit vector length.
Cannot be
>> >                         compared across non-inheriting functions.
>> > 
>> > { X, 0 } > { 0, Y }: Cannot return true.
>> > 
>> > { X, 0 } = { 0, Y }: Cannot return true.
>> > 
>> > { X, 0 } < { 0, Y }: Can return true.
>> > 
>> > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to
subtract common
>> >                             terms and try the above comparisons;
it
>> >                             may not be possible to get a good
answer.
>> > 
>> > It's worth noting that we don't expect the last case
(mixed scaled and
>> > unscaled sizes) to occur. Richard Sandiford's proposed C
extensions
>> > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html)
explicitly
>> > prohibits mixing fixed-size types into sizeless struct.
>> > 
>> > I don't know if we need a 'maybe' or 'unknown'
result for cases comparing scaled
>> > vs. unscaled; I believe the gcc implementation of SVE allows for
such
>> > results, but that supports a generic polynomial length
representation.
>> > 
>> > My current intention is to rely on functions that clone or copy
values to
>> > check whether they are being used to copy scalable vectors across
function
>> > boundaries without the inherit vlen attribute and raise an error
there instead
>> > of requiring passing the Function a type size is from for each
comparison. If
>> > there's a strong preference for moving the check to the size
comparison function
>> > let me know; I will be starting work on patches for this later in
the year if
>> > there's no major problems with the idea.
>> > 
>> > Future Work
>> > -----------
>> > 
>> > Since we cannot determine the exact size of a scalable vector, the
>> > existing logic for alias detection won't work when multiple
accesses
>> > share a common base pointer with different offsets.
>> > 
>> > However, SVE's predication will mean that a dynamic
'safe' vector length
>> > can be determined at runtime, so after initial support has been
added we
>> > can work on vectorizing loops using runtime predication to avoid
aliasing
>> > problems.
>> > 
>> > Alternatives Considered
>> > -----------------------
>> > 
>> > Marking scalable vectors as unsized doesn't work well, as many
parts of
>> > llvm dealing with loads and stores assert that 'isSized()'
returns true
>> > and make use of the size when calculating offsets.
>> > 
>> > We have considered introducing multiple helper functions instead
of
>> > using direct size queries, but that doesn't cover all cases.
It may
>> > still be a good idea to introduce them to make the purpose in a
given
>> > case more obvious, e.g.
'requiresSignExtension(Type*,Type*)'.
>> > 
>> > =======================================>> > 3.
Representing Vector Length at Runtime
>> > =======================================>> > 
>> > With a scalable vector type defined, we now need a way to
represent the runtime
>> > length in IR in order to generate addresses for consecutive
vectors in memory
>> > and determine how many elements have been processed in an
iteration of a loop.
>> > 
>> > We have added an experimental `vscale` intrinsic to represent the
runtime
>> > multiple. Multiplying the result of this intrinsic by the minimum
number of
>> > elements in a vector gives the total number of elements in a
scalable vector.
>> > 
>> > Fixed-Length Code
>> > -----------------
>> > 
>> > Assuming a vector type of <4 x <ty>>
>> > ``
>> > vector.body:
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>> >  ;; <loop body>
>> >  ;; Increment induction var
>> >  %index.next = add i64 %index, 4
>> >  ;; <check and branch>
>> > ``
>> > Scalable Equivalent
>> > -------------------
>> > 
>> > Assuming a vector type of <scalable 4 x <ty>>
>> > ``
>> > vector.body:
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>> >  ;; <loop body>
>> >  ;; Increment induction var
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>> >  ;; <check and branch>
>> > ``
>> > ==========================>> > 4. Generating Vector
Values
>> > ==========================>> > For constant vector
values, we cannot specify all the elements as we can for
>> > fixed-length vectors; fortunately only a small number of easily
synthesized
>> > patterns are required for autovectorization. The `zeroinitializer`
constant
>> > can be used in the same manner as fixed-length vectors for a
constant zero
>> > splat. This can then be combined with `insertelement` and
`shufflevector`
>> > to create arbitrary value splats in the same manner as
fixed-length vectors.
>> > 
>> > For constants consisting of a sequence of values, an experimental
`stepvector`
>> > intrinsic has been added to represent a simple constant of the
form
>> > `<0, 1, 2... num_elems-1>`. To change the starting value a
splat of the new
>> > start can be added, and changing the step requires multiplying by
a splat.
>> > 
>> > Fixed-Length Code
>> > -----------------
>> > ``
>> >  ;; Splat a value
>> >  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>> >  %splat = shufflevector <4 x i32> %insert, <4 x i32>
undef, <4 x i32> zeroinitializer
>> >  ;; Add a constant sequence
>> >  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32
8>
>> > ``
>> > Scalable Equivalent
>> > -------------------
>> > ``
>> >  ;; Splat a value
>> >  %insert = insertelement <scalable 4 x i32> undef, i32
%value, i32 0
>> >  %splat = shufflevector <scalable 4 x i32> %insert,
<scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>> >  ;; Splat offset + stride (the same in this case)
>> >  %insert2 = insertelement <scalable 4 x i32> under, i32 2,
i32 0
>> >  %str_off = shufflevector <scalable 4 x i32> %insert2,
<scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>> >  ;; Create sequence for scalable vector
>> >  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>> >  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>> >  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>> >  ;; Add the runtime-generated sequence
>> >  %add = add <scalable 4 x i32> %splat, %addoffset
>> > ``
>> > Future Work
>> > -----------
>> > 
>> > Intrinsics cannot currently be used for constant folding. Our
downstream
>> > compiler (using Constants instead of intrinsics) relies quite
heavily on this
>> > for good code generation, so we will need to find new ways to
recognize and
>> > fold these values.
>> > 
>> > ==========================================>> > 5.
Splitting and Combining Scalable Vectors
>> > ==========================================>> > 
>> > Splitting and combining scalable vectors in IR is done in the same
manner as
>> > for fixed-length vectors, but with a non-constant mask for the
shufflevector.
>> > 
>> > The following is an example of splitting a <scalable 4 x
double> into two
>> > separate <scalable 2 x double> values.
>> > 
>> > ``
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  ;; Stepvector generates the element ids for first subvector
>> >  %sv1 = call <scalable 2 x i64>
@llvm.experimental.vector.stepvector.nxv2i64()
>> >  ;; Add vscale * 2 to get the starting element for the second
subvector
>> >  %ec = mul i64 %vscale64, 2
>> >  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec,
i32 0
>> >  %ec.splat = shufflevector <scalable 2 x i64> %9,
<scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>> >  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>> >  ;; Perform the extracts
>> >  %res1 = shufflevector <scalable 4 x double> %in,
<scalable 4 x double> undef, <scalable 2 x i64> %sv1
>> >  %res2 = shufflevector <scalable 4 x double> %in,
<scalable 4 x double> undef, <scalable 2 x i64> %sv2
>> > ``
>> > 
>> > =================>> > 6. Code Generation
>> > =================>> > 
>> > IR splats will be converted to an experimental splatvector
intrinsic in
>> > SelectionDAGBuilder.
>> > 
>> > All three intrinsics are custom lowered and legalized in the
AArch64 backend.
>> > 
>> > Two new AArch64ISD nodes have been added to represent the same
concepts
>> > at the SelectionDAG level, while splatvector maps onto the
existing
>> > AArch64ISD::DUP.
>> > 
>> > GlobalISel
>> > ----------
>> > 
>> > Since GlobalISel was enabled by default on AArch64, it was
necessary to add
>> > scalable vector support to the LowLevelType implementation. A
single bit was
>> > added to the raw_data representation for vectors and vectors of
pointers.
>> > 
>> > In addition, types that only exist in destination patterns are
planted in
>> > the enumeration of available types for generated code. While this
may not be
>> > necessary in future, generating an all-true 'ptrue' value
was necessary to
>> > convert a predicated instruction into an unpredicated one.
>> > 
>> > =========>> > 7. Example
>> > =========>> > 
>> > The following example shows a simple C loop which assigns the
array index to
>> > the array elements matching that index. The IR shows how vscale
and stepvector
>> > are used to create the needed values and to advance the index
variable in the
>> > loop.
>> > 
>> > C Code
>> > ------
>> > 
>> > ``
>> > void IdentityArrayInit(int *a, int count) {
>> >  for (int i = 0; i < count; ++i)
>> >    a[i] = i;
>> > }
>> > ``
>> > 
>> > Scalable IR Vector Body
>> > -----------------------
>> > 
>> > ``
>> > vector.body.preheader:
>> >  ;; Other setup
>> >  ;; Stepvector used to create initial identity vector
>> >  %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>> >  br vector.body
>> > 
>> > vector.body
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>> >  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>> > 
>> >           ;; stepvector used for index identity on entry to loop
body ;;
>> >  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8,
%vector.body ],
>> >                                     [ %stepvector,
%vector.body.preheader ]
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  %vscale32 = trunc i64 %vscale64 to i32
>> >  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>> > 
>> >           ;; vscale splat used to increment identity vector ;;
>> >  %insert = insertelement <scalable 4 x i32> undef, i32 mul
(i32 %vscale32, i32 4), i32 0
>> >  %splat shufflevector <scalable 4 x i32> %insert,
<scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>> >  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>> >  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>> >  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>> >  store <scalable 4 x i32> %vec.ind7, <scalable 4 x
i32>* %3, align 4
>> > 
>> >           ;; vscale used to increment loop index
>> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>> >  %4 = icmp eq i64 %index.next, %n.vec
>> >  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>> > ``
>> > 
>> > =========>> > 8. Patches
>> > =========>> > 
>> > List of patches:
>> > 
>> > 1. Extend VectorType: https://reviews.llvm.org/D32530
>> > 2. Vector element type Tablegen constraint:
https://reviews.llvm.org/D47768
>> > 3. LLT support for scalable vectors:
https://reviews.llvm.org/D47769
>> > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
>> > 5. SVE Calling Convention: https://reviews.llvm.org/D47771
>> > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
>> > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
>> > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
>> > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
>> > 10. Initial store patterns: https://reviews.llvm.org/D47776
>> > 11. Initial addition patterns: https://reviews.llvm.org/D47777
>> > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
>> > 13. Implement copy logic for Z regs:
https://reviews.llvm.org/D47779
>> > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>> > 
>> 
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>

Hal Finkel via llvm-dev

2018-Aug-01 19:43 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

On 08/01/2018 02:00 PM, Graham Hunter wrote:> Hi Hal,
>
>> On 30 Jul 2018, at 20:10, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>
>> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>>> I strongly suspect that there remains widespread concern with the
direction of this, I know I have them.
>>>
>>> I don't think that many of the people who have that concern
have had time to come back to this RFC and make progress on it, likely because
of other commitments or simply the amount of churn around SVE related patches
and such. That is at least why I haven't had time to return to this RFC and
try to write more detailed feedback.
>>>
>>> Certainly, I would want to see pretty clear and considered support
for this change to the IR type system from Hal, Chris, Eric and/or other long
time maintainers of core LLVM IR components before it moves forward, and I
don't see that in this thread.
>> At a high level, I'm happy with this approach. I think it will be
important for LLVM to support runtime-determined vector lengths - I see the
customizability and power-efficiency constraints that motivate these designs
continuing to increase in importance. I'm still undecided on whether this
makes vector code nicer even for fixed-vector-length architectures, but some of
the design decisions that it forces, such as having explicit intrinsics for
reductions and other horizontal operations, seem like the right direction
regardless.
> Thanks, that's good to hear.
>
>> 1.
>>> This is a proposal for how to deal with querying the size of
scalable types for
>>>> analysis of IR. While it has not been implemented in full,
>> Is this still true? The details here need to all work out, obviously,
and we should make sure that any issues are identified.
> Yes. I had hoped to get some more comments on the basic approach before
progressing with the implementation, but if it makes more sense to have the
implementation available to discuss then I'll start creating patches.
At least on this point, I think that we'll want to have the
implementation to help make sure there aren't important details we're
overlooking.
>
>> 2. I know that there has been some discussion around support for
changing the vector length during program execution (e.g., to account for some
(proposed?) RISC-V feature), perhaps even during the execution of a single
function. I'm very concerned about this idea because it is not at all clear
to me how to limit information transfer contaminated with the vector size from
propagating between different regions. As a result, I'm concerned about
trying to add this on later, and so if this is part of the plan, I think that we
need to think through the details up front because it could have a major impact
on the design.
> I think Robin's email yesterday covered it fairly nicely; this RFC
proposes that the hardware length of vectors will be consistent throughout an
entire function, so we don't need to limit information inside a function,
just between them. For SVE, h/w vector length will likely be consistent across
the whole program as well (assuming the programmer doesn't make a prctl call
to the kernel to change it) so we could drop that limit too, but I thought it
best to come up with a unified approach that would work for both architectures.
The 'inherits_vscale' attribute would allow us to continue optimizing
across functions for SVE where desired.
I think that this will likely work, although I think we want to invert
the sense of the attribute. vscale should be inherited by default, and
some attribute can say that this isn't so. That same attribute, I
imagine, will also forbid scalable vector function arguments and return
values on those functions. If we don't have inherited vscale as the
default, we place an implicit contract on any IR transformation hat
performs outlining that it needs to scan for certain kinds of vector
operations and add the special attribute, or just always add this
special attribute, and that just becomes another special case, which
will only actually manifest on certain platforms, that it's best to avoid.
>
> Modelling the dynamic vector length for RVV is something for Robin (or
others) to tackle later, but can be though of (at a high level) as an implicit
predicate on all operations.
My point is that, while there may be some sense in which the details can
be worked out later, we need to have a good-enough understanding of how
this will work now in order to make sure that we're not making design
decisions now that make handling the dynamic vscale in a reasonable way
later more difficult.

Thanks again,
Hal
>
> -Graham
>
>> Thanks again,
>> Hal
>>
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Graham Hunter via llvm-dev

2018-Aug-03 10:31 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi,

Just a quick question about bitcode format changes; is there anything special I
should be doing for that beyond ensuring the reader can still process older
bitcode files correctly?

The code in the main patch will always emit 3 records for a vector type (as
opposed to the current 2), but we could omit the third field for fixed-length
vectors if that's preferable.

-Graham
> On 1 Aug 2018, at 20:00, Graham Hunter via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi Hal,
> 
>> On 30 Jul 2018, at 20:10, Hal Finkel <hfinkel at anl.gov> wrote:
>> 
>> 
>> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>>> I strongly suspect that there remains widespread concern with the
direction of this, I know I have them.
>>> 
>>> I don't think that many of the people who have that concern
have had time to come back to this RFC and make progress on it, likely because
of other commitments or simply the amount of churn around SVE related patches
and such. That is at least why I haven't had time to return to this RFC and
try to write more detailed feedback.
>>> 
>>> Certainly, I would want to see pretty clear and considered support
for this change to the IR type system from Hal, Chris, Eric and/or other long
time maintainers of core LLVM IR components before it moves forward, and I
don't see that in this thread.
>> 
>> At a high level, I'm happy with this approach. I think it will be
important for LLVM to support runtime-determined vector lengths - I see the
customizability and power-efficiency constraints that motivate these designs
continuing to increase in importance. I'm still undecided on whether this
makes vector code nicer even for fixed-vector-length architectures, but some of
the design decisions that it forces, such as having explicit intrinsics for
reductions and other horizontal operations, seem like the right direction
regardless.
> 
> Thanks, that's good to hear.
> 
>> 1.
>>> This is a proposal for how to deal with querying the size of
scalable types for
>>>> analysis of IR. While it has not been implemented in full,
>> 
>> Is this still true? The details here need to all work out, obviously,
and we should make sure that any issues are identified.
> 
> Yes. I had hoped to get some more comments on the basic approach before
progressing with the implementation, but if it makes more sense to have the
implementation available to discuss then I'll start creating patches.
> 
>> 2. I know that there has been some discussion around support for
changing the vector length during program execution (e.g., to account for some
(proposed?) RISC-V feature), perhaps even during the execution of a single
function. I'm very concerned about this idea because it is not at all clear
to me how to limit information transfer contaminated with the vector size from
propagating between different regions. As a result, I'm concerned about
trying to add this on later, and so if this is part of the plan, I think that we
need to think through the details up front because it could have a major impact
on the design.
> 
> I think Robin's email yesterday covered it fairly nicely; this RFC
proposes that the hardware length of vectors will be consistent throughout an
entire function, so we don't need to limit information inside a function,
just between them. For SVE, h/w vector length will likely be consistent across
the whole program as well (assuming the programmer doesn't make a prctl call
to the kernel to change it) so we could drop that limit too, but I thought it
best to come up with a unified approach that would work for both architectures.
The 'inherits_vscale' attribute would allow us to continue optimizing
across functions for SVE where desired.
> 
> Modelling the dynamic vector length for RVV is something for Robin (or
others) to tackle later, but can be though of (at a high level) as an implicit
predicate on all operations.
> 
> -Graham
> 
>> 
>> Thanks again,
>> Hal
>> 
>>> 
>>> Put differently: I don't think silence is assent here. You
really need some clear signal of consensus.
>>> 
>>> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <Graham.Hunter at
arm.com> wrote:
>>> Hi,
>>> 
>>> Are there any objections to going ahead with this? If not,
we'll try to get the patches reviewed and committed after the 7.0 branch
occurs.
>>> 
>>> -Graham
>>> 
>>>> On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at
arm.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I've updated the RFC slightly based on the discussion
within the thread, reposted below. Let me know if I've missed anything or if
more clarification is needed.
>>>> 
>>>> Thanks,
>>>> 
>>>> -Graham
>>>> 
>>>>
============================================================>>>>
Supporting SIMD instruction sets with variable vector lengths
>>>>
============================================================>>>>
>>>> In this RFC we propose extending LLVM IR to support
code-generation for variable
>>>> length vector architectures like Arm's SVE or RISC-V's
'V' extension. Our
>>>> approach is backwards compatible and should be as non-intrusive
as possible; the
>>>> only change needed in other backends is how size is queried on
vector types, and
>>>> it only requires a change in which function is called. We have
created a set of
>>>> proof-of-concept patches to represent a simple vectorized loop
in IR and
>>>> generate SVE instructions from that IR. These patches (listed
in section 7 of
>>>> this rfc) can be found on Phabricator and are intended to
illustrate the scope
>>>> of changes required by the general approach described in this
RFC.
>>>> 
>>>> =========>>>> Background
>>>> =========>>>> 
>>>> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA
extension for
>>>> AArch64 which is intended to scale with hardware such that the
same binary
>>>> running on a processor with longer vector registers can take
advantage of the
>>>> increased compute power without recompilation.
>>>> 
>>>> As the vector length is no longer a compile-time known value,
the way in which
>>>> the LLVM vectorizer generates code requires modifications such
that certain
>>>> values are now runtime evaluated expressions instead of
compile-time constants.
>>>> 
>>>> Documentation for SVE can be found at
>>>>
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>>>> 
>>>> =======>>>> Contents
>>>> =======>>>> 
>>>> The rest of this RFC covers the following topics:
>>>> 
>>>> 1. Types -- a proposal to extend VectorType to be able to
represent vectors that
>>>>  have a length which is a runtime-determined multiple of a
known base length.
>>>> 
>>>> 2. Size Queries - how to reason about the size of types for
which the size isn't
>>>>  fully known at compile time.
>>>> 
>>>> 3. Representing the runtime multiple of vector length in IR for
use in address
>>>>  calculations and induction variable comparisons.
>>>> 
>>>> 4. Generating 'constant' values in IR for vectors with
a runtime-determined
>>>>  number of elements.
>>>> 
>>>> 5. An explanation of splitting/concatentating scalable vectors.
>>>> 
>>>> 6. A brief note on code generation of these new operations for
AArch64.
>>>> 
>>>> 7. An example of C code and matching IR using the proposed
extensions.
>>>> 
>>>> 8. A list of patches demonstrating the changes required to emit
SVE instructions
>>>>  for a loop that has already been vectorized using the
extensions described
>>>>  in this RFC.
>>>> 
>>>> =======>>>> 1. Types
>>>> =======>>>> 
>>>> To represent a vector of unknown length a boolean `Scalable`
property has been
>>>> added to the `VectorType` class, which indicates that the
number of elements in
>>>> the vector is a runtime-determined integer multiple of the
`NumElements` field.
>>>> Most code that deals with vectors doesn't need to know the
exact length, but
>>>> does need to know relative lengths -- e.g. get a vector with
the same number of
>>>> elements but a different element type, or with half or double
the number of
>>>> elements.
>>>> 
>>>> In order to allow code to transparently support scalable
vectors, we introduce
>>>> an `ElementCount` class with two members:
>>>> 
>>>> - `unsigned Min`: the minimum number of elements.
>>>> - `bool Scalable`: is the element count an unknown multiple of
`Min`?
>>>> 
>>>> For non-scalable vectors (``Scalable=false``) the scale is
considered to be
>>>> equal to one and thus `Min` represents the exact number of
elements in the
>>>> vector.
>>>> 
>>>> The intent for code working with vectors is to use convenience
methods and avoid
>>>> directly dealing with the number of elements. If needed,
calling
>>>> `getElementCount` on a vector type instead of
`getVectorNumElements` can be used
>>>> to obtain the (potentially scalable) number of elements.
Overloaded division and
>>>> multiplication operators allow an ElementCount instance to be
used in much the
>>>> same manner as an integer for most cases.
>>>> 
>>>> This mixture of compile-time and runtime quantities allow us to
reason about the
>>>> relationship between different scalable vector types without
knowing their
>>>> exact length.
>>>> 
>>>> The runtime multiple is not expected to change during program
execution for SVE,
>>>> but it is possible. The model of scalable vectors presented in
this RFC assumes
>>>> that the multiple will be constant within a function but not
necessarily across
>>>> functions. As suggested in the recent RISC-V rfc, a new
function attribute to
>>>> inherit the multiple across function calls will allow for
function calls with
>>>> vector arguments/return values and inlining/outlining
optimizations.
>>>> 
>>>> IR Textual Form
>>>> ---------------
>>>> 
>>>> The textual form for a scalable vector is:
>>>> 
>>>> ``<scalable <n> x <type>>``
>>>> 
>>>> where `type` is the scalar type of each element, `n` is the
minimum number of
>>>> elements, and the string literal `scalable` indicates that the
total number of
>>>> elements is an unknown multiple of `n`; `scalable` is just an
arbitrary choice
>>>> for indicating that the vector is scalable, and could be
substituted by another.
>>>> For fixed-length vectors, the `scalable` is omitted, so there
is no change in
>>>> the format for existing vectors.
>>>> 
>>>> Scalable vectors with the same `Min` value have the same number
of elements, and
>>>> the same number of bytes if `Min * sizeof(type)` is the same
(assuming they are
>>>> used within the same function):
>>>> 
>>>> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>``
have the same number of
>>>> elements.
>>>> 
>>>> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>``
have the same number of
>>>> bytes.
>>>> 
>>>> IR Bitcode Form
>>>> ---------------
>>>> 
>>>> To serialize scalable vectors to bitcode, a new boolean field
is added to the
>>>> type record. If the field is not present the type will default
to a fixed-length
>>>> vector type, preserving backwards compatibility.
>>>> 
>>>> Alternatives Considered
>>>> -----------------------
>>>> 
>>>> We did consider one main alternative -- a dedicated target
type, like the
>>>> x86_mmx type.
>>>> 
>>>> A dedicated target type would either need to extend all
existing passes that
>>>> work with vectors to recognize the new type, or to duplicate
all that code
>>>> in order to get reasonable code generation and
autovectorization.
>>>> 
>>>> This hasn't been done for the x86_mmx type, and so it is
only capable of
>>>> providing support for C-level intrinsics instead of being used
and recognized by
>>>> passes inside llvm.
>>>> 
>>>> Although our current solution will need to change some of the
code that creates
>>>> new VectorTypes, much of that code doesn't need to care
about whether the types
>>>> are scalable or not -- they can use preexisting methods like
>>>> `getHalfElementsVectorType`. If the code is a little more
complex,
>>>> `ElementCount` structs can be used instead of an `unsigned`
value to represent
>>>> the number of elements.
>>>> 
>>>> ==============>>>> 2. Size Queries
>>>> ==============>>>> 
>>>> This is a proposal for how to deal with querying the size of
scalable types for
>>>> analysis of IR. While it has not been implemented in full, the
general approach
>>>> works well for calculating offsets into structures with
scalable types in a
>>>> modified version of ComputeValueVTs in our downstream compiler.
>>>> 
>>>> For current IR types that have a known size, all query
functions return a single
>>>> integer constant. For scalable types a second integer is needed
to indicate the
>>>> number of bytes/bits which need to be scaled by the runtime
multiple to obtain
>>>> the actual length.
>>>> 
>>>> For primitive types, `getPrimitiveSizeInBits()` will function
as it does today,
>>>> except that it will no longer return a size for vector types
(it will return 0,
>>>> as it does for other derived types). The majority of calls to
this function are
>>>> already for scalar rather than vector types.
>>>> 
>>>> For derived types, a function `getScalableSizePairInBits()`
will be added, which
>>>> returns a pair of integers (one to indicate unscaled bits, the
other for bits
>>>> that need to be scaled by the runtime multiple). For backends
that do not need
>>>> to deal with scalable types the existing methods will suffice,
but a debug-only
>>>> assert will be added to them to ensure they aren't used on
scalable types.
>>>> 
>>>> Similar functionality will be added to DataLayout.
>>>> 
>>>> Comparisons between sizes will use the following methods,
assuming that X and
>>>> Y are non-zero integers and the form is of { unscaled, scaled
}.
>>>> 
>>>> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>>>> 
>>>> { 0, X } <cmp> { 0, Y }: Normal comparison within a
function, or across
>>>>                        functions that inherit vector length.
Cannot be
>>>>                        compared across non-inheriting
functions.
>>>> 
>>>> { X, 0 } > { 0, Y }: Cannot return true.
>>>> 
>>>> { X, 0 } = { 0, Y }: Cannot return true.
>>>> 
>>>> { X, 0 } < { 0, Y }: Can return true.
>>>> 
>>>> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to
subtract common
>>>>                            terms and try the above comparisons;
it
>>>>                            may not be possible to get a good
answer.
>>>> 
>>>> It's worth noting that we don't expect the last case
(mixed scaled and
>>>> unscaled sizes) to occur. Richard Sandiford's proposed C
extensions
>>>> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html)
explicitly
>>>> prohibits mixing fixed-size types into sizeless struct.
>>>> 
>>>> I don't know if we need a 'maybe' or
'unknown' result for cases comparing scaled
>>>> vs. unscaled; I believe the gcc implementation of SVE allows
for such
>>>> results, but that supports a generic polynomial length
representation.
>>>> 
>>>> My current intention is to rely on functions that clone or copy
values to
>>>> check whether they are being used to copy scalable vectors
across function
>>>> boundaries without the inherit vlen attribute and raise an
error there instead
>>>> of requiring passing the Function a type size is from for each
comparison. If
>>>> there's a strong preference for moving the check to the
size comparison function
>>>> let me know; I will be starting work on patches for this later
in the year if
>>>> there's no major problems with the idea.
>>>> 
>>>> Future Work
>>>> -----------
>>>> 
>>>> Since we cannot determine the exact size of a scalable vector,
the
>>>> existing logic for alias detection won't work when multiple
accesses
>>>> share a common base pointer with different offsets.
>>>> 
>>>> However, SVE's predication will mean that a dynamic
'safe' vector length
>>>> can be determined at runtime, so after initial support has been
added we
>>>> can work on vectorizing loops using runtime predication to
avoid aliasing
>>>> problems.
>>>> 
>>>> Alternatives Considered
>>>> -----------------------
>>>> 
>>>> Marking scalable vectors as unsized doesn't work well, as
many parts of
>>>> llvm dealing with loads and stores assert that
'isSized()' returns true
>>>> and make use of the size when calculating offsets.
>>>> 
>>>> We have considered introducing multiple helper functions
instead of
>>>> using direct size queries, but that doesn't cover all
cases. It may
>>>> still be a good idea to introduce them to make the purpose in a
given
>>>> case more obvious, e.g.
'requiresSignExtension(Type*,Type*)'.
>>>> 
>>>> =======================================>>>> 3.
Representing Vector Length at Runtime
>>>> =======================================>>>> 
>>>> With a scalable vector type defined, we now need a way to
represent the runtime
>>>> length in IR in order to generate addresses for consecutive
vectors in memory
>>>> and determine how many elements have been processed in an
iteration of a loop.
>>>> 
>>>> We have added an experimental `vscale` intrinsic to represent
the runtime
>>>> multiple. Multiplying the result of this intrinsic by the
minimum number of
>>>> elements in a vector gives the total number of elements in a
scalable vector.
>>>> 
>>>> Fixed-Length Code
>>>> -----------------
>>>> 
>>>> Assuming a vector type of <4 x <ty>>
>>>> ``
>>>> vector.body:
>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>>>> ;; <loop body>
>>>> ;; Increment induction var
>>>> %index.next = add i64 %index, 4
>>>> ;; <check and branch>
>>>> ``
>>>> Scalable Equivalent
>>>> -------------------
>>>> 
>>>> Assuming a vector type of <scalable 4 x <ty>>
>>>> ``
>>>> vector.body:
>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>>>> ;; <loop body>
>>>> ;; Increment induction var
>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>> %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>>>> ;; <check and branch>
>>>> ``
>>>> ==========================>>>> 4. Generating Vector
Values
>>>> ==========================>>>> For constant vector
values, we cannot specify all the elements as we can for
>>>> fixed-length vectors; fortunately only a small number of easily
synthesized
>>>> patterns are required for autovectorization. The
`zeroinitializer` constant
>>>> can be used in the same manner as fixed-length vectors for a
constant zero
>>>> splat. This can then be combined with `insertelement` and
`shufflevector`
>>>> to create arbitrary value splats in the same manner as
fixed-length vectors.
>>>> 
>>>> For constants consisting of a sequence of values, an
experimental `stepvector`
>>>> intrinsic has been added to represent a simple constant of the
form
>>>> `<0, 1, 2... num_elems-1>`. To change the starting value
a splat of the new
>>>> start can be added, and changing the step requires multiplying
by a splat.
>>>> 
>>>> Fixed-Length Code
>>>> -----------------
>>>> ``
>>>> ;; Splat a value
>>>> %insert = insertelement <4 x i32> undef, i32 %value, i32
0
>>>> %splat = shufflevector <4 x i32> %insert, <4 x i32>
undef, <4 x i32> zeroinitializer
>>>> ;; Add a constant sequence
>>>> %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32
8>
>>>> ``
>>>> Scalable Equivalent
>>>> -------------------
>>>> ``
>>>> ;; Splat a value
>>>> %insert = insertelement <scalable 4 x i32> undef, i32
%value, i32 0
>>>> %splat = shufflevector <scalable 4 x i32> %insert,
<scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>> ;; Splat offset + stride (the same in this case)
>>>> %insert2 = insertelement <scalable 4 x i32> under, i32 2,
i32 0
>>>> %str_off = shufflevector <scalable 4 x i32> %insert2,
<scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>> ;; Create sequence for scalable vector
>>>> %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>>>> %mulbystride = mul <scalable 4 x i32> %stepvector,
%str_off
>>>> %addoffset = add <scalable 4 x i32> %mulbystride,
%str_off
>>>> ;; Add the runtime-generated sequence
>>>> %add = add <scalable 4 x i32> %splat, %addoffset
>>>> ``
>>>> Future Work
>>>> -----------
>>>> 
>>>> Intrinsics cannot currently be used for constant folding. Our
downstream
>>>> compiler (using Constants instead of intrinsics) relies quite
heavily on this
>>>> for good code generation, so we will need to find new ways to
recognize and
>>>> fold these values.
>>>> 
>>>> ==========================================>>>> 5.
Splitting and Combining Scalable Vectors
>>>> ==========================================>>>> 
>>>> Splitting and combining scalable vectors in IR is done in the
same manner as
>>>> for fixed-length vectors, but with a non-constant mask for the
shufflevector.
>>>> 
>>>> The following is an example of splitting a <scalable 4 x
double> into two
>>>> separate <scalable 2 x double> values.
>>>> 
>>>> ``
>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>> ;; Stepvector generates the element ids for first subvector
>>>> %sv1 = call <scalable 2 x i64>
@llvm.experimental.vector.stepvector.nxv2i64()
>>>> ;; Add vscale * 2 to get the starting element for the second
subvector
>>>> %ec = mul i64 %vscale64, 2
>>>> %ec.ins = insertelement <scalable 2 x i64> undef, i64
%ec, i32 0
>>>> %ec.splat = shufflevector <scalable 2 x i64> %9,
<scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>>>> %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>>>> ;; Perform the extracts
>>>> %res1 = shufflevector <scalable 4 x double> %in,
<scalable 4 x double> undef, <scalable 2 x i64> %sv1
>>>> %res2 = shufflevector <scalable 4 x double> %in,
<scalable 4 x double> undef, <scalable 2 x i64> %sv2
>>>> ``
>>>> 
>>>> =================>>>> 6. Code Generation
>>>> =================>>>> 
>>>> IR splats will be converted to an experimental splatvector
intrinsic in
>>>> SelectionDAGBuilder.
>>>> 
>>>> All three intrinsics are custom lowered and legalized in the
AArch64 backend.
>>>> 
>>>> Two new AArch64ISD nodes have been added to represent the same
concepts
>>>> at the SelectionDAG level, while splatvector maps onto the
existing
>>>> AArch64ISD::DUP.
>>>> 
>>>> GlobalISel
>>>> ----------
>>>> 
>>>> Since GlobalISel was enabled by default on AArch64, it was
necessary to add
>>>> scalable vector support to the LowLevelType implementation. A
single bit was
>>>> added to the raw_data representation for vectors and vectors of
pointers.
>>>> 
>>>> In addition, types that only exist in destination patterns are
planted in
>>>> the enumeration of available types for generated code. While
this may not be
>>>> necessary in future, generating an all-true 'ptrue'
value was necessary to
>>>> convert a predicated instruction into an unpredicated one.
>>>> 
>>>> =========>>>> 7. Example
>>>> =========>>>> 
>>>> The following example shows a simple C loop which assigns the
array index to
>>>> the array elements matching that index. The IR shows how vscale
and stepvector
>>>> are used to create the needed values and to advance the index
variable in the
>>>> loop.
>>>> 
>>>> C Code
>>>> ------
>>>> 
>>>> ``
>>>> void IdentityArrayInit(int *a, int count) {
>>>> for (int i = 0; i < count; ++i)
>>>>   a[i] = i;
>>>> }
>>>> ``
>>>> 
>>>> Scalable IR Vector Body
>>>> -----------------------
>>>> 
>>>> ``
>>>> vector.body.preheader:
>>>> ;; Other setup
>>>> ;; Stepvector used to create initial identity vector
>>>> %stepvector = call <scalable 4 x i32>
@llvm.experimental.vector.stepvector.nxv4i32()
>>>> br vector.body
>>>> 
>>>> vector.body
>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
>>>> %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader
]
>>>> 
>>>>          ;; stepvector used for index identity on entry to loop
body ;;
>>>> %vec.ind7 = phi <scalable 4 x i32> [ %step.add8,
%vector.body ],
>>>>                                    [ %stepvector,
%vector.body.preheader ]
>>>> %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>>>> %vscale32 = trunc i64 %vscale64 to i32
>>>> %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>>>> 
>>>>          ;; vscale splat used to increment identity vector ;;
>>>> %insert = insertelement <scalable 4 x i32> undef, i32 mul
(i32 %vscale32, i32 4), i32 0
>>>> %splat shufflevector <scalable 4 x i32> %insert,
<scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>>>> %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>>>> %2 = getelementptr inbounds i32, i32* %a, i64 %0
>>>> %3 = bitcast i32* %2 to <scalable 4 x i32>*
>>>> store <scalable 4 x i32> %vec.ind7, <scalable 4 x
i32>* %3, align 4
>>>> 
>>>>          ;; vscale used to increment loop index
>>>> %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>>>> %4 = icmp eq i64 %index.next, %n.vec
>>>> br i1 %4, label %middle.block, label %vector.body, !llvm.loop
!5
>>>> ``
>>>> 
>>>> =========>>>> 8. Patches
>>>> =========>>>> 
>>>> List of patches:
>>>> 
>>>> 1. Extend VectorType: https://reviews.llvm.org/D32530
>>>> 2. Vector element type Tablegen constraint:
https://reviews.llvm.org/D47768
>>>> 3. LLT support for scalable vectors:
https://reviews.llvm.org/D47769
>>>> 4. EVT strings and Type mapping:
https://reviews.llvm.org/D47770
>>>> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
>>>> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
>>>> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
>>>> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
>>>> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
>>>> 10. Initial store patterns: https://reviews.llvm.org/D47776
>>>> 11. Initial addition patterns: https://reviews.llvm.org/D47777
>>>> 12. Initial left-shift patterns:
https://reviews.llvm.org/D47778
>>>> 13. Implement copy logic for Z regs:
https://reviews.llvm.org/D47779
>>>> 14. Prevectorized loop unit test:
https://reviews.llvm.org/D47780
>>>> 
>>> 
>> 
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Aug 2018 - [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths