Graham Hunter via llvm-dev
2018-Jul-30 09:23 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Hi, Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs. -Graham> On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at arm.com> wrote: > > Hi, > > I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed. > > Thanks, > > -Graham > > ============================================================> Supporting SIMD instruction sets with variable vector lengths > ============================================================> > In this RFC we propose extending LLVM IR to support code-generation for variable > length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our > approach is backwards compatible and should be as non-intrusive as possible; the > only change needed in other backends is how size is queried on vector types, and > it only requires a change in which function is called. We have created a set of > proof-of-concept patches to represent a simple vectorized loop in IR and > generate SVE instructions from that IR. These patches (listed in section 7 of > this rfc) can be found on Phabricator and are intended to illustrate the scope > of changes required by the general approach described in this RFC. > > =========> Background > =========> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for > AArch64 which is intended to scale with hardware such that the same binary > running on a processor with longer vector registers can take advantage of the > increased compute power without recompilation. > > As the vector length is no longer a compile-time known value, the way in which > the LLVM vectorizer generates code requires modifications such that certain > values are now runtime evaluated expressions instead of compile-time constants. > > Documentation for SVE can be found at > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a > > =======> Contents > =======> > The rest of this RFC covers the following topics: > > 1. Types -- a proposal to extend VectorType to be able to represent vectors that > have a length which is a runtime-determined multiple of a known base length. > > 2. Size Queries - how to reason about the size of types for which the size isn't > fully known at compile time. > > 3. Representing the runtime multiple of vector length in IR for use in address > calculations and induction variable comparisons. > > 4. Generating 'constant' values in IR for vectors with a runtime-determined > number of elements. > > 5. An explanation of splitting/concatentating scalable vectors. > > 6. A brief note on code generation of these new operations for AArch64. > > 7. An example of C code and matching IR using the proposed extensions. > > 8. A list of patches demonstrating the changes required to emit SVE instructions > for a loop that has already been vectorized using the extensions described > in this RFC. > > =======> 1. Types > =======> > To represent a vector of unknown length a boolean `Scalable` property has been > added to the `VectorType` class, which indicates that the number of elements in > the vector is a runtime-determined integer multiple of the `NumElements` field. > Most code that deals with vectors doesn't need to know the exact length, but > does need to know relative lengths -- e.g. get a vector with the same number of > elements but a different element type, or with half or double the number of > elements. > > In order to allow code to transparently support scalable vectors, we introduce > an `ElementCount` class with two members: > > - `unsigned Min`: the minimum number of elements. > - `bool Scalable`: is the element count an unknown multiple of `Min`? > > For non-scalable vectors (``Scalable=false``) the scale is considered to be > equal to one and thus `Min` represents the exact number of elements in the > vector. > > The intent for code working with vectors is to use convenience methods and avoid > directly dealing with the number of elements. If needed, calling > `getElementCount` on a vector type instead of `getVectorNumElements` can be used > to obtain the (potentially scalable) number of elements. Overloaded division and > multiplication operators allow an ElementCount instance to be used in much the > same manner as an integer for most cases. > > This mixture of compile-time and runtime quantities allow us to reason about the > relationship between different scalable vector types without knowing their > exact length. > > The runtime multiple is not expected to change during program execution for SVE, > but it is possible. The model of scalable vectors presented in this RFC assumes > that the multiple will be constant within a function but not necessarily across > functions. As suggested in the recent RISC-V rfc, a new function attribute to > inherit the multiple across function calls will allow for function calls with > vector arguments/return values and inlining/outlining optimizations. > > IR Textual Form > --------------- > > The textual form for a scalable vector is: > > ``<scalable <n> x <type>>`` > > where `type` is the scalar type of each element, `n` is the minimum number of > elements, and the string literal `scalable` indicates that the total number of > elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice > for indicating that the vector is scalable, and could be substituted by another. > For fixed-length vectors, the `scalable` is omitted, so there is no change in > the format for existing vectors. > > Scalable vectors with the same `Min` value have the same number of elements, and > the same number of bytes if `Min * sizeof(type)` is the same (assuming they are > used within the same function): > > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of > elements. > > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of > bytes. > > IR Bitcode Form > --------------- > > To serialize scalable vectors to bitcode, a new boolean field is added to the > type record. If the field is not present the type will default to a fixed-length > vector type, preserving backwards compatibility. > > Alternatives Considered > ----------------------- > > We did consider one main alternative -- a dedicated target type, like the > x86_mmx type. > > A dedicated target type would either need to extend all existing passes that > work with vectors to recognize the new type, or to duplicate all that code > in order to get reasonable code generation and autovectorization. > > This hasn't been done for the x86_mmx type, and so it is only capable of > providing support for C-level intrinsics instead of being used and recognized by > passes inside llvm. > > Although our current solution will need to change some of the code that creates > new VectorTypes, much of that code doesn't need to care about whether the types > are scalable or not -- they can use preexisting methods like > `getHalfElementsVectorType`. If the code is a little more complex, > `ElementCount` structs can be used instead of an `unsigned` value to represent > the number of elements. > > ==============> 2. Size Queries > ==============> > This is a proposal for how to deal with querying the size of scalable types for > analysis of IR. While it has not been implemented in full, the general approach > works well for calculating offsets into structures with scalable types in a > modified version of ComputeValueVTs in our downstream compiler. > > For current IR types that have a known size, all query functions return a single > integer constant. For scalable types a second integer is needed to indicate the > number of bytes/bits which need to be scaled by the runtime multiple to obtain > the actual length. > > For primitive types, `getPrimitiveSizeInBits()` will function as it does today, > except that it will no longer return a size for vector types (it will return 0, > as it does for other derived types). The majority of calls to this function are > already for scalar rather than vector types. > > For derived types, a function `getScalableSizePairInBits()` will be added, which > returns a pair of integers (one to indicate unscaled bits, the other for bits > that need to be scaled by the runtime multiple). For backends that do not need > to deal with scalable types the existing methods will suffice, but a debug-only > assert will be added to them to ensure they aren't used on scalable types. > > Similar functionality will be added to DataLayout. > > Comparisons between sizes will use the following methods, assuming that X and > Y are non-zero integers and the form is of { unscaled, scaled }. > > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison. > > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across > functions that inherit vector length. Cannot be > compared across non-inheriting functions. > > { X, 0 } > { 0, Y }: Cannot return true. > > { X, 0 } = { 0, Y }: Cannot return true. > > { X, 0 } < { 0, Y }: Can return true. > > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common > terms and try the above comparisons; it > may not be possible to get a good answer. > > It's worth noting that we don't expect the last case (mixed scaled and > unscaled sizes) to occur. Richard Sandiford's proposed C extensions > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly > prohibits mixing fixed-size types into sizeless struct. > > I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled > vs. unscaled; I believe the gcc implementation of SVE allows for such > results, but that supports a generic polynomial length representation. > > My current intention is to rely on functions that clone or copy values to > check whether they are being used to copy scalable vectors across function > boundaries without the inherit vlen attribute and raise an error there instead > of requiring passing the Function a type size is from for each comparison. If > there's a strong preference for moving the check to the size comparison function > let me know; I will be starting work on patches for this later in the year if > there's no major problems with the idea. > > Future Work > ----------- > > Since we cannot determine the exact size of a scalable vector, the > existing logic for alias detection won't work when multiple accesses > share a common base pointer with different offsets. > > However, SVE's predication will mean that a dynamic 'safe' vector length > can be determined at runtime, so after initial support has been added we > can work on vectorizing loops using runtime predication to avoid aliasing > problems. > > Alternatives Considered > ----------------------- > > Marking scalable vectors as unsized doesn't work well, as many parts of > llvm dealing with loads and stores assert that 'isSized()' returns true > and make use of the size when calculating offsets. > > We have considered introducing multiple helper functions instead of > using direct size queries, but that doesn't cover all cases. It may > still be a good idea to introduce them to make the purpose in a given > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'. > > =======================================> 3. Representing Vector Length at Runtime > =======================================> > With a scalable vector type defined, we now need a way to represent the runtime > length in IR in order to generate addresses for consecutive vectors in memory > and determine how many elements have been processed in an iteration of a loop. > > We have added an experimental `vscale` intrinsic to represent the runtime > multiple. Multiplying the result of this intrinsic by the minimum number of > elements in a vector gives the total number of elements in a scalable vector. > > Fixed-Length Code > ----------------- > > Assuming a vector type of <4 x <ty>> > `` > vector.body: > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > ;; <loop body> > ;; Increment induction var > %index.next = add i64 %index, 4 > ;; <check and branch> > `` > Scalable Equivalent > ------------------- > > Assuming a vector type of <scalable 4 x <ty>> > `` > vector.body: > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > ;; <loop body> > ;; Increment induction var > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > ;; <check and branch> > `` > ==========================> 4. Generating Vector Values > ==========================> For constant vector values, we cannot specify all the elements as we can for > fixed-length vectors; fortunately only a small number of easily synthesized > patterns are required for autovectorization. The `zeroinitializer` constant > can be used in the same manner as fixed-length vectors for a constant zero > splat. This can then be combined with `insertelement` and `shufflevector` > to create arbitrary value splats in the same manner as fixed-length vectors. > > For constants consisting of a sequence of values, an experimental `stepvector` > intrinsic has been added to represent a simple constant of the form > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new > start can be added, and changing the step requires multiplying by a splat. > > Fixed-Length Code > ----------------- > `` > ;; Splat a value > %insert = insertelement <4 x i32> undef, i32 %value, i32 0 > %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer > ;; Add a constant sequence > %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8> > `` > Scalable Equivalent > ------------------- > `` > ;; Splat a value > %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0 > %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > ;; Splat offset + stride (the same in this case) > %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0 > %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > ;; Create sequence for scalable vector > %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32() > %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off > %addoffset = add <scalable 4 x i32> %mulbystride, %str_off > ;; Add the runtime-generated sequence > %add = add <scalable 4 x i32> %splat, %addoffset > `` > Future Work > ----------- > > Intrinsics cannot currently be used for constant folding. Our downstream > compiler (using Constants instead of intrinsics) relies quite heavily on this > for good code generation, so we will need to find new ways to recognize and > fold these values. > > ==========================================> 5. Splitting and Combining Scalable Vectors > ==========================================> > Splitting and combining scalable vectors in IR is done in the same manner as > for fixed-length vectors, but with a non-constant mask for the shufflevector. > > The following is an example of splitting a <scalable 4 x double> into two > separate <scalable 2 x double> values. > > `` > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > ;; Stepvector generates the element ids for first subvector > %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64() > ;; Add vscale * 2 to get the starting element for the second subvector > %ec = mul i64 %vscale64, 2 > %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0 > %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer > %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64 > ;; Perform the extracts > %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1 > %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2 > `` > > =================> 6. Code Generation > =================> > IR splats will be converted to an experimental splatvector intrinsic in > SelectionDAGBuilder. > > All three intrinsics are custom lowered and legalized in the AArch64 backend. > > Two new AArch64ISD nodes have been added to represent the same concepts > at the SelectionDAG level, while splatvector maps onto the existing > AArch64ISD::DUP. > > GlobalISel > ---------- > > Since GlobalISel was enabled by default on AArch64, it was necessary to add > scalable vector support to the LowLevelType implementation. A single bit was > added to the raw_data representation for vectors and vectors of pointers. > > In addition, types that only exist in destination patterns are planted in > the enumeration of available types for generated code. While this may not be > necessary in future, generating an all-true 'ptrue' value was necessary to > convert a predicated instruction into an unpredicated one. > > =========> 7. Example > =========> > The following example shows a simple C loop which assigns the array index to > the array elements matching that index. The IR shows how vscale and stepvector > are used to create the needed values and to advance the index variable in the > loop. > > C Code > ------ > > `` > void IdentityArrayInit(int *a, int count) { > for (int i = 0; i < count; ++i) > a[i] = i; > } > `` > > Scalable IR Vector Body > ----------------------- > > `` > vector.body.preheader: > ;; Other setup > ;; Stepvector used to create initial identity vector > %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32() > br vector.body > > vector.body > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ] > > ;; stepvector used for index identity on entry to loop body ;; > %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ], > [ %stepvector, %vector.body.preheader ] > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > %vscale32 = trunc i64 %vscale64 to i32 > %1 = add i64 %0, mul (i64 %vscale64, i64 4) > > ;; vscale splat used to increment identity vector ;; > %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0 > %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat > %2 = getelementptr inbounds i32, i32* %a, i64 %0 > %3 = bitcast i32* %2 to <scalable 4 x i32>* > store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4 > > ;; vscale used to increment loop index > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > %4 = icmp eq i64 %index.next, %n.vec > br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5 > `` > > =========> 8. Patches > =========> > List of patches: > > 1. Extend VectorType: https://reviews.llvm.org/D32530 > 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768 > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769 > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770 > 5. SVE Calling Convention: https://reviews.llvm.org/D47771 > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772 > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773 > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774 > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775 > 10. Initial store patterns: https://reviews.llvm.org/D47776 > 11. Initial addition patterns: https://reviews.llvm.org/D47777 > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778 > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779 > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780 >
Chandler Carruth via llvm-dev
2018-Jul-30 10:34 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
I strongly suspect that there remains widespread concern with the direction of this, I know I have them. I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback. Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread. Put differently: I don't think silence is assent here. You really need some clear signal of consensus. On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <Graham.Hunter at arm.com> wrote:> Hi, > > Are there any objections to going ahead with this? If not, we'll try to > get the patches reviewed and committed after the 7.0 branch occurs. > > -Graham > > > On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at arm.com> wrote: > > > > Hi, > > > > I've updated the RFC slightly based on the discussion within the thread, > reposted below. Let me know if I've missed anything or if more > clarification is needed. > > > > Thanks, > > > > -Graham > > > > ============================================================> > Supporting SIMD instruction sets with variable vector lengths > > ============================================================> > > > In this RFC we propose extending LLVM IR to support code-generation for > variable > > length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our > > approach is backwards compatible and should be as non-intrusive as > possible; the > > only change needed in other backends is how size is queried on vector > types, and > > it only requires a change in which function is called. We have created a > set of > > proof-of-concept patches to represent a simple vectorized loop in IR and > > generate SVE instructions from that IR. These patches (listed in section > 7 of > > this rfc) can be found on Phabricator and are intended to illustrate the > scope > > of changes required by the general approach described in this RFC. > > > > =========> > Background > > =========> > > > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension > for > > AArch64 which is intended to scale with hardware such that the same > binary > > running on a processor with longer vector registers can take advantage > of the > > increased compute power without recompilation. > > > > As the vector length is no longer a compile-time known value, the way in > which > > the LLVM vectorizer generates code requires modifications such that > certain > > values are now runtime evaluated expressions instead of compile-time > constants. > > > > Documentation for SVE can be found at > > > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a > > > > =======> > Contents > > =======> > > > The rest of this RFC covers the following topics: > > > > 1. Types -- a proposal to extend VectorType to be able to represent > vectors that > > have a length which is a runtime-determined multiple of a known base > length. > > > > 2. Size Queries - how to reason about the size of types for which the > size isn't > > fully known at compile time. > > > > 3. Representing the runtime multiple of vector length in IR for use in > address > > calculations and induction variable comparisons. > > > > 4. Generating 'constant' values in IR for vectors with a > runtime-determined > > number of elements. > > > > 5. An explanation of splitting/concatentating scalable vectors. > > > > 6. A brief note on code generation of these new operations for AArch64. > > > > 7. An example of C code and matching IR using the proposed extensions. > > > > 8. A list of patches demonstrating the changes required to emit SVE > instructions > > for a loop that has already been vectorized using the extensions > described > > in this RFC. > > > > =======> > 1. Types > > =======> > > > To represent a vector of unknown length a boolean `Scalable` property > has been > > added to the `VectorType` class, which indicates that the number of > elements in > > the vector is a runtime-determined integer multiple of the `NumElements` > field. > > Most code that deals with vectors doesn't need to know the exact length, > but > > does need to know relative lengths -- e.g. get a vector with the same > number of > > elements but a different element type, or with half or double the number > of > > elements. > > > > In order to allow code to transparently support scalable vectors, we > introduce > > an `ElementCount` class with two members: > > > > - `unsigned Min`: the minimum number of elements. > > - `bool Scalable`: is the element count an unknown multiple of `Min`? > > > > For non-scalable vectors (``Scalable=false``) the scale is considered to > be > > equal to one and thus `Min` represents the exact number of elements in > the > > vector. > > > > The intent for code working with vectors is to use convenience methods > and avoid > > directly dealing with the number of elements. If needed, calling > > `getElementCount` on a vector type instead of `getVectorNumElements` can > be used > > to obtain the (potentially scalable) number of elements. Overloaded > division and > > multiplication operators allow an ElementCount instance to be used in > much the > > same manner as an integer for most cases. > > > > This mixture of compile-time and runtime quantities allow us to reason > about the > > relationship between different scalable vector types without knowing > their > > exact length. > > > > The runtime multiple is not expected to change during program execution > for SVE, > > but it is possible. The model of scalable vectors presented in this RFC > assumes > > that the multiple will be constant within a function but not necessarily > across > > functions. As suggested in the recent RISC-V rfc, a new function > attribute to > > inherit the multiple across function calls will allow for function calls > with > > vector arguments/return values and inlining/outlining optimizations. > > > > IR Textual Form > > --------------- > > > > The textual form for a scalable vector is: > > > > ``<scalable <n> x <type>>`` > > > > where `type` is the scalar type of each element, `n` is the minimum > number of > > elements, and the string literal `scalable` indicates that the total > number of > > elements is an unknown multiple of `n`; `scalable` is just an arbitrary > choice > > for indicating that the vector is scalable, and could be substituted by > another. > > For fixed-length vectors, the `scalable` is omitted, so there is no > change in > > the format for existing vectors. > > > > Scalable vectors with the same `Min` value have the same number of > elements, and > > the same number of bytes if `Min * sizeof(type)` is the same (assuming > they are > > used within the same function): > > > > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of > > elements. > > > > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of > > bytes. > > > > IR Bitcode Form > > --------------- > > > > To serialize scalable vectors to bitcode, a new boolean field is added > to the > > type record. If the field is not present the type will default to a > fixed-length > > vector type, preserving backwards compatibility. > > > > Alternatives Considered > > ----------------------- > > > > We did consider one main alternative -- a dedicated target type, like the > > x86_mmx type. > > > > A dedicated target type would either need to extend all existing passes > that > > work with vectors to recognize the new type, or to duplicate all that > code > > in order to get reasonable code generation and autovectorization. > > > > This hasn't been done for the x86_mmx type, and so it is only capable of > > providing support for C-level intrinsics instead of being used and > recognized by > > passes inside llvm. > > > > Although our current solution will need to change some of the code that > creates > > new VectorTypes, much of that code doesn't need to care about whether > the types > > are scalable or not -- they can use preexisting methods like > > `getHalfElementsVectorType`. If the code is a little more complex, > > `ElementCount` structs can be used instead of an `unsigned` value to > represent > > the number of elements. > > > > ==============> > 2. Size Queries > > ==============> > > > This is a proposal for how to deal with querying the size of scalable > types for > > analysis of IR. While it has not been implemented in full, the general > approach > > works well for calculating offsets into structures with scalable types > in a > > modified version of ComputeValueVTs in our downstream compiler. > > > > For current IR types that have a known size, all query functions return > a single > > integer constant. For scalable types a second integer is needed to > indicate the > > number of bytes/bits which need to be scaled by the runtime multiple to > obtain > > the actual length. > > > > For primitive types, `getPrimitiveSizeInBits()` will function as it does > today, > > except that it will no longer return a size for vector types (it will > return 0, > > as it does for other derived types). The majority of calls to this > function are > > already for scalar rather than vector types. > > > > For derived types, a function `getScalableSizePairInBits()` will be > added, which > > returns a pair of integers (one to indicate unscaled bits, the other for > bits > > that need to be scaled by the runtime multiple). For backends that do > not need > > to deal with scalable types the existing methods will suffice, but a > debug-only > > assert will be added to them to ensure they aren't used on scalable > types. > > > > Similar functionality will be added to DataLayout. > > > > Comparisons between sizes will use the following methods, assuming that > X and > > Y are non-zero integers and the form is of { unscaled, scaled }. > > > > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison. > > > > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across > > functions that inherit vector length. Cannot be > > compared across non-inheriting functions. > > > > { X, 0 } > { 0, Y }: Cannot return true. > > > > { X, 0 } = { 0, Y }: Cannot return true. > > > > { X, 0 } < { 0, Y }: Can return true. > > > > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common > > terms and try the above comparisons; it > > may not be possible to get a good answer. > > > > It's worth noting that we don't expect the last case (mixed scaled and > > unscaled sizes) to occur. Richard Sandiford's proposed C extensions > > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) > explicitly > > prohibits mixing fixed-size types into sizeless struct. > > > > I don't know if we need a 'maybe' or 'unknown' result for cases > comparing scaled > > vs. unscaled; I believe the gcc implementation of SVE allows for such > > results, but that supports a generic polynomial length representation. > > > > My current intention is to rely on functions that clone or copy values to > > check whether they are being used to copy scalable vectors across > function > > boundaries without the inherit vlen attribute and raise an error there > instead > > of requiring passing the Function a type size is from for each > comparison. If > > there's a strong preference for moving the check to the size comparison > function > > let me know; I will be starting work on patches for this later in the > year if > > there's no major problems with the idea. > > > > Future Work > > ----------- > > > > Since we cannot determine the exact size of a scalable vector, the > > existing logic for alias detection won't work when multiple accesses > > share a common base pointer with different offsets. > > > > However, SVE's predication will mean that a dynamic 'safe' vector length > > can be determined at runtime, so after initial support has been added we > > can work on vectorizing loops using runtime predication to avoid aliasing > > problems. > > > > Alternatives Considered > > ----------------------- > > > > Marking scalable vectors as unsized doesn't work well, as many parts of > > llvm dealing with loads and stores assert that 'isSized()' returns true > > and make use of the size when calculating offsets. > > > > We have considered introducing multiple helper functions instead of > > using direct size queries, but that doesn't cover all cases. It may > > still be a good idea to introduce them to make the purpose in a given > > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'. > > > > =======================================> > 3. Representing Vector Length at Runtime > > =======================================> > > > With a scalable vector type defined, we now need a way to represent the > runtime > > length in IR in order to generate addresses for consecutive vectors in > memory > > and determine how many elements have been processed in an iteration of a > loop. > > > > We have added an experimental `vscale` intrinsic to represent the runtime > > multiple. Multiplying the result of this intrinsic by the minimum number > of > > elements in a vector gives the total number of elements in a scalable > vector. > > > > Fixed-Length Code > > ----------------- > > > > Assuming a vector type of <4 x <ty>> > > `` > > vector.body: > > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > > ;; <loop body> > > ;; Increment induction var > > %index.next = add i64 %index, 4 > > ;; <check and branch> > > `` > > Scalable Equivalent > > ------------------- > > > > Assuming a vector type of <scalable 4 x <ty>> > > `` > > vector.body: > > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > > ;; <loop body> > > ;; Increment induction var > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > > ;; <check and branch> > > `` > > ==========================> > 4. Generating Vector Values > > ==========================> > For constant vector values, we cannot specify all the elements as we can > for > > fixed-length vectors; fortunately only a small number of easily > synthesized > > patterns are required for autovectorization. The `zeroinitializer` > constant > > can be used in the same manner as fixed-length vectors for a constant > zero > > splat. This can then be combined with `insertelement` and `shufflevector` > > to create arbitrary value splats in the same manner as fixed-length > vectors. > > > > For constants consisting of a sequence of values, an experimental > `stepvector` > > intrinsic has been added to represent a simple constant of the form > > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the > new > > start can be added, and changing the step requires multiplying by a > splat. > > > > Fixed-Length Code > > ----------------- > > `` > > ;; Splat a value > > %insert = insertelement <4 x i32> undef, i32 %value, i32 0 > > %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> > zeroinitializer > > ;; Add a constant sequence > > %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8> > > `` > > Scalable Equivalent > > ------------------- > > `` > > ;; Splat a value > > %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0 > > %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> > undef, <scalable 4 x i32> zeroinitializer > > ;; Splat offset + stride (the same in this case) > > %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0 > > %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x > i32> undef, <scalable 4 x i32> zeroinitializer > > ;; Create sequence for scalable vector > > %stepvector = call <scalable 4 x i32> > @llvm.experimental.vector.stepvector.nxv4i32() > > %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off > > %addoffset = add <scalable 4 x i32> %mulbystride, %str_off > > ;; Add the runtime-generated sequence > > %add = add <scalable 4 x i32> %splat, %addoffset > > `` > > Future Work > > ----------- > > > > Intrinsics cannot currently be used for constant folding. Our downstream > > compiler (using Constants instead of intrinsics) relies quite heavily on > this > > for good code generation, so we will need to find new ways to recognize > and > > fold these values. > > > > ==========================================> > 5. Splitting and Combining Scalable Vectors > > ==========================================> > > > Splitting and combining scalable vectors in IR is done in the same > manner as > > for fixed-length vectors, but with a non-constant mask for the > shufflevector. > > > > The following is an example of splitting a <scalable 4 x double> into two > > separate <scalable 2 x double> values. > > > > `` > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > ;; Stepvector generates the element ids for first subvector > > %sv1 = call <scalable 2 x i64> > @llvm.experimental.vector.stepvector.nxv2i64() > > ;; Add vscale * 2 to get the starting element for the second subvector > > %ec = mul i64 %vscale64, 2 > > %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0 > > %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> > undef, <scalable 2 x i32> zeroinitializer > > %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64 > > ;; Perform the extracts > > %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> > undef, <scalable 2 x i64> %sv1 > > %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> > undef, <scalable 2 x i64> %sv2 > > `` > > > > =================> > 6. Code Generation > > =================> > > > IR splats will be converted to an experimental splatvector intrinsic in > > SelectionDAGBuilder. > > > > All three intrinsics are custom lowered and legalized in the AArch64 > backend. > > > > Two new AArch64ISD nodes have been added to represent the same concepts > > at the SelectionDAG level, while splatvector maps onto the existing > > AArch64ISD::DUP. > > > > GlobalISel > > ---------- > > > > Since GlobalISel was enabled by default on AArch64, it was necessary to > add > > scalable vector support to the LowLevelType implementation. A single bit > was > > added to the raw_data representation for vectors and vectors of pointers. > > > > In addition, types that only exist in destination patterns are planted in > > the enumeration of available types for generated code. While this may > not be > > necessary in future, generating an all-true 'ptrue' value was necessary > to > > convert a predicated instruction into an unpredicated one. > > > > =========> > 7. Example > > =========> > > > The following example shows a simple C loop which assigns the array > index to > > the array elements matching that index. The IR shows how vscale and > stepvector > > are used to create the needed values and to advance the index variable > in the > > loop. > > > > C Code > > ------ > > > > `` > > void IdentityArrayInit(int *a, int count) { > > for (int i = 0; i < count; ++i) > > a[i] = i; > > } > > `` > > > > Scalable IR Vector Body > > ----------------------- > > > > `` > > vector.body.preheader: > > ;; Other setup > > ;; Stepvector used to create initial identity vector > > %stepvector = call <scalable 4 x i32> > @llvm.experimental.vector.stepvector.nxv4i32() > > br vector.body > > > > vector.body > > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > > %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ] > > > > ;; stepvector used for index identity on entry to loop body ;; > > %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ], > > [ %stepvector, > %vector.body.preheader ] > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > %vscale32 = trunc i64 %vscale64 to i32 > > %1 = add i64 %0, mul (i64 %vscale64, i64 4) > > > > ;; vscale splat used to increment identity vector ;; > > %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 > %vscale32, i32 4), i32 0 > > %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> > undef, <scalable 4 x i32> zeroinitializer > > %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat > > %2 = getelementptr inbounds i32, i32* %a, i64 %0 > > %3 = bitcast i32* %2 to <scalable 4 x i32>* > > store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4 > > > > ;; vscale used to increment loop index > > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > > %4 = icmp eq i64 %index.next, %n.vec > > br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5 > > `` > > > > =========> > 8. Patches > > =========> > > > List of patches: > > > > 1. Extend VectorType: https://reviews.llvm.org/D32530 > > 2. Vector element type Tablegen constraint: > https://reviews.llvm.org/D47768 > > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769 > > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770 > > 5. SVE Calling Convention: https://reviews.llvm.org/D47771 > > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772 > > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773 > > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774 > > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775 > > 10. Initial store patterns: https://reviews.llvm.org/D47776 > > 11. Initial addition patterns: https://reviews.llvm.org/D47777 > > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778 > > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779 > > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780 > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180730/104ff24d/attachment-0002.html>
Chandler Carruth via llvm-dev
2018-Jul-30 10:35 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
(And in case it isn't clear, I *do* think it is reasonable to push on this thread and lobby to get that feedback! You shouldn't be stalled forever, I'm just trying to point out that the focus should be on getting the feedback, not whether there are objections. You want consensus here, not silence.) On Mon, Jul 30, 2018 at 3:34 AM Chandler Carruth <chandlerc at gmail.com> wrote:> I strongly suspect that there remains widespread concern with the > direction of this, I know I have them. > > I don't think that many of the people who have that concern have had time > to come back to this RFC and make progress on it, likely because of other > commitments or simply the amount of churn around SVE related patches and > such. That is at least why I haven't had time to return to this RFC and try > to write more detailed feedback. > > Certainly, I would want to see pretty clear and considered support for > this change to the IR type system from Hal, Chris, Eric and/or other long > time maintainers of core LLVM IR components before it moves forward, and I > don't see that in this thread. > > Put differently: I don't think silence is assent here. You really need > some clear signal of consensus. > > On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <Graham.Hunter at arm.com> > wrote: > >> Hi, >> >> Are there any objections to going ahead with this? If not, we'll try to >> get the patches reviewed and committed after the 7.0 branch occurs. >> >> -Graham >> >> > On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at arm.com> wrote: >> > >> > Hi, >> > >> > I've updated the RFC slightly based on the discussion within the >> thread, reposted below. Let me know if I've missed anything or if more >> clarification is needed. >> > >> > Thanks, >> > >> > -Graham >> > >> > ============================================================>> > Supporting SIMD instruction sets with variable vector lengths >> > ============================================================>> > >> > In this RFC we propose extending LLVM IR to support code-generation for >> variable >> > length vector architectures like Arm's SVE or RISC-V's 'V' extension. >> Our >> > approach is backwards compatible and should be as non-intrusive as >> possible; the >> > only change needed in other backends is how size is queried on vector >> types, and >> > it only requires a change in which function is called. We have created >> a set of >> > proof-of-concept patches to represent a simple vectorized loop in IR and >> > generate SVE instructions from that IR. These patches (listed in >> section 7 of >> > this rfc) can be found on Phabricator and are intended to illustrate >> the scope >> > of changes required by the general approach described in this RFC. >> > >> > =========>> > Background >> > =========>> > >> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA >> extension for >> > AArch64 which is intended to scale with hardware such that the same >> binary >> > running on a processor with longer vector registers can take advantage >> of the >> > increased compute power without recompilation. >> > >> > As the vector length is no longer a compile-time known value, the way >> in which >> > the LLVM vectorizer generates code requires modifications such that >> certain >> > values are now runtime evaluated expressions instead of compile-time >> constants. >> > >> > Documentation for SVE can be found at >> > >> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a >> > >> > =======>> > Contents >> > =======>> > >> > The rest of this RFC covers the following topics: >> > >> > 1. Types -- a proposal to extend VectorType to be able to represent >> vectors that >> > have a length which is a runtime-determined multiple of a known base >> length. >> > >> > 2. Size Queries - how to reason about the size of types for which the >> size isn't >> > fully known at compile time. >> > >> > 3. Representing the runtime multiple of vector length in IR for use in >> address >> > calculations and induction variable comparisons. >> > >> > 4. Generating 'constant' values in IR for vectors with a >> runtime-determined >> > number of elements. >> > >> > 5. An explanation of splitting/concatentating scalable vectors. >> > >> > 6. A brief note on code generation of these new operations for AArch64. >> > >> > 7. An example of C code and matching IR using the proposed extensions. >> > >> > 8. A list of patches demonstrating the changes required to emit SVE >> instructions >> > for a loop that has already been vectorized using the extensions >> described >> > in this RFC. >> > >> > =======>> > 1. Types >> > =======>> > >> > To represent a vector of unknown length a boolean `Scalable` property >> has been >> > added to the `VectorType` class, which indicates that the number of >> elements in >> > the vector is a runtime-determined integer multiple of the >> `NumElements` field. >> > Most code that deals with vectors doesn't need to know the exact >> length, but >> > does need to know relative lengths -- e.g. get a vector with the same >> number of >> > elements but a different element type, or with half or double the >> number of >> > elements. >> > >> > In order to allow code to transparently support scalable vectors, we >> introduce >> > an `ElementCount` class with two members: >> > >> > - `unsigned Min`: the minimum number of elements. >> > - `bool Scalable`: is the element count an unknown multiple of `Min`? >> > >> > For non-scalable vectors (``Scalable=false``) the scale is considered >> to be >> > equal to one and thus `Min` represents the exact number of elements in >> the >> > vector. >> > >> > The intent for code working with vectors is to use convenience methods >> and avoid >> > directly dealing with the number of elements. If needed, calling >> > `getElementCount` on a vector type instead of `getVectorNumElements` >> can be used >> > to obtain the (potentially scalable) number of elements. Overloaded >> division and >> > multiplication operators allow an ElementCount instance to be used in >> much the >> > same manner as an integer for most cases. >> > >> > This mixture of compile-time and runtime quantities allow us to reason >> about the >> > relationship between different scalable vector types without knowing >> their >> > exact length. >> > >> > The runtime multiple is not expected to change during program execution >> for SVE, >> > but it is possible. The model of scalable vectors presented in this RFC >> assumes >> > that the multiple will be constant within a function but not >> necessarily across >> > functions. As suggested in the recent RISC-V rfc, a new function >> attribute to >> > inherit the multiple across function calls will allow for function >> calls with >> > vector arguments/return values and inlining/outlining optimizations. >> > >> > IR Textual Form >> > --------------- >> > >> > The textual form for a scalable vector is: >> > >> > ``<scalable <n> x <type>>`` >> > >> > where `type` is the scalar type of each element, `n` is the minimum >> number of >> > elements, and the string literal `scalable` indicates that the total >> number of >> > elements is an unknown multiple of `n`; `scalable` is just an arbitrary >> choice >> > for indicating that the vector is scalable, and could be substituted by >> another. >> > For fixed-length vectors, the `scalable` is omitted, so there is no >> change in >> > the format for existing vectors. >> > >> > Scalable vectors with the same `Min` value have the same number of >> elements, and >> > the same number of bytes if `Min * sizeof(type)` is the same (assuming >> they are >> > used within the same function): >> > >> > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of >> > elements. >> > >> > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number >> of >> > bytes. >> > >> > IR Bitcode Form >> > --------------- >> > >> > To serialize scalable vectors to bitcode, a new boolean field is added >> to the >> > type record. If the field is not present the type will default to a >> fixed-length >> > vector type, preserving backwards compatibility. >> > >> > Alternatives Considered >> > ----------------------- >> > >> > We did consider one main alternative -- a dedicated target type, like >> the >> > x86_mmx type. >> > >> > A dedicated target type would either need to extend all existing passes >> that >> > work with vectors to recognize the new type, or to duplicate all that >> code >> > in order to get reasonable code generation and autovectorization. >> > >> > This hasn't been done for the x86_mmx type, and so it is only capable of >> > providing support for C-level intrinsics instead of being used and >> recognized by >> > passes inside llvm. >> > >> > Although our current solution will need to change some of the code that >> creates >> > new VectorTypes, much of that code doesn't need to care about whether >> the types >> > are scalable or not -- they can use preexisting methods like >> > `getHalfElementsVectorType`. If the code is a little more complex, >> > `ElementCount` structs can be used instead of an `unsigned` value to >> represent >> > the number of elements. >> > >> > ==============>> > 2. Size Queries >> > ==============>> > >> > This is a proposal for how to deal with querying the size of scalable >> types for >> > analysis of IR. While it has not been implemented in full, the general >> approach >> > works well for calculating offsets into structures with scalable types >> in a >> > modified version of ComputeValueVTs in our downstream compiler. >> > >> > For current IR types that have a known size, all query functions return >> a single >> > integer constant. For scalable types a second integer is needed to >> indicate the >> > number of bytes/bits which need to be scaled by the runtime multiple to >> obtain >> > the actual length. >> > >> > For primitive types, `getPrimitiveSizeInBits()` will function as it >> does today, >> > except that it will no longer return a size for vector types (it will >> return 0, >> > as it does for other derived types). The majority of calls to this >> function are >> > already for scalar rather than vector types. >> > >> > For derived types, a function `getScalableSizePairInBits()` will be >> added, which >> > returns a pair of integers (one to indicate unscaled bits, the other >> for bits >> > that need to be scaled by the runtime multiple). For backends that do >> not need >> > to deal with scalable types the existing methods will suffice, but a >> debug-only >> > assert will be added to them to ensure they aren't used on scalable >> types. >> > >> > Similar functionality will be added to DataLayout. >> > >> > Comparisons between sizes will use the following methods, assuming that >> X and >> > Y are non-zero integers and the form is of { unscaled, scaled }. >> > >> > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison. >> > >> > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across >> > functions that inherit vector length. Cannot be >> > compared across non-inheriting functions. >> > >> > { X, 0 } > { 0, Y }: Cannot return true. >> > >> > { X, 0 } = { 0, Y }: Cannot return true. >> > >> > { X, 0 } < { 0, Y }: Can return true. >> > >> > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common >> > terms and try the above comparisons; it >> > may not be possible to get a good answer. >> > >> > It's worth noting that we don't expect the last case (mixed scaled and >> > unscaled sizes) to occur. Richard Sandiford's proposed C extensions >> > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) >> explicitly >> > prohibits mixing fixed-size types into sizeless struct. >> > >> > I don't know if we need a 'maybe' or 'unknown' result for cases >> comparing scaled >> > vs. unscaled; I believe the gcc implementation of SVE allows for such >> > results, but that supports a generic polynomial length representation. >> > >> > My current intention is to rely on functions that clone or copy values >> to >> > check whether they are being used to copy scalable vectors across >> function >> > boundaries without the inherit vlen attribute and raise an error there >> instead >> > of requiring passing the Function a type size is from for each >> comparison. If >> > there's a strong preference for moving the check to the size comparison >> function >> > let me know; I will be starting work on patches for this later in the >> year if >> > there's no major problems with the idea. >> > >> > Future Work >> > ----------- >> > >> > Since we cannot determine the exact size of a scalable vector, the >> > existing logic for alias detection won't work when multiple accesses >> > share a common base pointer with different offsets. >> > >> > However, SVE's predication will mean that a dynamic 'safe' vector length >> > can be determined at runtime, so after initial support has been added we >> > can work on vectorizing loops using runtime predication to avoid >> aliasing >> > problems. >> > >> > Alternatives Considered >> > ----------------------- >> > >> > Marking scalable vectors as unsized doesn't work well, as many parts of >> > llvm dealing with loads and stores assert that 'isSized()' returns true >> > and make use of the size when calculating offsets. >> > >> > We have considered introducing multiple helper functions instead of >> > using direct size queries, but that doesn't cover all cases. It may >> > still be a good idea to introduce them to make the purpose in a given >> > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'. >> > >> > =======================================>> > 3. Representing Vector Length at Runtime >> > =======================================>> > >> > With a scalable vector type defined, we now need a way to represent the >> runtime >> > length in IR in order to generate addresses for consecutive vectors in >> memory >> > and determine how many elements have been processed in an iteration of >> a loop. >> > >> > We have added an experimental `vscale` intrinsic to represent the >> runtime >> > multiple. Multiplying the result of this intrinsic by the minimum >> number of >> > elements in a vector gives the total number of elements in a scalable >> vector. >> > >> > Fixed-Length Code >> > ----------------- >> > >> > Assuming a vector type of <4 x <ty>> >> > `` >> > vector.body: >> > %index = phi i64 [ %index.next, %vector.body ], [ 0, >> %vector.body.preheader ] >> > ;; <loop body> >> > ;; Increment induction var >> > %index.next = add i64 %index, 4 >> > ;; <check and branch> >> > `` >> > Scalable Equivalent >> > ------------------- >> > >> > Assuming a vector type of <scalable 4 x <ty>> >> > `` >> > vector.body: >> > %index = phi i64 [ %index.next, %vector.body ], [ 0, >> %vector.body.preheader ] >> > ;; <loop body> >> > ;; Increment induction var >> > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() >> > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) >> > ;; <check and branch> >> > `` >> > ==========================>> > 4. Generating Vector Values >> > ==========================>> > For constant vector values, we cannot specify all the elements as we >> can for >> > fixed-length vectors; fortunately only a small number of easily >> synthesized >> > patterns are required for autovectorization. The `zeroinitializer` >> constant >> > can be used in the same manner as fixed-length vectors for a constant >> zero >> > splat. This can then be combined with `insertelement` and >> `shufflevector` >> > to create arbitrary value splats in the same manner as fixed-length >> vectors. >> > >> > For constants consisting of a sequence of values, an experimental >> `stepvector` >> > intrinsic has been added to represent a simple constant of the form >> > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the >> new >> > start can be added, and changing the step requires multiplying by a >> splat. >> > >> > Fixed-Length Code >> > ----------------- >> > `` >> > ;; Splat a value >> > %insert = insertelement <4 x i32> undef, i32 %value, i32 0 >> > %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> >> zeroinitializer >> > ;; Add a constant sequence >> > %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8> >> > `` >> > Scalable Equivalent >> > ------------------- >> > `` >> > ;; Splat a value >> > %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0 >> > %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> >> undef, <scalable 4 x i32> zeroinitializer >> > ;; Splat offset + stride (the same in this case) >> > %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0 >> > %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x >> i32> undef, <scalable 4 x i32> zeroinitializer >> > ;; Create sequence for scalable vector >> > %stepvector = call <scalable 4 x i32> >> @llvm.experimental.vector.stepvector.nxv4i32() >> > %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off >> > %addoffset = add <scalable 4 x i32> %mulbystride, %str_off >> > ;; Add the runtime-generated sequence >> > %add = add <scalable 4 x i32> %splat, %addoffset >> > `` >> > Future Work >> > ----------- >> > >> > Intrinsics cannot currently be used for constant folding. Our downstream >> > compiler (using Constants instead of intrinsics) relies quite heavily >> on this >> > for good code generation, so we will need to find new ways to recognize >> and >> > fold these values. >> > >> > ==========================================>> > 5. Splitting and Combining Scalable Vectors >> > ==========================================>> > >> > Splitting and combining scalable vectors in IR is done in the same >> manner as >> > for fixed-length vectors, but with a non-constant mask for the >> shufflevector. >> > >> > The following is an example of splitting a <scalable 4 x double> into >> two >> > separate <scalable 2 x double> values. >> > >> > `` >> > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() >> > ;; Stepvector generates the element ids for first subvector >> > %sv1 = call <scalable 2 x i64> >> @llvm.experimental.vector.stepvector.nxv2i64() >> > ;; Add vscale * 2 to get the starting element for the second subvector >> > %ec = mul i64 %vscale64, 2 >> > %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0 >> > %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> >> undef, <scalable 2 x i32> zeroinitializer >> > %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64 >> > ;; Perform the extracts >> > %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> >> undef, <scalable 2 x i64> %sv1 >> > %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> >> undef, <scalable 2 x i64> %sv2 >> > `` >> > >> > =================>> > 6. Code Generation >> > =================>> > >> > IR splats will be converted to an experimental splatvector intrinsic in >> > SelectionDAGBuilder. >> > >> > All three intrinsics are custom lowered and legalized in the AArch64 >> backend. >> > >> > Two new AArch64ISD nodes have been added to represent the same concepts >> > at the SelectionDAG level, while splatvector maps onto the existing >> > AArch64ISD::DUP. >> > >> > GlobalISel >> > ---------- >> > >> > Since GlobalISel was enabled by default on AArch64, it was necessary to >> add >> > scalable vector support to the LowLevelType implementation. A single >> bit was >> > added to the raw_data representation for vectors and vectors of >> pointers. >> > >> > In addition, types that only exist in destination patterns are planted >> in >> > the enumeration of available types for generated code. While this may >> not be >> > necessary in future, generating an all-true 'ptrue' value was necessary >> to >> > convert a predicated instruction into an unpredicated one. >> > >> > =========>> > 7. Example >> > =========>> > >> > The following example shows a simple C loop which assigns the array >> index to >> > the array elements matching that index. The IR shows how vscale and >> stepvector >> > are used to create the needed values and to advance the index variable >> in the >> > loop. >> > >> > C Code >> > ------ >> > >> > `` >> > void IdentityArrayInit(int *a, int count) { >> > for (int i = 0; i < count; ++i) >> > a[i] = i; >> > } >> > `` >> > >> > Scalable IR Vector Body >> > ----------------------- >> > >> > `` >> > vector.body.preheader: >> > ;; Other setup >> > ;; Stepvector used to create initial identity vector >> > %stepvector = call <scalable 4 x i32> >> @llvm.experimental.vector.stepvector.nxv4i32() >> > br vector.body >> > >> > vector.body >> > %index = phi i64 [ %index.next, %vector.body ], [ 0, >> %vector.body.preheader ] >> > %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ] >> > >> > ;; stepvector used for index identity on entry to loop body ;; >> > %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ], >> > [ %stepvector, >> %vector.body.preheader ] >> > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() >> > %vscale32 = trunc i64 %vscale64 to i32 >> > %1 = add i64 %0, mul (i64 %vscale64, i64 4) >> > >> > ;; vscale splat used to increment identity vector ;; >> > %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 >> %vscale32, i32 4), i32 0 >> > %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> >> undef, <scalable 4 x i32> zeroinitializer >> > %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat >> > %2 = getelementptr inbounds i32, i32* %a, i64 %0 >> > %3 = bitcast i32* %2 to <scalable 4 x i32>* >> > store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4 >> > >> > ;; vscale used to increment loop index >> > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) >> > %4 = icmp eq i64 %index.next, %n.vec >> > br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5 >> > `` >> > >> > =========>> > 8. Patches >> > =========>> > >> > List of patches: >> > >> > 1. Extend VectorType: https://reviews.llvm.org/D32530 >> > 2. Vector element type Tablegen constraint: >> https://reviews.llvm.org/D47768 >> > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769 >> > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770 >> > 5. SVE Calling Convention: https://reviews.llvm.org/D47771 >> > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772 >> > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773 >> > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774 >> > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775 >> > 10. Initial store patterns: https://reviews.llvm.org/D47776 >> > 11. Initial addition patterns: https://reviews.llvm.org/D47777 >> > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778 >> > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779 >> > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780 >> > >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180730/15b929f0/attachment-0001.html>
David Greene via llvm-dev
2018-Jul-30 17:37 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Chandler Carruth wrote:> I strongly suspect that there remains widespread concern with the > direction of this, I know I have them. > > I don't think that many of the people who have that concern have had > time to come back to this RFC and make progress on it, likely because > of other commitments or simply the amount of churn around SVE related > patches and such. That is at least why I haven't had time to return to > this RFC and try to write more detailed feedback.We believe ARM SVE will be an important architecture going forward. As such, it's important to us that these questions and concerns get posted and discussed, whatever the outcome may be. If there are objections, alternative proposals would be helpful. I see a lot of SVE patches on Phab that are described as "not for review." I don't know how helpful that is. It would be more helpful to have actual patches intended for review/commit. It is difficult to know which is which in Phab. Could patches not intended for review either be removed if not needed, or their subjects updated to indicate they are not for review but for discussion purposes so that it's easier to filter search results? -David
Hal Finkel via llvm-dev
2018-Jul-30 19:10 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 07/30/2018 05:34 AM, Chandler Carruth wrote:> I strongly suspect that there remains widespread concern with the > direction of this, I know I have them. > > I don't think that many of the people who have that concern have had > time to come back to this RFC and make progress on it, likely because > of other commitments or simply the amount of churn around SVE related > patches and such. That is at least why I haven't had time to return to > this RFC and try to write more detailed feedback. > > Certainly, I would want to see pretty clear and considered support for > this change to the IR type system from Hal, Chris, Eric and/or other > long time maintainers of core LLVM IR components before it moves > forward, and I don't see that in this thread.At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless. I have two questions: 1.> This is a proposal for how to deal with querying the size of scalable > types for > > analysis of IR. While it has not been implemented in full,Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified. 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design. Thanks again, Hal> > Put differently: I don't think silence is assent here. You really need > some clear signal of consensus. > > On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <Graham.Hunter at arm.com > <mailto:Graham.Hunter at arm.com>> wrote: > > Hi, > > Are there any objections to going ahead with this? If not, we'll > try to get the patches reviewed and committed after the 7.0 branch > occurs. > > -Graham > > > On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at arm.com > <mailto:Graham.Hunter at arm.com>> wrote: > > > > Hi, > > > > I've updated the RFC slightly based on the discussion within the > thread, reposted below. Let me know if I've missed anything or if > more clarification is needed. > > > > Thanks, > > > > -Graham > > > > ============================================================> > Supporting SIMD instruction sets with variable vector lengths > > ============================================================> > > > In this RFC we propose extending LLVM IR to support > code-generation for variable > > length vector architectures like Arm's SVE or RISC-V's 'V' > extension. Our > > approach is backwards compatible and should be as non-intrusive > as possible; the > > only change needed in other backends is how size is queried on > vector types, and > > it only requires a change in which function is called. We have > created a set of > > proof-of-concept patches to represent a simple vectorized loop > in IR and > > generate SVE instructions from that IR. These patches (listed in > section 7 of > > this rfc) can be found on Phabricator and are intended to > illustrate the scope > > of changes required by the general approach described in this RFC. > > > > =========> > Background > > =========> > > > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA > extension for > > AArch64 which is intended to scale with hardware such that the > same binary > > running on a processor with longer vector registers can take > advantage of the > > increased compute power without recompilation. > > > > As the vector length is no longer a compile-time known value, > the way in which > > the LLVM vectorizer generates code requires modifications such > that certain > > values are now runtime evaluated expressions instead of > compile-time constants. > > > > Documentation for SVE can be found at > > > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a > > > > =======> > Contents > > =======> > > > The rest of this RFC covers the following topics: > > > > 1. Types -- a proposal to extend VectorType to be able to > represent vectors that > > have a length which is a runtime-determined multiple of a > known base length. > > > > 2. Size Queries - how to reason about the size of types for > which the size isn't > > fully known at compile time. > > > > 3. Representing the runtime multiple of vector length in IR for > use in address > > calculations and induction variable comparisons. > > > > 4. Generating 'constant' values in IR for vectors with a > runtime-determined > > number of elements. > > > > 5. An explanation of splitting/concatentating scalable vectors. > > > > 6. A brief note on code generation of these new operations for > AArch64. > > > > 7. An example of C code and matching IR using the proposed > extensions. > > > > 8. A list of patches demonstrating the changes required to emit > SVE instructions > > for a loop that has already been vectorized using the > extensions described > > in this RFC. > > > > =======> > 1. Types > > =======> > > > To represent a vector of unknown length a boolean `Scalable` > property has been > > added to the `VectorType` class, which indicates that the number > of elements in > > the vector is a runtime-determined integer multiple of the > `NumElements` field. > > Most code that deals with vectors doesn't need to know the exact > length, but > > does need to know relative lengths -- e.g. get a vector with the > same number of > > elements but a different element type, or with half or double > the number of > > elements. > > > > In order to allow code to transparently support scalable > vectors, we introduce > > an `ElementCount` class with two members: > > > > - `unsigned Min`: the minimum number of elements. > > - `bool Scalable`: is the element count an unknown multiple of > `Min`? > > > > For non-scalable vectors (``Scalable=false``) the scale is > considered to be > > equal to one and thus `Min` represents the exact number of > elements in the > > vector. > > > > The intent for code working with vectors is to use convenience > methods and avoid > > directly dealing with the number of elements. If needed, calling > > `getElementCount` on a vector type instead of > `getVectorNumElements` can be used > > to obtain the (potentially scalable) number of elements. > Overloaded division and > > multiplication operators allow an ElementCount instance to be > used in much the > > same manner as an integer for most cases. > > > > This mixture of compile-time and runtime quantities allow us to > reason about the > > relationship between different scalable vector types without > knowing their > > exact length. > > > > The runtime multiple is not expected to change during program > execution for SVE, > > but it is possible. The model of scalable vectors presented in > this RFC assumes > > that the multiple will be constant within a function but not > necessarily across > > functions. As suggested in the recent RISC-V rfc, a new function > attribute to > > inherit the multiple across function calls will allow for > function calls with > > vector arguments/return values and inlining/outlining optimizations. > > > > IR Textual Form > > --------------- > > > > The textual form for a scalable vector is: > > > > ``<scalable <n> x <type>>`` > > > > where `type` is the scalar type of each element, `n` is the > minimum number of > > elements, and the string literal `scalable` indicates that the > total number of > > elements is an unknown multiple of `n`; `scalable` is just an > arbitrary choice > > for indicating that the vector is scalable, and could be > substituted by another. > > For fixed-length vectors, the `scalable` is omitted, so there is > no change in > > the format for existing vectors. > > > > Scalable vectors with the same `Min` value have the same number > of elements, and > > the same number of bytes if `Min * sizeof(type)` is the same > (assuming they are > > used within the same function): > > > > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same > number of > > elements. > > > > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same > number of > > bytes. > > > > IR Bitcode Form > > --------------- > > > > To serialize scalable vectors to bitcode, a new boolean field is > added to the > > type record. If the field is not present the type will default > to a fixed-length > > vector type, preserving backwards compatibility. > > > > Alternatives Considered > > ----------------------- > > > > We did consider one main alternative -- a dedicated target type, > like the > > x86_mmx type. > > > > A dedicated target type would either need to extend all existing > passes that > > work with vectors to recognize the new type, or to duplicate all > that code > > in order to get reasonable code generation and autovectorization. > > > > This hasn't been done for the x86_mmx type, and so it is only > capable of > > providing support for C-level intrinsics instead of being used > and recognized by > > passes inside llvm. > > > > Although our current solution will need to change some of the > code that creates > > new VectorTypes, much of that code doesn't need to care about > whether the types > > are scalable or not -- they can use preexisting methods like > > `getHalfElementsVectorType`. If the code is a little more complex, > > `ElementCount` structs can be used instead of an `unsigned` > value to represent > > the number of elements. > > > > ==============> > 2. Size Queries > > ==============> > > > This is a proposal for how to deal with querying the size of > scalable types for > > analysis of IR. While it has not been implemented in full, the > general approach > > works well for calculating offsets into structures with scalable > types in a > > modified version of ComputeValueVTs in our downstream compiler. > > > > For current IR types that have a known size, all query functions > return a single > > integer constant. For scalable types a second integer is needed > to indicate the > > number of bytes/bits which need to be scaled by the runtime > multiple to obtain > > the actual length. > > > > For primitive types, `getPrimitiveSizeInBits()` will function as > it does today, > > except that it will no longer return a size for vector types (it > will return 0, > > as it does for other derived types). The majority of calls to > this function are > > already for scalar rather than vector types. > > > > For derived types, a function `getScalableSizePairInBits()` will > be added, which > > returns a pair of integers (one to indicate unscaled bits, the > other for bits > > that need to be scaled by the runtime multiple). For backends > that do not need > > to deal with scalable types the existing methods will suffice, > but a debug-only > > assert will be added to them to ensure they aren't used on > scalable types. > > > > Similar functionality will be added to DataLayout. > > > > Comparisons between sizes will use the following methods, > assuming that X and > > Y are non-zero integers and the form is of { unscaled, scaled }. > > > > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison. > > > > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or > across > > functions that inherit vector length. > Cannot be > > compared across non-inheriting functions. > > > > { X, 0 } > { 0, Y }: Cannot return true. > > > > { X, 0 } = { 0, Y }: Cannot return true. > > > > { X, 0 } < { 0, Y }: Can return true. > > > > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract > common > > terms and try the above comparisons; it > > may not be possible to get a good > answer. > > > > It's worth noting that we don't expect the last case (mixed > scaled and > > unscaled sizes) to occur. Richard Sandiford's proposed C extensions > > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) > explicitly > > prohibits mixing fixed-size types into sizeless struct. > > > > I don't know if we need a 'maybe' or 'unknown' result for cases > comparing scaled > > vs. unscaled; I believe the gcc implementation of SVE allows for > such > > results, but that supports a generic polynomial length > representation. > > > > My current intention is to rely on functions that clone or copy > values to > > check whether they are being used to copy scalable vectors > across function > > boundaries without the inherit vlen attribute and raise an error > there instead > > of requiring passing the Function a type size is from for each > comparison. If > > there's a strong preference for moving the check to the size > comparison function > > let me know; I will be starting work on patches for this later > in the year if > > there's no major problems with the idea. > > > > Future Work > > ----------- > > > > Since we cannot determine the exact size of a scalable vector, the > > existing logic for alias detection won't work when multiple accesses > > share a common base pointer with different offsets. > > > > However, SVE's predication will mean that a dynamic 'safe' > vector length > > can be determined at runtime, so after initial support has been > added we > > can work on vectorizing loops using runtime predication to avoid > aliasing > > problems. > > > > Alternatives Considered > > ----------------------- > > > > Marking scalable vectors as unsized doesn't work well, as many > parts of > > llvm dealing with loads and stores assert that 'isSized()' > returns true > > and make use of the size when calculating offsets. > > > > We have considered introducing multiple helper functions instead of > > using direct size queries, but that doesn't cover all cases. It may > > still be a good idea to introduce them to make the purpose in a > given > > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'. > > > > =======================================> > 3. Representing Vector Length at Runtime > > =======================================> > > > With a scalable vector type defined, we now need a way to > represent the runtime > > length in IR in order to generate addresses for consecutive > vectors in memory > > and determine how many elements have been processed in an > iteration of a loop. > > > > We have added an experimental `vscale` intrinsic to represent > the runtime > > multiple. Multiplying the result of this intrinsic by the > minimum number of > > elements in a vector gives the total number of elements in a > scalable vector. > > > > Fixed-Length Code > > ----------------- > > > > Assuming a vector type of <4 x <ty>> > > `` > > vector.body: > > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > > ;; <loop body> > > ;; Increment induction var > > %index.next = add i64 %index, 4 > > ;; <check and branch> > > `` > > Scalable Equivalent > > ------------------- > > > > Assuming a vector type of <scalable 4 x <ty>> > > `` > > vector.body: > > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > > ;; <loop body> > > ;; Increment induction var > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > > ;; <check and branch> > > `` > > ==========================> > 4. Generating Vector Values > > ==========================> > For constant vector values, we cannot specify all the elements > as we can for > > fixed-length vectors; fortunately only a small number of easily > synthesized > > patterns are required for autovectorization. The > `zeroinitializer` constant > > can be used in the same manner as fixed-length vectors for a > constant zero > > splat. This can then be combined with `insertelement` and > `shufflevector` > > to create arbitrary value splats in the same manner as > fixed-length vectors. > > > > For constants consisting of a sequence of values, an > experimental `stepvector` > > intrinsic has been added to represent a simple constant of the form > > `<0, 1, 2... num_elems-1>`. To change the starting value a splat > of the new > > start can be added, and changing the step requires multiplying > by a splat. > > > > Fixed-Length Code > > ----------------- > > `` > > ;; Splat a value > > %insert = insertelement <4 x i32> undef, i32 %value, i32 0 > > %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x > i32> zeroinitializer > > ;; Add a constant sequence > > %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8> > > `` > > Scalable Equivalent > > ------------------- > > `` > > ;; Splat a value > > %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0 > > %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 > x i32> undef, <scalable 4 x i32> zeroinitializer > > ;; Splat offset + stride (the same in this case) > > %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0 > > %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable > 4 x i32> undef, <scalable 4 x i32> zeroinitializer > > ;; Create sequence for scalable vector > > %stepvector = call <scalable 4 x i32> > @llvm.experimental.vector.stepvector.nxv4i32() > > %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off > > %addoffset = add <scalable 4 x i32> %mulbystride, %str_off > > ;; Add the runtime-generated sequence > > %add = add <scalable 4 x i32> %splat, %addoffset > > `` > > Future Work > > ----------- > > > > Intrinsics cannot currently be used for constant folding. Our > downstream > > compiler (using Constants instead of intrinsics) relies quite > heavily on this > > for good code generation, so we will need to find new ways to > recognize and > > fold these values. > > > > ==========================================> > 5. Splitting and Combining Scalable Vectors > > ==========================================> > > > Splitting and combining scalable vectors in IR is done in the > same manner as > > for fixed-length vectors, but with a non-constant mask for the > shufflevector. > > > > The following is an example of splitting a <scalable 4 x double> > into two > > separate <scalable 2 x double> values. > > > > `` > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > ;; Stepvector generates the element ids for first subvector > > %sv1 = call <scalable 2 x i64> > @llvm.experimental.vector.stepvector.nxv2i64() > > ;; Add vscale * 2 to get the starting element for the second > subvector > > %ec = mul i64 %vscale64, 2 > > %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0 > > %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x > i64> undef, <scalable 2 x i32> zeroinitializer > > %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64 > > ;; Perform the extracts > > %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x > double> undef, <scalable 2 x i64> %sv1 > > %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x > double> undef, <scalable 2 x i64> %sv2 > > `` > > > > =================> > 6. Code Generation > > =================> > > > IR splats will be converted to an experimental splatvector > intrinsic in > > SelectionDAGBuilder. > > > > All three intrinsics are custom lowered and legalized in the > AArch64 backend. > > > > Two new AArch64ISD nodes have been added to represent the same > concepts > > at the SelectionDAG level, while splatvector maps onto the existing > > AArch64ISD::DUP. > > > > GlobalISel > > ---------- > > > > Since GlobalISel was enabled by default on AArch64, it was > necessary to add > > scalable vector support to the LowLevelType implementation. A > single bit was > > added to the raw_data representation for vectors and vectors of > pointers. > > > > In addition, types that only exist in destination patterns are > planted in > > the enumeration of available types for generated code. While > this may not be > > necessary in future, generating an all-true 'ptrue' value was > necessary to > > convert a predicated instruction into an unpredicated one. > > > > =========> > 7. Example > > =========> > > > The following example shows a simple C loop which assigns the > array index to > > the array elements matching that index. The IR shows how vscale > and stepvector > > are used to create the needed values and to advance the index > variable in the > > loop. > > > > C Code > > ------ > > > > `` > > void IdentityArrayInit(int *a, int count) { > > for (int i = 0; i < count; ++i) > > a[i] = i; > > } > > `` > > > > Scalable IR Vector Body > > ----------------------- > > > > `` > > vector.body.preheader: > > ;; Other setup > > ;; Stepvector used to create initial identity vector > > %stepvector = call <scalable 4 x i32> > @llvm.experimental.vector.stepvector.nxv4i32() > > br vector.body > > > > vector.body > > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > > %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ] > > > > ;; stepvector used for index identity on entry to loop > body ;; > > %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ], > > [ %stepvector, > %vector.body.preheader ] > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > %vscale32 = trunc i64 %vscale64 to i32 > > %1 = add i64 %0, mul (i64 %vscale64, i64 4) > > > > ;; vscale splat used to increment identity vector ;; > > %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 > %vscale32, i32 4), i32 0 > > %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x > i32> undef, <scalable 4 x i32> zeroinitializer > > %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat > > %2 = getelementptr inbounds i32, i32* %a, i64 %0 > > %3 = bitcast i32* %2 to <scalable 4 x i32>* > > store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4 > > > > ;; vscale used to increment loop index > > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > > %4 = icmp eq i64 %index.next, %n.vec > > br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5 > > `` > > > > =========> > 8. Patches > > =========> > > > List of patches: > > > > 1. Extend VectorType: https://reviews.llvm.org/D32530 > > 2. Vector element type Tablegen constraint: > https://reviews.llvm.org/D47768 > > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769 > > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770 > > 5. SVE Calling Convention: https://reviews.llvm.org/D47771 > > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772 > > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773 > > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774 > > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775 > > 10. Initial store patterns: https://reviews.llvm.org/D47776 > > 11. Initial addition patterns: https://reviews.llvm.org/D47777 > > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778 > > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779 > > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780 > > >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180730/25c38df6/attachment.html>
Graham Hunter via llvm-dev
2018-Jul-31 18:10 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Hi Chandler,> On 30 Jul 2018, at 11:34, Chandler Carruth <chandlerc at gmail.com> wrote: > > I strongly suspect that there remains widespread concern with the direction of this, I know I have them. > > I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.The core IR patches (listed in the RFC) haven't really changed much and are ready for review. I appreciate that Sander has pushed a lot of SVE-related patches for MC through recently though.> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread. > > Put differently: I don't think silence is assent here. You really need some clear signal of consensus.Understood. Thankfully there seems to be more interest in this now... I guess people will be busy with the release in the near future but I can work on responding to all the new messages now. I'll try to log in to irc during the evenings (UK time) if that would help. -Graham> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <Graham.Hunter at arm.com> wrote: > Hi, > > Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs. > > -Graham > > > On 2 Jul 2018, at 10:53, Graham Hunter <Graham.Hunter at arm.com> wrote: > > > > Hi, > > > > I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed. > > > > Thanks, > > > > -Graham > > > > ============================================================> > Supporting SIMD instruction sets with variable vector lengths > > ============================================================> > > > In this RFC we propose extending LLVM IR to support code-generation for variable > > length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our > > approach is backwards compatible and should be as non-intrusive as possible; the > > only change needed in other backends is how size is queried on vector types, and > > it only requires a change in which function is called. We have created a set of > > proof-of-concept patches to represent a simple vectorized loop in IR and > > generate SVE instructions from that IR. These patches (listed in section 7 of > > this rfc) can be found on Phabricator and are intended to illustrate the scope > > of changes required by the general approach described in this RFC. > > > > =========> > Background > > =========> > > > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for > > AArch64 which is intended to scale with hardware such that the same binary > > running on a processor with longer vector registers can take advantage of the > > increased compute power without recompilation. > > > > As the vector length is no longer a compile-time known value, the way in which > > the LLVM vectorizer generates code requires modifications such that certain > > values are now runtime evaluated expressions instead of compile-time constants. > > > > Documentation for SVE can be found at > > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a > > > > =======> > Contents > > =======> > > > The rest of this RFC covers the following topics: > > > > 1. Types -- a proposal to extend VectorType to be able to represent vectors that > > have a length which is a runtime-determined multiple of a known base length. > > > > 2. Size Queries - how to reason about the size of types for which the size isn't > > fully known at compile time. > > > > 3. Representing the runtime multiple of vector length in IR for use in address > > calculations and induction variable comparisons. > > > > 4. Generating 'constant' values in IR for vectors with a runtime-determined > > number of elements. > > > > 5. An explanation of splitting/concatentating scalable vectors. > > > > 6. A brief note on code generation of these new operations for AArch64. > > > > 7. An example of C code and matching IR using the proposed extensions. > > > > 8. A list of patches demonstrating the changes required to emit SVE instructions > > for a loop that has already been vectorized using the extensions described > > in this RFC. > > > > =======> > 1. Types > > =======> > > > To represent a vector of unknown length a boolean `Scalable` property has been > > added to the `VectorType` class, which indicates that the number of elements in > > the vector is a runtime-determined integer multiple of the `NumElements` field. > > Most code that deals with vectors doesn't need to know the exact length, but > > does need to know relative lengths -- e.g. get a vector with the same number of > > elements but a different element type, or with half or double the number of > > elements. > > > > In order to allow code to transparently support scalable vectors, we introduce > > an `ElementCount` class with two members: > > > > - `unsigned Min`: the minimum number of elements. > > - `bool Scalable`: is the element count an unknown multiple of `Min`? > > > > For non-scalable vectors (``Scalable=false``) the scale is considered to be > > equal to one and thus `Min` represents the exact number of elements in the > > vector. > > > > The intent for code working with vectors is to use convenience methods and avoid > > directly dealing with the number of elements. If needed, calling > > `getElementCount` on a vector type instead of `getVectorNumElements` can be used > > to obtain the (potentially scalable) number of elements. Overloaded division and > > multiplication operators allow an ElementCount instance to be used in much the > > same manner as an integer for most cases. > > > > This mixture of compile-time and runtime quantities allow us to reason about the > > relationship between different scalable vector types without knowing their > > exact length. > > > > The runtime multiple is not expected to change during program execution for SVE, > > but it is possible. The model of scalable vectors presented in this RFC assumes > > that the multiple will be constant within a function but not necessarily across > > functions. As suggested in the recent RISC-V rfc, a new function attribute to > > inherit the multiple across function calls will allow for function calls with > > vector arguments/return values and inlining/outlining optimizations. > > > > IR Textual Form > > --------------- > > > > The textual form for a scalable vector is: > > > > ``<scalable <n> x <type>>`` > > > > where `type` is the scalar type of each element, `n` is the minimum number of > > elements, and the string literal `scalable` indicates that the total number of > > elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice > > for indicating that the vector is scalable, and could be substituted by another. > > For fixed-length vectors, the `scalable` is omitted, so there is no change in > > the format for existing vectors. > > > > Scalable vectors with the same `Min` value have the same number of elements, and > > the same number of bytes if `Min * sizeof(type)` is the same (assuming they are > > used within the same function): > > > > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of > > elements. > > > > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of > > bytes. > > > > IR Bitcode Form > > --------------- > > > > To serialize scalable vectors to bitcode, a new boolean field is added to the > > type record. If the field is not present the type will default to a fixed-length > > vector type, preserving backwards compatibility. > > > > Alternatives Considered > > ----------------------- > > > > We did consider one main alternative -- a dedicated target type, like the > > x86_mmx type. > > > > A dedicated target type would either need to extend all existing passes that > > work with vectors to recognize the new type, or to duplicate all that code > > in order to get reasonable code generation and autovectorization. > > > > This hasn't been done for the x86_mmx type, and so it is only capable of > > providing support for C-level intrinsics instead of being used and recognized by > > passes inside llvm. > > > > Although our current solution will need to change some of the code that creates > > new VectorTypes, much of that code doesn't need to care about whether the types > > are scalable or not -- they can use preexisting methods like > > `getHalfElementsVectorType`. If the code is a little more complex, > > `ElementCount` structs can be used instead of an `unsigned` value to represent > > the number of elements. > > > > ==============> > 2. Size Queries > > ==============> > > > This is a proposal for how to deal with querying the size of scalable types for > > analysis of IR. While it has not been implemented in full, the general approach > > works well for calculating offsets into structures with scalable types in a > > modified version of ComputeValueVTs in our downstream compiler. > > > > For current IR types that have a known size, all query functions return a single > > integer constant. For scalable types a second integer is needed to indicate the > > number of bytes/bits which need to be scaled by the runtime multiple to obtain > > the actual length. > > > > For primitive types, `getPrimitiveSizeInBits()` will function as it does today, > > except that it will no longer return a size for vector types (it will return 0, > > as it does for other derived types). The majority of calls to this function are > > already for scalar rather than vector types. > > > > For derived types, a function `getScalableSizePairInBits()` will be added, which > > returns a pair of integers (one to indicate unscaled bits, the other for bits > > that need to be scaled by the runtime multiple). For backends that do not need > > to deal with scalable types the existing methods will suffice, but a debug-only > > assert will be added to them to ensure they aren't used on scalable types. > > > > Similar functionality will be added to DataLayout. > > > > Comparisons between sizes will use the following methods, assuming that X and > > Y are non-zero integers and the form is of { unscaled, scaled }. > > > > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison. > > > > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across > > functions that inherit vector length. Cannot be > > compared across non-inheriting functions. > > > > { X, 0 } > { 0, Y }: Cannot return true. > > > > { X, 0 } = { 0, Y }: Cannot return true. > > > > { X, 0 } < { 0, Y }: Can return true. > > > > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common > > terms and try the above comparisons; it > > may not be possible to get a good answer. > > > > It's worth noting that we don't expect the last case (mixed scaled and > > unscaled sizes) to occur. Richard Sandiford's proposed C extensions > > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly > > prohibits mixing fixed-size types into sizeless struct. > > > > I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled > > vs. unscaled; I believe the gcc implementation of SVE allows for such > > results, but that supports a generic polynomial length representation. > > > > My current intention is to rely on functions that clone or copy values to > > check whether they are being used to copy scalable vectors across function > > boundaries without the inherit vlen attribute and raise an error there instead > > of requiring passing the Function a type size is from for each comparison. If > > there's a strong preference for moving the check to the size comparison function > > let me know; I will be starting work on patches for this later in the year if > > there's no major problems with the idea. > > > > Future Work > > ----------- > > > > Since we cannot determine the exact size of a scalable vector, the > > existing logic for alias detection won't work when multiple accesses > > share a common base pointer with different offsets. > > > > However, SVE's predication will mean that a dynamic 'safe' vector length > > can be determined at runtime, so after initial support has been added we > > can work on vectorizing loops using runtime predication to avoid aliasing > > problems. > > > > Alternatives Considered > > ----------------------- > > > > Marking scalable vectors as unsized doesn't work well, as many parts of > > llvm dealing with loads and stores assert that 'isSized()' returns true > > and make use of the size when calculating offsets. > > > > We have considered introducing multiple helper functions instead of > > using direct size queries, but that doesn't cover all cases. It may > > still be a good idea to introduce them to make the purpose in a given > > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'. > > > > =======================================> > 3. Representing Vector Length at Runtime > > =======================================> > > > With a scalable vector type defined, we now need a way to represent the runtime > > length in IR in order to generate addresses for consecutive vectors in memory > > and determine how many elements have been processed in an iteration of a loop. > > > > We have added an experimental `vscale` intrinsic to represent the runtime > > multiple. Multiplying the result of this intrinsic by the minimum number of > > elements in a vector gives the total number of elements in a scalable vector. > > > > Fixed-Length Code > > ----------------- > > > > Assuming a vector type of <4 x <ty>> > > `` > > vector.body: > > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > > ;; <loop body> > > ;; Increment induction var > > %index.next = add i64 %index, 4 > > ;; <check and branch> > > `` > > Scalable Equivalent > > ------------------- > > > > Assuming a vector type of <scalable 4 x <ty>> > > `` > > vector.body: > > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > > ;; <loop body> > > ;; Increment induction var > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > > ;; <check and branch> > > `` > > ==========================> > 4. Generating Vector Values > > ==========================> > For constant vector values, we cannot specify all the elements as we can for > > fixed-length vectors; fortunately only a small number of easily synthesized > > patterns are required for autovectorization. The `zeroinitializer` constant > > can be used in the same manner as fixed-length vectors for a constant zero > > splat. This can then be combined with `insertelement` and `shufflevector` > > to create arbitrary value splats in the same manner as fixed-length vectors. > > > > For constants consisting of a sequence of values, an experimental `stepvector` > > intrinsic has been added to represent a simple constant of the form > > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new > > start can be added, and changing the step requires multiplying by a splat. > > > > Fixed-Length Code > > ----------------- > > `` > > ;; Splat a value > > %insert = insertelement <4 x i32> undef, i32 %value, i32 0 > > %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer > > ;; Add a constant sequence > > %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8> > > `` > > Scalable Equivalent > > ------------------- > > `` > > ;; Splat a value > > %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0 > > %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > > ;; Splat offset + stride (the same in this case) > > %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0 > > %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > > ;; Create sequence for scalable vector > > %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32() > > %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off > > %addoffset = add <scalable 4 x i32> %mulbystride, %str_off > > ;; Add the runtime-generated sequence > > %add = add <scalable 4 x i32> %splat, %addoffset > > `` > > Future Work > > ----------- > > > > Intrinsics cannot currently be used for constant folding. Our downstream > > compiler (using Constants instead of intrinsics) relies quite heavily on this > > for good code generation, so we will need to find new ways to recognize and > > fold these values. > > > > ==========================================> > 5. Splitting and Combining Scalable Vectors > > ==========================================> > > > Splitting and combining scalable vectors in IR is done in the same manner as > > for fixed-length vectors, but with a non-constant mask for the shufflevector. > > > > The following is an example of splitting a <scalable 4 x double> into two > > separate <scalable 2 x double> values. > > > > `` > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > ;; Stepvector generates the element ids for first subvector > > %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64() > > ;; Add vscale * 2 to get the starting element for the second subvector > > %ec = mul i64 %vscale64, 2 > > %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0 > > %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer > > %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64 > > ;; Perform the extracts > > %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1 > > %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2 > > `` > > > > =================> > 6. Code Generation > > =================> > > > IR splats will be converted to an experimental splatvector intrinsic in > > SelectionDAGBuilder. > > > > All three intrinsics are custom lowered and legalized in the AArch64 backend. > > > > Two new AArch64ISD nodes have been added to represent the same concepts > > at the SelectionDAG level, while splatvector maps onto the existing > > AArch64ISD::DUP. > > > > GlobalISel > > ---------- > > > > Since GlobalISel was enabled by default on AArch64, it was necessary to add > > scalable vector support to the LowLevelType implementation. A single bit was > > added to the raw_data representation for vectors and vectors of pointers. > > > > In addition, types that only exist in destination patterns are planted in > > the enumeration of available types for generated code. While this may not be > > necessary in future, generating an all-true 'ptrue' value was necessary to > > convert a predicated instruction into an unpredicated one. > > > > =========> > 7. Example > > =========> > > > The following example shows a simple C loop which assigns the array index to > > the array elements matching that index. The IR shows how vscale and stepvector > > are used to create the needed values and to advance the index variable in the > > loop. > > > > C Code > > ------ > > > > `` > > void IdentityArrayInit(int *a, int count) { > > for (int i = 0; i < count; ++i) > > a[i] = i; > > } > > `` > > > > Scalable IR Vector Body > > ----------------------- > > > > `` > > vector.body.preheader: > > ;; Other setup > > ;; Stepvector used to create initial identity vector > > %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32() > > br vector.body > > > > vector.body > > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > > %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ] > > > > ;; stepvector used for index identity on entry to loop body ;; > > %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ], > > [ %stepvector, %vector.body.preheader ] > > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > > %vscale32 = trunc i64 %vscale64 to i32 > > %1 = add i64 %0, mul (i64 %vscale64, i64 4) > > > > ;; vscale splat used to increment identity vector ;; > > %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0 > > %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > > %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat > > %2 = getelementptr inbounds i32, i32* %a, i64 %0 > > %3 = bitcast i32* %2 to <scalable 4 x i32>* > > store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4 > > > > ;; vscale used to increment loop index > > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > > %4 = icmp eq i64 %index.next, %n.vec > > br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5 > > `` > > > > =========> > 8. Patches > > =========> > > > List of patches: > > > > 1. Extend VectorType: https://reviews.llvm.org/D32530 > > 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768 > > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769 > > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770 > > 5. SVE Calling Convention: https://reviews.llvm.org/D47771 > > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772 > > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773 > > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774 > > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775 > > 10. Initial store patterns: https://reviews.llvm.org/D47776 > > 11. Initial addition patterns: https://reviews.llvm.org/D47777 > > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778 > > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779 > > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780 > > >