Simon Moll via llvm-dev
2018-Jul-02 15:08 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Hi, i am the main author of RV, the Region Vectorizer (github.com/cdl-saarland/rv). I want to share our standpoint as potential users of the proposed vector-length agnostic IR (RISC-V, ARM SVE). -- support for `llvm.experimental.vector.reduce.*` intrinsics -- RV relies heavily on predicate reductions (`or` and `and` reduction) to tame divergent loops and provide a vector-length agnostic programming model on LLVM IR. I'd really like to see these adopted early on in the new VLA backends so we can fully support these targets from the start. Without these generic intrinsics, we would either need to emit target specific ones or go through the painful process of VLA-style reduction trees with loops or the like. -- setting the vector length (MVL) -- I really like the idea of the `inherits_vlen` attribute. Absence of this attribute in a callee means we can safely stop tracking the vector length across the call boundary. However, i think there are some issues with the `vlen token` approach. * Why do you need an explicit vlen token if there is a 1 : 1-0 correspondence between functions and vlen tokens? * My main concern is that you are navigating towards a local optimum here. All is well as long as there is only one vector length per function. However, if the architecture supports changing the vector length at any point but you explicitly forbid it, programmers will complain, well, i will for one ;-) Once you give in to that demand you are facing the situation that multiple vector length tokens are live within the same function. This means you have to stop transformations from mixing vector operations with different vector lengths: these would otherwise incur an expensive state change at every vlen transition. However, there is no natural way to express that two SSA values (vlen tokens) must not be live at the same program point. On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote:> There are some operations that use vl for things other than simple > masking. To give one example, "speculative" loads (which silencing > some exceptions to safely permit vectorization of some loops with > data-dependent exits, such as strlen) can shrink vl as a side effect. > I believe this can be handled by modelling all relevant operations > (including setvl itself) as intrinsics that have side effects or > read/write inaccessible memory. However, if you want to have the > "current" vl (or equivalent mask) around as SSA value, you need to > "reload" it after any operation that updates vl. That seems like it > could get a bit complex if you want to do it efficiently (in the > limit, it seems equivalent to SSA construction).I think modeling the vector length as state isn't as bad as it may sound first. In fact, how about modeling the "hard" vector length as a thread_local global variable? That way there is exactly one valid vector length value at every point (defined by the value of the thread_local global variable of the exact name). There is no need for a "demanded vlen" analyses: the global variable yields the value immediately. The RISC-V backend can map the global directly to the vlen register. If a target does not support a re-configurable vector length (SVE), it is safe to run SSA construction during legalization and use explicit predication instead. You'd perform SSA construction only at the backend/legalization phase. Vice versa coming from IR targeted at LLVM SVE, you can go the other way, run a demanded vlen analysis, and encode it explicitly in the program. vlen changes are expensive and should be rare anyway. ; explicit vlen_state modelling in RV could look like this: @vlen_state=thread_local globaltoken ; this gives AA a fixed point to constraint vlen-dependent operations llvm.vla.setvl(i32 %n) ; implicitly writes-only %vlen_state i32 llvm.vla.getvl() ; implicitly reads-only %vlen_state llvm.vla.fadd.f64(f64, f64) ; implicitly reads-only %vlen_state llvm.vla.fdiv.f64(f64, f64) : .. same ; this implements the "speculative" load mentioned in the quote above (writes %vlen_state. I suppose it also reads it first?) <scalable 1 x f64> llvm.riscv.probe.f64(%ptr) By relying on memory dependence, this also implies that arithmetic operations can be re-ordered freely as long as vlen_state does not change between them (SLP, "loop mix (CGO16)", ..). Regarding function calls, if the callee does not have the 'inherits_vlen' attribute, the target can use a default value at function entry (max width or "undef"). Otherwise, the vector length needs to be communicated from caller to callee. However, the `vlen_state` variable already achieves that for a first implementation. Last but not least, thank you all for working on this! I am really looking forward to playing around with vla architectures in LLVM. Regards, Simon On 07/02/2018 11:53 AM, Graham Hunter via llvm-dev wrote:> Hi, > > I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed. > > Thanks, > > -Graham > > ============================================================> Supporting SIMD instruction sets with variable vector lengths > ============================================================> > In this RFC we propose extending LLVM IR to support code-generation for variable > length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our > approach is backwards compatible and should be as non-intrusive as possible; the > only change needed in other backends is how size is queried on vector types, and > it only requires a change in which function is called. We have created a set of > proof-of-concept patches to represent a simple vectorized loop in IR and > generate SVE instructions from that IR. These patches (listed in section 7 of > this rfc) can be found on Phabricator and are intended to illustrate the scope > of changes required by the general approach described in this RFC. > > =========> Background > =========> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for > AArch64 which is intended to scale with hardware such that the same binary > running on a processor with longer vector registers can take advantage of the > increased compute power without recompilation. > > As the vector length is no longer a compile-time known value, the way in which > the LLVM vectorizer generates code requires modifications such that certain > values are now runtime evaluated expressions instead of compile-time constants. > > Documentation for SVE can be found at > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a > > =======> Contents > =======> > The rest of this RFC covers the following topics: > > 1. Types -- a proposal to extend VectorType to be able to represent vectors that > have a length which is a runtime-determined multiple of a known base length. > > 2. Size Queries - how to reason about the size of types for which the size isn't > fully known at compile time. > > 3. Representing the runtime multiple of vector length in IR for use in address > calculations and induction variable comparisons. > > 4. Generating 'constant' values in IR for vectors with a runtime-determined > number of elements. > > 5. An explanation of splitting/concatentating scalable vectors. > > 6. A brief note on code generation of these new operations for AArch64. > > 7. An example of C code and matching IR using the proposed extensions. > > 8. A list of patches demonstrating the changes required to emit SVE instructions > for a loop that has already been vectorized using the extensions described > in this RFC. > > =======> 1. Types > =======> > To represent a vector of unknown length a boolean `Scalable` property has been > added to the `VectorType` class, which indicates that the number of elements in > the vector is a runtime-determined integer multiple of the `NumElements` field. > Most code that deals with vectors doesn't need to know the exact length, but > does need to know relative lengths -- e.g. get a vector with the same number of > elements but a different element type, or with half or double the number of > elements. > > In order to allow code to transparently support scalable vectors, we introduce > an `ElementCount` class with two members: > > - `unsigned Min`: the minimum number of elements. > - `bool Scalable`: is the element count an unknown multiple of `Min`? > > For non-scalable vectors (``Scalable=false``) the scale is considered to be > equal to one and thus `Min` represents the exact number of elements in the > vector. > > The intent for code working with vectors is to use convenience methods and avoid > directly dealing with the number of elements. If needed, calling > `getElementCount` on a vector type instead of `getVectorNumElements` can be used > to obtain the (potentially scalable) number of elements. Overloaded division and > multiplication operators allow an ElementCount instance to be used in much the > same manner as an integer for most cases. > > This mixture of compile-time and runtime quantities allow us to reason about the > relationship between different scalable vector types without knowing their > exact length. > > The runtime multiple is not expected to change during program execution for SVE, > but it is possible. The model of scalable vectors presented in this RFC assumes > that the multiple will be constant within a function but not necessarily across > functions. As suggested in the recent RISC-V rfc, a new function attribute to > inherit the multiple across function calls will allow for function calls with > vector arguments/return values and inlining/outlining optimizations. > > IR Textual Form > --------------- > > The textual form for a scalable vector is: > > ``<scalable <n> x <type>>`` > > where `type` is the scalar type of each element, `n` is the minimum number of > elements, and the string literal `scalable` indicates that the total number of > elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice > for indicating that the vector is scalable, and could be substituted by another. > For fixed-length vectors, the `scalable` is omitted, so there is no change in > the format for existing vectors. > > Scalable vectors with the same `Min` value have the same number of elements, and > the same number of bytes if `Min * sizeof(type)` is the same (assuming they are > used within the same function): > > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of > elements. > > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of > bytes. > > IR Bitcode Form > --------------- > > To serialize scalable vectors to bitcode, a new boolean field is added to the > type record. If the field is not present the type will default to a fixed-length > vector type, preserving backwards compatibility. > > Alternatives Considered > ----------------------- > > We did consider one main alternative -- a dedicated target type, like the > x86_mmx type. > > A dedicated target type would either need to extend all existing passes that > work with vectors to recognize the new type, or to duplicate all that code > in order to get reasonable code generation and autovectorization. > > This hasn't been done for the x86_mmx type, and so it is only capable of > providing support for C-level intrinsics instead of being used and recognized by > passes inside llvm. > > Although our current solution will need to change some of the code that creates > new VectorTypes, much of that code doesn't need to care about whether the types > are scalable or not -- they can use preexisting methods like > `getHalfElementsVectorType`. If the code is a little more complex, > `ElementCount` structs can be used instead of an `unsigned` value to represent > the number of elements. > > ==============> 2. Size Queries > ==============> > This is a proposal for how to deal with querying the size of scalable types for > analysis of IR. While it has not been implemented in full, the general approach > works well for calculating offsets into structures with scalable types in a > modified version of ComputeValueVTs in our downstream compiler. > > For current IR types that have a known size, all query functions return a single > integer constant. For scalable types a second integer is needed to indicate the > number of bytes/bits which need to be scaled by the runtime multiple to obtain > the actual length. > > For primitive types, `getPrimitiveSizeInBits()` will function as it does today, > except that it will no longer return a size for vector types (it will return 0, > as it does for other derived types). The majority of calls to this function are > already for scalar rather than vector types. > > For derived types, a function `getScalableSizePairInBits()` will be added, which > returns a pair of integers (one to indicate unscaled bits, the other for bits > that need to be scaled by the runtime multiple). For backends that do not need > to deal with scalable types the existing methods will suffice, but a debug-only > assert will be added to them to ensure they aren't used on scalable types. > > Similar functionality will be added to DataLayout. > > Comparisons between sizes will use the following methods, assuming that X and > Y are non-zero integers and the form is of { unscaled, scaled }. > > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison. > > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across > functions that inherit vector length. Cannot be > compared across non-inheriting functions. > > { X, 0 } > { 0, Y }: Cannot return true. > > { X, 0 } = { 0, Y }: Cannot return true. > > { X, 0 } < { 0, Y }: Can return true. > > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common > terms and try the above comparisons; it > may not be possible to get a good answer. > > It's worth noting that we don't expect the last case (mixed scaled and > unscaled sizes) to occur. Richard Sandiford's proposed C extensions > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly > prohibits mixing fixed-size types into sizeless struct. > > I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled > vs. unscaled; I believe the gcc implementation of SVE allows for such > results, but that supports a generic polynomial length representation. > > My current intention is to rely on functions that clone or copy values to > check whether they are being used to copy scalable vectors across function > boundaries without the inherit vlen attribute and raise an error there instead > of requiring passing the Function a type size is from for each comparison. If > there's a strong preference for moving the check to the size comparison function > let me know; I will be starting work on patches for this later in the year if > there's no major problems with the idea. > > Future Work > ----------- > > Since we cannot determine the exact size of a scalable vector, the > existing logic for alias detection won't work when multiple accesses > share a common base pointer with different offsets. > > However, SVE's predication will mean that a dynamic 'safe' vector length > can be determined at runtime, so after initial support has been added we > can work on vectorizing loops using runtime predication to avoid aliasing > problems. > > Alternatives Considered > ----------------------- > > Marking scalable vectors as unsized doesn't work well, as many parts of > llvm dealing with loads and stores assert that 'isSized()' returns true > and make use of the size when calculating offsets. > > We have considered introducing multiple helper functions instead of > using direct size queries, but that doesn't cover all cases. It may > still be a good idea to introduce them to make the purpose in a given > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'. > > =======================================> 3. Representing Vector Length at Runtime > =======================================> > With a scalable vector type defined, we now need a way to represent the runtime > length in IR in order to generate addresses for consecutive vectors in memory > and determine how many elements have been processed in an iteration of a loop. > > We have added an experimental `vscale` intrinsic to represent the runtime > multiple. Multiplying the result of this intrinsic by the minimum number of > elements in a vector gives the total number of elements in a scalable vector. > > Fixed-Length Code > ----------------- > > Assuming a vector type of <4 x <ty>> > `` > vector.body: > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > ;; <loop body> > ;; Increment induction var > %index.next = add i64 %index, 4 > ;; <check and branch> > `` > Scalable Equivalent > ------------------- > > Assuming a vector type of <scalable 4 x <ty>> > `` > vector.body: > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > ;; <loop body> > ;; Increment induction var > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > ;; <check and branch> > `` > ==========================> 4. Generating Vector Values > ==========================> For constant vector values, we cannot specify all the elements as we can for > fixed-length vectors; fortunately only a small number of easily synthesized > patterns are required for autovectorization. The `zeroinitializer` constant > can be used in the same manner as fixed-length vectors for a constant zero > splat. This can then be combined with `insertelement` and `shufflevector` > to create arbitrary value splats in the same manner as fixed-length vectors. > > For constants consisting of a sequence of values, an experimental `stepvector` > intrinsic has been added to represent a simple constant of the form > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new > start can be added, and changing the step requires multiplying by a splat. > > Fixed-Length Code > ----------------- > `` > ;; Splat a value > %insert = insertelement <4 x i32> undef, i32 %value, i32 0 > %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer > ;; Add a constant sequence > %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8> > `` > Scalable Equivalent > ------------------- > `` > ;; Splat a value > %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0 > %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > ;; Splat offset + stride (the same in this case) > %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0 > %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > ;; Create sequence for scalable vector > %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32() > %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off > %addoffset = add <scalable 4 x i32> %mulbystride, %str_off > ;; Add the runtime-generated sequence > %add = add <scalable 4 x i32> %splat, %addoffset > `` > Future Work > ----------- > > Intrinsics cannot currently be used for constant folding. Our downstream > compiler (using Constants instead of intrinsics) relies quite heavily on this > for good code generation, so we will need to find new ways to recognize and > fold these values. > > ==========================================> 5. Splitting and Combining Scalable Vectors > ==========================================> > Splitting and combining scalable vectors in IR is done in the same manner as > for fixed-length vectors, but with a non-constant mask for the shufflevector. > > The following is an example of splitting a <scalable 4 x double> into two > separate <scalable 2 x double> values. > > `` > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > ;; Stepvector generates the element ids for first subvector > %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64() > ;; Add vscale * 2 to get the starting element for the second subvector > %ec = mul i64 %vscale64, 2 > %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0 > %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer > %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64 > ;; Perform the extracts > %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1 > %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2 > `` > > =================> 6. Code Generation > =================> > IR splats will be converted to an experimental splatvector intrinsic in > SelectionDAGBuilder. > > All three intrinsics are custom lowered and legalized in the AArch64 backend. > > Two new AArch64ISD nodes have been added to represent the same concepts > at the SelectionDAG level, while splatvector maps onto the existing > AArch64ISD::DUP. > > GlobalISel > ---------- > > Since GlobalISel was enabled by default on AArch64, it was necessary to add > scalable vector support to the LowLevelType implementation. A single bit was > added to the raw_data representation for vectors and vectors of pointers. > > In addition, types that only exist in destination patterns are planted in > the enumeration of available types for generated code. While this may not be > necessary in future, generating an all-true 'ptrue' value was necessary to > convert a predicated instruction into an unpredicated one. > > =========> 7. Example > =========> > The following example shows a simple C loop which assigns the array index to > the array elements matching that index. The IR shows how vscale and stepvector > are used to create the needed values and to advance the index variable in the > loop. > > C Code > ------ > > `` > void IdentityArrayInit(int *a, int count) { > for (int i = 0; i < count; ++i) > a[i] = i; > } > `` > > Scalable IR Vector Body > ----------------------- > > `` > vector.body.preheader: > ;; Other setup > ;; Stepvector used to create initial identity vector > %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32() > br vector.body > > vector.body > %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ] > %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ] > > ;; stepvector used for index identity on entry to loop body ;; > %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ], > [ %stepvector, %vector.body.preheader ] > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > %vscale32 = trunc i64 %vscale64 to i32 > %1 = add i64 %0, mul (i64 %vscale64, i64 4) > > ;; vscale splat used to increment identity vector ;; > %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0 > %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer > %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat > %2 = getelementptr inbounds i32, i32* %a, i64 %0 > %3 = bitcast i32* %2 to <scalable 4 x i32>* > store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4 > > ;; vscale used to increment loop index > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > %4 = icmp eq i64 %index.next, %n.vec > br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5 > `` > > =========> 8. Patches > =========> > List of patches: > > 1. Extend VectorType: https://reviews.llvm.org/D32530 > 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768 > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769 > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770 > 5. SVE Calling Convention: https://reviews.llvm.org/D47771 > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772 > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773 > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774 > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775 > 10. Initial store patterns: https://reviews.llvm.org/D47776 > 11. Initial addition patterns: https://reviews.llvm.org/D47777 > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778 > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779 > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780 > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Simon Moll Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland University, Computer Science Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://compilers.cs.uni-saarland.de/people/moll -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180702/fe841c6c/attachment.html>
Graham Hunter via llvm-dev
2018-Jul-04 13:13 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Hi Simon, Replies inline. -Graham> i am the main author of RV, the Region Vectorizer (github.com/cdl-saarland/rv). I want to share our standpoint as potential users of the proposed vector-length agnostic IR (RISC-V, ARM SVE). > -- support for `llvm.experimental.vector.reduce.*` intrinsics -- > RV relies heavily on predicate reductions (`or` and `and` reduction) to tame divergent loops and provide a vector-length agnostic programming model on LLVM IR. I'd really like to see these adopted early on in the new VLA backends so we can fully support these targets from the start. Without these generic intrinsics, we would either need to emit target specific ones or go through the painful process of VLA-style reduction trees with loops or the like.The vector reduction intrinsics were originally created to support SVE in order to avoid loops, so we'll definitely be using them.> -- setting the vector length (MVL) -- > I really like the idea of the `inherits_vlen` attribute. Absence of this attribute in a callee means we can safely stop tracking the vector length across the call boundary. > However, i think there are some issues with the `vlen token` approach. > * Why do you need an explicit vlen token if there is a 1 : 1-0 correspondence between functions and vlen tokens?I think there's a bit of a mix-up here... my proposal doesn't feature tokens. Robin's proposal earlier in the year did, but I think we've reached a consensus that they aren't necessary. We do need to decide where to place the appropriate checks for which function an instruction is from before allowing copying: 1. Solely within the passes that perform cross-function optimizations. Light-weight, but easier to get it wrong. 2. Within generic methods that insert instructions into blocks. Probably more code changes than method 1. May run into problems if an instruction is cloned first (and therefore has no parent to check -- looking at operands/uses may suffice though). 3. Within size queries. Probably insufficient in places where entire blocks are copied without looking at the types of each individual instruction, and also suffers from problems when cloning instructions. My current idea is to proceed with option 2 with some additional checks where needed.> * My main concern is that you are navigating towards a local optimum here. All is well as long as there is only one vector length per function. However, if the architecture supports changing the vector length at any point but you explicitly forbid it, programmers will complain, well, i will for one ;-) Once you give in to that demand you are facing the situation that multiple vector length tokens are live within the same function. This means you have to stop transformations from mixing vector operations with different vector lengths: these would otherwise incur an expensive state change at every vlen transition. However, there is no natural way to express that two SSA values (vlen tokens) must not be live at the same program point.So I think we've agreed that the notion of vscale inside a function is consistent, so that all size comparisons and stack allocations will use the maximum size for that function. However, use of setvl or predication changes the effective length inside the function. This is already the case for masked loads and stores -- although an AVX512 vector is 512 bits in size, a different amount of data can be transferred to/from memory. Robin will be working on the best way to represent setvl, whereas SVE will just use <scalable n x i1> predicate vectors to control length.> On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote: >> There are some operations that use vl for things other than simple >> masking. To give one example, "speculative" loads (which silencing >> some exceptions to safely permit vectorization of some loops with >> data-dependent exits, such as strlen) can shrink vl as a side effect. >> I believe this can be handled by modelling all relevant operations >> (including setvl itself) as intrinsics that have side effects or >> read/write inaccessible memory. However, if you want to have the >> "current" vl (or equivalent mask) around as SSA value, you need to >> "reload" it after any operation that updates vl. That seems like it >> could get a bit complex if you want to do it efficiently (in the >> limit, it seems equivalent to SSA construction). >> > I think modeling the vector length as state isn't as bad as it may sound first. In fact, how about modeling the "hard" vector length as a thread_local global variable? That way there is exactly one valid vector length value at every point (defined by the value of the thread_local global variable of the exact name). There is no need for a "demanded vlen" analyses: the global variable yields the value immediately. The RISC-V backend can map the global directly to the vlen register. If a target does not support a re-configurable vector length (SVE), it is safe to run SSA construction during legalization and use explicit predication instead. You'd perform SSA construction only at the backend/legalization phase. > Vice versa coming from IR targeted at LLVM SVE, you can go the other way, run a demanded vlen analysis, and encode it explicitly in the program. vlen changes are expensive and should be rare anyway.This was in response to my suggestion to model setvl with predicates; I've withdrawn the idea. The vscale intrinsic is enough to represent 'maxvl', and based on the IR samples I've seen for RVV, a setvl intrinsic would return the dynamic length in order to correctly update offset/induction variables.> ; explicit vlen_state modelling in RV could look like this: > > @vlen_state = thread_local global token ; this gives AA a fixed point to constraint vlen-dependent operations > > llvm.vla.setvl(i32 %n) ; implicitly writes-only %vlen_state > i32 llvm.vla.getvl() ; implicitly reads-only %vlen_state > > llvm.vla.fadd.f64(f64, f64) ; implicitly reads-only %vlen_state > llvm.vla.fdiv.f64(f64, f64) : .. same > > ; this implements the "speculative" load mentioned in the quote above (writes %vlen_state. I suppose it also reads it first?) > <scalable 1 x f64> llvm.riscv.probe.f64(%ptr)Having separate getvl and setvl intrinsics may work nicely, but I'll leave that to Robin to decide.> By relying on memory dependence, this also implies that arithmetic operations can be re-ordered freely as long as vlen_state does not change between them (SLP, "loop mix (CGO16)", ..). > Regarding function calls, if the callee does not have the 'inherits_vlen' attribute, the target can use a default value at function entry (max width or "undef"). Otherwise, the vector length needs to be communicated from caller to callee. However, the `vlen_state` variable already achieves that for a first implementation.I got the impression that the RVV team wanted to be able to reconfigure registers (and therefore potentially change max vector length/number of available registers) for each function; if a call to a function is required from inside a vectorized loop then I think maxvl/vscale has to match and the callee must not reconfigure registers. I suspect there will be a complicated cost model to decide whether to change configuration or stick with a default of all registers enabled.> Last but not least, thank you all for working on this! I am really looking forward to playing around with vla architectures in LLVM.Glad to hear it; there is an SVE emulator[1] available so once we've managed to get some code committed you'll be able to try some of this out, at least on one of the architectures. [1] https://developer.arm.com/products/software-development-tools/hpc/arm-instruction-emulator
Robin Kruppe via llvm-dev
2018-Jul-07 16:23 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Hi Simon, to add to what Graham said... On 2 July 2018 at 17:08, Simon Moll via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi, > > i am the main author of RV, the Region Vectorizer > (github.com/cdl-saarland/rv). I want to share our standpoint as potential > users of the proposed vector-length agnostic IR (RISC-V, ARM SVE).Thanks, this perspective is very useful, just as our discussion at EuroLLVM was.> -- support for `llvm.experimental.vector.reduce.*` intrinsics -- > > RV relies heavily on predicate reductions (`or` and `and` reduction) to tame > divergent loops and provide a vector-length agnostic programming model on > LLVM IR. I'd really like to see these adopted early on in the new VLA > backends so we can fully support these targets from the start. Without these > generic intrinsics, we would either need to emit target specific ones or go > through the painful process of VLA-style reduction trees with loops or the > like. > > > -- setting the vector length (MVL) -- > > I really like the idea of the `inherits_vlen` attribute. Absence of this > attribute in a callee means we can safely stop tracking the vector length > across the call boundary. > > However, i think there are some issues with the `vlen token` approach. > > * Why do you need an explicit vlen token if there is a 1 : 1-0 > correspondence between functions and vlen tokens? > > * My main concern is that you are navigating towards a local optimum here. > All is well as long as there is only one vector length per function. > However, if the architecture supports changing the vector length at any > point but you explicitly forbid it, programmers will complain, well, i will > for one ;-) Once you give in to that demand you are facing the situation > that multiple vector length tokens are live within the same function. This > means you have to stop transformations from mixing vector operations with > different vector lengths: these would otherwise incur an expensive state > change at every vlen transition. However, there is no natural way to express > that two SSA values (vlen tokens) must not be live at the same program > point.Can you elaborate on what use cases you have in mind for this? I'm very curious because I'm only aware of one case where changing vscale at a specific point is desirable: when you have two independent code regions (e.g., separate loop nests with no vector values flowing between them) that are substantially different in their demands from the vector unit. And even in that case, you only need a way to give the backend the freedom to change it if beneficial, not to exert actual control over the size of the vector registers [1]. As mentioned in my RFC in April, I believe we can still support that case reasonably well with an optimization pass in the backend (operating on MIR, i.e., no IR changes). Everything else I know of that falls under "changing vector lengths" is better served by predication or RISC-V's "active vector length" (vl) register. Even tricks for running code that is intended for packed SIMD of a particular width on top of the variable-vector-length RISC-V ISA only need to fiddle with the active vector length! And to be clear, the active vector length is a completely separate mechanism from the concept that's called vscale in this RFC, vlen in my previous RFC, and MVL in the RISC-V ISA. The restriction of 1 function : 1 vscale is not one that was adopted lightly. In some sense, yes, it's just a local optimum and one can think about IR designs that don't couple vscale to function boundaries. However, there are multiple factors that make it challenging to do in LLVM IR, and challenging to implement in the backend too (some of them outlined in my RFC from April). After tinkering with the problem for months, I'm fairly certain that multiple vscales in one function is several orders of magnitude more complex and difficult to add to LLVM IR (some different IRs would work better), so I'd really like to understand any and all reasons programmers might have for wanting to change vscale, and hopefully find ways to support what they want to do without opening this pandora's box. [1] Software has extremely little control about vscale/MVL/etc. anyway: on packed SIMD machines vscale = 1 is hardwired, on RISC-V you can only change the vector unit configuration and thereby affect the MVL in very indirect and uarch-dependent ways, and on SVE it's implementation-defined which multiples of 128 bit are supported.> On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote: > > There are some operations that use vl for things other than simple > masking. To give one example, "speculative" loads (which silencing > some exceptions to safely permit vectorization of some loops with > data-dependent exits, such as strlen) can shrink vl as a side effect. > I believe this can be handled by modelling all relevant operations > (including setvl itself) as intrinsics that have side effects or > read/write inaccessible memory. However, if you want to have the > "current" vl (or equivalent mask) around as SSA value, you need to > "reload" it after any operation that updates vl. That seems like it > could get a bit complex if you want to do it efficiently (in the > limit, it seems equivalent to SSA construction). > > I think modeling the vector length as state isn't as bad as it may sound > first. In fact, how about modeling the "hard" vector length as aThe term "hard" vector length here makes me suspect you might be mixing up the physical size of vector registers, which is derived from vscale in IR terms or the vector unit configuration in the RISC-V ISA, with the _active_ vector length which is effectively just encoding a particular kind of predication to limit processing to a subset of lanes in the physical registers. Since everything below makes more sense when read as referring to the active vector length, I will assume you meant that, but just to be sure could you please clarify?> thread_local global variable? That way there is exactly one valid vector > length value at every point (defined by the value of the thread_local global > variable of the exact name). There is no need for a "demanded vlen" > analyses: the global variable yields the value immediately. The RISC-V > backend can map the global directly to the vlen register. If a target does > not support a re-configurable vector length (SVE), it is safe to run SSA > construction during legalization and use explicit predication instead. You'd > perform SSA construction only at the backend/legalization phase. > Vice versa coming from IR targeted at LLVM SVE, you can go the other way, > run a demanded vlen analysis, and encode it explicitly in the program. vlen > changes are expensive and should be rare anyway.For the active vector length, yes, modelling the architectural state as memory read and written by intrinsics works fine and is in fact roughly what I'm currently doing. Globals can't be of type token, but I use the existing "reads/write hidden memory" flag on intrinsics instead of globals anyway (which increases AA precision, and doesn't require special casing these intrinsics in AA code). However, these intrinsics also have some downsides that might make different solutions better in the long run. For example, the artificial memory accesses block some optimizations that don't bother reasoning about memory in detail, and many operations controlled by the vector length are so similar to the existing add, mul, etc. instructions that we'd duplicate the vast majority of optimizations that apply to those instructions (if we want equal optimization power, that is).> ; explicit vlen_state modelling in RV could look like this: > > @vlen_state = thread_local global token ; this gives AA a fixed point to > constraint vlen-dependent operations > > llvm.vla.setvl(i32 %n) ; implicitly writes-only %vlen_state > i32 llvm.vla.getvl() ; implicitly reads-only %vlen_state > > llvm.vla.fadd.f64(f64, f64) ; implicitly reads-only %vlen_state > llvm.vla.fdiv.f64(f64, f64) : .. same > > ; this implements the "speculative" load mentioned in the quote above > (writes %vlen_state. I suppose it also reads it first?) > <scalable 1 x f64> llvm.riscv.probe.f64(%ptr) > > By relying on memory dependence, this also implies that arithmetic > operations can be re-ordered freely as long as vlen_state does not change > between them (SLP, "loop mix (CGO16)", ..). > > Regarding function calls, if the callee does not have the 'inherits_vlen' > attribute, the target can use a default value at function entry (max width > or "undef"). Otherwise, the vector length needs to be communicated from > caller to callee. However, the `vlen_state` variable already achieves that > for a first implementation.As Graham said, on RISC-V a vector function call means that the caller needs to configure the vector unit in a particular way (determined by the callee's ABI), and the callee uses with that configuration. (And the configuration determines the hardware vector length / vscale.) This is a backend concern, except the backend needs to know whether a function expects to be called in this way, or whether it can and needs to pick a configuration for itself and set it up on function entry. That's what motivates this attribute. In terms of IR semantics, I would say vscale is *unspecified* on entry into a function without the inherits_vscale (let's rename it to fit this RFC's terminology) attribute. That means it's a program error to assume you get the caller's vscale -- in scenarios where that's what you want, you need to add the attribute everywhere. This corresponds to the fact that in the absence of the attribute, the RISC-V backend will make up a configuration for you and you'll get whatever vscale that configuration implies, rather than the caller's. Cheers, Robin> Last but not least, thank you all for working on this! I am really looking > forward to playing around with vla architectures in LLVM. > > Regards, > > Simon > > > On 07/02/2018 11:53 AM, Graham Hunter via llvm-dev wrote: > > Hi, > > I've updated the RFC slightly based on the discussion within the thread, > reposted below. Let me know if I've missed anything or if more clarification > is needed. > > Thanks, > > -Graham > > ============================================================> Supporting SIMD instruction sets with variable vector lengths > ============================================================> > In this RFC we propose extending LLVM IR to support code-generation for > variable > length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our > approach is backwards compatible and should be as non-intrusive as possible; > the > only change needed in other backends is how size is queried on vector types, > and > it only requires a change in which function is called. We have created a set > of > proof-of-concept patches to represent a simple vectorized loop in IR and > generate SVE instructions from that IR. These patches (listed in section 7 > of > this rfc) can be found on Phabricator and are intended to illustrate the > scope > of changes required by the general approach described in this RFC. > > =========> Background > =========> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for > AArch64 which is intended to scale with hardware such that the same binary > running on a processor with longer vector registers can take advantage of > the > increased compute power without recompilation. > > As the vector length is no longer a compile-time known value, the way in > which > the LLVM vectorizer generates code requires modifications such that certain > values are now runtime evaluated expressions instead of compile-time > constants. > > Documentation for SVE can be found at > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a > > =======> Contents > =======> > The rest of this RFC covers the following topics: > > 1. Types -- a proposal to extend VectorType to be able to represent vectors > that > have a length which is a runtime-determined multiple of a known base > length. > > 2. Size Queries - how to reason about the size of types for which the size > isn't > fully known at compile time. > > 3. Representing the runtime multiple of vector length in IR for use in > address > calculations and induction variable comparisons. > > 4. Generating 'constant' values in IR for vectors with a runtime-determined > number of elements. > > 5. An explanation of splitting/concatentating scalable vectors. > > 6. A brief note on code generation of these new operations for AArch64. > > 7. An example of C code and matching IR using the proposed extensions. > > 8. A list of patches demonstrating the changes required to emit SVE > instructions > for a loop that has already been vectorized using the extensions > described > in this RFC. > > =======> 1. Types > =======> > To represent a vector of unknown length a boolean `Scalable` property has > been > added to the `VectorType` class, which indicates that the number of elements > in > the vector is a runtime-determined integer multiple of the `NumElements` > field. > Most code that deals with vectors doesn't need to know the exact length, but > does need to know relative lengths -- e.g. get a vector with the same number > of > elements but a different element type, or with half or double the number of > elements. > > In order to allow code to transparently support scalable vectors, we > introduce > an `ElementCount` class with two members: > > - `unsigned Min`: the minimum number of elements. > - `bool Scalable`: is the element count an unknown multiple of `Min`? > > For non-scalable vectors (``Scalable=false``) the scale is considered to be > equal to one and thus `Min` represents the exact number of elements in the > vector. > > The intent for code working with vectors is to use convenience methods and > avoid > directly dealing with the number of elements. If needed, calling > `getElementCount` on a vector type instead of `getVectorNumElements` can be > used > to obtain the (potentially scalable) number of elements. Overloaded division > and > multiplication operators allow an ElementCount instance to be used in much > the > same manner as an integer for most cases. > > This mixture of compile-time and runtime quantities allow us to reason about > the > relationship between different scalable vector types without knowing their > exact length. > > The runtime multiple is not expected to change during program execution for > SVE, > but it is possible. The model of scalable vectors presented in this RFC > assumes > that the multiple will be constant within a function but not necessarily > across > functions. As suggested in the recent RISC-V rfc, a new function attribute > to > inherit the multiple across function calls will allow for function calls > with > vector arguments/return values and inlining/outlining optimizations. > > IR Textual Form > --------------- > > The textual form for a scalable vector is: > > ``<scalable <n> x <type>>`` > > where `type` is the scalar type of each element, `n` is the minimum number > of > elements, and the string literal `scalable` indicates that the total number > of > elements is an unknown multiple of `n`; `scalable` is just an arbitrary > choice > for indicating that the vector is scalable, and could be substituted by > another. > For fixed-length vectors, the `scalable` is omitted, so there is no change > in > the format for existing vectors. > > Scalable vectors with the same `Min` value have the same number of elements, > and > the same number of bytes if `Min * sizeof(type)` is the same (assuming they > are > used within the same function): > > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of > elements. > > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of > bytes. > > IR Bitcode Form > --------------- > > To serialize scalable vectors to bitcode, a new boolean field is added to > the > type record. If the field is not present the type will default to a > fixed-length > vector type, preserving backwards compatibility. > > Alternatives Considered > ----------------------- > > We did consider one main alternative -- a dedicated target type, like the > x86_mmx type. > > A dedicated target type would either need to extend all existing passes that > work with vectors to recognize the new type, or to duplicate all that code > in order to get reasonable code generation and autovectorization. > > This hasn't been done for the x86_mmx type, and so it is only capable of > providing support for C-level intrinsics instead of being used and > recognized by > passes inside llvm. > > Although our current solution will need to change some of the code that > creates > new VectorTypes, much of that code doesn't need to care about whether the > types > are scalable or not -- they can use preexisting methods like > `getHalfElementsVectorType`. If the code is a little more complex, > `ElementCount` structs can be used instead of an `unsigned` value to > represent > the number of elements. > > ==============> 2. Size Queries > ==============> > This is a proposal for how to deal with querying the size of scalable types > for > analysis of IR. While it has not been implemented in full, the general > approach > works well for calculating offsets into structures with scalable types in a > modified version of ComputeValueVTs in our downstream compiler. > > For current IR types that have a known size, all query functions return a > single > integer constant. For scalable types a second integer is needed to indicate > the > number of bytes/bits which need to be scaled by the runtime multiple to > obtain > the actual length. > > For primitive types, `getPrimitiveSizeInBits()` will function as it does > today, > except that it will no longer return a size for vector types (it will return > 0, > as it does for other derived types). The majority of calls to this function > are > already for scalar rather than vector types. > > For derived types, a function `getScalableSizePairInBits()` will be added, > which > returns a pair of integers (one to indicate unscaled bits, the other for > bits > that need to be scaled by the runtime multiple). For backends that do not > need > to deal with scalable types the existing methods will suffice, but a > debug-only > assert will be added to them to ensure they aren't used on scalable types. > > Similar functionality will be added to DataLayout. > > Comparisons between sizes will use the following methods, assuming that X > and > Y are non-zero integers and the form is of { unscaled, scaled }. > > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison. > > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across > functions that inherit vector length. Cannot be > compared across non-inheriting functions. > > { X, 0 } > { 0, Y }: Cannot return true. > > { X, 0 } = { 0, Y }: Cannot return true. > > { X, 0 } < { 0, Y }: Can return true. > > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common > terms and try the above comparisons; it > may not be possible to get a good answer. > > It's worth noting that we don't expect the last case (mixed scaled and > unscaled sizes) to occur. Richard Sandiford's proposed C extensions > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly > prohibits mixing fixed-size types into sizeless struct. > > I don't know if we need a 'maybe' or 'unknown' result for cases comparing > scaled > vs. unscaled; I believe the gcc implementation of SVE allows for such > results, but that supports a generic polynomial length representation. > > My current intention is to rely on functions that clone or copy values to > check whether they are being used to copy scalable vectors across function > boundaries without the inherit vlen attribute and raise an error there > instead > of requiring passing the Function a type size is from for each comparison. > If > there's a strong preference for moving the check to the size comparison > function > let me know; I will be starting work on patches for this later in the year > if > there's no major problems with the idea. > > Future Work > ----------- > > Since we cannot determine the exact size of a scalable vector, the > existing logic for alias detection won't work when multiple accesses > share a common base pointer with different offsets. > > However, SVE's predication will mean that a dynamic 'safe' vector length > can be determined at runtime, so after initial support has been added we > can work on vectorizing loops using runtime predication to avoid aliasing > problems. > > Alternatives Considered > ----------------------- > > Marking scalable vectors as unsized doesn't work well, as many parts of > llvm dealing with loads and stores assert that 'isSized()' returns true > and make use of the size when calculating offsets. > > We have considered introducing multiple helper functions instead of > using direct size queries, but that doesn't cover all cases. It may > still be a good idea to introduce them to make the purpose in a given > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'. > > =======================================> 3. Representing Vector Length at Runtime > =======================================> > With a scalable vector type defined, we now need a way to represent the > runtime > length in IR in order to generate addresses for consecutive vectors in > memory > and determine how many elements have been processed in an iteration of a > loop. > > We have added an experimental `vscale` intrinsic to represent the runtime > multiple. Multiplying the result of this intrinsic by the minimum number of > elements in a vector gives the total number of elements in a scalable > vector. > > Fixed-Length Code > ----------------- > > Assuming a vector type of <4 x <ty>> > `` > vector.body: > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > ;; <loop body> > ;; Increment induction var > %index.next = add i64 %index, 4 > ;; <check and branch> > `` > Scalable Equivalent > ------------------- > > Assuming a vector type of <scalable 4 x <ty>> > `` > vector.body: > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > ;; <loop body> > ;; Increment induction var > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > ;; <check and branch> > `` > ==========================> 4. Generating Vector Values > ==========================> For constant vector values, we cannot specify all the elements as we can for > fixed-length vectors; fortunately only a small number of easily synthesized > patterns are required for autovectorization. The `zeroinitializer` constant > can be used in the same manner as fixed-length vectors for a constant zero > splat. This can then be combined with `insertelement` and `shufflevector` > to create arbitrary value splats in the same manner as fixed-length vectors. > > For constants consisting of a sequence of values, an experimental > `stepvector` > intrinsic has been added to represent a simple constant of the form > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new > start can be added, and changing the step requires multiplying by a splat. > > Fixed-Length Code > ----------------- > `` > ;; Splat a value > %insert = insertelement <4 x i32> undef, i32 %value, i32 0 > %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> > zeroinitializer > ;; Add a constant sequence > %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8> > `` > Scalable Equivalent > ------------------- > `` > ;; Splat a value > %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0 > %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> > undef, <scalable 4 x i32> zeroinitializer > ;; Splat offset + stride (the same in this case) > %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0 > %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> > undef, <scalable 4 x i32> zeroinitializer > ;; Create sequence for scalable vector > %stepvector = call <scalable 4 x i32> > @llvm.experimental.vector.stepvector.nxv4i32() > %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off > %addoffset = add <scalable 4 x i32> %mulbystride, %str_off > ;; Add the runtime-generated sequence > %add = add <scalable 4 x i32> %splat, %addoffset > `` > Future Work > ----------- > > Intrinsics cannot currently be used for constant folding. Our downstream > compiler (using Constants instead of intrinsics) relies quite heavily on > this > for good code generation, so we will need to find new ways to recognize and > fold these values. > > ==========================================> 5. Splitting and Combining Scalable Vectors > ==========================================> > Splitting and combining scalable vectors in IR is done in the same manner as > for fixed-length vectors, but with a non-constant mask for the > shufflevector. > > The following is an example of splitting a <scalable 4 x double> into two > separate <scalable 2 x double> values. > > `` > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > ;; Stepvector generates the element ids for first subvector > %sv1 = call <scalable 2 x i64> > @llvm.experimental.vector.stepvector.nxv2i64() > ;; Add vscale * 2 to get the starting element for the second subvector > %ec = mul i64 %vscale64, 2 > %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0 > %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, > <scalable 2 x i32> zeroinitializer > %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64 > ;; Perform the extracts > %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> > undef, <scalable 2 x i64> %sv1 > %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> > undef, <scalable 2 x i64> %sv2 > `` > > =================> 6. Code Generation > =================> > IR splats will be converted to an experimental splatvector intrinsic in > SelectionDAGBuilder. > > All three intrinsics are custom lowered and legalized in the AArch64 > backend. > > Two new AArch64ISD nodes have been added to represent the same concepts > at the SelectionDAG level, while splatvector maps onto the existing > AArch64ISD::DUP. > > GlobalISel > ---------- > > Since GlobalISel was enabled by default on AArch64, it was necessary to add > scalable vector support to the LowLevelType implementation. A single bit was > added to the raw_data representation for vectors and vectors of pointers. > > In addition, types that only exist in destination patterns are planted in > the enumeration of available types for generated code. While this may not be > necessary in future, generating an all-true 'ptrue' value was necessary to > convert a predicated instruction into an unpredicated one. > > =========> 7. Example > =========> > The following example shows a simple C loop which assigns the array index to > the array elements matching that index. The IR shows how vscale and > stepvector > are used to create the needed values and to advance the index variable in > the > loop. > > C Code > ------ > > `` > void IdentityArrayInit(int *a, int count) { > for (int i = 0; i < count; ++i) > a[i] = i; > } > `` > > Scalable IR Vector Body > ----------------------- > > `` > vector.body.preheader: > ;; Other setup > ;; Stepvector used to create initial identity vector > %stepvector = call <scalable 4 x i32> > @llvm.experimental.vector.stepvector.nxv4i32() > br vector.body > > vector.body > %index = phi i64 [ %index.next, %vector.body ], [ 0, > %vector.body.preheader ] > %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ] > > ;; stepvector used for index identity on entry to loop body ;; > %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ], > [ %stepvector, %vector.body.preheader ] > %vscale64 = call i64 @llvm.experimental.vector.vscale.64() > %vscale32 = trunc i64 %vscale64 to i32 > %1 = add i64 %0, mul (i64 %vscale64, i64 4) > > ;; vscale splat used to increment identity vector ;; > %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, > i32 4), i32 0 > %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, > <scalable 4 x i32> zeroinitializer > %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat > %2 = getelementptr inbounds i32, i32* %a, i64 %0 > %3 = bitcast i32* %2 to <scalable 4 x i32>* > store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4 > > ;; vscale used to increment loop index > %index.next = add i64 %index, mul (i64 %vscale64, i64 4) > %4 = icmp eq i64 %index.next, %n.vec > br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5 > `` > > =========> 8. Patches > =========> > List of patches: > > 1. Extend VectorType: https://reviews.llvm.org/D32530 > 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768 > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769 > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770 > 5. SVE Calling Convention: https://reviews.llvm.org/D47771 > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772 > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773 > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774 > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775 > 10. Initial store patterns: https://reviews.llvm.org/D47776 > 11. Initial addition patterns: https://reviews.llvm.org/D47777 > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778 > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779 > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780 > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > -- > > Simon Moll > Researcher / PhD Student > > Compiler Design Lab (Prof. Hack) > Saarland University, Computer Science > Building E1.3, Room 4.31 > > Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de > Fax. +49 (0)681 302-3065 : http://compilers.cs.uni-saarland.de/people/moll > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
David A. Greene via llvm-dev
2018-Jul-09 15:01 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Robin Kruppe <robin.kruppe at gmail.com> writes:> Everything else I know of that falls under "changing vector lengths" > is better served by predication or RISC-V's "active vector length" > (vl) register.Agreed. A "vl" register is slightly more efficient in some cases because forming predicates can be bothersome. I also want to caution about predication in LLVM IR. The way it's done now is, I think, not quite kosher. We use select to represent a predicated operation, but select says nothing about suppressing the evaluation of either input. Therefore, there is nothing in the IR to prevent code motion of Values outside the select. Indeed, I ran into this very problem a couple of months ago, where a legitimate (according to the IR) code motion resulted in wrong answers in vectorized code because what was supposed to be predicated was not. We had to disable the transformation to get things working. Another consequence of this setup is that we need special intrinsics to convey evaluation requirements. We have masked load/store/gather/scatter intrinsics and will be getting masked floating-point intrinsics (or something like them). Years ago we had some discussion about how to represent predication as a first-class IR construct but at the time it was considered too difficult. With more and more architectures turning to predication for performance, perhaps it's time to revisit that conversation. -David