Luke Kenneth Casson Leighton via llvm-dev
2019-Oct-01 08:21 UTC
[llvm-dev] Adding support for vscale
On Tue, Oct 1, 2019 at 8:08 AM Robin Kruppe <robin.kruppe at gmail.com> wrote:> > Hello Jacob and Luke, > > First off, even if a dynamically changing vscale was truly necessary > for RVV or SV, this thread would be far too late to raise the question. > That vscale is constant -- that the number of elements in a scalable > vector does not change during program execution -- is baked into the > accepted scalable vector type proposal from top to bottom and in fact > was one of the conditions for its acceptance...that should be explicitly made clear in the patches. it sounds very much like it's only suitable for statically-allocated arrays-of-vectorisable-types: typedef vec4 float[4]; // SEW=32,LMUL=4 probably static vec4 globalvec[1024]; // vscale == 1024 here or, would it be intended for use inside functions - again statically-allocated? int somefn(void) { static vec4 localvec[1024]; // vscale == 1024 here } *or*, would it be intended to be used like this? int somefn(num_of_vec4s) { static vec4 localvec[num_of_vec4s]; // vscale == dynamic, here } clarifying this in the documentation strings on vscale, perhaps even providing c-style examples, would be extremely useful, and avoid misunderstandings.>... (runtime-variable type > sizes create many more headaches which nobody has worked out >how to solve to a satisfactory degree in the context of LLVM).hmmmm. so it looks like data-dependent fail-on-first is something that's going to come up later, rather than right now.> *This* thread is just about whether vscale should be exposed to programs > in the form of a Constant or as an intrinsic which always returns the same > value during one program execution. > > Luckily, this is not a problem for RVV. I do not know anything about this > "SV" extension you are working onSV has been designed specifically to help with the creation of *Hybrid* CPU / VPU / GPUs. it's very similar to RVV except that there are no new instructions added. a typical GPU would be happy to have 128-bit-wide SIMD or VLIW-style instructions, on the basis that (A) the shader programs are usually no greater than 1K in size and (B) those 128-bit-wide instructions have an extremely high bang-per-buck ratio, of 32x FP32 operations issued at once. in a *hybrid* CPU - VPU - GPU context even a 1k shader program hits a significant portion of the 1st level cache which is *not* separate from a *GPU*'s 1st level cache because the CPU *is* the GPU. consequently, SV has been specifically designed to "compactify" instruction effectiveness by "prefixing" even RVC 16-bit opcodes with vectorisation "tags". this has the side-effect of reducing executable size by over 10% in many cases when compared to RVV.> so I cannot comment on that, but I'll sketch the reasons for why it's not > an issue with RVV and maybe that helps you with SV too.looks like it does: Jacob explains (in another reply) that MVL is exactly the same concept, except that in RVV it is hard-coded (baked) into the hardware, where in SV it is explicitly set as a CSR, and i explained in the previous reply that in RVV the VL CSR is requested (and the hardware chooses a value), whereas in SV, the VL CSR *must* be set to exactly what is requested [within the bounds of MVL, sorry, left that out earlier].> As mentioned above, this is tangential to the focus of this thread, so if > you want to discuss further I'd prefer you do that in a new thread.it's not yet clear whether vscale is intended for use in static-allocation involving fixed constants or whether it's intended for use with runtime-dependent variables inside functions. with that not being clear, my questions are not tangential to the focus of the thread. however yes i would agree that data-dependent fail-on-first is definitely not the focus of this thread, and would need to be discussed later. we are a very small team at the moment, we may end up missing valuable discussions: how can it be ensured that we are included in future discussions?> [...] > You may be aware of Simon Moll's vector predication (previously: > explicit vector length) proposal which does just that.ah yehyehyeh. i remember.> In contrast, the vscale concept is more about how many elements a > vector register contains, regardless of whether some operations process > only a subset of them.ok so this *might* be answering my question about vscale being relate-able to a function parameter (the latter of the c examples), it would be good to clarify.> In RVV terms that means it's related not to VL but more to VBITS, > which is indeed a constant (and has been for many months).ok so VL is definitely "assembly-level" rather than something that actually is exposed to the intrinsics. that may turn out to be a mistake when it comes to data-dependent fail-on-first capability (which is present in a *DIFFERENT* form in ARM SVE, btw), but would, yes, need discussion separately.> For example <vscale x 4 x i16> has four times as many elements and > twice as many bits as <vscale x 1 x i32>, so it captures the distinction > between a SEW=16,LMUL=2 vtype setting and a SEW=32,LMUL=1 > vtype setting.hang on - so this may seem like a silly question: is it intended that the *word* vscale would actually appear in LLVM-IR i.e. it is a new compiler "keyword"? or did you use it here in the context of just "an example", where actually the idea is that actual value would be <5 x 4 x i16> or <5 x 1 x i32>? let me re-read the summary: "This patch adds vscale as a symbolic constant to the IR, similar to undef and zeroinitializer, so that it can be used in constant expressions." it's a keyword, isn't it? so, that "vscale" keyword would be substituted at runtime by either a constant (1024) *or* a runtime-calculated variable or function parameter (num_of_vec4s), is that correct? apologies for asking: these are precisely the kinds of from-zero-prior-knowledge questions that help with any review process to clarify things for other users/devs. l.
Hi Luke,> On 1 Oct 2019, at 09:21, Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote: > >> First off, even if a dynamically changing vscale was truly necessary >> for RVV or SV, this thread would be far too late to raise the question. >> That vscale is constant -- that the number of elements in a scalable >> vector does not change during program execution -- is baked into the >> accepted scalable vector type proposal from top to bottom and in fact >> was one of the conditions for its acceptance... > > that should be explicitly made clear in the patches. it sounds very > much like it's only suitable for statically-allocated > arrays-of-vectorisable-types: > > typedef vec4 float[4]; // SEW=32,LMUL=4 probably > static vec4 globalvec[1024]; // vscale == 1024 here'vscale' just refers to the scaling factor that gives the maximum size of the vector at runtime, not the number of currently active elements. SVE will be using predication alone to deal with data that doesn't fill an entire vector, whereas RVV and SX-Aurora want to use a separate mechanism that fits with their hardware having a changeable active length. The scalable type tells you the maximum number of elements that could be operated on, and individual operations can constrain that to a smaller number of elements. The latter is what Simon Moll's proposal addresses.>> ... (runtime-variable type >> sizes create many more headaches which nobody has worked out >> how to solve to a satisfactory degree in the context of LLVM). > > hmmmm. so it looks like data-dependent fail-on-first is something > that's going to come up later, rather than right now.Arm's downstream compiler has been able to use the scalable type and a constant vscale with first-faulting loads for around 4 years, so there's no conflict here. We will need to figure out exactly what form the first faulting intrinsics take of course, as I think SVE's predication-only approach doesn't quite fit with others -- maybe we'll end up with two intrinsics? Or maybe we'll be able to synthesize a predicate from an active vlen and pattern match? Something to discuss later I guess. (I'm not even sure AVX512 has a first-faulting form, possibly just no-faulting and check the first predicate element?)>> As mentioned above, this is tangential to the focus of this thread, so if >> you want to discuss further I'd prefer you do that in a new thread. > > it's not yet clear whether vscale is intended for use in > static-allocation involving fixed constants or whether it's intended > for use with runtime-dependent variables inside functions.Runtime-dependent, though you could use C-level types and intrinsics to try a static approach.> ok so this *might* be answering my question about vscale being > relate-able to a function parameter (the latter of the c examples), it > would be good to clarify. > >> In RVV terms that means it's related not to VL but more to VBITS, >> which is indeed a constant (and has been for many months). > > ok so VL is definitely "assembly-level" rather than something that > actually is exposed to the intrinsics. that may turn out to be a > mistake when it comes to data-dependent fail-on-first capability > (which is present in a *DIFFERENT* form in ARM SVE, btw), but would, > yes, need discussion separately. > >> For example <vscale x 4 x i16> has four times as many elements and >> twice as many bits as <vscale x 1 x i32>, so it captures the distinction >> between a SEW=16,LMUL=2 vtype setting and a SEW=32,LMUL=1 >> vtype setting. > > hang on - so this may seem like a silly question: is it intended that > the *word* vscale would actually appear in LLVM-IR i.e. it is a new > compiler "keyword"? or did you use it here in the context of just "an > example", where actually the idea is that actual value would be <5 x 4 > x i16> or <5 x 1 x i32>?If you're referring to the '<vscale x 4 x i32>' syntax, that's already part of LLVM IR now (though effectively still in 'beta'). You can see a few examples in .ll tests now, e.g. llvm/test/Bitcode/compatibility.ll It's also documented in the langref. Sander's patch takes the existing 'vscale' keyword and allows it to be used outside the type, to serve as an integer constant that represents the same runtime value as it does in the type. Some previous discussions proposed using an intrinsic to start with for this, and that may still happen depending on community reaction, but the Arm hpc compiler team felt it was important to at least start a wider discussion on this topic before proceeding. From our experience, using an intrinsic makes it harder to work with shufflevector or get good code generation. If someone can spot a problem with our reasoning on that please let us know. -Graham
Luke Kenneth Casson Leighton via llvm-dev
2019-Oct-01 11:37 UTC
[llvm-dev] Adding support for vscale
On Tue, Oct 1, 2019 at 11:08 AM Graham Hunter <Graham.Hunter at arm.com> wrote:> Hi Luke,hi graham, thanks for responding in such an informative fashion.> > On 1 Oct 2019, at 09:21, Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > typedef vec4 float[4]; // SEW=32,LMUL=4 probably > > static vec4 globalvec[1024]; // vscale == 1024 here > > 'vscale' just refers to the scaling factor that gives the maximum size of > the vector at runtime, not the number of currently active elements.ok, this starts to narrow down the definition. i'm attempting to get clarity on what it means. so, in the example above involving globalvec, "maximum size of the vector at runtime" would be "1024" (not involving RVV VL). and... would vscale would be dynamically (but permanently) substituted with the constant "1024", there? and in that example i gave which was a local function, vscale would be substituted with "local_vlen_param_len" permanently and irrevocably at runtime? or, is it intended to be dynamically (but permanently) substituted with something related to RVV's *MVL* at runtime? if it's intended to be substituted by MVL, *that* starts to make more sense, because MVL may actually vary depending on the hardware on which the program is being executed. smaller systems may have an MVL of only 1 (only allowing one element of a vector to be executed at any one time) whereas Mainframe or massively-parallel systems may have... MVL in the hundreds.> SVE will be using predication alone to deal with data that doesn't fill an > entire vector, whereas RVV and SX-Aurora[and SV! :) ]> want to use a separate mechanism > that fits with their hardware having a changeable active length.okaaay, now, yes, i Get It. this is MVL (Max Vector Length) in RVV. btw minor nitpick: it's not that "their" hardware changes, it's that the RISC-V Vector Spec *allows* arbitrary MVL length (so there are multiple vendors each choosing an arbitrary MVL suited to their customer's needs). "RVV-compliant hardware" would fit things better. hmmm that's going to be interesting for SV, because SV specifically permits variable MVL *at runtime*. however, just checking the spec (don't laugh, yes i know i wrote it...) MVL is set through an immediate. there's a way to bypass that and set it dynamically, but it's intended for context-switching, *not* for general-purpose use. ah wait.... argh. ok, is vscale expected to be a global constant *for the entire application*? note above: SV allows MVL to be set *arbitrarily*, and this is extremely important. the reason it is important is because unlike RVV, SV uses the actual *scalar* register files. it does *NOT* have a separate "Vector Register File". so if vscale was set to say 8 on a per-runtime basis, that then sets the total number of registers *in the scalar register file* which will be utilised for vectorisation. it becomes impossible to set vscale to 4, which another function might have been specifically designed to use. so what would then need to be done is: predicate out the top 4 elements, which now comes with a performance-penalty and a whole boat-load of mess. so, apologies: we reaaaally need vscale to be selectable on at the very least a per-function basis. otherwise, applications would have to set it (at runtime) to the "least inconvenient" value, wasting "the least-inconvenient number of registers".> The scalable type tells you the maximum number of elements that could be > operated on,... which is related (in RVV) to MVL...> and individual operations can constrain that to a smaller > number of elements.... by setting VL.> > hmmmm. so it looks like data-dependent fail-on-first is something > > that's going to come up later, rather than right now. > > Arm's downstream compiler has been able to use the scalable type and a > constant vscale with first-faulting loads for around 4 years, so there's > no conflict here.ARM's SVE uses predication. the LDs that would [normally] cause page-faults create a mask, instead, giving *only* those LDs which "succeeded". that's then passed into standard [SIMD-based] predicated operations, masking out operations, the most important one (for the canonical strcpy / memcpy) being the ST.> We will need to figure out exactly what form the first faulting intrinsics > take of course, as I think SVE's predication-only approach doesn't quite > fit with others -- maybe we'll end up with two intrinsics?perhaps - as robin suggests, this for another discussion (not related to vscale). or... maybe not. if vscale was permitted to be dynamically set, not only would it suit SV's ability to set different vscales on a per-function (or other) basis, it could be utilised by RVV, SV, and anything else that changes VL based on data-dependent conditions, to change the following instructions. what i'm saying is: vscale needs to be permitted to be a variable, not a constant. now, ARM SVE wouldn't *use* that capability: it would hard-code it to 512/SEW/etc.etc. (or whatever), by setting it to a global constant. follow-up LLVM-IR-morphing passes would end up generating globally-fixed-width SVE instructions. RVV would be able to set that vscale variable as a way to indicate data-dependent lengths [various IR-morphing-passes would carry out the required substitutions prior to creating actual assembler] SV would be able to likewise do that *and* start from a value for vscale that suited each function's requirements to utilise a subset of the register file which suited the workload. SV could then trade off "register spill" with "vector size", which i can tell you right now will be absolutely critical for 3D GPU workloads. we can *NOT* allow register spill using LD/STs for a GPU workload covering gigabytes of data, the power consumption penalty would just be mental [commercially totally unacceptable]. it would be far better to allow a function which required that many registers to dynamically set vscale=2 or possibly even vscale=1 (we have 128 *scalar* registers, where, reminder: MVL is used to say how many of the *SCALAR* register file get utilised to "make up" a vector). oh. ah. bruce (et al), isn't there an option in RVV to allow Vectors to sit on top of the *scalar* register file(s)? (Zfinx) https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-registers> Or maybe we'll > be able to synthesize a predicate from an active vlen and pattern match?the purpose of having a dynamic VL, which comes originally from the Cray Supercomputer Vector Architecture, is to not have to use the kinds of instructions that perform bit-manipulation (mask-manipulation) which are not only wasted CPU cycles, but end up in many [simpler] hardware implementations with masked-out "Lanes" running empty, particularly ones that have Vector Front-ends but predicated-SIMD-style ALU backends. i would be quite concerned, therefore, if by "synthesise a predicate" the idea was, instead of using actual dynamic truncation of vlen (changing vscale), instructions were used to create a predicate which had its last bits set to zero. basically using RVV/SV fail-on-first to emulate the way that ARM SVE fail-on-first creates masks. that would be... yuk :)> Sander's patch takes the existing 'vscale' keyword and allows it to be > used outside the type, to serve as an integer constant that represents the > same runtime value as it does in the type.if i am understanding things correctly, it reaaally needs to be allowed to be a variable, definitely not a constant.> Some previous discussions proposed using an intrinsic to start with for this, > and that may still happen depending on community reaction, but the Arm > hpc compiler team felt it was important to at least start a wider discussion > on this topic before proceeding. From our experience, using an intrinsic makes > it harder to work with shufflevector or get good code generation. If someone > can spot a problem with our reasoning on that please let us know.honestly can't say, can i leave it to you to decide if it's related to this vscale thread, and, if so, could you elaborate further? if it's not, feel free to leave it for another time? will see if there is any follow-up discussion here. thanks graham. l.
Thanks @Robin and @Graham for giving some background on scalable vectors and clarifying some of the details! Apologies if I'm repeating things here, but it is probably good to emphasize the conceptually different, but complementary models for scalable vectors: 1. Vectors of unknown, but constant size throughout the program. 2. Vectors of changing size throughout the program. Where (2) basically builds on (1). LLVM's scalable vectors support (1) directly. The scalable type is defined using the concept `vscale` that is constant throughout the program and expresses the unknown, but maximum size of a scalable vector. My patch builds on that definition by adding `vscale` as a keyword that can be used in expressions. For this model, predication can be used to disable the lanes that are not needed. Given that `vscale` is defined as inherently constant and a corner-stone of the scalable type, it makes no sense to describe the `vscale` keyword as an intrinsic. The other model for scalable vectors (2) requires additional intrinsics to get/set the `active VL` at runtime. This model would be complementary to `vscale`, as it still requires the same scalable vector type to describe a vector of unknown size. `vscale` can be used to express the maximum vector length, but the `active vector length` would need to be handled through explicit intrinsics. As Robin explained, it would also need Simon Moll's vector predication proposal to express operations on `active VL` elements.> apologies for asking: these are precisely the kinds of > from-zero-prior-knowledge questions that help with any review process > to clarify things for other users/devs.No apologies required, the discussion on scalable types have been going on for quite a while so there are much email threads to read through. It is important these concepts are clear and well understood!> clarifying this in the documentation strings on vscale, perhaps even > providing c-style examples, would be extremely useful, and avoid > misunderstandings.I wonder if we should add a separate document about scalable vectors that describe these concepts in more detail with some examples. Given that (2) is a very different use-case, I hope we can keep discussions on that model separate from this thread, if possible. Thanks, Sander> On 1 Oct 2019, at 11:07, Graham Hunter <Graham.Hunter at arm.com> wrote: > > Hi Luke, > >> On 1 Oct 2019, at 09:21, Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >>> First off, even if a dynamically changing vscale was truly necessary >>> for RVV or SV, this thread would be far too late to raise the question. >>> That vscale is constant -- that the number of elements in a scalable >>> vector does not change during program execution -- is baked into the >>> accepted scalable vector type proposal from top to bottom and in fact >>> was one of the conditions for its acceptance... >> >> that should be explicitly made clear in the patches. it sounds very >> much like it's only suitable for statically-allocated >> arrays-of-vectorisable-types: >> >> typedef vec4 float[4]; // SEW=32,LMUL=4 probably >> static vec4 globalvec[1024]; // vscale == 1024 here > > 'vscale' just refers to the scaling factor that gives the maximum size of > the vector at runtime, not the number of currently active elements. > > SVE will be using predication alone to deal with data that doesn't fill an > entire vector, whereas RVV and SX-Aurora want to use a separate mechanism > that fits with their hardware having a changeable active length. > > The scalable type tells you the maximum number of elements that could be > operated on, and individual operations can constrain that to a smaller > number of elements. The latter is what Simon Moll's proposal addresses. > >>> ... (runtime-variable type >>> sizes create many more headaches which nobody has worked out >>> how to solve to a satisfactory degree in the context of LLVM). >> >> hmmmm. so it looks like data-dependent fail-on-first is something >> that's going to come up later, rather than right now. > > Arm's downstream compiler has been able to use the scalable type and a > constant vscale with first-faulting loads for around 4 years, so there's > no conflict here. > > We will need to figure out exactly what form the first faulting intrinsics > take of course, as I think SVE's predication-only approach doesn't quite > fit with others -- maybe we'll end up with two intrinsics? Or maybe we'll > be able to synthesize a predicate from an active vlen and pattern match? > Something to discuss later I guess. (I'm not even sure AVX512 has a > first-faulting form, possibly just no-faulting and check the first predicate > element?) > >>> As mentioned above, this is tangential to the focus of this thread, so if >>> you want to discuss further I'd prefer you do that in a new thread. >> >> it's not yet clear whether vscale is intended for use in >> static-allocation involving fixed constants or whether it's intended >> for use with runtime-dependent variables inside functions. > > Runtime-dependent, though you could use C-level types and intrinsics to > try a static approach. > >> ok so this *might* be answering my question about vscale being >> relate-able to a function parameter (the latter of the c examples), it >> would be good to clarify. >> >>> In RVV terms that means it's related not to VL but more to VBITS, >>> which is indeed a constant (and has been for many months). >> >> ok so VL is definitely "assembly-level" rather than something that >> actually is exposed to the intrinsics. that may turn out to be a >> mistake when it comes to data-dependent fail-on-first capability >> (which is present in a *DIFFERENT* form in ARM SVE, btw), but would, >> yes, need discussion separately. >> >>> For example <vscale x 4 x i16> has four times as many elements and >>> twice as many bits as <vscale x 1 x i32>, so it captures the distinction >>> between a SEW=16,LMUL=2 vtype setting and a SEW=32,LMUL=1 >>> vtype setting. >> >> hang on - so this may seem like a silly question: is it intended that >> the *word* vscale would actually appear in LLVM-IR i.e. it is a new >> compiler "keyword"? or did you use it here in the context of just "an >> example", where actually the idea is that actual value would be <5 x 4 >> x i16> or <5 x 1 x i32>? > > If you're referring to the '<vscale x 4 x i32>' syntax, that's already part > of LLVM IR now (though effectively still in 'beta'). You can see a few > examples in .ll tests now, e.g. llvm/test/Bitcode/compatibility.ll > > It's also documented in the langref. > > Sander's patch takes the existing 'vscale' keyword and allows it to be > used outside the type, to serve as an integer constant that represents the > same runtime value as it does in the type. > > Some previous discussions proposed using an intrinsic to start with for this, > and that may still happen depending on community reaction, but the Arm > hpc compiler team felt it was important to at least start a wider discussion > on this topic before proceeding. From our experience, using an intrinsic makes > it harder to work with shufflevector or get good code generation. If someone > can spot a problem with our reasoning on that please let us know. > > -Graham >