Luke Kenneth Casson Leighton via llvm-dev
2019-Oct-01 11:37 UTC
[llvm-dev] Adding support for vscale
On Tue, Oct 1, 2019 at 11:08 AM Graham Hunter <Graham.Hunter at arm.com> wrote:> Hi Luke,hi graham, thanks for responding in such an informative fashion.> > On 1 Oct 2019, at 09:21, Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > typedef vec4 float[4]; // SEW=32,LMUL=4 probably > > static vec4 globalvec[1024]; // vscale == 1024 here > > 'vscale' just refers to the scaling factor that gives the maximum size of > the vector at runtime, not the number of currently active elements.ok, this starts to narrow down the definition. i'm attempting to get clarity on what it means. so, in the example above involving globalvec, "maximum size of the vector at runtime" would be "1024" (not involving RVV VL). and... would vscale would be dynamically (but permanently) substituted with the constant "1024", there? and in that example i gave which was a local function, vscale would be substituted with "local_vlen_param_len" permanently and irrevocably at runtime? or, is it intended to be dynamically (but permanently) substituted with something related to RVV's *MVL* at runtime? if it's intended to be substituted by MVL, *that* starts to make more sense, because MVL may actually vary depending on the hardware on which the program is being executed. smaller systems may have an MVL of only 1 (only allowing one element of a vector to be executed at any one time) whereas Mainframe or massively-parallel systems may have... MVL in the hundreds.> SVE will be using predication alone to deal with data that doesn't fill an > entire vector, whereas RVV and SX-Aurora[and SV! :) ]> want to use a separate mechanism > that fits with their hardware having a changeable active length.okaaay, now, yes, i Get It. this is MVL (Max Vector Length) in RVV. btw minor nitpick: it's not that "their" hardware changes, it's that the RISC-V Vector Spec *allows* arbitrary MVL length (so there are multiple vendors each choosing an arbitrary MVL suited to their customer's needs). "RVV-compliant hardware" would fit things better. hmmm that's going to be interesting for SV, because SV specifically permits variable MVL *at runtime*. however, just checking the spec (don't laugh, yes i know i wrote it...) MVL is set through an immediate. there's a way to bypass that and set it dynamically, but it's intended for context-switching, *not* for general-purpose use. ah wait.... argh. ok, is vscale expected to be a global constant *for the entire application*? note above: SV allows MVL to be set *arbitrarily*, and this is extremely important. the reason it is important is because unlike RVV, SV uses the actual *scalar* register files. it does *NOT* have a separate "Vector Register File". so if vscale was set to say 8 on a per-runtime basis, that then sets the total number of registers *in the scalar register file* which will be utilised for vectorisation. it becomes impossible to set vscale to 4, which another function might have been specifically designed to use. so what would then need to be done is: predicate out the top 4 elements, which now comes with a performance-penalty and a whole boat-load of mess. so, apologies: we reaaaally need vscale to be selectable on at the very least a per-function basis. otherwise, applications would have to set it (at runtime) to the "least inconvenient" value, wasting "the least-inconvenient number of registers".> The scalable type tells you the maximum number of elements that could be > operated on,... which is related (in RVV) to MVL...> and individual operations can constrain that to a smaller > number of elements.... by setting VL.> > hmmmm. so it looks like data-dependent fail-on-first is something > > that's going to come up later, rather than right now. > > Arm's downstream compiler has been able to use the scalable type and a > constant vscale with first-faulting loads for around 4 years, so there's > no conflict here.ARM's SVE uses predication. the LDs that would [normally] cause page-faults create a mask, instead, giving *only* those LDs which "succeeded". that's then passed into standard [SIMD-based] predicated operations, masking out operations, the most important one (for the canonical strcpy / memcpy) being the ST.> We will need to figure out exactly what form the first faulting intrinsics > take of course, as I think SVE's predication-only approach doesn't quite > fit with others -- maybe we'll end up with two intrinsics?perhaps - as robin suggests, this for another discussion (not related to vscale). or... maybe not. if vscale was permitted to be dynamically set, not only would it suit SV's ability to set different vscales on a per-function (or other) basis, it could be utilised by RVV, SV, and anything else that changes VL based on data-dependent conditions, to change the following instructions. what i'm saying is: vscale needs to be permitted to be a variable, not a constant. now, ARM SVE wouldn't *use* that capability: it would hard-code it to 512/SEW/etc.etc. (or whatever), by setting it to a global constant. follow-up LLVM-IR-morphing passes would end up generating globally-fixed-width SVE instructions. RVV would be able to set that vscale variable as a way to indicate data-dependent lengths [various IR-morphing-passes would carry out the required substitutions prior to creating actual assembler] SV would be able to likewise do that *and* start from a value for vscale that suited each function's requirements to utilise a subset of the register file which suited the workload. SV could then trade off "register spill" with "vector size", which i can tell you right now will be absolutely critical for 3D GPU workloads. we can *NOT* allow register spill using LD/STs for a GPU workload covering gigabytes of data, the power consumption penalty would just be mental [commercially totally unacceptable]. it would be far better to allow a function which required that many registers to dynamically set vscale=2 or possibly even vscale=1 (we have 128 *scalar* registers, where, reminder: MVL is used to say how many of the *SCALAR* register file get utilised to "make up" a vector). oh. ah. bruce (et al), isn't there an option in RVV to allow Vectors to sit on top of the *scalar* register file(s)? (Zfinx) https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-registers> Or maybe we'll > be able to synthesize a predicate from an active vlen and pattern match?the purpose of having a dynamic VL, which comes originally from the Cray Supercomputer Vector Architecture, is to not have to use the kinds of instructions that perform bit-manipulation (mask-manipulation) which are not only wasted CPU cycles, but end up in many [simpler] hardware implementations with masked-out "Lanes" running empty, particularly ones that have Vector Front-ends but predicated-SIMD-style ALU backends. i would be quite concerned, therefore, if by "synthesise a predicate" the idea was, instead of using actual dynamic truncation of vlen (changing vscale), instructions were used to create a predicate which had its last bits set to zero. basically using RVV/SV fail-on-first to emulate the way that ARM SVE fail-on-first creates masks. that would be... yuk :)> Sander's patch takes the existing 'vscale' keyword and allows it to be > used outside the type, to serve as an integer constant that represents the > same runtime value as it does in the type.if i am understanding things correctly, it reaaally needs to be allowed to be a variable, definitely not a constant.> Some previous discussions proposed using an intrinsic to start with for this, > and that may still happen depending on community reaction, but the Arm > hpc compiler team felt it was important to at least start a wider discussion > on this topic before proceeding. From our experience, using an intrinsic makes > it harder to work with shufflevector or get good code generation. If someone > can spot a problem with our reasoning on that please let us know.honestly can't say, can i leave it to you to decide if it's related to this vscale thread, and, if so, could you elaborate further? if it's not, feel free to leave it for another time? will see if there is any follow-up discussion here. thanks graham. l.
Hi Luke,>> want to use a separate mechanism >> that fits with their hardware having a changeable active length. > > okaaay, now, yes, i Get It. this is MVL (Max Vector Length) in RVV. > btw minor nitpick: it's not that "their" hardware changes, it's that > the RISC-V Vector Spec *allows* arbitrary MVL length (so there are > multiple vendors each choosing an arbitrary MVL suited to their > customer's needs). "RVV-compliant hardware" would fit things better.Yes, the hardware doesn't change, dynamic/active VL just stops processing elements past the number of active elements. SVE similarly allows vendors to choose a maximum hardware vector length but skips an active VL in favour of predication only. I'll try and clear things up with a concrete example for SVE. Allowable SVE hardware vector lengths are all multiples of 128 bits. So our main legal types for codegen will have a minimum size of 128 bits, e.g. <vscale x 4 x i32>. If a low-end device implements SVE at 128 bits, then at runtime vscale is 1 and you get exactly <4 x i32>. For mid-level devices I'd guess 256 bits is reasonable, so vscale would be 2 and <vscale x 4 x i32> would be equivalent to <8 x i32>, but we still only guarantee that the first 4 lanes exist. For Fujitsu's A64FX at 512 bits, vscale is 4 and legal type would now be equivalent to <16 x i32>. In all cases, vscale is constant at runtime for those machines. While it is possible to change the maximum vector length from privileged code (so you could set the A64FX to run with 256b or 128b vectors if you chose... even 384b if you wanted to), we don't allow for changes at runtime since that may corrupt data. Expecting the compiler to be able to recover from a change in vector length when you have spilled registers to the stack isn't reasonable. Robin found a way to make this work for RVV; there, he had the additional concern of registers being joined together in x2,x4,(x8?) combinations. This was resolved by just making the legal types bigger when that feature is in use iirc. Would that approach help SV, or is it just a backend thing deciding how many scalar registers it can spare?>> The scalable type tells you the maximum number of elements that could be >> operated on, > > ... which is related (in RVV) to MVL... > >> and individual operations can constrain that to a smaller >> number of elements. > > ... by setting VL.Yes, at least for architectures that support changing VL. Simon's proposal was to provide intrinsics for common IR operations which took an additional parameter corresponding to VL; vscale doesn't represent VL, so doesn't need to change.>>> hmmmm. so it looks like data-dependent fail-on-first is something >>> that's going to come up later, rather than right now. >> >> Arm's downstream compiler has been able to use the scalable type and a >> constant vscale with first-faulting loads for around 4 years, so there's >> no conflict here. > > ARM's SVE uses predication. the LDs that would [normally] cause > page-faults create a mask, instead, giving *only* those LDs which > "succeeded".Those that succeeded until the first that didn't -- every bit in the mask after a fault is unset, even if it would have succeeded with a first-faulting gather operation.> that's then passed into standard [SIMD-based] predicated operations, > masking out operations, the most important one (for the canonical > strcpy / memcpy) being the ST.Nod; I wrote an experimental early exit loop vectorizer which made use of that.>> We will need to figure out exactly what form the first faulting intrinsics >> take of course, as I think SVE's predication-only approach doesn't quite >> fit with others -- maybe we'll end up with two intrinsics? > > perhaps - as robin suggests, this for another discussion (not related > to vscale). > >> Or maybe we'll >> be able to synthesize a predicate from an active vlen and pattern match? > > the purpose of having a dynamic VL, which comes originally from the > Cray Supercomputer Vector Architecture, is to not have to use the > kinds of instructions that perform bit-manipulation > (mask-manipulation) which are not only wasted CPU cycles, but end up > in many [simpler] hardware implementations with masked-out "Lanes" > running empty, particularly ones that have Vector Front-ends but > predicated-SIMD-style ALU backends.Yeah, I get that, which is why I support Simon's proposals.> i would be quite concerned, therefore, if by "synthesise a predicate" > the idea was, instead of using actual dynamic truncation of vlen > (changing vscale), instructions were used to create a predicate which > had its last bits set to zero. > > basically using RVV/SV fail-on-first to emulate the way that ARM SVE > fail-on-first creates masks. > > that would be... yuk :)Ah, I could have made it a bit clearer. I meant have a first-faulting load intrinsic which returns a vector and an integer representing the number of valid lanes. For architectures using a dynamic VL, you could then pass that integer to subsequent operations so they are tied to that number of active elements. For SVE/AVX512, we'd have to splat that integer and compare against a stepvector to generate a mask. Ugly, but it can be pattern matched into the direct first/no-faulting loads and masks for codegen. Or we just use separate intrinsics. To discuss later, I think; possibly on the early-exit loopvec thread. -Graham
Luke Kenneth Casson Leighton via llvm-dev
2019-Oct-01 14:55 UTC
[llvm-dev] Adding support for vscale
hi graham, On Tue, Oct 1, 2019 at 2:07 PM Graham Hunter <Graham.Hunter at arm.com> wrote:> > the RISC-V Vector Spec *allows* arbitrary MVL length (so there are > > multiple vendors each choosing an arbitrary MVL suited to their > > customer's needs). "RVV-compliant hardware" would fit things better. > > Yes, the hardware doesn't change, dynamic/active VL just stops processing > elements past the number of active elements. > > SVE similarly allows vendors to choose a maximum hardware vector length > but skips an active VL in favour of predication only.yes.> I'll try and clear things up with a concrete example for SVE. > > Allowable SVE hardware vector lengths are all multiples of 128 bits. So > our main legal types for codegen will have a minimum size of 128 bits, > e.g. <vscale x 4 x i32>.> For Fujitsu's A64FX at 512 bits, vscale is 4 and legal type would now be > equivalent to <16 x i32>.okaaaay, so, right, it is kinda similar to MVL for RVV, except dynamically settable in powers of 2. okaay. makes sense: just as with Cray-style Vectors, high-end machines can go extremely wide.> In all cases, vscale is constant at runtime for those machines. While it > is possible to change the maximum vector length from privileged code (so > you could set the A64FX to run with 256b or 128b vectors if you chose... > even 384b if you wanted to), we don't allow for changes at runtime since > that may corrupt data. Expecting the compiler to be able to recover from > a change in vector length when you have spilled registers to the stack > isn't reasonable.deep breath: i worked for Aspex Semiconductors, they have (had) what was called an "Array String Processor". 2-bit ALUs could have a gate opened up which constructed 4-bit ALUs, open up another gate now you have 8-bit ALUs, open another you have 16-bit, 32-bit, 64-bit and so on. thus, as a massively-deep SIMD architecture you could, at runtime, turn computations round from either using 32 cycles with a batch of 32x 2-bit ALUs to perform 32x separate and distinct parallel 64-bit operations OR open up all the gates, and use ONE cycle to compute a single 64-bit operation. with LOAD/STORE taking fixed time but algorithms (obviously) taking variable lengths of time, our job, as FAEs, was to write f*****g SPREADSHEETS (yes, really) giving estimates of which was the best possible balance to keep LD/STs an equal time-consumer as the frickin algorithm. as you can probably imagine, this being in assembler, and literally a dozen algorithms having to be written where one would normally do, code productivity was measured in DAAAAYYYS per line of code. we do have a genuine need to do something similar, here (except automated or at an absolute minimum, under the guidance of #pragma). the reason is because this is for a [hybrid] 3D GPU, to run texturisation and other workloads. these are pretty much unlike a CPU workload: data comes in, gets processed, data goes out. there's *one* LD, one algorithm, one ST, in a massive loop covering tens to hundreds (to gigabytes, in large GPUs) of megabytes per second. if there's *any* register spill at all, the L1/L2 performance and power penalty is so harsh that it's absolutely unthinkable to let it happen. this was outlined in Jeff Bush's nyuzipass2016 paper. the solution is therefore to have fine-grained dynamic control over vscale, on a per-loop basis. letting the registers spill *cannot* be permitted, so is not actually a problem per se. with a fine-grained dynamic control over vscale, we can perform a (much better, automated) equivalent of the awfulness-we-did-at-Aspex, analysing the best vscale to use for that loop, that will cover as many registers as possible, *without* spill. even if vscale gets set to 1, that's far, _far_ better than allowing LD/ST register-spilling. and with most 3D workloads being very specifically designed to fit into 128 FP32 registers (even for MALI400 and Vivante GC800), and our design having 128 FP64 registers that can be MMX-style subdivided into 2x FP32, 4x FP16, we should be fine.> Robin found a way to make this work for RVV; there, he had the additional > concern of registers being joined together in x2,x4,(x8?) combinations. > This was resolved by just making the legal types bigger when that feature > is in use iirc.unfortunately - although i do not know the full details (Jacob knows this better than I) there are some 3D workloads involving 4x3 or 3x4 matrices, and Texture datasets with arrays of X,Y,Z coordinates which means that power-of-two boundaries will result in serious performance penalties (25% reduction due to a Lane always running empty).> Would that approach help SV, or is it just a backend thing deciding how > many scalar registers it can spare?it would be best to see what Jacob has to say: we're basically likely to be reserving the top x32-x127 scalar registers for "use" as vectors. however being able to dynamically alter the actual allocation of registers on a per-loop basis [and never "spilling"] is going to be critical to ensuring the commercial success and acceptance of the entire processor. in the absolute worst case we would be forced to set vscale = 1, which then "punishes" performance by only utilising say x32-x47. this would (hypothetically) result in a meagre 25% of peak performance (all 16 registers being effectively utilised as scalar-only). if however vscale could be dynamically set to 4, that loop could (hypothetically) deploy registers x32-x95, the parallelism would properly kick in, and we'd get 4x the level of performance. ok that was quite a lot, cutting much of what follows...> > ARM's SVE uses predication. the LDs that would [normally] cause > > page-faults create a mask, instead, giving *only* those LDs which > > "succeeded". > > Those that succeeded until the first that didn't -- every bit in the > mask after a fault is unset, even if it would have succeeded with a > first-faulting gather operation.yehyeh. i do like ffirst, a lot.> > that's then passed into standard [SIMD-based] predicated operations, > > masking out operations, the most important one (for the canonical > > strcpy / memcpy) being the ST. > > Nod; I wrote an experimental early exit loop vectorizer which made use of that.it's pretty awesome, isn't it? :) the one thing that nobody really expect to be able to parallelise / auto-vectorise, and it's now possible!> Ah, I could have made it a bit clearer. I meant have a first-faulting > load intrinsic which returns a vector and an integer representing the > number of valid lanes.[ah, when it comes up, (early-exit loopvec thread?) i should mention that in SV we've augmented fail-first to *true* data-dependent semantics, based on whether the result of [literally any] operation is zero or not. work-in-progress here (related to FP "what constitutes fail" because NaN can be considered "fail").]> For architectures using a dynamic VL, you could > then pass that integer to subsequent operations so they are tied to > that number of active elements. > > For SVE/AVX512, we'd have to splat that integer and compare against > a stepvector to generate a mask. Ugly, but it can be pattern matched > into the direct first/no-faulting loads and masks for codegen.this sounds very similar to the RVV use of a special "vmfirst" predicate-mask instruction which is used to detect the zero point in the canonical strcpy example. it... works :)> Or we just use separate intrinsics. > > To discuss later, I think; possibly on the early-exit loopvec thread.ok, yes, agreed. thanks graham. l.