Luke Kenneth Casson Leighton via llvm-dev
2021-Aug-03 17:32 UTC
[llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction
(renato thank you for cc'ing, due to digest subscription at the moment) On Tue, Aug 3, 2021 at 3:25 PM Renato Golin <rengolin at gmail.com> wrote:> > On Sat, 31 Jul 2021 at 00:33, Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> if however instead of an NxM problem this was turned into N+M, >> separating out "scalar base" from "augmentation" throughout the IR, >> the problem disappears entirely. > > > Hi Luke, > > It's not entirely clear to me what you are suggesting here.it's a nebulous but fundamentally low-level concept, that may take some time to sink in, and also appreciate the significance. some background: over the past 3+ years i have made a comprehensive comparative study of ISAs (both SIMD and Vector), the latter only being revived recently thanks to RVV bringing Cray-style Vectors back into the forefront of computing research. here is a quick summary of that: * Packed SIMD. the worst kind of ISA. the following article puts it diplomatically: https://www.sigarch.org/simd-instructions-considered-harmful/ i have no such compunction or affiiation, and as an engineer can speak freely and plainly. gloves off, statement of fact: Packed SIMD is the worst thing to become ubiquitous in computer science, bar none. frankly, the sooner that Packed SIMD is shot and buried as an embarrassing and incredibly expensive historical footnote in the history of computing, the better [there's always exceptions: for e.g. embedded DSP Audio, Packed SIMD is perfect]. * Predicated SIMD: by total contrast, this is actually not half bad. SVE2, AVX-512, GPU ISAs, anything that takes masks per element, the masks completely eliminate Packed SIMD Hell (except for ISAs that still have SIMD alignment for memory LD/STs even for Predicated LD/STs, but hey, nothing's perfect) * Horizontal-First Vectors. known best from the Cray-i, also in modern ISAs such as NEC's SX-Aurora and now RVV, horizontal vectors are of the form "for i in range(VL) operation(vec_src_reg[i], vec_dest_regs[i])" * Vertical-First Vectors. these are *NOT* very well-known, and the only two ISAs that i know of (to date) are SVP64 and Mitch Alsup's MyISA 66000. btw, here's the [terse] page on SVP64's format: https://libre-soc.org/openpower/sv/svp64/ the overview is much more useful for understanding: https://libre-soc.org/openpower/sv/overview/ additional relevant ISAs: * Broadcom VideoCore IV which has a Vector-form of "REP", where the operation may be repeated 2, 4, 8, 16, 32 or "the count taken from scalar r0". * the Mill Architecture, which is a "tagged" ISA. here, there is no "ADD32", "ADD16", "ADD64", "ADD8", there is *ONLY* "ADD". the width of the operation is taken *FROM THE LOAD OPERATION* which is the ONLY place where the register operand width is specified. operations are DEDUCED, statically, at compile-time, based on the "tags" associated with source registers as they cascade through. two strategically important instructions are included: WIDEN and NARROW. the significance of mentioning the Mill is that the ISA has a closer match to the simple (basic) LLVM intrinsics than most other ISAs. however even there (unless they've done drastic and extensive changes) they will be limited to trying to fit a flexible (tagged) ISA into an inflexible IR that was never designed with "context" in mind.> For context: > * Historically, we have tried to keep as many instructions as native IR as possible to avoid the explosion of intrinsics, as you describe.a crucially important goal that gets a big thumbs-up from me.> * However, traditionally, intrinsics reduce the number of instructions in a basic block instead of increasing them, so there's always the balance.where the opposite of that is that the CISC-ness of a given new intrinsic itself could impact ISAs that don't support that feature natively, making it necessary for them to emit rather more assembly instructions than it first appears.> * For example, some reduction intrinsics were added to address bloat, but no target is forced to use them.excellent. iteration and reduction (including fixed schedule paralleliseable reduction) is one of the intrinsics being added to SVP64. it's good to hear that that, as a concept, has been added. if i may, i will use that as an example, later.> * If you can represent the operation as a series of native IR instructions, by all means, you should do so.this assumes (perfectly reasonably, mind you) that the (hypothetical) ISA itself is not capable of expressing a given operation *in* IR, and consequently has to be done as a series of passes, substituting for a lack of native (direct) support of a given operation with some (faster?) operations that *do* (ultimately) exist as actual assembler. in some architectures a particular native IR instruction might actually exist, but the native assembler variant is so horribly slow at the hardware level that alternatives are actually *demanded* by users. AVX's native Horizontal Reduction instructions would be a good example.> I get it that a lot of intrinsics are repeated patterns over all variations and that most targets don't have that many, so it's "ok". > > I also get it that most SIMD vector operations aren't intrinsically vector,[indeed. i have spent considerable time recently on the wikipedia Vector_processor page, and associated nearby related pages (SIMD, GPUs, etc), correcting that unfortunate meme that "SIMD equals vectors". this is unfortunately where Corporate Marketing has badly interfered with actual Computer Science. sigh.]> but expansions of scalar operations for the benefit of vectorisation >(plus predication, to avoid undefined behaviour and to allow "funny" patterns, etc).yes. and this perspective is where Mitch Alsup's MyISA 66000, and SVP64's "Vertical-First" Mode come into play: the instructions in both are effectively executed *in scalar form ONLY* (as far as the Program Order is concerned), and, at the end of a loop/branch, you *EXPLICITLY* increment the element index, such that all *SCALAR* operations in the loop now execute on *scalar* element one. end of loop, explicit increment element index to 2, loop back and execute on element *two* of the Vector Register. repeat until loop-termination condition. the challenge ahead for Libre-SOC (and for MyISA 66000) will be to introduce this entirely new concept to compilers. however given that it's effectively scalar rather than Vector, the task should actually be a *lot* easier than it is for Horizontal-First ISAs such as SVE and RVV.> But it's not clear to me what the "augmentation" part would be in other targets.the proposal is - at its heart - to replace all IR of the form: llvm.masked.load.v16f32.predicatespec(arguments) llvm.masked.load.v2f64.predicatespec(arguments) and so on with just: llvm.load(mask=x, arguments). where there *is* no llvm.masked.load, there is *only* an optional argument that *at runtime* (or, compile-time more like, i.e. when running llvm) rather than expands out explicitly / statically to dozens of special IR intrinsics, *there is only one*: llvm.load. additional optional arguments also then specify whether this operation is twin-predicated by having a *second* predicate mask (yes, SVP64 can apply one predicate mask to the source, and another to the destination. conceptually this is equivalent to back-to-back VGATHER-VSCATTER). additional optional arguments also then specify whether there is SWIZZLE applied, or Sub-Vectors, or any other types of "augmentation". now, here's the kicker: what we need to support SVP64 is for *all llvm basic intrinsics to support all possible optional augmentations of all possible types*. yes, really, that's not a typo or a mis-statement. we *genuinely* need a sign-extended twin-predicated intrinsic: llvm.sext(source_mask=source_pred, dest_mask=dest_pred, source_argument)>> even permute / shuffle Vector/SIMD operations are separateable into >> "base" and "abstract Vector Concept": the "base" operation in that >> case being "MV.X" (scalar register copy, indexable - reg[RT] >> reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm]) > > > Shuffles are already represented as IR instructions (insert/extract vector), so I'm not sure this clarifies much.ok, so is it possible to do shuffle-sign-extend, shuffle-fptrunc, shuffle-fabs, shuffle-sqrt, shuffle-log, and any other single-src single-dest operation? this is where the "augmentation" - the separation of PREFIX-SUFFIX comes into play. SVP64 has the ability to set up "SWIZZLE" contexts as well as certain kinds of "REMAP" Schedules (triple-loop butterfly schedules) - PREFIXes - that can be *applied* to base operations (SUFFIXes), which, if we were to expand all those possibilities out would literally create several MILLION intrinsics.> Have you looked at the current scalable vector implementation?briefly, yes. i also helped with some review insights when RVV was being added, although that was a brief glimpse into a massive world where i was (and still am) constrained, unfortunately, by time and resources, much as i would love that to be otherwies.> It allows a set of operations on open-ended vectors that are controlled by a predicate, which is possibly the "augmentation" that you're looking for?no. what is happening there is that it is a reflection of the limitations of the current ISAs. i can say with 100% certainty that the SVE implementation will not have been designed to take SVP64 into consideration. the reason is actually very simple and straightforward: at the time LLVM SVE was added, SVP64 did not even exist. so for example, let us take the new feature added in LLVM SVE: reduction. most Vector ISAs add *explicit* reduction operations. NEC SX-Aurora for example has reduce-add, reduce-multiply, reduce-OR, reduce-AND, reduce-XOR, and that's about it. SVP64 has: * reduction as a fixed (paralleliseable) schedule * base operation. you can LITERALLY apply reduction to.... to... "llvm.maximum" scalar operation, or to... divide or subtract (or other non-commutative operation) if you really really want to, and the ISA will go, "ok, done. next?". sv.fmax/MR FRT.v, FRA.v, FRB.v # MR means "map-reduce mode" you can apply parallel reduction to Power ISA v3.0 Condition Register operations, "crand" or "cror" or "crnor". sv.crand/MR BT.v, BA.v, BC.v you can even apply parallel reduction to single-argument instructions if you really, really want to: we're not going to stop that from happening, because somebody might find it useful given the fact that the parallel-reduction is on a fixed Power-2-halving Schedule that could have practical uses, and the hardware is *required* to write out all intermediate values into a *vector* result. you can even apply parallel reduction Schedules to triple-argument instructions (FMA), however there it gets tricky and complicated (and i haven't thought it through, fully, what it actually means, i.e. whether it's useful). certainly if the MUL register argument is considered scalar and the others Vector, that is actually useful (performs repeated cumulative multiply as part of the Schedule). does this help illustrate what i mean by "augmentation"? there is a "base" (scalar) operation, you "augment" it, and it *becomes* SIMD-like, *becomes* Vector-like, *becomes* predicated, *becomes* Swizzled, *becomes* reduced. the development of LLVM SVE would not have taken this possibility into account, because, put simply, it is plain common sense in something as complex as LLVM not to waste time writing code for something that does not have a real-world use-case.>> the issue is that this is a massive intrusive change, effectively a >> low-level redesign of LLVM IR internals for every single back-end. > > > Not necessarily.this would be fantastic (and a huge relief) if it can be so arranged. one of my biggest concerns is that what i am advocating is at such a fundamental level that it could, if done incorrectly, be extremely disruptive. however even now as i think about it on-the-fly, if the proposal is as simple as adding c++-like "optional named arguments" to (all) base scalar LLVM intrinsics, then, i think that would work extremely well. it would have zero impact on other ISAs, which is a huge plus.> For example, scalable vectors are being introduced in a way that non-scalable back-ends (mostly) won't notice. > And it's not just adding a few intrinsics, the very concept of vectors was changed. > There could be a (set of) construct(s) for your particular back-end that is invisible to others.the problem is that all other Vector ISAs have constrained themselves to 32 bit (or, for GPU ISAs, often 48 or 64). they *explicitly* add *explicit* O(N) opcodes. RVV adds 192 *explicit* opcodes, embedded into a MAJOR 32-bit opcode specifically dedicated for use by RVV, and that was part of its original design. ARM, likewise, will have done something similar, with SVE and SVE2. the problem with that approach is that it is extremely limiting in the possible permutations / combinations of *potential* instructions that *could* exist, if there was not such a limit of trying to cram into a 32-bit space. [it does have to be said, however, that there are some serious practical benefits to limiting the possibilities of an ISA: validation and verification of silicon before spending USD 16 million on 7nm masks is a Damn Good Reason :) and it is one that we are going to have to give some serious thought to: how to verify the hardware for an ISA with literally several MILLION instructions] we have left the entirety of the Scalar Power v3.0B ISA alone (which is 32-bit), called that "base", and added a full 32-bit Prefix (called SVP64) which contains the Vectorisation Context. SVP64 is - fundamentally - an O(NxM) ISA, where N ~= 250 and M is ~= 1,000 to 8,000. actually, it's O(NxMxOxPxQ) where: * N~=250 is the base scalar Power v3.0B ISA * M~=1,000-8,000 is the Vectorisation Context * O~=2^20 (guessing here) is REMAP Schedules and * P~=2^(3*12) is SWIZZLE Contexts for GPUs (XXYZ, WZZX) * Q=64 (Vector Length, VL) thus for example with Twin-Predication applied to e.g. llvm.sext or llvm.cos we have implicit back-to-back VGATHER-VSCATTER behaviour *WITHOUT* needing a pair of LD/ST operations or register MV operations before and after the Vectorised operation. we anticipate some extremely powerful compact representations, and to be honest it may literally take several years for the full implications of SVP64's power and flexibility to sink in. in-register paralleliseable DCT can be done in 9 instructions, and paralleliseable in-register-file (small) insertion-sort likely in around 11 instructions thanks to the Data-Dependent Fail-on-First-Condition Mode. we can even implement a (small) Vectorised quick-sort, in-register, fully paralleliseable, in probably about... 20 instructions. it's on my TODO list to investigate.> Of course, the more invisible things, the harder it is to validate and change intersections of code, so the change must really be worth the extra hassle.appreciated.> With both Arm and RISCV implementing scalable extensions, that change was deemed worthy and work is progressing. > So, if you could leverage the existing code to your advantage, you'd avoid having to convince a huge community to implement a large breaking change.the possibility that occurred to me, above, as writing this, of adding optional arguments (containing the Vector Augmentation Context) to base scalar llvm intrinsics, would i believe achieve that. if any other ISA vendors wanted to use that, they could, as a first pass, map e.g. llvm.load(optional_predicate=xxx) *onto* llvm.masked.load(....) and thus avoid huge disruption, and carry that out in an incremental fashion. or not. at their choice. there are several variants on this theme of optional arguments of some description to the base: llvm.add<optional_vector_context>(normal_arguments) where optional_vector_context is an object of some type that itself contains optional "augmentation" features. i would advocate something like this: llvm.add(normal_arguments, source_override_width=<8/16/32/64>, dest_override_width=<8/16/32/64>, saturation_mode=<signed/unsigned>, source_pred=xxx, dest_pred=yyyy, fail_first_mode, swizzle_src=<XYZW>, REMAP_schedules=<RA,RB,RC,RT,RS>, scalar_or_vector_regs=<RA=v/s,RB=v/s, RC=v/s,RT=v/s,RS=v/s>) can you imagine expanding all of those out into a declared (flat) list of intrinsics? what that would do to LLVM SVE if we tried? the "augmentation" list is absolutely massive and starts to give some idea of why LLVM SVE, as designed, simply won't cope, and why we have to think about this differently. the thing is: from the study i've made of other ISAs, i can say that with near 100% certainty that there *will* be a direct map to all of the existing LLVM SVE intrinsics recently added *and to those of all SIMD ISAs as well* and, what i expect to happen is that instead of a massive list of thousands of SIMD intrinsics for e.g. x86, it will reduce down to a fraction of what is in LLVM x86 backend right now. in fact, i expect the exact same reduction to occur for *all* Packed and Predicated SIMD ISAs supported by LLVM. that will have both reduction in maintenance burden, and it should, in theeoorry, reduce compile times as well. in theory. in practical terms it depends what the impact is of the "optional" arguments. hmmm, that will need some thought. even the Mill i believe could benefit, from being able to map much more closely to the actual underlying ISA, which *only* has "ADD" (not ADD8/16/32/64), because they could potentially add an "auto" or "implicit" option to the source width / dest width arguments, which would be much more in line with how the actual ISA itself works (implicit tagged - polymorphic - registers) https://millcomputing.com/docs/compiler/> And you'd also give us one more reason for the scalable extension to exist. :):) as i mentioned at the start, with that list of ISAs, there _do_ exist ofher Vector ISAs and actual hardware implementations, out there: NEC SX-Aurora has been shipping for decades, now - first implementations were April 1983! https://en.wikipedia.org/wiki/NEC_SX here's some background: https://sx-aurora.github.io/ and yes, they do have a Vector Extension variant of llvm: https://sx-aurora.github.io/posts/llvm-ve-rv/ i do hope at some point that they come out of the woodwork and participate in LLVM SVE. and that the product continues to ship, it's pretty incredible and i am delighted that NEC has had a strong enough customer base to keep on selling it and maintaining SX-Aurora. Mitch Alsup's MyISA 66000 will need gcc and llvm at some point, and it is another ISA with a form of Scalable Vectors - one that has been specially designed to "thunk" down to simple scalar hardware. thank you, Renato, for responding, it's given me the opportunity to explain a bit more in-depth. feel free to cc libre-soc-dev in future, we don't mind [relevant!] cross-posts. warmest, l.
lkcl via llvm-dev
2021-Aug-04 22:05 UTC
[llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction
On August 3, 2021 5:32:29 PM UTC, Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:>(renato thank you for cc'ing, due to digest subscription at the moment) > >On Tue, Aug 3, 2021 at 3:25 PM Renato Golin <rengolin at gmail.com> wrote:>> * For example, some reduction intrinsics were added to address >bloat, but no target is forced to use them. > >excellent. iteration and reduction (including fixed schedule >paralleliseable reduction) is one of the intrinsics being added to >SVP64.apologies to all for the follow-up, i realised i joined iteration and reduction together as if they were the same concept: they are not. Iterative Sum when carried out on add of a Vector containing all 1s results in a Pascal Triangle Vector output example of existing hardware that has actual Iteration instructions: Section 8.15 of SX-Aurora ISA guide, p8-297, the pseudocode for Iterative Add: for (i = 0 to VL-1) { Vx(i) ← Vy(i) + Vx(i-1), where Vx(-1)=Sy } where if Vx and Vy are the same register you get the Pascal Triangle effect. https://www.hpc.nec/documents/guide/pdfs/Aurora_ISA_guide.pdf SVP64 does not have this *specifically* added: it is achieved incidentally by issuing an add where the src and dest registers differ by one (SVP64 sits on top of a rather large scalar regfile, 128 64 bit entries) sv.add r1, r1, r0 we did however need to add a "reverse gear" (for (i = 0 to VL-1)) which was needed for ffmpeg's MP3 CODEC ironically to *avoid* the Pascal Triangle effect (and not need to copy a large batch of registers instead) can anyone say if LLVM SVE happened to add Iteration? l.