Luke Kenneth Casson Leighton via llvm-dev
2021-Jul-30 23:27 UTC
[llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction
(please cc me i am subscribed digest) i have an idea which i have been meaning to float for some time. as context: i am the lead author of the Draft SVP64 Cray-like Vector Extensions for the Power ISA, which is being designed for Hybrid CPU, VPU and 3D GPU workloads. SVP64 is similar to Broadcom VideoCore IV's "repeat" feature and to x86 "REP" but with Vectorisation Context. unlike x86 REP which simply repeats the following instruction, SVP64 *augments* the following instruction to: * change any one of src and dest registers to scalar or vector * adds both src *and dest* predication in some cases, and * overrides the element width of src and additionally overrides dest register width (8/16/32/64 bit or FP16/BF16/FP32/FP64) and * adds several modes including saturation, fail-first, iteration and reduction and other modes never seen in any commercial ISA. there are also two modes of operation: * Vertical First which requires explicit incrementing of the Vector Element offset (effectively turning the register file into an indexable SRAM) * Horizontal First which is equivalent to the original Cray Vectors and to RVV. Vertical-First may be permitted to execute an arbitrary number of elements in parallel "batches": interestingly, when those batches are chosen at runtime to be equal to Maximum Vector Length, that effectively executes *all* element operations Horizontally and incidentally is directly equivalent to Cray-style Vector execution. here's the problem: where the Scalar Power ISA for the SFFS compliancy subset is 214 instructions, SVP64 Context is 24 bits and consequently multiplies those 214 instructions to well north of a QUARTER OR A MILLION ISA Intrinsics. adding in GPU-style Swizzle context and the Draft REMAP looping for Matrix Multiply, FFT, DCT, Iterative reduction and other modes and it could well be several MILLION intrinsics. the standard approach used to autogenerate intrinsics with scripts, making all intrinsics available as a flat header file or c++ template, which works extremely well for all other ISAs, are therefore absolutely out of the question. if however instead of an NxM problem this was turned into N+M, separating out "scalar base" from "augmentation" throughout the IR, the problem disappears entirely. the nice thing about that approach is that it also tidies up other ISAs as well, including SIMD ones. very few ISAs have intrinsics which are only inherently meaningful in a Vector context (a cross product instruction would be a perfect illustrative exception to that rule). even permute / shuffle Vector/SIMD operations are separateable into "base" and "abstract Vector Concept": the "base" operation in that case being "MV.X" (scalar register copy, indexable - reg[RT] reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm]) the issue is that this is a massive intrusive change, effectively a low-level redesign of LLVM IR internals for every single back-end. on the other hand, as we make progress over the next few years with SVP64, if there was resistance to this concept, trying to shoe-horn SVP64 into an NxM intrinsics concept is guaranteed to limit SVP64's full capqbilities or, worse, run people's machines out of resources during compilation, and ultimately cause complaints about LLVM's performance. i have no idea where to go with this, and wanted to open up the floor to alternatives as well as present an opprtunity for discussion of the ramifications, advantages and disadvantages of separating out parallelism / vectorisation as an abstract concept from scalar "base" intrinsics, and what that would look like in practice. also, i have to ask: has anything like this ever been considered before? l.
Renato Golin via llvm-dev
2021-Aug-03 14:19 UTC
[llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction
On Sat, 31 Jul 2021 at 00:33, Luke Kenneth Casson Leighton via llvm-dev < llvm-dev at lists.llvm.org> wrote:> if however instead of an NxM problem this was turned into N+M, > separating out "scalar base" from "augmentation" throughout the IR, > the problem disappears entirely. >Hi Luke, It's not entirely clear to me what you are suggesting here. For context: * Historically, we have tried to keep as many instructions as native IR as possible to avoid the explosion of intrinsics, as you describe. * However, traditionally, intrinsics reduce the number of instructions in a basic block instead of increasing them, so there's always the balance. * For example, some reduction intrinsics were added to address bloat, but no target is forced to use them. * If you can represent the operation as a series of native IR instructions, by all means, you should do so. I get it that a lot of intrinsics are repeated patterns over all variations and that most targets don't have that many, so it's "ok". I also get it that most SIMD vector operations aren't intrinsically vector, but expansions of scalar operations for the benefit of vectorisation (plus predication, to avoid undefined behaviour and to allow "funny" patterns, etc). But it's not clear to me what the "augmentation" part would be in other targets. even permute / shuffle Vector/SIMD operations are separateable into> "base" and "abstract Vector Concept": the "base" operation in that > case being "MV.X" (scalar register copy, indexable - reg[RT] > reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm]) >Shuffles are already represented as IR instructions (insert/extract vector), so I'm not sure this clarifies much. Have you looked at the current scalable vector implementation? It allows a set of operations on open-ended vectors that are controlled by a predicate, which is possibly the "augmentation" that you're looking for?> the issue is that this is a massive intrusive change, effectively a > low-level redesign of LLVM IR internals for every single back-end. >Not necessarily. For example, scalable vectors are being introduced in a way that non-scalable back-ends (mostly) won't notice. And it's not just adding a few intrinsics, the very concept of vectors was changed. There could be a (set of) construct(s) for your particular back-end that is invisible to others. Of course, the more invisible things, the harder it is to validate and change intersections of code, so the change must really be worth the extra hassle. With both Arm and RISCV implementing scalable extensions, that change was deemed worthy and work is progressing. So, if you could leverage the existing code to your advantage, you'd avoid having to convince a huge community to implement a large breaking change. And you'd also give us one more reason for the scalable extension to exist. :) Hope this helps. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210803/1c8e4595/attachment-0001.html>