thr3ads.net - llvm dev - [llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction [Jul 2021]

If this information is useful, please help other people find it:
Share via:

Luke Kenneth Casson Leighton via llvm-dev

2021-Jul-30 23:27 UTC

[llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction

(please cc me i am subscribed digest)

i have an idea which i have been meaning to float for some time.  as
context: i am the lead author of the Draft SVP64 Cray-like Vector
Extensions for the Power ISA, which is being designed for Hybrid CPU,
VPU and 3D GPU workloads.

SVP64 is similar to Broadcom VideoCore IV's "repeat" feature and
to
x86 "REP" but with Vectorisation Context. unlike x86 REP which simply
repeats the following instruction, SVP64 *augments* the following
instruction to:

* change any one of src and dest registers to scalar or vector
* adds both src *and dest* predication in some cases, and
* overrides the element width of src and additionally overrides dest
register width (8/16/32/64 bit or FP16/BF16/FP32/FP64) and
* adds several modes including saturation, fail-first, iteration and
reduction and other modes never seen in any commercial ISA.

there are also two modes of operation:

* Vertical First which requires explicit incrementing of the Vector
Element offset (effectively turning the register file into an
indexable SRAM)
* Horizontal First which is equivalent to the original Cray Vectors and to RVV.

Vertical-First may be permitted to execute an arbitrary number of
elements in parallel "batches": interestingly, when those batches are
chosen at runtime to be equal to Maximum Vector Length, that
effectively  executes *all* element operations Horizontally and
incidentally is directly equivalent to Cray-style Vector execution.

here's the problem:

where the Scalar Power ISA for the SFFS compliancy subset is 214
instructions, SVP64 Context is 24 bits and consequently multiplies
those 214 instructions to well north of a QUARTER OR A MILLION ISA
Intrinsics.

adding in GPU-style Swizzle context and the Draft REMAP looping for
Matrix Multiply, FFT, DCT, Iterative reduction and other modes and it
could well be several MILLION intrinsics.

the standard approach used to autogenerate intrinsics with scripts,
making all intrinsics available as a flat header file or c++ template,
which works extremely well for all other ISAs, are therefore
absolutely out of the question.

if however instead of an NxM problem this was turned into  N+M,
separating out "scalar base" from "augmentation" throughout
the IR,
the problem disappears entirely.

the nice thing about that approach is that it also tidies up other
ISAs as well, including SIMD ones.  very few ISAs have intrinsics
which are only inherently meaningful in a Vector context (a cross
product instruction would be a perfect illustrative exception to that
rule).

even permute / shuffle Vector/SIMD operations are separateable into
"base" and "abstract Vector Concept": the "base"
operation in that
case being "MV.X" (scalar register copy, indexable - reg[RT]
reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm])

the issue is that this is a massive intrusive change, effectively a
low-level redesign of LLVM IR internals for every single back-end.

on the other hand, as we make progress over the next few years with
SVP64, if there was resistance to this concept, trying to shoe-horn
SVP64 into an NxM intrinsics concept is guaranteed to limit SVP64's
full capqbilities or, worse, run people's machines out of resources
during compilation, and ultimately cause complaints about LLVM's
performance.

i have no idea where to go with this, and wanted to open up the floor
to alternatives as well as present an opprtunity for discussion of the
ramifications, advantages and disadvantages of separating out
parallelism / vectorisation as an abstract concept from scalar "base"
intrinsics, and what that would look like in practice.

also, i have to ask: has anything like this ever been considered before?

l.

Renato Golin via llvm-dev

2021-Aug-03 14:19 UTC

head link

[llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction

On Sat, 31 Jul 2021 at 00:33, Luke Kenneth Casson Leighton via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> if however instead of an NxM problem this was turned into  N+M,
> separating out "scalar base" from "augmentation"
throughout the IR,
> the problem disappears entirely.
>
Hi Luke,

It's not entirely clear to me what you are suggesting here.

For context:
 * Historically, we have tried to keep as many instructions as native IR as
possible to avoid the explosion of intrinsics, as you describe.
 * However, traditionally, intrinsics reduce the number of instructions in
a basic block instead of increasing them, so there's always the balance.
 * For example, some reduction intrinsics were added to address bloat, but
no target is forced to use them.
 * If you can represent the operation as a series of native IR
instructions, by all means, you should do so.

I get it that a lot of intrinsics are repeated patterns over all variations
and that most targets don't have that many, so it's "ok".

I also get it that most SIMD vector operations aren't intrinsically vector,
but expansions of scalar operations for the benefit of vectorisation (plus
predication, to avoid undefined behaviour and to allow "funny"
patterns,
etc).

But it's not clear to me what the "augmentation" part would be in
other
targets.

even permute / shuffle Vector/SIMD operations are separateable
into> "base" and "abstract Vector Concept": the
"base" operation in that
> case being "MV.X" (scalar register copy, indexable - reg[RT] >
reg[reg[RA]] and immediate variant reg[RT] = reg[RA+imm])
>
Shuffles are already represented as IR instructions (insert/extract
vector), so I'm not sure this clarifies much.

Have you looked at the current scalable vector implementation?

It allows a set of operations on open-ended vectors that are controlled by
a predicate, which is possibly the "augmentation" that you're
looking for?

> the issue is that this is a massive intrusive change, effectively a
> low-level redesign of LLVM IR internals for every single back-end.
>
Not necessarily.

For example, scalable vectors are being introduced in a way that
non-scalable back-ends (mostly) won't notice.
And it's not just adding a few intrinsics, the very concept of vectors was
changed.
There could be a (set of) construct(s) for your particular back-end that is
invisible to others.

Of course, the more invisible things, the harder it is to validate and
change intersections of code, so the change must really be worth the extra
hassle.
With both Arm and RISCV implementing scalable extensions, that change was
deemed worthy and work is progressing.
So, if you could leverage the existing code to your advantage, you'd avoid
having to convince a huge community to implement a large breaking change.
And you'd also give us one more reason for the scalable extension to exist.
:)

Hope this helps.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210803/1c8e4595/attachment-0001.html>

llvm dev - Jul 2021 - [RFC] Vector/SIMD ISA Context Abstraction

[llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction

[llvm-dev] [RFC] Vector/SIMD ISA Context Abstraction