thr3ads.net - llvm dev - [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Graham Hunter via llvm-dev

2018-Jun-15 16:19 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi David,

Responses below.

-Graham

On 11 Jun 2018, at 22:19, David A. Greene <dag at cray.com<mailto:dag at
cray.com>> wrote:

Graham Hunter <Graham.Hunter at arm.com<mailto:Graham.Hunter at
arm.com>> writes:

=======1. Types
=======
To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

How does this work in practice? Let's say I populate a vector with a
splat. Presumably, that gives me a "full length" vector. Let's
say the
type is <scalable 2 x double>. How do I split the vector and get
something half the width? What is its type? How do I split it again
and get something a quarter of the width? What is its type? How do I
use half- and quarter-width vectors? Must I resort to predication?

To split a <scalable 2 x double> in half, you'd use a shufflevector in
much the
same way you would for fixed-length vector types.

e.g.
``
%sv = call <scalable 1 x i32>
@llvm.experimental.vector.stepvector.nxv1i32()
%halfvec = shufflevector <scalable 2 x double> %fullvec, <scalable 2
x double> undef, <scalable 1 x i32> %sv
``

You can't split it any further than a <scalable 1 x <ty>>, since
there may only be
one element in the actual hardware vector at runtime. The same restriction
applies to
a <1 x <ty>>. This is why we have a minimum number of lanes in
addition to the
scalable flag so that we can concatenate and split vectors, since SVE registers
have
the same number of bytes and will therefore decrease the number of elements per
register as the element type increases in size.

If you want to extract something other than the first part of a vector, you need
to add
offsets based on a calculation from vscale (e.g. adding vscale * (min_elts/2)
allows you
to reach the high half of a larger register).

If you check the patch which introduces splatvector
(https://reviews.llvm.org/D47775),
you can see a line which currently produces an error if changing the size of a
vector
is required, and notes that VECTOR_SHUFFLE_VAR hasn't been implemented yet.

In our downstream compiler, this is an ISD alongside VECTOR_SHUFFLE which
allows a shuffle with a variable mask instead of a constant.

If people feel it would be useful, I can prepare another patch which implements
these
shuffles (as an intrinsic rather than a common ISD) for review now instead of
later;
I tried to keep the initial patch set small so didn't cover all cases.

In terms of using predication, that's generally not required for the legal
integer types;
normal promotion via sign or zero extension work. We've tried to reuse
existing
mechanisms wherever possible.

For floating point types, we do use predication to allow the use of otherwise
illegal
types like <scalable 1 x double>, but that's limited to the AArch64
backend and does
not need to be represented in IR.

Ths split question comes into play for backward compatibility. How
would one take a scalable vector and pass it into a NEON library? It is
likely that some math functions, for example, will not have SVE versions
available.

I don't believe we intend to support this, but instead provide libraries
with
SVE versions of functions instead. The problem is that you don't know how
many NEON-size subvectors exist within an SVE vector at compile time.
While you could create a loop with 'vscale' number of iterations and try
to
extract those subvectors, I suspect the IR would end up being quite messy
and potentially hard to recognize and optimize.

The other problem with calling non-SVE functions is that any live SVE
registers must be spilled to the stack and filled after the call, which is
likely to be quite expensive.

Is there a way to represent "double width" vectors? In
mixed-data-size
loops it is sometimes convenient to reason about double-width vectors
rather than having to split them (to legalize for the target
architecture) and keep track of their parts early on. I guess the more
fundamental question is about how such loops should be handled.

For SVE, it's fine to generate IR with types that are 'double size'
or larger,
and just leave it to legalization at SelectionDAG level to split into multiple
legal size registers.

Again, if people would like me to create a patch to illustrate legalization
sooner
rather than later to help understand what's needed to support these types,
let
me know.

What do insertelement and extractelement mean for scalable vectors?
Your examples showed insertelement at index zero. How would I, say,
insertelement into the upper half of the vector? Or any arbitrary
place? Does insertelement at index 10 of a <scalable 2 x double> work,
assuming vscale is large enough? It is sometimes useful to constitute a
vector out of various scalar pieces and insertelement is a convenient
way to do it.

So you can insert or extract any element known to exist (in other words,
it's
within the minimum number of elements). Using a constant index outside
that range will fail, as we won't know whether the element actually exists
until we're running on a cpu.

Our downstream compiler supports inserting and extracting arbitrary elements
from calculated offsets as part of our experiment on search loop vectorization,
but that generates the offsets based on a count of true bits within partitioned
predicates. I was planning on proposing new intrinsics to improve predicate use
within llvm at a later date.

We have been able to implement various types of known shuffles (like the
high/low
half extract, zip, concatention, etc) with vscale, stepvector, and the existing
IR
instructions.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/08e788d9/attachment-0001.html>

David Greene via llvm-dev

2018-Jun-15 20:51 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org> writes:
> To split a <scalable 2 x double> in half, you'd use a
shufflevector in much the
> same way you would for fixed-length vector types.
>
> e.g.
> ``
> %sv = call <scalable 1 x i32>
@llvm.experimental.vector.stepvector.nxv1i32()
> %halfvec = shufflevector <scalable 2 x double> %fullvec, <scalable
2 x double> undef, <scalable 1 x i32> %sv
> ``
>
> You can't split it any further than a <scalable 1 x <ty>>,
since there may only be
> one element in the actual hardware vector at runtime. The same restriction
applies to
> a <1 x <ty>>. This is why we have a minimum number of lanes in
addition to the
> scalable flag so that we can concatenate and split vectors, since SVE
registers have
> the same number of bytes and will therefore decrease the number of elements
per
> register as the element type increases in size.
Right.  So let's say the hardware width is 1024.  If I have a
<scalable 2 x double> it is 1024 bits.  If I split it, it's a
<scalable 1 x double> (right?) with 512 bits.  There is no
way to create a 256-bit vector.

It's probably the case that for pure VL-agnostic code, this is ok.  Our
experience with the X1/X2, which also supported VL-agnostic code, was
that at times compiler awareness of the hardware MAXVL allowed us to
generate better code, better enough that we "cheated" regularly.  The
hardware guys loved us.  :)

I'm not at all saying that's a good idea for SVE, just recounting
experience and wondering what the implications would be for SVE and more
generally, LLVM IR.  Would the MIPS V people be interested in a
non-VL-agnostic compilation mode?
> If you want to extract something other than the first part of a vector, you
need to add
> offsets based on a calculation from vscale (e.g. adding vscale *
(min_elts/2) allows you
> to reach the high half of a larger register).
Sure, that makes semse.
> For floating point types, we do use predication to allow the use of
otherwise illegal
> types like <scalable 1 x double>, but that's limited to the
AArch64 backend and does
> not need to be represented in IR.
This is something done during or after isel?
>  Ths split question comes into play for backward compatibility. How
>  would one take a scalable vector and pass it into a NEON library? It is
>  likely that some math functions, for example, will not have SVE versions
>  available.
>
> I don't believe we intend to support this, but instead provide
libraries with
> SVE versions of functions instead. The problem is that you don't know
how
> many NEON-size subvectors exist within an SVE vector at compile time.
> While you could create a loop with 'vscale' number of iterations
and try to
> extract those subvectors, I suspect the IR would end up being quite messy
> and potentially hard to recognize and optimize.
Yes, that was my concern.  The vscale loop is what I came up with as
well.  It is technically feasible, but ugly.  I'm a little concerned
about what vendors will do with this.  Not everyone is going to have the
resources to convert all of their NEON libraries, certainly not all at
once.

Just something to think about.
> The other problem with calling non-SVE functions is that any live SVE
> registers must be spilled to the stack and filled after the call, which is
> likely to be quite expensive.
Understood.
>  Is there a way to represent "double width" vectors? In
mixed-data-size
>  loops it is sometimes convenient to reason about double-width vectors
>  rather than having to split them (to legalize for the target
>  architecture) and keep track of their parts early on. I guess the more
>  fundamental question is about how such loops should be handled.
>
> For SVE, it's fine to generate IR with types that are 'double
size' or larger,
> and just leave it to legalization at SelectionDAG level to split into
multiple
> legal size registers.
Ok, great.  If something is larger than "double size," how can it be
split, given the "split once" restriction above?
>  What do insertelement and extractelement mean for scalable vectors?
>  Your examples showed insertelement at index zero. How would I, say,
>  insertelement into the upper half of the vector? Or any arbitrary
>  place? Does insertelement at index 10 of a <scalable 2 x double>
work,
>  assuming vscale is large enough? It is sometimes useful to constitute a
>  vector out of various scalar pieces and insertelement is a convenient
>  way to do it.
>
> So you can insert or extract any element known to exist (in other words,
it's
> within the minimum number of elements). Using a constant index outside
> that range will fail, as we won't know whether the element actually
exists
> until we're running on a cpu.
In that case to "insert" into the higher elements one would insert
into
the known range and then shufflevector, I suppose.  Ok.
> Our downstream compiler supports inserting and extracting arbitrary
elements
> from calculated offsets as part of our experiment on search loop
vectorization,
> but that generates the offsets based on a count of true bits within
partitioned
> predicates. I was planning on proposing new intrinsics to improve predicate
use
> within llvm at a later date.
Ok, I look forward to seeing them!
> We have been able to implement various types of known shuffles (like the
high/low
> half extract, zip, concatention, etc) with vscale, stepvector, and the
existing IR
> instructions.
Yes, I can certainly see how all of those would be implemented.  The
main case I'm thinking about is something that is "scalarized"
within a
vector loop context.  I'm wondering about the best way to reconstitute a
vector from scalar pieces (or vice-versa).

Thanks for the explanations!

                          -David

Graham Hunter via llvm-dev

2018-Jun-26 09:44 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi David,

Responses inline.

-Graham
>> You can't split it any further than a <scalable 1 x
<ty>>, since there may only be
>> one element in the actual hardware vector at runtime. The same
restriction applies to
>> a <1 x <ty>>. This is why we have a minimum number of lanes
in addition to the
>> scalable flag so that we can concatenate and split vectors, since SVE
registers have
>> the same number of bytes and will therefore decrease the number of
elements per
>> register as the element type increases in size.
> 
> Right.  So let's say the hardware width is 1024.  If I have a
> <scalable 2 x double> it is 1024 bits.  If I split it, it's a
> <scalable 1 x double> (right?) with 512 bits.  There is no
> way to create a 256-bit vector.
Correct; you'd need to use predication to force smaller vectors.
> It's probably the case that for pure VL-agnostic code, this is ok.  Our
> experience with the X1/X2, which also supported VL-agnostic code, was
> that at times compiler awareness of the hardware MAXVL allowed us to
> generate better code, better enough that we "cheated" regularly. 
The
> hardware guys loved us.  :)
I think if people want to take advantage of knowing the hardware vector
length for their particular machine, the existing fixed-length types will
suffice. We haven't implemented that for SVE at present, though -- a lot
of work would be required.
> I'm not at all saying that's a good idea for SVE, just recounting
> experience and wondering what the implications would be for SVE and more
> generally, LLVM IR.  Would the MIPS V people be interested in a
> non-VL-agnostic compilation mode?
> 
>> If you want to extract something other than the first part of a vector,
you need to add
>> offsets based on a calculation from vscale (e.g. adding vscale *
(min_elts/2) allows you
>> to reach the high half of a larger register).
> 
> Sure, that makes semse.
> 
>> For floating point types, we do use predication to allow the use of
otherwise illegal
>> types like <scalable 1 x double>, but that's limited to the
AArch64 backend and does
>> not need to be represented in IR.
> 
> This is something done during or after isel?
A combination of custom lowering and isel.
>> Ths split question comes into play for backward compatibility. How
>> would one take a scalable vector and pass it into a NEON library? It is
>> likely that some math functions, for example, will not have SVE
versions
>> available.
>> 
>> I don't believe we intend to support this, but instead provide
libraries with
>> SVE versions of functions instead. The problem is that you don't
know how
>> many NEON-size subvectors exist within an SVE vector at compile time.
>> While you could create a loop with 'vscale' number of
iterations and try to
>> extract those subvectors, I suspect the IR would end up being quite
messy
>> and potentially hard to recognize and optimize.
> 
> Yes, that was my concern.  The vscale loop is what I came up with as
> well.  It is technically feasible, but ugly.  I'm a little concerned
> about what vendors will do with this.  Not everyone is going to have the
> resources to convert all of their NEON libraries, certainly not all at
> once.
> 
> Just something to think about.
Understood; it may be that using predication (or a kernel call to limit
the maximum length to 128b) would suffice in those cases -- you can't take
advantage of the extra vector length, but since the bottom 128b of each
vector register alias with the NEON registers you could just call existing
NEON code without needing to do anything special that way. I guess it's a
trade-off as to whether you want to reuse existing tuned code or take
advantage of more powerful hardware. We haven't implemented anything like
this though.
>> The other problem with calling non-SVE functions is that any live SVE
>> registers must be spilled to the stack and filled after the call, which
is
>> likely to be quite expensive.
> 
> Understood.
> 
>> Is there a way to represent "double width" vectors? In
mixed-data-size
>> loops it is sometimes convenient to reason about double-width vectors
>> rather than having to split them (to legalize for the target
>> architecture) and keep track of their parts early on. I guess the more
>> fundamental question is about how such loops should be handled.
>> 
>> For SVE, it's fine to generate IR with types that are 'double
size' or larger,
>> and just leave it to legalization at SelectionDAG level to split into
multiple
>> legal size registers.
> 
> Ok, great.  If something is larger than "double size," how can it
be
> split, given the "split once" restriction above?
There's not a 'split once' restriction, but 'n' in
<scalable n x <ty>> has to
be an integer. A "double width" vector would just double 'n',
so
<scalable 4 x double> would be such a type. That's not a legal type
for SVE,
so it would be split into 2 <scalable 2 x double> values during
legalization
(or nxv2f64, since it's at SDAG at that point).
> 
>> What do insertelement and extractelement mean for scalable vectors?
>> Your examples showed insertelement at index zero. How would I, say,
>> insertelement into the upper half of the vector? Or any arbitrary
>> place? Does insertelement at index 10 of a <scalable 2 x double>
work,
>> assuming vscale is large enough? It is sometimes useful to constitute a
>> vector out of various scalar pieces and insertelement is a convenient
>> way to do it.
>> 
>> So you can insert or extract any element known to exist (in other
words, it's
>> within the minimum number of elements). Using a constant index outside
>> that range will fail, as we won't know whether the element actually
exists
>> until we're running on a cpu.
> 
> In that case to "insert" into the higher elements one would
insert into
> the known range and then shufflevector, I suppose.  Ok.
> 
>> Our downstream compiler supports inserting and extracting arbitrary
elements
>> from calculated offsets as part of our experiment on search loop
vectorization,
>> but that generates the offsets based on a count of true bits within
partitioned
>> predicates. I was planning on proposing new intrinsics to improve
predicate use
>> within llvm at a later date.
> 
> Ok, I look forward to seeing them!
> 
>> We have been able to implement various types of known shuffles (like
the high/low
>> half extract, zip, concatention, etc) with vscale, stepvector, and the
existing IR
>> instructions.
> 
> Yes, I can certainly see how all of those would be implemented.  The
> main case I'm thinking about is something that is
"scalarized" within a
> vector loop context.  I'm wondering about the best way to reconstitute
a
> vector from scalar pieces (or vice-versa).
If actual scalarization is required, SVE needs to use predicates; there are
instructions that allow moving a single predicate bit forward one element at
a time, and you can then extract the element, perform whatever operation is
needed, and insert the result back into a register at the same location.

There's obviously more overhead than just moving directly from/to a fixed
number of lanes, but it works. Hopefully we won't need to use scalarization
often.

llvm dev - Jun 2018 - [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths