thr3ads.net - llvm dev - [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Graham Hunter via llvm-dev

2018-Jun-05 18:25 UTC

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi David,

Thanks for taking a look.
> On 5 Jun 2018, at 16:23, dag at cray.com wrote:
> 
> Hi Graham,
> 
> Just a few initial comments.
> 
> Graham Hunter <Graham.Hunter at arm.com> writes:
> 
>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have
the same number of
>>  bytes.
> 
> "scalable" instead of "scalable x."
Yep, missed that in the conversion from the old <n x m x ty> format.
> 
>> For derived types, a function (getSizeExpressionInBits) to return a
pair of
>> integers (one to indicate unscaled bits, the other for bits that need
to be
>> scaled by the runtime multiple) will be added. For backends that do not
need to
>> deal with scalable types, another function
(getFixedSizeExpressionInBits) that
>> only returns unscaled bits will be provided, with a debug assert that
the type
>> isn't scalable.
> 
> Can you explain a bit about what the two integers represent?  What's
the
> "unscaled" part for?
'Unscaled' just means 'exactly this many bits', whereas
'scaled' is 'this many bits
multiplied by vscale'.
> 
> The name "getSizeExpressionInBits" makes me think that a Value
> expression will be returned (something like a ConstantExpr that uses
> vscale).  I would be surprised to get a pair of integers back.  Do
> clients actually need constant integer values or would a ConstantExpr
> sufffice?  We could add a ConstantVScale or something to make it work.
I agree the name is not ideal and I'm open to suggestions -- I was thinking
of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.

Assuming the pair is of the form (unscaled, scaled), then for a type with a size
known at
compile time like <4 x i32> the size would be (128, 0).

For a scalable type like <scalable 4 x i32> the size would be (0, 128).

For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64,
256).

When calculating the offset for memory addresses, you just need to multiply the
scaled
part by vscale and add the unscaled as is.
> 
>> Comparing two of these sizes together is straightforward if only
unscaled sizes
>> are used. Comparisons between scaled sizes is also simple when
comparing sizes
>> within a function (or across functions with the inherit flag mentioned
in the
>> changes to the type), but cannot be compared otherwise. If a mix is
present,
>> then any number of unscaled bits will not be considered to have a
greater size
>> than a smaller number of scaled bits, but a smaller number of unscaled
bits
>> will be considered to have a smaller size than a greater number of
scaled bits
>> (since the runtime multiple is at least one).
> 
> If we went the ConstantExpr route and added ConstantExpr support to
> ScalarEvolution, then SCEVs could be compared to do this size
> comparison.  We have code here that adds ConstantExpr support to
> ScalarEvolution.  We just didn't know if anyone else would be
interested
> in it since we added it solely for our Fortran frontend.
We added a dedicated SCEV expression class for vscale instead; I suspect it
works
either way.
> 
>> We have added an experimental `vscale` intrinsic to represent the
runtime
>> multiple. Multiplying the result of this intrinsic by the minimum
number of
>> elements in a vector gives the total number of elements in a scalable
vector.
> 
> I think this may be a case where added a full-fledged Instruction might
> be worthwhile.  Because vscale is intimately tied to addressing, it
> seems like things such as ScalarEvolution support will be important.  I
> don't know what's involved in making intrinsics work with
> ScalarEvolution but it seems strangely odd that a key component of IR
> computation would live outside the IR proper, in the sense that all
> other fundamental addressing operations are Instructions.
We've tried it as both an instruction and as a 'Constant', and both
work fine with
ScalarEvolution. I have not yet tried it with the intrinsic.
> 
>> For constants consisting of a sequence of values, an experimental
`stepvector`
>> intrinsic has been added to represent a simple constant of the form
>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat
of the new
>> start can be added, and changing the step requires multiplying by a
splat.
> 
> This is another case where an Instruction might be better, for the same
> reasons as with vscale.
> 
> Also, "iota" is the name Cray has traditionally used for this
operation
> as it is the mathematical name for the concept.  It's also used by C++
> and go and so should be familiar to many people.
Iota would be fine with me; I forget the reason we didn't go with that
initially. We
also had 'series_vector' in the past, but that was a more generic form
with start
and step parameters instead of requiring additional IR instructions to multiply
and
add for the result as we do for stepvector.
> 
>> Future Work
>> -----------
>> 
>> Intrinsics cannot currently be used for constant folding. Our
downstream
>> compiler (using Constants instead of intrinsics) relies quite heavily
on this
>> for good code generation, so we will need to find new ways to recognize
and
>> fold these values.
> 
> As above, we could add ConstantVScale and also ConstantStepVector (or
> ConstantIota).  They won't fold to compile-time values but the
> expressions could be simplified.  I haven't really thought through the
> implications of this, just brainstorming ideas.  What does your
> downstream compiler require in terms of constant support.  What kinds of
> queries does it need to do?
It makes things a little easier to pattern match (just looking for a constant to
start
instead of having to match multiple different forms of vscale or stepvector
multiplied
and/or added in each place you're looking for them).

The bigger reason we currently depend on them being constant is that code
generation
generally looks at a single block at a time, and there are several expressions
using
vscale that we don't want to be generated in one block and passed around in
a register,
since many of the load/store addressing forms for instructions will already
scale properly.

We've done this downstream by having them be Constants, but if there's a
good way
of doing them with intrinsics we'd be fine with that too.

-Graham

Amara Emerson via llvm-dev

2018-Jun-05 18:46 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

> On Jun 5, 2018, at 11:25 AM, Graham Hunter <Graham.Hunter at arm.com>
wrote:
> 
> Hi David,
> 
> Thanks for taking a look.
> 
>> On 5 Jun 2018, at 16:23, dag at cray.com wrote:
>> 
>> Hi Graham,
>> 
>> Just a few initial comments.
>> 
>> Graham Hunter <Graham.Hunter at arm.com> writes:
>> 
>>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>``
have the same number of
>>> bytes.
>> 
>> "scalable" instead of "scalable x."
> 
> Yep, missed that in the conversion from the old <n x m x ty> format.
> 
>> 
>>> For derived types, a function (getSizeExpressionInBits) to return a
pair of
>>> integers (one to indicate unscaled bits, the other for bits that
need to be
>>> scaled by the runtime multiple) will be added. For backends that do
not need to
>>> deal with scalable types, another function
(getFixedSizeExpressionInBits) that
>>> only returns unscaled bits will be provided, with a debug assert
that the type
>>> isn't scalable.
>> 
>> Can you explain a bit about what the two integers represent? 
What's the
>> "unscaled" part for?
> 
> 'Unscaled' just means 'exactly this many bits', whereas
'scaled' is 'this many bits
> multiplied by vscale'.
> 
>> 
>> The name "getSizeExpressionInBits" makes me think that a
Value
>> expression will be returned (something like a ConstantExpr that uses
>> vscale).  I would be surprised to get a pair of integers back.  Do
>> clients actually need constant integer values or would a ConstantExpr
>> sufffice?  We could add a ConstantVScale or something to make it work.
> 
> I agree the name is not ideal and I'm open to suggestions -- I was
thinking of the two
> integers representing the known-at-compile-time terms in an expression:
> '(scaled_bits * vscale) + unscaled_bits'.
> 
> Assuming the pair is of the form (unscaled, scaled), then for a type with a
size known at
> compile time like <4 x i32> the size would be (128, 0).
> 
> For a scalable type like <scalable 4 x i32> the size would be (0,
128).
> 
> For a struct with, say, a <scalable 32 x i8> and an i64, it would be
(64, 256).
> 
> When calculating the offset for memory addresses, you just need to multiply
the scaled
> part by vscale and add the unscaled as is.
> 
>> 
>>> Comparing two of these sizes together is straightforward if only
unscaled sizes
>>> are used. Comparisons between scaled sizes is also simple when
comparing sizes
>>> within a function (or across functions with the inherit flag
mentioned in the
>>> changes to the type), but cannot be compared otherwise. If a mix is
present,
>>> then any number of unscaled bits will not be considered to have a
greater size
>>> than a smaller number of scaled bits, but a smaller number of
unscaled bits
>>> will be considered to have a smaller size than a greater number of
scaled bits
>>> (since the runtime multiple is at least one).
>> 
>> If we went the ConstantExpr route and added ConstantExpr support to
>> ScalarEvolution, then SCEVs could be compared to do this size
>> comparison.  We have code here that adds ConstantExpr support to
>> ScalarEvolution.  We just didn't know if anyone else would be
interested
>> in it since we added it solely for our Fortran frontend.
> 
> We added a dedicated SCEV expression class for vscale instead; I suspect it
works
> either way.
> 
>> 
>>> We have added an experimental `vscale` intrinsic to represent the
runtime
>>> multiple. Multiplying the result of this intrinsic by the minimum
number of
>>> elements in a vector gives the total number of elements in a
scalable vector.
>> 
>> I think this may be a case where added a full-fledged Instruction might
>> be worthwhile.  Because vscale is intimately tied to addressing, it
>> seems like things such as ScalarEvolution support will be important.  I
>> don't know what's involved in making intrinsics work with
>> ScalarEvolution but it seems strangely odd that a key component of IR
>> computation would live outside the IR proper, in the sense that all
>> other fundamental addressing operations are Instructions.
> 
> We've tried it as both an instruction and as a 'Constant', and
both work fine with
> ScalarEvolution. I have not yet tried it with the intrinsic.+CC Sanjoy to confirm: I think intrinsics should be fine to add support for in
SCEV.> 
>> 
>>> For constants consisting of a sequence of values, an experimental
`stepvector`
>>> intrinsic has been added to represent a simple constant of the form
>>> `<0, 1, 2... num_elems-1>`. To change the starting value a
splat of the new
>>> start can be added, and changing the step requires multiplying by a
splat.
>> 
>> This is another case where an Instruction might be better, for the same
>> reasons as with vscale.
>> 
>> Also, "iota" is the name Cray has traditionally used for this
operation
>> as it is the mathematical name for the concept.  It's also used by
C++
>> and go and so should be familiar to many people.
> 
> Iota would be fine with me; I forget the reason we didn't go with that
initially. We
> also had 'series_vector' in the past, but that was a more generic
form with start
> and step parameters instead of requiring additional IR instructions to
multiply and
> add for the result as we do for stepvector.
> 
>> 
>>> Future Work
>>> -----------
>>> 
>>> Intrinsics cannot currently be used for constant folding. Our
downstream
>>> compiler (using Constants instead of intrinsics) relies quite
heavily on this
>>> for good code generation, so we will need to find new ways to
recognize and
>>> fold these values.
>> 
>> As above, we could add ConstantVScale and also ConstantStepVector (or
>> ConstantIota).  They won't fold to compile-time values but the
>> expressions could be simplified.  I haven't really thought through
the
>> implications of this, just brainstorming ideas.  What does your
>> downstream compiler require in terms of constant support.  What kinds
of
>> queries does it need to do?
> 
> It makes things a little easier to pattern match (just looking for a
constant to start
> instead of having to match multiple different forms of vscale or stepvector
multiplied
> and/or added in each place you're looking for them).
> 
> The bigger reason we currently depend on them being constant is that code
generation
> generally looks at a single block at a time, and there are several
expressions using
> vscale that we don't want to be generated in one block and passed
around in a register,
> since many of the load/store addressing forms for instructions will already
scale properly.
> 
> We've done this downstream by having them be Constants, but if
there's a good way
> of doing them with intrinsics we'd be fine with that too.
> 
> -Graham
> 
>

David Greene via llvm-dev

2018-Jun-05 19:08 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Graham Hunter <Graham.Hunter at arm.com> writes:
>> Can you explain a bit about what the two integers represent? 
What's the
>> "unscaled" part for?
>
> 'Unscaled' just means 'exactly this many bits', whereas
'scaled' is 'this many bits
> multiplied by vscale'.
Right, but what do they represent?  If I have <scalable 4 x i32> is
"32"
"unscaled" and "4" "scaled?"  Or is
"128" "scaled?"  Or something else?

I see you answered this below.
>> The name "getSizeExpressionInBits" makes me think that a
Value
>> expression will be returned (something like a ConstantExpr that uses
>> vscale).  I would be surprised to get a pair of integers back.  Do
>> clients actually need constant integer values or would a ConstantExpr
>> sufffice?  We could add a ConstantVScale or something to make it work.
>
> I agree the name is not ideal and I'm open to suggestions -- I was
thinking of the two
> integers representing the known-at-compile-time terms in an expression:
> '(scaled_bits * vscale) + unscaled_bits'.
>
> Assuming the pair is of the form (unscaled, scaled), then for a type with a
size known at
> compile time like <4 x i32> the size would be (128, 0).
>
> For a scalable type like <scalable 4 x i32> the size would be (0,
128).
>
> For a struct with, say, a <scalable 32 x i8> and an i64, it would be
(64, 256).
>
> When calculating the offset for memory addresses, you just need to multiply
the scaled
> part by vscale and add the unscaled as is.
Ok, now I understand what you're getting at.  A ConstantExpr would
encapsulate this computation.  We alreay have "non-static-constant"
values for ConstantExpr like sizeof and offsetof.  I would see
VScaleConstant in that same tradition.  In your struct example,
getSizeExpressionInBits would return:

add(mul(256, vscale), 64)

Does that satisfy your needs?

Is there anything about vscale or a scalable vector that requires a
minimum bit width?  For example, is this legal?

<scalable 1 x double>

I know it won't map to an SVE type.  I'm simply curious because
traditionally Cray machines defined vectors in terms of
machine-dependent "maxvl" with an element type, so with the above
vscale
would == maxvl.  Not that we make any such things anymore.  But maybe
someone else does?
>> If we went the ConstantExpr route and added ConstantExpr support to
>> ScalarEvolution, then SCEVs could be compared to do this size
>> comparison.  We have code here that adds ConstantExpr support to
>> ScalarEvolution.  We just didn't know if anyone else would be
interested
>> in it since we added it solely for our Fortran frontend.
>
> We added a dedicated SCEV expression class for vscale instead; I suspect it
works
> either way.
Yes, that's probably true.  A vscale SCEV is less invasive.
> We've tried it as both an instruction and as a 'Constant', and
both work fine with
> ScalarEvolution. I have not yet tried it with the intrinsic.
vscale as a Constant is interesting.  It's a target-dependent Constant
like sizeof and offsetof.  It doesn't have a statically known value and
maybe isn't "constant" across functions.  So it's a strange
kind of
constant.

Ultimately whatever is easier for LLVM to analyze in the long run is
best.  Intrinsics often block optimization.  I don't know whether vscale
would be "eaiser" as a Constant or an Instruction.
>> As above, we could add ConstantVScale and also ConstantStepVector (or
>> ConstantIota).  They won't fold to compile-time values but the
>> expressions could be simplified.  I haven't really thought through
the
>> implications of this, just brainstorming ideas.  What does your
>> downstream compiler require in terms of constant support.  What kinds
of
>> queries does it need to do?
>
> It makes things a little easier to pattern match (just looking for a
constant to start
> instead of having to match multiple different forms of vscale or stepvector
multiplied
> and/or added in each place you're looking for them).
Ok.  Normalization could help with this but I certainly understand the
issue.
> The bigger reason we currently depend on them being constant is that code
generation
> generally looks at a single block at a time, and there are several
expressions using
> vscale that we don't want to be generated in one block and passed
around in a register,
> since many of the load/store addressing forms for instructions will already
scale properly.
This is kind of like X86 memop folding.  If a load has multiple uses, it
won't be folded, on the theory that one load is better than many folded
loads.  If a load has exactly one use, it will fold.  There's explicit
predicate code in the X86 backend to enforce this requirement.  I
suspect if the X86 backend tried to fold a single load into multiple
places, Bad Things would happen (needed SDNodes might disappear, etc.).

Codegen probably doesn't understand non-statically-constant
ConstantExprs, since sizeof of offsetof can be resolved by the target
before instruction selection.
> We've done this downstream by having them be Constants, but if
there's a good way
> of doing them with intrinsics we'd be fine with that too.
If vscale/stepvector as Constants works, it seems fine to me.

                               -David

Graham Hunter via llvm-dev

2018-Jun-06 09:20 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi David,
>>> The name "getSizeExpressionInBits" makes me think that a
Value
>>> expression will be returned (something like a ConstantExpr that
uses
>>> vscale).  I would be surprised to get a pair of integers back.  Do
>>> clients actually need constant integer values or would a
ConstantExpr
>>> sufffice?  We could add a ConstantVScale or something to make it
work.
>> 
>> I agree the name is not ideal and I'm open to suggestions -- I was
thinking of the two
>> integers representing the known-at-compile-time terms in an expression:
>> '(scaled_bits * vscale) + unscaled_bits'.
>> 
>> Assuming the pair is of the form (unscaled, scaled), then for a type
with a size known at
>> compile time like <4 x i32> the size would be (128, 0).
>> 
>> For a scalable type like <scalable 4 x i32> the size would be (0,
128).
>> 
>> For a struct with, say, a <scalable 32 x i8> and an i64, it would
be (64, 256).
>> 
>> When calculating the offset for memory addresses, you just need to
multiply the scaled
>> part by vscale and add the unscaled as is.
> 
> Ok, now I understand what you're getting at.  A ConstantExpr would
> encapsulate this computation.  We alreay have
"non-static-constant"
> values for ConstantExpr like sizeof and offsetof.  I would see
> VScaleConstant in that same tradition.  In your struct example,
> getSizeExpressionInBits would return:
> 
> add(mul(256, vscale), 64)
> 
> Does that satisfy your needs?
Ah, I think the use of 'expression' in the name definitely confuses the
issue then. This
isn't for expressing the size in IR, where you would indeed just multiply by
vscale and
add any fixed-length size.

This is for the analysis code around the IR -- lots of code asks for the size of
a Type in
bits to determine what it can do to a Value with that type. Some of them are
specific to
scalar Types, like determining whether a sign/zero extend is needed. Others
would
apply to vector types (including scalable vectors), such as checking whether two
Types have the exact same size so that a bitcast can be used instead of a more
expensive operation like copying to memory and back to convert.

See 'getTypeSizeInBits' and 'getTypeStoreSizeInBits' in
DataLayout -- they're used
a few hundred times throughout the codebase, and to properly support scalable
types we'd need to return something that isn't just a single integer.
Since most
backends won't support scalable vectors I suggested having a
'FixedSize' method
that just returns the single integer, but it may be better to just leave the
existing method
as is and create a new method with 'Scalable' or
'VariableLength' or similar in the
name to make it more obvious in common code.

There's a few places where changes in IR may be needed;
'lifetime.start' markers in
IR embed size data, and we would need to either add a scalable term to that or
find some other way of indicating the size. That can be dealt with when we try
to
add support for the SVE ACLE though.
> 
> Is there anything about vscale or a scalable vector that requires a
> minimum bit width?  For example, is this legal?
> 
> <scalable 1 x double>
> 
> I know it won't map to an SVE type.  I'm simply curious because
> traditionally Cray machines defined vectors in terms of
> machine-dependent "maxvl" with an element type, so with the above
vscale
> would == maxvl.  Not that we make any such things anymore.  But maybe
> someone else does?
That's legal in IR, yes, and we believe it should be usable to represent the
vectors for
RISC-V's 'V' extension. The main problem there is that they have a
dynamic vector
length within the loop so that they can perform the last iterations of a loop
within vector
registers when there's less than a full register worth of data remaining.
SVE uses
predication (masking) to achieve the same effect.

For the 'V' extension, vscale would indeed correspond to
'maxvl', and I'm hoping that a
'setvl' intrinsic that provides a predicate will avoid the need for
modelling a change in
dynamic vector length -- reducing the vector length is effectively equivalent to
an implied
predicate on all operations. This avoids needing to add a token operand to all
existing
instructions that work on vector types.

-Graham





> 
>>> If we went the ConstantExpr route and added ConstantExpr support to
>>> ScalarEvolution, then SCEVs could be compared to do this size
>>> comparison.  We have code here that adds ConstantExpr support to
>>> ScalarEvolution.  We just didn't know if anyone else would be
interested
>>> in it since we added it solely for our Fortran frontend.
>> 
>> We added a dedicated SCEV expression class for vscale instead; I
suspect it works
>> either way.
> 
> Yes, that's probably true.  A vscale SCEV is less invasive.
> 
>> We've tried it as both an instruction and as a 'Constant',
and both work fine with
>> ScalarEvolution. I have not yet tried it with the intrinsic.
> 
> vscale as a Constant is interesting.  It's a target-dependent Constant
> like sizeof and offsetof.  It doesn't have a statically known value and
> maybe isn't "constant" across functions.  So it's a
strange kind of
> constant.
> 
> Ultimately whatever is easier for LLVM to analyze in the long run is
> best.  Intrinsics often block optimization.  I don't know whether
vscale
> would be "eaiser" as a Constant or an Instruction.
> 
>>> As above, we could add ConstantVScale and also ConstantStepVector
(or
>>> ConstantIota).  They won't fold to compile-time values but the
>>> expressions could be simplified.  I haven't really thought
through the
>>> implications of this, just brainstorming ideas.  What does your
>>> downstream compiler require in terms of constant support.  What
kinds of
>>> queries does it need to do?
>> 
>> It makes things a little easier to pattern match (just looking for a
constant to start
>> instead of having to match multiple different forms of vscale or
stepvector multiplied
>> and/or added in each place you're looking for them).
> 
> Ok.  Normalization could help with this but I certainly understand the
> issue.
> 
>> The bigger reason we currently depend on them being constant is that
code generation
>> generally looks at a single block at a time, and there are several
expressions using
>> vscale that we don't want to be generated in one block and passed
around in a register,
>> since many of the load/store addressing forms for instructions will
already scale properly.
> 
> This is kind of like X86 memop folding.  If a load has multiple uses, it
> won't be folded, on the theory that one load is better than many folded
> loads.  If a load has exactly one use, it will fold.  There's explicit
> predicate code in the X86 backend to enforce this requirement.  I
> suspect if the X86 backend tried to fold a single load into multiple
> places, Bad Things would happen (needed SDNodes might disappear, etc.).
> 
> Codegen probably doesn't understand non-statically-constant
> ConstantExprs, since sizeof of offsetof can be resolved by the target
> before instruction selection.
> 
>> We've done this downstream by having them be Constants, but if
there's a good way
>> of doing them with intrinsics we'd be fine with that too.
> 
> If vscale/stepvector as Constants works, it seems fine to me.
> 
>                               -David

llvm dev - Jun 2018 - [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths