thr3ads.net - llvm dev - [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Graham Hunter via llvm-dev

2018-Jun-07 16:10 UTC

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi,
> On 6 Jun 2018, at 17:36, David A. Greene <dag at cray.com> wrote:
> 
> Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org> writes:
> 
>>> Ok, now I understand what you're getting at.  A ConstantExpr
would
>>> encapsulate this computation.  We alreay have
"non-static-constant"
>>> values for ConstantExpr like sizeof and offsetof.  I would see
>>> VScaleConstant in that same tradition.  In your struct example,
>>> getSizeExpressionInBits would return:
>>> 
>>> add(mul(256, vscale), 64)
>>> 
>>> Does that satisfy your needs?
>> 
>> Ah, I think the use of 'expression' in the name definitely
confuses the issue then. This
>> isn't for expressing the size in IR, where you would indeed just
multiply by vscale and
>> add any fixed-length size.
> 
> Ok, thanks for clarifying.  The use of "expression" is confusing.
> 
>> This is for the analysis code around the IR -- lots of code asks for
the size of a Type in
>> bits to determine what it can do to a Value with that type. Some of
them are specific to
>> scalar Types, like determining whether a sign/zero extend is needed.
Others would
>> apply to vector types (including scalable vectors), such as checking
whether two
>> Types have the exact same size so that a bitcast can be used instead of
a more
>> expensive operation like copying to memory and back to convert.
> 
> If this method returns two integers, how does LLVM interpret the
> comparison?  If the return value is { <unscaled>, <scaled> }
then how
> do, say { 1024, 0 } and { 0, 128 } compare?  Doesn't it depend on the
> vscale?  They could be the same size or not, depending on the target
> characteristics.
I did have a paragraph on that in the RFC, but perhaps a list would be
a better format (assuming X,Y,etc are non-zero):

{ X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.

{ 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
                         functions that inherit vector length. Cannot be
                         compared across non-inheriting functions.

{ X, 0 } > { 0, Y }: Cannot return true.

{ X, 0 } = { 0, Y }: Cannot return true.

{ X, 0 } < { 0, Y }: Can return true.

{ Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
                             terms and try the above comparisons; it
                             may not be possible to get a good answer.

I don't know if we need a 'maybe' result for cases comparing scaled
vs. unscaled; I believe the gcc implementation of SVE allows for such
results, but that supports a generic polynomial length representation.

I think in code, we'd have an inline function to deal with the first case
and an likely-not-taken call to a separate function to handle all the
scalable cases.
> Are bitcasts between scaled types and non-scaled types disallowed?  I
> could certainly see an argument for disallowing it.  I could argue that
> for bitcasting purposes that the unscaled and scaled parts would have to
> exactly match in order to do a legal bitcast.  Is that the intent?
I would propose disallowing bitcasts, but allowing extracting a subvector
if the minimum number of scaled bits matches the number of unscaled bits.
> 
>>> Is there anything about vscale or a scalable vector that requires a
>>> minimum bit width?  For example, is this legal?
>>> 
>>> <scalable 1 x double>
>>> 
>>> I know it won't map to an SVE type.  I'm simply curious
because
>>> traditionally Cray machines defined vectors in terms of
>>> machine-dependent "maxvl" with an element type, so with
the above vscale
>>> would == maxvl.  Not that we make any such things anymore.  But
maybe
>>> someone else does?
>> 
>> That's legal in IR, yes, and we believe it should be usable to
represent the vectors for
>> RISC-V's 'V' extension. The main problem there is that they
have a dynamic vector
>> length within the loop so that they can perform the last iterations of
a loop within vector
>> registers when there's less than a full register worth of data
remaining. SVE uses
>> predication (masking) to achieve the same effect.
>> 
>> For the 'V' extension, vscale would indeed correspond to
'maxvl', and I'm hoping that a
>> 'setvl' intrinsic that provides a predicate will avoid the need
for modelling a change in
>> dynamic vector length -- reducing the vector length is effectively
equivalent to an implied
>> predicate on all operations. This avoids needing to add a token operand
to all existing
>> instructions that work on vector types.
> 
> Right.  In that way the RISC V method is very much like what the old
> Cray machines did with the Vector Length register.
> 
> So in LLVM IR you would have "setvl" return a predicate and then
apply
> that predicate to operations using the current select method?  How does
> instruction selection map that back onto a simple setvl + unpredicated
> vector instructions?
> 
> For conditional code both vector length and masking must be taken into
> account.  If "setvl" returns a predicate then that predicate
would have
> to be combined in some way with the conditional predicate (typically via
> an AND operation in an IR that directly supports predicates).  Since
> LLVM IR doesn't have predicates _per_se_, would it turn into nested
> selects or something?  Untangling that in instruction selection seems
> difficult but perhaps I'm missing something.
My idea is for the RISC-V backend to recognize when a setvl intrinsic has
been used, and replace the use of its value in AND operations with an
all-true value (with constant folding to remove unnecessary ANDs) then
replace any masked instructions (generally loads, stores, anything else
that might generate an exception or modify state that it shouldn't) with
target-specific nodes that understand the dynamic vlen.

This could be part of lowering, or maybe a separate IR pass, rather than ISel.
I *think* this will work, but if someone can come up with some IR where it
wouldn't work then please let me know (e.g. global-state-changing
instructions
that could move out of blocks where one setvl predicate is used and into one
where another is used).

Unfortunately, I can't find a description of the instructions included in
the 'V' extension in the online manual (other than setvl or configuring
registers), so I can't tell if there's something I'm missing.

-Graham

Bruce Hoult via llvm-dev

2018-Jun-07 22:31 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

On Fri, Jun 8, 2018 at 4:10 AM, Graham Hunter via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> > On 6 Jun 2018, at 17:36, David A. Greene <dag at cray.com>
wrote:
> >
> > Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org> writes:
> >
> >>> Ok, now I understand what you're getting at.  A
ConstantExpr would
> >>> encapsulate this computation.  We alreay have
"non-static-constant"
> >>> values for ConstantExpr like sizeof and offsetof.  I would see
> >>> VScaleConstant in that same tradition.  In your struct
example,
> >>> getSizeExpressionInBits would return:
> >>>
> >>> add(mul(256, vscale), 64)
> >>>
> >>> Does that satisfy your needs?
> >>
> >> Ah, I think the use of 'expression' in the name definitely
confuses the
> issue then. This
> >> isn't for expressing the size in IR, where you would indeed
just
> multiply by vscale and
> >> add any fixed-length size.
> >
> > Ok, thanks for clarifying.  The use of "expression" is
confusing.
> >
> >> This is for the analysis code around the IR -- lots of code asks
for
> the size of a Type in
> >> bits to determine what it can do to a Value with that type. Some
of
> them are specific to
> >> scalar Types, like determining whether a sign/zero extend is
needed.
> Others would
> >> apply to vector types (including scalable vectors), such as
checking
> whether two
> >> Types have the exact same size so that a bitcast can be used
instead of
> a more
> >> expensive operation like copying to memory and back to convert.
> >
> > If this method returns two integers, how does LLVM interpret the
> > comparison?  If the return value is { <unscaled>, <scaled>
} then how
> > do, say { 1024, 0 } and { 0, 128 } compare?  Doesn't it depend on
the
> > vscale?  They could be the same size or not, depending on the target
> > characteristics.
>
> I did have a paragraph on that in the RFC, but perhaps a list would be
> a better format (assuming X,Y,etc are non-zero):
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or
across
>                          functions that inherit vector length. Cannot be
>                          compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract
common
>                              terms and try the above comparisons; it
>                              may not be possible to get a good answer.
>
> I don't know if we need a 'maybe' result for cases comparing
scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> I think in code, we'd have an inline function to deal with the first
case
> and an likely-not-taken call to a separate function to handle all the
> scalable cases.
>
> > Are bitcasts between scaled types and non-scaled types disallowed?  I
> > could certainly see an argument for disallowing it.  I could argue
that
> > for bitcasting purposes that the unscaled and scaled parts would have
to
> > exactly match in order to do a legal bitcast.  Is that the intent?
>
> I would propose disallowing bitcasts, but allowing extracting a subvector
> if the minimum number of scaled bits matches the number of unscaled bits.
>
> >
> >>> Is there anything about vscale or a scalable vector that
requires a
> >>> minimum bit width?  For example, is this legal?
> >>>
> >>> <scalable 1 x double>
> >>>
> >>> I know it won't map to an SVE type.  I'm simply
curious because
> >>> traditionally Cray machines defined vectors in terms of
> >>> machine-dependent "maxvl" with an element type, so
with the above
> vscale
> >>> would == maxvl.  Not that we make any such things anymore. 
But maybe
> >>> someone else does?
> >>
> >> That's legal in IR, yes, and we believe it should be usable to
> represent the vectors for
> >> RISC-V's 'V' extension. The main problem there is that
they have a
> dynamic vector
> >> length within the loop so that they can perform the last
iterations of
> a loop within vector
> >> registers when there's less than a full register worth of data
> remaining. SVE uses
> >> predication (masking) to achieve the same effect.
> >>
> >> For the 'V' extension, vscale would indeed correspond to
'maxvl', and
> I'm hoping that a
> >> 'setvl' intrinsic that provides a predicate will avoid the
need for
> modelling a change in
> >> dynamic vector length -- reducing the vector length is effectively
> equivalent to an implied
> >> predicate on all operations. This avoids needing to add a token
operand
> to all existing
> >> instructions that work on vector types.
> >
> > Right.  In that way the RISC V method is very much like what the old
> > Cray machines did with the Vector Length register.
> >
> > So in LLVM IR you would have "setvl" return a predicate and
then apply
> > that predicate to operations using the current select method?  How
does
> > instruction selection map that back onto a simple setvl + unpredicated
> > vector instructions?
> >
> > For conditional code both vector length and masking must be taken into
> > account.  If "setvl" returns a predicate then that predicate
would have
> > to be combined in some way with the conditional predicate (typically
via
> > an AND operation in an IR that directly supports predicates).  Since
> > LLVM IR doesn't have predicates _per_se_, would it turn into
nested
> > selects or something?  Untangling that in instruction selection seems
> > difficult but perhaps I'm missing something.
>
> My idea is for the RISC-V backend to recognize when a setvl intrinsic has
> been used, and replace the use of its value in AND operations with an
> all-true value (with constant folding to remove unnecessary ANDs) then
> replace any masked instructions (generally loads, stores, anything else
> that might generate an exception or modify state that it shouldn't)
with
> target-specific nodes that understand the dynamic vlen.
>
> This could be part of lowering, or maybe a separate IR pass, rather than
> ISel.
> I *think* this will work, but if someone can come up with some IR where it
> wouldn't work then please let me know (e.g. global-state-changing
> instructions
> that could move out of blocks where one setvl predicate is used and into
> one
> where another is used).
>
> Unfortunately, I can't find a description of the instructions included
in
> the 'V' extension in the online manual (other than setvl or
configuring
> registers), so I can't tell if there's something I'm missing.
>
RVV is a little bit behind SVE in the process :-) On the whole it's
following the style of vector processor that has had several
implementations at Berkeley, dating back a decade or more. The set of
operations is pretty much nailed down now, but things such as the exact
instruction encodings are still in flux. There is an intention to get some
experience with compilers and FPGA (at least) implementations of the
proposal before ratifying it as part of the RISC-V standard. So details
could well change during that period.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180608/33701be0/attachment.html>

Robin Kruppe via llvm-dev

2018-Jun-11 15:47 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Graham,
Hi David,

glad to hear other people are thinking about RVV codegen!

On 7 June 2018 at 18:10, Graham Hunter <Graham.Hunter at arm.com>
wrote:>
> Hi,
>
> > On 6 Jun 2018, at 17:36, David A. Greene <dag at cray.com>
wrote:
> >
> > Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org> writes:
> >>> Is there anything about vscale or a scalable vector that
requires a
> >>> minimum bit width?  For example, is this legal?
> >>>
> >>> <scalable 1 x double>
> >>>
> >>> I know it won't map to an SVE type.  I'm simply
curious because
> >>> traditionally Cray machines defined vectors in terms of
> >>> machine-dependent "maxvl" with an element type, so
with the above vscale
> >>> would == maxvl.  Not that we make any such things anymore. 
But maybe
> >>> someone else does?
> >>
> >> That's legal in IR, yes, and we believe it should be usable to
represent the vectors for
> >> RISC-V's 'V' extension. The main problem there is that
they have a dynamic vector
> >> length within the loop so that they can perform the last
iterations of a loop within vector
> >> registers when there's less than a full register worth of data
remaining. SVE uses
> >> predication (masking) to achieve the same effect.

Yes, <scalable 1 x T> should be allowed in the IR type system (even <1
x T> is currently allowed and unlike the scalable variant that's not
even useful) and it would be the sole legal vector types in the RISCV
backend.
>
> >> For the 'V' extension, vscale would indeed correspond to
'maxvl', and I'm hoping that a
> >> 'setvl' intrinsic that provides a predicate will avoid the
need for modelling a change in
> >> dynamic vector length -- reducing the vector length is effectively
equivalent to an implied
> >> predicate on all operations. This avoids needing to add a token
operand to all existing
> >> instructions that work on vector types.
Yes, vscale would be the *maximum* vector length (how many elements
fit into each register), not the *active* vector length (how many
elements are operated on in the current loop iteration).

This has nothing to do with tokens, though. The tokens I proposed were
to encode the fact that even 'maxvl' varies on a function by function
basis. This RFC approaches the same issue differently, but it's still
there -- in terms of this RFC, operations on scalable vectors depend
on `vscale`, which is "not necessarily [constant] across functions".
That implies, for example, that an unmasked <scalable 4 x i32> load or
store (which accesses vscale * 16 bytes of memory) can't generally be
moved from one function to another unless it's somehow ensured that
both functions will have the same vscale. For that matter, the same
restriction applies to calls to `vscale` itself.

The evolution of the active vector length is a separate problem and
one that doesn't really impact the IR type system (nor one that can
easily be solved by tokens).
> >
> > Right.  In that way the RISC V method is very much like what the old
> > Cray machines did with the Vector Length register.
> >
> > So in LLVM IR you would have "setvl" return a predicate and
then apply
> > that predicate to operations using the current select method?  How
does
> > instruction selection map that back onto a simple setvl + unpredicated
> > vector instructions?
> >
> > For conditional code both vector length and masking must be taken into
> > account.  If "setvl" returns a predicate then that predicate
would have
> > to be combined in some way with the conditional predicate (typically
via
> > an AND operation in an IR that directly supports predicates).  Since
> > LLVM IR doesn't have predicates _per_se_, would it turn into
nested
> > selects or something?  Untangling that in instruction selection seems
> > difficult but perhaps I'm missing something.
>
> My idea is for the RISC-V backend to recognize when a setvl intrinsic has
> been used, and replace the use of its value in AND operations with an
> all-true value (with constant folding to remove unnecessary ANDs) then
> replace any masked instructions (generally loads, stores, anything else
> that might generate an exception or modify state that it shouldn't)
with
> target-specific nodes that understand the dynamic vlen.
I am not quite so sure about turning the active vector length into
just another mask. It's true that the effects on arithmetic, load,
stores, etc. are the same as if everything executed under a mask like
<1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
active vector length. However, actually materializing the masks in the
IR means the RISCV backend has to reverse-engineer what it must do
with the vl register for any given (masked or unmasked) vector
operation. The stakes for that are rather high, because (1) it applies
to pretty much every single vector operation ever, and (2) when it
fails, the codegen impact is incredibly bad.

(1) The vl register affects not only loads, stores and other
operations with side effects, but all vector instructions, even pure
computation (and reg-reg moves, but that's not relevant for IR). An
integer vector add, for example, only computes src1[i] + src2[i] for 0
<= i < vl and the remaining elements of the destination register (from
vl upwards) are zeroed. This is very natural for strip-mined loops
(you'll never need those elements), but it means an unmasked IR level
vector add is a really bad fit for the RISC-V 'vadd' instruction.
Unless the backend can prove that only the first vl elements of the
result will ever be observed, it will have to temporarily set vl to
MAXVL so that the RVV instruction will actually compute the "full"
result. Establishing that seems like it will require at least some
custom data flow analysis, and it's unclear how robust it can be made.

(2) Failing to properly use vl for some vector operation is worse than
e.g. materializing a mask you wouldn't otherwise need. It requires
that too (if the operation is masked), but more importantly it needs
to save vl, change it to MAXVL, and finally restore the old value.
That's quite expensive: besides the ca. 3 extra instructions and the
scratch GPR required, this save/restore dance can have other nasty
effects depending on uarch style. I'd have to consult the hardware
people to be sure, but from my understanding risks include pipeline
stalls and expensive roundtrips between decoupled vector and scalar
units.

To be clear: I have not yet experimented with any of this, so I'm not
saying this is a deal breaker. A well-engineered "demanded elements"
analysis may very well be good enough in practice. But since we
broached the subject, I wanted to mention this challenge. (I'm
currently side stepping it by not using built-in vector instructions
but instead intrinsics that treat vl as magic extra state.)
> This could be part of lowering, or maybe a separate IR pass, rather than
ISel.
> I *think* this will work, but if someone can come up with some IR where it
> wouldn't work then please let me know (e.g. global-state-changing
instructions
> that could move out of blocks where one setvl predicate is used and into
one
> where another is used).
There are some operations that use vl for things other than simple
masking. To give one example, "speculative" loads (which silencing
some exceptions to safely permit vectorization of some loops with
data-dependent exits, such as strlen) can shrink vl as a side effect.
I believe this can be handled by modelling all relevant operations
(including setvl itself) as intrinsics that have side effects or
read/write inaccessible memory. However, if you want to have the
"current" vl (or equivalent mask) around as SSA value, you need to
"reload" it after any operation that updates vl. That seems like it
could get a bit complex if you want to do it efficiently (in the
limit, it seems equivalent to SSA construction).
>
>
> Unfortunately, I can't find a description of the instructions included
in
> the 'V' extension in the online manual (other than setvl or
configuring
> registers), so I can't tell if there's something I'm missing.
I'm very sorry for that, I know how frustrating it can be. I hope the
above gives a clearer picture of the constraints involved. Exact
instructions, let alone encodings, are still in flux as Bruce said.


Cheers,
Robin

Graham Hunter via llvm-dev

2018-Jun-12 12:47 UTC

head link

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Hi Robin,

responses inline.

-Graham
> On 11 Jun 2018, at 16:47, Robin Kruppe <robin.kruppe at gmail.com>
wrote:
> 
> Hi Graham,
> Hi David,
> 
> glad to hear other people are thinking about RVV codegen!
> 
> On 7 June 2018 at 18:10, Graham Hunter <Graham.Hunter at arm.com>
wrote:
>> 
>> Hi,
>> 
>>> On 6 Jun 2018, at 17:36, David A. Greene <dag at cray.com>
wrote:
>>> 
>>> Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org>
writes:
>>>>> Is there anything about vscale or a scalable vector that
requires a
>>>>> minimum bit width?  For example, is this legal?
>>>>> 
>>>>> <scalable 1 x double>
>>>>> 
>>>>> I know it won't map to an SVE type.  I'm simply
curious because
>>>>> traditionally Cray machines defined vectors in terms of
>>>>> machine-dependent "maxvl" with an element type,
so with the above vscale
>>>>> would == maxvl.  Not that we make any such things anymore. 
But maybe
>>>>> someone else does?
>>>> 
>>>> That's legal in IR, yes, and we believe it should be usable
to represent the vectors for
>>>> RISC-V's 'V' extension. The main problem there is
that they have a dynamic vector
>>>> length within the loop so that they can perform the last
iterations of a loop within vector
>>>> registers when there's less than a full register worth of
data remaining. SVE uses
>>>> predication (masking) to achieve the same effect.
> 
> 
> Yes, <scalable 1 x T> should be allowed in the IR type system (even
<1
> x T> is currently allowed and unlike the scalable variant that's not
> even useful) and it would be the sole legal vector types in the RISCV
> backend.
> 
>> 
>>>> For the 'V' extension, vscale would indeed correspond
to 'maxvl', and I'm hoping that a
>>>> 'setvl' intrinsic that provides a predicate will avoid
the need for modelling a change in
>>>> dynamic vector length -- reducing the vector length is
effectively equivalent to an implied
>>>> predicate on all operations. This avoids needing to add a token
operand to all existing
>>>> instructions that work on vector types.
> 
> Yes, vscale would be the *maximum* vector length (how many elements
> fit into each register), not the *active* vector length (how many
> elements are operated on in the current loop iteration).
> 
> This has nothing to do with tokens, though. The tokens I proposed were
> to encode the fact that even 'maxvl' varies on a function by
function
> basis. This RFC approaches the same issue differently, but it's still
> there -- in terms of this RFC, operations on scalable vectors depend
> on `vscale`, which is "not necessarily [constant] across
functions".
> That implies, for example, that an unmasked <scalable 4 x i32> load
or
> store (which accesses vscale * 16 bytes of memory) can't generally be
> moved from one function to another unless it's somehow ensured that
> both functions will have the same vscale. For that matter, the same
> restriction applies to calls to `vscale` itself.
> 
> The evolution of the active vector length is a separate problem and
> one that doesn't really impact the IR type system (nor one that can
> easily be solved by tokens).
Agreed.
> 
>>> 
>>> Right.  In that way the RISC V method is very much like what the
old
>>> Cray machines did with the Vector Length register.
>>> 
>>> So in LLVM IR you would have "setvl" return a predicate
and then apply
>>> that predicate to operations using the current select method?  How
does
>>> instruction selection map that back onto a simple setvl +
unpredicated
>>> vector instructions?
>>> 
>>> For conditional code both vector length and masking must be taken
into
>>> account.  If "setvl" returns a predicate then that
predicate would have
>>> to be combined in some way with the conditional predicate
(typically via
>>> an AND operation in an IR that directly supports predicates). 
Since
>>> LLVM IR doesn't have predicates _per_se_, would it turn into
nested
>>> selects or something?  Untangling that in instruction selection
seems
>>> difficult but perhaps I'm missing something.
>> 
>> My idea is for the RISC-V backend to recognize when a setvl intrinsic
has
>> been used, and replace the use of its value in AND operations with an
>> all-true value (with constant folding to remove unnecessary ANDs) then
>> replace any masked instructions (generally loads, stores, anything else
>> that might generate an exception or modify state that it shouldn't)
with
>> target-specific nodes that understand the dynamic vlen.
> 
> I am not quite so sure about turning the active vector length into
> just another mask. It's true that the effects on arithmetic, load,
> stores, etc. are the same as if everything executed under a mask like
> <1, 1, ..., 1, 0, 0, ..., 0> with the number of ones equal to the
> active vector length. However, actually materializing the masks in the
> IR means the RISCV backend has to reverse-engineer what it must do
> with the vl register for any given (masked or unmasked) vector
> operation. The stakes for that are rather high, because (1) it applies
> to pretty much every single vector operation ever, and (2) when it
> fails, the codegen impact is incredibly bad.
I can see where the concern comes from; we had problems reconstructing
semantics when experimenting with search loop vectorization and often
had to fall back on default (slow) generic cases.

My main reason for proposing this was to try and ensure that the size was
consistent from the point of view of the query functions we were discussing
in the main thread. If you're fine with all size queries assuming maxvl (so
things like stack slots would always use the current configured maximum
length), then I don't think there's a problem with dropping this part of
the
proposal and letting you find a better representation of active length.
> (1) The vl register affects not only loads, stores and other
> operations with side effects, but all vector instructions, even pure
> computation (and reg-reg moves, but that's not relevant for IR). An
> integer vector add, for example, only computes src1[i] + src2[i] for 0
> <= i < vl and the remaining elements of the destination register
(from
> vl upwards) are zeroed. This is very natural for strip-mined loops
> (you'll never need those elements), but it means an unmasked IR level
> vector add is a really bad fit for the RISC-V 'vadd' instruction.
> Unless the backend can prove that only the first vl elements of the
> result will ever be observed, it will have to temporarily set vl to
> MAXVL so that the RVV instruction will actually compute the
"full"
> result. Establishing that seems like it will require at least some
> custom data flow analysis, and it's unclear how robust it can be made.
> 
> (2) Failing to properly use vl for some vector operation is worse than
> e.g. materializing a mask you wouldn't otherwise need. It requires
> that too (if the operation is masked), but more importantly it needs
> to save vl, change it to MAXVL, and finally restore the old value.
> That's quite expensive: besides the ca. 3 extra instructions and the
> scratch GPR required, this save/restore dance can have other nasty
> effects depending on uarch style. I'd have to consult the hardware
> people to be sure, but from my understanding risks include pipeline
> stalls and expensive roundtrips between decoupled vector and scalar
> units.
Ah, I hadn't appreciated you might need to save/restore the VL like that.
I'd worked through a couple of small example loops and it seemed fine,
but hadn't looked at more complicated cases.
> To be clear: I have not yet experimented with any of this, so I'm not
> saying this is a deal breaker. A well-engineered "demanded
elements"
> analysis may very well be good enough in practice. But since we
> broached the subject, I wanted to mention this challenge. (I'm
> currently side stepping it by not using built-in vector instructions
> but instead intrinsics that treat vl as magic extra state.)
> 
>> This could be part of lowering, or maybe a separate IR pass, rather
than ISel.
>> I *think* this will work, but if someone can come up with some IR where
it
>> wouldn't work then please let me know (e.g. global-state-changing
instructions
>> that could move out of blocks where one setvl predicate is used and
into one
>> where another is used).
> 
> There are some operations that use vl for things other than simple
> masking. To give one example, "speculative" loads (which
silencing
> some exceptions to safely permit vectorization of some loops with
> data-dependent exits, such as strlen) can shrink vl as a side effect.
> I believe this can be handled by modelling all relevant operations
> (including setvl itself) as intrinsics that have side effects or
> read/write inaccessible memory. However, if you want to have the
> "current" vl (or equivalent mask) around as SSA value, you need
to
> "reload" it after any operation that updates vl. That seems like
it
> could get a bit complex if you want to do it efficiently (in the
> limit, it seems equivalent to SSA construction).
Ok; the fact that there's more instructions that can change vl and that you
might
need to reload it is useful to know.

SVE uses predication to achieve the same via the first-faulting/no-faulting
load instructions and the ffr register.

I think SVE having 16 predicate registers (vs. 8 for RVV and AVX-512) has led
to us using the feature quite widely with our own experiments; I'll try
looking for
non-predicated solutions as well when we try to expand scalable vectorization
capabilities.
>> Unfortunately, I can't find a description of the instructions
included in
>> the 'V' extension in the online manual (other than setvl or
configuring
>> registers), so I can't tell if there's something I'm
missing.
> 
> I'm very sorry for that, I know how frustrating it can be. I hope the
> above gives a clearer picture of the constraints involved. Exact
> instructions, let alone encodings, are still in flux as Bruce said.
Yes, the above information is definitely useful, even if I don't have a
complete
picture yet. Thanks.

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Jun 2018 - [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Maybe Matching Threads