thr3ads.net - llvm dev - [llvm-dev] [RFC] Supporting ARM's SVE in LLVM [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2016-Nov-27 15:42 UTC

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 27 November 2016 at 13:59, Paul Walker <Paul.Walker at arm.com>
wrote:> Thanks Renato, my takeaway is that I am presenting the design out of order.
So let's focus purely on the vector length (VL) and ignore everything else. 
For SVE the vector length is unknown and can vary across an as yet undetermined
boundary (process, library....).  Within a boundary we propose making VL a
constant with all instructions that operate on this constant locked within its
boundary.
This is in line with my current understanding of SVE. Check.

> I know this stretches the meaning of constant and my reasoning (however
unsound) is below.  We expect changes to VL to be infrequent and not located
where it would present an unnecessary barrier to optimisation.  With this in
mind the initial implementation of VL barriers would be an intrinsic that
prevents any instruction movement across it.
>
> Question: Is this type of intrinsic something LLVM supports today?
Function calls are natural barriers, but they should outline the
parameters that cannot cross, especially if they're local, to make
sure those don't cross it. In that sense, specially crafted intrinsics
can get you the same behaviour, but it will be ugly.

Also, we have special purpose barriers, ex. @llvm.arm|aarch64.dmb,
which could serve as template for scalable-specific barriers.

> Why a constant? Well it doesn't change within the context it is being
used. More crucially the LLVM implementation of constants gives us a property
that's very important to SVE (perhaps this is where prototyping laziness has
kicked in).  Constants remain attached to the instructions that operate on them
through until code generation.  This allows the semantic meaning of these
instruction to be maintained, something non-scalable vectors get for free with
their "real" constants.
This makes sense. Not just because it behaves similarly, but because
the back-end *must* guarantee it will be a constant within its
boundaries and fail otherwise. That's up to the SVE code-generator to
add enough SVE-specific instructions to get that right.

>         shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef,
<n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1)
>
> Firstly I'll highlight the use of seriesvector is purely for brevity,
let's ignore that debate for now.  Our concern is that not treating VL as a
Constant means sub and seriesvector are no longer constant and are likely to be
hoisted away from the shufflevector.  The knock on effect being to force the
code generator into generating generic vector permutes rather than utilise any
specialised permute instructions the target provides.
The concept looks ok.

IIGIR, your argument is that an intrinsic will not look "constant
enough" to the other IR passes, which can break the contantness
required to generate the correct "constant" vector.

I'm also assuming SVE has an instruction that relates to the syntax
above, which will reduce the setup process from N instructions to one
and will be scale-independent. Otherwise, that whole exercise is
meaningless.

Something like:
  mov  x2, #i
  const       z0.b, p0/z, x2, 2     # From (i) to (2*VF)
  const       z1.b, p0/z, x2, -1    # From (i) to (i - VF) in reverse

The undefined behaviour that will come of such instructions need to be
understood in order to not break the IR.

For example, if x2 is an unsigned variable and you iterate through the
array but the array length is not a multiple of VF, the last range
will pass through zero and become negative at the end. Or, if x2 is a
16-bit variable that must wrap (or saturate) and the same tail issue
happens above.

> Does this make sense? I am not after agreement just want to make sure we
are on the same page regarding our aims before digging down into how VL actually
looks and its interaction with the loop vectoriser’s chosen VF.
As much sense as is possible, I guess.

But without knowing the guarantees we're aiming for, it'll be hard to
know if any of those proposals will make proper sense.

One way to make your "seriesvector" concept show up *before* any spec
is out is to apply it to non-scalable vectors.

Today, we have the "zeroinitializer", which is very similar to what
you want. You can even completely omit the "vscale" if we get the
semantics right.

Hope that helps.

cheers,
--renato

Amara Emerson via llvm-dev

2016-Nov-27 16:51 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Bringing the discussion back onto the IR proposals:
> One way to make your "seriesvector" concept show up *before* any
spec
> is out is to apply it to non-scalable vectors.
>
> Today, we have the "zeroinitializer", which is very similar to
what
> you want. You can even completely omit the "vscale" if we get the
> semantics right.
There is nothing to stop other targets from using
stepvector/seriesvector. In fact for wide vector targets, often the IR
constant for representing a step vector is explicitly expressed as
<i32 0, i32 1, i32 2..> and so on (this gets really cumbersome when
your vector length is 512bits for example). That could be replaced by
a single "stepvector" constant, and it works the same for both
fixed-length and scalable vectors.

Amara

On 27 November 2016 at 15:42, Renato Golin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> On 27 November 2016 at 13:59, Paul Walker <Paul.Walker at arm.com>
wrote:
>> Thanks Renato, my takeaway is that I am presenting the design out of
order.  So let's focus purely on the vector length (VL) and ignore
everything else.  For SVE the vector length is unknown and can vary across an as
yet undetermined boundary (process, library....).  Within a boundary we propose
making VL a constant with all instructions that operate on this constant locked
within its boundary.
>
> This is in line with my current understanding of SVE. Check.
>
>
>> I know this stretches the meaning of constant and my reasoning (however
unsound) is below.  We expect changes to VL to be infrequent and not located
where it would present an unnecessary barrier to optimisation.  With this in
mind the initial implementation of VL barriers would be an intrinsic that
prevents any instruction movement across it.
>>
>> Question: Is this type of intrinsic something LLVM supports today?
>
> Function calls are natural barriers, but they should outline the
> parameters that cannot cross, especially if they're local, to make
> sure those don't cross it. In that sense, specially crafted intrinsics
> can get you the same behaviour, but it will be ugly.
>
> Also, we have special purpose barriers, ex. @llvm.arm|aarch64.dmb,
> which could serve as template for scalable-specific barriers.
>
>
>> Why a constant? Well it doesn't change within the context it is
being used. More crucially the LLVM implementation of constants gives us a
property that's very important to SVE (perhaps this is where prototyping
laziness has kicked in).  Constants remain attached to the instructions that
operate on them through until code generation.  This allows the semantic meaning
of these instruction to be maintained, something non-scalable vectors get for
free with their "real" constants.
>
> This makes sense. Not just because it behaves similarly, but because
> the back-end *must* guarantee it will be a constant within its
> boundaries and fail otherwise. That's up to the SVE code-generator to
> add enough SVE-specific instructions to get that right.
>
>
>>         shufflevector <n x 4 x i32> %a, <n x 4 x i32>
undef, <n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1)
>>
>> Firstly I'll highlight the use of seriesvector is purely for
brevity, let's ignore that debate for now.  Our concern is that not treating
VL as a Constant means sub and seriesvector are no longer constant and are
likely to be hoisted away from the shufflevector.  The knock on effect being to
force the code generator into generating generic vector permutes rather than
utilise any specialised permute instructions the target provides.
>
> The concept looks ok.
>
> IIGIR, your argument is that an intrinsic will not look "constant
> enough" to the other IR passes, which can break the contantness
> required to generate the correct "constant" vector.
>
> I'm also assuming SVE has an instruction that relates to the syntax
> above, which will reduce the setup process from N instructions to one
> and will be scale-independent. Otherwise, that whole exercise is
> meaningless.
>
> Something like:
>   mov  x2, #i
>   const       z0.b, p0/z, x2, 2     # From (i) to (2*VF)
>   const       z1.b, p0/z, x2, -1    # From (i) to (i - VF) in reverse
>
> The undefined behaviour that will come of such instructions need to be
> understood in order to not break the IR.
>
> For example, if x2 is an unsigned variable and you iterate through the
> array but the array length is not a multiple of VF, the last range
> will pass through zero and become negative at the end. Or, if x2 is a
> 16-bit variable that must wrap (or saturate) and the same tail issue
> happens above.
>
>
>> Does this make sense? I am not after agreement just want to make sure
we are on the same page regarding our aims before digging down into how VL
actually looks and its interaction with the loop vectoriser’s chosen VF.
>
> As much sense as is possible, I guess.
>
> But without knowing the guarantees we're aiming for, it'll be hard
to
> know if any of those proposals will make proper sense.
>
> One way to make your "seriesvector" concept show up *before* any
spec
> is out is to apply it to non-scalable vectors.
>
> Today, we have the "zeroinitializer", which is very similar to
what
> you want. You can even completely omit the "vscale" if we get the
> semantics right.
>
> Hope that helps.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

2016-Nov-27 16:54 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 27 November 2016 at 16:51, Amara Emerson <amara.emerson at gmail.com>
wrote:> There is nothing to stop other targets from using
> stepvector/seriesvector. In fact for wide vector targets, often the IR
> constant for representing a step vector is explicitly expressed as
> <i32 0, i32 1, i32 2..> and so on (this gets really cumbersome when
> your vector length is 512bits for example). That could be replaced by
> a single "stepvector" constant, and it works the same for both
> fixed-length and scalable vectors.
Indeed! For this particular point, I think we should start there.

Also, on a more general comment regarding David's point about Hwacha,
maybe we could get some traction on the RISC-V front, to see if the
proposal is acceptable on their end, since they're likely to be using
this in the future in LLVM.

Alex, any comments?

cheers,
--renato

Paul Walker via llvm-dev

2016-Nov-28 01:43 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

>>Does this make sense? I am not after agreement just want to make sure we
are on the same page regarding our aims before digging down into how VL actually
looks and its interaction with the loop vectoriser’s chosen VF.
>
> As much sense as is possible, I guess.
I’ll take that.  Let's move on to the relationship between scalable vectors
and VL.  VL is very much a hardware centric value that we'd prefer not to
expose at the IR level, beyond the requirements for a sufficiently accurate cost
model.

An initial attempt to represent scalable vectors might be <n x Ty>.  The
problem with this approach is there's no perfect interpretation as to what
the following type definitions me:

	<n x i8>
	<n x i16>
	<n x i32>
	<n x i64>

[Interpretation 1]

A vector of "n" elements of the specified type.  Here "n" is
likely to be scaled based on the largest possible element type.  This fits well
with the following loop:

(1)	for (0..N) { bigger_type[i] += smaller_type[i]; }

but becomes inefficient when the largest element type is not required.

[Interpretation 2]

A vector full of the specified type. Here the isolated meaning of "n"
means nothing without an associated element type.  This fits well with the
following loop:

(2)	for (0..N) { type[i] += type[i]; }

Neither interpretation is ideal with implicit knowledge required to understand
the relationship between different vector types.  Our proposal is a vector type
where that relationship is explicit, namely <n x M x Ty>.

Reconsidering the above loops with this type system leads to IR like:

(1)	<n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32>    ;
bigger_type=i32, smaller_type=i8
(2)	<n x 16 x i8> += <n x 16 x i8>

Here the value of "n" is the same across both loops and more
importantly the bit-width of the largest vectors within both loops is the same. 
The relevance of the second point it that we now have a property that can be
varied based on a cost model.  This results in a predictable set of types that
should lead to performant code, whilst allowing types outside that range to work
as expected, just like non-scalable vectors.

All that remains is the ability to reference the isolated value of the
"n" in "<n x M x Ty>", which is where the
"vscale" constant proposal comes in.
>>     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
>>
>>for a VF of "n*4" (remembering that vscale is the
"n" in "<n x 4 x Ty>")
>>
>I see what you mean.
>
>Quick question: Since you're saying "vscale" is an unknown
constant,
>why not just:
>     %index.next = add nuw nsw i64 %index, i64 vscale
Hopefully the answer to this is now clear. Our intention is for a single
constant to represent the runtime part of a scalable vector's length. Using
the same loop examples from above, the induction variable updates become:

(1)	%index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
(2)	%index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)

The runtime part of the scalable vector lengths remains the same with the second
loop processing 4x the number of elements per iteration.

Does this make sense? Is this sufficient argument for the new type and
associated "vscale" constant, or is there another topic that needs
covering first?

As an aside, note that I am not describing a new style of vectorisation here. 
SVE is perfectly capable of non-predicated vectorisation with the
loop-vectoriser ensuring no data-dependency violations using the same logic as
for non-scalable vectors.  The exception is that if a strict VF is required to
maintain safety we can simply fall back to non-scalable vectors that target
Neon.  Obviously not ideal but it gets the ball rolling.

	Paul!!!

Mehdi Amini via llvm-dev

2016-Nov-28 04:25 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

> On Nov 27, 2016, at 5:43 PM, Paul Walker via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
>>> Does this make sense? I am not after agreement just want to make
sure we are on the same page regarding our aims before digging down into how VL
actually looks and its interaction with the loop vectoriser’s chosen VF.
>> 
>> As much sense as is possible, I guess.
> 
> I’ll take that.  Let's move on to the relationship between scalable
vectors and VL.  VL is very much a hardware centric value that we'd prefer
not to expose at the IR level, beyond the requirements for a sufficiently
accurate cost model.
> 
> An initial attempt to represent scalable vectors might be <n x Ty>. 
The problem with this approach is there's no perfect interpretation as to
what the following type definitions me:
> 
> 	<n x i8>
> 	<n x i16>
> 	<n x i32>
> 	<n x i64>
> 
> [Interpretation 1]
> 
> A vector of "n" elements of the specified type.  Here
"n" is likely to be scaled based on the largest possible element type.
This fits well with the following loop:
> 
> (1)	for (0..N) { bigger_type[i] += smaller_type[i]; }
> 
> but becomes inefficient when the largest element type is not required.
> 
> [Interpretation 2]
> 
> A vector full of the specified type. Here the isolated meaning of
"n" means nothing without an associated element type.  This fits well
with the following loop:
> 
> (2)	for (0..N) { type[i] += type[i]; }
> 
> Neither interpretation is ideal with implicit knowledge required to
understand the relationship between different vector types.  Our proposal is a
vector type where that relationship is explicit, namely <n x M x Ty>.
> 
> Reconsidering the above loops with this type system leads to IR like:
> 
> (1)	<n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32>  
; bigger_type=i32, smaller_type=i8
> (2)	<n x 16 x i8> += <n x 16 x i8>
I don’t really get why the “naive” <n x Ty> be enough for the loops you
mentioned:

1) <n x i32> +=  += zext <n x i8> as <n x i32>
2) <n x i8> += <n x i8>

— 
Mehdi

> 
> Here the value of "n" is the same across both loops and more
importantly the bit-width of the largest vectors within both loops is the same. 
The relevance of the second point it that we now have a property that can be
varied based on a cost model.  This results in a predictable set of types that
should lead to performant code, whilst allowing types outside that range to work
as expected, just like non-scalable vectors.
> 
> All that remains is the ability to reference the isolated value of the
"n" in "<n x M x Ty>", which is where the
"vscale" constant proposal comes in.
> 
>>>    %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
>>> 
>>> for a VF of "n*4" (remembering that vscale is the
"n" in "<n x 4 x Ty>")
>>> 
>> I see what you mean.
>> 
>> Quick question: Since you're saying "vscale" is an
unknown constant,
>> why not just:
>>    %index.next = add nuw nsw i64 %index, i64 vscale
> 
> Hopefully the answer to this is now clear. Our intention is for a single
constant to represent the runtime part of a scalable vector's length. Using
the same loop examples from above, the induction variable updates become:
> 
> (1)	%index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> (2)	%index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)
> 
> The runtime part of the scalable vector lengths remains the same with the
second loop processing 4x the number of elements per iteration.
> 
> Does this make sense? Is this sufficient argument for the new type and
associated "vscale" constant, or is there another topic that needs
covering first?
> 
> As an aside, note that I am not describing a new style of vectorisation
here.  SVE is perfectly capable of non-predicated vectorisation with the
loop-vectoriser ensuring no data-dependency violations using the same logic as
for non-scalable vectors.  The exception is that if a strict VF is required to
maintain safety we can simply fall back to non-scalable vectors that target
Neon.  Obviously not ideal but it gets the ball rolling.
> 
> 	Paul!!!
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

2016-Nov-28 09:43 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 28 November 2016 at 01:43, Paul Walker <Paul.Walker at arm.com>
wrote:> Reconsidering the above loops with this type system leads to IR like:
>
> (1)     <n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x
i32>    ; bigger_type=i32, smaller_type=i8
> (2)     <n x 16 x i8> += <n x 16 x i8>
Hi Paul,

I'm with Mehdi on this... these examples don't look problematic. You
have shown what the different constructs would be good at, but I still
can't see where they won't be.

I originally though that the extended version "<n x m x Ty>" was
required because SVE needs all vector lengths to be a multiple of
128-bits, so they'd be just "glorified" NEON vectors. Without it,
there is no way to make sure it will be a multiple.

> (1)     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> (2)     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)
>
> The runtime part of the scalable vector lengths remains the same with the
second loop processing 4x the number of elements per iteration.
Right, but this is a "constant", and LLVM would be forgiven by asking
the "size" of it. With that proposal, there's no way to know if
that's
a <16 x i8> or <16 x i32>.

The vectorizer concerns itself mostly with number of elements, not raw
sizes, but these types will survive the whole process, especially if
they come from intrinsics.

> As an aside, note that I am not describing a new style of vectorisation
here.  SVE is perfectly capable of non-predicated vectorisation with the
loop-vectoriser ensuring no data-dependency violations using the same logic as
for non-scalable vectors.  The exception is that if a strict VF is required to
maintain safety we can simply fall back to non-scalable vectors that target
Neon.  Obviously not ideal but it gets the ball rolling.
Right, got that. Baby steps, safety first.

cheers,
--renato

llvm dev - Nov 2016 - [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM