thr3ads.net - llvm dev - [llvm-dev] [RFC] Supporting ARM's SVE in LLVM [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2016-Nov-28 09:43 UTC

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 28 November 2016 at 01:43, Paul Walker <Paul.Walker at arm.com>
wrote:> Reconsidering the above loops with this type system leads to IR like:
>
> (1)     <n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x
i32>    ; bigger_type=i32, smaller_type=i8
> (2)     <n x 16 x i8> += <n x 16 x i8>
Hi Paul,

I'm with Mehdi on this... these examples don't look problematic. You
have shown what the different constructs would be good at, but I still
can't see where they won't be.

I originally though that the extended version "<n x m x Ty>" was
required because SVE needs all vector lengths to be a multiple of
128-bits, so they'd be just "glorified" NEON vectors. Without it,
there is no way to make sure it will be a multiple.

> (1)     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> (2)     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)
>
> The runtime part of the scalable vector lengths remains the same with the
second loop processing 4x the number of elements per iteration.
Right, but this is a "constant", and LLVM would be forgiven by asking
the "size" of it. With that proposal, there's no way to know if
that's
a <16 x i8> or <16 x i32>.

The vectorizer concerns itself mostly with number of elements, not raw
sizes, but these types will survive the whole process, especially if
they come from intrinsics.

> As an aside, note that I am not describing a new style of vectorisation
here.  SVE is perfectly capable of non-predicated vectorisation with the
loop-vectoriser ensuring no data-dependency violations using the same logic as
for non-scalable vectors.  The exception is that if a strict VF is required to
maintain safety we can simply fall back to non-scalable vectors that target
Neon.  Obviously not ideal but it gets the ball rolling.
Right, got that. Baby steps, safety first.

cheers,
--renato

Paul Walker via llvm-dev

2016-Nov-28 11:19 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

>>(1)     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
>>(2)     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)
>>
>>The runtime part of the scalable vector lengths remains the same with
the second loop processing 4x the number of elements per iteration.
>Right, but this is a "constant", and LLVM would be forgiven by
asking
>the "size" of it. With that proposal, there's no way to know
if that's
>a <16 x i8> or <16 x i32>.
>
>The vectorizer concerns itself mostly with number of elements, not raw
>sizes, but these types will survive the whole process, especially if
>they come from intrinsics.
What is the relevance of the vector’s element type.  The induction variable
update is purely in terms of elements, it doesn’t care about its type. If you
need to reference the vector length in bytes you would simply multiply it by the
size of vector’s element type just as we do for non-scalable vectors.

Paul!!!

Paul Walker via llvm-dev

2016-Nov-28 12:02 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

>>An initial attempt to represent scalable vectors might be <n x
Ty>.  The problem with this approach is there's no perfect interpretation
as to what the following type definitions me:
>>
>>	<n x i8>
>>	<n x i16>
>>	<n x i32>
>>	<n x i64>
>>
>>[Interpretation 1]
>>
>>A vector of "n" elements of the specified type.  Here
"n" is likely to be scaled based on the largest possible element type.
This fits well with the following loop:
>>
>>(1)	for (0..N) { bigger_type[i] += smaller_type[i]; }
>>
>>but becomes inefficient when the largest element type is not required.
>>
>>[Interpretation 2]
>>
>>A vector full of the specified type. Here the isolated meaning of
"n" means nothing without an associated element type.  This fits well
with the following loop:
>>
>>(2)	for (0..N) { type[i] += type[i]; }
>I'm with Mehdi on this... these examples don't look problematic. You
>have shown what the different constructs would be good at, but I still
>can't see where they won't be.
I'll apply the loops to their opposite interpretation assuming
bigger_type=i64, smaller_type=type=i8:

[Interpretation 1]

(2) for (0..N) { bytes[i] += other_bytes[i]; } ====> <n x i8> += <n
x i8>
(2) for (0..N) { int64s[i] += other_int64s[i]; } ====> <n x i64> +=
<n x i64>

because this interpretation requires "n" to be the same for all
scalable vectors clearly the int64 loop involves vectors that are 8x bigger than
the byte loop.  Structurally this is fine from the IR's point of view but in
hardware they'll operate on vectors of the same length.  The code generator
will either split the int64 loop's instructions thus planting 8 adds, or
promote the byte loop's instructions thus only utilising an 8th of the
lanes.

[Interpretation 2]

(1)	for (0..N) { int64s[i] += bytes[i]; }  ==> <n x i64> += zext
<????? x i8> as <n x i64>

This interpretation falls down at the IR level.  If <n x i8> represents a
vector full of bytes, how do you represent a vector that's an 8th full of
bytes ready be zero-extended.
>I originally though that the extended version "<n x m x Ty>"
was
>required because SVE needs all vector lengths to be a multiple of
>128-bits, so they'd be just "glorified" NEON vectors. Without
it,
>there is no way to make sure it will be a multiple.
Surely this is true of most vector architectures, hence the reason for costing
vectors across a range of element counts to determine which the code generator
likes best.  Scalable vectors are no different with SVE's cost model
preferring scalable vectors whose statically known length component (i.e.
"M x sizeof(Ty)") is 128-bits because they'll better match the way
the code generator models SVE registers.

Paul!!!

Renato Golin via llvm-dev

2016-Nov-28 14:28 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 28 November 2016 at 11:19, Paul Walker <Paul.Walker at arm.com>
wrote:> What is the relevance of the vector’s element type.  The induction variable
update is purely in terms of elements, it doesn’t care about its type. If you
need to reference the vector length in bytes you would simply multiply it by the
size of vector’s element type just as we do for non-scalable vectors.
For pointer inductions, you have to add the total size, not the index
count. Wouldn't that need the final vector size?

I'm just trying to figure out is there's any pass that is any pass
that needs to know the vector's actual length. I'm not saying there
is... :)

cheers,
--renato

Renato Golin via llvm-dev

2016-Nov-28 14:36 UTC

head link

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

On 28 November 2016 at 12:02, Paul Walker <Paul.Walker at arm.com>
wrote:> (1)     for (0..N) { int64s[i] += bytes[i]; }  ==> <n x i64> +=
zext <????? x i8> as <n x i64>
>
> This interpretation falls down at the IR level.  If <n x i8>
represents a vector full of bytes, how do you represent a vector that's an
8th full of bytes ready be zero-extended.
Right, of course! A <n x i8> vector can be on any number of lanes.

So, for vscale = 4, <4 x 4 x i8> would use 16 lanes (out of a possible
64), while <4 x 16 x i8> would use all 64 lanes. The instructions that
are needed are also different: an extend + copy or just a copy.

All that matters here is the actual number of lanes, which is directly
obtained by (n * m) from <n x m x Ty>. If the number of lanes is
different, and the types can be converted (extend/truncate), then
you'll need additional pre-ops to fudge the data between the moves /
ops.

I think I'm getting the idea, now. :)

cheers,
--renato

llvm dev - Nov 2016 - [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

[llvm-dev] [RFC] Supporting ARM's SVE in LLVM