Renato Golin via llvm-dev
2016-Nov-28 09:43 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
On 28 November 2016 at 01:43, Paul Walker <Paul.Walker at arm.com> wrote:> Reconsidering the above loops with this type system leads to IR like: > > (1) <n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32> ; bigger_type=i32, smaller_type=i8 > (2) <n x 16 x i8> += <n x 16 x i8>Hi Paul, I'm with Mehdi on this... these examples don't look problematic. You have shown what the different constructs would be good at, but I still can't see where they won't be. I originally though that the extended version "<n x m x Ty>" was required because SVE needs all vector lengths to be a multiple of 128-bits, so they'd be just "glorified" NEON vectors. Without it, there is no way to make sure it will be a multiple.> (1) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) > (2) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16) > > The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration.Right, but this is a "constant", and LLVM would be forgiven by asking the "size" of it. With that proposal, there's no way to know if that's a <16 x i8> or <16 x i32>. The vectorizer concerns itself mostly with number of elements, not raw sizes, but these types will survive the whole process, especially if they come from intrinsics.> As an aside, note that I am not describing a new style of vectorisation here. SVE is perfectly capable of non-predicated vectorisation with the loop-vectoriser ensuring no data-dependency violations using the same logic as for non-scalable vectors. The exception is that if a strict VF is required to maintain safety we can simply fall back to non-scalable vectors that target Neon. Obviously not ideal but it gets the ball rolling.Right, got that. Baby steps, safety first. cheers, --renato
Paul Walker via llvm-dev
2016-Nov-28 11:19 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
>>(1) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) >>(2) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16) >> >>The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration.>Right, but this is a "constant", and LLVM would be forgiven by asking >the "size" of it. With that proposal, there's no way to know if that's >a <16 x i8> or <16 x i32>. > >The vectorizer concerns itself mostly with number of elements, not raw >sizes, but these types will survive the whole process, especially if >they come from intrinsics.What is the relevance of the vector’s element type. The induction variable update is purely in terms of elements, it doesn’t care about its type. If you need to reference the vector length in bytes you would simply multiply it by the size of vector’s element type just as we do for non-scalable vectors. Paul!!!
Paul Walker via llvm-dev
2016-Nov-28 12:02 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
>>An initial attempt to represent scalable vectors might be <n x Ty>. The problem with this approach is there's no perfect interpretation as to what the following type definitions me: >> >> <n x i8> >> <n x i16> >> <n x i32> >> <n x i64> >> >>[Interpretation 1] >> >>A vector of "n" elements of the specified type. Here "n" is likely to be scaled based on the largest possible element type. This fits well with the following loop: >> >>(1) for (0..N) { bigger_type[i] += smaller_type[i]; } >> >>but becomes inefficient when the largest element type is not required. >> >>[Interpretation 2] >> >>A vector full of the specified type. Here the isolated meaning of "n" means nothing without an associated element type. This fits well with the following loop: >> >>(2) for (0..N) { type[i] += type[i]; }>I'm with Mehdi on this... these examples don't look problematic. You >have shown what the different constructs would be good at, but I still >can't see where they won't be.I'll apply the loops to their opposite interpretation assuming bigger_type=i64, smaller_type=type=i8: [Interpretation 1] (2) for (0..N) { bytes[i] += other_bytes[i]; } ====> <n x i8> += <n x i8> (2) for (0..N) { int64s[i] += other_int64s[i]; } ====> <n x i64> += <n x i64> because this interpretation requires "n" to be the same for all scalable vectors clearly the int64 loop involves vectors that are 8x bigger than the byte loop. Structurally this is fine from the IR's point of view but in hardware they'll operate on vectors of the same length. The code generator will either split the int64 loop's instructions thus planting 8 adds, or promote the byte loop's instructions thus only utilising an 8th of the lanes. [Interpretation 2] (1) for (0..N) { int64s[i] += bytes[i]; } ==> <n x i64> += zext <????? x i8> as <n x i64> This interpretation falls down at the IR level. If <n x i8> represents a vector full of bytes, how do you represent a vector that's an 8th full of bytes ready be zero-extended.>I originally though that the extended version "<n x m x Ty>" was >required because SVE needs all vector lengths to be a multiple of >128-bits, so they'd be just "glorified" NEON vectors. Without it, >there is no way to make sure it will be a multiple.Surely this is true of most vector architectures, hence the reason for costing vectors across a range of element counts to determine which the code generator likes best. Scalable vectors are no different with SVE's cost model preferring scalable vectors whose statically known length component (i.e. "M x sizeof(Ty)") is 128-bits because they'll better match the way the code generator models SVE registers. Paul!!!
Renato Golin via llvm-dev
2016-Nov-28 14:28 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
On 28 November 2016 at 11:19, Paul Walker <Paul.Walker at arm.com> wrote:> What is the relevance of the vector’s element type. The induction variable update is purely in terms of elements, it doesn’t care about its type. If you need to reference the vector length in bytes you would simply multiply it by the size of vector’s element type just as we do for non-scalable vectors.For pointer inductions, you have to add the total size, not the index count. Wouldn't that need the final vector size? I'm just trying to figure out is there's any pass that is any pass that needs to know the vector's actual length. I'm not saying there is... :) cheers, --renato
Renato Golin via llvm-dev
2016-Nov-28 14:36 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
On 28 November 2016 at 12:02, Paul Walker <Paul.Walker at arm.com> wrote:> (1) for (0..N) { int64s[i] += bytes[i]; } ==> <n x i64> += zext <????? x i8> as <n x i64> > > This interpretation falls down at the IR level. If <n x i8> represents a vector full of bytes, how do you represent a vector that's an 8th full of bytes ready be zero-extended.Right, of course! A <n x i8> vector can be on any number of lanes. So, for vscale = 4, <4 x 4 x i8> would use 16 lanes (out of a possible 64), while <4 x 16 x i8> would use all 64 lanes. The instructions that are needed are also different: an extend + copy or just a copy. All that matters here is the actual number of lanes, which is directly obtained by (n * m) from <n x m x Ty>. If the number of lanes is different, and the types can be converted (extend/truncate), then you'll need additional pre-ops to fudge the data between the moves / ops. I think I'm getting the idea, now. :) cheers, --renato