Renato Golin via llvm-dev
2016-Nov-27 15:42 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
On 27 November 2016 at 13:59, Paul Walker <Paul.Walker at arm.com> wrote:> Thanks Renato, my takeaway is that I am presenting the design out of order. So let's focus purely on the vector length (VL) and ignore everything else. For SVE the vector length is unknown and can vary across an as yet undetermined boundary (process, library....). Within a boundary we propose making VL a constant with all instructions that operate on this constant locked within its boundary.This is in line with my current understanding of SVE. Check.> I know this stretches the meaning of constant and my reasoning (however unsound) is below. We expect changes to VL to be infrequent and not located where it would present an unnecessary barrier to optimisation. With this in mind the initial implementation of VL barriers would be an intrinsic that prevents any instruction movement across it. > > Question: Is this type of intrinsic something LLVM supports today?Function calls are natural barriers, but they should outline the parameters that cannot cross, especially if they're local, to make sure those don't cross it. In that sense, specially crafted intrinsics can get you the same behaviour, but it will be ugly. Also, we have special purpose barriers, ex. @llvm.arm|aarch64.dmb, which could serve as template for scalable-specific barriers.> Why a constant? Well it doesn't change within the context it is being used. More crucially the LLVM implementation of constants gives us a property that's very important to SVE (perhaps this is where prototyping laziness has kicked in). Constants remain attached to the instructions that operate on them through until code generation. This allows the semantic meaning of these instruction to be maintained, something non-scalable vectors get for free with their "real" constants.This makes sense. Not just because it behaves similarly, but because the back-end *must* guarantee it will be a constant within its boundaries and fail otherwise. That's up to the SVE code-generator to add enough SVE-specific instructions to get that right.> shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef, <n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1) > > Firstly I'll highlight the use of seriesvector is purely for brevity, let's ignore that debate for now. Our concern is that not treating VL as a Constant means sub and seriesvector are no longer constant and are likely to be hoisted away from the shufflevector. The knock on effect being to force the code generator into generating generic vector permutes rather than utilise any specialised permute instructions the target provides.The concept looks ok. IIGIR, your argument is that an intrinsic will not look "constant enough" to the other IR passes, which can break the contantness required to generate the correct "constant" vector. I'm also assuming SVE has an instruction that relates to the syntax above, which will reduce the setup process from N instructions to one and will be scale-independent. Otherwise, that whole exercise is meaningless. Something like: mov x2, #i const z0.b, p0/z, x2, 2 # From (i) to (2*VF) const z1.b, p0/z, x2, -1 # From (i) to (i - VF) in reverse The undefined behaviour that will come of such instructions need to be understood in order to not break the IR. For example, if x2 is an unsigned variable and you iterate through the array but the array length is not a multiple of VF, the last range will pass through zero and become negative at the end. Or, if x2 is a 16-bit variable that must wrap (or saturate) and the same tail issue happens above.> Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF.As much sense as is possible, I guess. But without knowing the guarantees we're aiming for, it'll be hard to know if any of those proposals will make proper sense. One way to make your "seriesvector" concept show up *before* any spec is out is to apply it to non-scalable vectors. Today, we have the "zeroinitializer", which is very similar to what you want. You can even completely omit the "vscale" if we get the semantics right. Hope that helps. cheers, --renato
Amara Emerson via llvm-dev
2016-Nov-27 16:51 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
Bringing the discussion back onto the IR proposals:> One way to make your "seriesvector" concept show up *before* any spec > is out is to apply it to non-scalable vectors. > > Today, we have the "zeroinitializer", which is very similar to what > you want. You can even completely omit the "vscale" if we get the > semantics right.There is nothing to stop other targets from using stepvector/seriesvector. In fact for wide vector targets, often the IR constant for representing a step vector is explicitly expressed as <i32 0, i32 1, i32 2..> and so on (this gets really cumbersome when your vector length is 512bits for example). That could be replaced by a single "stepvector" constant, and it works the same for both fixed-length and scalable vectors. Amara On 27 November 2016 at 15:42, Renato Golin via llvm-dev <llvm-dev at lists.llvm.org> wrote:> On 27 November 2016 at 13:59, Paul Walker <Paul.Walker at arm.com> wrote: >> Thanks Renato, my takeaway is that I am presenting the design out of order. So let's focus purely on the vector length (VL) and ignore everything else. For SVE the vector length is unknown and can vary across an as yet undetermined boundary (process, library....). Within a boundary we propose making VL a constant with all instructions that operate on this constant locked within its boundary. > > This is in line with my current understanding of SVE. Check. > > >> I know this stretches the meaning of constant and my reasoning (however unsound) is below. We expect changes to VL to be infrequent and not located where it would present an unnecessary barrier to optimisation. With this in mind the initial implementation of VL barriers would be an intrinsic that prevents any instruction movement across it. >> >> Question: Is this type of intrinsic something LLVM supports today? > > Function calls are natural barriers, but they should outline the > parameters that cannot cross, especially if they're local, to make > sure those don't cross it. In that sense, specially crafted intrinsics > can get you the same behaviour, but it will be ugly. > > Also, we have special purpose barriers, ex. @llvm.arm|aarch64.dmb, > which could serve as template for scalable-specific barriers. > > >> Why a constant? Well it doesn't change within the context it is being used. More crucially the LLVM implementation of constants gives us a property that's very important to SVE (perhaps this is where prototyping laziness has kicked in). Constants remain attached to the instructions that operate on them through until code generation. This allows the semantic meaning of these instruction to be maintained, something non-scalable vectors get for free with their "real" constants. > > This makes sense. Not just because it behaves similarly, but because > the back-end *must* guarantee it will be a constant within its > boundaries and fail otherwise. That's up to the SVE code-generator to > add enough SVE-specific instructions to get that right. > > >> shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef, <n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1) >> >> Firstly I'll highlight the use of seriesvector is purely for brevity, let's ignore that debate for now. Our concern is that not treating VL as a Constant means sub and seriesvector are no longer constant and are likely to be hoisted away from the shufflevector. The knock on effect being to force the code generator into generating generic vector permutes rather than utilise any specialised permute instructions the target provides. > > The concept looks ok. > > IIGIR, your argument is that an intrinsic will not look "constant > enough" to the other IR passes, which can break the contantness > required to generate the correct "constant" vector. > > I'm also assuming SVE has an instruction that relates to the syntax > above, which will reduce the setup process from N instructions to one > and will be scale-independent. Otherwise, that whole exercise is > meaningless. > > Something like: > mov x2, #i > const z0.b, p0/z, x2, 2 # From (i) to (2*VF) > const z1.b, p0/z, x2, -1 # From (i) to (i - VF) in reverse > > The undefined behaviour that will come of such instructions need to be > understood in order to not break the IR. > > For example, if x2 is an unsigned variable and you iterate through the > array but the array length is not a multiple of VF, the last range > will pass through zero and become negative at the end. Or, if x2 is a > 16-bit variable that must wrap (or saturate) and the same tail issue > happens above. > > >> Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF. > > As much sense as is possible, I guess. > > But without knowing the guarantees we're aiming for, it'll be hard to > know if any of those proposals will make proper sense. > > One way to make your "seriesvector" concept show up *before* any spec > is out is to apply it to non-scalable vectors. > > Today, we have the "zeroinitializer", which is very similar to what > you want. You can even completely omit the "vscale" if we get the > semantics right. > > Hope that helps. > > cheers, > --renato > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Renato Golin via llvm-dev
2016-Nov-27 16:54 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
On 27 November 2016 at 16:51, Amara Emerson <amara.emerson at gmail.com> wrote:> There is nothing to stop other targets from using > stepvector/seriesvector. In fact for wide vector targets, often the IR > constant for representing a step vector is explicitly expressed as > <i32 0, i32 1, i32 2..> and so on (this gets really cumbersome when > your vector length is 512bits for example). That could be replaced by > a single "stepvector" constant, and it works the same for both > fixed-length and scalable vectors.Indeed! For this particular point, I think we should start there. Also, on a more general comment regarding David's point about Hwacha, maybe we could get some traction on the RISC-V front, to see if the proposal is acceptable on their end, since they're likely to be using this in the future in LLVM. Alex, any comments? cheers, --renato
Paul Walker via llvm-dev
2016-Nov-28 01:43 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
>>Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF. > > As much sense as is possible, I guess.I’ll take that. Let's move on to the relationship between scalable vectors and VL. VL is very much a hardware centric value that we'd prefer not to expose at the IR level, beyond the requirements for a sufficiently accurate cost model. An initial attempt to represent scalable vectors might be <n x Ty>. The problem with this approach is there's no perfect interpretation as to what the following type definitions me: <n x i8> <n x i16> <n x i32> <n x i64> [Interpretation 1] A vector of "n" elements of the specified type. Here "n" is likely to be scaled based on the largest possible element type. This fits well with the following loop: (1) for (0..N) { bigger_type[i] += smaller_type[i]; } but becomes inefficient when the largest element type is not required. [Interpretation 2] A vector full of the specified type. Here the isolated meaning of "n" means nothing without an associated element type. This fits well with the following loop: (2) for (0..N) { type[i] += type[i]; } Neither interpretation is ideal with implicit knowledge required to understand the relationship between different vector types. Our proposal is a vector type where that relationship is explicit, namely <n x M x Ty>. Reconsidering the above loops with this type system leads to IR like: (1) <n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32> ; bigger_type=i32, smaller_type=i8 (2) <n x 16 x i8> += <n x 16 x i8> Here the value of "n" is the same across both loops and more importantly the bit-width of the largest vectors within both loops is the same. The relevance of the second point it that we now have a property that can be varied based on a cost model. This results in a predictable set of types that should lead to performant code, whilst allowing types outside that range to work as expected, just like non-scalable vectors. All that remains is the ability to reference the isolated value of the "n" in "<n x M x Ty>", which is where the "vscale" constant proposal comes in.>> %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) >> >>for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>") >> >I see what you mean. > >Quick question: Since you're saying "vscale" is an unknown constant, >why not just: > %index.next = add nuw nsw i64 %index, i64 vscaleHopefully the answer to this is now clear. Our intention is for a single constant to represent the runtime part of a scalable vector's length. Using the same loop examples from above, the induction variable updates become: (1) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) (2) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16) The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration. Does this make sense? Is this sufficient argument for the new type and associated "vscale" constant, or is there another topic that needs covering first? As an aside, note that I am not describing a new style of vectorisation here. SVE is perfectly capable of non-predicated vectorisation with the loop-vectoriser ensuring no data-dependency violations using the same logic as for non-scalable vectors. The exception is that if a strict VF is required to maintain safety we can simply fall back to non-scalable vectors that target Neon. Obviously not ideal but it gets the ball rolling. Paul!!!
Mehdi Amini via llvm-dev
2016-Nov-28 04:25 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
> On Nov 27, 2016, at 5:43 PM, Paul Walker via llvm-dev <llvm-dev at lists.llvm.org> wrote: > >>> Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF. >> >> As much sense as is possible, I guess. > > I’ll take that. Let's move on to the relationship between scalable vectors and VL. VL is very much a hardware centric value that we'd prefer not to expose at the IR level, beyond the requirements for a sufficiently accurate cost model. > > An initial attempt to represent scalable vectors might be <n x Ty>. The problem with this approach is there's no perfect interpretation as to what the following type definitions me: > > <n x i8> > <n x i16> > <n x i32> > <n x i64> > > [Interpretation 1] > > A vector of "n" elements of the specified type. Here "n" is likely to be scaled based on the largest possible element type. This fits well with the following loop: > > (1) for (0..N) { bigger_type[i] += smaller_type[i]; } > > but becomes inefficient when the largest element type is not required. > > [Interpretation 2] > > A vector full of the specified type. Here the isolated meaning of "n" means nothing without an associated element type. This fits well with the following loop: > > (2) for (0..N) { type[i] += type[i]; } > > Neither interpretation is ideal with implicit knowledge required to understand the relationship between different vector types. Our proposal is a vector type where that relationship is explicit, namely <n x M x Ty>. > > Reconsidering the above loops with this type system leads to IR like: > > (1) <n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32> ; bigger_type=i32, smaller_type=i8 > (2) <n x 16 x i8> += <n x 16 x i8>I don’t really get why the “naive” <n x Ty> be enough for the loops you mentioned: 1) <n x i32> += += zext <n x i8> as <n x i32> 2) <n x i8> += <n x i8> — Mehdi> > Here the value of "n" is the same across both loops and more importantly the bit-width of the largest vectors within both loops is the same. The relevance of the second point it that we now have a property that can be varied based on a cost model. This results in a predictable set of types that should lead to performant code, whilst allowing types outside that range to work as expected, just like non-scalable vectors. > > All that remains is the ability to reference the isolated value of the "n" in "<n x M x Ty>", which is where the "vscale" constant proposal comes in. > >>> %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) >>> >>> for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>") >>> >> I see what you mean. >> >> Quick question: Since you're saying "vscale" is an unknown constant, >> why not just: >> %index.next = add nuw nsw i64 %index, i64 vscale > > Hopefully the answer to this is now clear. Our intention is for a single constant to represent the runtime part of a scalable vector's length. Using the same loop examples from above, the induction variable updates become: > > (1) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) > (2) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16) > > The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration. > > Does this make sense? Is this sufficient argument for the new type and associated "vscale" constant, or is there another topic that needs covering first? > > As an aside, note that I am not describing a new style of vectorisation here. SVE is perfectly capable of non-predicated vectorisation with the loop-vectoriser ensuring no data-dependency violations using the same logic as for non-scalable vectors. The exception is that if a strict VF is required to maintain safety we can simply fall back to non-scalable vectors that target Neon. Obviously not ideal but it gets the ball rolling. > > Paul!!! > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Renato Golin via llvm-dev
2016-Nov-28 09:43 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
On 28 November 2016 at 01:43, Paul Walker <Paul.Walker at arm.com> wrote:> Reconsidering the above loops with this type system leads to IR like: > > (1) <n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32> ; bigger_type=i32, smaller_type=i8 > (2) <n x 16 x i8> += <n x 16 x i8>Hi Paul, I'm with Mehdi on this... these examples don't look problematic. You have shown what the different constructs would be good at, but I still can't see where they won't be. I originally though that the extended version "<n x m x Ty>" was required because SVE needs all vector lengths to be a multiple of 128-bits, so they'd be just "glorified" NEON vectors. Without it, there is no way to make sure it will be a multiple.> (1) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) > (2) %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16) > > The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration.Right, but this is a "constant", and LLVM would be forgiven by asking the "size" of it. With that proposal, there's no way to know if that's a <16 x i8> or <16 x i32>. The vectorizer concerns itself mostly with number of elements, not raw sizes, but these types will survive the whole process, especially if they come from intrinsics.> As an aside, note that I am not describing a new style of vectorisation here. SVE is perfectly capable of non-predicated vectorisation with the loop-vectoriser ensuring no data-dependency violations using the same logic as for non-scalable vectors. The exception is that if a strict VF is required to maintain safety we can simply fall back to non-scalable vectors that target Neon. Obviously not ideal but it gets the ball rolling.Right, got that. Baby steps, safety first. cheers, --renato