Hi all, During the Nov 2016 dev meeting, we had a hackers’ lab session where we discussed some issues about loop idiom recognition, IR representation and cost modelling. I took an action to write up an RFC about introducing reduction intrinsics to LLVM to represent horizontal operations across vectors. Vector reductions have been discussed in the past before, notably here: http://lists.llvm.org/pipermail/llvm-dev/2015-November/092379.html This was in the context of recognizing the patterns for matching more complex cases like SAD & dot product. These discussions have centered around adding a new vector reduce instruction or annotating the reduction PHI to allow pattern recognition later at codegen. However, these motivations are different to the ones I'm giving here. Current status ============= Today scalar loop reductions are recognized in IR through the LoopUtils isReductionPHI utility, and this creates a recurrence descriptor after doing some analysis. The descriptor gives us information about what kind of reduction it is, e.g. int add, int or, fp add, fp mul, minmax etc) The vectorizer generates vector versions of these as a final reducing step after the vector loop body, doing a tree reduction across all the lanes. This has multiple disadvantages in being complex and difficult to generate for other passes in LLVM and front-ends. The reduction using shuffles then have to be matched in the backend. For ARM’s Scalable Vector Extensions (SVE), there are some more fundamental reasons that requires us to move away from this approach; 1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction. 2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions. 3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later. Proposal ======= We propose to introduce the following reduction intrinsics as a starting point: int_vector_reduce_add(vector_src) int_vector_reduce_fadd(vector_src) int_vector_reduce_and(vector_src) int_vector_reduce_or(vector_src) int_vector_reduce_eor(vector_src) int_vector_reduce_mul(vector_src) These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence. And for minmax recurrence types, we have: int_vector_reduce_smax(vector_src) int_vector_reduce_smin(vector_src) int_vector_reduce_umax(vector_src) int_vector_reduce_umin(vector_src) int_vector_reduce_fmin(vector_src, i32 NoNaNs) int_vector_reduce_fmax(vector_src, i32 NoNaNs) These reduction operations can be mapped from the recurrence kinds defined in LoopUtils, however any front-end or pass in LLVM may use them. Predication ========== We have multiple options for expressing vector predication in reductions: 1. The first is to simply add a predicate operand to the intrinsics, and require that targets without predication explicitly pattern match for an all-true predicate in order to select hardware instructions. 2. The second option is to mask out inactive lanes in the input vector by using a select between the input, and a vector splat of the reduction's identity values, e.g. 0.0 for fp-add. We believe option 2 will be sufficient for predication capable architectures, while keeping the intrinsics definitions simple and minimal. If there are targets for which the identity value for a reduction is different, then we could use an IR constant to express this in a generic way. All the intrinsics above take a single vector operand and reduce to a scalar of the same type as the vector element. Ordered reductions ================== SVE introduces the capability to execute reductions while maintaining the sequential order, as a result it can vectorize floating point addition reductions without requiring fast-math to be enabled. This uses a different approach to the fast reductions; instead of doing a partial reduction in the vector loop body and a final reduce to scalar in the loop exit block, we reduce to a scalar on every iteration in the vector loop. This requires an extra scalar accumulator operand to be passed into the reduction intrinsic. We propose a generic intrinsics to handle this as well: int_vector_reduce_ordered_fadd([float|double] %acc, <vec [float|double]> %input) We can use the same select-on-input predication scheme with these as we do with the normal reductions. Prototype patches ================ I've developed some initial patches as a proof of concept that should allow targets to opt in to using these new reductions with fairly minimal effort. The patches enable the use of generic reduction intrinsics for the AArch64 target, and as far as I can tell don't affect codegen in any way. The AArch64 patch cleans up a lot of the old code which was written specifically to match various forms of the log2 shuffle pattern, and therefore the tests are significantly reduced in size. The target independent changes introduce a new transitional API documented below, by moving some duplicated code into VectorUtils (to generate shuffle patterns for targets which don't opt-in). Some basic legalization is implemented, but only enough to support the cases required by AArch64, there may be some things I've missed that are exposed by other targets. Transitional API =============== To support both the old-style tree reduction patterns and the new intrinsics for clients, we can use a generalization of the TTI call we introduce in our SVE patches. We could provide something like: bool TargetTransformInfo::useReductionIntrinsic(unsigned Opcode, Type *Ty, bool IsMaxOp, bool IsSigned, bool NoNaN) In the prototype patch, I've created two separate ways to generate reductions from an IR generator's perspective, depending on whether the client has a reduction descriptor (obtained from a LoopUtils analysis of the loop's PHIs). If we want to generate from a descriptor: Value *llvm::createTargetReduction(IRBuilder<> &Builder, const TargetTransformInfo *TTI, RecurrenceDescriptor &Desc, Value *Src, bool InOrder, bool NoNaN) Or if no descriptor is given, then an alternative API call: Value *llvm::createTargetReduction(IRBuilder<> &Builder, const TargetTransformInfo *TTI, unsigned Opcode, Value *Src) { By default, these two interfaces would generate a tree reduction for targets which do not opt-in to the new intrinsic representation. What does everyone think of this, do these new forms seem reasonable? Amara -------------- next part -------------- A non-text attachment was scrubbed... Name: reductions-enable-aarch64.diff Type: application/octet-stream Size: 79807 bytes Desc: reductions-enable-aarch64.diff URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/dff98280/attachment-0002.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: reductions-ir.diff Type: application/octet-stream Size: 33375 bytes Desc: reductions-ir.diff URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/dff98280/attachment-0003.obj>
Hi Amara, We also had some discussions on the SVE side of reductions on the main SVE thread, but this description is much more detailed than we had before. I don't want to discuss specifically about SVE, as the spec is not out yet, but I think we can cover a lot of ground until very close to SVE and do the final step when we get there. On 31 January 2017 at 17:27, Amara Emerson via llvm-dev <llvm-dev at lists.llvm.org> wrote:> 1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction.Not SVE specific, for example fast-math.> 2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions.We should leave this for later, as this is very SVE-specific.> 3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later.This is a minor point, since bloated IR is "efficient" if it collapses into a small number of instructions, for example dozens of shuffles into a ZIP. The argument that the intrinsic is harder to destroy through optimisation passes is the same as other cases of stiff rich semantics vs. generic pattern matching, so orthogonal to this issue.> We propose to introduce the following reduction intrinsics as a starting point: > int_vector_reduce_add(vector_src)Is this C intrinsic? Shouldn't an IR builtin be something like: @llvm.vector.reduce.add(...) ?> These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence.Yup.> ... > int_vector_reduce_fmax(vector_src, i32 NoNaNs)A large list, and probably doesn't even cover all SVE can do, let alone other reductions. Why not simplify this into something like: %sum = add <N x float>, <N x float> %a, <N x float> %b %red = @llvm.reduce(%sum, float %acc) or %fast_red = @llvm.reduce(%sum) For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?> We have multiple options for expressing vector predication in reductions: > 1. The first is to simply add a predicate operand to the intrinsics, and require that targets without predication explicitly pattern match for an all-true predicate in order to select hardware instructions. > 2. The second option is to mask out inactive lanes in the input vector by using a select between the input, and a vector splat of the reduction's identity values, e.g. 0.0 for fp-add. > > We believe option 2 will be sufficient for predication capable architectures, while keeping the intrinsics definitions simple and minimal. If there are targets for which the identity value for a reduction is different, then we could use an IR constant to express this in a generic way.I agree. I haven't followed Elena's work on the similar concept for Intel, but I vaguely remember we reached a similar conclusion for AVX512. cheers, --renato
+cc Simon who's also interested in reductions for the any_true, all_true predicate vectors. On 31 January 2017 at 20:19, Renato Golin via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi Amara, > > We also had some discussions on the SVE side of reductions on the main > SVE thread, but this description is much more detailed than we had > before. > > I don't want to discuss specifically about SVE, as the spec is not out > yet, but I think we can cover a lot of ground until very close to SVE > and do the final step when we get there.The goal of this proposal is to agree on a new representation, so we are looking at more than SVE and improving things for LLVM as a whole.> > > On 31 January 2017 at 17:27, Amara Emerson via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> 1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction. > > Not SVE specific, for example fast-math.Can you explain what you mean here? Other targets may well have ordered reductions, I can't comment on that aspect. However, fast-math vectorization is a case where you *don't* want ordered reductions, as the relaxed fp-contract means that the conventional tree reduction algorithm preserves the required semantics.> >> 2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions. > > We should leave this for later, as this is very SVE-specific.As I stated earlier, this is going beyond SVE. I hope to achieve a consensus between other targets as well, even if some don't yet have efficient ways to handle it.> >> 3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later. > > This is a minor point, since bloated IR is "efficient" if it collapses > into a small number of instructions, for example dozens of shuffles > into a ZIP.The hassle of generating reductions may well be at most a minor motivator, but my point still stands. If a front-end wants the target to be able to generate the best code for a reduction idiom, they must generate a lot of IR for many-element vectors. You still have to paid the price in bloated IR, see the tests changed as part of the AArch64 NEON patch.> The argument that the intrinsic is harder to destroy through > optimisation passes is the same as other cases of stiff rich semantics > vs. generic pattern matching, so orthogonal to this issue. > > >> We propose to introduce the following reduction intrinsics as a starting point: >> int_vector_reduce_add(vector_src) > > Is this C intrinsic? Shouldn't an IR builtin be something like: > > @llvm.vector.reduce.add(...) ?You're right, in IR assembly it would appear like that. The ones I proposed were the tablegen syntax of the intrinsics definitions, so in practice they would look like @llvm.vector.reduce.[operation] in IR asm.> >> These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence. > > Yup. > > >> ... >> int_vector_reduce_fmax(vector_src, i32 NoNaNs) > > A large list, and probably doesn't even cover all SVE can do, let > alone other reductions. > > Why not simplify this into something like: > > %sum = add <N x float>, <N x float> %a, <N x float> %b > %red = @llvm.reduce(%sum, float %acc) > or > %fast_red = @llvm.reduce(%sum)Because the semantics of an operation would not depend solely in the operand value types and operation, but on a chain of computations forming the operands. If the input operand is a phi, you then have to do potentially inter-block analysis. If it's a function parameter or simply a load from memory then you're pretty much stuck and you can't resolve the semantics. During the dev meeting, a reductions proposal where the operation to be performed was a kind of opcode was discussed, and rejected by the community. I don't believe having many intrinsics would be a problem.> For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?For the same reasons that we don't re-use the other binary operator instructions like add, sub, mul. The vector versions of those are not horizontal operations, they instead produce vector results.> >> We have multiple options for expressing vector predication in reductions: >> 1. The first is to simply add a predicate operand to the intrinsics, and require that targets without predication explicitly pattern match for an all-true predicate in order to select hardware instructions. >> 2. The second option is to mask out inactive lanes in the input vector by using a select between the input, and a vector splat of the reduction's identity values, e.g. 0.0 for fp-add. >> >> We believe option 2 will be sufficient for predication capable architectures, while keeping the intrinsics definitions simple and minimal. If there are targets for which the identity value for a reduction is different, then we could use an IR constant to express this in a generic way. > > I agree. I haven't followed Elena's work on the similar concept for > Intel, but I vaguely remember we reached a similar conclusion for > AVX512. > > cheers, > --renato > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev