+cc Simon who's also interested in reductions for the any_true, all_true predicate vectors. On 31 January 2017 at 20:19, Renato Golin via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi Amara, > > We also had some discussions on the SVE side of reductions on the main > SVE thread, but this description is much more detailed than we had > before. > > I don't want to discuss specifically about SVE, as the spec is not out > yet, but I think we can cover a lot of ground until very close to SVE > and do the final step when we get there.The goal of this proposal is to agree on a new representation, so we are looking at more than SVE and improving things for LLVM as a whole.> > > On 31 January 2017 at 17:27, Amara Emerson via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> 1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction. > > Not SVE specific, for example fast-math.Can you explain what you mean here? Other targets may well have ordered reductions, I can't comment on that aspect. However, fast-math vectorization is a case where you *don't* want ordered reductions, as the relaxed fp-contract means that the conventional tree reduction algorithm preserves the required semantics.> >> 2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions. > > We should leave this for later, as this is very SVE-specific.As I stated earlier, this is going beyond SVE. I hope to achieve a consensus between other targets as well, even if some don't yet have efficient ways to handle it.> >> 3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later. > > This is a minor point, since bloated IR is "efficient" if it collapses > into a small number of instructions, for example dozens of shuffles > into a ZIP.The hassle of generating reductions may well be at most a minor motivator, but my point still stands. If a front-end wants the target to be able to generate the best code for a reduction idiom, they must generate a lot of IR for many-element vectors. You still have to paid the price in bloated IR, see the tests changed as part of the AArch64 NEON patch.> The argument that the intrinsic is harder to destroy through > optimisation passes is the same as other cases of stiff rich semantics > vs. generic pattern matching, so orthogonal to this issue. > > >> We propose to introduce the following reduction intrinsics as a starting point: >> int_vector_reduce_add(vector_src) > > Is this C intrinsic? Shouldn't an IR builtin be something like: > > @llvm.vector.reduce.add(...) ?You're right, in IR assembly it would appear like that. The ones I proposed were the tablegen syntax of the intrinsics definitions, so in practice they would look like @llvm.vector.reduce.[operation] in IR asm.> >> These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence. > > Yup. > > >> ... >> int_vector_reduce_fmax(vector_src, i32 NoNaNs) > > A large list, and probably doesn't even cover all SVE can do, let > alone other reductions. > > Why not simplify this into something like: > > %sum = add <N x float>, <N x float> %a, <N x float> %b > %red = @llvm.reduce(%sum, float %acc) > or > %fast_red = @llvm.reduce(%sum)Because the semantics of an operation would not depend solely in the operand value types and operation, but on a chain of computations forming the operands. If the input operand is a phi, you then have to do potentially inter-block analysis. If it's a function parameter or simply a load from memory then you're pretty much stuck and you can't resolve the semantics. During the dev meeting, a reductions proposal where the operation to be performed was a kind of opcode was discussed, and rejected by the community. I don't believe having many intrinsics would be a problem.> For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?For the same reasons that we don't re-use the other binary operator instructions like add, sub, mul. The vector versions of those are not horizontal operations, they instead produce vector results.> >> We have multiple options for expressing vector predication in reductions: >> 1. The first is to simply add a predicate operand to the intrinsics, and require that targets without predication explicitly pattern match for an all-true predicate in order to select hardware instructions. >> 2. The second option is to mask out inactive lanes in the input vector by using a select between the input, and a vector splat of the reduction's identity values, e.g. 0.0 for fp-add. >> >> We believe option 2 will be sufficient for predication capable architectures, while keeping the intrinsics definitions simple and minimal. If there are targets for which the identity value for a reduction is different, then we could use an IR constant to express this in a generic way. > > I agree. I haven't followed Elena's work on the similar concept for > Intel, but I vaguely remember we reached a similar conclusion for > AVX512. > > cheers, > --renato > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
+Elena, as she has done a lot of work for AVX512, which has similar concepts. On 31 January 2017 at 23:16, Amara Emerson <amara.emerson at gmail.com> wrote:>> Not SVE specific, for example fast-math. > > Can you explain what you mean here? Other targets may well have > ordered reductions, I can't comment on that aspect. However, fast-math > vectorization is a case where you *don't* want ordered reductions, as > the relaxed fp-contract means that the conventional tree reduction > algorithm preserves the required semantics.That's my point. Fast-math can change the target's semantics regarding reductions, independently of scalable vectors, so it's worth discussing in a more general case, ie in this thread. Sorry for being terse.> The hassle of generating reductions may well be at most a minor > motivator, but my point still stands. If a front-end wants the target > to be able to generate the best code for a reduction idiom, they must > generate a lot of IR for many-element vectors. You still have to paid > the price in bloated IR, see the tests changed as part of the AArch64 > NEON patch.This is a completely orthogonal discussion, as I stated here:>> The argument that the intrinsic is harder to destroy through >> optimisation passes is the same as other cases of stiff rich semantics >> vs. generic pattern matching, so orthogonal to this issue.One that we have had multiple times and the usual consensus is: if it can be represented in plain IR, it must. Adding multiple semantics for the same concept, especially stiff ones like builtins, adds complexity to the optimiser. Regardless of the merits in this case, builtins should only be introduced IFF there is no other way. So first we should discuss adding it to IR with generic concepts, just like we did for scatter/gather and strided access.>> Why not simplify this into something like: >> >> %sum = add <N x float>, <N x float> %a, <N x float> %b >> %red = @llvm.reduce(%sum, float %acc) >> or >> %fast_red = @llvm.reduce(%sum) > > Because the semantics of an operation would not depend solely in the > operand value types and operation, but on a chain of computations > forming the operands. If the input operand is a phi, you then have to > do potentially inter-block analysis. If it's a function parameter or > simply a load from memory then you're pretty much stuck and you can't > resolve the semantics.I think you have just described the pattern matching algorithm, meaning it's possible to write that in a sequence of IR instructions, thus using add+reduce should work. Same with pointer types and other reduction operations. If the argument comes from a function parameter that is a non-strict pointer to memory, then all bets are off anyway and the front-end wouldn't be able to generate anything more specific, unless you're using SIMD intrinsics, in which case this point is moot.> During the dev meeting, a reductions proposal where the operation to > be performed was a kind of opcode was discussed, and rejected by the > community.Well, that was certainly a smaller group than the list. Design decisions should not be taken off list, so we must have this discussion on the list again, I'm afraid.> I don't believe having many intrinsics would be a problem.This is against every decision I remember. Saying it out loud in a meeting is one thing, writing them down and implementing and having to bear the maintenance costs is another entirely. That's why the consensus has to happen on the list.>> For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum? > > For the same reasons that we don't re-use the other binary operator > instructions like add, sub, mul. The vector versions of those are not > horizontal operations, they instead produce vector results.Sorry, I meant min/max + reduce, just like above. %sum = add <N x float>, <N x float> %a, <N x float> %b %min = @llvm.minnum(<N x float> %sum) %red = @llvm.reduce(%min, float %acc) cheers, --renato
Demikhovsky, Elena via llvm-dev
2017-Feb-01 09:38 UTC
[llvm-dev] RFC: Generic IR reductions
> One that we have had multiple times and the usual consensus is: if it can be represented in plain IR, it must. Adding multiple semantics for the same concept, especially stiff ones like builtins, adds complexity to the optimiser. > Regardless of the merits in this case, builtins should only be introduced IFF there is no other way. So first we should discuss adding it to IR with generic concepts, just like we did for scatter/gather and strided access.I suppose, we can let Target to "decide". Target may decide to use an intrinsic since it will be converted later to one instruction or leave a plain IR letting the optimizer to work smoothly. Actually, we do the same for gather/scatter - vectorizer does not build intrinsics if the Target does not support them.> adds complexity to the optimizerOptimizer should not deal with intrinsics, with this kind of intrinsics at least. The target should be able to answer on question IsProfitable(ORDERED_REDUCTION_FADD, VF) - ordered reductions, for example, are not supported on X86 and the answer will, probably, be "false". In this case we'll stay with plain IR. But SVE will answer "true" and an intrinsic will be created. So, in my opinion, we need reduction intrinsics.> if it can be represented in plain IR, it mustThe IR is plain and the ISAs are complex, we miss optimizations at the end. IR should be reach in order to serve complex ISAs. As far as intrinsics set, we, probably, need "ordered" and "plain" mode for FP reductions. X86 does not have the "ordered" today, but might be interesting in a loop-tail operation in FP fast mode. - Elena -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Wednesday, February 01, 2017 10:27 To: Amara Emerson <amara.emerson at gmail.com> Cc: Amara Emerson <Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org; nd <nd at arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk>; Demikhovsky, Elena <elena.demikhovsky at intel.com> Subject: Re: [llvm-dev] RFC: Generic IR reductions +Elena, as she has done a lot of work for AVX512, which has similar concepts. On 31 January 2017 at 23:16, Amara Emerson <amara.emerson at gmail.com> wrote:>> Not SVE specific, for example fast-math. > > Can you explain what you mean here? Other targets may well have > ordered reductions, I can't comment on that aspect. However, fast-math > vectorization is a case where you *don't* want ordered reductions, as > the relaxed fp-contract means that the conventional tree reduction > algorithm preserves the required semantics.That's my point. Fast-math can change the target's semantics regarding reductions, independently of scalable vectors, so it's worth discussing in a more general case, ie in this thread. Sorry for being terse.> The hassle of generating reductions may well be at most a minor > motivator, but my point still stands. If a front-end wants the target > to be able to generate the best code for a reduction idiom, they must > generate a lot of IR for many-element vectors. You still have to paid > the price in bloated IR, see the tests changed as part of the AArch64 > NEON patch.This is a completely orthogonal discussion, as I stated here:>> The argument that the intrinsic is harder to destroy through >> optimisation passes is the same as other cases of stiff rich >> semantics vs. generic pattern matching, so orthogonal to this issue.One that we have had multiple times and the usual consensus is: if it can be represented in plain IR, it must. Adding multiple semantics for the same concept, especially stiff ones like builtins, adds complexity to the optimiser. Regardless of the merits in this case, builtins should only be introduced IFF there is no other way. So first we should discuss adding it to IR with generic concepts, just like we did for scatter/gather and strided access.>> Why not simplify this into something like: >> >> %sum = add <N x float>, <N x float> %a, <N x float> %b >> %red = @llvm.reduce(%sum, float %acc) or >> %fast_red = @llvm.reduce(%sum) > > Because the semantics of an operation would not depend solely in the > operand value types and operation, but on a chain of computations > forming the operands. If the input operand is a phi, you then have to > do potentially inter-block analysis. If it's a function parameter or > simply a load from memory then you're pretty much stuck and you can't > resolve the semantics.I think you have just described the pattern matching algorithm, meaning it's possible to write that in a sequence of IR instructions, thus using add+reduce should work. Same with pointer types and other reduction operations. If the argument comes from a function parameter that is a non-strict pointer to memory, then all bets are off anyway and the front-end wouldn't be able to generate anything more specific, unless you're using SIMD intrinsics, in which case this point is moot.> During the dev meeting, a reductions proposal where the operation to > be performed was a kind of opcode was discussed, and rejected by the > community.Well, that was certainly a smaller group than the list. Design decisions should not be taken off list, so we must have this discussion on the list again, I'm afraid.> I don't believe having many intrinsics would be a problem.This is against every decision I remember. Saying it out loud in a meeting is one thing, writing them down and implementing and having to bear the maintenance costs is another entirely. That's why the consensus has to happen on the list.>> For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum? > > For the same reasons that we don't re-use the other binary operator > instructions like add, sub, mul. The vector versions of those are not > horizontal operations, they instead produce vector results.Sorry, I meant min/max + reduce, just like above. %sum = add <N x float>, <N x float> %a, <N x float> %b %min = @llvm.minnum(<N x float> %sum) %red = @llvm.reduce(%min, float %acc) cheers, --renato --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
On 1 February 2017 at 08:27, Renato Golin <renato.golin at linaro.org> wrote:> Sorry, I meant min/max + reduce, just like above. > > %sum = add <N x float>, <N x float> %a, <N x float> %b > %min = @llvm.minnum(<N x float> %sum) > %red = @llvm.reduce(%min, float %acc)No, this is wrong. I actually meant overriding the max/min intrinsics to take vectors instead of two scalar options. The semantics of those intrinsics is to get a number of parameters and return the max/min on their own type. A vector is just a different packing of parameters, all of which have the same type, so there's no semantic difference other than the number of arguments. However, when they're lowered, they'll end up a a short sequence of instruction (if supported) in the same way. You'd only emit a max/min in vectorisation if the target supports it and the cost is low. For instance, in AArch64 it would only emit it if SVE support is enabled. Makes sense? cheers, --renato
+CC Chandler and Philip. On 1 February 2017 at 08:27, Renato Golin <renato.golin at linaro.org> wrote:> Well, that was certainly a smaller group than the list. Design > decisions should not be taken off list, so we must have this > discussion on the list again, I'm afraid. > > >> I don't believe having many intrinsics would be a problem. > > This is against every decision I remember. Saying it out loud in a > meeting is one thing, writing them down and implementing and having to > bear the maintenance costs is another entirely. >That's fair. To bring some context, during Elena's BOF Chandler/Philip's made some arguments against the idea of general purpose single intrinsics that take an opcode parameter, which went along the lines of: the design would lower the bar for introducing new idioms to an extent that each one would not have to undergo as much scrutiny as separate additions. I specifically recall someone, and I think it was Chandler but correct me if I'm wrong, saying that many intrinsics were not in itself a problem. Their points were also broader than that of reductions, and covered the general approach of handling groups of idioms. Chandler/Philip I hope I haven't mischaracterized your thoughts. Amara