Demikhovsky, Elena via llvm-dev
2017-Feb-01 11:59 UTC
[llvm-dev] RFC: Generic IR reductions
> My proposal was to have a reduction intrinsic that can infer the type by the predecessors.> For example:> @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) And if we don't have %b? We just want to sum all elements of %a? Something like @llvm.reduce(ext <N x double> ( add <N x float> %a, zeroinitializer)) Don't we have a problem with constant propagation in this approach? I proposed a "generic" intrinsic approach on the BOF (Nov, 2016), like %scalar = @llvm.reduce(OPCODE, %vector_input) - OPCODE may be a string, integer or metadata. - Elena -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Wednesday, February 01, 2017 12:54 To: Demikhovsky, Elena <elena.demikhovsky at intel.com> Cc: Amara Emerson <amara.emerson at gmail.com>; Amara Emerson <Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org; nd <nd at arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk> Subject: Re: [llvm-dev] RFC: Generic IR reductions On 1 February 2017 at 10:30, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:>> If you mean "patterns may not be matched, and reduction instructions will not be generated, making the code worse", then this is just a matter of making the patterns obvious and the back-ends robust enough to cope with it, no? > The Back-end should be as robust as possible, I agree. The problem that I see is in adding another kind of complexity to the optimizer that works between the Vectorizer and the Back-end. It should be able to recognize all "obvious" patterns in order to preserve them.Right! Also, I may have been my own enemy again and muddled the question. Let me try again... :) I'm not against a reduction intrinsic. I'm against one reduction intrinsic for {every kind} x {ordered, unordered}. At least until further evidence comes to light. My proposal was to have a reduction intrinsic that can infer the type by the predecessors. For example: @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) would generate a widening unordered reduction (fast-math).> Now we look at a Reduction Phi and if the FP mode requires the "Ordered" reduction which is not supported, the whole loop remains scalar.Right, but this is orthogonal to having separate intrinsics or not. %fast = @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) %order = llvm.reduce(ext <N x double> ( add <N x float> %a, %b), double %acc) If the IR contains %order, then this will *have* to be scalar if the target doesn't support native ordered or a fast path to it. And this is up to the cost model.> If we would leave this decision to the Cost Model, it will provide a cost of scalarization. And at the end we may decide to scalarize reduction operation inside vector loop. Now, once the taken decision is vectorization, inserting an intrinsic or generating a plain IR should be a Target decision.Hum, I'm beginning to see your point, I think. I agree this is again a target decision, but it's also a larger compiler-wide decision, too. The target's decision is: IR doesn't have the required semantics, I *must* use intrinsics. It can't be: I'd rather have intrinsics because it's easier to match in the back-end. The first is a requirement, the second is a personal choice, and one that can impact the generic instruction selection between IR and target specific selection. cheers, --renato --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
On 1 February 2017 at 11:59, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:> > @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) > > And if we don't have %b? We just want to sum all elements of %a? Something like @llvm.reduce(ext <N x double> ( add <N x float> %a, zeroinitializer))Hum, that's a good point. My examples were actually wrong, as they weren't related to simple reductions. Your zeroinit is the thing I was looking for.> Don't we have a problem with constant propagation in this approach?I'm not sure. Can you expand this?> I proposed a "generic" intrinsic approach on the BOF (Nov, 2016), like > %scalar = @llvm.reduce(OPCODE, %vector_input) - OPCODE may be a string, integer or metadata.I wouldn't use metadata. Integer would be cumbersome and lead to eventual ABI breakages, and "text" would be the same as: %scalar = @llvm.reduce.add(%vector) which is the same thing Amara proposed. I'm not saying it is wrong, I'm just worried that, by mandating the encoding of the reduction into an intrinsic, we'll force the middle-end to convert high-level code patterns to the intrinsic or the target will ignore it completely. There is a pattern already for reductions, and the back-ends already match it. This should not change, unless there is a serious flaw in it - for the targets that *already* support it. This is an orthogonal discussion. SVE has more restrictions, for instance, one cannot know how many shuffles to do because the vector size is unknown, so the current representation is insufficient, in which case, we need the intrinsic. But replace everything else with intrinsics just because one target can't cope with it doesn't work. On thing that does happen is that code optimisations expose patterns that would otherwise not be apparent. This includes potential reduction or fusion patterns and can lead to massively smaller code or even eliding the whole block. If you convert a block to an intrinsic too early you may lose the ability to merge it back again later, as we're doing today. These are all hypothetical wrt SVE, but they did happen in NEON in the past and were the reason why we only have a handful of NEON intrinsics. Everything else are encoded with sequences of instructions. cheers, --renato
Demikhovsky, Elena via llvm-dev
2017-Feb-01 13:06 UTC
[llvm-dev] RFC: Generic IR reductions
Constant propagation: %sum = add <N x float> %a, %b @llvm.reduce(ext <N x double> %sum) if %a and %b are vector of constants, the %sum also becomes a vector of constants. At this point you have @llvm.reduce(ext <N x double> %sum) and don't know what kind of reduction do you need. - Elena -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Wednesday, February 01, 2017 14:40 To: Demikhovsky, Elena <elena.demikhovsky at intel.com> Cc: Amara Emerson <amara.emerson at gmail.com>; Amara Emerson <Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org; nd <nd at arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk> Subject: Re: [llvm-dev] RFC: Generic IR reductions On 1 February 2017 at 11:59, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:> > @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) > > And if we don't have %b? We just want to sum all elements of %a? > Something like @llvm.reduce(ext <N x double> ( add <N x float> %a, > zeroinitializer))Hum, that's a good point. My examples were actually wrong, as they weren't related to simple reductions. Your zeroinit is the thing I was looking for.> Don't we have a problem with constant propagation in this approach?I'm not sure. Can you expand this?> I proposed a "generic" intrinsic approach on the BOF (Nov, 2016), like > %scalar = @llvm.reduce(OPCODE, %vector_input) - OPCODE may be a string, integer or metadata.I wouldn't use metadata. Integer would be cumbersome and lead to eventual ABI breakages, and "text" would be the same as: %scalar = @llvm.reduce.add(%vector) which is the same thing Amara proposed. I'm not saying it is wrong, I'm just worried that, by mandating the encoding of the reduction into an intrinsic, we'll force the middle-end to convert high-level code patterns to the intrinsic or the target will ignore it completely. There is a pattern already for reductions, and the back-ends already match it. This should not change, unless there is a serious flaw in it - for the targets that *already* support it. This is an orthogonal discussion. SVE has more restrictions, for instance, one cannot know how many shuffles to do because the vector size is unknown, so the current representation is insufficient, in which case, we need the intrinsic. But replace everything else with intrinsics just because one target can't cope with it doesn't work. On thing that does happen is that code optimisations expose patterns that would otherwise not be apparent. This includes potential reduction or fusion patterns and can lead to massively smaller code or even eliding the whole block. If you convert a block to an intrinsic too early you may lose the ability to merge it back again later, as we're doing today. These are all hypothetical wrt SVE, but they did happen in NEON in the past and were the reason why we only have a handful of NEON intrinsics. Everything else are encoded with sequences of instructions. cheers, --renato --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.