Demikhovsky, Elena via llvm-dev
2017-Feb-01 10:30 UTC
[llvm-dev] RFC: Generic IR reductions
> If you mean "patterns may not be matched, and reduction instructions will not be generated, making the code worse", then this is just a matter of making the patterns obvious and the back-ends robust enough to cope with it, no?The Back-end should be as robust as possible, I agree. The problem that I see is in adding another kind of complexity to the optimizer that works between the Vectorizer and the Back-end. It should be able to recognize all "obvious" patterns in order to preserve them.>> The target should be able to answer on question IsProfitable(ORDERED_REDUCTION_FADD, VF) - ordered reductions, for example, are not supported on X86 and the answer will, probably, be "false". >> In this case we'll stay with plain IR. >> But SVE will answer "true" and an intrinsic will be created. So, in my opinion, we need reduction intrinsics.> I'm not sure what you mean. The cost analysis is per instruction, which adds up to per block or function, be them intrinsics or legal IR instructions.Now we look at a Reduction Phi and if the FP mode requires the "Ordered" reduction which is not supported, the whole loop remains scalar. If we would leave this decision to the Cost Model, it will provide a cost of scalarization. And at the end we may decide to scalarize reduction operation inside vector loop. Now, once the taken decision is vectorization, inserting an intrinsic or generating a plain IR should be a Target decision. - Elena -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Wednesday, February 01, 2017 11:58 To: Demikhovsky, Elena <elena.demikhovsky at intel.com> Cc: Amara Emerson <amara.emerson at gmail.com>; Amara Emerson <Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org; nd <nd at arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk> Subject: Re: [llvm-dev] RFC: Generic IR reductions On 1 February 2017 at 09:38, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:> I suppose, we can let Target to "decide". Target may decide to use an intrinsic since it will be converted later to one instruction or leave a plain IR letting the optimizer to work smoothly. > Actually, we do the same for gather/scatter - vectorizer does not build intrinsics if the Target does not support them.Absolutely, this is a target-specific decision. For gather / scatter we had some intrinsics, but we also used plain IR for others, no? For strided access, we used mostly IR, despite the original proposal to be mostly intrinsics. I think the balance is: what is the cost across the compiler to generate and keep the intrinsics syntactically correct and optimal? I'm not claiming I know, as I don't know AVX or SVE too well, but we need answers to those questions before we take any decision.>> adds complexity to the optimizer > Optimizer should not deal with intrinsics, with this kind of intrinsics at least.That's why it adds complexity. To most optimisers, intrinsics are function calls that could either be replaced by a code block, or a specific instruction or a library call. This may make certain transformations opaque. We have had in the past, examples in the NEON generation that has shown better code after inlining due to new IR patterns being found, and if we had used intrinsics, they wouldn't.> The target should be able to answer on question IsProfitable(ORDERED_REDUCTION_FADD, VF) - ordered reductions, for example, are not supported on X86 and the answer will, probably, be "false". > In this case we'll stay with plain IR. > But SVE will answer "true" and an intrinsic will be created. So, in my opinion, we need reduction intrinsics.I'm not sure what you mean. The cost analysis is per instruction, which adds up to per block or function, be them intrinsics or legal IR instructions. The implementation of functions that calculate profitability already have to take into account non-intrinsics, so that shouldn't make any difference.> The IR is plain and the ISAs are complex, we miss optimizations at the end. IR should be reach in order to serve complex ISAs.Creating more intrinsics actually make the IR *more* complex. What kind of optimisations would we miss? If you mean "patterns may not be matched, and reduction instructions will not be generated, making the code worse", then this is just a matter of making the patterns obvious and the back-ends robust enough to cope with it, no? I mean, we already do this for almost everything else. If the current IR can express the required semantics, then we should use plain IR. The alternative is to start forcing intrinsics for everything, which would make IR really pointless.> As far as intrinsics set, we, probably, need "ordered" and "plain" mode for FP reductions.The ordered and plain modes can be chosen via an additional parameter, the accumulator.> X86 does not have the "ordered" today, but might be interesting in a loop-tail operation in FP fast mode.Indeed, same for NEON and I guess most other SIMD engines. cheers, --renato --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
On 1 February 2017 at 10:30, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:>> If you mean "patterns may not be matched, and reduction instructions will not be generated, making the code worse", then this is just a matter of making the patterns obvious and the back-ends robust enough to cope with it, no? > The Back-end should be as robust as possible, I agree. The problem that I see is in adding another kind of complexity to the optimizer that works between the Vectorizer and the Back-end. It should be able to recognize all "obvious" patterns in order to preserve them.Right! Also, I may have been my own enemy again and muddled the question. Let me try again... :) I'm not against a reduction intrinsic. I'm against one reduction intrinsic for {every kind} x {ordered, unordered}. At least until further evidence comes to light. My proposal was to have a reduction intrinsic that can infer the type by the predecessors. For example: @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) would generate a widening unordered reduction (fast-math).> Now we look at a Reduction Phi and if the FP mode requires the "Ordered" reduction which is not supported, the whole loop remains scalar.Right, but this is orthogonal to having separate intrinsics or not. %fast = @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) %order = llvm.reduce(ext <N x double> ( add <N x float> %a, %b), double %acc) If the IR contains %order, then this will *have* to be scalar if the target doesn't support native ordered or a fast path to it. And this is up to the cost model.> If we would leave this decision to the Cost Model, it will provide a cost of scalarization. And at the end we may decide to scalarize reduction operation inside vector loop. Now, once the taken decision is vectorization, inserting an intrinsic or generating a plain IR should be a Target decision.Hum, I'm beginning to see your point, I think. I agree this is again a target decision, but it's also a larger compiler-wide decision, too. The target's decision is: IR doesn't have the required semantics, I *must* use intrinsics. It can't be: I'd rather have intrinsics because it's easier to match in the back-end. The first is a requirement, the second is a personal choice, and one that can impact the generic instruction selection between IR and target specific selection. cheers, --renato
Demikhovsky, Elena via llvm-dev
2017-Feb-01 11:59 UTC
[llvm-dev] RFC: Generic IR reductions
> My proposal was to have a reduction intrinsic that can infer the type by the predecessors.> For example:> @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) And if we don't have %b? We just want to sum all elements of %a? Something like @llvm.reduce(ext <N x double> ( add <N x float> %a, zeroinitializer)) Don't we have a problem with constant propagation in this approach? I proposed a "generic" intrinsic approach on the BOF (Nov, 2016), like %scalar = @llvm.reduce(OPCODE, %vector_input) - OPCODE may be a string, integer or metadata. - Elena -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Wednesday, February 01, 2017 12:54 To: Demikhovsky, Elena <elena.demikhovsky at intel.com> Cc: Amara Emerson <amara.emerson at gmail.com>; Amara Emerson <Amara.Emerson at arm.com>; llvm-dev at lists.llvm.org; nd <nd at arm.com>; Simon Pilgrim <llvm-dev at redking.me.uk> Subject: Re: [llvm-dev] RFC: Generic IR reductions On 1 February 2017 at 10:30, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:>> If you mean "patterns may not be matched, and reduction instructions will not be generated, making the code worse", then this is just a matter of making the patterns obvious and the back-ends robust enough to cope with it, no? > The Back-end should be as robust as possible, I agree. The problem that I see is in adding another kind of complexity to the optimizer that works between the Vectorizer and the Back-end. It should be able to recognize all "obvious" patterns in order to preserve them.Right! Also, I may have been my own enemy again and muddled the question. Let me try again... :) I'm not against a reduction intrinsic. I'm against one reduction intrinsic for {every kind} x {ordered, unordered}. At least until further evidence comes to light. My proposal was to have a reduction intrinsic that can infer the type by the predecessors. For example: @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) would generate a widening unordered reduction (fast-math).> Now we look at a Reduction Phi and if the FP mode requires the "Ordered" reduction which is not supported, the whole loop remains scalar.Right, but this is orthogonal to having separate intrinsics or not. %fast = @llvm.reduce(ext <N x double> ( add <N x float> %a, %b)) %order = llvm.reduce(ext <N x double> ( add <N x float> %a, %b), double %acc) If the IR contains %order, then this will *have* to be scalar if the target doesn't support native ordered or a fast path to it. And this is up to the cost model.> If we would leave this decision to the Cost Model, it will provide a cost of scalarization. And at the end we may decide to scalarize reduction operation inside vector loop. Now, once the taken decision is vectorization, inserting an intrinsic or generating a plain IR should be a Target decision.Hum, I'm beginning to see your point, I think. I agree this is again a target decision, but it's also a larger compiler-wide decision, too. The target's decision is: IR doesn't have the required semantics, I *must* use intrinsics. It can't be: I'd rather have intrinsics because it's easier to match in the back-end. The first is a requirement, the second is a personal choice, and one that can impact the generic instruction selection between IR and target specific selection. cheers, --renato --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.