Hi All, Renato wrote:>As they say: if it's not broken, don't fix it. >Let's talk about the reductions that AVX512 and SVE can't handle with IR semantics, but let's not change the current IR semantics for no reason.Main problem for SVE: We can't write straight-line IR instruction sequence for reduction last value compute, without knowing #elements in vector to start from. For non-scalable vector, series of "reduce to half" (or any other straight-line IR instruction sequence) is functionally working, but certainly not ideal (ugly, optimal sequence is target dependent, artificially inflating the outgoing IL for any optimizers not interested in optimizing reduction last value compute, that translates to longer compile time, etc.) So, it's not like we don't have reasons for asking to change the status quo. Here's what I'd like to see at the end of this discussion. Nice and concise representation of reducing a vector into a scalar value. An IR instruction to do so is ideal, but I understand that the hurdle for that is very high. I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction. Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)" is "nice and concise". Such a representation is also useful inside of vectorized loop if the programmer wants bitwise identical FP reduction value (obviously at the much slower speed), without losing the vectorization of rest of the loop body. So, this isn't just "outside the vectorized loop" stuff. Now, can we focus on the value of "nice and concise representation"? If the community consensus is "Yes", we can then talk about how to deliver that vision --- one IR instruction, one intrinsic call, a small number of IR instructions (I don't know how, but someone might have brilliant ideas), etc., and further dig deeper (e.g., IR instruction for each reduction operator?, whether the operator specification is part of operand or part of intrinsic name?, etc). If the community consensus is "No", we can stop talking about details right away. Obviously, I'm voting for YES on "nice and concise representation". Thanks, Hideki Saito Intel Compilers and Languages ------------------------------ Date: Wed, 1 Feb 2017 15:10:36 +0000 From: Renato Golin via llvm-dev <llvm-dev at lists.llvm.org> To: Amara Emerson <amara.emerson at gmail.com> Cc: "llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, nd <nd at arm.com> Subject: Re: [llvm-dev] RFC: Generic IR reductions Message-ID: <CAMSE1keZFe+EGTZ6zu69J+953Y87jnMKKZuHbR-P63sEjE0Ymw at mail.gmail.com> Content-Type: text/plain; charset=UTF-8 On 1 February 2017 at 14:44, Amara Emerson <amara.emerson at gmail.com> wrote:> Her point is that the %sum value will no longer be an add node, but > simply a constant vector. There is no way to resolve the semantics and > have meaningful IR after a simple constant prop. This means that other > simple transformations will all have to have special case logic just > to handle reductions, for example LCSSA.Right.> Can you give a specific example? Reductions are actually very robust > to optimizations breaking the idiom, which is why I was able to > replace the reductions with intrinsics in my patches and with simple > matching generate identical code to before. No other changes were > required in the mid-end.First, this is a demonstration that keeping them as IR is a good thing, not a bad one. We keep the semantics as well as allow for further introspection in the code block. Examples of introspection are vector widening or instruction fusion after inlining. Say you have a loop with a function call and a reduction, but that call has a reduction on its own. If the function gets inlined after you reduced your loop into a builtin, the optimiser will have no visibility if the function's reduction pattern can be merged, either widening the vectors (saving on loads/stores) or fusing (ex. MLA). We have seen both cases with NEON after using IR instructions for everything but the impossible. In 2010, I've gone through a similar discussion with Bob Wilson, who defended the position I'm defending now. And I defend this position today because I have been categorically proven wrong by the results I describe above. Chandler's arguments are perfectly to the point. Intrinsics are not only necessary when we can't represent things in IR, they're a *good* ways of representing odd things. But if things are not odd (ie. many targets have them) or if we can already represent them in IR, then it stands to reason that adding duplicated stuff only adds complexity. It increases the maintenance cost (more node types to consider), in increases the chance for missing some of them (and either not optimising or generating bad code), and it stops the optimisers that know nothing about it (because it's too new) to do any inference, and can actually generate worse code than before (shuffle explosion). As they say: if it's not broken, don't fix it. Let's talk about the reductions that AVX512 and SVE can't handle with IR semantics, but let's not change the current IR semantics for no reason. cheers, --renato
On 1 February 2017 at 20:27, Saito, Hideki <hideki.saito at intel.com> wrote:> Main problem for SVE: We can't write straight-line IR instruction sequence for reduction last value compute, without > knowing #elements in vector to start from.Hi Saito, Indeed, I think there's at least consensus that, for scalable vectors, there's no other way around.> For non-scalable vector, series of "reduce to half" (or any other straight-line IR instruction sequence) is functionally > working, but certainly not ideal (ugly, optimal sequence is target dependent, artificially inflating the outgoing IL > for any optimizers not interested in optimizing reduction last value compute, that translates to longer compile time, > etc.)I agree it's not pretty, but IR doesn't have to be pretty. IR is a compiler intermediate language, not a user-facing language. So, if the representation is accurate and the compile time costs are within noise, there are no reasons to change. Now, you touched one subject that is worth considering: binary reduction is target specific. I don't know AVX, AltiVec and others too well, but I assumed there are only two types of reduction strategies: ordered and binary reduction. The first because of precision and the second because it's the most logical generic step: you only need one instruction with multiple register sizes to do any kind of reduction, and that reduces the complexity of the ISA. So if there are non-ordered / non-binary reductions, the representation is, indeed, target specific, and that's a very negative point for it. But, this only raises the case for non-ordered / non-binary reductions to gain a builtin, not to move everything into a builtin.> So, it's not like we don't have reasons for asking to change the status quo.I apologise if that's what got across. I'm not trying to shut anything down, I'm just reiterating the reasons that I have received (and given) in the past with regards to IR changes. We have done this for NEON, stride access, etc. and it has worked well in general, with exposed optimisations that we wouldn't have gotten with a builtin.> Here's what I'd like to see at the end of this discussion. > Nice and concise representation of reducing a vector into a scalar value. > An IR instruction to do so is ideal, but I understand that the hurdle for that is very high. > I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction.Yes. If we have an intrinsic that can be built in a generic way, that's half-way towards an instruction. That's what I was trying to do on my first reply, when I suggested the @llvm.reduce() builtin. But it seems impractical to have a single one. It'll most likely be a family: @llvm.reduce.op.typein.typeout.ordered?(%vector), with op being add/sub/max/min/etc, type in the vector type, type out if different for widen/shorten and ordered if FP precision is important. It'll be harder to transform this into an instruction (mainly because of the op), but it can be done, just need time and community support.> Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see > a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem > of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)" > is "nice and concise".I agree it would be ugly, but I want to make sure we're clear that this is mostly irrelevant, unless it impacts compile time or the complexity of pattern matching beyond reasonable. I would expect larger vectors to have longer patterns, but I wouldn't oppose to an intrinsic for 1024 bit vector reduction. :)> Such a representation is also useful inside of vectorized loop if the programmer wants bitwise identical > FP reduction value (obviously at the much slower speed), without losing the vectorization of rest of > the loop body. So, this isn't just "outside the vectorized loop" stuff.I'm not sure what you mean by this. Do you mean that @llvm.reduce.add.ordered(...) would still be useful scalarised? We can get the same ordered behaviour by introducing instruction dependencies (%sum = %a + %b -> %sum = %um + %c, etc). Also, we'd have to teach the middle end or all back-ends to lower that intrinsic, even if the target has no concept of reduction in its ISA, in the case it does appear from some other pass. Wouldn't that be risky?> Now, can we focus on the value of "nice and concise representation"? If the community consensus > is "Yes", we can then talk about how to deliver that vision --- one IR instruction, one intrinsic call, a small > number of IR instructions (I don't know how, but someone might have brilliant ideas), etc., and further dig > deeper (e.g., IR instruction for each reduction operator?, whether the operator specification is part of > operand or part of intrinsic name?, etc). If the community consensus is "No", we can stop talking about > details right away.As I said, I don't see why we need to have a concise representation at all, unless it impacts compile time or pattern match complexity (maintainability), or it's the only way (ex. SVE). So I vote "it depends". :) cheers, --renato
Hi, Renato.>So I vote "it depends". :)My preference is to let vectorizer emit one kind of "reduce vector into scalar" instead of letting vectorizer choose one of many different ways. I'm perfectly fine with @llvm.reduce.op.typein.typeout.ordered?(%vector) being that "one kind of reduce vector into scalar". I think we are converging enough at the detail level, but having a big difference in the opinions at the "vision" level. :) Thanks, Hideki -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Wednesday, February 01, 2017 12:52 PM To: Saito, Hideki <hideki.saito at intel.com> Cc: llvm-dev at lists.llvm.org; Amara Emerson <amara.emerson at gmail.com>; Chandler Carruth <chandlerc at gmail.com>; Philip Reames <listmail at philipreames.com>; Demikhovsky, Elena <elena.demikhovsky at intel.com>; Simon Pilgrim <llvm-dev at redking.me.uk>; Mehdi Amini <mehdi.amini at apple.com>; Michael Kuperstein <mkuper at google.com>; nd <nd at arm.com> Subject: Re: RFC: Generic IR reductions On 1 February 2017 at 20:27, Saito, Hideki <hideki.saito at intel.com> wrote:> Main problem for SVE: We can't write straight-line IR instruction sequence for reduction last value compute, without > knowing #elements in vector to start from.Hi Saito, Indeed, I think there's at least consensus that, for scalable vectors, there's no other way around.> For non-scalable vector, series of "reduce to half" (or any other straight-line IR instruction sequence) is functionally > working, but certainly not ideal (ugly, optimal sequence is target dependent, artificially inflating the outgoing IL > for any optimizers not interested in optimizing reduction last value compute, that translates to longer compile time, > etc.)I agree it's not pretty, but IR doesn't have to be pretty. IR is a compiler intermediate language, not a user-facing language. So, if the representation is accurate and the compile time costs are within noise, there are no reasons to change. Now, you touched one subject that is worth considering: binary reduction is target specific. I don't know AVX, AltiVec and others too well, but I assumed there are only two types of reduction strategies: ordered and binary reduction. The first because of precision and the second because it's the most logical generic step: you only need one instruction with multiple register sizes to do any kind of reduction, and that reduces the complexity of the ISA. So if there are non-ordered / non-binary reductions, the representation is, indeed, target specific, and that's a very negative point for it. But, this only raises the case for non-ordered / non-binary reductions to gain a builtin, not to move everything into a builtin.> So, it's not like we don't have reasons for asking to change the status quo.I apologise if that's what got across. I'm not trying to shut anything down, I'm just reiterating the reasons that I have received (and given) in the past with regards to IR changes. We have done this for NEON, stride access, etc. and it has worked well in general, with exposed optimisations that we wouldn't have gotten with a builtin.> Here's what I'd like to see at the end of this discussion. > Nice and concise representation of reducing a vector into a scalar value. > An IR instruction to do so is ideal, but I understand that the hurdle for that is very high. > I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction.Yes. If we have an intrinsic that can be built in a generic way, that's half-way towards an instruction. That's what I was trying to do on my first reply, when I suggested the @llvm.reduce() builtin. But it seems impractical to have a single one. It'll most likely be a family: @llvm.reduce.op.typein.typeout.ordered?(%vector), with op being add/sub/max/min/etc, type in the vector type, type out if different for widen/shorten and ordered if FP precision is important. It'll be harder to transform this into an instruction (mainly because of the op), but it can be done, just need time and community support.> Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see > a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem > of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)" > is "nice and concise".I agree it would be ugly, but I want to make sure we're clear that this is mostly irrelevant, unless it impacts compile time or the complexity of pattern matching beyond reasonable. I would expect larger vectors to have longer patterns, but I wouldn't oppose to an intrinsic for 1024 bit vector reduction. :)> Such a representation is also useful inside of vectorized loop if the programmer wants bitwise identical > FP reduction value (obviously at the much slower speed), without losing the vectorization of rest of > the loop body. So, this isn't just "outside the vectorized loop" stuff.I'm not sure what you mean by this. Do you mean that @llvm.reduce.add.ordered(...) would still be useful scalarised? We can get the same ordered behaviour by introducing instruction dependencies (%sum = %a + %b -> %sum = %um + %c, etc). Also, we'd have to teach the middle end or all back-ends to lower that intrinsic, even if the target has no concept of reduction in its ISA, in the case it does appear from some other pass. Wouldn't that be risky?> Now, can we focus on the value of "nice and concise representation"? If the community consensus > is "Yes", we can then talk about how to deliver that vision --- one IR instruction, one intrinsic call, a small > number of IR instructions (I don't know how, but someone might have brilliant ideas), etc., and further dig > deeper (e.g., IR instruction for each reduction operator?, whether the operator specification is part of > operand or part of intrinsic name?, etc). If the community consensus is "No", we can stop talking about > details right away.As I said, I don't see why we need to have a concise representation at all, unless it impacts compile time or pattern match complexity (maintainability), or it's the only way (ex. SVE). So I vote "it depends". :) cheers, --renato