Hi, Renato.>So I vote "it depends". :)My preference is to let vectorizer emit one kind of "reduce vector into scalar" instead of letting vectorizer choose one of many different ways. I'm perfectly fine with @llvm.reduce.op.typein.typeout.ordered?(%vector) being that "one kind of reduce vector into scalar". I think we are converging enough at the detail level, but having a big difference in the opinions at the "vision" level. :) Thanks, Hideki -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Wednesday, February 01, 2017 12:52 PM To: Saito, Hideki <hideki.saito at intel.com> Cc: llvm-dev at lists.llvm.org; Amara Emerson <amara.emerson at gmail.com>; Chandler Carruth <chandlerc at gmail.com>; Philip Reames <listmail at philipreames.com>; Demikhovsky, Elena <elena.demikhovsky at intel.com>; Simon Pilgrim <llvm-dev at redking.me.uk>; Mehdi Amini <mehdi.amini at apple.com>; Michael Kuperstein <mkuper at google.com>; nd <nd at arm.com> Subject: Re: RFC: Generic IR reductions On 1 February 2017 at 20:27, Saito, Hideki <hideki.saito at intel.com> wrote:> Main problem for SVE: We can't write straight-line IR instruction sequence for reduction last value compute, without > knowing #elements in vector to start from.Hi Saito, Indeed, I think there's at least consensus that, for scalable vectors, there's no other way around.> For non-scalable vector, series of "reduce to half" (or any other straight-line IR instruction sequence) is functionally > working, but certainly not ideal (ugly, optimal sequence is target dependent, artificially inflating the outgoing IL > for any optimizers not interested in optimizing reduction last value compute, that translates to longer compile time, > etc.)I agree it's not pretty, but IR doesn't have to be pretty. IR is a compiler intermediate language, not a user-facing language. So, if the representation is accurate and the compile time costs are within noise, there are no reasons to change. Now, you touched one subject that is worth considering: binary reduction is target specific. I don't know AVX, AltiVec and others too well, but I assumed there are only two types of reduction strategies: ordered and binary reduction. The first because of precision and the second because it's the most logical generic step: you only need one instruction with multiple register sizes to do any kind of reduction, and that reduces the complexity of the ISA. So if there are non-ordered / non-binary reductions, the representation is, indeed, target specific, and that's a very negative point for it. But, this only raises the case for non-ordered / non-binary reductions to gain a builtin, not to move everything into a builtin.> So, it's not like we don't have reasons for asking to change the status quo.I apologise if that's what got across. I'm not trying to shut anything down, I'm just reiterating the reasons that I have received (and given) in the past with regards to IR changes. We have done this for NEON, stride access, etc. and it has worked well in general, with exposed optimisations that we wouldn't have gotten with a builtin.> Here's what I'd like to see at the end of this discussion. > Nice and concise representation of reducing a vector into a scalar value. > An IR instruction to do so is ideal, but I understand that the hurdle for that is very high. > I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction.Yes. If we have an intrinsic that can be built in a generic way, that's half-way towards an instruction. That's what I was trying to do on my first reply, when I suggested the @llvm.reduce() builtin. But it seems impractical to have a single one. It'll most likely be a family: @llvm.reduce.op.typein.typeout.ordered?(%vector), with op being add/sub/max/min/etc, type in the vector type, type out if different for widen/shorten and ordered if FP precision is important. It'll be harder to transform this into an instruction (mainly because of the op), but it can be done, just need time and community support.> Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see > a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem > of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)" > is "nice and concise".I agree it would be ugly, but I want to make sure we're clear that this is mostly irrelevant, unless it impacts compile time or the complexity of pattern matching beyond reasonable. I would expect larger vectors to have longer patterns, but I wouldn't oppose to an intrinsic for 1024 bit vector reduction. :)> Such a representation is also useful inside of vectorized loop if the programmer wants bitwise identical > FP reduction value (obviously at the much slower speed), without losing the vectorization of rest of > the loop body. So, this isn't just "outside the vectorized loop" stuff.I'm not sure what you mean by this. Do you mean that @llvm.reduce.add.ordered(...) would still be useful scalarised? We can get the same ordered behaviour by introducing instruction dependencies (%sum = %a + %b -> %sum = %um + %c, etc). Also, we'd have to teach the middle end or all back-ends to lower that intrinsic, even if the target has no concept of reduction in its ISA, in the case it does appear from some other pass. Wouldn't that be risky?> Now, can we focus on the value of "nice and concise representation"? If the community consensus > is "Yes", we can then talk about how to deliver that vision --- one IR instruction, one intrinsic call, a small > number of IR instructions (I don't know how, but someone might have brilliant ideas), etc., and further dig > deeper (e.g., IR instruction for each reduction operator?, whether the operator specification is part of > operand or part of intrinsic name?, etc). If the community consensus is "No", we can stop talking about > details right away.As I said, I don't see why we need to have a concise representation at all, unless it impacts compile time or pattern match complexity (maintainability), or it's the only way (ex. SVE). So I vote "it depends". :) cheers, --renato
On 1 February 2017 at 21:22, Saito, Hideki <hideki.saito at intel.com> wrote:> I think we are converging enough at the detail level, but having a big > difference in the opinions at the "vision" level. :)Vision is always harder than engineering. :) So, I have copied a lot more people now, to make sure we get a wider audience. I want to give them a chance to chime in early enough. Folks, feel free to copy more people if you think they need to have an opinion. Here's a summary: Problem #1: Reductions are common, but: * the code generated is large (log(lanes) operations + extracts) and will get "too large" with 1024/2048 bit vectors * this may cause compile time slow downs (we're not sure, need data), and increased complexity in pattern-matching description for the back-end * the semantics can be target-specific (ordered / non-ordered / non-binary reduction), but I don't know of any target that is not ordered/binary. * constant propagation can hide the semantics half-way through, and the back-end loses the ability to match Problem #2: As we move to scalable vectors (of unknown length, like SVE), we: * can't generate binary extracts + ops (to mean unordered / aka fast), as we don't know the number of lanes * we'll have to represent ordered reduction as a loop based on the (unknown at compile time) "compile time constant" that represents the vector length So, Amara's proposal converged to solve all those problems with a new intrinsic similar to: %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty> %vector) With: * "op" one of { add, sub, mul, max, min, etc } x { '', 'f' } for integer / float * vec-ty the input type, which can be a scalable type in the future (<n x N x type>). * type-out not necessarily equal to the lane type in vec-ty, to allow for widening/shotening or FP to int conversions, it supported. * "ordered" some flag meaning FP precision is a concern (could be "fast" meaning the opposite, doesn't matter) The solution simplifies a lot the loop vectoriser, as well as the SLP and, as Amara said, could even be used by front-ends or other passes when the semantics is guaranteed (Haskell?). However, it also incur in the same issue we had with the other vector intrinsics: they stiff the IR, making other optimisations discard information, avoid valid transformations or even plain bad code-gen. So far, on the specific topic for reductions, the only case I could think of that we'd miss optimisations is when we have two compatible reductions and we want to merge them. For example, NEON can horizontally reduce one or two vectors at the same time, or we could have three strided reductions and then combine them all into one after inlining. What stops us from doing so with intrinsics is just the knowledge, so we trade complexity in the back-end to match long patterns for complexity in optimisation passes to know about the new intrinsics. The arguments were: Pro intrinsic: * Simpler and shorter IR * Easier to read, easier for the vectorisers to generate * Easier for the back-end to match * Easier transition towards scalable vectors in the future Cons: * Already existing patterns work and can expose further opportunities * We may have to change more middle-end passes / back-ends than would be affected I really have no strong opinion either way. Both are challenging in their own rights, and this is worth doing anyway. Did I miss something? cheers, --renato
Thanks for the summary, some more comments inline. On 1 February 2017 at 22:02, Renato Golin <renato.golin at linaro.org> wrote:> On 1 February 2017 at 21:22, Saito, Hideki <hideki.saito at intel.com> wrote: >> I think we are converging enough at the detail level, but having a big >> difference in the opinions at the "vision" level. :) > > Vision is always harder than engineering. :) > > So, I have copied a lot more people now, to make sure we get a wider > audience. I want to give them a chance to chime in early enough. > Folks, feel free to copy more people if you think they need to have an > opinion. > > Here's a summary: > > Problem #1: Reductions are common, but: > * the code generated is large (log(lanes) operations + extracts) and > will get "too large" with 1024/2048 bit vectors > * this may cause compile time slow downs (we're not sure, need data), > and increased complexity in pattern-matching description for the > back-end > * the semantics can be target-specific (ordered / non-ordered / > non-binary reduction), but I don't know of any target that is not > ordered/binary. > * constant propagation can hide the semantics half-way through, and > the back-end loses the ability to matchUnless I'm mistaken I think the constant propagation issue was with the proposal to split reduction into two stages, where the reduce step depends on the instruction used to produce the input value. This is unrelated to the current shuffle representation.> > Problem #2: As we move to scalable vectors (of unknown length, like SVE), we: > * can't generate binary extracts + ops (to mean unordered / aka > fast), as we don't know the number of lanes > * we'll have to represent ordered reduction as a loop based on the > (unknown at compile time) "compile time constant" that represents the > vector length > > So, Amara's proposal converged to solve all those problems with a new > intrinsic similar to: > > %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty> %vector) > > With: > * "op" one of { add, sub, mul, max, min, etc } x { '', 'f' } for > integer / float > * vec-ty the input type, which can be a scalable type in the future > (<n x N x type>). > * type-out not necessarily equal to the lane type in vec-ty, to allow > for widening/shotening or FP to int conversions, it supported.I didn't include this in my original proposal, although I'm not particularly opposed to it. In the spirit of keeping the IR as simple as possible, the widening/narrowing behaviour doesn't need to be bundled into the reduction. A promotion can be a separate ext of the input and I think it should be easily pattern matched.> * "ordered" some flag meaning FP precision is a concern (could be > "fast" meaning the opposite, doesn't matter) > > The solution simplifies a lot the loop vectoriser, as well as the SLP > and, as Amara said, could even be used by front-ends or other passes > when the semantics is guaranteed (Haskell?). > > However, it also incur in the same issue we had with the other vector > intrinsics: they stiff the IR, making other optimisations discard > information, avoid valid transformations or even plain bad code-gen. > > So far, on the specific topic for reductions, the only case I could > think of that we'd miss optimisations is when we have two compatible > reductions and we want to merge them. For example, NEON can > horizontally reduce one or two vectors at the same time, or we could > have three strided reductions and then combine them all into one after > inlining.My implementation of this for AArch64 NEON (I've not looked at AArch32 yet) converts the generic reduce ISD nodes to AArch64 specific ones with an extract. It doesn't try to do anything else, so once it goes past the original tree-shuffle matching code, the DAG should be identical.> > What stops us from doing so with intrinsics is just the knowledge, so > we trade complexity in the back-end to match long patterns for > complexity in optimisation passes to know about the new intrinsics. > > The arguments were: > > Pro intrinsic: > * Simpler and shorter IR > * Easier to read, easier for the vectorisers to generate > * Easier for the back-end to match > * Easier transition towards scalable vectors in the futureAlso want to re-iterate that reductions are useful for testing bits of predicate vectors, which can be applicable to other targets than SVE for things like early-exit vectorization. Simon mentioned this to me off-list. Simon, could you comment here if this proposal would work for your needs?> Cons: > * Already existing patterns work and can expose further opportunities > * We may have to change more middle-end passes / back-ends than would > be affected > > I really have no strong opinion either way. Both are challenging in > their own rights, and this is worth doing anyway. > > Did I miss something? > > cheers, > --renato