Thanks for the summary, some more comments inline. On 1 February 2017 at 22:02, Renato Golin <renato.golin at linaro.org> wrote:> On 1 February 2017 at 21:22, Saito, Hideki <hideki.saito at intel.com> wrote: >> I think we are converging enough at the detail level, but having a big >> difference in the opinions at the "vision" level. :) > > Vision is always harder than engineering. :) > > So, I have copied a lot more people now, to make sure we get a wider > audience. I want to give them a chance to chime in early enough. > Folks, feel free to copy more people if you think they need to have an > opinion. > > Here's a summary: > > Problem #1: Reductions are common, but: > * the code generated is large (log(lanes) operations + extracts) and > will get "too large" with 1024/2048 bit vectors > * this may cause compile time slow downs (we're not sure, need data), > and increased complexity in pattern-matching description for the > back-end > * the semantics can be target-specific (ordered / non-ordered / > non-binary reduction), but I don't know of any target that is not > ordered/binary. > * constant propagation can hide the semantics half-way through, and > the back-end loses the ability to matchUnless I'm mistaken I think the constant propagation issue was with the proposal to split reduction into two stages, where the reduce step depends on the instruction used to produce the input value. This is unrelated to the current shuffle representation.> > Problem #2: As we move to scalable vectors (of unknown length, like SVE), we: > * can't generate binary extracts + ops (to mean unordered / aka > fast), as we don't know the number of lanes > * we'll have to represent ordered reduction as a loop based on the > (unknown at compile time) "compile time constant" that represents the > vector length > > So, Amara's proposal converged to solve all those problems with a new > intrinsic similar to: > > %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty> %vector) > > With: > * "op" one of { add, sub, mul, max, min, etc } x { '', 'f' } for > integer / float > * vec-ty the input type, which can be a scalable type in the future > (<n x N x type>). > * type-out not necessarily equal to the lane type in vec-ty, to allow > for widening/shotening or FP to int conversions, it supported.I didn't include this in my original proposal, although I'm not particularly opposed to it. In the spirit of keeping the IR as simple as possible, the widening/narrowing behaviour doesn't need to be bundled into the reduction. A promotion can be a separate ext of the input and I think it should be easily pattern matched.> * "ordered" some flag meaning FP precision is a concern (could be > "fast" meaning the opposite, doesn't matter) > > The solution simplifies a lot the loop vectoriser, as well as the SLP > and, as Amara said, could even be used by front-ends or other passes > when the semantics is guaranteed (Haskell?). > > However, it also incur in the same issue we had with the other vector > intrinsics: they stiff the IR, making other optimisations discard > information, avoid valid transformations or even plain bad code-gen. > > So far, on the specific topic for reductions, the only case I could > think of that we'd miss optimisations is when we have two compatible > reductions and we want to merge them. For example, NEON can > horizontally reduce one or two vectors at the same time, or we could > have three strided reductions and then combine them all into one after > inlining.My implementation of this for AArch64 NEON (I've not looked at AArch32 yet) converts the generic reduce ISD nodes to AArch64 specific ones with an extract. It doesn't try to do anything else, so once it goes past the original tree-shuffle matching code, the DAG should be identical.> > What stops us from doing so with intrinsics is just the knowledge, so > we trade complexity in the back-end to match long patterns for > complexity in optimisation passes to know about the new intrinsics. > > The arguments were: > > Pro intrinsic: > * Simpler and shorter IR > * Easier to read, easier for the vectorisers to generate > * Easier for the back-end to match > * Easier transition towards scalable vectors in the futureAlso want to re-iterate that reductions are useful for testing bits of predicate vectors, which can be applicable to other targets than SVE for things like early-exit vectorization. Simon mentioned this to me off-list. Simon, could you comment here if this proposal would work for your needs?> Cons: > * Already existing patterns work and can expose further opportunities > * We may have to change more middle-end passes / back-ends than would > be affected > > I really have no strong opinion either way. Both are challenging in > their own rights, and this is worth doing anyway. > > Did I miss something? > > cheers, > --renato
>Also want to re-iterate that reductions are useful for testing bits of predicate vectors, which can be applicable to other targets than SVE for things like early-exit vectorization. Simon >mentioned this to me off-list. Simon, could you comment here if this proposal would work for your needs?For fixed sized vector, bitcasting vector of i1s into a single integer of the same number of bits can takes care of most cases. For AVX512, it is K-register to GPR move. I don't think that works for SVE, though. This is a place where I don't mind vectorizer not generating one kind of representation. Non-intrinsic alternative is so nice to give up. If we are talking about reduction instruction, however, that might convey vectorizer's logic better than bitcasting. That's the line I'd draw. For future outer loop vectorization, we'll be performing "if vector of predicate is not all-false", which is bitcast and then !=0. Thanks, Hideki -----Original Message----- From: Amara Emerson [mailto:amara.emerson at gmail.com] Sent: Wednesday, February 01, 2017 5:06 PM To: Renato Golin <renato.golin at linaro.org> Cc: Saito, Hideki <hideki.saito at intel.com>; llvm-dev at lists.llvm.org; Chandler Carruth <chandlerc at gmail.com>; Philip Reames <listmail at philipreames.com>; Demikhovsky, Elena <elena.demikhovsky at intel.com>; Simon Pilgrim <llvm-dev at redking.me.uk>; Mehdi Amini <mehdi.amini at apple.com>; Michael Kuperstein <mkuper at google.com>; nd <nd at arm.com>; Jim Grosbach <grosbach at apple.com>; Eric Christopher <echristo at gmail.com>; Tim Northover <tnorthover at apple.com>; Kristof Beyls <Kristof.Beyls at arm.com> Subject: Re: RFC: Generic IR reductions Thanks for the summary, some more comments inline. On 1 February 2017 at 22:02, Renato Golin <renato.golin at linaro.org> wrote:> On 1 February 2017 at 21:22, Saito, Hideki <hideki.saito at intel.com> wrote: >> I think we are converging enough at the detail level, but having a >> big difference in the opinions at the "vision" level. :) > > Vision is always harder than engineering. :) > > So, I have copied a lot more people now, to make sure we get a wider > audience. I want to give them a chance to chime in early enough. > Folks, feel free to copy more people if you think they need to have an > opinion. > > Here's a summary: > > Problem #1: Reductions are common, but: > * the code generated is large (log(lanes) operations + extracts) and > will get "too large" with 1024/2048 bit vectors > * this may cause compile time slow downs (we're not sure, need data), > and increased complexity in pattern-matching description for the > back-end > * the semantics can be target-specific (ordered / non-ordered / > non-binary reduction), but I don't know of any target that is not > ordered/binary. > * constant propagation can hide the semantics half-way through, and > the back-end loses the ability to matchUnless I'm mistaken I think the constant propagation issue was with the proposal to split reduction into two stages, where the reduce step depends on the instruction used to produce the input value. This is unrelated to the current shuffle representation.> > Problem #2: As we move to scalable vectors (of unknown length, like SVE), we: > * can't generate binary extracts + ops (to mean unordered / aka > fast), as we don't know the number of lanes > * we'll have to represent ordered reduction as a loop based on the > (unknown at compile time) "compile time constant" that represents the > vector length > > So, Amara's proposal converged to solve all those problems with a new > intrinsic similar to: > > %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty> > %vector) > > With: > * "op" one of { add, sub, mul, max, min, etc } x { '', 'f' } for > integer / float > * vec-ty the input type, which can be a scalable type in the future > (<n x N x type>). > * type-out not necessarily equal to the lane type in vec-ty, to allow > for widening/shotening or FP to int conversions, it supported.I didn't include this in my original proposal, although I'm not particularly opposed to it. In the spirit of keeping the IR as simple as possible, the widening/narrowing behaviour doesn't need to be bundled into the reduction. A promotion can be a separate ext of the input and I think it should be easily pattern matched.> * "ordered" some flag meaning FP precision is a concern (could be > "fast" meaning the opposite, doesn't matter) > > The solution simplifies a lot the loop vectoriser, as well as the SLP > and, as Amara said, could even be used by front-ends or other passes > when the semantics is guaranteed (Haskell?). > > However, it also incur in the same issue we had with the other vector > intrinsics: they stiff the IR, making other optimisations discard > information, avoid valid transformations or even plain bad code-gen. > > So far, on the specific topic for reductions, the only case I could > think of that we'd miss optimisations is when we have two compatible > reductions and we want to merge them. For example, NEON can > horizontally reduce one or two vectors at the same time, or we could > have three strided reductions and then combine them all into one after > inlining.My implementation of this for AArch64 NEON (I've not looked at AArch32 yet) converts the generic reduce ISD nodes to AArch64 specific ones with an extract. It doesn't try to do anything else, so once it goes past the original tree-shuffle matching code, the DAG should be identical.> > What stops us from doing so with intrinsics is just the knowledge, so > we trade complexity in the back-end to match long patterns for > complexity in optimisation passes to know about the new intrinsics. > > The arguments were: > > Pro intrinsic: > * Simpler and shorter IR > * Easier to read, easier for the vectorisers to generate > * Easier for the back-end to match > * Easier transition towards scalable vectors in the futureAlso want to re-iterate that reductions are useful for testing bits of predicate vectors, which can be applicable to other targets than SVE for things like early-exit vectorization. Simon mentioned this to me off-list. Simon, could you comment here if this proposal would work for your needs?> Cons: > * Already existing patterns work and can expose further opportunities > * We may have to change more middle-end passes / back-ends than would > be affected > > I really have no strong opinion either way. Both are challenging in > their own rights, and this is worth doing anyway. > > Did I miss something? > > cheers, > --renato
> On 2 Feb 2017, at 01:06, Amara Emerson <amara.emerson at gmail.com> wrote: >> >> What stops us from doing so with intrinsics is just the knowledge, so >> we trade complexity in the back-end to match long patterns for >> complexity in optimisation passes to know about the new intrinsics. >> >> The arguments were: >> >> Pro intrinsic: >> * Simpler and shorter IR >> * Easier to read, easier for the vectorisers to generate >> * Easier for the back-end to match >> * Easier transition towards scalable vectors in the future > Also want to re-iterate that reductions are useful for testing bits of > predicate vectors, which can be applicable to other targets than SVE > for things like early-exit vectorization. Simon mentioned this to me > off-list. Simon, could you comment here if this proposal would work > for your needs?Yes - I’m hoping that we can both vectorise early-out ‘any_of’ predicate tests code and perform early-out breaks from already vectorised cases - nothing I’ve seen suggests this will get in the way. It’s mainly going to be a case of correct recognition in the LV, handling dereference’d arrays etc. and I don’t think these intrinsics will obfuscate these cases/attributes etc. Does SVE have an early-out ability or is it an all or nothing mechanism? Simon.
Yes, SVE can vectorize early exit loops by using speculative (first-faulting) loads, which essentially give a predicate of the lanes loaded successfully. For uncounted loops with these special loads, the loop predicate tests can be done using a 'ptest' instruction, checking if the last element is active. Amara On 3 February 2017 at 10:15, Simon Pilgrim <llvm-dev at redking.me.uk> wrote:> >> On 2 Feb 2017, at 01:06, Amara Emerson <amara.emerson at gmail.com> wrote: >>> >>> What stops us from doing so with intrinsics is just the knowledge, so >>> we trade complexity in the back-end to match long patterns for >>> complexity in optimisation passes to know about the new intrinsics. >>> >>> The arguments were: >>> >>> Pro intrinsic: >>> * Simpler and shorter IR >>> * Easier to read, easier for the vectorisers to generate >>> * Easier for the back-end to match >>> * Easier transition towards scalable vectors in the future >> Also want to re-iterate that reductions are useful for testing bits of >> predicate vectors, which can be applicable to other targets than SVE >> for things like early-exit vectorization. Simon mentioned this to me >> off-list. Simon, could you comment here if this proposal would work >> for your needs? > > Yes - I’m hoping that we can both vectorise early-out ‘any_of’ predicate tests code and perform early-out breaks from already vectorised cases - nothing I’ve seen suggests this will get in the way. It’s mainly going to be a case of correct recognition in the LV, handling dereference’d arrays etc. and I don’t think these intrinsics will obfuscate these cases/attributes etc. > > Does SVE have an early-out ability or is it an all or nothing mechanism? > > Simon.