Cameron McInally via llvm-dev
2020-Nov-12 21:36 UTC
[llvm-dev] Complex proposal v3 + roundtable agenda
On Thu, Nov 12, 2020 at 2:47 PM Florian Hahn <florian_hahn at apple.com> wrote:> > > > On Nov 12, 2020, at 18:52, Cameron McInally <cameron.mcinally at nyu.edu> wrote: > > On Thu, Nov 12, 2020 at 12:03 PM Florian Hahn via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > > Hi, > > There’s growing interest among our users to make better use of dedicated hardware instructions for complex math and I would like to re-start the discussion on the topic. Given that this original thread was started a while ago apologies if I missed anything already discussed earlier on the list or the round-table. The original mail is quoted below. > > In particular, I’m interested in the AArch64 side of things, like using FCMLA [1] for complex multiplications to start with. > > To get the discussion going, I’d like to share an alternative pitch. Instead of starting with adding complex types, we could start with adding a set of intrinsics that operate on complex values packed into vectors instead. > > Starting with intrinsics would allow us to bring up the lowering of those intrinsics to target-specific nodes incrementally without having to make substantial changes across the codebase, as adding new types would require. Initially, we could try and match IR patterns that correspond to complex operations late in the pipeline. We can then work on incrementally moving the point where the intrinsics are introduced earlier in the pipeline, as we adopt more passes to deal with them. This way, we won’t have to teach all passes about complex types at once or risk loosing out all the existing combines on the corresponding floating point operations. > > I think if we introduce a small set of intrinsics for complex math (like @llvm.complex.multiply) we could use them to improve code-generation in key passes like the vectorizers and deliver large improvements to our users fairly quickly. There might be some scenarios which require a dedicated IR type, but I think we can get a long way with a set of specialized intrinsics at a much lower cost. If we later decide that dedicated IR types are needed, replacing the intrinsics should be easy and we will benefit of having already updated various passes to deal with the intrinsics. > > We took a similar approach when adding matrix support to LLVM and I think that worked out very well in the end. The implementation upstream generates equivalent or better code than our earlier implementation using dedicated IR matrix types, while being simpler and impacting a much smaller area of the codebase. > > An independent issue to discuss is how to generate complex math intrinsics. > As part of the initial bring-up, I’d propose matching the code Clang generates for operations on std::complex<> & co to introduce the complex math intrinsics. This won’t be perfect and will miss cases, but allows us to deliver initial improvements without requiring extensive updates to existing libraries or frontends. I don’t think either the intrinsic only or the complex type variants are inherently more convenient for frontends to emit. > > To better illustrate what this approach could look like, I put up a set of rough patches that introduce a @llvm.complex.multiply intrinsic (https://reviews.llvm.org/D91347), replace a set of fadd/fsub/fmul instructions with @llvm.complex.multiply (https://reviews.llvm.org/D91353) and lower the intrinsic for FCMLA on AArch64 (https://reviews.llvm.org/D91354). Note that those are just rough proof-of-concept patches. > > Cheers, > Florian > > > > The proposed experimental intrinsics are a difficult detour to accept > for performance reasons. With a complex type, the usual algebraic > simplifications fall out for free (or close to it). Teaching existing > optimizations how to handle the new complex intrinsics seems like a > LOT of unnecessary work. > > > Thanks for taking a look! > > Could you expand a bit more on what kind of unnecessary work you expect? I would expect most of the code to deal with intrinsics to be easily migrated once/if we decide to switch to a dedicated type. > > Concretely, for the lowering code, it should hopefully just boil down to updating the patterns that get matched in the backends (matching the complex multiply instruction instead of @llvm.complex.multiply). For supporting complex math in the vectorizers, we have to add support for cost-modeling and support widening the intrinsics. Again, wouldn’t’ changing from intrinsics to a type just mean adjusting from dealing with intrinsic calls to the corresponding instructions? > > There certainly is some pieces around the edges that will need adjusting or become obsolete, but I would hope/expect the majority of the work to be re-usable. > > As for getting the usual algebraic simplifications for free, even with a new type I suppose we would have to teach instcombine/InstructionSimplify about them. This is indeed an area where a dedicated type is probably quite a bit easier to deal with. But I think part of the vector-predication proposal includes some generic matchers for the different intrinsic, which should be relatively straight-forward to update as well.Yes, the IR passes are where I see complex types win. For example, pretty much all of InstCombine's visitFMul(...) transformations should not depend on type.*** So there's no need to duplicate all that code for @llvm.complex.multiply. It's true that teaching the pattern matcher to recognize intrinsics gets us a long way. But we'd also need to update IR passes in places where the Opcode is explicitly checked. E.g.: ``` switch (Opcode) { <snipped> case Instruction::FAdd: <== ***HERE*** return SimplifyFAddInst(LHS, RHS, FastMathFlags(), Q, MaxRecurse); ``` And when a transformation succeeds, we'd need to update how the result is returned. E.g.: ``` // (-X) + Y --> Y - X Value *X, *Y; if (match(&I, m_c_FAdd(m_FNeg(m_Value(X)), m_Value(Y)))) return BinaryOperator::CreateFSubFMF(Y, X, &I); <== ***HERE*** ``` Both of those come for free with a complex type, along with the pattern matching code. I'm not opposed to finding solutions for these problems, but this pitch has been around for a while now, and it hasn't really taken off yet. It's hard to rally behind it again. (***) Let me point out that my number theory and abstract algebra are *rusty*. So don't be fooled into thinking my initial assumptions about complex numbers are correct.> I’ll admit that the scope of my pitch is much more limited than the original proposal and very focused on allowing LLVM to use specialized instructions. But it should be relatively simple to implement (without impacting anything unrelated to the passes/backends that are adjusted) and easy to extend to meet additional needs by other people & backends. It also allows to reuse the existing instructions for insert/extract/shuffles and the corresponding folds.Generating the special instructions is good, but not if it costs other optimizations. Let's get both. :D But in all seriousness, my opinion on this topic has hit stark resistance before. I won't harp on it if others don't feel the same. It's just pretty clear to me that experimental intrinsics are a major hurdle to performant code for larger projects.
Florian Hahn via llvm-dev
2020-Nov-13 20:01 UTC
[llvm-dev] Complex proposal v3 + roundtable agenda
> On Nov 12, 2020, at 21:36, Cameron McInally <cameron.mcinally at nyu.edu> wrote: > > On Thu, Nov 12, 2020 at 2:47 PM Florian Hahn <florian_hahn at apple.com <mailto:florian_hahn at apple.com>> wrote: >> >> There certainly is some pieces around the edges that will need adjusting or become obsolete, but I would hope/expect the majority of the work to be re-usable. >> >> As for getting the usual algebraic simplifications for free, even with a new type I suppose we would have to teach instcombine/InstructionSimplify about them. This is indeed an area where a dedicated type is probably quite a bit easier to deal with. But I think part of the vector-predication proposal includes some generic matchers for the different intrinsic, which should be relatively straight-forward to update as well. > > Yes, the IR passes are where I see complex types win. For example, > pretty much all of InstCombine's visitFMul(...) transformations should > not depend on type.*** So there's no need to duplicate all that code > for @llvm.complex.multiply. > > It's true that teaching the pattern matcher to recognize intrinsics > gets us a long way. But we'd also need to update IR passes in places > where the Opcode is explicitly checked. E.g.: > > ``` > switch (Opcode) { > <snipped> > case Instruction::FAdd: <== ***HERE*** > return SimplifyFAddInst(LHS, RHS, FastMathFlags(), Q, MaxRecurse); > ``` > > And when a transformation succeeds, we'd need to update how the result > is returned. E.g.: > > ``` > // (-X) + Y --> Y - X > Value *X, *Y; > if (match(&I, m_c_FAdd(m_FNeg(m_Value(X)), m_Value(Y)))) > return BinaryOperator::CreateFSubFMF(Y, X, &I); <== ***HERE*** > ``` > > Both of those come for free with a complex type, along with the > pattern matching code. >Thanks for expanding on this! I think for InstructionSimplify it should be fairly easy to treat @llvm.complex.multiply & co as their instruction counter parts. InstCombine (or other passes that create new instructions as they combine) will require more additional changes. This is certainly a trade-off compared to adding dedicated types. But I given the additional complexity and overhead adding new types entails, I think the intrinsics are a good alternative that can get us quite far, especially if we can re-use/build on some of the techniques Simon mentioned.> I'm not opposed to finding solutions for these problems, but this > pitch has been around for a while now, and it hasn't really taken off > yet. It's hard to rally behind it again.If there’s been an earlier discussion about introducing a small set of intrinsics, I think I missed that. The references linked from the complex type proposal seem to all focus on adding a new type.> > (***) Let me point out that my number theory and abstract algebra are > *rusty*. So don't be fooled into thinking my initial assumptions about > complex numbers are correct. > > >> I’ll admit that the scope of my pitch is much more limited than the original proposal and very focused on allowing LLVM to use specialized instructions. But it should be relatively simple to implement (without impacting anything unrelated to the passes/backends that are adjusted) and easy to extend to meet additional needs by other people & backends. It also allows to reuse the existing instructions for insert/extract/shuffles and the corresponding folds. > > Generating the special instructions is good, but not if it costs other > optimizations. Let's get both. :D > > But in all seriousness, my opinion on this topic has hit stark > resistance before. I won't harp on it if others don't feel the same. > It's just pretty clear to me that experimental intrinsics are a major > hurdle to performant code for larger projects.That’s an interesting point. I am expecting to get most benefits in terms of performance from better vectorizing complex math by using specialized hardware instructions and would expect benefits from better combines to be smaller. But besides reports from our users, I do not have anything to back this up and I suppose this highly depends on the particulars of the codebases. As discussed earlier, the intrinsics pitch comes with some trade-offs when it comes to re-using existing combines and simplifications. But the key benefit I see is that it is much more lightweight (and hence hopefully easier to get started with) and gives us a path forward to make incremental and noticeable improvements in a relatively short time-span. Even if that doesn’t get us to a perfect state directly, it should make it much easier to start moving in the right direction. And once we hit the point at which we really need dedicated types we already laid much of the ground work with the intrinsics. Cheers, Florian -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201113/08e8098b/attachment.html>
Cameron McInally via llvm-dev
2020-Nov-16 16:37 UTC
[llvm-dev] Complex proposal v3 + roundtable agenda
On Fri, Nov 13, 2020 at 3:01 PM Florian Hahn <florian_hahn at apple.com> wrote:> > > > On Nov 12, 2020, at 21:36, Cameron McInally <cameron.mcinally at nyu.edu> wrote: > > On Thu, Nov 12, 2020 at 2:47 PM Florian Hahn <florian_hahn at apple.com> wrote: > > There certainly is some pieces around the edges that will need adjusting or become obsolete, but I would hope/expect the majority of the work to be re-usable. > > As for getting the usual algebraic simplifications for free, even with a new type I suppose we would have to teach instcombine/InstructionSimplify about them. This is indeed an area where a dedicated type is probably quite a bit easier to deal with. But I think part of the vector-predication proposal includes some generic matchers for the different intrinsic, which should be relatively straight-forward to update as well. > > > Yes, the IR passes are where I see complex types win. For example, > pretty much all of InstCombine's visitFMul(...) transformations should > not depend on type.*** So there's no need to duplicate all that code > for @llvm.complex.multiply. > > It's true that teaching the pattern matcher to recognize intrinsics > gets us a long way. But we'd also need to update IR passes in places > where the Opcode is explicitly checked. E.g.: > > ``` > switch (Opcode) { > <snipped> > case Instruction::FAdd: <== ***HERE*** > return SimplifyFAddInst(LHS, RHS, FastMathFlags(), Q, MaxRecurse); > ``` > > And when a transformation succeeds, we'd need to update how the result > is returned. E.g.: > > ``` > // (-X) + Y --> Y - X > Value *X, *Y; > if (match(&I, m_c_FAdd(m_FNeg(m_Value(X)), m_Value(Y)))) > return BinaryOperator::CreateFSubFMF(Y, X, &I); <== ***HERE*** > ``` > > Both of those come for free with a complex type, along with the > pattern matching code. > > > Thanks for expanding on this! I think for InstructionSimplify it should be fairly easy to treat @llvm.complex.multiply & co as their instruction counter parts. InstCombine (or other passes that create new instructions as they combine) will require more additional changes. > > This is certainly a trade-off compared to adding dedicated types. But I given the additional complexity and overhead adding new types entails, I think the intrinsics are a good alternative that can get us quite far, especially if we can re-use/build on some of the techniques Simon mentioned. > > I'm not opposed to finding solutions for these problems, but this > pitch has been around for a while now, and it hasn't really taken off > yet. It's hard to rally behind it again. > > > If there’s been an earlier discussion about introducing a small set of intrinsics, I think I missed that. The references linked from the complex type proposal seem to all focus on adding a new type.Sorry, I wasn't clear. I meant the pitch to optimize experimental intrinsics in general. That general plan to optimize experimental intrinsics has been around since the constrained FP intrinsics project started. Probably 2016 or 2017.> (***) Let me point out that my number theory and abstract algebra are > *rusty*. So don't be fooled into thinking my initial assumptions about > complex numbers are correct. > > > I’ll admit that the scope of my pitch is much more limited than the original proposal and very focused on allowing LLVM to use specialized instructions. But it should be relatively simple to implement (without impacting anything unrelated to the passes/backends that are adjusted) and easy to extend to meet additional needs by other people & backends. It also allows to reuse the existing instructions for insert/extract/shuffles and the corresponding folds. > > > Generating the special instructions is good, but not if it costs other > optimizations. Let's get both. :D > > But in all seriousness, my opinion on this topic has hit stark > resistance before. I won't harp on it if others don't feel the same. > It's just pretty clear to me that experimental intrinsics are a major > hurdle to performant code for larger projects. > > > That’s an interesting point. I am expecting to get most benefits in terms of performance from better vectorizing complex math by using specialized hardware instructions and would expect benefits from better combines to be smaller. But besides reports from our users, I do not have anything to back this up and I suppose this highly depends on the particulars of the codebases. > > As discussed earlier, the intrinsics pitch comes with some trade-offs when it comes to re-using existing combines and simplifications. But the key benefit I see is that it is much more lightweight (and hence hopefully easier to get started with) and gives us a path forward to make incremental and noticeable improvements in a relatively short time-span. Even if that doesn’t get us to a perfect state directly, it should make it much easier to start moving in the right direction. And once we hit the point at which we really need dedicated types we already laid much of the ground work with the intrinsics.That seems fair.