> On Dec 23, 2014, at 10:54 AM, Bob Wilson <bob.wilson at apple.com> wrote: > > >> On Dec 23, 2014, at 10:41 AM, Chris Lattner <clattner at apple.com <mailto:clattner at apple.com>> wrote: >> >> On Dec 23, 2014, at 10:28 AM, Chris Bieneman <beanz at apple.com <mailto:beanz at apple.com>> wrote: >>>>> It should be straight-forward to have something like LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a registry. >>>> >>>> I tried doing that a few years ago. It’s not nearly as easy as it sounds because we’ve got hardcoded references to various target intrinsics scattered throughout the code. >>> >>> I was just writing to say exactly this. There are a number of ways we could work toward something like this. I’m completely in favor of a world where Intrinsics are properties of the targets and don’t leach out, however today they do in a lot of places. >> >> What are the specific problems here? Anything that does an equality comparison with the IntrinsicID can be changed to do strcmp with the name. That would handle the one-off cases like InstCombiner::SimplifyDemandedUseBits in InstCombine. >> >> The other cases in InstCombine could be handled similarly, but may be better handled by adding a intrinsic behavior query APIs to the intrinsic registry, or would be better handled (eventually) by adding new attributes to the intrinsics themselves. > > I don’t think there’s anything fundamentally difficult, but it’s a big change. For example: > > $ git grep Intrinsic:: | wc > 3364 12286 281078 > > The vast majority of those 3,364 lines have hardcoded references to specific intrinsics. Many of them are used in contexts where you can’t easily insert a strcmp (e.g., case values in large switch statements, or worse, the m_Intrinsic PatternMatch templates).I don’t find this convincing. It should be simple to introduce a new m_Intrinsic PatternMatch template that takes a string. The switches are also straight-forward. In BasicAA, for example, instead of: switch (II->getIntrinsicID()) { default: break; case Intrinsic::memset: case Intrinsic::memcpy: case Intrinsic::memmove: { ... case Intrinsic::arm_neon_vld1: { assert(ArgIdx == 0 && "Invalid argument index"); assert(Loc.Ptr == II->getArgOperand(ArgIdx) && "Intrinsic location pointer not argument?"); // LLVM's vld1 and vst1 intrinsics currently only support a single // vector register. if (DL) Loc.Size = DL->getTypeStoreSize(II->getType()); break; } } Just change this to: switch (II->getIntrinsicID()) { case Intrinsic::memset: case Intrinsic::memcpy: case Intrinsic::memmove: { ... default: if (II->getName() == “llvm.arm.neon.vld1”) { same stuff as above } break; Spreading ifdefs through the codebase would be really unfortunate. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/eb3be06b/attachment.html>
> On Dec 23, 2014, at 12:41 PM, Chris Lattner <clattner at apple.com> wrote: > > >> On Dec 23, 2014, at 10:54 AM, Bob Wilson <bob.wilson at apple.com <mailto:bob.wilson at apple.com>> wrote: >> >> >>> On Dec 23, 2014, at 10:41 AM, Chris Lattner <clattner at apple.com <mailto:clattner at apple.com>> wrote: >>> >>> On Dec 23, 2014, at 10:28 AM, Chris Bieneman <beanz at apple.com <mailto:beanz at apple.com>> wrote: >>>>>> It should be straight-forward to have something like LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a registry. >>>>> >>>>> I tried doing that a few years ago. It’s not nearly as easy as it sounds because we’ve got hardcoded references to various target intrinsics scattered throughout the code. >>>> >>>> I was just writing to say exactly this. There are a number of ways we could work toward something like this. I’m completely in favor of a world where Intrinsics are properties of the targets and don’t leach out, however today they do in a lot of places. >>> >>> What are the specific problems here? Anything that does an equality comparison with the IntrinsicID can be changed to do strcmp with the name. That would handle the one-off cases like InstCombiner::SimplifyDemandedUseBits in InstCombine. >>> >>> The other cases in InstCombine could be handled similarly, but may be better handled by adding a intrinsic behavior query APIs to the intrinsic registry, or would be better handled (eventually) by adding new attributes to the intrinsics themselves. >> >> I don’t think there’s anything fundamentally difficult, but it’s a big change. For example: >> >> $ git grep Intrinsic:: | wc >> 3364 12286 281078 >> >> The vast majority of those 3,364 lines have hardcoded references to specific intrinsics. Many of them are used in contexts where you can’t easily insert a strcmp (e.g., case values in large switch statements, or worse, the m_Intrinsic PatternMatch templates). > > I don’t find this convincing. It should be simple to introduce a new m_Intrinsic PatternMatch template that takes a string. The switches are also straight-forward. In BasicAA, for example, instead of: > > switch (II->getIntrinsicID()) { > default: break; > case Intrinsic::memset: > case Intrinsic::memcpy: > case Intrinsic::memmove: { > ... > case Intrinsic::arm_neon_vld1: { > assert(ArgIdx == 0 && "Invalid argument index"); > assert(Loc.Ptr == II->getArgOperand(ArgIdx) && > "Intrinsic location pointer not argument?"); > // LLVM's vld1 and vst1 intrinsics currently only support a single > // vector register. > if (DL) > Loc.Size = DL->getTypeStoreSize(II->getType()); > break; > } > } > > Just change this to: > > switch (II->getIntrinsicID()) { > case Intrinsic::memset: > case Intrinsic::memcpy: > case Intrinsic::memmove: { > ... > default: > if (II->getName() == “llvm.arm.neon.vld1”) { > same stuff as above > } > break; > > Spreading ifdefs through the codebase would be really unfortunate.I agree. I wasn’t at all trying to discourage anyone from doing as you suggest — just pointing out that it’s not a small project. Much of it would be mechanical, though, and it does seem like the right long-term solution. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/24946b13/attachment.html>
On Tue, Dec 23, 2014 at 12:41 PM, Chris Lattner <clattner at apple.com> wrote:> > On Dec 23, 2014, at 10:54 AM, Bob Wilson <bob.wilson at apple.com> wrote: > > > On Dec 23, 2014, at 10:41 AM, Chris Lattner <clattner at apple.com> wrote: > > On Dec 23, 2014, at 10:28 AM, Chris Bieneman <beanz at apple.com> wrote: > > It should be straight-forward to have something like > LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a > registry. > > > I tried doing that a few years ago. It’s not nearly as easy as it sounds > because we’ve got hardcoded references to various target intrinsics > scattered throughout the code. > > > I was just writing to say exactly this. There are a number of ways we > could work toward something like this. I’m completely in favor of a world > where Intrinsics are properties of the targets and don’t leach out, however > today they do in a lot of places. > > > What are the specific problems here? Anything that does an equality > comparison with the IntrinsicID can be changed to do strcmp with the name. > That would handle the one-off cases like > InstCombiner::SimplifyDemandedUseBits in InstCombine. > > > The other cases in InstCombine could be handled similarly, but may be > better handled by adding a intrinsic behavior query APIs to the intrinsic > registry, or would be better handled (eventually) by adding new attributes > to the intrinsics themselves. > > > I don’t think there’s anything fundamentally difficult, but it’s a big > change. For example: > > $ git grep Intrinsic:: | wc > 3364 12286 281078 > > The vast majority of those 3,364 lines have hardcoded references to > specific intrinsics. Many of them are used in contexts where you can’t > easily insert a strcmp (e.g., case values in large switch statements, or > worse, the m_Intrinsic PatternMatch templates). > > > I don’t find this convincing. It should be simple to introduce a new > m_Intrinsic PatternMatch template that takes a string. The switches are > also straight-forward. In BasicAA, for example, instead of: > > switch (II->getIntrinsicID()) { > default: break; > case Intrinsic::memset: > case Intrinsic::memcpy: > case Intrinsic::memmove: { > ... > case Intrinsic::arm_neon_vld1: { > assert(ArgIdx == 0 && "Invalid argument index"); > assert(Loc.Ptr == II->getArgOperand(ArgIdx) && > "Intrinsic location pointer not argument?"); > // LLVM's vld1 and vst1 intrinsics currently only support a single > // vector register. > if (DL) > Loc.Size = DL->getTypeStoreSize(II->getType()); > break; > } > } > > Just change this to: > > switch (II->getIntrinsicID()) { > case Intrinsic::memset: > case Intrinsic::memcpy: > case Intrinsic::memmove: { > ... > default: > if (II->getName() == “llvm.arm.neon.vld1”) { > same stuff as above > } > break; >If we're going to talk about what the right long-term design is, let me put out a different opinion. I used to be somewhat torn on this issue, but this discussion and looking at the particular intrinsics in question, I'm rapidly being persuaded. We shouldn't have any target specific intrinsics. At the very least, we shouldn't use them anywhere in the front- or middle-end, even if we have them. Today, frontends need to emit specific target instrinsics *and* have the optimizer be aware of them. I can see a few reasons why: 1) Missing semantics -- the IR may not have *quite* the semantics desired and provided by the target's ISA. 2) Historical expectations -- the GCC-compatible builtins are named after the instructions, and the target independent builtins lower to intrinsics so the target-specific ones should too. 3) Poor instruction selection -- we could emit the logic as boring IR, but we fail to instruction select that well, so as a hack we emit the instruction directly and teach the optimizer to still optimize through it. If we want to pursue the *right* design, I think we should be fixing these three issues and then we won't need the optimizer to be aware of any of this. Fixing #1) We just need to flush these out and define target independent intrinsics. The number of these is relatively small, and not likely to be a burden. I'm looking at you cvtss2si. Fixing #2) I think this just requires a giant pile of work to re-target builtins and other things that people expect the optimizer to operate on into proper IR. Yes, it's hard, but it is the right design. With x86 we're already about 90% of the way there. My understanding is that ARM has gone the other direction. While unpleasant, I think we have to make a choice: either the builtin doesn't get optimized or it gets lowered to "normal" IR. Fixing #3) Yes, we need better instruction selection. We have always needed better instruction selection. If there are truly critical needs that aren't met, we can even achieve this through much more localized hacks *in the backend* that form specific intrinsics to "pre select" instructions that we know are important. While just as hacky and gross, this seems like a much better tradeoff to make than the current approach. In the end, we don't have target-specific intrinsics in the middle end at all. Then we get to decide on whether it is worth our time to provide a really efficient target-specific *encoding* of these operations, or if we should just lower to the target specific encoding we already have: inline assembly. I don't have strong feelings about this either way. OK, so I think that is the right design, but I also completely agree with Chris B here -- it is *way more work* than is really called for to address a very narrow, targeted problem that WebKit has: binary size. But I also agree with Chris L's point here:> Spreading ifdefs through the codebase would be really unfortunate. >I just don't think this is necessary. And I don't think relying on a fragile frontend-based "optimization" to achieve the same effects is really great either. The huge code in IR/Function.cpp.o is from *terrible* code being emitted by tablegen. We should just change tablegen to emit something smaller. This will get most of the size win, and will get it even when the target *is* compiled in! In particular, if we switch to a sorted table of intrinsic names, change the *lookup* of names to use the same sorted table so we share the memory, and do a binary search, I think the code size will be essentially free. Once we address that, Chris B can use the somewhat ugly #ifdef's or his clang attribute patch to look for other hotspots where we burn size on intrinsics. I bet a few more could be fixed by similar effort. If the code size is *still* a problem, we could could start factoring things incrementally into target independent intrinsics and removing the target specific ones which might help incrementally move us toward the "right design" (IMO). Then if someone is really motivated (which would be cool) to do all the hard work to really move us over to the right design, we would have lost nothing, there would be essentially no more technical debt than there is today, and WebKit and others won't have to wait for their binary size improvements. -Chandler -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/b693819f/attachment.html>
> On Dec 23, 2014, at 1:40 PM, Chandler Carruth <chandlerc at google.com> wrote: > > If we're going to talk about what the right long-term design is, let me put out a different opinion. I used to be somewhat torn on this issue, but this discussion and looking at the particular intrinsics in question, I'm rapidly being persuaded. > > We shouldn't have any target specific intrinsics. At the very least, we shouldn't use them anywhere in the front- or middle-end, even if we have them. > > Today, frontends need to emit specific target instrinsics *and* have the optimizer be aware of them. I can see a few reasons why: > > 1) Missing semantics -- the IR may not have *quite* the semantics desired and provided by the target's ISA. > 2) Historical expectations -- the GCC-compatible builtins are named after the instructions, and the target independent builtins lower to intrinsics so the target-specific ones should too. > 3) Poor instruction selection -- we could emit the logic as boring IR, but we fail to instruction select that well, so as a hack we emit the instruction directly and teach the optimizer to still optimize through it. > > If we want to pursue the *right* design, I think we should be fixing these three issues and then we won't need the optimizer to be aware of any of this.I strongly disagree with your conclusions here. Everything you’re suggesting is rooted in three base assumptions that are not true for many clients: - that all source languages are basically C - that all programming models are more or less like C on a *nix system - that all hardware is basically like the intersection of X86 and ARM (“typical RISC machine”) Consider the use case of an OpenGL shader compiler. Its source language is not C (despite syntactic appearances) and the frontend may need to express semantics that are difficult or impossible to express in target-independent IR. Its programming model is not like a C compiler, including constructs like cross-thread derivatives, uniform vs varying calculations, etc. It’s target instruction set is likely nothing at all like X86 or ARM, likely including an arithmetic set that is very different from your typical CPU, as well as lots of ISA-level construct for interacting with various fixed function hardware units. Consider the less exotic use case of a DSP compiler. DSPs typically have lots of instructions for “unusual” arithmetic operations that are intended to map to very specific use cases: lots of variants of rounding and/or wrapping control, lots of extending/widening/doubling operations, memory accesses with unusual stride patterns. The entire purpose of the existence of a DSP is to deliver high computation bandwidth under tight latency constraints. If your DSP compiler fails to make use of exotic arithmetic operations that the user requested, the whole system has *failed* at being a DSP. Consider the even-closer-to-home use case of vector programming. There are three major families of vector extensions in widespread use (SSE, NEON, and Altivec) as well as many variants and lesser-known instruction sets. And while all three agree on a small core of functionality (fadd <4 x float> !), all of them include large bodies of just plain arithmetic that are not covered by the others and are not practically expressible in target-independent IR. Even if we add the union of their functionalities to target independent IR, then we have the reverse problem where the frontend and optimizers may produce IR that most backends have little to no hope of generating good code for. And let’s not forget that, while the requirements here are somewhat less strict than on a DSP, our users will still be very unhappy if they write a top-half-extending-saturating-absolute-difference builtin and we give them 100 instructions of emulated gunk back. While I agree with the underlying sentiment that we should strive to minimize the intrusion of target-specific intrinsics as much as possible, and compartmentalizing them into their source backends as much as possible, expecting to reach a world with no intrinsic considerations in any part of the frontend or optimizer just seems hopelessly idealistic. —Owen -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/4528321d/attachment.html>
> On Dec 23, 2014, at 3:07 PM, Owen Anderson <resistor at mac.com <mailto:resistor at mac.com>> wrote: > > >> On Dec 23, 2014, at 1:40 PM, Chandler Carruth <chandlerc at google.com <mailto:chandlerc at google.com>> wrote: >> >> If we're going to talk about what the right long-term design is, let me put out a different opinion. I used to be somewhat torn on this issue, but this discussion and looking at the particular intrinsics in question, I'm rapidly being persuaded. >> >> We shouldn't have any target specific intrinsics. At the very least, we shouldn't use them anywhere in the front- or middle-end, even if we have them. >> >> Today, frontends need to emit specific target instrinsics *and* have the optimizer be aware of them. I can see a few reasons why: >> >> 1) Missing semantics -- the IR may not have *quite* the semantics desired and provided by the target's ISA. >> 2) Historical expectations -- the GCC-compatible builtins are named after the instructions, and the target independent builtins lower to intrinsics so the target-specific ones should too. >> 3) Poor instruction selection -- we could emit the logic as boring IR, but we fail to instruction select that well, so as a hack we emit the instruction directly and teach the optimizer to still optimize through it. >> >> If we want to pursue the *right* design, I think we should be fixing these three issues and then we won't need the optimizer to be aware of any of this. > > I strongly disagree with your conclusions here. Everything you’re suggesting is rooted in three base assumptions that are not true for many clients: > - that all source languages are basically C > - that all programming models are more or less like C on a *nix system > - that all hardware is basically like the intersection of X86 and ARM (“typical RISC machine”) > > Consider the use case of an OpenGL shader compiler. Its source language is not C (despite syntactic appearances) and the frontend may need to express semantics that are difficult or impossible to express in target-independent IR. Its programming model is not like a C compiler, including constructs like cross-thread derivatives, uniform vs varying calculations, etc. It’s target instruction set is likely nothing at all like X86 or ARM, likely including an arithmetic set that is very different from your typical CPU, as well as lots of ISA-level construct for interacting with various fixed function hardware units. > > Consider the less exotic use case of a DSP compiler. DSPs typically have lots of instructions for “unusual” arithmetic operations that are intended to map to very specific use cases: lots of variants of rounding and/or wrapping control, lots of extending/widening/doubling operations, memory accesses with unusual stride patterns. The entire purpose of the existence of a DSP is to deliver high computation bandwidth under tight latency constraints. If your DSP compiler fails to make use of exotic arithmetic operations that the user requested, the whole system has *failed* at being a DSP. > > Consider the even-closer-to-home use case of vector programming. There are three major families of vector extensions in widespread use (SSE, NEON, and Altivec) as well as many variants and lesser-known instruction sets. And while all three agree on a small core of functionality (fadd <4 x float> !), all of them include large bodies of just plain arithmetic that are not covered by the others and are not practically expressible in target-independent IR. Even if we add the union of their functionalities to target independent IR, then we have the reverse problem where the frontend and optimizers may produce IR that most backends have little to no hope of generating good code for. And let’s not forget that, while the requirements here are somewhat less strict than on a DSP, our users will still be very unhappy if they write a top-half-extending-saturating-absolute-difference builtin and we give them 100 instructions of emulated gunk back. > > While I agree with the underlying sentiment that we should strive to minimize the intrusion of target-specific intrinsics as much as possible, and compartmentalizing them into their source backends as much as possible, expecting to reach a world with no intrinsic considerations in any part of the frontend or optimizer just seems hopelessly idealistic.I don’t see exactly why the middle-end optimizer has to know about the mighty DSP intrinsic? If it maps to this super powerful hardware instructions anyway, can’t the middle optimizer just ignore it and leave it as-is to be handled by the backend? (I agree the front-end had to know about it anyway...) A more generic question I have is why does the code that handle “target-specific” intrinsic has to be in the middle-end? I am surprise to see all that X86 specific code in the middle-end. Why don’t we expose a hook (TargetMachine?) for the different targets to handle their own intrinsics their own way? InstCombine, ConstantFolding, AliasAnalysis, … would fallback to the target to handle non-generic intrinsic. — Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141229/08be0885/attachment.html>
> On Dec 29, 2014, at 5:51 PM, Mehdi Amini <mehdi.amini at apple.com> wrote: > > A more generic question I have is why does the code that handle “target-specific” intrinsic has to be in the middle-end? I am surprise to see all that X86 specific code in the middle-end. > Why don’t we expose a hook (TargetMachine?) for the different targets to handle their own intrinsics their own way? > InstCombine, ConstantFolding, AliasAnalysis, … would fallback to the target to handle non-generic intrinsic.This is what I was getting at when I said that I think there could be a better separation of concerns with respect to the handling of target specific intrinsics, which could move some of their handling into the targets themselves on hooks that can be invoked from generic passes. As to why we need to optimize them, just consider constant folding. It’s quite possible for constant-foldable forms of target-specific intrinsics to arise from macros, template expansions, etc. It’s important that we do that constant folding as early as possible because it can affect later decisions like inlining, loop unrolling, etc. If we can’t constant fold the target-specific operation until after the middle-end, then we’ve potentially lost out. —Owen -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141229/355e11cb/attachment.html>