thr3ads.net - llvm dev - [LLVMdev] [RFC] Stripping unusable intrinsics [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Chris Lattner

2014-Dec-23 20:41 UTC

[LLVMdev] [RFC] Stripping unusable intrinsics

> On Dec 23, 2014, at 10:54 AM, Bob Wilson <bob.wilson at apple.com>
wrote:
> 
> 
>> On Dec 23, 2014, at 10:41 AM, Chris Lattner <clattner at apple.com
<mailto:clattner at apple.com>> wrote:
>> 
>> On Dec 23, 2014, at 10:28 AM, Chris Bieneman <beanz at apple.com
<mailto:beanz at apple.com>> wrote:
>>>>> It should be straight-forward to have something like
LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a
registry.
>>>> 
>>>> I tried doing that a few years ago. It’s not nearly as easy as
it sounds because we’ve got hardcoded references to various target intrinsics
scattered throughout the code.
>>> 
>>> I was just writing to say exactly this. There are a number of ways
we could work toward something like this. I’m completely in favor of a world
where Intrinsics are properties of the targets and don’t leach out, however
today they do in a lot of places.
>> 
>> What are the specific problems here?  Anything that does an equality
comparison with the IntrinsicID can be changed to do strcmp with the name.  That
would handle the one-off cases like InstCombiner::SimplifyDemandedUseBits in
InstCombine.
>> 
>> The other cases in InstCombine could be handled similarly, but may be
better handled by adding a intrinsic behavior query APIs to the intrinsic
registry, or would be better handled (eventually) by adding new attributes to
the intrinsics themselves.
> 
> I don’t think there’s anything fundamentally difficult, but it’s a big
change. For example:
> 
> $ git grep Intrinsic:: | wc
>     3364   12286  281078
> 
> The vast majority of those 3,364 lines have hardcoded references to
specific intrinsics. Many of them are used in contexts where you can’t easily
insert a strcmp (e.g., case values in large switch statements, or worse, the
m_Intrinsic PatternMatch templates).
I don’t find this convincing.  It should be simple to introduce a new
m_Intrinsic PatternMatch template that takes a string.  The switches are also
straight-forward.  In BasicAA, for example, instead of:

   switch (II->getIntrinsicID()) {
    default: break;
    case Intrinsic::memset:
    case Intrinsic::memcpy:
    case Intrinsic::memmove: {
    ...
    case Intrinsic::arm_neon_vld1: {
      assert(ArgIdx == 0 && "Invalid argument index");
      assert(Loc.Ptr == II->getArgOperand(ArgIdx) &&
             "Intrinsic location pointer not argument?");
      // LLVM's vld1 and vst1 intrinsics currently only support a single
      // vector register.
      if (DL)
        Loc.Size = DL->getTypeStoreSize(II->getType());
      break;
    }
}

Just change this to:

   switch (II->getIntrinsicID()) {
    case Intrinsic::memset:
    case Intrinsic::memcpy:
    case Intrinsic::memmove: {
    ...
    default:
      if (II->getName() == “llvm.arm.neon.vld1”) {
        same stuff as above
      }
    break;

Spreading ifdefs through the codebase would be really unfortunate.

-Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/eb3be06b/attachment.html>

Bob Wilson

2014-Dec-23 20:44 UTC

head link

[LLVMdev] [RFC] Stripping unusable intrinsics

> On Dec 23, 2014, at 12:41 PM, Chris Lattner <clattner at apple.com>
wrote:
> 
> 
>> On Dec 23, 2014, at 10:54 AM, Bob Wilson <bob.wilson at apple.com
<mailto:bob.wilson at apple.com>> wrote:
>> 
>> 
>>> On Dec 23, 2014, at 10:41 AM, Chris Lattner <clattner at
apple.com <mailto:clattner at apple.com>> wrote:
>>> 
>>> On Dec 23, 2014, at 10:28 AM, Chris Bieneman <beanz at apple.com
<mailto:beanz at apple.com>> wrote:
>>>>>> It should be straight-forward to have something like
LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a
registry.
>>>>> 
>>>>> I tried doing that a few years ago. It’s not nearly as easy
as it sounds because we’ve got hardcoded references to various target intrinsics
scattered throughout the code.
>>>> 
>>>> I was just writing to say exactly this. There are a number of
ways we could work toward something like this. I’m completely in favor of a
world where Intrinsics are properties of the targets and don’t leach out,
however today they do in a lot of places.
>>> 
>>> What are the specific problems here?  Anything that does an
equality comparison with the IntrinsicID can be changed to do strcmp with the
name.  That would handle the one-off cases like
InstCombiner::SimplifyDemandedUseBits in InstCombine.
>>> 
>>> The other cases in InstCombine could be handled similarly, but may
be better handled by adding a intrinsic behavior query APIs to the intrinsic
registry, or would be better handled (eventually) by adding new attributes to
the intrinsics themselves.
>> 
>> I don’t think there’s anything fundamentally difficult, but it’s a big
change. For example:
>> 
>> $ git grep Intrinsic:: | wc
>>     3364   12286  281078
>> 
>> The vast majority of those 3,364 lines have hardcoded references to
specific intrinsics. Many of them are used in contexts where you can’t easily
insert a strcmp (e.g., case values in large switch statements, or worse, the
m_Intrinsic PatternMatch templates).
> 
> I don’t find this convincing.  It should be simple to introduce a new
m_Intrinsic PatternMatch template that takes a string.  The switches are also
straight-forward.  In BasicAA, for example, instead of:
> 
>    switch (II->getIntrinsicID()) {
>     default: break;
>     case Intrinsic::memset:
>     case Intrinsic::memcpy:
>     case Intrinsic::memmove: {
>     ...
>     case Intrinsic::arm_neon_vld1: {
>       assert(ArgIdx == 0 && "Invalid argument index");
>       assert(Loc.Ptr == II->getArgOperand(ArgIdx) &&
>              "Intrinsic location pointer not argument?");
>       // LLVM's vld1 and vst1 intrinsics currently only support a
single
>       // vector register.
>       if (DL)
>         Loc.Size = DL->getTypeStoreSize(II->getType());
>       break;
>     }
> }
> 
> Just change this to:
> 
>    switch (II->getIntrinsicID()) {
>     case Intrinsic::memset:
>     case Intrinsic::memcpy:
>     case Intrinsic::memmove: {
>     ...
>     default:
>       if (II->getName() == “llvm.arm.neon.vld1”) {
>         same stuff as above
>       }
>     break;
> 
> Spreading ifdefs through the codebase would be really unfortunate.
I agree. I wasn’t at all trying to discourage anyone from doing as you suggest —
just pointing out that it’s not a small project. Much of it would be mechanical,
though, and it does seem like the right long-term solution.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/24946b13/attachment.html>

Chandler Carruth

2014-Dec-23 21:40 UTC

head link

[LLVMdev] [RFC] Stripping unusable intrinsics

On Tue, Dec 23, 2014 at 12:41 PM, Chris Lattner <clattner at apple.com>
wrote:
>
> On Dec 23, 2014, at 10:54 AM, Bob Wilson <bob.wilson at apple.com>
wrote:
>
>
> On Dec 23, 2014, at 10:41 AM, Chris Lattner <clattner at apple.com>
wrote:
>
> On Dec 23, 2014, at 10:28 AM, Chris Bieneman <beanz at apple.com>
wrote:
>
> It should be straight-forward to have something like
> LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into a
> registry.
>
>
> I tried doing that a few years ago. It’s not nearly as easy as it sounds
> because we’ve got hardcoded references to various target intrinsics
> scattered throughout the code.
>
>
> I was just writing to say exactly this. There are a number of ways we
> could work toward something like this. I’m completely in favor of a world
> where Intrinsics are properties of the targets and don’t leach out, however
> today they do in a lot of places.
>
>
> What are the specific problems here?  Anything that does an equality
> comparison with the IntrinsicID can be changed to do strcmp with the name.
> That would handle the one-off cases like
> InstCombiner::SimplifyDemandedUseBits in InstCombine.
>
>
> The other cases in InstCombine could be handled similarly, but may be
> better handled by adding a intrinsic behavior query APIs to the intrinsic
> registry, or would be better handled (eventually) by adding new attributes
> to the intrinsics themselves.
>
>
> I don’t think there’s anything fundamentally difficult, but it’s a big
> change. For example:
>
> $ git grep Intrinsic:: | wc
>     3364   12286  281078
>
> The vast majority of those 3,364 lines have hardcoded references to
> specific intrinsics. Many of them are used in contexts where you can’t
> easily insert a strcmp (e.g., case values in large switch statements, or
> worse, the m_Intrinsic PatternMatch templates).
>
>
> I don’t find this convincing.  It should be simple to introduce a new
> m_Intrinsic PatternMatch template that takes a string.  The switches are
> also straight-forward.  In BasicAA, for example, instead of:
>
>    switch (II->getIntrinsicID()) {
>     default: break;
>     case Intrinsic::memset:
>     case Intrinsic::memcpy:
>     case Intrinsic::memmove: {
>     ...
>     case Intrinsic::arm_neon_vld1: {
>       assert(ArgIdx == 0 && "Invalid argument index");
>       assert(Loc.Ptr == II->getArgOperand(ArgIdx) &&
>              "Intrinsic location pointer not argument?");
>       // LLVM's vld1 and vst1 intrinsics currently only support a
single
>       // vector register.
>       if (DL)
>         Loc.Size = DL->getTypeStoreSize(II->getType());
>       break;
>     }
> }
>
> Just change this to:
>
>    switch (II->getIntrinsicID()) {
>     case Intrinsic::memset:
>     case Intrinsic::memcpy:
>     case Intrinsic::memmove: {
>     ...
>     default:
>       if (II->getName() == “llvm.arm.neon.vld1”) {
>         same stuff as above
>       }
>     break;
>

If we're going to talk about what the right long-term design is, let me put
out a different opinion. I used to be somewhat torn on this issue, but this
discussion and looking at the particular intrinsics in question, I'm
rapidly being persuaded.

We shouldn't have any target specific intrinsics. At the very least, we
shouldn't use them anywhere in the front- or middle-end, even if we have
them.

Today, frontends need to emit specific target instrinsics *and* have the
optimizer be aware of them. I can see a few reasons why:

1) Missing semantics -- the IR may not have *quite* the semantics desired
and provided by the target's ISA.
2) Historical expectations -- the GCC-compatible builtins are named after
the instructions, and the target independent builtins lower to intrinsics
so the target-specific ones should too.
3) Poor instruction selection -- we could emit the logic as boring IR, but
we fail to instruction select that well, so as a hack we emit the
instruction directly and teach the optimizer to still optimize through it.

If we want to pursue the *right* design, I think we should be fixing these
three issues and then we won't need the optimizer to be aware of any of
this.

Fixing #1) We just need to flush these out and define target independent
intrinsics. The number of these is relatively small, and not likely to be a
burden. I'm looking at you cvtss2si.

Fixing #2) I think this just requires a giant pile of work to re-target
builtins and other things that people expect the optimizer to operate on
into proper IR. Yes, it's hard, but it is the right design. With x86
we're
already about 90% of the way there. My understanding is that ARM has gone
the other direction. While unpleasant, I think we have to make a choice:
either the builtin doesn't get optimized or it gets lowered to
"normal" IR.

Fixing #3) Yes, we need better instruction selection. We have always needed
better instruction selection. If there are truly critical needs that aren't
met, we can even achieve this through much more localized hacks *in the
backend* that form specific intrinsics to "pre select" instructions
that we
know are important. While just as hacky and gross, this seems like a much
better tradeoff to make than the current approach.

In the end, we don't have target-specific intrinsics in the middle end at
all. Then we get to decide on whether it is worth our time to provide a
really efficient target-specific *encoding* of these operations, or if we
should just lower to the target specific encoding we already have: inline
assembly. I don't have strong feelings about this either way.

OK, so I think that is the right design, but I also completely agree with
Chris B here -- it is *way more work* than is really called for to address
a very narrow, targeted problem that WebKit has: binary size.

But I also agree with Chris L's point here:

> Spreading ifdefs through the codebase would be really unfortunate.
>
I just don't think this is necessary. And I don't think relying on a
fragile frontend-based "optimization" to achieve the same effects is
really
great either.

The huge code in IR/Function.cpp.o is from *terrible* code being emitted by
tablegen. We should just change tablegen to emit something smaller. This
will get most of the size win, and will get it even when the target *is*
compiled in!

In particular, if we switch to a sorted table of intrinsic names, change
the *lookup* of names to use the same sorted table so we share the memory,
and do a binary search, I think the code size will be essentially free.

Once we address that, Chris B can use the somewhat ugly #ifdef's or his
clang attribute patch to look for other hotspots where we burn size on
intrinsics. I bet a few more could be fixed by similar effort.

If the code size is *still* a problem, we could could start factoring
things incrementally into target independent intrinsics and removing the
target specific ones which might help incrementally move us toward the
"right design" (IMO).

Then if someone is really motivated (which would be cool) to do all the
hard work to really move us over to the right design, we would have lost
nothing, there would be essentially no more technical debt than there is
today, and WebKit and others won't have to wait for their binary size
improvements.

-Chandler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/b693819f/attachment.html>

Owen Anderson

2014-Dec-23 23:07 UTC

head link

[LLVMdev] [RFC] Stripping unusable intrinsics

> On Dec 23, 2014, at 1:40 PM, Chandler Carruth <chandlerc at
google.com> wrote:
> 
> If we're going to talk about what the right long-term design is, let me
put out a different opinion. I used to be somewhat torn on this issue, but this
discussion and looking at the particular intrinsics in question, I'm rapidly
being persuaded.
> 
> We shouldn't have any target specific intrinsics. At the very least, we
shouldn't use them anywhere in the front- or middle-end, even if we have
them.
> 
> Today, frontends need to emit specific target instrinsics *and* have the
optimizer be aware of them. I can see a few reasons why:
> 
> 1) Missing semantics -- the IR may not have *quite* the semantics desired
and provided by the target's ISA.
> 2) Historical expectations -- the GCC-compatible builtins are named after
the instructions, and the target independent builtins lower to intrinsics so the
target-specific ones should too.
> 3) Poor instruction selection -- we could emit the logic as boring IR, but
we fail to instruction select that well, so as a hack we emit the instruction
directly and teach the optimizer to still optimize through it.
> 
> If we want to pursue the *right* design, I think we should be fixing these
three issues and then we won't need the optimizer to be aware of any of
this.
I strongly disagree with your conclusions here.  Everything you’re suggesting is
rooted in three base assumptions that are not true for many clients:
	- that all source languages are basically C
	- that all programming models are more or less like C on a *nix system
	- that all hardware is basically like the intersection of X86 and ARM (“typical
RISC machine”)

Consider the use case of an OpenGL shader compiler.  Its source language is not
C (despite syntactic appearances) and the frontend may need to express semantics
that are difficult or impossible to express in target-independent IR.  Its
programming model is not like a C compiler, including constructs like
cross-thread derivatives, uniform vs varying calculations, etc.  It’s target
instruction set is likely nothing at all like X86 or ARM, likely including an
arithmetic set that is very different from your typical CPU, as well as lots of
ISA-level construct for interacting with various fixed function hardware units.

Consider the less exotic use case of a DSP compiler.  DSPs typically have lots
of instructions for “unusual” arithmetic operations that are intended to map to
very specific use cases: lots of variants of rounding and/or wrapping control,
lots of extending/widening/doubling operations, memory accesses with unusual
stride patterns.  The entire purpose of the existence of a DSP is to deliver
high computation bandwidth under tight latency constraints.  If your DSP
compiler fails to make use of exotic arithmetic operations that the user
requested, the whole system has *failed* at being a DSP.

Consider the even-closer-to-home use case of vector programming.  There are
three major families of vector extensions in widespread use (SSE, NEON, and
Altivec) as well as many variants and lesser-known instruction sets. And while
all three agree on a small core of functionality (fadd <4 x float> !), all
of them include large bodies of just plain arithmetic that are not covered by
the others and are not practically expressible in target-independent IR.  Even
if we add the union of their functionalities to target independent IR, then we
have the reverse problem where the frontend and optimizers may produce IR that
most backends have little to no hope of generating good code for.  And let’s not
forget that, while the requirements here are somewhat less strict than on a DSP,
our users will still be very unhappy if they write a
top-half-extending-saturating-absolute-difference builtin and we give them 100
instructions of emulated gunk back.

While I agree with the underlying sentiment that we should strive to minimize
the intrusion of target-specific intrinsics as much as possible, and
compartmentalizing them into their source backends as much as possible,
expecting to reach a world with no intrinsic considerations in any part of the
frontend or optimizer just seems hopelessly idealistic.

—Owen
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141223/4528321d/attachment.html>

Mehdi Amini

2014-Dec-30 01:51 UTC

head link

[LLVMdev] [RFC] Stripping unusable intrinsics

> On Dec 23, 2014, at 3:07 PM, Owen Anderson <resistor at mac.com
<mailto:resistor at mac.com>> wrote:
> 
> 
>> On Dec 23, 2014, at 1:40 PM, Chandler Carruth <chandlerc at
google.com <mailto:chandlerc at google.com>> wrote:
>> 
>> If we're going to talk about what the right long-term design is,
let me put out a different opinion. I used to be somewhat torn on this issue,
but this discussion and looking at the particular intrinsics in question,
I'm rapidly being persuaded.
>> 
>> We shouldn't have any target specific intrinsics. At the very
least, we shouldn't use them anywhere in the front- or middle-end, even if
we have them.
>> 
>> Today, frontends need to emit specific target instrinsics *and* have
the optimizer be aware of them. I can see a few reasons why:
>> 
>> 1) Missing semantics -- the IR may not have *quite* the semantics
desired and provided by the target's ISA.
>> 2) Historical expectations -- the GCC-compatible builtins are named
after the instructions, and the target independent builtins lower to intrinsics
so the target-specific ones should too.
>> 3) Poor instruction selection -- we could emit the logic as boring IR,
but we fail to instruction select that well, so as a hack we emit the
instruction directly and teach the optimizer to still optimize through it.
>> 
>> If we want to pursue the *right* design, I think we should be fixing
these three issues and then we won't need the optimizer to be aware of any
of this.
> 
> I strongly disagree with your conclusions here.  Everything you’re
suggesting is rooted in three base assumptions that are not true for many
clients:
> 	- that all source languages are basically C
> 	- that all programming models are more or less like C on a *nix system
> 	- that all hardware is basically like the intersection of X86 and ARM
(“typical RISC machine”)
> 
> Consider the use case of an OpenGL shader compiler.  Its source language is
not C (despite syntactic appearances) and the frontend may need to express
semantics that are difficult or impossible to express in target-independent IR. 
Its programming model is not like a C compiler, including constructs like
cross-thread derivatives, uniform vs varying calculations, etc.  It’s target
instruction set is likely nothing at all like X86 or ARM, likely including an
arithmetic set that is very different from your typical CPU, as well as lots of
ISA-level construct for interacting with various fixed function hardware units.
> 
> Consider the less exotic use case of a DSP compiler.  DSPs typically have
lots of instructions for “unusual” arithmetic operations that are intended to
map to very specific use cases: lots of variants of rounding and/or wrapping
control, lots of extending/widening/doubling operations, memory accesses with
unusual stride patterns.  The entire purpose of the existence of a DSP is to
deliver high computation bandwidth under tight latency constraints.  If your DSP
compiler fails to make use of exotic arithmetic operations that the user
requested, the whole system has *failed* at being a DSP.
> 
> Consider the even-closer-to-home use case of vector programming.  There are
three major families of vector extensions in widespread use (SSE, NEON, and
Altivec) as well as many variants and lesser-known instruction sets. And while
all three agree on a small core of functionality (fadd <4 x float> !), all
of them include large bodies of just plain arithmetic that are not covered by
the others and are not practically expressible in target-independent IR.  Even
if we add the union of their functionalities to target independent IR, then we
have the reverse problem where the frontend and optimizers may produce IR that
most backends have little to no hope of generating good code for.  And let’s not
forget that, while the requirements here are somewhat less strict than on a DSP,
our users will still be very unhappy if they write a
top-half-extending-saturating-absolute-difference builtin and we give them 100
instructions of emulated gunk back.
> 
> While I agree with the underlying sentiment that we should strive to
minimize the intrusion of target-specific intrinsics as much as possible, and
compartmentalizing them into their source backends as much as possible,
expecting to reach a world with no intrinsic considerations in any part of the
frontend or optimizer just seems hopelessly idealistic.
I don’t see exactly why the middle-end optimizer has to know about the mighty
DSP intrinsic? If it maps to this super powerful hardware instructions anyway,
can’t the middle optimizer just ignore it and leave it as-is to be handled by
the backend? (I agree the front-end had to know about it anyway...)

A more generic question I have is why does the code that handle
“target-specific” intrinsic has to be in the middle-end? I am surprise to see
all that X86 specific code in the middle-end.
Why don’t we expose a hook (TargetMachine?) for the different targets to handle
their own intrinsics their own way?
InstCombine, ConstantFolding, AliasAnalysis, … would fallback to the target to
handle non-generic intrinsic.

— 
Mehdi


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141229/08be0885/attachment.html>

Owen Anderson

2014-Dec-30 06:58 UTC

head link

[LLVMdev] [RFC] Stripping unusable intrinsics

> On Dec 29, 2014, at 5:51 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
> A more generic question I have is why does the code that handle
“target-specific” intrinsic has to be in the middle-end? I am surprise to see
all that X86 specific code in the middle-end.
> Why don’t we expose a hook (TargetMachine?) for the different targets to
handle their own intrinsics their own way?
> InstCombine, ConstantFolding, AliasAnalysis, … would fallback to the target
to handle non-generic intrinsic.
This is what I was getting at when I said that I think there could be a better
separation of concerns with respect to the handling of target specific
intrinsics, which could move some of their handling into the targets themselves
on hooks that can be invoked from generic passes.

As to why we need to optimize them, just consider constant folding.  It’s quite
possible for constant-foldable forms of target-specific intrinsics to arise from
macros, template expansions, etc.  It’s important that we do that constant
folding as early as possible because it can affect later decisions like
inlining, loop unrolling, etc.  If we can’t constant fold the target-specific
operation until after the middle-end, then we’ve potentially lost out.

—Owen

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141229/355e11cb/attachment.html>

llvm dev - Dec 2014 - [LLVMdev] [RFC] Stripping unusable intrinsics

[LLVMdev] [RFC] Stripping unusable intrinsics

[LLVMdev] [RFC] Stripping unusable intrinsics

[LLVMdev] [RFC] Stripping unusable intrinsics

[LLVMdev] [RFC] Stripping unusable intrinsics

[LLVMdev] [RFC] Stripping unusable intrinsics

[LLVMdev] [RFC] Stripping unusable intrinsics