thr3ads.net - llvm dev - [LLVMdev] Proposal: add intrinsics for safe division [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Filip Pizlo

2014-Apr-25 21:04 UTC

[LLVMdev] Proposal: add intrinsics for safe division

On April 25, 2014 at 1:44:37 PM, Reid Kleckner (rnk at google.com) wrote:

Thanks for the writeup! It's very helpful.

On Fri, Apr 25, 2014 at 11:49 AM, Filip Pizlo <fpizlo at apple.com> wrote:

On April 25, 2014 at 10:48:18 AM, Reid Kleckner (rnk at google.com) wrote:

On Fri, Apr 25, 2014 at 10:19 AM, Filip Pizlo <fpizlo at apple.com> wrote:
The sdiv operation in LLVM IR only makes sense for C and its very direct
relatives.  The amount of control flow necessary to represent a safe division
for any other language is ghastly.  (a/b) becomes something like (b != 0 ? ((a
!= INT_MIN || b != -1) ? a / b : INT_MIN) : 0).  It's more useful to the
optimizer to see this as the single operation that it really is.

This argument makes sense to me. It would be hard for the arm64 backend to
pattern match all this control flow back into a single instruction after
optimizations.
I want to soften this statement a bit: while ISel can't look across basic
blocks, we could probably pattern match away the branches in the
SelectionDAGBuilder, because it can see the original use-def chain for the
result of the division. This would take a little work, but I can see how it
would work and I've only touched SelectionDAGBuilder a few times. 
So, it seems that for JavaScript and Java on ARM - and probably any environment
with emulated division, these intrinsics would result in tighter code without
requiring target intrinsics.  For all other language/hardware combinations, the
main benefit here is to materialize the gnarly control flow later and so
hopefully reveal more optimization opportunities.  That seems like a pretty good
sweet-spot for the systems that I know about.

If we're going to emit the control flow anyway, then we actually want to
expose the control flow to the mid-level optimizers as early as possible.
Not if you want to move code around.  Consider a loop that is too large to
unswitch but that contains a "safe division" expressed as branches and
that division is loop invariant.  What would happen?  My guess is that the
optimizer would have a hard time with it.

  For example, on a non-ARM64 ISA, if the user checked that the divisor was zero
already, we can eliminate the zero check branch.  We could also unswitch this
loop:

int b = ...;
for (int i = 0; i < n; ++i)
  c[i] = a[i] / b;

Lowering this with branches produces:

for (int i = 0; i < n; ++i) {
  if (b == 0) c[i] = 0;
  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
  else c[i] = a[i] / b;
}

b is loop invariant, and we could unswitch this on non-AArch64:

if (b == 0) for (int i = 0; i < n; ++i) c[i] = 0;
else if (b == -1) for (int i = 0; i < n; ++i) c[i] = (a[i] == INT_MIN) ?
INT_MIN : a[i] / b;
else for (int i = 0; i < n; ++i) c[i] = a[i] / b;

Note that the middle branch goes away for unsigned division.  It seems to me
that the branches are better all around, at least for other ISAs.
Sure, but that would be definitely suboptimal for JS on ARM64.  So if the claim
is that ARM64 will optimize this by pattern matching in the backend then this is
a clear example of where such pattern matching would be rather hard: an earlier
optimization could inhibit our ability to do this effectively.



As for generality, every language except asm.js (or JS-with-i32-coercion-hints)
has to branch on the "overflow" bit of the intrinsic anyway.
Heh, I wouldn't call it asm.js.  asm.js adopted something that was already a
de-facto idiom.

But yes - most people want to be told if the denominator was zero but want
INT_MIN/-1 to flow through gracefully and have defined behavior.



There's also the discussion of iload and istore, where the consensus was
that we'd rather have explicit null checks and branches.
Lol.  Since you brought that up, I figure I might as well throw this out:
probably some VMs will want a guaranteed trap on divide-by-zero as a free way of
throwing the exception.  I actually think that the safediv intrinsic would
ultimately make this easier.  The client would generate code like "if
(extractelement (safediv (...), 1) == 1) trap" and that could be
pattern-matched to mean "emit code that deterministically traps on divide
by zero".  I don't think we have the vocabulary necessary to express
this in IR.  But I think that matching this in the backend would be easier than
matching "if (b == 0) then trap else do a division", since the
division could be moved around independently of the branch.



I'm also not excited by the prospect of teaching all the mid-level
optimizers about the arithmetic properties of safe.[us]div, which has been a big
drawback to the other overflow checked arithmetic intrinsics.

So, in conclusion, I think we should at least *try* to pattern match out the
branches in SDAGBuilder for ARM64 before giving up and adding this intrinsic, or
something like it.  Does that sound reasonable?
My fear is that since the control flow is complex, the matching would get really
nutty.  For example you could be clever for a/b and say "b + 1
>_unsigned 1 ? a / b : slow path".  I actually did that on some VM at
some point.  Then there are all of the ways that the control flow could get
messed up by optimization phases.  If you ask the frontend that generates LLVM
IR to do this on its own then you'll end up with a lot of patterns on your
hands.  It would take some effort to canonicalize control flow of this
complexity to the point where the pattern matching would be reliable.  I think
it's better if we give frontends that need this some way of expressing it to
LLVM explicitly.  Note that this doesn't preclude LLVM from lowering the
intrinsic as early as we deem appropriate.  But then the choice to do so is
within LLVM rather than being delegated to the client, who is less likely to
know about specifics like what LLVM can or can't pattern match.

In short, I agree with your observations that these intrinsics are not an
obvious slam-dunk compared to making the explicit control flow, but I think that
the intrinsics do give enough flexibility on the LLVM side that it would be
great if front-ends used them rather than rolling the control flow themselves.

-Filip





-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140425/78d67f81/attachment.html>

Eric Christopher

2014-Apr-25 21:21 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

> In short, I agree with your observations that these intrinsics are not an
> obvious slam-dunk compared to making the explicit control flow, but I think
> that the intrinsics do give enough flexibility on the LLVM side that it
> would be great if front-ends used them rather than rolling the control flow
> themselves.
>
The problem is that then we have 2 problems: All targets (except for
arm64) then have to lower the intrinsic as the first thing they do
(giving us a TTI pass as the first thing in the pipeline) to take
advantage of the information later during optimization, _and_ we have
to plumb all of the work optimizing the intrinsic as well giving us a
situation where we've now split our optimization efforts as well as
the pain of maintaining an intrinsic that's useful for a single
target.

I really think that this is just solidifying my position that the
intrinsic is a bad idea and that this should be done as later
optimizations.

-eric

Philip Reames

2014-Apr-25 22:55 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

On 04/25/2014 02:21 PM, Eric Christopher wrote:>> In short, I agree with your observations that these intrinsics are not
an
>> obvious slam-dunk compared to making the explicit control flow, but I
think
>> that the intrinsics do give enough flexibility on the LLVM side that it
>> would be great if front-ends used them rather than rolling the control
flow
>> themselves.
>>
> The problem is that then we have 2 problems: All targets (except for
> arm64) then have to lower the intrinsic as the first thing they do
> (giving us a TTI pass as the first thing in the pipeline) to take
> advantage of the information later during optimization, _and_ we have
> to plumb all of the work optimizing the intrinsic as well giving us a
> situation where we've now split our optimization efforts as well as
> the pain of maintaining an intrinsic that's useful for a single
> target.
>
> I really think that this is just solidifying my position that the
> intrinsic is a bad idea and that this should be done as later
> optimizations.
>
> -eric
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdevDoes anyone have performance data on ARM for the early vs late lowering 
approaches?  Having data would make the discussion much more concrete 
about the tradeoffs of the two approaches.

To put it another way, how much do we loose performance-wise in missed 
pattern matching?

Philip

Andrew Trick

2014-Apr-26 00:04 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

On Apr 25, 2014, at 2:21 PM, Eric Christopher <echristo at gmail.com>
wrote:
>> In short, I agree with your observations that these intrinsics are not
an
>> obvious slam-dunk compared to making the explicit control flow, but I
think
>> that the intrinsics do give enough flexibility on the LLVM side that it
>> would be great if front-ends used them rather than rolling the control
flow
>> themselves.
>> 
> 
> The problem is that then we have 2 problems: All targets (except for
> arm64) then have to lower the intrinsic as the first thing they do
> (giving us a TTI pass as the first thing in the pipeline) to take
> advantage of the information later during optimization, _and_ we have
> to plumb all of the work optimizing the intrinsic as well giving us a
> situation where we've now split our optimization efforts as well as
> the pain of maintaining an intrinsic that's useful for a single
> target.
> 
> I really think that this is just solidifying my position that the
> intrinsic is a bad idea and that this should be done as later
> optimizations.
The goal of the intrinsic wasn't stated clearly.

This intrinsic isn't particularly necessary for any specific frontend
or architecture. I think the LLVM community would benefit from a
canonical way to represent well-defined integer division. We don't
need to add an IR instruction, because it's fine for most optimization
passes to ignore these operations. A target independent intrinsic
works well as long as:

- It cleanly captures the language level semantics.

- It facilitates mid-level optimization.

- It naturally lowers into ideal code both on architectures that trap
  (x86) and those that don't (arm).

To summarize my understanding of the concerns:

(1) The semantics of the safe.div intrinsic need to be useful for the
Language/ISA Matrix that LLVMers care about.

At canonical IR level, the intrinsic is useful by eliminating control
flow merges and representing divide-by-zero and/or signed overflow in
a canonical form:

    %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %bit = extractvalue {i32, i1} %res, 1
    br i1 %bit, label %trap, label %continue
trap:
    call ...
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

The original proposal fails to achieve this because the common case of
Java/Go would require a check in the slow-path to differentiate
divide-by-zero from signed overflow. That should be fixed by
generalizing the intrinsic so that the two conditions are distinct:

    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %div0 = extractvalue {i32, i1} %res, 1
    br i1 %div0, label %trap, label %checkovf

checkovf:
    %ovf = extractvalue {i32, i1} %res, 2
    br i1 %div0, label %trap, label %continue

trap:
    call ...
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

...or some variation of the above. I don't have a personal stake in this.

(2) The safe.div intrinsic inhibits generic code motion and Correlated
Value Prop based optimization.

This goes both ways.

CVP could miss out on cases unless we teach it about the semantics of
the intrinsics. Correct me if I'm wrong, but I don't think this would
actually be too difficult.

OTOH, with the intrinsics, it would be easy to write a simple
optimization pass that hoists and combines checks along the domtree.

After divide-by-zero optimization, you would have something like:

    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %div0 = extractvalue {i32, i1} %res, 1
    br i1 %div0, label %trap, label %continue

trap:
    # No call here!
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

And the branch to trap just goes away at some point.

Now considering Reid's LICM example:

for (int i = 0; i < n; ++i) {
  if (b == 0) c[i] = 0;
  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
  else c[i] = a[i] / b;
}

Simply put, if we want to unswitch this code for x86, then we need a
TTI pass to lower the intrinsic. On the other hand, I believe it is
more important to eliminate the control flow within the loop to aid
loop analysis and other optimizations. So we still want the front end
to emit the intrinsic, we just may eventually want it lowered earlier
than CGP. I don't think this issue has any bearing on the intrinsic's
LangRef spec.

There was some comparison made to iload/istore, which I don't
follow:
- subsuming loads and stores into another instruction is really scary
  considering how much logic we have for analyzing the side effects of
  memory access.
- there is no benefit to IR optimization to the iload/istore intruction.
- it is easy to detect null checked load/stores in IR at any point.
- it is very rare that a platform would want to resume from trapping load/store.

(3) The safe.div intrinsic is a target-specific codegen optimization.

Regardless of the target, we want to eliminate control flow merges in
all cases, and completely eliminate control flow when we don't have
trapping semantics. To do this, we need an intrinsic at canonical IR
level. Clearly it should be a target independent intrinsic
since *some* optimizations/analyses want to be aware of it.

Even ignoring the rationale above, supporting codegen with a
target-specific intrinsic would minimally require the DAG builder to
match this
 "if (b != 0) ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0)"

after all optimizations have had their way with it. A big problem here
is that there are actually many different forms of this expression and
surrounding control flow, depending on the frontend's integer division
semantics. We would have to recognize all the variations of checking for
divide-by-zero and/or overflow, trapping and/or producing a certain
constant, branching or selecting values. This is obviously terrible
from an engineering standpoint.

-Andy

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Apr 2014 - [LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

Reasonably Related Threads