thr3ads.net - llvm dev - [LLVMdev] Proposal: add intrinsics for safe division [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Filip Pizlo

2014-Apr-25 18:49 UTC

[LLVMdev] Proposal: add intrinsics for safe division

On April 25, 2014 at 10:48:18 AM, Reid Kleckner (rnk at google.com) wrote:

On Fri, Apr 25, 2014 at 10:19 AM, Filip Pizlo <fpizlo at apple.com> wrote:
The sdiv operation in LLVM IR only makes sense for C and its very direct
relatives.  The amount of control flow necessary to represent a safe division
for any other language is ghastly.  (a/b) becomes something like (b != 0 ? ((a
!= INT_MIN || b != -1) ? a / b : INT_MIN) : 0).  It's more useful to the
optimizer to see this as the single operation that it really is.

This argument makes sense to me. It would be hard for the arm64 backend to
pattern match all this control flow back into a single instruction after
optimizations.

Are there ISAs out there that produce different results without trapping for
these boundary conditions? Should we leave the quotient result unspecified in
the b == 0 || (a == INT_MIN && B == -1) cases? If we do that, we require
frontends to check the "overflow" bit if they want deterministic
cross-platform behavior.
That's a good question.  I believe that all ISAs go to one of the following
paths:

- Trap on x/0 and INT_MIN/-1 like x86.

- Return 0 for x/0 and INT_MIN for INT_MIN/-1 like ARM.

Also, here's what I know about what different languages do:

C/C++/etc: undef on x/0 and INT_MIN/-1, so they would use LLVM's existing
div instruction.

JavaScript: optimizing compilers will look for places where people say things
like (a/b)|0 - in JavaScript saying x|0 means "truncate to int32", and
they will do an integer division if the inputs were already represented as
integers.  x/0 will produce NaN and NaN|0 produces 0; INT_MIN/-1 produces
-INT_MIN (which is representable as double) and then (-INT_MIN)|0 produces
INT_MIN.  So, the ARM64 semantics match optimized JavaScript semantics.  There
is also the case that you say a/b in JavaScript, and we have inferred that a and
b are integers but there is no |0.  In that case we want to detect when x/0 and
INT_MIN/-1 so that we can promote the result to double.

Java: x/0 throws an exception but INT_MIN/-1 produces INT_MIN.

Ruby and probably a lot of others: provide BigInt semantics but have fast paths
for small ints.  So, these will throw an exception on x/0 and they will inflate
to a BigInt on INT_MIN/-1.

I don't know about other languages.

Looking at this, it makes me *slightly* prefer to generalize these intrinsics a
bit.  Instead of returning an i1 flag for all overflow cases, you could return
an i2 status enum that says 0 for neither overflow condition, 1 for x/0, and 2
for INT_MIN/-1.  Then, you get the following matrix for how Java and JavaScript
would map to this and how it would be lowered:

- Java on x86: use safediv and check if the if the status is 1.  If it's 1
then throw.  This lowers to a bunch of branches fairly late in the backend.

- JavaScript on x86 with |0: use safediv and ignore the status.  Also loweres to
a bunch of branches, which is as good as as it gets.

- JavaScript on x86 without |0: use safediv and deoptimize if status != 0.  Also
a bunch of branches.

- Ruby on x86: use safediv and detect both status==1 and status==2.  This lowers
to a bunch of branches.

- Java on ARM: use safediv as on x86, but not it lowers to just a branch to
detect x/0 and not the other case.  Note that without the safediv intrinsic, you
wouldn't be able to get this to one branch unless you used a target
intrinsic.

- JavaScript on ARM with |0: use safediv and you ignore the status and you get
no branches.  It maps directly to ARM.

- JavaScript on ARM without |0: use safediv and test the status, so you get a
bunch of branches as before.

- Ruby on ARM: this lowers to a bunch of branches as on x86.

So, it seems that for JavaScript and Java on ARM - and probably any environment
with emulated division, these intrinsics would result in tighter code without
requiring target intrinsics.  For all other language/hardware combinations, the
main benefit here is to materialize the gnarly control flow later and so
hopefully reveal more optimization opportunities.  That seems like a pretty good
sweet-spot for the systems that I know about.

-Filip



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140425/707883f3/attachment.html>

Reid Kleckner

2014-Apr-25 20:44 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

Thanks for the writeup! It's very helpful.

On Fri, Apr 25, 2014 at 11:49 AM, Filip Pizlo <fpizlo at apple.com> wrote:
>
> On April 25, 2014 at 10:48:18 AM, Reid Kleckner (rnk at google.com) wrote:
>
> On Fri, Apr 25, 2014 at 10:19 AM, Filip Pizlo <fpizlo at apple.com>
wrote:
>>
>> The sdiv operation in LLVM IR only makes sense for C and its very
direct
>> relatives.  The amount of control flow necessary to represent a safe
>> division for any other language is ghastly.  (a/b) becomes something
like
>> (b != 0 ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0).  It's
more
>> useful to the optimizer to see this as the single operation that it
really
>> is.
>>
> This argument makes sense to me. It would be hard for the arm64 backend to
> pattern match all this control flow back into a single instruction after
> optimizations.
>
> I want to soften this statement a bit: while ISel can't look across
basicblocks, we could probably pattern match away the branches in the
SelectionDAGBuilder, because it can see the original use-def chain for the
result of the division. This would take a little work, but I can see how it
would work and I've only touched SelectionDAGBuilder a few times.
> So, it seems that for JavaScript and Java on ARM - and probably any
> environment with emulated division, these intrinsics would result in
> tighter code without requiring target intrinsics.  For all other
> language/hardware combinations, the main benefit here is to materialize the
> gnarly control flow later and so hopefully reveal more optimization
> opportunities.  That seems like a pretty good sweet-spot for the systems
> that I know about.
>If we're going to emit the control flow anyway, then we actually want to
expose the control flow to the mid-level optimizers as early as possible.
 For example, on a non-ARM64 ISA, if the user checked that the divisor was
zero already, we can eliminate the zero check branch.  We could also
unswitch this loop:

int b = ...;
for (int i = 0; i < n; ++i)
  c[i] = a[i] / b;

Lowering this with branches produces:

for (int i = 0; i < n; ++i) {
  if (b == 0) c[i] = 0;
  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
  else c[i] = a[i] / b;
}

b is loop invariant, and we could unswitch this on non-AArch64:

if (b == 0) for (int i = 0; i < n; ++i) c[i] = 0;
else if (b == -1) for (int i = 0; i < n; ++i) c[i] = (a[i] == INT_MIN) ?
INT_MIN : a[i] / b;
else for (int i = 0; i < n; ++i) c[i] = a[i] / b;

Note that the middle branch goes away for unsigned division.  It seems to
me that the branches are better all around, at least for other ISAs.

As for generality, every language except asm.js (or
JS-with-i32-coercion-hints) has to branch on the "overflow" bit of the
intrinsic anyway.

There's also the discussion of iload and istore, where the consensus was
that we'd rather have explicit null checks and branches.

I'm also not excited by the prospect of teaching all the mid-level
optimizers about the arithmetic properties of safe.[us]div, which has been
a big drawback to the other overflow checked arithmetic intrinsics.

So, in conclusion, I think we should at least *try* to pattern match out
the branches in SDAGBuilder for ARM64 before giving up and adding this
intrinsic, or something like it.  Does that sound reasonable?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140425/6477928e/attachment.html>

Filip Pizlo

2014-Apr-25 21:04 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

On April 25, 2014 at 1:44:37 PM, Reid Kleckner (rnk at google.com) wrote:

Thanks for the writeup! It's very helpful.

On Fri, Apr 25, 2014 at 11:49 AM, Filip Pizlo <fpizlo at apple.com> wrote:

On April 25, 2014 at 10:48:18 AM, Reid Kleckner (rnk at google.com) wrote:

On Fri, Apr 25, 2014 at 10:19 AM, Filip Pizlo <fpizlo at apple.com> wrote:
The sdiv operation in LLVM IR only makes sense for C and its very direct
relatives.  The amount of control flow necessary to represent a safe division
for any other language is ghastly.  (a/b) becomes something like (b != 0 ? ((a
!= INT_MIN || b != -1) ? a / b : INT_MIN) : 0).  It's more useful to the
optimizer to see this as the single operation that it really is.

This argument makes sense to me. It would be hard for the arm64 backend to
pattern match all this control flow back into a single instruction after
optimizations.
I want to soften this statement a bit: while ISel can't look across basic
blocks, we could probably pattern match away the branches in the
SelectionDAGBuilder, because it can see the original use-def chain for the
result of the division. This would take a little work, but I can see how it
would work and I've only touched SelectionDAGBuilder a few times. 
So, it seems that for JavaScript and Java on ARM - and probably any environment
with emulated division, these intrinsics would result in tighter code without
requiring target intrinsics.  For all other language/hardware combinations, the
main benefit here is to materialize the gnarly control flow later and so
hopefully reveal more optimization opportunities.  That seems like a pretty good
sweet-spot for the systems that I know about.

If we're going to emit the control flow anyway, then we actually want to
expose the control flow to the mid-level optimizers as early as possible.
Not if you want to move code around.  Consider a loop that is too large to
unswitch but that contains a "safe division" expressed as branches and
that division is loop invariant.  What would happen?  My guess is that the
optimizer would have a hard time with it.

  For example, on a non-ARM64 ISA, if the user checked that the divisor was zero
already, we can eliminate the zero check branch.  We could also unswitch this
loop:

int b = ...;
for (int i = 0; i < n; ++i)
  c[i] = a[i] / b;

Lowering this with branches produces:

for (int i = 0; i < n; ++i) {
  if (b == 0) c[i] = 0;
  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
  else c[i] = a[i] / b;
}

b is loop invariant, and we could unswitch this on non-AArch64:

if (b == 0) for (int i = 0; i < n; ++i) c[i] = 0;
else if (b == -1) for (int i = 0; i < n; ++i) c[i] = (a[i] == INT_MIN) ?
INT_MIN : a[i] / b;
else for (int i = 0; i < n; ++i) c[i] = a[i] / b;

Note that the middle branch goes away for unsigned division.  It seems to me
that the branches are better all around, at least for other ISAs.
Sure, but that would be definitely suboptimal for JS on ARM64.  So if the claim
is that ARM64 will optimize this by pattern matching in the backend then this is
a clear example of where such pattern matching would be rather hard: an earlier
optimization could inhibit our ability to do this effectively.



As for generality, every language except asm.js (or JS-with-i32-coercion-hints)
has to branch on the "overflow" bit of the intrinsic anyway.
Heh, I wouldn't call it asm.js.  asm.js adopted something that was already a
de-facto idiom.

But yes - most people want to be told if the denominator was zero but want
INT_MIN/-1 to flow through gracefully and have defined behavior.



There's also the discussion of iload and istore, where the consensus was
that we'd rather have explicit null checks and branches.
Lol.  Since you brought that up, I figure I might as well throw this out:
probably some VMs will want a guaranteed trap on divide-by-zero as a free way of
throwing the exception.  I actually think that the safediv intrinsic would
ultimately make this easier.  The client would generate code like "if
(extractelement (safediv (...), 1) == 1) trap" and that could be
pattern-matched to mean "emit code that deterministically traps on divide
by zero".  I don't think we have the vocabulary necessary to express
this in IR.  But I think that matching this in the backend would be easier than
matching "if (b == 0) then trap else do a division", since the
division could be moved around independently of the branch.



I'm also not excited by the prospect of teaching all the mid-level
optimizers about the arithmetic properties of safe.[us]div, which has been a big
drawback to the other overflow checked arithmetic intrinsics.

So, in conclusion, I think we should at least *try* to pattern match out the
branches in SDAGBuilder for ARM64 before giving up and adding this intrinsic, or
something like it.  Does that sound reasonable?
My fear is that since the control flow is complex, the matching would get really
nutty.  For example you could be clever for a/b and say "b + 1
>_unsigned 1 ? a / b : slow path".  I actually did that on some VM at
some point.  Then there are all of the ways that the control flow could get
messed up by optimization phases.  If you ask the frontend that generates LLVM
IR to do this on its own then you'll end up with a lot of patterns on your
hands.  It would take some effort to canonicalize control flow of this
complexity to the point where the pattern matching would be reliable.  I think
it's better if we give frontends that need this some way of expressing it to
LLVM explicitly.  Note that this doesn't preclude LLVM from lowering the
intrinsic as early as we deem appropriate.  But then the choice to do so is
within LLVM rather than being delegated to the client, who is less likely to
know about specifics like what LLVM can or can't pattern match.

In short, I agree with your observations that these intrinsics are not an
obvious slam-dunk compared to making the explicit control flow, but I think that
the intrinsics do give enough flexibility on the LLVM side that it would be
great if front-ends used them rather than rolling the control flow themselves.

-Filip





-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140425/78d67f81/attachment.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Apr 2014 - [LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

Reasonably Related Threads