thr3ads.net - llvm dev - [LLVMdev] Proposal: add intrinsics for safe division [May 2014]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2014-Apr-29 18:27 UTC

[LLVMdev] Proposal: add intrinsics for safe division

On 04/29/2014 10:44 AM, Filip Pizlo wrote:> LD;DR: Your desire to use trapping on x86 only further convinces me 
> that Michael's proposed intrinsics are the best way to go.I'm still not convinced, but am not going to actively oppose it either.  
I'm leery of designing a solution with major assumptions we don't have 
data to backup.

I worry your assumptions about deoptimization are potentially unsound.  
But I don't have data to actually show this (yet).
>
> On April 29, 2014 at 10:09:49 AM, Philip Reames 
> (listmail at philipreames.com <mailto:listmail at philipreames.com>)
wrote:
>
>> As the discussion has progressed and I've spent more time thinking 
>> about the topic, I find myself less and less enthused about the 
>> current proposal.  I'm in full support of having idiomatic ways to 
>> express safe division, but I'm starting to doubt that using an 
>> intrinsic is the right way at the moment.
>>
>> One case I find myself thinking about is how one would combine 
>> profiling information and implicit div-by-zero/overflow checks with 
>> this proposal.  I don't really see a clean way.  Ideally, for a
"safe
>> div" which never has the exceptional paths taken, you'd like
to
>> completely do away with the control flow entirely.  (And rely on 
>> hardware traps w/exceptions instead.)  I don't really see a way to 
>> represent that type of construct given the current proposal.
>
> This is a deeper problem and to solve it you'd need a solution to 
> trapping in general.  Let's consider the case of Java.  A Java program 
> may want to catch the arithmetic exception due to divide by zero.  How 
> would you do this with a trap in LLVM IR?  Spill all state that is 
> live at the catch?  Use a patchpoint for the entire division instruction?
>We'd likely use something similar to a patchpoint.  You'd need the 
"abstract vm state" (which is not the compiled frame necessarily) 
available at the div instruction.  You could then re-enter the 
interpreter at the specified index (part of the vm state).  We have all 
most of these mechanisms in place.  Ideally, you'd trigger a recompile 
and otherwise ensure re-entry into compiled code at the soonest possible 
moment.

This requires a lot of runtime support, but we already have most of it 
implemented for another compiler.  From our perspective, the runtime 
requirements are not a major blocker.>
> In a lot of languages, a divide produces some result even in the 
> exceptional case and this result requires effectively deoptimizing 
> since the resut won't be the one you would have predicted (double 
> instead of int, or BigInt instead of small int), which sort of means 
> that if the CPU exception occurs you have to be able to reconstruct 
> all state.  A patchpoint could do this, and so could spilling all 
> state to the stack before the divide - but both are very heavy hammers 
> that are sure to be more expensive than just doing a branch.
>This isn't necessarily as expensive as you might believe.  I'd recommend
reading the Graal project papers on this topic.

Whether deopt or branching is more profitable *in this case*, I can't 
easily say.  I'm not yet to the point of being able to run that 
experiment.  We can argue about what "should" be better all we want,
but
real performance data is the only way to truly know.>
> For these reasons, WebKit has long gone with the approach of emitting 
> branches even in our custom JITs; they are cheaper than the costs 
> incurred by spilling or other craziness.  My bet is that the cheapest 
> implementation in LLVM will involve some kind of branching on x86.
>You may be right.  I simply don't want to design a system which assumes 
you are until I have data in hand.>>
>>
>> Another point that is bothering me is how these intrinsics will 
>> interact with loop invariant code motion, and other condition check 
>> optimizations.  We have the potential to hide control flow from 
>> optimizations which would otherwise remove it.  I'm not stating 
>> these/aren't/beneficial. I'm simply stating that I remain
somewhat
>> unconvinced.  They seem like clear wins on some architectures (i.e. 
>> ARM, where the hardware includes the specific checks already), but 
>> not on others (i.e. x86 with it's exceptions).
>
> I don't follow this.  One of the proposed intrinsic's advantages is
> precisely to help with things like loop invariant code motion by not 
> emitting confusing control flow early.
>Er, not sure about this.  Consider:
for(int i = 0; i < n; i++) {
   a[i] = a[i] / b;
}

Yes, if b is a constant the conditions immediately fail away.

If you can't prove the bounds on b, you'd still like to transform this
to:
if( b == 0 ) throw;
for(int i = 0; i < n; i++) {
   if( a[i] == INT_MIN && b == -1) throw;
   a[i] = a[i] / b;
}

You might even want to split the loop into two versions (i.e. b is -1, 
otherwise).  These transforms wouldn't matter on ARM, but do matter on x86.

Even if you're using a deopt scheme mentioned above, this still could be 
profitable.  The frequencies of the two edge cases aren't necessarily 
correlated.  (i.e. you could be trying to divide by zero a lot)
> As for condition check optimizations, this has too low of profit for 
> me to consider.  It's true that if you can prove that one of the 
> operands to a division is a constant, then you probably don't want to 
> emit all of the branches, and surely the proposed intrinsic allows for 
> this.  But you're less likely to see code dividing by the same value 
> repeatedly.  Or so goes my experience with this, anyway.
>I suspect your workloads may differ from mine in this area.  I also need 
to restate: I haven't spent much time looking at this yet.  I don't 
actually know if any of this matters.  Looking through the source code 
for the existing compilers, I suspect it does.  But that's not 
conclusive evidence.

p.s. w.r.t to terminology, I was being somewhat sloppy.  By condition 
check optimization, I meant everything from simplifycfg, cvp, and
others.>>
>>
>> Given the above, I'm going withdraw my active support from the 
>> current proposal.  (Note: This does not mean that I'm going to 
>> actively oppose it either!)
>>
>> Let me make a counter proposal of my own.
>>
>> 1) Let's not add generic intrinsics at this time. Instead,
let's add
>> cpu specific intrinsics with the exact semantics of the associated 
>> div instructions.
>>   a) for ARM, this would be the semantics of the current proposed 
>> instruction (right?)
>>   b) for x86, this would be an instruction which is defined to fault 
>> on the corner cases (the resume behaviour would/not/be specified for 
>> the moment)
>
> I don't like 1.b because the only thing you'd be able to do with
the
> fault in the runtime is:
>
> - Unwind the entire function if you're careful enough and you don't
> mind losing all of its state.  This woud only apply to Java-like 
> languages, and only if they didn't have any 
> finally/catch(all)/catch(arithmetic) statements in the surrounding scope.
>
> - "interpret" the division to get your own semantics and resume 
> execution, if your language allows for division to return an integer 
> in the corner cases (i.e. JavaScript's (a/b)|0).  I concede that a 
> trapping intrinsic would allow you to make this one case work, in 
> cases where your profiling told you that the corner cases are rare.
>
> - Arrange to spill all state to the stack just to be able to throw an 
> exception and/or deoptimize - i.e. you're crazy enough to think that 
> the branch you just saved was more expensive than a bunch of stack 
> traffic.
>
> I believe that the currently proposed intrinsics give us more of what 
> we want on x86, than does your proposal, since they are more general, 
> and since the branches will be cheaper than spilling everything to the 
> stack.
>I was trying to avoid getting into all of the details around deopt and 
resuming from traps.  :)  I agree this is a complex area, and frankly, 
it's not one I've explored enough in LLVM to have strong opinions on
yet.>
> To be clear, trap-based optimizations are something that we may 
> eventually want to support but they require more magic unrelated to 
> the specific issues of division.  I suspect that the nicest way of 
> expressing trap-based optimizations in IR is to:
>
> - Have a branch in IR that is weighted in such a way as to convey to 
> LLVM that we believe that one side is never taken and that therefore 
> implementing that side in terms of traps is OK.
>
> - The thing being branched on is a predicate that is known to LLVM to 
> be something that can be trapped on.
>I agree with the above.  I think.  (I reserve the right to change my 
mind later...)>
> Notice that the proposed intrinsics are *the best* way of achieving 
> this, once we have a way of optimizing such branches.  The LLVM 
> backend will know that the i1's returned from the safe.div intrinsic 
> are "trappable" on x86, and the frontend can arrange to weight
the
> branches on those i1's appropriately if it wants a trap-based 
> implementation.  Better yet, the frontend could simply feed whatever 
> profiling it wants ("I saw the corner case X times and the main case Y
> times") and LLVM can decide what to do on its own.  The fact that the 
> machanism by which execution is diverted to the "trapping" basic
block
> is by way of a trap could even be contained entirely within LLVM's 
> innards.
>I agree this sounds like a potentially workable design.

OT: Do we want to be implementing the trapping logic inside LLVM?  I 
would guess not.

I actually agree with you on the general approach.  However, I'm not 
sure we're ready to get into this yet.  That's why I was proposing a 
simpler starting point.

My concern is how to way to balance the IR level CFG optimizations above 
with the trapping semantics and profiling data.  Until someone actually 
builds something, this is all *really* speculative.  I'd like to not 
lock down a design until we have *data* and *experience in
LLVM*.>
> This could be used to support not only the unwinding semantics in Java 
> on div/0, but also the (a/b)|0 in JS, and all of the others.  But 
> first we need an intrinsic that reveals some predicate that is 
> associated with the trapping condition, rather than having an 
> intrinsic that is defined to simply trap.
>
> (Notice also how my approach to exposing trapping in LLVM could handle 
> things like null checks - if LLVM sees a branch on a pointer to see if 
> it's null in the close vicinity of a memory access, and the branch is 
> weighted 100% in favor of non-null, then it can "fold" the branch
and
> load together and use a trap.  I imagine this being super easy to 
> implement and I'd be happy to say more about how I'd do it.  So, to
> me, the issue with division isn't that we don't know how to trap on
it
> - it's that we don't have a crystal-clear way of revealing the 
> trapping conditions in IR.)
>
Agreed, particularly with the last statement.>>
>> 1b) Teach the optimizer about the semantics of each - if we do go 
>> with a unified safe.div instruction later, this effort could be 
>> mostly reused.
>
> Except for x86, where your proposal above doesn't work.
>Er, how so?  Same as the reasons we've discussed?  Or
different?>>
>> 2) Add optimizations (fairly late) which fold generic-div+control 
>> flow into machine specific div instructions.
>
> This is purely aspirational.  There are many ways of doing the control 
> flow for the division since there is quite a bit of it.  Are you going 
> to pattern match all of them?  What if you miss a case because one of 
> the many IR optimizations transformed it into something that is no 
> longer recognizable?  Or are you going to impose rules for 
> canonicalizing any control flow around any division to ensure that a 
> div-by-zero and INT_MIN/-1 check never stops looking like the pattern 
> that you wanted it to look like?
>
> To me this sounds kind of crazy.  Once the frontend has determined 
> that it wants safe division, it should be able to say this explicitly 
> enough that this information is preserved through compilation.
>You don't need to recognize all patterns; you don't even want to. See 
above for LICM and related.

I suspect this is less hard than it initially sounds.  Particularly 
since you really only need to identify cases in hot code.  If the 
control flow was moved outside of the loop, it is almost certainly not 
hot any more.

Note: This scheme is definitely biased towards x86.  It probably is NOT 
the optimal scheme for ARM.  I freely acknowledge that.  I'm just not 
happy with the way the previous proposal seems to prefer ARM over
x86.>>
>> 3) Frontends that want machine specific semantics (i.e. trap or 
>> defined edge cases), can use the machine specific intrinsics directly.
>> 4) We provide a set of IR helper functions which implement safe 
>> division in what we believe to be the best way.  This can be 
>> different for each supported platform.  This could either be the form 
>> of sample IR, or even as functions on IRBuilder.
>> 5) We spend some time optimizing the IR generated by (4) and use that 
>> to inform our discussion about what is truly necessary from a 
>> performance standpoint.
>>
>> Once this is in place, we can come back to the discussion of what an 
>> appropriate generic safe div intrinsic would look like.  In 
>> particular, we would have the performance data at hand to lay to rest 
>> the concerns raised by myself and Eric.  My strong suspicion is that 
>> the counter proposal will be "good enough" in practice for
most users.
>>
>> p.s. If we do go ahead with the current proposal, I'm going to 
>> *strongly* request they be marked experimental.  I think we'll need
>> to iterate on these a few times.
>>
>> Philip
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/5ff30a1a/attachment.html>

Eric Christopher

2014-Apr-29 18:30 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

I have nothing to say that Philip hasn't already said so I'll just add
a "me too".

Thanks!

-eric

On Tue, Apr 29, 2014 at 11:27 AM, Philip Reames
<listmail at philipreames.com> wrote:>
> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>
> LD;DR: Your desire to use trapping on x86 only further convinces me that
> Michael's proposed intrinsics are the best way to go.
>
> I'm still not convinced, but am not going to actively oppose it either.
I'm
> leery of designing a solution with major assumptions we don't have data
to
> backup.
>
> I worry your assumptions about deoptimization are potentially unsound.  But
> I don't have data to actually show this (yet).
>
>
>
> On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at
philipreames.com)
> wrote:
>
> As the discussion has progressed and I've spent more time thinking
about the
> topic, I find myself less and less enthused about the current proposal. 
I'm
> in full support of having idiomatic ways to express safe division, but
I'm
> starting to doubt that using an intrinsic is the right way at the moment.
>
> One case I find myself thinking about is how one would combine profiling
> information and implicit div-by-zero/overflow checks with this proposal.  I
> don't really see a clean way.  Ideally, for a "safe div"
which never has the
> exceptional paths taken, you'd like to completely do away with the
control
> flow entirely.  (And rely on hardware traps w/exceptions instead.)  I
don't
> really see a way to represent that type of construct given the current
> proposal.
>
> This is a deeper problem and to solve it you'd need a solution to
trapping
> in general.  Let's consider the case of Java.  A Java program may want
to
> catch the arithmetic exception due to divide by zero.  How would you do
this
> with a trap in LLVM IR?  Spill all state that is live at the catch?  Use a
> patchpoint for the entire division instruction?
>
> We'd likely use something similar to a patchpoint.  You'd need the
"abstract
> vm state" (which is not the compiled frame necessarily) available at
the div
> instruction.  You could then re-enter the interpreter at the specified
index
> (part of the vm state).  We have all most of these mechanisms in place.
> Ideally, you'd trigger a recompile and otherwise ensure re-entry into
> compiled code at the soonest possible moment.
>
> This requires a lot of runtime support, but we already have most of it
> implemented for another compiler.  From our perspective, the runtime
> requirements are not a major blocker.
>
> In a lot of languages, a divide produces some result even in the
exceptional
> case and this result requires effectively deoptimizing since the resut
won't
> be the one you would have predicted (double instead of int, or BigInt
> instead of small int), which sort of means that if the CPU exception occurs
> you have to be able to reconstruct all state.  A patchpoint could do this,
> and so could spilling all state to the stack before the divide - but both
> are very heavy hammers that are sure to be more expensive than just doing a
> branch.
>
> This isn't necessarily as expensive as you might believe.  I'd
recommend
> reading the Graal project papers on this topic.
>
> Whether deopt or branching is more profitable *in this case*, I can't
easily
> say.  I'm not yet to the point of being able to run that experiment. 
We can
> argue about what "should" be better all we want, but real
performance data
> is the only way to truly know.
>
> For these reasons, WebKit has long gone with the approach of emitting
> branches even in our custom JITs; they are cheaper than the costs incurred
> by spilling or other craziness.  My bet is that the cheapest implementation
> in LLVM will involve some kind of branching on x86.
>
> You may be right.  I simply don't want to design a system which assumes
you
> are until I have data in hand.
>
>
>
> Another point that is bothering me is how these intrinsics will interact
> with loop invariant code motion, and other condition check optimizations.
> We have the potential to hide control flow from optimizations which would
> otherwise remove it.  I'm not stating these aren't beneficial. 
I'm simply
> stating that I remain somewhat unconvinced.  They seem like clear wins on
> some architectures (i.e. ARM, where the hardware includes the specific
> checks already), but not on others (i.e. x86 with it's exceptions).
>
> I don't follow this.  One of the proposed intrinsic's advantages is
> precisely to help with things like loop invariant code motion by not
> emitting confusing control flow early.
>
> Er, not sure about this.  Consider:
> for(int i = 0; i < n; i++) {
>
>   a[i] = a[i] / b;
> }
>
> Yes, if b is a constant the conditions immediately fail away.
>
> If you can't prove the bounds on b, you'd still like to transform
this to:
> if( b == 0 ) throw;
> for(int i = 0; i < n; i++) {
>   if( a[i] == INT_MIN && b == -1) throw;
>
>   a[i] = a[i] / b;
> }
>
> You might even want to split the loop into two versions (i.e. b is -1,
> otherwise).  These transforms wouldn't matter on ARM, but do matter on
x86.
>
> Even if you're using a deopt scheme mentioned above, this still could
be
> profitable.  The frequencies of the two edge cases aren't necessarily
> correlated.  (i.e. you could be trying to divide by zero a lot)
>
>
> As for condition check optimizations, this has too low of profit for me to
> consider.  It's true that if you can prove that one of the operands to
a
> division is a constant, then you probably don't want to emit all of the
> branches, and surely the proposed intrinsic allows for this.  But
you're
> less likely to see code dividing by the same value repeatedly.  Or so goes
> my experience with this, anyway.
>
> I suspect your workloads may differ from mine in this area.  I also need to
> restate: I haven't spent much time looking at this yet.  I don't
actually
> know if any of this matters.  Looking through the source code for the
> existing compilers, I suspect it does.  But that's not conclusive
evidence.
>
> p.s. w.r.t to terminology, I was being somewhat sloppy.  By condition check
> optimization, I meant everything from simplifycfg, cvp, and others.
>
>
>
> Given the above, I'm going withdraw my active support from the current
> proposal.  (Note: This does not mean that I'm going to actively oppose
it
> either!)
>
> Let me make a counter proposal of my own.
>
> 1) Let's not add generic intrinsics at this time.  Instead, let's
add cpu
> specific intrinsics with the exact semantics of the associated div
> instructions.
>   a) for ARM, this would be the semantics of the current proposed
> instruction (right?)
>   b) for x86, this would be an instruction which is defined to fault on the
> corner cases (the resume behaviour would not be specified for the moment)
>
> I don't like 1.b because the only thing you'd be able to do with
the fault
> in the runtime is:
>
> - Unwind the entire function if you're careful enough and you don't
mind
> losing all of its state.  This woud only apply to Java-like languages, and
> only if they didn't have any finally/catch(all)/catch(arithmetic)
statements
> in the surrounding scope.
>
> - "interpret" the division to get your own semantics and resume
execution,
> if your language allows for division to return an integer in the corner
> cases (i.e. JavaScript's (a/b)|0).  I concede that a trapping intrinsic
> would allow you to make this one case work, in cases where your profiling
> told you that the corner cases are rare.
>
> - Arrange to spill all state to the stack just to be able to throw an
> exception and/or deoptimize - i.e. you're crazy enough to think that
the
> branch you just saved was more expensive than a bunch of stack traffic.
>
> I believe that the currently proposed intrinsics give us more of what we
> want on x86, than does your proposal, since they are more general, and
since
> the branches will be cheaper than spilling everything to the stack.
>
> I was trying to avoid getting into all of the details around deopt and
> resuming from traps.  :)  I agree this is a complex area, and frankly,
it's
> not one I've explored enough in LLVM to have strong opinions on yet.
>
> To be clear, trap-based optimizations are something that we may eventually
> want to support but they require more magic unrelated to the specific
issues
> of division.  I suspect that the nicest way of expressing trap-based
> optimizations in IR is to:
>
> - Have a branch in IR that is weighted in such a way as to convey to LLVM
> that we believe that one side is never taken and that therefore
implementing
> that side in terms of traps is OK.
>
> - The thing being branched on is a predicate that is known to LLVM to be
> something that can be trapped on.
>
> I agree with the above.  I think.  (I reserve the right to change my mind
> later...)
>
> Notice that the proposed intrinsics are *the best* way of achieving this,
> once we have a way of optimizing such branches.  The LLVM backend will know
> that the i1's returned from the safe.div intrinsic are
"trappable" on x86,
> and the frontend can arrange to weight the branches on those i1's
> appropriately if it wants a trap-based implementation.  Better yet, the
> frontend could simply feed whatever profiling it wants ("I saw the
corner
> case X times and the main case Y times") and LLVM can decide what to
do on
> its own.  The fact that the machanism by which execution is diverted to the
> "trapping" basic block is by way of a trap could even be
contained entirely
> within LLVM's innards.
>
> I agree this sounds like a potentially workable design.
>
> OT: Do we want to be implementing the trapping logic inside LLVM?  I would
> guess not.
>
> I actually agree with you on the general approach.  However, I'm not
sure
> we're ready to get into this yet.  That's why I was proposing a
simpler
> starting point.
>
> My concern is how to way to balance the IR level CFG optimizations above
> with the trapping semantics and profiling data.  Until someone actually
> builds something, this is all *really* speculative.  I'd like to not
lock
> down a design until we have *data* and *experience in LLVM*.
>
> This could be used to support not only the unwinding semantics in Java on
> div/0, but also the (a/b)|0 in JS, and all of the others.  But first we
need
> an intrinsic that reveals some predicate that is associated with the
> trapping condition, rather than having an intrinsic that is defined to
> simply trap.
>
> (Notice also how my approach to exposing trapping in LLVM could handle
> things like null checks - if LLVM sees a branch on a pointer to see if
it's
> null in the close vicinity of a memory access, and the branch is weighted
> 100% in favor of non-null, then it can "fold" the branch and load
together
> and use a trap.  I imagine this being super easy to implement and I'd
be
> happy to say more about how I'd do it.  So, to me, the issue with
division
> isn't that we don't know how to trap on it - it's that we
don't have a
> crystal-clear way of revealing the trapping conditions in IR.)
>
> Agreed, particularly with the last statement.
>
>
> 1b) Teach the optimizer about the semantics of each - if we do go with a
> unified safe.div instruction later, this effort could be mostly reused.
>
> Except for x86, where your proposal above doesn't work.
>
> Er, how so?  Same as the reasons we've discussed?  Or different?
>
>
> 2) Add optimizations (fairly late) which fold generic-div+control flow into
> machine specific div instructions.
>
> This is purely aspirational.  There are many ways of doing the control flow
> for the division since there is quite a bit of it.  Are you going to
pattern
> match all of them?  What if you miss a case because one of the many IR
> optimizations transformed it into something that is no longer recognizable?
> Or are you going to impose rules for canonicalizing any control flow around
> any division to ensure that a div-by-zero and INT_MIN/-1 check never stops
> looking like the pattern that you wanted it to look like?
>
> To me this sounds kind of crazy.  Once the frontend has determined that it
> wants safe division, it should be able to say this explicitly enough that
> this information is preserved through compilation.
>
> You don't need to recognize all patterns; you don't even want to. 
See above
> for LICM and related.
>
> I suspect this is less hard than it initially sounds.  Particularly since
> you really only need to identify cases in hot code.  If the control flow
was
> moved outside of the loop, it is almost certainly not hot any more.
>
> Note: This scheme is definitely biased towards x86.  It probably is NOT the
> optimal scheme for ARM.  I freely acknowledge that.  I'm just not happy
with
> the way the previous proposal seems to prefer ARM over x86.
>
>
> 3) Frontends that want machine specific semantics (i.e. trap or defined
edge
> cases), can use the machine specific intrinsics directly.
> 4) We provide a set of IR helper functions which implement safe division in
> what we believe to be the best way.  This can be different for each
> supported platform.  This could either be the form of sample IR, or even as
> functions on IRBuilder.
> 5) We spend some time optimizing the IR generated by (4) and use that to
> inform our discussion about what is truly necessary from a performance
> standpoint.
>
> Once this is in place, we can come back to the discussion of what an
> appropriate generic safe div intrinsic would look like.  In particular, we
> would have the performance data at hand to lay to rest the concerns raised
> by myself and Eric.  My strong suspicion is that the counter proposal will
> be "good enough" in practice for most users.
>
> p.s. If we do go ahead with the current proposal, I'm going to
*strongly*
> request they be marked experimental.  I think we'll need to iterate on
these
> a few times.
>
> Philip
>
>

Filip Pizlo

2014-Apr-29 19:39 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

On April 29, 2014 at 11:27:06 AM, Philip Reames (listmail at philipreames.com)
wrote:


On 04/29/2014 10:44 AM, Filip Pizlo wrote:
LD;DR: Your desire to use trapping on x86 only further convinces me that
Michael's proposed intrinsics are the best way to go.
I'm still not convinced, but am not going to actively oppose it either. 
I'm leery of designing a solution with major assumptions we don't have
data to backup. 

I worry your assumptions about deoptimization are potentially unsound.  But I
don't have data to actually show this (yet).
I *think* I may have been unclear about my assumptions; in particular, my claims
with respect to deoptimization are probably more subtle than they appeared.
 WebKit can use LLVM and it has divisions and we do all possible
deoptimization/profiling/etc tricks for it, so this is grounded in experience.
 Forgive me if the rest of this e-mail contains a lecture on things that are
obvious - I'll try to err on the side of clarity and completeness since this
discussion is sufficiently dense that we run the risk of talking cross-purposes
unless some baseline assumptions are established.




On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at philipreames.com)
wrote:

As the discussion has progressed and I've spent more time thinking about the
topic, I find myself less and less enthused about the current proposal.  I'm
in full support of having idiomatic ways to express safe division, but I'm
starting to doubt that using an intrinsic is the right way at the moment.

One case I find myself thinking about is how one would combine profiling
information and implicit div-by-zero/overflow checks with this proposal.  I
don't really see a clean way.  Ideally, for a "safe div" which
never has the exceptional paths taken, you'd like to completely do away with
the control flow entirely.  (And rely on hardware traps w/exceptions instead.) 
I don't really see a way to represent that type of construct given the
current proposal. 
This is a deeper problem and to solve it you'd need a solution to trapping
in general.  Let's consider the case of Java.  A Java program may want to
catch the arithmetic exception due to divide by zero.  How would you do this
with a trap in LLVM IR?  Spill all state that is live at the catch?  Use a
patchpoint for the entire division instruction?

We'd likely use something similar to a patchpoint.  You'd need the
"abstract vm state" (which is not the compiled frame necessarily)
available at the div instruction.  You could then re-enter the interpreter at
the specified index (part of the vm state).  We have all most of these
mechanisms in place.  Ideally, you'd trigger a recompile and otherwise
ensure re-entry into compiled code at the soonest possible moment. 

This requires a lot of runtime support, but we already have most of it
implemented for another compiler.  From our perspective, the runtime
requirements are not a major blocker. 
Right, you'll use a patchpoint.  That's way more expensive than using a
safe division intrinsic with branches, because it's opaque to the optimizer.


In a lot of languages, a divide produces some result even in the exceptional
case and this result requires effectively deoptimizing since the resut won't
be the one you would have predicted (double instead of int, or BigInt instead of
small int), which sort of means that if the CPU exception occurs you have to be
able to reconstruct all state.  A patchpoint could do this, and so could
spilling all state to the stack before the divide - but both are very heavy
hammers that are sure to be more expensive than just doing a branch.

This isn't necessarily as expensive as you might believe.  I'd recommend
reading the Graal project papers on this topic.

Whether deopt or branching is more profitable *in this case*, I can't easily
say.  I'm not yet to the point of being able to run that experiment.  We can
argue about what "should" be better all we want, but real performance
data is the only way to truly know. 
My point may have been confusing.  I know that deoptimization is cheap and
WebKit uses it everywhere, including division corner cases, if profiling tells
us that it's profitable to do so (which it does, in the common case).
 WebKit is a heavy user of deoptimization in general, so you don't need to
convince me that it's worth it.

Note that I want *both* deopt *and* branching, because in this case, a branch is
the fastest overall way of detecting when to deopt.  In the future, I will want
to implement the deopt in terms of branching, and when we do this, I believe
that the most sound and performat approach would be using Michael's
intrinsics.  This is subtle and I'll try to explain why it's the case.

The point is that you wouldn't want to do deoptimization by spilling state
on the main path or by using a patchpoint for the main path of the division.

You don't want the common path of executing the division to involve a
patchpoint instruction, although using a patchpoint or stackmap to implement
deoptimization on the failing path is great:

Good: if (division would fail) { call @patchpoint(all of my state) } else {
result = a / b }

Bad: call @patchpoint(all of my state) // patch with a divide instruction - bad
because the optimizer has no clue what you're doing and assumes the very
worst

Worse: spill all state to the stack; call @trapping.div(a, b) // the spills will
hurt you far more than a branch, so this should be avoided

I suppose we could imagine a fourth option that involves a patchpoint to pick up
the state and a trapping divide instrinsic.  But a trapping divide intrinsic
alone is not enough.  Consider this:

result = call @trapping.div(a, b); call @stackmap(all of my state)

As soon as these are separate instructions, you have no guarantees that the
state that the stackmap reports is sound for the point at which the div would
trap.  So, the division itself shouldn't be a trapping instruction and
instead you want to detect the bad case with a branch.

To be clear:

- Whether you use deoptimization for division or anything else - like WebKit has
done since before any of the Graal papers were written - is mostly unrelated to
how you represent the division, unless you wanted to add a new intrinsic that is
like a trapping-division-with-stackmap:

result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)

Now, maybe you do want such an intrinsic, in which case you should propose it!
 The reason why I haven't proposed it is that I think that long-term, the
currently proposed intrinsics are a better path to getting the trapping
optimizations.  See my previous mail, where I show how we could tell LLVM what
the failing path is (which may have deoptimization code that uses a stackmap or
whatever), what the trapping predicate is (it comes from the safe.div
intrinsic), and the fact that trapping is wise (branch weights).

- If you want to do the deoptimization with a trap, then your only choice
currently is to use a patchpoint for the main path of the division.  This will
be slower than using a branch to an OSR exit basic block, because you're
making the division itself opaque to the optimizer (bad) just to get rid of a
branch (which was probably cheap to begin with).

Hence, what you want to do - one way or another, regardless of whether this
proposed intrinsic is added - is to branch on the corner case condition, and
have the slow case of the branch go to a basic block that deoptimizes.  Unless
of course you have profiling that says that the case does happen often, in which
case you can have that basic block handle the corner case inline without leaving
optimized code (FWIW, we do have such paths in WebKit and they are useful).

So the question for me is whether the branching involves explicit control flow
or is hidden inside an intrinsic.  I prefer for it to be within an intrinsic
because it:

- allows the optimizer to do more interesting things in the common cases, like
hoisting the entire division.

- will give us a clearer path for implementing trapping optimizations in the
future.

- is an immediate win on ARM.

I'd be curious to hear what specific idea you have about how to implement
trap-based deoptimization with your trapping division intrinsic for x86 - maybe
it's different from the two "bad" idioms I showed above.

Finally, as for performance data, which part of this do you want performance
data for?  I concede that I don't have performance data for using
Michael's new intrinsic.  Part of what the intrinsic accomplishes is it
gives a less ugly way of doing something that is already possible with target
intrinsics on ARM.  I think it would be great if you could get those semantics -
along with a known-good implementation - on other architectures as well.

But this discussion has also involved suggestions that we should use trapping to
implement deoptimization, and the main point of my message is to strongly argue
against anything like this given the current state of trapping semantics and how
patchpoints work.  My point is that using traps for division corner cases would
either be unsound (see the stackmap after the trap, above), or would require you
to do things that are obviously inefficient.  If you truly believe that the
branch to detect division slow paths is more expensive than spilling all
bytecode state to the stack or using a patchpoint for the division, then I could
probably hack something up in WebKit to show you the performance implications.
 (Or you could do it yourself, the code is open source...)

-Filip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/c070212f/attachment.html>

Philip Reames

2014-Apr-29 20:58 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

Filip,

Thanks for spelling this out.  I'm out of time to respond right now, but 
I'll try to get back to you later today or tomorrow morning. After a 
quick read through, I don't think you've changed my opinion, but I need 
to read through what you wrote more carefully before responding.

Philip

On 04/29/2014 12:39 PM, Filip Pizlo wrote:>
>
> On April 29, 2014 at 11:27:06 AM, Philip Reames 
> (listmail at philipreames.com <mailto:listmail at philipreames.com>)
wrote:
>
>>
>> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>>> LD;DR: Your desire to use trapping on x86 only further convinces me
>>> that Michael's proposed intrinsics are the best way to go.
>> I'm still not convinced, but am not going to actively oppose it 
>> either.  I'm leery of designing a solution with major assumptions
we
>> don't have data to backup.
>>
>> I worry your assumptions about deoptimization are potentially 
>> unsound.  But I don't have data to actually show this (yet).
>
> I *think* I may have been unclear about my assumptions; in particular, 
> my claims with respect to deoptimization are probably more subtle than 
> they appeared.  WebKit can use LLVM and it has divisions and we do all 
> possible deoptimization/profiling/etc tricks for it, so this is 
> grounded in experience.  Forgive me if the rest of this e-mail 
> contains a lecture on things that are obvious - I'll try to err on the 
> side of clarity and completeness since this discussion is sufficiently 
> dense that we run the risk of talking cross-purposes unless some 
> baseline assumptions are established.
>
>>
>>
>>>
>>> On April 29, 2014 at 10:09:49 AM, Philip Reames 
>>> (listmail at philipreames.com <mailto:listmail at
philipreames.com>) wrote:
>>>
>>>> As the discussion has progressed and I've spent more time
thinking
>>>> about the topic, I find myself less and less enthused about the
>>>> current proposal.  I'm in full support of having idiomatic
ways to
>>>> express safe division, but I'm starting to doubt that using
an
>>>> intrinsic is the right way at the moment.
>>>>
>>>> One case I find myself thinking about is how one would combine 
>>>> profiling information and implicit div-by-zero/overflow checks
with
>>>> this proposal.  I don't really see a clean way.  Ideally,
for a
>>>> "safe div" which never has the exceptional paths
taken, you'd like
>>>> to completely do away with the control flow entirely.  (And
rely on
>>>> hardware traps w/exceptions instead.)  I don't really see a
way to
>>>> represent that type of construct given the current proposal.
>>>
>>> This is a deeper problem and to solve it you'd need a solution
to
>>> trapping in general.  Let's consider the case of Java.  A Java 
>>> program may want to catch the arithmetic exception due to divide by
>>> zero.  How would you do this with a trap in LLVM IR?  Spill all 
>>> state that is live at the catch?  Use a patchpoint for the entire 
>>> division instruction?
>>>
>> We'd likely use something similar to a patchpoint. You'd need
the
>> "abstract vm state" (which is not the compiled frame
necessarily)
>> available at the div instruction.  You could then re-enter the 
>> interpreter at the specified index (part of the vm state).  We have 
>> all most of these mechanisms in place.  Ideally, you'd trigger a 
>> recompile and otherwise ensure re-entry into compiled code at the 
>> soonest possible moment.
>>
>> This requires a lot of runtime support, but we already have most of 
>> it implemented for another compiler. From our perspective, the 
>> runtime requirements are not a major blocker.
>
> Right, you'll use a patchpoint.  That's way more expensive than
using
> a safe division intrinsic with branches, because it's opaque to the 
> optimizer.
>
>>>
>>> In a lot of languages, a divide produces some result even in the 
>>> exceptional case and this result requires effectively deoptimizing 
>>> since the resut won't be the one you would have predicted
(double
>>> instead of int, or BigInt instead of small int), which sort of
means
>>> that if the CPU exception occurs you have to be able to reconstruct
>>> all state.  A patchpoint could do this, and so could spilling all 
>>> state to the stack before the divide - but both are very heavy 
>>> hammers that are sure to be more expensive than just doing a
branch.
>>>
>> This isn't necessarily as expensive as you might believe.  I'd 
>> recommend reading the Graal project papers on this topic.
>>
>> Whether deopt or branching is more profitable *in this case*, I
can't
>> easily say.  I'm not yet to the point of being able to run that 
>> experiment.  We can argue about what "should" be better all
we want,
>> but real performance data is the only way to truly know.
>
> My point may have been confusing.  I know that deoptimization is cheap 
> and WebKit uses it everywhere, including division corner cases, if 
> profiling tells us that it's profitable to do so (which it does, in 
> the common case).  WebKit is a heavy user of deoptimization in 
> general, so you don't need to convince me that it's worth it.
>
> Note that I want *both* deopt *and* branching, because in this case, a 
> branch is the fastest overall way of detecting when to deopt.  In the 
> future, I will want to implement the deopt in terms of branching, and 
> when we do this, I believe that the most sound and performat approach 
> would be using Michael's intrinsics.  This is subtle and I'll try
to
> explain why it's the case.
>
> The point is that you wouldn't want to do deoptimization by spilling 
> state on the main path or by using a patchpoint for the main path of 
> the division.
>
> You don't want the common path of executing the division to involve a 
> patchpoint instruction, although using a patchpoint or stackmap to 
> implement deoptimization on the failing path is great:
>
> *Good: *if (division would fail) { call @patchpoint(all of my state) } 
> else { result = a / b }
>
> *Bad: *call @patchpoint(all of my state) // patch with a divide 
> instruction - bad because the optimizer has no clue what you're doing 
> and assumes the very worst
>
> *Worse: *spill all state to the stack; call @trapping.div(a, b) // the 
> spills will hurt you far more than a branch, so this should be avoided
>
> I suppose we could imagine a fourth option that involves a patchpoint 
> to pick up the state and a trapping divide instrinsic.  But a trapping 
> divide intrinsic alone is not enough.  Consider this:
>
> result = call @trapping.div(a, b); call @stackmap(all of my state)
>
> As soon as these are separate instructions, you have no guarantees 
> that the state that the stackmap reports is sound for the point at 
> which the div would trap.  So, the division itself shouldn't be a 
> trapping instruction and instead you want to detect the bad case with 
> a branch.
>
> To be clear:
>
> - Whether you use deoptimization for division or anything else - like 
> WebKit has done since before any of the Graal papers were written - is 
> mostly unrelated to how you represent the division, unless you wanted 
> to add a new intrinsic that is like a trapping-division-with-stackmap:
>
> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)
>
> Now, maybe you do want such an intrinsic, in which case you should 
> propose it!  The reason why I haven't proposed it is that I think that 
> long-term, the currently proposed intrinsics are a better path to 
> getting the trapping optimizations.  See my previous mail, where I 
> show how we could tell LLVM what the failing path is (which may have 
> deoptimization code that uses a stackmap or whatever), what the 
> trapping predicate is (it comes from the safe.div intrinsic), and the 
> fact that trapping is wise (branch weights).
>
> - If you want to do the deoptimization with a trap, then your only 
> choice currently is to use a patchpoint for the main path of the 
> division.  This will be slower than using a branch to an OSR exit 
> basic block, because you're making the division itself opaque to the 
> optimizer (bad) just to get rid of a branch (which was probably cheap 
> to begin with).
>
> Hence, what you want to do - one way or another, regardless of whether 
> this proposed intrinsic is added - is to branch on the corner case 
> condition, and have the slow case of the branch go to a basic block 
> that deoptimizes.  Unless of course you have profiling that says that 
> the case does happen often, in which case you can have that basic 
> block handle the corner case inline without leaving optimized code 
> (FWIW, we do have such paths in WebKit and they are useful).
>
> So the question for me is whether the branching involves explicit 
> control flow or is hidden inside an intrinsic.  I prefer for it to be 
> within an intrinsic because it:
>
> - allows the optimizer to do more interesting things in the common 
> cases, like hoisting the entire division.
>
> - will give us a clearer path for implementing trapping optimizations 
> in the future.
>
> - is an immediate win on ARM.
>
> I'd be curious to hear what specific idea you have about how to 
> implement trap-based deoptimization with your trapping division 
> intrinsic for x86 - maybe it's different from the two "bad"
idioms I
> showed above.
>
> Finally, as for performance data, which part of this do you want 
> performance data for?  I concede that I don't have performance data 
> for using Michael's new intrinsic.  Part of what the intrinsic 
> accomplishes is it gives a less ugly way of doing something that is 
> already possible with target intrinsics on ARM.  I think it would be 
> great if you could get those semantics - along with a known-good 
> implementation - on other architectures as well.
>
> But this discussion has also involved suggestions that we should use 
> trapping to implement deoptimization, and the main point of my message 
> is to strongly argue against anything like this given the current 
> state of trapping semantics and how patchpoints work.  My point is 
> that using traps for division corner cases would either be unsound 
> (see the stackmap after the trap, above), or would require you to do 
> things that are obviously inefficient.  If you truly believe that the 
> branch to detect division slow paths is more expensive than spilling 
> all bytecode state to the stack or using a patchpoint for the 
> division, then I could probably hack something up in WebKit to show 
> you the performance implications.  (Or you could do it yourself, the 
> code is open source...)
>
> -Filip
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/fbd6b2fb/attachment.html>

Philip Reames

2014-May-01 04:06 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

On 04/29/2014 12:39 PM, Filip Pizlo wrote:> On April 29, 2014 at 11:27:06 AM, Philip Reames 
> (listmail at philipreames.com <mailto:listmail at philipreames.com>)
wrote:
>> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>>> LD;DR: Your desire to use trapping on x86 only further convinces me
>>> that Michael's proposed intrinsics are the best way to go.
>> I'm still not convinced, but am not going to actively oppose it 
>> either.  I'm leery of designing a solution with major assumptions
we
>> don't have data to backup.
>>
>> I worry your assumptions about deoptimization are potentially 
>> unsound.  But I don't have data to actually show this (yet).
>
> I *think* I may have been unclear about my assumptions; in particular, 
> my claims with respect to deoptimization are probably more subtle than 
> they appeared.  WebKit can use LLVM and it has divisions and we do all 
> possible deoptimization/profiling/etc tricks for it, so this is 
> grounded in experience.  Forgive me if the rest of this e-mail 
> contains a lecture on things that are obvious - I'll try to err on the 
> side of clarity and completeness since this discussion is sufficiently 
> dense that we run the risk of talking cross-purposes unless some 
> baseline assumptions are established.
>I think we're using the same terminology, but with slightly different 
sets of assumptions.  I'll point this out below where relevant.

Also, thanks for taking the time to expand.  It help clarify the 
discussion quite a bit.>>
>>
>>>
>>> On April 29, 2014 at 10:09:49 AM, Philip Reames 
>>> (listmail at philipreames.com <mailto:listmail at
philipreames.com>) wrote:
>>>
>>>> As the discussion has progressed and I've spent more time
thinking
>>>> about the topic, I find myself less and less enthused about the
>>>> current proposal.  I'm in full support of having idiomatic
ways to
>>>> express safe division, but I'm starting to doubt that using
an
>>>> intrinsic is the right way at the moment.
>>>>
>>>> One case I find myself thinking about is how one would combine 
>>>> profiling information and implicit div-by-zero/overflow checks
with
>>>> this proposal.  I don't really see a clean way.  Ideally,
for a
>>>> "safe div" which never has the exceptional paths
taken, you'd like
>>>> to completely do away with the control flow entirely.  (And
rely on
>>>> hardware traps w/exceptions instead.)  I don't really see a
way to
>>>> represent that type of construct given the current proposal.
>>>
>>> This is a deeper problem and to solve it you'd need a solution
to
>>> trapping in general.  Let's consider the case of Java.  A Java 
>>> program may want to catch the arithmetic exception due to divide by
>>> zero.  How would you do this with a trap in LLVM IR?  Spill all 
>>> state that is live at the catch?  Use a patchpoint for the entire 
>>> division instruction?
>>>
>> We'd likely use something similar to a patchpoint. You'd need
the
>> "abstract vm state" (which is not the compiled frame
necessarily)
>> available at the div instruction.  You could then re-enter the 
>> interpreter at the specified index (part of the vm state).  We have 
>> all most of these mechanisms in place.  Ideally, you'd trigger a 
>> recompile and otherwise ensure re-entry into compiled code at the 
>> soonest possible moment.
>>
>> This requires a lot of runtime support, but we already have most of 
>> it implemented for another compiler. From our perspective, the 
>> runtime requirements are not a major blocker.
>
> Right, you'll use a patchpoint.  That's way more expensive than
using
> a safe division intrinsic with branches, because it's opaque to the 
> optimizer.
>This statement is true at the moment, but it shouldn't be.  I think this 
is our fundamental difference in approach.

You should be able to write something like:
i32 %res = invoke patchpoint (... x86_trapping_divide, a, b) normal_dest 
invoke_dest

normal_dest:
   ;; use %res
invoke_dest:
   landingpad
   ;; dispatch edge cases
   ;; this could be unreachable code if you deopt this frame in the trap 
handler and jump directly to an interpreter or other bit of code

A patchpoint should not require any excess spilling.  If values are live 
in registers, that should be reflected in the stack map.  (I do not know 
if this is the case for patchpoint at the moment or not.)

The Value called by a patchpoint should participate in optimization 
normally.  We really want the patchpoint part of the call to be 
supplemental.  It should still be a call.  It should be constant 
propagated, transformed, etc..  This is not the case currently. I've got 
a couple of off the wall ideas for improving the current status, but 
I'll agree this is a hardish problem.

It should be legal to use a patchpoint in an invoke.  It's an ABI issue 
of how the invoke path gets invoked.  (i.e. side tables for the runtime 
to lookup, etc..)  This is not possible today, and probably requires a 
fair amount of work.  Some of it, I've already done and will be sharing 
shortly.  Other parts, I haven't even thought about.

If you didn't want to use the trapping semantics, you'd insert dedicated
control flow _before_ the divide.  This would allow normal optimization 
of the control flow.

Notes:
1) This might require a new PATCHPOINT pseudo op in the backend. Haven't 
thought much about that yet.
2) I *think* your current intrinsic could be translated into something 
like this.  (Leaving aside the question of where the deopt state comes 
from.)  In fact, the more I look at this, the less difference I actually 
see between the approaches.

>>>
>>> In a lot of languages, a divide produces some result even in the 
>>> exceptional case and this result requires effectively deoptimizing 
>>> since the resut won't be the one you would have predicted
(double
>>> instead of int, or BigInt instead of small int), which sort of
means
>>> that if the CPU exception occurs you have to be able to reconstruct
>>> all state.  A patchpoint could do this, and so could spilling all 
>>> state to the stack before the divide - but both are very heavy 
>>> hammers that are sure to be more expensive than just doing a
branch.
>>>
>> This isn't necessarily as expensive as you might believe.  I'd 
>> recommend reading the Graal project papers on this topic.
>>
>> Whether deopt or branching is more profitable *in this case*, I
can't
>> easily say.  I'm not yet to the point of being able to run that 
>> experiment.  We can argue about what "should" be better all
we want,
>> but real performance data is the only way to truly know.
>
> My point may have been confusing.  I know that deoptimization is cheap 
> and WebKit uses it everywhere, including division corner cases, if 
> profiling tells us that it's profitable to do so (which it does, in 
> the common case).  WebKit is a heavy user of deoptimization in 
> general, so you don't need to convince me that it's worth it.
>
Acknowledged.>
> Note that I want *both* deopt *and* branching, because in this case, a 
> branch is the fastest overall way of detecting when to deopt.  In the 
> future, I will want to implement the deopt in terms of branching, and 
> when we do this, I believe that the most sound and performat approach 
> would be using Michael's intrinsics.  This is subtle and I'll try
to
> explain why it's the case.
>
> The point is that you wouldn't want to do deoptimization by spilling 
> state on the main path or by using a patchpoint for the main path of 
> the division.
>This is the main point I disagree with.  I don't believe that having a 
patchpoint on the main path should be any more expensive then the 
original call.  (see above)

Worth noting explicitly: I'm assuming that all of your deopt state would 
already be available for other purposes in nearby code.  It's on the 
stack or in registers.  I'm assuming that by adding the deopt point, you 
are not radically changing the set of computations which need to be 
done.  If that's not the case, you should avoid deopt and instead just 
inline the slow paths with explicit checks.

I'll note that given your assumptions about the cost of a patchpoint, 
the rest of your position makes a lot more sense.  :) As I spelled out 
above, I believe this cost is not fundamental.>
> You don't want the common path of executing the division to involve a 
> patchpoint instruction, although using a patchpoint or stackmap to 
> implement deoptimization on the failing path is great:
>
> *Good: *if (division would fail) { call @patchpoint(all of my state) } 
> else { result = a / b }
>
Given your cost assumptions, I'd agree.>
> *Bad: *call @patchpoint(all of my state) // patch with a divide 
> instruction - bad because the optimizer has no clue what you're doing 
> and assumes the very worst
>
Yuck.  Agreed.>
> *Worse: *spill all state to the stack; call @trapping.div(a, b) // the 
> spills will hurt you far more than a branch, so this should be avoided
>I'm assuming this is an explicit spill rather than simply recording a 
stack map *at the div*.  If so, agreed.>
> I suppose we could imagine a fourth option that involves a patchpoint 
> to pick up the state and a trapping divide instrinsic.  But a trapping 
> divide intrinsic alone is not enough.  Consider this:
>
> result = call @trapping.div(a, b); call @stackmap(all of my state)
>
> As soon as these are separate instructions, you have no guarantees 
> that the state that the stackmap reports is sound for the point at 
> which the div would trap.
>This is the closest to what I'd propose, except that the two calls would 
be merged into a single patchpoint.  Isn't the entire point of a 
patchpoint to record the stack map for a call?  (Well, ignoring the 
actual patching part..)  Why not write this as:
patchpoint(..., trapping.div, a, b);

Is there something I'm missing here?

Just to note: I fully agree that the two call proposal is unsound and 
should be avoided.
> So, the division itself shouldn't be a trapping instruction and 
> instead you want to detect the bad case with a branch.
>
> To be clear:
>
> - Whether you use deoptimization for division or anything else - like 
> WebKit has done since before any of the Graal papers were written - is 
> mostly unrelated to how you represent the division, unless you wanted 
> to add a new intrinsic that is like a trapping-division-with-stackmap:
>
> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)
>
> Now, maybe you do want such an intrinsic, in which case you should 
> propose it!
>Given what we already have with patchpoints, I don't think a merged 
intrinsic is necessary.  (See above).  I believe we have all the parts 
to build this solution, and that - in theory - they should compose neatly.

p.s. The bits I was referring to was not deopt per se.  It was 
particularly which set of deopt state you used for each deopt point.  
That's a bit of tangent for the rest of the discussion now
though.>
> The reason why I haven't proposed it is that I think that long-term, 
> the currently proposed intrinsics are a better path to getting the 
> trapping optimizations.  See my previous mail, where I show how we 
> could tell LLVM what the failing path is (which may have 
> deoptimization code that uses a stackmap or whatever), what the 
> trapping predicate is (it comes from the safe.div intrinsic), and the 
> fact that trapping is wise (branch weights).
>
> - If you want to do the deoptimization with a trap, then your only 
> choice currently is to use a patchpoint for the main path of the 
> division.  This will be slower than using a branch to an OSR exit 
> basic block, because you're making the division itself opaque to the 
> optimizer (bad) just to get rid of a branch (which was probably cheap 
> to begin with).
>
> Hence, what you want to do - one way or another, regardless of whether 
> this proposed intrinsic is added - is to branch on the corner case 
> condition, and have the slow case of the branch go to a basic block 
> that deoptimizes.  Unless of course you have profiling that says that 
> the case does happen often, in which case you can have that basic 
> block handle the corner case inline without leaving optimized code 
> (FWIW, we do have such paths in WebKit and they are useful).
>
> So the question for me is whether the branching involves explicit 
> control flow or is hidden inside an intrinsic.  I prefer for it to be 
> within an intrinsic because it:
>
> - allows the optimizer to do more interesting things in the common 
> cases, like hoisting the entire division.
>
> - will give us a clearer path for implementing trapping optimizations 
> in the future.
>
> - is an immediate win on ARM.
>
> I'd be curious to hear what specific idea you have about how to 
> implement trap-based deoptimization with your trapping division 
> intrinsic for x86 - maybe it's different from the two "bad"
idioms I
> showed above.
>I hope my explanation above helps.  If not, ask, and I'll try to explain 
more clearly.

One point just for clarity; I don't believe this effects the conclusions 
of our discussion so far.  I'm also fairly sure that you (Filip) are 
aware of this, but want to spell it out for other readers.

You seem to be assuming that compiled code needs to explicitly branch to 
a point where deopt state is known to exit a compiled frame.  Worth 
noting is that you can also exit a compiled frame on a trap (without an 
explicitly condition/branch!) if the deopt state is known at the point 
you take the trap.  This "exit frame on trap" behavior shows up with 
null pointer exceptions as well.  I'll note that both compilers in 
OpenJDK support some combination of "exit-on-trap" conditions for 
division and null dereferences.  The two differ on exactly what 
strategies they use in each case though. :)

I'm not really arguing that either scheme is "better" in all
cases. I'm
simply arguing that we should support both and allow optimization and 
tuning between them.  As far as I can tell, you seem to be assuming that 
an explicit branch to known exit point is always superior.


Ok, back to the topic at hand...

With regards to the current proposal, I'm going to take a step back.  
You guys seem to have already looked in this in a fair amount of depth.  
I'm not necessarily convinced you've come to the best solution, but at 
some point, we need to make forward progress.  What you have is clearly 
better than nothing.

Please go ahead and submit your current approach.  We can come back and 
revise later if we really need to.

I do request the following changes:
- Mark it clearly as experimental.
- Either don't specify the value computed in the edge cases, or allow 
those values to be specified as constant arguments to the call.  This 
allows efficient lowering to x86's div instruction if you want to make 
use of the trapping semantics.
> Finally, as for performance data, which part of this do you want 
> performance data for?  I concede that I don't have performance data 
> for using Michael's new intrinsic.  Part of what the intrinsic 
> accomplishes is it gives a less ugly way of doing something that is 
> already possible with target intrinsics on ARM.  I think it would be 
> great if you could get those semantics - along with a known-good 
> implementation - on other architectures as well.
>I would be very interested in seeing data comparing two schemes:
- Explicit control flow emited by the frontend
- The safe.div intrinsic emitted by the frontend, desugared in CodeGenPrep

My strong suspicion is that each would preform well in some cases and 
not in others.  At least on x86.  Since the edge-checks are essentially 
free on ARM, the second scheme would probably be strictly superior there.

I am NOT asking that we block submission on this data however.
> But this discussion has also involved suggestions that we should use 
> trapping to implement deoptimization, and the main point of my message 
> is to strongly argue against anything like this given the current 
> state of trapping semantics and how patchpoints work.  My point is 
> that using traps for division corner cases would either be unsound 
> (see the stackmap after the trap, above), or would require you to do 
> things that are obviously inefficient.  If you truly believe that the 
> branch to detect division slow paths is more expensive than spilling 
> all bytecode state to the stack or using a patchpoint for the 
> division, then I could probably hack something up in WebKit to show 
> you the performance implications.  (Or you could do it yourself, the 
> code is open source...)
>In a couple of months, I'll probably have the performance data to 
discuss this for real.  When that happens, let's pick this up and 
continue the debate.  Alternatively, if you want to chat this over more 
with a beer in hand at the social next week, let me know.  In the 
meantime, let's not stall the current proposal any more.

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140430/ab77cc04/attachment.html>

Seemingly Similar Threads

Search for more reasonably related threads

llvm dev - May 2014 - [LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

Seemingly Similar Threads