thr3ads.net - llvm dev - [LLVMdev] Proposal: add intrinsics for safe division [May 2014]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2014-May-01 05:34 UTC

[LLVMdev] Proposal: add intrinsics for safe division

Andy - If you're not already following this closely, please start.  
We've gotten into fairly fundamental questions of what a patchpoint does.

Filip,

I think you've hit the nail on the head.  What I'm thinking of as being 
patchpoints are not what you think they are.  Part of that is that I've 
got a local change which adds a very similar construction (called 
"statepoints" for the moment), but I was trying to keep that separate.
That also includes a lot of GC semantics which are not under discussion 
currently.  My apologies if that experience bled over into this 
conversation and made things more confusing.

I will note that the documentation for patchpoint say explicitly the 
following:
"The 'llvm.experimental.patchpoint.*' intrinsics creates a function
call
to the specified <target> and records the location of specified values 
in the stack map."

My reading has always been that a patchpoint *that isn't patched* is 
simply a call with a stackmap associated with it.  To my reading, this 
can (and did, and does) indicate my proposed usage would be legal.

I will agree that I've confused the topic badly on the optimization 
front.  My "statepoint" isn't patchable, so a lot more
optimizations are
legal.  Sorry about that.  To restate what I think you've been saying 
all along, the optimizer can't make assumptions about what function is 
called by a patchpoint because that might change based on later 
patching.  Is this the key point you've been trying to make?

I'm not objecting to separating "my patchpoint" from "your
patchpoint".
Let's just hammer out the semantics of each first. :)

Again, longer response to follow in a day or so. :)

Philip

On 04/30/2014 10:09 PM, Filip Pizlo wrote:>
>
> On April 30, 2014 at 9:06:20 PM, Philip Reames 
> (listmail at philipreames.com <mailto:listmail at philipreames.com>)
wrote:
>
>> On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>>> On April 29, 2014 at 11:27:06 AM, Philip Reames 
>>> (listmail at philipreames.com <mailto:listmail at
philipreames.com>) wrote:
>>>> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>>>>> LD;DR: Your desire to use trapping on x86 only further
convinces
>>>>> me that Michael's proposed intrinsics are the best way
to go.
>>>> I'm still not convinced, but am not going to actively
oppose it
>>>> either.  I'm leery of designing a solution with major
assumptions
>>>> we don't have data to backup.
>>>>
>>>> I worry your assumptions about deoptimization are potentially 
>>>> unsound. But I don't have data to actually show this (yet).
>>>
>>> I *think* I may have been unclear about my assumptions; in 
>>> particular, my claims with respect to deoptimization are probably 
>>> more subtle than they appeared.  WebKit can use LLVM and it has 
>>> divisions and we do all possible deoptimization/profiling/etc
tricks
>>> for it, so this is grounded in experience.  Forgive me if the rest 
>>> of this e-mail contains a lecture on things that are obvious -
I'll
>>> try to err on the side of clarity and completeness since this 
>>> discussion is sufficiently dense that we run the risk of talking 
>>> cross-purposes unless some baseline assumptions are established.
>>>
>> I think we're using the same terminology, but with slightly
different
>> sets of assumptions.  I'll point this out below where relevant.
>>
>> Also, thanks for taking the time to expand.  It help clarify the 
>> discussion quite a bit.
>
> I think we may be converging to an understanding of what you want 
> versus what I want, and I think that there are some points - possibly 
> unrelated to division - that are worth clarifying.  I think that the 
> main difference is that when I say "patchpoint", I am referring
to a
> concrete intrinsic with specific semantics that cannot change without 
> breaking WebKit, while you are using the term to refer to a broad 
> concept, or rather, a class of as-yet-unimplemented intrinsics that 
> share some of the same features with patchpoints but otherwise have 
> incompatible semantics.
>
> Also, when I say that you wouldn't want to use the existing patchpoint 
> to do your trapping deopt, what I mean is that the performance of 
> doing this would suck for reasons that are not related to 
> deoptimization or trapping.  I'm not claiming that deoptimization 
> performs poorly (trust me, I know better) or that trapping to 
> deoptimize is bad (I've done this many, many times and I know better). 
>  I'm saying that with the current patchpoint intrinsics in LLVM, as 
> they are currently specified and implemented, doing it would be a bad 
> idea because you'd have to compromise a bunch of other optimizations 
> to achieve it.
>
> You have essentially described new intrinsics that would make this 
> less of a bad idea and I am interested in your intrinsics, so I'll try 
> to both respond with why patchpoints don't currently give you what you 
> want (and why simply changing patchpoint semantics would be evil) and 
> I'll also try to comment on what I think of the intrinsic that
you're
> effectively proposing.  Long story short, I think you should formally 
> propose your intrinsic even if it's not completely fleshed out.  I 
> think that it's an interesting capability and in its most basic form, 
> it is a simple variation of the current patchpoint/stackmap intrinsics.
>
>>>>
>>>>
>>>>
>>>>>
>>>>> On April 29, 2014 at 10:09:49 AM, Philip Reames 
>>>>> (listmail at philipreames.com <mailto:listmail at
philipreames.com>) wrote:
>>>>>
>>>>>> As the discussion has progressed and I've spent
more time
>>>>>> thinking about the topic, I find myself less and less
enthused
>>>>>> about the current proposal.  I'm in full support of
having
>>>>>> idiomatic ways to express safe division, but I'm
starting to
>>>>>> doubt that using an intrinsic is the right way at the
moment.
>>>>>>
>>>>>> One case I find myself thinking about is how one would
combine
>>>>>> profiling information and implicit div-by-zero/overflow
checks
>>>>>> with this proposal.  I don't really see a clean
way. Ideally, for
>>>>>> a "safe div" which never has the exceptional
paths taken, you'd
>>>>>> like to completely do away with the control flow
entirely.  (And
>>>>>> rely on hardware traps w/exceptions instead.)  I
don't really see
>>>>>> a way to represent that type of construct given the
current
>>>>>> proposal.
>>>>>
>>>>> This is a deeper problem and to solve it you'd need a
solution to
>>>>> trapping in general.  Let's consider the case of Java. 
A Java
>>>>> program may want to catch the arithmetic exception due to
divide
>>>>> by zero.  How would you do this with a trap in LLVM IR? 
Spill all
>>>>> state that is live at the catch?  Use a patchpoint for the
entire
>>>>> division instruction?
>>>>>
>>>> We'd likely use something similar to a patchpoint. 
You'd need the
>>>> "abstract vm state" (which is not the compiled frame
necessarily)
>>>> available at the div instruction.  You could then re-enter the 
>>>> interpreter at the specified index (part of the vm state).  We
have
>>>> all most of these mechanisms in place.  Ideally, you'd
trigger a
>>>> recompile and otherwise ensure re-entry into compiled code at
the
>>>> soonest possible moment.
>>>>
>>>> This requires a lot of runtime support, but we already have
most of
>>>> it implemented for another compiler.  From our perspective, the
>>>> runtime requirements are not a major blocker.
>>>
>>> Right, you'll use a patchpoint.  That's way more expensive
than
>>> using a safe division intrinsic with branches, because it's
opaque
>>> to the optimizer.
>>>
>> This statement is true at the moment, but it shouldn't be.  I think
>> this is our fundamental difference in approach.
>>
>> You should be able to write something like:
>> i32 %res = invoke patchpoint (... x86_trapping_divide, a, b) 
>> normal_dest invoke_dest
>>
>> normal_dest:
>>   ;; use %res
>> invoke_dest:
>>   landingpad
>>   ;; dispatch edge cases
>>   ;; this could be unreachable code if you deopt this frame in the 
>> trap handler and jump directly to an interpreter or other bit of code
>
> I see.  It sounds like you want a generalization of the 
> "div.with.stackmap" that I thought you wanted - you want to be
able to
> wrap anything in a stackmap.
>
> The current patchpoint intrinsic does not do this, and you run the 
> risk of breaking existing semantics if you changed this.  You'd 
> probably break WebKit, which treats the call target of the patchpoint 
> as nothing more than a quirk - we always pass null.  Also, the current 
> patchpoint treats the callee as an i8* if I remember right and it 
> would be super weird if all LLVM phases had to interpret this i8* by 
> unwrapping a possible bitcast to get to a declared function that may 
> be an intrinsic.  Yuck!  Basically, the call target of existing 
> patchpoints is meant to be a kind of convenience feature rather than 
> the core of the mechanism.
>
> I agree in principle that the intrinsic that you want would be a 
> useful intrinsic.  But let's not call it a patchpoint for the purposes 
> of this discussion, and let's not confuse the discussion by claiming 
> (incorrectly) that the existing patchpoint facility gives you what you 
> want.  It doesn't: patchpoints are designed to make the call target 
> opaque (hence the use of i8*) and there shouldn't be a correlation 
> between what the patchpoint does at run-time and what the called 
> function would have done.  You could make the call target be null 
> (like WebKit does) and the patchpoint should still mean "this code can
> do anything" because the expectation is that the client JIT will patch
> over it anyway.
>
> Also, "patchpoint" would probably not be the right term for the 
> intrinsic that you want.  I think that what you want is 
> "call.with.stackmap".  Or maybe "stackmap.wrapper".  Or
just
> "stackmap" - I'd be OK, in principle, with changing the name
of the
> current "stackmap" intrinsic to something that reflects the fact
that
> it's less of a stackmap than what you want.
>
> To summarize.  A patchpoint's main purpose is that you can patch it 
> with arbitrary code.  The current "stackmap" means that you can
patch
> it with arbitrary code and that patching may be destructive to a 
> shadow of machine code bytes, so it's really just like patchpoints - 
> we could change its name to "patchpoint.shadow" for example.
>
> If you were to propose such a stackmap intrinsic, then I think there 
> could be some ways of doing it that wouldn't be too terrible. 
>  Basically you want something that is like a patchpoint in that it 
> reports a stackmap via a side channel, but unlike patchpoints, it 
> doesn't allow arbitrary patching - instead the optimizer should be 
> allowed to assume that the code within the patchpoint will always do 
> the same thing that the call target would have done.  There are 
> downsides to truly doing this.  For example, to make division 
> efficient with such an intrinsic, you'd have to do something that is 
> somewhat worse than just recognizing intrinsics in the optimizer - 
> you'd have to first recognize a call to your "stackmap
wrapper"
> intrinsic and then observe that its call target argument is in turn 
> another intrinsic.  To me personally, that's kind of yucky, but I 
> won't deny that it could be useful.
>
> As to the use of invoke: I don't believe that the use of invoke versus 
> my suggested "branch on a trap predicate" idea are different in
any
> truly meaningful way.  I buy that either would work.
>
>>
>>
>> A patchpoint should not require any excess spilling. If values are 
>> live in registers, that should be reflected in the stack map.  (I do 
>> not know if this is the case for patchpoint at the moment or not.)
>
> Patchpoints do not require spilling.
>
> My point was that with existing patchpoints, you *either* use a 
> patchpoint for the entire division which makes the division opaque to 
> the optimizer - because a patchpoint means "this code can do
anything"
> - *or* you could spill stuff to the stack prior to your trapping 
> division intrinsic, since spilling is something that you could do as a 
> workaround if you didn't have a patchpoint.
>
> The reason why I brought up spilling at all is that I suspect that 
> spilling all state to the stack might be cheaper - for some systems - 
> than turning the division into a patchpoint.  Turning the division 
> into a patchpoint is horrendously brutal - the patchpoint looks like 
> it clobbers the heap (which a division doesn't do), has to execute (a 
> division is an obvious DCE candidate), cannot be hoisted (hoisting 
> divisions is awesome), etc.  Perhaps most importantly, though, a 
> patchpoint doesn't tell LLVM that you're *doing a division* - so
all
> constant folding, strenght reduction, and algebraic reasoning flies 
> out the window.  On the other hand, spilling all state to the stack is 
> an arguably sound and performant solution to a lot of VM problems. 
>  I've seen JVM implementations that ensure that there is always a copy 
> of state on the stack at some critical points, just because it makes 
> loads of stuff simpler (debugging, profiling, GC, and of course 
> deopt).  I personally prefer to stay away from such a strategy because 
> it's not free.
>
> On the other hand, branches can be cheap.  A branch on a divide is 
> cheaper than not being able to optimize the divide.
>
>>
>>
>> The Value called by a patchpoint should participate in optimization 
>> normally.
>
> I agree that you could have a different intrinsic that behaves like this.
>
>> We really want the patchpoint part of the call to be supplemental.  
>> It should still be a call. It should be constant propagated, 
>> transformed, etc..  This is not the case currently.  I've got a 
>> couple of off the wall ideas for improving the current status, but 
>> I'll agree this is a hardish problem.
>>
>> It should be legal to use a patchpoint in an invoke.  It's an ABI 
>> issue of how the invoke path gets invoked.  (i.e. side tables for the 
>> runtime to lookup, etc..)  This is not possible today, and probably 
>> requires a fair amount of work.  Some of it, I've already done and 
>> will be sharing shortly.  Other parts, I haven't even thought
about.
>
> Right, it's significantly more complex than either the existing 
> patchpoints or Michael's proposed safe.div.
>
>>
>>
>> If you didn't want to use the trapping semantics, you'd insert 
>> dedicated control flow _before_ the divide.  This would allow normal 
>> optimization of the control flow.
>>
>> Notes:
>> 1) This might require a new PATCHPOINT pseudo op in the backend.  
>> Haven't thought much about that yet.
>> 2) I *think* your current intrinsic could be translated into 
>> something like this.  (Leaving aside the question of where the deopt 
>> state comes from.)  In fact, the more I look at this, the less 
>> difference I actually see between the approaches.
>>
>>
>>>>>
>>>>> In a lot of languages, a divide produces some result even
in the
>>>>> exceptional case and this result requires effectively
deoptimizing
>>>>> since the resut won't be the one you would have
predicted (double
>>>>> instead of int, or BigInt instead of small int), which sort
of
>>>>> means that if the CPU exception occurs you have to be able
to
>>>>> reconstruct all state.  A patchpoint could do this, and so
could
>>>>> spilling all state to the stack before the divide - but
both are
>>>>> very heavy hammers that are sure to be more expensive than
just
>>>>> doing a branch.
>>>>>
>>>> This isn't necessarily as expensive as you might believe.
I'd
>>>> recommend reading the Graal project papers on this topic.
>>>>
>>>> Whether deopt or branching is more profitable *in this case*, I
>>>> can't easily say.  I'm not yet to the point of being
able to run
>>>> that experiment.  We can argue about what "should" be
better all we
>>>> want, but real performance data is the only way to truly know.
>>>
>>> My point may have been confusing.  I know that deoptimization is 
>>> cheap and WebKit uses it everywhere, including division corner 
>>> cases, if profiling tells us that it's profitable to do so
(which it
>>> does, in the common case).  WebKit is a heavy user of
deoptimization
>>> in general, so you don't need to convince me that it's
worth it.
>>>
>> Acknowledged.
>>>
>>> Note that I want *both* deopt *and* branching, because in this
case,
>>> a branch is the fastest overall way of detecting when to deopt.  In
>>> the future, I will want to implement the deopt in terms of 
>>> branching, and when we do this, I believe that the most sound and 
>>> performat approach would be using Michael's intrinsics.  This
is
>>> subtle and I'll try to explain why it's the case.
>>>
>>> The point is that you wouldn't want to do deoptimization by
spilling
>>> state on the main path or by using a patchpoint for the main path
of
>>> the division.
>>>
>> This is the main point I disagree with.  I don't believe that
having
>> a patchpoint on the main path should be any more expensive then the 
>> original call.  (see above)
>
> The reason why the patchpoint is expensive is that if you use a 
> patchpoint to implement a division then the optimizer won't be allowed 
> to assume that it's a division, because the whole point of 
> "patchpoint" is to tell the optimizer to piss off.
>
>>
>>
>> Worth noting explicitly: I'm assuming that all of your deopt state 
>> would already be available for other purposes in nearby code.  It's
>> on the stack or in registers.  I'm assuming that by adding the
deopt
>> point, you are not radically changing the set of computations which 
>> need to be done.  If that's not the case, you should avoid deopt
and
>> instead just inline the slow paths with explicit checks.
>
> Yes, of course it is.  That's not the issue.
>
>>
>>
>> I'll note that given your assumptions about the cost of a
patchpoint,
>> the rest of your position makes a lot more sense.  :)  As I spelled 
>> out above, I believe this cost is not fundamental.
>>>
>>> You don't want the common path of executing the division to
involve
>>> a patchpoint instruction, although using a patchpoint or stackmap
to
>>> implement deoptimization on the failing path is great:
>>>
>>> *Good:*if (division would fail) { call @patchpoint(all of my state)
>>> } else { result = a / b }
>>>
>> Given your cost assumptions, I'd agree.
>
> Not my cost assumptions.  The reason why this is better is that the 
> division is expressed in LLVM IR so that LLVM can do useful things to 
> it - like eliminate it, for example.
>
>>> *
>>> Bad:*call @patchpoint(all of my state) // patch with a divide 
>>> instruction - bad because the optimizer has no clue what you're
>>> doing and assumes the very worst
>>>
>> Yuck.  Agreed.
>
> To be clear, this is what you're proposing - except that you're 
> assuming that LLVM will know that you've patched a division because 
> you're expecting the call target to have semantic meaning.  Or, 
> rather, you're expecting that you can make the contents of the 
> patchpoint be a division by having the call target be a division 
> intrinsic.  In the current implementation and as it is currently 
> specified, the call target has no meaning and so you get the yuck that 
> I'm showing.
>
>>> *
>>> Worse:*spill all state to the stack; call @trapping.div(a, b) //
the
>>> spills will hurt you far more than a branch, so this should be
avoided
>>>
>> I'm assuming this is an explicit spill rather than simply recording
a
>> stack map *at the div*.  If so, agreed.
>>>
>>> I suppose we could imagine a fourth option that involves a 
>>> patchpoint to pick up the state and a trapping divide instrinsic. 
>>>  But a trapping divide intrinsic alone is not enough.  Consider
this:
>>>
>>> result = call @trapping.div(a, b); call @stackmap(all of my state)
>>>
>>> As soon as these are separate instructions, you have no guarantees 
>>> that the state that the stackmap reports is sound for the point at 
>>> which the div would trap.
>>>
>> This is the closest to what I'd propose, except that the two calls 
>> would be merged into a single patchpoint.  Isn't the entire point
of
>> a patchpoint to record the stack map for a call?
>
> No.  It would be bad if that's what the documentation says.  That's
> not at all how WebKit uses it or probably any IC client would use it.
>
> Patchpoints are designed to be inline assembly on steroids.  They're 
> there to allow the client JIT to tell LLVM to piss off.
>
>> (Well, ignoring the actual patching part..)  Why not write this as:
>> patchpoint(..., trapping.div, a, b);
>>
>> Is there something I'm missing here?
>>
>> Just to note: I fully agree that the two call proposal is unsound and 
>> should be avoided.
>>
>>> So, the division itself shouldn't be a trapping instruction and
>>> instead you want to detect the bad case with a branch.
>>>
>>> To be clear:
>>>
>>> - Whether you use deoptimization for division or anything else - 
>>> like WebKit has done since before any of the Graal papers were 
>>> written - is mostly unrelated to how you represent the division, 
>>> unless you wanted to add a new intrinsic that is like a 
>>> trapping-division-with-stackmap:
>>>
>>> result = call @trapping.div.with.stackmap(a, b, ... all of my state
...)
>>>
>>> Now, maybe you do want such an intrinsic, in which case you should 
>>> propose it!
>>>
>> Given what we already have with patchpoints, I don't think a merged
>> intrinsic is necessary.  (See above). I believe we have all the parts 
>> to build this solution, and that - in theory - they should compose 
>> neatly.
>>
>> p.s. The bits I was referring to was not deopt per se.  It was 
>> particularly which set of deopt state you used for each deopt point.  
>> That's a bit of tangent for the rest of the discussion now though.
>>>
>>> The reason why I haven't proposed it is that I think that
long-term,
>>> the currently proposed intrinsics are a better path to getting the 
>>> trapping optimizations.  See my previous mail, where I show how we 
>>> could tell LLVM what the failing path is (which may have 
>>> deoptimization code that uses a stackmap or whatever), what the 
>>> trapping predicate is (it comes from the safe.div intrinsic), and 
>>> the fact that trapping is wise (branch weights).
>>>
>>> - If you want to do the deoptimization with a trap, then your only 
>>> choice currently is to use a patchpoint for the main path of the 
>>> division.  This will be slower than using a branch to an OSR exit 
>>> basic block, because you're making the division itself opaque
to the
>>> optimizer (bad) just to get rid of a branch (which was probably 
>>> cheap to begin with).
>>>
>>> Hence, what you want to do - one way or another, regardless of 
>>> whether this proposed intrinsic is added - is to branch on the 
>>> corner case condition, and have the slow case of the branch go to a
>>> basic block that deoptimizes.  Unless of course you have profiling 
>>> that says that the case does happen often, in which case you can 
>>> have that basic block handle the corner case inline without leaving
>>> optimized code (FWIW, we do have such paths in WebKit and they are 
>>> useful).
>>>
>>> So the question for me is whether the branching involves explicit 
>>> control flow or is hidden inside an intrinsic.  I prefer for it to 
>>> be within an intrinsic because it:
>>>
>>> - allows the optimizer to do more interesting things in the common 
>>> cases, like hoisting the entire division.
>>>
>>> - will give us a clearer path for implementing trapping 
>>> optimizations in the future.
>>>
>>> - is an immediate win on ARM.
>>>
>>> I'd be curious to hear what specific idea you have about how to
>>> implement trap-based deoptimization with your trapping division 
>>> intrinsic for x86 - maybe it's different from the two
"bad" idioms I
>>> showed above.
>>>
>> I hope my explanation above helps.  If not, ask, and I'll try to 
>> explain more clearly.
>
> I think I understand it.  I think that the only issue is that:
>
> - Patchpoints currently don't do what you want.
>
> - If you made patchpoints do what you want then you'd break WebKit - 
> not to mention anyone who wants to use them for inline caches.
>
> So it seems like you want a new intrinsic.  You should officially 
> propose this new intrinsic, particularly since the core semantic 
> differences are not so great from what we have now.  OTOH, if you 
> truly believe that patchpoints should just be changed to your 
> semantics in a way that does break WebKit, then that's probably also 
> something you should get off your chest. ;-)
>
>>
>>
>> One point just for clarity; I don't believe this effects the 
>> conclusions of our discussion so far.  I'm also fairly sure that
you
>> (Filip) are aware of this, but want to spell it out for other readers.
>>
>> You seem to be assuming that compiled code needs to explicitly branch 
>> to a point where deopt state is known to exit a compiled frame.
>
> This is a slightly unclear characterization of my assumptions.  Our 
> JIT does deoptimization without explicit branches for many, many 
> things.  You should look at it some time, it's pretty fancy. ;-)
>
>> Worth noting is that you can also exit a compiled frame on a trap 
>> (without an explicitly condition/branch!) if the deopt state is known 
>> at the point you take the trap.  This "exit frame on trap"
behavior
>> shows up with null pointer exceptions as well. I'll note that both 
>> compilers in OpenJDK support some combination of
"exit-on-trap"
>> conditions for division and null dereferences. The two differ on 
>> exactly what strategies they use in each case though.  :)
>
> Yeah, and I've also implemented VMs that do this - and I endorse the 
> basic idea.  I know what you want, and my only point is that the 
> existing patchpoints only give you this if you're willing to make a 
> huge compromise: namely, that you're willing to make the division (or 
> heap load for the null case) completely opaque to the compiler to the 
> point that GVN, LICM, SCCP, and all algebraic reasoning have to give 
> up on optimizing it.  The point of using LLVM is that it can optimize 
> code.  It can optimize branches and divisions pretty well.  So, 
> eliminating an explicit branch by replacing it with a construct that 
> appears opaque to the optimizer is not a smart trade-off.
>
> You could add a new intrinsic that, like patchpoints, records the 
> layout of state in a side-table, but that is used as a kind of wrapper 
> for operations that LLVM understands.  This may or may not be hairy - 
> you seem to have sort of acknowledged that it's got some complexity 
> and I've also pointed out some possible issues.  If this is something 
> that you want, you should propose it so that others know what you're 
> talking about.  One danger of how we're discussing this right now is 
> that you're overloading patchpoints to mean the thing you want them to 
> mean rather than what they actually mean, which makes it seem like we 
> don't need Michael's intrinsics on the grounds that patchpoints 
> already offer a solution.  They don't already offer a solution 
> precisely because patchpoints don't do what your intrinsics would do.
>
>>
>>
>> I'm not really arguing that either scheme is "better" in
all cases.
>> I'm simply arguing that we should support both and allow
optimization
>> and tuning between them.  As far as I can tell, you seem to be 
>> assuming that an explicit branch to known exit point is always
superior.
>>
>>
>> Ok, back to the topic at hand...
>>
>> With regards to the current proposal, I'm going to take a step
back.
>> You guys seem to have already looked in this in a fair amount of 
>> depth.  I'm not necessarily convinced you've come to the best 
>> solution, but at some point, we need to make forward progress.  What 
>> you have is clearly better than nothing.
>>
>> Please go ahead and submit your current approach.  We can come back 
>> and revise later if we really need to.
>>
>> I do request the following changes:
>> - Mark it clearly as experimental.
>>
>> - Either don't specify the value computed in the edge cases, or
allow
>> those values to be specified as constant arguments to the call.  This 
>> allows efficient lowering to x86's div instruction if you want to 
>> make use of the trapping semantics.
>
> Once again: how would you use this to get trapping semantics without 
> throwing all of LLVM's optimizations out the window, in the absence of 
> the kind of patchpoint-like intrinsic that you want?  I ask just to 
> make sure that we're on the same page.
>
>>
>>
>>> Finally, as for performance data, which part of this do you want 
>>> performance data for?  I concede that I don't have performance
data
>>> for using Michael's new intrinsic.  Part of what the intrinsic 
>>> accomplishes is it gives a less ugly way of doing something that is
>>> already possible with target intrinsics on ARM.  I think it would
be
>>> great if you could get those semantics - along with a known-good 
>>> implementation - on other architectures as well.
>>>
>> I would be very interested in seeing data comparing two schemes:
>> - Explicit control flow emited by the frontend
>> - The safe.div intrinsic emitted by the frontend, desugared in 
>> CodeGenPrep
>>
>> My strong suspicion is that each would preform well in some cases and 
>> not in others. At least on x86.  Since the edge-checks are 
>> essentially free on ARM, the second scheme would probably be strictly 
>> superior there.
>>
>> I am NOT asking that we block submission on this data however.
>>
>>> But this discussion has also involved suggestions that we should
use
>>> trapping to implement deoptimization, and the main point of my 
>>> message is to strongly argue against anything like this given the 
>>> current state of trapping semantics and how patchpoints work.  My 
>>> point is that using traps for division corner cases would either be
>>> unsound (see the stackmap after the trap, above), or would require 
>>> you to do things that are obviously inefficient.  If you truly 
>>> believe that the branch to detect division slow paths is more 
>>> expensive than spilling all bytecode state to the stack or using a 
>>> patchpoint for the division, then I could probably hack something
up
>>> in WebKit to show you the performance implications.  (Or you could 
>>> do it yourself, the code is open source...)
>>>
>> In a couple of months, I'll probably have the performance data to 
>> discuss this for real.  When that happens, let's pick this up and 
>> continue the debate. Alternatively, if you want to chat this over 
>> more with a beer in hand at the social next week, let me know.  In 
>> the meantime, let's not stall the current proposal any more.
>>
>> Philip
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140430/4eadf969/attachment.html>

Filip Pizlo

2014-May-01 05:57 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

> On Apr 30, 2014, at 10:34 PM, Philip Reames <listmail at
philipreames.com> wrote:
> 
> Andy - If you're not already following this closely, please start. 
We've gotten into fairly fundamental questions of what a patchpoint does.
> 
> Filip, 
> 
> I think you've hit the nail on the head.  What I'm thinking of as
being patchpoints are not what you think they are.  Part of that is that
I've got a local change which adds a very similar construction (called
"statepoints" for the moment), but I was trying to keep that separate.
That also includes a lot of GC semantics which are not under discussion
currently.  My apologies if that experience bled over into this conversation and
made things more confusing.
> 
> I will note that the documentation for patchpoint say explicitly the
following:
> "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a function
call to the specified <target> and records the location of specified
values in the stack map."
I'm not disputing that the patch point will *initially* call the thing you
want. But the point of the patch point - and the reason why the word
"patch" is part of the name - is that you're allowed to modify the
machine code in the patch point from the client JIT. This presumes that the LLVM
optimizer cannot infer semantics from the call target.

Once we started using the patch points in WebKit, we quickly realized that the
only call target that makes sense is null: our inline cache state machine may
return to "just call a function" so our client JIT might as well have
full control of what such a call looks like. So we overwrite all of the machine
code that LLVM generates for the patch point and passing null means that LLVM
only emits nops.
> 
> My reading has always been that a patchpoint *that isn't patched* is
simply a call with a stackmap associated with it.  To my reading, this can (and
did, and does) indicate my proposed usage would be legal.
Yes and I agree, but any optimizations in LLVM based on the call target would be
illegal.
> 
> I will agree that I've confused the topic badly on the optimization
front.  My "statepoint" isn't patchable, so a lot more
optimizations are legal.  Sorry about that.  To restate what I think you've
been saying all along, the optimizer can't make assumptions about what
function is called by a patchpoint because that might change based on later
patching.  Is this the key point you've been trying to make?
Yup!
> 
> I'm not objecting to separating "my patchpoint" from
"your patchpoint".  Let's just hammer out the semantics of each
first.  :)
> 
> Again, longer response to follow in a day or so. :)
> 
> Philip
> 
>> On 04/30/2014 10:09 PM, Filip Pizlo wrote:
>> 
>> 
>>> On April 30, 2014 at 9:06:20 PM, Philip Reames (listmail at
philipreames.com) wrote:
>>> 
>>>> On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>>>>> On April 29, 2014 at 11:27:06 AM, Philip Reames (listmail
at philipreames.com) wrote:
>>>>>> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>>>>>> LD;DR: Your desire to use trapping on x86 only further
convinces me that Michael's proposed intrinsics are the best way to go.
>>>>> I'm still not convinced, but am not going to actively
oppose it either.  I'm leery of designing a solution with major assumptions
we don't have data to backup.
>>>>> 
>>>>> I worry your assumptions about deoptimization are
potentially unsound.                                But I don't have data to
actually show this (yet).
>>>> 
>>>> I *think* I may have been unclear about my assumptions; in
particular, my claims with respect to deoptimization are probably more subtle
than they appeared.  WebKit can use LLVM and it has divisions and we do all
possible deoptimization/profiling/etc tricks for it, so                      
this is grounded in experience.  Forgive me if the rest of this e-mail contains
a lecture on things that are obvious - I'll try to err on the side of
clarity and completeness since this discussion is sufficiently dense that we run
the risk of talking cross-purposes unless some baseline assumptions are
established.
>>>> 
>>> I think we're using the same terminology, but with slightly
different sets of assumptions.  I'll point this out below where relevant.
>>> 
>>> Also, thanks for taking the time to expand.  It help clarify the
discussion quite a bit.
>> 
>> I think we may be converging to an understanding of what you want
versus what I want, and I think that there are some points - possibly unrelated
to division - that are worth clarifying.  I think that the main difference is
that when I say "patchpoint", I am referring to a concrete intrinsic
with specific semantics that cannot change without breaking WebKit, while you
are using the term to refer to a broad concept, or rather, a class of
as-yet-unimplemented intrinsics that share some of the same features with
patchpoints but otherwise have incompatible semantics.
>> 
>> Also, when I say that you wouldn't want to use the existing
patchpoint to do your trapping deopt, what I mean is that the performance of
doing this would suck for reasons that are not related to deoptimization or
trapping.  I'm not claiming that deoptimization performs poorly (trust me, I
know better) or that trapping to deoptimize is bad (I've done this many,
many times and I know better).  I'm saying that with the current          
patchpoint intrinsics in LLVM, as they are currently specified and implemented,
doing it would be a bad idea because you'd have to compromise a bunch of
other optimizations to achieve it.
>> 
>> You have essentially described new intrinsics that would make this less
of a bad idea and I am interested in your intrinsics, so I'll try to both
respond with why patchpoints don't currently give you what you want (and why
simply changing patchpoint semantics would be evil) and I'll also try to
comment on what I think of the intrinsic that you're effectively proposing. 
Long story short, I think you should formally propose your intrinsic even if
it's not completely fleshed out.  I think that it's an interesting
capability and in its most basic form, it is a simple variation of the current
patchpoint/stackmap intrinsics.
>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>>> On April 29, 2014 at 10:09:49 AM, Philip Reames
(listmail at philipreames.com) wrote:
>>>>>>> 
>>>>>>> As the discussion has progressed and I've spent
more time thinking about the topic, I find myself less and less enthused about
the current proposal.  I'm in full support of having idiomatic ways to
express safe division, but I'm starting to doubt that using an intrinsic is
the right way at the moment.
>>>>>>> 
>>>>>>> One case I find myself thinking about is how one
would combine profiling information and implicit div-by-zero/overflow checks
with this proposal.  I don't really see a clean way.  Ideally, for a
"safe div" which never has the exceptional paths taken, you'd like
to completely do away with the control flow entirely.  (And rely on hardware
traps w/exceptions instead.)  I don't really see a way to represent that
type of construct given the current proposal.
>>>>>> 
>>>>>> This is a deeper problem and to solve it you'd need
a solution to trapping in general.  Let's consider the case of Java.  A Java
program may want to catch the arithmetic exception due to divide by zero.  How
would you do this with a trap in LLVM IR?  Spill all state that is live at the
catch?  Use a patchpoint for the entire division instruction?
>>>>>> 
>>>>> We'd likely use something similar to a patchpoint. 
You'd need the "abstract vm state" (which is not the compiled
frame necessarily) available at the div instruction.  You could then re-enter
the                               interpreter at the specified index (part of
the vm state).  We have all most of these mechanisms in place.  Ideally,
you'd trigger a recompile and otherwise ensure re-entry into compiled code
at the soonest possible moment.
>>>>> 
>>>>> This requires a lot of runtime support, but we already have
most of it implemented for another compiler.  From our perspective, the runtime
requirements are not a major blocker.
>>>> 
>>>> Right, you'll use a patchpoint.  That's way            
more expensive than using a safe division intrinsic with branches, because
it's opaque to the optimizer.
>>>> 
>>> This statement is true at the moment, but it shouldn't be.  I
think this is our fundamental difference in approach.
>>> 
>>> You should be able to write something like:
>>> i32 %res = invoke patchpoint (... x86_trapping_divide, a, b)
normal_dest invoke_dest
>>> 
>>> normal_dest:
>>>   ;; use %res
>>> invoke_dest:
>>>   landingpad
>>>   ;; dispatch edge cases
>>>   ;; this could be unreachable code if you deopt this frame in the
trap handler and jump directly to an interpreter or other bit of code
>> 
>> I see.  It sounds like you want a generalization of the
"div.with.stackmap" that I thought you wanted - you want to be able to
wrap anything in a stackmap.
>> 
>> The current patchpoint intrinsic does not do this, and you run the risk
of breaking existing semantics if you changed this.  You'd probably break
WebKit, which treats the call target of the patchpoint as nothing more than a
quirk - we always pass null.  Also, the current patchpoint treats the callee as
an i8* if I remember right and it would be super weird if all LLVM phases had to
interpret this i8* by unwrapping a possible bitcast to get to a declared
function that may be an intrinsic.  Yuck!  Basically, the call target of
existing patchpoints is meant to be a kind of convenience feature rather than
the core of the mechanism.
>> 
>> I agree in principle that the intrinsic that you want would be a useful
intrinsic.  But let's not call it a patchpoint for the purposes of this
discussion, and let's not confuse the discussion by claiming (incorrectly)
that the existing patchpoint facility gives you what you want.  It doesn't:
patchpoints are designed to make the call target opaque (hence the use of i8*)
and there shouldn't be a correlation between what the patchpoint does at
run-time and what the called function would have done.  You could make the call
target be null (like WebKit does) and the patchpoint should still mean
"this code can do anything" because the expectation is that the client
JIT will patch over it anyway.
>> 
>> Also, "patchpoint" would probably not be the right term for
the intrinsic that you want.  I think that what you want is
"call.with.stackmap".  Or maybe "stackmap.wrapper".  Or just
"stackmap" - I'd be OK, in principle, with changing the name of
the current "stackmap" intrinsic to something that reflects the fact
that it's less of a stackmap than what you want.
>> 
>> To summarize.  A patchpoint's main purpose is that you can patch it
with arbitrary code.  The current "stackmap" means that you can patch
it with arbitrary code and that patching may be destructive to a shadow of
machine code bytes, so it's really just like patchpoints - we could change
its name to "patchpoint.shadow" for example.
>> 
>> If you were to propose such a stackmap intrinsic, then I think there
could be some ways of doing it that wouldn't be too terrible.  Basically you
want something that is like a patchpoint in that it reports a stackmap via a
side channel, but unlike patchpoints, it doesn't allow arbitrary patching -
instead the optimizer should be allowed to assume that the code within the
patchpoint will always do the same thing that the call target would have done. 
There are downsides to truly doing this.  For example, to make division
efficient with such an intrinsic, you'd have to do something that is
somewhat worse than just recognizing intrinsics in the optimizer - you'd
have to first recognize a call to your "stackmap wrapper" intrinsic
and then observe that its call target argument is in turn         another
intrinsic.  To me personally, that's kind of yucky, but I won't deny
that it could be useful.
>> 
>> As to the use of invoke: I don't believe that the use of invoke
versus my suggested "branch on a trap predicate" idea are different in
any truly meaningful way.  I buy that either would work.
>> 
>>> 
>>> 
>>> A patchpoint should not require any excess spilling.  If values are
live in registers, that should be reflected in the stack map.  (I do not know if
this is the case for patchpoint at the moment or not.)
>> 
>> Patchpoints do not require spilling.
>> 
>> My point was that with existing patchpoints, you *either* use a
patchpoint for the entire division which makes the division opaque to the
optimizer - because a patchpoint means "this code can do anything" -
*or* you could spill stuff to the stack prior to your trapping division
intrinsic, since spilling is something that you could do as a workaround if you
didn't have a patchpoint.
>> 
>> The reason why I brought up spilling at all is that I suspect that
spilling all state to the stack might be cheaper - for some systems - than
turning the division into a patchpoint.  Turning the division into a patchpoint
is horrendously brutal - the patchpoint looks like it clobbers the heap (which a
division doesn't do), has to execute (a division is an obvious DCE
candidate), cannot be hoisted (hoisting divisions is awesome), etc.  Perhaps
most importantly, though, a patchpoint doesn't tell LLVM that you're
*doing a division* - so all constant folding, strenght reduction, and algebraic
reasoning flies out the window.  On the other hand, spilling all state to the
stack is an arguably sound and performant solution to a lot of VM problems. 
I've seen JVM implementations that ensure that there is always a copy of
state on the stack at some critical points, just because it makes loads of stuff
simpler (debugging, profiling, GC, and of course deopt).  I personally prefer to
stay away from such a strategy because it's not free.
>> 
>> On the other hand, branches can be cheap.  A branch on a divide is
cheaper than not being able to optimize the divide.
>> 
>>> 
>>> 
>>> The Value called by a patchpoint should participate in optimization
normally.
>> 
>> I agree that you could have a different intrinsic that behaves like
this.
>> 
>>> We really want the patchpoint part of the call to be supplemental. 
It should still be a call.  It should be constant propagated, transformed, etc..
This is not the case currently.  I've got a couple of off the wall ideas for
improving the current status, but I'll agree this is a hardish problem.
>>> 
>>> It should be legal to use a patchpoint in an invoke.  It's an
ABI issue of how the invoke path gets invoked.  (i.e. side tables for the
runtime to lookup, etc..)  This is not possible today, and probably requires a
fair amount of work.  Some of it, I've already done and will be sharing
shortly.  Other parts, I haven't even thought about.
>> 
>> Right, it's significantly more complex than either the existing
patchpoints or Michael's proposed safe.div.
>> 
>>> 
>>> 
>>> If you didn't want to use the trapping semantics, you'd
insert dedicated control flow _before_ the divide.  This would allow normal
optimization of the control flow.
>>> 
>>> Notes:
>>> 1) This might require a new PATCHPOINT pseudo op in the backend. 
Haven't thought much about that yet.
>>> 2) I *think* your current intrinsic could be translated into
something like this.  (Leaving aside the question of where the deopt state comes
from.)  In fact, the more I look at this, the less difference I actually see
between the approaches.
>>> 
>>> 
>>>>>> 
>>>>>> In a lot of languages, a divide produces some result
even in the exceptional case and this result requires effectively deoptimizing
since the resut won't be the one you would have predicted (double instead of
int, or BigInt instead of small int), which sort of means that if the CPU
exception occurs you have to be able                                            
to reconstruct all state.  A patchpoint could do this, and so could spilling all
state to the stack before the divide - but both are very heavy hammers that are
sure to be more expensive than just doing a branch.
>>>>>> 
>>>>> This isn't necessarily as expensive as you might
believe.  I'd recommend reading the Graal project papers on this topic.
>>>>> 
>>>>> Whether deopt or branching is more profitable *in this
case*, I can't easily say.  I'm not yet to the point of being able to
run that experiment.  We can argue about what "should" be better all
we want, but real performance data is the only way to truly know.
>>>> 
>>>> My point may have been confusing.  I know that deoptimization
is cheap and WebKit uses it everywhere, including division corner cases, if
profiling tells us that it's profitable to do so (which it does, in the
common case).  WebKit is a heavy user of deoptimization in general, so you
don't need to convince me that it's worth it.
>>>> 
>>> Acknowledged. 
>>>> Note that I want *both* deopt *and* branching, because in this
case, a branch is the fastest overall way of detecting when to deopt.  In the
future, I will want to implement the deopt in terms of branching, and when we do
this, I believe that the most sound and performat approach would be using
Michael's intrinsics.  This is subtle and I'll try to explain why
it's the case.
>>>> 
>>>> The point is that you wouldn't want to do deoptimization by
spilling state on the main path or by using a patchpoint for the main path of
the division.
>>>> 
>>> This is the main point I disagree with.  I don't believe that
having a patchpoint on the main path should be any more expensive then the
original call.  (see above)
>> 
>> The reason why the patchpoint is expensive is that if you use a
patchpoint to implement a division then the optimizer won't be allowed to
assume that it's a division, because the whole point of
"patchpoint" is to tell the optimizer to piss off.
>> 
>>> 
>>> 
>>> Worth noting explicitly: I'm assuming that all                 
of your deopt state would already be available for other purposes in nearby
code.  It's on the stack or in registers.  I'm assuming that by adding
the deopt point, you are not radically changing the set of computations which
need to be done.  If that's not the case, you should avoid deopt and instead
just inline the slow paths with explicit checks.
>> 
>> Yes, of course it is.  That's not the issue.
>> 
>>> 
>>> 
>>> I'll note that given your assumptions about the cost of a
patchpoint, the rest of your position makes a lot more sense.  :)  As I spelled
out above, I believe this cost is not fundamental.
>>>> You don't want the common path of executing the division to
involve a patchpoint instruction, although using a patchpoint or stackmap to
implement deoptimization on the failing path is great:
>>>> 
>>>> Good: if (division would fail) { call @patchpoint(all of my
state) } else { result = a / b }
>>>> 
>>> Given your cost assumptions, I'd agree. 
>> 
>> Not my cost assumptions.  The reason why this is better is that the
division is expressed in LLVM IR so that LLVM can do useful things to it - like
eliminate it, for example.
>> 
>>>> 
>>>> Bad: call @patchpoint(all of my state) // patch with a divide
instruction - bad because the optimizer has no clue what you're doing and
assumes the very worst
>>>> 
>>> Yuck.  Agreed. 
>> 
>> To be clear, this is what you're proposing - except that you're
assuming that LLVM will know that you've patched a division because
you're expecting the call target to have semantic meaning.  Or, rather,
you're expecting that you can make the contents of the patchpoint be a
division by having the call target be a division intrinsic.  In the current
implementation and as it is currently specified, the call target has no meaning
and so you get the yuck that I'm showing.
>> 
>>>> 
>>>> Worse: spill all state to the stack; call @trapping.div(a, b)
// the spills will hurt you far more than a branch, so this should be avoided
>>>> 
>>> I'm assuming this is an explicit spill rather than simply
recording a stack map *at the div*.  If so, agreed.
>>>> I suppose we could imagine a fourth option that involves a
patchpoint to pick up the state and a trapping divide instrinsic.  But a
trapping divide intrinsic alone is not enough.  Consider this:
>>>> 
>>>> result = call @trapping.div(a, b); call @stackmap(all of my
state)
>>>> 
>>>> As soon as these are separate instructions, you have no
guarantees that the state that the stackmap reports is sound for the point at
which the div would trap.
>>> This is the closest to what I'd propose, except that the two
calls would be merged into a single patchpoint.  Isn't the entire point of a
patchpoint to record the stack map for a call?
>> 
>> No.  It would be bad if that's what the documentation says. 
That's not at all how WebKit uses it or probably any IC client would use it.
>> 
>> Patchpoints are designed to be inline assembly on steroids. 
They're there to allow the client JIT to tell LLVM to piss off.
>> 
>>> (Well, ignoring the actual patching part..)  Why not write this as:
>>> patchpoint(..., trapping.div, a, b);
>>> 
>>> Is there something I'm missing here?
>>> 
>>> Just to note: I fully agree that the two call proposal is unsound
and should be avoided.
>>> 
>>>> So, the division itself shouldn't be a trapping instruction
and instead you want to detect the bad case with a branch.
>>>> 
>>>> To be clear:
>>>> 
>>>> - Whether you use deoptimization for division or anything else
- like WebKit has done since before any of the Graal papers were written - is
mostly unrelated to how you represent the division, unless you wanted to add a
new intrinsic that is like a trapping-division-with-stackmap:
>>>> 
>>>> result = call @trapping.div.with.stackmap(a, b, ... all of my
state ...)
>>>> 
>>>> Now, maybe you do want such an intrinsic, in which case you
should propose it!
>>> Given what we already have with patchpoints, I don't think a
merged intrinsic is necessary.  (See above).  I believe we have all the parts to
build this solution, and that - in theory - they should compose neatly.
>>> 
>>> p.s. The bits I was referring to was not deopt per se.  It was
particularly which set of deopt state you used for each deopt point.  That's
a bit of tangent for the rest of the discussion now though.
>>>> The reason why I haven't proposed it is that I think that
long-term, the currently proposed intrinsics are a better path to getting the
trapping optimizations.  See my previous mail, where I show how we could tell
LLVM what the failing path is (which may have deoptimization code that uses a
stackmap or whatever), what the trapping predicate is (it comes from the
safe.div intrinsic), and the fact that trapping is wise (branch weights).
>>>> 
>>>> - If you want to do the deoptimization with a trap, then your
only choice currently is to use a patchpoint for the main path of the division. 
This will be slower than using a branch to an OSR exit basic block, because
you're making the division itself opaque to the optimizer (bad) just to get
rid of a branch (which was probably cheap to begin with).
>>>> 
>>>> Hence, what you want to do - one way or another, regardless of
whether this proposed intrinsic is added - is to branch on the corner case
condition, and have the slow case of the branch go to a basic block that
deoptimizes.  Unless of course you have profiling that says that the case does
happen often, in which case you can have that basic block handle the corner case
inline without leaving optimized code (FWIW, we do have such paths in WebKit and
they are useful).
>>>> 
>>>> So the question for me is whether the branching involves
explicit control flow or is hidden inside an intrinsic.  I prefer for it to be
within an intrinsic because it:
>>>> 
>>>> - allows the optimizer to do more interesting things in the
common cases, like hoisting the entire division.
>>>> 
>>>> - will give us a clearer path for implementing trapping
optimizations in the future.
>>>> 
>>>> - is an immediate win on ARM.
>>>> 
>>>> I'd be curious to hear what specific idea you have about
how to implement trap-based deoptimization with your trapping division intrinsic
for x86 - maybe it's different from the two "bad" idioms I showed
above.
>>>> 
>>> I hope my explanation above helps.  If not, ask, and I'll try
to explain more clearly.
>> 
>> I think I understand it.  I think that the only issue is that:
>> 
>> - Patchpoints currently don't do what you want.
>> 
>> - If you made patchpoints do what you want then you'd break WebKit
- not to mention anyone who wants to use them for inline caches.
>> 
>> So it seems like you want a new intrinsic.  You should officially
propose this new intrinsic, particularly since the core semantic differences are
not so great from what we have now.  OTOH, if you truly believe that patchpoints
should just be changed to your semantics in a way that does break WebKit, then
that's probably also something you should get off your chest. ;-)
>> 
>>> 
>>> 
>>> One point just for clarity; I don't believe this effects the
conclusions of our discussion so far.  I'm also fairly sure that you (Filip)
are aware of this, but want to spell it out for other readers.
>>> 
>>> You seem to be assuming that compiled code needs to explicitly
branch to a point where deopt state is known to exit a compiled frame.
>> 
>> This is a slightly unclear characterization of my assumptions.  Our JIT
does deoptimization without explicit branches for many, many things.  You should
look at it some time, it's pretty fancy. ;-)
>> 
>>> Worth noting is that you can also exit a compiled frame on a trap
(without an explicitly condition/branch!) if the deopt state is known at the
point you take the trap.  This "exit frame on trap" behavior shows up
with null pointer exceptions as well.  I'll note that both compilers in
OpenJDK support some combination of "exit-on-trap" conditions for
division and null dereferences.  The two differ on exactly what strategies they
use in each case though.  :)
>> 
>> Yeah, and I've also implemented VMs that do this - and I endorse
the basic idea.  I know what you want, and my only point is that the existing
patchpoints only give you this if you're willing to make a huge compromise:
namely, that you're willing to make the division (or heap load for the null
case) completely opaque to the compiler to the point that GVN, LICM, SCCP, and
all algebraic reasoning have to give up on optimizing it.  The point of using
LLVM is that it can optimize code.  It can optimize branches and divisions
pretty well.  So, eliminating an explicit branch by replacing it with a
construct that appears opaque to the optimizer is not a smart trade-off.
>> 
>> You could add a new intrinsic that, like patchpoints, records the
layout of state in a side-table, but that is used as a kind of wrapper for
operations that LLVM understands.  This may or may not be hairy - you seem to
have sort of acknowledged that it's got some complexity and I've also
pointed out some possible issues.  If this is something that you want, you
should propose it so that others know what you're talking about.  One danger
of how we're discussing this right now is that you're overloading
patchpoints to mean the thing you want them to mean rather than what they
actually mean, which makes it seem like we don't need Michael's
intrinsics on the grounds that patchpoints already offer a solution.  They
don't already offer a solution precisely because patchpoints don't do
what your intrinsics would do.
>> 
>>> 
>>> 
>>> I'm not really arguing that either scheme is "better"
in all cases.  I'm simply arguing that                                      
we should support both and allow optimization and tuning between them.  As far
as I can tell, you seem to be assuming that an explicit branch to known exit
point is always superior.
>>> 
>>> 
>>> Ok, back to the topic at hand...
>>> 
>>> With regards to the current proposal, I'm going to take a step
back.  You guys seem to have already looked in this in a fair amount of depth. 
I'm not necessarily convinced you've come to the best solution, but at
some point, we need to make forward progress.  What you have is clearly better
than nothing.
>>> 
>>> Please go ahead and submit your current approach.  We can come back
and revise later if we really need to.
>>> 
>>> I do request the following changes:
>>> - Mark it clearly as experimental.
>> 
>>> 
>>> - Either don't specify the value computed in the edge cases, or
allow those values to be specified as constant arguments to the call.  This
allows efficient lowering to x86's div instruction if you want to make use
of the trapping semantics.
>> 
>> Once again: how would you use this to get trapping semantics without
throwing all of LLVM's optimizations out the window, in the absence of the
kind of patchpoint-like intrinsic that you want?  I ask just to make sure that
we're on the same page.
>> 
>>> 
>>> 
>>>> Finally, as for performance data, which part of this do you
want performance data for?  I concede that I don't have performance data for
using Michael's new intrinsic.  Part of what the intrinsic accomplishes is
it gives a less ugly way of doing something that is already possible with target
intrinsics on ARM.  I think it would be great if you could get those semantics -
along with a known-good implementation - on other architectures as well.
>>>> 
>>> I would be very interested in seeing data comparing two schemes:
>>> - Explicit control flow emited by the frontend
>>> - The safe.div intrinsic emitted by the frontend, desugared in
CodeGenPrep
>>> 
>>> My strong suspicion is that each would preform well in some cases
and not in others.  At least on x86.  Since the edge-checks are essentially free
on ARM, the second scheme would probably be strictly superior there.
>>> 
>>> I am NOT asking that we block submission on this data however. 
>>> 
>>>> But this discussion has also involved suggestions that we
should use trapping to implement deoptimization, and the main point of my
message is to strongly argue against anything like this given the current state
of trapping semantics and how patchpoints work.  My point is that using traps
for division corner cases would either be unsound (see the stackmap after the
trap, above), or would require you to do things that are obviously inefficient. 
If you truly believe that the branch to detect division slow paths is more
expensive than spilling all bytecode state to the stack or using a patchpoint
for the division, then I could probably hack something up in WebKit to show you
the performance implications.  (Or you could do it yourself, the code is open
source...)
>>>> 
>>> In a couple of months, I'll probably have the performance data
to discuss this for real.  When that happens, let's pick this up and
continue the debate.  Alternatively, if you want to chat this over more with a
beer in hand at the social next week, let me know.  In the meantime, let's
not stall the current proposal any more.
>>> 
>>> Philip
> -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140430/33d56a5f/attachment.html>

Jeff Bezanson

2014-May-02 15:53 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

I just want to chime in that I'd love to see something done about safe
division in LLVM. In my opinion, the problem is that division that is
both (1) as fast as possible, (2) has no undefined behavior is simply
not available in LLVM. I'd be happy with target-specific intrinsics.
That way if some architecture happens to do what your language wants,
and you're targeting that architecture, you can choose to emit that
intrinsic. This strikes me as a simpler and less contentious first
step than adding a generic intrinsic that favors one architecture or
another. I'd also be happy with a way to represent traps in LLVM IR,
and have a branch+trap recognized and lowered to a raw div instruction
on x86. But IIUC that is a much bigger can of worms.


On Thu, May 1, 2014 at 1:57 AM, Filip Pizlo <fpizlo at apple.com>
wrote:>
>
> On Apr 30, 2014, at 10:34 PM, Philip Reames <listmail at
philipreames.com>
> wrote:
>
> Andy - If you're not already following this closely, please start. 
We've
> gotten into fairly fundamental questions of what a patchpoint does.
>
> Filip,
>
> I think you've hit the nail on the head.  What I'm thinking of as
being
> patchpoints are not what you think they are.  Part of that is that I've
got
> a local change which adds a very similar construction (called
"statepoints"
> for the moment), but I was trying to keep that separate.  That also
includes
> a lot of GC semantics which are not under discussion currently.  My
> apologies if that experience bled over into this conversation and made
> things more confusing.
>
> I will note that the documentation for patchpoint say explicitly the
> following:
> "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a function
call to
> the specified <target> and records the location of specified values
in the
> stack map."
>
>
> I'm not disputing that the patch point will *initially* call the thing
you
> want. But the point of the patch point - and the reason why the word
"patch"
> is part of the name - is that you're allowed to modify the machine code
in
> the patch point from the client JIT. This presumes that the LLVM optimizer
> cannot infer semantics from the call target.
>
> Once we started using the patch points in WebKit, we quickly realized that
> the only call target that makes sense is null: our inline cache state
> machine may return to "just call a function" so our client JIT
might as well
> have full control of what such a call looks like. So we overwrite all of
the
> machine code that LLVM generates for the patch point and passing null means
> that LLVM only emits nops.
>
>
> My reading has always been that a patchpoint *that isn't patched* is
simply
> a call with a stackmap associated with it.  To my reading, this can (and
> did, and does) indicate my proposed usage would be legal.
>
>
> Yes and I agree, but any optimizations in LLVM based on the call target
> would be illegal.
>
>
> I will agree that I've confused the topic badly on the optimization
front.
> My "statepoint" isn't patchable, so a lot more optimizations
are legal.
> Sorry about that.  To restate what I think you've been saying all
along, the
> optimizer can't make assumptions about what function is called by a
> patchpoint because that might change based on later patching.  Is this the
> key point you've been trying to make?
>
>
> Yup!
>
>
> I'm not objecting to separating "my patchpoint" from
"your patchpoint".
> Let's just hammer out the semantics of each first.  :)
>
> Again, longer response to follow in a day or so. :)
>
> Philip
>
> On 04/30/2014 10:09 PM, Filip Pizlo wrote:
>
>
>
> On April 30, 2014 at 9:06:20 PM, Philip Reames (listmail at
philipreames.com)
> wrote:
>
> On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>
> On April 29, 2014 at 11:27:06 AM, Philip Reames (listmail at
philipreames.com)
> wrote:
>
> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>
> LD;DR: Your desire to use trapping on x86 only further convinces me that
> Michael's proposed intrinsics are the best way to go.
>
> I'm still not convinced, but am not going to actively oppose it either.
I'm
> leery of designing a solution with major assumptions we don't have data
to
> backup.
>
> I worry your assumptions about deoptimization are potentially unsound.  But
> I don't have data to actually show this (yet).
>
> I *think* I may have been unclear about my assumptions; in particular, my
> claims with respect to deoptimization are probably more subtle than they
> appeared.  WebKit can use LLVM and it has divisions and we do all possible
> deoptimization/profiling/etc tricks for it, so this is grounded in
> experience.  Forgive me if the rest of this e-mail contains a lecture on
> things that are obvious - I'll try to err on the side of clarity and
> completeness since this discussion is sufficiently dense that we run the
> risk of talking cross-purposes unless some baseline assumptions are
> established.
>
> I think we're using the same terminology, but with slightly different
sets
> of assumptions.  I'll point this out below where relevant.
>
> Also, thanks for taking the time to expand.  It help clarify the discussion
> quite a bit.
>
> I think we may be converging to an understanding of what you want versus
> what I want, and I think that there are some points - possibly unrelated to
> division - that are worth clarifying.  I think that the main difference is
> that when I say "patchpoint", I am referring to a concrete
intrinsic with
> specific semantics that cannot change without breaking WebKit, while you
are
> using the term to refer to a broad concept, or rather, a class of
> as-yet-unimplemented intrinsics that share some of the same features with
> patchpoints but otherwise have incompatible semantics.
>
> Also, when I say that you wouldn't want to use the existing patchpoint
to do
> your trapping deopt, what I mean is that the performance of doing this
would
> suck for reasons that are not related to deoptimization or trapping. 
I'm
> not claiming that deoptimization performs poorly (trust me, I know better)
> or that trapping to deoptimize is bad (I've done this many, many times
and I
> know better).  I'm saying that with the current patchpoint intrinsics
in
> LLVM, as they are currently specified and implemented, doing it would be a
> bad idea because you'd have to compromise a bunch of other
optimizations to
> achieve it.
>
> You have essentially described new intrinsics that would make this less of
a
> bad idea and I am interested in your intrinsics, so I'll try to both
respond
> with why patchpoints don't currently give you what you want (and why
simply
> changing patchpoint semantics would be evil) and I'll also try to
comment on
> what I think of the intrinsic that you're effectively proposing.  Long
story
> short, I think you should formally propose your intrinsic even if it's
not
> completely fleshed out.  I think that it's an interesting capability
and in
> its most basic form, it is a simple variation of the current
> patchpoint/stackmap intrinsics.
>
>
>
>
>
> On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at
philipreames.com)
> wrote:
>
> As the discussion has progressed and I've spent more time thinking
about the
> topic, I find myself less and less enthused about the current proposal. 
I'm
> in full support of having idiomatic ways to express safe division, but
I'm
> starting to doubt that using an intrinsic is the right way at the moment.
>
> One case I find myself thinking about is how one would combine profiling
> information and implicit div-by-zero/overflow checks with this proposal.  I
> don't really see a clean way.  Ideally, for a "safe div"
which never has the
> exceptional paths taken, you'd like to completely do away with the
control
> flow entirely.  (And rely on hardware traps w/exceptions instead.)  I
don't
> really see a way to represent that type of construct given the current
> proposal.
>
> This is a deeper problem and to solve it you'd need a solution to
trapping
> in general.  Let's consider the case of Java.  A Java program may want
to
> catch the arithmetic exception due to divide by zero.  How would you do
this
> with a trap in LLVM IR?  Spill all state that is live at the catch?  Use a
> patchpoint for the entire division instruction?
>
> We'd likely use something similar to a patchpoint.  You'd need the
"abstract
> vm state" (which is not the compiled frame necessarily) available at
the div
> instruction.  You could then re-enter the interpreter at the specified
index
> (part of the vm state).  We have all most of these mechanisms in place.
> Ideally, you'd trigger a recompile and otherwise ensure re-entry into
> compiled code at the soonest possible moment.
>
> This requires a lot of runtime support, but we already have most of it
> implemented for another compiler.  From our perspective, the runtime
> requirements are not a major blocker.
>
> Right, you'll use a patchpoint.  That's way more expensive than
using a safe
> division intrinsic with branches, because it's opaque to the optimizer.
>
> This statement is true at the moment, but it shouldn't be.  I think
this is
> our fundamental difference in approach.
>
> You should be able to write something like:
> i32 %res = invoke patchpoint (... x86_trapping_divide, a, b) normal_dest
> invoke_dest
>
> normal_dest:
>   ;; use %res
> invoke_dest:
>   landingpad
>   ;; dispatch edge cases
>   ;; this could be unreachable code if you deopt this frame in the trap
> handler and jump directly to an interpreter or other bit of code
>
> I see.  It sounds like you want a generalization of the
"div.with.stackmap"
> that I thought you wanted - you want to be able to wrap anything in a
> stackmap.
>
> The current patchpoint intrinsic does not do this, and you run the risk of
> breaking existing semantics if you changed this.  You'd probably break
> WebKit, which treats the call target of the patchpoint as nothing more than
> a quirk - we always pass null.  Also, the current patchpoint treats the
> callee as an i8* if I remember right and it would be super weird if all
LLVM
> phases had to interpret this i8* by unwrapping a possible bitcast to get to
> a declared function that may be an intrinsic.  Yuck!  Basically, the call
> target of existing patchpoints is meant to be a kind of convenience feature
> rather than the core of the mechanism.
>
> I agree in principle that the intrinsic that you want would be a useful
> intrinsic.  But let's not call it a patchpoint for the purposes of this
> discussion, and let's not confuse the discussion by claiming
(incorrectly)
> that the existing patchpoint facility gives you what you want.  It
doesn't:
> patchpoints are designed to make the call target opaque (hence the use of
> i8*) and there shouldn't be a correlation between what the patchpoint
does
> at run-time and what the called function would have done.  You could make
> the call target be null (like WebKit does) and the patchpoint should still
> mean "this code can do anything" because the expectation is that
the client
> JIT will patch over it anyway.
>
> Also, "patchpoint" would probably not be the right term for the
intrinsic
> that you want.  I think that what you want is
"call.with.stackmap".  Or
> maybe "stackmap.wrapper".  Or just "stackmap" - I'd
be OK, in principle,
> with changing the name of the current "stackmap" intrinsic to
something that
> reflects the fact that it's less of a stackmap than what you want.
>
> To summarize.  A patchpoint's main purpose is that you can patch it
with
> arbitrary code.  The current "stackmap" means that you can patch
it with
> arbitrary code and that patching may be destructive to a shadow of machine
> code bytes, so it's really just like patchpoints - we could change its
name
> to "patchpoint.shadow" for example.
>
> If you were to propose such a stackmap intrinsic, then I think there could
> be some ways of doing it that wouldn't be too terrible.  Basically you
want
> something that is like a patchpoint in that it reports a stackmap via a
side
> channel, but unlike patchpoints, it doesn't allow arbitrary patching -
> instead the optimizer should be allowed to assume that the code within the
> patchpoint will always do the same thing that the call target would have
> done.  There are downsides to truly doing this.  For example, to make
> division efficient with such an intrinsic, you'd have to do something
that
> is somewhat worse than just recognizing intrinsics in the optimizer -
you'd
> have to first recognize a call to your "stackmap wrapper"
intrinsic and then
> observe that its call target argument is in turn another intrinsic.  To me
> personally, that's kind of yucky, but I won't deny that it could be
useful.
>
> As to the use of invoke: I don't believe that the use of invoke versus
my
> suggested "branch on a trap predicate" idea are different in any
truly
> meaningful way.  I buy that either would work.
>
>
>
> A patchpoint should not require any excess spilling.  If values are live in
> registers, that should be reflected in the stack map.  (I do not know if
> this is the case for patchpoint at the moment or not.)
>
> Patchpoints do not require spilling.
>
> My point was that with existing patchpoints, you *either* use a patchpoint
> for the entire division which makes the division opaque to the optimizer -
> because a patchpoint means "this code can do anything" - *or* you
could
> spill stuff to the stack prior to your trapping division intrinsic, since
> spilling is something that you could do as a workaround if you didn't
have a
> patchpoint.
>
> The reason why I brought up spilling at all is that I suspect that spilling
> all state to the stack might be cheaper - for some systems - than turning
> the division into a patchpoint.  Turning the division into a patchpoint is
> horrendously brutal - the patchpoint looks like it clobbers the heap (which
> a division doesn't do), has to execute (a division is an obvious DCE
> candidate), cannot be hoisted (hoisting divisions is awesome), etc. 
Perhaps
> most importantly, though, a patchpoint doesn't tell LLVM that
you're *doing
> a division* - so all constant folding, strenght reduction, and algebraic
> reasoning flies out the window.  On the other hand, spilling all state to
> the stack is an arguably sound and performant solution to a lot of VM
> problems.  I've seen JVM implementations that ensure that there is
always a
> copy of state on the stack at some critical points, just because it makes
> loads of stuff simpler (debugging, profiling, GC, and of course deopt).  I
> personally prefer to stay away from such a strategy because it's not
free.
>
> On the other hand, branches can be cheap.  A branch on a divide is cheaper
> than not being able to optimize the divide.
>
>
>
> The Value called by a patchpoint should participate in optimization
> normally.
>
> I agree that you could have a different intrinsic that behaves like this.
>
> We really want the patchpoint part of the call to be supplemental.  It
> should still be a call.  It should be constant propagated, transformed,
> etc..  This is not the case currently.  I've got a couple of off the
wall
> ideas for improving the current status, but I'll agree this is a
hardish
> problem.
>
> It should be legal to use a patchpoint in an invoke.  It's an ABI issue
of
> how the invoke path gets invoked.  (i.e. side tables for the runtime to
> lookup, etc..)  This is not possible today, and probably requires a fair
> amount of work.  Some of it, I've already done and will be sharing
shortly.
> Other parts, I haven't even thought about.
>
> Right, it's significantly more complex than either the existing
patchpoints
> or Michael's proposed safe.div.
>
>
>
> If you didn't want to use the trapping semantics, you'd insert
dedicated
> control flow _before_ the divide.  This would allow normal optimization of
> the control flow.
>
> Notes:
> 1) This might require a new PATCHPOINT pseudo op in the backend. 
Haven't
> thought much about that yet.
> 2) I *think* your current intrinsic could be translated into something like
> this.  (Leaving aside the question of where the deopt state comes from.) 
In
> fact, the more I look at this, the less difference I actually see between
> the approaches.
>
>
>
> In a lot of languages, a divide produces some result even in the
exceptional
> case and this result requires effectively deoptimizing since the resut
won't
> be the one you would have predicted (double instead of int, or BigInt
> instead of small int), which sort of means that if the CPU exception occurs
> you have to be able to reconstruct all state.  A patchpoint could do this,
> and so could spilling all state to the stack before the divide - but both
> are very heavy hammers that are sure to be more expensive than just doing a
> branch.
>
> This isn't necessarily as expensive as you might believe.  I'd
recommend
> reading the Graal project papers on this topic.
>
> Whether deopt or branching is more profitable *in this case*, I can't
easily
> say.  I'm not yet to the point of being able to run that experiment. 
We can
> argue about what "should" be better all we want, but real
performance data
> is the only way to truly know.
>
> My point may have been confusing.  I know that deoptimization is cheap and
> WebKit uses it everywhere, including division corner cases, if profiling
> tells us that it's profitable to do so (which it does, in the common
case).
> WebKit is a heavy user of deoptimization in general, so you don't need
to
> convince me that it's worth it.
>
> Acknowledged.
>
> Note that I want *both* deopt *and* branching, because in this case, a
> branch is the fastest overall way of detecting when to deopt.  In the
> future, I will want to implement the deopt in terms of branching, and when
> we do this, I believe that the most sound and performat approach would be
> using Michael's intrinsics.  This is subtle and I'll try to explain
why it's
> the case.
>
> The point is that you wouldn't want to do deoptimization by spilling
state
> on the main path or by using a patchpoint for the main path of the
division.
>
> This is the main point I disagree with.  I don't believe that having a
> patchpoint on the main path should be any more expensive then the original
> call.  (see above)
>
> The reason why the patchpoint is expensive is that if you use a patchpoint
> to implement a division then the optimizer won't be allowed to assume
that
> it's a division, because the whole point of "patchpoint" is
to tell the
> optimizer to piss off.
>
>
>
> Worth noting explicitly: I'm assuming that all of your deopt state
would
> already be available for other purposes in nearby code.  It's on the
stack
> or in registers.  I'm assuming that by adding the deopt point, you are
not
> radically changing the set of computations which need to be done.  If
that's
> not the case, you should avoid deopt and instead just inline the slow paths
> with explicit checks.
>
> Yes, of course it is.  That's not the issue.
>
>
>
> I'll note that given your assumptions about the cost of a patchpoint,
the
> rest of your position makes a lot more sense.  :)  As I spelled out above,
I
> believe this cost is not fundamental.
>
> You don't want the common path of executing the division to involve a
> patchpoint instruction, although using a patchpoint or stackmap to
implement
> deoptimization on the failing path is great:
>
> Good: if (division would fail) { call @patchpoint(all of my state) } else {
> result = a / b }
>
> Given your cost assumptions, I'd agree.
>
> Not my cost assumptions.  The reason why this is better is that the
division
> is expressed in LLVM IR so that LLVM can do useful things to it - like
> eliminate it, for example.
>
>
> Bad: call @patchpoint(all of my state) // patch with a divide instruction -
> bad because the optimizer has no clue what you're doing and assumes the
very
> worst
>
> Yuck.  Agreed.
>
> To be clear, this is what you're proposing - except that you're
assuming
> that LLVM will know that you've patched a division because you're
expecting
> the call target to have semantic meaning.  Or, rather, you're expecting
that
> you can make the contents of the patchpoint be a division by having the
call
> target be a division intrinsic.  In the current implementation and as it is
> currently specified, the call target has no meaning and so you get the yuck
> that I'm showing.
>
>
> Worse: spill all state to the stack; call @trapping.div(a, b) // the spills
> will hurt you far more than a branch, so this should be avoided
>
> I'm assuming this is an explicit spill rather than simply recording a
stack
> map *at the div*.  If so, agreed.
>
> I suppose we could imagine a fourth option that involves a patchpoint to
> pick up the state and a trapping divide instrinsic.  But a trapping divide
> intrinsic alone is not enough.  Consider this:
>
> result = call @trapping.div(a, b); call @stackmap(all of my state)
>
> As soon as these are separate instructions, you have no guarantees that the
> state that the stackmap reports is sound for the point at which the div
> would trap.
>
> This is the closest to what I'd propose, except that the two calls
would be
> merged into a single patchpoint.  Isn't the entire point of a
patchpoint to
> record the stack map for a call?
>
> No.  It would be bad if that's what the documentation says.  That's
not at
> all how WebKit uses it or probably any IC client would use it.
>
> Patchpoints are designed to be inline assembly on steroids.  They're
there
> to allow the client JIT to tell LLVM to piss off.
>
> (Well, ignoring the actual patching part..)  Why not write this as:
> patchpoint(..., trapping.div, a, b);
>
> Is there something I'm missing here?
>
> Just to note: I fully agree that the two call proposal is unsound and
should
> be avoided.
>
> So, the division itself shouldn't be a trapping instruction and instead
you
> want to detect the bad case with a branch.
>
> To be clear:
>
> - Whether you use deoptimization for division or anything else - like
WebKit
> has done since before any of the Graal papers were written - is mostly
> unrelated to how you represent the division, unless you wanted to add a new
> intrinsic that is like a trapping-division-with-stackmap:
>
> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)
>
> Now, maybe you do want such an intrinsic, in which case you should propose
> it!
>
> Given what we already have with patchpoints, I don't think a merged
> intrinsic is necessary.  (See above).  I believe we have all the parts to
> build this solution, and that - in theory - they should compose neatly.
>
> p.s. The bits I was referring to was not deopt per se.  It was particularly
> which set of deopt state you used for each deopt point.  That's a bit
of
> tangent for the rest of the discussion now though.
>
> The reason why I haven't proposed it is that I think that long-term,
the
> currently proposed intrinsics are a better path to getting the trapping
> optimizations.  See my previous mail, where I show how we could tell LLVM
> what the failing path is (which may have deoptimization code that uses a
> stackmap or whatever), what the trapping predicate is (it comes from the
> safe.div intrinsic), and the fact that trapping is wise (branch weights).
>
> - If you want to do the deoptimization with a trap, then your only choice
> currently is to use a patchpoint for the main path of the division.  This
> will be slower than using a branch to an OSR exit basic block, because
> you're making the division itself opaque to the optimizer (bad) just to
get
> rid of a branch (which was probably cheap to begin with).
>
> Hence, what you want to do - one way or another, regardless of whether this
> proposed intrinsic is added - is to branch on the corner case condition,
and
> have the slow case of the branch go to a basic block that deoptimizes.
> Unless of course you have profiling that says that the case does happen
> often, in which case you can have that basic block handle the corner case
> inline without leaving optimized code (FWIW, we do have such paths in
WebKit
> and they are useful).
>
> So the question for me is whether the branching involves explicit control
> flow or is hidden inside an intrinsic.  I prefer for it to be within an
> intrinsic because it:
>
> - allows the optimizer to do more interesting things in the common cases,
> like hoisting the entire division.
>
> - will give us a clearer path for implementing trapping optimizations in
the
> future.
>
> - is an immediate win on ARM.
>
> I'd be curious to hear what specific idea you have about how to
implement
> trap-based deoptimization with your trapping division intrinsic for x86 -
> maybe it's different from the two "bad" idioms I showed
above.
>
> I hope my explanation above helps.  If not, ask, and I'll try to
explain
> more clearly.
>
> I think I understand it.  I think that the only issue is that:
>
> - Patchpoints currently don't do what you want.
>
> - If you made patchpoints do what you want then you'd break WebKit -
not to
> mention anyone who wants to use them for inline caches.
>
> So it seems like you want a new intrinsic.  You should officially propose
> this new intrinsic, particularly since the core semantic differences are
not
> so great from what we have now.  OTOH, if you truly believe that
patchpoints
> should just be changed to your semantics in a way that does break WebKit,
> then that's probably also something you should get off your chest. ;-)
>
>
>
> One point just for clarity; I don't believe this effects the
conclusions of
> our discussion so far.  I'm also fairly sure that you (Filip) are aware
of
> this, but want to spell it out for other readers.
>
> You seem to be assuming that compiled code needs to explicitly branch to a
> point where deopt state is known to exit a compiled frame.
>
> This is a slightly unclear characterization of my assumptions.  Our JIT
does
> deoptimization without explicit branches for many, many things.  You should
> look at it some time, it's pretty fancy. ;-)
>
> Worth noting is that you can also exit a compiled frame on a trap (without
> an explicitly condition/branch!) if the deopt state is known at the point
> you take the trap.  This "exit frame on trap" behavior shows up
with null
> pointer exceptions as well.  I'll note that both compilers in OpenJDK
> support some combination of "exit-on-trap" conditions for
division and null
> dereferences.  The two differ on exactly what strategies they use in each
> case though.  :)
>
> Yeah, and I've also implemented VMs that do this - and I endorse the
basic
> idea.  I know what you want, and my only point is that the existing
> patchpoints only give you this if you're willing to make a huge
compromise:
> namely, that you're willing to make the division (or heap load for the
null
> case) completely opaque to the compiler to the point that GVN, LICM, SCCP,
> and all algebraic reasoning have to give up on optimizing it.  The point of
> using LLVM is that it can optimize code.  It can optimize branches and
> divisions pretty well.  So, eliminating an explicit branch by replacing it
> with a construct that appears opaque to the optimizer is not a smart
> trade-off.
>
> You could add a new intrinsic that, like patchpoints, records the layout of
> state in a side-table, but that is used as a kind of wrapper for operations
> that LLVM understands.  This may or may not be hairy - you seem to have
sort
> of acknowledged that it's got some complexity and I've also pointed
out some
> possible issues.  If this is something that you want, you should propose it
> so that others know what you're talking about.  One danger of how
we're
> discussing this right now is that you're overloading patchpoints to
mean the
> thing you want them to mean rather than what they actually mean, which
makes
> it seem like we don't need Michael's intrinsics on the grounds that
> patchpoints already offer a solution.  They don't already offer a
solution
> precisely because patchpoints don't do what your intrinsics would do.
>
>
>
> I'm not really arguing that either scheme is "better" in all
cases.  I'm
> simply arguing that we should support both and allow optimization and
tuning
> between them.  As far as I can tell, you seem to be assuming that an
> explicit branch to known exit point is always superior.
>
>
> Ok, back to the topic at hand...
>
> With regards to the current proposal, I'm going to take a step back. 
You
> guys seem to have already looked in this in a fair amount of depth. 
I'm not
> necessarily convinced you've come to the best solution, but at some
point,
> we need to make forward progress.  What you have is clearly better than
> nothing.
>
> Please go ahead and submit your current approach.  We can come back and
> revise later if we really need to.
>
> I do request the following changes:
> - Mark it clearly as experimental.
>
>
> - Either don't specify the value computed in the edge cases, or allow
those
> values to be specified as constant arguments to the call.  This allows
> efficient lowering to x86's div instruction if you want to make use of
the
> trapping semantics.
>
> Once again: how would you use this to get trapping semantics without
> throwing all of LLVM's optimizations out the window, in the absence of
the
> kind of patchpoint-like intrinsic that you want?  I ask just to make sure
> that we're on the same page.
>
>
>
> Finally, as for performance data, which part of this do you want
performance
> data for?  I concede that I don't have performance data for using
Michael's
> new intrinsic.  Part of what the intrinsic accomplishes is it gives a less
> ugly way of doing something that is already possible with target intrinsics
> on ARM.  I think it would be great if you could get those semantics - along
> with a known-good implementation - on other architectures as well.
>
> I would be very interested in seeing data comparing two schemes:
> - Explicit control flow emited by the frontend
> - The safe.div intrinsic emitted by the frontend, desugared in CodeGenPrep
>
> My strong suspicion is that each would preform well in some cases and not
in
> others.  At least on x86.  Since the edge-checks are essentially free on
> ARM, the second scheme would probably be strictly superior there.
>
> I am NOT asking that we block submission on this data however.
>
> But this discussion has also involved suggestions that we should use
> trapping to implement deoptimization, and the main point of my message is
to
> strongly argue against anything like this given the current state of
> trapping semantics and how patchpoints work.  My point is that using traps
> for division corner cases would either be unsound (see the stackmap after
> the trap, above), or would require you to do things that are obviously
> inefficient.  If you truly believe that the branch to detect division slow
> paths is more expensive than spilling all bytecode state to the stack or
> using a patchpoint for the division, then I could probably hack something
up
> in WebKit to show you the performance implications.  (Or you could do it
> yourself, the code is open source...)
>
> In a couple of months, I'll probably have the performance data to
discuss
> this for real.  When that happens, let's pick this up and continue the
> debate.  Alternatively, if you want to chat this over more with a beer in
> hand at the social next week, let me know.  In the meantime, let's not
stall
> the current proposal any more.
>
> Philip
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Andrew Trick

2014-May-02 18:33 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

On Apr 30, 2014, at 10:34 PM, Philip Reames <listmail at philipreames.com>
wrote:
> Andy - If you're not already following this closely, please start. 
We've gotten into fairly fundamental questions of what a patchpoint does.
Sorry, I’ve been following but worried about causing more confusion in this
thread. The very short answer is that @safe.div does not help you implement
exceptions using x86 trapping divide instruction. In your case, either way,
you’ll have a branch to deoptimization in the canonical IR with no control flow
merge anyway. You’ll just be looking for a different pattern during codegen.

I can understand why you would want to extend the call to @safe.div with a
stackmap to be consistent with the approach for safepointing normal calls.
However, it’s not easy to do this in LLVM IR, and we need to think of every
possibility that keeps the IR minimal. Given that trapping loads/divides are a
codegen optimization (you can always fall back to branches), I think they can be
generated opportunistically during lowering without worrying about all the edge
cases. I will respond to you regarding extending patchpoint or introducing a new
safepoint/statemap intrinsic in a separate thread. I have a lot of ideas but
don’t want to make any proposals prematurely. I’ll get back to you after
consulting with Juergen.

-Andy
> Filip, 
> 
> I think you've hit the nail on the head.  What I'm thinking of as
being patchpoints are not what you think they are.  Part of that is that
I've got a local change which adds a very similar construction (called
"statepoints" for the moment), but I was trying to keep that separate.
That also includes a lot of GC semantics which are not under discussion
currently.  My apologies if that experience bled over into this conversation and
made things more confusing.
> 
> I will note that the documentation for patchpoint say explicitly the
following:
> "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a function
call to the specified <target> and records the location of specified
values in the stack map."
> 
> My reading has always been that a patchpoint *that isn't patched* is
simply a call with a stackmap associated with it.  To my reading, this can (and
did, and does) indicate my proposed usage would be legal.
> 
> I will agree that I've confused the topic badly on the optimization
front.  My "statepoint" isn't patchable, so a lot more
optimizations are legal.  Sorry about that.  To restate what I think you've
been saying all along, the optimizer can't make assumptions about what
function is called by a patchpoint because that might change based on later
patching.  Is this the key point you've been trying to make?
> 
> I'm not objecting to separating "my patchpoint" from
"your patchpoint".  Let's just hammer out the semantics of each
first.  :)
> 
> Again, longer response to follow in a day or so. :)
> 
> Philip
> 
> On 04/30/2014 10:09 PM, Filip Pizlo wrote:
>> 
>> 
>> On April 30, 2014 at 9:06:20 PM, Philip Reames (listmail at
philipreames.com) wrote:
>> 
>>> On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>>> On April 29, 2014 at 11:27:06 AM, Philip Reames (listmail at
philipreames.com) wrote:On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>>> LD;DR: Your desire to use trapping on x86 only further convinces me
that Michael's proposed intrinsics are the best way to go.I'm still not
convinced, but am not going to actively oppose it either.  I'm leery of
designing a solution with major assumptions we don't have data to backup.
>>> 
>>> I worry your assumptions about deoptimization are potentially
unsound.  But I don't have data to actually show this (yet).
>>> I *think* I may have been unclear about my assumptions; in
particular, my claims with respect to deoptimization are probably more subtle
than they appeared.  WebKit can use LLVM and it has divisions and we do all
possible deoptimization/profiling/etc tricks for it, so this is grounded in
experience.  Forgive me if the rest of this e-mail contains a lecture on things
that are obvious - I'll try to err on the side of clarity and completeness
since this discussion is sufficiently dense that we run the risk of talking
cross-purposes unless some baseline assumptions are established.
>>> 
>>> I think we're using the same terminology, but with slightly
different sets of assumptions.  I'll point this out below where relevant.
>>> 
>>> Also, thanks for taking the time to expand.  It help clarify the
discussion quite a bit.
>> 
>> I think we may be converging to an understanding of what you want
versus what I want, and I think that there are some points - possibly unrelated
to division - that are worth clarifying.  I think that the main difference is
that when I say "patchpoint", I am referring to a concrete intrinsic
with specific semantics that cannot change without breaking WebKit, while you
are using the term to refer to a broad concept, or rather, a class of
as-yet-unimplemented intrinsics that share some of the same features with
patchpoints but otherwise have incompatible semantics.
>> 
>> Also, when I say that you wouldn't want to use the existing
patchpoint to do your trapping deopt, what I mean is that the performance of
doing this would suck for reasons that are not related to deoptimization or
trapping.  I'm not claiming that deoptimization performs poorly (trust me, I
know better) or that trapping to deoptimize is bad (I've done this many,
many times and I know better).  I'm saying that with the current patchpoint
intrinsics in LLVM, as they are currently specified and implemented, doing it
would be a bad idea because you'd have to compromise a bunch of other
optimizations to achieve it.
>> 
>> You have essentially described new intrinsics that would make this less
of a bad idea and I am interested in your intrinsics, so I'll try to both
respond with why patchpoints don't currently give you what you want (and why
simply changing patchpoint semantics would be evil) and I'll also try to
comment on what I think of the intrinsic that you're effectively proposing. 
Long story short, I think you should formally propose your intrinsic even if
it's not completely fleshed out.  I think that it's an interesting
capability and in its most basic form, it is a simple variation of the current
patchpoint/stackmap intrinsics.
>> 
>>> 
>>> 
>>> 
>>> 
>>> On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at
philipreames.com) wrote:As the discussion has progressed and I've spent more
time thinking about the topic, I find myself less and less enthused about the
current proposal.  I'm in full support of having idiomatic ways to express
safe division, but I'm starting to doubt that using an intrinsic is the
right way at the moment.
>>> 
>>> One case I find myself thinking about is how one would combine
profiling information and implicit div-by-zero/overflow checks with this
proposal.  I don't really see a clean way.  Ideally, for a "safe
div" which never has the exceptional paths taken, you'd like to
completely do away with the control flow entirely.  (And rely on hardware traps
w/exceptions instead.)  I don't really see a way to represent that type of
construct given the current proposal.
>>> This is a deeper problem and to solve it you'd need a solution
to trapping in general.  Let's consider the case of Java.  A Java program
may want to catch the arithmetic exception due to divide by zero.  How would you
do this with a trap in LLVM IR?  Spill all state that is live at the catch?  Use
a patchpoint for the entire division instruction?
>>> 
>>> We'd likely use something similar to a patchpoint.  You'd
need the "abstract vm state" (which is not the compiled frame
necessarily) available at the div instruction.  You could then re-enter the
interpreter at the specified index (part of the vm state).  We have all most of
these mechanisms in place.  Ideally, you'd trigger a recompile and otherwise
ensure re-entry into compiled code at the soonest possible moment.
>>> 
>>> This requires a lot of runtime support, but we already have most of
it implemented for another compiler.  From our perspective, the runtime
requirements are not a major blocker.
>>> Right, you'll use a patchpoint.  That's way more expensive
than using a safe division intrinsic with branches, because it's opaque to
the optimizer.
>>> 
>>> This statement is true at the moment, but it shouldn't be.  I
think this is our fundamental difference in approach.
>>> 
>>> You should be able to write something like:
>>> i32 %res = invoke patchpoint (... x86_trapping_divide, a, b)
normal_dest invoke_dest
>>> 
>>> normal_dest:
>>>   ;; use %res
>>> invoke_dest:
>>>   landingpad
>>>   ;; dispatch edge cases
>>>   ;; this could be unreachable code if you deopt this frame in the
trap handler and jump directly to an interpreter or other bit of code
>> 
>> I see.  It sounds like you want a generalization of the
"div.with.stackmap" that I thought you wanted - you want to be able to
wrap anything in a stackmap.
>> 
>> The current patchpoint intrinsic does not do this, and you run the risk
of breaking existing semantics if you changed this.  You'd probably break
WebKit, which treats the call target of the patchpoint as nothing more than a
quirk - we always pass null.  Also, the current patchpoint treats the callee as
an i8* if I remember right and it would be super weird if all LLVM phases had to
interpret this i8* by unwrapping a possible bitcast to get to a declared
function that may be an intrinsic.  Yuck!  Basically, the call target of
existing patchpoints is meant to be a kind of convenience feature rather than
the core of the mechanism.
>> 
>> I agree in principle that the intrinsic that you want would be a useful
intrinsic.  But let's not call it a patchpoint for the purposes of this
discussion, and let's not confuse the discussion by claiming (incorrectly)
that the existing patchpoint facility gives you what you want.  It doesn't:
patchpoints are designed to make the call target opaque (hence the use of i8*)
and there shouldn't be a correlation between what the patchpoint does at
run-time and what the called function would have done.  You could make the call
target be null (like WebKit does) and the patchpoint should still mean
"this code can do anything" because the expectation is that the client
JIT will patch over it anyway.
>> 
>> Also, "patchpoint" would probably not be the right term for
the intrinsic that you want.  I think that what you want is
"call.with.stackmap".  Or maybe "stackmap.wrapper".  Or just
"stackmap" - I'd be OK, in principle, with changing the name of
the current "stackmap" intrinsic to something that reflects the fact
that it's less of a stackmap than what you want.
>> 
>> To summarize.  A patchpoint's main purpose is that you can patch it
with arbitrary code.  The current "stackmap" means that you can patch
it with arbitrary code and that patching may be destructive to a shadow of
machine code bytes, so it's really just like patchpoints - we could change
its name to "patchpoint.shadow" for example.
>> 
>> If you were to propose such a stackmap intrinsic, then I think there
could be some ways of doing it that wouldn't be too terrible.  Basically you
want something that is like a patchpoint in that it reports a stackmap via a
side channel, but unlike patchpoints, it doesn't allow arbitrary patching -
instead the optimizer should be allowed to assume that the code within the
patchpoint will always do the same thing that the call target would have done. 
There are downsides to truly doing this.  For example, to make division
efficient with such an intrinsic, you'd have to do something that is
somewhat worse than just recognizing intrinsics in the optimizer - you'd
have to first recognize a call to your "stackmap wrapper" intrinsic
and then observe that its call target argument is in turn another intrinsic.  To
me personally, that's kind of yucky, but I won't deny that it could be
useful.
>> 
>> As to the use of invoke: I don't believe that the use of invoke
versus my suggested "branch on a trap predicate" idea are different in
any truly meaningful way.  I buy that either would work.
>> 
>>> 
>>> 
>>> A patchpoint should not require any excess spilling.  If values are
live in registers, that should be reflected in the stack map.  (I do not know if
this is the case for patchpoint at the moment or not.)
>> 
>> Patchpoints do not require spilling.
>> 
>> My point was that with existing patchpoints, you *either* use a
patchpoint for the entire division which makes the division opaque to the
optimizer - because a patchpoint means "this code can do anything" -
*or* you could spill stuff to the stack prior to your trapping division
intrinsic, since spilling is something that you could do as a workaround if you
didn't have a patchpoint.
>> 
>> The reason why I brought up spilling at all is that I suspect that
spilling all state to the stack might be cheaper - for some systems - than
turning the division into a patchpoint.  Turning the division into a patchpoint
is horrendously brutal - the patchpoint looks like it clobbers the heap (which a
division doesn't do), has to execute (a division is an obvious DCE
candidate), cannot be hoisted (hoisting divisions is awesome), etc.  Perhaps
most importantly, though, a patchpoint doesn't tell LLVM that you're
*doing a division* - so all constant folding, strenght reduction, and algebraic
reasoning flies out the window.  On the other hand, spilling all state to the
stack is an arguably sound and performant solution to a lot of VM problems. 
I've seen JVM implementations that ensure that there is always a copy of
state on the stack at some critical points, just because it makes loads of stuff
simpler (debugging, profiling, GC, and of course deopt).  I personally prefer to
stay away from such a strategy because it's not free.
>> 
>> On the other hand, branches can be cheap.  A branch on a divide is
cheaper than not being able to optimize the divide.
>> 
>>> 
>>> 
>>> The Value called by a patchpoint should participate in optimization
normally.
>> 
>> I agree that you could have a different intrinsic that behaves like
this.
>> 
>>> We really want the patchpoint part of the call to be supplemental. 
It should still be a call.  It should be constant propagated, transformed, etc..
This is not the case currently.  I've got a couple of off the wall ideas for
improving the current status, but I'll agree this is a hardish problem.
>>> 
>>> It should be legal to use a patchpoint in an invoke.  It's an
ABI issue of how the invoke path gets invoked.  (i.e. side tables for the
runtime to lookup, etc..)  This is not possible today, and probably requires a
fair amount of work.  Some of it, I've already done and will be sharing
shortly.  Other parts, I haven't even thought about.
>> 
>> Right, it's significantly more complex than either the existing
patchpoints or Michael's proposed safe.div.
>> 
>>> 
>>> 
>>> If you didn't want to use the trapping semantics, you'd
insert dedicated control flow _before_ the divide.  This would allow normal
optimization of the control flow.
>>> 
>>> Notes:
>>> 1) This might require a new PATCHPOINT pseudo op in the backend. 
Haven't thought much about that yet.
>>> 2) I *think* your current intrinsic could be translated into
something like this.  (Leaving aside the question of where the deopt state comes
from.)  In fact, the more I look at this, the less difference I actually see
between the approaches.
>>> 
>>> 
>>> 
>>> In a lot of languages, a divide produces some result even in the
exceptional case and this result requires effectively deoptimizing since the
resut won't be the one you would have predicted (double instead of int, or
BigInt instead of small int), which sort of means that if the CPU exception
occurs you have to be able to reconstruct all state.  A patchpoint could do
this, and so could spilling all state to the stack before the divide - but both
are very heavy hammers that are sure to be more expensive than just doing a
branch.This isn't necessarily as expensive as you might believe.  I'd
recommend reading the Graal project papers on this topic.
>>> 
>>> Whether deopt or branching is more profitable *in this case*, I
can't easily say.  I'm not yet to the point of being able to run that
experiment.  We can argue about what "should" be better all we want,
but real performance data is the only way to truly know.
>>> My point may have been confusing.  I know that deoptimization is
cheap and WebKit uses it everywhere, including division corner cases, if
profiling tells us that it's profitable to do so (which it does, in the
common case).  WebKit is a heavy user of deoptimization in general, so you
don't need to convince me that it's worth it.
>>> 
>>> Acknowledged. 
>>>> Note that I want *both* deopt *and* branching, because in this
case, a branch is the fastest overall way of detecting when to deopt.  In the
future, I will want to implement the deopt in terms of branching, and when we do
this, I believe that the most sound and performat approach would be using
Michael's intrinsics.  This is subtle and I'll try to explain why
it's the case.
>>>> 
>>>> The point is that you wouldn't want to do deoptimization by
spilling state on the main path or by using a patchpoint for the main path of
the division.
>>>> 
>>> This is the main point I disagree with.  I don't believe that
having a patchpoint on the main path should be any more expensive then the
original call.  (see above)
>> 
>> The reason why the patchpoint is expensive is that if you use a
patchpoint to implement a division then the optimizer won't be allowed to
assume that it's a division, because the whole point of
"patchpoint" is to tell the optimizer to piss off.
>> 
>>> 
>>> 
>>> Worth noting explicitly: I'm assuming that all of your deopt
state would already be available for other purposes in nearby code.  It's on
the stack or in registers.  I'm assuming that by adding the deopt point, you
are not radically changing the set of computations which need to be done.  If
that's not the case, you should avoid deopt and instead just inline the slow
paths with explicit checks.
>> 
>> Yes, of course it is.  That's not the issue.
>> 
>>> 
>>> 
>>> I'll note that given your assumptions about the cost of a
patchpoint, the rest of your position makes a lot more sense.  :)  As I spelled
out above, I believe this cost is not fundamental.
>>> You don't want the common path of executing the division to
involve a patchpoint instruction, although using a patchpoint or stackmap to
implement deoptimization on the failing path is great:
>>> 
>>> Good: if (division would fail) { call @patchpoint(all of my state)
} else { result = a / b }Given your cost assumptions, I'd agree.
>> 
>> Not my cost assumptions.  The reason why this is better is that the
division is expressed in LLVM IR so that LLVM can do useful things to it - like
eliminate it, for example.
>> 
>>> 
>>> Bad: call @patchpoint(all of my state) // patch with a divide
instruction - bad because the optimizer has no clue what you're doing and
assumes the very worstYuck.  Agreed.
>> 
>> To be clear, this is what you're proposing - except that you're
assuming that LLVM will know that you've patched a division because
you're expecting the call target to have semantic meaning.  Or, rather,
you're expecting that you can make the contents of the patchpoint be a
division by having the call target be a division intrinsic.  In the current
implementation and as it is currently specified, the call target has no meaning
and so you get the yuck that I'm showing.
>> 
>>> 
>>> Worse: spill all state to the stack; call @trapping.div(a, b) //
the spills will hurt you far more than a branch, so this should be
avoidedI'm assuming this is an explicit spill rather than simply recording a
stack map *at the div*.  If so, agreed.
>>>> I suppose we could imagine a fourth option that involves a
patchpoint to pick up the state and a trapping divide instrinsic.  But a
trapping divide intrinsic alone is not enough.  Consider this:
>>>> 
>>>> result = call @trapping.div(a, b); call @stackmap(all of my
state)
>>>> 
>>>> As soon as these are separate instructions, you have no
guarantees that the state that the stackmap reports is sound for the point at
which the div would trap.
>>>> 
>>> This is the closest to what I'd propose, except that the two
calls would be merged into a single patchpoint.  Isn't the entire point of a
patchpoint to record the stack map for a call?
>> 
>> No.  It would be bad if that's what the documentation says. 
That's not at all how WebKit uses it or probably any IC client would use it.
>> 
>> Patchpoints are designed to be inline assembly on steroids. 
They're there to allow the client JIT to tell LLVM to piss off.
>> 
>>> (Well, ignoring the actual patching part..)  Why not write this as:
>>> patchpoint(..., trapping.div, a, b);
>>> 
>>> Is there something I'm missing here?
>>> 
>>> Just to note: I fully agree that the two call proposal is unsound
and should be avoided.
>>> 
>>>> So, the division itself shouldn't be a trapping instruction
and instead you want to detect the bad case with a branch.
>>>> 
>>>> To be clear:
>>>> 
>>>> - Whether you use deoptimization for division or anything else
- like WebKit has done since before any of the Graal papers were written - is
mostly unrelated to how you represent the division, unless you wanted to add a
new intrinsic that is like a trapping-division-with-stackmap:
>>>> 
>>>> result = call @trapping.div.with.stackmap(a, b, ... all of my
state ...)
>>>> 
>>>> Now, maybe you do want such an intrinsic, in which case you
should propose it!
>>>> 
>>> Given what we already have with patchpoints, I don't think a
merged intrinsic is necessary.  (See above).  I believe we have all the parts to
build this solution, and that - in theory - they should compose neatly.
>>> 
>>> p.s. The bits I was referring to was not deopt per se.  It was
particularly which set of deopt state you used for each deopt point.  That's
a bit of tangent for the rest of the discussion now though.
>>>> The reason why I haven't proposed it is that I think that
long-term, the currently proposed intrinsics are a better path to getting the
trapping optimizations.  See my previous mail, where I show how we could tell
LLVM what the failing path is (which may have deoptimization code that uses a
stackmap or whatever), what the trapping predicate is (it comes from the
safe.div intrinsic), and the fact that trapping is wise (branch weights).
>>>> 
>>>> - If you want to do the deoptimization with a trap, then your
only choice currently is to use a patchpoint for the main path of the division. 
This will be slower than using a branch to an OSR exit basic block, because
you're making the division itself opaque to the optimizer (bad) just to get
rid of a branch (which was probably cheap to begin with).
>>>> 
>>>> Hence, what you want to do - one way or another, regardless of
whether this proposed intrinsic is added - is to branch on the corner case
condition, and have the slow case of the branch go to a basic block that
deoptimizes.  Unless of course you have profiling that says that the case does
happen often, in which case you can have that basic block handle the corner case
inline without leaving optimized code (FWIW, we do have such paths in WebKit and
they are useful).
>>>> 
>>>> So the question for me is whether the branching involves
explicit control flow or is hidden inside an intrinsic.  I prefer for it to be
within an intrinsic because it:
>>>> 
>>>> - allows the optimizer to do more interesting things in the
common cases, like hoisting the entire division.
>>>> 
>>>> - will give us a clearer path for implementing trapping
optimizations in the future.
>>>> 
>>>> - is an immediate win on ARM.
>>>> 
>>>> I'd be curious to hear what specific idea you have about
how to implement trap-based deoptimization with your trapping division intrinsic
for x86 - maybe it's different from the two "bad" idioms I showed
above.
>>>> 
>>> I hope my explanation above helps.  If not, ask, and I'll try
to explain more clearly.
>> 
>> I think I understand it.  I think that the only issue is that:
>> 
>> - Patchpoints currently don't do what you want.
>> 
>> - If you made patchpoints do what you want then you'd break WebKit
- not to mention anyone who wants to use them for inline caches.
>> 
>> So it seems like you want a new intrinsic.  You should officially
propose this new intrinsic, particularly since the core semantic differences are
not so great from what we have now.  OTOH, if you truly believe that patchpoints
should just be changed to your semantics in a way that does break WebKit, then
that's probably also something you should get off your chest. ;-)
>> 
>>> 
>>> 
>>> One point just for clarity; I don't believe this effects the
conclusions of our discussion so far.  I'm also fairly sure that you (Filip)
are aware of this, but want to spell it out for other readers.
>>> 
>>> You seem to be assuming that compiled code needs to explicitly
branch to a point where deopt state is known to exit a compiled frame.
>> 
>> This is a slightly unclear characterization of my assumptions.  Our JIT
does deoptimization without explicit branches for many, many things.  You should
look at it some time, it's pretty fancy. ;-)
>> 
>>> Worth noting is that you can also exit a compiled frame on a trap
(without an explicitly condition/branch!) if the deopt state is known at the
point you take the trap.  This "exit frame on trap" behavior shows up
with null pointer exceptions as well.  I'll note that both compilers in
OpenJDK support some combination of "exit-on-trap" conditions for
division and null dereferences.  The two differ on exactly what strategies they
use in each case though.  :)
>> 
>> Yeah, and I've also implemented VMs that do this - and I endorse
the basic idea.  I know what you want, and my only point is that the existing
patchpoints only give you this if you're willing to make a huge compromise:
namely, that you're willing to make the division (or heap load for the null
case) completely opaque to the compiler to the point that GVN, LICM, SCCP, and
all algebraic reasoning have to give up on optimizing it.  The point of using
LLVM is that it can optimize code.  It can optimize branches and divisions
pretty well.  So, eliminating an explicit branch by replacing it with a
construct that appears opaque to the optimizer is not a smart trade-off.
>> 
>> You could add a new intrinsic that, like patchpoints, records the
layout of state in a side-table, but that is used as a kind of wrapper for
operations that LLVM understands.  This may or may not be hairy - you seem to
have sort of acknowledged that it's got some complexity and I've also
pointed out some possible issues.  If this is something that you want, you
should propose it so that others know what you're talking about.  One danger
of how we're discussing this right now is that you're overloading
patchpoints to mean the thing you want them to mean rather than what they
actually mean, which makes it seem like we don't need Michael's
intrinsics on the grounds that patchpoints already offer a solution.  They
don't already offer a solution precisely because patchpoints don't do
what your intrinsics would do.
>> 
>>> 
>>> 
>>> I'm not really arguing that either scheme is "better"
in all cases.  I'm simply arguing that we should support both and allow
optimization and tuning between them.  As far as I can tell, you seem to be
assuming that an explicit branch to known exit point is always superior.
>>> 
>>> 
>>> Ok, back to the topic at hand...
>>> 
>>> With regards to the current proposal, I'm going to take a step
back.  You guys seem to have already looked in this in a fair amount of depth. 
I'm not necessarily convinced you've come to the best solution, but at
some point, we need to make forward progress.  What you have is clearly better
than nothing.
>>> 
>>> Please go ahead and submit your current approach.  We can come back
and revise later if we really need to.
>>> 
>>> I do request the following changes:
>>> - Mark it clearly as experimental.
>> 
>>> 
>>> - Either don't specify the value computed in the edge cases, or
allow those values to be specified as constant arguments to the call.  This
allows efficient lowering to x86's div instruction if you want to make use
of the trapping semantics.
>> 
>> Once again: how would you use this to get trapping semantics without
throwing all of LLVM's optimizations out the window, in the absence of the
kind of patchpoint-like intrinsic that you want?  I ask just to make sure that
we're on the same page.
>> 
>>> 
>>> 
>>>> Finally, as for performance data, which part of this do you
want performance data for?  I concede that I don't have performance data for
using Michael's new intrinsic.  Part of what the intrinsic accomplishes is
it gives a less ugly way of doing something that is already possible with target
intrinsics on ARM.  I think it would be great if you could get those semantics -
along with a known-good implementation - on other architectures as well.
>>>> 
>>> I would be very interested in seeing data comparing two schemes:
>>> - Explicit control flow emited by the frontend
>>> - The safe.div intrinsic emitted by the frontend, desugared in
CodeGenPrep
>>> 
>>> My strong suspicion is that each would preform well in some cases
and not in others.  At least on x86.  Since the edge-checks are essentially free
on ARM, the second scheme would probably be strictly superior there.
>>> 
>>> I am NOT asking that we block submission on this data however. 
>>> 
>>>> But this discussion has also involved suggestions that we
should use trapping to implement deoptimization, and the main point of my
message is to strongly argue against anything like this given the current state
of trapping semantics and how patchpoints work.  My point is that using traps
for division corner cases would either be unsound (see the stackmap after the
trap, above), or would require you to do things that are obviously inefficient. 
If you truly believe that the branch to detect division slow paths is more
expensive than spilling all bytecode state to the stack or using a patchpoint
for the division, then I could probably hack something up in WebKit to show you
the performance implications.  (Or you could do it yourself, the code is open
source...)
>>>> 
>>> In a couple of months, I'll probably have the performance data
to discuss this for real.  When that happens, let's pick this up and
continue the debate.  Alternatively, if you want to chat this over more with a
beer in hand at the social next week, let me know.  In the meantime, let's
not stall the current proposal any more.
>>> 
>>> Philip
>>> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140502/8a160d99/attachment.html>

Eric Christopher

2014-May-02 18:53 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

On Wed, Apr 30, 2014 at 10:34 PM, Philip Reames
<listmail at philipreames.com> wrote:> Andy - If you're not already following this closely, please start. 
We've
> gotten into fairly fundamental questions of what a patchpoint does.
>
> Filip,
>
> I think you've hit the nail on the head.  What I'm thinking of as
being
> patchpoints are not what you think they are.  Part of that is that I've
got
> a local change which adds a very similar construction (called
"statepoints"
> for the moment), but I was trying to keep that separate.  That also
includes
> a lot of GC semantics which are not under discussion currently.  My
> apologies if that experience bled over into this conversation and made
> things more confusing.
>
> I will note that the documentation for patchpoint say explicitly the
> following:
> "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a function
call to
> the specified <target> and records the location of specified values
in the
> stack map."
>
> My reading has always been that a patchpoint *that isn't patched* is
simply
> a call with a stackmap associated with it.  To my reading, this can (and
> did, and does) indicate my proposed usage would be legal.
>
I like the idea that the target can be assumed to be called. It makes
optimization of the call possible, etc. I think it's definitely worth
exploring before we lock down the patchpoint intrinsic.

-eric
> I will agree that I've confused the topic badly on the optimization
front.
> My "statepoint" isn't patchable, so a lot more optimizations
are legal.
> Sorry about that.  To restate what I think you've been saying all
along, the
> optimizer can't make assumptions about what function is called by a
> patchpoint because that might change based on later patching.  Is this the
> key point you've been trying to make?
>
> I'm not objecting to separating "my patchpoint" from
"your patchpoint".
> Let's just hammer out the semantics of each first.  :)
>
> Again, longer response to follow in a day or so. :)
>
> Philip
>
>
> On 04/30/2014 10:09 PM, Filip Pizlo wrote:
>
>
>
> On April 30, 2014 at 9:06:20 PM, Philip Reames (listmail at
philipreames.com)
> wrote:
>
> On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>
> On April 29, 2014 at 11:27:06 AM, Philip Reames (listmail at
philipreames.com)
> wrote:
>
> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>
> LD;DR: Your desire to use trapping on x86 only further convinces me that
> Michael's proposed intrinsics are the best way to go.
>
> I'm still not convinced, but am not going to actively oppose it either.
I'm
> leery of designing a solution with major assumptions we don't have data
to
> backup.
>
> I worry your assumptions about deoptimization are potentially unsound.  But
> I don't have data to actually show this (yet).
>
> I *think* I may have been unclear about my assumptions; in particular, my
> claims with respect to deoptimization are probably more subtle than they
> appeared.  WebKit can use LLVM and it has divisions and we do all possible
> deoptimization/profiling/etc tricks for it, so this is grounded in
> experience.  Forgive me if the rest of this e-mail contains a lecture on
> things that are obvious - I'll try to err on the side of clarity and
> completeness since this discussion is sufficiently dense that we run the
> risk of talking cross-purposes unless some baseline assumptions are
> established.
>
> I think we're using the same terminology, but with slightly different
sets
> of assumptions.  I'll point this out below where relevant.
>
> Also, thanks for taking the time to expand.  It help clarify the discussion
> quite a bit.
>
> I think we may be converging to an understanding of what you want versus
> what I want, and I think that there are some points - possibly unrelated to
> division - that are worth clarifying.  I think that the main difference is
> that when I say "patchpoint", I am referring to a concrete
intrinsic with
> specific semantics that cannot change without breaking WebKit, while you
are
> using the term to refer to a broad concept, or rather, a class of
> as-yet-unimplemented intrinsics that share some of the same features with
> patchpoints but otherwise have incompatible semantics.
>
> Also, when I say that you wouldn't want to use the existing patchpoint
to do
> your trapping deopt, what I mean is that the performance of doing this
would
> suck for reasons that are not related to deoptimization or trapping. 
I'm
> not claiming that deoptimization performs poorly (trust me, I know better)
> or that trapping to deoptimize is bad (I've done this many, many times
and I
> know better).  I'm saying that with the current patchpoint intrinsics
in
> LLVM, as they are currently specified and implemented, doing it would be a
> bad idea because you'd have to compromise a bunch of other
optimizations to
> achieve it.
>
> You have essentially described new intrinsics that would make this less of
a
> bad idea and I am interested in your intrinsics, so I'll try to both
respond
> with why patchpoints don't currently give you what you want (and why
simply
> changing patchpoint semantics would be evil) and I'll also try to
comment on
> what I think of the intrinsic that you're effectively proposing.  Long
story
> short, I think you should formally propose your intrinsic even if it's
not
> completely fleshed out.  I think that it's an interesting capability
and in
> its most basic form, it is a simple variation of the current
> patchpoint/stackmap intrinsics.
>
>
>
>
>
> On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at
philipreames.com)
> wrote:
>
> As the discussion has progressed and I've spent more time thinking
about the
> topic, I find myself less and less enthused about the current proposal. 
I'm
> in full support of having idiomatic ways to express safe division, but
I'm
> starting to doubt that using an intrinsic is the right way at the moment.
>
> One case I find myself thinking about is how one would combine profiling
> information and implicit div-by-zero/overflow checks with this proposal.  I
> don't really see a clean way.  Ideally, for a "safe div"
which never has the
> exceptional paths taken, you'd like to completely do away with the
control
> flow entirely.  (And rely on hardware traps w/exceptions instead.)  I
don't
> really see a way to represent that type of construct given the current
> proposal.
>
> This is a deeper problem and to solve it you'd need a solution to
trapping
> in general.  Let's consider the case of Java.  A Java program may want
to
> catch the arithmetic exception due to divide by zero.  How would you do
this
> with a trap in LLVM IR?  Spill all state that is live at the catch?  Use a
> patchpoint for the entire division instruction?
>
> We'd likely use something similar to a patchpoint.  You'd need the
"abstract
> vm state" (which is not the compiled frame necessarily) available at
the div
> instruction.  You could then re-enter the interpreter at the specified
index
> (part of the vm state).  We have all most of these mechanisms in place.
> Ideally, you'd trigger a recompile and otherwise ensure re-entry into
> compiled code at the soonest possible moment.
>
> This requires a lot of runtime support, but we already have most of it
> implemented for another compiler.  From our perspective, the runtime
> requirements are not a major blocker.
>
> Right, you'll use a patchpoint.  That's way more expensive than
using a safe
> division intrinsic with branches, because it's opaque to the optimizer.
>
> This statement is true at the moment, but it shouldn't be.  I think
this is
> our fundamental difference in approach.
>
> You should be able to write something like:
> i32 %res = invoke patchpoint (... x86_trapping_divide, a, b) normal_dest
> invoke_dest
>
> normal_dest:
>   ;; use %res
> invoke_dest:
>   landingpad
>   ;; dispatch edge cases
>   ;; this could be unreachable code if you deopt this frame in the trap
> handler and jump directly to an interpreter or other bit of code
>
> I see.  It sounds like you want a generalization of the
"div.with.stackmap"
> that I thought you wanted - you want to be able to wrap anything in a
> stackmap.
>
> The current patchpoint intrinsic does not do this, and you run the risk of
> breaking existing semantics if you changed this.  You'd probably break
> WebKit, which treats the call target of the patchpoint as nothing more than
> a quirk - we always pass null.  Also, the current patchpoint treats the
> callee as an i8* if I remember right and it would be super weird if all
LLVM
> phases had to interpret this i8* by unwrapping a possible bitcast to get to
> a declared function that may be an intrinsic.  Yuck!  Basically, the call
> target of existing patchpoints is meant to be a kind of convenience feature
> rather than the core of the mechanism.
>
> I agree in principle that the intrinsic that you want would be a useful
> intrinsic.  But let's not call it a patchpoint for the purposes of this
> discussion, and let's not confuse the discussion by claiming
(incorrectly)
> that the existing patchpoint facility gives you what you want.  It
doesn't:
> patchpoints are designed to make the call target opaque (hence the use of
> i8*) and there shouldn't be a correlation between what the patchpoint
does
> at run-time and what the called function would have done.  You could make
> the call target be null (like WebKit does) and the patchpoint should still
> mean "this code can do anything" because the expectation is that
the client
> JIT will patch over it anyway.
>
> Also, "patchpoint" would probably not be the right term for the
intrinsic
> that you want.  I think that what you want is
"call.with.stackmap".  Or
> maybe "stackmap.wrapper".  Or just "stackmap" - I'd
be OK, in principle,
> with changing the name of the current "stackmap" intrinsic to
something that
> reflects the fact that it's less of a stackmap than what you want.
>
> To summarize.  A patchpoint's main purpose is that you can patch it
with
> arbitrary code.  The current "stackmap" means that you can patch
it with
> arbitrary code and that patching may be destructive to a shadow of machine
> code bytes, so it's really just like patchpoints - we could change its
name
> to "patchpoint.shadow" for example.
>
> If you were to propose such a stackmap intrinsic, then I think there could
> be some ways of doing it that wouldn't be too terrible.  Basically you
want
> something that is like a patchpoint in that it reports a stackmap via a
side
> channel, but unlike patchpoints, it doesn't allow arbitrary patching -
> instead the optimizer should be allowed to assume that the code within the
> patchpoint will always do the same thing that the call target would have
> done.  There are downsides to truly doing this.  For example, to make
> division efficient with such an intrinsic, you'd have to do something
that
> is somewhat worse than just recognizing intrinsics in the optimizer -
you'd
> have to first recognize a call to your "stackmap wrapper"
intrinsic and then
> observe that its call target argument is in turn another intrinsic.  To me
> personally, that's kind of yucky, but I won't deny that it could be
useful.
>
> As to the use of invoke: I don't believe that the use of invoke versus
my
> suggested "branch on a trap predicate" idea are different in any
truly
> meaningful way.  I buy that either would work.
>
>
>
> A patchpoint should not require any excess spilling.  If values are live in
> registers, that should be reflected in the stack map.  (I do not know if
> this is the case for patchpoint at the moment or not.)
>
> Patchpoints do not require spilling.
>
> My point was that with existing patchpoints, you *either* use a patchpoint
> for the entire division which makes the division opaque to the optimizer -
> because a patchpoint means "this code can do anything" - *or* you
could
> spill stuff to the stack prior to your trapping division intrinsic, since
> spilling is something that you could do as a workaround if you didn't
have a
> patchpoint.
>
> The reason why I brought up spilling at all is that I suspect that spilling
> all state to the stack might be cheaper - for some systems - than turning
> the division into a patchpoint.  Turning the division into a patchpoint is
> horrendously brutal - the patchpoint looks like it clobbers the heap (which
> a division doesn't do), has to execute (a division is an obvious DCE
> candidate), cannot be hoisted (hoisting divisions is awesome), etc. 
Perhaps
> most importantly, though, a patchpoint doesn't tell LLVM that
you're *doing
> a division* - so all constant folding, strenght reduction, and algebraic
> reasoning flies out the window.  On the other hand, spilling all state to
> the stack is an arguably sound and performant solution to a lot of VM
> problems.  I've seen JVM implementations that ensure that there is
always a
> copy of state on the stack at some critical points, just because it makes
> loads of stuff simpler (debugging, profiling, GC, and of course deopt).  I
> personally prefer to stay away from such a strategy because it's not
free.
>
> On the other hand, branches can be cheap.  A branch on a divide is cheaper
> than not being able to optimize the divide.
>
>
>
> The Value called by a patchpoint should participate in optimization
> normally.
>
> I agree that you could have a different intrinsic that behaves like this.
>
> We really want the patchpoint part of the call to be supplemental.  It
> should still be a call.  It should be constant propagated, transformed,
> etc..  This is not the case currently.  I've got a couple of off the
wall
> ideas for improving the current status, but I'll agree this is a
hardish
> problem.
>
> It should be legal to use a patchpoint in an invoke.  It's an ABI issue
of
> how the invoke path gets invoked.  (i.e. side tables for the runtime to
> lookup, etc..)  This is not possible today, and probably requires a fair
> amount of work.  Some of it, I've already done and will be sharing
shortly.
> Other parts, I haven't even thought about.
>
> Right, it's significantly more complex than either the existing
patchpoints
> or Michael's proposed safe.div.
>
>
>
> If you didn't want to use the trapping semantics, you'd insert
dedicated
> control flow _before_ the divide.  This would allow normal optimization of
> the control flow.
>
> Notes:
> 1) This might require a new PATCHPOINT pseudo op in the backend. 
Haven't
> thought much about that yet.
> 2) I *think* your current intrinsic could be translated into something like
> this.  (Leaving aside the question of where the deopt state comes from.) 
In
> fact, the more I look at this, the less difference I actually see between
> the approaches.
>
>
>
> In a lot of languages, a divide produces some result even in the
exceptional
> case and this result requires effectively deoptimizing since the resut
won't
> be the one you would have predicted (double instead of int, or BigInt
> instead of small int), which sort of means that if the CPU exception occurs
> you have to be able to reconstruct all state.  A patchpoint could do this,
> and so could spilling all state to the stack before the divide - but both
> are very heavy hammers that are sure to be more expensive than just doing a
> branch.
>
> This isn't necessarily as expensive as you might believe.  I'd
recommend
> reading the Graal project papers on this topic.
>
> Whether deopt or branching is more profitable *in this case*, I can't
easily
> say.  I'm not yet to the point of being able to run that experiment. 
We can
> argue about what "should" be better all we want, but real
performance data
> is the only way to truly know.
>
> My point may have been confusing.  I know that deoptimization is cheap and
> WebKit uses it everywhere, including division corner cases, if profiling
> tells us that it's profitable to do so (which it does, in the common
case).
> WebKit is a heavy user of deoptimization in general, so you don't need
to
> convince me that it's worth it.
>
> Acknowledged.
>
> Note that I want *both* deopt *and* branching, because in this case, a
> branch is the fastest overall way of detecting when to deopt.  In the
> future, I will want to implement the deopt in terms of branching, and when
> we do this, I believe that the most sound and performat approach would be
> using Michael's intrinsics.  This is subtle and I'll try to explain
why it's
> the case.
>
> The point is that you wouldn't want to do deoptimization by spilling
state
> on the main path or by using a patchpoint for the main path of the
division.
>
> This is the main point I disagree with.  I don't believe that having a
> patchpoint on the main path should be any more expensive then the original
> call.  (see above)
>
> The reason why the patchpoint is expensive is that if you use a patchpoint
> to implement a division then the optimizer won't be allowed to assume
that
> it's a division, because the whole point of "patchpoint" is
to tell the
> optimizer to piss off.
>
>
>
> Worth noting explicitly: I'm assuming that all of your deopt state
would
> already be available for other purposes in nearby code.  It's on the
stack
> or in registers.  I'm assuming that by adding the deopt point, you are
not
> radically changing the set of computations which need to be done.  If
that's
> not the case, you should avoid deopt and instead just inline the slow paths
> with explicit checks.
>
> Yes, of course it is.  That's not the issue.
>
>
>
> I'll note that given your assumptions about the cost of a patchpoint,
the
> rest of your position makes a lot more sense.  :)  As I spelled out above,
I
> believe this cost is not fundamental.
>
> You don't want the common path of executing the division to involve a
> patchpoint instruction, although using a patchpoint or stackmap to
implement
> deoptimization on the failing path is great:
>
> Good: if (division would fail) { call @patchpoint(all of my state) } else {
> result = a / b }
>
> Given your cost assumptions, I'd agree.
>
> Not my cost assumptions.  The reason why this is better is that the
division
> is expressed in LLVM IR so that LLVM can do useful things to it - like
> eliminate it, for example.
>
>
> Bad: call @patchpoint(all of my state) // patch with a divide instruction -
> bad because the optimizer has no clue what you're doing and assumes the
very
> worst
>
> Yuck.  Agreed.
>
> To be clear, this is what you're proposing - except that you're
assuming
> that LLVM will know that you've patched a division because you're
expecting
> the call target to have semantic meaning.  Or, rather, you're expecting
that
> you can make the contents of the patchpoint be a division by having the
call
> target be a division intrinsic.  In the current implementation and as it is
> currently specified, the call target has no meaning and so you get the yuck
> that I'm showing.
>
>
> Worse: spill all state to the stack; call @trapping.div(a, b) // the spills
> will hurt you far more than a branch, so this should be avoided
>
> I'm assuming this is an explicit spill rather than simply recording a
stack
> map *at the div*.  If so, agreed.
>
> I suppose we could imagine a fourth option that involves a patchpoint to
> pick up the state and a trapping divide instrinsic.  But a trapping divide
> intrinsic alone is not enough.  Consider this:
>
> result = call @trapping.div(a, b); call @stackmap(all of my state)
>
> As soon as these are separate instructions, you have no guarantees that the
> state that the stackmap reports is sound for the point at which the div
> would trap.
>
> This is the closest to what I'd propose, except that the two calls
would be
> merged into a single patchpoint.  Isn't the entire point of a
patchpoint to
> record the stack map for a call?
>
> No.  It would be bad if that's what the documentation says.  That's
not at
> all how WebKit uses it or probably any IC client would use it.
>
> Patchpoints are designed to be inline assembly on steroids.  They're
there
> to allow the client JIT to tell LLVM to piss off.
>
> (Well, ignoring the actual patching part..)  Why not write this as:
> patchpoint(..., trapping.div, a, b);
>
> Is there something I'm missing here?
>
> Just to note: I fully agree that the two call proposal is unsound and
should
> be avoided.
>
> So, the division itself shouldn't be a trapping instruction and instead
you
> want to detect the bad case with a branch.
>
> To be clear:
>
> - Whether you use deoptimization for division or anything else - like
WebKit
> has done since before any of the Graal papers were written - is mostly
> unrelated to how you represent the division, unless you wanted to add a new
> intrinsic that is like a trapping-division-with-stackmap:
>
> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)
>
> Now, maybe you do want such an intrinsic, in which case you should propose
> it!
>
> Given what we already have with patchpoints, I don't think a merged
> intrinsic is necessary.  (See above).  I believe we have all the parts to
> build this solution, and that - in theory - they should compose neatly.
>
> p.s. The bits I was referring to was not deopt per se.  It was particularly
> which set of deopt state you used for each deopt point.  That's a bit
of
> tangent for the rest of the discussion now though.
>
> The reason why I haven't proposed it is that I think that long-term,
the
> currently proposed intrinsics are a better path to getting the trapping
> optimizations.  See my previous mail, where I show how we could tell LLVM
> what the failing path is (which may have deoptimization code that uses a
> stackmap or whatever), what the trapping predicate is (it comes from the
> safe.div intrinsic), and the fact that trapping is wise (branch weights).
>
> - If you want to do the deoptimization with a trap, then your only choice
> currently is to use a patchpoint for the main path of the division.  This
> will be slower than using a branch to an OSR exit basic block, because
> you're making the division itself opaque to the optimizer (bad) just to
get
> rid of a branch (which was probably cheap to begin with).
>
> Hence, what you want to do - one way or another, regardless of whether this
> proposed intrinsic is added - is to branch on the corner case condition,
and
> have the slow case of the branch go to a basic block that deoptimizes.
> Unless of course you have profiling that says that the case does happen
> often, in which case you can have that basic block handle the corner case
> inline without leaving optimized code (FWIW, we do have such paths in
WebKit
> and they are useful).
>
> So the question for me is whether the branching involves explicit control
> flow or is hidden inside an intrinsic.  I prefer for it to be within an
> intrinsic because it:
>
> - allows the optimizer to do more interesting things in the common cases,
> like hoisting the entire division.
>
> - will give us a clearer path for implementing trapping optimizations in
the
> future.
>
> - is an immediate win on ARM.
>
> I'd be curious to hear what specific idea you have about how to
implement
> trap-based deoptimization with your trapping division intrinsic for x86 -
> maybe it's different from the two "bad" idioms I showed
above.
>
> I hope my explanation above helps.  If not, ask, and I'll try to
explain
> more clearly.
>
> I think I understand it.  I think that the only issue is that:
>
> - Patchpoints currently don't do what you want.
>
> - If you made patchpoints do what you want then you'd break WebKit -
not to
> mention anyone who wants to use them for inline caches.
>
> So it seems like you want a new intrinsic.  You should officially propose
> this new intrinsic, particularly since the core semantic differences are
not
> so great from what we have now.  OTOH, if you truly believe that
patchpoints
> should just be changed to your semantics in a way that does break WebKit,
> then that's probably also something you should get off your chest. ;-)
>
>
>
> One point just for clarity; I don't believe this effects the
conclusions of
> our discussion so far.  I'm also fairly sure that you (Filip) are aware
of
> this, but want to spell it out for other readers.
>
> You seem to be assuming that compiled code needs to explicitly branch to a
> point where deopt state is known to exit a compiled frame.
>
> This is a slightly unclear characterization of my assumptions.  Our JIT
does
> deoptimization without explicit branches for many, many things.  You should
> look at it some time, it's pretty fancy. ;-)
>
> Worth noting is that you can also exit a compiled frame on a trap (without
> an explicitly condition/branch!) if the deopt state is known at the point
> you take the trap.  This "exit frame on trap" behavior shows up
with null
> pointer exceptions as well.  I'll note that both compilers in OpenJDK
> support some combination of "exit-on-trap" conditions for
division and null
> dereferences.  The two differ on exactly what strategies they use in each
> case though.  :)
>
> Yeah, and I've also implemented VMs that do this - and I endorse the
basic
> idea.  I know what you want, and my only point is that the existing
> patchpoints only give you this if you're willing to make a huge
compromise:
> namely, that you're willing to make the division (or heap load for the
null
> case) completely opaque to the compiler to the point that GVN, LICM, SCCP,
> and all algebraic reasoning have to give up on optimizing it.  The point of
> using LLVM is that it can optimize code.  It can optimize branches and
> divisions pretty well.  So, eliminating an explicit branch by replacing it
> with a construct that appears opaque to the optimizer is not a smart
> trade-off.
>
> You could add a new intrinsic that, like patchpoints, records the layout of
> state in a side-table, but that is used as a kind of wrapper for operations
> that LLVM understands.  This may or may not be hairy - you seem to have
sort
> of acknowledged that it's got some complexity and I've also pointed
out some
> possible issues.  If this is something that you want, you should propose it
> so that others know what you're talking about.  One danger of how
we're
> discussing this right now is that you're overloading patchpoints to
mean the
> thing you want them to mean rather than what they actually mean, which
makes
> it seem like we don't need Michael's intrinsics on the grounds that
> patchpoints already offer a solution.  They don't already offer a
solution
> precisely because patchpoints don't do what your intrinsics would do.
>
>
>
> I'm not really arguing that either scheme is "better" in all
cases.  I'm
> simply arguing that we should support both and allow optimization and
tuning
> between them.  As far as I can tell, you seem to be assuming that an
> explicit branch to known exit point is always superior.
>
>
> Ok, back to the topic at hand...
>
> With regards to the current proposal, I'm going to take a step back. 
You
> guys seem to have already looked in this in a fair amount of depth. 
I'm not
> necessarily convinced you've come to the best solution, but at some
point,
> we need to make forward progress.  What you have is clearly better than
> nothing.
>
> Please go ahead and submit your current approach.  We can come back and
> revise later if we really need to.
>
> I do request the following changes:
> - Mark it clearly as experimental.
>
>
> - Either don't specify the value computed in the edge cases, or allow
those
> values to be specified as constant arguments to the call.  This allows
> efficient lowering to x86's div instruction if you want to make use of
the
> trapping semantics.
>
> Once again: how would you use this to get trapping semantics without
> throwing all of LLVM's optimizations out the window, in the absence of
the
> kind of patchpoint-like intrinsic that you want?  I ask just to make sure
> that we're on the same page.
>
>
>
> Finally, as for performance data, which part of this do you want
performance
> data for?  I concede that I don't have performance data for using
Michael's
> new intrinsic.  Part of what the intrinsic accomplishes is it gives a less
> ugly way of doing something that is already possible with target intrinsics
> on ARM.  I think it would be great if you could get those semantics - along
> with a known-good implementation - on other architectures as well.
>
> I would be very interested in seeing data comparing two schemes:
> - Explicit control flow emited by the frontend
> - The safe.div intrinsic emitted by the frontend, desugared in CodeGenPrep
>
> My strong suspicion is that each would preform well in some cases and not
in
> others.  At least on x86.  Since the edge-checks are essentially free on
> ARM, the second scheme would probably be strictly superior there.
>
> I am NOT asking that we block submission on this data however.
>
> But this discussion has also involved suggestions that we should use
> trapping to implement deoptimization, and the main point of my message is
to
> strongly argue against anything like this given the current state of
> trapping semantics and how patchpoints work.  My point is that using traps
> for division corner cases would either be unsound (see the stackmap after
> the trap, above), or would require you to do things that are obviously
> inefficient.  If you truly believe that the branch to detect division slow
> paths is more expensive than spilling all bytecode state to the stack or
> using a patchpoint for the division, then I could probably hack something
up
> in WebKit to show you the performance implications.  (Or you could do it
> yourself, the code is open source...)
>
> In a couple of months, I'll probably have the performance data to
discuss
> this for real.  When that happens, let's pick this up and continue the
> debate.  Alternatively, if you want to chat this over more with a beer in
> hand at the social next week, let me know.  In the meantime, let's not
stall
> the current proposal any more.
>
> Philip
>
>

Filip Pizlo

2014-May-02 18:57 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

On May 2, 2014 at 11:53:25 AM, Eric Christopher (echristo at gmail.com) wrote:

On Wed, Apr 30, 2014 at 10:34 PM, Philip Reames 
<listmail at philipreames.com> wrote: > Andy - If you're not already following this closely, please start.
We've 
> gotten into fairly fundamental questions of what a patchpoint does. 
> 
> Filip, 
> 
> I think you've hit the nail on the head. What I'm thinking of as
being 
> patchpoints are not what you think they are. Part of that is that I've
got 
> a local change which adds a very similar construction (called
"statepoints" 
> for the moment), but I was trying to keep that separate. That also
includes 
> a lot of GC semantics which are not under discussion currently. My 
> apologies if that experience bled over into this conversation and made 
> things more confusing. 
> 
> I will note that the documentation for patchpoint say explicitly the 
> following: 
> "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a function
call to 
> the specified <target> and records the location of specified values
in the 
> stack map." 
> 
> My reading has always been that a patchpoint *that isn't patched* is
simply 
> a call with a stackmap associated with it. To my reading, this can (and 
> did, and does) indicate my proposed usage would be legal. 
> 
I like the idea that the target can be assumed to be called. It makes 
optimization of the call possible, etc. I think it's definitely worth 
exploring before we lock down the patchpoint intrinsic. 
I will actively oppose this.

-Filip





-eric 
> I will agree that I've confused the topic badly on the optimization
front. 
> My "statepoint" isn't patchable, so a lot more optimizations
are legal. 
> Sorry about that. To restate what I think you've been saying all along,
the 
> optimizer can't make assumptions about what function is called by a 
> patchpoint because that might change based on later patching. Is this the 
> key point you've been trying to make? 
> 
> I'm not objecting to separating "my patchpoint" from
"your patchpoint". 
> Let's just hammer out the semantics of each first. :) 
> 
> Again, longer response to follow in a day or so. :) 
> 
> Philip 
> 
> 
> On 04/30/2014 10:09 PM, Filip Pizlo wrote: 
> 
> 
> 
> On April 30, 2014 at 9:06:20 PM, Philip Reames (listmail at
philipreames.com) 
> wrote: 
> 
> On 04/29/2014 12:39 PM, Filip Pizlo wrote: 
> 
> On April 29, 2014 at 11:27:06 AM, Philip Reames (listmail at
philipreames.com) 
> wrote: 
> 
> On 04/29/2014 10:44 AM, Filip Pizlo wrote: 
> 
> LD;DR: Your desire to use trapping on x86 only further convinces me that 
> Michael's proposed intrinsics are the best way to go. 
> 
> I'm still not convinced, but am not going to actively oppose it either.
I'm 
> leery of designing a solution with major assumptions we don't have data
to 
> backup. 
> 
> I worry your assumptions about deoptimization are potentially unsound. But 
> I don't have data to actually show this (yet). 
> 
> I *think* I may have been unclear about my assumptions; in particular, my 
> claims with respect to deoptimization are probably more subtle than they 
> appeared. WebKit can use LLVM and it has divisions and we do all possible 
> deoptimization/profiling/etc tricks for it, so this is grounded in 
> experience. Forgive me if the rest of this e-mail contains a lecture on 
> things that are obvious - I'll try to err on the side of clarity and 
> completeness since this discussion is sufficiently dense that we run the 
> risk of talking cross-purposes unless some baseline assumptions are 
> established. 
> 
> I think we're using the same terminology, but with slightly different
sets 
> of assumptions. I'll point this out below where relevant. 
> 
> Also, thanks for taking the time to expand. It help clarify the discussion 
> quite a bit. 
> 
> I think we may be converging to an understanding of what you want versus 
> what I want, and I think that there are some points - possibly unrelated
to 
> division - that are worth clarifying. I think that the main difference is 
> that when I say "patchpoint", I am referring to a concrete
intrinsic with 
> specific semantics that cannot change without breaking WebKit, while you
are 
> using the term to refer to a broad concept, or rather, a class of 
> as-yet-unimplemented intrinsics that share some of the same features with 
> patchpoints but otherwise have incompatible semantics. 
> 
> Also, when I say that you wouldn't want to use the existing patchpoint
to do 
> your trapping deopt, what I mean is that the performance of doing this
would 
> suck for reasons that are not related to deoptimization or trapping.
I'm 
> not claiming that deoptimization performs poorly (trust me, I know better) 
> or that trapping to deoptimize is bad (I've done this many, many times
and I 
> know better). I'm saying that with the current patchpoint intrinsics
in 
> LLVM, as they are currently specified and implemented, doing it would be a 
> bad idea because you'd have to compromise a bunch of other
optimizations to 
> achieve it. 
> 
> You have essentially described new intrinsics that would make this less of
a 
> bad idea and I am interested in your intrinsics, so I'll try to both
respond 
> with why patchpoints don't currently give you what you want (and why
simply 
> changing patchpoint semantics would be evil) and I'll also try to
comment on 
> what I think of the intrinsic that you're effectively proposing. Long
story 
> short, I think you should formally propose your intrinsic even if it's
not 
> completely fleshed out. I think that it's an interesting capability and
in 
> its most basic form, it is a simple variation of the current 
> patchpoint/stackmap intrinsics. 
> 
> 
> 
> 
> 
> On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at
philipreames.com) 
> wrote: 
> 
> As the discussion has progressed and I've spent more time thinking
about the 
> topic, I find myself less and less enthused about the current proposal.
I'm 
> in full support of having idiomatic ways to express safe division, but
I'm 
> starting to doubt that using an intrinsic is the right way at the moment. 
> 
> One case I find myself thinking about is how one would combine profiling 
> information and implicit div-by-zero/overflow checks with this proposal. I 
> don't really see a clean way. Ideally, for a "safe div" which
never has the 
> exceptional paths taken, you'd like to completely do away with the
control 
> flow entirely. (And rely on hardware traps w/exceptions instead.) I
don't 
> really see a way to represent that type of construct given the current 
> proposal. 
> 
> This is a deeper problem and to solve it you'd need a solution to
trapping 
> in general. Let's consider the case of Java. A Java program may want
to 
> catch the arithmetic exception due to divide by zero. How would you do
this 
> with a trap in LLVM IR? Spill all state that is live at the catch? Use a 
> patchpoint for the entire division instruction? 
> 
> We'd likely use something similar to a patchpoint. You'd need the
"abstract 
> vm state" (which is not the compiled frame necessarily) available at
the div 
> instruction. You could then re-enter the interpreter at the specified
index 
> (part of the vm state). We have all most of these mechanisms in place. 
> Ideally, you'd trigger a recompile and otherwise ensure re-entry into 
> compiled code at the soonest possible moment. 
> 
> This requires a lot of runtime support, but we already have most of it 
> implemented for another compiler. From our perspective, the runtime 
> requirements are not a major blocker. 
> 
> Right, you'll use a patchpoint. That's way more expensive than
using a safe 
> division intrinsic with branches, because it's opaque to the
optimizer. 
> 
> This statement is true at the moment, but it shouldn't be. I think this
is 
> our fundamental difference in approach. 
> 
> You should be able to write something like: 
> i32 %res = invoke patchpoint (... x86_trapping_divide, a, b) normal_dest 
> invoke_dest 
> 
> normal_dest: 
> ;; use %res 
> invoke_dest: 
> landingpad 
> ;; dispatch edge cases 
> ;; this could be unreachable code if you deopt this frame in the trap 
> handler and jump directly to an interpreter or other bit of code 
> 
> I see. It sounds like you want a generalization of the
"div.with.stackmap" 
> that I thought you wanted - you want to be able to wrap anything in a 
> stackmap. 
> 
> The current patchpoint intrinsic does not do this, and you run the risk of 
> breaking existing semantics if you changed this. You'd probably break 
> WebKit, which treats the call target of the patchpoint as nothing more
than 
> a quirk - we always pass null. Also, the current patchpoint treats the 
> callee as an i8* if I remember right and it would be super weird if all
LLVM 
> phases had to interpret this i8* by unwrapping a possible bitcast to get
to 
> a declared function that may be an intrinsic. Yuck! Basically, the call 
> target of existing patchpoints is meant to be a kind of convenience
feature 
> rather than the core of the mechanism. 
> 
> I agree in principle that the intrinsic that you want would be a useful 
> intrinsic. But let's not call it a patchpoint for the purposes of this 
> discussion, and let's not confuse the discussion by claiming
(incorrectly) 
> that the existing patchpoint facility gives you what you want. It
doesn't: 
> patchpoints are designed to make the call target opaque (hence the use of 
> i8*) and there shouldn't be a correlation between what the patchpoint
does 
> at run-time and what the called function would have done. You could make 
> the call target be null (like WebKit does) and the patchpoint should still 
> mean "this code can do anything" because the expectation is that
the client 
> JIT will patch over it anyway. 
> 
> Also, "patchpoint" would probably not be the right term for the
intrinsic 
> that you want. I think that what you want is
"call.with.stackmap". Or 
> maybe "stackmap.wrapper". Or just "stackmap" - I'd
be OK, in principle, 
> with changing the name of the current "stackmap" intrinsic to
something that 
> reflects the fact that it's less of a stackmap than what you want. 
> 
> To summarize. A patchpoint's main purpose is that you can patch it
with 
> arbitrary code. The current "stackmap" means that you can patch
it with 
> arbitrary code and that patching may be destructive to a shadow of machine 
> code bytes, so it's really just like patchpoints - we could change its
name 
> to "patchpoint.shadow" for example. 
> 
> If you were to propose such a stackmap intrinsic, then I think there could 
> be some ways of doing it that wouldn't be too terrible. Basically you
want 
> something that is like a patchpoint in that it reports a stackmap via a
side 
> channel, but unlike patchpoints, it doesn't allow arbitrary patching - 
> instead the optimizer should be allowed to assume that the code within the 
> patchpoint will always do the same thing that the call target would have 
> done. There are downsides to truly doing this. For example, to make 
> division efficient with such an intrinsic, you'd have to do something
that 
> is somewhat worse than just recognizing intrinsics in the optimizer -
you'd 
> have to first recognize a call to your "stackmap wrapper"
intrinsic and then 
> observe that its call target argument is in turn another intrinsic. To me 
> personally, that's kind of yucky, but I won't deny that it could be
useful. 
> 
> As to the use of invoke: I don't believe that the use of invoke versus
my 
> suggested "branch on a trap predicate" idea are different in any
truly 
> meaningful way. I buy that either would work. 
> 
> 
> 
> A patchpoint should not require any excess spilling. If values are live in 
> registers, that should be reflected in the stack map. (I do not know if 
> this is the case for patchpoint at the moment or not.) 
> 
> Patchpoints do not require spilling. 
> 
> My point was that with existing patchpoints, you *either* use a patchpoint 
> for the entire division which makes the division opaque to the optimizer - 
> because a patchpoint means "this code can do anything" - *or* you
could 
> spill stuff to the stack prior to your trapping division intrinsic, since 
> spilling is something that you could do as a workaround if you didn't
have a 
> patchpoint. 
> 
> The reason why I brought up spilling at all is that I suspect that
spilling 
> all state to the stack might be cheaper - for some systems - than turning 
> the division into a patchpoint. Turning the division into a patchpoint is 
> horrendously brutal - the patchpoint looks like it clobbers the heap
(which 
> a division doesn't do), has to execute (a division is an obvious DCE 
> candidate), cannot be hoisted (hoisting divisions is awesome), etc.
Perhaps 
> most importantly, though, a patchpoint doesn't tell LLVM that
you're *doing 
> a division* - so all constant folding, strenght reduction, and algebraic 
> reasoning flies out the window. On the other hand, spilling all state to 
> the stack is an arguably sound and performant solution to a lot of VM 
> problems. I've seen JVM implementations that ensure that there is
always a 
> copy of state on the stack at some critical points, just because it makes 
> loads of stuff simpler (debugging, profiling, GC, and of course deopt). I 
> personally prefer to stay away from such a strategy because it's not
free. 
> 
> On the other hand, branches can be cheap. A branch on a divide is cheaper 
> than not being able to optimize the divide. 
> 
> 
> 
> The Value called by a patchpoint should participate in optimization 
> normally. 
> 
> I agree that you could have a different intrinsic that behaves like this. 
> 
> We really want the patchpoint part of the call to be supplemental. It 
> should still be a call. It should be constant propagated, transformed, 
> etc.. This is not the case currently. I've got a couple of off the
wall 
> ideas for improving the current status, but I'll agree this is a
hardish 
> problem. 
> 
> It should be legal to use a patchpoint in an invoke. It's an ABI issue
of 
> how the invoke path gets invoked. (i.e. side tables for the runtime to 
> lookup, etc..) This is not possible today, and probably requires a fair 
> amount of work. Some of it, I've already done and will be sharing
shortly. 
> Other parts, I haven't even thought about. 
> 
> Right, it's significantly more complex than either the existing
patchpoints 
> or Michael's proposed safe.div. 
> 
> 
> 
> If you didn't want to use the trapping semantics, you'd insert
dedicated 
> control flow _before_ the divide. This would allow normal optimization of 
> the control flow. 
> 
> Notes: 
> 1) This might require a new PATCHPOINT pseudo op in the backend.
Haven't 
> thought much about that yet. 
> 2) I *think* your current intrinsic could be translated into something
like 
> this. (Leaving aside the question of where the deopt state comes from.) In 
> fact, the more I look at this, the less difference I actually see between 
> the approaches. 
> 
> 
> 
> In a lot of languages, a divide produces some result even in the
exceptional 
> case and this result requires effectively deoptimizing since the resut
won't 
> be the one you would have predicted (double instead of int, or BigInt 
> instead of small int), which sort of means that if the CPU exception
occurs 
> you have to be able to reconstruct all state. A patchpoint could do this, 
> and so could spilling all state to the stack before the divide - but both 
> are very heavy hammers that are sure to be more expensive than just doing
a 
> branch. 
> 
> This isn't necessarily as expensive as you might believe. I'd
recommend 
> reading the Graal project papers on this topic. 
> 
> Whether deopt or branching is more profitable *in this case*, I can't
easily 
> say. I'm not yet to the point of being able to run that experiment. We
can 
> argue about what "should" be better all we want, but real
performance data 
> is the only way to truly know. 
> 
> My point may have been confusing. I know that deoptimization is cheap and 
> WebKit uses it everywhere, including division corner cases, if profiling 
> tells us that it's profitable to do so (which it does, in the common
case). 
> WebKit is a heavy user of deoptimization in general, so you don't need
to 
> convince me that it's worth it. 
> 
> Acknowledged. 
> 
> Note that I want *both* deopt *and* branching, because in this case, a 
> branch is the fastest overall way of detecting when to deopt. In the 
> future, I will want to implement the deopt in terms of branching, and when 
> we do this, I believe that the most sound and performat approach would be 
> using Michael's intrinsics. This is subtle and I'll try to explain
why it's 
> the case. 
> 
> The point is that you wouldn't want to do deoptimization by spilling
state 
> on the main path or by using a patchpoint for the main path of the
division. 
> 
> This is the main point I disagree with. I don't believe that having a 
> patchpoint on the main path should be any more expensive then the original 
> call. (see above) 
> 
> The reason why the patchpoint is expensive is that if you use a patchpoint 
> to implement a division then the optimizer won't be allowed to assume
that 
> it's a division, because the whole point of "patchpoint" is
to tell the 
> optimizer to piss off. 
> 
> 
> 
> Worth noting explicitly: I'm assuming that all of your deopt state
would 
> already be available for other purposes in nearby code. It's on the
stack 
> or in registers. I'm assuming that by adding the deopt point, you are
not 
> radically changing the set of computations which need to be done. If
that's 
> not the case, you should avoid deopt and instead just inline the slow
paths 
> with explicit checks. 
> 
> Yes, of course it is. That's not the issue. 
> 
> 
> 
> I'll note that given your assumptions about the cost of a patchpoint,
the 
> rest of your position makes a lot more sense. :) As I spelled out above, I 
> believe this cost is not fundamental. 
> 
> You don't want the common path of executing the division to involve a 
> patchpoint instruction, although using a patchpoint or stackmap to
implement 
> deoptimization on the failing path is great: 
> 
> Good: if (division would fail) { call @patchpoint(all of my state) } else
{ 
> result = a / b } 
> 
> Given your cost assumptions, I'd agree. 
> 
> Not my cost assumptions. The reason why this is better is that the
division 
> is expressed in LLVM IR so that LLVM can do useful things to it - like 
> eliminate it, for example. 
> 
> 
> Bad: call @patchpoint(all of my state) // patch with a divide instruction
- 
> bad because the optimizer has no clue what you're doing and assumes the
very 
> worst 
> 
> Yuck. Agreed. 
> 
> To be clear, this is what you're proposing - except that you're
assuming 
> that LLVM will know that you've patched a division because you're
expecting 
> the call target to have semantic meaning. Or, rather, you're expecting
that 
> you can make the contents of the patchpoint be a division by having the
call 
> target be a division intrinsic. In the current implementation and as it is 
> currently specified, the call target has no meaning and so you get the
yuck 
> that I'm showing. 
> 
> 
> Worse: spill all state to the stack; call @trapping.div(a, b) // the
spills 
> will hurt you far more than a branch, so this should be avoided 
> 
> I'm assuming this is an explicit spill rather than simply recording a
stack 
> map *at the div*. If so, agreed. 
> 
> I suppose we could imagine a fourth option that involves a patchpoint to 
> pick up the state and a trapping divide instrinsic. But a trapping divide 
> intrinsic alone is not enough. Consider this: 
> 
> result = call @trapping.div(a, b); call @stackmap(all of my state) 
> 
> As soon as these are separate instructions, you have no guarantees that
the 
> state that the stackmap reports is sound for the point at which the div 
> would trap. 
> 
> This is the closest to what I'd propose, except that the two calls
would be 
> merged into a single patchpoint. Isn't the entire point of a patchpoint
to 
> record the stack map for a call? 
> 
> No. It would be bad if that's what the documentation says. That's
not at 
> all how WebKit uses it or probably any IC client would use it. 
> 
> Patchpoints are designed to be inline assembly on steroids. They're
there 
> to allow the client JIT to tell LLVM to piss off. 
> 
> (Well, ignoring the actual patching part..) Why not write this as: 
> patchpoint(..., trapping.div, a, b); 
> 
> Is there something I'm missing here? 
> 
> Just to note: I fully agree that the two call proposal is unsound and
should 
> be avoided. 
> 
> So, the division itself shouldn't be a trapping instruction and instead
you 
> want to detect the bad case with a branch. 
> 
> To be clear: 
> 
> - Whether you use deoptimization for division or anything else - like
WebKit 
> has done since before any of the Graal papers were written - is mostly 
> unrelated to how you represent the division, unless you wanted to add a
new 
> intrinsic that is like a trapping-division-with-stackmap: 
> 
> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...) 
> 
> Now, maybe you do want such an intrinsic, in which case you should propose 
> it! 
> 
> Given what we already have with patchpoints, I don't think a merged 
> intrinsic is necessary. (See above). I believe we have all the parts to 
> build this solution, and that - in theory - they should compose neatly. 
> 
> p.s. The bits I was referring to was not deopt per se. It was particularly 
> which set of deopt state you used for each deopt point. That's a bit
of 
> tangent for the rest of the discussion now though. 
> 
> The reason why I haven't proposed it is that I think that long-term,
the 
> currently proposed intrinsics are a better path to getting the trapping 
> optimizations. See my previous mail, where I show how we could tell LLVM 
> what the failing path is (which may have deoptimization code that uses a 
> stackmap or whatever), what the trapping predicate is (it comes from the 
> safe.div intrinsic), and the fact that trapping is wise (branch weights). 
> 
> - If you want to do the deoptimization with a trap, then your only choice 
> currently is to use a patchpoint for the main path of the division. This 
> will be slower than using a branch to an OSR exit basic block, because 
> you're making the division itself opaque to the optimizer (bad) just to
get 
> rid of a branch (which was probably cheap to begin with). 
> 
> Hence, what you want to do - one way or another, regardless of whether
this 
> proposed intrinsic is added - is to branch on the corner case condition,
and 
> have the slow case of the branch go to a basic block that deoptimizes. 
> Unless of course you have profiling that says that the case does happen 
> often, in which case you can have that basic block handle the corner case 
> inline without leaving optimized code (FWIW, we do have such paths in
WebKit 
> and they are useful). 
> 
> So the question for me is whether the branching involves explicit control 
> flow or is hidden inside an intrinsic. I prefer for it to be within an 
> intrinsic because it: 
> 
> - allows the optimizer to do more interesting things in the common cases, 
> like hoisting the entire division. 
> 
> - will give us a clearer path for implementing trapping optimizations in
the 
> future. 
> 
> - is an immediate win on ARM. 
> 
> I'd be curious to hear what specific idea you have about how to
implement 
> trap-based deoptimization with your trapping division intrinsic for x86 - 
> maybe it's different from the two "bad" idioms I showed
above. 
> 
> I hope my explanation above helps. If not, ask, and I'll try to
explain 
> more clearly. 
> 
> I think I understand it. I think that the only issue is that: 
> 
> - Patchpoints currently don't do what you want. 
> 
> - If you made patchpoints do what you want then you'd break WebKit -
not to 
> mention anyone who wants to use them for inline caches. 
> 
> So it seems like you want a new intrinsic. You should officially propose 
> this new intrinsic, particularly since the core semantic differences are
not 
> so great from what we have now. OTOH, if you truly believe that
patchpoints 
> should just be changed to your semantics in a way that does break WebKit, 
> then that's probably also something you should get off your chest. ;-) 
> 
> 
> 
> One point just for clarity; I don't believe this effects the
conclusions of 
> our discussion so far. I'm also fairly sure that you (Filip) are aware
of 
> this, but want to spell it out for other readers. 
> 
> You seem to be assuming that compiled code needs to explicitly branch to a 
> point where deopt state is known to exit a compiled frame. 
> 
> This is a slightly unclear characterization of my assumptions. Our JIT
does 
> deoptimization without explicit branches for many, many things. You should 
> look at it some time, it's pretty fancy. ;-) 
> 
> Worth noting is that you can also exit a compiled frame on a trap (without 
> an explicitly condition/branch!) if the deopt state is known at the point 
> you take the trap. This "exit frame on trap" behavior shows up
with null 
> pointer exceptions as well. I'll note that both compilers in OpenJDK 
> support some combination of "exit-on-trap" conditions for
division and null 
> dereferences. The two differ on exactly what strategies they use in each 
> case though. :) 
> 
> Yeah, and I've also implemented VMs that do this - and I endorse the
basic 
> idea. I know what you want, and my only point is that the existing 
> patchpoints only give you this if you're willing to make a huge
compromise: 
> namely, that you're willing to make the division (or heap load for the
null 
> case) completely opaque to the compiler to the point that GVN, LICM, SCCP, 
> and all algebraic reasoning have to give up on optimizing it. The point of 
> using LLVM is that it can optimize code. It can optimize branches and 
> divisions pretty well. So, eliminating an explicit branch by replacing it 
> with a construct that appears opaque to the optimizer is not a smart 
> trade-off. 
> 
> You could add a new intrinsic that, like patchpoints, records the layout
of 
> state in a side-table, but that is used as a kind of wrapper for
operations 
> that LLVM understands. This may or may not be hairy - you seem to have
sort 
> of acknowledged that it's got some complexity and I've also pointed
out some 
> possible issues. If this is something that you want, you should propose it 
> so that others know what you're talking about. One danger of how
we're 
> discussing this right now is that you're overloading patchpoints to
mean the 
> thing you want them to mean rather than what they actually mean, which
makes 
> it seem like we don't need Michael's intrinsics on the grounds
that 
> patchpoints already offer a solution. They don't already offer a
solution 
> precisely because patchpoints don't do what your intrinsics would do. 
> 
> 
> 
> I'm not really arguing that either scheme is "better" in all
cases. I'm 
> simply arguing that we should support both and allow optimization and
tuning 
> between them. As far as I can tell, you seem to be assuming that an 
> explicit branch to known exit point is always superior. 
> 
> 
> Ok, back to the topic at hand... 
> 
> With regards to the current proposal, I'm going to take a step back.
You 
> guys seem to have already looked in this in a fair amount of depth. I'm
not 
> necessarily convinced you've come to the best solution, but at some
point, 
> we need to make forward progress. What you have is clearly better than 
> nothing. 
> 
> Please go ahead and submit your current approach. We can come back and 
> revise later if we really need to. 
> 
> I do request the following changes: 
> - Mark it clearly as experimental. 
> 
> 
> - Either don't specify the value computed in the edge cases, or allow
those 
> values to be specified as constant arguments to the call. This allows 
> efficient lowering to x86's div instruction if you want to make use of
the 
> trapping semantics. 
> 
> Once again: how would you use this to get trapping semantics without 
> throwing all of LLVM's optimizations out the window, in the absence of
the 
> kind of patchpoint-like intrinsic that you want? I ask just to make sure 
> that we're on the same page. 
> 
> 
> 
> Finally, as for performance data, which part of this do you want
performance 
> data for? I concede that I don't have performance data for using
Michael's 
> new intrinsic. Part of what the intrinsic accomplishes is it gives a less 
> ugly way of doing something that is already possible with target
intrinsics 
> on ARM. I think it would be great if you could get those semantics - along 
> with a known-good implementation - on other architectures as well. 
> 
> I would be very interested in seeing data comparing two schemes: 
> - Explicit control flow emited by the frontend 
> - The safe.div intrinsic emitted by the frontend, desugared in CodeGenPrep 
> 
> My strong suspicion is that each would preform well in some cases and not
in 
> others. At least on x86. Since the edge-checks are essentially free on 
> ARM, the second scheme would probably be strictly superior there. 
> 
> I am NOT asking that we block submission on this data however. 
> 
> But this discussion has also involved suggestions that we should use 
> trapping to implement deoptimization, and the main point of my message is
to 
> strongly argue against anything like this given the current state of 
> trapping semantics and how patchpoints work. My point is that using traps 
> for division corner cases would either be unsound (see the stackmap after 
> the trap, above), or would require you to do things that are obviously 
> inefficient. If you truly believe that the branch to detect division slow 
> paths is more expensive than spilling all bytecode state to the stack or 
> using a patchpoint for the division, then I could probably hack something
up 
> in WebKit to show you the performance implications. (Or you could do it 
> yourself, the code is open source...) 
> 
> In a couple of months, I'll probably have the performance data to
discuss 
> this for real. When that happens, let's pick this up and continue the 
> debate. Alternatively, if you want to chat this over more with a beer in 
> hand at the social next week, let me know. In the meantime, let's not
stall 
> the current proposal any more. 
> 
> Philip 
> 
> -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140502/c068d1f3/attachment.html>

Philip Reames

2014-May-02 23:02 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

Ok, let's take a major step back.  I think we're getting too much into 
the details of the current implementation and loosing track of the use 
cases.

 From what I can tell, there are a number of possible uses for 
"patchable calls" or "stack maps".  Note, I'm purposesly
not using the
term "patchpoint" here.  Let's walk through from first principles
and
make sure we're in agreement.  Once we've got that, we can figure out 
how to evolve patchpoint as needed (if needed).

Call Dispatch Optimizations (i.e. inline call cache) -- In this case, 
the final callee is unknown at the call site.  The purpose is to cache 
the result of the function resolution (virtual call, interface call, 
"other" call).  The actual call target may change, so little is known 
about the callee.  In theory, the optimizer could leverage information 
about the set of functions potentially reachable, but that information 
is not exposed today.

Code Lifetime Management - If one compiled function calls another, it 
may be necessary to redirect this call to another implementation of the 
same function.  (In the most common case, this would be an 
interpreter.)  In this case, the callee is always the same and the call 
site can be optimized with that assumption provided the ability to 
redirect the actual call is preserved.

Unpatched Call w/stack map - The runtime may need to know the state of 
certain values during the call.  Conceptually, these values become 
arguments to the function, but the calling convention is highly flexible 
to reduce the call overhead.  In this usage, the call site is not 
patched, but the stack map is required.  The call target is known and 
can be used for optimization purposes.

Opaque Patching - The runtime is going to patch in some completely 
unknown bit of code at this location.  This is completely opaque to the 
optimizer.

Are there any cases I've missed?

Let's move on to the existing implementations.

As far as I can tell, Filip expects the "opaque patching" semantics,
but
is actually implementing something closer to the "call dispatch 
optimization".  I was expecting that the existing implementation could 
be used for any of the above, but got confused about optimizations which 
were legal only for "lifetime management" and "unpatched calls
w/stack
maps" and tried to apply them to patchpoint as a whole.

At least to my reading, the current documentation would seem to indicate 
that all uses above are legal, but only optimizations applicable to all 
would be legal.  (i.e. none)  The last half could stand to be clarified 
in the documentation.

Is this a correct summary of the discussion to date?


It's starting to sound like we need a split within the
"patchpoint"
intrinsic.  We need one version which provides *no* information about 
the eventual callee.  We need another which provides the "patchable" 
part, but also specifies some information about the eventual callee for 
the purpose of optimization.  Potentially, we even need a version which 
has a "stack map", but isn't patchable.

Thoughts?  This isn't a concrete proposal or anything.  If we settle on 
something, I'll make a more formal proposal with a proper spec for 
general discussion.

Philip





On 04/30/2014 10:57 PM, Filip Pizlo wrote:
*Conversation trimmed to avoid moderation due to length*

llvm dev - May 2014 - [LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division