thr3ads.net - llvm dev - [LLVMdev] Proposal: add intrinsics for safe division [May 2014]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2014-May-02 18:58 UTC

[LLVMdev] Proposal: add intrinsics for safe division

On 05/02/2014 11:57 AM, Filip Pizlo wrote:>
> On May 2, 2014 at 11:53:25 AM, Eric Christopher (echristo at gmail.com 
> <mailto:echristo at gmail.com>) wrote:
>
>> On Wed, Apr 30, 2014 at 10:34 PM, Philip Reames
>> <listmail at philipreames.com> wrote:
>> > Andy - If you're not already following this closely, please
start.
>> We've
>> > gotten into fairly fundamental questions of what a patchpoint
does.
>> >
>> > Filip,
>> >
>> > I think you've hit the nail on the head. What I'm thinking
of as being
>> > patchpoints are not what you think they are. Part of that is that 
>> I've got
>> > a local change which adds a very similar construction (called 
>> "statepoints"
>> > for the moment), but I was trying to keep that separate. That also
>> includes
>> > a lot of GC semantics which are not under discussion currently. My
>> > apologies if that experience bled over into this conversation and
made
>> > things more confusing.
>> >
>> > I will note that the documentation for patchpoint say explicitly
the
>> > following:
>> > "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a
function
>> call to
>> > the specified <target> and records the location of specified
values
>> in the
>> > stack map."
>> >
>> > My reading has always been that a patchpoint *that isn't
patched*
>> is simply
>> > a call with a stackmap associated with it. To my reading, this can
(and
>> > did, and does) indicate my proposed usage would be legal.
>> >
>>
>> I like the idea that the target can be assumed to be called. It makes
>> optimization of the call possible, etc. I think it's definitely
worth
>> exploring before we lock down the patchpoint intrinsic.
>
> I will actively oppose this.
>I think it sounds like we need to split patchpoint into two cases. I'm 
going to send a rough proposal for this later today.>
> -Filip
>
>
>>
>>
>> -eric
>>
>> > I will agree that I've confused the topic badly on the
optimization
>> front.
>> > My "statepoint" isn't patchable, so a lot more
optimizations are legal.
>> > Sorry about that. To restate what I think you've been saying
all
>> along, the
>> > optimizer can't make assumptions about what function is called
by a
>> > patchpoint because that might change based on later patching. Is 
>> this the
>> > key point you've been trying to make?
>> >
>> > I'm not objecting to separating "my patchpoint" from
"your patchpoint".
>> > Let's just hammer out the semantics of each first. :)
>> >
>> > Again, longer response to follow in a day or so. :)
>> >
>> > Philip
>> >
>> >
>> > On 04/30/2014 10:09 PM, Filip Pizlo wrote:
>> >
>> >
>> >
>> > On April 30, 2014 at 9:06:20 PM, Philip Reames 
>> (listmail at philipreames.com)
>> > wrote:
>> >
>> > On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>> >
>> > On April 29, 2014 at 11:27:06 AM, Philip Reames 
>> (listmail at philipreames.com)
>> > wrote:
>> >
>> > On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>> >
>> > LD;DR: Your desire to use trapping on x86 only further convinces
me
>> that
>> > Michael's proposed intrinsics are the best way to go.
>> >
>> > I'm still not convinced, but am not going to actively oppose
it
>> either. I'm
>> > leery of designing a solution with major assumptions we don't
have
>> data to
>> > backup.
>> >
>> > I worry your assumptions about deoptimization are potentially 
>> unsound. But
>> > I don't have data to actually show this (yet).
>> >
>> > I *think* I may have been unclear about my assumptions; in 
>> particular, my
>> > claims with respect to deoptimization are probably more subtle
than
>> they
>> > appeared. WebKit can use LLVM and it has divisions and we do all 
>> possible
>> > deoptimization/profiling/etc tricks for it, so this is grounded in
>> > experience. Forgive me if the rest of this e-mail contains a
lecture on
>> > things that are obvious - I'll try to err on the side of
clarity and
>> > completeness since this discussion is sufficiently dense that we 
>> run the
>> > risk of talking cross-purposes unless some baseline assumptions
are
>> > established.
>> >
>> > I think we're using the same terminology, but with slightly 
>> different sets
>> > of assumptions. I'll point this out below where relevant.
>> >
>> > Also, thanks for taking the time to expand. It help clarify the 
>> discussion
>> > quite a bit.
>> >
>> > I think we may be converging to an understanding of what you want 
>> versus
>> > what I want, and I think that there are some points - possibly 
>> unrelated to
>> > division - that are worth clarifying. I think that the main 
>> difference is
>> > that when I say "patchpoint", I am referring to a
concrete
>> intrinsic with
>> > specific semantics that cannot change without breaking WebKit, 
>> while you are
>> > using the term to refer to a broad concept, or rather, a class of
>> > as-yet-unimplemented intrinsics that share some of the same 
>> features with
>> > patchpoints but otherwise have incompatible semantics.
>> >
>> > Also, when I say that you wouldn't want to use the existing 
>> patchpoint to do
>> > your trapping deopt, what I mean is that the performance of doing 
>> this would
>> > suck for reasons that are not related to deoptimization or 
>> trapping. I'm
>> > not claiming that deoptimization performs poorly (trust me, I know
>> better)
>> > or that trapping to deoptimize is bad (I've done this many,
many
>> times and I
>> > know better). I'm saying that with the current patchpoint
intrinsics in
>> > LLVM, as they are currently specified and implemented, doing it 
>> would be a
>> > bad idea because you'd have to compromise a bunch of other 
>> optimizations to
>> > achieve it.
>> >
>> > You have essentially described new intrinsics that would make this
>> less of a
>> > bad idea and I am interested in your intrinsics, so I'll try
to
>> both respond
>> > with why patchpoints don't currently give you what you want
(and
>> why simply
>> > changing patchpoint semantics would be evil) and I'll also try
to
>> comment on
>> > what I think of the intrinsic that you're effectively
proposing.
>> Long story
>> > short, I think you should formally propose your intrinsic even if 
>> it's not
>> > completely fleshed out. I think that it's an interesting
capability
>> and in
>> > its most basic form, it is a simple variation of the current
>> > patchpoint/stackmap intrinsics.
>> >
>> >
>> >
>> >
>> >
>> > On April 29, 2014 at 10:09:49 AM, Philip Reames 
>> (listmail at philipreames.com)
>> > wrote:
>> >
>> > As the discussion has progressed and I've spent more time
thinking
>> about the
>> > topic, I find myself less and less enthused about the current 
>> proposal. I'm
>> > in full support of having idiomatic ways to express safe division,
>> but I'm
>> > starting to doubt that using an intrinsic is the right way at the 
>> moment.
>> >
>> > One case I find myself thinking about is how one would combine 
>> profiling
>> > information and implicit div-by-zero/overflow checks with this 
>> proposal. I
>> > don't really see a clean way. Ideally, for a "safe
div" which never
>> has the
>> > exceptional paths taken, you'd like to completely do away with
the
>> control
>> > flow entirely. (And rely on hardware traps w/exceptions instead.)
I
>> don't
>> > really see a way to represent that type of construct given the
current
>> > proposal.
>> >
>> > This is a deeper problem and to solve it you'd need a solution
to
>> trapping
>> > in general. Let's consider the case of Java. A Java program
may want to
>> > catch the arithmetic exception due to divide by zero. How would
you
>> do this
>> > with a trap in LLVM IR? Spill all state that is live at the catch?
>> Use a
>> > patchpoint for the entire division instruction?
>> >
>> > We'd likely use something similar to a patchpoint. You'd
need the
>> "abstract
>> > vm state" (which is not the compiled frame necessarily)
available
>> at the div
>> > instruction. You could then re-enter the interpreter at the 
>> specified index
>> > (part of the vm state). We have all most of these mechanisms in
place.
>> > Ideally, you'd trigger a recompile and otherwise ensure
re-entry into
>> > compiled code at the soonest possible moment.
>> >
>> > This requires a lot of runtime support, but we already have most
of it
>> > implemented for another compiler. From our perspective, the
runtime
>> > requirements are not a major blocker.
>> >
>> > Right, you'll use a patchpoint. That's way more expensive
than
>> using a safe
>> > division intrinsic with branches, because it's opaque to the
optimizer.
>> >
>> > This statement is true at the moment, but it shouldn't be. I
think
>> this is
>> > our fundamental difference in approach.
>> >
>> > You should be able to write something like:
>> > i32 %res = invoke patchpoint (... x86_trapping_divide, a, b) 
>> normal_dest
>> > invoke_dest
>> >
>> > normal_dest:
>> > ;; use %res
>> > invoke_dest:
>> > landingpad
>> > ;; dispatch edge cases
>> > ;; this could be unreachable code if you deopt this frame in the
trap
>> > handler and jump directly to an interpreter or other bit of code
>> >
>> > I see. It sounds like you want a generalization of the 
>> "div.with.stackmap"
>> > that I thought you wanted - you want to be able to wrap anything
in a
>> > stackmap.
>> >
>> > The current patchpoint intrinsic does not do this, and you run the
>> risk of
>> > breaking existing semantics if you changed this. You'd
probably break
>> > WebKit, which treats the call target of the patchpoint as nothing 
>> more than
>> > a quirk - we always pass null. Also, the current patchpoint treats
the
>> > callee as an i8* if I remember right and it would be super weird
if
>> all LLVM
>> > phases had to interpret this i8* by unwrapping a possible bitcast 
>> to get to
>> > a declared function that may be an intrinsic. Yuck! Basically, the
call
>> > target of existing patchpoints is meant to be a kind of
convenience
>> feature
>> > rather than the core of the mechanism.
>> >
>> > I agree in principle that the intrinsic that you want would be a
useful
>> > intrinsic. But let's not call it a patchpoint for the purposes
of this
>> > discussion, and let's not confuse the discussion by claiming 
>> (incorrectly)
>> > that the existing patchpoint facility gives you what you want. It 
>> doesn't:
>> > patchpoints are designed to make the call target opaque (hence the
>> use of
>> > i8*) and there shouldn't be a correlation between what the 
>> patchpoint does
>> > at run-time and what the called function would have done. You
could
>> make
>> > the call target be null (like WebKit does) and the patchpoint 
>> should still
>> > mean "this code can do anything" because the expectation
is that
>> the client
>> > JIT will patch over it anyway.
>> >
>> > Also, "patchpoint" would probably not be the right term
for the
>> intrinsic
>> > that you want. I think that what you want is
"call.with.stackmap". Or
>> > maybe "stackmap.wrapper". Or just "stackmap" -
I'd be OK, in principle,
>> > with changing the name of the current "stackmap"
intrinsic to
>> something that
>> > reflects the fact that it's less of a stackmap than what you
want.
>> >
>> > To summarize. A patchpoint's main purpose is that you can
patch it with
>> > arbitrary code. The current "stackmap" means that you
can patch it with
>> > arbitrary code and that patching may be destructive to a shadow of
>> machine
>> > code bytes, so it's really just like patchpoints - we could
change
>> its name
>> > to "patchpoint.shadow" for example.
>> >
>> > If you were to propose such a stackmap intrinsic, then I think 
>> there could
>> > be some ways of doing it that wouldn't be too terrible.
Basically
>> you want
>> > something that is like a patchpoint in that it reports a stackmap 
>> via a side
>> > channel, but unlike patchpoints, it doesn't allow arbitrary
patching -
>> > instead the optimizer should be allowed to assume that the code 
>> within the
>> > patchpoint will always do the same thing that the call target
would
>> have
>> > done. There are downsides to truly doing this. For example, to
make
>> > division efficient with such an intrinsic, you'd have to do 
>> something that
>> > is somewhat worse than just recognizing intrinsics in the
optimizer
>> - you'd
>> > have to first recognize a call to your "stackmap
wrapper" intrinsic
>> and then
>> > observe that its call target argument is in turn another
intrinsic.
>> To me
>> > personally, that's kind of yucky, but I won't deny that it
could be
>> useful.
>> >
>> > As to the use of invoke: I don't believe that the use of
invoke
>> versus my
>> > suggested "branch on a trap predicate" idea are
different in any truly
>> > meaningful way. I buy that either would work.
>> >
>> >
>> >
>> > A patchpoint should not require any excess spilling. If values are
>> live in
>> > registers, that should be reflected in the stack map. (I do not
know if
>> > this is the case for patchpoint at the moment or not.)
>> >
>> > Patchpoints do not require spilling.
>> >
>> > My point was that with existing patchpoints, you *either* use a 
>> patchpoint
>> > for the entire division which makes the division opaque to the 
>> optimizer -
>> > because a patchpoint means "this code can do anything" -
*or* you could
>> > spill stuff to the stack prior to your trapping division
intrinsic,
>> since
>> > spilling is something that you could do as a workaround if you 
>> didn't have a
>> > patchpoint.
>> >
>> > The reason why I brought up spilling at all is that I suspect that
>> spilling
>> > all state to the stack might be cheaper - for some systems - than 
>> turning
>> > the division into a patchpoint. Turning the division into a 
>> patchpoint is
>> > horrendously brutal - the patchpoint looks like it clobbers the 
>> heap (which
>> > a division doesn't do), has to execute (a division is an
obvious DCE
>> > candidate), cannot be hoisted (hoisting divisions is awesome),
etc.
>> Perhaps
>> > most importantly, though, a patchpoint doesn't tell LLVM that 
>> you're *doing
>> > a division* - so all constant folding, strenght reduction, and 
>> algebraic
>> > reasoning flies out the window. On the other hand, spilling all 
>> state to
>> > the stack is an arguably sound and performant solution to a lot of
VM
>> > problems. I've seen JVM implementations that ensure that there
is
>> always a
>> > copy of state on the stack at some critical points, just because
it
>> makes
>> > loads of stuff simpler (debugging, profiling, GC, and of course 
>> deopt). I
>> > personally prefer to stay away from such a strategy because
it's
>> not free.
>> >
>> > On the other hand, branches can be cheap. A branch on a divide is 
>> cheaper
>> > than not being able to optimize the divide.
>> >
>> >
>> >
>> > The Value called by a patchpoint should participate in
optimization
>> > normally.
>> >
>> > I agree that you could have a different intrinsic that behaves
like
>> this.
>> >
>> > We really want the patchpoint part of the call to be supplemental.
It
>> > should still be a call. It should be constant propagated,
transformed,
>> > etc.. This is not the case currently. I've got a couple of off
the wall
>> > ideas for improving the current status, but I'll agree this is
a
>> hardish
>> > problem.
>> >
>> > It should be legal to use a patchpoint in an invoke. It's an
ABI
>> issue of
>> > how the invoke path gets invoked. (i.e. side tables for the
runtime to
>> > lookup, etc..) This is not possible today, and probably requires a
fair
>> > amount of work. Some of it, I've already done and will be
sharing
>> shortly.
>> > Other parts, I haven't even thought about.
>> >
>> > Right, it's significantly more complex than either the
existing
>> patchpoints
>> > or Michael's proposed safe.div.
>> >
>> >
>> >
>> > If you didn't want to use the trapping semantics, you'd
insert
>> dedicated
>> > control flow _before_ the divide. This would allow normal 
>> optimization of
>> > the control flow.
>> >
>> > Notes:
>> > 1) This might require a new PATCHPOINT pseudo op in the backend. 
>> Haven't
>> > thought much about that yet.
>> > 2) I *think* your current intrinsic could be translated into 
>> something like
>> > this. (Leaving aside the question of where the deopt state comes 
>> from.) In
>> > fact, the more I look at this, the less difference I actually see 
>> between
>> > the approaches.
>> >
>> >
>> >
>> > In a lot of languages, a divide produces some result even in the 
>> exceptional
>> > case and this result requires effectively deoptimizing since the 
>> resut won't
>> > be the one you would have predicted (double instead of int, or
BigInt
>> > instead of small int), which sort of means that if the CPU 
>> exception occurs
>> > you have to be able to reconstruct all state. A patchpoint could
do
>> this,
>> > and so could spilling all state to the stack before the divide - 
>> but both
>> > are very heavy hammers that are sure to be more expensive than
just
>> doing a
>> > branch.
>> >
>> > This isn't necessarily as expensive as you might believe.
I'd recommend
>> > reading the Graal project papers on this topic.
>> >
>> > Whether deopt or branching is more profitable *in this case*, I 
>> can't easily
>> > say. I'm not yet to the point of being able to run that
experiment.
>> We can
>> > argue about what "should" be better all we want, but
real
>> performance data
>> > is the only way to truly know.
>> >
>> > My point may have been confusing. I know that deoptimization is 
>> cheap and
>> > WebKit uses it everywhere, including division corner cases, if 
>> profiling
>> > tells us that it's profitable to do so (which it does, in the 
>> common case).
>> > WebKit is a heavy user of deoptimization in general, so you
don't
>> need to
>> > convince me that it's worth it.
>> >
>> > Acknowledged.
>> >
>> > Note that I want *both* deopt *and* branching, because in this
case, a
>> > branch is the fastest overall way of detecting when to deopt. In
the
>> > future, I will want to implement the deopt in terms of branching, 
>> and when
>> > we do this, I believe that the most sound and performat approach 
>> would be
>> > using Michael's intrinsics. This is subtle and I'll try to
explain
>> why it's
>> > the case.
>> >
>> > The point is that you wouldn't want to do deoptimization by 
>> spilling state
>> > on the main path or by using a patchpoint for the main path of the
>> division.
>> >
>> > This is the main point I disagree with. I don't believe that
having a
>> > patchpoint on the main path should be any more expensive then the 
>> original
>> > call. (see above)
>> >
>> > The reason why the patchpoint is expensive is that if you use a 
>> patchpoint
>> > to implement a division then the optimizer won't be allowed to
>> assume that
>> > it's a division, because the whole point of
"patchpoint" is to tell the
>> > optimizer to piss off.
>> >
>> >
>> >
>> > Worth noting explicitly: I'm assuming that all of your deopt
state
>> would
>> > already be available for other purposes in nearby code. It's
on the
>> stack
>> > or in registers. I'm assuming that by adding the deopt point,
you
>> are not
>> > radically changing the set of computations which need to be done. 
>> If that's
>> > not the case, you should avoid deopt and instead just inline the 
>> slow paths
>> > with explicit checks.
>> >
>> > Yes, of course it is. That's not the issue.
>> >
>> >
>> >
>> > I'll note that given your assumptions about the cost of a 
>> patchpoint, the
>> > rest of your position makes a lot more sense. :) As I spelled out 
>> above, I
>> > believe this cost is not fundamental.
>> >
>> > You don't want the common path of executing the division to
involve a
>> > patchpoint instruction, although using a patchpoint or stackmap to
>> implement
>> > deoptimization on the failing path is great:
>> >
>> > Good: if (division would fail) { call @patchpoint(all of my state)
>> } else {
>> > result = a / b }
>> >
>> > Given your cost assumptions, I'd agree.
>> >
>> > Not my cost assumptions. The reason why this is better is that the
>> division
>> > is expressed in LLVM IR so that LLVM can do useful things to it -
like
>> > eliminate it, for example.
>> >
>> >
>> > Bad: call @patchpoint(all of my state) // patch with a divide 
>> instruction -
>> > bad because the optimizer has no clue what you're doing and
assumes
>> the very
>> > worst
>> >
>> > Yuck. Agreed.
>> >
>> > To be clear, this is what you're proposing - except that
you're
>> assuming
>> > that LLVM will know that you've patched a division because
you're
>> expecting
>> > the call target to have semantic meaning. Or, rather, you're 
>> expecting that
>> > you can make the contents of the patchpoint be a division by
having
>> the call
>> > target be a division intrinsic. In the current implementation and 
>> as it is
>> > currently specified, the call target has no meaning and so you get
>> the yuck
>> > that I'm showing.
>> >
>> >
>> > Worse: spill all state to the stack; call @trapping.div(a, b) // 
>> the spills
>> > will hurt you far more than a branch, so this should be avoided
>> >
>> > I'm assuming this is an explicit spill rather than simply
recording
>> a stack
>> > map *at the div*. If so, agreed.
>> >
>> > I suppose we could imagine a fourth option that involves a 
>> patchpoint to
>> > pick up the state and a trapping divide instrinsic. But a trapping
>> divide
>> > intrinsic alone is not enough. Consider this:
>> >
>> > result = call @trapping.div(a, b); call @stackmap(all of my state)
>> >
>> > As soon as these are separate instructions, you have no guarantees
>> that the
>> > state that the stackmap reports is sound for the point at which
the div
>> > would trap.
>> >
>> > This is the closest to what I'd propose, except that the two
calls
>> would be
>> > merged into a single patchpoint. Isn't the entire point of a 
>> patchpoint to
>> > record the stack map for a call?
>> >
>> > No. It would be bad if that's what the documentation says.
That's
>> not at
>> > all how WebKit uses it or probably any IC client would use it.
>> >
>> > Patchpoints are designed to be inline assembly on steroids.
They're
>> there
>> > to allow the client JIT to tell LLVM to piss off.
>> >
>> > (Well, ignoring the actual patching part..) Why not write this as:
>> > patchpoint(..., trapping.div, a, b);
>> >
>> > Is there something I'm missing here?
>> >
>> > Just to note: I fully agree that the two call proposal is unsound 
>> and should
>> > be avoided.
>> >
>> > So, the division itself shouldn't be a trapping instruction
and
>> instead you
>> > want to detect the bad case with a branch.
>> >
>> > To be clear:
>> >
>> > - Whether you use deoptimization for division or anything else - 
>> like WebKit
>> > has done since before any of the Graal papers were written - is
mostly
>> > unrelated to how you represent the division, unless you wanted to 
>> add a new
>> > intrinsic that is like a trapping-division-with-stackmap:
>> >
>> > result = call @trapping.div.with.stackmap(a, b, ... all of my
state
>> ...)
>> >
>> > Now, maybe you do want such an intrinsic, in which case you should
>> propose
>> > it!
>> >
>> > Given what we already have with patchpoints, I don't think a
merged
>> > intrinsic is necessary. (See above). I believe we have all the
parts to
>> > build this solution, and that - in theory - they should compose
neatly.
>> >
>> > p.s. The bits I was referring to was not deopt per se. It was 
>> particularly
>> > which set of deopt state you used for each deopt point. That's
a bit of
>> > tangent for the rest of the discussion now though.
>> >
>> > The reason why I haven't proposed it is that I think that 
>> long-term, the
>> > currently proposed intrinsics are a better path to getting the
trapping
>> > optimizations. See my previous mail, where I show how we could
tell
>> LLVM
>> > what the failing path is (which may have deoptimization code that 
>> uses a
>> > stackmap or whatever), what the trapping predicate is (it comes 
>> from the
>> > safe.div intrinsic), and the fact that trapping is wise (branch 
>> weights).
>> >
>> > - If you want to do the deoptimization with a trap, then your only
>> choice
>> > currently is to use a patchpoint for the main path of the
division.
>> This
>> > will be slower than using a branch to an OSR exit basic block,
because
>> > you're making the division itself opaque to the optimizer
(bad)
>> just to get
>> > rid of a branch (which was probably cheap to begin with).
>> >
>> > Hence, what you want to do - one way or another, regardless of 
>> whether this
>> > proposed intrinsic is added - is to branch on the corner case 
>> condition, and
>> > have the slow case of the branch go to a basic block that
deoptimizes.
>> > Unless of course you have profiling that says that the case does
happen
>> > often, in which case you can have that basic block handle the 
>> corner case
>> > inline without leaving optimized code (FWIW, we do have such paths
>> in WebKit
>> > and they are useful).
>> >
>> > So the question for me is whether the branching involves explicit 
>> control
>> > flow or is hidden inside an intrinsic. I prefer for it to be
within an
>> > intrinsic because it:
>> >
>> > - allows the optimizer to do more interesting things in the common
>> cases,
>> > like hoisting the entire division.
>> >
>> > - will give us a clearer path for implementing trapping 
>> optimizations in the
>> > future.
>> >
>> > - is an immediate win on ARM.
>> >
>> > I'd be curious to hear what specific idea you have about how
to
>> implement
>> > trap-based deoptimization with your trapping division intrinsic
for
>> x86 -
>> > maybe it's different from the two "bad" idioms I
showed above.
>> >
>> > I hope my explanation above helps. If not, ask, and I'll try
to explain
>> > more clearly.
>> >
>> > I think I understand it. I think that the only issue is that:
>> >
>> > - Patchpoints currently don't do what you want.
>> >
>> > - If you made patchpoints do what you want then you'd break
WebKit
>> - not to
>> > mention anyone who wants to use them for inline caches.
>> >
>> > So it seems like you want a new intrinsic. You should officially 
>> propose
>> > this new intrinsic, particularly since the core semantic 
>> differences are not
>> > so great from what we have now. OTOH, if you truly believe that 
>> patchpoints
>> > should just be changed to your semantics in a way that does break 
>> WebKit,
>> > then that's probably also something you should get off your
chest. ;-)
>> >
>> >
>> >
>> > One point just for clarity; I don't believe this effects the 
>> conclusions of
>> > our discussion so far. I'm also fairly sure that you (Filip)
are
>> aware of
>> > this, but want to spell it out for other readers.
>> >
>> > You seem to be assuming that compiled code needs to explicitly 
>> branch to a
>> > point where deopt state is known to exit a compiled frame.
>> >
>> > This is a slightly unclear characterization of my assumptions. Our
>> JIT does
>> > deoptimization without explicit branches for many, many things.
You
>> should
>> > look at it some time, it's pretty fancy. ;-)
>> >
>> > Worth noting is that you can also exit a compiled frame on a trap 
>> (without
>> > an explicitly condition/branch!) if the deopt state is known at
the
>> point
>> > you take the trap. This "exit frame on trap" behavior
shows up with
>> null
>> > pointer exceptions as well. I'll note that both compilers in
OpenJDK
>> > support some combination of "exit-on-trap" conditions
for division
>> and null
>> > dereferences. The two differ on exactly what strategies they use
in
>> each
>> > case though. :)
>> >
>> > Yeah, and I've also implemented VMs that do this - and I
endorse
>> the basic
>> > idea. I know what you want, and my only point is that the existing
>> > patchpoints only give you this if you're willing to make a
huge
>> compromise:
>> > namely, that you're willing to make the division (or heap load
for
>> the null
>> > case) completely opaque to the compiler to the point that GVN, 
>> LICM, SCCP,
>> > and all algebraic reasoning have to give up on optimizing it. The 
>> point of
>> > using LLVM is that it can optimize code. It can optimize branches
and
>> > divisions pretty well. So, eliminating an explicit branch by 
>> replacing it
>> > with a construct that appears opaque to the optimizer is not a
smart
>> > trade-off.
>> >
>> > You could add a new intrinsic that, like patchpoints, records the 
>> layout of
>> > state in a side-table, but that is used as a kind of wrapper for 
>> operations
>> > that LLVM understands. This may or may not be hairy - you seem to 
>> have sort
>> > of acknowledged that it's got some complexity and I've
also pointed
>> out some
>> > possible issues. If this is something that you want, you should 
>> propose it
>> > so that others know what you're talking about. One danger of
how we're
>> > discussing this right now is that you're overloading
patchpoints to
>> mean the
>> > thing you want them to mean rather than what they actually mean, 
>> which makes
>> > it seem like we don't need Michael's intrinsics on the
grounds that
>> > patchpoints already offer a solution. They don't already offer
a
>> solution
>> > precisely because patchpoints don't do what your intrinsics
would do.
>> >
>> >
>> >
>> > I'm not really arguing that either scheme is
"better" in all cases. I'm
>> > simply arguing that we should support both and allow optimization 
>> and tuning
>> > between them. As far as I can tell, you seem to be assuming that
an
>> > explicit branch to known exit point is always superior.
>> >
>> >
>> > Ok, back to the topic at hand...
>> >
>> > With regards to the current proposal, I'm going to take a step
>> back. You
>> > guys seem to have already looked in this in a fair amount of
depth.
>> I'm not
>> > necessarily convinced you've come to the best solution, but at
some
>> point,
>> > we need to make forward progress. What you have is clearly better
than
>> > nothing.
>> >
>> > Please go ahead and submit your current approach. We can come back
and
>> > revise later if we really need to.
>> >
>> > I do request the following changes:
>> > - Mark it clearly as experimental.
>> >
>> >
>> > - Either don't specify the value computed in the edge cases,
or
>> allow those
>> > values to be specified as constant arguments to the call. This
allows
>> > efficient lowering to x86's div instruction if you want to
make use
>> of the
>> > trapping semantics.
>> >
>> > Once again: how would you use this to get trapping semantics
without
>> > throwing all of LLVM's optimizations out the window, in the
absence
>> of the
>> > kind of patchpoint-like intrinsic that you want? I ask just to
make
>> sure
>> > that we're on the same page.
>> >
>> >
>> >
>> > Finally, as for performance data, which part of this do you want 
>> performance
>> > data for? I concede that I don't have performance data for
using
>> Michael's
>> > new intrinsic. Part of what the intrinsic accomplishes is it gives
>> a less
>> > ugly way of doing something that is already possible with target 
>> intrinsics
>> > on ARM. I think it would be great if you could get those semantics
>> - along
>> > with a known-good implementation - on other architectures as well.
>> >
>> > I would be very interested in seeing data comparing two schemes:
>> > - Explicit control flow emited by the frontend
>> > - The safe.div intrinsic emitted by the frontend, desugared in 
>> CodeGenPrep
>> >
>> > My strong suspicion is that each would preform well in some cases 
>> and not in
>> > others. At least on x86. Since the edge-checks are essentially
free on
>> > ARM, the second scheme would probably be strictly superior there.
>> >
>> > I am NOT asking that we block submission on this data however.
>> >
>> > But this discussion has also involved suggestions that we should
use
>> > trapping to implement deoptimization, and the main point of my 
>> message is to
>> > strongly argue against anything like this given the current state
of
>> > trapping semantics and how patchpoints work. My point is that
using
>> traps
>> > for division corner cases would either be unsound (see the
stackmap
>> after
>> > the trap, above), or would require you to do things that are
obviously
>> > inefficient. If you truly believe that the branch to detect 
>> division slow
>> > paths is more expensive than spilling all bytecode state to the 
>> stack or
>> > using a patchpoint for the division, then I could probably hack 
>> something up
>> > in WebKit to show you the performance implications. (Or you could
do it
>> > yourself, the code is open source...)
>> >
>> > In a couple of months, I'll probably have the performance data
to
>> discuss
>> > this for real. When that happens, let's pick this up and
continue the
>> > debate. Alternatively, if you want to chat this over more with a 
>> beer in
>> > hand at the social next week, let me know. In the meantime,
let's
>> not stall
>> > the current proposal any more.
>> >
>> > Philip
>> >
>> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140502/573b7055/attachment.html>

Filip Pizlo

2014-May-02 19:16 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

You're proposing a unified intrinsic that allows Philip's call target
reasoning and the current "anything goes" semantics.  I strongly
dislike unifying the two.  I don't believe that the current semantics are
cleanly expressible using Philip's intrinsic.  Specifying patchpoint
semantics in a way that implies that the call target is always called breaks
inline caches.  Even if you could make those things work, you'd still have
to solve other issues, like the types used to pass the call target.

How would a unified intrinsic work for dynamic object access inline caches and
virtual call inline caches?  Presumably you'd want to have a special fake
call target that means "this can do anything".  You could say that
null means "anything" but this feels ugly.  Other than that, my
understanding is that LLVM IR currently doesn't allow for any other kind of
"this is a call target that is guaranteed to not be analyzable"
construct - hypothetically any extern call could be resolved with LTO or
something else.

Claiming in the documentation that the point of a patchpoint is that it always
calls the call target would be confusing to anyone wanting to implement inline
caches using LLVM.  The purpose of inline caches is to change the call target,
or to eliminate the call altogether, using post-compilation patch-up.  The
semantics of patchpoints should be crystal-clear regarding this aspect of the
intrinsic and the fact that this is allowed.

Also, would you expect LLVM phases that perform optimizations using the call
target to recover the call target by unwrapping the function-pointer-to-i8*
bitcast?  This part is particularly yucky to me.

There's also the question of how not to break WebKit.  Note that you can
accomplish this by saying that a null call target means that the call target can
do anything, as I mention above - but I don't like this because it seems
like you're creating two intrinsics and selecting them by passing null as
one of the arguments rather than selecting them by name.

Long story short: creating unified intrinsics is useful if it clarifies
reasoning.  In this case a single unified intrinsic doesn't clarify
anything.  We're talking about two different semantics - one that means call
and another that means "clobber the world" - and so we should keep
them separate.

-Phil

On May 2, 2014 at 12:00:40 PM, Eric Christopher (echristo at gmail.com) wrote:

On Fri, May 2, 2014 at 11:58 AM, Philip Reames  
<listmail at philipreames.com> wrote:  >  
> On 05/02/2014 11:57 AM, Filip Pizlo wrote:  
>  
>  
> On May 2, 2014 at 11:53:25 AM, Eric Christopher (echristo at gmail.com)
wrote:
>  
> On Wed, Apr 30, 2014 at 10:34 PM, Philip Reames  
> <listmail at philipreames.com> wrote:  
>> Andy - If you're not already following this closely, please start.
We've
>> gotten into fairly fundamental questions of what a patchpoint does.  
>>  
>> Filip,  
>>  
>> I think you've hit the nail on the head. What I'm thinking of
as being
>> patchpoints are not what you think they are. Part of that is that
I've got
>> a local change which adds a very similar construction (called  
>> "statepoints"  
>> for the moment), but I was trying to keep that separate. That also  
>> includes  
>> a lot of GC semantics which are not under discussion currently. My  
>> apologies if that experience bled over into this conversation and made
>> things more confusing.  
>>  
>> I will note that the documentation for patchpoint say explicitly the  
>> following:  
>> "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a
function call
>> to  
>> the specified <target> and records the location of specified
values in the
>> stack map."  
>>  
>> My reading has always been that a patchpoint *that isn't patched*
is
>> simply  
>> a call with a stackmap associated with it. To my reading, this can (and
>> did, and does) indicate my proposed usage would be legal.  
>>  
>  
> I like the idea that the target can be assumed to be called. It makes  
> optimization of the call possible, etc. I think it's definitely worth  
> exploring before we lock down the patchpoint intrinsic.  
>  
> I will actively oppose this.  
>  
> I think it sounds like we need to split patchpoint into two cases. I'm
> going to send a rough proposal for this later today.  
>  
A proposal definitely sounds interesting but if it makes more sense to  
keep it as single intrinsics then that's preferable. We've already got  
two in a kind of weird way with patchpoint and stackmap.  

-eric  

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140502/6b0a00f4/attachment.html>

llvm dev - May 2014 - [LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division