As the discussion has progressed and I've spent more time thinking about the topic, I find myself less and less enthused about the current proposal. I'm in full support of having idiomatic ways to express safe division, but I'm starting to doubt that using an intrinsic is the right way at the moment. One case I find myself thinking about is how one would combine profiling information and implicit div-by-zero/overflow checks with this proposal. I don't really see a clean way. Ideally, for a "safe div" which never has the exceptional paths taken, you'd like to completely do away with the control flow entirely. (And rely on hardware traps w/exceptions instead.) I don't really see a way to represent that type of construct given the current proposal. Another point that is bothering me is how these intrinsics will interact with loop invariant code motion, and other condition check optimizations. We have the potential to hide control flow from optimizations which would otherwise remove it. I'm not stating these /aren't/ beneficial. I'm simply stating that I remain somewhat unconvinced. They seem like clear wins on some architectures (i.e. ARM, where the hardware includes the specific checks already), but not on others (i.e. x86 with it's exceptions). Given the above, I'm going withdraw my active support from the current proposal. (Note: This does not mean that I'm going to actively oppose it either!) Let me make a counter proposal of my own. 1) Let's not add generic intrinsics at this time. Instead, let's add cpu specific intrinsics with the exact semantics of the associated div instructions. a) for ARM, this would be the semantics of the current proposed instruction (right?) b) for x86, this would be an instruction which is defined to fault on the corner cases (the resume behaviour would /not/ be specified for the moment) 1b) Teach the optimizer about the semantics of each - if we do go with a unified safe.div instruction later, this effort could be mostly reused. 2) Add optimizations (fairly late) which fold generic-div+control flow into machine specific div instructions. 3) Frontends that want machine specific semantics (i.e. trap or defined edge cases), can use the machine specific intrinsics directly. 4) We provide a set of IR helper functions which implement safe division in what we believe to be the best way. This can be different for each supported platform. This could either be the form of sample IR, or even as functions on IRBuilder. 5) We spend some time optimizing the IR generated by (4) and use that to inform our discussion about what is truly necessary from a performance standpoint. Once this is in place, we can come back to the discussion of what an appropriate generic safe div intrinsic would look like. In particular, we would have the performance data at hand to lay to rest the concerns raised by myself and Eric. My strong suspicion is that the counter proposal will be "good enough" in practice for most users. p.s. If we do go ahead with the current proposal, I'm going to *strongly* request they be marked experimental. I think we'll need to iterate on these a few times. Philip On 04/29/2014 04:17 AM, Michael Zolotukhin wrote:> Hi, > >> I am very much in favor of having a div instruction with well defined >> div-by-zero and overflow behavior. > Actually, an interest in these intrinsics has been shown for quite a > long time, see for example thread [1]. > > I updated the patch, now the intrinsics return {iN, i1, i1} to > separately indicate division by zero and overflow events. Though for > unsigned version only division by zero is relevant, I decided to keep > three elements in return structure for consistency (the last one is > always 0 in this case). With this change various front-ends can take > advantage of the intrinsics, depending on desired semantics. > > Also I changed semantics of srem intrinsics to better match semantics > of, for instance, Go. The original proposal stated that > safe.srem(min<T>,-1) = min<T>, but it makes more sense to return 0 > here. Thus, we'll keep invariant X = div(X,Y)*Y + rem(X,Y). > > > > > > > Does the patch look good? I'd be glad to hear any remarks and comments > regarding the patch, and adjust it correspondingly. > > [1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-April/060955.html > > Best regards, > Michael > > On Apr 27, 2014, at 2:45 AM, Keno Fischer > <kfischer at college.harvard.edu <mailto:kfischer at college.harvard.edu>> > wrote: > >> I am very much in favor of having a div instruction with well defined >> div-by-zero and overflow behavior. The undefined behavior on certain >> values for LLVM intrinsics has been a major pain point for us in >> Julia, because adding the extra branches just kills performance and >> we know that there is an X86 instruction that just does what we want. >> Anyway, this was brought up briefly above, but want to bring back the >> discussion of a div instruction that reliably traps, as that seems to >> me to be one of the only ways to get adequate performance in the >> common case. This can probably be done as suggested above with >> matching on the trap condition of this instruction, but I think while >> we're at it, we should tackle that case as well. >> >> Thanks, >> Keno >> >> >> On Sat, Apr 26, 2014 at 5:11 PM, Michael Zolotukhin >> <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote: >> >> >> Hi, >> >> Thanks to everybody for the very informative feedback! Due to >> difference in timezones, I usually miss the most active time of >> discussions, however I'll try to cover the raised questions here >> and share my vision on them. >> >> Generally speaking, we are trying to find answers to the >> following two questions: >> o Are these intrinsics useful in general? >> o If yes, when to lower them to actual IR/DAG? >> >> In a nutshell, my answers are: >> o Yes, they are useful, and useful for all targets. Not >> only for ARM64, where they are especially useful. >> o After loop optimizations but before lowering IR to DAG. >> Currently it's in CodeGenPrepare, but we might lower them before >> that if there is any need in this. >> >> I'll try to explain why I think so. For the beginning, let's >> consider a simple expression div=x/y in different languages and >> write the corresponding IR with use of the intrinsics (here I >> omit types for conciseness, assuming result type of the safe.sdiv >> intrinsic being {i32, i1, i1}) : >> >> *** C/C-like *** >> %div = sdiv %x, %y >> >> *** Java, Go *** >> %divr = call %safe.sdiv(%x, %y) >> %div = extractvalue %divr, 0 >> %divbyzero_bit = extractvalue %divr, 1 >> br i1 %divbyzero_bit, label %throw, label %normal >> >> *** JavaScript, with "|0" *** >> %divr = call %safe.sdiv(%x, %y) >> %div = extractvalue %divr, 0 >> >> *** JavaScript without "|0", Ruby and maybe others *** >> %divr = call %safe.sdiv(%x, %y) >> %div = extractvalue %divr, 0 >> %divbyzero_bit = extractvalue %divr, 1 >> %ovf_bit = extractvalue %divr, 2 >> %exceptional_case_bit = or %ovf_bit, %divbyzero_bit >> br i1 %exceptional_case_bit, label %handle_exceptional_case, >> label %normal >> >> >> Now let's write the IR for the same expression without the >> intrinsics: >> >> *** C/C-like *** >> %div = sdiv %x, %y >> >> *** Java, Go *** >> %divbyzero_bit = cmp %y, 0 >> br i1 %divbyzero_bit, label %throw, label %normal >> normal: >> %div = sdiv %x, %y >> >> *** JavaScript, with "|0" *** >> %divbyzero_bit = cmp %y, 0 >> br i1 %divbyzero_bit, label %division_by_zero, label %chkovf >> division_by_zero: >> br label %end >> chkovf: >> %tmin_bit = cmp %x, min<T> >> %minus_one_bit = cmp %y, -1 >> %ovf_bit = and %tmin_bit, %minus_one_bit >> br i1 %ovf_bit, label %ovf, label %normal >> ovf: >> br label %end >> normal: >> %div.0 = sdiv %x, %y >> br label %end >> end: >> %div = phi [%div.0, %normal], [min<T>, %ovf], [0, >> %division_by_zero] >> >> *** JavaScript without "|0", Ruby and maybe others *** >> %divbyzero_bit = cmp %y, 0 >> br i1 %divbyzero_bit, label %throw, label %chkovf >> chkovf: >> %tmin_bit = cmp %x, min<T> >> %minus_one_bit = cmp %y, -1 >> %ovf_bit = and %tmin_bit, %minus_one_bit >> br i1 %ovf_bit, label %ovf, label %normal >> ovf: >> ; Special handling of min<T>/-1: converting to BigInt, double >> or something other >> unreachable >> normal: >> %div.0 = sdiv %x, %y >> >> As one can see, if the intrinsic is lowered to some CFG (i.e. on >> x86), there is nor loss neither win: due to the implemented >> optimizations the final code would be roughly the same as in the >> case with explicitly written CFG. >> >> If the intrinsic could be directly lowered to a single >> instruction (i.e. on ARM64 and if we don't bother about >> divz/ovf-flag), use of the intrinsic would definitely be >> beneficial - that takes place for the <JavaScript, with "|0"> case. >> >> That might make one think that the intrinsics are useful only for >> ARM64, and in general aren't that useful at all. But in fact even >> when it can't be lowered to a single instruction, keeping the >> control flow 'inside' the intrinsic till its lowering might be >> beneficial. For instance, loop passes work much better on loops >> without control flow: IndVars might be a good example of such >> pass. And, for instance, it seems much easier to vectorize a loop >> with call to one of these intrinsics than vectorize a loop with >> explicitly created control flow (and it could become even worse >> if there are several calls to the intrinsics in the same loop). >> >> In some cases we might want to lower these intrinsics earlier >> than in CGP - as Reid mentioned, it could help us to hoist checks >> out of loops. That's true but we need to keep in mind possible >> disadvantages of such early lowering (see previous paragraph). >> But in general I don't see any problems with allowing earlier >> lowering. >> >> As for extending the result structure to keep one more flag >> (being that i2 instead of i1, or one more i1 flag) - that seems >> fine, and both options {iN, i2} and {iN, i1, i1} look good to me. >> >> Now, why not to lower these intrinsics even later? Almost for all >> targets we want to get very similar CFG, and I don't see any >> benefits of duplicating very similar code all across different >> backends. Even on ARM64 we need to have control flow in some >> cases (e.g. for Java) - so we actually win nothing from making >> lowering completely target-specific. >> >> I hope that covers the raised concerns, at least to some extent. >> Does that sound reasonable enough? If so, I'll prepare and post >> here an updated version of the patch. >> >> Thanks, >> Michael >> >> >> On Apr 26, 2014, at 4:04 AM, Andrew Trick <atrick at apple.com >> <mailto:atrick at apple.com>> wrote: >> >> > >> > On Apr 25, 2014, at 2:21 PM, Eric Christopher >> <echristo at gmail.com <mailto:echristo at gmail.com>> wrote: >> > >> >>> In short, I agree with your observations that these >> intrinsics are not an >> >>> obvious slam-dunk compared to making the explicit control >> flow, but I think >> >>> that the intrinsics do give enough flexibility on the LLVM >> side that it >> >>> would be great if front-ends used them rather than rolling >> the control flow >> >>> themselves. >> >>> >> >> >> >> The problem is that then we have 2 problems: All targets >> (except for >> >> arm64) then have to lower the intrinsic as the first thing they do >> >> (giving us a TTI pass as the first thing in the pipeline) to take >> >> advantage of the information later during optimization, _and_ >> we have >> >> to plumb all of the work optimizing the intrinsic as well >> giving us a >> >> situation where we've now split our optimization efforts as >> well as >> >> the pain of maintaining an intrinsic that's useful for a single >> >> target. >> >> >> >> I really think that this is just solidifying my position that the >> >> intrinsic is a bad idea and that this should be done as later >> >> optimizations. >> > >> > The goal of the intrinsic wasn't stated clearly. >> > >> > This intrinsic isn't particularly necessary for any specific >> frontend >> > or architecture. I think the LLVM community would benefit from a >> > canonical way to represent well-defined integer division. We don't >> > need to add an IR instruction, because it's fine for most >> optimization >> > passes to ignore these operations. A target independent intrinsic >> > works well as long as: >> > >> > - It cleanly captures the language level semantics. >> > >> > - It facilitates mid-level optimization. >> > >> > - It naturally lowers into ideal code both on architectures >> that trap >> > (x86) and those that don't (arm). >> > >> > To summarize my understanding of the concerns: >> > >> > (1) The semantics of the safe.div intrinsic need to be useful >> for the >> > Language/ISA Matrix that LLVMers care about. >> > >> > At canonical IR level, the intrinsic is useful by eliminating >> control >> > flow merges and representing divide-by-zero and/or signed >> overflow in >> > a canonical form: >> > >> > %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b) >> > %bit = extractvalue {i32, i1} %res, 1 >> > br i1 %bit, label %trap, label %continue >> > trap: >> > call ... >> > unreachable >> > >> > continue: >> > %div = extractvalue {i32, i1} %res, 0 >> > >> > The original proposal fails to achieve this because the common >> case of >> > Java/Go would require a check in the slow-path to differentiate >> > divide-by-zero from signed overflow. That should be fixed by >> > generalizing the intrinsic so that the two conditions are distinct: >> > >> > %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b) >> > %div0 = extractvalue {i32, i1} %res, 1 >> > br i1 %div0, label %trap, label %checkovf >> > >> > checkovf: >> > %ovf = extractvalue {i32, i1} %res, 2 >> > br i1 %div0, label %trap, label %continue >> > >> > trap: >> > call ... >> > unreachable >> > >> > continue: >> > %div = extractvalue {i32, i1} %res, 0 >> > >> > ...or some variation of the above. I don't have a personal >> stake in this. >> > >> > (2) The safe.div intrinsic inhibits generic code motion and >> Correlated >> > Value Prop based optimization. >> > >> > This goes both ways. >> > >> > CVP could miss out on cases unless we teach it about the >> semantics of >> > the intrinsics. Correct me if I'm wrong, but I don't think this >> would >> > actually be too difficult. >> > >> > OTOH, with the intrinsics, it would be easy to write a simple >> > optimization pass that hoists and combines checks along the >> domtree. >> > >> > After divide-by-zero optimization, you would have something like: >> > >> > %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b) >> > %div0 = extractvalue {i32, i1} %res, 1 >> > br i1 %div0, label %trap, label %continue >> > >> > trap: >> > # No call here! >> > unreachable >> > >> > continue: >> > %div = extractvalue {i32, i1} %res, 0 >> > >> > And the branch to trap just goes away at some point. >> > >> > Now considering Reid's LICM example: >> > >> > for (int i = 0; i < n; ++i) { >> > if (b == 0) c[i] = 0; >> > else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN; >> > else c[i] = a[i] / b; >> > } >> > >> > Simply put, if we want to unswitch this code for x86, then we >> need a >> > TTI pass to lower the intrinsic. On the other hand, I believe it is >> > more important to eliminate the control flow within the loop to aid >> > loop analysis and other optimizations. So we still want the >> front end >> > to emit the intrinsic, we just may eventually want it lowered >> earlier >> > than CGP. I don't think this issue has any bearing on the >> intrinsic's >> > LangRef spec. >> > >> > There was some comparison made to iload/istore, which I don't >> > follow: >> > - subsuming loads and stores into another instruction is really >> scary >> > considering how much logic we have for analyzing the side >> effects of >> > memory access. >> > - there is no benefit to IR optimization to the iload/istore >> intruction. >> > - it is easy to detect null checked load/stores in IR at any point. >> > - it is very rare that a platform would want to resume from >> trapping load/store. >> > >> > (3) The safe.div intrinsic is a target-specific codegen >> optimization. >> > >> > Regardless of the target, we want to eliminate control flow >> merges in >> > all cases, and completely eliminate control flow when we don't have >> > trapping semantics. To do this, we need an intrinsic at >> canonical IR >> > level. Clearly it should be a target independent intrinsic >> > since *some* optimizations/analyses want to be aware of it. >> > >> > Even ignoring the rationale above, supporting codegen with a >> > target-specific intrinsic would minimally require the DAG >> builder to >> > match this >> > "if (b != 0) ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0)" >> > >> > after all optimizations have had their way with it. A big >> problem here >> > is that there are actually many different forms of this >> expression and >> > surrounding control flow, depending on the frontend's integer >> division >> > semantics. We would have to recognize all the variations of >> checking for >> > divide-by-zero and/or overflow, trapping and/or producing a certain >> > constant, branching or selecting values. This is obviously terrible >> > from an engineering standpoint. >> > >> > -Andy >> > _______________________________________________ >> > LLVM Developers mailing list >> > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> >> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> >> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/53e2cbb4/attachment.html>
LD;DR: Your desire to use trapping on x86 only further convinces me that Michael's proposed intrinsics are the best way to go. On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at philipreames.com) wrote: As the discussion has progressed and I've spent more time thinking about the topic, I find myself less and less enthused about the current proposal. I'm in full support of having idiomatic ways to express safe division, but I'm starting to doubt that using an intrinsic is the right way at the moment. One case I find myself thinking about is how one would combine profiling information and implicit div-by-zero/overflow checks with this proposal. I don't really see a clean way. Ideally, for a "safe div" which never has the exceptional paths taken, you'd like to completely do away with the control flow entirely. (And rely on hardware traps w/exceptions instead.) I don't really see a way to represent that type of construct given the current proposal. This is a deeper problem and to solve it you'd need a solution to trapping in general. Let's consider the case of Java. A Java program may want to catch the arithmetic exception due to divide by zero. How would you do this with a trap in LLVM IR? Spill all state that is live at the catch? Use a patchpoint for the entire division instruction? In a lot of languages, a divide produces some result even in the exceptional case and this result requires effectively deoptimizing since the resut won't be the one you would have predicted (double instead of int, or BigInt instead of small int), which sort of means that if the CPU exception occurs you have to be able to reconstruct all state. A patchpoint could do this, and so could spilling all state to the stack before the divide - but both are very heavy hammers that are sure to be more expensive than just doing a branch. For these reasons, WebKit has long gone with the approach of emitting branches even in our custom JITs; they are cheaper than the costs incurred by spilling or other craziness. My bet is that the cheapest implementation in LLVM will involve some kind of branching on x86. Another point that is bothering me is how these intrinsics will interact with loop invariant code motion, and other condition check optimizations. We have the potential to hide control flow from optimizations which would otherwise remove it. I'm not stating these aren't beneficial. I'm simply stating that I remain somewhat unconvinced. They seem like clear wins on some architectures (i.e. ARM, where the hardware includes the specific checks already), but not on others (i.e. x86 with it's exceptions). I don't follow this. One of the proposed intrinsic's advantages is precisely to help with things like loop invariant code motion by not emitting confusing control flow early. As for condition check optimizations, this has too low of profit for me to consider. It's true that if you can prove that one of the operands to a division is a constant, then you probably don't want to emit all of the branches, and surely the proposed intrinsic allows for this. But you're less likely to see code dividing by the same value repeatedly. Or so goes my experience with this, anyway. Given the above, I'm going withdraw my active support from the current proposal. (Note: This does not mean that I'm going to actively oppose it either!) Let me make a counter proposal of my own. 1) Let's not add generic intrinsics at this time. Instead, let's add cpu specific intrinsics with the exact semantics of the associated div instructions. a) for ARM, this would be the semantics of the current proposed instruction (right?) b) for x86, this would be an instruction which is defined to fault on the corner cases (the resume behaviour would not be specified for the moment) I don't like 1.b because the only thing you'd be able to do with the fault in the runtime is: - Unwind the entire function if you're careful enough and you don't mind losing all of its state. This woud only apply to Java-like languages, and only if they didn't have any finally/catch(all)/catch(arithmetic) statements in the surrounding scope. - "interpret" the division to get your own semantics and resume execution, if your language allows for division to return an integer in the corner cases (i.e. JavaScript's (a/b)|0). I concede that a trapping intrinsic would allow you to make this one case work, in cases where your profiling told you that the corner cases are rare. - Arrange to spill all state to the stack just to be able to throw an exception and/or deoptimize - i.e. you're crazy enough to think that the branch you just saved was more expensive than a bunch of stack traffic. I believe that the currently proposed intrinsics give us more of what we want on x86, than does your proposal, since they are more general, and since the branches will be cheaper than spilling everything to the stack. To be clear, trap-based optimizations are something that we may eventually want to support but they require more magic unrelated to the specific issues of division. I suspect that the nicest way of expressing trap-based optimizations in IR is to: - Have a branch in IR that is weighted in such a way as to convey to LLVM that we believe that one side is never taken and that therefore implementing that side in terms of traps is OK. - The thing being branched on is a predicate that is known to LLVM to be something that can be trapped on. Notice that the proposed intrinsics are *the best* way of achieving this, once we have a way of optimizing such branches. The LLVM backend will know that the i1's returned from the safe.div intrinsic are "trappable" on x86, and the frontend can arrange to weight the branches on those i1's appropriately if it wants a trap-based implementation. Better yet, the frontend could simply feed whatever profiling it wants ("I saw the corner case X times and the main case Y times") and LLVM can decide what to do on its own. The fact that the machanism by which execution is diverted to the "trapping" basic block is by way of a trap could even be contained entirely within LLVM's innards. This could be used to support not only the unwinding semantics in Java on div/0, but also the (a/b)|0 in JS, and all of the others. But first we need an intrinsic that reveals some predicate that is associated with the trapping condition, rather than having an intrinsic that is defined to simply trap. (Notice also how my approach to exposing trapping in LLVM could handle things like null checks - if LLVM sees a branch on a pointer to see if it's null in the close vicinity of a memory access, and the branch is weighted 100% in favor of non-null, then it can "fold" the branch and load together and use a trap. I imagine this being super easy to implement and I'd be happy to say more about how I'd do it. So, to me, the issue with division isn't that we don't know how to trap on it - it's that we don't have a crystal-clear way of revealing the trapping conditions in IR.) 1b) Teach the optimizer about the semantics of each - if we do go with a unified safe.div instruction later, this effort could be mostly reused. Except for x86, where your proposal above doesn't work. 2) Add optimizations (fairly late) which fold generic-div+control flow into machine specific div instructions. This is purely aspirational. There are many ways of doing the control flow for the division since there is quite a bit of it. Are you going to pattern match all of them? What if you miss a case because one of the many IR optimizations transformed it into something that is no longer recognizable? Or are you going to impose rules for canonicalizing any control flow around any division to ensure that a div-by-zero and INT_MIN/-1 check never stops looking like the pattern that you wanted it to look like? To me this sounds kind of crazy. Once the frontend has determined that it wants safe division, it should be able to say this explicitly enough that this information is preserved through compilation. 3) Frontends that want machine specific semantics (i.e. trap or defined edge cases), can use the machine specific intrinsics directly. 4) We provide a set of IR helper functions which implement safe division in what we believe to be the best way. This can be different for each supported platform. This could either be the form of sample IR, or even as functions on IRBuilder. 5) We spend some time optimizing the IR generated by (4) and use that to inform our discussion about what is truly necessary from a performance standpoint. Once this is in place, we can come back to the discussion of what an appropriate generic safe div intrinsic would look like. In particular, we would have the performance data at hand to lay to rest the concerns raised by myself and Eric. My strong suspicion is that the counter proposal will be "good enough" in practice for most users. p.s. If we do go ahead with the current proposal, I'm going to *strongly* request they be marked experimental. I think we'll need to iterate on these a few times. Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/da416eed/attachment.html>
On 04/29/2014 10:44 AM, Filip Pizlo wrote:> LD;DR: Your desire to use trapping on x86 only further convinces me > that Michael's proposed intrinsics are the best way to go.I'm still not convinced, but am not going to actively oppose it either. I'm leery of designing a solution with major assumptions we don't have data to backup. I worry your assumptions about deoptimization are potentially unsound. But I don't have data to actually show this (yet).> > On April 29, 2014 at 10:09:49 AM, Philip Reames > (listmail at philipreames.com <mailto:listmail at philipreames.com>) wrote: > >> As the discussion has progressed and I've spent more time thinking >> about the topic, I find myself less and less enthused about the >> current proposal. I'm in full support of having idiomatic ways to >> express safe division, but I'm starting to doubt that using an >> intrinsic is the right way at the moment. >> >> One case I find myself thinking about is how one would combine >> profiling information and implicit div-by-zero/overflow checks with >> this proposal. I don't really see a clean way. Ideally, for a "safe >> div" which never has the exceptional paths taken, you'd like to >> completely do away with the control flow entirely. (And rely on >> hardware traps w/exceptions instead.) I don't really see a way to >> represent that type of construct given the current proposal. > > This is a deeper problem and to solve it you'd need a solution to > trapping in general. Let's consider the case of Java. A Java program > may want to catch the arithmetic exception due to divide by zero. How > would you do this with a trap in LLVM IR? Spill all state that is > live at the catch? Use a patchpoint for the entire division instruction? >We'd likely use something similar to a patchpoint. You'd need the "abstract vm state" (which is not the compiled frame necessarily) available at the div instruction. You could then re-enter the interpreter at the specified index (part of the vm state). We have all most of these mechanisms in place. Ideally, you'd trigger a recompile and otherwise ensure re-entry into compiled code at the soonest possible moment. This requires a lot of runtime support, but we already have most of it implemented for another compiler. From our perspective, the runtime requirements are not a major blocker.> > In a lot of languages, a divide produces some result even in the > exceptional case and this result requires effectively deoptimizing > since the resut won't be the one you would have predicted (double > instead of int, or BigInt instead of small int), which sort of means > that if the CPU exception occurs you have to be able to reconstruct > all state. A patchpoint could do this, and so could spilling all > state to the stack before the divide - but both are very heavy hammers > that are sure to be more expensive than just doing a branch. >This isn't necessarily as expensive as you might believe. I'd recommend reading the Graal project papers on this topic. Whether deopt or branching is more profitable *in this case*, I can't easily say. I'm not yet to the point of being able to run that experiment. We can argue about what "should" be better all we want, but real performance data is the only way to truly know.> > For these reasons, WebKit has long gone with the approach of emitting > branches even in our custom JITs; they are cheaper than the costs > incurred by spilling or other craziness. My bet is that the cheapest > implementation in LLVM will involve some kind of branching on x86. >You may be right. I simply don't want to design a system which assumes you are until I have data in hand.>> >> >> Another point that is bothering me is how these intrinsics will >> interact with loop invariant code motion, and other condition check >> optimizations. We have the potential to hide control flow from >> optimizations which would otherwise remove it. I'm not stating >> these/aren't/beneficial. I'm simply stating that I remain somewhat >> unconvinced. They seem like clear wins on some architectures (i.e. >> ARM, where the hardware includes the specific checks already), but >> not on others (i.e. x86 with it's exceptions). > > I don't follow this. One of the proposed intrinsic's advantages is > precisely to help with things like loop invariant code motion by not > emitting confusing control flow early. >Er, not sure about this. Consider: for(int i = 0; i < n; i++) { a[i] = a[i] / b; } Yes, if b is a constant the conditions immediately fail away. If you can't prove the bounds on b, you'd still like to transform this to: if( b == 0 ) throw; for(int i = 0; i < n; i++) { if( a[i] == INT_MIN && b == -1) throw; a[i] = a[i] / b; } You might even want to split the loop into two versions (i.e. b is -1, otherwise). These transforms wouldn't matter on ARM, but do matter on x86. Even if you're using a deopt scheme mentioned above, this still could be profitable. The frequencies of the two edge cases aren't necessarily correlated. (i.e. you could be trying to divide by zero a lot)> As for condition check optimizations, this has too low of profit for > me to consider. It's true that if you can prove that one of the > operands to a division is a constant, then you probably don't want to > emit all of the branches, and surely the proposed intrinsic allows for > this. But you're less likely to see code dividing by the same value > repeatedly. Or so goes my experience with this, anyway. >I suspect your workloads may differ from mine in this area. I also need to restate: I haven't spent much time looking at this yet. I don't actually know if any of this matters. Looking through the source code for the existing compilers, I suspect it does. But that's not conclusive evidence. p.s. w.r.t to terminology, I was being somewhat sloppy. By condition check optimization, I meant everything from simplifycfg, cvp, and others.>> >> >> Given the above, I'm going withdraw my active support from the >> current proposal. (Note: This does not mean that I'm going to >> actively oppose it either!) >> >> Let me make a counter proposal of my own. >> >> 1) Let's not add generic intrinsics at this time. Instead, let's add >> cpu specific intrinsics with the exact semantics of the associated >> div instructions. >> a) for ARM, this would be the semantics of the current proposed >> instruction (right?) >> b) for x86, this would be an instruction which is defined to fault >> on the corner cases (the resume behaviour would/not/be specified for >> the moment) > > I don't like 1.b because the only thing you'd be able to do with the > fault in the runtime is: > > - Unwind the entire function if you're careful enough and you don't > mind losing all of its state. This woud only apply to Java-like > languages, and only if they didn't have any > finally/catch(all)/catch(arithmetic) statements in the surrounding scope. > > - "interpret" the division to get your own semantics and resume > execution, if your language allows for division to return an integer > in the corner cases (i.e. JavaScript's (a/b)|0). I concede that a > trapping intrinsic would allow you to make this one case work, in > cases where your profiling told you that the corner cases are rare. > > - Arrange to spill all state to the stack just to be able to throw an > exception and/or deoptimize - i.e. you're crazy enough to think that > the branch you just saved was more expensive than a bunch of stack > traffic. > > I believe that the currently proposed intrinsics give us more of what > we want on x86, than does your proposal, since they are more general, > and since the branches will be cheaper than spilling everything to the > stack. >I was trying to avoid getting into all of the details around deopt and resuming from traps. :) I agree this is a complex area, and frankly, it's not one I've explored enough in LLVM to have strong opinions on yet.> > To be clear, trap-based optimizations are something that we may > eventually want to support but they require more magic unrelated to > the specific issues of division. I suspect that the nicest way of > expressing trap-based optimizations in IR is to: > > - Have a branch in IR that is weighted in such a way as to convey to > LLVM that we believe that one side is never taken and that therefore > implementing that side in terms of traps is OK. > > - The thing being branched on is a predicate that is known to LLVM to > be something that can be trapped on. >I agree with the above. I think. (I reserve the right to change my mind later...)> > Notice that the proposed intrinsics are *the best* way of achieving > this, once we have a way of optimizing such branches. The LLVM > backend will know that the i1's returned from the safe.div intrinsic > are "trappable" on x86, and the frontend can arrange to weight the > branches on those i1's appropriately if it wants a trap-based > implementation. Better yet, the frontend could simply feed whatever > profiling it wants ("I saw the corner case X times and the main case Y > times") and LLVM can decide what to do on its own. The fact that the > machanism by which execution is diverted to the "trapping" basic block > is by way of a trap could even be contained entirely within LLVM's > innards. >I agree this sounds like a potentially workable design. OT: Do we want to be implementing the trapping logic inside LLVM? I would guess not. I actually agree with you on the general approach. However, I'm not sure we're ready to get into this yet. That's why I was proposing a simpler starting point. My concern is how to way to balance the IR level CFG optimizations above with the trapping semantics and profiling data. Until someone actually builds something, this is all *really* speculative. I'd like to not lock down a design until we have *data* and *experience in LLVM*.> > This could be used to support not only the unwinding semantics in Java > on div/0, but also the (a/b)|0 in JS, and all of the others. But > first we need an intrinsic that reveals some predicate that is > associated with the trapping condition, rather than having an > intrinsic that is defined to simply trap. > > (Notice also how my approach to exposing trapping in LLVM could handle > things like null checks - if LLVM sees a branch on a pointer to see if > it's null in the close vicinity of a memory access, and the branch is > weighted 100% in favor of non-null, then it can "fold" the branch and > load together and use a trap. I imagine this being super easy to > implement and I'd be happy to say more about how I'd do it. So, to > me, the issue with division isn't that we don't know how to trap on it > - it's that we don't have a crystal-clear way of revealing the > trapping conditions in IR.) >Agreed, particularly with the last statement.>> >> 1b) Teach the optimizer about the semantics of each - if we do go >> with a unified safe.div instruction later, this effort could be >> mostly reused. > > Except for x86, where your proposal above doesn't work. >Er, how so? Same as the reasons we've discussed? Or different?>> >> 2) Add optimizations (fairly late) which fold generic-div+control >> flow into machine specific div instructions. > > This is purely aspirational. There are many ways of doing the control > flow for the division since there is quite a bit of it. Are you going > to pattern match all of them? What if you miss a case because one of > the many IR optimizations transformed it into something that is no > longer recognizable? Or are you going to impose rules for > canonicalizing any control flow around any division to ensure that a > div-by-zero and INT_MIN/-1 check never stops looking like the pattern > that you wanted it to look like? > > To me this sounds kind of crazy. Once the frontend has determined > that it wants safe division, it should be able to say this explicitly > enough that this information is preserved through compilation. >You don't need to recognize all patterns; you don't even want to. See above for LICM and related. I suspect this is less hard than it initially sounds. Particularly since you really only need to identify cases in hot code. If the control flow was moved outside of the loop, it is almost certainly not hot any more. Note: This scheme is definitely biased towards x86. It probably is NOT the optimal scheme for ARM. I freely acknowledge that. I'm just not happy with the way the previous proposal seems to prefer ARM over x86.>> >> 3) Frontends that want machine specific semantics (i.e. trap or >> defined edge cases), can use the machine specific intrinsics directly. >> 4) We provide a set of IR helper functions which implement safe >> division in what we believe to be the best way. This can be >> different for each supported platform. This could either be the form >> of sample IR, or even as functions on IRBuilder. >> 5) We spend some time optimizing the IR generated by (4) and use that >> to inform our discussion about what is truly necessary from a >> performance standpoint. >> >> Once this is in place, we can come back to the discussion of what an >> appropriate generic safe div intrinsic would look like. In >> particular, we would have the performance data at hand to lay to rest >> the concerns raised by myself and Eric. My strong suspicion is that >> the counter proposal will be "good enough" in practice for most users. >> >> p.s. If we do go ahead with the current proposal, I'm going to >> *strongly* request they be marked experimental. I think we'll need >> to iterate on these a few times. >> >> Philip >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/5ff30a1a/attachment.html>
Possibly Parallel Threads
- [LLVMdev] Proposal: add intrinsics for safe division
- [LLVMdev] Proposal: add intrinsics for safe division
- [LLVMdev] Proposal: add intrinsics for safe division
- [LLVMdev] Proposal: add intrinsics for safe division
- [LLVMdev] Proposal: add intrinsics for safe division