thr3ads.net - llvm dev - [LLVMdev] Proposal: add intrinsics for safe division [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Keno Fischer

2014-Apr-26 22:45 UTC

[LLVMdev] Proposal: add intrinsics for safe division

I am very much in favor of having a div instruction with well defined
div-by-zero and overflow behavior. The undefined behavior on certain values
for LLVM intrinsics has been a major pain point for us in Julia, because
adding the extra branches just kills performance and we know that there is
an X86 instruction that just does what we want. Anyway, this was brought up
briefly above, but want to bring back the discussion of a div instruction
that reliably traps, as that seems to me to be one of the only ways to get
adequate performance in the common case. This can probably be done as
suggested above with matching on the trap condition of this instruction,
but I think while we're at it, we should tackle that case as well.

Thanks,
Keno


On Sat, Apr 26, 2014 at 5:11 PM, Michael Zolotukhin
<mzolotukhin at apple.com>wrote:
>
> Hi,
>
> Thanks to everybody for the very informative feedback! Due to difference
> in timezones, I usually miss the most active time of discussions, however
> I'll try to cover the raised questions here and share my vision on
them.
>
> Generally speaking, we are trying to find answers to the following two
> questions:
>     o   Are these intrinsics useful in general?
>     o   If yes, when to lower them to actual IR/DAG?
>
> In a nutshell, my answers are:
>     o   Yes, they are useful, and useful for all targets. Not only for
> ARM64, where they are especially useful.
>     o   After loop optimizations but before lowering IR to DAG. Currently
> it's in CodeGenPrepare, but we might lower them before that if there is
any
> need in this.
>
> I'll try to explain why I think so. For the beginning, let's
consider a
> simple expression div=x/y in different languages and write the
> corresponding IR with use of the intrinsics (here I omit types for
> conciseness, assuming result type of the safe.sdiv intrinsic being {i32,
> i1, i1}) :
>
> *** C/C-like ***
>   %div = sdiv %x, %y
>
> *** Java, Go ***
>   %divr = call %safe.sdiv(%x, %y)
>   %div = extractvalue %divr, 0
>   %divbyzero_bit = extractvalue %divr, 1
>   br i1 %divbyzero_bit, label %throw, label %normal
>
> *** JavaScript, with "|0" ***
>   %divr = call %safe.sdiv(%x, %y)
>   %div = extractvalue %divr, 0
>
> *** JavaScript without "|0", Ruby and maybe others ***
>   %divr = call %safe.sdiv(%x, %y)
>   %div = extractvalue %divr, 0
>   %divbyzero_bit = extractvalue %divr, 1
>   %ovf_bit = extractvalue %divr, 2
>   %exceptional_case_bit = or %ovf_bit, %divbyzero_bit
>   br i1 %exceptional_case_bit, label %handle_exceptional_case, label
> %normal
>
>
> Now let's write the IR for the same expression without the intrinsics:
>
> *** C/C-like ***
>   %div = sdiv %x, %y
>
> *** Java, Go ***
>   %divbyzero_bit = cmp %y, 0
>   br i1 %divbyzero_bit, label %throw, label %normal
> normal:
>   %div = sdiv %x, %y
>
> *** JavaScript, with "|0" ***
>   %divbyzero_bit = cmp %y, 0
>   br i1 %divbyzero_bit, label %division_by_zero, label %chkovf
> division_by_zero:
>   br label %end
> chkovf:
>   %tmin_bit = cmp %x, min<T>
>   %minus_one_bit = cmp %y, -1
>   %ovf_bit = and %tmin_bit, %minus_one_bit
>   br i1 %ovf_bit, label %ovf, label %normal
> ovf:
>   br label %end
> normal:
>   %div.0 = sdiv %x, %y
>   br label %end
> end:
>   %div = phi [%div.0, %normal], [min<T>, %ovf], [0,
%division_by_zero]
>
> *** JavaScript without "|0", Ruby and maybe others ***
>   %divbyzero_bit = cmp %y, 0
>   br i1 %divbyzero_bit, label %throw, label %chkovf
> chkovf:
>   %tmin_bit = cmp %x, min<T>
>   %minus_one_bit = cmp %y, -1
>   %ovf_bit = and %tmin_bit, %minus_one_bit
>   br i1 %ovf_bit, label %ovf, label %normal
> ovf:
>   ; Special handling of min<T>/-1: converting to BigInt, double or
> something other
>   unreachable
> normal:
>   %div.0 = sdiv %x, %y
>
> As one can see, if the intrinsic is lowered to some CFG (i.e. on x86),
> there is nor loss neither win: due to the implemented optimizations the
> final code would be roughly the same as in the case with explicitly written
> CFG.
>
> If the intrinsic could be directly lowered to a single instruction (i.e.
> on ARM64 and if we don't bother about divz/ovf-flag), use of the
intrinsic
> would definitely be beneficial - that takes place for the <JavaScript,
with
> "|0"> case.
>
> That might make one think that the intrinsics are useful only for ARM64,
> and in general aren't that useful at all. But in fact even when it
can't be
> lowered to a single instruction, keeping the control flow 'inside'
the
> intrinsic till its lowering might be beneficial. For instance, loop passes
> work much better on loops without control flow: IndVars might be a good
> example of such pass. And, for instance, it seems much easier to vectorize
> a loop with call to one of these intrinsics than vectorize a loop with
> explicitly created control flow (and it could become even worse if there
> are several calls to the intrinsics in the same loop).
>
> In some cases we might want to lower these intrinsics earlier than in CGP
> - as Reid mentioned, it could help us to hoist checks out of loops.
That's
> true but we need to keep in mind possible disadvantages of such early
> lowering (see previous paragraph). But in general I don't see any
problems
> with allowing earlier lowering.
>
> As for extending the result structure to keep one more flag (being that i2
> instead of i1, or one more i1 flag) - that seems fine, and both options
> {iN, i2} and {iN, i1, i1} look good to me.
>
> Now, why not to lower these intrinsics even later? Almost for all targets
> we want to get very similar CFG, and I don't see any benefits of
> duplicating very similar code all across different backends. Even on ARM64
> we need to have control flow in some cases (e.g. for Java) - so we actually
> win nothing from making lowering completely target-specific.
>
> I hope that covers the raised concerns, at least to some extent. Does that
> sound reasonable enough? If so, I'll prepare and post here an updated
> version of the patch.
>
> Thanks,
> Michael
>
>
> On Apr 26, 2014, at 4:04 AM, Andrew Trick <atrick at apple.com>
wrote:
>
> >
> > On Apr 25, 2014, at 2:21 PM, Eric Christopher <echristo at
gmail.com>
> wrote:
> >
> >>> In short, I agree with your observations that these intrinsics
are not
> an
> >>> obvious slam-dunk compared to making the explicit control
flow, but I
> think
> >>> that the intrinsics do give enough flexibility on the LLVM
side that it
> >>> would be great if front-ends used them rather than rolling the
control
> flow
> >>> themselves.
> >>>
> >>
> >> The problem is that then we have 2 problems: All targets (except
for
> >> arm64) then have to lower the intrinsic as the first thing they do
> >> (giving us a TTI pass as the first thing in the pipeline) to take
> >> advantage of the information later during optimization, _and_ we
have
> >> to plumb all of the work optimizing the intrinsic as well giving
us a
> >> situation where we've now split our optimization efforts as
well as
> >> the pain of maintaining an intrinsic that's useful for a
single
> >> target.
> >>
> >> I really think that this is just solidifying my position that the
> >> intrinsic is a bad idea and that this should be done as later
> >> optimizations.
> >
> > The goal of the intrinsic wasn't stated clearly.
> >
> > This intrinsic isn't particularly necessary for any specific
frontend
> > or architecture. I think the LLVM community would benefit from a
> > canonical way to represent well-defined integer division. We don't
> > need to add an IR instruction, because it's fine for most
optimization
> > passes to ignore these operations. A target independent intrinsic
> > works well as long as:
> >
> > - It cleanly captures the language level semantics.
> >
> > - It facilitates mid-level optimization.
> >
> > - It naturally lowers into ideal code both on architectures that trap
> >  (x86) and those that don't (arm).
> >
> > To summarize my understanding of the concerns:
> >
> > (1) The semantics of the safe.div intrinsic need to be useful for the
> > Language/ISA Matrix that LLVMers care about.
> >
> > At canonical IR level, the intrinsic is useful by eliminating control
> > flow merges and representing divide-by-zero and/or signed overflow in
> > a canonical form:
> >
> >    %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
> >    %bit = extractvalue {i32, i1} %res, 1
> >    br i1 %bit, label %trap, label %continue
> > trap:
> >    call ...
> >    unreachable
> >
> > continue:
> >    %div = extractvalue {i32, i1} %res, 0
> >
> > The original proposal fails to achieve this because the common case of
> > Java/Go would require a check in the slow-path to differentiate
> > divide-by-zero from signed overflow. That should be fixed by
> > generalizing the intrinsic so that the two conditions are distinct:
> >
> >    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
> >    %div0 = extractvalue {i32, i1} %res, 1
> >    br i1 %div0, label %trap, label %checkovf
> >
> > checkovf:
> >    %ovf = extractvalue {i32, i1} %res, 2
> >    br i1 %div0, label %trap, label %continue
> >
> > trap:
> >    call ...
> >    unreachable
> >
> > continue:
> >    %div = extractvalue {i32, i1} %res, 0
> >
> > ...or some variation of the above. I don't have a personal stake
in this.
> >
> > (2) The safe.div intrinsic inhibits generic code motion and Correlated
> > Value Prop based optimization.
> >
> > This goes both ways.
> >
> > CVP could miss out on cases unless we teach it about the semantics of
> > the intrinsics. Correct me if I'm wrong, but I don't think
this would
> > actually be too difficult.
> >
> > OTOH, with the intrinsics, it would be easy to write a simple
> > optimization pass that hoists and combines checks along the domtree.
> >
> > After divide-by-zero optimization, you would have something like:
> >
> >    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
> >    %div0 = extractvalue {i32, i1} %res, 1
> >    br i1 %div0, label %trap, label %continue
> >
> > trap:
> >    # No call here!
> >    unreachable
> >
> > continue:
> >    %div = extractvalue {i32, i1} %res, 0
> >
> > And the branch to trap just goes away at some point.
> >
> > Now considering Reid's LICM example:
> >
> > for (int i = 0; i < n; ++i) {
> >  if (b == 0) c[i] = 0;
> >  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
> >  else c[i] = a[i] / b;
> > }
> >
> > Simply put, if we want to unswitch this code for x86, then we need a
> > TTI pass to lower the intrinsic. On the other hand, I believe it is
> > more important to eliminate the control flow within the loop to aid
> > loop analysis and other optimizations. So we still want the front end
> > to emit the intrinsic, we just may eventually want it lowered earlier
> > than CGP. I don't think this issue has any bearing on the
intrinsic's
> > LangRef spec.
> >
> > There was some comparison made to iload/istore, which I don't
> > follow:
> > - subsuming loads and stores into another instruction is really scary
> >  considering how much logic we have for analyzing the side effects of
> >  memory access.
> > - there is no benefit to IR optimization to the iload/istore
intruction.
> > - it is easy to detect null checked load/stores in IR at any point.
> > - it is very rare that a platform would want to resume from trapping
> load/store.
> >
> > (3) The safe.div intrinsic is a target-specific codegen optimization.
> >
> > Regardless of the target, we want to eliminate control flow merges in
> > all cases, and completely eliminate control flow when we don't
have
> > trapping semantics. To do this, we need an intrinsic at canonical IR
> > level. Clearly it should be a target independent intrinsic
> > since *some* optimizations/analyses want to be aware of it.
> >
> > Even ignoring the rationale above, supporting codegen with a
> > target-specific intrinsic would minimally require the DAG builder to
> > match this
> > "if (b != 0) ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) :
0)"
> >
> > after all optimizations have had their way with it. A big problem here
> > is that there are actually many different forms of this expression and
> > surrounding control flow, depending on the frontend's integer
division
> > semantics. We would have to recognize all the variations of checking
for
> > divide-by-zero and/or overflow, trapping and/or producing a certain
> > constant, branching or selecting values. This is obviously terrible
> > from an engineering standpoint.
> >
> > -Andy
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140426/13e59478/attachment.html>

Michael Zolotukhin

2014-Apr-29 11:17 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

Hi,
> I am very much in favor of having a div instruction with well defined
div-by-zero and overflow behavior.
Actually, an interest in these intrinsics has been shown for quite a long time,
see for example thread [1].

I updated the patch, now the intrinsics return {iN, i1, i1} to separately
indicate division by zero and overflow events. Though for unsigned version only
division by zero is relevant, I decided to keep three elements in return
structure for consistency (the last one is always 0 in this case). With this
change various front-ends can take advantage of the intrinsics, depending on
desired semantics.

Also I changed semantics of srem intrinsics to better match semantics of, for
instance, Go. The original proposal stated that safe.srem(min<T>,-1) =
min<T>, but it makes more sense to return 0 here. Thus, we’ll keep
invariant X = div(X,Y)*Y + rem(X,Y).



Does the patch look good? I’d be glad to hear any remarks and comments regarding
the patch, and adjust it correspondingly.

[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-April/060955.html

Best regards,
Michael

On Apr 27, 2014, at 2:45 AM, Keno Fischer <kfischer at
college.harvard.edu> wrote:
> I am very much in favor of having a div instruction with well defined
div-by-zero and overflow behavior. The undefined behavior on certain values for
LLVM intrinsics has been a major pain point for us in Julia, because adding the
extra branches just kills performance and we know that there is an X86
instruction that just does what we want. Anyway, this was brought up briefly
above, but want to bring back the discussion of a div instruction that reliably
traps, as that seems to me to be one of the only ways to get adequate
performance in the common case. This can probably be done as suggested above
with matching on the trap condition of this instruction, but I think while
we're at it, we should tackle that case as well.
> 
> Thanks,
> Keno
> 
> 
> On Sat, Apr 26, 2014 at 5:11 PM, Michael Zolotukhin <mzolotukhin at
apple.com> wrote:
> 
> Hi,
> 
> Thanks to everybody for the very informative feedback! Due to difference in
timezones, I usually miss the most active time of discussions, however I’ll try
to cover the raised questions here and share my vision on them.
> 
> Generally speaking, we are trying to find answers to the following two
questions:
>     o   Are these intrinsics useful in general?
>     o   If yes, when to lower them to actual IR/DAG?
> 
> In a nutshell, my answers are:
>     o   Yes, they are useful, and useful for all targets. Not only for
ARM64, where they are especially useful.
>     o   After loop optimizations but before lowering IR to DAG. Currently
it’s in CodeGenPrepare, but we might lower them before that if there is any need
in this.
> 
> I'll try to explain why I think so. For the beginning, let's
consider a simple expression div=x/y in different languages and write the
corresponding IR with use of the intrinsics (here I omit types for conciseness,
assuming result type of the safe.sdiv intrinsic being {i32, i1, i1}) :
> 
> *** C/C-like ***
>   %div = sdiv %x, %y
> 
> *** Java, Go ***
>   %divr = call %safe.sdiv(%x, %y)
>   %div = extractvalue %divr, 0
>   %divbyzero_bit = extractvalue %divr, 1
>   br i1 %divbyzero_bit, label %throw, label %normal
> 
> *** JavaScript, with “|0” ***
>   %divr = call %safe.sdiv(%x, %y)
>   %div = extractvalue %divr, 0
> 
> *** JavaScript without “|0”, Ruby and maybe others ***
>   %divr = call %safe.sdiv(%x, %y)
>   %div = extractvalue %divr, 0
>   %divbyzero_bit = extractvalue %divr, 1
>   %ovf_bit = extractvalue %divr, 2
>   %exceptional_case_bit = or %ovf_bit, %divbyzero_bit
>   br i1 %exceptional_case_bit, label %handle_exceptional_case, label
%normal
> 
> 
> Now let’s write the IR for the same expression without the intrinsics:
> 
> *** C/C-like ***
>   %div = sdiv %x, %y
> 
> *** Java, Go ***
>   %divbyzero_bit = cmp %y, 0
>   br i1 %divbyzero_bit, label %throw, label %normal
> normal:
>   %div = sdiv %x, %y
> 
> *** JavaScript, with “|0” ***
>   %divbyzero_bit = cmp %y, 0
>   br i1 %divbyzero_bit, label %division_by_zero, label %chkovf
> division_by_zero:
>   br label %end
> chkovf:
>   %tmin_bit = cmp %x, min<T>
>   %minus_one_bit = cmp %y, -1
>   %ovf_bit = and %tmin_bit, %minus_one_bit
>   br i1 %ovf_bit, label %ovf, label %normal
> ovf:
>   br label %end
> normal:
>   %div.0 = sdiv %x, %y
>   br label %end
> end:
>   %div = phi [%div.0, %normal], [min<T>, %ovf], [0,
%division_by_zero]
> 
> *** JavaScript without “|0”, Ruby and maybe others ***
>   %divbyzero_bit = cmp %y, 0
>   br i1 %divbyzero_bit, label %throw, label %chkovf
> chkovf:
>   %tmin_bit = cmp %x, min<T>
>   %minus_one_bit = cmp %y, -1
>   %ovf_bit = and %tmin_bit, %minus_one_bit
>   br i1 %ovf_bit, label %ovf, label %normal
> ovf:
>   ; Special handling of min<T>/-1: converting to BigInt, double or
something other
>   unreachable
> normal:
>   %div.0 = sdiv %x, %y
> 
> As one can see, if the intrinsic is lowered to some CFG (i.e. on x86),
there is nor loss neither win: due to the implemented optimizations the final
code would be roughly the same as in the case with explicitly written CFG.
> 
> If the intrinsic could be directly lowered to a single instruction (i.e. on
ARM64 and if we don’t bother about divz/ovf-flag), use of the intrinsic would
definitely be beneficial - that takes place for the <JavaScript, with
“|0”> case.
> 
> That might make one think that the intrinsics are useful only for ARM64,
and in general aren’t that useful at all. But in fact even when it can’t be
lowered to a single instruction, keeping the control flow ‘inside’ the intrinsic
till its lowering might be beneficial. For instance, loop passes work much
better on loops without control flow: IndVars might be a good example of such
pass. And, for instance, it seems much easier to vectorize a loop with call to
one of these intrinsics than vectorize a loop with explicitly created control
flow (and it could become even worse if there are several calls to the
intrinsics in the same loop).
> 
> In some cases we might want to lower these intrinsics earlier than in CGP -
as Reid mentioned, it could help us to hoist checks out of loops. That’s true
but we need to keep in mind possible disadvantages of such early lowering (see
previous paragraph). But in general I don’t see any problems with allowing
earlier lowering.
> 
> As for extending the result structure to keep one more flag (being that i2
instead of i1, or one more i1 flag) - that seems fine, and both options {iN, i2}
and {iN, i1, i1} look good to me.
> 
> Now, why not to lower these intrinsics even later? Almost for all targets
we want to get very similar CFG, and I don’t see any benefits of duplicating
very similar code all across different backends. Even on ARM64 we need to have
control flow in some cases (e.g. for Java) - so we actually win nothing from
making lowering completely target-specific.
> 
> I hope that covers the raised concerns, at least to some extent. Does that
sound reasonable enough? If so, I’ll prepare and post here an updated version of
the patch.
> 
> Thanks,
> Michael
> 
> 
> On Apr 26, 2014, at 4:04 AM, Andrew Trick <atrick at apple.com>
wrote:
> 
> >
> > On Apr 25, 2014, at 2:21 PM, Eric Christopher <echristo at
gmail.com> wrote:
> >
> >>> In short, I agree with your observations that these intrinsics
are not an
> >>> obvious slam-dunk compared to making the explicit control
flow, but I think
> >>> that the intrinsics do give enough flexibility on the LLVM
side that it
> >>> would be great if front-ends used them rather than rolling the
control flow
> >>> themselves.
> >>>
> >>
> >> The problem is that then we have 2 problems: All targets (except
for
> >> arm64) then have to lower the intrinsic as the first thing they do
> >> (giving us a TTI pass as the first thing in the pipeline) to take
> >> advantage of the information later during optimization, _and_ we
have
> >> to plumb all of the work optimizing the intrinsic as well giving
us a
> >> situation where we've now split our optimization efforts as
well as
> >> the pain of maintaining an intrinsic that's useful for a
single
> >> target.
> >>
> >> I really think that this is just solidifying my position that the
> >> intrinsic is a bad idea and that this should be done as later
> >> optimizations.
> >
> > The goal of the intrinsic wasn't stated clearly.
> >
> > This intrinsic isn't particularly necessary for any specific
frontend
> > or architecture. I think the LLVM community would benefit from a
> > canonical way to represent well-defined integer division. We don't
> > need to add an IR instruction, because it's fine for most
optimization
> > passes to ignore these operations. A target independent intrinsic
> > works well as long as:
> >
> > - It cleanly captures the language level semantics.
> >
> > - It facilitates mid-level optimization.
> >
> > - It naturally lowers into ideal code both on architectures that trap
> >  (x86) and those that don't (arm).
> >
> > To summarize my understanding of the concerns:
> >
> > (1) The semantics of the safe.div intrinsic need to be useful for the
> > Language/ISA Matrix that LLVMers care about.
> >
> > At canonical IR level, the intrinsic is useful by eliminating control
> > flow merges and representing divide-by-zero and/or signed overflow in
> > a canonical form:
> >
> >    %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
> >    %bit = extractvalue {i32, i1} %res, 1
> >    br i1 %bit, label %trap, label %continue
> > trap:
> >    call ...
> >    unreachable
> >
> > continue:
> >    %div = extractvalue {i32, i1} %res, 0
> >
> > The original proposal fails to achieve this because the common case of
> > Java/Go would require a check in the slow-path to differentiate
> > divide-by-zero from signed overflow. That should be fixed by
> > generalizing the intrinsic so that the two conditions are distinct:
> >
> >    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
> >    %div0 = extractvalue {i32, i1} %res, 1
> >    br i1 %div0, label %trap, label %checkovf
> >
> > checkovf:
> >    %ovf = extractvalue {i32, i1} %res, 2
> >    br i1 %div0, label %trap, label %continue
> >
> > trap:
> >    call ...
> >    unreachable
> >
> > continue:
> >    %div = extractvalue {i32, i1} %res, 0
> >
> > ...or some variation of the above. I don't have a personal stake
in this.
> >
> > (2) The safe.div intrinsic inhibits generic code motion and Correlated
> > Value Prop based optimization.
> >
> > This goes both ways.
> >
> > CVP could miss out on cases unless we teach it about the semantics of
> > the intrinsics. Correct me if I'm wrong, but I don't think
this would
> > actually be too difficult.
> >
> > OTOH, with the intrinsics, it would be easy to write a simple
> > optimization pass that hoists and combines checks along the domtree.
> >
> > After divide-by-zero optimization, you would have something like:
> >
> >    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
> >    %div0 = extractvalue {i32, i1} %res, 1
> >    br i1 %div0, label %trap, label %continue
> >
> > trap:
> >    # No call here!
> >    unreachable
> >
> > continue:
> >    %div = extractvalue {i32, i1} %res, 0
> >
> > And the branch to trap just goes away at some point.
> >
> > Now considering Reid's LICM example:
> >
> > for (int i = 0; i < n; ++i) {
> >  if (b == 0) c[i] = 0;
> >  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
> >  else c[i] = a[i] / b;
> > }
> >
> > Simply put, if we want to unswitch this code for x86, then we need a
> > TTI pass to lower the intrinsic. On the other hand, I believe it is
> > more important to eliminate the control flow within the loop to aid
> > loop analysis and other optimizations. So we still want the front end
> > to emit the intrinsic, we just may eventually want it lowered earlier
> > than CGP. I don't think this issue has any bearing on the
intrinsic's
> > LangRef spec.
> >
> > There was some comparison made to iload/istore, which I don't
> > follow:
> > - subsuming loads and stores into another instruction is really scary
> >  considering how much logic we have for analyzing the side effects of
> >  memory access.
> > - there is no benefit to IR optimization to the iload/istore
intruction.
> > - it is easy to detect null checked load/stores in IR at any point.
> > - it is very rare that a platform would want to resume from trapping
load/store.
> >
> > (3) The safe.div intrinsic is a target-specific codegen optimization.
> >
> > Regardless of the target, we want to eliminate control flow merges in
> > all cases, and completely eliminate control flow when we don't
have
> > trapping semantics. To do this, we need an intrinsic at canonical IR
> > level. Clearly it should be a target independent intrinsic
> > since *some* optimizations/analyses want to be aware of it.
> >
> > Even ignoring the rationale above, supporting codegen with a
> > target-specific intrinsic would minimally require the DAG builder to
> > match this
> > "if (b != 0) ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) :
0)"
> >
> > after all optimizations have had their way with it. A big problem here
> > is that there are actually many different forms of this expression and
> > surrounding control flow, depending on the frontend's integer
division
> > semantics. We would have to recognize all the variations of checking
for
> > divide-by-zero and/or overflow, trapping and/or producing a certain
> > constant, branching or selecting values. This is obviously terrible
> > from an engineering standpoint.
> >
> > -Andy
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/3123e720/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1-safediv-intrinsics-v2.patch
Type: application/octet-stream
Size: 29504 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/3123e720/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/3123e720/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2-safediv-intrinsics-docs-v2.patch
Type: application/octet-stream
Size: 7190 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/3123e720/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/3123e720/attachment-0002.html>

Philip Reames

2014-Apr-29 17:07 UTC

head link

[LLVMdev] Proposal: add intrinsics for safe division

As the discussion has progressed and I've spent more time thinking about 
the topic, I find myself less and less enthused about the current 
proposal.  I'm in full support of having idiomatic ways to express safe 
division, but I'm starting to doubt that using an intrinsic is the right 
way at the moment.

One case I find myself thinking about is how one would combine profiling 
information and implicit div-by-zero/overflow checks with this 
proposal.  I don't really see a clean way.  Ideally, for a "safe
div"
which never has the exceptional paths taken, you'd like to completely do 
away with the control flow entirely.  (And rely on hardware traps 
w/exceptions instead.)  I don't really see a way to represent that type 
of construct given the current proposal.

Another point that is bothering me is how these intrinsics will interact 
with loop invariant code motion, and other condition check 
optimizations.  We have the potential to hide control flow from 
optimizations which would otherwise remove it.  I'm not stating these 
/aren't/ beneficial.  I'm simply stating that I remain somewhat 
unconvinced.  They seem like clear wins on some architectures (i.e. ARM, 
where the hardware includes the specific checks already), but not on 
others (i.e. x86 with it's exceptions).

Given the above, I'm going withdraw my active support from the current 
proposal.  (Note: This does not mean that I'm going to actively oppose 
it either!)

Let me make a counter proposal of my own.

1) Let's not add generic intrinsics at this time.  Instead, let's add 
cpu specific intrinsics with the exact semantics of the associated div 
instructions.
   a) for ARM, this would be the semantics of the current proposed 
instruction (right?)
   b) for x86, this would be an instruction which is defined to fault on 
the corner cases (the resume behaviour would /not/ be specified for the 
moment)
1b) Teach the optimizer about the semantics of each - if we do go with a 
unified safe.div instruction later, this effort could be mostly reused.
2) Add optimizations (fairly late) which fold generic-div+control flow 
into machine specific div instructions.
3) Frontends that want machine specific semantics (i.e. trap or defined 
edge cases), can use the machine specific intrinsics directly.
4) We provide a set of IR helper functions which implement safe division 
in what we believe to be the best way.  This can be different for each 
supported platform.  This could either be the form of sample IR, or even 
as functions on IRBuilder.
5) We spend some time optimizing the IR generated by (4) and use that to 
inform our discussion about what is truly necessary from a performance 
standpoint.

Once this is in place, we can come back to the discussion of what an 
appropriate generic safe div intrinsic would look like.  In particular, 
we would have the performance data at hand to lay to rest the concerns 
raised by myself and Eric.  My strong suspicion is that the counter 
proposal will be "good enough" in practice for most users.

p.s. If we do go ahead with the current proposal, I'm going to 
*strongly* request they be marked experimental.  I think we'll need to 
iterate on these a few times.

Philip

On 04/29/2014 04:17 AM, Michael Zolotukhin wrote:> Hi,
>
>> I am very much in favor of having a div instruction with well defined 
>> div-by-zero and overflow behavior.
> Actually, an interest in these intrinsics has been shown for quite a 
> long time, see for example thread [1].
>
> I updated the patch, now the intrinsics return {iN, i1, i1} to 
> separately indicate division by zero and overflow events. Though for 
> unsigned version only division by zero is relevant, I decided to keep 
> three elements in return structure for consistency (the last one is 
> always 0 in this case). With this change various front-ends can take 
> advantage of the intrinsics, depending on desired semantics.
>
> Also I changed semantics of srem intrinsics to better match semantics 
> of, for instance, Go. The original proposal stated that 
> safe.srem(min<T>,-1) = min<T>, but it makes more sense to
return 0
> here. Thus, we'll keep invariant X = div(X,Y)*Y + rem(X,Y).
>
>
>
>
>
>
> Does the patch look good? I'd be glad to hear any remarks and comments 
> regarding the patch, and adjust it correspondingly.
>
> [1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-April/060955.html
>
> Best regards,
> Michael
>
> On Apr 27, 2014, at 2:45 AM, Keno Fischer 
> <kfischer at college.harvard.edu <mailto:kfischer at
college.harvard.edu>>
> wrote:
>
>> I am very much in favor of having a div instruction with well defined 
>> div-by-zero and overflow behavior. The undefined behavior on certain 
>> values for LLVM intrinsics has been a major pain point for us in 
>> Julia, because adding the extra branches just kills performance and 
>> we know that there is an X86 instruction that just does what we want. 
>> Anyway, this was brought up briefly above, but want to bring back the 
>> discussion of a div instruction that reliably traps, as that seems to 
>> me to be one of the only ways to get adequate performance in the 
>> common case. This can probably be done as suggested above with 
>> matching on the trap condition of this instruction, but I think while 
>> we're at it, we should tackle that case as well.
>>
>> Thanks,
>> Keno
>>
>>
>> On Sat, Apr 26, 2014 at 5:11 PM, Michael Zolotukhin 
>> <mzolotukhin at apple.com <mailto:mzolotukhin at
apple.com>> wrote:
>>
>>
>>     Hi,
>>
>>     Thanks to everybody for the very informative feedback! Due to
>>     difference in timezones, I usually miss the most active time of
>>     discussions, however I'll try to cover the raised questions
here
>>     and share my vision on them.
>>
>>     Generally speaking, we are trying to find answers to the
>>     following two questions:
>>         o   Are these intrinsics useful in general?
>>         o   If yes, when to lower them to actual IR/DAG?
>>
>>     In a nutshell, my answers are:
>>         o   Yes, they are useful, and useful for all targets. Not
>>     only for ARM64, where they are especially useful.
>>         o   After loop optimizations but before lowering IR to DAG.
>>     Currently it's in CodeGenPrepare, but we might lower them
before
>>     that if there is any need in this.
>>
>>     I'll try to explain why I think so. For the beginning,
let's
>>     consider a simple expression div=x/y in different languages and
>>     write the corresponding IR with use of the intrinsics (here I
>>     omit types for conciseness, assuming result type of the safe.sdiv
>>     intrinsic being {i32, i1, i1}) :
>>
>>     *** C/C-like ***
>>       %div = sdiv %x, %y
>>
>>     *** Java, Go ***
>>       %divr = call %safe.sdiv(%x, %y)
>>       %div = extractvalue %divr, 0
>>       %divbyzero_bit = extractvalue %divr, 1
>>       br i1 %divbyzero_bit, label %throw, label %normal
>>
>>     *** JavaScript, with "|0" ***
>>       %divr = call %safe.sdiv(%x, %y)
>>       %div = extractvalue %divr, 0
>>
>>     *** JavaScript without "|0", Ruby and maybe others ***
>>       %divr = call %safe.sdiv(%x, %y)
>>       %div = extractvalue %divr, 0
>>       %divbyzero_bit = extractvalue %divr, 1
>>       %ovf_bit = extractvalue %divr, 2
>>       %exceptional_case_bit = or %ovf_bit, %divbyzero_bit
>>       br i1 %exceptional_case_bit, label %handle_exceptional_case,
>>     label %normal
>>
>>
>>     Now let's write the IR for the same expression without the
>>     intrinsics:
>>
>>     *** C/C-like ***
>>       %div = sdiv %x, %y
>>
>>     *** Java, Go ***
>>       %divbyzero_bit = cmp %y, 0
>>       br i1 %divbyzero_bit, label %throw, label %normal
>>     normal:
>>       %div = sdiv %x, %y
>>
>>     *** JavaScript, with "|0" ***
>>       %divbyzero_bit = cmp %y, 0
>>       br i1 %divbyzero_bit, label %division_by_zero, label %chkovf
>>     division_by_zero:
>>       br label %end
>>     chkovf:
>>       %tmin_bit = cmp %x, min<T>
>>       %minus_one_bit = cmp %y, -1
>>       %ovf_bit = and %tmin_bit, %minus_one_bit
>>       br i1 %ovf_bit, label %ovf, label %normal
>>     ovf:
>>       br label %end
>>     normal:
>>       %div.0 = sdiv %x, %y
>>       br label %end
>>     end:
>>       %div = phi [%div.0, %normal], [min<T>, %ovf], [0,
>>     %division_by_zero]
>>
>>     *** JavaScript without "|0", Ruby and maybe others ***
>>       %divbyzero_bit = cmp %y, 0
>>       br i1 %divbyzero_bit, label %throw, label %chkovf
>>     chkovf:
>>       %tmin_bit = cmp %x, min<T>
>>       %minus_one_bit = cmp %y, -1
>>       %ovf_bit = and %tmin_bit, %minus_one_bit
>>       br i1 %ovf_bit, label %ovf, label %normal
>>     ovf:
>>       ; Special handling of min<T>/-1: converting to BigInt,
double
>>     or something other
>>       unreachable
>>     normal:
>>       %div.0 = sdiv %x, %y
>>
>>     As one can see, if the intrinsic is lowered to some CFG (i.e. on
>>     x86), there is nor loss neither win: due to the implemented
>>     optimizations the final code would be roughly the same as in the
>>     case with explicitly written CFG.
>>
>>     If the intrinsic could be directly lowered to a single
>>     instruction (i.e. on ARM64 and if we don't bother about
>>     divz/ovf-flag), use of the intrinsic would definitely be
>>     beneficial - that takes place for the <JavaScript, with
"|0"> case.
>>
>>     That might make one think that the intrinsics are useful only for
>>     ARM64, and in general aren't that useful at all. But in fact
even
>>     when it can't be lowered to a single instruction, keeping the
>>     control flow 'inside' the intrinsic till its lowering might
be
>>     beneficial. For instance, loop passes work much better on loops
>>     without control flow: IndVars might be a good example of such
>>     pass. And, for instance, it seems much easier to vectorize a loop
>>     with call to one of these intrinsics than vectorize a loop with
>>     explicitly created control flow (and it could become even worse
>>     if there are several calls to the intrinsics in the same loop).
>>
>>     In some cases we might want to lower these intrinsics earlier
>>     than in CGP - as Reid mentioned, it could help us to hoist checks
>>     out of loops. That's true but we need to keep in mind possible
>>     disadvantages of such early lowering (see previous paragraph).
>>     But in general I don't see any problems with allowing earlier
>>     lowering.
>>
>>     As for extending the result structure to keep one more flag
>>     (being that i2 instead of i1, or one more i1 flag) - that seems
>>     fine, and both options {iN, i2} and {iN, i1, i1} look good to me.
>>
>>     Now, why not to lower these intrinsics even later? Almost for all
>>     targets we want to get very similar CFG, and I don't see any
>>     benefits of duplicating very similar code all across different
>>     backends. Even on ARM64 we need to have control flow in some
>>     cases (e.g. for Java) - so we actually win nothing from making
>>     lowering completely target-specific.
>>
>>     I hope that covers the raised concerns, at least to some extent.
>>     Does that sound reasonable enough? If so, I'll prepare and post
>>     here an updated version of the patch.
>>
>>     Thanks,
>>     Michael
>>
>>
>>     On Apr 26, 2014, at 4:04 AM, Andrew Trick <atrick at apple.com
>>     <mailto:atrick at apple.com>> wrote:
>>
>>     >
>>     > On Apr 25, 2014, at 2:21 PM, Eric Christopher
>>     <echristo at gmail.com <mailto:echristo at gmail.com>>
wrote:
>>     >
>>     >>> In short, I agree with your observations that these
>>     intrinsics are not an
>>     >>> obvious slam-dunk compared to making the explicit
control
>>     flow, but I think
>>     >>> that the intrinsics do give enough flexibility on the
LLVM
>>     side that it
>>     >>> would be great if front-ends used them rather than
rolling
>>     the control flow
>>     >>> themselves.
>>     >>>
>>     >>
>>     >> The problem is that then we have 2 problems: All targets
>>     (except for
>>     >> arm64) then have to lower the intrinsic as the first thing
they do
>>     >> (giving us a TTI pass as the first thing in the pipeline)
to take
>>     >> advantage of the information later during optimization,
_and_
>>     we have
>>     >> to plumb all of the work optimizing the intrinsic as well
>>     giving us a
>>     >> situation where we've now split our optimization
efforts as
>>     well as
>>     >> the pain of maintaining an intrinsic that's useful for
a single
>>     >> target.
>>     >>
>>     >> I really think that this is just solidifying my position
that the
>>     >> intrinsic is a bad idea and that this should be done as
later
>>     >> optimizations.
>>     >
>>     > The goal of the intrinsic wasn't stated clearly.
>>     >
>>     > This intrinsic isn't particularly necessary for any
specific
>>     frontend
>>     > or architecture. I think the LLVM community would benefit from
a
>>     > canonical way to represent well-defined integer division. We
don't
>>     > need to add an IR instruction, because it's fine for most
>>     optimization
>>     > passes to ignore these operations. A target independent
intrinsic
>>     > works well as long as:
>>     >
>>     > - It cleanly captures the language level semantics.
>>     >
>>     > - It facilitates mid-level optimization.
>>     >
>>     > - It naturally lowers into ideal code both on architectures
>>     that trap
>>     >  (x86) and those that don't (arm).
>>     >
>>     > To summarize my understanding of the concerns:
>>     >
>>     > (1) The semantics of the safe.div intrinsic need to be useful
>>     for the
>>     > Language/ISA Matrix that LLVMers care about.
>>     >
>>     > At canonical IR level, the intrinsic is useful by eliminating
>>     control
>>     > flow merges and representing divide-by-zero and/or signed
>>     overflow in
>>     > a canonical form:
>>     >
>>     >    %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
>>     >    %bit = extractvalue {i32, i1} %res, 1
>>     >    br i1 %bit, label %trap, label %continue
>>     > trap:
>>     >    call ...
>>     >    unreachable
>>     >
>>     > continue:
>>     >    %div = extractvalue {i32, i1} %res, 0
>>     >
>>     > The original proposal fails to achieve this because the common
>>     case of
>>     > Java/Go would require a check in the slow-path to
differentiate
>>     > divide-by-zero from signed overflow. That should be fixed by
>>     > generalizing the intrinsic so that the two conditions are
distinct:
>>     >
>>     >    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32
%b)
>>     >    %div0 = extractvalue {i32, i1} %res, 1
>>     >    br i1 %div0, label %trap, label %checkovf
>>     >
>>     > checkovf:
>>     >    %ovf = extractvalue {i32, i1} %res, 2
>>     >    br i1 %div0, label %trap, label %continue
>>     >
>>     > trap:
>>     >    call ...
>>     >    unreachable
>>     >
>>     > continue:
>>     >    %div = extractvalue {i32, i1} %res, 0
>>     >
>>     > ...or some variation of the above. I don't have a personal
>>     stake in this.
>>     >
>>     > (2) The safe.div intrinsic inhibits generic code motion and
>>     Correlated
>>     > Value Prop based optimization.
>>     >
>>     > This goes both ways.
>>     >
>>     > CVP could miss out on cases unless we teach it about the
>>     semantics of
>>     > the intrinsics. Correct me if I'm wrong, but I don't
think this
>>     would
>>     > actually be too difficult.
>>     >
>>     > OTOH, with the intrinsics, it would be easy to write a simple
>>     > optimization pass that hoists and combines checks along the
>>     domtree.
>>     >
>>     > After divide-by-zero optimization, you would have something
like:
>>     >
>>     >    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32
%b)
>>     >    %div0 = extractvalue {i32, i1} %res, 1
>>     >    br i1 %div0, label %trap, label %continue
>>     >
>>     > trap:
>>     >    # No call here!
>>     >    unreachable
>>     >
>>     > continue:
>>     >    %div = extractvalue {i32, i1} %res, 0
>>     >
>>     > And the branch to trap just goes away at some point.
>>     >
>>     > Now considering Reid's LICM example:
>>     >
>>     > for (int i = 0; i < n; ++i) {
>>     >  if (b == 0) c[i] = 0;
>>     >  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
>>     >  else c[i] = a[i] / b;
>>     > }
>>     >
>>     > Simply put, if we want to unswitch this code for x86, then we
>>     need a
>>     > TTI pass to lower the intrinsic. On the other hand, I believe
it is
>>     > more important to eliminate the control flow within the loop
to aid
>>     > loop analysis and other optimizations. So we still want the
>>     front end
>>     > to emit the intrinsic, we just may eventually want it lowered
>>     earlier
>>     > than CGP. I don't think this issue has any bearing on the
>>     intrinsic's
>>     > LangRef spec.
>>     >
>>     > There was some comparison made to iload/istore, which I
don't
>>     > follow:
>>     > - subsuming loads and stores into another instruction is
really
>>     scary
>>     >  considering how much logic we have for analyzing the side
>>     effects of
>>     >  memory access.
>>     > - there is no benefit to IR optimization to the iload/istore
>>     intruction.
>>     > - it is easy to detect null checked load/stores in IR at any
point.
>>     > - it is very rare that a platform would want to resume from
>>     trapping load/store.
>>     >
>>     > (3) The safe.div intrinsic is a target-specific codegen
>>     optimization.
>>     >
>>     > Regardless of the target, we want to eliminate control flow
>>     merges in
>>     > all cases, and completely eliminate control flow when we
don't have
>>     > trapping semantics. To do this, we need an intrinsic at
>>     canonical IR
>>     > level. Clearly it should be a target independent intrinsic
>>     > since *some* optimizations/analyses want to be aware of it.
>>     >
>>     > Even ignoring the rationale above, supporting codegen with a
>>     > target-specific intrinsic would minimally require the DAG
>>     builder to
>>     > match this
>>     > "if (b != 0) ? ((a != INT_MIN || b != -1) ? a / b :
INT_MIN) : 0)"
>>     >
>>     > after all optimizations have had their way with it. A big
>>     problem here
>>     > is that there are actually many different forms of this
>>     expression and
>>     > surrounding control flow, depending on the frontend's
integer
>>     division
>>     > semantics. We would have to recognize all the variations of
>>     checking for
>>     > divide-by-zero and/or overflow, trapping and/or producing a
certain
>>     > constant, branching or selecting values. This is obviously
terrible
>>     > from an engineering standpoint.
>>     >
>>     > -Andy
>>     > _______________________________________________
>>     > LLVM Developers mailing list
>>     > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>>     http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>     > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>>     http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/53e2cbb4/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Apr 2014 - [LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

[LLVMdev] Proposal: add intrinsics for safe division

Apparently Analagous Threads