Sanjay Patel via llvm-dev
2017-May-19 23:11 UTC
[llvm-dev] [llvm] r303387 - [InstCombine] add more tests for xor-of-icmps; NFC
[adding llvm-dev for wider audience] On Fri, May 19, 2017 at 12:28 PM, Daniel Berlin <dberlin at dberlin.org> wrote:> > > On Fri, May 19, 2017 at 11:00 AM, Davide Italiano <davide at freebsd.org> > wrote: > >> On Fri, May 19, 2017 at 10:00 AM, Sanjay Patel <spatel at rotateright.com> >> wrote: >> > Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated value >> > propagation)? >> > >> > If so, we have this comment regarding compares: >> > // As a policy choice, we choose not to waste compile time on anything >> > where >> > // the comparison is testing local values. >> > >> > Or this for div/rem: >> > // As a policy choice, we choose not >> > // to waste compile time on anything where the operands are local >> defs. >> > >> > "Local" means in the same basic block from what I can tell by the code >> here. >> > >> > I think this means that this pass explicitly defers handling simple >> cases >> > like: >> > https://reviews.llvm.org/D33342 >> > ...to another pass, and currently that pass is InstCombine (because the >> > patterns really can't be any simpler than the tests in that patch, >> right?). >> > >> > I think that improving compile-time should not preclude improving >> > InstCombine. We should do both. >> > >> >> Just thoughts, feel free to ignore them. >> I didn't measure the impact in LLVM, but I'm sure you can do VRP >> relatively fast (GCC does that both interprocedurally and >> intraprocedurally and their pass is much faster in some cases than >> LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this >> policy? >> > > Yeah, that's kind of silly, we can do much better. > > >> I think replacing a pass in LLVM is not trivial (I'm learning it the >> hard way with NewGVN). OTOH, I'm still not entirely convinced >> `InstCombine` should handle these cases given it's already a >> compile-time sink? >> > >Correct me where I'm going wrong. 1. We got a bug report with a perf regression that was root caused to this IR: define i1 @xor_of_icmps(i64 %a) { %b = icmp sgt i64 %a, 0 %c = icmp eq i64 %a, 1 %r = xor i1 %b, %c ret i1 %r } Because this was a regression, we know that the optimal/canonical IR for this program is: define i1 @xor_of_icmps_canonical(i64 %a) { %r = icmp sgt i64 %a, 1 ret i1 %r } 2. I saw this as a bug in InstCombine because I think InstCombine's job is to canonicalize simple IR sequences. 3. I assumed that matching an (xor icmp, icmp) pattern can be done efficiently in InstCombine. In fact, I knew that we already did that match, so there is zero incremental cost to find this pattern in InstCombine. 4. I knew that we already handle related and-of-cmps and or-of-cmps patterns in InstSimplify/InstCombine. 5. Based on that, I have proposed a patch that mostly uses existing logic in those passes to solve this bug in the most general way I am aware of. It makes 2 calls to InstSimplify to find a potential fold before morphing/creating new instructions that may get folded by other existing logic. Questions: 1. Was/is InstCombine intended to handle this transform? 2. If there is a measurable compile-time cost for this patch, then there must be a structural problem with InstSimplify, InstCombine, and/or the pass pipeline? 3. Is there another pass that is more suitable to solve this bug than InstCombine? CVP was mentioned as a candidate for enhancement. Any others? 4. If there is a more suitable pass, why is it more suitable? 5. If we add to that pass or create a new pass that specifically targets logic-of-cmps, can we pull that code out of InstSimplify and InstCombine? 6. If compile-time has become unbearable, shouldn't we be moving optimization logic and/or passes from -O1 to -O2 to -O3, etc? My understanding as a compiler consumer has always been that I'll get better performing code the higher up the -O# ladder I climb, but it will cost increasingly more time to compile. Is that not the model for LLVM? People seem intent on adding lots and lots of stuff to instcombine because> it's easy. > This makes it harder to ever replace, while simultaneously making it > slower. > It's not hard to see where that path ends up. > InstCombine does a lot of random crap that isn't even combining or graph > rewrite, too, because well second and third order effects are hard. >Can you provide some examples? I honestly don't know where I should draw the line. If I've crossed the line, I'd like to fix it. If we want an optimal graph rewriter, we should actually just build one,> with sane rewrite rules, verification that it fixpoints, and some sense of > good ordering, etc, instead of slowly every adding possible pattern match > to InstCombine. >Can you recommend papers or projects to look at to get a better understanding of an optimal graph rewriter? LLVM is the only compiler I've worked on from the inside, so I don't really know what else is possible / feasible. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170519/b8d97a5e/attachment.html>
Davide Italiano via llvm-dev
2017-May-19 23:27 UTC
[llvm-dev] [llvm] r303387 - [InstCombine] add more tests for xor-of-icmps; NFC
On Fri, May 19, 2017 at 4:11 PM, Sanjay Patel <spatel at rotateright.com> wrote:> [adding llvm-dev for wider audience] > > On Fri, May 19, 2017 at 12:28 PM, Daniel Berlin <dberlin at dberlin.org> wrote: >> >> >> >> On Fri, May 19, 2017 at 11:00 AM, Davide Italiano <davide at freebsd.org> >> wrote: >>> >>> On Fri, May 19, 2017 at 10:00 AM, Sanjay Patel <spatel at rotateright.com> >>> wrote: >>> > Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated >>> > value >>> > propagation)? >>> > >>> > If so, we have this comment regarding compares: >>> > // As a policy choice, we choose not to waste compile time on >>> > anything >>> > where >>> > // the comparison is testing local values. >>> > >>> > Or this for div/rem: >>> > // As a policy choice, we choose not >>> > // to waste compile time on anything where the operands are local >>> > defs. >>> > >>> > "Local" means in the same basic block from what I can tell by the code >>> > here. >>> > >>> > I think this means that this pass explicitly defers handling simple >>> > cases >>> > like: >>> > https://reviews.llvm.org/D33342 >>> > ...to another pass, and currently that pass is InstCombine (because the >>> > patterns really can't be any simpler than the tests in that patch, >>> > right?). >>> > >>> > I think that improving compile-time should not preclude improving >>> > InstCombine. We should do both. >>> > >>> >>> Just thoughts, feel free to ignore them. >>> I didn't measure the impact in LLVM, but I'm sure you can do VRP >>> relatively fast (GCC does that both interprocedurally and >>> intraprocedurally and their pass is much faster in some cases than >>> LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this >>> policy? >> >> >> Yeah, that's kind of silly, we can do much better. >> >>> >>> I think replacing a pass in LLVM is not trivial (I'm learning it the >>> hard way with NewGVN). OTOH, I'm still not entirely convinced >>> `InstCombine` should handle these cases given it's already a >>> compile-time sink? >> >> > > Correct me where I'm going wrong. > > 1. We got a bug report with a perf regression that was root caused to this > IR: > > define i1 @xor_of_icmps(i64 %a) { > %b = icmp sgt i64 %a, 0 > %c = icmp eq i64 %a, 1 > %r = xor i1 %b, %c > ret i1 %r > } > > Because this was a regression, we know that the optimal/canonical IR for > this program is: > > define i1 @xor_of_icmps_canonical(i64 %a) { > %r = icmp sgt i64 %a, 1 > ret i1 %r > } > > 2. I saw this as a bug in InstCombine because I think InstCombine's job is > to canonicalize simple IR sequences. > > 3. I assumed that matching an (xor icmp, icmp) pattern can be done > efficiently in InstCombine. In fact, I knew that we already did that match, > so there is zero incremental cost to find this pattern in InstCombine. > > 4. I knew that we already handle related and-of-cmps and or-of-cmps patterns > in InstSimplify/InstCombine. > > 5. Based on that, I have proposed a patch that mostly uses existing logic in > those passes to solve this bug in the most general way I am aware of. It > makes 2 calls to InstSimplify to find a potential fold before > morphing/creating new instructions that may get folded by other existing > logic. > > > Questions: > 1. Was/is InstCombine intended to handle this transform? > > 2. If there is a measurable compile-time cost for this patch, then there > must be a structural problem with InstSimplify, InstCombine, and/or the pass > pipeline? > > 3. Is there another pass that is more suitable to solve this bug than > InstCombine? CVP was mentioned as a candidate for enhancement. Any others? > > 4. If there is a more suitable pass, why is it more suitable? >I can provide an example + analysis: https://reviews.llvm.org/D33172#754563> 5. If we add to that pass or create a new pass that specifically targets > logic-of-cmps, can we pull that code out of InstSimplify and InstCombine? > > 6. If compile-time has become unbearable, shouldn't we be moving > optimization logic and/or passes from -O1 to -O2 to -O3, etc? My > understanding as a compiler consumer has always been that I'll get better > performing code the higher up the -O# ladder I climb, but it will cost > increasingly more time to compile. Is that not the model for LLVM? >The real problem of instcombine IMHO is that it suffer from an additive effect. i.e. matchers over matchers are added over time which per-se don't slow down the compiler in any measurable manner. But all of them sum up result in a significant compiler slowdown, and we notice only when it's too late (i.e. when the code has been released). We've all under the assumption that the time in InstCombine was spent inside auxiliary analysis computing the bitwise domain (ComputeKnownBits et similia). Turns out we're not able to find a case where caching speeds up things.> >> People seem intent on adding lots and lots of stuff to instcombine because >> it's easy. >> This makes it harder to ever replace, while simultaneously making it >> slower. >> It's not hard to see where that path ends up. >> InstCombine does a lot of random crap that isn't even combining or graph >> rewrite, too, because well second and third order effects are hard. > > > Can you provide some examples? I honestly don't know where I should draw the > line. If I've crossed the line, I'd like to fix it. >I don't have the code handy, but something that came up recently (and I was a little surprised) is that instcombine checks for irreducible PHI cycles, see the functions/comments in InstCombinePHI.>> If we want an optimal graph rewriter, we should actually just build one, >> with sane rewrite rules, verification that it fixpoints, and some sense of >> good ordering, etc, instead of slowly every adding possible pattern match >> to InstCombine. > > > Can you recommend papers or projects to look at to get a better > understanding of an optimal graph rewriter? LLVM is the only compiler I've > worked on from the inside, so I don't really know what else is possible / > feasible. > >-- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare
Sanjay Patel via llvm-dev
2017-May-21 15:55 UTC
[llvm-dev] [llvm] r303387 - [InstCombine] add more tests for xor-of-icmps; NFC
On Fri, May 19, 2017 at 5:27 PM, Davide Italiano <davide at freebsd.org> wrote:> On Fri, May 19, 2017 at 4:11 PM, Sanjay Patel <spatel at rotateright.com> > wrote: > > [adding llvm-dev for wider audience] > > > > On Fri, May 19, 2017 at 12:28 PM, Daniel Berlin <dberlin at dberlin.org> > wrote: > >> > >> > >> > >> On Fri, May 19, 2017 at 11:00 AM, Davide Italiano <davide at freebsd.org> > >> wrote: > >>> > >>> On Fri, May 19, 2017 at 10:00 AM, Sanjay Patel <spatel at rotateright.com > > > >>> wrote: > >>> > Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated > >>> > value > >>> > propagation)? > >>> > > >>> > If so, we have this comment regarding compares: > >>> > // As a policy choice, we choose not to waste compile time on > >>> > anything > >>> > where > >>> > // the comparison is testing local values. > >>> > > >>> > Or this for div/rem: > >>> > // As a policy choice, we choose not > >>> > // to waste compile time on anything where the operands are local > >>> > defs. > >>> > > >>> > "Local" means in the same basic block from what I can tell by the > code > >>> > here. > >>> > > >>> > I think this means that this pass explicitly defers handling simple > >>> > cases > >>> > like: > >>> > https://reviews.llvm.org/D33342 > >>> > ...to another pass, and currently that pass is InstCombine (because > the > >>> > patterns really can't be any simpler than the tests in that patch, > >>> > right?). > >>> > > >>> > I think that improving compile-time should not preclude improving > >>> > InstCombine. We should do both. > >>> > > >>> > >>> Just thoughts, feel free to ignore them. > >>> I didn't measure the impact in LLVM, but I'm sure you can do VRP > >>> relatively fast (GCC does that both interprocedurally and > >>> intraprocedurally and their pass is much faster in some cases than > >>> LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this > >>> policy? > >> > >> > >> Yeah, that's kind of silly, we can do much better. > >> > >>> > >>> I think replacing a pass in LLVM is not trivial (I'm learning it the > >>> hard way with NewGVN). OTOH, I'm still not entirely convinced > >>> `InstCombine` should handle these cases given it's already a > >>> compile-time sink? > >> > >> > > > > Correct me where I'm going wrong. > > > > 1. We got a bug report with a perf regression that was root caused to > this > > IR: > > > > define i1 @xor_of_icmps(i64 %a) { > > %b = icmp sgt i64 %a, 0 > > %c = icmp eq i64 %a, 1 > > %r = xor i1 %b, %c > > ret i1 %r > > } > > > > Because this was a regression, we know that the optimal/canonical IR for > > this program is: > > > > define i1 @xor_of_icmps_canonical(i64 %a) { > > %r = icmp sgt i64 %a, 1 > > ret i1 %r > > } > > > > 2. I saw this as a bug in InstCombine because I think InstCombine's job > is > > to canonicalize simple IR sequences. > > > > 3. I assumed that matching an (xor icmp, icmp) pattern can be done > > efficiently in InstCombine. In fact, I knew that we already did that > match, > > so there is zero incremental cost to find this pattern in InstCombine. > > > > 4. I knew that we already handle related and-of-cmps and or-of-cmps > patterns > > in InstSimplify/InstCombine. > > > > 5. Based on that, I have proposed a patch that mostly uses existing > logic in > > those passes to solve this bug in the most general way I am aware of. It > > makes 2 calls to InstSimplify to find a potential fold before > > morphing/creating new instructions that may get folded by other existing > > logic. > > > > > > Questions: > > 1. Was/is InstCombine intended to handle this transform? > > > > 2. If there is a measurable compile-time cost for this patch, then there > > must be a structural problem with InstSimplify, InstCombine, and/or the > pass > > pipeline? > > > > 3. Is there another pass that is more suitable to solve this bug than > > InstCombine? CVP was mentioned as a candidate for enhancement. Any > others? > > > > 4. If there is a more suitable pass, why is it more suitable? > > > > I can provide an example + analysis: https://reviews.llvm.org/D3317 > 2#754563 > >Yes, I agreed that case should be solved more generally. There are follow-up questions in that review that I'd like to reach consensus on though. For example: define i32 @cmp_br_1(i32 %a, i32 %b) { entry: %cmp = icmp sge i32 %a, %b br i1 %cmp, label %t, label %f t: ret i32 42 f: ret i32 31415 } define i32 @cmp_br_2(i32 %a, i32 %b) { entry: %cmp = icmp slt i32 %a, %b ; invert predicate br i1 %cmp, label %f, label %t ; swap successors t: ret i32 42 f: ret i32 31415 } Which of these is canonical? Is InstCombine responsible for that canonicalization?> > 5. If we add to that pass or create a new pass that specifically targets > > logic-of-cmps, can we pull that code out of InstSimplify and InstCombine? > > > > 6. If compile-time has become unbearable, shouldn't we be moving > > optimization logic and/or passes from -O1 to -O2 to -O3, etc? My > > understanding as a compiler consumer has always been that I'll get better > > performing code the higher up the -O# ladder I climb, but it will cost > > increasingly more time to compile. Is that not the model for LLVM? > > > > The real problem of instcombine IMHO is that it suffer from an > additive effect. i.e. matchers over matchers are added over time which > per-se don't slow down the compiler in any measurable manner. But all > of them sum up result in a significant compiler slowdown, and we > notice only when it's too late (i.e. when the code has been released). > We've all under the assumption that the time in InstCombine was spent > inside auxiliary analysis computing the bitwise domain > (ComputeKnownBits et similia). Turns out we're not able to find a case > where caching speeds up things. > >This conclusion doesn't line up with the experimental results I showed here: http://lists.llvm.org/pipermail/llvm-dev/2017-March/111416.html Is there a bug report with a test case for the stated case of "matchers over matchers"? I'd like to try some experiments with that test. From the earlier thread, it sounded like improving that case was something that was being investigated or would be investigated soon. The general idea, as I understand it, is that we need to better define/distinguish canonicalization from optimization and then separate InstCombine into 'InstCanonicalize' and 'InstOptimize' and/or not run InstCombine at full strength so many times.> > > >> People seem intent on adding lots and lots of stuff to instcombine > because > >> it's easy. > >> This makes it harder to ever replace, while simultaneously making it > >> slower. > >> It's not hard to see where that path ends up. > >> InstCombine does a lot of random crap that isn't even combining or graph > >> rewrite, too, because well second and third order effects are hard. > > > > > > Can you provide some examples? I honestly don't know where I should draw > the > > line. If I've crossed the line, I'd like to fix it. > > > > I don't have the code handy, but something that came up recently (and > I was a little surprised) is that instcombine checks for irreducible > PHI cycles, see the functions/comments in InstCombinePHI. > > >> If we want an optimal graph rewriter, we should actually just build one, > >> with sane rewrite rules, verification that it fixpoints, and some sense > of > >> good ordering, etc, instead of slowly every adding possible pattern > match > >> to InstCombine. > > > > > > Can you recommend papers or projects to look at to get a better > > understanding of an optimal graph rewriter? LLVM is the only compiler > I've > > worked on from the inside, so I don't really know what else is possible / > > feasible. > > > > > > > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170521/7854f858/attachment.html>
Daniel Berlin via llvm-dev
2017-May-21 16:02 UTC
[llvm-dev] [llvm] r303387 - [InstCombine] add more tests for xor-of-icmps; NFC
(I'll reply in more depth a bit later this week, i'm slammed ATM) On Fri, May 19, 2017 at 4:11 PM, Sanjay Patel <spatel at rotateright.com> wrote:> [adding llvm-dev for wider audience] > > On Fri, May 19, 2017 at 12:28 PM, Daniel Berlin <dberlin at dberlin.org> > wrote: > >> >> >> On Fri, May 19, 2017 at 11:00 AM, Davide Italiano <davide at freebsd.org> >> wrote: >> >>> On Fri, May 19, 2017 at 10:00 AM, Sanjay Patel <spatel at rotateright.com> >>> wrote: >>> > Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated >>> value >>> > propagation)? >>> > >>> > If so, we have this comment regarding compares: >>> > // As a policy choice, we choose not to waste compile time on >>> anything >>> > where >>> > // the comparison is testing local values. >>> > >>> > Or this for div/rem: >>> > // As a policy choice, we choose not >>> > // to waste compile time on anything where the operands are local >>> defs. >>> > >>> > "Local" means in the same basic block from what I can tell by the code >>> here. >>> > >>> > I think this means that this pass explicitly defers handling simple >>> cases >>> > like: >>> > https://reviews.llvm.org/D33342 >>> > ...to another pass, and currently that pass is InstCombine (because the >>> > patterns really can't be any simpler than the tests in that patch, >>> right?). >>> > >>> > I think that improving compile-time should not preclude improving >>> > InstCombine. We should do both. >>> > >>> >>> Just thoughts, feel free to ignore them. >>> I didn't measure the impact in LLVM, but I'm sure you can do VRP >>> relatively fast (GCC does that both interprocedurally and >>> intraprocedurally and their pass is much faster in some cases than >>> LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this >>> policy? >>> >> >> Yeah, that's kind of silly, we can do much better. >> >> >>> I think replacing a pass in LLVM is not trivial (I'm learning it the >>> hard way with NewGVN). OTOH, I'm still not entirely convinced >>> `InstCombine` should handle these cases given it's already a >>> compile-time sink? >>> >> >> > Correct me where I'm going wrong. > > 1. We got a bug report with a perf regression that was root caused to this > IR: > > define i1 @xor_of_icmps(i64 %a) { > %b = icmp sgt i64 %a, 0 > %c = icmp eq i64 %a, 1 > %r = xor i1 %b, %c > ret i1 %r > } >Yup> > Because this was a regression, we know that the optimal/canonical IR for > this program is: > > define i1 @xor_of_icmps_canonical(i64 %a) { > %r = icmp sgt i64 %a, 1 > ret i1 %r > } > > Yes.> 2. I saw this as a bug in InstCombine because I think InstCombine's job is > to canonicalize simple IR sequences. >Where is the dividing line between what InstCombine does and what something else does?> > 3. I assumed that matching an (xor icmp, icmp) pattern can be done > efficiently in InstCombine. In fact, I knew that we already did that match, > so there is zero incremental cost to find this pattern in InstCombine. > >> 4. I knew that we already handle related and-of-cmps and or-of-cmps > patterns in InstSimplify/InstCombine. >> > 5. Based on that, I have proposed a patch that mostly uses existing logic > in those passes to solve this bug in the most general way I am aware of. It > makes 2 calls to InstSimplify to find a potential fold before > morphing/creating new instructions that may get folded by other existing > logic. > > > Questions: > 1. Was/is InstCombine intended to handle this transform? >InstCombine has no real philosophy right now. It can handle *anything*. The more it handles, the more people want to shove stuff in it. You can see this already. It does random store sinking, phi conversion, etc. Because if some other pass handles it, then they don't get as many simplifications, because nobody has built out those places. Without some real philosophy line drawing, instcombine expands to cover everything, gets slow and can never be replaced without introducing performance regressions. It becomes the only thing that does certain things, and then people want to run it more as a result. But you can't run it more because it's slow (which also incentivizes people to make it do more, so that it catches whatever case when it does run)> > 2. If there is a measurable compile-time cost for this patch, then there > must be a structural problem with InstSimplify, InstCombine, and/or the > pass pipeline? > >Places like InstCombine get slower by very small percentages at a time. Really.> > People seem intent on adding lots and lots of stuff to instcombine because >> it's easy. >> This makes it harder to ever replace, while simultaneously making it >> slower. >> It's not hard to see where that path ends up. >> InstCombine does a lot of random crap that isn't even combining or graph >> rewrite, too, because well second and third order effects are hard. >> > > Can you provide some examples?] >It does: Demanded bits analysis Phi translation and attempts to fold Instruction sinking Reassociation Factorization Dead code elimination> I honestly don't know where I should draw the line. If I've crossed the > line, I'd like to fix it. >I'm sorry if you feel like this is people saying it's you. It's really not. It's really "hey, uh, are we ever going to stop trying to shove stuff in this thing?" I think if you look, every so often people take notice of these kinds of things and it's usually started in a random review thread. :) As for where to draw the line, i think that's a good discussion for us all to have. I think we *should* say what the line is, draw it, and stick to it, and probably split it up into multiple things (canonicalization, etc). Truthfully, right now, Instcombine crossed whatever line you want to draw a long time ago. It's 30k lines of code. For reference: The entirety of transforms/Scalar, combined, is only 60k lines of code. Of course, line counting is not a great measure of a lot, i'm just pointing out it contains roughly half the code of the rest of our scalar optimization infrastructure, and 30% of the total code of it overall. It's twice as big as our entire vectorization infrastructure. It's big. It does a lot. I'd argue that even if most of these transforms belong there, there are probably cleaner, more performance, more maintainable, and easier to reason about ways at this point to produce the same result.> If we want an optimal graph rewriter, we should actually just build one, >> with sane rewrite rules, verification that it fixpoints, and some sense of >> good ordering, etc, instead of slowly every adding possible pattern match >> to InstCombine. >> > > Can you recommend papers or projects to look at to get a better > understanding of an optimal graph rewriter? LLVM is the only compiler I've > worked on from the inside, so I don't really know what else is possible / > feasible. > > Sure, let me get you some.> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170521/5c6c697a/attachment.html>
David Majnemer via llvm-dev
2017-May-21 22:24 UTC
[llvm-dev] [llvm] r303387 - [InstCombine] add more tests for xor-of-icmps; NFC
On Sun, May 21, 2017 at 9:02 AM, Daniel Berlin via llvm-dev < llvm-dev at lists.llvm.org> wrote:> (I'll reply in more depth a bit later this week, i'm slammed ATM) > > On Fri, May 19, 2017 at 4:11 PM, Sanjay Patel <spatel at rotateright.com> > wrote: > >> [adding llvm-dev for wider audience] >> >> On Fri, May 19, 2017 at 12:28 PM, Daniel Berlin <dberlin at dberlin.org> >> wrote: >> >>> >>> >>> On Fri, May 19, 2017 at 11:00 AM, Davide Italiano <davide at freebsd.org> >>> wrote: >>> >>>> On Fri, May 19, 2017 at 10:00 AM, Sanjay Patel <spatel at rotateright.com> >>>> wrote: >>>> > Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated >>>> value >>>> > propagation)? >>>> > >>>> > If so, we have this comment regarding compares: >>>> > // As a policy choice, we choose not to waste compile time on >>>> anything >>>> > where >>>> > // the comparison is testing local values. >>>> > >>>> > Or this for div/rem: >>>> > // As a policy choice, we choose not >>>> > // to waste compile time on anything where the operands are local >>>> defs. >>>> > >>>> > "Local" means in the same basic block from what I can tell by the >>>> code here. >>>> > >>>> > I think this means that this pass explicitly defers handling simple >>>> cases >>>> > like: >>>> > https://reviews.llvm.org/D33342 >>>> > ...to another pass, and currently that pass is InstCombine (because >>>> the >>>> > patterns really can't be any simpler than the tests in that patch, >>>> right?). >>>> > >>>> > I think that improving compile-time should not preclude improving >>>> > InstCombine. We should do both. >>>> > >>>> >>>> Just thoughts, feel free to ignore them. >>>> I didn't measure the impact in LLVM, but I'm sure you can do VRP >>>> relatively fast (GCC does that both interprocedurally and >>>> intraprocedurally and their pass is much faster in some cases than >>>> LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this >>>> policy? >>>> >>> >>> Yeah, that's kind of silly, we can do much better. >>> >>> >>>> I think replacing a pass in LLVM is not trivial (I'm learning it the >>>> hard way with NewGVN). OTOH, I'm still not entirely convinced >>>> `InstCombine` should handle these cases given it's already a >>>> compile-time sink? >>>> >>> >>> >> Correct me where I'm going wrong. >> >> 1. We got a bug report with a perf regression that was root caused to >> this IR: >> >> define i1 @xor_of_icmps(i64 %a) { >> %b = icmp sgt i64 %a, 0 >> %c = icmp eq i64 %a, 1 >> %r = xor i1 %b, %c >> ret i1 %r >> } >> > Yup > > >> >> Because this was a regression, we know that the optimal/canonical IR for >> this program is: >> >> define i1 @xor_of_icmps_canonical(i64 %a) { >> %r = icmp sgt i64 %a, 1 >> ret i1 %r >> } >> >> Yes. > > >> 2. I saw this as a bug in InstCombine because I think InstCombine's job >> is to canonicalize simple IR sequences. >> > > Where is the dividing line between what InstCombine does and what > something else does? >In practice? CFG transformations and inter-procedural reasoning are forbidden. The driving philosophy of InstCombine, as I know it, is "do reasonable stuff which exposes further canonicalization opportunities." This mandate is widely defined but is not unreasonable in and of itself. It is unreasonable to expect InstCombine to perform _all_ canonicalization opportunities. InstCombine is a soupy combination of many passes because it is a engineering trade-off between perfect separation of concerns (performing no overlapping work with any other pass which has a specific responsibility) and having one giant pass which can do everything and anything.> > >> >> 3. I assumed that matching an (xor icmp, icmp) pattern can be done >> efficiently in InstCombine. In fact, I knew that we already did that match, >> so there is zero incremental cost to find this pattern in InstCombine. >> >> > >> 4. I knew that we already handle related and-of-cmps and or-of-cmps >> patterns in InstSimplify/InstCombine. >> > > >> >> 5. Based on that, I have proposed a patch that mostly uses existing logic >> in those passes to solve this bug in the most general way I am aware of. It >> makes 2 calls to InstSimplify to find a potential fold before >> morphing/creating new instructions that may get folded by other existing >> logic. >> >> >> Questions: >> 1. Was/is InstCombine intended to handle this transform? >> > > InstCombine has no real philosophy right now. It can handle *anything*. > The more it handles, the more people want to shove stuff in it. You can > see this already. It does random store sinking, phi conversion, etc. > > Because if some other pass handles it, then they don't get as many > simplifications, because nobody has built out those places. > Without some real philosophy line drawing, instcombine expands to cover > everything, gets slow and can never be replaced without introducing > performance regressions. > > It becomes the only thing that does certain things, and then people want > to run it more as a result. But you can't run it more because it's slow > (which also incentivizes people to make it do more, so that it catches > whatever case when it does run) > >> >> 2. If there is a measurable compile-time cost for this patch, then there >> must be a structural problem with InstSimplify, InstCombine, and/or the >> pass pipeline? >> >> > Places like InstCombine get slower by very small percentages at a time. > Really. > >> >> People seem intent on adding lots and lots of stuff to instcombine >>> because it's easy. >>> This makes it harder to ever replace, while simultaneously making it >>> slower. >>> It's not hard to see where that path ends up. >>> InstCombine does a lot of random crap that isn't even combining or graph >>> rewrite, too, because well second and third order effects are hard. >>> >> >> Can you provide some examples?] >> > > It does: > > Demanded bits analysis > Phi translation and attempts to fold > Instruction sinking > Reassociation > Factorization > Dead code elimination >It also does DSE, some memory forwarding optimizations and a restrict subset of SROA. It can also do some really weird stuff like rewriting and/or coalescing allocas. It can even do speculation :)> > >> I honestly don't know where I should draw the line. If I've crossed the >> line, I'd like to fix it. >> > > I'm sorry if you feel like this is people saying it's you. It's really > not. It's really "hey, uh, are we ever going to stop trying to shove stuff > in this thing?" > I think if you look, every so often people take notice of these kinds of > things and it's usually started in a random review thread. :) > As for where to draw the line, i think that's a good discussion for us all > to have. > I think we *should* say what the line is, draw it, and stick to it, and > probably split it up into multiple things (canonicalization, etc). > > Truthfully, right now, Instcombine crossed whatever line you want to draw > a long time ago. > It's 30k lines of code. > > For reference: > The entirety of transforms/Scalar, combined, is only 60k lines of code. > > Of course, line counting is not a great measure of a lot, i'm just > pointing out it contains roughly half the code of the rest of our scalar > optimization infrastructure, and 30% of the total code of it overall. It's > twice as big as our entire vectorization infrastructure. It's big. It does > a lot. > I'd argue that even if most of these transforms belong there, there are > probably cleaner, more performance, more maintainable, and easier to reason > about ways at this point to produce the same result. >If there is a way to do what we do today which is more Pareto optimal, I doubt you will hear any complaints. You will hear loud complains if we made InstCombine as enjoyable to debug as SelectionDAGISel :)> >> If we want an optimal graph rewriter, we should actually just build one, >>> with sane rewrite rules, verification that it fixpoints, and some sense of >>> good ordering, etc, instead of slowly every adding possible pattern match >>> to InstCombine. >>> >> >> Can you recommend papers or projects to look at to get a better >> understanding of an optimal graph rewriter? LLVM is the only compiler I've >> worked on from the inside, so I don't really know what else is possible / >> feasible. >> >> Sure, let me get you some. > > >> >> > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170521/eef76c26/attachment.html>
Davide Italiano via llvm-dev
2017-Jul-21 19:05 UTC
[llvm-dev] [llvm] r303387 - [InstCombine] add more tests for xor-of-icmps; NFC
On Fri, May 19, 2017 at 4:11 PM, Sanjay Patel <spatel at rotateright.com> wrote:> [adding llvm-dev for wider audience] > > > Can you recommend papers or projects to look at to get a better > understanding of an optimal graph rewriter? LLVM is the only compiler I've > worked on from the inside, so I don't really know what else is possible / > feasible. > >Sorry, I realized this never got a reply. If I were you, I'd start from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.8495&rep=rep1&type=pdf When I last looked at this area [for different purposes, many years ago] I found GrGen nice to play with (to understand/visualize the problem). http://ai2-s2-pdfs.s3.amazonaws.com/a509/550d8427c210f6a5900c12733181e9a2e862.pdf The algebraic approach to graph rewriting requires a very superficial knowledge of categories, as the problem is stated in terms of graph homomorphism, but I wouldn't worry too much , TBH, unless you want to dig into the theory behind this. Thanks, -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare