Hi Danny, Thanks for that :) However I've just updated the prototype patch to NewGVN and it didn't need any API changes - all I rely on is GVNExpression. Hongbin, I wanted to explain a little about what GVNSink can currently do, what it was designed for and hopefully how to make it handle your testcase. *Background* Common code sinking is more difficult to efficently do than one might expect. Prior to October 2016, LLVM was able to handle simple cases of this within SimplifyCFG, for example: if (a) { b += 42; } else { b += 42; } // -> b += 42 It could even sink operations where a phi was required: if (a) { b += 42; } else { b += 64; } // -> b += phi(42, 64) However it had a simplistic execution model and so: 1) Restricted itself to blocks with only TWO incoming arcs 2) Would only ever produce one PHI and then bail out Generally, the factors that make the search space for a good generic sinking optimization large are: 1) The number of incoming edges to a block. 2) Non-monotonicity of heuristic - for example it might not be worthwhile sinking 2 instructions but may be worthwhile sinking 3 instructions. 3) Permutations of otherwise similar instructions in incoming blocks. I'll describe them all in a bit more detail: *1) The number of incoming edges to a block* There are many constructs that can create a block with more than two input edges. An obvious example is a switch: switch (a) { case 1: b += 32; break; case 2: b += 43; break; default: break; } Another is a chain of if-elses: if (a) { b += 32; } else if (c) { b += 43; } else { b += 65; } Note that those two examples are subtly different. In the latter example, every branch performs an addition to 'b'. We can therefore sink all the adds to the end and perform only one add: b += phi(32, 43, 65); However in the former case things aren't so simple. The default case never does any add, so we have to speculate the add instruction in that case. For adds, this is simple - adds don't have sideeffects and we can speculate by adding with an identity (0): b += phi(32, 43, 0); However what if 'b' was a global variable? In this case, each case of the switch will have a load/add/store sequence: switch(a) --------------- / \ \ | | | [t1 = load @b] [t2 = load @b] | [t1 = add t1, 32] [t2 = add t2, 43] | [store @b, t1] [store @b, t2] | \ / / v v v------------- [ ] In most cases, we cannot be sure that in the default case @b is allowed to be written to. Speculating a store in this case could be invalid. Therefore, if we want to sink the store (and thus the add and load) we need to invent a new block to sink them to. switch(a) --------------- / \ \ | | | [t1 = load @b] [t2 = load @b] | [t1 = add t1, 32] [t2 = add t2, 43] | \ / | [ p = phi(t1, t2) ] / [ store @b, p ] / | +------------ v v [ ] Creating this new block allows the switch to be turned into a trivial one that can be lowered to a bit test or jump table. There is an even more complex case that appears quite a bit: if (a) { switch (b) { case 1: c += 42; break; case 2: c += 2; break; default: c += 3; break; } } else { switch (d) { case 1: e -= 2; break; case 2: e -= 6; break; default: e -= 32; break; } } We canonicalize this CFG to one cleanup block with six incoming arcs. In this case, it takes some nontrivial analysis to determine what to do, as it requires partitioning the incoming arcs into sets, performing different sinking operations on each (in this case two disjoint sets, but there are cases where the sets are not disjoint and some cost heuristic needs to be used). *2) Non-monotonicity of heuristic* What I'm trying to say here is the greedy algorithms work best when the heuristic they are computing is monotonic. Consider this: if (a) { tmp1 = *b; tmp2 = *c; tmp3 = tmp1 + tmp2; *d = tmp3; } else { tmp4 = *b; tmp5 = *c; tmp6 = tmp4 + tmp5; *e = tmp6; } If we proceed sequentially from the bottom of these blocks, trying to determine if sinking them is worthwhile: Step 1: [*e = tmp6], [*d = tmp3] - This is the same operation, a store, but it requires two PHIs - one for (e,d) and one for (tmp6, tmp3). - This is NOT worthwhile. We are constraining register allocation for only one instruction commonned. Step 2: [tmp6 = tmp4 + tmp5], [tmp3 = tmp1 + tmp2] - Again, the operation is the same, but now we need another two PHIs, plus one PHI from Step 1 (the phi required for (tmp6, tmp3) is now no longer required). - Three PHIs required for two instructions sunk. NOT worth it. Step 3: [tmp5 = *c], [tmp2 = *c] - These can be sunk, and have the same arguments so no PHIs required. We still need two of the PHIs from step 2 (the (tmp5, tmp2) phi now is no longer required) - At this point we've sunk three instructions for two PHIs. Maybe it's worth it? Step 4: [tmp4 = *b], [tmp1 = *b] - Same as step 3, but now the (tmp4, tmp1) PHI from step 2 goes away too - So now we're sinking 4 instructions for one PHI. - Now it's definitely worth it! Whatever heuristic one chooses, in a case like this it will get worse, then maybe get better again. So we can't simply bail out when the heuristic reaches some threshold, because maybe it will improve as we analyze more instructions! *3) Permutations of otherwise similar instructions* This is the case you describe. Stores (for example) that look similar but have been permuted: if (a) { *b = 52; *c = 43; } else { *c = 87; *b = 54; } If we approach this bottom-up, we will compare [*b = 54], [*c = 43], then [*c = 87], [*b = 52] and conclude that two instructions can be sunk, but 4 PHIs are required (because the target of the store and value are different in each case). If we looked at [*b = 54], [*b = 52], perhaps we would have seen that we can sink this with only *two* PHIs required and that might change our decision. This is generally quite difficult and especially shows up with memory instructions like this (if c and b alias, then permuting them when sinking would be illegal. So we need alias analysis here). *What GVNSink attempts to solve* The current implementation of GVNSink was meant to solve 1) and 2), but NOT 3). 3) wasn't within scope when I was writing GVNSink, but I think 3) is doable for some cases within the framework of GVNSink. *How GVNSink works* GVNSink works by computing a value number for each IR instruction, similar to GVN. However, GVNSink's value numbering works slightly from GVN's. To recap, the value number (VN) of an instruction is a function of the instruction. If VN[I] == VN[J], I and J can be considered equivalent, for some definition of equivalence. In GVN, VN[I] is based upon I and all of I's operands, transitively. Therefore if VN[I] == VN[J], I and J were computed in the same way and J can be used instead of I. In GVNSink, we flip this on its head. VN[I] is based upon I's *uses*, transitively. Therefore if VN[I] == VN[J], all of the uses of I are the same as all of the uses of J. Therefore I and J can be merged. So this: [ %a = add %I, 1 ] [ %c = add %J, 1 ] [ %b = sub %a, 5 ] [ %d = sub %c, 5 ] [ ret %b ] [ ret %d ] Can be transformed into this: [] [] \ / [ %IJ = phi %I, %J ] [ %a = add %IJ, 1 ] [ %b = sub %a, 5 ] [ ret %b ] In GVNSink, we step backwards across all instructions in all incoming basic blocks (see LockstepReverseIterator). This produces, conceptually, a sequence of sets of instructions. Each of these sets is analyzed in turn (see analyzeInstructionForSinking) - the value number of each contained instruction is calculated and used to determine if the entire set is mergeable or not (all values have the same VN). If not all instructions in the set have the same VN, the set is pruned to just those with the same VN (or we stop if all instructions had different VNs). If this set passes extra legality checks, a SinkingInstructionCandidate is created. This contains pertinent data for the cost heuristic: 1) Number of instructions from each incoming block being sunk 2) Number of blocks to pick instructions from (this can be a subset of all the incoming blocks!) 3) How many PHIs would be generated in total After all candidates have been examined, they are sorted based on a heuristic and the best candidate picked and applied. The use of value numbering in this algorithm is what allows us to do this in somewhat-linear time with respect to the number of instructions inspected. Now, to solve 3), I think the component that needs swapping out is LockstepReverseIterator. This currently iterates backwards in lockstep across a set of blocks. To handle permutations, I think we could use a priority queue that feeds the main algorithm instructions from the input blocks in a slightly sorted order. For example, perhaps it might order sequences of stores by their target operand (assuming alias analysis says it can), ensuring sequences of stores in different blocks appear in the same order? Anyway, I hope this has helped understand a bit more about what GVNSink does, can do and how to modify it if you want to solve your problem. Cheers, James On Wed, 26 Apr 2017 at 18:16 Daniel Berlin <dberlin at dberlin.org> wrote:> I have a patch i'm working on to split newgvn into a proper analysis. > It works (and happy to share it), it's just the interface i'm using is a > little ugly (haven't decided what the right API is), but let me know what > GVNSink needs from life and i'll make sure it's there :) > > > On Wed, Apr 26, 2017 at 10:09 AM, James Molloy via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> It's basically ready to commit; the reviewers were fairly happy with it. >> It needs rebasing on top of NewGVN and any bugs that shakes out fixed, but >> that's about it. >> >> I want to get around to it soon-ish, but I've wanted that for a while! >> >> On Wed, 26 Apr 2017 at 16:50, Hongbin Zheng <etherzhhb at gmail.com> wrote: >> >>> Hi James, >>> >>> I have an ad-hoc solution in mind to solve this problem. >>> But if it can be solved under the framework of GVN"Sink", it is even >>> better. >>> >>> any plan on your https://reviews.llvm.org/D24805? >>> >>> Thanks >>> Hongbin >>> >>> >>> On Wed, Apr 26, 2017 at 2:13 AM, James Molloy <james at jamesmolloy.co.uk> >>> wrote: >>> >>>> Hi, >>>> >>>> Yes, I can see why that would not work. >>>> >>>> The sinking algorithm in SimplifyCFG isn't particularly clever. In >>>> particular it can't reason about memory ordering and aliasing. In >>>> unswitch1(), it can identify that the stores correlate because the >>>> correlating stores appear in the same relative source order. In unswitch2() >>>> they have been permuted, and the algorithm cannot deal with this. This >>>> requires some kind of value numbering to do efficiently. >>>> >>>> The GVNSink pass that I really should get around to updating should >>>> solve this, eventually! >>>> >>>> James >>>> >>>> >>>> On Wed, 26 Apr 2017 at 08:19 Hongbin Zheng via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> >>>>> Looks like this do not work: >>>>> >>>>> // Type your code here, or load an example. >>>>> int a[10]; >>>>> >>>>> void unswitch2(int i, int x, int y0, int y1) { >>>>> if (x) { >>>>> a[i] = y0; >>>>> a[i + 1] = y1; >>>>> } else { >>>>> a[i + 1] = y0; >>>>> a[i] = y1; >>>>> } >>>>> } >>>>> >>>>> https://godbolt.org/g/Ldd5qV >>>>> >>>>> On Tue, Apr 25, 2017 at 10:22 PM, Hongbin Zheng <etherzhhb at gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks, >>>>>> >>>>>> Looks like inst combine do the job >>>>>> >>>>>> On Tue, Apr 25, 2017 at 9:36 PM, Davide Italiano <davide at freebsd.org> >>>>>> wrote: >>>>>> >>>>>>> On Tue, Apr 25, 2017 at 9:24 PM, Hongbin Zheng via llvm-dev >>>>>>> <llvm-dev at lists.llvm.org> wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > Is there a pass in LLVM that can optimize: >>>>>>> > >>>>>>> > if (x) >>>>>>> > a[i] = y0; >>>>>>> > else >>>>>>> > a[i] = y1; >>>>>>> > >>>>>>> > to >>>>>>> > >>>>>>> > a[i] = x ? y0 : y1? >>>>>>> > >>>>>>> > I tried opt (3.9) with -O3 but looks like such an optimization do >>>>>>> not >>>>>>> > happened. >>>>>>> > >>>>>>> >>>>>>> The same IR at -O3 for both cases on this example. >>>>>>> https://godbolt.org/g/Tk2MM8 >>>>>>> >>>>>>> -- >>>>>>> Davide >>>>>>> >>>>>>> "There are no solved problems; there are only problems that are more >>>>>>> or less solved" -- Henri Poincare >>>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>> >>> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170428/e290cd35/attachment.html>
On Fri, Apr 28, 2017 at 7:42 AM, James Molloy <james at jamesmolloy.co.uk> wrote:> Hi Danny, > > Thanks for that :) However I've just updated the prototype patch to NewGVN > and it didn't need any API changes - all I rely on is GVNExpression. > >Oh, nice. Note, Bryant has been working on PDSE, which, combined with your code here, would probably give us value based partial dead store elimination :) (since the value numbering issues are related) Hongbin,> > I wanted to explain a little about what GVNSink can currently do, what it > was designed for and hopefully how to make it handle your testcase. > > *Background* > > Common code sinking is more difficult to efficently do than one might > expect. Prior to October 2016, LLVM was able to handle simple cases of this > within SimplifyCFG, for example: > > if (a) { > b += 42; > } else { > b += 42; > } > // -> b += 42 > > It could even sink operations where a phi was required: > > if (a) { > b += 42; > } else { > b += 64; > } > // -> b += phi(42, 64) > > However it had a simplistic execution model and so: > 1) Restricted itself to blocks with only TWO incoming arcs > 2) Would only ever produce one PHI and then bail out > > Generally, the factors that make the search space for a good generic > sinking optimization large are: > > 1) The number of incoming edges to a block. > 2) Non-monotonicity of heuristic - for example it might not be > worthwhile sinking 2 instructions but may be worthwhile sinking 3 > instructions. > 3) Permutations of otherwise similar instructions in incoming blocks. > > I'll describe them all in a bit more detail: > > *1) The number of incoming edges to a block* > > There are many constructs that can create a block with more than two input > edges. An obvious example is a switch: > > switch (a) { > case 1: b += 32; break; > case 2: b += 43; break; > default: break; > } > > Another is a chain of if-elses: > > if (a) { > b += 32; > } else if (c) { > b += 43; > } else { > b += 65; > } > > Note that those two examples are subtly different. In the latter example, > every branch performs an addition to 'b'. We can therefore sink all the > adds to the end and perform only one add: > > b += phi(32, 43, 65); > > However in the former case things aren't so simple. The default case never > does any add, so we have to speculate the add instruction in that case. For > adds, this is simple - adds don't have sideeffects and we can speculate by > adding with an identity (0): > > b += phi(32, 43, 0); > > However what if 'b' was a global variable? In this case, each case of the > switch will have a load/add/store sequence: > > switch(a) --------------- > / \ \ > | | | > [t1 = load @b] [t2 = load @b] | > [t1 = add t1, 32] [t2 = add t2, 43] | > [store @b, t1] [store @b, t2] | > \ / / > v v v------------- > [ ] > > In most cases, we cannot be sure that in the default case @b is allowed to > be written to. Speculating a store in this case could be invalid. > Therefore, if we want to sink the store (and thus the add and load) we need > to invent a new block to sink them to. > > switch(a) --------------- > / \ \ > | | | > [t1 = load @b] [t2 = load @b] | > [t1 = add t1, 32] [t2 = add t2, 43] | > \ / | > [ p = phi(t1, t2) ] / > [ store @b, p ] / > | +------------ > v v > [ ] > > Creating this new block allows the switch to be turned into a trivial one > that can be lowered to a bit test or jump table. > > There is an even more complex case that appears quite a bit: > > if (a) { > switch (b) { > case 1: c += 42; break; > case 2: c += 2; break; > default: c += 3; break; > } > } else { > switch (d) { > case 1: e -= 2; break; > case 2: e -= 6; break; > default: e -= 32; break; > } > } > > We canonicalize this CFG to one cleanup block with six incoming arcs. In > this case, it takes some nontrivial analysis to determine what to do, as it > requires partitioning the incoming arcs into sets, performing different > sinking operations on each (in this case two disjoint sets, but there are > cases where the sets are not disjoint and some cost heuristic needs to be > used). > > *2) Non-monotonicity of heuristic* > > What I'm trying to say here is the greedy algorithms work best when the > heuristic they are computing is monotonic. Consider this: > > if (a) { > tmp1 = *b; > tmp2 = *c; > tmp3 = tmp1 + tmp2; > *d = tmp3; > } else { > tmp4 = *b; > tmp5 = *c; > tmp6 = tmp4 + tmp5; > *e = tmp6; > } > > If we proceed sequentially from the bottom of these blocks, trying to > determine if sinking them is worthwhile: > > Step 1: [*e = tmp6], [*d = tmp3] > - This is the same operation, a store, but it requires two PHIs - one > for (e,d) and one for (tmp6, tmp3). > - This is NOT worthwhile. We are constraining register allocation for > only one instruction commonned. > > Step 2: [tmp6 = tmp4 + tmp5], [tmp3 = tmp1 + tmp2] > - Again, the operation is the same, but now we need another two PHIs, > plus one PHI from Step 1 (the phi required for (tmp6, tmp3) is now no > longer required). > - Three PHIs required for two instructions sunk. NOT worth it. > > Step 3: [tmp5 = *c], [tmp2 = *c] > - These can be sunk, and have the same arguments so no PHIs required. We > still need two of the PHIs from step 2 (the (tmp5, tmp2) phi now is no > longer required) > - At this point we've sunk three instructions for two PHIs. Maybe it's > worth it? > > Step 4: [tmp4 = *b], [tmp1 = *b] > - Same as step 3, but now the (tmp4, tmp1) PHI from step 2 goes away too > - So now we're sinking 4 instructions for one PHI. > - Now it's definitely worth it! > > Whatever heuristic one chooses, in a case like this it will get worse, > then maybe get better again. So we can't simply bail out when the heuristic > reaches some threshold, because maybe it will improve as we analyze more > instructions! > > *3) Permutations of otherwise similar instructions* > > This is the case you describe. Stores (for example) that look similar but > have been permuted: > > if (a) { > *b = 52; > *c = 43; > } else { > *c = 87; > *b = 54; > } > > If we approach this bottom-up, we will compare [*b = 54], [*c = 43], then > [*c = 87], [*b = 52] and conclude that two instructions can be sunk, but 4 > PHIs are required (because the target of the store and value are different > in each case). If we looked at [*b = 54], [*b = 52], perhaps we would have > seen that we can sink this with only *two* PHIs required and that might > change our decision. > > This is generally quite difficult and especially shows up with memory > instructions like this (if c and b alias, then permuting them when sinking > would be illegal. So we need alias analysis here). > > *What GVNSink attempts to solve* > > The current implementation of GVNSink was meant to solve 1) and 2), but > NOT 3). 3) wasn't within scope when I was writing GVNSink, but I think 3) > is doable for some cases within the framework of GVNSink. > > *How GVNSink works* > > GVNSink works by computing a value number for each IR instruction, similar > to GVN. However, GVNSink's value numbering works slightly from GVN's. > > To recap, the value number (VN) of an instruction is a function of the > instruction. If VN[I] == VN[J], I and J can be considered equivalent, for > some definition of equivalence. > > In GVN, VN[I] is based upon I and all of I's operands, transitively. > Therefore if VN[I] == VN[J], I and J were computed in the same way and J > can be used instead of I. > > In GVNSink, we flip this on its head. VN[I] is based upon I's *uses*, > transitively. Therefore if VN[I] == VN[J], all of the uses of I are the > same as all of the uses of J. Therefore I and J can be merged. So this: > > [ %a = add %I, 1 ] [ %c = add %J, 1 ] > [ %b = sub %a, 5 ] [ %d = sub %c, 5 ] > [ ret %b ] [ ret %d ] > > Can be transformed into this: > > [] [] > \ / > [ %IJ = phi %I, %J ] > [ %a = add %IJ, 1 ] > [ %b = sub %a, 5 ] > [ ret %b ] > > In GVNSink, we step backwards across all instructions in all incoming > basic blocks (see LockstepReverseIterator). This produces, conceptually, a > sequence of sets of instructions. > > Each of these sets is analyzed in turn (see analyzeInstructionForSinking) > - the value number of each contained instruction is calculated and used to > determine if the entire set is mergeable or not (all values have the same > VN). > > If not all instructions in the set have the same VN, the set is pruned to > just those with the same VN (or we stop if all instructions had different > VNs). If this set passes extra legality checks, a > SinkingInstructionCandidate is created. This contains pertinent data for > the cost heuristic: > > 1) Number of instructions from each incoming block being sunk > 2) Number of blocks to pick instructions from (this can be a subset of > all the incoming blocks!) > 3) How many PHIs would be generated in total > > After all candidates have been examined, they are sorted based on a > heuristic and the best candidate picked and applied. > > The use of value numbering in this algorithm is what allows us to do this > in somewhat-linear time with respect to the number of instructions > inspected. > > Now, to solve 3), I think the component that needs swapping out is > LockstepReverseIterator. This currently iterates backwards in lockstep > across a set of blocks. To handle permutations, I think we could use a > priority queue that feeds the main algorithm instructions from the input > blocks in a slightly sorted order. > > For example, perhaps it might order sequences of stores by their target > operand (assuming alias analysis says it can), ensuring sequences of stores > in different blocks appear in the same order? > > Anyway, I hope this has helped understand a bit more about what GVNSink > does, can do and how to modify it if you want to solve your problem. > > Cheers, > > James > > > > On Wed, 26 Apr 2017 at 18:16 Daniel Berlin <dberlin at dberlin.org> wrote: > >> I have a patch i'm working on to split newgvn into a proper analysis. >> It works (and happy to share it), it's just the interface i'm using is a >> little ugly (haven't decided what the right API is), but let me know what >> GVNSink needs from life and i'll make sure it's there :) >> >> >> On Wed, Apr 26, 2017 at 10:09 AM, James Molloy via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> It's basically ready to commit; the reviewers were fairly happy with it. >>> It needs rebasing on top of NewGVN and any bugs that shakes out fixed, but >>> that's about it. >>> >>> I want to get around to it soon-ish, but I've wanted that for a while! >>> >>> On Wed, 26 Apr 2017 at 16:50, Hongbin Zheng <etherzhhb at gmail.com> wrote: >>> >>>> Hi James, >>>> >>>> I have an ad-hoc solution in mind to solve this problem. >>>> But if it can be solved under the framework of GVN"Sink", it is even >>>> better. >>>> >>>> any plan on your https://reviews.llvm.org/D24805? >>>> >>>> Thanks >>>> Hongbin >>>> >>>> >>>> On Wed, Apr 26, 2017 at 2:13 AM, James Molloy <james at jamesmolloy.co.uk> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Yes, I can see why that would not work. >>>>> >>>>> The sinking algorithm in SimplifyCFG isn't particularly clever. In >>>>> particular it can't reason about memory ordering and aliasing. In >>>>> unswitch1(), it can identify that the stores correlate because the >>>>> correlating stores appear in the same relative source order. In unswitch2() >>>>> they have been permuted, and the algorithm cannot deal with this. This >>>>> requires some kind of value numbering to do efficiently. >>>>> >>>>> The GVNSink pass that I really should get around to updating should >>>>> solve this, eventually! >>>>> >>>>> James >>>>> >>>>> >>>>> On Wed, 26 Apr 2017 at 08:19 Hongbin Zheng via llvm-dev < >>>>> llvm-dev at lists.llvm.org> wrote: >>>>> >>>>>> Looks like this do not work: >>>>>> >>>>>> // Type your code here, or load an example. >>>>>> int a[10]; >>>>>> >>>>>> void unswitch2(int i, int x, int y0, int y1) { >>>>>> if (x) { >>>>>> a[i] = y0; >>>>>> a[i + 1] = y1; >>>>>> } else { >>>>>> a[i + 1] = y0; >>>>>> a[i] = y1; >>>>>> } >>>>>> } >>>>>> >>>>>> https://godbolt.org/g/Ldd5qV >>>>>> >>>>>> On Tue, Apr 25, 2017 at 10:22 PM, Hongbin Zheng <etherzhhb at gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Looks like inst combine do the job >>>>>>> >>>>>>> On Tue, Apr 25, 2017 at 9:36 PM, Davide Italiano <davide at freebsd.org >>>>>>> > wrote: >>>>>>> >>>>>>>> On Tue, Apr 25, 2017 at 9:24 PM, Hongbin Zheng via llvm-dev >>>>>>>> <llvm-dev at lists.llvm.org> wrote: >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > Is there a pass in LLVM that can optimize: >>>>>>>> > >>>>>>>> > if (x) >>>>>>>> > a[i] = y0; >>>>>>>> > else >>>>>>>> > a[i] = y1; >>>>>>>> > >>>>>>>> > to >>>>>>>> > >>>>>>>> > a[i] = x ? y0 : y1? >>>>>>>> > >>>>>>>> > I tried opt (3.9) with -O3 but looks like such an optimization do >>>>>>>> not >>>>>>>> > happened. >>>>>>>> > >>>>>>>> >>>>>>>> The same IR at -O3 for both cases on this example. >>>>>>>> https://godbolt.org/g/Tk2MM8 >>>>>>>> >>>>>>>> -- >>>>>>>> Davide >>>>>>>> >>>>>>>> "There are no solved problems; there are only problems that are more >>>>>>>> or less solved" -- Henri Poincare >>>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> llvm-dev at lists.llvm.org >>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>> >>>>> >>>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170428/157cc043/attachment-0001.html>
Hi James, Thanks a lot for the explanations, they make sense. On Fri, Apr 28, 2017 at 7:42 AM, James Molloy <james at jamesmolloy.co.uk> wrote:> Hi Danny, > > Thanks for that :) However I've just updated the prototype patch to NewGVN > and it didn't need any API changes - all I rely on is GVNExpression. > > Hongbin, > > I wanted to explain a little about what GVNSink can currently do, what it > was designed for and hopefully how to make it handle your testcase. > > *Background* > > Common code sinking is more difficult to efficently do than one might > expect. Prior to October 2016, LLVM was able to handle simple cases of this > within SimplifyCFG, for example: > > if (a) { > b += 42; > } else { > b += 42; > } > // -> b += 42 > > It could even sink operations where a phi was required: > > if (a) { > b += 42; > } else { > b += 64; > } > // -> b += phi(42, 64) > > However it had a simplistic execution model and so: > 1) Restricted itself to blocks with only TWO incoming arcs > 2) Would only ever produce one PHI and then bail out > > Generally, the factors that make the search space for a good generic > sinking optimization large are: > > 1) The number of incoming edges to a block. > 2) Non-monotonicity of heuristic - for example it might not be > worthwhile sinking 2 instructions but may be worthwhile sinking 3 > instructions. > 3) Permutations of otherwise similar instructions in incoming blocks. > > I'll describe them all in a bit more detail: > > *1) The number of incoming edges to a block* > > There are many constructs that can create a block with more than two input > edges. An obvious example is a switch: > > switch (a) { > case 1: b += 32; break; > case 2: b += 43; break; > default: break; > } > > Another is a chain of if-elses: > > if (a) { > b += 32; > } else if (c) { > b += 43; > } else { > b += 65; > } > > Note that those two examples are subtly different. In the latter example, > every branch performs an addition to 'b'. We can therefore sink all the > adds to the end and perform only one add: > > b += phi(32, 43, 65); > > However in the former case things aren't so simple. The default case never > does any add, so we have to speculate the add instruction in that case. For > adds, this is simple - adds don't have sideeffects and we can speculate by > adding with an identity (0): > > b += phi(32, 43, 0); > > However what if 'b' was a global variable? In this case, each case of the > switch will have a load/add/store sequence: > > switch(a) --------------- > / \ \ > | | | > [t1 = load @b] [t2 = load @b] | > [t1 = add t1, 32] [t2 = add t2, 43] | > [store @b, t1] [store @b, t2] | > \ / / > v v v------------- > [ ] > > In most cases, we cannot be sure that in the default case @b is allowed to > be written to. Speculating a store in this case could be invalid. > Therefore, if we want to sink the store (and thus the add and load) we need > to invent a new block to sink them to. > > switch(a) --------------- > / \ \ > | | | > [t1 = load @b] [t2 = load @b] | > [t1 = add t1, 32] [t2 = add t2, 43] | > \ / | > [ p = phi(t1, t2) ] / > [ store @b, p ] / > | +------------ > v v > [ ] > > Creating this new block allows the switch to be turned into a trivial one > that can be lowered to a bit test or jump table. > > There is an even more complex case that appears quite a bit: > > if (a) { > switch (b) { > case 1: c += 42; break; > case 2: c += 2; break; > default: c += 3; break; > } > } else { > switch (d) { > case 1: e -= 2; break; > case 2: e -= 6; break; > default: e -= 32; break; > } > } > > We canonicalize this CFG to one cleanup block with six incoming arcs. In > this case, it takes some nontrivial analysis to determine what to do, as it > requires partitioning the incoming arcs into sets, performing different > sinking operations on each (in this case two disjoint sets, but there are > cases where the sets are not disjoint and some cost heuristic needs to be > used). > > *2) Non-monotonicity of heuristic* > > What I'm trying to say here is the greedy algorithms work best when the > heuristic they are computing is monotonic. Consider this: > > if (a) { > tmp1 = *b; > tmp2 = *c; > tmp3 = tmp1 + tmp2; > *d = tmp3; > } else { > tmp4 = *b; > tmp5 = *c; > tmp6 = tmp4 + tmp5; > *e = tmp6; > } > > If we proceed sequentially from the bottom of these blocks, trying to > determine if sinking them is worthwhile: > > Step 1: [*e = tmp6], [*d = tmp3] > - This is the same operation, a store, but it requires two PHIs - one > for (e,d) and one for (tmp6, tmp3). > - This is NOT worthwhile. We are constraining register allocation for > only one instruction commonned. > > Step 2: [tmp6 = tmp4 + tmp5], [tmp3 = tmp1 + tmp2] > - Again, the operation is the same, but now we need another two PHIs, > plus one PHI from Step 1 (the phi required for (tmp6, tmp3) is now no > longer required). > - Three PHIs required for two instructions sunk. NOT worth it. > > Step 3: [tmp5 = *c], [tmp2 = *c] > - These can be sunk, and have the same arguments so no PHIs required. We > still need two of the PHIs from step 2 (the (tmp5, tmp2) phi now is no > longer required) > - At this point we've sunk three instructions for two PHIs. Maybe it's > worth it? > > Step 4: [tmp4 = *b], [tmp1 = *b] > - Same as step 3, but now the (tmp4, tmp1) PHI from step 2 goes away too > - So now we're sinking 4 instructions for one PHI. > - Now it's definitely worth it! > > Whatever heuristic one chooses, in a case like this it will get worse, > then maybe get better again. So we can't simply bail out when the heuristic > reaches some threshold, because maybe it will improve as we analyze more > instructions! > > *3) Permutations of otherwise similar instructions* > > This is the case you describe. Stores (for example) that look similar but > have been permuted: > > if (a) { > *b = 52; > *c = 43; > } else { > *c = 87; > *b = 54; > } > > If we approach this bottom-up, we will compare [*b = 54], [*c = 43], then > [*c = 87], [*b = 52] and conclude that two instructions can be sunk, but 4 > PHIs are required (because the target of the store and value are different > in each case). If we looked at [*b = 54], [*b = 52], perhaps we would have > seen that we can sink this with only *two* PHIs required and that might > change our decision. > > This is generally quite difficult and especially shows up with memory > instructions like this (if c and b alias, then permuting them when sinking > would be illegal. So we need alias analysis here). > > *What GVNSink attempts to solve* > > The current implementation of GVNSink was meant to solve 1) and 2), but > NOT 3). 3) wasn't within scope when I was writing GVNSink, but I think 3) > is doable for some cases within the framework of GVNSink. > > *How GVNSink works* > > GVNSink works by computing a value number for each IR instruction, similar > to GVN. However, GVNSink's value numbering works slightly from GVN's. > > To recap, the value number (VN) of an instruction is a function of the > instruction. If VN[I] == VN[J], I and J can be considered equivalent, for > some definition of equivalence. > > In GVN, VN[I] is based upon I and all of I's operands, transitively. > Therefore if VN[I] == VN[J], I and J were computed in the same way and J > can be used instead of I. > > In GVNSink, we flip this on its head. VN[I] is based upon I's *uses*, > transitively. Therefore if VN[I] == VN[J], all of the uses of I are the > same as all of the uses of J. Therefore I and J can be merged. So this: > > [ %a = add %I, 1 ] [ %c = add %J, 1 ] > [ %b = sub %a, 5 ] [ %d = sub %c, 5 ] > [ ret %b ] [ ret %d ] > > Can be transformed into this: > > [] [] > \ / > [ %IJ = phi %I, %J ] > [ %a = add %IJ, 1 ] > [ %b = sub %a, 5 ] > [ ret %b ] > > In GVNSink, we step backwards across all instructions in all incoming > basic blocks (see LockstepReverseIterator). This produces, conceptually, a > sequence of sets of instructions. > > Each of these sets is analyzed in turn (see analyzeInstructionForSinking) > - the value number of each contained instruction is calculated and used to > determine if the entire set is mergeable or not (all values have the same > VN). > > If not all instructions in the set have the same VN, the set is pruned to > just those with the same VN (or we stop if all instructions had different > VNs). If this set passes extra legality checks, a > SinkingInstructionCandidate is created. This contains pertinent data for > the cost heuristic: > > 1) Number of instructions from each incoming block being sunk > 2) Number of blocks to pick instructions from (this can be a subset of > all the incoming blocks!) > 3) How many PHIs would be generated in total > > After all candidates have been examined, they are sorted based on a > heuristic and the best candidate picked and applied. > > The use of value numbering in this algorithm is what allows us to do this > in somewhat-linear time with respect to the number of instructions > inspected. > > Now, to solve 3), I think the component that needs swapping out is > LockstepReverseIterator. This currently iterates backwards in lockstep > across a set of blocks. To handle permutations, I think we could use a > priority queue that feeds the main algorithm instructions from the input > blocks in a slightly sorted order. >To solve 3, I think what we need are not only the CSE from GVN, but also some kind of instruction reordering, i.e. "scheduling". I think the idea of introducing a "priority queue" and processing the instruction in a different order than they appear in the BB, is also a kind of "implicit scheduling". maybe what we can do is introduce a "canonical order" by considering data dependency (SSA def-use chain), memory dependencies (Memory SSA), and other dependencies to order the instruction such that: 1. Load instructions are move toward the beginning of a BB, such that GVNHosit and/or CFGSimplify have better chance to hoist and merge the loads 2. Store instructions are move toward the end of a BB, such that GVNSink and InstructionCombine (it is merging the stores today) have a better chance to sink and merge the stores. 3. With the same type of instructions, i.e. load/stores, without dependencies between them, we can sort/schedule them based on the GVN of the pointer operand. In sort, the "canonical ordering" will sort instructions in BBs like: BB0: br BB1 or BB2 BB1: x = ld ptr1 st x ptr2 y = ld ptr3 st y ptr4 br BB3 BB2: x1 = ld ptr3 st x1 ptr4 y1 = ld ptr1 st y1 ptr2 br BB3 BB3: to the following: BB0: br BB1 or BB2 BB1: x = ld ptr1 y = ld ptr3 st x ptr2 st y ptr4 br BB3 BB2: y1 = ld ptr1 x1 = ld ptr3 st y1 ptr2 st x1 ptr4 br BB3 BB3: Now we can use the lock step iterator to merge these instructions.: BB0: x = ld ptr1 y = ld ptr3 br BB1 or BB2 BB1: br BB3 BB2: br BB3 BB3: st y ptr2 st x ptr4 And we can make this "canonical ordering" independent from the other passes. Such that the users can choose to pay extra compile time run the "canonical ordering" pass and get more merge from GVN or not. Of course, this "canonical ordering" may not always work, e.g.: BB0: br BB1 or BB2 BB1: x = ld ptr1 st x ptr2 y = ld ptr3 st y ptr4 br BB3 BB2: x1 = ld ptr0 st x1 ptr5 y1 = ld ptr1 st y1 ptr2 br BB3 BB3: May be reordered to the following: BB0: br BB1 or BB2 BB1: x = ld ptr1 y = ld ptr3 st x ptr2 st y ptr4 br BB3 BB2: x1 = ld ptr0 y1 = ld ptr1 st y1 ptr2 st x1 ptr5 br BB3 BB3: And now the lock step iterator doesn't work. So there may be no universal "canonical order", and we need to consider the context when we reorder/assign priority in the priority queue to solve problem 3. Thanks Hongbin> > For example, perhaps it might order sequences of stores by their target > operand (assuming alias analysis says it can), ensuring sequences of stores > in different blocks appear in the same order? > > Anyway, I hope this has helped understand a bit more about what GVNSink > does, can do and how to modify it if you want to solve your problem. > > Cheers, > > James > > > > On Wed, 26 Apr 2017 at 18:16 Daniel Berlin <dberlin at dberlin.org> wrote: > >> I have a patch i'm working on to split newgvn into a proper analysis. >> It works (and happy to share it), it's just the interface i'm using is a >> little ugly (haven't decided what the right API is), but let me know what >> GVNSink needs from life and i'll make sure it's there :) >> >> >> On Wed, Apr 26, 2017 at 10:09 AM, James Molloy via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> It's basically ready to commit; the reviewers were fairly happy with it. >>> It needs rebasing on top of NewGVN and any bugs that shakes out fixed, but >>> that's about it. >>> >>> I want to get around to it soon-ish, but I've wanted that for a while! >>> >>> On Wed, 26 Apr 2017 at 16:50, Hongbin Zheng <etherzhhb at gmail.com> wrote: >>> >>>> Hi James, >>>> >>>> I have an ad-hoc solution in mind to solve this problem. >>>> But if it can be solved under the framework of GVN"Sink", it is even >>>> better. >>>> >>>> any plan on your https://reviews.llvm.org/D24805? >>>> >>>> Thanks >>>> Hongbin >>>> >>>> >>>> On Wed, Apr 26, 2017 at 2:13 AM, James Molloy <james at jamesmolloy.co.uk> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Yes, I can see why that would not work. >>>>> >>>>> The sinking algorithm in SimplifyCFG isn't particularly clever. In >>>>> particular it can't reason about memory ordering and aliasing. In >>>>> unswitch1(), it can identify that the stores correlate because the >>>>> correlating stores appear in the same relative source order. In unswitch2() >>>>> they have been permuted, and the algorithm cannot deal with this. This >>>>> requires some kind of value numbering to do efficiently. >>>>> >>>>> The GVNSink pass that I really should get around to updating should >>>>> solve this, eventually! >>>>> >>>>> James >>>>> >>>>> >>>>> On Wed, 26 Apr 2017 at 08:19 Hongbin Zheng via llvm-dev < >>>>> llvm-dev at lists.llvm.org> wrote: >>>>> >>>>>> Looks like this do not work: >>>>>> >>>>>> // Type your code here, or load an example. >>>>>> int a[10]; >>>>>> >>>>>> void unswitch2(int i, int x, int y0, int y1) { >>>>>> if (x) { >>>>>> a[i] = y0; >>>>>> a[i + 1] = y1; >>>>>> } else { >>>>>> a[i + 1] = y0; >>>>>> a[i] = y1; >>>>>> } >>>>>> } >>>>>> >>>>>> https://godbolt.org/g/Ldd5qV >>>>>> >>>>>> On Tue, Apr 25, 2017 at 10:22 PM, Hongbin Zheng <etherzhhb at gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Looks like inst combine do the job >>>>>>> >>>>>>> On Tue, Apr 25, 2017 at 9:36 PM, Davide Italiano <davide at freebsd.org >>>>>>> > wrote: >>>>>>> >>>>>>>> On Tue, Apr 25, 2017 at 9:24 PM, Hongbin Zheng via llvm-dev >>>>>>>> <llvm-dev at lists.llvm.org> wrote: >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > Is there a pass in LLVM that can optimize: >>>>>>>> > >>>>>>>> > if (x) >>>>>>>> > a[i] = y0; >>>>>>>> > else >>>>>>>> > a[i] = y1; >>>>>>>> > >>>>>>>> > to >>>>>>>> > >>>>>>>> > a[i] = x ? y0 : y1? >>>>>>>> > >>>>>>>> > I tried opt (3.9) with -O3 but looks like such an optimization do >>>>>>>> not >>>>>>>> > happened. >>>>>>>> > >>>>>>>> >>>>>>>> The same IR at -O3 for both cases on this example. >>>>>>>> https://godbolt.org/g/Tk2MM8 >>>>>>>> >>>>>>>> -- >>>>>>>> Davide >>>>>>>> >>>>>>>> "There are no solved problems; there are only problems that are more >>>>>>>> or less solved" -- Henri Poincare >>>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> llvm-dev at lists.llvm.org >>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>> >>>>> >>>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170428/4e49b51c/attachment-0001.html>
This is normally known as global code motion, and there are numerous algorithms that do it On Fri, Apr 28, 2017, 7:48 PM Hongbin Zheng <etherzhhb at gmail.com> wrote:> Hi James, > > Thanks a lot for the explanations, they make sense. > > > On Fri, Apr 28, 2017 at 7:42 AM, James Molloy <james at jamesmolloy.co.uk> > wrote: > >> Hi Danny, >> >> Thanks for that :) However I've just updated the prototype patch to >> NewGVN and it didn't need any API changes - all I rely on is GVNExpression. >> >> Hongbin, >> >> I wanted to explain a little about what GVNSink can currently do, what it >> was designed for and hopefully how to make it handle your testcase. >> >> *Background* >> >> Common code sinking is more difficult to efficently do than one might >> expect. Prior to October 2016, LLVM was able to handle simple cases of this >> within SimplifyCFG, for example: >> >> if (a) { >> b += 42; >> } else { >> b += 42; >> } >> // -> b += 42 >> >> It could even sink operations where a phi was required: >> >> if (a) { >> b += 42; >> } else { >> b += 64; >> } >> // -> b += phi(42, 64) >> >> However it had a simplistic execution model and so: >> 1) Restricted itself to blocks with only TWO incoming arcs >> 2) Would only ever produce one PHI and then bail out >> >> Generally, the factors that make the search space for a good generic >> sinking optimization large are: >> >> 1) The number of incoming edges to a block. >> 2) Non-monotonicity of heuristic - for example it might not be >> worthwhile sinking 2 instructions but may be worthwhile sinking 3 >> instructions. >> 3) Permutations of otherwise similar instructions in incoming blocks. >> >> I'll describe them all in a bit more detail: >> >> *1) The number of incoming edges to a block* >> >> There are many constructs that can create a block with more than two >> input edges. An obvious example is a switch: >> >> switch (a) { >> case 1: b += 32; break; >> case 2: b += 43; break; >> default: break; >> } >> >> Another is a chain of if-elses: >> >> if (a) { >> b += 32; >> } else if (c) { >> b += 43; >> } else { >> b += 65; >> } >> >> Note that those two examples are subtly different. In the latter example, >> every branch performs an addition to 'b'. We can therefore sink all the >> adds to the end and perform only one add: >> >> b += phi(32, 43, 65); >> >> However in the former case things aren't so simple. The default case >> never does any add, so we have to speculate the add instruction in that >> case. For adds, this is simple - adds don't have sideeffects and we can >> speculate by adding with an identity (0): >> >> b += phi(32, 43, 0); >> >> However what if 'b' was a global variable? In this case, each case of the >> switch will have a load/add/store sequence: >> >> switch(a) --------------- >> / \ \ >> | | | >> [t1 = load @b] [t2 = load @b] | >> [t1 = add t1, 32] [t2 = add t2, 43] | >> [store @b, t1] [store @b, t2] | >> \ / / >> v v v------------- >> [ ] >> >> In most cases, we cannot be sure that in the default case @b is allowed >> to be written to. Speculating a store in this case could be invalid. >> Therefore, if we want to sink the store (and thus the add and load) we need >> to invent a new block to sink them to. >> >> switch(a) --------------- >> / \ \ >> | | | >> [t1 = load @b] [t2 = load @b] | >> [t1 = add t1, 32] [t2 = add t2, 43] | >> \ / | >> [ p = phi(t1, t2) ] / >> [ store @b, p ] / >> | +------------ >> v v >> [ ] >> >> Creating this new block allows the switch to be turned into a trivial one >> that can be lowered to a bit test or jump table. >> >> There is an even more complex case that appears quite a bit: >> >> if (a) { >> switch (b) { >> case 1: c += 42; break; >> case 2: c += 2; break; >> default: c += 3; break; >> } >> } else { >> switch (d) { >> case 1: e -= 2; break; >> case 2: e -= 6; break; >> default: e -= 32; break; >> } >> } >> >> We canonicalize this CFG to one cleanup block with six incoming arcs. In >> this case, it takes some nontrivial analysis to determine what to do, as it >> requires partitioning the incoming arcs into sets, performing different >> sinking operations on each (in this case two disjoint sets, but there are >> cases where the sets are not disjoint and some cost heuristic needs to be >> used). >> >> *2) Non-monotonicity of heuristic* >> >> What I'm trying to say here is the greedy algorithms work best when the >> heuristic they are computing is monotonic. Consider this: >> >> if (a) { >> tmp1 = *b; >> tmp2 = *c; >> tmp3 = tmp1 + tmp2; >> *d = tmp3; >> } else { >> tmp4 = *b; >> tmp5 = *c; >> tmp6 = tmp4 + tmp5; >> *e = tmp6; >> } >> >> If we proceed sequentially from the bottom of these blocks, trying to >> determine if sinking them is worthwhile: >> >> Step 1: [*e = tmp6], [*d = tmp3] >> - This is the same operation, a store, but it requires two PHIs - one >> for (e,d) and one for (tmp6, tmp3). >> - This is NOT worthwhile. We are constraining register allocation for >> only one instruction commonned. >> >> Step 2: [tmp6 = tmp4 + tmp5], [tmp3 = tmp1 + tmp2] >> - Again, the operation is the same, but now we need another two PHIs, >> plus one PHI from Step 1 (the phi required for (tmp6, tmp3) is now no >> longer required). >> - Three PHIs required for two instructions sunk. NOT worth it. >> >> Step 3: [tmp5 = *c], [tmp2 = *c] >> - These can be sunk, and have the same arguments so no PHIs required. >> We still need two of the PHIs from step 2 (the (tmp5, tmp2) phi now is no >> longer required) >> - At this point we've sunk three instructions for two PHIs. Maybe it's >> worth it? >> >> Step 4: [tmp4 = *b], [tmp1 = *b] >> - Same as step 3, but now the (tmp4, tmp1) PHI from step 2 goes away too >> - So now we're sinking 4 instructions for one PHI. >> - Now it's definitely worth it! >> >> Whatever heuristic one chooses, in a case like this it will get worse, >> then maybe get better again. So we can't simply bail out when the heuristic >> reaches some threshold, because maybe it will improve as we analyze more >> instructions! >> >> *3) Permutations of otherwise similar instructions* >> >> This is the case you describe. Stores (for example) that look similar but >> have been permuted: >> >> if (a) { >> *b = 52; >> *c = 43; >> } else { >> *c = 87; >> *b = 54; >> } >> >> If we approach this bottom-up, we will compare [*b = 54], [*c = 43], then >> [*c = 87], [*b = 52] and conclude that two instructions can be sunk, but 4 >> PHIs are required (because the target of the store and value are different >> in each case). If we looked at [*b = 54], [*b = 52], perhaps we would have >> seen that we can sink this with only *two* PHIs required and that might >> change our decision. >> >> This is generally quite difficult and especially shows up with memory >> instructions like this (if c and b alias, then permuting them when sinking >> would be illegal. So we need alias analysis here). >> >> *What GVNSink attempts to solve* >> >> The current implementation of GVNSink was meant to solve 1) and 2), but >> NOT 3). 3) wasn't within scope when I was writing GVNSink, but I think 3) >> is doable for some cases within the framework of GVNSink. >> >> *How GVNSink works* >> >> GVNSink works by computing a value number for each IR instruction, >> similar to GVN. However, GVNSink's value numbering works slightly from >> GVN's. >> >> To recap, the value number (VN) of an instruction is a function of the >> instruction. If VN[I] == VN[J], I and J can be considered equivalent, for >> some definition of equivalence. >> >> In GVN, VN[I] is based upon I and all of I's operands, transitively. >> Therefore if VN[I] == VN[J], I and J were computed in the same way and J >> can be used instead of I. >> >> In GVNSink, we flip this on its head. VN[I] is based upon I's *uses*, >> transitively. Therefore if VN[I] == VN[J], all of the uses of I are the >> same as all of the uses of J. Therefore I and J can be merged. So this: >> >> [ %a = add %I, 1 ] [ %c = add %J, 1 ] >> [ %b = sub %a, 5 ] [ %d = sub %c, 5 ] >> [ ret %b ] [ ret %d ] >> >> Can be transformed into this: >> >> [] [] >> \ / >> [ %IJ = phi %I, %J ] >> [ %a = add %IJ, 1 ] >> [ %b = sub %a, 5 ] >> [ ret %b ] >> >> In GVNSink, we step backwards across all instructions in all incoming >> basic blocks (see LockstepReverseIterator). This produces, conceptually, a >> sequence of sets of instructions. >> >> Each of these sets is analyzed in turn (see analyzeInstructionForSinking) >> - the value number of each contained instruction is calculated and used to >> determine if the entire set is mergeable or not (all values have the same >> VN). >> >> If not all instructions in the set have the same VN, the set is pruned to >> just those with the same VN (or we stop if all instructions had different >> VNs). If this set passes extra legality checks, a >> SinkingInstructionCandidate is created. This contains pertinent data for >> the cost heuristic: >> >> 1) Number of instructions from each incoming block being sunk >> 2) Number of blocks to pick instructions from (this can be a subset of >> all the incoming blocks!) >> 3) How many PHIs would be generated in total >> >> After all candidates have been examined, they are sorted based on a >> heuristic and the best candidate picked and applied. >> >> The use of value numbering in this algorithm is what allows us to do this >> in somewhat-linear time with respect to the number of instructions >> inspected. >> >> Now, to solve 3), I think the component that needs swapping out is >> LockstepReverseIterator. This currently iterates backwards in lockstep >> across a set of blocks. To handle permutations, I think we could use a >> priority queue that feeds the main algorithm instructions from the input >> blocks in a slightly sorted order. >> > > To solve 3, I think what we need are not only the CSE from GVN, but also > some kind of instruction reordering, i.e. "scheduling". I think the idea of > introducing a "priority queue" and processing the instruction in a > different order than they appear in the BB, is also a kind of "implicit > scheduling". > > maybe what we can do is introduce a "canonical order" by considering data > dependency (SSA def-use chain), memory dependencies (Memory SSA), and other > dependencies to order the instruction such that: > 1. Load instructions are move toward the beginning of a BB, such that > GVNHosit and/or CFGSimplify have better chance to hoist and merge the loads > 2. Store instructions are move toward the end of a BB, such that GVNSink > and InstructionCombine (it is merging the stores today) have a better > chance to sink and merge the stores. > 3. With the same type of instructions, i.e. load/stores, without > dependencies between them, we can sort/schedule them based on the GVN of > the pointer operand. > > In sort, the "canonical ordering" will sort instructions in BBs like: > > BB0: > br BB1 or BB2 > > BB1: > x = ld ptr1 > st x ptr2 > y = ld ptr3 > st y ptr4 > br BB3 > > BB2: > x1 = ld ptr3 > st x1 ptr4 > y1 = ld ptr1 > st y1 ptr2 > br BB3 > > BB3: > > to the following: > BB0: > br BB1 or BB2 > > BB1: > x = ld ptr1 > y = ld ptr3 > st x ptr2 > st y ptr4 > br BB3 > > BB2: > y1 = ld ptr1 > x1 = ld ptr3 > st y1 ptr2 > st x1 ptr4 > br BB3 > > BB3: > > Now we can use the lock step iterator to merge these instructions.: > BB0: > x = ld ptr1 > y = ld ptr3 > br BB1 or BB2 > > BB1: > br BB3 > > BB2: > br BB3 > > BB3: > st y ptr2 > st x ptr4 > > And we can make this "canonical ordering" independent from the other > passes. > > Such that the users can choose to pay extra compile time run the > "canonical ordering" pass and get more merge from GVN or not. > > Of course, this "canonical ordering" may not always work, e.g.: > > BB0: > br BB1 or BB2 > > BB1: > x = ld ptr1 > st x ptr2 > y = ld ptr3 > st y ptr4 > br BB3 > > BB2: > x1 = ld ptr0 > st x1 ptr5 > y1 = ld ptr1 > st y1 ptr2 > br BB3 > > BB3: > > May be reordered to the following: > > BB0: > br BB1 or BB2 > > BB1: > x = ld ptr1 > y = ld ptr3 > st x ptr2 > st y ptr4 > br BB3 > > BB2: > x1 = ld ptr0 > y1 = ld ptr1 > st y1 ptr2 > st x1 ptr5 > br BB3 > > BB3: > > And now the lock step iterator doesn't work. > > So there may be no universal "canonical order", and we need to consider > the context when we reorder/assign priority in the priority queue to solve > problem 3. > > Thanks > Hongbin > > >> >> For example, perhaps it might order sequences of stores by their target >> operand (assuming alias analysis says it can), ensuring sequences of stores >> in different blocks appear in the same order? >> >> Anyway, I hope this has helped understand a bit more about what GVNSink >> does, can do and how to modify it if you want to solve your problem. >> >> Cheers, >> >> James >> >> >> >> On Wed, 26 Apr 2017 at 18:16 Daniel Berlin <dberlin at dberlin.org> wrote: >> >>> I have a patch i'm working on to split newgvn into a proper analysis. >>> It works (and happy to share it), it's just the interface i'm using is a >>> little ugly (haven't decided what the right API is), but let me know what >>> GVNSink needs from life and i'll make sure it's there :) >>> >>> >>> On Wed, Apr 26, 2017 at 10:09 AM, James Molloy via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> It's basically ready to commit; the reviewers were fairly happy with >>>> it. It needs rebasing on top of NewGVN and any bugs that shakes out fixed, >>>> but that's about it. >>>> >>>> I want to get around to it soon-ish, but I've wanted that for a while! >>>> >>>> On Wed, 26 Apr 2017 at 16:50, Hongbin Zheng <etherzhhb at gmail.com> >>>> wrote: >>>> >>>>> Hi James, >>>>> >>>>> I have an ad-hoc solution in mind to solve this problem. >>>>> But if it can be solved under the framework of GVN"Sink", it is even >>>>> better. >>>>> >>>>> any plan on your https://reviews.llvm.org/D24805? >>>>> >>>>> Thanks >>>>> Hongbin >>>>> >>>>> >>>>> On Wed, Apr 26, 2017 at 2:13 AM, James Molloy <james at jamesmolloy.co.uk >>>>> > wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Yes, I can see why that would not work. >>>>>> >>>>>> The sinking algorithm in SimplifyCFG isn't particularly clever. In >>>>>> particular it can't reason about memory ordering and aliasing. In >>>>>> unswitch1(), it can identify that the stores correlate because the >>>>>> correlating stores appear in the same relative source order. In unswitch2() >>>>>> they have been permuted, and the algorithm cannot deal with this. This >>>>>> requires some kind of value numbering to do efficiently. >>>>>> >>>>>> The GVNSink pass that I really should get around to updating should >>>>>> solve this, eventually! >>>>>> >>>>>> James >>>>>> >>>>>> >>>>>> On Wed, 26 Apr 2017 at 08:19 Hongbin Zheng via llvm-dev < >>>>>> llvm-dev at lists.llvm.org> wrote: >>>>>> >>>>>>> Looks like this do not work: >>>>>>> >>>>>>> // Type your code here, or load an example. >>>>>>> int a[10]; >>>>>>> >>>>>>> void unswitch2(int i, int x, int y0, int y1) { >>>>>>> if (x) { >>>>>>> a[i] = y0; >>>>>>> a[i + 1] = y1; >>>>>>> } else { >>>>>>> a[i + 1] = y0; >>>>>>> a[i] = y1; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> https://godbolt.org/g/Ldd5qV >>>>>>> >>>>>>> On Tue, Apr 25, 2017 at 10:22 PM, Hongbin Zheng <etherzhhb at gmail.com >>>>>>> > wrote: >>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Looks like inst combine do the job >>>>>>>> >>>>>>>> On Tue, Apr 25, 2017 at 9:36 PM, Davide Italiano < >>>>>>>> davide at freebsd.org> wrote: >>>>>>>> >>>>>>>>> On Tue, Apr 25, 2017 at 9:24 PM, Hongbin Zheng via llvm-dev >>>>>>>>> <llvm-dev at lists.llvm.org> wrote: >>>>>>>>> > Hi, >>>>>>>>> > >>>>>>>>> > Is there a pass in LLVM that can optimize: >>>>>>>>> > >>>>>>>>> > if (x) >>>>>>>>> > a[i] = y0; >>>>>>>>> > else >>>>>>>>> > a[i] = y1; >>>>>>>>> > >>>>>>>>> > to >>>>>>>>> > >>>>>>>>> > a[i] = x ? y0 : y1? >>>>>>>>> > >>>>>>>>> > I tried opt (3.9) with -O3 but looks like such an optimization >>>>>>>>> do not >>>>>>>>> > happened. >>>>>>>>> > >>>>>>>>> >>>>>>>>> The same IR at -O3 for both cases on this example. >>>>>>>>> https://godbolt.org/g/Tk2MM8 >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Davide >>>>>>>>> >>>>>>>>> "There are no solved problems; there are only problems that are >>>>>>>>> more >>>>>>>>> or less solved" -- Henri Poincare >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> LLVM Developers mailing list >>>>>>> llvm-dev at lists.llvm.org >>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>> >>>>>> >>>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >>>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170429/5a2ebd7e/attachment.html>