Wei Mi via llvm-dev
2017-Aug-02 23:00 UTC
[llvm-dev] [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass
On Wed, Aug 2, 2017 at 3:36 PM, Matthias Braun <mbraun at apple.com> wrote:> So to write this in a more condensed form, you have: > > %v0 = ... > %v1 = and %v0, 255 > %v2 = and %v1, 31 > use %v1 > use %v2 > > and transform this to > %v0 = ... > %v1 = and %v0, 255 > %v2 = and %v0, 31 > ... > > This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end. > > In my experience it is best to restrict those sort of transformation to situations where you can prove there aren't multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users > will match the same pattern). > > - MatthiasThanks Matthias, yes, the transformation in instcombine is of non-root node pattern. What I am worried about to add the restriction that %v1 has to have only one use is that such change may regress some benchmarks. For the same case, after we add the limitation, we will save a move in BB1, but may add another move in BB2, because %v2 and %v1 are both alive after %v2 = and %v1, 31. Which choice is better depends on the hotness of BB1 and BB2, so the result of such transformation in instcombine is like flipping a coin. Currently we don't have a pass to use BFI to do such transformation in a meaningful way. Maybe we need such a pass, and with that pass, we can disable the transformations in instcombine without regressing certain benchmarks. BB1:> %v0 = ... > %v1 = and %v0, 255BB2:> %v2 = and %v1, 31 > use %v1 > use %v2Wei.> >> On Aug 2, 2017, at 3:00 PM, Wei Mi via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Hi, >> >> We recently found a testcase showing that simplifications in >> instcombine sometimes change the instruction without reducing the >> instruction cost, but causing problems in TwoAddressInstruction pass. >> And it looks like the problem is generic and other simplification may >> have the same issue. I want to get some ideas about what is the best >> way to fix such kind of problem. >> >> The testcase: >> ----------------------------- a.ll ---------------------------------- >> @a = global i64 0, align 8 >> @b = global i32 0, align 8 >> >> define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) local_unnamed_addr { >> entry: >> %t1 = load i16, i16* %arrayidx, align 2 >> %conv.i = zext i16 %t1 to i32 >> %and.i = and i32 %conv.i, 255 >> %and7.i = and i32 %conv.i, 1792 >> %t3 = zext i32 %and7.i to i64 >> %t4 = load i64, i64* @a, align 8 >> %add.i = add i64 %t4, %t3 >> %cmp1 = icmp eq i64 %add.i, 1 >> %cmp2 = icmp ult i32 %and.i, 101 >> %bool = and i1 %cmp1, %cmp2 >> br i1 %bool, label %if.then, label %if.else, !prof !0 >> >> if.then: ; preds = %entry >> %r1 = trunc i64 %add.i to i32 >> br label %return >> >> if.else: ; preds = %entry >> %r2 = and i32 %and.i, 31 >> store i32 %and.i, i32* @b, align 8 >> br label %return >> >> return: ; preds = %if.else, %if.then >> %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ] >> ret i32 %ret >> } >> >> !0 = !{!"branch_weights", i32 2000, i32 1} >> ------------------------------------------------------------------------- >> >> For the snippet: >> %and.i = and i32 %conv.i, 255 >> ... >> %r2 = and i32 %and.i, 31 >> >> Look at %r2 in block %if.else, it is computed by two "and" operations. >> Both InstCombiner::SimplifyAssociativeOrCommutative and >> InstCombiner::SimplifyDemandedUseBits can replace %r2 = and i32 >> %and.i, 31 with %r2 = and i32 %conv.i, 31. Because %and.i has many >> other uses and it won't become dead after the simplifications, those >> simplifications won't simplify instruction cost, but they will change >> the live range of variables: >> Before instcombine, %conv.i is last used at %and7.i = and i32 %conv.i, >> 1792, so on architecture like x86, llvm won't generate extra mov in >> TwoAddressInstruction pass for it. After instcombine, %conv.i's live >> range is extended, both %and7.i and %conv.i are alive after the "and" >> instruction, so TwoAddressInstruction pass will generate an extra mov >> for %and7.i = and i32 %conv.i. >> >> *** After instcombine: *** >> ~/workarea/llvm-r309240/rbuild1/bin/opt -instcombine -S < a.ll > b.ll; cat b.ll >> @a = global i64 0, align 8 >> @b = global i32 0, align 8 >> >> define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) local_unnamed_addr { >> entry: >> %t1 = load i16, i16* %arrayidx, align 2 >> %conv.i = zext i16 %t1 to i32 >> %and.i = and i32 %conv.i, 255 >> %and7.i = and i32 %conv.i, 1792 ============> %conv.i >> is alive since its live range is extended. >> %t3 = zext i32 %and7.i to i64 >> We need a extra mov for this two address instruction on x86 >> %t4 = load i64, i64* @a, align 8 >> %add.i = add i64 %t4, %t3 >> %cmp1 = icmp eq i64 %add.i, 1 >> %cmp2 = icmp ult i32 %and.i, 101 >> %bool = and i1 %cmp1, %cmp2 >> br i1 %bool, label %if.then, label %if.else >> >> if.then: ; preds = %entry >> %r1 = trunc i64 %add.i to i32 >> br label %return >> >> if.else: ; preds = %entry >> %r2 = and i32 %conv.i, 31 >> ===============> %and.i is replaced by %conv.i. >> store i32 %and.i, i32* @b, align 8 >> br label %return >> >> return: ; preds = %if.else, %if.then >> %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ] >> ret i32 %ret >> } >> >> *** asm code without instcombine: *** >> ~/workarea/llvm-r309240/rbuild1/bin/llc < a.ll >> # BB#0: # %entry >> movzwl (%rdi), %ecx >> movzbl %cl, %eax >> andl $1792, %ecx # imm = 0x700 >> addq a(%rip), %rcx >> cmpq $1, %rcx >> jne .LBB0_3 >> >> *** asm code with instcombine: *** >> ~/workarea/llvm-r309240/rbuild1/bin/llc < b.ll >> # BB#0: # %entry >> movzwl (%rdi), %eax >> movzbl %al, %ecx >> movl %eax, %edx =====> One extra move >> instruction in the hot path >> andl $1792, %edx # imm = 0x700 >> addq a(%rip), %rdx >> cmpq $1, %rdx >> jne .LBB0_3 >> >> I am not sure if it is possible to tackle the problem in >> simplification or instcombine directly, so I explored the possibility >> to do the reversal transformation in CodeGen, but I found that not >> quite easy either because of two things: >> * Seems to me Peephole pass or TwoAddrInstruction pass are the proper >> places to do the reversal transformation. We need Live Interval >> information to justify the transformation, however currently live >> Interval information is not ready in both places. >> >> * The pattern matching looks quite ad hoc on machine IR. I need to >> figure out we can replace %vreg0 in "AND32ri8 %vreg0<tied0>, 31" with >> %vreg1 by looking at the copy chain starting from %vreg9<def> = COPY >> %vreg0 to %vreg1<def> = MOVZX32rr8 %vreg9 first, and at the same time, >> after replacing vreg0 with %vreg1, vreg0 becomes dead at the other >> AND32ri and we can save an instruction there. In addition, replace >> %vreg0 with %vreg1 may increase an extra move before "AND32ri8 >> %vreg0<tied0>, 31", so we still need to check "AND32ri8 %vreg0<tied0>, >> 31" is colder than "AND32ri %vreg0<tied0>, 1792". >> All these efforts are just handling a specific pattern, and if the >> pattern changes a little bit, they won't work. >> BB_i: >> %vreg9<def> = COPY %vreg0:sub_8bit; GR8:%vreg9 GR32:%vreg0 >> %vreg1<def> = MOVZX32rr8 %vreg9<kill>; GR32:%vreg1 GR8:%vreg9 >> %vreg10<def,tied1> = AND32ri %vreg0<tied0>, 1792, >> %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0 >> ... >> BB_j: >> %vreg4<def,tied1> = AND32ri8 %vreg0<tied0>, 31, >> %EFLAGS<imp-def,dead>; GR32:%vreg4,%vreg0 >> MOV32mr %RIP, 1, %noreg, <ga:@b>, %noreg, %vreg1; >> mem:ST4[@b](align=8) GR32:%vreg1 >> >> Any suggestions are welcomed. >> >> Thanks, >> Wei. >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
Matthias Braun via llvm-dev
2017-Aug-02 23:07 UTC
[llvm-dev] [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass
> On Aug 2, 2017, at 4:00 PM, Wei Mi <wmi at google.com> wrote: > > On Wed, Aug 2, 2017 at 3:36 PM, Matthias Braun <mbraun at apple.com <mailto:mbraun at apple.com>> wrote: >> So to write this in a more condensed form, you have: >> >> %v0 = ... >> %v1 = and %v0, 255 >> %v2 = and %v1, 31 >> use %v1 >> use %v2 >> >> and transform this to >> %v0 = ... >> %v1 = and %v0, 255 >> %v2 = and %v0, 31 >> ... >> >> This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end. >> >> In my experience it is best to restrict those sort of transformation to situations where you can prove there aren't multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users >> will match the same pattern). >> >> - Matthias > > Thanks Matthias, yes, the transformation in instcombine is of non-root > node pattern. What I am worried about to add the restriction that %v1 > has to have only one use is that such change may regress some > benchmarks. For the same case, after we add the limitation, we will > save a move in BB1, but may add another move in BB2, because %v2 and > %v1 are both alive after %v2 = and %v1, 31. Which choice is better > depends on the hotness of BB1 and BB2, so the result of such > transformation in instcombine is like flipping a coin. Currently we > don't have a pass to use BFI to do such transformation in a meaningful > way. Maybe we need such a pass, and with that pass, we can disable the > transformations in instcombine without regressing certain benchmarks.Yes this strategy will regress some situations and also goes against the LLVM spirit of perform as much normalization as possible in the middle end. On the other hand eagerly transform all such opportunities has more potential to regress things than to improve things in my experience, especially since we currently have no way of reverting this transformation. So at the very least changing the strategy would be a very interesting experiment and at least in another compiler I worked on we saw a huge number of improvements and nearly no regressions when switching to this strategy of not transforming in the presence of multiple users on a non-root node. - Matthias> > BB1: >> %v0 = ... >> %v1 = and %v0, 255 > > BB2: >> %v2 = and %v1, 31 >> use %v1 >> use %v2 > > Wei. > >> >>> On Aug 2, 2017, at 3:00 PM, Wei Mi via llvm-dev <llvm-dev at lists.llvm.org> wrote: >>> >>> Hi, >>> >>> We recently found a testcase showing that simplifications in >>> instcombine sometimes change the instruction without reducing the >>> instruction cost, but causing problems in TwoAddressInstruction pass. >>> And it looks like the problem is generic and other simplification may >>> have the same issue. I want to get some ideas about what is the best >>> way to fix such kind of problem. >>> >>> The testcase: >>> ----------------------------- a.ll ---------------------------------- >>> @a = global i64 0, align 8 >>> @b = global i32 0, align 8 >>> >>> define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) local_unnamed_addr { >>> entry: >>> %t1 = load i16, i16* %arrayidx, align 2 >>> %conv.i = zext i16 %t1 to i32 >>> %and.i = and i32 %conv.i, 255 >>> %and7.i = and i32 %conv.i, 1792 >>> %t3 = zext i32 %and7.i to i64 >>> %t4 = load i64, i64* @a, align 8 >>> %add.i = add i64 %t4, %t3 >>> %cmp1 = icmp eq i64 %add.i, 1 >>> %cmp2 = icmp ult i32 %and.i, 101 >>> %bool = and i1 %cmp1, %cmp2 >>> br i1 %bool, label %if.then, label %if.else, !prof !0 >>> >>> if.then: ; preds = %entry >>> %r1 = trunc i64 %add.i to i32 >>> br label %return >>> >>> if.else: ; preds = %entry >>> %r2 = and i32 %and.i, 31 >>> store i32 %and.i, i32* @b, align 8 >>> br label %return >>> >>> return: ; preds = %if.else, %if.then >>> %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ] >>> ret i32 %ret >>> } >>> >>> !0 = !{!"branch_weights", i32 2000, i32 1} >>> ------------------------------------------------------------------------- >>> >>> For the snippet: >>> %and.i = and i32 %conv.i, 255 >>> ... >>> %r2 = and i32 %and.i, 31 >>> >>> Look at %r2 in block %if.else, it is computed by two "and" operations. >>> Both InstCombiner::SimplifyAssociativeOrCommutative and >>> InstCombiner::SimplifyDemandedUseBits can replace %r2 = and i32 >>> %and.i, 31 with %r2 = and i32 %conv.i, 31. Because %and.i has many >>> other uses and it won't become dead after the simplifications, those >>> simplifications won't simplify instruction cost, but they will change >>> the live range of variables: >>> Before instcombine, %conv.i is last used at %and7.i = and i32 %conv.i, >>> 1792, so on architecture like x86, llvm won't generate extra mov in >>> TwoAddressInstruction pass for it. After instcombine, %conv.i's live >>> range is extended, both %and7.i and %conv.i are alive after the "and" >>> instruction, so TwoAddressInstruction pass will generate an extra mov >>> for %and7.i = and i32 %conv.i. >>> >>> *** After instcombine: *** >>> ~/workarea/llvm-r309240/rbuild1/bin/opt -instcombine -S < a.ll > b.ll; cat b.ll >>> @a = global i64 0, align 8 >>> @b = global i32 0, align 8 >>> >>> define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) local_unnamed_addr { >>> entry: >>> %t1 = load i16, i16* %arrayidx, align 2 >>> %conv.i = zext i16 %t1 to i32 >>> %and.i = and i32 %conv.i, 255 >>> %and7.i = and i32 %conv.i, 1792 ============> %conv.i >>> is alive since its live range is extended. >>> %t3 = zext i32 %and7.i to i64 >>> We need a extra mov for this two address instruction on x86 >>> %t4 = load i64, i64* @a, align 8 >>> %add.i = add i64 %t4, %t3 >>> %cmp1 = icmp eq i64 %add.i, 1 >>> %cmp2 = icmp ult i32 %and.i, 101 >>> %bool = and i1 %cmp1, %cmp2 >>> br i1 %bool, label %if.then, label %if.else >>> >>> if.then: ; preds = %entry >>> %r1 = trunc i64 %add.i to i32 >>> br label %return >>> >>> if.else: ; preds = %entry >>> %r2 = and i32 %conv.i, 31 >>> ===============> %and.i is replaced by %conv.i. >>> store i32 %and.i, i32* @b, align 8 >>> br label %return >>> >>> return: ; preds = %if.else, %if.then >>> %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ] >>> ret i32 %ret >>> } >>> >>> *** asm code without instcombine: *** >>> ~/workarea/llvm-r309240/rbuild1/bin/llc < a.ll >>> # BB#0: # %entry >>> movzwl (%rdi), %ecx >>> movzbl %cl, %eax >>> andl $1792, %ecx # imm = 0x700 >>> addq a(%rip), %rcx >>> cmpq $1, %rcx >>> jne .LBB0_3 >>> >>> *** asm code with instcombine: *** >>> ~/workarea/llvm-r309240/rbuild1/bin/llc < b.ll >>> # BB#0: # %entry >>> movzwl (%rdi), %eax >>> movzbl %al, %ecx >>> movl %eax, %edx =====> One extra move >>> instruction in the hot path >>> andl $1792, %edx # imm = 0x700 >>> addq a(%rip), %rdx >>> cmpq $1, %rdx >>> jne .LBB0_3 >>> >>> I am not sure if it is possible to tackle the problem in >>> simplification or instcombine directly, so I explored the possibility >>> to do the reversal transformation in CodeGen, but I found that not >>> quite easy either because of two things: >>> * Seems to me Peephole pass or TwoAddrInstruction pass are the proper >>> places to do the reversal transformation. We need Live Interval >>> information to justify the transformation, however currently live >>> Interval information is not ready in both places. >>> >>> * The pattern matching looks quite ad hoc on machine IR. I need to >>> figure out we can replace %vreg0 in "AND32ri8 %vreg0<tied0>, 31" with >>> %vreg1 by looking at the copy chain starting from %vreg9<def> = COPY >>> %vreg0 to %vreg1<def> = MOVZX32rr8 %vreg9 first, and at the same time, >>> after replacing vreg0 with %vreg1, vreg0 becomes dead at the other >>> AND32ri and we can save an instruction there. In addition, replace >>> %vreg0 with %vreg1 may increase an extra move before "AND32ri8 >>> %vreg0<tied0>, 31", so we still need to check "AND32ri8 %vreg0<tied0>, >>> 31" is colder than "AND32ri %vreg0<tied0>, 1792". >>> All these efforts are just handling a specific pattern, and if the >>> pattern changes a little bit, they won't work. >>> BB_i: >>> %vreg9<def> = COPY %vreg0:sub_8bit; GR8:%vreg9 GR32:%vreg0 >>> %vreg1<def> = MOVZX32rr8 %vreg9<kill>; GR32:%vreg1 GR8:%vreg9 >>> %vreg10<def,tied1> = AND32ri %vreg0<tied0>, 1792, >>> %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0 >>> ... >>> BB_j: >>> %vreg4<def,tied1> = AND32ri8 %vreg0<tied0>, 31, >>> %EFLAGS<imp-def,dead>; GR32:%vreg4,%vreg0 >>> MOV32mr %RIP, 1, %noreg, <ga:@b>, %noreg, %vreg1; >>> mem:ST4[@b](align=8) GR32:%vreg1 >>> >>> Any suggestions are welcomed. >>> >>> Thanks, >>> Wei. >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170802/a430275a/attachment.html>
Chandler Carruth via llvm-dev
2017-Aug-02 23:29 UTC
[llvm-dev] [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass
On Wed, Aug 2, 2017 at 4:07 PM Matthias Braun via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Aug 2, 2017, at 4:00 PM, Wei Mi <wmi at google.com> wrote: > > On Wed, Aug 2, 2017 at 3:36 PM, Matthias Braun <mbraun at apple.com> wrote: > > So to write this in a more condensed form, you have: > > %v0 = ... > %v1 = and %v0, 255 > %v2 = and %v1, 31 > use %v1 > use %v2 > > and transform this to > %v0 = ... > %v1 = and %v0, 255 > %v2 = and %v0, 31 > ... > > This is a classical problem with instruction combining. When you replace a > non-root node of a pattern (in this case %v2 would be the root pattern we > match, but we actually replace %v1 with %v0 as part of the replacement) and > that non-root node has multiple users then it becomes very hard to predict > whether the transformations pays off in the end. > > In my experience it is best to restrict those sort of transformation to > situations where you can prove there aren't multiple users on the > intermediate/non-root node that we replace. (Or if you want to get fancier > that all users > will match the same pattern). > > - Matthias > > > Thanks Matthias, yes, the transformation in instcombine is of non-root > node pattern. What I am worried about to add the restriction that %v1 > has to have only one use is that such change may regress some > benchmarks. For the same case, after we add the limitation, we will > save a move in BB1, but may add another move in BB2, because %v2 and > %v1 are both alive after %v2 = and %v1, 31. Which choice is better > depends on the hotness of BB1 and BB2, so the result of such > transformation in instcombine is like flipping a coin. Currently we > don't have a pass to use BFI to do such transformation in a meaningful > way. Maybe we need such a pass, and with that pass, we can disable the > transformations in instcombine without regressing certain benchmarks. > > Yes this strategy will regress some situations and also goes against the > LLVM spirit of perform as much normalization as possible in the middle end. > > On the other hand eagerly transform all such opportunities has more > potential to regress things than to improve things in my experience, > especially since we currently have no way of reverting this transformation. >There are many cases where we have no way of reverting this transformation, but it seems like this is not one of those cases. It seems like we could, potentially very late (not sure how late, maybe in remat itself? maybe earlier...), recognize that we can reduce the live value set by rewriting %v2 in terms of %v1 (in your reduced example). Even something as simple as demanded-bits should be able to tell that %v0 and %v1 are interchangeable. But maybe there is something about this live-set-reduction transform that is hard? I've not thought about it much.> So at the very least changing the strategy would be a very interesting > experiment and at least in another compiler I worked on we saw a huge > number of improvements and nearly no regressions when switching to this > strategy of not transforming in the presence of multiple users on a > non-root node. >None of the above would invalidate the utility of this experiment of course. I would guess we want to restrict the experiment to combines where the rewrite doesn't substantially improve our ability to reason about the computed value (IE, something we can see through transparently like an and seems like a no-brainer). My guess at why this would hurt LLVM, if it does, is due to the recursion depth limit and the degree to which we rely on lazy "up" walks with that depth limit rather than forward-propagation.> > - Matthias > > > BB1: > > %v0 = ... > %v1 = and %v0, 255 > > > BB2: > > %v2 = and %v1, 31 > use %v1 > use %v2 > > > Wei. > > > On Aug 2, 2017, at 3:00 PM, Wei Mi via llvm-dev <llvm-dev at lists.llvm.org> > wrote: > > Hi, > > We recently found a testcase showing that simplifications in > instcombine sometimes change the instruction without reducing the > instruction cost, but causing problems in TwoAddressInstruction pass. > And it looks like the problem is generic and other simplification may > have the same issue. I want to get some ideas about what is the best > way to fix such kind of problem. > > The testcase: > ----------------------------- a.ll ---------------------------------- > @a = global i64 0, align 8 > @b = global i32 0, align 8 > > define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) > local_unnamed_addr { > entry: > %t1 = load i16, i16* %arrayidx, align 2 > %conv.i = zext i16 %t1 to i32 > %and.i = and i32 %conv.i, 255 > %and7.i = and i32 %conv.i, 1792 > %t3 = zext i32 %and7.i to i64 > %t4 = load i64, i64* @a, align 8 > %add.i = add i64 %t4, %t3 > %cmp1 = icmp eq i64 %add.i, 1 > %cmp2 = icmp ult i32 %and.i, 101 > %bool = and i1 %cmp1, %cmp2 > br i1 %bool, label %if.then, label %if.else, !prof !0 > > if.then: ; preds = %entry > %r1 = trunc i64 %add.i to i32 > br label %return > > if.else: ; preds = %entry > %r2 = and i32 %and.i, 31 > store i32 %and.i, i32* @b, align 8 > br label %return > > return: ; preds = %if.else, > %if.then > %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ] > ret i32 %ret > } > > !0 = !{!"branch_weights", i32 2000, i32 1} > ------------------------------------------------------------------------- > > For the snippet: > %and.i = and i32 %conv.i, 255 > ... > %r2 = and i32 %and.i, 31 > > Look at %r2 in block %if.else, it is computed by two "and" operations. > Both InstCombiner::SimplifyAssociativeOrCommutative and > InstCombiner::SimplifyDemandedUseBits can replace %r2 = and i32 > %and.i, 31 with %r2 = and i32 %conv.i, 31. Because %and.i has many > other uses and it won't become dead after the simplifications, those > simplifications won't simplify instruction cost, but they will change > the live range of variables: > Before instcombine, %conv.i is last used at %and7.i = and i32 %conv.i, > 1792, so on architecture like x86, llvm won't generate extra mov in > TwoAddressInstruction pass for it. After instcombine, %conv.i's live > range is extended, both %and7.i and %conv.i are alive after the "and" > instruction, so TwoAddressInstruction pass will generate an extra mov > for %and7.i = and i32 %conv.i. > > *** After instcombine: *** > ~/workarea/llvm-r309240/rbuild1/bin/opt -instcombine -S < a.ll > b.ll; cat > b.ll > @a = global i64 0, align 8 > @b = global i32 0, align 8 > > define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) > local_unnamed_addr { > entry: > %t1 = load i16, i16* %arrayidx, align 2 > %conv.i = zext i16 %t1 to i32 > %and.i = and i32 %conv.i, 255 > %and7.i = and i32 %conv.i, 1792 ============> %conv.i > is alive since its live range is extended. > %t3 = zext i32 %and7.i to i64 > We need a extra mov for this two address instruction on x86 > %t4 = load i64, i64* @a, align 8 > %add.i = add i64 %t4, %t3 > %cmp1 = icmp eq i64 %add.i, 1 > %cmp2 = icmp ult i32 %and.i, 101 > %bool = and i1 %cmp1, %cmp2 > br i1 %bool, label %if.then, label %if.else > > if.then: ; preds = %entry > %r1 = trunc i64 %add.i to i32 > br label %return > > if.else: ; preds = %entry > %r2 = and i32 %conv.i, 31 > ===============> %and.i is replaced by %conv.i. > store i32 %and.i, i32* @b, align 8 > br label %return > > return: ; preds = %if.else, > %if.then > %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ] > ret i32 %ret > } > > *** asm code without instcombine: *** > ~/workarea/llvm-r309240/rbuild1/bin/llc < a.ll > # BB#0: # %entry > movzwl (%rdi), %ecx > movzbl %cl, %eax > andl $1792, %ecx # imm = 0x700 > addq a(%rip), %rcx > cmpq $1, %rcx > jne .LBB0_3 > > *** asm code with instcombine: *** > ~/workarea/llvm-r309240/rbuild1/bin/llc < b.ll > # BB#0: # %entry > movzwl (%rdi), %eax > movzbl %al, %ecx > movl %eax, %edx =====> One extra move > instruction in the hot path > andl $1792, %edx # imm = 0x700 > addq a(%rip), %rdx > cmpq $1, %rdx > jne .LBB0_3 > > I am not sure if it is possible to tackle the problem in > simplification or instcombine directly, so I explored the possibility > to do the reversal transformation in CodeGen, but I found that not > quite easy either because of two things: > * Seems to me Peephole pass or TwoAddrInstruction pass are the proper > places to do the reversal transformation. We need Live Interval > information to justify the transformation, however currently live > Interval information is not ready in both places. > > * The pattern matching looks quite ad hoc on machine IR. I need to > figure out we can replace %vreg0 in "AND32ri8 %vreg0<tied0>, 31" with > %vreg1 by looking at the copy chain starting from %vreg9<def> = COPY > %vreg0 to %vreg1<def> = MOVZX32rr8 %vreg9 first, and at the same time, > after replacing vreg0 with %vreg1, vreg0 becomes dead at the other > AND32ri and we can save an instruction there. In addition, replace > %vreg0 with %vreg1 may increase an extra move before "AND32ri8 > %vreg0<tied0>, 31", so we still need to check "AND32ri8 %vreg0<tied0>, > 31" is colder than "AND32ri %vreg0<tied0>, 1792". > All these efforts are just handling a specific pattern, and if the > pattern changes a little bit, they won't work. > BB_i: > %vreg9<def> = COPY %vreg0:sub_8bit; GR8:%vreg9 GR32:%vreg0 > %vreg1<def> = MOVZX32rr8 %vreg9<kill>; GR32:%vreg1 GR8:%vreg9 > %vreg10<def,tied1> = AND32ri %vreg0<tied0>, 1792, > %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0 > ... > BB_j: > %vreg4<def,tied1> = AND32ri8 %vreg0<tied0>, 31, > %EFLAGS<imp-def,dead>; GR32:%vreg4,%vreg0 > MOV32mr %RIP, 1, %noreg, <ga:@b>, %noreg, %vreg1; > mem:ST4[@b](align=8) GR32:%vreg1 > > Any suggestions are welcomed. > > Thanks, > Wei. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170802/9bdf9f79/attachment-0001.html>
Wei Mi via llvm-dev
2017-Aug-02 23:59 UTC
[llvm-dev] [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass
On Wed, Aug 2, 2017 at 4:07 PM, Matthias Braun <mbraun at apple.com> wrote:> > On Aug 2, 2017, at 4:00 PM, Wei Mi <wmi at google.com> wrote: > > On Wed, Aug 2, 2017 at 3:36 PM, Matthias Braun <mbraun at apple.com> wrote: > > So to write this in a more condensed form, you have: > > %v0 = ... > %v1 = and %v0, 255 > %v2 = and %v1, 31 > use %v1 > use %v2 > > and transform this to > %v0 = ... > %v1 = and %v0, 255 > %v2 = and %v0, 31 > ... > > This is a classical problem with instruction combining. When you replace a > non-root node of a pattern (in this case %v2 would be the root pattern we > match, but we actually replace %v1 with %v0 as part of the replacement) and > that non-root node has multiple users then it becomes very hard to predict > whether the transformations pays off in the end. > > In my experience it is best to restrict those sort of transformation to > situations where you can prove there aren't multiple users on the > intermediate/non-root node that we replace. (Or if you want to get fancier > that all users > will match the same pattern). > > - Matthias > > > Thanks Matthias, yes, the transformation in instcombine is of non-root > node pattern. What I am worried about to add the restriction that %v1 > has to have only one use is that such change may regress some > benchmarks. For the same case, after we add the limitation, we will > save a move in BB1, but may add another move in BB2, because %v2 and > %v1 are both alive after %v2 = and %v1, 31. Which choice is better > depends on the hotness of BB1 and BB2, so the result of such > transformation in instcombine is like flipping a coin. Currently we > don't have a pass to use BFI to do such transformation in a meaningful > way. Maybe we need such a pass, and with that pass, we can disable the > transformations in instcombine without regressing certain benchmarks. > > Yes this strategy will regress some situations and also goes against the > LLVM spirit of perform as much normalization as possible in the middle end. > > On the other hand eagerly transform all such opportunities has more > potential to regress things than to improve things in my experience, > especially since we currently have no way of reverting this transformation. > So at the very least changing the strategy would be a very interesting > experiment and at least in another compiler I worked on we saw a huge number > of improvements and nearly no regressions when switching to this strategy of > not transforming in the presence of multiple users on a non-root node. > > - MatthiasThanks for sharing the experience! I will try the same and see how the performance looks like. Wei.> > > BB1: > > %v0 = ... > %v1 = and %v0, 255 > > > BB2: > > %v2 = and %v1, 31 > use %v1 > use %v2 > > > Wei. > > > On Aug 2, 2017, at 3:00 PM, Wei Mi via llvm-dev <llvm-dev at lists.llvm.org> > wrote: > > Hi, > > We recently found a testcase showing that simplifications in > instcombine sometimes change the instruction without reducing the > instruction cost, but causing problems in TwoAddressInstruction pass. > And it looks like the problem is generic and other simplification may > have the same issue. I want to get some ideas about what is the best > way to fix such kind of problem. > > The testcase: > ----------------------------- a.ll ---------------------------------- > @a = global i64 0, align 8 > @b = global i32 0, align 8 > > define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) > local_unnamed_addr { > entry: > %t1 = load i16, i16* %arrayidx, align 2 > %conv.i = zext i16 %t1 to i32 > %and.i = and i32 %conv.i, 255 > %and7.i = and i32 %conv.i, 1792 > %t3 = zext i32 %and7.i to i64 > %t4 = load i64, i64* @a, align 8 > %add.i = add i64 %t4, %t3 > %cmp1 = icmp eq i64 %add.i, 1 > %cmp2 = icmp ult i32 %and.i, 101 > %bool = and i1 %cmp1, %cmp2 > br i1 %bool, label %if.then, label %if.else, !prof !0 > > if.then: ; preds = %entry > %r1 = trunc i64 %add.i to i32 > br label %return > > if.else: ; preds = %entry > %r2 = and i32 %and.i, 31 > store i32 %and.i, i32* @b, align 8 > br label %return > > return: ; preds = %if.else, > %if.then > %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ] > ret i32 %ret > } > > !0 = !{!"branch_weights", i32 2000, i32 1} > ------------------------------------------------------------------------- > > For the snippet: > %and.i = and i32 %conv.i, 255 > ... > %r2 = and i32 %and.i, 31 > > Look at %r2 in block %if.else, it is computed by two "and" operations. > Both InstCombiner::SimplifyAssociativeOrCommutative and > InstCombiner::SimplifyDemandedUseBits can replace %r2 = and i32 > %and.i, 31 with %r2 = and i32 %conv.i, 31. Because %and.i has many > other uses and it won't become dead after the simplifications, those > simplifications won't simplify instruction cost, but they will change > the live range of variables: > Before instcombine, %conv.i is last used at %and7.i = and i32 %conv.i, > 1792, so on architecture like x86, llvm won't generate extra mov in > TwoAddressInstruction pass for it. After instcombine, %conv.i's live > range is extended, both %and7.i and %conv.i are alive after the "and" > instruction, so TwoAddressInstruction pass will generate an extra mov > for %and7.i = and i32 %conv.i. > > *** After instcombine: *** > ~/workarea/llvm-r309240/rbuild1/bin/opt -instcombine -S < a.ll > b.ll; cat > b.ll > @a = global i64 0, align 8 > @b = global i32 0, align 8 > > define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) > local_unnamed_addr { > entry: > %t1 = load i16, i16* %arrayidx, align 2 > %conv.i = zext i16 %t1 to i32 > %and.i = and i32 %conv.i, 255 > %and7.i = and i32 %conv.i, 1792 ============> %conv.i > is alive since its live range is extended. > %t3 = zext i32 %and7.i to i64 > We need a extra mov for this two address instruction on x86 > %t4 = load i64, i64* @a, align 8 > %add.i = add i64 %t4, %t3 > %cmp1 = icmp eq i64 %add.i, 1 > %cmp2 = icmp ult i32 %and.i, 101 > %bool = and i1 %cmp1, %cmp2 > br i1 %bool, label %if.then, label %if.else > > if.then: ; preds = %entry > %r1 = trunc i64 %add.i to i32 > br label %return > > if.else: ; preds = %entry > %r2 = and i32 %conv.i, 31 > ===============> %and.i is replaced by %conv.i. > store i32 %and.i, i32* @b, align 8 > br label %return > > return: ; preds = %if.else, > %if.then > %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ] > ret i32 %ret > } > > *** asm code without instcombine: *** > ~/workarea/llvm-r309240/rbuild1/bin/llc < a.ll > # BB#0: # %entry > movzwl (%rdi), %ecx > movzbl %cl, %eax > andl $1792, %ecx # imm = 0x700 > addq a(%rip), %rcx > cmpq $1, %rcx > jne .LBB0_3 > > *** asm code with instcombine: *** > ~/workarea/llvm-r309240/rbuild1/bin/llc < b.ll > # BB#0: # %entry > movzwl (%rdi), %eax > movzbl %al, %ecx > movl %eax, %edx =====> One extra move > instruction in the hot path > andl $1792, %edx # imm = 0x700 > addq a(%rip), %rdx > cmpq $1, %rdx > jne .LBB0_3 > > I am not sure if it is possible to tackle the problem in > simplification or instcombine directly, so I explored the possibility > to do the reversal transformation in CodeGen, but I found that not > quite easy either because of two things: > * Seems to me Peephole pass or TwoAddrInstruction pass are the proper > places to do the reversal transformation. We need Live Interval > information to justify the transformation, however currently live > Interval information is not ready in both places. > > * The pattern matching looks quite ad hoc on machine IR. I need to > figure out we can replace %vreg0 in "AND32ri8 %vreg0<tied0>, 31" with > %vreg1 by looking at the copy chain starting from %vreg9<def> = COPY > %vreg0 to %vreg1<def> = MOVZX32rr8 %vreg9 first, and at the same time, > after replacing vreg0 with %vreg1, vreg0 becomes dead at the other > AND32ri and we can save an instruction there. In addition, replace > %vreg0 with %vreg1 may increase an extra move before "AND32ri8 > %vreg0<tied0>, 31", so we still need to check "AND32ri8 %vreg0<tied0>, > 31" is colder than "AND32ri %vreg0<tied0>, 1792". > All these efforts are just handling a specific pattern, and if the > pattern changes a little bit, they won't work. > BB_i: > %vreg9<def> = COPY %vreg0:sub_8bit; GR8:%vreg9 GR32:%vreg0 > %vreg1<def> = MOVZX32rr8 %vreg9<kill>; GR32:%vreg1 GR8:%vreg9 > %vreg10<def,tied1> = AND32ri %vreg0<tied0>, 1792, > %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0 > ... > BB_j: > %vreg4<def,tied1> = AND32ri8 %vreg0<tied0>, 31, > %EFLAGS<imp-def,dead>; GR32:%vreg4,%vreg0 > MOV32mr %RIP, 1, %noreg, <ga:@b>, %noreg, %vreg1; > mem:ST4[@b](align=8) GR32:%vreg1 > > Any suggestions are welcomed. > > Thanks, > Wei. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >
Apparently Analagous Threads
- [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass
- [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass
- [LLVMdev] RegisterCoalescing pass crashes with ImplicitDef registers
- [LLVMdev] RegisterCoalescing Pass seems to ignore part of CFG.
- [LLVMdev] RegisterCoalescing Pass seems to ignore part of CFG.