Hal Finkel
2015-Mar-12 21:34 UTC
[LLVMdev] Question about shouldMergeGEPs in InstructionCombining
Hi Mark, It is not clear to me at all that preventing the merging is the right solution. There are a large number of analysis, including alias analysis, and optimizations that use GetUnderlyingObject, and related routines to search back through GEPs. They only do this up to some small finite depth (six, IIRC). So reducing the GEP depth is likely the right solution for InstCombine (which has the job of canonicalizing the IR). We should, however, pull these apart somewhere, and probably in some way that is address-mode aware. I'd recommend trying to split non-free (via the addressing-mode) loop-invariant parts of GEPs out of loops in CodeGenPrep. -Hal ----- Original Message -----> From: "Mark Heffernan" <meheff at google.com> > To: "Francois Pichet" <pichet2000 at gmail.com> > Cc: "Hal Finkel" <hfinkel at anl.gov>, "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > Sent: Thursday, March 12, 2015 4:19:56 PM > Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining > > > > > On Thu, Mar 12, 2015 at 2:03 PM, Francois Pichet < > pichet2000 at gmail.com > wrote: > > > > > I think it would make sense for (1) and (2). I am not sure if (3) is > feasible in instcombine . (I am not too familiar with LoopInfo) > > > LoopInfo should be available for at least one of the instcombine > invocations. In the -O2 pipeline it looks like instcombine is called > 7 times and it should have loopinfo for the third invocation (called > immediate after some loop passes). Here's output from > --debug-pass=Structure: > > > > Pass Arguments: -tti -no- aa -tbaa -scoped-noalias > -assumption-cache-tracker -targetlibinfo -basicaa -verify > -simplifycfg -domtree -sroa -early-cse -lower-expect > Target Transform Information > No Alias Analysis (always returns 'may' alias) > Type-Based Alias Analysis > Scoped NoAlias Alias Analysis > Assumption Cache Tracker > Target Library Information > Basic Alias Analysis (stateless AA impl) > FunctionPass Manager > Module Verifier > Simplify the CFG > Dominator Tree Construction > SROA > Early CSE > Lower 'expect' Intrinsics > Pass Arguments: -targetlibinfo -tti -no-aa -tbaa -scoped-noalias > -assumption-cache-tracker -basicaa -verify-di -ipsccp -globalopt > -deadargelim -domtree -instcombine -simplifycfg -basiccg -prune-eh > -inline-cost -i > nline -functionattrs -sroa -domtree -early-cse -lazy-value-info > -jump-threadi ng -correlated-propagation -simplifycfg -domtree > -instcombine -tailcallelim -simplifycfg -reassociate -domtree -loops > -loop-simplify -lc > ssa -loop-rotate -licm -loop-unswitch -instcombine -scalar-evolution > -loop-simplify -lcssa -indvars -loop-idiom -loop-deletion > -loop-unroll -memdep -mldst-motion -domtree -memdep -gvn -memdep > -memcpyopt -sccp -dom > tree -bdce -instcombine -lazy-value-info -jump-threading > -correlated-propagation -domtree -memdep -dse -loops -loop-simplify > -lcssa -licm -adce -simplifycfg -domtree -instcombine -barrier > -domtree -loops -loop-sim > plify -lcssa -loop-rotate -branch-prob -block-freq -scalar-evolution > -loop-accesses -loop-vectorize -instcombine -scalar-evolution > -slp-vectorizer -simplifycfg -domtree -instcombine -loops > -loop-simplify -lcssa -s > calar-evolution -loop-unroll -alignment-from-assumptions > -strip-dead-prototypes -globaldce -constmerge -verify -verify-di > Target Library Information > Tar get Transform Information > No Alias Analysis (always returns 'may' alias) > Type-Based Alias Analysis > Scoped NoAlias Alias Analysis > Assumption Cache Tracker > Basic Alias Analysis (stateless AA impl) > ModulePass Manager > Debug Info Verifier > Interprocedural Sparse Conditional Constant Propagation > Global Variable Optimizer > Dead Argument Elimination > FunctionPass Manager > Dominator Tree Construction > Combine redundant instructions > Simplify the CFG > CallGraph Construction > Call Graph SCC Pass Manager > Remove unused exception handling info > Inline Cost Analysis > Function Integration/Inlining > Deduce function attributes > FunctionPass Manager > SROA > Dominator Tree Construction > Early CSE > Lazy Value Information Analysis > Jump Threading > Value Propagation > Simplify the CFG > Dominator Tree Construction > Combine redundant instructions > Tail Call Elimination > Simplify the CFG > Reassociate expressions > Dominator Tree Construction > Natural Loop Information > Canonicalize natural loops > Loop-Closed SSA Form Pass > Loop Pass Manager > Rotate Loops > Loop Invariant Code Motion > Unswitch loops > > Combine redundant instructions > Scalar Evolution Analysis > Canonicalize natural loops > Loop-Closed SSA Form Pass > Loop Pass Manager > Induction Variable Simplification > Recognize loop idioms > Delete dead loops > Unroll loops > Memory Dependence Analysis > MergedLoadStoreMotion > Dominator Tree Construction > Memory Dependence Analysis > Global Value Numbering > Memory Dependence Analysis > MemCpy Optimization > Sparse Conditional Constant Propagation > Dominator Tree Construction > Bit-Tracking Dead Code Elimination > Combine redundant instructions > Lazy Value Information Analysis > Jump Threading > Value Propagation > Dominator Tree Construction > Memory Dependence Analysis > Dead Store Elimination > Natural Loop Information > Canonicalize natural loops > Loop-Closed SSA Form Pass > Loop Pass Manager > Loop Invariant Code Motion > Aggressive Dead Code Elimination > Simplify the CFG > Dominator Tree Construction > Combine redundant instructions > A No-Op Barrier Pass > FunctionPass Manager > Dominator Tree Construction > Natural Loop Information > Canonicalize natural loops > Loop-Closed SSA Form Pass > Loop Pass Manager > Rotate Loops > Branch Probability Analysis > Block Frequency Analysis > Scalar Evolution Analysis > Loop Access Analysis > Loop Vectorization > Combine redundant instructions > Scalar Evolution Analysis > SLP Vectorizer > Simplify the CFG > Dominator Tree Construction > Combine redundant instructions > Natural Loop Information > Canonicalize natural loops > Loop-Closed SSA Form Pass > Scalar Evolution Analysis > Loop Pass Manager > Unroll loops > Alignment from assumptions > > Strip Unused Function Prototypes > Dead Global Elimination > Merge Duplicate Global Constants > FunctionPass Manager > Module Verifier > Debug Info Verifier Bitcode Writer > > > > > > > Mark > > > > > > > > For the Octasic's Opus platform, I modified shouldMergeGEPs in our > fork to: > > > > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() && > !Src.hasOneUse()) > return false; > > > return Src.hasAllConstantIndices(); // was return false; > > > > Following that change, I noticed some performance gain for a few > specific tests and no regression at all in our (admittedly limited) > benchmarks suite. > > > Regards, > Francois Pichet, Octasic. > > > > > On Thu, Mar 12, 2015 at 4:14 PM, Mark Heffernan < meheff at google.com > > wrote: > > > > Coincidentally, I just ran into this same issue on some of our > benchmarks for the NVPTX backend. You have something like this > before instcombine: > > > %tmp = getelementptr inbounds i32, i32* %input, i64 %offset > > loop: > > %loop_variant = ... > %ptr = getelementptr inbounds i32, i32* %tmp, i64 %loop_variant > > > Which gets transformed to: > > > > loop: > > %loop_variant = ... > %sum = add nsw i64 %loop_variant, %offset > > %ptr = getelementptr inbounds i32, i32* %input, i64 %sum > > > > The merge essentially reassociates the loop-variant term > (%loop_variant) and loop-invariant terms (%input and %offset) in > such a way that LICM can't remove it. > > > > > One idea is to only perform this style of gep merge if at least one > of the following conditions is true: > (1) both index terms in the GEP are constant. In this case no new add > instruction is created, instead the constants are folded. > (2) the GEPs are in the same BB. > (3) LoopInfo is available, and we know we're not creating a new > instruction in a (deeper) loop. > > > > What do you think? > > > Mark > > > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Mark Heffernan
2015-Mar-13 17:13 UTC
[LLVMdev] Question about shouldMergeGEPs in InstructionCombining
On Thu, Mar 12, 2015 at 2:34 PM, Hal Finkel <hfinkel at anl.gov> wrote:> It is not clear to me at all that preventing the merging is the right > solution. There are a large number of analysis, including alias analysis, > and optimizations that use GetUnderlyingObject, and related routines to > search back through GEPs. They only do this up to some small finite depth > (six, IIRC). So reducing the GEP depth is likely the right solution for > InstCombine (which has the job of canonicalizing the IR). > > We should, however, pull these apart somewhere, and probably in some way > that is address-mode aware. I'd recommend trying to split non-free (via the > addressing-mode) loop-invariant parts of GEPs out of loops in CodeGenPrep. >Thanks, Hal. I'll have a look at CGP to see how this might be done. It's a little more complicated than just pulling the GEP apart, there needs to be a loop-invariant-aware reassociation to undo the damage done by the initial merge. Mark> > -Hal > > ----- Original Message ----- > > From: "Mark Heffernan" <meheff at google.com> > > To: "Francois Pichet" <pichet2000 at gmail.com> > > Cc: "Hal Finkel" <hfinkel at anl.gov>, "LLVM Developers Mailing List" < > llvmdev at cs.uiuc.edu> > > Sent: Thursday, March 12, 2015 4:19:56 PM > > Subject: Re: [LLVMdev] Question about shouldMergeGEPs in > InstructionCombining > > > > > > > > > > On Thu, Mar 12, 2015 at 2:03 PM, Francois Pichet < > > pichet2000 at gmail.com > wrote: > > > > > > > > > > I think it would make sense for (1) and (2). I am not sure if (3) is > > feasible in instcombine . (I am not too familiar with LoopInfo) > > > > > > LoopInfo should be available for at least one of the instcombine > > invocations. In the -O2 pipeline it looks like instcombine is called > > 7 times and it should have loopinfo for the third invocation (called > > immediate after some loop passes). Here's output from > > --debug-pass=Structure: > > > > > > > > Pass Arguments: -tti -no- aa -tbaa -scoped-noalias > > -assumption-cache-tracker -targetlibinfo -basicaa -verify > > -simplifycfg -domtree -sroa -early-cse -lower-expect > > Target Transform Information > > No Alias Analysis (always returns 'may' alias) > > Type-Based Alias Analysis > > Scoped NoAlias Alias Analysis > > Assumption Cache Tracker > > Target Library Information > > Basic Alias Analysis (stateless AA impl) > > FunctionPass Manager > > Module Verifier > > Simplify the CFG > > Dominator Tree Construction > > SROA > > Early CSE > > Lower 'expect' Intrinsics > > Pass Arguments: -targetlibinfo -tti -no-aa -tbaa -scoped-noalias > > -assumption-cache-tracker -basicaa -verify-di -ipsccp -globalopt > > -deadargelim -domtree -instcombine -simplifycfg -basiccg -prune-eh > > -inline-cost -i > > nline -functionattrs -sroa -domtree -early-cse -lazy-value-info > > -jump-threadi ng -correlated-propagation -simplifycfg -domtree > > -instcombine -tailcallelim -simplifycfg -reassociate -domtree -loops > > -loop-simplify -lc > > ssa -loop-rotate -licm -loop-unswitch -instcombine -scalar-evolution > > -loop-simplify -lcssa -indvars -loop-idiom -loop-deletion > > -loop-unroll -memdep -mldst-motion -domtree -memdep -gvn -memdep > > -memcpyopt -sccp -dom > > tree -bdce -instcombine -lazy-value-info -jump-threading > > -correlated-propagation -domtree -memdep -dse -loops -loop-simplify > > -lcssa -licm -adce -simplifycfg -domtree -instcombine -barrier > > -domtree -loops -loop-sim > > plify -lcssa -loop-rotate -branch-prob -block-freq -scalar-evolution > > -loop-accesses -loop-vectorize -instcombine -scalar-evolution > > -slp-vectorizer -simplifycfg -domtree -instcombine -loops > > -loop-simplify -lcssa -s > > calar-evolution -loop-unroll -alignment-from-assumptions > > -strip-dead-prototypes -globaldce -constmerge -verify -verify-di > > Target Library Information > > Tar get Transform Information > > No Alias Analysis (always returns 'may' alias) > > Type-Based Alias Analysis > > Scoped NoAlias Alias Analysis > > Assumption Cache Tracker > > Basic Alias Analysis (stateless AA impl) > > ModulePass Manager > > Debug Info Verifier > > Interprocedural Sparse Conditional Constant Propagation > > Global Variable Optimizer > > Dead Argument Elimination > > FunctionPass Manager > > Dominator Tree Construction > > Combine redundant instructions > > Simplify the CFG > > CallGraph Construction > > Call Graph SCC Pass Manager > > Remove unused exception handling info > > Inline Cost Analysis > > Function Integration/Inlining > > Deduce function attributes > > FunctionPass Manager > > SROA > > Dominator Tree Construction > > Early CSE > > Lazy Value Information Analysis > > Jump Threading > > Value Propagation > > Simplify the CFG > > Dominator Tree Construction > > Combine redundant instructions > > Tail Call Elimination > > Simplify the CFG > > Reassociate expressions > > Dominator Tree Construction > > Natural Loop Information > > Canonicalize natural loops > > Loop-Closed SSA Form Pass > > Loop Pass Manager > > Rotate Loops > > Loop Invariant Code Motion > > Unswitch loops > > > > Combine redundant instructions > > Scalar Evolution Analysis > > Canonicalize natural loops > > Loop-Closed SSA Form Pass > > Loop Pass Manager > > Induction Variable Simplification > > Recognize loop idioms > > Delete dead loops > > Unroll loops > > Memory Dependence Analysis > > MergedLoadStoreMotion > > Dominator Tree Construction > > Memory Dependence Analysis > > Global Value Numbering > > Memory Dependence Analysis > > MemCpy Optimization > > Sparse Conditional Constant Propagation > > Dominator Tree Construction > > Bit-Tracking Dead Code Elimination > > Combine redundant instructions > > Lazy Value Information Analysis > > Jump Threading > > Value Propagation > > Dominator Tree Construction > > Memory Dependence Analysis > > Dead Store Elimination > > Natural Loop Information > > Canonicalize natural loops > > Loop-Closed SSA Form Pass > > Loop Pass Manager > > Loop Invariant Code Motion > > Aggressive Dead Code Elimination > > Simplify the CFG > > Dominator Tree Construction > > Combine redundant instructions > > A No-Op Barrier Pass > > FunctionPass Manager > > Dominator Tree Construction > > Natural Loop Information > > Canonicalize natural loops > > Loop-Closed SSA Form Pass > > Loop Pass Manager > > Rotate Loops > > Branch Probability Analysis > > Block Frequency Analysis > > Scalar Evolution Analysis > > Loop Access Analysis > > Loop Vectorization > > Combine redundant instructions > > Scalar Evolution Analysis > > SLP Vectorizer > > Simplify the CFG > > Dominator Tree Construction > > Combine redundant instructions > > Natural Loop Information > > Canonicalize natural loops > > Loop-Closed SSA Form Pass > > Scalar Evolution Analysis > > Loop Pass Manager > > Unroll loops > > Alignment from assumptions > > > > Strip Unused Function Prototypes > > Dead Global Elimination > > Merge Duplicate Global Constants > > FunctionPass Manager > > Module Verifier > > Debug Info Verifier Bitcode Writer > > > > > > > > > > > > > > Mark > > > > > > > > > > > > > > > > For the Octasic's Opus platform, I modified shouldMergeGEPs in our > > fork to: > > > > > > > > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() && > > !Src.hasOneUse()) > > return false; > > > > > > return Src.hasAllConstantIndices(); // was return false; > > > > > > > > Following that change, I noticed some performance gain for a few > > specific tests and no regression at all in our (admittedly limited) > > benchmarks suite. > > > > > > Regards, > > Francois Pichet, Octasic. > > > > > > > > > > On Thu, Mar 12, 2015 at 4:14 PM, Mark Heffernan < meheff at google.com > > > wrote: > > > > > > > > Coincidentally, I just ran into this same issue on some of our > > benchmarks for the NVPTX backend. You have something like this > > before instcombine: > > > > > > %tmp = getelementptr inbounds i32, i32* %input, i64 %offset > > > > loop: > > > > %loop_variant = ... > > %ptr = getelementptr inbounds i32, i32* %tmp, i64 %loop_variant > > > > > > Which gets transformed to: > > > > > > > > loop: > > > > %loop_variant = ... > > %sum = add nsw i64 %loop_variant, %offset > > > > %ptr = getelementptr inbounds i32, i32* %input, i64 %sum > > > > > > > > The merge essentially reassociates the loop-variant term > > (%loop_variant) and loop-invariant terms (%input and %offset) in > > such a way that LICM can't remove it. > > > > > > > > > > One idea is to only perform this style of gep merge if at least one > > of the following conditions is true: > > (1) both index terms in the GEP are constant. In this case no new add > > instruction is created, instead the constants are folded. > > (2) the GEPs are in the same BB. > > (3) LoopInfo is available, and we know we're not creating a new > > instruction in a (deeper) loop. > > > > > > > > What do you think? > > > > > > Mark > > > > > > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/5cbc95f8/attachment.html>
Daniel Berlin
2015-Mar-13 17:38 UTC
[LLVMdev] Question about shouldMergeGEPs in InstructionCombining
On Fri, Mar 13, 2015 at 10:16 AM Mark Heffernan <meheff at google.com> wrote:> On Thu, Mar 12, 2015 at 2:34 PM, Hal Finkel <hfinkel at anl.gov> wrote: > >> It is not clear to me at all that preventing the merging is the right >> solution. There are a large number of analysis, including alias analysis, >> and optimizations that use GetUnderlyingObject, and related routines to >> search back through GEPs. They only do this up to some small finite depth >> (six, IIRC). So reducing the GEP depth is likely the right solution for >> InstCombine (which has the job of canonicalizing the IR). >> >> We should, however, pull these apart somewhere, and probably in some way >> that is address-mode aware. I'd recommend trying to split non-free (via the >> addressing-mode) loop-invariant parts of GEPs out of loops in CodeGenPrep. >> > > Thanks, Hal. I'll have a look at CGP to see how this might be done. It's > a little more complicated than just pulling the GEP apart, there needs to > be a loop-invariant-aware reassociation to undo the damage done by the > initial merge. >So, this is in fact, just using the ranks reassociation would normally use, to do splitting :) That is, even outside of geps, you end up with chains of operations where, if you moved the operands around, you would expose loop-invariant calculation. Reassociation builds a rank map ordered by RPO traversal of the CFG, and uses it to place operations at the same rank into the same expression. This guarantees deeper loops have higher ranks. For your purposes, if you have it calculated already, you could just use loop depth instead of RPO ordering, since you only care about invariantness (the downside to not using RPO here is that you may increase register pressure. The upside is it's easier to reason about the ranks for your purposes). Anyway, reassociate tries to place operations with the lowest ranks into leaf computations, since those are "the most loop invariant". You want to do exactly the same thing, with likely the same algorithm, splitting geps as necessary into "leaf geps" to place low-ranking operations in the same gep. The only heuristic part is "what is the lowest rank you want to split for". If you have stuff at rank 0 and stuff not at rank 0, you always want to split out the rank 0 stuff. rank 1 and rank 2, well, if you are using loop depth, it can only be invariant in a loop of rank 2, etc. You don't need to split if rank == highest rank in functions, since you are guaranteed it is not invariant in any loop in that case -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/42191b03/attachment.html>