On 10/27/10 8:34 PM, Eli Friedman wrote:> On Wed, Oct 27, 2010 at 1:29 PM, Brian West<bnwest at rice.edu> wrote: >> Here is the patch for the new Operator Strength Reduction optimization >> pass that I have written. The bulk of the code is in >> >> lib/Transforms/Scalar/OperatorStrengthReduce.cpp >> >> The algorithm finds reduction opportunities in both array accesses and >> explicit multiplications within loops. >> >> Next, I plan on writing the regression tests. > > Comments: > 1. Please don't top-post. > 2. Instruction::getOpcodeName instead of your OpcodeToString function.Can do. I did not see getOpcodeName() in lib/Target/CBackend/CBackend.cpp, so I wrote my own. FWIW I am only using OpcodeToString() for debug output.> 3. LLVM already has a significant amount of infrastructure for loop > passes; why does this pass have its own code for finding loops?I saw the loop infrastructure for CFG loops. This algorithm finds loops in the data flow (more precisely: strongly-connected components in the SSA-graph), e.g. bb5: %20 = add nsw i32 %i.0, 1 br label %bb6 bb6: %i.0 = phi i32 [ 0, %entry ], [ %20, %bb5 ] There is a data flow loop between %20 and %i.0. The OSR paper has a nice figure showing data flow loops. Here is small excerpt from the OSR paper: "OSR is driven by a simple depth-first search of the SSA-graph, using Tarjan’s strongly-connected component finder. The SCC-finder lets OSR discover induction variables in topological order and process them as they are discovered. As the SCCs are discovered, they are processed by a set of mutually-recursive routines that test for induction variables and region constants, and then reduce the appropriate operations."> 4. What's the use case for this pass? Can this pass improve > performance of generated code beyond the existing -loop-reduce pass? > If not, could it usefully act as a faster replacement for -loop-reduce > for fast code generation? (Perhaps someone doing fast code generation > with LLVM could comment on whether -loop-reduce is useful, and whether > the performance is an issue.)-loop-reduce (LSR) only looks at implicit multiplication in array accesses. The following code has an explicit multiplication within a loop that LSR will not reduce: extern float loop4(float* a, int start, int stop) { int i; float sum; float *pEntry; char *p; unsigned long offset; const unsigned int float_size = sizeof(float); sum = 0.0; p = (char*)a; for (i=start-1; i<stop; ++i) { offset = i * float_size; pEntry = (float*) (p + offset); sum += *pEntry; } return sum; } OSR also finds reductions in some esoteric array accesses that LSR misses. I found these LSR misses while writing less-than-real-world code to break OSR. I can include these examples, if requested. For array accesses, OSR and LSR produces similar LLVM IR and I am guessing that their execution times should be similar. Empirically the OSR optimization is compile-time faster than LSR. I have also noticed that OSR has more "analysis" requirements: Induction Variable User, Natural Loop Information, Canonicalize natural loops, and Scalar Evolution Analysis. Both OSR and LSR require the Dominator Tree Construction analysis pass. I did not mention in the original email (and should have) that OSR needs -instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn and -sccp can be enabling optimizations for OSR.> -EliEli, Thanks for your comments. Brian West
On Thu, Oct 28, 2010 at 9:38 AM, Brian West <bnwest at rice.edu> wrote:>> 3. LLVM already has a significant amount of infrastructure for loop >> passes; why does this pass have its own code for finding loops? > > I saw the loop infrastructure for CFG loops. This algorithm finds loops in > the data flow (more precisely: strongly-connected components in the > SSA-graph), e.g. > > bb5: > %20 = add nsw i32 %i.0, 1 > br label %bb6 > > bb6: > %i.0 = phi i32 [ 0, %entry ], [ %20, %bb5 ] > > There is a data flow loop between %20 and %i.0. The OSR paper has a nice > figure showing data flow loops. > > Here is small excerpt from the OSR paper: > > "OSR is driven by a simple depth-first search of the SSA-graph, using > Tarjan’s strongly-connected component finder. The SCC-finder lets OSR > discover induction variables in topological order and process them as they > are discovered. As the SCCs are discovered, they are processed by a set of > mutually-recursive routines that test for induction variables and region > constants, and then reduce the appropriate operations."Ah, okay, that makes sense. I didn't really read the algorithm too carefully...>> 4. What's the use case for this pass? Can this pass improve >> performance of generated code beyond the existing -loop-reduce pass? >> If not, could it usefully act as a faster replacement for -loop-reduce >> for fast code generation? (Perhaps someone doing fast code generation >> with LLVM could comment on whether -loop-reduce is useful, and whether >> the performance is an issue.)[...]> For array accesses, OSR and LSR produces similar LLVM IR and I am guessing > that their execution times should be similar.For normal compilation, LSR runs rather late (with the code generation passes), and has some target-specific knowledge. We really need benchmarks here, I think.> Empirically the OSR optimization is compile-time faster than LSR. I have > also noticed that OSR has more "analysis" requirements: Induction Variable > User, Natural Loop Information, Canonicalize natural loops, and Scalar > Evolution Analysis. Both OSR and LSR require the Dominator Tree Construction > analysis pass.My primary concern here is that if this code gets committed without anyone interested in actually using it, it will just end up orphaned, so there's no point to contributing it.> I did not mention in the original email (and should have) that OSR needs > -instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn > and -sccp can be enabling optimizations for OSR.Hmm... perhaps that could be partially fixed using the InstSimplify infrastructure. -Eli
Eli Friedman <eli.friedman <at> gmail.com> writes:> > Empirically the OSR optimization is compile-time faster than LSR. I have > > also noticed that OSR has more "analysis" requirements: Induction Variable > > User, Natural Loop Information, Canonicalize natural loops, and Scalar > > Evolution Analysis. Both OSR and LSR require the Dominator Tree Construction > > analysis pass. > > My primary concern here is that if this code gets committed without > anyone interested in actually using it, it will just end up orphaned, > so there's no point to contributing it.This work is part of the PACE project (http://pace.rice.edu/). My team has funding for another three years and is committed to maintaining this code.> > I did not mention in the original email (and should have) that OSR needs > > -instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn > > and -sccp can be enabling optimizations for OSR. > > Hmm... perhaps that could be partially fixed using the InstSimplify > infrastructure.OSR can produce induction variables that are not used. -instcombine finds the unused induction variables and deletes them (which is super cool btw :)). I am unsure if InstructionSimplify can fix that.> > -Eli >thanks, Brian
Eli is right. We do need to see some benchmark numbers and understand how the pass will fit in the target independent optimizer. While we encourage contribution, we typically don't commit new passes unless it introduce new functionalities that have active clients. It would also help if you provide us with compile time numbers. Evan On Oct 28, 2010, at 10:56 AM, Eli Friedman wrote:> On Thu, Oct 28, 2010 at 9:38 AM, Brian West <bnwest at rice.edu> wrote: >>> 3. LLVM already has a significant amount of infrastructure for loop >>> passes; why does this pass have its own code for finding loops? >> >> I saw the loop infrastructure for CFG loops. This algorithm finds loops in >> the data flow (more precisely: strongly-connected components in the >> SSA-graph), e.g. >> >> bb5: >> %20 = add nsw i32 %i.0, 1 >> br label %bb6 >> >> bb6: >> %i.0 = phi i32 [ 0, %entry ], [ %20, %bb5 ] >> >> There is a data flow loop between %20 and %i.0. The OSR paper has a nice >> figure showing data flow loops. >> >> Here is small excerpt from the OSR paper: >> >> "OSR is driven by a simple depth-first search of the SSA-graph, using >> Tarjan’s strongly-connected component finder. The SCC-finder lets OSR >> discover induction variables in topological order and process them as they >> are discovered. As the SCCs are discovered, they are processed by a set of >> mutually-recursive routines that test for induction variables and region >> constants, and then reduce the appropriate operations." > > Ah, okay, that makes sense. I didn't really read the algorithm too carefully... > >>> 4. What's the use case for this pass? Can this pass improve >>> performance of generated code beyond the existing -loop-reduce pass? >>> If not, could it usefully act as a faster replacement for -loop-reduce >>> for fast code generation? (Perhaps someone doing fast code generation >>> with LLVM could comment on whether -loop-reduce is useful, and whether >>> the performance is an issue.) > [...] >> For array accesses, OSR and LSR produces similar LLVM IR and I am guessing >> that their execution times should be similar. > > For normal compilation, LSR runs rather late (with the code generation > passes), and has some target-specific knowledge. We really need > benchmarks here, I think. > >> Empirically the OSR optimization is compile-time faster than LSR. I have >> also noticed that OSR has more "analysis" requirements: Induction Variable >> User, Natural Loop Information, Canonicalize natural loops, and Scalar >> Evolution Analysis. Both OSR and LSR require the Dominator Tree Construction >> analysis pass. > > My primary concern here is that if this code gets committed without > anyone interested in actually using it, it will just end up orphaned, > so there's no point to contributing it. > >> I did not mention in the original email (and should have) that OSR needs >> -instcombine to be run after it for cleanup. Also -licm, -reassociate, -gvn >> and -sccp can be enabling optimizations for OSR. > > Hmm... perhaps that could be partially fixed using the InstSimplify > infrastructure. > > -Eli > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Possibly Parallel Threads
- [LLVMdev] Landing my new development on the trunk ...
- [LLVMdev] Landing my new development on the trunk ...
- [LLVMdev] Landing my new development on the trunk ...
- [LLVMdev] Landing my new development on the trunk ...
- [LLVMdev] Landing my new development on the trunk ...