Nicolas Capens
2008-May-20 12:03 UTC
[LLVMdev] Optimization passes organization and tradeoffs
Hi all, I'm getting more impressed by LLVM day by day, but what's a bit unclear to me now is the order of optimization passes, and their performance. I think I have a pretty solid understanding of what each pass does at a high level, but I couldn't find any documentation about how they interact at a lower level. I'd like to use LLVM for generating high-performance stream processing code at run-time. Obviously the resulting code should be as optimized as possible, but performing the optimizations themselves should also be very fast. The code I'm compiling is comparable to C (without any exception handling or garbage collection, so none of the related passes are needed). My first attempt at collecting useful optimizations looks like this: passManager->add(new TargetData(*executionEngine->getTargetData())); passManager->add(createScalarReplAggregatesPass()); // Convert to SSA form passManager->add(createSCCPPass()); // Propagate constants passManager->add(createInstructionCombiningPass()); // Peephole optimization passManager->add(createDeadStoreEliminationPass()); // Dead store elimination passManager->add(createAggressiveDCEPass()); // Aggressive dead code elimination passManager->add(createCFGSimplificationPass()); // Control-flow optimization I have several questions about this: 1) Does ScalarReplAggregates totally superscede PromoteMemoryToRegister? I think I need it to optimize small arrays, but what is the expected added complexity? 2) Does SCCP also eliminate multiplying/dividing by 1 and adding/subtracting 0? 3) Is it arbitrary where to place InstructionCombining? Is there a better order? 4) Is DeadStoreElimination still necessary when we have AggressiveDCE? 5) What are the tradeoffs between the different dead code elimination variants (why not always use the aggressive one)? 6) Is there a better place for CFGSimplification? Should I perform it at multiple points? Also, my code will frequently have vectors, that are either initialized to all 0.0 or 1.0. This offers a lot of opportunity for eliminating many multiplications and additions, but I was wondering which passes I need for this (probably a Reassociate pass, what else)? And I also wonder whether these passes actually work with vectors? Is there any other highly recommended pass for this kind of applications that I'm forgetting? Any passes that I better avoid due to poor gains and long optimization time? Sorry for the many question marks. :-) I don't expect there is an absolute definite answer to each of them, but some guidelines and insights would be very welcome and much appreciated. Thanks! Nicolas Capens -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20080520/b3cc9592/attachment.html>
David Greene
2008-May-20 15:57 UTC
[LLVMdev] Optimization passes organization and tradeoffs
On Tuesday 20 May 2008 07:03, Nicolas Capens wrote:> 1) Does ScalarReplAggregates totally superscede PromoteMemoryToRegister? INope, they are different. Mem2Reg is really important if you want register allocation.> think I need it to optimize small arrays, but what is the expected added > complexity?I shouldn't think it would be very expensive at all.> 2) Does SCCP also eliminate multiplying/dividing by 1 and > adding/subtracting 0?That's probably more the purview of instcombine.> 3) Is it arbitrary where to place InstructionCombining? Is there a better > order?Typically you'll want to place it after various kinds of propagation are done (for example, SCCP). You can run it multipe times.> 4) Is DeadStoreElimination still necessary when we have AggressiveDCE?Probably, but I'll let others give the definitive answer.> 5) What are the tradeoffs between the different dead code elimination > variants (why not always use the aggressive one)?Others can speak to this.> 6) Is there a better place for CFGSimplification? Should I perform it at > multiple points?I think once is probably enough. Earlier would probably be better as it will simplify later passes and potentially help them run faster.> Also, my code will frequently have vectors, that are either initialized to > all 0.0 or 1.0. This offers a lot of opportunity for eliminating many > multiplications and additions, but I was wondering which passes I need for > this (probably a Reassociate pass, what else)? And I also wonder whether > these passes actually work with vectors?I would assume they work with vector. Anything expression-related is good to capture these opportunities (reassociation, folding, instcombine, etc.).> Is there any other highly recommended pass for this kind of applications > that I'm forgetting? Any passes that I better avoid due to poor gains and > long optimization time?The most expensive optimizations are typically scheduling and register allocation. Regalloc is pretty important but it depends a lot on the machine architecture. If you have a very fast cache, it becomes less important. Scheduling is hit-or-miss. If your architecture is strictly in-order, it's pretty important. You can always play games like have the scheduler bail out if a basic block is too large or not scheduling bocks outside of loops. This can save significant compile time.> Sorry for the many question marks. :-) I don't expect there is an absolute > definite answer to each of them, but some guidelines and insights would be > very welcome and much appreciated.Phase ordering is one of the trickiest parts to tune. It's often highly code-dependent. If your target sources are limited in number, you might be able to get away with a funky ordering that would be disastrous on general-purpose integer codes, for example. It's often a matter of trial and error. Several people have done research on using genetic algorithms and other tricks to find an optimal phase ordering. A google search should turn up interesting stuff. Have a look at the opt tool sources to get an idea of where to start. -Dave
Chris Lattner
2008-May-20 18:28 UTC
[LLVMdev] Optimization passes organization and tradeoffs
On May 20, 2008, at 8:57 AM, David Greene wrote:> On Tuesday 20 May 2008 07:03, Nicolas Capens wrote: > >> 1) Does ScalarReplAggregates totally superscede >> PromoteMemoryToRegister? I > > Nope, they are different. Mem2Reg is really important if you want > register > allocation.Actually SROA does fully subsume Mem2Reg. It iterates between breaking up aggregates and promoting them to registers.>> think I need it to optimize small arrays, but what is the expected >> added >> complexity? > > I shouldn't think it would be very expensive at all.Agreed.>> 2) Does SCCP also eliminate multiplying/dividing by 1 and >> adding/subtracting 0? > > That's probably more the purview of instcombine.Right.>> 4) Is DeadStoreElimination still necessary when we have >> AggressiveDCE? > > Probably, but I'll let others give the definitive answer.They are orthogonal. DSE touches stores, AggressiveDCE does mostly scalars. ADCE basically assumes all stores are live.>> 5) What are the tradeoffs between the different dead code elimination >> variants (why not always use the aggressive one)? > > Others can speak to this.ADCE is actually quite expensive from the compile-time perspective, because it requires building post dominance information. I'd stick with a combination of "-dce -simplifycfg" instead of -adce.>> 6) Is there a better place for CFGSimplification? Should I perform >> it at >> multiple points? > > I think once is probably enough. Earlier would probably be better > as it will > simplify later passes and potentially help them run faster.I'd suggest running it earlier as you suggest, but then also late to clean up. It's a very fast pass.>> Also, my code will frequently have vectors, that are either >> initialized to >> all 0.0 or 1.0. This offers a lot of opportunity for eliminating many >> multiplications and additions, but I was wondering which passes I >> need for >> this (probably a Reassociate pass, what else)? And I also wonder >> whether >> these passes actually work with vectors? > > I would assume they work with vector. Anything expression-related > is good > to capture these opportunities (reassociation, folding, instcombine, > etc.).Note that a lot of optimizations don't apply to floating point values. E.g. "x+0.0" can't be eliminated (in general), because "-0.0+0.0" is 0.0, not -0.0. All the optimizations should tolerate vectors, but may miss certain optimizations on them. If you notice something (e.g. reassociation) that isn't being done aggressively with vectors, please file a bug. We do a lot of vector optimizations though, so we are probably only missing details, not whole classes of important optimizations.>> Is there any other highly recommended pass for this kind of >> applications >> that I'm forgetting? Any passes that I better avoid due to poor >> gains and >> long optimization time? >Another important optimization is GVN, which does common subexpression elimination (for example).>> Sorry for the many question marks. :-) I don't expect there is an >> absolute >> definite answer to each of them, but some guidelines and insights >> would be >> very welcome and much appreciated. > > Phase ordering is one of the trickiest parts to tune. It's often > highly > code-dependent. If your target sources are limited in number, you > might be > able to get away with a funky ordering that would be disastrous on > general-purpose integer codes, for example. It's often a matter of > trial and > error. Several people have done research on using genetic > algorithms and > other tricks to find an optimal phase ordering. A google search > should turn > up interesting stuff."What he said" :). Also, if you see specific problems in the generated code, let us know. -Chris
Nicolas Capens
2008-May-21 21:09 UTC
[LLVMdev] Optimization passes organization and tradeoffs
Hi David, Thanks for the info, but I'm not using any of the command line options. I'm dynamically generating code at run-time. So unless I'm missing some way to set these command line options at run-time I think I need another API for controlling register allocation and scheduling. I was also under the impression that linear scan register allocation is quite cheap, faster than graph coloring. Or does it use an advanced variation (with aggressive coalescing), while other register allocation algorithms are more straightforward? Cheers, Nicolas -----Original Message----- From: David Greene [mailto:dag at cray.com] Sent: Wednesday, 21 May, 2008 18:57 To: Nicolas Capens Subject: Re: [LLVMdev] Optimization passes organization and tradeoffs On Wednesday 21 May 2008 04:01, you wrote:> Thanks for the clarifications David! > > My main target is x86. How do I control register allocation andscheduling?> The docs talk about IR level optimization passes only (as far as I found).There's a -regalloc option to pick from various register allocators. Linear scan is the most effective but also the most expensive. -pre-RA-sched is the option to select schedulers. By default LLVM schedules to reduce register pressure. You can run opt -help and opt -help-hidden to see what available. -Dave
David Greene
2008-May-21 21:47 UTC
[LLVMdev] Optimization passes organization and tradeoffs
On Wednesday 21 May 2008 16:09, Nicolas Capens wrote:> Hi David, > > Thanks for the info, but I'm not using any of the command line options. I'm > dynamically generating code at run-time. So unless I'm missing some way to > set these command line options at run-time I think I need another API for > controlling register allocation and scheduling.You'll just have to pick one and instantiate that Pass to run. I was just pointing out the command-line options so you could try them out.> I was also under the impression that linear scan register allocation is > quite cheap, faster than graph coloring. Or does it use an advanced > variation (with aggressive coalescing), while other register allocation > algorithms are more straightforward?Yes, Linear Scan is cheaper than graph coloring, but it is still more expensive than the local spiller or simple spiller. -Dave
Maybe Matching Threads
- [LLVMdev] Optimization passes organization and tradeoffs
- [LLVMdev] Optimization passes organization and tradeoffs
- [LLVMdev] Optimization passes organization and tradeoffs
- [LLVMdev] Optimization passes organization and tradeoffs
- [LLVMdev] Optimization passes organization and tradeoffs