thr3ads.net - llvm dev - [LLVMdev] Optimization passes organization and tradeoffs [May 2008]

If this information is useful, please help other people find it:
Share via:

Nicolas Capens

2008-May-20 12:03 UTC

[LLVMdev] Optimization passes organization and tradeoffs

Hi all,

 

I'm getting more impressed by LLVM day by day, but what's a bit unclear
to
me now is the order of optimization passes, and their performance. I think I
have a pretty solid understanding of what each pass does at a high level,
but I couldn't find any documentation about how they interact at a lower
level.

 

I'd like to use LLVM for generating high-performance stream processing code
at run-time. Obviously the resulting code should be as optimized as
possible, but performing the optimizations themselves should also be very
fast. The code I'm compiling is comparable to C (without any exception
handling or garbage collection, so none of the related passes are needed).
My first attempt at collecting useful optimizations looks like this:

 

passManager->add(new TargetData(*executionEngine->getTargetData()));

passManager->add(createScalarReplAggregatesPass());   // Convert to SSA form

passManager->add(createSCCPPass());                   // Propagate constants

passManager->add(createInstructionCombiningPass());   // Peephole
optimization

passManager->add(createDeadStoreEliminationPass());   // Dead store
elimination

passManager->add(createAggressiveDCEPass());          // Aggressive dead
code elimination

passManager->add(createCFGSimplificationPass());      // Control-flow
optimization

 

I have several questions about this:

 

1) Does ScalarReplAggregates totally superscede PromoteMemoryToRegister? I
think I need it to optimize small arrays, but what is the expected added
complexity?

2) Does SCCP also eliminate multiplying/dividing by 1 and adding/subtracting
0?

3) Is it arbitrary where to place InstructionCombining? Is there a better
order?

4) Is DeadStoreElimination still necessary when we have AggressiveDCE?

5) What are the tradeoffs between the different dead code elimination
variants (why not always use the aggressive one)? 

6) Is there a better place for CFGSimplification? Should I perform it at
multiple points?

 

Also, my code will frequently have vectors, that are either initialized to
all 0.0 or 1.0. This offers a lot of opportunity for eliminating many
multiplications and additions, but I was wondering which passes I need for
this (probably a Reassociate pass, what else)? And I also wonder whether
these passes actually work with vectors?

 

Is there any other highly recommended pass for this kind of applications
that I'm forgetting? Any passes that I better avoid due to poor gains and
long optimization time?

 

Sorry for the many question marks. :-) I don't expect there is an absolute
definite answer to each of them, but some guidelines and insights would be
very welcome and much appreciated.

 

Thanks!

 

Nicolas Capens

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20080520/b3cc9592/attachment.html>

David Greene

2008-May-20 15:57 UTC

head link

[LLVMdev] Optimization passes organization and tradeoffs

On Tuesday 20 May 2008 07:03, Nicolas Capens wrote:
> 1) Does ScalarReplAggregates totally superscede PromoteMemoryToRegister? I
Nope, they are different.  Mem2Reg is really important if you want register 
allocation.
> think I need it to optimize small arrays, but what is the expected added
> complexity?
I shouldn't think it would be very expensive at all.
> 2) Does SCCP also eliminate multiplying/dividing by 1 and
> adding/subtracting 0?
That's probably more the purview of instcombine.
> 3) Is it arbitrary where to place InstructionCombining? Is there a better
> order?
Typically you'll want to place it after various kinds of propagation are
done
(for example, SCCP).  You can run it multipe times.
> 4) Is DeadStoreElimination still necessary when we have AggressiveDCE?
Probably, but I'll let others give the definitive answer.
> 5) What are the tradeoffs between the different dead code elimination
> variants (why not always use the aggressive one)?
Others can speak to this.
> 6) Is there a better place for CFGSimplification? Should I perform it at
> multiple points?
I think once is probably enough.  Earlier would probably be better as it will
simplify later passes and potentially help them run faster.
> Also, my code will frequently have vectors, that are either initialized to
> all 0.0 or 1.0. This offers a lot of opportunity for eliminating many
> multiplications and additions, but I was wondering which passes I need for
> this (probably a Reassociate pass, what else)? And I also wonder whether
> these passes actually work with vectors?
I would assume they work with vector.  Anything expression-related is good 
to capture these opportunities (reassociation, folding, instcombine, etc.).
> Is there any other highly recommended pass for this kind of applications
> that I'm forgetting? Any passes that I better avoid due to poor gains
and
> long optimization time?
The most expensive optimizations are typically scheduling and register 
allocation.  Regalloc is pretty important but it depends a lot on the machine 
architecture.  If you have a very fast cache, it becomes less important.

Scheduling is hit-or-miss.  If your architecture is strictly in-order, it's 
pretty important.  You can always play games like have the scheduler
bail out if a basic block is too large or not scheduling bocks outside of 
loops.  This can save significant compile time.
> Sorry for the many question marks. :-) I don't expect there is an
absolute
> definite answer to each of them, but some guidelines and insights would be
> very welcome and much appreciated.
Phase ordering is one of the trickiest parts to tune.  It's often highly 
code-dependent.  If your target sources are limited in number, you might be 
able to get away with a funky ordering that would be disastrous on 
general-purpose integer codes, for example.  It's often a matter of trial
and
error.  Several people have done research on using genetic algorithms and 
other tricks to find an optimal phase ordering.  A google search should turn 
up interesting stuff.

Have a look at the opt tool sources to get an idea of where to start.

                                                    -Dave

Chris Lattner

2008-May-20 18:28 UTC

head link

[LLVMdev] Optimization passes organization and tradeoffs

On May 20, 2008, at 8:57 AM, David Greene wrote:
> On Tuesday 20 May 2008 07:03, Nicolas Capens wrote:
>
>> 1) Does ScalarReplAggregates totally superscede  
>> PromoteMemoryToRegister? I
>
> Nope, they are different.  Mem2Reg is really important if you want  
> register
> allocation.
Actually SROA does fully subsume Mem2Reg. It iterates between breaking  
up aggregates and promoting them to registers.
>> think I need it to optimize small arrays, but what is the expected  
>> added
>> complexity?
>
> I shouldn't think it would be very expensive at all.
Agreed.
>> 2) Does SCCP also eliminate multiplying/dividing by 1 and
>> adding/subtracting 0?
>
> That's probably more the purview of instcombine.
Right.
>> 4) Is DeadStoreElimination still necessary when we have  
>> AggressiveDCE?
>
> Probably, but I'll let others give the definitive answer.
They are orthogonal.  DSE touches stores, AggressiveDCE does mostly  
scalars.  ADCE basically assumes all stores are live.
>> 5) What are the tradeoffs between the different dead code elimination
>> variants (why not always use the aggressive one)?
>
> Others can speak to this.
ADCE is actually quite expensive from the compile-time perspective,  
because it requires building post dominance information.  I'd stick  
with a combination of "-dce -simplifycfg" instead of -adce.
>> 6) Is there a better place for CFGSimplification? Should I perform  
>> it at
>> multiple points?
>
> I think once is probably enough.  Earlier would probably be better  
> as it will
> simplify later passes and potentially help them run faster.
I'd suggest running it earlier as you suggest, but then also late to  
clean up.  It's a very fast pass.
>> Also, my code will frequently have vectors, that are either  
>> initialized to
>> all 0.0 or 1.0. This offers a lot of opportunity for eliminating many
>> multiplications and additions, but I was wondering which passes I  
>> need for
>> this (probably a Reassociate pass, what else)? And I also wonder  
>> whether
>> these passes actually work with vectors?
>
> I would assume they work with vector.  Anything expression-related  
> is good
> to capture these opportunities (reassociation, folding, instcombine,  
> etc.).
Note that a lot of optimizations don't apply to floating point  
values.  E.g. "x+0.0" can't be eliminated (in general), because  
"-0.0+0.0"  is 0.0, not -0.0.  All the optimizations should tolerate  
vectors, but may miss certain optimizations on them.  If you notice  
something (e.g. reassociation) that isn't being done aggressively with  
vectors, please file a bug.  We do a lot of vector optimizations  
though, so we are probably only missing details, not whole classes of  
important optimizations.
>> Is there any other highly recommended pass for this kind of  
>> applications
>> that I'm forgetting? Any passes that I better avoid due to poor  
>> gains and
>> long optimization time?
>Another important optimization is GVN, which does common subexpression  
elimination (for example).
>> Sorry for the many question marks. :-) I don't expect there is an  
>> absolute
>> definite answer to each of them, but some guidelines and insights  
>> would be
>> very welcome and much appreciated.
>
> Phase ordering is one of the trickiest parts to tune.  It's often  
> highly
> code-dependent.  If your target sources are limited in number, you  
> might be
> able to get away with a funky ordering that would be disastrous on
> general-purpose integer codes, for example.  It's often a matter of  
> trial and
> error.  Several people have done research on using genetic  
> algorithms and
> other tricks to find an optimal phase ordering.  A google search  
> should turn
> up interesting stuff.
"What he said" :).  Also, if you see specific problems in the  
generated code, let us know.

-Chris

Nicolas Capens

2008-May-21 21:09 UTC

head link

[LLVMdev] Optimization passes organization and tradeoffs

Hi David,

Thanks for the info, but I'm not using any of the command line options.
I'm
dynamically generating code at run-time. So unless I'm missing some way to
set these command line options at run-time I think I need another API for
controlling register allocation and scheduling.

I was also under the impression that linear scan register allocation is
quite cheap, faster than graph coloring. Or does it use an advanced
variation (with aggressive coalescing), while other register allocation
algorithms are more straightforward?

Cheers,

Nicolas

-----Original Message-----
From: David Greene [mailto:dag at cray.com] 
Sent: Wednesday, 21 May, 2008 18:57
To: Nicolas Capens
Subject: Re: [LLVMdev] Optimization passes organization and tradeoffs

On Wednesday 21 May 2008 04:01, you wrote:> Thanks for the clarifications David!
>
> My main target is x86. How do I control register allocation and
scheduling?> The docs talk about IR level optimization passes only (as far as I found).
There's a -regalloc option to pick from various register allocators.  Linear
scan is the most effective but also the most expensive.

-pre-RA-sched is the option to select schedulers.  By default LLVM schedules
to reduce register pressure.

You can run opt -help and opt -help-hidden to see what available.

                                             -Dave

David Greene

2008-May-21 21:47 UTC

head link

[LLVMdev] Optimization passes organization and tradeoffs

On Wednesday 21 May 2008 16:09, Nicolas Capens wrote:> Hi David,
>
> Thanks for the info, but I'm not using any of the command line options.
I'm
> dynamically generating code at run-time. So unless I'm missing some way
to
> set these command line options at run-time I think I need another API for
> controlling register allocation and scheduling.
You'll just have to pick one and instantiate that Pass to run.  I was just
pointing out the command-line options so you could try them out.
> I was also under the impression that linear scan register allocation is
> quite cheap, faster than graph coloring. Or does it use an advanced
> variation (with aggressive coalescing), while other register allocation
> algorithms are more straightforward?
Yes, Linear Scan is cheaper than graph coloring, but it is still more
expensive than the local spiller or simple spiller.

                                           -Dave

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - May 2008 - [LLVMdev] Optimization passes organization and tradeoffs

[LLVMdev] Optimization passes organization and tradeoffs

[LLVMdev] Optimization passes organization and tradeoffs

[LLVMdev] Optimization passes organization and tradeoffs

[LLVMdev] Optimization passes organization and tradeoffs

[LLVMdev] Optimization passes organization and tradeoffs

Reasonably Related Threads