thr3ads.net - llvm dev - [LLVMdev] Where in the pass pipeline run the loop optimizer? [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2014-Dec-03 08:32 UTC

[LLVMdev] Where in the pass pipeline run the loop optimizer?

On 30.11.2014 15:05, Hongbin Zheng wrote:> What should us consider when we try to determinate the point to run Polly?
I think we could benefit from some community feedback here.

Some context:

Since a couple of weeks we focus on tuning Polly to minimize
compile time overhead. One important case for us is to minimize the 
compile time impact of Polly in case no loop optimizations are 
performed. One item that we would like to reconsider here is the 
position in the LLVM pass order where we execute Polly.

Currently, when '-O3 -polly' is run, we first schedule nine 
canonicalization passes (taken from the -O3 pass pipeline) to prepare 
the code for Polly, then run Polly itself and after this we run the full 
unmodified -O3 sequence.

The reason for this was that we wanted full control on how we 
canonicalize the code for Polly and we also wanted to run the _entire_
LLVM optimization chain to be sure the code is well optimized after. The 
reason for this choice was mostly to be able to experiment
with loop optimizations without being constrained by compile time or
pass order concerns.

This pass order is obviously not optimal due to the inherent compile 
time increase it implies and also because we do not currently run the 
inliner in our canonicalizer.

Hence, we would like to place the Polly optimizer inside the standard 
-O3 optimization sequence. The question is now what would be the optimal 
location.

Here some considerations:

1) After the IR has been canonicalized

This should remove the need for most of our -polly-canonicalize passes
and eliminate the corresponding compile time cost.

2) After inlining has been performed.

Inlining is essential to optimize C++ heavy code such as boost::ublas.

3) Before the Loop/SLP vectorizer

We want to ensure that Polly is run at a position, where the vectorizer
can continue to optimize the code generated by Polly (and to benefit 
from the parallelism, alias information and loop optimizations we 
performed).

4) Minimize interactions with the inliner

At best Polly is run after all inlining has happened, such that Polly 
always sees the largest possible loop nest. If we optimize too early
we may optimize code that later is inlined in a larger loop nest where
the earlier optimization hinders the optimization of the larger loop nest.

Some open questions:

5) Run it before / after LICM?

LICM may make more loop bounds analyzable by making the parameters in 
the loop bounds loop invariant.

On the other side, running it may introduce unnecessary scalar data 
dependences, which need to be eliminated again (Polly does not do this
yet).

6) Run it before loop rotate / loop unswitch?

Both of these passes introduce copies of code which Polly does not 
really benefit from. (Polly may e.g prove that a loop is executed at
least once, such that the copy introduced by loop-rotate is not 
necessary any more). Hence running them before Polly may unnecessarily 
complicating the loop structure. On the other side, those passes may be 
helpful for LICM.

I currently have been considering two locations:

Option 1: Before the loop optimization passes

============================================
   MPM.add(createReassociatePass()); // Reassociate expressions

<< Add Polly before the loop optimization passes >>

// Or better call it EP_LoopCanonicalizerStart?
+ addExtensionsToPM(EP_LoopOptimizerStart, MPM);


   MPM.add(createLoopRotatePass()); // Rotate Loop
   MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3));
   MPM.add(createInstructionCombiningPass());
   MPM.add(createIndVarSimplifyPass());        // Canonicalize indvars
   MPM.add(createLoopIdiomPass());             // Recognize idioms like
                                                  memset.
   MPM.add(createLoopDeletionPass());          // Delete dead loops


Option 2: Right before the loop vectorizer
====================================
   MPM.add(createBarrierNoopPass());

<< Add Polly at the beginning of the target specific optimizations
>>
+ addExtensionsToPM(EP_TargetOptsStart, MPM);

   // Re-rotate loops in all our loop nests. These may have fallout out
   // of rotated form due to GVN or other transformations, and the
   // vectorizer relies on the rotated form.
   if (ExtraVectorizerPasses)
     MPM.add(createLoopRotatePass());

   MPM.add(createLoopVectorizePass(DisableUnrollLoops,
         LoopVectorize));

Originally I was thinking of 'Option 1'. It is after the scalar
optimization passes, but before LICM and LoopRotate. However, this 
location seems still to be still somehow a canonicalization phase. 
'Option 2' places Polly after all canonicalization (is the inliner 
somehow finished at this point?) and right at the beginning of the 
target specific optimizations. It seems somehow preferable to me at the 
moment.

Does anybody have opinions on this?

Cheers,
Tobias

Johannes Doerfert

2014-Dec-03 08:43 UTC

head link

[LLVMdev] Where in the pass pipeline run the loop optimizer?

Maybe a stupid question, but what does lnt say when you run it on both
positions (if possible)?

On 12/03, Tobias Grosser wrote:> On 30.11.2014 15:05, Hongbin Zheng wrote:
> >What should us consider when we try to determinate the point to run
Polly?
> 
> I think we could benefit from some community feedback here.
> 
> Some context:
> 
> Since a couple of weeks we focus on tuning Polly to minimize
> compile time overhead. One important case for us is to minimize the compile
> time impact of Polly in case no loop optimizations are performed. One item
> that we would like to reconsider here is the position in the LLVM pass
order
> where we execute Polly.
> 
> Currently, when '-O3 -polly' is run, we first schedule nine
canonicalization
> passes (taken from the -O3 pass pipeline) to prepare the code for Polly,
> then run Polly itself and after this we run the full unmodified -O3
> sequence.
> 
> The reason for this was that we wanted full control on how we canonicalize
> the code for Polly and we also wanted to run the _entire_
> LLVM optimization chain to be sure the code is well optimized after. The
> reason for this choice was mostly to be able to experiment
> with loop optimizations without being constrained by compile time or
> pass order concerns.
> 
> This pass order is obviously not optimal due to the inherent compile time
> increase it implies and also because we do not currently run the inliner in
> our canonicalizer.
> 
> Hence, we would like to place the Polly optimizer inside the standard -O3
> optimization sequence. The question is now what would be the optimal
> location.
> 
> Here some considerations:
> 
> 1) After the IR has been canonicalized
> 
> This should remove the need for most of our -polly-canonicalize passes
> and eliminate the corresponding compile time cost.
> 
> 2) After inlining has been performed.
> 
> Inlining is essential to optimize C++ heavy code such as boost::ublas.
> 
> 3) Before the Loop/SLP vectorizer
> 
> We want to ensure that Polly is run at a position, where the vectorizer
> can continue to optimize the code generated by Polly (and to benefit from
> the parallelism, alias information and loop optimizations we performed).
> 
> 4) Minimize interactions with the inliner
> 
> At best Polly is run after all inlining has happened, such that Polly
always
> sees the largest possible loop nest. If we optimize too early
> we may optimize code that later is inlined in a larger loop nest where
> the earlier optimization hinders the optimization of the larger loop nest.
> 
> Some open questions:
> 
> 5) Run it before / after LICM?
> 
> LICM may make more loop bounds analyzable by making the parameters in the
> loop bounds loop invariant.
> 
> On the other side, running it may introduce unnecessary scalar data
> dependences, which need to be eliminated again (Polly does not do this
> yet).
> 
> 6) Run it before loop rotate / loop unswitch?
> 
> Both of these passes introduce copies of code which Polly does not really
> benefit from. (Polly may e.g prove that a loop is executed at
> least once, such that the copy introduced by loop-rotate is not necessary
> any more). Hence running them before Polly may unnecessarily complicating
> the loop structure. On the other side, those passes may be helpful for
LICM.
> 
> I currently have been considering two locations:
> 
> Option 1: Before the loop optimization passes
> 
> ============================================> 
>   MPM.add(createReassociatePass()); // Reassociate expressions
> 
> << Add Polly before the loop optimization passes >>
> 
> // Or better call it EP_LoopCanonicalizerStart?
> + addExtensionsToPM(EP_LoopOptimizerStart, MPM);
> 
> 
>   MPM.add(createLoopRotatePass()); // Rotate Loop
>   MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3));
>   MPM.add(createInstructionCombiningPass());
>   MPM.add(createIndVarSimplifyPass());        // Canonicalize indvars
>   MPM.add(createLoopIdiomPass());             // Recognize idioms like
>                                                  memset.
>   MPM.add(createLoopDeletionPass());          // Delete dead loops
> 
> 
> Option 2: Right before the loop vectorizer
> ====================================> 
>   MPM.add(createBarrierNoopPass());
> 
> << Add Polly at the beginning of the target specific optimizations
>>
> + addExtensionsToPM(EP_TargetOptsStart, MPM);
> 
>   // Re-rotate loops in all our loop nests. These may have fallout out
>   // of rotated form due to GVN or other transformations, and the
>   // vectorizer relies on the rotated form.
>   if (ExtraVectorizerPasses)
>     MPM.add(createLoopRotatePass());
> 
>   MPM.add(createLoopVectorizePass(DisableUnrollLoops,
>         LoopVectorize));
> 
> Originally I was thinking of 'Option 1'. It is after the scalar
> optimization passes, but before LICM and LoopRotate. However, this location
> seems still to be still somehow a canonicalization phase. 'Option
2' places
> Polly after all canonicalization (is the inliner somehow finished at this
> point?) and right at the beginning of the target specific optimizations. It
> seems somehow preferable to me at the moment.
> 
> Does anybody have opinions on this?
> 
> Cheers,
> Tobias
> 
> 
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.26

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 213 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141203/8c726950/attachment.sig>

llvm dev - Dec 2014 - [LLVMdev] Where in the pass pipeline run the loop optimizer?

[LLVMdev] Where in the pass pipeline run the loop optimizer?

[LLVMdev] Where in the pass pipeline run the loop optimizer?