thr3ads.net - llvm dev - [LLVMdev] [polly] pass ordering [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Sebastian Pop

2013-Apr-17 15:53 UTC

[LLVMdev] [polly] pass ordering

Hi,

polly is run very early and schedules the following passes before it runs:

/// @brief Schedule a set of canonicalization passes to prepare for Polly
///
/// The set of optimization passes was partially taken/copied from the
/// set of default optimization passes in LLVM. It is used to bring the code
/// into a canonical form that simplifies the analysis and optimization passes
/// of Polly. The set of optimization passes scheduled here is probably not yet
/// optimal. TODO: Optimize the set of canonicalization passes.
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
  PM.add(llvm::createInstructionCombiningPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());

  if (!SCEVCodegen)
    PM.add(polly::createIndVarSimplifyPass());

  PM.add(polly::createCodePreparationPass());
  PM.add(polly::createRegionSimplifyPass());


Sergei was saying that on some benchmarks PromoteMemoryToRegister was causing
performance regressions when it is run with and without Polly and scheduled that
early.  Another remark is that these passes apply to all the functions,
transforming them without considering whether they contain loops or whether
Polly could improve anything.

That brings the question: why do we run Polly that early?  Could we move Polly
down after all these passes have been scheduled by LLVM's scalar optimizer?

Thanks,
Sebastian
-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Tobias Grosser

2013-Apr-17 17:45 UTC

head link

[LLVMdev] [polly] pass ordering

On 04/17/2013 05:53 PM, Sebastian Pop wrote:> Hi,
>
> polly is run very early and schedules the following passes before it runs:
>
> /// @brief Schedule a set of canonicalization passes to prepare for Polly
> ///
> /// The set of optimization passes was partially taken/copied from the
> /// set of default optimization passes in LLVM. It is used to bring the
code
> /// into a canonical form that simplifies the analysis and optimization
passes
> /// of Polly. The set of optimization passes scheduled here is probably not
yet
> /// optimal. TODO: Optimize the set of canonicalization passes.
> static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
>    PM.add(llvm::createPromoteMemoryToRegisterPass());
>    PM.add(llvm::createInstructionCombiningPass());
>    PM.add(llvm::createCFGSimplificationPass());
>    PM.add(llvm::createTailCallEliminationPass());
>    PM.add(llvm::createCFGSimplificationPass());
>    PM.add(llvm::createReassociatePass());
>    PM.add(llvm::createLoopRotatePass());
>    PM.add(llvm::createInstructionCombiningPass());
>
>    if (!SCEVCodegen)
>      PM.add(polly::createIndVarSimplifyPass());
>
>    PM.add(polly::createCodePreparationPass());
>    PM.add(polly::createRegionSimplifyPass());
Right.
> Sergei was saying that on some benchmarks PromoteMemoryToRegister was
causing
> performance regressions when it is run with and without Polly and scheduled
that
> early.
Are you saying these passes add compile time overhead or rather that 
they cause problems with the performance of the compiled binary?

I assume when talking about regressions, you compare against a 
compilation with Polly disabled.

 > Another remark is that these passes apply to all the
functions,> transforming them without considering whether they contain loops or whether
> Polly could improve anything.
True.
> That brings the question: why do we run Polly that early?  Could we move
Polly
> down after all these passes have been scheduled by LLVM's scalar
optimizer?
For Polly we have basically two constraints:

1) We want to detect scops in the IR on which we run Polly

This means the IR needs to be canonicalized enough to allow scalar 
evolution & Co to work.

2) The IR generated by Polly, should be well optimized through LLVM

This means we do not only need to perform the optimizations that would 
have been necessary for the input code, but we also want to take 
advantage of optimization opportunities that show up after Polly 
regenerated code.

When I generated the pass ordering, I did not spend a large amount of 
time to minimize it. I rather assumed, that to be sure the LLVM-IR is 
well optimized after Polly, it would be good to just run all passes LLVM 
passes over the output of Polly. So I just placed Polly at the very 
beginning. Now, to enable Polly to detect reasonably sized scops, I 
scheduled a set of canonicalization passes before Polly (taken from the 
beginning of the -O3 sequence).

In terms of scop coverage and quality of the generated code this seems 
to be a good choice, but it obviously will increase the compile time 
compared to a run without Polly. What we could aim for is to run Polly 
at the beginning of the loop transformations e.g. by adding an extension 
point 'EP_LoopOptimizerStart'. Meaning before vectorization, before loop
invariant code motion and before the loop idiom recognition. However, we 
would then need to evaluate what cleanup passes we need to run after 
Polly. For the classical code generation strategy we probably need a 
couple of scalar cleanups, with the scev based code generation, there is 
normally a lot less to do.

If you can find a pass ordering that does not regress too much on the 
performance and scop coverage of the current one, but that has Polly
integrated in the normal pass chain just before the loop passes, that 
would be a great improvement.

Thanks,
Tobias

Hal Finkel

2013-Apr-17 18:13 UTC

head link

[LLVMdev] [polly] pass ordering

----- Original Message -----> From: "Tobias Grosser" <tobias at grosser.es>
> To: "Sebastian Pop" <spop at codeaurora.org>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Wednesday, April 17, 2013 12:45:26 PM
> Subject: Re: [LLVMdev] [polly] pass ordering
> 
> On 04/17/2013 05:53 PM, Sebastian Pop wrote:
> > Hi,
> >
> > polly is run very early and schedules the following passes before
> > it runs:
> >
> > /// @brief Schedule a set of canonicalization passes to prepare for
> > Polly
> > ///
> > /// The set of optimization passes was partially taken/copied from
> > the
> > /// set of default optimization passes in LLVM. It is used to bring
> > the code
> > /// into a canonical form that simplifies the analysis and
> > optimization passes
> > /// of Polly. The set of optimization passes scheduled here is
> > probably not yet
> > /// optimal. TODO: Optimize the set of canonicalization passes.
> > static void registerCanonicalicationPasses(llvm::PassManagerBase
> > &PM) {
> >    PM.add(llvm::createPromoteMemoryToRegisterPass());
> >    PM.add(llvm::createInstructionCombiningPass());
> >    PM.add(llvm::createCFGSimplificationPass());
> >    PM.add(llvm::createTailCallEliminationPass());
> >    PM.add(llvm::createCFGSimplificationPass());
> >    PM.add(llvm::createReassociatePass());
> >    PM.add(llvm::createLoopRotatePass());
> >    PM.add(llvm::createInstructionCombiningPass());
> >
> >    if (!SCEVCodegen)
> >      PM.add(polly::createIndVarSimplifyPass());
> >
> >    PM.add(polly::createCodePreparationPass());
> >    PM.add(polly::createRegionSimplifyPass());
> 
> Right.
> 
> > Sergei was saying that on some benchmarks PromoteMemoryToRegister
> > was causing
> > performance regressions when it is run with and without Polly and
> > scheduled that
> > early.
> 
> Are you saying these passes add compile time overhead or rather that
> they cause problems with the performance of the compiled binary?
> 
> I assume when talking about regressions, you compare against a
> compilation with Polly disabled.
> 
>  > Another remark is that these passes apply to all the functions,
> > transforming them without considering whether they contain loops or
> > whether
> > Polly could improve anything.
> 
> True.
> 
> > That brings the question: why do we run Polly that early?  Could we
> > move Polly
> > down after all these passes have been scheduled by LLVM's scalar
> > optimizer?
> 
> For Polly we have basically two constraints:
> 
> 1) We want to detect scops in the IR on which we run Polly
> 
> This means the IR needs to be canonicalized enough to allow scalar
> evolution & Co to work.
> 
> 2) The IR generated by Polly, should be well optimized through LLVM
> 
> This means we do not only need to perform the optimizations that
> would
> have been necessary for the input code, but we also want to take
> advantage of optimization opportunities that show up after Polly
> regenerated code.
> 
> When I generated the pass ordering, I did not spend a large amount of
> time to minimize it. I rather assumed, that to be sure the LLVM-IR is
> well optimized after Polly, it would be good to just run all passes
> LLVM
> passes over the output of Polly. So I just placed Polly at the very
> beginning. Now, to enable Polly to detect reasonably sized scops, I
> scheduled a set of canonicalization passes before Polly (taken from
> the
> beginning of the -O3 sequence).
> 
> In terms of scop coverage and quality of the generated code this
> seems
> to be a good choice, but it obviously will increase the compile time
> compared to a run without Polly. What we could aim for is to run
> Polly
> at the beginning of the loop transformations e.g. by adding an
> extension
> point 'EP_LoopOptimizerStart'. Meaning before vectorization, before
> loop
> invariant code motion and before the loop idiom recognition. However,
> we
> would then need to evaluate what cleanup passes we need to run after
> Polly. For the classical code generation strategy we probably need a
> couple of scalar cleanups, with the scev based code generation, there
> is
> normally a lot less to do.
> 
> If you can find a pass ordering that does not regress too much on the
> performance and scop coverage of the current one, but that has Polly
> integrated in the normal pass chain just before the loop passes, that
> would be a great improvement.
I thought that, when we discussed this in November, the goal was to have Polly
scheduled to run just prior to the loop vectorizer, etc. That way we could split
the analysis off and it could be (optionally) reused by the vectorization passes
without being invalidated by other transforms.

 -Hal
> 
> Thanks,
> Tobias
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Sebastian Pop

2013-Apr-17 18:37 UTC

head link

[LLVMdev] [polly] pass ordering

Tobias Grosser wrote:> On 04/17/2013 05:53 PM, Sebastian Pop wrote:
> >Hi,
> >
> >polly is run very early and schedules the following passes before it
runs:
> >
> >/// @brief Schedule a set of canonicalization passes to prepare for
Polly
> >///
> >/// The set of optimization passes was partially taken/copied from the
> >/// set of default optimization passes in LLVM. It is used to bring the
code
> >/// into a canonical form that simplifies the analysis and optimization
passes
> >/// of Polly. The set of optimization passes scheduled here is probably
not yet
> >/// optimal. TODO: Optimize the set of canonicalization passes.
> >static void registerCanonicalicationPasses(llvm::PassManagerBase
&PM) {
> >   PM.add(llvm::createPromoteMemoryToRegisterPass());
> >   PM.add(llvm::createInstructionCombiningPass());
> >   PM.add(llvm::createCFGSimplificationPass());
> >   PM.add(llvm::createTailCallEliminationPass());
> >   PM.add(llvm::createCFGSimplificationPass());
> >   PM.add(llvm::createReassociatePass());
> >   PM.add(llvm::createLoopRotatePass());
> >   PM.add(llvm::createInstructionCombiningPass());
> >
> >   if (!SCEVCodegen)
> >     PM.add(polly::createIndVarSimplifyPass());
> >
> >   PM.add(polly::createCodePreparationPass());
> >   PM.add(polly::createRegionSimplifyPass());
> 
> Right.
> 
> >Sergei was saying that on some benchmarks PromoteMemoryToRegister was
causing
> >performance regressions when it is run with and without Polly and
scheduled that
> >early.
> 
> Are you saying these passes add compile time overhead or rather that
> they cause problems with the performance of the compiled binary?
Sergei was looking at the performance of the generated code (not compile time),
and yes he looked at the impact of -O3 with the pre-passes of Polly as scheduled
now vs. plain -O3.
> This means the IR needs to be canonicalized enough to allow scalar
> evolution & Co to work.
Right, Sergei has also pointed out that PromoteMemoryToRegister is needed that
early because otherwise SCEV would not be able to recognize induction variables
allocated on the stack.

If we schedule polly in the LNO, this constraint would be satisfied.
> 2) The IR generated by Polly, should be well optimized through LLVM
> 
> This means we do not only need to perform the optimizations that
> would have been necessary for the input code, but we also want to
> take advantage of optimization opportunities that show up after
> Polly regenerated code.
> 
> When I generated the pass ordering, I did not spend a large amount
> of time to minimize it. I rather assumed, that to be sure the
> LLVM-IR is well optimized after Polly, it would be good to just run
> all passes LLVM passes over the output of Polly. So I just placed
> Polly at the very beginning. Now, to enable Polly to detect
> reasonably sized scops, I scheduled a set of canonicalization passes
> before Polly (taken from the beginning of the -O3 sequence).
> 
> In terms of scop coverage and quality of the generated code this
> seems to be a good choice, but it obviously will increase the
> compile time compared to a run without Polly. What we could aim for
> is to run Polly at the beginning of the loop transformations e.g. by
> adding an extension point 'EP_LoopOptimizerStart'. Meaning before
> vectorization, before loop invariant code motion and before the loop
> idiom recognition. However, we would then need to evaluate what
> cleanup passes we need to run after Polly. For the classical code
> generation strategy we probably need a couple of scalar cleanups,
> with the scev based code generation, there is normally a lot less to
> do.
> 
Right: let's try to see whether with SCEVcodegen we can have better
performance
when scheduling Polly in the LNO.

Sebastian
-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Apr 2013 - [LLVMdev] [polly] pass ordering

[LLVMdev] [polly] pass ordering

[LLVMdev] [polly] pass ordering

[LLVMdev] [polly] pass ordering

[LLVMdev] [polly] pass ordering

Possibly Parallel Threads