Hi Tobias, At 2013-09-19 22:59:25,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/19/2013 04:46 PM, Star Tan wrote: >> Hi Tobias, >> >> >> I am trying to move Polly later. >> >> >> LLVM provides some predefined ExtensionPointTy: >> EP_EarlyAsPossible, >> EP_ModuleOptimizerEarly, >> EP_LoopOptimizerEnd, >> EP_ScalarOptimizerLate, >> ... >> >> >> Currently Polly uses "EP_EarlyAsPossible" to run as early as possible. As what you suggested: >>> Instead of removing canonicalization passes, I believe we may want to >>> move Polly to a later place in the pass manager. Possibly at the >>> beginning of the loop optimizer right before PM.add(createLoopRotatePass()); >> I want to move it to the point immediate after someone Loop optimization pass, e.g. MPM.add(createLoopRotatePass()). However no predefined ExtensionPointTy is available for this purpose. Instead, the "EP_ModuleOptimizerEarly" would move Polly before all loop optimization passes. >> >> >> In my option, there are two solutions: one is to use "EP_ModuleOptimizerEarly" (only modify the tool/polly/lib/RegisterPasses.cpp) to move Polly before all loop optimization passes; the other is to add a new ExtensionPointTy, e.g. "EP_LoopRotateEnd" and move Polly exactly immediate after the "LoopRotate" pass (need to modify tool/polly/lib/RegisterPasses.cpp, include/llvm/Transforms/IPO/PassManagerBuilder.h and lib/Transforms/IPO/PassManagerBuilder.cpp). We can use the second way to investigate other points to start Polly. >> >> Is my understanding correct? Do you have any further suggestion? > >Yes, fully correct. I would play with solution two. Why would you add >Polly after the loop rotate pass?Forgot it. I just wanted to give an immature example to show how to move Polly to other points.>I would possibly add it at the beginning of the loop optimizer (right >before the loop rotation pass) possibly experimenting with a new >extension point called EP_LoopOptimizerBegin. >Thanks for your advice. I have tried to move Polly immediately before the loop rotation pass. The temporary patch file can be reached on http://llvm.org/bugs/attachment.cgi?id=11262. First of all, I evaluated Polly's canonicalization passes based on this patch file. Results are posted on http://188.40.87.11:8000, which contains two new runs: PollyNoGen_loop (run 56): move polly later but keep all polly canonicalization passes; PollyNoGen_loop_nocan(run57): move polly later and delete all polly canonicalization passes; Comparison is shown on http://188.40.87.11:8000/db_default/v4/nts/57?compare_to=56&baseline=56. It seems that even we move polly later, those canonicalization passes still have significant impact on the final execution-time performance. A very bad news is that Polly would produce wrong code if we move Polly right immediately before the loop rotate pass. Many benchmarks in LLVM testsuite would fail in execution. I constructed a relatively simple testcase for this bug as shown on http://llvm.org/bugs/show_bug.cgi?id=17323. The key problem is Polly would produce wrong basic blocks like this: polly.loop_header27: ; preds = %vector.body, %polly.loop_header27 br label %polly.loop_header27 For your information, the source code is posted on http://llvm.org/bugs/attachment.cgi?id=11263 and the LLVM IR code is posted on http://llvm.org/bugs/attachment.cgi?id=11264 I have not yet found out why Polly produces such wrong code. However, maybe I should also investigate other points where we can move Polly. Do you have any suggestion? Best, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130922/eda59b06/attachment.html>
Here is an update about moving Polly later. 1. Why does Polly generate incorrect code when we move Polly immediately after the loop rotating pass? It is mainly caused by a wrong polly merge block. When Polly detects a valid loop for Polyhedral transformations, it usually introduces a new basic block "polly.merge_new_and_old" after the original loop exit block. This new basic block contains a branch instruction that decides whether to exit or re-enter the loop like this: polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25 br label %polly.merge_new_and_old polly.merge_new_and_old: ; preds = %polly.loop_exit28, %for.body.i ... %cmp = icmp slt i32 %2, 0 ... br i1 %cmp, label %for.body, label %for.end12 for.end12: ; preds = %loop.exit ... ret i32 0 Unfortunately, the new basic block "polly.merge_new_and_old" is marked as unreachable after a serial of immediate Loop optimizations in the following pass order: Polly - Create LLVM-IR from SCoPs Loop Pass Manager Canonicalize natural loops Loop-Closed SSA Form Pass Rotate Loops Loop Invariant Code Motion Loop-Closed SSA Form Pass Unswitch loops Combine redundant instructions After those loop optimition passes, the "Combine redundant instructions" would treat the basic block "merge_new_and_old" unreachable and transfer the block "polly.loop_exit28" into an infinity loop: polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25 br label polly.loop_exit28 Actually, I have no idea why this happens, but experiments show that this problem can be addressed by adding a pass "MPM.add(createCFGSimplificationPass())" after Polly. It seems it is necessary to run this pass after Polly code generation. 2. Where should we move Polly to? There are many choices to move Polly later. Tobias suggests to move it immediately after the loop rotate pass (the first loop optimization pass), but unfortunately Polly would generate incorrect code in that case (http://llvm.org/bugs/show_bug.cgi?id=17323). By investigating those Polly canonicalization passes (polly/lib/RegisterPasses.cpp): PM.add(llvm::createPromoteMemoryToRegisterPass()); PM.add(llvm::createInstructionCombiningPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createTailCallEliminationPass()); PM.add(llvm::createCFGSimplificationPass()); PM.add(llvm::createReassociatePass()); PM.add(llvm::createLoopRotatePass()); PM.add(llvm::createInstructionCombiningPass()); We can see that some of them are also called in common cases (lib/Transforms/IPO/PassManagerBuilder.cpp:181): MPM.add(createEarlyCSEPass()); MPM.add(createJumpThreadingPass()); MPM.add(createCorrelatedValuePropagationPass()); MPM.add(createCFGSimplificationPass()); // also called in Polly MPM.add(createInstructionCombiningPass()); // also called in Polly MPM.add(createTailCallEliminationPass()); // also called in Polly MPM.add(createCFGSimplificationPass()); // also called in Polly MPM.add(createReassociatePass()); // also called in Polly MPM.add(createLoopRotatePass()); // also called in Polly MPM.add(createLICMPass()); MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3)); The initial idea is to move Polly immediately after LoopRotatePass, which will maximum the reuse of canonicalization passes and we need only keep one canonicalization pass "createPromoteMemoryToRegisterPass" in Polly. Unfortunately, it would lead to errors as shown in previous analysis (see bug http://llvm.org/bugs/show_bug.cgi?id=17323). We need to run "createCFGSimplificationPass" to ensure the correct code. The second possible solution is to move Polly immediately before "createCFGSimplificationPass", and then remove all Polly canonicalization passes except the "createTailCallEliminationPass" and "createCFGSimplificationPass"! Unfortunately, it would still trigger an assert error when compiling some benchmarks in LLVM test-suite (see http://llvm.org/bugs/show_bug.cgi?id=17351). Similarly I also investigated other points to move Polly, such as before "MPM.add(createTailCallEliminationPass())" or before "MPM.add(createCorrelatedValuePropagationPass())", but all these cases would trigger some unexpected compile error or execution error. I think we need more efforts to investigate why Polly cannot be placed in these points. At last, I also investigated the basic solution that move Polly immediately before "CallGraph SCC passes" which is a very early stage in module level optimizations. However, there are still some pre-executed basic passes (DeadArgElimination/InstructionCombining/CFGSimplification), which may reduce Polly canonicalization passes . Preliminary testing shows no extra error happens in LLVM test-suite and detailed evaluation is running. Hope we can get detailed results after several hours's running. Best, Star Tan At 2013-09-22 13:02:38,"Star Tan" <tanmx_star at yeah.net> wrote: Hi Tobias,At 2013-09-19 22:59:25,"Tobias Grosser" <tobias at grosser.es> wrote:>On 09/19/2013 04:46 PM, Star Tan wrote:>> Hi Tobias, >> >> >> I am trying to move Polly later. >> >> >> LLVM provides some predefined ExtensionPointTy: >> EP_EarlyAsPossible, >> EP_ModuleOptimizerEarly, >> EP_LoopOptimizerEnd, >> EP_ScalarOptimizerLate, >> ... >> >> >> Currently Polly uses "EP_EarlyAsPossible" to run as early as possible. As what you suggested: >>> Instead of removing canonicalization passes, I believe we may want to >>> move Polly to a later place in the pass manager. Possibly at the >>> beginning of the loop optimizer right before PM.add(createLoopRotatePass()); >> I want to move it to the point immediate after someone Loop optimization pass, e.g. MPM.add(createLoopRotatePass()). However no predefined ExtensionPointTy is available for this purpose. Instead, the "EP_ModuleOptimizerEarly" would move Polly before all loop optimization passes. >> >> >> In my option, there are two solutions: one is to use "EP_ModuleOptimizerEarly" (only modify the tool/polly/lib/RegisterPasses.cpp) to move Polly before all loop optimization passes; the other is to add a new ExtensionPointTy, e.g. "EP_LoopRotateEnd" and move Polly exactly immediate after the "LoopRotate" pass (need to modify tool/polly/lib/RegisterPasses.cpp, include/llvm/Transforms/IPO/PassManagerBuilder.h and lib/Transforms/IPO/PassManagerBuilder.cpp). We can use the second way to investigate other points to start Polly. >> >> Is my understanding correct? Do you have any further suggestion? > >Yes, fully correct. I would play with solution two. Why would you add >Polly after the loop rotate pass? > >I would possibly add it at the beginning of the loop optimizer (right >before the loop rotation pass) possibly experimenting with a new >extension point called EP_LoopOptimizerBegin. > >Cheers, >Tobias >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130925/ed6c476d/attachment.html>
On 09/25/2013 04:55 AM, Star Tan wrote:> Here is an update about moving Polly later.Hi star tan, thanks for your report.> > 1. Why does Polly generate incorrect code when we move Polly immediately after the loop rotating pass? > > It is mainly caused by a wrong polly merge block. When Polly detects a valid loop for Polyhedral transformations, it usually introduces a new basic block "polly.merge_new_and_old" after the original loop exit block. This new basic block contains a branch instruction that decides whether to exit or re-enter the loop like this: > > polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25 > br label %polly.merge_new_and_old > polly.merge_new_and_old: ; preds = %polly.loop_exit28, %for.body.i > ... > %cmp = icmp slt i32 %2, 0 > ... > br i1 %cmp, label %for.body, label %for.end12 > for.end12: ; preds = %loop.exit > ... > ret i32 0 > > Unfortunately, the new basic block "polly.merge_new_and_old" is marked as unreachable after a serial of immediate Loop optimizations in the following pass order: > Polly - Create LLVM-IR from SCoPs > Loop Pass Manager > Canonicalize natural loops > Loop-Closed SSA Form Pass > Rotate Loops > Loop Invariant Code Motion > Loop-Closed SSA Form Pass > Unswitch loops > Combine redundant instructions > > After those loop optimition passes, the "Combine redundant instructions" would treat the basic block "merge_new_and_old" unreachable and transfer the block "polly.loop_exit28" into an infinity loop: > polly.loop_exit28: ; preds = %polly.stmt.for.body.i, %polly.loop_if25 > br label polly.loop_exit28 > > Actually, I have no idea why this happens, but experiments show that this problem can be addressed by adding a pass "MPM.add(createCFGSimplificationPass())" after Polly. It seems it is necessary to run this pass after Polly code generation.Any pass order should yield correct code. If this is not the case there is either a bug in Polly or we expose a bug in LLVM. We should track this down. Our path ordering decisions should not be driven by us trying to avoid triggering this bug. As you already created a bug report, we should work on fixing it. The next step would be to reduce the set of passes involved. You can do this by calling -debug-pass=Arguments> 2. Where should we move Polly to? > > There are many choices to move Polly later. Tobias suggests to move it immediately after the loop rotate pass (the first loop optimization pass), but unfortunately Polly would generate incorrect code in that case (http://llvm.org/bugs/show_bug.cgi?id=17323). > > By investigating those Polly canonicalization passes (polly/lib/RegisterPasses.cpp): > PM.add(llvm::createPromoteMemoryToRegisterPass()); > PM.add(llvm::createInstructionCombiningPass()); > PM.add(llvm::createCFGSimplificationPass()); > PM.add(llvm::createTailCallEliminationPass()); > PM.add(llvm::createCFGSimplificationPass()); > PM.add(llvm::createReassociatePass()); > PM.add(llvm::createLoopRotatePass()); > PM.add(llvm::createInstructionCombiningPass()); > > We can see that some of them are also called in common cases (lib/Transforms/IPO/PassManagerBuilder.cpp:181): > > MPM.add(createEarlyCSEPass()); > MPM.add(createJumpThreadingPass()); > MPM.add(createCorrelatedValuePropagationPass()); > MPM.add(createCFGSimplificationPass()); // also called in Polly > MPM.add(createInstructionCombiningPass()); // also called in Polly > MPM.add(createTailCallEliminationPass()); // also called in Polly > MPM.add(createCFGSimplificationPass()); // also called in Polly > MPM.add(createReassociatePass()); // also called in Polly > MPM.add(createLoopRotatePass()); // also called in Polly > MPM.add(createLICMPass()); > MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3)); > > The initial idea is to move Polly immediately after LoopRotatePass, which will maximum the reuse of canonicalization passes and we need only keep one canonicalization pass "createPromoteMemoryToRegisterPass" in Polly. Unfortunately, it would lead to errors as shown in previous analysis (see bug http://llvm.org/bugs/show_bug.cgi?id=17323). We need to run "createCFGSimplificationPass" to ensure the correct code.In fact, I propose to move it _before_ the LoopRotate Pass. I think that is also what you tried, no?> The second possible solution is to move Polly immediately before "createCFGSimplificationPass", and then remove all Polly canonicalization passes except the "createTailCallEliminationPass" and "createCFGSimplificationPass"! Unfortunately, it would still trigger an assert error when compiling some benchmarks in LLVM test-suite (see http://llvm.org/bugs/show_bug.cgi?id=17351).I think this is too early, as most of the canonicalization is not yet done. We probably don't need to investigate this bug immediately, but it would be nice if we could make it reproducible without your changes to polly. For this please run the command with -debug-pass=Arguments and replace the -O3 with the actual set of commands that is needed to trigger the bug. If you can reduce the set of commands using bugpoint, that would be even better.> Similarly I also investigated other points to move Polly, such as before "MPM.add(createTailCallEliminationPass())" or before "MPM.add(createCorrelatedValuePropagationPass())", but all these cases would trigger some unexpected compile error or execution error. I think we need more efforts to investigate why Polly cannot be placed in these points.This is unfortunate. I wonder if they all have the same root cause. In any case, I would focus on moving Polly _before_ the loop rotate pass and would start with fixing the blocking bug. Tobias
Star Tan wrote:> I have not yet found out why Polly produces such wrong code.IMO this is because Polly does not recognize a given number of patterns that occur after the scalar opts have transformed the code. There are a given number of assumptions that we took for granted and these do not hold down the pipeline.> However, maybe I should also investigate other points where we can move > Polly. Do you have any suggestion?Yep, try to move Polly just after function inlining, that's better than the current place, and will extend the loop coverage in many cases. Sebastian -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
At 2013-09-27 04:22:13,"Sebastian Pop" <spop at codeaurora.org> wrote:>Star Tan wrote:>> I have not yet found out why Polly produces such wrong code. > >IMO this is because Polly does not recognize a given number of patterns that >occur after the scalar opts have transformed the code. There are a given number >of assumptions that we took for granted and these do not hold down the pipeline. > >> However, maybe I should also investigate other points where we can move >> Polly. Do you have any suggestion? > >Yep, try to move Polly just after function inlining, that's better than the >current place, and will extend the loop coverage in many cases. >Do you mean we should move it after "MPM.add(Inliner);"? Inliner pass is an early pass, so moving Polly after Inliner may not reduce the Polly's canonicalization passes. On the contrary, it may make the Polly more complex. However, as what you said, moving Polly after the inlinerg pass will extend the loop coverage and thus improve the performance of Polly! I'll try it. Thanks for your suggestion, Star Tan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130927/d4e63640/attachment.html>