Tobias Grosser
2013-May-02 09:38 UTC
[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead
On 04/30/2013 04:13 PM, Star Tan wrote:> Hi all,[...]> How could I find out where the time is spent on between two adjacent Polly passes? Can anyone give me some advice?Hi Star Tan, I propose to do the performance analysis using the 'opt' tool and optimizing LLVM-IR, instead of running it from within clang. For the 'opt' tool there are two commands that should help you: 1) -debug-pass=Structure or -debug-pass=Details This should give you the list of passes that is executed. You can compare the list to see at which point additional passes are scheduled. 2) -time-passes This gives you the time spent in the different passes. These two commands may remove the need for a Polly specific profiling infrastructure. Also, if you talk about performance issues you see, it would be great if you could attach the .ll file you use as well as the exact command line you profile. Thanks, Tobias
Star Tan
2013-May-03 09:39 UTC
[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead
Dear Tobias, Thank you very much for your very helpful advice. Yes, -debug-pass and -time-passes are two very useful and powerful options when evaluating the compile-time of each compiler pass. They are exactly what I need! With these options, I can step into details of the compile-time overhead of each pass. I have finished some preliminary testing based on two randomly selected files from PolyBench and MediaBench. Results are listed on https://gist.github.com/tanstar/5508153 . Thanks again for your timely advice and help. Best regards, Star Tan. At 2013-05-02 17:38:22,"Tobias Grosser" <tobias at grosser.es> wrote:>On 04/30/2013 04:13 PM, Star Tan wrote: >> Hi all, >[...] >> How could I find out where the time is spent on between two adjacent Polly passes? Can anyone give me some advice? > >Hi Star Tan, > >I propose to do the performance analysis using the 'opt' tool and >optimizing LLVM-IR, instead of running it from within clang. For the >'opt' tool there are two commands that should help you: > >1) -debug-pass=Structure or -debug-pass=Details > >This should give you the list of passes that is executed. You can >compare the list to see at which point additional passes are scheduled. > >2) -time-passes > >This gives you the time spent in the different passes. > >These two commands may remove the need for a Polly specific profiling >infrastructure. Also, if you talk about performance issues you see, it >would be great if you could attach the .ll file you use as well as the >exact command line you profile. > >Thanks, >Tobias > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130503/7804c63b/attachment.html>
Tobias Grosser
2013-May-03 11:05 UTC
[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead
On 05/03/2013 11:39 AM, Star Tan wrote:> Dear Tobias, > > > Thank you very much for your very helpful advice. > > > Yes, -debug-pass and -time-passes are two very useful and powerful > options when evaluating the compile-time of each compiler pass. They > are exactly what I need! With these options, I can step into details > of the compile-time overhead of each pass. I have finished some > preliminary testing based on two randomly selected files from > PolyBench and MediaBench. Results are listed on > https://gist.github.com/tanstar/5508153 .Great. Thanks for the quick update and the nice formatting of your proposal.> In this project, I try to reduce Polly compile-time overhead by > revising Polly passes. For this purpose, I firstly try to find out > where the compile-time overhead comes from. When Polly optimizes a > program, it takes the following steps: > > * step1 - Polly canonicalization: prepare basic information and > complete some transformations. > * step2 - LLVM IR to Polly description: detect valid scops and > translates them into polyhedral representation. > * step3 - Polly optimization: analyze and optimize polyhedral scops. > * step4 - Polly description to LLVM-IR: translates the polyhedral > description back into LLVM-IR.I may make sense to list in the attachments which passes exactly belong to these different steps.> Based on these results, I plan to revise all canonicalization passes, > code generation passes and optimization passes in Polly. Then, I will > try my best to reduce the compile-time overhead for each part of them. > > In the following 12 weeks, my schedule for this project includes the following stages: > * Stage1 (1 week): set up a Polly performance tester and integrate the automation testing environment. > * Stage2 (1 week): collect and analyze the detailed profiling information for Polly compile process; > * Stage3 (3 weeks): revise and improve some expensive Polly passes or LLVM passes that Polly depends on; > * Stage4 (3 weeks): revise canonicalization passes of Polly and try to remove most of canonicalization passes; > * Stage5 (1 week): revise the pass ordering problem and let Polly bail out if it cannot benefit programs; > * Stage6 (1 week): revise and improve Polly code generation passes; > * Stage7 (1 week): revise and improve Polly optimization passes; > * Stage8 (1 week): revise other parts of Polly.I believe that before we should probably move Polly rather early into the default -O3 optimization chain. You may take some time to profile Polly and to understand it better, but we should not do extensive optimizations before moving it. As long as we can ensure the number of scops detected remains the same, this should be good enough. We can then spend the time optimizing the Polly integration in the normal -O3 chain.> ### 3. Project Details > > #### Stage1 -- Set up a Polly performance tester to track the compile time. [Week 1] > > The first important work is to pick criteria to evaluate my work in > this project. This project targets to reduce compile-time overhead, > but at the same time it must not degrade the performance. To simplify > the testing, I will use the number of scops that optimized by Polly as > the performance criteria. As a result, our Polly performance tester > should contains two parts: > > * Code performance: the number of scops optimized by Polly should not > be degraded. > * Compile-time Overhead: both the total compile-time overhead and the > percentage of each Polly pass are both important. > > Furthermore, I currently use some scripts to collect the compile-time > statistics, but it still requires some tricky and manual work. My plan > is to set up an automation testing environment and integrate it into > Polly.Yes, this is nice. It would be great if we could get some hardware to run such tests regularly. I will check if we can find a sponsor for this?> **Week 3**: revise the "Polly - Calculate dependence" pass. Since this > pass leads to significant compile-time overhead in both PolyBench and > MediaBench, I will investigate the details of this pass and try to > reduce the compile-time overhead. > **Week 4**: revise the "Polly - Detect static control parts (SCoPs)" > and "Polly - Create polyhedral description of Scops" pass. As shown in > table 3 and table 4, these two passes are expensive. However, they are > also very important to Polly because all Polly optimizations are based > on scops detected in these two passes. I will step into the details of > the two passes and make sure no performance loss happens no matter how > much compile-time overhead is reduced. > **Week 5**: revise other top ten expensive Polly passes. For example, > "Combine redundant instructions" pass runs several times because of > Polly and it can causes a large compile-time overhead. I will try to > eliminate or reduce the run of this pass. > > In this stage, I will also step into some existing bugs related to > those expensive passes discussed above. For example, the bug > llvm.org/PR15643 is related to the "Polly - Calculate dependence" > pass. Although it has brought some discussions in LLVM/Polly mail > list, there is still no feasible solution. I will try my best to fix > up such bugs when I step into the details of those passes.I actually pointed out a solution. We need to change Polly or recanonicalize the SCEVs we analyse as the current way the SCEV are canonicalized makes Polly introduce different parameters for expressions {p + 1, +, 1}_<l> and {p + 2, +, 1}_<l> in case the loop 'l' is not part of the scop.> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with > 2GB DDR2 memory. Each benchmark is run multiple times and data are > collected using ministat (https://github.com/codahale/ministat). > Polly is built with Cloog in Release+Assert mode. All Polly, LLVM and > Cloog source code are checked out from SVN at April 24, 2013.Can you give the svn revision / git hash for the checkouts of LLVM, Polly, isl, cloog and polybench.> Table 1 and table 2 show the compile-time overhead of Polly. Five cases are tested: > (alias pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) > * **clang**: clang -O3 > * **pBasic**: clang -O3 -load LLVMPolly.so > * **pNoGen**: pollycc -O3 -mllvm -polly-optimizer=none -mllvm -polly-code-generatorr=none > * **pNoOpt**: pollycc -O3 -mllvm -polly-optimizer=none > * **pOpt**: pollycc -O3You probably want to add a couple of additional parameters here to ensure we detect all scops in polybench. I would add -mllvm -polly-ignore-aliasing and -DPOLYBENCH_USE_SCALAR_LB You can use '-mllvm -debug-only=polly-cloog' to see if the kernels are properly detected. I get the following output: alias polly-clang alias polly-clang='~/Projekte/polly/build/bin/clang -Xclang -load -Xclang ~/Projekte/polly/build/lib/LLVMPolly.so' grosser at tobilaptop:~/Projekte/polybench$ polly-clang -O3 -mllvm -polly -mllvm -debug-only=polly-cloog linear-algebra/kernels/gemm/gemm.c -I utilities/ utilities/polybench.c -mllvm -polly-ignore-aliasing -DPOLYBENCH_USE_SCALAR_LB :: init_array : entry.split => for.end56 if ((nj >= 1) && (nk >= 1) && (p_1 >= 1) && (p_4 >= 1)) { for (c2=0;c2<=p_4-1;c2+=32) { for (c3=max(-32*floord(p_1-12*p_4+10,32)-32*p_4,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) { for (c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_4-1);c4++) { for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) { Stmt_for_body41(c4,-20*c4-c5); } } } } } if ((ni >= 1) && (nk >= 1) && (p_0 >= 1) && (p_4 >= 1)) { for (c2=0;c2<=p_0-1;c2+=32) { for (c3=max(-32*floord(-12*p_0+p_4+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_4+10,32)-640);c3<=-20*c2;c3+=32) { for (c4=max(ceild(-c3-p_4-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) { for (c5=max(c3,-20*c4-p_4+1);c5<=min(-20*c4,c3+31);c5++) { Stmt_for_body18(c4,-20*c4-c5); } } } } } if ((ni >= 1) && (nj >= 1) && (p_0 >= 1) && (p_1 >= 1)) { for (c2=0;c2<=p_0-1;c2+=32) { for (c3=max(-32*floord(-12*p_0+p_1+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32) { for (c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++) { for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) { Stmt_for_body3(c4,-20*c4-c5); } } } } } Stmt_entry_split(); :: kernel_gemm : for.cond1.preheader => for.end28 for (c2=0;c2<=1023;c2+=32) { for (c3=0;c3<=1023;c3+=32) { for (c4=c2;c4<=c2+31;c4++) { for (c5=c3;c5<=c3+31;c5++) { Stmt_for_body3(c4,c5); } } } } for (c2=0;c2<=1023;c2+=32) { for (c3=0;c3<=1023;c3+=32) { for (c4=0;c4<=1023;c4+=32) { for (c5=c2;c5<=c2+31;c5++) { for (c6=c3;c6<=c3+31;c6++) { for (c7=c4;c7<=c4+31;c7++) { Stmt_for_body8(c5,c6,c7); } } } } } } :: polybench_flush_cache : for.inc => for.end for (c1=0;c1<=4194559;c1++) { Stmt_for_inc(c1); } As a first step, we should ensure that Polly is fast for these flags. As subsequent steps we should then ensure that the absence of these flags does not increase compile-time.> Table 3 and table 4 show the compile-time overhead of top 15 passes in > polly-opt. we first generate LLVM IR files by the command "clang -O3 > -emit-llvm XX.c -o XX.ll", then we run opt to collect the compile time > of each pass using the command "opt -basicaa -load LLVMPolly.so -O3 > -polly -polly-codegen-scev XX.ll -S -o XX.opt.ll -time-passes"You want to extract the .ll file with 'clang -O0'. Otherwise you run polly on code that is already -O3 optimized, making the runs not comparable to the clang integrated ones and also unrealistic as they do not reflect what we do when running Polly from within clang. It would be interesting to understand the doitgen results better. The time in the optimizer is only 0.408 seconds, whereas the increase from pBasic to pOpt is 0.897 - 0.151 = 0.746 seconds. This seems surprising. Is this because of running polly on -O3 optimized code, is Polly producing bigger .ll files which yield to longer object file emmission time or is there a measurement problem? That's it for now. Thanks for the update! Cheers, Tobi
Possibly Parallel Threads
- [LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead
- [LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead
- [LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead
- [LLVMdev] [Polly] Analysis of the expensive compile-time overhead of Polly Dependence pass
- [LLVMdev] [Polly] Analysis of the expensive compile-time overhead of Polly Dependence pass