thr3ads.net - llvm dev - [LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead [May 2013]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2013-May-02 09:38 UTC

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

On 04/30/2013 04:13 PM, Star Tan wrote:> Hi all,
[...]> How could I find out where the time is spent on between two adjacent Polly
passes? Can anyone give me some advice?
Hi Star Tan,

I propose to do the performance analysis using the 'opt' tool and 
optimizing LLVM-IR, instead of running it from within clang. For the 
'opt' tool there are two commands that should help you:

1) -debug-pass=Structure or -debug-pass=Details

This should give you the list of passes that is executed. You can 
compare the list to see at which point additional passes are scheduled.

2) -time-passes

This gives you the time spent in the different passes.

These two commands may remove the need for a Polly specific profiling 
infrastructure. Also, if you talk about performance issues you see, it 
would be great if you could attach the .ll file you use as well as the 
exact command line you profile.

Thanks,
Tobias

Star Tan

2013-May-03 09:39 UTC

head link

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Dear Tobias,

Thank you very much for your very helpful advice.  

Yes, -debug-pass and -time-passes are two very useful and powerful options when
evaluating the compile-time of each compiler pass. They are exactly what I need!
With these options, I can step into details of the compile-time overhead of each
pass. I have finished some preliminary testing based on two randomly selected
files from PolyBench and MediaBench. Results are listed on
https://gist.github.com/tanstar/5508153 .

Thanks again for your timely advice and help.

Best regards,
Star Tan.
At 2013-05-02 17:38:22,"Tobias Grosser" <tobias at grosser.es>
wrote:>On 04/30/2013 04:13 PM, Star Tan wrote:
>> Hi all,
>[...]
>> How could I find out where the time is spent on between two adjacent
Polly passes? Can anyone give me some advice?
>
>Hi Star Tan,
>
>I propose to do the performance analysis using the 'opt' tool and 
>optimizing LLVM-IR, instead of running it from within clang. For the 
>'opt' tool there are two commands that should help you:
>
>1) -debug-pass=Structure or -debug-pass=Details
>
>This should give you the list of passes that is executed. You can 
>compare the list to see at which point additional passes are scheduled.
>
>2) -time-passes
>
>This gives you the time spent in the different passes.
>
>These two commands may remove the need for a Polly specific profiling 
>infrastructure. Also, if you talk about performance issues you see, it 
>would be great if you could attach the .ll file you use as well as the 
>exact command line you profile.
>
>Thanks,
>Tobias
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130503/7804c63b/attachment.html>

Tobias Grosser

2013-May-03 11:05 UTC

head link

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

On 05/03/2013 11:39 AM, Star Tan wrote:> Dear Tobias,
>
>
> Thank you very much for your very helpful advice.
>
>
> Yes, -debug-pass and -time-passes are two very useful and powerful
> options when evaluating the compile-time of each compiler pass. They
> are exactly what I need! With these options, I can step into details
> of the compile-time overhead of each pass. I have finished some
> preliminary testing based on two randomly selected files from
> PolyBench and MediaBench. Results are listed on
> https://gist.github.com/tanstar/5508153 .
Great. Thanks for the quick update and the nice formatting of your
proposal.
> In this project, I try to reduce Polly compile-time overhead by
> revising Polly passes.  For this purpose, I firstly try to find out
> where the compile-time overhead comes from. When Polly optimizes a
> program, it takes the following steps:
>
> * step1 - Polly canonicalization: prepare basic information and
> complete some transformations.
> * step2 - LLVM IR to Polly description: detect valid scops and
> translates them into polyhedral representation.
> * step3 - Polly optimization: analyze and optimize polyhedral scops.
> * step4 - Polly description to LLVM-IR: translates the polyhedral
> description back into LLVM-IR.
I may make sense to list in the attachments which passes exactly belong
to these different steps.
> Based on these results, I plan to revise all canonicalization passes,
> code generation passes and optimization passes in Polly. Then, I will
> try my best to reduce the compile-time overhead for each part of them.
>
> In the following 12 weeks, my schedule for this project includes the
following stages:
> * Stage1 (1 week): set up a Polly performance tester and integrate the
automation testing environment.
> * Stage2 (1 week): collect and analyze the detailed profiling information
for Polly compile process;
> * Stage3 (3 weeks): revise and improve some expensive Polly passes or LLVM
passes that Polly depends on;
> * Stage4 (3 weeks): revise canonicalization passes of Polly and try to
remove most of canonicalization passes;
> * Stage5 (1 week): revise the pass ordering problem and let Polly bail out
if it cannot benefit programs;
> * Stage6 (1 week): revise and improve Polly code generation passes;
> * Stage7 (1 week): revise and improve Polly optimization passes;
> * Stage8 (1 week): revise other parts of Polly.
I believe that before we should probably move Polly rather early into
the default -O3 optimization chain. You may take some time to profile
Polly and to understand it better, but we should not do extensive
optimizations before moving it. As long as we can ensure the number of
scops detected remains the same, this should be good enough. We can then
spend the time optimizing the Polly integration in the normal -O3 chain.
> ### 3. Project Details
>
> #### Stage1 -- Set up a Polly performance tester to track the compile time.
[Week 1]
>
> The first important work is to pick criteria to evaluate my work in
> this project. This project targets to reduce compile-time overhead,
> but at the same time it must not degrade the performance. To simplify
> the testing, I will use the number of scops that optimized by Polly as
> the performance criteria. As a result, our Polly performance tester
> should contains two parts:
>
> * Code performance: the number of scops optimized by Polly should not
> be degraded.
> * Compile-time Overhead: both the total compile-time overhead and the
> percentage of each Polly pass are both important.
>
> Furthermore, I currently use some scripts to collect the compile-time
> statistics, but it still requires some tricky and manual work. My plan
> is to set up an automation testing environment and integrate it into
> Polly.
Yes, this is nice. It would be great if we could get some hardware to
run such tests regularly. I will check if we can find a sponsor for
this?
> **Week 3**: revise the "Polly - Calculate dependence" pass. Since
this
> pass leads to significant compile-time overhead in both PolyBench and
> MediaBench, I will investigate the details of this pass and try to
> reduce the compile-time overhead.
> **Week 4**: revise the "Polly - Detect static control parts
(SCoPs)"
> and "Polly - Create polyhedral description of Scops" pass. As
shown in
> table 3 and table 4, these two passes are expensive. However, they are
> also very important to Polly because all Polly optimizations are based
> on scops detected in these two passes. I will step into the details of
> the two passes and make sure no performance loss happens no matter how
> much compile-time overhead is reduced.
> **Week 5**: revise other top ten expensive Polly passes. For example,
> "Combine redundant instructions" pass runs several times because
of
> Polly and it can causes a large compile-time overhead. I will try to
> eliminate or reduce the run of this pass.
>
> In this stage, I will also step into some existing bugs related to
> those expensive passes discussed above. For example, the bug
> llvm.org/PR15643 is related to the "Polly - Calculate dependence"
> pass. Although it has brought some discussions in LLVM/Polly mail
> list, there is still no feasible solution. I will try my best to fix
> up such bugs when I step into the details of those passes.
I actually pointed out a solution. We need to change Polly or
recanonicalize the SCEVs we analyse as the current way the SCEV are
canonicalized makes Polly introduce different parameters for expressions
{p + 1, +, 1}_<l> and {p + 2, +, 1}_<l> in case the loop 'l'
is not part
of the scop.
> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
> 2GB DDR2 memory. Each benchmark is run multiple times and data are
> collected using ministat (https://github.com/codahale/ministat).
> Polly is built with Cloog in Release+Assert mode. All Polly, LLVM and
> Cloog source code are checked out from SVN at April 24, 2013.
Can you give the svn revision / git hash for the checkouts of LLVM,
Polly, isl, cloog and polybench.
> Table 1 and table 2 show the compile-time overhead of Polly. Five cases are
tested:
> (alias pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly)
> * **clang**:  clang -O3
> * **pBasic**: clang -O3 -load LLVMPolly.so
> * **pNoGen**: pollycc -O3 -mllvm -polly-optimizer=none -mllvm
-polly-code-generatorr=none
> * **pNoOpt**: pollycc -O3 -mllvm -polly-optimizer=none
> * **pOpt**:   pollycc -O3
You probably want to add a couple of additional parameters here to
ensure we detect all scops in polybench. I would add -mllvm
-polly-ignore-aliasing and -DPOLYBENCH_USE_SCALAR_LB

You can use '-mllvm -debug-only=polly-cloog' to see if the kernels
are properly detected. I get the following output:

alias polly-clang
alias polly-clang='~/Projekte/polly/build/bin/clang -Xclang -load 
-Xclang ~/Projekte/polly/build/lib/LLVMPolly.so'
grosser at tobilaptop:~/Projekte/polybench$ polly-clang -O3 -mllvm -polly 
-mllvm -debug-only=polly-cloog linear-algebra/kernels/gemm/gemm.c -I 
utilities/ utilities/polybench.c  -mllvm -polly-ignore-aliasing 
-DPOLYBENCH_USE_SCALAR_LB
:: init_array : entry.split => for.end56
if ((nj >= 1) && (nk >= 1) && (p_1 >= 1) &&
(p_4 >= 1)) {
   for (c2=0;c2<=p_4-1;c2+=32) {
     for 
(c3=max(-32*floord(p_1-12*p_4+10,32)-32*p_4,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32)
{
       for 
(c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_4-1);c4++)
{
         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body41(c4,-20*c4-c5);
         }
       }
     }
   }
}
if ((ni >= 1) && (nk >= 1) && (p_0 >= 1) &&
(p_4 >= 1)) {
   for (c2=0;c2<=p_0-1;c2+=32) {
     for 
(c3=max(-32*floord(-12*p_0+p_4+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_4+10,32)-640);c3<=-20*c2;c3+=32)
{
       for 
(c4=max(ceild(-c3-p_4-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++)
{
         for (c5=max(c3,-20*c4-p_4+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body18(c4,-20*c4-c5);
         }
       }
     }
   }
}
if ((ni >= 1) && (nj >= 1) && (p_0 >= 1) &&
(p_1 >= 1)) {
   for (c2=0;c2<=p_0-1;c2+=32) {
     for 
(c3=max(-32*floord(-12*p_0+p_1+10,32)-32*p_0,-32*c2-32*floord(-12*c2+p_1+10,32)-640);c3<=-20*c2;c3+=32)
{
       for 
(c4=max(ceild(-c3-p_1-30,20),c2);c4<=min(min(floord(-c3,20),c2+31),p_0-1);c4++)
{
         for (c5=max(c3,-20*c4-p_1+1);c5<=min(-20*c4,c3+31);c5++) {
           Stmt_for_body3(c4,-20*c4-c5);
         }
       }
     }
   }
}
Stmt_entry_split();
:: kernel_gemm : for.cond1.preheader => for.end28
for (c2=0;c2<=1023;c2+=32) {
   for (c3=0;c3<=1023;c3+=32) {
     for (c4=c2;c4<=c2+31;c4++) {
       for (c5=c3;c5<=c3+31;c5++) {
         Stmt_for_body3(c4,c5);
       }
     }
   }
}
for (c2=0;c2<=1023;c2+=32) {
   for (c3=0;c3<=1023;c3+=32) {
     for (c4=0;c4<=1023;c4+=32) {
       for (c5=c2;c5<=c2+31;c5++) {
         for (c6=c3;c6<=c3+31;c6++) {
           for (c7=c4;c7<=c4+31;c7++) {
             Stmt_for_body8(c5,c6,c7);
           }
         }
       }
     }
   }
}
:: polybench_flush_cache : for.inc => for.end
for (c1=0;c1<=4194559;c1++) {
   Stmt_for_inc(c1);
}

As a first step, we should ensure that Polly is fast for these flags. As
subsequent steps we should then ensure that the absence of these flags
does not increase compile-time.
> Table 3 and table 4 show the compile-time overhead of top 15 passes in
> polly-opt. we first generate LLVM IR files by the command "clang -O3
> -emit-llvm XX.c -o XX.ll", then we run opt to collect the compile time
> of each pass using the command "opt -basicaa  -load LLVMPolly.so -O3
> -polly -polly-codegen-scev XX.ll -S -o XX.opt.ll -time-passes"
You want to extract the .ll file with 'clang -O0'. Otherwise you run
polly on code that is already -O3 optimized, making the runs not
comparable to the clang integrated ones and also unrealistic as they do
not reflect what we do when running Polly from within clang.

It would be interesting to understand the doitgen results better. The
time in the optimizer is only 0.408 seconds, whereas the increase from
pBasic to pOpt is 0.897 - 0.151 = 0.746 seconds. This seems surprising.
Is this because of running polly on -O3 optimized code, is Polly
producing bigger .ll files which yield to longer object file emmission
time or is there a measurement problem?

That's it for now. Thanks for the update!

Cheers,
Tobi

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - May 2013 - [LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Possibly Parallel Threads