thr3ads.net - llvm dev - [LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead [May 2013]

If this information is useful, please help other people find it:
Share via:

tanmx_star

2013-Apr-26 03:08 UTC

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Hi all,

I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly Compiling
overhead" (https://gist.github.com/tanstar/5441808).  I think the pass
ordering problem you discussed early can be also investigated in this project!

Is there any comment or advice about my proposal?  I appreciate all your help
and advice.

Thanks,
Star Tan
Proposal: https://gist.github.com/tanstar/5441808
>Hi all,
>
>
>This is Star Tan, who proposed a project to reduce the Polly compiling
overhead several days ago.  After that, I kept on moving forward for this
project. By now, I am much familiar with Polly and LLVM. Thanks to the help from
Polly and LLVM group, I have also provided some LLVM-Polly patch files for this
project, such as  r179673, r179586, r179159, r179158, r179157, r178530. I am
confident that I am on the right position to propose a GSoC project.
>
>
>I have written a GSoC proposal draft in
https://gist.github.com/tanstar/5441808. I am pleased to answer any questions
about this project. Any advice or comment would be appreciated.
>
>
>Thank you very much!
>
>
>Best regards,
>Star Tan.

Star Tan

2013-Apr-30 14:13 UTC

head link

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Hi all,

When I was profiling Polly, I found some results were beyond my understanding. I
need your help.

To profile the Polly in details, I developed some timers to count the compiling
overhead for each part of Polly. Attached is the patch file for this purpose.
For each runOnRegion/runOnScop/runOnFunction/runOnLoop function in Polly, a
timer is inserted to count its compiling overhead. Since these functions usually
account for the major compiling time, I think those timers should catch most of
Polly compiling overhead.  Unfortunately,  this is not true. My experimental
results show that the compiling time captured by those timers only accounts for
less than half of total Polly compiling time.

For example, when compiling the doitgen.c in PolyBench with Polly, the total
Polly compiling overhead is about 0.7 seconds, but the compiling overhead
captured by our timers is only about 0.2 seconds.  A lot of compiling time is
consumed by LLVM codes out of Polly.  For example, the RegisterPasses.cpp shows
that PM.add(polly::createIslScheduleOptimizerPass()) is immediately followed by
PM.add(polly::createCodeGenerationPass()), but our profiling shows that 0.4
seconds elapse between the two passes (ScheduleOptimizer and CodeGeneration).  I
have checked that CodeGeneration pass only depends on a few LLVM passes as
follows:
    AU.addRequired<CloogInfo>();
    AU.addRequired<Dependences>();
    AU.addRequired<DominatorTree>();
    AU.addRequired<RegionInfo>();
    AU.addRequired<ScalarEvolution>();
    AU.addRequired<ScopDetection>();
    AU.addRequired<ScopInfo>();
    AU.addRequired<DataLayout>();

How could I find out where the time is spent on between two adjacent Polly
passes? Can anyone give me some advice?

Thank you!

Best Regards
Star Tan.

PS: I have updated my GSoC proposal. You can access the application on
https://gist.github.com/tanstar/5441808 or on 
https://gist.github.com/tanstar/5441808/raw/c041e001300e3502403eb4071e9556a6eb2b7fd5/%5BGSoc2013%5D%5BLLVM-Polly%5D+Reducing+Polly+Compiling+Overhead
I would submit the proposal in recent days. Any comments or advice would be
appreciated. Thank you!

At 2013-04-26 11:08:02,tanmx_star <tanmx_star at yeah.net>
wrote:>Hi all,
>
>I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly
Compiling overhead" (https://gist.github.com/tanstar/5441808).  I think the
pass ordering problem you discussed early can be also investigated in this
project!
>
>Is there any comment or advice about my proposal?  I appreciate all your
help and advice.
>
>Thanks,
>Star Tan
>Proposal: https://gist.github.com/tanstar/5441808
>
>>Hi all,
>>
>>
>>This is Star Tan, who proposed a project to reduce the Polly compiling
overhead several days ago.  After that, I kept on moving forward for this
project. By now, I am much familiar with Polly and LLVM. Thanks to the help from
Polly and LLVM group, I have also provided some LLVM-Polly patch files for this
project, such as  r179673, r179586, r179159, r179158, r179157, r178530. I am
confident that I am on the right position to propose a GSoC project.
>>
>>
>>I have written a GSoC proposal draft in
https://gist.github.com/tanstar/5441808. I am pleased to answer any questions
about this project. Any advice or comment would be appreciated.
>>
>>
>>Thank you very much!
>>
>>
>>Best regards,
>>Star Tan.-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130430/b76d2123/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PollyProfile.patch
Type: application/octet-stream
Size: 19944 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130430/b76d2123/attachment.obj>

Tobias Grosser

2013-May-02 09:38 UTC

head link

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

On 04/30/2013 04:13 PM, Star Tan wrote:> Hi all,
[...]> How could I find out where the time is spent on between two adjacent Polly
passes? Can anyone give me some advice?
Hi Star Tan,

I propose to do the performance analysis using the 'opt' tool and 
optimizing LLVM-IR, instead of running it from within clang. For the 
'opt' tool there are two commands that should help you:

1) -debug-pass=Structure or -debug-pass=Details

This should give you the list of passes that is executed. You can 
compare the list to see at which point additional passes are scheduled.

2) -time-passes

This gives you the time spent in the different passes.

These two commands may remove the need for a Polly specific profiling 
infrastructure. Also, if you talk about performance issues you see, it 
would be great if you could attach the .ll file you use as well as the 
exact command line you profile.

Thanks,
Tobias

Tobias Grosser

2013-May-02 11:12 UTC

head link

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

On 04/26/2013 05:08 AM, tanmx_star wrote:> Hi all,
Hi,

thanks for the update and sorry for the delay in reviewing. I just had a 
look at your proposal.

> I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly
Compiling overhead" (https://gist.github.com/tanstar/5441808).  I think the
pass ordering problem you discussed early can be also investigated in this
project!
Yes, figuring out the optimal path ordering sequence is very good.
> Is there any comment or advice about my proposal?  I appreciate all your
help and advice.
> 1. Summary:
>
> LLVM-Polly is a promising polyhedral optimizer for data-locality and
> parallelism, which takes advantage of multi-cores, cache hierarchies,
> short vector instructions as well as dedicated accelerators. However,
> Polly analysis and optimization can lead to significant compiling
> overhead, which makes it much less attractive for LLVM users. I argue
> that maintaining fast compiling time when Polly is enabled is very
> important, especially if we want to think of enabling Polly in default.
> Based on this assumption, I try to reduce Polly compiling overhead in
> this project.
Sounds good.

> 2. Motivation:
>
> LLVM is an incredible open-source project. It has been widely in C/C++
                                          You miss a verb here ^^^
> compilers, high-level synthesis compilers, virtual machines, optimizing
> tools, etc. As a graduate student, I am going to work on compiler
> analysis and optimization, especially on program vectorization and
> parallelization. I find Polly is a very useful and powerful polyhedral
> optimizer. I would like to use this tool and contribute to this project.
>
> When I was using Polly tool, I found that Polly optimization can lead toNo need for 'tool' here  ^^^
> significant compiling overhead. On average, polly optimization will
> increase the compiling time by 393% for PolyBench benchmarks and by 53%
> for MediaBench benchmarks compared with clang. That means if you want to
> gain from Polly, you have to pay 4 times extra compiling overhead. Even
> if you do not want to gain much from Polly, you still have to pay 53%
> compiling overhead. Such expensive compiling overhead would make the
> Polly much less attractive to LLVM users.
Good point.
> In this project, I try to reduce Polly compiling overhead by removing
I would call it 'compile-time overhead' instead of 'compiling
overhead'.
> unnecessary passes and improving critical passes.  For this purpose, I
> firstly try to find out where the compiling overhead comes from. When
> Polly optimizes a program, it takes the following steps: 1) Polly
> canonicalization: prepare some basic information and do some basic
> transformation, such as loop-simplify and region-simplify.  2) LLVM-IR
> to Polly description: detect polly scops and translates the detected
> scops into a polyhedral representation.  3) Polly optimization: analyze
> and optimize polyhedral scops.  4) Polly description to LLVM-IR:
> translates the polyhedral description back into new LLVM-IR.
>
> In attched table 1 and 2, pBasic shows the overhead of loading the      attached
> LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows
> the overhead of step 1), 2) and 4). So the compiling overhead of Polly
> can be divided into three parts:
> PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)
> and optimization(393%-248%=145%) MediaBench:canonicalization( 9%-1%>
8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%= 10%)
Thanks for adding numbers for pNoGen. Having only 10% runtime increase 
if Polly is not used is a good sign, especially for the amount of 
canonicalization passes we run. This makes me confident we can get it to 
an even smaller number.

The other numbers are large, but there are likely ways to improve on 
this significantly. Also, it would be good to show at least for one 
benchmark which passes the different numbers actually contain. (You can 
use -debug-pass=Structure for this). E.g. the code generation time looks
rather large. I suppose most of the time is not actually spent in code 
generation, but also in the LLVM passes such as common subexpression 
elimination that have more LLVM-IR to work on or clean up after Polly 
was run.

Also, I believe the names of your columns, and the command line options
given above are a little out of sync. I could e.g. not find a 
description for pBasic
> Based on these results, I plan to reduce Polly compiling overhead by the
> following steps: First, I will try to remove unnecessary
> canonicalization passes to reduce canonicalization time; Second, I will
> try to remove or rewrite expensive analysis passes to reduce
> optimization overhead; Third, I will try to improve the code generation
> passes to reduce code generation overhead. Another interesting work is
> to let the polly bail out early, which can be very helpful to save
> compiling overhead if Polly cannot benefit the program.
OK, this sounds like a reasonable approach. Some more points may be 
worth adding:

1) It is important to pick criteria you can evaluate your work on

It is a good start that you identified two benchmarks. Especially 
looking into non-polybench code is very valuable. You should make sure 
that you evaluate your work throughout the project to see the benefit
of your changes. In fact, it may even be worthwhile to set up a Polly 
performance tester to track the compile time with Polly enabled and how
your changes influence it.

2) Add some specific bug reports you are planning to lock into

This bug report shows a large performance problem in Polly that is 
mainly due to creating a very difficult dependency analysis problem:
llvm.org/PR15643

There was a larger discussion on the Polly mailing list that discusses 
this bug.
> 3. Details about the project:
>
> StageI -- Remove unnecessary canonicalization transformation. [Week 1-2]
>
> Polly relies on some canonicalization passes to simplify the following
> analysis and optimization. Canonicalization passes include
> loop-simplify, region-simplify, Induction variable canonicalization and
> block independent. For example, region-simplify pass is run to simplify
> the region to single entry and single exit edge before -polly-detect.
> However, such approach will introduce unnecessary modifications that
> increase compile time even in the cases where Polly cannot optimize the
> code.
>
> A first step is to remove -region-simplify pass. For this purpose, I
> have modified the scop detection pass and polly code generation pass to
> allow scops with multiple entry edges and multiple exit edges. Details
> can be referred to the following patch files: (Thanks for all the help
> from Polly group)
> r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs with
> multiple entry edges r179159: Support SCoPs with multiple exit edges
> r179158: Codegen: Replace region exit and entries recursively r179157:
> RegionInfo: Add helpers to replace entry/exit recursively r178530:
> ScopDetection: Use isTopLevelRegion
>
> In this project, I plan to spend two weeks to reduce canonicalization
> overhead.
It was a good idea to write down what you plan to do each week.
> Week 1:  Profile the compiling overhead of each canonicalization pass,
> including PromoteMemoryToRegisterPass, CFGSimplificationPass,
> ReassociatePass, LoopRotatePass, InstructionCombiningPass,
> IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass.  Week 2:
> Remove or improve one or two most expensive canonicalization passes. I
> will also try to revise the pass ordering to move some expensive
> canonicalization passes later.
Instead of speeding up the canonicalization passes your focus should 
really be integrating Polly into the -O3 pass chain without the need to 
have any additional canonicalization passes. This part is not so much 
about the patch itself that implements it. It rather requires careful 
analysis how the number of detected scops changes when moving Polly.
At the moment we optimized for optimal scop coverage while neglecting 
compile time. Now we want both, optimal scop coverage and good compile time.

Another point that can be mentioned is removing the need for induction
variable canonicalization. We currently do this using the -polly-indvars 
pass. However, the new option -polly-codegen-scev enables us to remove 
this pass entirely. This could also be an interesting performance
problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code 
generation time, which may take more time to generate but it may also
require less time to be cleaned up. This could also be interesting to 
investigate.


> StageII -- Remove or rewrite expensive analysis passes for compiling
> performance. [Week 3-5]
>
> There are many optimization libraries for Polly, such as ScopLib, Pluto,
> ISL and Jason optimization. To balance the tradeoff between code
           JSON> performance and compiling overhead, I will profile each optimization
> library and try to improve some of these libraries to reduce compiling
> overhead.
The only relevant one is currently isl. It may in some cases be useful 
to compare against Pluto so. No need to optimize scoplib or JSON.
> Week 3:  Profile the compiling overhead of each Polly optimization
> library, including ScopLib, Pluto, ISL and Jason.
Instead of profiling per library, I would rather profile per Polly pass
using --time-passes

You could do this later for several programs, but it would be good to 
have this already today for a single program to get an idea where time 
is spent and what needs optimization.
> Week 4:  Profile the
> compiling overhead of each optimization pass for one or two libraries
> (such as ISL and ScopLib). For example, ISL optimization provides many
> optimization passes, such as dependence simplify, schedule optimization,
> and various loop transformation.  Week 5: remove some expensive
> optimization passes and rewrite some critical but expensive optimization
> passes.


> StageIII -- Improve code generation passes for compiling performance.
> [Week 6-9]
>
> Our evalutions show that polly code generation passes are very
> expensive, especially for some benchmarks like ludcmp.c and adi.c. Polly
> code generation passes can increase the compiling time by 500% or more
> (See table 1). My plan is to improve various code generation passes.
Can you verify your numbers here. You report for ludcmp the following:

	   	clang	pBasic	pNoOpt	pNoGen	pOPt
ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175

			pBasic%	pNoGen%	pNoOpt%	pOpt%
			2.04%	27.52%	585.41%	739.17%

I have the feeling the headings of the pNoGen% and pNoOpt% columns have 
been switched accidentally. At least from the numbers above, I see an 
increase from 0.16 to 0.20 for code generation, which is far from being 
a 500% increase. On the other side, the optimization itself seems to add 
a larger amount of time as well as the code generation of the optimized 
code. O

> Week 6:  Profile the compiling overhead of each Polly code generation
> pass, especially for ISL code generation.  Week 7:  Remove unnecessary
> analysis for code generation. Currently, Polly code generation pass
> dependents on a lot of  analysis passes such as DominatorTree,
> IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I will
> try to remove some of expensive analysis passes.
Those passes add little overhead to the code generation. In fact the 
analysis is normally already available, such that these analysis 
requirements are for free. They have been added her mainly to allow the 
code generation to update them, such that we do not need to spend time 
rebuilding them later.

 > Week 8-9: Rewrite some> expensive functions for Polly code generation based on profiling
> information.
This is still very vague. I propose to
> StageIV -- Let Polly bail out early. [Week 10]
>
> Week 10: Add support in canicalization step or optimization step to        Typo ----->        canonicalization
> allow Polly boil out early if it cannot benefit programs.
> StageV -- Improve other parts. [Week 11-12]
>
> Week 11: Improve other parts of Polly. Especially, I will focus on some
> expensive helper functions such as TempScop analysis. This helper
> function is critical and expensive.
How do you know TempScop is expensive?
> Week 12: Integrate all improvements
> and evaluate the whole Polly with multiple benchmarks.
I think the only way to do this project is to continuously evaluate your 
changes on Polybench and mediabench and to directly integrate them
into the svn repository. This should be made clear at the beginning and
I believe it is very fine to spend more time on the individual steps,
such that we can make sure the changes are properly evaluated and 
integrated.
> 4. Profit for LLVM users and Polly users
>
> This project can benefit both LLVM users and Polly users. For LLVM
> users, our project will make the Polly more acceptable if it can
> provides extra performance gains within little extra compiling overhead.
> For Polly users, this project will make the Polly more powerful by
> significantly reducing compiling overhead and improving code quality.
Nice.

You could make your goals more concrete saying that we want to show that
by enabling Polly we can significantly optimizing the polybench 
benchmarks, while at the same time no prohibitively large compile time 
increase can be seen for mediabench. Reaching this goal would be a great
step forward.
> [Attachments]
>
> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
> 2GB DDR2 memory. Each benchmark is run multiple times and data are
> collected using ministat (https://github.com/codahale/ministat). Results
> are shown in table 1 and table 2.  Five cases are tested: (alias
> pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang -O3
> *pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm
> -polly-optimizer=none -mllvm -polly-code-generatorr=none *pNoOpt:pollycc
> -O3 -mllvm -polly-optimizer=none *polly: pollycc -O3
>
> Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10
> times)
>
> 		clang	pBasic	pNoOpt	pNoGen	pOPt	pBasic%	pNoGen%
> pNoOpt%	pOpt% 2mm.c     	0.1521	0.1593	0.1711	0.3235	0.7247
> 4.73%	12.49%	112.69%	376.46% atax.c  	0.1386	0.1349	0.1449
> 0.2066	0.313	0.00%	0.00%	49.06%	125.83% covariance.c	0.1498
> 0.1517	0.1526	0.3561	0.7706	1.27%	1.87%	137.72%	414.42% gemver.c
> 0.1562	0.1587	0.1724	0.2674	0.3936	1.60%	10.37%	71.19%	151.99%
> instrument.c	0.1062	0.1075	0.1124	0.123	0.1216	0.00%	5.84%
> 15.82%	14.50% ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175	2.04%
> 27.52%	585.41%	739.17% 3mm.c   	0.1529	0.1559	0.1826	0.4134
> 1.0436	1.96%	19.42%	170.37%	582.54% bicg.c   	0.1244	0.1268
> 0.1353	0.1977	0.2828	1.93%	8.76%	58.92%	127.33% doitgen.c
> 0.1492	0.1505	0.1644	0.3325	0.8971	0.00%	10.19%	122.86%	501.27%
> gesummv.c	0.1224	0.1279	0.134	0.1999	0.2937	4.49%	9.48%
> 63.32%	139.95% jacobi.c	0.1444	0.1506	0.1592	0.3912	0.8494
> 0.00%	10.25%	170.91%	488.23% seidel.c	0.1337	0.1353	0.1462
> 0.6299	0.9155	0.00%	9.35%	371.13%	584.74% adi.c   	0.1593
> 0.1621	0.1835	1.4375	1.849	1.76%	15.19%	802.39%	1060.70%
> correlation.c	0.1579	0.1596	0.1802	0.3393	0.6337	1.08%	14.12%
> 114.88%	301.33% gemm.c   	0.1407	0.1432	0.1576	0.2421	0.4477
> 1.78%	12.01%	72.07%	218.20% gramschmidt.c	0.1331	0.1349	0.1509
> 0.3069	0.4138	0.00%	13.37%	130.58%	210.89% lu.c    	0.1419
> 0.1443	0.1581	0.3156	0.3943	1.69%	11.42%	122.41%	177.87% average
> 1.26%	13.22%	248.47%	393.80%
To improve readability, it may be worth ensuring this fits into 80 
columns. You may be able to reduce the number of digits used here.

You could probably increase the readability of your proposal further if 
you use markdown. See here for an example of how a markdown file looks 
at github: https://gist.github.com/micmcg/976172 and here the raw version
https://gist.github.com/micmcg/976172/raw/70f1e0db278340bd8167c98fb880979b4571e847/gistfile1.md

You basically need to use the file ending '.md' and you can then use 
markdown syntax to format your text. The very same syntax will also 
improve the readability of the proposal on the mailing list.

All the best,

Tobias

Star Tan

2013-May-03 11:40 UTC

head link

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Dear Tobias,


Thanks for your timely reply. Your advice is really helpful.


I have updated the proposal on https://gist.github.com/tanstar/5508153. Major
differences include:
(1)  Add table 3 and table 4 to show the compile-time overhead of top 15 hot
passes;
(2)  Describe a new schedule for this project. The new schedule pay more
attention on reducing compile-time overhead of hot passes. The new schedule
includes eight stages.
(3)  Enrich the proposal with a lot of concrete work plans in each stage.
(4) Rewrite the proposal using markdown to make it more readable.


At 2013-05-02 19:12:37,"Tobias Grosser" <tobias at grosser.es>
wrote:>On 04/26/2013 05:08 AM, tanmx_star wrote:
>> Hi all,
>
>Hi,
>
>thanks for the update and sorry for the delay in reviewing. I just had a 
>look at your proposal.
>
>
>> I have updated my GSoS proposal: "FastPolly: Reducing LLVM-Polly
Compiling overhead" (https://gist.github.com/tanstar/5441808).  I think the
pass ordering problem you discussed early can be also investigated in this
project!
>
>Yes, figuring out the optimal path ordering sequence is very good.
>
>> Is there any comment or advice about my proposal?  I appreciate all
your help and advice.
>
>
>> 1. Summary:
>>
>> LLVM-Polly is a promising polyhedral optimizer for data-locality and
>> parallelism, which takes advantage of multi-cores, cache hierarchies,
>> short vector instructions as well as dedicated accelerators. However,
>> Polly analysis and optimization can lead to significant compiling
>> overhead, which makes it much less attractive for LLVM users. I argue
>> that maintaining fast compiling time when Polly is enabled is very
>> important, especially if we want to think of enabling Polly in default.
>> Based on this assumption, I try to reduce Polly compiling overhead in
>> this project.
>
>Sounds good.
>
>
>> 2. Motivation:
>>
>> LLVM is an incredible open-source project. It has been widely in C/C++
>
>                                          You miss a verb here ^^^
>
>> compilers, high-level synthesis compilers, virtual machines, optimizing
>> tools, etc. As a graduate student, I am going to work on compiler
>> analysis and optimization, especially on program vectorization and
>> parallelization. I find Polly is a very useful and powerful polyhedral
>> optimizer. I would like to use this tool and contribute to this
project.
>>
>> When I was using Polly tool, I found that Polly optimization can lead
to
>No need for 'tool' here  ^^^
>
>> significant compiling overhead. On average, polly optimization will
>> increase the compiling time by 393% for PolyBench benchmarks and by 53%
>> for MediaBench benchmarks compared with clang. That means if you want
to
>> gain from Polly, you have to pay 4 times extra compiling overhead. Even
>> if you do not want to gain much from Polly, you still have to pay 53%
>> compiling overhead. Such expensive compiling overhead would make the
>> Polly much less attractive to LLVM users.
>
>Good point.
>
>> In this project, I try to reduce Polly compiling overhead by removing
>
>I would call it 'compile-time overhead' instead of 'compiling
overhead'.
>
>> unnecessary passes and improving critical passes.  For this purpose, I
>> firstly try to find out where the compiling overhead comes from. When
>> Polly optimizes a program, it takes the following steps: 1) Polly
>> canonicalization: prepare some basic information and do some basic
>> transformation, such as loop-simplify and region-simplify.  2) LLVM-IR
>> to Polly description: detect polly scops and translates the detected
>> scops into a polyhedral representation.  3) Polly optimization: analyze
>> and optimize polyhedral scops.  4) Polly description to LLVM-IR:
>> translates the polyhedral description back into new LLVM-IR.
>>
>> In attched table 1 and 2, pBasic shows the overhead of loading the
>      attached
>
>> LLVMPolly.so; pNoGen shows the overhead of step 1) and 2); pNoOpt shows
>> the overhead of step 1), 2) and 4). So the compiling overhead of Polly
>> can be divided into three parts:
>> PolyBench: canonicalization(13%-1%=12%), code generation(248%-13%=235%)
>> and optimization(393%-248%=145%) MediaBench:canonicalization(
9%-1%>> 8%), code generation( 43%- 9%= 34%) and optimization( 53%- 43%=
10%)
>
>Thanks for adding numbers for pNoGen. Having only 10% runtime increase 
>if Polly is not used is a good sign, especially for the amount of 
>canonicalization passes we run. This makes me confident we can get it to 
>an even smaller number.
>
>The other numbers are large, but there are likely ways to improve on 
>this significantly. Also, it would be good to show at least for one 
>benchmark which passes the different numbers actually contain. (You can 
>use -debug-pass=Structure for this). E.g. the code generation time looks
>rather large. I suppose most of the time is not actually spent in code 
>generation, but also in the LLVM passes such as common subexpression 
>elimination that have more LLVM-IR to work on or clean up after Polly 
>was run.
>
>Also, I believe the names of your columns, and the command line options
>given above are a little out of sync. I could e.g. not find a 
>description for pBasicSorry, pBasic means pLoad. I have added the description for pBasic in the new
proposal.>
>> Based on these results, I plan to reduce Polly compiling overhead by
the
>> following steps: First, I will try to remove unnecessary
>> canonicalization passes to reduce canonicalization time; Second, I will
>> try to remove or rewrite expensive analysis passes to reduce
>> optimization overhead; Third, I will try to improve the code generation
>> passes to reduce code generation overhead. Another interesting work is
>> to let the polly bail out early, which can be very helpful to save
>> compiling overhead if Polly cannot benefit the program.
>
>OK, this sounds like a reasonable approach. Some more points may be 
>worth adding:
>
>1) It is important to pick criteria you can evaluate your work on
>
>It is a good start that you identified two benchmarks. Especially 
>looking into non-polybench code is very valuable. You should make sure 
>that you evaluate your work throughout the project to see the benefit
>of your changes. In fact, it may even be worthwhile to set up a Polly 
>performance tester to track the compile time with Polly enabled and how
>your changes influence it.Yes, your are right. It is very important prerequisite work to pick criteria for
the continuous evaluation. I add an extra stage (stage1) for this work. In my
option, "number of scops optimized by Polly" can be used as the
performance criteria, while "total compile-time overhead" and 
"compile-time overhead of each Polly pass" can be used as the
compile-time overhead criteria.  I will set up the testing environment and
integrate it into Polly SVN repository as soon as
possible.>
>2) Add some specific bug reports you are planning to lock into
>
>This bug report shows a large performance problem in Polly that is 
>mainly due to creating a very difficult dependency analysis problem:
>llvm.org/PR15643
>
>There was a larger discussion on the Polly mailing list that discusses 
>this bug.I have added such kind of work plans to stage3 in the new
proposal.>
>> 3. Details about the project:
>>
>> StageI -- Remove unnecessary canonicalization transformation. [Week
1-2]
>>
>> Polly relies on some canonicalization passes to simplify the following
>> analysis and optimization. Canonicalization passes include
>> loop-simplify, region-simplify, Induction variable canonicalization and
>> block independent. For example, region-simplify pass is run to simplify
>> the region to single entry and single exit edge before -polly-detect.
>> However, such approach will introduce unnecessary modifications that
>> increase compile time even in the cases where Polly cannot optimize the
>> code.
>>
>> A first step is to remove -region-simplify pass. For this purpose, I
>> have modified the scop detection pass and polly code generation pass to
>> allow scops with multiple entry edges and multiple exit edges. Details
>> can be referred to the following patch files: (Thanks for all the help
>> from Polly group)
>
>> r179673: Remove unneeded RegionSimplify pass r179586: Support SCoPs
with
>> multiple entry edges r179159: Support SCoPs with multiple exit edges
>> r179158: Codegen: Replace region exit and entries recursively r179157:
>> RegionInfo: Add helpers to replace entry/exit recursively r178530:
>> ScopDetection: Use isTopLevelRegion
>>
>> In this project, I plan to spend two weeks to reduce canonicalization
>> overhead.
>
>It was a good idea to write down what you plan to do each week.
>
>> Week 1:  Profile the compiling overhead of each canonicalization pass,
>> including PromoteMemoryToRegisterPass, CFGSimplificationPass,
>> ReassociatePass, LoopRotatePass, InstructionCombiningPass,
>> IndVarSimplifyPass, CodePreparationPass and LoopSimplifyPass.  Week 2:
>> Remove or improve one or two most expensive canonicalization passes. I
>> will also try to revise the pass ordering to move some expensive
>> canonicalization passes later.
>
>Instead of speeding up the canonicalization passes your focus should 
>really be integrating Polly into the -O3 pass chain without the need to 
>have any additional canonicalization passes. This part is not so much 
>about the patch itself that implements it. It rather requires careful 
>analysis how the number of detected scops changes when moving Polly.
>At the moment we optimized for optimal scop coverage while neglecting 
>compile time. Now we want both, optimal scop coverage and good compile time.
>
>Another point that can be mentioned is removing the need for induction
>variable canonicalization. We currently do this using the -polly-indvars 
>pass. However, the new option -polly-codegen-scev enables us to remove 
>this pass entirely. This could also be an interesting performance
>problem as -polly-codegen-scev produces a lot cleaner LLVM-IR at code 
>generation time, which may take more time to generate but it may also
>require less time to be cleaned up. This could also be interesting to 
>investigate.
>Work plans for such work are added to stage4 in the new proposal.
You are right, It would be great if we can completely remove canonicalization
passes. I think I will try to remove  -polly-indvars at first, and then I will
investigate the other canonicalization passes.>
>
>> StageII -- Remove or rewrite expensive analysis passes for compiling
>> performance. [Week 3-5]
>>
>> There are many optimization libraries for Polly, such as ScopLib,
Pluto,
>> ISL and Jason optimization. To balance the tradeoff between code
>           JSON
>> performance and compiling overhead, I will profile each optimization
>> library and try to improve some of these libraries to reduce compiling
>> overhead.
>
>The only relevant one is currently isl. It may in some cases be useful 
>to compare against Pluto so. No need to optimize scoplib or JSON.Yes, your comments are right. However,  it seems Polly uses Cloog to generate
code in default, which is much slower than ISL. Do you mean we will use ISL as
default in the future?>
>> Week 3:  Profile the compiling overhead of each Polly optimization
>> library, including ScopLib, Pluto, ISL and Jason.
>
>Instead of profiling per library, I would rather profile per Polly pass
>using --time-passes
>
>You could do this later for several programs, but it would be good to 
>have this already today for a single program to get an idea where time 
>is spent and what needs optimization.Yes, the new proposal pays more attention on profiling and improving each Polly
pass. "--time-passes" is really a very useful
command!>
>> Week 4:  Profile the
>> compiling overhead of each optimization pass for one or two libraries
>> (such as ISL and ScopLib). For example, ISL optimization provides many
>> optimization passes, such as dependence simplify, schedule
optimization,
>> and various loop transformation.  Week 5: remove some expensive
>> optimization passes and rewrite some critical but expensive
optimization
>> passes.
>
>
>
>
>> StageIII -- Improve code generation passes for compiling performance.
>> [Week 6-9]
>>
>> Our evalutions show that polly code generation passes are very
>> expensive, especially for some benchmarks like ludcmp.c and adi.c.
Polly
>> code generation passes can increase the compiling time by 500% or more
>> (See table 1). My plan is to improve various code generation passes.
>
>Can you verify your numbers here. You report for ludcmp the following:
>
>	   	clang	pBasic	pNoOpt	pNoGen	pOPt
>ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175
>
>			pBasic%	pNoGen%	pNoOpt%	pOpt%
>			2.04%	27.52%	585.41%	739.17%
>
>I have the feeling the headings of the pNoGen% and pNoOpt% columns have 
>been switched accidentally. At least from the numbers above, I see an 
>increase from 0.16 to 0.20 for code generation, which is far from being 
>a 500% increase. On the other side, the optimization itself seems to add 
>a larger amount of time as well as the code generation of the optimized 
>code. OSorry, the right order should be  "clang pBasic pNoGen pNoOpt pOPt pBasic%
pNoGen% pNoOpt% pOpt%". I have fixed this problem in the new
proposal.>> Week 6:  Profile the compiling overhead of each Polly code generation
>> pass, especially for ISL code generation.  Week 7:  Remove unnecessary
>> analysis for code generation. Currently, Polly code generation pass
>> dependents on a lot of  analysis passes such as DominatorTree,
>> IslAstInfo, RegionInfo, ScalarEvolution, ScopDetection, ScopInfo. I
will
>> try to remove some of expensive analysis passes.
>
>Those passes add little overhead to the code generation. In fact the 
>analysis is normally already available, such that these analysis 
>requirements are for free. They have been added her mainly to allow the 
>code generation to update them, such that we do not need to spend time 
>rebuilding them later.
>
> > Week 8-9: Rewrite some
>> expensive functions for Polly code generation based on profiling
>> information.
>
>This is still very vague. I propose to
>
>> StageIV -- Let Polly bail out early. [Week 10]
>>
>> Week 10: Add support in canicalization step or optimization step to
>        Typo ----->        canonicalization
>
>> allow Polly boil out early if it cannot benefit programs.
>
>
>> StageV -- Improve other parts. [Week 11-12]
>>
>> Week 11: Improve other parts of Polly. Especially, I will focus on some
>> expensive helper functions such as TempScop analysis. This helper
>> function is critical and expensive.
>
>How do you know TempScop is expensive?
>
>> Week 12: Integrate all improvements
>> and evaluate the whole Polly with multiple benchmarks.
>
>I think the only way to do this project is to continuously evaluate your 
>changes on Polybench and mediabench and to directly integrate them
>into the svn repository. This should be made clear at the beginning and
>I believe it is very fine to spend more time on the individual steps,
>such that we can make sure the changes are properly evaluated and 
>integrated.Yes, I will set the environment as soon as possible and integrate it into Polly
SVN repository. Currently I have finished some scripts and I think this work can
be done in the next week.>
>> 4. Profit for LLVM users and Polly users
>>
>> This project can benefit both LLVM users and Polly users. For LLVM
>> users, our project will make the Polly more acceptable if it can
>> provides extra performance gains within little extra compiling
overhead.
>> For Polly users, this project will make the Polly more powerful by
>> significantly reducing compiling overhead and improving code quality.
>
>Nice.
>
>You could make your goals more concrete saying that we want to show that
>by enabling Polly we can significantly optimizing the polybench 
>benchmarks, while at the same time no prohibitively large compile time 
>increase can be seen for mediabench. Reaching this goal would be a great
>step forward.
>
>> [Attachments]
>>
>> Our evaluation is based on Intel Pentium Dual CPU T2390(1.86GHz) with
>> 2GB DDR2 memory. Each benchmark is run multiple times and data are
>> collected using ministat (https://github.com/codahale/ministat).
Results
>> are shown in table 1 and table 2.  Five cases are tested: (alias
>> pollycc="clang -O3 -load LLVMPolly.so -mllvm -polly) *clang: clang
-O3
>> *pLoad: clang -O3 -load LLVMPolly.so *pNoGen:pollycc -O3 -mllvm
>> -polly-optimizer=none -mllvm -polly-code-generatorr=none
*pNoOpt:pollycc
>> -O3 -mllvm -polly-optimizer=none *polly: pollycc -O3
>>
>> Table 1: Compile time for PolyBench (Seconds, each benchmark is run 10
>> times)
>>
>> 		clang	pBasic	pNoOpt	pNoGen	pOPt	pBasic%	pNoGen%
>> pNoOpt%	pOpt% 2mm.c     	0.1521	0.1593	0.1711	0.3235	0.7247
>> 4.73%	12.49%	112.69%	376.46% atax.c  	0.1386	0.1349	0.1449
>> 0.2066	0.313	0.00%	0.00%	49.06%	125.83% covariance.c	0.1498
>> 0.1517	0.1526	0.3561	0.7706	1.27%	1.87%	137.72%	414.42% gemver.c
>> 0.1562	0.1587	0.1724	0.2674	0.3936	1.60%	10.37%	71.19%	151.99%
>> instrument.c	0.1062	0.1075	0.1124	0.123	0.1216	0.00%	5.84%
>> 15.82%	14.50% ludcmp.c	0.157	0.1602	0.2002	1.0761	1.3175	2.04%
>> 27.52%	585.41%	739.17% 3mm.c   	0.1529	0.1559	0.1826	0.4134
>> 1.0436	1.96%	19.42%	170.37%	582.54% bicg.c   	0.1244	0.1268
>> 0.1353	0.1977	0.2828	1.93%	8.76%	58.92%	127.33% doitgen.c
>> 0.1492	0.1505	0.1644	0.3325	0.8971	0.00%	10.19%	122.86%	501.27%
>> gesummv.c	0.1224	0.1279	0.134	0.1999	0.2937	4.49%	9.48%
>> 63.32%	139.95% jacobi.c	0.1444	0.1506	0.1592	0.3912	0.8494
>> 0.00%	10.25%	170.91%	488.23% seidel.c	0.1337	0.1353	0.1462
>> 0.6299	0.9155	0.00%	9.35%	371.13%	584.74% adi.c   	0.1593
>> 0.1621	0.1835	1.4375	1.849	1.76%	15.19%	802.39%	1060.70%
>> correlation.c	0.1579	0.1596	0.1802	0.3393	0.6337	1.08%	14.12%
>> 114.88%	301.33% gemm.c   	0.1407	0.1432	0.1576	0.2421	0.4477
>> 1.78%	12.01%	72.07%	218.20% gramschmidt.c	0.1331	0.1349	0.1509
>> 0.3069	0.4138	0.00%	13.37%	130.58%	210.89% lu.c    	0.1419
>> 0.1443	0.1581	0.3156	0.3943	1.69%	11.42%	122.41%	177.87% average
>> 1.26%	13.22%	248.47%	393.80%
>
>To improve readability, it may be worth ensuring this fits into 80 
>columns. You may be able to reduce the number of digits used here.
>
>You could probably increase the readability of your proposal further if 
>you use markdown. See here for an example of how a markdown file looks 
>at github: https://gist.github.com/micmcg/976172 and here the raw version
>https://gist.github.com/micmcg/976172/raw/70f1e0db278340bd8167c98fb880979b4571e847/gistfile1.md
>
>You basically need to use the file ending '.md' and you can then use
>markdown syntax to format your text. The very same syntax will also 
>improve the readability of the proposal on the mailing list.Thank you so much for your very helpful advice. I have rewrite the proposal
using markdown. This tool is really interesting and
powerful.>
>All the best,
>
>Tobias
>
Best regards,
Star Tan

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130503/3d04a8b3/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - May 2013 - [LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly] GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Seemingly Similar Threads