thr3ads.net - llvm dev - [LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2013-Mar-20 13:06 UTC

[LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

On 03/19/2013 11:02 AM, Star Tan wrote:>
> Dear Tobias Grosser,
>
> Today I have rebuilt the LLVM-Polly in Release mode. The configuration of
my own testing machine is: Intel Pentium Dual CPU T2390(1.86GHz) with 2GB DDR2
memory.
> I evaluated the Polly using PolyBench and Mediabench. It takes too long
time to evaluate the whole LLVM-testsuite, so I just choose the Mediabench from
LLVM-testsuite.
OK. This is a good baseline.
> The preliminary results of Polly compiling overhead is listed as follows:
>
> Table 1: Compiling time overhead of Polly for PolyBench.
>
> | | Clang
> (econd) | Polly-load
> (econd) | Polly-optimize
> (econd) | Polly-load penalty | Polly-optimize
> Penalty |
> | 2mm.c | 0.155 | 0.158 | 0.75 | 1.9% | 383.9% |
> | correlation.c | 0.132 | 0.133 | 0.319 | 0.8% | 141.7% |
> | geummv.c | 0.152 | 0.157 | 0.794 | 3.3% | 422.4% |
> | ludcmp.c | 0.157 | 0.159 | 0.391 | 1.3% | 149.0% |
> | 3mm.c | 0.103 | 0.109 | 0.122 | 5.8% | 18.4% |
> | covariance.c | 0.16 | 0.163 | 1.346 | 1.9% | 741.3% |
This is a very large slowdown. On my system I get

0.06 sec for Polly-load
0.09 sec for Polly-optimize

What exact version of Polybench did you use? What compiler
flags did you use to compile the benchmark?
Also, did you run the executables several times? How large is the
standard deviation of the results? (You can use a tool like ministat to 
calculate these values [1])
> | gramchmidt.c | 0.159 | 0.167 | 1.023 | 5.0% | 543.4% |
> | eidel.c | 0.125 | 0.13 | 0.285 | 4.0% | 128.0% |
> | adi.c | 0.155 | 0.156 | 0.953 | 0.6% | 514.8% |
> | doitgen.c | 0.124 | 0.128 | 0.298 | 3.2% | 140.3% |
> | intrument.c | 0.149 | 0.151 | 0.837 | 1.3% | 461.7% |
This number is surprising. In your last numbers you reported 
Polly-optimize as taking 0.495 sec in debug mode. The time you now
report for the release mode is almost twice as much. Can you verify
this number please?
> | atax.c | 0.135 | 0.136 | 0.917 | 0.7% | 579.3% |
> | gemm.c | 0.161 | 0.162 | 1.839 | 0.6% | 1042.2% |
This number looks also fishy. In debug mode you reported for 
Polly-optimize 1.327 seconds. This is again faster than in release mode.
> | jacobi-2d-imper.c | 0.16 | 0.161 | 0.649 | 0.6% | 305.6% |
> | bicg.c | 0.149 | 0.152 | 0.444 | 2.0% | 198.0% |
> | gemver.c | 0.135 | 0.136 | 0.416 | 0.7% | 208.1% |
> | lu.c | 0.143 | 0.148 | 0.398 | 3.5% | 178.3% |
> | Average | | | | 2.20% | 362.15% |
Otherwise, those numbers look like a good start. Maybe you can put them
on some website/wiki/document where you can extend them as you proceed 
with benchmarking.
> Table 2: Compiling time overhead of Polly for Mediabench (Selected from
LLVM-testsuite).
> | | Clang
> (econd) | Polly-load
> (econd) | Polly-optimize
> (econd) | Polly-load penalty | Polly-optimize
> Penalty |
> | adpcm | 0.18 | 0.187 | 0.218 | 3.9% | 21.1% |
> | g721 | 0.538 | 0.538 | 0.803 | 0.0% | 49.3% |
> | gsm | 2.869 | 2.936 | 4.789 | 2.3% | 66.9% |
> | mpeg2 | 3.026 | 3.072 | 4.662 | 1.5% | 54.1% |
> | jpeg | 13.083 | 13.248 | 22.488 | 1.3% | 71.9% |
> | Average | | | | 1.80% | 52.65% |

I run jpeg myself to verify these numbers on my machine. I got:

A: -O3
B: -O3 -load LLVMPolly.so
C: -O3 -load LLVMPolly.so -mllvm -polly
D: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
E: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
    -mllvm -polly-code-generator=none

           A     B     C     D     E
| jpeg | 5.1 | 5.2 | 8.0 | 7.9 | 5.5

The overhead between A and C is similar to the one you report. Hence, 
the numbers seem to be correct.

I also added two more runs D and E to figure out where the slowdown 
comes from. As you can see most of the slow down disappears when we
do not do code generation. This either means that the polly code 
generation itself is slow or that the LLVM passes afterwards need more
time due to the code we generated (it contains many opportunities for 
scalar simplifications). It would be interesting to see if this holds 
for the other benchmarks and to investigate the actual reasons for the 
slowdown. It is also interesting to see that just running Polly, but 
without applying optimizations does not slow down the compilation a lot. 
Does this also hold for other benchmarks?
> As shown in these two tables, Polly can significantly increase the
compiling time when it indeed works for the Polybench. On average, Polly will
increase the compiling time by 4.5X for Polybench.  Even for the Mediabench, in
which Polly does not actually improve the efficiency of generated code, it still
increases the compiling time by 1.5X.
> Based on the above observation, I think we should not only reduce the Polly
analysis and optimization time, but also make it bail out early when it cannot
improve the efficiency of generated code. That is very important when Polly is
enabled in default for LLVM users.
Bailing out early is definitely something we can think about.

To get started here, you could e.g. look into the jpeg benchmark and 
investigate on which files Polly is spending a lot of time, where 
exactly the time is spent and what kind of SCoPs Polly is optimizing. In 
case we do not expect any benefit, we may skip code generation entirely.

Thanks again for your interesting analysis.

Cheers,
Tobi

[1] https://github.com/codahale/ministat

tanmx_star

2013-Mar-23 16:23 UTC

head link

[LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Dear Tobies,

Sorry for the late reply. 

I have checked the experiment and I found some of the data is mismatched because
of incorrect manual copy and paste, so I have written a Shell script to
automatically collect data. Newest data is listed in the attached file.

Tobies, I have made a simple HTML page (attached polly-compiling-overhead.html)
to show the experimental data and my plans for this project. I think a public
webpage can be helpful for our further discussion. If possible, could you put it
on Polly website (Either a public link or a temporary webpage) ?
I think I will try to remove unnecessary code transformations for
canonicalization in next step.

Thank you very much for your warm help.

Best Regards,
Star Tan


From: Tobias Grosser
Date: 2013-03-20 21:06
To: Star Tan
CC: llvmdev
Subject: Re: [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead
On 03/19/2013 11:02 AM, Star Tan wrote:>
> Dear Tobias Grosser,
>
> Today I have rebuilt the LLVM-Polly in Release mode. The configuration of
my own testing machine is: Intel Pentium Dual CPU T2390(1.86GHz) with 2GB DDR2
memory.
> I evaluated the Polly using PolyBench and Mediabench. It takes too long
time to evaluate the whole LLVM-testsuite, so I just choose the Mediabench from
LLVM-testsuite.
OK. This is a good baseline.
> The preliminary results of Polly compiling overhead is listed as follows:
>
> Table 1: Compiling time overhead of Polly for PolyBench.
>
> | | Clang
> (econd) | Polly-load
> (econd) | Polly-optimize
> (econd) | Polly-load penalty | Polly-optimize
> Penalty |
> | 2mm.c | 0.155 | 0.158 | 0.75 | 1.9% | 383.9% |
> | correlation.c | 0.132 | 0.133 | 0.319 | 0.8% | 141.7% |
> | geummv.c | 0.152 | 0.157 | 0.794 | 3.3% | 422.4% |
> | ludcmp.c | 0.157 | 0.159 | 0.391 | 1.3% | 149.0% |
> | 3mm.c | 0.103 | 0.109 | 0.122 | 5.8% | 18.4% |
> | covariance.c | 0.16 | 0.163 | 1.346 | 1.9% | 741.3% |
This is a very large slowdown. On my system I get

0.06 sec for Polly-load
0.09 sec for Polly-optimize

What exact version of Polybench did you use? What compiler
flags did you use to compile the benchmark?
Also, did you run the executables several times? How large is the
standard deviation of the results? (You can use a tool like ministat to 
calculate these values [1])
> | gramchmidt.c | 0.159 | 0.167 | 1.023 | 5.0% | 543.4% |
> | eidel.c | 0.125 | 0.13 | 0.285 | 4.0% | 128.0% |
> | adi.c | 0.155 | 0.156 | 0.953 | 0.6% | 514.8% |
> | doitgen.c | 0.124 | 0.128 | 0.298 | 3.2% | 140.3% |
> | intrument.c | 0.149 | 0.151 | 0.837 | 1.3% | 461.7% |
This number is surprising. In your last numbers you reported 
Polly-optimize as taking 0.495 sec in debug mode. The time you now
report for the release mode is almost twice as much. Can you verify
this number please?
> | atax.c | 0.135 | 0.136 | 0.917 | 0.7% | 579.3% |
> | gemm.c | 0.161 | 0.162 | 1.839 | 0.6% | 1042.2% |
This number looks also fishy. In debug mode you reported for 
Polly-optimize 1.327 seconds. This is again faster than in release mode.
> | jacobi-2d-imper.c | 0.16 | 0.161 | 0.649 | 0.6% | 305.6% |
> | bicg.c | 0.149 | 0.152 | 0.444 | 2.0% | 198.0% |
> | gemver.c | 0.135 | 0.136 | 0.416 | 0.7% | 208.1% |
> | lu.c | 0.143 | 0.148 | 0.398 | 3.5% | 178.3% |
> | Average | | | | 2.20% | 362.15% |
Otherwise, those numbers look like a good start. Maybe you can put them
on some website/wiki/document where you can extend them as you proceed 
with benchmarking.
> Table 2: Compiling time overhead of Polly for Mediabench (Selected from
LLVM-testsuite).
> | | Clang
> (econd) | Polly-load
> (econd) | Polly-optimize
> (econd) | Polly-load penalty | Polly-optimize
> Penalty |
> | adpcm | 0.18 | 0.187 | 0.218 | 3.9% | 21.1% |
> | g721 | 0.538 | 0.538 | 0.803 | 0.0% | 49.3% |
> | gsm | 2.869 | 2.936 | 4.789 | 2.3% | 66.9% |
> | mpeg2 | 3.026 | 3.072 | 4.662 | 1.5% | 54.1% |
> | jpeg | 13.083 | 13.248 | 22.488 | 1.3% | 71.9% |
> | Average | | | | 1.80% | 52.65% |

I run jpeg myself to verify these numbers on my machine. I got:

A: -O3
B: -O3 -load LLVMPolly.so
C: -O3 -load LLVMPolly.so -mllvm -polly
D: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
E: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
    -mllvm -polly-code-generator=none

           A     B     C     D     E
| jpeg | 5.1 | 5.2 | 8.0 | 7.9 | 5.5

The overhead between A and C is similar to the one you report. Hence, 
the numbers seem to be correct.

I also added two more runs D and E to figure out where the slowdown 
comes from. As you can see most of the slow down disappears when we
do not do code generation. This either means that the polly code 
generation itself is slow or that the LLVM passes afterwards need more
time due to the code we generated (it contains many opportunities for 
scalar simplifications). It would be interesting to see if this holds 
for the other benchmarks and to investigate the actual reasons for the 
slowdown. It is also interesting to see that just running Polly, but 
without applying optimizations does not slow down the compilation a lot. 
Does this also hold for other benchmarks?
> As shown in these two tables, Polly can significantly increase the
compiling time when it indeed works for the Polybench. On average, Polly will
increase the compiling time by 4.5X for Polybench.  Even for the Mediabench, in
which Polly does not actually improve the efficiency of generated code, it still
increases the compiling time by 1.5X.
> Based on the above observation, I think we should not only reduce the Polly
analysis and optimization time, but also make it bail out early when it cannot
improve the efficiency of generated code. That is very important when Polly is
enabled in default for LLVM users.
Bailing out early is definitely something we can think about.

To get started here, you could e.g. look into the jpeg benchmark and 
investigate on which files Polly is spending a lot of time, where 
exactly the time is spent and what kind of SCoPs Polly is optimizing. In 
case we do not expect any benefit, we may skip code generation entirely.

Thanks again for your interesting analysis.

Cheers,
Tobi

[1] https://github.com/codahale/ministat
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130324/3a85931c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: polly-compiling-overhead.html
Type: application/octet-stream
Size: 8687 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130324/3a85931c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: polly_build.sh
Type: application/octet-stream
Size: 1177 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130324/3a85931c/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: polly_compile.sh
Type: application/octet-stream
Size: 1213 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130324/3a85931c/attachment-0002.obj>

Tobias Grosser

2013-Mar-23 17:37 UTC

head link

[LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

On 03/23/2013 05:23 PM, tanmx_star wrote:> Dear Tobies,
> 
> Sorry for the late reply.
> 
> I have checked the experiment and I found some of the data is mismatched
because of incorrect manual copy and paste, so I have written a Shell script to
automatically collect data. Newest data is listed in the attached file.
Yes, automatizing those experiments to make them reproducible is a very
good idea. I did not yet verify the numbers, but will as soon as your
script is online.

Two comments:

o Can you run also with the following flags:

D: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
E: -O3 -load LLVMPolly.so -mllvm -polly -mllvm -polly-optimizer=none
   -mllvm -polly-code-generator=none

o Some numbers are again fishy:

adi: In your previous report you reported 0.953 seconds, the website now
     says 1.839 seconds.

ludcmp: In your previous report you reported 0.391 seconds, the website
        now says 1.346 seconds

instrument: It seems you rounded the previous numbers to one significant
            digit and calculated the performance difference from the
	    rounded numbers. I would prefer if you would use the
	    original numbers and you would only round when
	    displaying/printing the results
> Tobies, I have made a simple HTML page (attached
polly-compiling-overhead.html) to show the experimental data and my plans for
this project. I think a public webpage can be helpful for our further
discussion. If possible, could you put it on Polly website (Either a public link
or a temporary webpage) ?
Yes, I believe a website is a very good start to illustrate your
findings and to organize the information that you got. For now I propose
to host it yourself as I expect it to change often and you waiting for
me to add changes just adds overhead (there are plenty of free hosting
services). We can later move it to the Polly website.

Some comments on the content:

- Just put your name as the person who runs the project

I appreciate that you put my name on the top, but this is work you
started and that you will use as a summer of code project application.
So you should be the only person mentioned there

- Cite properly

Also, as this will later become an application, I believe it is
necessary to make clear what part of the document comes from you and
which part was something you got from reviews/external sources.
Specifically, if you copy a larger text from one of my emails, please
mark it accordingly.

- Typo

'memeory'
> I think I will try to remove unnecessary code transformations for
canonicalization in next step.
Are you referring to the region simplification change, I was proposing
earlier? I believe this is a good change to work on as it is simple,
self contained and also a conceptual cleanup.

After this patch, I believe it is necessary to get more details about
your performance numbers to understand better where your work will be
beneficial.

All the best,
Tobi

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Mar 2013 - [LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

[LLVMdev] [Polly]GSoC Proposal: Reducing LLVM-Polly Compiling overhead

Seemingly Similar Threads