thr3ads.net - llvm dev - [llvm-dev] llvm (the middle-end) is getting slower, December edition [Dec 2016]

If this information is useful, please help other people find it:
Share via:

Davide Italiano via llvm-dev

2016-Dec-17 21:35 UTC

[llvm-dev] llvm (the middle-end) is getting slower, December edition

First of all, sorry for the long mail.
Inspired by the excellent analysis Rui did for lld, I decided to do
the same for llvm.
I'm personally very interested in build-time for LTO configuration,
with particular attention to the time spent in the optimizer.
Rafael did something similar back in March, so this can be considered
as an update. This tries to include a more accurate high-level
analysis of where llvm is spending CPU cycles.
Here I present 2 cases: clang building itself with `-flto` (Full), and
clang building an internal codebase which I'm going to refer as
`game7`.
It's a mid-sized program (it's actually a game), more or less of the
size of clang, which we use internally as benchmark to track
compile-time/runtime improvements/regression.
I picked two random revisions of llvm: trunk (December 16th 2016) and
trunk (June 2nd 2016), so, roughly, 6 months period.
My setup is a Mac Pro running Linux (NixOS).
These are the numbers I collected (including the output of -mllvm -time-passes).
For clang:

June 2nd:
real    22m9.278s
user    21m30.410s
sys     0m38.834s
  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
  289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
24.3%)  X86 DAG->DAG Instruction Selection
  97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
7.7%)  Global Value Numbering
  62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
5.0%)  Function Integration/Inlining
  58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
4.7%)  Combine redundant instructions
  53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
4.3%)  Combine redundant instructions
  51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
4.1%)  Loop Strength Reduction
  47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
3.8%)  Greedy Register Allocator
  36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
3.0%)  Induction Variable Simplification
  37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
2.9%)  Combine redundant instructions
  34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
2.7%)  Combine redundant instructions
  25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
2.0%)  Combine redundant instructions

Dec 16th:
real    27m34.922s
user    26m53.489s
sys     0m41.533s

  287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
19.3%)  X86 DAG->DAG Instruction Selection
  197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
12.5%)  Function Integration/Inlining
  106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
6.8%)  Global Value Numbering
  89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
5.7%)  Combine redundant instructions
  79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
5.0%)  Combine redundant instructions
  55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
3.5%)  Combine redundant instructions
  51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
3.3%)  Greedy Register Allocator
  52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
3.3%)  Combine redundant instructions
  49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
3.1%)  Loop Strength Reduction
  41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
2.6%)  Induction Variable Simplification
  38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
2.4%)  Combine redundant instructions

so, llvm is around 20% slower than it used to be.

For our internal codebase the situation seems slightly worse:

`game7`

June 2nd:

Total Execution Time: 464.3920 seconds (463.8986 wall clock)

  88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
20.3%)  X86 DAG->DAG Instruction Selection
  27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
9.4%)  X86 Assembly / Object Emitter
  34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
7.6%)  Function Integration/Inlining
  27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
6.1%)  Global Value Numbering
  22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
4.8%)  Combine redundant instructions
  19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
4.2%)  Post RA top-down list latency scheduler
  15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
3.5%)  Combine redundant instructions

Dec 16th:

Total Execution Time: 861.0898 seconds (860.5808 wall clock)

  135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
15.2%)  Combine redundant instructions
  103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
11.7%)  Combine redundant instructions
  97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
11.6%)  X86 DAG->DAG Instruction Selection
  72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
8.1%)  Combine redundant instructions
  69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
7.8%)  Function Integration/Inlining
  60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
6.8%)  Global Value Numbering
  56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
6.4%)  Combine redundant instructions

so, using LTO, lld takes 2x to build what it used to take (and all the
extra time seems spent in the optimizer).

As an (extra) experiment, I decided to take the unoptimized output of
game7 (via lld -save-temps) and pass to -opt -O2. That shows another
significant regression (with different characteristics).

June 2nd:
time opt -O2
real    6m23.016s
user   6m20.900s
sys     0m2.113s

35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
 Function Integration/Inlining
33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
 Global Value Numbering
27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
 Bitcode Writer
25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
 Combine redundant instructions
19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
 Combine redundant instructions
18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
 Combine redundant instructions
16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
 Combine redundant instructions
13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
 Combine redundant instructions
13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
 Combine redundant instructions

Dec 16th:

real    20m10.734s
user    20m8.523s
sys     0m2.197s

  208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
17.5%)  Value Propagation
  179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
15.1%)  Value Propagation
  92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
7.7%)  Combine redundant instructions
  72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
6.1%)  Combine redundant instructions
  72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
6.1%)  Combine redundant instructions
  66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
5.6%)  Combine redundant instructions
  65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
5.5%)  Combine redundant instructions
  61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
5.2%)  Function Integration/Inlining
  54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
4.6%)  Combine redundant instructions
  50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
4.2%)  Combine redundant instructions
  47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
4.0%)  Global Value Numbering

I don't have an infrastructure to measure the runtime performance
benefits/regression of clang, but I have for `game7`.
I wasn't able to notice any fundamental speedup (at least, not
something that justifies a 2x build-time).

tl;dr:
There are quite a few things to notice:
1) GVN used to be the top pass in the middle-end, in some cases, and
pretty much always in the top-3. This is not the case anymore, but
it's still a pass where we spend a lot of time. This is being worked
on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
some hope that will be sorted out (or at least there's a plan for it).
2) For clang, we spend 35% more time inside instcombine, and for game7
instcombine seems to largely dominate the amount of time we spend
optimizing IR. I tried to bisect (which is not easy considering the
test takes a long time to run), but I wasn't able to identify a single
point in time responsible for the regression. It seems to be an
additive effect. My wild (or not so wild) guess is that every day
somebody adds a matcher of two because that improves their testcase,
and at some point all this things add up. I'll try to do some
additional profiling but I guess large part of our time is spent
solving bitwise-domain dataflow problems (ComputeKnownBits et
similia). Once GVN will be in a more stable state, I plan to
experiment with caching results.
3) Something peculiar is that we spend 2x time in the inliner. I'm not
familiar with the inliner, IIRC there were some changes to threshold
recently, so any help here will be appreciated (both in reproducing
the results and with analysis).
4) For the last testcase (opt -O2 on large unoptimized chunk of
bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
think it's not as lazy as it claims to be (or at least, the way we use
it). This doesn't show up in a full LTO run because we don't run CVP
as part of the default LTO pipeline, but the case shows how LVI can be
a bottleneck for large TUs (or at least how -O2 compile time degraded
on large TUs). I haven't thought about the problem very carefully, but
there seems to be some progress on this front (
https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
original bitcode file but I can probably do some profiling on it as
well.

As next steps I'll try to get a more detailed analysis of the
problems. In particular, try to do what Rui did for lld but with more
coarse granularity (every week) to have a chart of the compile time
trend for these cases over the last 6 months, and post here.

I think (I know) some people are aware of the problems I outline in
this e-mail. But apparently not everybody. We're in a situation where
compile time is increasing without real control. I'm happy that Apple
is doing a serious effort to track build-time, so hopefully things
will improve. There are, although, some cases (like adding matchers in
instcombine or knobs) where the compile time regression is hard to
track until it's too late. LLVM as a project tries not to stop people
trying to get things done and that's great, but from time to time it's
good to take a step back and re-evaluate approaches.
The purpose of this e-mail was to outline where we regressed, for
those interested.

Thanks for your time, and of course, feedback welcome!

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Andrew Kelley via llvm-dev

2016-Dec-17 21:41 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

Davide,

Thank you for this analysis.

As a front-end developer using LLVM I have one piece of feedback:

I'm not really concerned about compilation time for -O2 or -O3, but I am
extremely interested in minimizing compiliation time for -O0.

Thank you for all your hard work,
Andrew (http://ziglang.org/)

On Sat, Dec 17, 2016 at 4:35 PM, Davide Italiano via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
> Rafael did something similar back in March, so this can be considered
> as an update. This tries to include a more accurate high-level
> analysis of where llvm is spending CPU cycles.
> Here I present 2 cases: clang building itself with `-flto` (Full), and
> clang building an internal codebase which I'm going to refer as
> `game7`.
> It's a mid-sized program (it's actually a game), more or less of
the
> size of clang, which we use internally as benchmark to track
> compile-time/runtime improvements/regression.
> I picked two random revisions of llvm: trunk (December 16th 2016) and
> trunk (June 2nd 2016), so, roughly, 6 months period.
> My setup is a Mac Pro running Linux (NixOS).
> These are the numbers I collected (including the output of -mllvm
> -time-passes).
> For clang:
>
> June 2nd:
> real    22m9.278s
> user    21m30.410s
> sys     0m38.834s
>   Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
>   289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
> 24.3%)  X86 DAG->DAG Instruction Selection
>   97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
> 7.7%)  Global Value Numbering
>   62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
> 5.0%)  Function Integration/Inlining
>   58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
> 4.7%)  Combine redundant instructions
>   53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
> 4.3%)  Combine redundant instructions
>   51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
> 4.1%)  Loop Strength Reduction
>   47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
> 3.8%)  Greedy Register Allocator
>   36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
> 3.0%)  Induction Variable Simplification
>   37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
> 2.9%)  Combine redundant instructions
>   34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
> 2.7%)  Combine redundant instructions
>   25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
> 2.0%)  Combine redundant instructions
>
> Dec 16th:
> real    27m34.922s
> user    26m53.489s
> sys     0m41.533s
>
>   287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
> 19.3%)  X86 DAG->DAG Instruction Selection
>   197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
> 12.5%)  Function Integration/Inlining
>   106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
> 6.8%)  Global Value Numbering
>   89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
> 5.7%)  Combine redundant instructions
>   79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
> 5.0%)  Combine redundant instructions
>   55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
> 3.5%)  Combine redundant instructions
>   51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
> 3.3%)  Greedy Register Allocator
>   52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
> 3.3%)  Combine redundant instructions
>   49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
> 3.1%)  Loop Strength Reduction
>   41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
> 2.6%)  Induction Variable Simplification
>   38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
> 2.4%)  Combine redundant instructions
>
> so, llvm is around 20% slower than it used to be.
>
> For our internal codebase the situation seems slightly worse:
>
> `game7`
>
> June 2nd:
>
> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
>
>   88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
> 20.3%)  X86 DAG->DAG Instruction Selection
>   27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
> 9.4%)  X86 Assembly / Object Emitter
>   34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
> 7.6%)  Function Integration/Inlining
>   27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
> 6.1%)  Global Value Numbering
>   22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
> 4.8%)  Combine redundant instructions
>   19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
> 4.2%)  Post RA top-down list latency scheduler
>   15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
> 3.5%)  Combine redundant instructions
>
> Dec 16th:
>
> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
>
>   135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
> 15.2%)  Combine redundant instructions
>   103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
> 11.7%)  Combine redundant instructions
>   97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
> 11.6%)  X86 DAG->DAG Instruction Selection
>   72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
> 8.1%)  Combine redundant instructions
>   69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
> 7.8%)  Function Integration/Inlining
>   60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
> 6.8%)  Global Value Numbering
>   56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
> 6.4%)  Combine redundant instructions
>
> so, using LTO, lld takes 2x to build what it used to take (and all the
> extra time seems spent in the optimizer).
>
> As an (extra) experiment, I decided to take the unoptimized output of
> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> significant regression (with different characteristics).
>
> June 2nd:
> time opt -O2
> real    6m23.016s
> user   6m20.900s
> sys     0m2.113s
>
> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
>  Function Integration/Inlining
> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
>  Global Value Numbering
> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
>  Bitcode Writer
> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
>  Combine redundant instructions
> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
>  Combine redundant instructions
> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
>  Combine redundant instructions
> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
>  Combine redundant instructions
> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
>  Combine redundant instructions
> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
>  Combine redundant instructions
>
> Dec 16th:
>
> real    20m10.734s
> user    20m8.523s
> sys     0m2.197s
>
>   208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
> 17.5%)  Value Propagation
>   179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
> 15.1%)  Value Propagation
>   92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
> 7.7%)  Combine redundant instructions
>   72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
> 6.1%)  Combine redundant instructions
>   72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
> 6.1%)  Combine redundant instructions
>   66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
> 5.6%)  Combine redundant instructions
>   65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
> 5.5%)  Combine redundant instructions
>   61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
> 5.2%)  Function Integration/Inlining
>   54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
> 4.6%)  Combine redundant instructions
>   50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
> 4.2%)  Combine redundant instructions
>   47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
> 4.0%)  Global Value Numbering
>
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
>
> tl;dr:
> There are quite a few things to notice:
> 1) GVN used to be the top pass in the middle-end, in some cases, and
> pretty much always in the top-3. This is not the case anymore, but
> it's still a pass where we spend a lot of time. This is being worked
> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
> some hope that will be sorted out (or at least there's a plan for it).
> 2) For clang, we spend 35% more time inside instcombine, and for game7
> instcombine seems to largely dominate the amount of time we spend
> optimizing IR. I tried to bisect (which is not easy considering the
> test takes a long time to run), but I wasn't able to identify a single
> point in time responsible for the regression. It seems to be an
> additive effect. My wild (or not so wild) guess is that every day
> somebody adds a matcher of two because that improves their testcase,
> and at some point all this things add up. I'll try to do some
> additional profiling but I guess large part of our time is spent
> solving bitwise-domain dataflow problems (ComputeKnownBits et
> similia). Once GVN will be in a more stable state, I plan to
> experiment with caching results.
> 3) Something peculiar is that we spend 2x time in the inliner. I'm not
> familiar with the inliner, IIRC there were some changes to threshold
> recently, so any help here will be appreciated (both in reproducing
> the results and with analysis).
> 4) For the last testcase (opt -O2 on large unoptimized chunk of
> bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
> think it's not as lazy as it claims to be (or at least, the way we use
> it). This doesn't show up in a full LTO run because we don't run
CVP
> as part of the default LTO pipeline, but the case shows how LVI can be
> a bottleneck for large TUs (or at least how -O2 compile time degraded
> on large TUs). I haven't thought about the problem very carefully, but
> there seems to be some progress on this front (
> https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
> original bitcode file but I can probably do some profiling on it as
> well.
>
> As next steps I'll try to get a more detailed analysis of the
> problems. In particular, try to do what Rui did for lld but with more
> coarse granularity (every week) to have a chart of the compile time
> trend for these cases over the last 6 months, and post here.
>
> I think (I know) some people are aware of the problems I outline in
> this e-mail. But apparently not everybody. We're in a situation where
> compile time is increasing without real control. I'm happy that Apple
> is doing a serious effort to track build-time, so hopefully things
> will improve. There are, although, some cases (like adding matchers in
> instcombine or knobs) where the compile time regression is hard to
> track until it's too late. LLVM as a project tries not to stop people
> trying to get things done and that's great, but from time to time
it's
> good to take a step back and re-evaluate approaches.
> The purpose of this e-mail was to outline where we regressed, for
> those interested.
>
> Thanks for your time, and of course, feedback welcome!
>
> --
> Davide
>
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/00a09acf/attachment.html>

Michael Gottesman via llvm-dev

2016-Dec-18 01:19 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
> Rafael did something similar back in March, so this can be considered
> as an update. This tries to include a more accurate high-level
> analysis of where llvm is spending CPU cycles.
> Here I present 2 cases: clang building itself with `-flto` (Full), and
> clang building an internal codebase which I'm going to refer as
> `game7`.
> It's a mid-sized program (it's actually a game), more or less of
the
> size of clang, which we use internally as benchmark to track
> compile-time/runtime improvements/regression.
> I picked two random revisions of llvm: trunk (December 16th 2016) and
> trunk (June 2nd 2016), so, roughly, 6 months period.
> My setup is a Mac Pro running Linux (NixOS).
> These are the numbers I collected (including the output of -mllvm
-time-passes).
> For clang:
> 
> June 2nd:
> real    22m9.278s
> user    21m30.410s
> sys     0m38.834s
>  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
>  289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
> 24.3%)  X86 DAG->DAG Instruction Selection
>  97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
> 7.7%)  Global Value Numbering
>  62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
> 5.0%)  Function Integration/Inlining
>  58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
> 4.7%)  Combine redundant instructions
>  53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
> 4.3%)  Combine redundant instructions
>  51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
> 4.1%)  Loop Strength Reduction
>  47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
> 3.8%)  Greedy Register Allocator
>  36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
> 3.0%)  Induction Variable Simplification
>  37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
> 2.9%)  Combine redundant instructions
>  34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
> 2.7%)  Combine redundant instructions
>  25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
> 2.0%)  Combine redundant instructions
> 
> Dec 16th:
> real    27m34.922s
> user    26m53.489s
> sys     0m41.533s
> 
>  287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
> 19.3%)  X86 DAG->DAG Instruction Selection
>  197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
> 12.5%)  Function Integration/Inlining
>  106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
> 6.8%)  Global Value Numbering
>  89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
> 5.7%)  Combine redundant instructions
>  79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
> 5.0%)  Combine redundant instructions
>  55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
> 3.5%)  Combine redundant instructions
>  51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
> 3.3%)  Greedy Register Allocator
>  52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
> 3.3%)  Combine redundant instructions
>  49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
> 3.1%)  Loop Strength Reduction
>  41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
> 2.6%)  Induction Variable Simplification
>  38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
> 2.4%)  Combine redundant instructions
> 
> so, llvm is around 20% slower than it used to be.
> 
> For our internal codebase the situation seems slightly worse:
> 
> `game7`
> 
> June 2nd:
> 
> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
> 
>  88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
> 20.3%)  X86 DAG->DAG Instruction Selection
>  27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
> 9.4%)  X86 Assembly / Object Emitter
>  34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
> 7.6%)  Function Integration/Inlining
>  27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
> 6.1%)  Global Value Numbering
>  22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
> 4.8%)  Combine redundant instructions
>  19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
> 4.2%)  Post RA top-down list latency scheduler
>  15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
> 3.5%)  Combine redundant instructions
> 
> Dec 16th:
> 
> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
> 
>  135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
> 15.2%)  Combine redundant instructions
>  103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
> 11.7%)  Combine redundant instructions
>  97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
> 11.6%)  X86 DAG->DAG Instruction Selection
>  72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
> 8.1%)  Combine redundant instructions
>  69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
> 7.8%)  Function Integration/Inlining
>  60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
> 6.8%)  Global Value Numbering
>  56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
> 6.4%)  Combine redundant instructions
> 
> so, using LTO, lld takes 2x to build what it used to take (and all the
> extra time seems spent in the optimizer).
> 
> As an (extra) experiment, I decided to take the unoptimized output of
> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> significant regression (with different characteristics).
> 
> June 2nd:
> time opt -O2
> real    6m23.016s
> user   6m20.900s
> sys     0m2.113s
> 
> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
> Function Integration/Inlining
> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
> Global Value Numbering
> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
> Bitcode Writer
> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
> Combine redundant instructions
> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
> Combine redundant instructions
> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
> Combine redundant instructions
> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
> Combine redundant instructions
> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
> Combine redundant instructions
> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
> Combine redundant instructions
> 
> Dec 16th:
> 
> real    20m10.734s
> user    20m8.523s
> sys     0m2.197s
> 
>  208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
> 17.5%)  Value Propagation
>  179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
> 15.1%)  Value Propagation
>  92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
> 7.7%)  Combine redundant instructions
>  72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
> 6.1%)  Combine redundant instructions
>  72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
> 6.1%)  Combine redundant instructions
>  66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
> 5.6%)  Combine redundant instructions
>  65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
> 5.5%)  Combine redundant instructions
>  61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
> 5.2%)  Function Integration/Inlining
>  54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
> 4.6%)  Combine redundant instructions
>  50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
> 4.2%)  Combine redundant instructions
>  47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
> 4.0%)  Global Value Numbering
> 
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
> 
> tl;dr:
> There are quite a few things to notice:
> 1) GVN used to be the top pass in the middle-end, in some cases, and
> pretty much always in the top-3. This is not the case anymore, but
> it's still a pass where we spend a lot of time. This is being worked
> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
> some hope that will be sorted out (or at least there's a plan for it).
> 2) For clang, we spend 35% more time inside instcombine, and for game7
> instcombine seems to largely dominate the amount of time we spend
> optimizing IR. I tried to bisect (which is not easy considering the
> test takes a long time to run), but I wasn't able to identify a single
> point in time responsible for the regression. It seems to be an
> additive effect. My wild (or not so wild) guess is that every day
> somebody adds a matcher of two because that improves their testcase,
> and at some point all this things add up. I'll try to do some
> additional profiling but I guess large part of our time is spent
> solving bitwise-domain dataflow problems (ComputeKnownBits et
> similia). Once GVN will be in a more stable state, I plan to
> experiment with caching results.
We have seen a similar thing when compiling the swift standard library. We have
talked about making a small simple instcombine pass that doesn't iterate to
a fixed point (IIRC). IIRC Andy/Mark looked at this (so my memory might be
wrong).
> 3) Something peculiar is that we spend 2x time in the inliner. I'm not
> familiar with the inliner, IIRC there were some changes to threshold
> recently, so any help here will be appreciated (both in reproducing
> the results and with analysis).
> 4) For the last testcase (opt -O2 on large unoptimized chunk of
> bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
> think it's not as lazy as it claims to be (or at least, the way we use
> it). This doesn't show up in a full LTO run because we don't run
CVP
> as part of the default LTO pipeline, but the case shows how LVI can be
> a bottleneck for large TUs (or at least how -O2 compile time degraded
> on large TUs). I haven't thought about the problem very carefully, but
> there seems to be some progress on this front (
> https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
> original bitcode file but I can probably do some profiling on it as
> well.
> 
> As next steps I'll try to get a more detailed analysis of the
> problems. In particular, try to do what Rui did for lld but with more
> coarse granularity (every week) to have a chart of the compile time
> trend for these cases over the last 6 months, and post here.
> 
> I think (I know) some people are aware of the problems I outline in
> this e-mail. But apparently not everybody. We're in a situation where
> compile time is increasing without real control. I'm happy that Apple
> is doing a serious effort to track build-time, so hopefully things
> will improve. There are, although, some cases (like adding matchers in
> instcombine or knobs) where the compile time regression is hard to
> track until it's too late. LLVM as a project tries not to stop people
> trying to get things done and that's great, but from time to time
it's
> good to take a step back and re-evaluate approaches.
> The purpose of this e-mail was to outline where we regressed, for
> those interested.
> 
> Thanks for your time, and of course, feedback welcome!
> 
> -- 
> Davide
> 
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Davide Italiano via llvm-dev

2016-Dec-18 02:32 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at freebsd.org>
wrote:
[...]
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
>
Just to provide numbers (using Sean's Mathematica template, thanks),
here's a plot of the CDF of the frames in a particular level/scene.
The curves pretty much match, and the picture in the middle shows a
relative difference of 0.4% between Jun and Dec (which could be very
possibly be in the noise).
On the same scene, the difference between -O3 and -flto is 12%, FWIW.
So it seems that at least for this particular case all the compile
time didn't buy any runtime improvement.

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare
-------------- next part --------------
A non-text attachment was scrubbed...
Name: flto_regression.png
Type: image/png
Size: 131534 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/12c8a067/attachment-0001.png>

Mehdi Amini via llvm-dev

2016-Dec-18 02:39 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
> Rafael did something similar back in March, so this can be considered
> as an update. This tries to include a more accurate high-level
> analysis of where llvm is spending CPU cycles.
> Here I present 2 cases: clang building itself with `-flto` (Full), and
> clang building an internal codebase which I'm going to refer as
> `game7`.
> It's a mid-sized program (it's actually a game), more or less of
the
> size of clang, which we use internally as benchmark to track
> compile-time/runtime improvements/regression.
> I picked two random revisions of llvm: trunk (December 16th 2016) and
> trunk (June 2nd 2016), so, roughly, 6 months period.
> My setup is a Mac Pro running Linux (NixOS).
> These are the numbers I collected (including the output of -mllvm
-time-passes).
> For clang:
> 
> June 2nd:
> real    22m9.278s
> user    21m30.410s
> sys     0m38.834s
>  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
>  289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
> 24.3%)  X86 DAG->DAG Instruction Selection
>  97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
> 7.7%)  Global Value Numbering
>  62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
> 5.0%)  Function Integration/Inlining
>  58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
> 4.7%)  Combine redundant instructions
>  53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
> 4.3%)  Combine redundant instructions
>  51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
> 4.1%)  Loop Strength Reduction
>  47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
> 3.8%)  Greedy Register Allocator
>  36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
> 3.0%)  Induction Variable Simplification
>  37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
> 2.9%)  Combine redundant instructions
>  34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
> 2.7%)  Combine redundant instructions
>  25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
> 2.0%)  Combine redundant instructions
> 
> Dec 16th:
> real    27m34.922s
> user    26m53.489s
> sys     0m41.533s
> 
>  287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
> 19.3%)  X86 DAG->DAG Instruction Selection
>  197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
> 12.5%)  Function Integration/Inlining
>  106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
> 6.8%)  Global Value Numbering
>  89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
> 5.7%)  Combine redundant instructions
>  79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
> 5.0%)  Combine redundant instructions
>  55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
> 3.5%)  Combine redundant instructions
>  51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
> 3.3%)  Greedy Register Allocator
>  52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
> 3.3%)  Combine redundant instructions
>  49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
> 3.1%)  Loop Strength Reduction
>  41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
> 2.6%)  Induction Variable Simplification
>  38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
> 2.4%)  Combine redundant instructions
> 
> so, llvm is around 20% slower than it used to be.
> 
> For our internal codebase the situation seems slightly worse:
> 
> `game7`
> 
> June 2nd:
> 
> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
> 
>  88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
> 20.3%)  X86 DAG->DAG Instruction Selection
>  27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
> 9.4%)  X86 Assembly / Object Emitter
>  34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
> 7.6%)  Function Integration/Inlining
>  27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
> 6.1%)  Global Value Numbering
>  22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
> 4.8%)  Combine redundant instructions
>  19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
> 4.2%)  Post RA top-down list latency scheduler
>  15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
> 3.5%)  Combine redundant instructions
> 
> Dec 16th:
> 
> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
> 
>  135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
> 15.2%)  Combine redundant instructions
>  103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
> 11.7%)  Combine redundant instructions
>  97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
> 11.6%)  X86 DAG->DAG Instruction Selection
>  72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
> 8.1%)  Combine redundant instructions
>  69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
> 7.8%)  Function Integration/Inlining
>  60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
> 6.8%)  Global Value Numbering
>  56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
> 6.4%)  Combine redundant instructions
> 
> so, using LTO, lld takes 2x to build what it used to take (and all the
> extra time seems spent in the optimizer).
> 
> As an (extra) experiment, I decided to take the unoptimized output of
> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> significant regression (with different characteristics).
> 
> June 2nd:
> time opt -O2
> real    6m23.016s
> user   6m20.900s
> sys     0m2.113s
> 
> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
> Function Integration/Inlining
> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
> Global Value Numbering
> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
> Bitcode Writer
> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
> Combine redundant instructions
> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
> Combine redundant instructions
> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
> Combine redundant instructions
> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
> Combine redundant instructions
> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
> Combine redundant instructions
> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
> Combine redundant instructions
> 
> Dec 16th:
> 
> real    20m10.734s
> user    20m8.523s
> sys     0m2.197s
> 
>  208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
> 17.5%)  Value Propagation
>  179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
> 15.1%)  Value Propagation
>  92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
> 7.7%)  Combine redundant instructions
>  72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
> 6.1%)  Combine redundant instructions
>  72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
> 6.1%)  Combine redundant instructions
>  66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
> 5.6%)  Combine redundant instructions
>  65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
> 5.5%)  Combine redundant instructions
>  61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
> 5.2%)  Function Integration/Inlining
>  54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
> 4.6%)  Combine redundant instructions
>  50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
> 4.2%)  Combine redundant instructions
>  47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
> 4.0%)  Global Value Numbering
> 
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
> 
> tl;dr:
> There are quite a few things to notice:
> 1) GVN used to be the top pass in the middle-end, in some cases, and
> pretty much always in the top-3. This is not the case anymore, but
> it's still a pass where we spend a lot of time. This is being worked
> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
> some hope that will be sorted out (or at least there's a plan for it).
> 2) For clang, we spend 35% more time inside instcombine, and for game7
> instcombine seems to largely dominate the amount of time we spend
> optimizing IR. I tried to bisect (which is not easy considering the
> test takes a long time to run), but I wasn't able to identify a single
> point in time responsible for the regression.
An efficient way to bisect this is to:

1) dump the IR right before instcombine, and then run only opt -instcombine and
confirm the regression shows up.
2) Then reduce the input: you should be able to single out a single function
ultimately. (Maybe with bugpoint or with -opt-bisect-limit)
3) With a single function that shows the regression, it should be fairly easy to
plot the time to run opt -inst-combine for almost every revision between June
and now.
 
— 
Mehdi


> It seems to be an
> additive effect. My wild (or not so wild) guess is that every day
> somebody adds a matcher of two because that improves their testcase,
> and at some point all this things add up. I'll try to do some
> additional profiling but I guess large part of our time is spent
> solving bitwise-domain dataflow problems (ComputeKnownBits et
> similia). Once GVN will be in a more stable state, I plan to
> experiment with caching results.
> 3) Something peculiar is that we spend 2x time in the inliner. I'm not
> familiar with the inliner, IIRC there were some changes to threshold
> recently, so any help here will be appreciated (both in reproducing
> the results and with analysis).
> 4) For the last testcase (opt -O2 on large unoptimized chunk of
> bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
> think it's not as lazy as it claims to be (or at least, the way we use
> it). This doesn't show up in a full LTO run because we don't run
CVP
> as part of the default LTO pipeline, but the case shows how LVI can be
> a bottleneck for large TUs (or at least how -O2 compile time degraded
> on large TUs). I haven't thought about the problem very carefully, but
> there seems to be some progress on this front (
> https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
> original bitcode file but I can probably do some profiling on it as
> well.
> 
> As next steps I'll try to get a more detailed analysis of the
> problems. In particular, try to do what Rui did for lld but with more
> coarse granularity (every week) to have a chart of the compile time
> trend for these cases over the last 6 months, and post here.
> 
> I think (I know) some people are aware of the problems I outline in
> this e-mail. But apparently not everybody. We're in a situation where
> compile time is increasing without real control. I'm happy that Apple
> is doing a serious effort to track build-time, so hopefully things
> will improve. There are, although, some cases (like adding matchers in
> instcombine or knobs) where the compile time regression is hard to
> track until it's too late. LLVM as a project tries not to stop people
> trying to get things done and that's great, but from time to time
it's
> good to take a step back and re-evaluate approaches.
> The purpose of this e-mail was to outline where we regressed, for
> those interested.
> 
> Thanks for your time, and of course, feedback welcome!
> 
> -- 
> Davide
> 
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Davide Italiano via llvm-dev

2016-Dec-18 02:53 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Sat, Dec 17, 2016 at 6:39 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:>
>> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>>
>> First of all, sorry for the long mail.
>> Inspired by the excellent analysis Rui did for lld, I decided to do
>> the same for llvm.
>> I'm personally very interested in build-time for LTO configuration,
>> with particular attention to the time spent in the optimizer.
>> Rafael did something similar back in March, so this can be considered
>> as an update. This tries to include a more accurate high-level
>> analysis of where llvm is spending CPU cycles.
>> Here I present 2 cases: clang building itself with `-flto` (Full), and
>> clang building an internal codebase which I'm going to refer as
>> `game7`.
>> It's a mid-sized program (it's actually a game), more or less
of the
>> size of clang, which we use internally as benchmark to track
>> compile-time/runtime improvements/regression.
>> I picked two random revisions of llvm: trunk (December 16th 2016) and
>> trunk (June 2nd 2016), so, roughly, 6 months period.
>> My setup is a Mac Pro running Linux (NixOS).
>> These are the numbers I collected (including the output of -mllvm
-time-passes).
>> For clang:
>>
>> June 2nd:
>> real    22m9.278s
>> user    21m30.410s
>> sys     0m38.834s
>>  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
>>  289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
>> 24.3%)  X86 DAG->DAG Instruction Selection
>>  97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
>> 7.7%)  Global Value Numbering
>>  62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
>> 5.0%)  Function Integration/Inlining
>>  58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
>> 4.7%)  Combine redundant instructions
>>  53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
>> 4.3%)  Combine redundant instructions
>>  51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
>> 4.1%)  Loop Strength Reduction
>>  47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
>> 3.8%)  Greedy Register Allocator
>>  36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
>> 3.0%)  Induction Variable Simplification
>>  37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
>> 2.9%)  Combine redundant instructions
>>  34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
>> 2.7%)  Combine redundant instructions
>>  25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
>> 2.0%)  Combine redundant instructions
>>
>> Dec 16th:
>> real    27m34.922s
>> user    26m53.489s
>> sys     0m41.533s
>>
>>  287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
>> 19.3%)  X86 DAG->DAG Instruction Selection
>>  197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
>> 12.5%)  Function Integration/Inlining
>>  106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
>> 6.8%)  Global Value Numbering
>>  89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
>> 5.7%)  Combine redundant instructions
>>  79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
>> 5.0%)  Combine redundant instructions
>>  55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
>> 3.5%)  Combine redundant instructions
>>  51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
>> 3.3%)  Greedy Register Allocator
>>  52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
>> 3.3%)  Combine redundant instructions
>>  49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
>> 3.1%)  Loop Strength Reduction
>>  41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
>> 2.6%)  Induction Variable Simplification
>>  38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
>> 2.4%)  Combine redundant instructions
>>
>> so, llvm is around 20% slower than it used to be.
>>
>> For our internal codebase the situation seems slightly worse:
>>
>> `game7`
>>
>> June 2nd:
>>
>> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
>>
>>  88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
>> 20.3%)  X86 DAG->DAG Instruction Selection
>>  27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
>> 9.4%)  X86 Assembly / Object Emitter
>>  34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
>> 7.6%)  Function Integration/Inlining
>>  27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
>> 6.1%)  Global Value Numbering
>>  22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
>> 4.8%)  Combine redundant instructions
>>  19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
>> 4.2%)  Post RA top-down list latency scheduler
>>  15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
>> 3.5%)  Combine redundant instructions
>>
>> Dec 16th:
>>
>> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
>>
>>  135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
>> 15.2%)  Combine redundant instructions
>>  103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
>> 11.7%)  Combine redundant instructions
>>  97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
>> 11.6%)  X86 DAG->DAG Instruction Selection
>>  72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
>> 8.1%)  Combine redundant instructions
>>  69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
>> 7.8%)  Function Integration/Inlining
>>  60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
>> 6.8%)  Global Value Numbering
>>  56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
>> 6.4%)  Combine redundant instructions
>>
>> so, using LTO, lld takes 2x to build what it used to take (and all the
>> extra time seems spent in the optimizer).
>>
>> As an (extra) experiment, I decided to take the unoptimized output of
>> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
>> significant regression (with different characteristics).
>>
>> June 2nd:
>> time opt -O2
>> real    6m23.016s
>> user   6m20.900s
>> sys     0m2.113s
>>
>> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
>> Function Integration/Inlining
>> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
>> Global Value Numbering
>> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
>> Bitcode Writer
>> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
>> Combine redundant instructions
>> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
>> Combine redundant instructions
>> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
>> Combine redundant instructions
>> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
>> Combine redundant instructions
>> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
>> Combine redundant instructions
>> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
>> Combine redundant instructions
>>
>> Dec 16th:
>>
>> real    20m10.734s
>> user    20m8.523s
>> sys     0m2.197s
>>
>>  208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
>> 17.5%)  Value Propagation
>>  179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
>> 15.1%)  Value Propagation
>>  92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
>> 7.7%)  Combine redundant instructions
>>  72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
>> 6.1%)  Combine redundant instructions
>>  72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
>> 6.1%)  Combine redundant instructions
>>  66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
>> 5.6%)  Combine redundant instructions
>>  65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
>> 5.5%)  Combine redundant instructions
>>  61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
>> 5.2%)  Function Integration/Inlining
>>  54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
>> 4.6%)  Combine redundant instructions
>>  50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
>> 4.2%)  Combine redundant instructions
>>  47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
>> 4.0%)  Global Value Numbering
>>
>> I don't have an infrastructure to measure the runtime performance
>> benefits/regression of clang, but I have for `game7`.
>> I wasn't able to notice any fundamental speedup (at least, not
>> something that justifies a 2x build-time).
>>
>> tl;dr:
>> There are quite a few things to notice:
>> 1) GVN used to be the top pass in the middle-end, in some cases, and
>> pretty much always in the top-3. This is not the case anymore, but
>> it's still a pass where we spend a lot of time. This is being
worked
>> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so
there's
>> some hope that will be sorted out (or at least there's a plan for
it).
>> 2) For clang, we spend 35% more time inside instcombine, and for game7
>> instcombine seems to largely dominate the amount of time we spend
>> optimizing IR. I tried to bisect (which is not easy considering the
>> test takes a long time to run), but I wasn't able to identify a
single
>> point in time responsible for the regression.
>
> An efficient way to bisect this is to:
>
> 1) dump the IR right before instcombine, and then run only opt -instcombine
and confirm the regression shows up.
> 2) Then reduce the input: you should be able to single out a single
function ultimately. (Maybe with bugpoint or with -opt-bisect-limit)
> 3) With a single function that shows the regression, it should be fairly
easy to plot the time to run opt -inst-combine for almost every revision between
June and now.
>
I tried 1) and I'm able to reproduce the increase in compile time. 2)
is on my todolist. I plan to use (and I can see how you can use)
bugpoint or delta (with `ulimit`), but it's not entirely clear to me
how to reduce using -opt-bisect-limit. As far as I know that just runs
passes up to a given point of the pipeline, while here the regression
shows up also with a single pass, i.e. opt -instcombine. Can you
please elaborate?

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Sean Silva via llvm-dev

2016-Dec-18 03:19 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Sat, Dec 17, 2016 at 5:19 PM, Michael Gottesman via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> > On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > First of all, sorry for the long mail.
> > Inspired by the excellent analysis Rui did for lld, I decided to do
> > the same for llvm.
> > I'm personally very interested in build-time for LTO
configuration,
> > with particular attention to the time spent in the optimizer.
> > Rafael did something similar back in March, so this can be considered
> > as an update. This tries to include a more accurate high-level
> > analysis of where llvm is spending CPU cycles.
> > Here I present 2 cases: clang building itself with `-flto` (Full), and
> > clang building an internal codebase which I'm going to refer as
> > `game7`.
> > It's a mid-sized program (it's actually a game), more or less
of the
> > size of clang, which we use internally as benchmark to track
> > compile-time/runtime improvements/regression.
> > I picked two random revisions of llvm: trunk (December 16th 2016) and
> > trunk (June 2nd 2016), so, roughly, 6 months period.
> > My setup is a Mac Pro running Linux (NixOS).
> > These are the numbers I collected (including the output of -mllvm
> -time-passes).
> > For clang:
> >
> > June 2nd:
> > real    22m9.278s
> > user    21m30.410s
> > sys     0m38.834s
> >  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
> >  289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
> > 24.3%)  X86 DAG->DAG Instruction Selection
> >  97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
> > 7.7%)  Global Value Numbering
> >  62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
> > 5.0%)  Function Integration/Inlining
> >  58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
> > 4.7%)  Combine redundant instructions
> >  53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
> > 4.3%)  Combine redundant instructions
> >  51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
> > 4.1%)  Loop Strength Reduction
> >  47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
> > 3.8%)  Greedy Register Allocator
> >  36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
> > 3.0%)  Induction Variable Simplification
> >  37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
> > 2.9%)  Combine redundant instructions
> >  34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
> > 2.7%)  Combine redundant instructions
> >  25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
> > 2.0%)  Combine redundant instructions
> >
> > Dec 16th:
> > real    27m34.922s
> > user    26m53.489s
> > sys     0m41.533s
> >
> >  287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
> > 19.3%)  X86 DAG->DAG Instruction Selection
> >  197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
> > 12.5%)  Function Integration/Inlining
> >  106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
> > 6.8%)  Global Value Numbering
> >  89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
> > 5.7%)  Combine redundant instructions
> >  79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
> > 5.0%)  Combine redundant instructions
> >  55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
> > 3.5%)  Combine redundant instructions
> >  51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
> > 3.3%)  Greedy Register Allocator
> >  52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
> > 3.3%)  Combine redundant instructions
> >  49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
> > 3.1%)  Loop Strength Reduction
> >  41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
> > 2.6%)  Induction Variable Simplification
> >  38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
> > 2.4%)  Combine redundant instructions
> >
> > so, llvm is around 20% slower than it used to be.
> >
> > For our internal codebase the situation seems slightly worse:
> >
> > `game7`
> >
> > June 2nd:
> >
> > Total Execution Time: 464.3920 seconds (463.8986 wall clock)
> >
> >  88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
> > 20.3%)  X86 DAG->DAG Instruction Selection
> >  27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
> > 9.4%)  X86 Assembly / Object Emitter
> >  34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
> > 7.6%)  Function Integration/Inlining
> >  27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
> > 6.1%)  Global Value Numbering
> >  22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
> > 4.8%)  Combine redundant instructions
> >  19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
> > 4.2%)  Post RA top-down list latency scheduler
> >  15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
> > 3.5%)  Combine redundant instructions
> >
> > Dec 16th:
> >
> > Total Execution Time: 861.0898 seconds (860.5808 wall clock)
> >
> >  135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
> > 15.2%)  Combine redundant instructions
> >  103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
> > 11.7%)  Combine redundant instructions
> >  97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
> > 11.6%)  X86 DAG->DAG Instruction Selection
> >  72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
> > 8.1%)  Combine redundant instructions
> >  69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
> > 7.8%)  Function Integration/Inlining
> >  60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
> > 6.8%)  Global Value Numbering
> >  56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
> > 6.4%)  Combine redundant instructions
> >
> > so, using LTO, lld takes 2x to build what it used to take (and all the
> > extra time seems spent in the optimizer).
> >
> > As an (extra) experiment, I decided to take the unoptimized output of
> > game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> > significant regression (with different characteristics).
> >
> > June 2nd:
> > time opt -O2
> > real    6m23.016s
> > user   6m20.900s
> > sys     0m2.113s
> >
> > 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
> > Function Integration/Inlining
> > 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
> > Global Value Numbering
> > 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
> > Bitcode Writer
> > 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
> > Combine redundant instructions
> > 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
> > Combine redundant instructions
> > 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
> > Combine redundant instructions
> > 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
> > Combine redundant instructions
> > 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
> > Combine redundant instructions
> > 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
> > Combine redundant instructions
> >
> > Dec 16th:
> >
> > real    20m10.734s
> > user    20m8.523s
> > sys     0m2.197s
> >
> >  208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
> > 17.5%)  Value Propagation
> >  179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
> > 15.1%)  Value Propagation
> >  92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
> > 7.7%)  Combine redundant instructions
> >  72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
> > 6.1%)  Combine redundant instructions
> >  72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
> > 6.1%)  Combine redundant instructions
> >  66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
> > 5.6%)  Combine redundant instructions
> >  65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
> > 5.5%)  Combine redundant instructions
> >  61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
> > 5.2%)  Function Integration/Inlining
> >  54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
> > 4.6%)  Combine redundant instructions
> >  50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
> > 4.2%)  Combine redundant instructions
> >  47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
> > 4.0%)  Global Value Numbering
> >
> > I don't have an infrastructure to measure the runtime performance
> > benefits/regression of clang, but I have for `game7`.
> > I wasn't able to notice any fundamental speedup (at least, not
> > something that justifies a 2x build-time).
> >
> > tl;dr:
> > There are quite a few things to notice:
> > 1) GVN used to be the top pass in the middle-end, in some cases, and
> > pretty much always in the top-3. This is not the case anymore, but
> > it's still a pass where we spend a lot of time. This is being
worked
> > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so
there's
> > some hope that will be sorted out (or at least there's a plan for
it).
> > 2) For clang, we spend 35% more time inside instcombine, and for game7
> > instcombine seems to largely dominate the amount of time we spend
> > optimizing IR. I tried to bisect (which is not easy considering the
> > test takes a long time to run), but I wasn't able to identify a
single
> > point in time responsible for the regression. It seems to be an
> > additive effect. My wild (or not so wild) guess is that every day
> > somebody adds a matcher of two because that improves their testcase,
> > and at some point all this things add up. I'll try to do some
> > additional profiling but I guess large part of our time is spent
> > solving bitwise-domain dataflow problems (ComputeKnownBits et
> > similia). Once GVN will be in a more stable state, I plan to
> > experiment with caching results.
>
> We have seen a similar thing when compiling the swift standard library. We
> have talked about making a small simple instcombine pass that doesn't
> iterate to a fixed point (IIRC). IIRC Andy/Mark looked at this (so my
> memory might be wrong).
>
I would expect that if the slowdown was an increase in time spent iterating
to a fixed point, we would also be matching more patterns and simplifying
the code more, but the small performance delta on game7 doesn't seem to
corroborate that.

One issue with instcombine is that its matching time (even if it doesn't
find anything) is still O(# of patterns tested). That is very consistent
with the slowdown model where as new checks are added, it gradually gets
slower. Approaches like Nuno's Alive avoids this overhead by building a
matching automaton. From the above, it sounds like we're spending 10's
of
percents of time in instcombine, so it may be worth it.

Without having thought about it too closely, my gut reaction is that I
don't think it's a good idea to make an "instcombine V2";
incrementally
improving the one we already have (e.g. moving a large portion of its
matches into an automaton-based matcher; or conditionalizing (deleting?)
checks that aren't worth it) seems like a better long-term approach.

-- Sean Silva


>
> > 3) Something peculiar is that we spend 2x time in the inliner. I'm
not
> > familiar with the inliner, IIRC there were some changes to threshold
> > recently, so any help here will be appreciated (both in reproducing
> > the results and with analysis).
> > 4) For the last testcase (opt -O2 on large unoptimized chunk of
> > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
> > think it's not as lazy as it claims to be (or at least, the way we
use
> > it). This doesn't show up in a full LTO run because we don't
run CVP
> > as part of the default LTO pipeline, but the case shows how LVI can be
> > a bottleneck for large TUs (or at least how -O2 compile time degraded
> > on large TUs). I haven't thought about the problem very carefully,
but
> > there seems to be some progress on this front (
> > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
> > original bitcode file but I can probably do some profiling on it as
> > well.
> >
> > As next steps I'll try to get a more detailed analysis of the
> > problems. In particular, try to do what Rui did for lld but with more
> > coarse granularity (every week) to have a chart of the compile time
> > trend for these cases over the last 6 months, and post here.
> >
> > I think (I know) some people are aware of the problems I outline in
> > this e-mail. But apparently not everybody. We're in a situation
where
> > compile time is increasing without real control. I'm happy that
Apple
> > is doing a serious effort to track build-time, so hopefully things
> > will improve. There are, although, some cases (like adding matchers in
> > instcombine or knobs) where the compile time regression is hard to
> > track until it's too late. LLVM as a project tries not to stop
people
> > trying to get things done and that's great, but from time to time
it's
> > good to take a step back and re-evaluate approaches.
> > The purpose of this e-mail was to outline where we regressed, for
> > those interested.
> >
> > Thanks for your time, and of course, feedback welcome!
> >
> > --
> > Davide
> >
> > "There are no solved problems; there are only problems that are
more
> > or less solved" -- Henri Poincare
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/570669cd/attachment-0001.html>

Sean Silva via llvm-dev

2016-Dec-18 03:25 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
> Rafael did something similar back in March, so this can be considered
> as an update. This tries to include a more accurate high-level
> analysis of where llvm is spending CPU cycles.
> Here I present 2 cases: clang building itself with `-flto` (Full), and
> clang building an internal codebase which I'm going to refer as
> `game7`.
> It's a mid-sized program (it's actually a game), more or less of
the
> size of clang, which we use internally as benchmark to track
> compile-time/runtime improvements/regression.
> I picked two random revisions of llvm: trunk (December 16th 2016) and
> trunk (June 2nd 2016), so, roughly, 6 months period.
> My setup is a Mac Pro running Linux (NixOS).
> These are the numbers I collected (including the output of -mllvm
> -time-passes).
> For clang:
>
> June 2nd:
> real    22m9.278s
> user    21m30.410s
> sys     0m38.834s
>   Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
>   289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
> 24.3%)  X86 DAG->DAG Instruction Selection
>   97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
> 7.7%)  Global Value Numbering
>   62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
> 5.0%)  Function Integration/Inlining
>   58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
> 4.7%)  Combine redundant instructions
>   53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
> 4.3%)  Combine redundant instructions
>   51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
> 4.1%)  Loop Strength Reduction
>   47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
> 3.8%)  Greedy Register Allocator
>   36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
> 3.0%)  Induction Variable Simplification
>   37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
> 2.9%)  Combine redundant instructions
>   34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
> 2.7%)  Combine redundant instructions
>   25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
> 2.0%)  Combine redundant instructions
>
> Dec 16th:
> real    27m34.922s
> user    26m53.489s
> sys     0m41.533s
>
>   287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
> 19.3%)  X86 DAG->DAG Instruction Selection
>   197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
> 12.5%)  Function Integration/Inlining
>   106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
> 6.8%)  Global Value Numbering
>   89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
> 5.7%)  Combine redundant instructions
>   79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
> 5.0%)  Combine redundant instructions
>   55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
> 3.5%)  Combine redundant instructions
>   51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
> 3.3%)  Greedy Register Allocator
>   52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
> 3.3%)  Combine redundant instructions
>   49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
> 3.1%)  Loop Strength Reduction
>   41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
> 2.6%)  Induction Variable Simplification
>   38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
> 2.4%)  Combine redundant instructions
>
> so, llvm is around 20% slower than it used to be.
>
> For our internal codebase the situation seems slightly worse:
>
> `game7`
>
> June 2nd:
>
> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
>
>   88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
> 20.3%)  X86 DAG->DAG Instruction Selection
>   27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
> 9.4%)  X86 Assembly / Object Emitter
>   34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
> 7.6%)  Function Integration/Inlining
>   27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
> 6.1%)  Global Value Numbering
>   22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
> 4.8%)  Combine redundant instructions
>   19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
> 4.2%)  Post RA top-down list latency scheduler
>   15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
> 3.5%)  Combine redundant instructions
>
> Dec 16th:
>
> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
>
>   135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
> 15.2%)  Combine redundant instructions
>   103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
> 11.7%)  Combine redundant instructions
>   97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
> 11.6%)  X86 DAG->DAG Instruction Selection
>   72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
> 8.1%)  Combine redundant instructions
>   69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
> 7.8%)  Function Integration/Inlining
>   60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
> 6.8%)  Global Value Numbering
>   56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
> 6.4%)  Combine redundant instructions
>
> so, using LTO, lld takes 2x to build what it used to take (and all the
> extra time seems spent in the optimizer).
>
> As an (extra) experiment, I decided to take the unoptimized output of
> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> significant regression (with different characteristics).
>
> June 2nd:
> time opt -O2
> real    6m23.016s
> user   6m20.900s
> sys     0m2.113s
>
> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
>  Function Integration/Inlining
> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
>  Global Value Numbering
> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
>  Bitcode Writer
> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
>  Combine redundant instructions
> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
>  Combine redundant instructions
> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
>  Combine redundant instructions
> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
>  Combine redundant instructions
> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
>  Combine redundant instructions
> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
>  Combine redundant instructions
>
> Dec 16th:
>
> real    20m10.734s
> user    20m8.523s
> sys     0m2.197s
>
>   208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
> 17.5%)  Value Propagation
>   179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
> 15.1%)  Value Propagation
>   92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
> 7.7%)  Combine redundant instructions
>   72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
> 6.1%)  Combine redundant instructions
>   72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
> 6.1%)  Combine redundant instructions
>   66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
> 5.6%)  Combine redundant instructions
>   65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
> 5.5%)  Combine redundant instructions
>   61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
> 5.2%)  Function Integration/Inlining
>   54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
> 4.6%)  Combine redundant instructions
>   50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
> 4.2%)  Combine redundant instructions
>   47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
> 4.0%)  Global Value Numbering
>
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
>
> tl;dr:
> There are quite a few things to notice:
> 1) GVN used to be the top pass in the middle-end, in some cases, and
> pretty much always in the top-3. This is not the case anymore, but
> it's still a pass where we spend a lot of time. This is being worked
> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
> some hope that will be sorted out (or at least there's a plan for it).
> 2) For clang, we spend 35% more time inside instcombine, and for game7
> instcombine seems to largely dominate the amount of time we spend
> optimizing IR. I tried to bisect (which is not easy considering the
> test takes a long time to run), but I wasn't able to identify a single
> point in time responsible for the regression. It seems to be an
> additive effect. My wild (or not so wild) guess is that every day
> somebody adds a matcher of two because that improves their testcase,
> and at some point all this things add up. I'll try to do some
> additional profiling but I guess large part of our time is spent
> solving bitwise-domain dataflow problems (ComputeKnownBits et
> similia). Once GVN will be in a more stable state, I plan to
> experiment with caching results.
> 3) Something peculiar is that we spend 2x time in the inliner. I'm not
> familiar with the inliner, IIRC there were some changes to threshold
> recently, so any help here will be appreciated (both in reproducing
> the results and with analysis).
> 4) For the last testcase (opt -O2 on large unoptimized chunk of
> bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
> think it's not as lazy as it claims to be (or at least, the way we use
> it). This doesn't show up in a full LTO run because we don't run
CVP
> as part of the default LTO pipeline, but the case shows how LVI can be
> a bottleneck for large TUs (or at least how -O2 compile time degraded
> on large TUs). I haven't thought about the problem very carefully, but
> there seems to be some progress on this front (
> https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
> original bitcode file but I can probably do some profiling on it as
> well.
>
LVI is one of those analyses with quadratic runtime, but has a cutoff to
its search depth so that it is technically not quadratic. So increased
inlining could easily exacerbate it more than non-"quadratic" passes.
(increased inlining would also cause a general slowdown too).

-- Sean Silva

>
> As next steps I'll try to get a more detailed analysis of the
> problems. In particular, try to do what Rui did for lld but with more
> coarse granularity (every week) to have a chart of the compile time
> trend for these cases over the last 6 months, and post here.
>
> I think (I know) some people are aware of the problems I outline in
> this e-mail. But apparently not everybody. We're in a situation where
> compile time is increasing without real control. I'm happy that Apple
> is doing a serious effort to track build-time, so hopefully things
> will improve. There are, although, some cases (like adding matchers in
> instcombine or knobs) where the compile time regression is hard to
> track until it's too late. LLVM as a project tries not to stop people
> trying to get things done and that's great, but from time to time
it's
> good to take a step back and re-evaluate approaches.
> The purpose of this e-mail was to outline where we regressed, for
> those interested.
>
> Thanks for your time, and of course, feedback welcome!
>
> --
> Davide
>
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/71d1d768/attachment.html>

Sean Silva via llvm-dev

2016-Dec-18 03:41 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Sat, Dec 17, 2016 at 6:32 PM, Davide Italiano via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at
freebsd.org>
> wrote:
> [...]
>
> > I don't have an infrastructure to measure the runtime performance
> > benefits/regression of clang, but I have for `game7`.
> > I wasn't able to notice any fundamental speedup (at least, not
> > something that justifies a 2x build-time).
> >
>
> Just to provide numbers (using Sean's Mathematica template, thanks),
> here's a plot of the CDF of the frames in a particular level/scene.
> The curves pretty much match, and the picture in the middle shows a
> relative difference of 0.4% between Jun and Dec (which could be very
> possibly be in the noise).
>
With 5-10 runs per binary you should be able to reliably measure down to
0.5% difference on game7 (50 usec difference per frame). With just one run
per binary like you have here I would not trust a 0.4% difference.

-- Sean Silva

> On the same scene, the difference between -O3 and -flto is 12%, FWIW.
> So it seems that at least for this particular case all the compile
> time didn't buy any runtime improvement.
>
> --
> Davide
>
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/3590d910/attachment.html>

Davide Italiano via llvm-dev

2016-Dec-18 03:48 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Dec 17, 2016 7:41 PM, "Sean Silva" <chisophugis at gmail.com>
wrote:



On Sat, Dec 17, 2016 at 6:32 PM, Davide Italiano via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at
freebsd.org>
> wrote:
> [...]
>
> > I don't have an infrastructure to measure the runtime performance
> > benefits/regression of clang, but I have for `game7`.
> > I wasn't able to notice any fundamental speedup (at least, not
> > something that justifies a 2x build-time).
> >
>
> Just to provide numbers (using Sean's Mathematica template, thanks),
> here's a plot of the CDF of the frames in a particular level/scene.
> The curves pretty much match, and the picture in the middle shows a
> relative difference of 0.4% between Jun and Dec (which could be very
> possibly be in the noise).
>
With 5-10 runs per binary you should be able to reliably measure down to
0.5% difference on game7 (50 usec difference per frame). With just one run
per binary like you have here I would not trust a 0.4% difference.


Yeah, I agree. This was mostly a sanity check to understand if there was a
significant improvement at runtime. I ran the same test 7 times in the last
our but the difference is always in the noise, FWIW.


-- Sean Silva

> On the same scene, the difference between -O3 and -flto is 12%, FWIW.
> So it seems that at least for this particular case all the compile
> time didn't buy any runtime improvement.
>
> --
> Davide
>
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/2648cf20/attachment.html>

Daniel Berlin via llvm-dev

2016-Dec-18 04:39 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

>
>
>>
> LVI is one of those analyses with quadratic runtime, but has a cutoff to
> its search depth so that it is technically not quadratic. So increased
> inlining could easily exacerbate it more than non-"quadratic"
passes.
> (increased inlining would also cause a general slowdown too).
>
>LVI is only quadratic because of the way we've built it (it's actually
worse than quadratic,but let's just normalize to "quadratic").
Non-lazy versions are not quadratic, and you could likely build an
incremental lazy version that is also not quadratic in practice.
At some point, none of this will change unless people hold the line
somewhere. Compilers usually get slower 0.1% at a time, not in huge leaps
and bounds.
Without people saying "If we really want to get this case, in a thing not
really designed to get that case sanely, we need to stop and think about
it", it doesn't get thought about.

Obviously, you can go too far into that extreme, but i think we are still
too far on one side of that one (at least, in most places in LLVM)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/3549f873/attachment.html>

Philip Reames via llvm-dev

2016-Dec-18 19:10 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On 12/17/2016 01:35 PM, Davide Italiano via llvm-dev
wrote:> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
> Rafael did something similar back in March, so this can be considered
> as an update. This tries to include a more accurate high-level
> analysis of where llvm is spending CPU cycles.
> Here I present 2 cases: clang building itself with `-flto` (Full), and
> clang building an internal codebase which I'm going to refer as
> `game7`.
> It's a mid-sized program (it's actually a game), more or less of
the
> size of clang, which we use internally as benchmark to track
> compile-time/runtime improvements/regression.
> I picked two random revisions of llvm: trunk (December 16th 2016) and
> trunk (June 2nd 2016), so, roughly, 6 months period.
> My setup is a Mac Pro running Linux (NixOS).
> These are the numbers I collected (including the output of -mllvm
-time-passes).
> For clang:
>
> June 2nd:
> real    22m9.278s
> user    21m30.410s
> sys     0m38.834s
>    Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
>    289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
> 24.3%)  X86 DAG->DAG Instruction Selection
>    97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
> 7.7%)  Global Value Numbering
>    62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
> 5.0%)  Function Integration/Inlining
>    58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
> 4.7%)  Combine redundant instructions
>    53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
> 4.3%)  Combine redundant instructions
>    51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
> 4.1%)  Loop Strength Reduction
>    47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
> 3.8%)  Greedy Register Allocator
>    36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
> 3.0%)  Induction Variable Simplification
>    37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
> 2.9%)  Combine redundant instructions
>    34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
> 2.7%)  Combine redundant instructions
>    25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
> 2.0%)  Combine redundant instructions
>
> Dec 16th:
> real    27m34.922s
> user    26m53.489s
> sys     0m41.533s
>
>    287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
> 19.3%)  X86 DAG->DAG Instruction Selection
>    197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
> 12.5%)  Function Integration/Inlining
>    106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
> 6.8%)  Global Value Numbering
>    89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
> 5.7%)  Combine redundant instructions
>    79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
> 5.0%)  Combine redundant instructions
>    55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
> 3.5%)  Combine redundant instructions
>    51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
> 3.3%)  Greedy Register Allocator
>    52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
> 3.3%)  Combine redundant instructions
>    49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
> 3.1%)  Loop Strength Reduction
>    41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
> 2.6%)  Induction Variable Simplification
>    38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
> 2.4%)  Combine redundant instructions
>
> so, llvm is around 20% slower than it used to be.
>
> For our internal codebase the situation seems slightly worse:
>
> `game7`
>
> June 2nd:
>
> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
>
>    88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
> 20.3%)  X86 DAG->DAG Instruction Selection
>    27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
> 9.4%)  X86 Assembly / Object Emitter
>    34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
> 7.6%)  Function Integration/Inlining
>    27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
> 6.1%)  Global Value Numbering
>    22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
> 4.8%)  Combine redundant instructions
>    19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
> 4.2%)  Post RA top-down list latency scheduler
>    15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
> 3.5%)  Combine redundant instructions
>
> Dec 16th:
>
> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
>
>    135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
> 15.2%)  Combine redundant instructions
>    103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
> 11.7%)  Combine redundant instructions
>    97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
> 11.6%)  X86 DAG->DAG Instruction Selection
>    72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
> 8.1%)  Combine redundant instructions
>    69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
> 7.8%)  Function Integration/Inlining
>    60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
> 6.8%)  Global Value Numbering
>    56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
> 6.4%)  Combine redundant instructions
>
> so, using LTO, lld takes 2x to build what it used to take (and all the
> extra time seems spent in the optimizer).
>
> As an (extra) experiment, I decided to take the unoptimized output of
> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> significant regression (with different characteristics).
>
> June 2nd:
> time opt -O2
> real    6m23.016s
> user   6m20.900s
> sys     0m2.113s
>
> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
>   Function Integration/Inlining
> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
>   Global Value Numbering
> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
>   Bitcode Writer
> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
>   Combine redundant instructions
> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
>   Combine redundant instructions
> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
>   Combine redundant instructions
> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
>   Combine redundant instructions
> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
>   Combine redundant instructions
> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
>   Combine redundant instructions
>
> Dec 16th:
>
> real    20m10.734s
> user    20m8.523s
> sys     0m2.197s
>
>    208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
> 17.5%)  Value Propagation
>    179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
> 15.1%)  Value Propagation
>    92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
> 7.7%)  Combine redundant instructions
>    72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
> 6.1%)  Combine redundant instructions
>    72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
> 6.1%)  Combine redundant instructions
>    66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
> 5.6%)  Combine redundant instructions
>    65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
> 5.5%)  Combine redundant instructions
>    61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
> 5.2%)  Function Integration/Inlining
>    54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
> 4.6%)  Combine redundant instructions
>    50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
> 4.2%)  Combine redundant instructions
>    47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
> 4.0%)  Global Value NumberingI'd really like to see a profile which broke down the time spent in 
Value Propagation and LVI.  As the person who has touched both most 
recently, I am probably the responsible party.  At the same time, there 
are a number of known fixes we can apply depending on the way this 
particular compile time problem exhibits.  I'm happy to take a look at 
this particular issue if you can given me enough information to analyze it.
>
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
>
> tl;dr:
> There are quite a few things to notice:
> 1) GVN used to be the top pass in the middle-end, in some cases, and
> pretty much always in the top-3. This is not the case anymore, but
> it's still a pass where we spend a lot of time. This is being worked
> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
> some hope that will be sorted out (or at least there's a plan for it).
> 2) For clang, we spend 35% more time inside instcombine, and for game7
> instcombine seems to largely dominate the amount of time we spend
> optimizing IR. I tried to bisect (which is not easy considering the
> test takes a long time to run), but I wasn't able to identify a single
> point in time responsible for the regression. It seems to be an
> additive effect. My wild (or not so wild) guess is that every day
> somebody adds a matcher of two because that improves their testcase,
> and at some point all this things add up. I'll try to do some
> additional profiling but I guess large part of our time is spent
> solving bitwise-domain dataflow problems (ComputeKnownBits et
> similia). Once GVN will be in a more stable state, I plan to
> experiment with caching results.
> 3) Something peculiar is that we spend 2x time in the inliner. I'm not
> familiar with the inliner, IIRC there were some changes to threshold
> recently, so any help here will be appreciated (both in reproducing
> the results and with analysis).
> 4) For the last testcase (opt -O2 on large unoptimized chunk of
> bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
> think it's not as lazy as it claims to be (or at least, the way we use
> it). This doesn't show up in a full LTO run because we don't run
CVP
> as part of the default LTO pipeline, but the case shows how LVI can be
> a bottleneck for large TUs (or at least how -O2 compile time degraded
> on large TUs). I haven't thought about the problem very carefully, but
> there seems to be some progress on this front (
> https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
> original bitcode file but I can probably do some profiling on it as
> well.
>
> As next steps I'll try to get a more detailed analysis of the
> problems. In particular, try to do what Rui did for lld but with more
> coarse granularity (every week) to have a chart of the compile time
> trend for these cases over the last 6 months, and post here.
>
> I think (I know) some people are aware of the problems I outline in
> this e-mail. But apparently not everybody. We're in a situation where
> compile time is increasing without real control. I'm happy that Apple
> is doing a serious effort to track build-time, so hopefully things
> will improve. There are, although, some cases (like adding matchers in
> instcombine or knobs) where the compile time regression is hard to
> track until it's too late. LLVM as a project tries not to stop people
> trying to get things done and that's great, but from time to time
it's
> good to take a step back and re-evaluate approaches.
> The purpose of this e-mail was to outline where we regressed, for
> those interested.
>
> Thanks for your time, and of course, feedback welcome!
>

Davide Italiano via llvm-dev

2016-Dec-18 19:45 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Sun, Dec 18, 2016 at 11:10 AM, Philip Reames
<listmail at philipreames.com> wrote:> On 12/17/2016 01:35 PM, Davide Italiano via llvm-dev wrote:
>>
[...]>
> I'd really like to see a profile which broke down the time spent in
Value
> Propagation and LVI.  As the person who has touched both most recently, I
am
> probably the responsible party.  At the same time, there are a number of
> known fixes we can apply depending on the way this particular compile time
> problem exhibits.  I'm happy to take a look at this particular issue if
you
> can given me enough information to analyze it.
>
I'll try to reduce the testcase and do some profiling on it, then come
back to you.
Something to keep in mind, please note I'm not pointing fingers. The
main motivation behind this e-mail is showing up that multiple passes
in the compiler seem to have regressed on large testcases. Many people
don't notice because they are (un)lucky enough to not run with LTO.
There are probably/maybe other regressions uncovered by my
non-exahustive analysis. I wasn't particularly pleased after a fair
amount of time working on GVN to clean up technical-debt to realize
that the compiler got slower in other areas.

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

via llvm-dev

2016-Dec-18 20:55 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
From our own offline regression testing, one of the biggest culprits in our
experience is Instcombine’s known bits calculation. A number of new known bits
checks have been added in the past few years (e.g. to infer nuw, nsw, etc on
various instructions) and the cost adds up quite a lot, because *the cost is
paid even if Instcombine does nothing*, since it’s a significant cost on
visiting every relevant instruction.

This IME is one of the greatest ways performance gets lost: a tiny bit at a
time, whenever a new combine/transformation is added that is *expensive to test
for*. The test has to be done every time no matter what (and instcombine gets
called a lot!), so the cost adds up.

—escha

Andrew Trick via llvm-dev

2016-Dec-18 21:53 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

> On Dec 17, 2016, at 5:19 PM, Michael Gottesman <mgottesman at
apple.com> wrote:
> 
>> 2) For clang, we spend 35% more time inside instcombine, and for game7
>> instcombine seems to largely dominate the amount of time we spend
>> optimizing IR. I tried to bisect (which is not easy considering the
>> test takes a long time to run), but I wasn't able to identify a
single
>> point in time responsible for the regression. It seems to be an
>> additive effect. My wild (or not so wild) guess is that every day
>> somebody adds a matcher of two because that improves their testcase,
>> and at some point all this things add up. I'll try to do some
>> additional profiling but I guess large part of our time is spent
>> solving bitwise-domain dataflow problems (ComputeKnownBits et
>> similia). Once GVN will be in a more stable state, I plan to
>> experiment with caching results.
> 
> We have seen a similar thing when compiling the swift standard library. We
have talked about making a small simple instcombine pass that doesn't
iterate to a fixed point (IIRC). IIRC Andy/Mark looked at this (so my memory
might be wrong).

In that case we wanted a lighter-weight instcombine cleanup pass to run right
after LLVM IR generation. But, in general it makes sense to separate cleanup and
canonicalization vs. optimization. I’ve always been strongly in favor of
splitting InstCombine into a set of cheap, canonical transforms that run
frequently, vs. optimizations that can run once later in the pipeline. It’s
particularly annoying when somewhat target-specific codegen optimization get
thrown into InstCombine—I’m not how prevalent that still is. It would be a lot
of work to go through all the patterns and figure out what they’re needed for.
But it might make sense to try reducing the number of times InstCombine is run,
replace the earlier runs with an EarlyInstCombine and gradually move the most
important canonical transforms into EarlyInstCombine as they’re needed.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161218/b187011e/attachment.html>

Sean Silva via llvm-dev

2016-Dec-18 23:55 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Sat, Dec 17, 2016 at 7:48 PM, Davide Italiano <davide at freebsd.org>
wrote:
>
>
> On Dec 17, 2016 7:41 PM, "Sean Silva" <chisophugis at
gmail.com> wrote:
>
>
>
> On Sat, Dec 17, 2016 at 6:32 PM, Davide Italiano via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at
freebsd.org>
>> wrote:
>> [...]
>>
>> > I don't have an infrastructure to measure the runtime
performance
>> > benefits/regression of clang, but I have for `game7`.
>> > I wasn't able to notice any fundamental speedup (at least, not
>> > something that justifies a 2x build-time).
>> >
>>
>> Just to provide numbers (using Sean's Mathematica template,
thanks),
>> here's a plot of the CDF of the frames in a particular level/scene.
>> The curves pretty much match, and the picture in the middle shows a
>> relative difference of 0.4% between Jun and Dec (which could be very
>> possibly be in the noise).
>>
>
> With 5-10 runs per binary you should be able to reliably measure down to
> 0.5% difference on game7 (50 usec difference per frame). With just one run
> per binary like you have here I would not trust a 0.4% difference.
>
>
> Yeah, I agree. This was mostly a sanity check to understand if there was a
> significant improvement at runtime. I ran the same test 7 times in the last
> our but the difference is always in the noise, FWIW.
>
The reason to add more runs is that you can measure performance differences
smaller than the run to run variation for the same binary. Most of the
time, if you just plot 10 runs of each binary all on the same plot you can
easily eyeball it to see that one is consistently faster. If you want to
formalize this you could do something like:

1. pair up each run of one binary with another run from the other binary
2. see which (say) median is higher to get a boolean value of which is
faster in that pair. (you can be fancier and compute an integral over e.g.
the 20-80%'ile frame times; or you can investigate the tail by looking at
the 99'th percentile)
3. treat that as a coin toss (i.e., as a bernoulli random variable with
parameter p; a "fair coin" would be p=0.5)
4. Use bayesian methods to propagate the "coin toss" observations
backward
to infer a distribution on the possible values of the parameter p of the
bernoulli random variable.

Step 4 is actually fairly simple; an example of how to do it is here:
http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Ch1_Introduction_PyMC3.ipynb#Example:-Mandatory-coin-flip-example
Notice (in the picture on that page) how the prior assumption that p might
be anywhere in [0,1] with uniform probability is refined as we incorporate
the result of each coin flip (that's really all the "bayesian"
means;
Bayes' theorem tells you how to incorporate the new evidence to update your
estimated distribution (often called the "posterior" distribution)).
As we
incorporate more coin tosses, we refine that initial distribution. For a
fair coin, after enough tosses, the posterior distribution becomes very
concentrated around p=0.5

For the "coin flip", it's just a matter of plugging into a closed
form
formula to get the distribution, but for more sophisticated models there is
no closed form formula.
This is where MCMC libraries (like it shows how to use in that
"Probabilistic Programming and Bayesian Methods for Hackers" book)
come
into play.
E.g. note that in step 2 you are losing a lot of information actually (you
reduce two frame time CDF's (thousands of data points each) to a single bit
of information). Using an MCMC library you can have fine-grained control
over the amount of detail in the model and it can propagate your
observations back to the parameters of your model in a continuous and
information-preserving way (to the extent that you design your model to
preserve information; there are tradeoffs between model accuracy and
computation time obviously).

(step 1 also loses information, btw. You also lose information when you
start looking at the frame time distribution because you lose information
about the ordering of the frames)

-- Sean Silva

>
> -- Sean Silva
>
>
>> On the same scene, the difference between -O3 and -flto is 12%, FWIW.
>> So it seems that at least for this particular case all the compile
>> time didn't buy any runtime improvement.
>>
>> --
>> Davide
>>
>> "There are no solved problems; there are only problems that are
more
>> or less solved" -- Henri Poincare
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161218/d7940132/attachment.html>

Matthias Braun via llvm-dev

2016-Dec-19 22:40 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

Jut wanted to mention http://lab.llvm.org:8080/green/view/Compile%20Time/
<http://lab.llvm.org:8080/green/view/Compile%20Time/> again. We have setup
continuous compiletime tracking of the CTMark testsuite in 4 popular flag
combinations there (and yes it’s another example showing the slowdowns). It also
allows has a feature to track regressions!

- Matthias
> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
> Rafael did something similar back in March, so this can be considered
> as an update. This tries to include a more accurate high-level
> analysis of where llvm is spending CPU cycles.
> Here I present 2 cases: clang building itself with `-flto` (Full), and
> clang building an internal codebase which I'm going to refer as
> `game7`.
> It's a mid-sized program (it's actually a game), more or less of
the
> size of clang, which we use internally as benchmark to track
> compile-time/runtime improvements/regression.
> I picked two random revisions of llvm: trunk (December 16th 2016) and
> trunk (June 2nd 2016), so, roughly, 6 months period.
> My setup is a Mac Pro running Linux (NixOS).
> These are the numbers I collected (including the output of -mllvm
-time-passes).
> For clang:
> 
> June 2nd:
> real    22m9.278s
> user    21m30.410s
> sys     0m38.834s
>  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
>  289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
> 24.3%)  X86 DAG->DAG Instruction Selection
>  97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
> 7.7%)  Global Value Numbering
>  62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
> 5.0%)  Function Integration/Inlining
>  58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
> 4.7%)  Combine redundant instructions
>  53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
> 4.3%)  Combine redundant instructions
>  51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
> 4.1%)  Loop Strength Reduction
>  47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
> 3.8%)  Greedy Register Allocator
>  36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
> 3.0%)  Induction Variable Simplification
>  37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
> 2.9%)  Combine redundant instructions
>  34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
> 2.7%)  Combine redundant instructions
>  25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
> 2.0%)  Combine redundant instructions
> 
> Dec 16th:
> real    27m34.922s
> user    26m53.489s
> sys     0m41.533s
> 
>  287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
> 19.3%)  X86 DAG->DAG Instruction Selection
>  197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
> 12.5%)  Function Integration/Inlining
>  106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
> 6.8%)  Global Value Numbering
>  89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
> 5.7%)  Combine redundant instructions
>  79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
> 5.0%)  Combine redundant instructions
>  55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
> 3.5%)  Combine redundant instructions
>  51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
> 3.3%)  Greedy Register Allocator
>  52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
> 3.3%)  Combine redundant instructions
>  49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
> 3.1%)  Loop Strength Reduction
>  41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
> 2.6%)  Induction Variable Simplification
>  38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
> 2.4%)  Combine redundant instructions
> 
> so, llvm is around 20% slower than it used to be.
> 
> For our internal codebase the situation seems slightly worse:
> 
> `game7`
> 
> June 2nd:
> 
> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
> 
>  88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
> 20.3%)  X86 DAG->DAG Instruction Selection
>  27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
> 9.4%)  X86 Assembly / Object Emitter
>  34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
> 7.6%)  Function Integration/Inlining
>  27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
> 6.1%)  Global Value Numbering
>  22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
> 4.8%)  Combine redundant instructions
>  19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
> 4.2%)  Post RA top-down list latency scheduler
>  15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
> 3.5%)  Combine redundant instructions
> 
> Dec 16th:
> 
> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
> 
>  135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
> 15.2%)  Combine redundant instructions
>  103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
> 11.7%)  Combine redundant instructions
>  97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
> 11.6%)  X86 DAG->DAG Instruction Selection
>  72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
> 8.1%)  Combine redundant instructions
>  69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
> 7.8%)  Function Integration/Inlining
>  60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
> 6.8%)  Global Value Numbering
>  56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
> 6.4%)  Combine redundant instructions
> 
> so, using LTO, lld takes 2x to build what it used to take (and all the
> extra time seems spent in the optimizer).
> 
> As an (extra) experiment, I decided to take the unoptimized output of
> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> significant regression (with different characteristics).
> 
> June 2nd:
> time opt -O2
> real    6m23.016s
> user   6m20.900s
> sys     0m2.113s
> 
> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
> Function Integration/Inlining
> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
> Global Value Numbering
> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
> Bitcode Writer
> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
> Combine redundant instructions
> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
> Combine redundant instructions
> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
> Combine redundant instructions
> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
> Combine redundant instructions
> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
> Combine redundant instructions
> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
> Combine redundant instructions
> 
> Dec 16th:
> 
> real    20m10.734s
> user    20m8.523s
> sys     0m2.197s
> 
>  208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
> 17.5%)  Value Propagation
>  179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
> 15.1%)  Value Propagation
>  92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
> 7.7%)  Combine redundant instructions
>  72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
> 6.1%)  Combine redundant instructions
>  72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
> 6.1%)  Combine redundant instructions
>  66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
> 5.6%)  Combine redundant instructions
>  65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
> 5.5%)  Combine redundant instructions
>  61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
> 5.2%)  Function Integration/Inlining
>  54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
> 4.6%)  Combine redundant instructions
>  50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
> 4.2%)  Combine redundant instructions
>  47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
> 4.0%)  Global Value Numbering
> 
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
> 
> tl;dr:
> There are quite a few things to notice:
> 1) GVN used to be the top pass in the middle-end, in some cases, and
> pretty much always in the top-3. This is not the case anymore, but
> it's still a pass where we spend a lot of time. This is being worked
> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
> some hope that will be sorted out (or at least there's a plan for it).
> 2) For clang, we spend 35% more time inside instcombine, and for game7
> instcombine seems to largely dominate the amount of time we spend
> optimizing IR. I tried to bisect (which is not easy considering the
> test takes a long time to run), but I wasn't able to identify a single
> point in time responsible for the regression. It seems to be an
> additive effect. My wild (or not so wild) guess is that every day
> somebody adds a matcher of two because that improves their testcase,
> and at some point all this things add up. I'll try to do some
> additional profiling but I guess large part of our time is spent
> solving bitwise-domain dataflow problems (ComputeKnownBits et
> similia). Once GVN will be in a more stable state, I plan to
> experiment with caching results.
> 3) Something peculiar is that we spend 2x time in the inliner. I'm not
> familiar with the inliner, IIRC there were some changes to threshold
> recently, so any help here will be appreciated (both in reproducing
> the results and with analysis).
> 4) For the last testcase (opt -O2 on large unoptimized chunk of
> bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
> think it's not as lazy as it claims to be (or at least, the way we use
> it). This doesn't show up in a full LTO run because we don't run
CVP
> as part of the default LTO pipeline, but the case shows how LVI can be
> a bottleneck for large TUs (or at least how -O2 compile time degraded
> on large TUs). I haven't thought about the problem very carefully, but
> there seems to be some progress on this front (
> https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
> original bitcode file but I can probably do some profiling on it as
> well.
> 
> As next steps I'll try to get a more detailed analysis of the
> problems. In particular, try to do what Rui did for lld but with more
> coarse granularity (every week) to have a chart of the compile time
> trend for these cases over the last 6 months, and post here.
> 
> I think (I know) some people are aware of the problems I outline in
> this e-mail. But apparently not everybody. We're in a situation where
> compile time is increasing without real control. I'm happy that Apple
> is doing a serious effort to track build-time, so hopefully things
> will improve. There are, although, some cases (like adding matchers in
> instcombine or knobs) where the compile time regression is hard to
> track until it's too late. LLVM as a project tries not to stop people
> trying to get things done and that's great, but from time to time
it's
> good to take a step back and re-evaluate approaches.
> The purpose of this e-mail was to outline where we regressed, for
> those interested.
> 
> Thanks for your time, and of course, feedback welcome!
> 
> -- 
> Davide
> 
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161219/e32294b0/attachment.html>

Mikhail Zolotukhin via llvm-dev

2016-Dec-19 23:55 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

> On Dec 17, 2016, at 7:48 PM, Davide Italiano via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> 
> On Dec 17, 2016 7:41 PM, "Sean Silva" <chisophugis at
gmail.com <mailto:chisophugis at gmail.com>> wrote:
> 
> 
> On Sat, Dec 17, 2016 at 6:32 PM, Davide Italiano via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at freebsd.org
<mailto:davide at freebsd.org>> wrote:
> [...]
> 
> > I don't have an infrastructure to measure the runtime performance
> > benefits/regression of clang, but I have for `game7`.
> > I wasn't able to notice any fundamental speedup (at least, not
> > something that justifies a 2x build-time).
> >
> 
> Just to provide numbers (using Sean's Mathematica template, thanks),
> here's a plot of the CDF of the frames in a particular level/scene.
> The curves pretty much match, and the picture in the middle shows a
> relative difference of 0.4% between Jun and Dec (which could be very
> possibly be in the noise).
> 
> With 5-10 runs per binary you should be able to reliably measure down to
0.5% difference on game7 (50 usec difference per frame). With just one run per
binary like you have here I would not trust a 0.4% difference.
> 
> Yeah, I agree. This was mostly a sanity check to understand if there was a
significant improvement at runtime. I ran the same test 7 times in the last our
but the difference is always in the noise, FWIW.FWIW, opt has an option '-run-twice', which might be helpful if compile
time of a particular unit is too small (it would be nice to change it to
-run-n-times though to make it more flexible).

Michael> 
> -- Sean Silva
>  
> On the same scene, the difference between -O3 and -flto is 12%, FWIW.
> So it seems that at least for this particular case all the compile
> time didn't buy any runtime improvement.
> 
> --
> Davide
> 
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161219/b5ae2d18/attachment.html>

Mikhail Zolotukhin via llvm-dev

2016-Dec-20 00:24 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

Hi Davide,

Thanks for the analysis, it's really interesting! And I'm really glad
that we now put more and more attention at the compile time!

Just recently I've been looking into historical compile time data as well,
and have had similar conclusions. The regressions you've found are probably
caused by:
1) r289813 and r289855 - new matchers in InstCombine
2) r286814 and r288024 - changes in Inlining cost model

You probably can try reverting them locally to check if my hypothesis is
correct.

I also looked at earlier data, and on top of the before-mentioned issues I found
the following causes of compile-time degradations (and sometimes improvements):
1) Changes in SCEV (r251049, r248637).
2) Changes in LoopUnrolling cost-model: adding a new unroll analyzer, changing
Os thresholds, etc.
3) Adding new passes (e.g. LoopLoadElimination).

Not everything is recoverable, but there are still some opportunities laying
here. For instance, I have some ideas on how to mitigate the compile time effect
of the SCEV changes, and hopefully soon I'll submit a patch for it.

Changes in unrolling and inlining thresholds are different by their nature: we
basically just changed thresholds under which the optimizations kick in. That
inevitably will have negative impacts on some tests, while hopefully improving
all tests in average. I think it's reasonable to ignore such
'regressions' provided that no test regressed by too much.

Adding a new pass also most likely increases compile time. Here all we can do
is: 1) discuss the tradeoffs before we add it, 2) provide knobs to turn it
off/tune. Making light versions of passes (like the ones proposed in this thread
for InstCombine) also sounds like a good idea. E.g. I thought about profiling
InstCombine on some codebase to detect most common matchers and disable other
matchers under some flag.

Thanks,
Michael

> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> First of all, sorry for the long mail.
> Inspired by the excellent analysis Rui did for lld, I decided to do
> the same for llvm.
> I'm personally very interested in build-time for LTO configuration,
> with particular attention to the time spent in the optimizer.
> Rafael did something similar back in March, so this can be considered
> as an update. This tries to include a more accurate high-level
> analysis of where llvm is spending CPU cycles.
> Here I present 2 cases: clang building itself with `-flto` (Full), and
> clang building an internal codebase which I'm going to refer as
> `game7`.
> It's a mid-sized program (it's actually a game), more or less of
the
> size of clang, which we use internally as benchmark to track
> compile-time/runtime improvements/regression.
> I picked two random revisions of llvm: trunk (December 16th 2016) and
> trunk (June 2nd 2016), so, roughly, 6 months period.
> My setup is a Mac Pro running Linux (NixOS).
> These are the numbers I collected (including the output of -mllvm
-time-passes).
> For clang:
> 
> June 2nd:
> real    22m9.278s
> user    21m30.410s
> sys     0m38.834s
>  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)
>  289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (
> 24.3%)  X86 DAG->DAG Instruction Selection
>  97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (
> 7.7%)  Global Value Numbering
>  62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (
> 5.0%)  Function Integration/Inlining
>  58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (
> 4.7%)  Combine redundant instructions
>  53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (
> 4.3%)  Combine redundant instructions
>  51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (
> 4.1%)  Loop Strength Reduction
>  47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (
> 3.8%)  Greedy Register Allocator
>  36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (
> 3.0%)  Induction Variable Simplification
>  37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (
> 2.9%)  Combine redundant instructions
>  34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (
> 2.7%)  Combine redundant instructions
>  25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (
> 2.0%)  Combine redundant instructions
> 
> Dec 16th:
> real    27m34.922s
> user    26m53.489s
> sys     0m41.533s
> 
>  287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (
> 19.3%)  X86 DAG->DAG Instruction Selection
>  197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (
> 12.5%)  Function Integration/Inlining
>  106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (
> 6.8%)  Global Value Numbering
>  89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (
> 5.7%)  Combine redundant instructions
>  79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (
> 5.0%)  Combine redundant instructions
>  55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (
> 3.5%)  Combine redundant instructions
>  51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (
> 3.3%)  Greedy Register Allocator
>  52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (
> 3.3%)  Combine redundant instructions
>  49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (
> 3.1%)  Loop Strength Reduction
>  41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (
> 2.6%)  Induction Variable Simplification
>  38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (
> 2.4%)  Combine redundant instructions
> 
> so, llvm is around 20% slower than it used to be.
> 
> For our internal codebase the situation seems slightly worse:
> 
> `game7`
> 
> June 2nd:
> 
> Total Execution Time: 464.3920 seconds (463.8986 wall clock)
> 
>  88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (
> 20.3%)  X86 DAG->DAG Instruction Selection
>  27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (
> 9.4%)  X86 Assembly / Object Emitter
>  34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (
> 7.6%)  Function Integration/Inlining
>  27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (
> 6.1%)  Global Value Numbering
>  22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (
> 4.8%)  Combine redundant instructions
>  19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (
> 4.2%)  Post RA top-down list latency scheduler
>  15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (
> 3.5%)  Combine redundant instructions
> 
> Dec 16th:
> 
> Total Execution Time: 861.0898 seconds (860.5808 wall clock)
> 
>  135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (
> 15.2%)  Combine redundant instructions
>  103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (
> 11.7%)  Combine redundant instructions
>  97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (
> 11.6%)  X86 DAG->DAG Instruction Selection
>  72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (
> 8.1%)  Combine redundant instructions
>  69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (
> 7.8%)  Function Integration/Inlining
>  60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (
> 6.8%)  Global Value Numbering
>  56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (
> 6.4%)  Combine redundant instructions
> 
> so, using LTO, lld takes 2x to build what it used to take (and all the
> extra time seems spent in the optimizer).
> 
> As an (extra) experiment, I decided to take the unoptimized output of
> game7 (via lld -save-temps) and pass to -opt -O2. That shows another
> significant regression (with different characteristics).
> 
> June 2nd:
> time opt -O2
> real    6m23.016s
> user   6m20.900s
> sys     0m2.113s
> 
> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)
> Function Integration/Inlining
> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)
> Global Value Numbering
> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)
> Bitcode Writer
> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)
> Combine redundant instructions
> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)
> Combine redundant instructions
> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)
> Combine redundant instructions
> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)
> Combine redundant instructions
> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)
> Combine redundant instructions
> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)
> Combine redundant instructions
> 
> Dec 16th:
> 
> real    20m10.734s
> user    20m8.523s
> sys     0m2.197s
> 
>  208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (
> 17.5%)  Value Propagation
>  179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (
> 15.1%)  Value Propagation
>  92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (
> 7.7%)  Combine redundant instructions
>  72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (
> 6.1%)  Combine redundant instructions
>  72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (
> 6.1%)  Combine redundant instructions
>  66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (
> 5.6%)  Combine redundant instructions
>  65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (
> 5.5%)  Combine redundant instructions
>  61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (
> 5.2%)  Function Integration/Inlining
>  54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (
> 4.6%)  Combine redundant instructions
>  50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (
> 4.2%)  Combine redundant instructions
>  47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (
> 4.0%)  Global Value Numbering
> 
> I don't have an infrastructure to measure the runtime performance
> benefits/regression of clang, but I have for `game7`.
> I wasn't able to notice any fundamental speedup (at least, not
> something that justifies a 2x build-time).
> 
> tl;dr:
> There are quite a few things to notice:
> 1) GVN used to be the top pass in the middle-end, in some cases, and
> pretty much always in the top-3. This is not the case anymore, but
> it's still a pass where we spend a lot of time. This is being worked
> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's
> some hope that will be sorted out (or at least there's a plan for it).
> 2) For clang, we spend 35% more time inside instcombine, and for game7
> instcombine seems to largely dominate the amount of time we spend
> optimizing IR. I tried to bisect (which is not easy considering the
> test takes a long time to run), but I wasn't able to identify a single
> point in time responsible for the regression. It seems to be an
> additive effect. My wild (or not so wild) guess is that every day
> somebody adds a matcher of two because that improves their testcase,
> and at some point all this things add up. I'll try to do some
> additional profiling but I guess large part of our time is spent
> solving bitwise-domain dataflow problems (ComputeKnownBits et
> similia). Once GVN will be in a more stable state, I plan to
> experiment with caching results.
> 3) Something peculiar is that we spend 2x time in the inliner. I'm not
> familiar with the inliner, IIRC there were some changes to threshold
> recently, so any help here will be appreciated (both in reproducing
> the results and with analysis).
> 4) For the last testcase (opt -O2 on large unoptimized chunk of
> bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I
> think it's not as lazy as it claims to be (or at least, the way we use
> it). This doesn't show up in a full LTO run because we don't run
CVP
> as part of the default LTO pipeline, but the case shows how LVI can be
> a bottleneck for large TUs (or at least how -O2 compile time degraded
> on large TUs). I haven't thought about the problem very carefully, but
> there seems to be some progress on this front (
> https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the
> original bitcode file but I can probably do some profiling on it as
> well.
> 
> As next steps I'll try to get a more detailed analysis of the
> problems. In particular, try to do what Rui did for lld but with more
> coarse granularity (every week) to have a chart of the compile time
> trend for these cases over the last 6 months, and post here.
> 
> I think (I know) some people are aware of the problems I outline in
> this e-mail. But apparently not everybody. We're in a situation where
> compile time is increasing without real control. I'm happy that Apple
> is doing a serious effort to track build-time, so hopefully things
> will improve. There are, although, some cases (like adding matchers in
> instcombine or knobs) where the compile time regression is hard to
> track until it's too late. LLVM as a project tries not to stop people
> trying to get things done and that's great, but from time to time
it's
> good to take a step back and re-evaluate approaches.
> The purpose of this e-mail was to outline where we regressed, for
> those interested.
> 
> Thanks for your time, and of course, feedback welcome!
> 
> -- 
> Davide
> 
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Davide Italiano via llvm-dev

2016-Dec-21 16:07 UTC

head link

[llvm-dev] llvm (the middle-end) is getting slower, December edition

On Mon, Dec 19, 2016 at 4:24 PM, Mikhail Zolotukhin
<mzolotukhin at apple.com> wrote:> Hi Davide,
>
> Thanks for the analysis, it's really interesting! And I'm really
glad that we now put more and more attention at the compile time!
>
> Just recently I've been looking into historical compile time data as
well, and have had similar conclusions. The regressions you've found are
probably caused by:
> 1) r289813 and r289855 - new matchers in InstCombine
> 2) r286814 and r288024 - changes in Inlining cost model
>
Haven't looked at 2) yet, but I can confirm for 1). Sanjay/Ehsan, can
you please explain what's the motivation behind the new
transformations you introduced? I'm tempted to ask a revert, but I'd
like to understand the motivations first.

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Dec 2016 - llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

[llvm-dev] llvm (the middle-end) is getting slower, December edition

Seemingly Similar Threads