Davide Italiano via llvm-dev
2016-Dec-17 21:35 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
First of all, sorry for the long mail. Inspired by the excellent analysis Rui did for lld, I decided to do the same for llvm. I'm personally very interested in build-time for LTO configuration, with particular attention to the time spent in the optimizer. Rafael did something similar back in March, so this can be considered as an update. This tries to include a more accurate high-level analysis of where llvm is spending CPU cycles. Here I present 2 cases: clang building itself with `-flto` (Full), and clang building an internal codebase which I'm going to refer as `game7`. It's a mid-sized program (it's actually a game), more or less of the size of clang, which we use internally as benchmark to track compile-time/runtime improvements/regression. I picked two random revisions of llvm: trunk (December 16th 2016) and trunk (June 2nd 2016), so, roughly, 6 months period. My setup is a Mac Pro running Linux (NixOS). These are the numbers I collected (including the output of -mllvm -time-passes). For clang: June 2nd: real 22m9.278s user 21m30.410s sys 0m38.834s Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( 24.3%) X86 DAG->DAG Instruction Selection 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( 7.7%) Global Value Numbering 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( 5.0%) Function Integration/Inlining 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( 4.7%) Combine redundant instructions 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( 4.3%) Combine redundant instructions 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( 4.1%) Loop Strength Reduction 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( 3.8%) Greedy Register Allocator 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( 3.0%) Induction Variable Simplification 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( 2.9%) Combine redundant instructions 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( 2.7%) Combine redundant instructions 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( 2.0%) Combine redundant instructions Dec 16th: real 27m34.922s user 26m53.489s sys 0m41.533s 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( 19.3%) X86 DAG->DAG Instruction Selection 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( 12.5%) Function Integration/Inlining 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( 6.8%) Global Value Numbering 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( 5.7%) Combine redundant instructions 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( 5.0%) Combine redundant instructions 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( 3.5%) Combine redundant instructions 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( 3.3%) Greedy Register Allocator 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( 3.3%) Combine redundant instructions 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( 3.1%) Loop Strength Reduction 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( 2.6%) Induction Variable Simplification 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( 2.4%) Combine redundant instructions so, llvm is around 20% slower than it used to be. For our internal codebase the situation seems slightly worse: `game7` June 2nd: Total Execution Time: 464.3920 seconds (463.8986 wall clock) 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( 20.3%) X86 DAG->DAG Instruction Selection 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( 9.4%) X86 Assembly / Object Emitter 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( 7.6%) Function Integration/Inlining 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( 6.1%) Global Value Numbering 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( 4.8%) Combine redundant instructions 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( 4.2%) Post RA top-down list latency scheduler 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( 3.5%) Combine redundant instructions Dec 16th: Total Execution Time: 861.0898 seconds (860.5808 wall clock) 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( 15.2%) Combine redundant instructions 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( 11.7%) Combine redundant instructions 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( 11.6%) X86 DAG->DAG Instruction Selection 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( 8.1%) Combine redundant instructions 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( 7.8%) Function Integration/Inlining 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( 6.8%) Global Value Numbering 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( 6.4%) Combine redundant instructions so, using LTO, lld takes 2x to build what it used to take (and all the extra time seems spent in the optimizer). As an (extra) experiment, I decided to take the unoptimized output of game7 (via lld -save-temps) and pass to -opt -O2. That shows another significant regression (with different characteristics). June 2nd: time opt -O2 real 6m23.016s user 6m20.900s sys 0m2.113s 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) Function Integration/Inlining 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) Global Value Numbering 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) Bitcode Writer 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) Combine redundant instructions 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) Combine redundant instructions 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) Combine redundant instructions 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) Combine redundant instructions 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) Combine redundant instructions 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) Combine redundant instructions Dec 16th: real 20m10.734s user 20m8.523s sys 0m2.197s 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( 17.5%) Value Propagation 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( 15.1%) Value Propagation 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( 7.7%) Combine redundant instructions 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( 6.1%) Combine redundant instructions 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( 6.1%) Combine redundant instructions 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( 5.6%) Combine redundant instructions 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( 5.5%) Combine redundant instructions 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( 5.2%) Function Integration/Inlining 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( 4.6%) Combine redundant instructions 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( 4.2%) Combine redundant instructions 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( 4.0%) Global Value Numbering I don't have an infrastructure to measure the runtime performance benefits/regression of clang, but I have for `game7`. I wasn't able to notice any fundamental speedup (at least, not something that justifies a 2x build-time). tl;dr: There are quite a few things to notice: 1) GVN used to be the top pass in the middle-end, in some cases, and pretty much always in the top-3. This is not the case anymore, but it's still a pass where we spend a lot of time. This is being worked on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's some hope that will be sorted out (or at least there's a plan for it). 2) For clang, we spend 35% more time inside instcombine, and for game7 instcombine seems to largely dominate the amount of time we spend optimizing IR. I tried to bisect (which is not easy considering the test takes a long time to run), but I wasn't able to identify a single point in time responsible for the regression. It seems to be an additive effect. My wild (or not so wild) guess is that every day somebody adds a matcher of two because that improves their testcase, and at some point all this things add up. I'll try to do some additional profiling but I guess large part of our time is spent solving bitwise-domain dataflow problems (ComputeKnownBits et similia). Once GVN will be in a more stable state, I plan to experiment with caching results. 3) Something peculiar is that we spend 2x time in the inliner. I'm not familiar with the inliner, IIRC there were some changes to threshold recently, so any help here will be appreciated (both in reproducing the results and with analysis). 4) For the last testcase (opt -O2 on large unoptimized chunk of bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I think it's not as lazy as it claims to be (or at least, the way we use it). This doesn't show up in a full LTO run because we don't run CVP as part of the default LTO pipeline, but the case shows how LVI can be a bottleneck for large TUs (or at least how -O2 compile time degraded on large TUs). I haven't thought about the problem very carefully, but there seems to be some progress on this front ( https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the original bitcode file but I can probably do some profiling on it as well. As next steps I'll try to get a more detailed analysis of the problems. In particular, try to do what Rui did for lld but with more coarse granularity (every week) to have a chart of the compile time trend for these cases over the last 6 months, and post here. I think (I know) some people are aware of the problems I outline in this e-mail. But apparently not everybody. We're in a situation where compile time is increasing without real control. I'm happy that Apple is doing a serious effort to track build-time, so hopefully things will improve. There are, although, some cases (like adding matchers in instcombine or knobs) where the compile time regression is hard to track until it's too late. LLVM as a project tries not to stop people trying to get things done and that's great, but from time to time it's good to take a step back and re-evaluate approaches. The purpose of this e-mail was to outline where we regressed, for those interested. Thanks for your time, and of course, feedback welcome! -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare
Andrew Kelley via llvm-dev
2016-Dec-17 21:41 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
Davide, Thank you for this analysis. As a front-end developer using LLVM I have one piece of feedback: I'm not really concerned about compilation time for -O2 or -O3, but I am extremely interested in minimizing compiliation time for -O0. Thank you for all your hard work, Andrew (http://ziglang.org/) On Sat, Dec 17, 2016 at 4:35 PM, Davide Italiano via llvm-dev < llvm-dev at lists.llvm.org> wrote:> First of all, sorry for the long mail. > Inspired by the excellent analysis Rui did for lld, I decided to do > the same for llvm. > I'm personally very interested in build-time for LTO configuration, > with particular attention to the time spent in the optimizer. > Rafael did something similar back in March, so this can be considered > as an update. This tries to include a more accurate high-level > analysis of where llvm is spending CPU cycles. > Here I present 2 cases: clang building itself with `-flto` (Full), and > clang building an internal codebase which I'm going to refer as > `game7`. > It's a mid-sized program (it's actually a game), more or less of the > size of clang, which we use internally as benchmark to track > compile-time/runtime improvements/regression. > I picked two random revisions of llvm: trunk (December 16th 2016) and > trunk (June 2nd 2016), so, roughly, 6 months period. > My setup is a Mac Pro running Linux (NixOS). > These are the numbers I collected (including the output of -mllvm > -time-passes). > For clang: > > June 2nd: > real 22m9.278s > user 21m30.410s > sys 0m38.834s > Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) > 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( > 24.3%) X86 DAG->DAG Instruction Selection > 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( > 7.7%) Global Value Numbering > 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( > 5.0%) Function Integration/Inlining > 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( > 4.7%) Combine redundant instructions > 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( > 4.3%) Combine redundant instructions > 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( > 4.1%) Loop Strength Reduction > 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( > 3.8%) Greedy Register Allocator > 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( > 3.0%) Induction Variable Simplification > 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( > 2.9%) Combine redundant instructions > 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( > 2.7%) Combine redundant instructions > 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( > 2.0%) Combine redundant instructions > > Dec 16th: > real 27m34.922s > user 26m53.489s > sys 0m41.533s > > 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( > 19.3%) X86 DAG->DAG Instruction Selection > 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( > 12.5%) Function Integration/Inlining > 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( > 6.8%) Global Value Numbering > 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( > 5.7%) Combine redundant instructions > 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( > 5.0%) Combine redundant instructions > 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( > 3.5%) Combine redundant instructions > 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( > 3.3%) Greedy Register Allocator > 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( > 3.3%) Combine redundant instructions > 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( > 3.1%) Loop Strength Reduction > 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( > 2.6%) Induction Variable Simplification > 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( > 2.4%) Combine redundant instructions > > so, llvm is around 20% slower than it used to be. > > For our internal codebase the situation seems slightly worse: > > `game7` > > June 2nd: > > Total Execution Time: 464.3920 seconds (463.8986 wall clock) > > 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( > 20.3%) X86 DAG->DAG Instruction Selection > 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( > 9.4%) X86 Assembly / Object Emitter > 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( > 7.6%) Function Integration/Inlining > 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( > 6.1%) Global Value Numbering > 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( > 4.8%) Combine redundant instructions > 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( > 4.2%) Post RA top-down list latency scheduler > 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( > 3.5%) Combine redundant instructions > > Dec 16th: > > Total Execution Time: 861.0898 seconds (860.5808 wall clock) > > 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( > 15.2%) Combine redundant instructions > 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( > 11.7%) Combine redundant instructions > 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( > 11.6%) X86 DAG->DAG Instruction Selection > 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( > 8.1%) Combine redundant instructions > 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( > 7.8%) Function Integration/Inlining > 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( > 6.8%) Global Value Numbering > 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( > 6.4%) Combine redundant instructions > > so, using LTO, lld takes 2x to build what it used to take (and all the > extra time seems spent in the optimizer). > > As an (extra) experiment, I decided to take the unoptimized output of > game7 (via lld -save-temps) and pass to -opt -O2. That shows another > significant regression (with different characteristics). > > June 2nd: > time opt -O2 > real 6m23.016s > user 6m20.900s > sys 0m2.113s > > 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) > Function Integration/Inlining > 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) > Global Value Numbering > 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) > Bitcode Writer > 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) > Combine redundant instructions > 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) > Combine redundant instructions > 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) > Combine redundant instructions > 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) > Combine redundant instructions > 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) > Combine redundant instructions > 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) > Combine redundant instructions > > Dec 16th: > > real 20m10.734s > user 20m8.523s > sys 0m2.197s > > 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( > 17.5%) Value Propagation > 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( > 15.1%) Value Propagation > 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( > 7.7%) Combine redundant instructions > 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( > 6.1%) Combine redundant instructions > 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( > 6.1%) Combine redundant instructions > 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( > 5.6%) Combine redundant instructions > 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( > 5.5%) Combine redundant instructions > 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( > 5.2%) Function Integration/Inlining > 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( > 4.6%) Combine redundant instructions > 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( > 4.2%) Combine redundant instructions > 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( > 4.0%) Global Value Numbering > > I don't have an infrastructure to measure the runtime performance > benefits/regression of clang, but I have for `game7`. > I wasn't able to notice any fundamental speedup (at least, not > something that justifies a 2x build-time). > > tl;dr: > There are quite a few things to notice: > 1) GVN used to be the top pass in the middle-end, in some cases, and > pretty much always in the top-3. This is not the case anymore, but > it's still a pass where we spend a lot of time. This is being worked > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's > some hope that will be sorted out (or at least there's a plan for it). > 2) For clang, we spend 35% more time inside instcombine, and for game7 > instcombine seems to largely dominate the amount of time we spend > optimizing IR. I tried to bisect (which is not easy considering the > test takes a long time to run), but I wasn't able to identify a single > point in time responsible for the regression. It seems to be an > additive effect. My wild (or not so wild) guess is that every day > somebody adds a matcher of two because that improves their testcase, > and at some point all this things add up. I'll try to do some > additional profiling but I guess large part of our time is spent > solving bitwise-domain dataflow problems (ComputeKnownBits et > similia). Once GVN will be in a more stable state, I plan to > experiment with caching results. > 3) Something peculiar is that we spend 2x time in the inliner. I'm not > familiar with the inliner, IIRC there were some changes to threshold > recently, so any help here will be appreciated (both in reproducing > the results and with analysis). > 4) For the last testcase (opt -O2 on large unoptimized chunk of > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I > think it's not as lazy as it claims to be (or at least, the way we use > it). This doesn't show up in a full LTO run because we don't run CVP > as part of the default LTO pipeline, but the case shows how LVI can be > a bottleneck for large TUs (or at least how -O2 compile time degraded > on large TUs). I haven't thought about the problem very carefully, but > there seems to be some progress on this front ( > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the > original bitcode file but I can probably do some profiling on it as > well. > > As next steps I'll try to get a more detailed analysis of the > problems. In particular, try to do what Rui did for lld but with more > coarse granularity (every week) to have a chart of the compile time > trend for these cases over the last 6 months, and post here. > > I think (I know) some people are aware of the problems I outline in > this e-mail. But apparently not everybody. We're in a situation where > compile time is increasing without real control. I'm happy that Apple > is doing a serious effort to track build-time, so hopefully things > will improve. There are, although, some cases (like adding matchers in > instcombine or knobs) where the compile time regression is hard to > track until it's too late. LLVM as a project tries not to stop people > trying to get things done and that's great, but from time to time it's > good to take a step back and re-evaluate approaches. > The purpose of this e-mail was to outline where we regressed, for > those interested. > > Thanks for your time, and of course, feedback welcome! > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/00a09acf/attachment.html>
Michael Gottesman via llvm-dev
2016-Dec-18 01:19 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > First of all, sorry for the long mail. > Inspired by the excellent analysis Rui did for lld, I decided to do > the same for llvm. > I'm personally very interested in build-time for LTO configuration, > with particular attention to the time spent in the optimizer. > Rafael did something similar back in March, so this can be considered > as an update. This tries to include a more accurate high-level > analysis of where llvm is spending CPU cycles. > Here I present 2 cases: clang building itself with `-flto` (Full), and > clang building an internal codebase which I'm going to refer as > `game7`. > It's a mid-sized program (it's actually a game), more or less of the > size of clang, which we use internally as benchmark to track > compile-time/runtime improvements/regression. > I picked two random revisions of llvm: trunk (December 16th 2016) and > trunk (June 2nd 2016), so, roughly, 6 months period. > My setup is a Mac Pro running Linux (NixOS). > These are the numbers I collected (including the output of -mllvm -time-passes). > For clang: > > June 2nd: > real 22m9.278s > user 21m30.410s > sys 0m38.834s > Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) > 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( > 24.3%) X86 DAG->DAG Instruction Selection > 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( > 7.7%) Global Value Numbering > 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( > 5.0%) Function Integration/Inlining > 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( > 4.7%) Combine redundant instructions > 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( > 4.3%) Combine redundant instructions > 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( > 4.1%) Loop Strength Reduction > 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( > 3.8%) Greedy Register Allocator > 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( > 3.0%) Induction Variable Simplification > 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( > 2.9%) Combine redundant instructions > 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( > 2.7%) Combine redundant instructions > 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( > 2.0%) Combine redundant instructions > > Dec 16th: > real 27m34.922s > user 26m53.489s > sys 0m41.533s > > 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( > 19.3%) X86 DAG->DAG Instruction Selection > 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( > 12.5%) Function Integration/Inlining > 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( > 6.8%) Global Value Numbering > 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( > 5.7%) Combine redundant instructions > 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( > 5.0%) Combine redundant instructions > 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( > 3.5%) Combine redundant instructions > 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( > 3.3%) Greedy Register Allocator > 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( > 3.3%) Combine redundant instructions > 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( > 3.1%) Loop Strength Reduction > 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( > 2.6%) Induction Variable Simplification > 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( > 2.4%) Combine redundant instructions > > so, llvm is around 20% slower than it used to be. > > For our internal codebase the situation seems slightly worse: > > `game7` > > June 2nd: > > Total Execution Time: 464.3920 seconds (463.8986 wall clock) > > 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( > 20.3%) X86 DAG->DAG Instruction Selection > 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( > 9.4%) X86 Assembly / Object Emitter > 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( > 7.6%) Function Integration/Inlining > 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( > 6.1%) Global Value Numbering > 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( > 4.8%) Combine redundant instructions > 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( > 4.2%) Post RA top-down list latency scheduler > 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( > 3.5%) Combine redundant instructions > > Dec 16th: > > Total Execution Time: 861.0898 seconds (860.5808 wall clock) > > 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( > 15.2%) Combine redundant instructions > 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( > 11.7%) Combine redundant instructions > 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( > 11.6%) X86 DAG->DAG Instruction Selection > 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( > 8.1%) Combine redundant instructions > 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( > 7.8%) Function Integration/Inlining > 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( > 6.8%) Global Value Numbering > 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( > 6.4%) Combine redundant instructions > > so, using LTO, lld takes 2x to build what it used to take (and all the > extra time seems spent in the optimizer). > > As an (extra) experiment, I decided to take the unoptimized output of > game7 (via lld -save-temps) and pass to -opt -O2. That shows another > significant regression (with different characteristics). > > June 2nd: > time opt -O2 > real 6m23.016s > user 6m20.900s > sys 0m2.113s > > 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) > Function Integration/Inlining > 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) > Global Value Numbering > 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) > Bitcode Writer > 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) > Combine redundant instructions > 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) > Combine redundant instructions > 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) > Combine redundant instructions > 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) > Combine redundant instructions > 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) > Combine redundant instructions > 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) > Combine redundant instructions > > Dec 16th: > > real 20m10.734s > user 20m8.523s > sys 0m2.197s > > 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( > 17.5%) Value Propagation > 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( > 15.1%) Value Propagation > 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( > 7.7%) Combine redundant instructions > 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( > 6.1%) Combine redundant instructions > 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( > 6.1%) Combine redundant instructions > 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( > 5.6%) Combine redundant instructions > 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( > 5.5%) Combine redundant instructions > 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( > 5.2%) Function Integration/Inlining > 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( > 4.6%) Combine redundant instructions > 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( > 4.2%) Combine redundant instructions > 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( > 4.0%) Global Value Numbering > > I don't have an infrastructure to measure the runtime performance > benefits/regression of clang, but I have for `game7`. > I wasn't able to notice any fundamental speedup (at least, not > something that justifies a 2x build-time). > > tl;dr: > There are quite a few things to notice: > 1) GVN used to be the top pass in the middle-end, in some cases, and > pretty much always in the top-3. This is not the case anymore, but > it's still a pass where we spend a lot of time. This is being worked > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's > some hope that will be sorted out (or at least there's a plan for it). > 2) For clang, we spend 35% more time inside instcombine, and for game7 > instcombine seems to largely dominate the amount of time we spend > optimizing IR. I tried to bisect (which is not easy considering the > test takes a long time to run), but I wasn't able to identify a single > point in time responsible for the regression. It seems to be an > additive effect. My wild (or not so wild) guess is that every day > somebody adds a matcher of two because that improves their testcase, > and at some point all this things add up. I'll try to do some > additional profiling but I guess large part of our time is spent > solving bitwise-domain dataflow problems (ComputeKnownBits et > similia). Once GVN will be in a more stable state, I plan to > experiment with caching results.We have seen a similar thing when compiling the swift standard library. We have talked about making a small simple instcombine pass that doesn't iterate to a fixed point (IIRC). IIRC Andy/Mark looked at this (so my memory might be wrong).> 3) Something peculiar is that we spend 2x time in the inliner. I'm not > familiar with the inliner, IIRC there were some changes to threshold > recently, so any help here will be appreciated (both in reproducing > the results and with analysis). > 4) For the last testcase (opt -O2 on large unoptimized chunk of > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I > think it's not as lazy as it claims to be (or at least, the way we use > it). This doesn't show up in a full LTO run because we don't run CVP > as part of the default LTO pipeline, but the case shows how LVI can be > a bottleneck for large TUs (or at least how -O2 compile time degraded > on large TUs). I haven't thought about the problem very carefully, but > there seems to be some progress on this front ( > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the > original bitcode file but I can probably do some profiling on it as > well. > > As next steps I'll try to get a more detailed analysis of the > problems. In particular, try to do what Rui did for lld but with more > coarse granularity (every week) to have a chart of the compile time > trend for these cases over the last 6 months, and post here. > > I think (I know) some people are aware of the problems I outline in > this e-mail. But apparently not everybody. We're in a situation where > compile time is increasing without real control. I'm happy that Apple > is doing a serious effort to track build-time, so hopefully things > will improve. There are, although, some cases (like adding matchers in > instcombine or knobs) where the compile time regression is hard to > track until it's too late. LLVM as a project tries not to stop people > trying to get things done and that's great, but from time to time it's > good to take a step back and re-evaluate approaches. > The purpose of this e-mail was to outline where we regressed, for > those interested. > > Thanks for your time, and of course, feedback welcome! > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Davide Italiano via llvm-dev
2016-Dec-18 02:32 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at freebsd.org> wrote: [...]> I don't have an infrastructure to measure the runtime performance > benefits/regression of clang, but I have for `game7`. > I wasn't able to notice any fundamental speedup (at least, not > something that justifies a 2x build-time). >Just to provide numbers (using Sean's Mathematica template, thanks), here's a plot of the CDF of the frames in a particular level/scene. The curves pretty much match, and the picture in the middle shows a relative difference of 0.4% between Jun and Dec (which could be very possibly be in the noise). On the same scene, the difference between -O3 and -flto is 12%, FWIW. So it seems that at least for this particular case all the compile time didn't buy any runtime improvement. -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare -------------- next part -------------- A non-text attachment was scrubbed... Name: flto_regression.png Type: image/png Size: 131534 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/12c8a067/attachment-0001.png>
Mehdi Amini via llvm-dev
2016-Dec-18 02:39 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > First of all, sorry for the long mail. > Inspired by the excellent analysis Rui did for lld, I decided to do > the same for llvm. > I'm personally very interested in build-time for LTO configuration, > with particular attention to the time spent in the optimizer. > Rafael did something similar back in March, so this can be considered > as an update. This tries to include a more accurate high-level > analysis of where llvm is spending CPU cycles. > Here I present 2 cases: clang building itself with `-flto` (Full), and > clang building an internal codebase which I'm going to refer as > `game7`. > It's a mid-sized program (it's actually a game), more or less of the > size of clang, which we use internally as benchmark to track > compile-time/runtime improvements/regression. > I picked two random revisions of llvm: trunk (December 16th 2016) and > trunk (June 2nd 2016), so, roughly, 6 months period. > My setup is a Mac Pro running Linux (NixOS). > These are the numbers I collected (including the output of -mllvm -time-passes). > For clang: > > June 2nd: > real 22m9.278s > user 21m30.410s > sys 0m38.834s > Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) > 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( > 24.3%) X86 DAG->DAG Instruction Selection > 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( > 7.7%) Global Value Numbering > 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( > 5.0%) Function Integration/Inlining > 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( > 4.7%) Combine redundant instructions > 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( > 4.3%) Combine redundant instructions > 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( > 4.1%) Loop Strength Reduction > 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( > 3.8%) Greedy Register Allocator > 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( > 3.0%) Induction Variable Simplification > 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( > 2.9%) Combine redundant instructions > 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( > 2.7%) Combine redundant instructions > 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( > 2.0%) Combine redundant instructions > > Dec 16th: > real 27m34.922s > user 26m53.489s > sys 0m41.533s > > 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( > 19.3%) X86 DAG->DAG Instruction Selection > 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( > 12.5%) Function Integration/Inlining > 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( > 6.8%) Global Value Numbering > 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( > 5.7%) Combine redundant instructions > 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( > 5.0%) Combine redundant instructions > 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( > 3.5%) Combine redundant instructions > 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( > 3.3%) Greedy Register Allocator > 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( > 3.3%) Combine redundant instructions > 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( > 3.1%) Loop Strength Reduction > 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( > 2.6%) Induction Variable Simplification > 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( > 2.4%) Combine redundant instructions > > so, llvm is around 20% slower than it used to be. > > For our internal codebase the situation seems slightly worse: > > `game7` > > June 2nd: > > Total Execution Time: 464.3920 seconds (463.8986 wall clock) > > 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( > 20.3%) X86 DAG->DAG Instruction Selection > 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( > 9.4%) X86 Assembly / Object Emitter > 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( > 7.6%) Function Integration/Inlining > 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( > 6.1%) Global Value Numbering > 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( > 4.8%) Combine redundant instructions > 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( > 4.2%) Post RA top-down list latency scheduler > 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( > 3.5%) Combine redundant instructions > > Dec 16th: > > Total Execution Time: 861.0898 seconds (860.5808 wall clock) > > 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( > 15.2%) Combine redundant instructions > 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( > 11.7%) Combine redundant instructions > 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( > 11.6%) X86 DAG->DAG Instruction Selection > 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( > 8.1%) Combine redundant instructions > 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( > 7.8%) Function Integration/Inlining > 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( > 6.8%) Global Value Numbering > 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( > 6.4%) Combine redundant instructions > > so, using LTO, lld takes 2x to build what it used to take (and all the > extra time seems spent in the optimizer). > > As an (extra) experiment, I decided to take the unoptimized output of > game7 (via lld -save-temps) and pass to -opt -O2. That shows another > significant regression (with different characteristics). > > June 2nd: > time opt -O2 > real 6m23.016s > user 6m20.900s > sys 0m2.113s > > 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) > Function Integration/Inlining > 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) > Global Value Numbering > 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) > Bitcode Writer > 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) > Combine redundant instructions > 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) > Combine redundant instructions > 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) > Combine redundant instructions > 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) > Combine redundant instructions > 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) > Combine redundant instructions > 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) > Combine redundant instructions > > Dec 16th: > > real 20m10.734s > user 20m8.523s > sys 0m2.197s > > 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( > 17.5%) Value Propagation > 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( > 15.1%) Value Propagation > 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( > 7.7%) Combine redundant instructions > 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( > 6.1%) Combine redundant instructions > 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( > 6.1%) Combine redundant instructions > 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( > 5.6%) Combine redundant instructions > 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( > 5.5%) Combine redundant instructions > 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( > 5.2%) Function Integration/Inlining > 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( > 4.6%) Combine redundant instructions > 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( > 4.2%) Combine redundant instructions > 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( > 4.0%) Global Value Numbering > > I don't have an infrastructure to measure the runtime performance > benefits/regression of clang, but I have for `game7`. > I wasn't able to notice any fundamental speedup (at least, not > something that justifies a 2x build-time). > > tl;dr: > There are quite a few things to notice: > 1) GVN used to be the top pass in the middle-end, in some cases, and > pretty much always in the top-3. This is not the case anymore, but > it's still a pass where we spend a lot of time. This is being worked > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's > some hope that will be sorted out (or at least there's a plan for it). > 2) For clang, we spend 35% more time inside instcombine, and for game7 > instcombine seems to largely dominate the amount of time we spend > optimizing IR. I tried to bisect (which is not easy considering the > test takes a long time to run), but I wasn't able to identify a single > point in time responsible for the regression.An efficient way to bisect this is to: 1) dump the IR right before instcombine, and then run only opt -instcombine and confirm the regression shows up. 2) Then reduce the input: you should be able to single out a single function ultimately. (Maybe with bugpoint or with -opt-bisect-limit) 3) With a single function that shows the regression, it should be fairly easy to plot the time to run opt -inst-combine for almost every revision between June and now. — Mehdi> It seems to be an > additive effect. My wild (or not so wild) guess is that every day > somebody adds a matcher of two because that improves their testcase, > and at some point all this things add up. I'll try to do some > additional profiling but I guess large part of our time is spent > solving bitwise-domain dataflow problems (ComputeKnownBits et > similia). Once GVN will be in a more stable state, I plan to > experiment with caching results. > 3) Something peculiar is that we spend 2x time in the inliner. I'm not > familiar with the inliner, IIRC there were some changes to threshold > recently, so any help here will be appreciated (both in reproducing > the results and with analysis). > 4) For the last testcase (opt -O2 on large unoptimized chunk of > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I > think it's not as lazy as it claims to be (or at least, the way we use > it). This doesn't show up in a full LTO run because we don't run CVP > as part of the default LTO pipeline, but the case shows how LVI can be > a bottleneck for large TUs (or at least how -O2 compile time degraded > on large TUs). I haven't thought about the problem very carefully, but > there seems to be some progress on this front ( > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the > original bitcode file but I can probably do some profiling on it as > well. > > As next steps I'll try to get a more detailed analysis of the > problems. In particular, try to do what Rui did for lld but with more > coarse granularity (every week) to have a chart of the compile time > trend for these cases over the last 6 months, and post here. > > I think (I know) some people are aware of the problems I outline in > this e-mail. But apparently not everybody. We're in a situation where > compile time is increasing without real control. I'm happy that Apple > is doing a serious effort to track build-time, so hopefully things > will improve. There are, although, some cases (like adding matchers in > instcombine or knobs) where the compile time regression is hard to > track until it's too late. LLVM as a project tries not to stop people > trying to get things done and that's great, but from time to time it's > good to take a step back and re-evaluate approaches. > The purpose of this e-mail was to outline where we regressed, for > those interested. > > Thanks for your time, and of course, feedback welcome! > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Davide Italiano via llvm-dev
2016-Dec-18 02:53 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Sat, Dec 17, 2016 at 6:39 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:> >> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> First of all, sorry for the long mail. >> Inspired by the excellent analysis Rui did for lld, I decided to do >> the same for llvm. >> I'm personally very interested in build-time for LTO configuration, >> with particular attention to the time spent in the optimizer. >> Rafael did something similar back in March, so this can be considered >> as an update. This tries to include a more accurate high-level >> analysis of where llvm is spending CPU cycles. >> Here I present 2 cases: clang building itself with `-flto` (Full), and >> clang building an internal codebase which I'm going to refer as >> `game7`. >> It's a mid-sized program (it's actually a game), more or less of the >> size of clang, which we use internally as benchmark to track >> compile-time/runtime improvements/regression. >> I picked two random revisions of llvm: trunk (December 16th 2016) and >> trunk (June 2nd 2016), so, roughly, 6 months period. >> My setup is a Mac Pro running Linux (NixOS). >> These are the numbers I collected (including the output of -mllvm -time-passes). >> For clang: >> >> June 2nd: >> real 22m9.278s >> user 21m30.410s >> sys 0m38.834s >> Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) >> 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( >> 24.3%) X86 DAG->DAG Instruction Selection >> 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( >> 7.7%) Global Value Numbering >> 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( >> 5.0%) Function Integration/Inlining >> 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( >> 4.7%) Combine redundant instructions >> 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( >> 4.3%) Combine redundant instructions >> 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( >> 4.1%) Loop Strength Reduction >> 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( >> 3.8%) Greedy Register Allocator >> 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( >> 3.0%) Induction Variable Simplification >> 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( >> 2.9%) Combine redundant instructions >> 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( >> 2.7%) Combine redundant instructions >> 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( >> 2.0%) Combine redundant instructions >> >> Dec 16th: >> real 27m34.922s >> user 26m53.489s >> sys 0m41.533s >> >> 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( >> 19.3%) X86 DAG->DAG Instruction Selection >> 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( >> 12.5%) Function Integration/Inlining >> 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( >> 6.8%) Global Value Numbering >> 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( >> 5.7%) Combine redundant instructions >> 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( >> 5.0%) Combine redundant instructions >> 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( >> 3.5%) Combine redundant instructions >> 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( >> 3.3%) Greedy Register Allocator >> 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( >> 3.3%) Combine redundant instructions >> 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( >> 3.1%) Loop Strength Reduction >> 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( >> 2.6%) Induction Variable Simplification >> 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( >> 2.4%) Combine redundant instructions >> >> so, llvm is around 20% slower than it used to be. >> >> For our internal codebase the situation seems slightly worse: >> >> `game7` >> >> June 2nd: >> >> Total Execution Time: 464.3920 seconds (463.8986 wall clock) >> >> 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( >> 20.3%) X86 DAG->DAG Instruction Selection >> 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( >> 9.4%) X86 Assembly / Object Emitter >> 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( >> 7.6%) Function Integration/Inlining >> 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( >> 6.1%) Global Value Numbering >> 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( >> 4.8%) Combine redundant instructions >> 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( >> 4.2%) Post RA top-down list latency scheduler >> 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( >> 3.5%) Combine redundant instructions >> >> Dec 16th: >> >> Total Execution Time: 861.0898 seconds (860.5808 wall clock) >> >> 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( >> 15.2%) Combine redundant instructions >> 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( >> 11.7%) Combine redundant instructions >> 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( >> 11.6%) X86 DAG->DAG Instruction Selection >> 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( >> 8.1%) Combine redundant instructions >> 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( >> 7.8%) Function Integration/Inlining >> 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( >> 6.8%) Global Value Numbering >> 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( >> 6.4%) Combine redundant instructions >> >> so, using LTO, lld takes 2x to build what it used to take (and all the >> extra time seems spent in the optimizer). >> >> As an (extra) experiment, I decided to take the unoptimized output of >> game7 (via lld -save-temps) and pass to -opt -O2. That shows another >> significant regression (with different characteristics). >> >> June 2nd: >> time opt -O2 >> real 6m23.016s >> user 6m20.900s >> sys 0m2.113s >> >> 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) >> Function Integration/Inlining >> 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) >> Global Value Numbering >> 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) >> Bitcode Writer >> 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) >> Combine redundant instructions >> 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) >> Combine redundant instructions >> 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) >> Combine redundant instructions >> 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) >> Combine redundant instructions >> 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) >> Combine redundant instructions >> 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) >> Combine redundant instructions >> >> Dec 16th: >> >> real 20m10.734s >> user 20m8.523s >> sys 0m2.197s >> >> 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( >> 17.5%) Value Propagation >> 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( >> 15.1%) Value Propagation >> 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( >> 7.7%) Combine redundant instructions >> 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( >> 6.1%) Combine redundant instructions >> 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( >> 6.1%) Combine redundant instructions >> 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( >> 5.6%) Combine redundant instructions >> 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( >> 5.5%) Combine redundant instructions >> 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( >> 5.2%) Function Integration/Inlining >> 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( >> 4.6%) Combine redundant instructions >> 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( >> 4.2%) Combine redundant instructions >> 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( >> 4.0%) Global Value Numbering >> >> I don't have an infrastructure to measure the runtime performance >> benefits/regression of clang, but I have for `game7`. >> I wasn't able to notice any fundamental speedup (at least, not >> something that justifies a 2x build-time). >> >> tl;dr: >> There are quite a few things to notice: >> 1) GVN used to be the top pass in the middle-end, in some cases, and >> pretty much always in the top-3. This is not the case anymore, but >> it's still a pass where we spend a lot of time. This is being worked >> on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's >> some hope that will be sorted out (or at least there's a plan for it). >> 2) For clang, we spend 35% more time inside instcombine, and for game7 >> instcombine seems to largely dominate the amount of time we spend >> optimizing IR. I tried to bisect (which is not easy considering the >> test takes a long time to run), but I wasn't able to identify a single >> point in time responsible for the regression. > > An efficient way to bisect this is to: > > 1) dump the IR right before instcombine, and then run only opt -instcombine and confirm the regression shows up. > 2) Then reduce the input: you should be able to single out a single function ultimately. (Maybe with bugpoint or with -opt-bisect-limit) > 3) With a single function that shows the regression, it should be fairly easy to plot the time to run opt -inst-combine for almost every revision between June and now. >I tried 1) and I'm able to reproduce the increase in compile time. 2) is on my todolist. I plan to use (and I can see how you can use) bugpoint or delta (with `ulimit`), but it's not entirely clear to me how to reduce using -opt-bisect-limit. As far as I know that just runs passes up to a given point of the pipeline, while here the regression shows up also with a single pass, i.e. opt -instcombine. Can you please elaborate? -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare
Sean Silva via llvm-dev
2016-Dec-18 03:19 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Sat, Dec 17, 2016 at 5:19 PM, Michael Gottesman via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > First of all, sorry for the long mail. > > Inspired by the excellent analysis Rui did for lld, I decided to do > > the same for llvm. > > I'm personally very interested in build-time for LTO configuration, > > with particular attention to the time spent in the optimizer. > > Rafael did something similar back in March, so this can be considered > > as an update. This tries to include a more accurate high-level > > analysis of where llvm is spending CPU cycles. > > Here I present 2 cases: clang building itself with `-flto` (Full), and > > clang building an internal codebase which I'm going to refer as > > `game7`. > > It's a mid-sized program (it's actually a game), more or less of the > > size of clang, which we use internally as benchmark to track > > compile-time/runtime improvements/regression. > > I picked two random revisions of llvm: trunk (December 16th 2016) and > > trunk (June 2nd 2016), so, roughly, 6 months period. > > My setup is a Mac Pro running Linux (NixOS). > > These are the numbers I collected (including the output of -mllvm > -time-passes). > > For clang: > > > > June 2nd: > > real 22m9.278s > > user 21m30.410s > > sys 0m38.834s > > Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) > > 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( > > 24.3%) X86 DAG->DAG Instruction Selection > > 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( > > 7.7%) Global Value Numbering > > 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( > > 5.0%) Function Integration/Inlining > > 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( > > 4.7%) Combine redundant instructions > > 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( > > 4.3%) Combine redundant instructions > > 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( > > 4.1%) Loop Strength Reduction > > 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( > > 3.8%) Greedy Register Allocator > > 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( > > 3.0%) Induction Variable Simplification > > 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( > > 2.9%) Combine redundant instructions > > 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( > > 2.7%) Combine redundant instructions > > 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( > > 2.0%) Combine redundant instructions > > > > Dec 16th: > > real 27m34.922s > > user 26m53.489s > > sys 0m41.533s > > > > 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( > > 19.3%) X86 DAG->DAG Instruction Selection > > 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( > > 12.5%) Function Integration/Inlining > > 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( > > 6.8%) Global Value Numbering > > 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( > > 5.7%) Combine redundant instructions > > 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( > > 5.0%) Combine redundant instructions > > 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( > > 3.5%) Combine redundant instructions > > 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( > > 3.3%) Greedy Register Allocator > > 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( > > 3.3%) Combine redundant instructions > > 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( > > 3.1%) Loop Strength Reduction > > 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( > > 2.6%) Induction Variable Simplification > > 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( > > 2.4%) Combine redundant instructions > > > > so, llvm is around 20% slower than it used to be. > > > > For our internal codebase the situation seems slightly worse: > > > > `game7` > > > > June 2nd: > > > > Total Execution Time: 464.3920 seconds (463.8986 wall clock) > > > > 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( > > 20.3%) X86 DAG->DAG Instruction Selection > > 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( > > 9.4%) X86 Assembly / Object Emitter > > 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( > > 7.6%) Function Integration/Inlining > > 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( > > 6.1%) Global Value Numbering > > 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( > > 4.8%) Combine redundant instructions > > 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( > > 4.2%) Post RA top-down list latency scheduler > > 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( > > 3.5%) Combine redundant instructions > > > > Dec 16th: > > > > Total Execution Time: 861.0898 seconds (860.5808 wall clock) > > > > 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( > > 15.2%) Combine redundant instructions > > 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( > > 11.7%) Combine redundant instructions > > 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( > > 11.6%) X86 DAG->DAG Instruction Selection > > 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( > > 8.1%) Combine redundant instructions > > 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( > > 7.8%) Function Integration/Inlining > > 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( > > 6.8%) Global Value Numbering > > 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( > > 6.4%) Combine redundant instructions > > > > so, using LTO, lld takes 2x to build what it used to take (and all the > > extra time seems spent in the optimizer). > > > > As an (extra) experiment, I decided to take the unoptimized output of > > game7 (via lld -save-temps) and pass to -opt -O2. That shows another > > significant regression (with different characteristics). > > > > June 2nd: > > time opt -O2 > > real 6m23.016s > > user 6m20.900s > > sys 0m2.113s > > > > 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) > > Function Integration/Inlining > > 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) > > Global Value Numbering > > 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) > > Bitcode Writer > > 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) > > Combine redundant instructions > > 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) > > Combine redundant instructions > > 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) > > Combine redundant instructions > > 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) > > Combine redundant instructions > > 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) > > Combine redundant instructions > > 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) > > Combine redundant instructions > > > > Dec 16th: > > > > real 20m10.734s > > user 20m8.523s > > sys 0m2.197s > > > > 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( > > 17.5%) Value Propagation > > 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( > > 15.1%) Value Propagation > > 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( > > 7.7%) Combine redundant instructions > > 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( > > 6.1%) Combine redundant instructions > > 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( > > 6.1%) Combine redundant instructions > > 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( > > 5.6%) Combine redundant instructions > > 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( > > 5.5%) Combine redundant instructions > > 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( > > 5.2%) Function Integration/Inlining > > 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( > > 4.6%) Combine redundant instructions > > 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( > > 4.2%) Combine redundant instructions > > 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( > > 4.0%) Global Value Numbering > > > > I don't have an infrastructure to measure the runtime performance > > benefits/regression of clang, but I have for `game7`. > > I wasn't able to notice any fundamental speedup (at least, not > > something that justifies a 2x build-time). > > > > tl;dr: > > There are quite a few things to notice: > > 1) GVN used to be the top pass in the middle-end, in some cases, and > > pretty much always in the top-3. This is not the case anymore, but > > it's still a pass where we spend a lot of time. This is being worked > > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's > > some hope that will be sorted out (or at least there's a plan for it). > > 2) For clang, we spend 35% more time inside instcombine, and for game7 > > instcombine seems to largely dominate the amount of time we spend > > optimizing IR. I tried to bisect (which is not easy considering the > > test takes a long time to run), but I wasn't able to identify a single > > point in time responsible for the regression. It seems to be an > > additive effect. My wild (or not so wild) guess is that every day > > somebody adds a matcher of two because that improves their testcase, > > and at some point all this things add up. I'll try to do some > > additional profiling but I guess large part of our time is spent > > solving bitwise-domain dataflow problems (ComputeKnownBits et > > similia). Once GVN will be in a more stable state, I plan to > > experiment with caching results. > > We have seen a similar thing when compiling the swift standard library. We > have talked about making a small simple instcombine pass that doesn't > iterate to a fixed point (IIRC). IIRC Andy/Mark looked at this (so my > memory might be wrong). >I would expect that if the slowdown was an increase in time spent iterating to a fixed point, we would also be matching more patterns and simplifying the code more, but the small performance delta on game7 doesn't seem to corroborate that. One issue with instcombine is that its matching time (even if it doesn't find anything) is still O(# of patterns tested). That is very consistent with the slowdown model where as new checks are added, it gradually gets slower. Approaches like Nuno's Alive avoids this overhead by building a matching automaton. From the above, it sounds like we're spending 10's of percents of time in instcombine, so it may be worth it. Without having thought about it too closely, my gut reaction is that I don't think it's a good idea to make an "instcombine V2"; incrementally improving the one we already have (e.g. moving a large portion of its matches into an automaton-based matcher; or conditionalizing (deleting?) checks that aren't worth it) seems like a better long-term approach. -- Sean Silva> > > 3) Something peculiar is that we spend 2x time in the inliner. I'm not > > familiar with the inliner, IIRC there were some changes to threshold > > recently, so any help here will be appreciated (both in reproducing > > the results and with analysis). > > 4) For the last testcase (opt -O2 on large unoptimized chunk of > > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I > > think it's not as lazy as it claims to be (or at least, the way we use > > it). This doesn't show up in a full LTO run because we don't run CVP > > as part of the default LTO pipeline, but the case shows how LVI can be > > a bottleneck for large TUs (or at least how -O2 compile time degraded > > on large TUs). I haven't thought about the problem very carefully, but > > there seems to be some progress on this front ( > > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the > > original bitcode file but I can probably do some profiling on it as > > well. > > > > As next steps I'll try to get a more detailed analysis of the > > problems. In particular, try to do what Rui did for lld but with more > > coarse granularity (every week) to have a chart of the compile time > > trend for these cases over the last 6 months, and post here. > > > > I think (I know) some people are aware of the problems I outline in > > this e-mail. But apparently not everybody. We're in a situation where > > compile time is increasing without real control. I'm happy that Apple > > is doing a serious effort to track build-time, so hopefully things > > will improve. There are, although, some cases (like adding matchers in > > instcombine or knobs) where the compile time regression is hard to > > track until it's too late. LLVM as a project tries not to stop people > > trying to get things done and that's great, but from time to time it's > > good to take a step back and re-evaluate approaches. > > The purpose of this e-mail was to outline where we regressed, for > > those interested. > > > > Thanks for your time, and of course, feedback welcome! > > > > -- > > Davide > > > > "There are no solved problems; there are only problems that are more > > or less solved" -- Henri Poincare > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/570669cd/attachment-0001.html>
Sean Silva via llvm-dev
2016-Dec-18 03:25 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano via llvm-dev < llvm-dev at lists.llvm.org> wrote:> First of all, sorry for the long mail. > Inspired by the excellent analysis Rui did for lld, I decided to do > the same for llvm. > I'm personally very interested in build-time for LTO configuration, > with particular attention to the time spent in the optimizer. > Rafael did something similar back in March, so this can be considered > as an update. This tries to include a more accurate high-level > analysis of where llvm is spending CPU cycles. > Here I present 2 cases: clang building itself with `-flto` (Full), and > clang building an internal codebase which I'm going to refer as > `game7`. > It's a mid-sized program (it's actually a game), more or less of the > size of clang, which we use internally as benchmark to track > compile-time/runtime improvements/regression. > I picked two random revisions of llvm: trunk (December 16th 2016) and > trunk (June 2nd 2016), so, roughly, 6 months period. > My setup is a Mac Pro running Linux (NixOS). > These are the numbers I collected (including the output of -mllvm > -time-passes). > For clang: > > June 2nd: > real 22m9.278s > user 21m30.410s > sys 0m38.834s > Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) > 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( > 24.3%) X86 DAG->DAG Instruction Selection > 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( > 7.7%) Global Value Numbering > 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( > 5.0%) Function Integration/Inlining > 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( > 4.7%) Combine redundant instructions > 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( > 4.3%) Combine redundant instructions > 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( > 4.1%) Loop Strength Reduction > 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( > 3.8%) Greedy Register Allocator > 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( > 3.0%) Induction Variable Simplification > 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( > 2.9%) Combine redundant instructions > 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( > 2.7%) Combine redundant instructions > 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( > 2.0%) Combine redundant instructions > > Dec 16th: > real 27m34.922s > user 26m53.489s > sys 0m41.533s > > 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( > 19.3%) X86 DAG->DAG Instruction Selection > 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( > 12.5%) Function Integration/Inlining > 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( > 6.8%) Global Value Numbering > 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( > 5.7%) Combine redundant instructions > 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( > 5.0%) Combine redundant instructions > 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( > 3.5%) Combine redundant instructions > 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( > 3.3%) Greedy Register Allocator > 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( > 3.3%) Combine redundant instructions > 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( > 3.1%) Loop Strength Reduction > 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( > 2.6%) Induction Variable Simplification > 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( > 2.4%) Combine redundant instructions > > so, llvm is around 20% slower than it used to be. > > For our internal codebase the situation seems slightly worse: > > `game7` > > June 2nd: > > Total Execution Time: 464.3920 seconds (463.8986 wall clock) > > 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( > 20.3%) X86 DAG->DAG Instruction Selection > 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( > 9.4%) X86 Assembly / Object Emitter > 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( > 7.6%) Function Integration/Inlining > 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( > 6.1%) Global Value Numbering > 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( > 4.8%) Combine redundant instructions > 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( > 4.2%) Post RA top-down list latency scheduler > 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( > 3.5%) Combine redundant instructions > > Dec 16th: > > Total Execution Time: 861.0898 seconds (860.5808 wall clock) > > 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( > 15.2%) Combine redundant instructions > 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( > 11.7%) Combine redundant instructions > 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( > 11.6%) X86 DAG->DAG Instruction Selection > 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( > 8.1%) Combine redundant instructions > 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( > 7.8%) Function Integration/Inlining > 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( > 6.8%) Global Value Numbering > 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( > 6.4%) Combine redundant instructions > > so, using LTO, lld takes 2x to build what it used to take (and all the > extra time seems spent in the optimizer). > > As an (extra) experiment, I decided to take the unoptimized output of > game7 (via lld -save-temps) and pass to -opt -O2. That shows another > significant regression (with different characteristics). > > June 2nd: > time opt -O2 > real 6m23.016s > user 6m20.900s > sys 0m2.113s > > 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) > Function Integration/Inlining > 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) > Global Value Numbering > 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) > Bitcode Writer > 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) > Combine redundant instructions > 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) > Combine redundant instructions > 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) > Combine redundant instructions > 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) > Combine redundant instructions > 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) > Combine redundant instructions > 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) > Combine redundant instructions > > Dec 16th: > > real 20m10.734s > user 20m8.523s > sys 0m2.197s > > 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( > 17.5%) Value Propagation > 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( > 15.1%) Value Propagation > 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( > 7.7%) Combine redundant instructions > 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( > 6.1%) Combine redundant instructions > 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( > 6.1%) Combine redundant instructions > 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( > 5.6%) Combine redundant instructions > 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( > 5.5%) Combine redundant instructions > 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( > 5.2%) Function Integration/Inlining > 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( > 4.6%) Combine redundant instructions > 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( > 4.2%) Combine redundant instructions > 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( > 4.0%) Global Value Numbering > > I don't have an infrastructure to measure the runtime performance > benefits/regression of clang, but I have for `game7`. > I wasn't able to notice any fundamental speedup (at least, not > something that justifies a 2x build-time). > > tl;dr: > There are quite a few things to notice: > 1) GVN used to be the top pass in the middle-end, in some cases, and > pretty much always in the top-3. This is not the case anymore, but > it's still a pass where we spend a lot of time. This is being worked > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's > some hope that will be sorted out (or at least there's a plan for it). > 2) For clang, we spend 35% more time inside instcombine, and for game7 > instcombine seems to largely dominate the amount of time we spend > optimizing IR. I tried to bisect (which is not easy considering the > test takes a long time to run), but I wasn't able to identify a single > point in time responsible for the regression. It seems to be an > additive effect. My wild (or not so wild) guess is that every day > somebody adds a matcher of two because that improves their testcase, > and at some point all this things add up. I'll try to do some > additional profiling but I guess large part of our time is spent > solving bitwise-domain dataflow problems (ComputeKnownBits et > similia). Once GVN will be in a more stable state, I plan to > experiment with caching results. > 3) Something peculiar is that we spend 2x time in the inliner. I'm not > familiar with the inliner, IIRC there were some changes to threshold > recently, so any help here will be appreciated (both in reproducing > the results and with analysis). > 4) For the last testcase (opt -O2 on large unoptimized chunk of > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I > think it's not as lazy as it claims to be (or at least, the way we use > it). This doesn't show up in a full LTO run because we don't run CVP > as part of the default LTO pipeline, but the case shows how LVI can be > a bottleneck for large TUs (or at least how -O2 compile time degraded > on large TUs). I haven't thought about the problem very carefully, but > there seems to be some progress on this front ( > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the > original bitcode file but I can probably do some profiling on it as > well. >LVI is one of those analyses with quadratic runtime, but has a cutoff to its search depth so that it is technically not quadratic. So increased inlining could easily exacerbate it more than non-"quadratic" passes. (increased inlining would also cause a general slowdown too). -- Sean Silva> > As next steps I'll try to get a more detailed analysis of the > problems. In particular, try to do what Rui did for lld but with more > coarse granularity (every week) to have a chart of the compile time > trend for these cases over the last 6 months, and post here. > > I think (I know) some people are aware of the problems I outline in > this e-mail. But apparently not everybody. We're in a situation where > compile time is increasing without real control. I'm happy that Apple > is doing a serious effort to track build-time, so hopefully things > will improve. There are, although, some cases (like adding matchers in > instcombine or knobs) where the compile time regression is hard to > track until it's too late. LLVM as a project tries not to stop people > trying to get things done and that's great, but from time to time it's > good to take a step back and re-evaluate approaches. > The purpose of this e-mail was to outline where we regressed, for > those interested. > > Thanks for your time, and of course, feedback welcome! > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/71d1d768/attachment.html>
Sean Silva via llvm-dev
2016-Dec-18 03:41 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Sat, Dec 17, 2016 at 6:32 PM, Davide Italiano via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at freebsd.org> > wrote: > [...] > > > I don't have an infrastructure to measure the runtime performance > > benefits/regression of clang, but I have for `game7`. > > I wasn't able to notice any fundamental speedup (at least, not > > something that justifies a 2x build-time). > > > > Just to provide numbers (using Sean's Mathematica template, thanks), > here's a plot of the CDF of the frames in a particular level/scene. > The curves pretty much match, and the picture in the middle shows a > relative difference of 0.4% between Jun and Dec (which could be very > possibly be in the noise). >With 5-10 runs per binary you should be able to reliably measure down to 0.5% difference on game7 (50 usec difference per frame). With just one run per binary like you have here I would not trust a 0.4% difference. -- Sean Silva> On the same scene, the difference between -O3 and -flto is 12%, FWIW. > So it seems that at least for this particular case all the compile > time didn't buy any runtime improvement. > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/3590d910/attachment.html>
Davide Italiano via llvm-dev
2016-Dec-18 03:48 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Dec 17, 2016 7:41 PM, "Sean Silva" <chisophugis at gmail.com> wrote: On Sat, Dec 17, 2016 at 6:32 PM, Davide Italiano via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at freebsd.org> > wrote: > [...] > > > I don't have an infrastructure to measure the runtime performance > > benefits/regression of clang, but I have for `game7`. > > I wasn't able to notice any fundamental speedup (at least, not > > something that justifies a 2x build-time). > > > > Just to provide numbers (using Sean's Mathematica template, thanks), > here's a plot of the CDF of the frames in a particular level/scene. > The curves pretty much match, and the picture in the middle shows a > relative difference of 0.4% between Jun and Dec (which could be very > possibly be in the noise). >With 5-10 runs per binary you should be able to reliably measure down to 0.5% difference on game7 (50 usec difference per frame). With just one run per binary like you have here I would not trust a 0.4% difference. Yeah, I agree. This was mostly a sanity check to understand if there was a significant improvement at runtime. I ran the same test 7 times in the last our but the difference is always in the noise, FWIW. -- Sean Silva> On the same scene, the difference between -O3 and -flto is 12%, FWIW. > So it seems that at least for this particular case all the compile > time didn't buy any runtime improvement. > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/2648cf20/attachment.html>
Daniel Berlin via llvm-dev
2016-Dec-18 04:39 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
> > >> > LVI is one of those analyses with quadratic runtime, but has a cutoff to > its search depth so that it is technically not quadratic. So increased > inlining could easily exacerbate it more than non-"quadratic" passes. > (increased inlining would also cause a general slowdown too). > >LVI is only quadratic because of the way we've built it (it's actually worse than quadratic,but let's just normalize to "quadratic"). Non-lazy versions are not quadratic, and you could likely build an incremental lazy version that is also not quadratic in practice. At some point, none of this will change unless people hold the line somewhere. Compilers usually get slower 0.1% at a time, not in huge leaps and bounds. Without people saying "If we really want to get this case, in a thing not really designed to get that case sanely, we need to stop and think about it", it doesn't get thought about. Obviously, you can go too far into that extreme, but i think we are still too far on one side of that one (at least, in most places in LLVM) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161217/3549f873/attachment.html>
Philip Reames via llvm-dev
2016-Dec-18 19:10 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On 12/17/2016 01:35 PM, Davide Italiano via llvm-dev wrote:> First of all, sorry for the long mail. > Inspired by the excellent analysis Rui did for lld, I decided to do > the same for llvm. > I'm personally very interested in build-time for LTO configuration, > with particular attention to the time spent in the optimizer. > Rafael did something similar back in March, so this can be considered > as an update. This tries to include a more accurate high-level > analysis of where llvm is spending CPU cycles. > Here I present 2 cases: clang building itself with `-flto` (Full), and > clang building an internal codebase which I'm going to refer as > `game7`. > It's a mid-sized program (it's actually a game), more or less of the > size of clang, which we use internally as benchmark to track > compile-time/runtime improvements/regression. > I picked two random revisions of llvm: trunk (December 16th 2016) and > trunk (June 2nd 2016), so, roughly, 6 months period. > My setup is a Mac Pro running Linux (NixOS). > These are the numbers I collected (including the output of -mllvm -time-passes). > For clang: > > June 2nd: > real 22m9.278s > user 21m30.410s > sys 0m38.834s > Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) > 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( > 24.3%) X86 DAG->DAG Instruction Selection > 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( > 7.7%) Global Value Numbering > 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( > 5.0%) Function Integration/Inlining > 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( > 4.7%) Combine redundant instructions > 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( > 4.3%) Combine redundant instructions > 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( > 4.1%) Loop Strength Reduction > 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( > 3.8%) Greedy Register Allocator > 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( > 3.0%) Induction Variable Simplification > 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( > 2.9%) Combine redundant instructions > 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( > 2.7%) Combine redundant instructions > 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( > 2.0%) Combine redundant instructions > > Dec 16th: > real 27m34.922s > user 26m53.489s > sys 0m41.533s > > 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( > 19.3%) X86 DAG->DAG Instruction Selection > 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( > 12.5%) Function Integration/Inlining > 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( > 6.8%) Global Value Numbering > 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( > 5.7%) Combine redundant instructions > 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( > 5.0%) Combine redundant instructions > 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( > 3.5%) Combine redundant instructions > 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( > 3.3%) Greedy Register Allocator > 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( > 3.3%) Combine redundant instructions > 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( > 3.1%) Loop Strength Reduction > 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( > 2.6%) Induction Variable Simplification > 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( > 2.4%) Combine redundant instructions > > so, llvm is around 20% slower than it used to be. > > For our internal codebase the situation seems slightly worse: > > `game7` > > June 2nd: > > Total Execution Time: 464.3920 seconds (463.8986 wall clock) > > 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( > 20.3%) X86 DAG->DAG Instruction Selection > 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( > 9.4%) X86 Assembly / Object Emitter > 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( > 7.6%) Function Integration/Inlining > 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( > 6.1%) Global Value Numbering > 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( > 4.8%) Combine redundant instructions > 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( > 4.2%) Post RA top-down list latency scheduler > 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( > 3.5%) Combine redundant instructions > > Dec 16th: > > Total Execution Time: 861.0898 seconds (860.5808 wall clock) > > 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( > 15.2%) Combine redundant instructions > 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( > 11.7%) Combine redundant instructions > 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( > 11.6%) X86 DAG->DAG Instruction Selection > 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( > 8.1%) Combine redundant instructions > 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( > 7.8%) Function Integration/Inlining > 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( > 6.8%) Global Value Numbering > 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( > 6.4%) Combine redundant instructions > > so, using LTO, lld takes 2x to build what it used to take (and all the > extra time seems spent in the optimizer). > > As an (extra) experiment, I decided to take the unoptimized output of > game7 (via lld -save-temps) and pass to -opt -O2. That shows another > significant regression (with different characteristics). > > June 2nd: > time opt -O2 > real 6m23.016s > user 6m20.900s > sys 0m2.113s > > 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) > Function Integration/Inlining > 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) > Global Value Numbering > 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) > Bitcode Writer > 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) > Combine redundant instructions > 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) > Combine redundant instructions > 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) > Combine redundant instructions > 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) > Combine redundant instructions > 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) > Combine redundant instructions > 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) > Combine redundant instructions > > Dec 16th: > > real 20m10.734s > user 20m8.523s > sys 0m2.197s > > 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( > 17.5%) Value Propagation > 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( > 15.1%) Value Propagation > 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( > 7.7%) Combine redundant instructions > 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( > 6.1%) Combine redundant instructions > 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( > 6.1%) Combine redundant instructions > 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( > 5.6%) Combine redundant instructions > 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( > 5.5%) Combine redundant instructions > 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( > 5.2%) Function Integration/Inlining > 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( > 4.6%) Combine redundant instructions > 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( > 4.2%) Combine redundant instructions > 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( > 4.0%) Global Value NumberingI'd really like to see a profile which broke down the time spent in Value Propagation and LVI. As the person who has touched both most recently, I am probably the responsible party. At the same time, there are a number of known fixes we can apply depending on the way this particular compile time problem exhibits. I'm happy to take a look at this particular issue if you can given me enough information to analyze it.> > I don't have an infrastructure to measure the runtime performance > benefits/regression of clang, but I have for `game7`. > I wasn't able to notice any fundamental speedup (at least, not > something that justifies a 2x build-time). > > tl;dr: > There are quite a few things to notice: > 1) GVN used to be the top pass in the middle-end, in some cases, and > pretty much always in the top-3. This is not the case anymore, but > it's still a pass where we spend a lot of time. This is being worked > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's > some hope that will be sorted out (or at least there's a plan for it). > 2) For clang, we spend 35% more time inside instcombine, and for game7 > instcombine seems to largely dominate the amount of time we spend > optimizing IR. I tried to bisect (which is not easy considering the > test takes a long time to run), but I wasn't able to identify a single > point in time responsible for the regression. It seems to be an > additive effect. My wild (or not so wild) guess is that every day > somebody adds a matcher of two because that improves their testcase, > and at some point all this things add up. I'll try to do some > additional profiling but I guess large part of our time is spent > solving bitwise-domain dataflow problems (ComputeKnownBits et > similia). Once GVN will be in a more stable state, I plan to > experiment with caching results. > 3) Something peculiar is that we spend 2x time in the inliner. I'm not > familiar with the inliner, IIRC there were some changes to threshold > recently, so any help here will be appreciated (both in reproducing > the results and with analysis). > 4) For the last testcase (opt -O2 on large unoptimized chunk of > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I > think it's not as lazy as it claims to be (or at least, the way we use > it). This doesn't show up in a full LTO run because we don't run CVP > as part of the default LTO pipeline, but the case shows how LVI can be > a bottleneck for large TUs (or at least how -O2 compile time degraded > on large TUs). I haven't thought about the problem very carefully, but > there seems to be some progress on this front ( > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the > original bitcode file but I can probably do some profiling on it as > well. > > As next steps I'll try to get a more detailed analysis of the > problems. In particular, try to do what Rui did for lld but with more > coarse granularity (every week) to have a chart of the compile time > trend for these cases over the last 6 months, and post here. > > I think (I know) some people are aware of the problems I outline in > this e-mail. But apparently not everybody. We're in a situation where > compile time is increasing without real control. I'm happy that Apple > is doing a serious effort to track build-time, so hopefully things > will improve. There are, although, some cases (like adding matchers in > instcombine or knobs) where the compile time regression is hard to > track until it's too late. LLVM as a project tries not to stop people > trying to get things done and that's great, but from time to time it's > good to take a step back and re-evaluate approaches. > The purpose of this e-mail was to outline where we regressed, for > those interested. > > Thanks for your time, and of course, feedback welcome! >
Davide Italiano via llvm-dev
2016-Dec-18 19:45 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Sun, Dec 18, 2016 at 11:10 AM, Philip Reames <listmail at philipreames.com> wrote:> On 12/17/2016 01:35 PM, Davide Italiano via llvm-dev wrote: >>[...]> > I'd really like to see a profile which broke down the time spent in Value > Propagation and LVI. As the person who has touched both most recently, I am > probably the responsible party. At the same time, there are a number of > known fixes we can apply depending on the way this particular compile time > problem exhibits. I'm happy to take a look at this particular issue if you > can given me enough information to analyze it. >I'll try to reduce the testcase and do some profiling on it, then come back to you. Something to keep in mind, please note I'm not pointing fingers. The main motivation behind this e-mail is showing up that multiple passes in the compiler seem to have regressed on large testcases. Many people don't notice because they are (un)lucky enough to not run with LTO. There are probably/maybe other regressions uncovered by my non-exahustive analysis. I wasn't particularly pleased after a fair amount of time working on GVN to clean up technical-debt to realize that the compiler got slower in other areas. -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare
via llvm-dev
2016-Dec-18 20:55 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > First of all, sorry for the long mail. > Inspired by the excellent analysis Rui did for lld, I decided to do > the same for llvm. > I'm personally very interested in build-time for LTO configuration, > with particular attention to the time spent in the optimizer.From our own offline regression testing, one of the biggest culprits in our experience is Instcombine’s known bits calculation. A number of new known bits checks have been added in the past few years (e.g. to infer nuw, nsw, etc on various instructions) and the cost adds up quite a lot, because *the cost is paid even if Instcombine does nothing*, since it’s a significant cost on visiting every relevant instruction. This IME is one of the greatest ways performance gets lost: a tiny bit at a time, whenever a new combine/transformation is added that is *expensive to test for*. The test has to be done every time no matter what (and instcombine gets called a lot!), so the cost adds up. —escha
Andrew Trick via llvm-dev
2016-Dec-18 21:53 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
> On Dec 17, 2016, at 5:19 PM, Michael Gottesman <mgottesman at apple.com> wrote: > >> 2) For clang, we spend 35% more time inside instcombine, and for game7 >> instcombine seems to largely dominate the amount of time we spend >> optimizing IR. I tried to bisect (which is not easy considering the >> test takes a long time to run), but I wasn't able to identify a single >> point in time responsible for the regression. It seems to be an >> additive effect. My wild (or not so wild) guess is that every day >> somebody adds a matcher of two because that improves their testcase, >> and at some point all this things add up. I'll try to do some >> additional profiling but I guess large part of our time is spent >> solving bitwise-domain dataflow problems (ComputeKnownBits et >> similia). Once GVN will be in a more stable state, I plan to >> experiment with caching results. > > We have seen a similar thing when compiling the swift standard library. We have talked about making a small simple instcombine pass that doesn't iterate to a fixed point (IIRC). IIRC Andy/Mark looked at this (so my memory might be wrong).In that case we wanted a lighter-weight instcombine cleanup pass to run right after LLVM IR generation. But, in general it makes sense to separate cleanup and canonicalization vs. optimization. I’ve always been strongly in favor of splitting InstCombine into a set of cheap, canonical transforms that run frequently, vs. optimizations that can run once later in the pipeline. It’s particularly annoying when somewhat target-specific codegen optimization get thrown into InstCombine—I’m not how prevalent that still is. It would be a lot of work to go through all the patterns and figure out what they’re needed for. But it might make sense to try reducing the number of times InstCombine is run, replace the earlier runs with an EarlyInstCombine and gradually move the most important canonical transforms into EarlyInstCombine as they’re needed. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161218/b187011e/attachment.html>
Sean Silva via llvm-dev
2016-Dec-18 23:55 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Sat, Dec 17, 2016 at 7:48 PM, Davide Italiano <davide at freebsd.org> wrote:> > > On Dec 17, 2016 7:41 PM, "Sean Silva" <chisophugis at gmail.com> wrote: > > > > On Sat, Dec 17, 2016 at 6:32 PM, Davide Italiano via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at freebsd.org> >> wrote: >> [...] >> >> > I don't have an infrastructure to measure the runtime performance >> > benefits/regression of clang, but I have for `game7`. >> > I wasn't able to notice any fundamental speedup (at least, not >> > something that justifies a 2x build-time). >> > >> >> Just to provide numbers (using Sean's Mathematica template, thanks), >> here's a plot of the CDF of the frames in a particular level/scene. >> The curves pretty much match, and the picture in the middle shows a >> relative difference of 0.4% between Jun and Dec (which could be very >> possibly be in the noise). >> > > With 5-10 runs per binary you should be able to reliably measure down to > 0.5% difference on game7 (50 usec difference per frame). With just one run > per binary like you have here I would not trust a 0.4% difference. > > > Yeah, I agree. This was mostly a sanity check to understand if there was a > significant improvement at runtime. I ran the same test 7 times in the last > our but the difference is always in the noise, FWIW. >The reason to add more runs is that you can measure performance differences smaller than the run to run variation for the same binary. Most of the time, if you just plot 10 runs of each binary all on the same plot you can easily eyeball it to see that one is consistently faster. If you want to formalize this you could do something like: 1. pair up each run of one binary with another run from the other binary 2. see which (say) median is higher to get a boolean value of which is faster in that pair. (you can be fancier and compute an integral over e.g. the 20-80%'ile frame times; or you can investigate the tail by looking at the 99'th percentile) 3. treat that as a coin toss (i.e., as a bernoulli random variable with parameter p; a "fair coin" would be p=0.5) 4. Use bayesian methods to propagate the "coin toss" observations backward to infer a distribution on the possible values of the parameter p of the bernoulli random variable. Step 4 is actually fairly simple; an example of how to do it is here: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Ch1_Introduction_PyMC3.ipynb#Example:-Mandatory-coin-flip-example Notice (in the picture on that page) how the prior assumption that p might be anywhere in [0,1] with uniform probability is refined as we incorporate the result of each coin flip (that's really all the "bayesian" means; Bayes' theorem tells you how to incorporate the new evidence to update your estimated distribution (often called the "posterior" distribution)). As we incorporate more coin tosses, we refine that initial distribution. For a fair coin, after enough tosses, the posterior distribution becomes very concentrated around p=0.5 For the "coin flip", it's just a matter of plugging into a closed form formula to get the distribution, but for more sophisticated models there is no closed form formula. This is where MCMC libraries (like it shows how to use in that "Probabilistic Programming and Bayesian Methods for Hackers" book) come into play. E.g. note that in step 2 you are losing a lot of information actually (you reduce two frame time CDF's (thousands of data points each) to a single bit of information). Using an MCMC library you can have fine-grained control over the amount of detail in the model and it can propagate your observations back to the parameters of your model in a continuous and information-preserving way (to the extent that you design your model to preserve information; there are tradeoffs between model accuracy and computation time obviously). (step 1 also loses information, btw. You also lose information when you start looking at the frame time distribution because you lose information about the ordering of the frames) -- Sean Silva> > -- Sean Silva > > >> On the same scene, the difference between -O3 and -flto is 12%, FWIW. >> So it seems that at least for this particular case all the compile >> time didn't buy any runtime improvement. >> >> -- >> Davide >> >> "There are no solved problems; there are only problems that are more >> or less solved" -- Henri Poincare >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161218/d7940132/attachment.html>
Matthias Braun via llvm-dev
2016-Dec-19 22:40 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
Jut wanted to mention http://lab.llvm.org:8080/green/view/Compile%20Time/ <http://lab.llvm.org:8080/green/view/Compile%20Time/> again. We have setup continuous compiletime tracking of the CTMark testsuite in 4 popular flag combinations there (and yes it’s another example showing the slowdowns). It also allows has a feature to track regressions! - Matthias> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > First of all, sorry for the long mail. > Inspired by the excellent analysis Rui did for lld, I decided to do > the same for llvm. > I'm personally very interested in build-time for LTO configuration, > with particular attention to the time spent in the optimizer. > Rafael did something similar back in March, so this can be considered > as an update. This tries to include a more accurate high-level > analysis of where llvm is spending CPU cycles. > Here I present 2 cases: clang building itself with `-flto` (Full), and > clang building an internal codebase which I'm going to refer as > `game7`. > It's a mid-sized program (it's actually a game), more or less of the > size of clang, which we use internally as benchmark to track > compile-time/runtime improvements/regression. > I picked two random revisions of llvm: trunk (December 16th 2016) and > trunk (June 2nd 2016), so, roughly, 6 months period. > My setup is a Mac Pro running Linux (NixOS). > These are the numbers I collected (including the output of -mllvm -time-passes). > For clang: > > June 2nd: > real 22m9.278s > user 21m30.410s > sys 0m38.834s > Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) > 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( > 24.3%) X86 DAG->DAG Instruction Selection > 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( > 7.7%) Global Value Numbering > 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( > 5.0%) Function Integration/Inlining > 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( > 4.7%) Combine redundant instructions > 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( > 4.3%) Combine redundant instructions > 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( > 4.1%) Loop Strength Reduction > 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( > 3.8%) Greedy Register Allocator > 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( > 3.0%) Induction Variable Simplification > 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( > 2.9%) Combine redundant instructions > 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( > 2.7%) Combine redundant instructions > 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( > 2.0%) Combine redundant instructions > > Dec 16th: > real 27m34.922s > user 26m53.489s > sys 0m41.533s > > 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( > 19.3%) X86 DAG->DAG Instruction Selection > 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( > 12.5%) Function Integration/Inlining > 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( > 6.8%) Global Value Numbering > 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( > 5.7%) Combine redundant instructions > 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( > 5.0%) Combine redundant instructions > 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( > 3.5%) Combine redundant instructions > 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( > 3.3%) Greedy Register Allocator > 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( > 3.3%) Combine redundant instructions > 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( > 3.1%) Loop Strength Reduction > 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( > 2.6%) Induction Variable Simplification > 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( > 2.4%) Combine redundant instructions > > so, llvm is around 20% slower than it used to be. > > For our internal codebase the situation seems slightly worse: > > `game7` > > June 2nd: > > Total Execution Time: 464.3920 seconds (463.8986 wall clock) > > 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( > 20.3%) X86 DAG->DAG Instruction Selection > 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( > 9.4%) X86 Assembly / Object Emitter > 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( > 7.6%) Function Integration/Inlining > 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( > 6.1%) Global Value Numbering > 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( > 4.8%) Combine redundant instructions > 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( > 4.2%) Post RA top-down list latency scheduler > 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( > 3.5%) Combine redundant instructions > > Dec 16th: > > Total Execution Time: 861.0898 seconds (860.5808 wall clock) > > 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( > 15.2%) Combine redundant instructions > 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( > 11.7%) Combine redundant instructions > 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( > 11.6%) X86 DAG->DAG Instruction Selection > 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( > 8.1%) Combine redundant instructions > 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( > 7.8%) Function Integration/Inlining > 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( > 6.8%) Global Value Numbering > 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( > 6.4%) Combine redundant instructions > > so, using LTO, lld takes 2x to build what it used to take (and all the > extra time seems spent in the optimizer). > > As an (extra) experiment, I decided to take the unoptimized output of > game7 (via lld -save-temps) and pass to -opt -O2. That shows another > significant regression (with different characteristics). > > June 2nd: > time opt -O2 > real 6m23.016s > user 6m20.900s > sys 0m2.113s > > 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) > Function Integration/Inlining > 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) > Global Value Numbering > 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) > Bitcode Writer > 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) > Combine redundant instructions > 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) > Combine redundant instructions > 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) > Combine redundant instructions > 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) > Combine redundant instructions > 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) > Combine redundant instructions > 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) > Combine redundant instructions > > Dec 16th: > > real 20m10.734s > user 20m8.523s > sys 0m2.197s > > 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( > 17.5%) Value Propagation > 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( > 15.1%) Value Propagation > 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( > 7.7%) Combine redundant instructions > 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( > 6.1%) Combine redundant instructions > 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( > 6.1%) Combine redundant instructions > 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( > 5.6%) Combine redundant instructions > 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( > 5.5%) Combine redundant instructions > 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( > 5.2%) Function Integration/Inlining > 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( > 4.6%) Combine redundant instructions > 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( > 4.2%) Combine redundant instructions > 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( > 4.0%) Global Value Numbering > > I don't have an infrastructure to measure the runtime performance > benefits/regression of clang, but I have for `game7`. > I wasn't able to notice any fundamental speedup (at least, not > something that justifies a 2x build-time). > > tl;dr: > There are quite a few things to notice: > 1) GVN used to be the top pass in the middle-end, in some cases, and > pretty much always in the top-3. This is not the case anymore, but > it's still a pass where we spend a lot of time. This is being worked > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's > some hope that will be sorted out (or at least there's a plan for it). > 2) For clang, we spend 35% more time inside instcombine, and for game7 > instcombine seems to largely dominate the amount of time we spend > optimizing IR. I tried to bisect (which is not easy considering the > test takes a long time to run), but I wasn't able to identify a single > point in time responsible for the regression. It seems to be an > additive effect. My wild (or not so wild) guess is that every day > somebody adds a matcher of two because that improves their testcase, > and at some point all this things add up. I'll try to do some > additional profiling but I guess large part of our time is spent > solving bitwise-domain dataflow problems (ComputeKnownBits et > similia). Once GVN will be in a more stable state, I plan to > experiment with caching results. > 3) Something peculiar is that we spend 2x time in the inliner. I'm not > familiar with the inliner, IIRC there were some changes to threshold > recently, so any help here will be appreciated (both in reproducing > the results and with analysis). > 4) For the last testcase (opt -O2 on large unoptimized chunk of > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I > think it's not as lazy as it claims to be (or at least, the way we use > it). This doesn't show up in a full LTO run because we don't run CVP > as part of the default LTO pipeline, but the case shows how LVI can be > a bottleneck for large TUs (or at least how -O2 compile time degraded > on large TUs). I haven't thought about the problem very carefully, but > there seems to be some progress on this front ( > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the > original bitcode file but I can probably do some profiling on it as > well. > > As next steps I'll try to get a more detailed analysis of the > problems. In particular, try to do what Rui did for lld but with more > coarse granularity (every week) to have a chart of the compile time > trend for these cases over the last 6 months, and post here. > > I think (I know) some people are aware of the problems I outline in > this e-mail. But apparently not everybody. We're in a situation where > compile time is increasing without real control. I'm happy that Apple > is doing a serious effort to track build-time, so hopefully things > will improve. There are, although, some cases (like adding matchers in > instcombine or knobs) where the compile time regression is hard to > track until it's too late. LLVM as a project tries not to stop people > trying to get things done and that's great, but from time to time it's > good to take a step back and re-evaluate approaches. > The purpose of this e-mail was to outline where we regressed, for > those interested. > > Thanks for your time, and of course, feedback welcome! > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161219/e32294b0/attachment.html>
Mikhail Zolotukhin via llvm-dev
2016-Dec-19 23:55 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
> On Dec 17, 2016, at 7:48 PM, Davide Italiano via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > > On Dec 17, 2016 7:41 PM, "Sean Silva" <chisophugis at gmail.com <mailto:chisophugis at gmail.com>> wrote: > > > On Sat, Dec 17, 2016 at 6:32 PM, Davide Italiano via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano <davide at freebsd.org <mailto:davide at freebsd.org>> wrote: > [...] > > > I don't have an infrastructure to measure the runtime performance > > benefits/regression of clang, but I have for `game7`. > > I wasn't able to notice any fundamental speedup (at least, not > > something that justifies a 2x build-time). > > > > Just to provide numbers (using Sean's Mathematica template, thanks), > here's a plot of the CDF of the frames in a particular level/scene. > The curves pretty much match, and the picture in the middle shows a > relative difference of 0.4% between Jun and Dec (which could be very > possibly be in the noise). > > With 5-10 runs per binary you should be able to reliably measure down to 0.5% difference on game7 (50 usec difference per frame). With just one run per binary like you have here I would not trust a 0.4% difference. > > Yeah, I agree. This was mostly a sanity check to understand if there was a significant improvement at runtime. I ran the same test 7 times in the last our but the difference is always in the noise, FWIW.FWIW, opt has an option '-run-twice', which might be helpful if compile time of a particular unit is too small (it would be nice to change it to -run-n-times though to make it more flexible). Michael> > -- Sean Silva > > On the same scene, the difference between -O3 and -flto is 12%, FWIW. > So it seems that at least for this particular case all the compile > time didn't buy any runtime improvement. > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161219/b5ae2d18/attachment.html>
Mikhail Zolotukhin via llvm-dev
2016-Dec-20 00:24 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
Hi Davide, Thanks for the analysis, it's really interesting! And I'm really glad that we now put more and more attention at the compile time! Just recently I've been looking into historical compile time data as well, and have had similar conclusions. The regressions you've found are probably caused by: 1) r289813 and r289855 - new matchers in InstCombine 2) r286814 and r288024 - changes in Inlining cost model You probably can try reverting them locally to check if my hypothesis is correct. I also looked at earlier data, and on top of the before-mentioned issues I found the following causes of compile-time degradations (and sometimes improvements): 1) Changes in SCEV (r251049, r248637). 2) Changes in LoopUnrolling cost-model: adding a new unroll analyzer, changing Os thresholds, etc. 3) Adding new passes (e.g. LoopLoadElimination). Not everything is recoverable, but there are still some opportunities laying here. For instance, I have some ideas on how to mitigate the compile time effect of the SCEV changes, and hopefully soon I'll submit a patch for it. Changes in unrolling and inlining thresholds are different by their nature: we basically just changed thresholds under which the optimizations kick in. That inevitably will have negative impacts on some tests, while hopefully improving all tests in average. I think it's reasonable to ignore such 'regressions' provided that no test regressed by too much. Adding a new pass also most likely increases compile time. Here all we can do is: 1) discuss the tradeoffs before we add it, 2) provide knobs to turn it off/tune. Making light versions of passes (like the ones proposed in this thread for InstCombine) also sounds like a good idea. E.g. I thought about profiling InstCombine on some codebase to detect most common matchers and disable other matchers under some flag. Thanks, Michael> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > First of all, sorry for the long mail. > Inspired by the excellent analysis Rui did for lld, I decided to do > the same for llvm. > I'm personally very interested in build-time for LTO configuration, > with particular attention to the time spent in the optimizer. > Rafael did something similar back in March, so this can be considered > as an update. This tries to include a more accurate high-level > analysis of where llvm is spending CPU cycles. > Here I present 2 cases: clang building itself with `-flto` (Full), and > clang building an internal codebase which I'm going to refer as > `game7`. > It's a mid-sized program (it's actually a game), more or less of the > size of clang, which we use internally as benchmark to track > compile-time/runtime improvements/regression. > I picked two random revisions of llvm: trunk (December 16th 2016) and > trunk (June 2nd 2016), so, roughly, 6 months period. > My setup is a Mac Pro running Linux (NixOS). > These are the numbers I collected (including the output of -mllvm -time-passes). > For clang: > > June 2nd: > real 22m9.278s > user 21m30.410s > sys 0m38.834s > Total Execution Time: 1270.4795 seconds (1269.1330 wall clock) > 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 ( > 24.3%) X86 DAG->DAG Instruction Selection > 97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 ( > 7.7%) Global Value Numbering > 62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 ( > 5.0%) Function Integration/Inlining > 58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 ( > 4.7%) Combine redundant instructions > 53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 ( > 4.3%) Combine redundant instructions > 51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 ( > 4.1%) Loop Strength Reduction > 47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 ( > 3.8%) Greedy Register Allocator > 36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 ( > 3.0%) Induction Variable Simplification > 37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 ( > 2.9%) Combine redundant instructions > 34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 ( > 2.7%) Combine redundant instructions > 25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 ( > 2.0%) Combine redundant instructions > > Dec 16th: > real 27m34.922s > user 26m53.489s > sys 0m41.533s > > 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 ( > 19.3%) X86 DAG->DAG Instruction Selection > 197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 ( > 12.5%) Function Integration/Inlining > 106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 ( > 6.8%) Global Value Numbering > 89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 ( > 5.7%) Combine redundant instructions > 79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 ( > 5.0%) Combine redundant instructions > 55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 ( > 3.5%) Combine redundant instructions > 51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 ( > 3.3%) Greedy Register Allocator > 52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 ( > 3.3%) Combine redundant instructions > 49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 ( > 3.1%) Loop Strength Reduction > 41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 ( > 2.6%) Induction Variable Simplification > 38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 ( > 2.4%) Combine redundant instructions > > so, llvm is around 20% slower than it used to be. > > For our internal codebase the situation seems slightly worse: > > `game7` > > June 2nd: > > Total Execution Time: 464.3920 seconds (463.8986 wall clock) > > 88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 ( > 20.3%) X86 DAG->DAG Instruction Selection > 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 ( > 9.4%) X86 Assembly / Object Emitter > 34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 ( > 7.6%) Function Integration/Inlining > 27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 ( > 6.1%) Global Value Numbering > 22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 ( > 4.8%) Combine redundant instructions > 19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 ( > 4.2%) Post RA top-down list latency scheduler > 15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 ( > 3.5%) Combine redundant instructions > > Dec 16th: > > Total Execution Time: 861.0898 seconds (860.5808 wall clock) > > 135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 ( > 15.2%) Combine redundant instructions > 103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 ( > 11.7%) Combine redundant instructions > 97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 ( > 11.6%) X86 DAG->DAG Instruction Selection > 72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 ( > 8.1%) Combine redundant instructions > 69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 ( > 7.8%) Function Integration/Inlining > 60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 ( > 6.8%) Global Value Numbering > 56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 ( > 6.4%) Combine redundant instructions > > so, using LTO, lld takes 2x to build what it used to take (and all the > extra time seems spent in the optimizer). > > As an (extra) experiment, I decided to take the unoptimized output of > game7 (via lld -save-temps) and pass to -opt -O2. That shows another > significant regression (with different characteristics). > > June 2nd: > time opt -O2 > real 6m23.016s > user 6m20.900s > sys 0m2.113s > > 35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%) > Function Integration/Inlining > 33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%) > Global Value Numbering > 27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%) > Bitcode Writer > 25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%) > Combine redundant instructions > 19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%) > Combine redundant instructions > 18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%) > Combine redundant instructions > 16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%) > Combine redundant instructions > 13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%) > Combine redundant instructions > 13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%) > Combine redundant instructions > > Dec 16th: > > real 20m10.734s > user 20m8.523s > sys 0m2.197s > > 208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 ( > 17.5%) Value Propagation > 179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 ( > 15.1%) Value Propagation > 92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 ( > 7.7%) Combine redundant instructions > 72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 ( > 6.1%) Combine redundant instructions > 72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 ( > 6.1%) Combine redundant instructions > 66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 ( > 5.6%) Combine redundant instructions > 65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 ( > 5.5%) Combine redundant instructions > 61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 ( > 5.2%) Function Integration/Inlining > 54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 ( > 4.6%) Combine redundant instructions > 50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 ( > 4.2%) Combine redundant instructions > 47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 ( > 4.0%) Global Value Numbering > > I don't have an infrastructure to measure the runtime performance > benefits/regression of clang, but I have for `game7`. > I wasn't able to notice any fundamental speedup (at least, not > something that justifies a 2x build-time). > > tl;dr: > There are quite a few things to notice: > 1) GVN used to be the top pass in the middle-end, in some cases, and > pretty much always in the top-3. This is not the case anymore, but > it's still a pass where we spend a lot of time. This is being worked > on by Daniel Berlin and me) https://reviews.llvm.org/D26224 so there's > some hope that will be sorted out (or at least there's a plan for it). > 2) For clang, we spend 35% more time inside instcombine, and for game7 > instcombine seems to largely dominate the amount of time we spend > optimizing IR. I tried to bisect (which is not easy considering the > test takes a long time to run), but I wasn't able to identify a single > point in time responsible for the regression. It seems to be an > additive effect. My wild (or not so wild) guess is that every day > somebody adds a matcher of two because that improves their testcase, > and at some point all this things add up. I'll try to do some > additional profiling but I guess large part of our time is spent > solving bitwise-domain dataflow problems (ComputeKnownBits et > similia). Once GVN will be in a more stable state, I plan to > experiment with caching results. > 3) Something peculiar is that we spend 2x time in the inliner. I'm not > familiar with the inliner, IIRC there were some changes to threshold > recently, so any help here will be appreciated (both in reproducing > the results and with analysis). > 4) For the last testcase (opt -O2 on large unoptimized chunk of > bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I > think it's not as lazy as it claims to be (or at least, the way we use > it). This doesn't show up in a full LTO run because we don't run CVP > as part of the default LTO pipeline, but the case shows how LVI can be > a bottleneck for large TUs (or at least how -O2 compile time degraded > on large TUs). I haven't thought about the problem very carefully, but > there seems to be some progress on this front ( > https://llvm.org/bugs/show_bug.cgi?id=10584). I can't share the > original bitcode file but I can probably do some profiling on it as > well. > > As next steps I'll try to get a more detailed analysis of the > problems. In particular, try to do what Rui did for lld but with more > coarse granularity (every week) to have a chart of the compile time > trend for these cases over the last 6 months, and post here. > > I think (I know) some people are aware of the problems I outline in > this e-mail. But apparently not everybody. We're in a situation where > compile time is increasing without real control. I'm happy that Apple > is doing a serious effort to track build-time, so hopefully things > will improve. There are, although, some cases (like adding matchers in > instcombine or knobs) where the compile time regression is hard to > track until it's too late. LLVM as a project tries not to stop people > trying to get things done and that's great, but from time to time it's > good to take a step back and re-evaluate approaches. > The purpose of this e-mail was to outline where we regressed, for > those interested. > > Thanks for your time, and of course, feedback welcome! > > -- > Davide > > "There are no solved problems; there are only problems that are more > or less solved" -- Henri Poincare > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Davide Italiano via llvm-dev
2016-Dec-21 16:07 UTC
[llvm-dev] llvm (the middle-end) is getting slower, December edition
On Mon, Dec 19, 2016 at 4:24 PM, Mikhail Zolotukhin <mzolotukhin at apple.com> wrote:> Hi Davide, > > Thanks for the analysis, it's really interesting! And I'm really glad that we now put more and more attention at the compile time! > > Just recently I've been looking into historical compile time data as well, and have had similar conclusions. The regressions you've found are probably caused by: > 1) r289813 and r289855 - new matchers in InstCombine > 2) r286814 and r288024 - changes in Inlining cost model >Haven't looked at 2) yet, but I can confirm for 1). Sanjay/Ehsan, can you please explain what's the motivation behind the new transformations you introduced? I'm tempted to ask a revert, but I'd like to understand the motivations first. -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare
Maybe Matching Threads
- llvm (the middle-end) is getting slower, December edition
- llvm (the middle-end) is getting slower, December edition
- llvm (the middle-end) is getting slower, December edition
- llvm (the middle-end) is getting slower, December edition
- llvm (the middle-end) is getting slower, December edition