Sean Silva via llvm-dev
2016-Mar-09 01:47 UTC
[llvm-dev] llvm and clang are getting slower
On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:> > On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > > > On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola >> <llvm-dev at lists.llvm.org> wrote: >> > I have just benchmarked building trunk llvm and clang in Debug, >> > Release and LTO modes (see the attached scrip for the cmake lines). >> > >> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all >> > cases I used the system libgcc and libstdc++. >> > >> > For release builds there is a monotonic increase in each version. From >> > 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc >> > 5.3.2 takes 205 minutes. >> > >> > Debug and LTO show an improvement in 3.7, but have regressed again in >> 3.8. >> >> I'm curious how these times divide across Clang and various parts of >> LLVM; rerunning with -ftime-report and summing the numbers across all >> compiles could be interesting. >> > > Based on the results I posted upthread about the relative time spend in > the backend for debug vs release, we can estimate this. > To summarize: > 10% of time spent in LLVM for Debug > 33% of time spent in LLVM for Release > (I'll abbreviate "in LLVM" as just "backend"; this is "backend" from > clang's perspective) > > Let's look at the difference between 3.5 and trunk. > > For debug, the user time jumps from 174m50.251s to 197m9.932s. > That's {10490.3, 11829.9} seconds, respectively. > For release, the corresponding numbers are: > {9826.71, 12714.3} seconds. > > debug35 = 10490.251 > debugTrunk = 11829.932 > > debugTrunk/debug35 == 1.12771 > debugRatio = 1.12771 > > release35 = 9826.705 > releaseTrunk = 12714.288 > > releaseTrunk/release35 == 1.29385 > releaseRatio = 1.29385 > > For simplicity, let's use a simple linear model for the distribution of > slowdown between the frontend and backend: a constant factor slowdown for > the backend, and an independent constant factor slowdown for the frontend. > This gives the following linear system: > debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio > releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio > > Solving this linear system we find that under this simple model, the > expected slowdown factors are: > backendRatio = 1.77783 > frontendRatio = 1.05547 > > Intuitively, backendRatio comes out larger in this comparison because we > see the biggest slowdown during release (1.29 vs 1.12), and during release > we are spending a larger fraction of time in the backend (33% vs 10%). > > Applying this same model to across Rafael's data, we find the following > (numbers have been rounded for clarity): > > transition backendRatio frontendRatio > 3.5->3.6 1.08 1.03 > 3.6->3.7 1.30 0.95 > 3.7->3.8 1.34 1.07 > 3.8->trunk 0.98 1.02 > > Note that in Rafael's measurements LTO is pretty similar to Release from a > CPU time (user time) standpoint. While the final LTO link takes a large > amount of real time, it is single threaded. Based on the real time numbers > the LTO link was only spending about 20 minutes single-threaded (i.e. about > 20 minutes CPU time), which is pretty small compared to the 300-400 minutes > of total CPU time. It would be interesting to see the numbers for -O0 or > -O1 per-TU together with LTO. > > > > Just a note about LTO being sequential: Rafael mentioned he was "building > trunk llvm and clang". By default I believe it is ~56 link targets that can > be run in parallel (provided you have enough RAM to avoid swapping). >D'oh! I was looking at the data wrong since I broke my Fundamental Rule of Looking At Data, namely: don't look at raw numbers in a table since you are likely to look at things wrong or form biases based on the order in which you look at the data points; *always* visualize. There is a significant difference between release and LTO. About 2x consistently. [image: Inline image 3] This is actually curious because during the release build, we were spending 33% of CPU time in the backend (as clang sees it; i.e. mid-level optimizer and codegen). This data is inconsistent with LTO simply being another run through the backend (which would be just +33% CPU time at worst). There seems to be something nonlinear happening. To make it worse, the LTO build has approximately a full Release optimization running per-TU, so the actual LTO step should be seeing inlined/"cleaned up" IR which should be much smaller than what the per-TU optimizer is seeing, so naively it should take *even less* than "another 33% CPU time" chunk. Yet we see 1.5x-2x difference: [image: Inline image 4] -- Sean Silva> > -- > Mehdi > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160308/4e8d18de/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-03-08 at 5.45.54 PM.png Type: image/png Size: 39766 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160308/4e8d18de/attachment-0002.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-03-08 at 5.29.21 PM.png Type: image/png Size: 36008 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160308/4e8d18de/attachment-0003.png>
Xinliang David Li via llvm-dev
2016-Mar-09 20:38 UTC
[llvm-dev] llvm and clang are getting slower
The lto time could be explained by second order effect due to increased dcache/dtlb pressures due to increased memory footprint and poor locality. David On Tue, Mar 8, 2016 at 5:47 PM, Sean Silva via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <mehdi.amini at apple.com> wrote: > >> >> On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> >> >> On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola >>> <llvm-dev at lists.llvm.org> wrote: >>> > I have just benchmarked building trunk llvm and clang in Debug, >>> > Release and LTO modes (see the attached scrip for the cmake lines). >>> > >>> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all >>> > cases I used the system libgcc and libstdc++. >>> > >>> > For release builds there is a monotonic increase in each version. From >>> > 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc >>> > 5.3.2 takes 205 minutes. >>> > >>> > Debug and LTO show an improvement in 3.7, but have regressed again in >>> 3.8. >>> >>> I'm curious how these times divide across Clang and various parts of >>> LLVM; rerunning with -ftime-report and summing the numbers across all >>> compiles could be interesting. >>> >> >> Based on the results I posted upthread about the relative time spend in >> the backend for debug vs release, we can estimate this. >> To summarize: >> 10% of time spent in LLVM for Debug >> 33% of time spent in LLVM for Release >> (I'll abbreviate "in LLVM" as just "backend"; this is "backend" from >> clang's perspective) >> >> Let's look at the difference between 3.5 and trunk. >> >> For debug, the user time jumps from 174m50.251s to 197m9.932s. >> That's {10490.3, 11829.9} seconds, respectively. >> For release, the corresponding numbers are: >> {9826.71, 12714.3} seconds. >> >> debug35 = 10490.251 >> debugTrunk = 11829.932 >> >> debugTrunk/debug35 == 1.12771 >> debugRatio = 1.12771 >> >> release35 = 9826.705 >> releaseTrunk = 12714.288 >> >> releaseTrunk/release35 == 1.29385 >> releaseRatio = 1.29385 >> >> For simplicity, let's use a simple linear model for the distribution of >> slowdown between the frontend and backend: a constant factor slowdown for >> the backend, and an independent constant factor slowdown for the frontend. >> This gives the following linear system: >> debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio >> releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio >> >> Solving this linear system we find that under this simple model, the >> expected slowdown factors are: >> backendRatio = 1.77783 >> frontendRatio = 1.05547 >> >> Intuitively, backendRatio comes out larger in this comparison because we >> see the biggest slowdown during release (1.29 vs 1.12), and during release >> we are spending a larger fraction of time in the backend (33% vs 10%). >> >> Applying this same model to across Rafael's data, we find the following >> (numbers have been rounded for clarity): >> >> transition backendRatio frontendRatio >> 3.5->3.6 1.08 1.03 >> 3.6->3.7 1.30 0.95 >> 3.7->3.8 1.34 1.07 >> 3.8->trunk 0.98 1.02 >> >> Note that in Rafael's measurements LTO is pretty similar to Release from >> a CPU time (user time) standpoint. While the final LTO link takes a large >> amount of real time, it is single threaded. Based on the real time numbers >> the LTO link was only spending about 20 minutes single-threaded (i.e. about >> 20 minutes CPU time), which is pretty small compared to the 300-400 minutes >> of total CPU time. It would be interesting to see the numbers for -O0 or >> -O1 per-TU together with LTO. >> >> >> >> Just a note about LTO being sequential: Rafael mentioned he was "building >> trunk llvm and clang". By default I believe it is ~56 link targets that can >> be run in parallel (provided you have enough RAM to avoid swapping). >> > > D'oh! I was looking at the data wrong since I broke my Fundamental Rule of > Looking At Data, namely: don't look at raw numbers in a table since you are > likely to look at things wrong or form biases based on the order in which > you look at the data points; *always* visualize. There is a significant > difference between release and LTO. About 2x consistently. > > [image: Inline image 3] > > This is actually curious because during the release build, we were > spending 33% of CPU time in the backend (as clang sees it; i.e. mid-level > optimizer and codegen). This data is inconsistent with LTO simply being > another run through the backend (which would be just +33% CPU time at > worst). There seems to be something nonlinear happening. > To make it worse, the LTO build has approximately a full Release > optimization running per-TU, so the actual LTO step should be seeing > inlined/"cleaned up" IR which should be much smaller than what the per-TU > optimizer is seeing, so naively it should take *even less* than "another > 33% CPU time" chunk. > Yet we see 1.5x-2x difference: > > [image: Inline image 4] > > -- Sean Silva > > >> >> -- >> Mehdi >> >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/36bfa9e8/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-03-08 at 5.45.54 PM.png Type: image/png Size: 39766 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/36bfa9e8/attachment-0002.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-03-08 at 5.29.21 PM.png Type: image/png Size: 36008 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/36bfa9e8/attachment-0003.png>
Sean Silva via llvm-dev
2016-Mar-09 21:55 UTC
[llvm-dev] llvm and clang are getting slower
On Wed, Mar 9, 2016 at 12:38 PM, Xinliang David Li <xinliangli at gmail.com> wrote:> The lto time could be explained by second order effect due to increased > dcache/dtlb pressures due to increased memory footprint and poor locality. >Actually thinking more about this, I was totally wrong. Mehdi said that we LTO ~56 binaries. If we naively assume that each binary is like clang and links in "everything" and that the LTO process takes CPU time equivalent to "-O3 for every TU", then we would expect that *for each binary* we would see +33% (total increase >1800% vs Release). Clearly that is not happening since the actual overhead is only 50%-100%, so we need a more refined explanation. There are a couple factors that I can think of. a) there are 56 binaries being LTO'd (this will tend to increase our estimate) b) not all 56 binaries are the size of clang (this will tend to decrease our estimate) c) per-TU processing only is doing mid-level optimizations and no codegen (this will tend to decrease our estimate) d) IR seen during LTO has already been "cleaned up" and has less overall size/amount of optimizations that will apply during the LTO process (this will tend to decrease our estimate) e) comdat folding in the linker means that we only codegen (this will tend to decrease our estimate) Starting from a (normalized) release build with releaseBackend = .33 releaseFrontend = .67 release = releaseBackend + releaseFrontend = 1 Let us try to obtain LTO = (some expression involving releaseFrontend and releaseBackend) = 1.5-2 For starters, let us apply a), with a naive assumption that for each of the numBinaries = 52 binaries we add the cost of releaseBackend (I just checked and 52 is the exact number for LLVM+Clang+LLD+clang-tools-extra, ignoring symlinks). This gives LTO = release + 52 * releaseBackend = 21.46, which is way high. Let us apply b). A quick check gives 371,515,392 total bytes of text in a release build across all 52 binaries (Mac, x86_64). Clang is 45,182,976 bytes of text. So using final text size in Release as an indicator of the total code seen by the LTO process, we can use a coefficient of 1/8, i.e. the average binary links in about avgTextFraction = 1/8 of "everything". LTO = release + 52 * (.125 * releaseBackend) = 3.14 We are still high. For c), Let us assume that half of releaseBackend is spend after mid-level optimizations. So let codegenFraction = .5 be the fraction of releaseBackend that is spend after mid-level optimizations. We can discount this time from the LTO build since it does not that work per-TU. LTO = release + 52 * (.125 * releaseBackend) - (codegenFraction * releaseBackend) = 2.98 So this is not a significant reduction. I don't have a reasonable estimate a priori for d) or e), but altogether they reduce to a constant factor otherSavingsFraction that multiplies the second term LTO = release + 52 * (.125 * otherSavingsFraction * releaseBackend) - (codegenFraction * releaseBackend) =? 1.5-2x Given the empirical data, this suggests that otherSavingsFraction must have a value around 1/2, which seems reasonable. For a moment I was rather surprised that we could have 52 binaries and it would be only 2x, but this closer examination shows that between avgTextFraction = .125 and releaseBackend = .33 the "52" is brought under control. -- Sean Silva> > David > > On Tue, Mar 8, 2016 at 5:47 PM, Sean Silva via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> >> On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <mehdi.amini at apple.com> >> wrote: >> >>> >>> On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>> >>> >>> On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola >>>> <llvm-dev at lists.llvm.org> wrote: >>>> > I have just benchmarked building trunk llvm and clang in Debug, >>>> > Release and LTO modes (see the attached scrip for the cmake lines). >>>> > >>>> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all >>>> > cases I used the system libgcc and libstdc++. >>>> > >>>> > For release builds there is a monotonic increase in each version. From >>>> > 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc >>>> > 5.3.2 takes 205 minutes. >>>> > >>>> > Debug and LTO show an improvement in 3.7, but have regressed again in >>>> 3.8. >>>> >>>> I'm curious how these times divide across Clang and various parts of >>>> LLVM; rerunning with -ftime-report and summing the numbers across all >>>> compiles could be interesting. >>>> >>> >>> Based on the results I posted upthread about the relative time spend in >>> the backend for debug vs release, we can estimate this. >>> To summarize: >>> 10% of time spent in LLVM for Debug >>> 33% of time spent in LLVM for Release >>> (I'll abbreviate "in LLVM" as just "backend"; this is "backend" from >>> clang's perspective) >>> >>> Let's look at the difference between 3.5 and trunk. >>> >>> For debug, the user time jumps from 174m50.251s to 197m9.932s. >>> That's {10490.3, 11829.9} seconds, respectively. >>> For release, the corresponding numbers are: >>> {9826.71, 12714.3} seconds. >>> >>> debug35 = 10490.251 >>> debugTrunk = 11829.932 >>> >>> debugTrunk/debug35 == 1.12771 >>> debugRatio = 1.12771 >>> >>> release35 = 9826.705 >>> releaseTrunk = 12714.288 >>> >>> releaseTrunk/release35 == 1.29385 >>> releaseRatio = 1.29385 >>> >>> For simplicity, let's use a simple linear model for the distribution of >>> slowdown between the frontend and backend: a constant factor slowdown for >>> the backend, and an independent constant factor slowdown for the frontend. >>> This gives the following linear system: >>> debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio >>> releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio >>> >>> Solving this linear system we find that under this simple model, the >>> expected slowdown factors are: >>> backendRatio = 1.77783 >>> frontendRatio = 1.05547 >>> >>> Intuitively, backendRatio comes out larger in this comparison because we >>> see the biggest slowdown during release (1.29 vs 1.12), and during release >>> we are spending a larger fraction of time in the backend (33% vs 10%). >>> >>> Applying this same model to across Rafael's data, we find the following >>> (numbers have been rounded for clarity): >>> >>> transition backendRatio frontendRatio >>> 3.5->3.6 1.08 1.03 >>> 3.6->3.7 1.30 0.95 >>> 3.7->3.8 1.34 1.07 >>> 3.8->trunk 0.98 1.02 >>> >>> Note that in Rafael's measurements LTO is pretty similar to Release from >>> a CPU time (user time) standpoint. While the final LTO link takes a large >>> amount of real time, it is single threaded. Based on the real time numbers >>> the LTO link was only spending about 20 minutes single-threaded (i.e. about >>> 20 minutes CPU time), which is pretty small compared to the 300-400 minutes >>> of total CPU time. It would be interesting to see the numbers for -O0 or >>> -O1 per-TU together with LTO. >>> >>> >>> >>> Just a note about LTO being sequential: Rafael mentioned he was >>> "building trunk llvm and clang". By default I believe it is ~56 link >>> targets that can be run in parallel (provided you have enough RAM to avoid >>> swapping). >>> >> >> D'oh! I was looking at the data wrong since I broke my Fundamental Rule >> of Looking At Data, namely: don't look at raw numbers in a table since you >> are likely to look at things wrong or form biases based on the order in >> which you look at the data points; *always* visualize. There is a >> significant difference between release and LTO. About 2x consistently. >> >> [image: Inline image 3] >> >> This is actually curious because during the release build, we were >> spending 33% of CPU time in the backend (as clang sees it; i.e. mid-level >> optimizer and codegen). This data is inconsistent with LTO simply being >> another run through the backend (which would be just +33% CPU time at >> worst). There seems to be something nonlinear happening. >> To make it worse, the LTO build has approximately a full Release >> optimization running per-TU, so the actual LTO step should be seeing >> inlined/"cleaned up" IR which should be much smaller than what the per-TU >> optimizer is seeing, so naively it should take *even less* than "another >> 33% CPU time" chunk. >> Yet we see 1.5x-2x difference: >> >> [image: Inline image 4] >> >> -- Sean Silva >> >> >>> >>> -- >>> Mehdi >>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-03-08 at 5.29.21 PM.png Type: image/png Size: 36008 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0002.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-03-08 at 5.45.54 PM.png Type: image/png Size: 39766 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0003.png>