thr3ads.net - llvm dev - [llvm-dev] llvm and clang are getting slower [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Sean Silva via llvm-dev

2016-Mar-09 01:47 UTC

[llvm-dev] llvm and clang are getting slower

On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
>
> On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola
>> <llvm-dev at lists.llvm.org> wrote:
>> > I have just benchmarked building trunk llvm and clang in Debug,
>> > Release and LTO modes (see the attached scrip for the cmake
lines).
>> >
>> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
>> > cases I used the system libgcc and libstdc++.
>> >
>> > For release builds there is a monotonic increase in each version.
From
>> > 163 minutes with 3.5 to 212 minutes with trunk. For comparison,
gcc
>> > 5.3.2 takes 205 minutes.
>> >
>> > Debug and LTO show an improvement in 3.7, but have regressed again
in
>> 3.8.
>>
>> I'm curious how these times divide across Clang and various parts
of
>> LLVM; rerunning with -ftime-report and summing the numbers across all
>> compiles could be interesting.
>>
>
> Based on the results I posted upthread about the relative time spend in
> the backend for debug vs release, we can estimate this.
> To summarize:
> 10% of time spent in LLVM for Debug
> 33% of time spent in LLVM for Release
> (I'll abbreviate "in LLVM" as just "backend"; this
is "backend" from
> clang's perspective)
>
> Let's look at the difference between 3.5 and trunk.
>
> For debug, the user time jumps from 174m50.251s to 197m9.932s.
> That's {10490.3, 11829.9} seconds, respectively.
> For release, the corresponding numbers are:
> {9826.71, 12714.3} seconds.
>
> debug35 = 10490.251
> debugTrunk = 11829.932
>
> debugTrunk/debug35 == 1.12771
> debugRatio = 1.12771
>
> release35 = 9826.705
> releaseTrunk = 12714.288
>
> releaseTrunk/release35 == 1.29385
> releaseRatio = 1.29385
>
> For simplicity, let's use a simple linear model for the distribution of
> slowdown between the frontend and backend: a constant factor slowdown for
> the backend, and an independent constant factor slowdown for the frontend.
> This gives the following linear system:
> debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio
> releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio
>
> Solving this linear system we find that under this simple model, the
> expected slowdown factors are:
> backendRatio = 1.77783
> frontendRatio = 1.05547
>
> Intuitively, backendRatio comes out larger in this comparison because we
> see the biggest slowdown during release (1.29 vs 1.12), and during release
> we are spending a larger fraction of time in the backend (33% vs 10%).
>
> Applying this same model to across Rafael's data, we find the following
> (numbers have been rounded for clarity):
>
> transition       backendRatio   frontendRatio
> 3.5->3.6         1.08           1.03
> 3.6->3.7         1.30           0.95
> 3.7->3.8         1.34           1.07
> 3.8->trunk       0.98           1.02
>
> Note that in Rafael's measurements LTO is pretty similar to Release
from a
> CPU time (user time) standpoint. While the final LTO link takes a large
> amount of real time, it is single threaded. Based on the real time numbers
> the LTO link was only spending about 20 minutes single-threaded (i.e. about
> 20 minutes CPU time), which is pretty small compared to the 300-400 minutes
> of total CPU time. It would be interesting to see the numbers for -O0 or
> -O1 per-TU together with LTO.
>
>
>
> Just a note about LTO being sequential: Rafael mentioned he was
"building
> trunk llvm and clang". By default I believe it is ~56 link targets
that can
> be run in parallel (provided you have enough RAM to avoid swapping).
>
D'oh! I was looking at the data wrong since I broke my Fundamental Rule of
Looking At Data, namely: don't look at raw numbers in a table since you are
likely to look at things wrong or form biases based on the order in which
you look at the data points; *always* visualize. There is a significant
difference between release and LTO. About 2x consistently.

[image: Inline image 3]

This is actually curious because during the release build, we were spending
33% of CPU time in the backend (as clang sees it; i.e. mid-level optimizer
and codegen). This data is inconsistent with LTO simply being another run
through the backend (which would be just +33% CPU time at worst). There
seems to be something nonlinear happening.
To make it worse, the LTO build has approximately a full Release
optimization running per-TU, so the actual LTO step should be seeing
inlined/"cleaned up" IR which should be much smaller than what the
per-TU
optimizer is seeing, so naively it should take *even less* than "another
33% CPU time" chunk.
Yet we see 1.5x-2x difference:

[image: Inline image 4]

-- Sean Silva

>
> --
> Mehdi
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160308/4e8d18de/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.45.54 PM.png
Type: image/png
Size: 39766 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160308/4e8d18de/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.29.21 PM.png
Type: image/png
Size: 36008 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160308/4e8d18de/attachment-0003.png>

Xinliang David Li via llvm-dev

2016-Mar-09 20:38 UTC

head link

[llvm-dev] llvm and clang are getting slower

The lto time could be explained by second order effect due to increased
dcache/dtlb pressures due to increased memory footprint and poor locality.

David

On Tue, Mar 8, 2016 at 5:47 PM, Sean Silva via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>>
>> On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>
>>
>> On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola
>>> <llvm-dev at lists.llvm.org> wrote:
>>> > I have just benchmarked building trunk llvm and clang in
Debug,
>>> > Release and LTO modes (see the attached scrip for the cmake
lines).
>>> >
>>> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In
all
>>> > cases I used the system libgcc and libstdc++.
>>> >
>>> > For release builds there is a monotonic increase in each
version. From
>>> > 163 minutes with 3.5 to 212 minutes with trunk. For
comparison, gcc
>>> > 5.3.2 takes 205 minutes.
>>> >
>>> > Debug and LTO show an improvement in 3.7, but have regressed
again in
>>> 3.8.
>>>
>>> I'm curious how these times divide across Clang and various
parts of
>>> LLVM; rerunning with -ftime-report and summing the numbers across
all
>>> compiles could be interesting.
>>>
>>
>> Based on the results I posted upthread about the relative time spend in
>> the backend for debug vs release, we can estimate this.
>> To summarize:
>> 10% of time spent in LLVM for Debug
>> 33% of time spent in LLVM for Release
>> (I'll abbreviate "in LLVM" as just "backend";
this is "backend" from
>> clang's perspective)
>>
>> Let's look at the difference between 3.5 and trunk.
>>
>> For debug, the user time jumps from 174m50.251s to 197m9.932s.
>> That's {10490.3, 11829.9} seconds, respectively.
>> For release, the corresponding numbers are:
>> {9826.71, 12714.3} seconds.
>>
>> debug35 = 10490.251
>> debugTrunk = 11829.932
>>
>> debugTrunk/debug35 == 1.12771
>> debugRatio = 1.12771
>>
>> release35 = 9826.705
>> releaseTrunk = 12714.288
>>
>> releaseTrunk/release35 == 1.29385
>> releaseRatio = 1.29385
>>
>> For simplicity, let's use a simple linear model for the
distribution of
>> slowdown between the frontend and backend: a constant factor slowdown
for
>> the backend, and an independent constant factor slowdown for the
frontend.
>> This gives the following linear system:
>> debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio
>> releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio
>>
>> Solving this linear system we find that under this simple model, the
>> expected slowdown factors are:
>> backendRatio = 1.77783
>> frontendRatio = 1.05547
>>
>> Intuitively, backendRatio comes out larger in this comparison because
we
>> see the biggest slowdown during release (1.29 vs 1.12), and during
release
>> we are spending a larger fraction of time in the backend (33% vs 10%).
>>
>> Applying this same model to across Rafael's data, we find the
following
>> (numbers have been rounded for clarity):
>>
>> transition       backendRatio   frontendRatio
>> 3.5->3.6         1.08           1.03
>> 3.6->3.7         1.30           0.95
>> 3.7->3.8         1.34           1.07
>> 3.8->trunk       0.98           1.02
>>
>> Note that in Rafael's measurements LTO is pretty similar to Release
from
>> a CPU time (user time) standpoint. While the final LTO link takes a
large
>> amount of real time, it is single threaded. Based on the real time
numbers
>> the LTO link was only spending about 20 minutes single-threaded (i.e.
about
>> 20 minutes CPU time), which is pretty small compared to the 300-400
minutes
>> of total CPU time. It would be interesting to see the numbers for -O0
or
>> -O1 per-TU together with LTO.
>>
>>
>>
>> Just a note about LTO being sequential: Rafael mentioned he was
"building
>> trunk llvm and clang". By default I believe it is ~56 link targets
that can
>> be run in parallel (provided you have enough RAM to avoid swapping).
>>
>
> D'oh! I was looking at the data wrong since I broke my Fundamental Rule
of
> Looking At Data, namely: don't look at raw numbers in a table since you
are
> likely to look at things wrong or form biases based on the order in which
> you look at the data points; *always* visualize. There is a significant
> difference between release and LTO. About 2x consistently.
>
> [image: Inline image 3]
>
> This is actually curious because during the release build, we were
> spending 33% of CPU time in the backend (as clang sees it; i.e. mid-level
> optimizer and codegen). This data is inconsistent with LTO simply being
> another run through the backend (which would be just +33% CPU time at
> worst). There seems to be something nonlinear happening.
> To make it worse, the LTO build has approximately a full Release
> optimization running per-TU, so the actual LTO step should be seeing
> inlined/"cleaned up" IR which should be much smaller than what
the per-TU
> optimizer is seeing, so naively it should take *even less* than
"another
> 33% CPU time" chunk.
> Yet we see 1.5x-2x difference:
>
> [image: Inline image 4]
>
> -- Sean Silva
>
>
>>
>> --
>> Mehdi
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/36bfa9e8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.45.54 PM.png
Type: image/png
Size: 39766 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/36bfa9e8/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.29.21 PM.png
Type: image/png
Size: 36008 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/36bfa9e8/attachment-0003.png>

Sean Silva via llvm-dev

2016-Mar-09 21:55 UTC

head link

[llvm-dev] llvm and clang are getting slower

On Wed, Mar 9, 2016 at 12:38 PM, Xinliang David Li <xinliangli at
gmail.com>
wrote:
> The lto time could be explained by second order effect due to increased
> dcache/dtlb pressures due to increased memory footprint and poor locality.
>
Actually thinking more about this, I was totally wrong. Mehdi said that we
LTO ~56 binaries. If we naively assume that each binary is like clang and
links in "everything" and that the LTO process takes CPU time
equivalent to
"-O3 for every TU", then we would expect that *for each binary* we
would
see +33% (total increase >1800% vs Release). Clearly that is not happening
since the actual overhead is only 50%-100%, so we need a more refined
explanation.

There are a couple factors that I can think of.
a) there are 56 binaries being LTO'd (this will tend to increase our
estimate)
b) not all 56 binaries are the size of clang (this will tend to decrease
our estimate)
c) per-TU processing only is doing mid-level optimizations and no codegen
(this will tend to decrease our estimate)
d) IR seen during LTO has already been "cleaned up" and has less
overall
size/amount of optimizations that will apply during the LTO process (this
will tend to decrease our estimate)
e) comdat folding in the linker means that we only codegen (this will tend
to decrease our estimate)

Starting from a (normalized) release build with
releaseBackend = .33
releaseFrontend = .67
release = releaseBackend + releaseFrontend  = 1

Let us try to obtain
LTO = (some expression involving releaseFrontend and releaseBackend) = 1.5-2

For starters, let us apply a), with a naive assumption that for each of the
numBinaries = 52 binaries we add the cost of releaseBackend (I just checked
and 52 is the exact number for LLVM+Clang+LLD+clang-tools-extra, ignoring
symlinks). This gives
LTO = release + 52 * releaseBackend = 21.46, which is way high.

Let us apply b). A quick check gives 371,515,392 total bytes of text in a
release build across all 52 binaries (Mac, x86_64). Clang is 45,182,976
bytes of text. So using final text size in Release as an indicator of the
total code seen by the LTO process, we can use a coefficient of 1/8, i.e.
the average binary links in about avgTextFraction = 1/8 of
"everything".
LTO = release + 52 * (.125 * releaseBackend) = 3.14

We are still high. For c), Let us assume that half of releaseBackend is
spend after mid-level optimizations. So let codegenFraction = .5 be the
fraction of releaseBackend that is spend after mid-level optimizations. We
can discount this time from the LTO build since it does not that work
per-TU.
LTO = release + 52 * (.125 * releaseBackend) - (codegenFraction *
releaseBackend) = 2.98
So this is not a significant reduction.

I don't have a reasonable estimate a priori for d) or e), but altogether
they reduce to a constant factor otherSavingsFraction that multiplies the
second term
LTO = release + 52 * (.125 * otherSavingsFraction * releaseBackend) -
(codegenFraction * releaseBackend) =? 1.5-2x

Given the empirical data, this suggests that otherSavingsFraction must have
a value around 1/2, which seems reasonable.

For a moment I was rather surprised that we could have 52 binaries and it
would be only 2x, but this closer examination shows that between
avgTextFraction = .125 and releaseBackend = .33 the "52" is brought
under
control.

-- Sean Silva

>
> David
>
> On Tue, Mar 8, 2016 at 5:47 PM, Sean Silva via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>>
>> On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <mehdi.amini at
apple.com>
>> wrote:
>>
>>>
>>> On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>
>>>
>>> On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola
>>>> <llvm-dev at lists.llvm.org> wrote:
>>>> > I have just benchmarked building trunk llvm and clang in
Debug,
>>>> > Release and LTO modes (see the attached scrip for the
cmake lines).
>>>> >
>>>> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and
trunk. In all
>>>> > cases I used the system libgcc and libstdc++.
>>>> >
>>>> > For release builds there is a monotonic increase in each
version. From
>>>> > 163 minutes with 3.5 to 212 minutes with trunk. For
comparison, gcc
>>>> > 5.3.2 takes 205 minutes.
>>>> >
>>>> > Debug and LTO show an improvement in 3.7, but have
regressed again in
>>>> 3.8.
>>>>
>>>> I'm curious how these times divide across Clang and various
parts of
>>>> LLVM; rerunning with -ftime-report and summing the numbers
across all
>>>> compiles could be interesting.
>>>>
>>>
>>> Based on the results I posted upthread about the relative time
spend in
>>> the backend for debug vs release, we can estimate this.
>>> To summarize:
>>> 10% of time spent in LLVM for Debug
>>> 33% of time spent in LLVM for Release
>>> (I'll abbreviate "in LLVM" as just
"backend"; this is "backend" from
>>> clang's perspective)
>>>
>>> Let's look at the difference between 3.5 and trunk.
>>>
>>> For debug, the user time jumps from 174m50.251s to 197m9.932s.
>>> That's {10490.3, 11829.9} seconds, respectively.
>>> For release, the corresponding numbers are:
>>> {9826.71, 12714.3} seconds.
>>>
>>> debug35 = 10490.251
>>> debugTrunk = 11829.932
>>>
>>> debugTrunk/debug35 == 1.12771
>>> debugRatio = 1.12771
>>>
>>> release35 = 9826.705
>>> releaseTrunk = 12714.288
>>>
>>> releaseTrunk/release35 == 1.29385
>>> releaseRatio = 1.29385
>>>
>>> For simplicity, let's use a simple linear model for the
distribution of
>>> slowdown between the frontend and backend: a constant factor
slowdown for
>>> the backend, and an independent constant factor slowdown for the
frontend.
>>> This gives the following linear system:
>>> debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio
>>> releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio
>>>
>>> Solving this linear system we find that under this simple model,
the
>>> expected slowdown factors are:
>>> backendRatio = 1.77783
>>> frontendRatio = 1.05547
>>>
>>> Intuitively, backendRatio comes out larger in this comparison
because we
>>> see the biggest slowdown during release (1.29 vs 1.12), and during
release
>>> we are spending a larger fraction of time in the backend (33% vs
10%).
>>>
>>> Applying this same model to across Rafael's data, we find the
following
>>> (numbers have been rounded for clarity):
>>>
>>> transition       backendRatio   frontendRatio
>>> 3.5->3.6         1.08           1.03
>>> 3.6->3.7         1.30           0.95
>>> 3.7->3.8         1.34           1.07
>>> 3.8->trunk       0.98           1.02
>>>
>>> Note that in Rafael's measurements LTO is pretty similar to
Release from
>>> a CPU time (user time) standpoint. While the final LTO link takes a
large
>>> amount of real time, it is single threaded. Based on the real time
numbers
>>> the LTO link was only spending about 20 minutes single-threaded
(i.e. about
>>> 20 minutes CPU time), which is pretty small compared to the 300-400
minutes
>>> of total CPU time. It would be interesting to see the numbers for
-O0 or
>>> -O1 per-TU together with LTO.
>>>
>>>
>>>
>>> Just a note about LTO being sequential: Rafael mentioned he was
>>> "building trunk llvm and clang". By default I believe it
is ~56 link
>>> targets that can be run in parallel (provided you have enough RAM
to avoid
>>> swapping).
>>>
>>
>> D'oh! I was looking at the data wrong since I broke my Fundamental
Rule
>> of Looking At Data, namely: don't look at raw numbers in a table
since you
>> are likely to look at things wrong or form biases based on the order in
>> which you look at the data points; *always* visualize. There is a
>> significant difference between release and LTO. About 2x consistently.
>>
>> [image: Inline image 3]
>>
>> This is actually curious because during the release build, we were
>> spending 33% of CPU time in the backend (as clang sees it; i.e.
mid-level
>> optimizer and codegen). This data is inconsistent with LTO simply being
>> another run through the backend (which would be just +33% CPU time at
>> worst). There seems to be something nonlinear happening.
>> To make it worse, the LTO build has approximately a full Release
>> optimization running per-TU, so the actual LTO step should be seeing
>> inlined/"cleaned up" IR which should be much smaller than
what the per-TU
>> optimizer is seeing, so naively it should take *even less* than
"another
>> 33% CPU time" chunk.
>> Yet we see 1.5x-2x difference:
>>
>> [image: Inline image 4]
>>
>> -- Sean Silva
>>
>>
>>>
>>> --
>>> Mehdi
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.29.21 PM.png
Type: image/png
Size: 36008 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.45.54 PM.png
Type: image/png
Size: 39766 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0003.png>

llvm dev - Mar 2016 - llvm and clang are getting slower

[llvm-dev] llvm and clang are getting slower

[llvm-dev] llvm and clang are getting slower

[llvm-dev] llvm and clang are getting slower