thr3ads.net - llvm dev - [LLVMdev] Postponing more passes in LTO [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Daniel Stewart

2014-Sep-17 13:46 UTC

[LLVMdev] Postponing more passes in LTO

Looking at the existing flow of passes for LTO, it appears that most all
passes are run on a per file basis, before the call to the gold linker. I'm
looking to get people's feedback on whether there would be an advantage to
waiting to run a number of these passes until the linking stage. For
example, I believe I saw a post a little while back about postponing
vectorization until the linking stage. It seems to me that there could be an
advantage to postponing (some) passes until the linking stage, where the
entire graph is available. In general, what do people think about the idea
of a different flow of LTO where more passes are postponed until the linking
stage? 

 

Daniel Stewart

 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by
The Linux Foundation

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/4124b484/attachment.html>

Eric Christopher

2014-Sep-17 14:32 UTC

head link

[LLVMdev] Postponing more passes in LTO

On Wed, Sep 17, 2014 at 6:46 AM, Daniel Stewart <stewartd at
codeaurora.org>
wrote:
> Looking at the existing flow of passes for LTO, it appears that most all
> passes are run on a per file basis, before the call to the gold linker. I’m
> looking to get people’s feedback on whether there would be an advantage to
> waiting to run a number of these passes until the linking stage. For
> example, I believe I saw a post a little while back about postponing
> vectorization until the linking stage. It seems to me that there could be
> an advantage to postponing (some) passes until the linking stage, where the
> entire graph is available. In general, what do people think about the idea
> of a different flow of LTO where more passes are postponed until the
> linking stage?
>
>
>
I think there needs to be some amount of cleanup before cross module
inlining otherwise you're going to lose a lot of inlining chances that
you'd have had. It's a bit of a tradeoff. I remember working up a
pipeline
with Chandler and Dan at one point and I believe Bob was in on the
discussion too. I don't have notes of the actual pipeline so I'm adding
them all to the thread to pipe up :)

-eric
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/a21521bf/attachment.html>

Reid Kleckner

2014-Sep-17 16:21 UTC

head link

[LLVMdev] Postponing more passes in LTO

Yes, that seems to be the consensus. -flto during the compile step should
imply things like: no vectorization until after cross-module inlining,
reduced inlining threshold (only inline if it *reduces* code size), and
other things.

On Wed, Sep 17, 2014 at 6:46 AM, Daniel Stewart <stewartd at
codeaurora.org>
wrote:
> Looking at the existing flow of passes for LTO, it appears that most all
> passes are run on a per file basis, before the call to the gold linker. I’m
> looking to get people’s feedback on whether there would be an advantage to
> waiting to run a number of these passes until the linking stage. For
> example, I believe I saw a post a little while back about postponing
> vectorization until the linking stage. It seems to me that there could be
> an advantage to postponing (some) passes until the linking stage, where the
> entire graph is available. In general, what do people think about the idea
> of a different flow of LTO where more passes are postponed until the
> linking stage?
>
>
>
> Daniel Stewart
>
>
>
> --
>
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/737aaeff/attachment.html>

Daniel Stewart

2014-Sep-18 13:54 UTC

head link

[LLVMdev] Postponing more passes in LTO

If the notes about that pipeline are still around, I’d love to hear about/look
at it. I’d like to investigate some changes to LTO, but certainly want to know
what has already been discussed/discovered about the flow.

 

Daniel

 

From: Eric Christopher [mailto:echristo at gmail.com] 
Sent: Wednesday, September 17, 2014 10:32 AM
To: Daniel Stewart; Chandler Carruth; Dan Gohman; Bob Wilson
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Postponing more passes in LTO

 

 

 

On Wed, Sep 17, 2014 at 6:46 AM, Daniel Stewart <stewartd at codeaurora.org
<mailto:stewartd at codeaurora.org> > wrote:

Looking at the existing flow of passes for LTO, it appears that most all passes
are run on a per file basis, before the call to the gold linker. I’m looking to
get people’s feedback on whether there would be an advantage to waiting to run a
number of these passes until the linking stage. For example, I believe I saw a
post a little while back about postponing vectorization until the linking stage.
It seems to me that there could be an advantage to postponing (some) passes
until the linking stage, where the entire graph is available. In general, what
do people think about the idea of a different flow of LTO where more passes are
postponed until the linking stage?

 

 

I think there needs to be some amount of cleanup before cross module inlining
otherwise you're going to lose a lot of inlining chances that you'd have
had. It's a bit of a tradeoff. I remember working up a pipeline with
Chandler and Dan at one point and I believe Bob was in on the discussion too. I
don't have notes of the actual pipeline so I'm adding them all to the
thread to pipe up :)

 

-eric

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140918/e2173d18/attachment.html>

Daniel Stewart

2014-Dec-15 19:27 UTC

head link

[LLVMdev] Postponing more passes in LTO

I have done some preliminary investigation into postponing some of the
passes to see what the resulting performance impact would be. This is a
fairly crude attempt at moving passes around to see if there is any
potential benefit. I have attached the patch I used to do the tests, in case
anyone is interested. 

 

Briefly, the patch allows two different flows, with either a flag of
-lto-new or -lto-new2. In the first case, the vectorization passes are
postponed from the end of populateModulePassManager() function to midway
through the addLTOOptimizationPasses(). In the second case, essentially the
entire populateModulePassManager() function is deferred until link time.

 

I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4 configurations
(O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have attached a PDF presenting the
results from the test. The first 4 columns have the spec result (ratio) for
the 4 different configurations. The second set of columns are the respective
test / max(result of 4 configurations). I used this last one to see how
well/poor a particular configuration was in comparison to other
configurations. 

 

In general, there appears to be some benefit to be gained in a couple of the
benchmarks (spec2000/art, spec2006/milc) by postponing vectorization. 

 

I just wanted to present this information to the community to see if there
is interest in pursuing the idea of postponing passes.

 

Daniel

 

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Daniel Stewart
Sent: Wednesday, September 17, 2014 9:46 AM
To: llvmdev at cs.uiuc.edu
Subject: [LLVMdev] Postponing more passes in LTO

 

Looking at the existing flow of passes for LTO, it appears that most all
passes are run on a per file basis, before the call to the gold linker. I'm
looking to get people's feedback on whether there would be an advantage to
waiting to run a number of these passes until the linking stage. For
example, I believe I saw a post a little while back about postponing
vectorization until the linking stage. It seems to me that there could be an
advantage to postponing (some) passes until the linking stage, where the
entire graph is available. In general, what do people think about the idea
of a different flow of LTO where more passes are postponed until the linking
stage? 

 

Daniel Stewart

 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by
The Linux Foundation

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141215/354f7b70/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: newflow.patch
Type: application/octet-stream
Size: 6713 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141215/354f7b70/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Community LLVM on Nexus4.pdf
Type: application/pdf
Size: 123260 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141215/354f7b70/attachment.pdf>

Sean Silva

2014-Dec-18 10:45 UTC

head link

[LLVMdev] Postponing more passes in LTO

On Wed, Sep 17, 2014 at 6:46 AM, Daniel Stewart <stewartd at
codeaurora.org>
wrote:>
> Looking at the existing flow of passes for LTO, it appears that most all
> passes are run on a per file basis, before the call to the gold linker. I’m
> looking to get people’s feedback on whether there would be an advantage to
> waiting to run a number of these passes until the linking stage. For
> example, I believe I saw a post a little while back about postponing
> vectorization until the linking stage. It seems to me that there could be
> an advantage to postponing (some) passes until the linking stage, where the
> entire graph is available. In general, what do people think about the idea
> of a different flow of LTO where more passes are postponed until the
> linking stage?
>
AFAIK, we still mostly obey the per-TU optimization flags. E.g. if you pass
-O3 for each TU, we will run -O3 passes without really keeping in mind that
we are doing LTO (or if we do, then it is fairly minimal). The per-TU
optimization flags can have an enormous impact on the final binary size.
Here are some data points I have recently collected on one of our
first-party games:

noLTO O3perTU:
71.1MiB (control)

LTO O3perTU:
71.8MiB (1% larger)

LTO O0perTU:
67.4MiB (5% smaller)

LTO O1perTU:
68.5MiB (4% smaller)

LTO OsperTU:
65.3MiB (8% smaller)

This is with a 3.4-based compiler btw, but is in keeping with what I
observed last Summer, so I assume that the significant effect on binary
size is still present today.
FYI, these elf sizes are also with no debug info.

Here is a visualization of those same binary files, but broken down by text
and data sections (as given by llvm-size; bss was not significantly
affected so it was omitted):

http://i.imgur.com/Ie5Plgx.png

As you can see (and would expect), LTO does a good job of reducing the
data size, since it can use whole-program analysis to eliminate it. This
benefit does not depend on the per-TU optimization level, also as you would
expect.
The text section however has a different behavior. I'm still investigating,
but I suspect any size regression is largely due to excessive inlining (as
I think most people would expect). It is interesting to note that between
the -Os LTO case and -O3 LTO case, there is a text size difference of
(20.7/14.3 - 1) ~ 45%. Also, looking at this again, I don't understand why
I didn't do anything with -O2 (I'll eventually need to re-obtain these
datasets with a ToT compiler anyway, and I will be sure to grab -O2 data);
my experience is that Clang's -O3 is sufficiently similar to -O2 that
I'm
fairly confident that this missing data is not going to significantly alter
the findings of my preliminary analysis in the upcoming days.

For starters, here is a plot showing how much of the total text size is
attributable to functions of each size, comparing -O3 noLTO with -O3 LTO:

http://i.imgur.com/pfIo0sy.png [*]
To understand this plot, imagine if you were to take all the functions in
the binary, and group them into a small number of buckets of
similarly-sized functions. Each bar represents one bucket, and the height
of the bar represents the total size of all the functions in that bucket.
The width and position of the bucket indicate which range of sizes it
corresponds to.
Although the general behavior is a shift in the distribution to the right
(functions are becoming larger with LTO), there is also an increase in
total area under the bars, which is perhaps best visualized by looking at
the same plot, but with each bar indicating the cumulative total (imagine
that you were to call std::accumulate on the list of bar heights from the
previous plot):

http://i.imgur.com/q7Iq7AH.png
The overall text size regression adds up to nearly 25%.

[*] The two outliers in the non-LTO case are:
- the global initializers (_GLOBAL__I_a), whose size is significantly
reduced by LTO from about 400k to 100k (this single function corresponds to
the entire furthest-right bar). Note: the right-most bar for the LTO
dataset (>100kB functions) is this function (slimmed down to about 100k)
and one other that was subjected to an unusually large amount of inlining
and grew from 2k to about 125k.
- an unusually large dead function that LTO was able to remove but was not
being removed before (this single function corresponds to the entire
second-to-furthest-right bar).

-- Sean Silva

>
>
> Daniel Stewart
>
>
>
> --
>
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/c9c05c85/attachment.html>

Sean Silva

2014-Dec-18 12:10 UTC

head link

[LLVMdev] Postponing more passes in LTO

In the future could you please do some sort of visualization of your data,
or at least provide the raw data in a machine-readable format so that
others can do so?

It is incredibly easy to come to incorrect conclusions when looking at
lists of numbers because at any given moment you have a very localized view
of the dataset and are prone to locally pattern-match and form a selection
bias that corrupts your ability to make a proper decision in the context of
the entire dataset. Even if you go on to look at the rest of the data, this
selection bias limits your ability to come to a correct "global"
conclusion.

Appropriate reliable summary statistics can also help, but are not panacea.
In using, say, 2 summary statistics (e.g. mean and standard deviation), one
is discarding a large number of degrees of freedom from the dataset. This
is fine if you have good reason to believe that these 2 degrees of freedom
adequately explain the underlying dataset (e.g. there is a sound
theoretical description of the phenomenon being measured that suggests it
should follow a Gaussian distribution; hence mean and stddev completely
characterize the distribution). However, in the world of programs and
compiler optimizations, there is very rarely a good reason to believe that
any particular dataset (e.g. benchmark results for SPEC for a particular
optimization) is explained by a handful of common summary statistics, and
so looking only at summary statistics can often conceal important insights
into the data (or even be actively misleading). This is especially true
when looking across different programs (I die a little inside every time I
see someone cite a SPEC geomean).

In compilers we are usually more interested in actually discovering *which
parameters* are responsible for variation, rather than increasing
confidence in the values of an a priori set of known parameters. E.g. if
you are measuring the time it takes a laser beam to bounce off the moon and
come back to you (in order to measure the distance of the moon) you have an
a priori known set of parameters that well-characterize the data you
obtain, based on your knowledge of the measurement apparatus, atmospheric
dispersion, the fact that you know the moon is moving in an orbit, etc. You
can perform repeated measurements with the apparatus to narrow in on the
values of the parameters. In compilers, we rarely have such a simple and
small set of parameters that are known to adequately characterize the data
we are trying to understand; when investigating an optimization's results,
we are almost always investigating a situation that would resemble (in the
moon-bounce analogy) an unknown variation that turns out to be due to
whether the lab assistant is leaning against the apparatus or not. You're
not going to find out that the lab assistant's pose is at fault by looking
at your "repeatedly do them to increase confidence in the values"
measurements (e.g. the actual moon-bounce measurements; or looking at the
average time for a particular SPEC benchmark to run); you find it by
getting up and going to the room with the apparatus and investigating all
manner of things until you narrow in on the lab assistant's pose (usually
this takes the form of having to dig around in assembly, extract kernels,
time particular sub-things, profile things, look at what how the code
changes throughout the optimization pipeline, instrument things, etc.;
there are tons of places for the root cause to hide).

If you have to remember something from that last paragraph, remember that
not everything boils down to click "run" and get a timing for SPEC.
Often
you need to take some time to narrow in on the damn lab assistant.
Sometimes just the timing of a particular benchmark leads to a "lab
assistant" situation (although hopefully this doesn't happen too often;
it
does happen though: e.g. I have been in a situation where a benchmark
surprisingly goes 50% faster on about 1/10 runs). When working across
different programs, you are almost always in a "lab assistant"
situation.

-- Sean Silva

On Mon, Dec 15, 2014 at 11:27 AM, Daniel Stewart <stewartd at
codeaurora.org>
wrote:>
> I have done some preliminary investigation into postponing some of the
> passes to see what the resulting performance impact would be. This is a
> fairly crude attempt at moving passes around to see if there is any
> potential benefit. I have attached the patch I used to do the tests, in
> case anyone is interested.
>
>
>
> Briefly, the patch allows two different flows, with either a flag of
> –lto-new or –lto-new2. In the first case, the vectorization passes are
> postponed from the end of populateModulePassManager() function to midway
> through the addLTOOptimizationPasses(). In the second case, essentially the
> entire populateModulePassManager() function is deferred until link time.
>
>
>
> I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
> configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have attached a
> PDF presenting the results from the test. The first 4 columns have the spec
> result (ratio) for the 4 different configurations. The second set of
> columns are the respective test / max(result of 4 configurations). I used
> this last one to see how well/poor a particular configuration was in
> comparison to other configurations.
>
>
>
> In general, there appears to be some benefit to be gained in a couple of
> the benchmarks (spec2000/art, spec2006/milc) by postponing vectorization.
>
>
>
> I just wanted to present this information to the community to see if there
> is interest in pursuing the idea of postponing passes.
>
>
>
> Daniel
>
>
>
> *From:* llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu] *On
> Behalf Of *Daniel Stewart
> *Sent:* Wednesday, September 17, 2014 9:46 AM
> *To:* llvmdev at cs.uiuc.edu
> *Subject:* [LLVMdev] Postponing more passes in LTO
>
>
>
> Looking at the existing flow of passes for LTO, it appears that most all
> passes are run on a per file basis, before the call to the gold linker. I’m
> looking to get people’s feedback on whether there would be an advantage to
> waiting to run a number of these passes until the linking stage. For
> example, I believe I saw a post a little while back about postponing
> vectorization until the linking stage. It seems to me that there could be
> an advantage to postponing (some) passes until the linking stage, where the
> entire graph is available. In general, what do people think about the idea
> of a different flow of LTO where more passes are postponed until the
> linking stage?
>
>
>
> Daniel Stewart
>
>
>
> --
>
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/b1e1ec9c/attachment.html>

Greg Bedwell

2014-Dec-18 21:09 UTC

head link

[LLVMdev] Postponing more passes in LTO

This looks really interesting.  As my colleague Gao mentioned in his
lightning talk on our LTO implementation at the last developer meeting
we're definitely interested in seeing if there are any potential gains to
be had by deferring passes.

I'd like to give it a try on some of our codebases, although with the
Christmas break coming up and various other commitments after that in
January I'm not sure when I'll be able to look at it properly.  If there
is
any other followup here, then please don't let that block anything from
moving forward, but otherwise I'll do my best to reply here as soon as I
possibly can after doing some experimentation!

Greg Bedwell
SN Systems - Sony Computer Entertainment Group

On 15 December 2014 at 19:27, Daniel Stewart <stewartd at codeaurora.org>
wrote:>
> I have done some preliminary investigation into postponing some of the
> passes to see what the resulting performance impact would be. This is a
> fairly crude attempt at moving passes around to see if there is any
> potential benefit. I have attached the patch I used to do the tests, in
> case anyone is interested.
>
>
>
> Briefly, the patch allows two different flows, with either a flag of
> –lto-new or –lto-new2. In the first case, the vectorization passes are
> postponed from the end of populateModulePassManager() function to midway
> through the addLTOOptimizationPasses(). In the second case, essentially the
> entire populateModulePassManager() function is deferred until link time.
>
>
>
> I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
> configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have attached a
> PDF presenting the results from the test. The first 4 columns have the spec
> result (ratio) for the 4 different configurations. The second set of
> columns are the respective test / max(result of 4 configurations). I used
> this last one to see how well/poor a particular configuration was in
> comparison to other configurations.
>
>
>
> In general, there appears to be some benefit to be gained in a couple of
> the benchmarks (spec2000/art, spec2006/milc) by postponing vectorization.
>
>
>
> I just wanted to present this information to the community to see if there
> is interest in pursuing the idea of postponing passes.
>
>
>
> Daniel
>
>
>
> *From:* llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu] *On
> Behalf Of *Daniel Stewart
> *Sent:* Wednesday, September 17, 2014 9:46 AM
> *To:* llvmdev at cs.uiuc.edu
> *Subject:* [LLVMdev] Postponing more passes in LTO
>
>
>
> Looking at the existing flow of passes for LTO, it appears that most all
> passes are run on a per file basis, before the call to the gold linker. I’m
> looking to get people’s feedback on whether there would be an advantage to
> waiting to run a number of these passes until the linking stage. For
> example, I believe I saw a post a little while back about postponing
> vectorization until the linking stage. It seems to me that there could be
> an advantage to postponing (some) passes until the linking stage, where the
> entire graph is available. In general, what do people think about the idea
> of a different flow of LTO where more passes are postponed until the
> linking stage?
>
>
>
> Daniel Stewart
>
>
>
> --
>
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/0722cd86/attachment.html>

llvm dev - Dec 2014 - [LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO