thr3ads.net - llvm dev - [LLVMdev] Postponing more passes in LTO [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Sean Silva

2014-Dec-18 12:10 UTC

[LLVMdev] Postponing more passes in LTO

In the future could you please do some sort of visualization of your data,
or at least provide the raw data in a machine-readable format so that
others can do so?

It is incredibly easy to come to incorrect conclusions when looking at
lists of numbers because at any given moment you have a very localized view
of the dataset and are prone to locally pattern-match and form a selection
bias that corrupts your ability to make a proper decision in the context of
the entire dataset. Even if you go on to look at the rest of the data, this
selection bias limits your ability to come to a correct "global"
conclusion.

Appropriate reliable summary statistics can also help, but are not panacea.
In using, say, 2 summary statistics (e.g. mean and standard deviation), one
is discarding a large number of degrees of freedom from the dataset. This
is fine if you have good reason to believe that these 2 degrees of freedom
adequately explain the underlying dataset (e.g. there is a sound
theoretical description of the phenomenon being measured that suggests it
should follow a Gaussian distribution; hence mean and stddev completely
characterize the distribution). However, in the world of programs and
compiler optimizations, there is very rarely a good reason to believe that
any particular dataset (e.g. benchmark results for SPEC for a particular
optimization) is explained by a handful of common summary statistics, and
so looking only at summary statistics can often conceal important insights
into the data (or even be actively misleading). This is especially true
when looking across different programs (I die a little inside every time I
see someone cite a SPEC geomean).

In compilers we are usually more interested in actually discovering *which
parameters* are responsible for variation, rather than increasing
confidence in the values of an a priori set of known parameters. E.g. if
you are measuring the time it takes a laser beam to bounce off the moon and
come back to you (in order to measure the distance of the moon) you have an
a priori known set of parameters that well-characterize the data you
obtain, based on your knowledge of the measurement apparatus, atmospheric
dispersion, the fact that you know the moon is moving in an orbit, etc. You
can perform repeated measurements with the apparatus to narrow in on the
values of the parameters. In compilers, we rarely have such a simple and
small set of parameters that are known to adequately characterize the data
we are trying to understand; when investigating an optimization's results,
we are almost always investigating a situation that would resemble (in the
moon-bounce analogy) an unknown variation that turns out to be due to
whether the lab assistant is leaning against the apparatus or not. You're
not going to find out that the lab assistant's pose is at fault by looking
at your "repeatedly do them to increase confidence in the values"
measurements (e.g. the actual moon-bounce measurements; or looking at the
average time for a particular SPEC benchmark to run); you find it by
getting up and going to the room with the apparatus and investigating all
manner of things until you narrow in on the lab assistant's pose (usually
this takes the form of having to dig around in assembly, extract kernels,
time particular sub-things, profile things, look at what how the code
changes throughout the optimization pipeline, instrument things, etc.;
there are tons of places for the root cause to hide).

If you have to remember something from that last paragraph, remember that
not everything boils down to click "run" and get a timing for SPEC.
Often
you need to take some time to narrow in on the damn lab assistant.
Sometimes just the timing of a particular benchmark leads to a "lab
assistant" situation (although hopefully this doesn't happen too often;
it
does happen though: e.g. I have been in a situation where a benchmark
surprisingly goes 50% faster on about 1/10 runs). When working across
different programs, you are almost always in a "lab assistant"
situation.

-- Sean Silva

On Mon, Dec 15, 2014 at 11:27 AM, Daniel Stewart <stewartd at
codeaurora.org>
wrote:>
> I have done some preliminary investigation into postponing some of the
> passes to see what the resulting performance impact would be. This is a
> fairly crude attempt at moving passes around to see if there is any
> potential benefit. I have attached the patch I used to do the tests, in
> case anyone is interested.
>
>
>
> Briefly, the patch allows two different flows, with either a flag of
> –lto-new or –lto-new2. In the first case, the vectorization passes are
> postponed from the end of populateModulePassManager() function to midway
> through the addLTOOptimizationPasses(). In the second case, essentially the
> entire populateModulePassManager() function is deferred until link time.
>
>
>
> I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
> configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have attached a
> PDF presenting the results from the test. The first 4 columns have the spec
> result (ratio) for the 4 different configurations. The second set of
> columns are the respective test / max(result of 4 configurations). I used
> this last one to see how well/poor a particular configuration was in
> comparison to other configurations.
>
>
>
> In general, there appears to be some benefit to be gained in a couple of
> the benchmarks (spec2000/art, spec2006/milc) by postponing vectorization.
>
>
>
> I just wanted to present this information to the community to see if there
> is interest in pursuing the idea of postponing passes.
>
>
>
> Daniel
>
>
>
> *From:* llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu] *On
> Behalf Of *Daniel Stewart
> *Sent:* Wednesday, September 17, 2014 9:46 AM
> *To:* llvmdev at cs.uiuc.edu
> *Subject:* [LLVMdev] Postponing more passes in LTO
>
>
>
> Looking at the existing flow of passes for LTO, it appears that most all
> passes are run on a per file basis, before the call to the gold linker. I’m
> looking to get people’s feedback on whether there would be an advantage to
> waiting to run a number of these passes until the linking stage. For
> example, I believe I saw a post a little while back about postponing
> vectorization until the linking stage. It seems to me that there could be
> an advantage to postponing (some) passes until the linking stage, where the
> entire graph is available. In general, what do people think about the idea
> of a different flow of LTO where more passes are postponed until the
> linking stage?
>
>
>
> Daniel Stewart
>
>
>
> --
>
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/b1e1ec9c/attachment.html>

Sean Silva

2014-Dec-18 22:16 UTC

head link

[LLVMdev] Postponing more passes in LTO

Also, a couple notes about the use of standard deviation (and general stuff
about benchmarking):

1. It is *vital* to keep in mind the distinction between a "sample
statistic" and a "population statistic". A sample statistic is a
statistic
derived from the dataset of measured values, while a population statistic
is a mathematical function of the underlying probability distribution of
the phenomenon (some abstract "population" of possibilities) being
sampled.

2. As you take more measurements, the standard deviation sample statistic
converges to the standard deviation population statistic regardless of the
underlying population distribution.

3. *Every* probability distribution has a defined standard deviation (it is
just a mathematical function of the distribution). For a Gaussian
distribution, the standard deviation (in conjunction with the mean)
completely describes the probability distribution. This is emphatically
*not* the case for all probability distributions [1]. In order to derive
insight from a standard deviation sample statistic, you *must* a priori
have a good reason to believe that the underlying probability
distribution's standard deviation provides useful information about the
actual distribution.

4. Any probability distribution can be made to have any standard deviation
without changing its distribution shape (technically, at most a linear
transformation of the coordinate is needed). To pick a random example, for
a highly bimodal distribution the standard deviation doesn't really give
you any useful information about the overall shape (and it doesn't exactly
correspond between the spacing between the modes either).

5. The standard deviation is nothing more than a root-mean-square average.
Basically, it is a very popular statistic because there is an *extremely
powerful* way to model phenomena for which the root-mean-square average is
the actual meaningful value that ties into the model. The theory I'm
talking about is used extensively in almost every science and engineering
discipline *outside of* software engineering (and the purely digital part
of hardware) [2]. Since this modeling technique is rarely directly relevant
in software, there is little reason to assume that the standard deviation
is a good statistic (there may be a good reason to, but I haven't seen one).

Okay, so I've cast doubt on the use of standard deviation (and actually an
argument could be made against the mean too for similar reasons). On a more
constructive note, what *is* a good small set of parameters to summarize
measurements of program run time? One interesting model that was brought to
my attention by Nick Lewycky is based on the following observation: most
"perturbations" of the run time of a fixed program binary on a fixed
chip
and fixed OS are due to externalities that contend for (or otherwise
deprive the program of) execution resources and therefore almost always
make it *slower*.

The model is then that there is a "true" execution time for the
program,
which is a completely deterministic function of the program, chip, and OS;
intuitively, this is time occurs when the program is e.g. able to execute
by itself, on a single execution unit of the CPU, with no interrupts
knocking it off CPU, and no other threads using any of the execution
resources of this CPU. On top of this, we assume that there is a set of
perturbations that contend for execution resources and slow the program
down.

For this model, it makes sense to approximate the "true" value by the
maximum of the recorded values. If we want to use two numbers to summarize
the data, it probably makes sense to look at a way of characterizing the
extent of the variation from this "true" value. In the absence of
proper
characterization of the perturbations (such as the distribution of when
they occur and how much effect they have when they occur), one simple
solution is the minimum. Max and min might be *too* sensitive to the
perturbations (especially disk cache of the program itself on the first
run), so a pair of percentiles might be a bit more reliable, such as 10th
percentile and 90th percentile.

Next time I'm measuring program run time, I'm going to try out this
model.
It should be possible to look at a higher-degree-of-freedom representation
of the dataset, like a histogram of the distribution, in order to evaluate
how well this two-parameter characterization of the data works compared
with the typical mean and stddev. Also, ultimately we only need as much
precision here as is necessary for what we are doing with the summarized
data, which typically is to alert us of a significant change. On a
sufficiently quiet machine, the variance of the measurements might be
significantly smaller than the thresholds that we consider
"significant" so
that the mean, median, max, min, 10%-ile etc. are all so close that it does
not matter which we use, and we can summarize the data with just a single
number, which can be any one of them (and just trigger an error if total
variation exceeds a set amount).

However, the noisier the machine (and hence our data), the more important
is is to properly model and analyze things to avoid coming to a faulty
conclusion (note: reliable conclusions can be made from data with large
amounts of noise as long as there is a good model which allows isolating
the data we are interested in from the noise [3]).

Nick, I forget from when we talked, but did you guys ever settle on a model
like this for your internal benchmarking?

[1] If you are familiar with Fourier theory, using only two statistics to
describe a probability distribution is sort of like using only two Fourier
components to describe a function. In order to conclude *anything* you must
a priori know that the other components are not important!

[2] The modeling technique I'm talking about is decomposition of
square-integrable function spaces into an orthonormal basis (this is by no
means the most general way of describing the idea). This is a far-reaching
concept and pops up under many names. Things like "frequency domain",
"Fourier transform", "Laplace transform", and
"spectrum" are among the most
elementary. More advanced ones are "wavelets", "Hilbert
space",
"eigenfunctions of a linear operator". A smattering of use cases:
determining the shape of the electron orbitals of hydrogen, designing the
fuel control system of a jet engine to make sure the right amount of fuel
is being provided at any given moment, designing the circuits sitting at
the end of a communication line (or a cell phone antenna) that have to
detect incoming signals with the least chance of error, analyzing the
structure of molecules based on X-ray diffraction, determining the chemical
composition of the atmosphere of a planet orbiting another star.

[3] For example if you know that there is Gaussian noise (even a very large
amount) on top of an underlying value, then the underlying value is just
the mean of the population (which will be a Gaussian distribution) and can
be reliably determined from the mean sample statistic.

On Thu, Dec 18, 2014 at 4:10 AM, Sean Silva <chisophugis at gmail.com>
wrote:>
> In the future could you please do some sort of visualization of your data,
> or at least provide the raw data in a machine-readable format so that
> others can do so?
>
> It is incredibly easy to come to incorrect conclusions when looking at
> lists of numbers because at any given moment you have a very localized view
> of the dataset and are prone to locally pattern-match and form a selection
> bias that corrupts your ability to make a proper decision in the context of
> the entire dataset. Even if you go on to look at the rest of the data, this
> selection bias limits your ability to come to a correct "global"
conclusion.
>
> Appropriate reliable summary statistics can also help, but are not
> panacea. In using, say, 2 summary statistics (e.g. mean and standard
> deviation), one is discarding a large number of degrees of freedom from the
> dataset. This is fine if you have good reason to believe that these 2
> degrees of freedom adequately explain the underlying dataset (e.g. there is
> a sound theoretical description of the phenomenon being measured that
> suggests it should follow a Gaussian distribution; hence mean and stddev
> completely characterize the distribution). However, in the world of
> programs and compiler optimizations, there is very rarely a good reason to
> believe that any particular dataset (e.g. benchmark results for SPEC for a
> particular optimization) is explained by a handful of common summary
> statistics, and so looking only at summary statistics can often conceal
> important insights into the data (or even be actively misleading). This is
> especially true when looking across different programs (I die a little
> inside every time I see someone cite a SPEC geomean).
>
> In compilers we are usually more interested in actually discovering *which
> parameters* are responsible for variation, rather than increasing
> confidence in the values of an a priori set of known parameters. E.g. if
> you are measuring the time it takes a laser beam to bounce off the moon and
> come back to you (in order to measure the distance of the moon) you have an
> a priori known set of parameters that well-characterize the data you
> obtain, based on your knowledge of the measurement apparatus, atmospheric
> dispersion, the fact that you know the moon is moving in an orbit, etc. You
> can perform repeated measurements with the apparatus to narrow in on the
> values of the parameters. In compilers, we rarely have such a simple and
> small set of parameters that are known to adequately characterize the data
> we are trying to understand; when investigating an optimization's
results,
> we are almost always investigating a situation that would resemble (in the
> moon-bounce analogy) an unknown variation that turns out to be due to
> whether the lab assistant is leaning against the apparatus or not.
You're
> not going to find out that the lab assistant's pose is at fault by
looking
> at your "repeatedly do them to increase confidence in the values"
> measurements (e.g. the actual moon-bounce measurements; or looking at the
> average time for a particular SPEC benchmark to run); you find it by
> getting up and going to the room with the apparatus and investigating all
> manner of things until you narrow in on the lab assistant's pose
(usually
> this takes the form of having to dig around in assembly, extract kernels,
> time particular sub-things, profile things, look at what how the code
> changes throughout the optimization pipeline, instrument things, etc.;
> there are tons of places for the root cause to hide).
>
> If you have to remember something from that last paragraph, remember that
> not everything boils down to click "run" and get a timing for
SPEC. Often
> you need to take some time to narrow in on the damn lab assistant.
> Sometimes just the timing of a particular benchmark leads to a "lab
> assistant" situation (although hopefully this doesn't happen too
often; it
> does happen though: e.g. I have been in a situation where a benchmark
> surprisingly goes 50% faster on about 1/10 runs). When working across
> different programs, you are almost always in a "lab assistant"
situation.
>
> -- Sean Silva
>
> On Mon, Dec 15, 2014 at 11:27 AM, Daniel Stewart <stewartd at
codeaurora.org>
> wrote:
>
>> I have done some preliminary investigation into postponing some of the
>> passes to see what the resulting performance impact would be. This is a
>> fairly crude attempt at moving passes around to see if there is any
>> potential benefit. I have attached the patch I used to do the tests, in
>> case anyone is interested.
>>
>>
>>
>> Briefly, the patch allows two different flows, with either a flag of
>> –lto-new or –lto-new2. In the first case, the vectorization passes are
>> postponed from the end of populateModulePassManager() function to
midway
>> through the addLTOOptimizationPasses(). In the second case, essentially
the
>> entire populateModulePassManager() function is deferred until link
time.
>>
>>
>>
>> I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
>> configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have attached
a
>> PDF presenting the results from the test. The first 4 columns have the
spec
>> result (ratio) for the 4 different configurations. The second set of
>> columns are the respective test / max(result of 4 configurations). I
used
>> this last one to see how well/poor a particular configuration was in
>> comparison to other configurations.
>>
>>
>>
>> In general, there appears to be some benefit to be gained in a couple
of
>> the benchmarks (spec2000/art, spec2006/milc) by postponing
vectorization.
>>
>>
>>
>> I just wanted to present this information to the community to see if
>> there is interest in pursuing the idea of postponing passes.
>>
>>
>>
>> Daniel
>>
>>
>>
>> *From:* llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu] *On
>> Behalf Of *Daniel Stewart
>> *Sent:* Wednesday, September 17, 2014 9:46 AM
>> *To:* llvmdev at cs.uiuc.edu
>> *Subject:* [LLVMdev] Postponing more passes in LTO
>>
>>
>>
>> Looking at the existing flow of passes for LTO, it appears that most
all
>> passes are run on a per file basis, before the call to the gold linker.
I’m
>> looking to get people’s feedback on whether there would be an advantage
to
>> waiting to run a number of these passes until the linking stage. For
>> example, I believe I saw a post a little while back about postponing
>> vectorization until the linking stage. It seems to me that there could
be
>> an advantage to postponing (some) passes until the linking stage, where
the
>> entire graph is available. In general, what do people think about the
idea
>> of a different flow of LTO where more passes are postponed until the
>> linking stage?
>>
>>
>>
>> Daniel Stewart
>>
>>
>>
>> --
>>
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted
>> by The Linux Foundation
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/f2e89716/attachment.html>

Sean Silva

2014-Dec-19 22:03 UTC

head link

[LLVMdev] Postponing more passes in LTO

On Thu, Dec 18, 2014 at 2:16 PM, Sean Silva <chisophugis at gmail.com>
wrote:
> Also, a couple notes about the use of standard deviation (and general
> stuff about benchmarking):
>
> 1. It is *vital* to keep in mind the distinction between a "sample
> statistic" and a "population statistic". A sample statistic
is a statistic
> derived from the dataset of measured values, while a population statistic
> is a mathematical function of the underlying probability distribution of
> the phenomenon (some abstract "population" of possibilities)
being sampled.
>
> 2. As you take more measurements, the standard deviation sample statistic
> converges to the standard deviation population statistic regardless of the
> underlying population distribution.
>
> 3. *Every* probability distribution has a defined standard deviation (it
> is just a mathematical function of the distribution). For a Gaussian
> distribution, the standard deviation (in conjunction with the mean)
> completely describes the probability distribution. This is emphatically
> *not* the case for all probability distributions [1]. In order to derive
> insight from a standard deviation sample statistic, you *must* a priori
> have a good reason to believe that the underlying probability
> distribution's standard deviation provides useful information about the
> actual distribution.
>
> 4. Any probability distribution can be made to have any standard deviation
> without changing its distribution shape (technically, at most a linear
> transformation of the coordinate is needed). To pick a random example, for
> a highly bimodal distribution the standard deviation doesn't really
give
> you any useful information about the overall shape (and it doesn't
exactly
> correspond between the spacing between the modes either).
>
> 5. The standard deviation is nothing more than a root-mean-square average.
> Basically, it is a very popular statistic because there is an *extremely
> powerful* way to model phenomena for which the root-mean-square average is
> the actual meaningful value that ties into the model. The theory I'm
> talking about is used extensively in almost every science and engineering
> discipline *outside of* software engineering (and the purely digital part
> of hardware) [2]. Since this modeling technique is rarely directly relevant
> in software, there is little reason to assume that the standard deviation
> is a good statistic (there may be a good reason to, but I haven't seen
one).
>
> Okay, so I've cast doubt on the use of standard deviation (and actually
an
> argument could be made against the mean too for similar reasons). On a more
> constructive note, what *is* a good small set of parameters to summarize
> measurements of program run time? One interesting model that was brought to
> my attention by Nick Lewycky is based on the following observation: most
> "perturbations" of the run time of a fixed program binary on a
fixed chip
> and fixed OS are due to externalities that contend for (or otherwise
> deprive the program of) execution resources and therefore almost always
> make it *slower*.
>
> The model is then that there is a "true" execution time for the
program,
> which is a completely deterministic function of the program, chip, and OS;
> intuitively, this is time occurs when the program is e.g. able to execute
> by itself, on a single execution unit of the CPU, with no interrupts
> knocking it off CPU, and no other threads using any of the execution
> resources of this CPU. On top of this, we assume that there is a set of
> perturbations that contend for execution resources and slow the program
> down.
>
> For this model, it makes sense to approximate the "true" value by
the
> maximum of the recorded values.
>
D'oh! Here I meant "maximum performance", which would be the
*minimum* time!

-- Sean Silva

> If we want to use two numbers to summarize the data, it probably makes
> sense to look at a way of characterizing the extent of the variation from
> this "true" value. In the absence of proper characterization of
the
> perturbations (such as the distribution of when they occur and how much
> effect they have when they occur), one simple solution is the minimum. Max
> and min might be *too* sensitive to the perturbations (especially disk
> cache of the program itself on the first run), so a pair of percentiles
> might be a bit more reliable, such as 10th percentile and 90th percentile.
>
> Next time I'm measuring program run time, I'm going to try out this
model.
> It should be possible to look at a higher-degree-of-freedom representation
> of the dataset, like a histogram of the distribution, in order to evaluate
> how well this two-parameter characterization of the data works compared
> with the typical mean and stddev. Also, ultimately we only need as much
> precision here as is necessary for what we are doing with the summarized
> data, which typically is to alert us of a significant change. On a
> sufficiently quiet machine, the variance of the measurements might be
> significantly smaller than the thresholds that we consider
"significant" so
> that the mean, median, max, min, 10%-ile etc. are all so close that it does
> not matter which we use, and we can summarize the data with just a single
> number, which can be any one of them (and just trigger an error if total
> variation exceeds a set amount).
>
> However, the noisier the machine (and hence our data), the more important
> is is to properly model and analyze things to avoid coming to a faulty
> conclusion (note: reliable conclusions can be made from data with large
> amounts of noise as long as there is a good model which allows isolating
> the data we are interested in from the noise [3]).
>
> Nick, I forget from when we talked, but did you guys ever settle on a
> model like this for your internal benchmarking?
>
>
> [1] If you are familiar with Fourier theory, using only two statistics to
> describe a probability distribution is sort of like using only two Fourier
> components to describe a function. In order to conclude *anything* you must
> a priori know that the other components are not important!
>
> [2] The modeling technique I'm talking about is decomposition of
> square-integrable function spaces into an orthonormal basis (this is by no
> means the most general way of describing the idea). This is a far-reaching
> concept and pops up under many names. Things like "frequency
domain",
> "Fourier transform", "Laplace transform", and
"spectrum" are among the most
> elementary. More advanced ones are "wavelets", "Hilbert
space",
> "eigenfunctions of a linear operator". A smattering of use cases:
> determining the shape of the electron orbitals of hydrogen, designing the
> fuel control system of a jet engine to make sure the right amount of fuel
> is being provided at any given moment, designing the circuits sitting at
> the end of a communication line (or a cell phone antenna) that have to
> detect incoming signals with the least chance of error, analyzing the
> structure of molecules based on X-ray diffraction, determining the chemical
> composition of the atmosphere of a planet orbiting another star.
>
> [3] For example if you know that there is Gaussian noise (even a very
> large amount) on top of an underlying value, then the underlying value is
> just the mean of the population (which will be a Gaussian distribution) and
> can be reliably determined from the mean sample statistic.
>
>
> On Thu, Dec 18, 2014 at 4:10 AM, Sean Silva <chisophugis at
gmail.com> wrote:
>>
>> In the future could you please do some sort of visualization of your
>> data, or at least provide the raw data in a machine-readable format so
that
>> others can do so?
>>
>> It is incredibly easy to come to incorrect conclusions when looking at
>> lists of numbers because at any given moment you have a very localized
view
>> of the dataset and are prone to locally pattern-match and form a
selection
>> bias that corrupts your ability to make a proper decision in the
context of
>> the entire dataset. Even if you go on to look at the rest of the data,
this
>> selection bias limits your ability to come to a correct
"global" conclusion.
>>
>> Appropriate reliable summary statistics can also help, but are not
>> panacea. In using, say, 2 summary statistics (e.g. mean and standard
>> deviation), one is discarding a large number of degrees of freedom from
the
>> dataset. This is fine if you have good reason to believe that these 2
>> degrees of freedom adequately explain the underlying dataset (e.g.
there is
>> a sound theoretical description of the phenomenon being measured that
>> suggests it should follow a Gaussian distribution; hence mean and
stddev
>> completely characterize the distribution). However, in the world of
>> programs and compiler optimizations, there is very rarely a good reason
to
>> believe that any particular dataset (e.g. benchmark results for SPEC
for a
>> particular optimization) is explained by a handful of common summary
>> statistics, and so looking only at summary statistics can often conceal
>> important insights into the data (or even be actively misleading). This
is
>> especially true when looking across different programs (I die a little
>> inside every time I see someone cite a SPEC geomean).
>>
>> In compilers we are usually more interested in actually discovering
>> *which parameters* are responsible for variation, rather than
increasing
>> confidence in the values of an a priori set of known parameters. E.g.
if
>> you are measuring the time it takes a laser beam to bounce off the moon
and
>> come back to you (in order to measure the distance of the moon) you
have an
>> a priori known set of parameters that well-characterize the data you
>> obtain, based on your knowledge of the measurement apparatus,
atmospheric
>> dispersion, the fact that you know the moon is moving in an orbit, etc.
You
>> can perform repeated measurements with the apparatus to narrow in on
the
>> values of the parameters. In compilers, we rarely have such a simple
and
>> small set of parameters that are known to adequately characterize the
data
>> we are trying to understand; when investigating an optimization's
results,
>> we are almost always investigating a situation that would resemble (in
the
>> moon-bounce analogy) an unknown variation that turns out to be due to
>> whether the lab assistant is leaning against the apparatus or not.
You're
>> not going to find out that the lab assistant's pose is at fault by
looking
>> at your "repeatedly do them to increase confidence in the
values"
>> measurements (e.g. the actual moon-bounce measurements; or looking at
the
>> average time for a particular SPEC benchmark to run); you find it by
>> getting up and going to the room with the apparatus and investigating
all
>> manner of things until you narrow in on the lab assistant's pose
(usually
>> this takes the form of having to dig around in assembly, extract
kernels,
>> time particular sub-things, profile things, look at what how the code
>> changes throughout the optimization pipeline, instrument things, etc.;
>> there are tons of places for the root cause to hide).
>>
>> If you have to remember something from that last paragraph, remember
that
>> not everything boils down to click "run" and get a timing for
SPEC. Often
>> you need to take some time to narrow in on the damn lab assistant.
>> Sometimes just the timing of a particular benchmark leads to a
"lab
>> assistant" situation (although hopefully this doesn't happen
too often; it
>> does happen though: e.g. I have been in a situation where a benchmark
>> surprisingly goes 50% faster on about 1/10 runs). When working across
>> different programs, you are almost always in a "lab
assistant" situation.
>>
>> -- Sean Silva
>>
>> On Mon, Dec 15, 2014 at 11:27 AM, Daniel Stewart <stewartd at
codeaurora.org
>> > wrote:
>>
>>> I have done some preliminary investigation into postponing some of
the
>>> passes to see what the resulting performance impact would be. This
is a
>>> fairly crude attempt at moving passes around to see if there is any
>>> potential benefit. I have attached the patch I used to do the
tests, in
>>> case anyone is interested.
>>>
>>>
>>>
>>> Briefly, the patch allows two different flows, with either a flag
of
>>> –lto-new or –lto-new2. In the first case, the vectorization passes
are
>>> postponed from the end of populateModulePassManager() function to
midway
>>> through the addLTOOptimizationPasses(). In the second case,
essentially the
>>> entire populateModulePassManager() function is deferred until link
time.
>>>
>>>
>>>
>>> I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
>>> configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have
attached a
>>> PDF presenting the results from the test. The first 4 columns have
the spec
>>> result (ratio) for the 4 different configurations. The second set
of
>>> columns are the respective test / max(result of 4 configurations).
I used
>>> this last one to see how well/poor a particular configuration was
in
>>> comparison to other configurations.
>>>
>>>
>>>
>>> In general, there appears to be some benefit to be gained in a
couple of
>>> the benchmarks (spec2000/art, spec2006/milc) by postponing
vectorization.
>>>
>>>
>>>
>>> I just wanted to present this information to the community to see
if
>>> there is interest in pursuing the idea of postponing passes.
>>>
>>>
>>>
>>> Daniel
>>>
>>>
>>>
>>> *From:* llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu]
>>> *On Behalf Of *Daniel Stewart
>>> *Sent:* Wednesday, September 17, 2014 9:46 AM
>>> *To:* llvmdev at cs.uiuc.edu
>>> *Subject:* [LLVMdev] Postponing more passes in LTO
>>>
>>>
>>>
>>> Looking at the existing flow of passes for LTO, it appears that
most all
>>> passes are run on a per file basis, before the call to the gold
linker. I’m
>>> looking to get people’s feedback on whether there would be an
advantage to
>>> waiting to run a number of these passes until the linking stage.
For
>>> example, I believe I saw a post a little while back about
postponing
>>> vectorization until the linking stage. It seems to me that there
could be
>>> an advantage to postponing (some) passes until the linking stage,
where the
>>> entire graph is available. In general, what do people think about
the idea
>>> of a different flow of LTO where more passes are postponed until
the
>>> linking stage?
>>>
>>>
>>>
>>> Daniel Stewart
>>>
>>>
>>>
>>> --
>>>
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>> hosted by The Linux Foundation
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141219/f8fb021d/attachment.html>

Nick Lewycky

2015-Jan-08 06:58 UTC

head link

[LLVMdev] Postponing more passes in LTO

Sean Silva wrote:> Also, a couple notes about the use of standard deviation (and general
> stuff about benchmarking):
>
> 1. It is *vital* to keep in mind the distinction between a "sample
> statistic" and a "population statistic". A sample statistic
is a
> statistic derived from the dataset of measured values, while a
> population statistic is a mathematical function of the underlying
> probability distribution of the phenomenon (some abstract
"population"
> of possibilities) being sampled.
>
> 2. As you take more measurements, the standard deviation sample
> statistic converges to the standard deviation population statistic
> regardless of the underlying population distribution.
>
> 3. *Every* probability distribution has a defined standard deviation (it
> is just a mathematical function of the distribution). For a Gaussian
> distribution, the standard deviation (in conjunction with the mean)
> completely describes the probability distribution. This is emphatically
> *not* the case for all probability distributions [1]. In order to derive
> insight from a standard deviation sample statistic, you *must* a priori
> have a good reason to believe that the underlying probability
> distribution's standard deviation provides useful information about the
> actual distribution.
>
> 4. Any probability distribution can be made to have any standard
> deviation without changing its distribution shape (technically, at most
> a linear transformation of the coordinate is needed). To pick a random
> example, for a highly bimodal distribution the standard deviation
> doesn't really give you any useful information about the overall shape
> (and it doesn't exactly correspond between the spacing between the
modes
> either).
>
> 5. The standard deviation is nothing more than a root-mean-square
> average. Basically, it is a very popular statistic because there is an
> *extremely powerful* way to model phenomena for which the
> root-mean-square average is the actual meaningful value that ties into
> the model. The theory I'm talking about is used extensively in almost
> every science and engineering discipline *outside of* software
> engineering (and the purely digital part of hardware) [2]. Since this
> modeling technique is rarely directly relevant in software, there is
> little reason to assume that the standard deviation is a good statistic
> (there may be a good reason to, but I haven't seen one).
>
> Okay, so I've cast doubt on the use of standard deviation (and actually
> an argument could be made against the mean too for similar reasons). On
> a more constructive note, what *is* a good small set of parameters to
> summarize measurements of program run time? One interesting model that
> was brought to my attention by Nick Lewycky is based on the following
> observation: most "perturbations" of the run time of a fixed
program
> binary on a fixed chip and fixed OS are due to externalities that
> contend for (or otherwise deprive the program of) execution resources
> and therefore almost always make it *slower*.
>
> The model is then that there is a "true" execution time for the
program,
> which is a completely deterministic function of the program, chip, and
> OS; intuitively, this is time occurs when the program is e.g. able to
> execute by itself, on a single execution unit of the CPU, with no
> interrupts knocking it off CPU, and no other threads using any of the
> execution resources of this CPU. On top of this, we assume that there is
> a set of perturbations that contend for execution resources and slow the
> program down.
>
> For this model, it makes sense to approximate the "true" value by
the
> maximum of the recorded values. If we want to use two numbers to
> summarize the data, it probably makes sense to look at a way of
> characterizing the extent of the variation from this "true"
value. In
> the absence of proper characterization of the perturbations (such as the
> distribution of when they occur and how much effect they have when they
> occur), one simple solution is the minimum. Max and min might be *too*
> sensitive to the perturbations (especially disk cache of the program
> itself on the first run), so a pair of percentiles might be a bit more
> reliable, such as 10th percentile and 90th percentile.
>
> Next time I'm measuring program run time, I'm going to try out this
> model. It should be possible to look at a higher-degree-of-freedom
> representation of the dataset, like a histogram of the distribution, in
> order to evaluate how well this two-parameter characterization of the
> data works compared with the typical mean and stddev. Also, ultimately
> we only need as much precision here as is necessary for what we are
> doing with the summarized data, which typically is to alert us of a
> significant change. On a sufficiently quiet machine, the variance of the
> measurements might be significantly smaller than the thresholds that we
> consider "significant" so that the mean, median, max, min,
10%-ile etc.
> are all so close that it does not matter which we use, and we can
> summarize the data with just a single number, which can be any one of
> them (and just trigger an error if total variation exceeds a set amount).
>
> However, the noisier the machine (and hence our data), the more
> important is is to properly model and analyze things to avoid coming to
> a faulty conclusion (note: reliable conclusions can be made from data
> with large amounts of noise as long as there is a good model which
> allows isolating the data we are interested in from the noise [3]).
>
> Nick, I forget from when we talked, but did you guys ever settle on a
> model like this for your internal benchmarking?
We have not. Our current system (very quiesced machines) works well 
enough for our current purposes.

Nick
> [1] If you are familiar with Fourier theory, using only two statistics
> to describe a probability distribution is sort of like using only two
> Fourier components to describe a function. In order to conclude
> *anything* you must a priori know that the other components are not
> important!
>
> [2] The modeling technique I'm talking about is decomposition of
> square-integrable function spaces into an orthonormal basis (this is by
> no means the most general way of describing the idea). This is a
> far-reaching concept and pops up under many names. Things like
> "frequency domain", "Fourier transform", "Laplace
transform", and
> "spectrum" are among the most elementary. More advanced ones are
> "wavelets", "Hilbert space", "eigenfunctions of a
linear operator". A
> smattering of use cases: determining the shape of the electron orbitals
> of hydrogen, designing the fuel control system of a jet engine to make
> sure the right amount of fuel is being provided at any given moment,
> designing the circuits sitting at the end of a communication line (or a
> cell phone antenna) that have to detect incoming signals with the least
> chance of error, analyzing the structure of molecules based on X-ray
> diffraction, determining the chemical composition of the atmosphere of a
> planet orbiting another star.
>
> [3] For example if you know that there is Gaussian noise (even a very
> large amount) on top of an underlying value, then the underlying value
> is just the mean of the population (which will be a Gaussian
> distribution) and can be reliably determined from the mean sample
statistic.
>
>
> On Thu, Dec 18, 2014 at 4:10 AM, Sean Silva <chisophugis at gmail.com
> <mailto:chisophugis at gmail.com>> wrote:
>
>     In the future could you please do some sort of visualization of your
>     data, or at least provide the raw data in a machine-readable format
>     so that others can do so?
>
>     It is incredibly easy to come to incorrect conclusions when looking
>     at lists of numbers because at any given moment you have a very
>     localized view of the dataset and are prone to locally pattern-match
>     and form a selection bias that corrupts your ability to make a
>     proper decision in the context of the entire dataset. Even if you go
>     on to look at the rest of the data, this selection bias limits your
>     ability to come to a correct "global" conclusion.
>
>     Appropriate reliable summary statistics can also help, but are not
>     panacea. In using, say, 2 summary statistics (e.g. mean and standard
>     deviation), one is discarding a large number of degrees of freedom
>     from the dataset. This is fine if you have good reason to believe
>     that these 2 degrees of freedom adequately explain the underlying
>     dataset (e.g. there is a sound theoretical description of the
>     phenomenon being measured that suggests it should follow a Gaussian
>     distribution; hence mean and stddev completely characterize the
>     distribution). However, in the world of programs and compiler
>     optimizations, there is very rarely a good reason to believe that
>     any particular dataset (e.g. benchmark results for SPEC for a
>     particular optimization) is explained by a handful of common summary
>     statistics, and so looking only at summary statistics can often
>     conceal important insights into the data (or even be actively
>     misleading). This is especially true when looking across different
>     programs (I die a little inside every time I see someone cite a SPEC
>     geomean).
>
>     In compilers we are usually more interested in actually discovering
>     *which parameters* are responsible for variation, rather than
>     increasing confidence in the values of an a priori set of known
>     parameters. E.g. if you are measuring the time it takes a laser beam
>     to bounce off the moon and come back to you (in order to measure the
>     distance of the moon) you have an a priori known set of parameters
>     that well-characterize the data you obtain, based on your knowledge
>     of the measurement apparatus, atmospheric dispersion, the fact that
>     you know the moon is moving in an orbit, etc. You can perform
>     repeated measurements with the apparatus to narrow in on the values
>     of the parameters. In compilers, we rarely have such a simple and
>     small set of parameters that are known to adequately characterize
>     the data we are trying to understand; when investigating an
>     optimization's results, we are almost always investigating a
>     situation that would resemble (in the moon-bounce analogy) an
>     unknown variation that turns out to be due to whether the lab
>     assistant is leaning against the apparatus or not. You're not going
>     to find out that the lab assistant's pose is at fault by looking at
>     your "repeatedly do them to increase confidence in the
values"
>     measurements (e.g. the actual moon-bounce measurements; or looking
>     at the average time for a particular SPEC benchmark to run); you
>     find it by getting up and going to the room with the apparatus and
>     investigating all manner of things until you narrow in on the lab
>     assistant's pose (usually this takes the form of having to dig
>     around in assembly, extract kernels, time particular sub-things,
>     profile things, look at what how the code changes throughout the
>     optimization pipeline, instrument things, etc.; there are tons of
>     places for the root cause to hide).
>
>     If you have to remember something from that last paragraph, remember
>     that not everything boils down to click "run" and get a
timing for
>     SPEC. Often you need to take some time to narrow in on the damn lab
>     assistant. Sometimes just the timing of a particular benchmark leads
>     to a "lab assistant" situation (although hopefully this
doesn't
>     happen too often; it does happen though: e.g. I have been in a
>     situation where a benchmark surprisingly goes 50% faster on about
>     1/10 runs). When working across different programs, you are almost
>     always in a "lab assistant" situation.
>
>     -- Sean Silva
>
>     On Mon, Dec 15, 2014 at 11:27 AM, Daniel Stewart
>     <stewartd at codeaurora.org <mailto:stewartd at
codeaurora.org>> wrote:
>
>         I have done some preliminary investigation into postponing some
>         of the passes to see what the resulting performance impact would
>         be. This is a fairly crude attempt at moving passes around to
>         see if there is any potential benefit. I have attached the patch
>         I used to do the tests, in case anyone is interested.
>
>         Briefly, the patch allows two different flows, with either a
>         flag of –lto-new or –lto-new2. In the first case, the
>         vectorization passes are postponed from the end of
>         populateModulePassManager() function to midway through the
>         addLTOOptimizationPasses(). In the second case, essentially the
>         entire populateModulePassManager() function is deferred until
>         link time.
>
>         I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
>         configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have
>         attached a PDF presenting the results from the test. The first 4
>         columns have the spec result (ratio) for the 4 different
>         configurations. The second set of columns are the respective
>         test / max(result of 4 configurations). I used this last one to
>         see how well/poor a particular configuration was in comparison
>         to other configurations.
>
>         In general, there appears to be some benefit to be gained in a
>         couple of the benchmarks (spec2000/art, spec2006/milc) by
>         postponing vectorization.
>
>         I just wanted to present this information to the community to
>         see if there is interest in pursuing the idea of postponing passes.
>
>         Daniel
>
>         *From:* llvmdev-bounces at cs.uiuc.edu
>         <mailto:llvmdev-bounces at cs.uiuc.edu>
>         [mailto:llvmdev-bounces at cs.uiuc.edu
>         <mailto:llvmdev-bounces at cs.uiuc.edu>] *On Behalf Of
*Daniel Stewart
>         *Sent:* Wednesday, September 17, 2014 9:46 AM
>         *To:* llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>
>         *Subject:* [LLVMdev] Postponing more passes in LTO
>
>         Looking at the existing flow of passes for LTO, it appears that
>         most all passes are run on a per file basis, before the call to
>         the gold linker. I’m looking to get people’s feedback on whether
>         there would be an advantage to waiting to run a number of these
>         passes until the linking stage. For example, I believe I saw a
>         post a little while back about postponing vectorization until
>         the linking stage. It seems to me that there could be an
>         advantage to postponing (some) passes until the linking stage,
>         where the entire graph is available. In general, what do people
>         think about the idea of a different flow of LTO where more
>         passes are postponed until the linking stage?
>
>         Daniel Stewart
>
>         --
>
>         Qualcomm Innovation Center, Inc. is a member of Code Aurora
>         Forum, hosted by The Linux Foundation
>
>
>         _______________________________________________
>         LLVM Developers mailing list
>         LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>         http://llvm.cs.uiuc.edu
>         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Chandler Carruth

2015-Jan-08 08:55 UTC

head link

[LLVMdev] Postponing more passes in LTO

On Thu, Dec 18, 2014 at 2:16 PM, Sean Silva <chisophugis at gmail.com>
wrote:
> Okay, so I've cast doubt on the use of standard deviation (and actually
an
> argument could be made against the mean too for similar reasons). On a more
> constructive note, what *is* a good small set of parameters to summarize
> measurements of program run time? One interesting model that was brought to
> my attention by Nick Lewycky is based on the following observation: most
> "perturbations" of the run time of a fixed program binary on a
fixed chip
> and fixed OS are due to externalities that contend for (or otherwise
> deprive the program of) execution resources and therefore almost always
> make it *slower*.
>
> The model is then that there is a "true" execution time for the
program,
> which is a completely deterministic function of the program, chip, and OS;
> intuitively, this is time occurs when the program is e.g. able to execute
> by itself, on a single execution unit of the CPU, with no interrupts
> knocking it off CPU, and no other threads using any of the execution
> resources of this CPU. On top of this, we assume that there is a set of
> perturbations that contend for execution resources and slow the program
> down.
>
> For this model, it makes sense to approximate the "true" value by
the
> maximum of the recorded values. If we want to use two numbers to summarize
> the data, it probably makes sense to look at a way of characterizing the
> extent of the variation from this "true" value. In the absence of
proper
> characterization of the perturbations (such as the distribution of when
> they occur and how much effect they have when they occur), one simple
> solution is the minimum. Max and min might be *too* sensitive to the
> perturbations (especially disk cache of the program itself on the first
> run), so a pair of percentiles might be a bit more reliable, such as 10th
> percentile and 90th percentile.
>
I used to subscribe to this thinking. I no longer do, and I think it has
serious problems. Let me try to explain with a hopefully entertaining
analogy. =D

Back in physics class we would analyze the behavior of basic principles of
physics as they would act upon a brave cadre of scientifically savvy
penguins sliding down slopes under the effect of gravity. Naturally, these
slopes were very slick ice, and the physics problems would be solved
"neglecting friction"[1]. That made the analysis much more tractable
and
predictable. It also makes it really inaccurate in the real world.

Now, I'm not actually deriding this approach to physics. The results from
neglecting friction are *very* informative as to the basic principles at
work. In a very similar way, I don't think the model of a "true"
execution
time is very informative, but much more about the principles at work than
about the actual performance of the software.

The reality I have observed these days, especially in larger software
systems, is that there is inherently no deterministic and representative
execution to rely on for finding "true" execution time. And too many
of the
random fluctuations are actually by-products of the program itself. Some
examples:

- If the program runs uninterrupted on a single core, it may cause that
core to be throttled due to the thermals generated. This can have the
effect of lowering the overall performance more than what you would see if
the program ran *just* under that level without repeated drops in frequency.

- Different optimizations can change the thermal profile by using different
execution units. This is especially true for vector units.

- If you fully leverage the AVX unit of some intel chips, my understanding
is that regardless of the specific thermals, those units run at a lower
clock!

- Naturally, things like ASLR or other sources of correct but
non-deterministic layout of data in memory will cause fluctuations in the
rate at which cache misses occur. This is probably the most common cause of
non-external fluctuations in performance. Different generated code,
register spill frequencies, etc., can dramatically change these cache
collision rates.

Now, with all of these factors, we have to assume that in real world
scenarios, the programs will not always run at the maximal speed. Instead,
there is, much as you describe in your email, some distribution of
performance that will be observed in practice. The interesting question is
-- can generating different code from the compiler shift the distribution
in a meaningful way that is not correlated with the fastest measurement (or
the N fastest measurement). And I think the above examples are all places
where this can happen. =/

As a consequence, I am increasingly in favor of using very simple but
aggregated statistics for measuring performance and acknowledging the lack
of precision when it is present. We can still very often measure precisely
those localized improvements provided by a specific change even when very
small. We just won't see a reflection from them in many cases when
measuring large systems that have a high noise floor. Hopefully, we can at
least observe good trends in the large system to confirm we are making
progress, and at times we can make large improvements that actually rise
above the noise floor.

Anyways, my somewhat rambling 2 cents on this. Sorry for the not entirely
structured email. =]
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150108/7513e845/attachment.html>

Krzysztof Parzyszek

2015-Jan-11 15:11 UTC

head link

[LLVMdev] Postponing more passes in LTO

On 12/18/2014 6:10 AM, Sean Silva wrote:> [...]
Pass ordering is a design decision, based on assumptions as to what each 
of the optimizations is going to do, what information it needs, what 
properties of the code will be created, preserved or eliminated, and so on.
What Daniel showed are the performance results from reordering passes, 
with keeping their current implementations. This serves more as an 
indicator of the potential impact of such change if it was made today. 
You're right that relying only on the statistics may lead to incorrect 
conclusions, but I don't think that there is any meaningful information 
that could be extracted through a more detailed analysis of the provided 
data. The question is really whether changing the pass order could allow 
the optimizations to be more aggressive or accurate (e.g. through 
exposing them to a more complete view of the program), and whether the 
benefits could be expected to exceed the costs (such as increased 
compilation time). Whether the current implementations of particular 
optimizations would immediately benefit from such a change is not really 
that important in the long term.

-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

llvm dev - Dec 2014 - [LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO

[LLVMdev] Postponing more passes in LTO