thr3ads.net - llvm dev - [LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives [May 2015]

If this information is useful, please help other people find it:
Share via:

Chris Matthews

2015-May-15 21:24 UTC

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

tl;dr in low data situations we don’t look at past information, and that
increases the false positive regression rate.  We should look at the possibly
incorrect recent past runs to fix that.

Motivation: LNT’s current regression detection system has false positive rate
that is too high to make it useful.  With test suites as large as the llvm
“test-suite” a single report will show hundreds of regressions.  The false
positive rate is so high the reports are ignored because it is impossible for a
human to triage them, large performance problems are lost in the noise, small
important regressions never even have a chance.  Later today I am going to
commit a new unit test to LNT with 40 of my favorite regression patterns.  It
has gems such as flat but noisy line, 5% regression in 5% noise, bimodal, and a
slow increase, we fail to classify most of these correctly right now. They are
not trick questions, all are obvious regressions or non-regressions, that are
plainly visible. I want us to correctly classify them all!

Some context: LNTs regression detection algorithm as I understand it:

detect(current run’s samples, last runs samples) —> improve, regress or
unchanged.

    # when recovering from errors performance should not be counted
    Current or last run failed -> unchanged

    delta = min(current samples) - min(prev samples)

    # too small to measure
    delta <  (confidence*machine noise threshold (0.0005s by default)) ->
unchanged

    # too small to care
    delta % < 1% -> unchanged

    # too small to care
    delta < 0.01s -> unchanged

    if len(current samples) >= 4 && len(prev samples) >= 4
         Mann whitney U test -> possible unchanged

    #multisample, confidence interval check
    if len(current samples) > 1
           check delta within samples confidence interval -> if so,
unchanged, else Improve, regress.

    # single sample,range check
    if len(current samples) == 1
        all % deltas above 1% improve or regress


The too small to care rules are newer inventions.

Effectiveness data: to see how well these rules work I ran a 14 machine, 7 day
report:

- 16773 run comparisons
- 13852 marked unchanged because of small % delta
- 2603 unchanged because of small delta
- 0 unchanged because of Mann Whitney U test
- 0 unchanged because of confidence interval
- 318 improved or regressed because single sample change over 1% 

Real regressions: probably 1 or 2, not that I will click 318 links to check for
sure… hence the motivation.

Observations: Most of the work is done by dropping small deltas.  Confidence
intervals and Mann Whitney U tests are the tests we want to be triggering,
however they only work with many samples. Even with reruns, most tests end up
being a single sample.  LNT bots that a triggered after another build (unless
using the multisample feature) just have one sample at each rev.  Multisample is
not a good option because most runs already take a long time.

Even with a small amount of predictable noise, if len(current samples) == 1,
will flag a lot of samples, especially if len(prev) > 1.  Reruns actually
make this worse by making it likely that we flag the next run after the run we
rerun.  For instance, a flat line with 5% random noise flags all the time.

Besides the Mann Whitney U test, we are not using prev_samples in any way sane
way.

Ideas: 

-Try and get more samples in as many places as possible.  Maybe —multisample=4
should be the default?  Make bots run more often (I have already done this on
green dragon).

- Use recent past run information to enhance single sample regression detection.
I think we should add a lookback window, and model the recent past.  I tired a
technique suggested by Mikhail Zolotukhin of computing delta as the smallest
difference between current and all the previous samples.  It was far more
effective.  Alternately we could try a confidence interval generated from
previous, though that may not work on bimodal tests.

- Currently prev_samples is almost always just one other run, probably with only
one sample itself.  Lets give this more samples to work with. Start passing more
previous run data to all uses of the algorithm, in most places we intentionally
limit the computation to current=run and previous=run-1, lets do something like
previous=run-[1-10]. The risk in this approach is that regression noise in the
look back window could trigger a false negative (we miss detecting a
regression).  I think this is acceptable since we already miss lots of them
because the reports are not actionable.

- Given the choice between false positive and false negative, lets err towards
false negative.  We need to have manageable number of regressions detected or
else we can’t act on them.

Any objections to me implementing these ideas?

Sean Silva

2015-May-16 02:16 UTC

head link

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Is there a way to download the data off http://llvm.org/perf? I'd like to
help with this but I don't have a good dataset to analyze.

It definitely seems like the weakest part of the current and proposed
scheme is that it only looks at two runs. That is basically useless when
we're talking about only a handful of samples (<4???) per run. Since the
machine's noise can be modeled from run to run (also sample to sample, but
for simplicity just consider run to run) as a random process in the run
number, all the techniques from digital filtering come into play. From
looking at a couple of the graphs on LNT, the machine noise appears to be
almost exclusively at Nyquist (i.e. it alternates from sample to sample)
falling down to a bit at half Nyquist (I can analyze in more detail if I
can get my hands on the data). We probably want a lowpass differentiator at
about half Nyquist.

I would strongly recommend starting with a single benchmark on a single
machine and coming up with detection routine just for it that is basically
100% accurate, then generalizing as appropriate so that you are getting
reliable coverage of a larger portion of the benchmarks. The machine's
noise is probably easiest to characterize and most generalizable across
runs.


-- Sean Silva

On Fri, May 15, 2015 at 2:24 PM, Chris Matthews <chris.matthews at
apple.com>
wrote:
> tl;dr in low data situations we don’t look at past information, and that
> increases the false positive regression rate.  We should look at the
> possibly incorrect recent past runs to fix that.
>
> Motivation: LNT’s current regression detection system has false positive
> rate that is too high to make it useful.  With test suites as large as the
> llvm “test-suite” a single report will show hundreds of regressions.  The
> false positive rate is so high the reports are ignored because it is
> impossible for a human to triage them, large performance problems are lost
> in the noise, small important regressions never even have a chance.  Later
> today I am going to commit a new unit test to LNT with 40 of my favorite
> regression patterns.  It has gems such as flat but noisy line, 5%
> regression in 5% noise, bimodal, and a slow increase, we fail to classify
> most of these correctly right now. They are not trick questions, all are
> obvious regressions or non-regressions, that are plainly visible. I want us
> to correctly classify them all!
>
> Some context: LNTs regression detection algorithm as I understand it:
>
> detect(current run’s samples, last runs samples) —> improve, regress or
> unchanged.
>
>     # when recovering from errors performance should not be counted
>     Current or last run failed -> unchanged
>
>     delta = min(current samples) - min(prev samples)
>
>     # too small to measure
>     delta <  (confidence*machine noise threshold (0.0005s by default))
->
> unchanged
>
>     # too small to care
>     delta % < 1% -> unchanged
>
>     # too small to care
>     delta < 0.01s -> unchanged
>
>     if len(current samples) >= 4 && len(prev samples) >= 4
>          Mann whitney U test -> possible unchanged
>
>     #multisample, confidence interval check
>     if len(current samples) > 1
>            check delta within samples confidence interval -> if so,
> unchanged, else Improve, regress.
>
>     # single sample,range check
>     if len(current samples) == 1
>         all % deltas above 1% improve or regress
>
>
> The too small to care rules are newer inventions.
>
> Effectiveness data: to see how well these rules work I ran a 14 machine, 7
> day report:
>
> - 16773 run comparisons
> - 13852 marked unchanged because of small % delta
> - 2603 unchanged because of small delta
> - 0 unchanged because of Mann Whitney U test
> - 0 unchanged because of confidence interval
> - 318 improved or regressed because single sample change over 1%
>
> Real regressions: probably 1 or 2, not that I will click 318 links to
> check for sure… hence the motivation.
>
> Observations: Most of the work is done by dropping small deltas.
> Confidence intervals and Mann Whitney U tests are the tests we want to be
> triggering, however they only work with many samples. Even with reruns,
> most tests end up being a single sample.  LNT bots that a triggered after
> another build (unless using the multisample feature) just have one sample
> at each rev.  Multisample is not a good option because most runs already
> take a long time.
>
> Even with a small amount of predictable noise, if len(current samples)
=> 1, will flag a lot of samples, especially if len(prev) > 1.  Reruns
> actually make this worse by making it likely that we flag the next run
> after the run we rerun.  For instance, a flat line with 5% random noise
> flags all the time.
>
> Besides the Mann Whitney U test, we are not using prev_samples in any way
> sane way.
>
> Ideas:
>
> -Try and get more samples in as many places as possible.  Maybe
> —multisample=4 should be the default?  Make bots run more often (I have
> already done this on green dragon).
>
> - Use recent past run information to enhance single sample regression
> detection.  I think we should add a lookback window, and model the recent
> past.  I tired a technique suggested by Mikhail Zolotukhin of computing
> delta as the smallest difference between current and all the previous
> samples.  It was far more effective.  Alternately we could try a confidence
> interval generated from previous, though that may not work on bimodal
tests.
>
> - Currently prev_samples is almost always just one other run, probably
> with only one sample itself.  Lets give this more samples to work with.
> Start passing more previous run data to all uses of the algorithm, in most
> places we intentionally limit the computation to current=run and
> previous=run-1, lets do something like previous=run-[1-10]. The risk in
> this approach is that regression noise in the look back window could
> trigger a false negative (we miss detecting a regression).  I think this is
> acceptable since we already miss lots of them because the reports are not
> actionable.
>
> - Given the choice between false positive and false negative, lets err
> towards false negative.  We need to have manageable number of regressions
> detected or else we can’t act on them.
>
> Any objections to me implementing these ideas?
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150515/c40c77fa/attachment.html>

Chris Matthews

2015-May-16 03:46 UTC

head link

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Easiest way to get the data off llvm.org/perf <http://llvm.org/perf> is to
use the json APIs.  For many of the pages if you pass a &json=True LNT will
give you a json reply.  For example, go to a run page, a click the check boxes
next to a bunch of runs and click graph.  When all the run lines popup in the
graph page, add json=True and it will download data for those tests on those
machines.

For example, on the O3 tester, all the benchmarks in Multisource/Applications:

http://llvm.org/perf/db_default/v4/nts/graph?plot.1327=21.1327.0&plot.1053=21.1053.0&plot.1232=21.1232.0&plot.1483=21.1483.0&plot.1014=21.1014.0&plot.1138=21.1138.0&plot.1180=21.1180.0&plot.1288=21.1288.0&plot.1129=21.1129.0&plot.1425=21.1425.0&plot.1456=21.1456.0&plot.1038=21.1038.0&plot.1452=21.1452.0&plot.1166=21.1166.0&plot.1243=21.1243.0&plot.1116=21.1116.0&plot.1326=21.1326.0&plot.1279=21.1279.0&plot.1007=21.1007.0&plot.1394=21.1394.0&plot.1017=21.1017.0&plot.1443=21.1443.0&plot.1445=21.1445.0&plot.1197=21.1197.0&plot.1332=21.1332.0&json=True

I have scripts for converting the json data into Python Pandas format, though
the format is so simple you can really parse it with anything. I could
contribute them if anyone would find them helpful. I know there are also scripts
floating around for scraping all machines and runs from the LNT instance by
first looking up the run and machine list, then fetching all the tests for each.

> On May 15, 2015, at 7:16 PM, Sean Silva <chisophugis at gmail.com>
wrote:
> 
> Is there a way to download the data off http://llvm.org/perf
<http://llvm.org/perf>? I'd like to help with this but I don't
have a good dataset to analyze.
> 
> It definitely seems like the weakest part of the current and proposed
scheme is that it only looks at two runs. That is basically useless when
we're talking about only a handful of samples (<4???) per run. Since the
machine's noise can be modeled from run to run (also sample to sample, but
for simplicity just consider run to run) as a random process in the run number,
all the techniques from digital filtering come into play. From looking at a
couple of the graphs on LNT, the machine noise appears to be almost exclusively
at Nyquist (i.e. it alternates from sample to sample) falling down to a bit at
half Nyquist (I can analyze in more detail if I can get my hands on the data).
We probably want a lowpass differentiator at about half Nyquist.
> 
> I would strongly recommend starting with a single benchmark on a single
machine and coming up with detection routine just for it that is basically 100%
accurate, then generalizing as appropriate so that you are getting reliable
coverage of a larger portion of the benchmarks. The machine's noise is
probably easiest to characterize and most generalizable across runs.
> 
> 
> -- Sean Silva
> 
> On Fri, May 15, 2015 at 2:24 PM, Chris Matthews <chris.matthews at
apple.com <mailto:chris.matthews at apple.com>> wrote:
> tl;dr in low data situations we don’t look at past information, and that
increases the false positive regression rate.  We should look at the possibly
incorrect recent past runs to fix that.
> 
> Motivation: LNT’s current regression detection system has false positive
rate that is too high to make it useful.  With test suites as large as the llvm
“test-suite” a single report will show hundreds of regressions.  The false
positive rate is so high the reports are ignored because it is impossible for a
human to triage them, large performance problems are lost in the noise, small
important regressions never even have a chance.  Later today I am going to
commit a new unit test to LNT with 40 of my favorite regression patterns.  It
has gems such as flat but noisy line, 5% regression in 5% noise, bimodal, and a
slow increase, we fail to classify most of these correctly right now. They are
not trick questions, all are obvious regressions or non-regressions, that are
plainly visible. I want us to correctly classify them all!
> 
> Some context: LNTs regression detection algorithm as I understand it:
> 
> detect(current run’s samples, last runs samples) —> improve, regress or
unchanged.
> 
>     # when recovering from errors performance should not be counted
>     Current or last run failed -> unchanged
> 
>     delta = min(current samples) - min(prev samples)
> 
>     # too small to measure
>     delta <  (confidence*machine noise threshold (0.0005s by default))
-> unchanged
> 
>     # too small to care
>     delta % < 1% -> unchanged
> 
>     # too small to care
>     delta < 0.01s -> unchanged
> 
>     if len(current samples) >= 4 && len(prev samples) >= 4
>          Mann whitney U test -> possible unchanged
> 
>     #multisample, confidence interval check
>     if len(current samples) > 1
>            check delta within samples confidence interval -> if so,
unchanged, else Improve, regress.
> 
>     # single sample,range check
>     if len(current samples) == 1
>         all % deltas above 1% improve or regress
> 
> 
> The too small to care rules are newer inventions.
> 
> Effectiveness data: to see how well these rules work I ran a 14 machine, 7
day report:
> 
> - 16773 run comparisons
> - 13852 marked unchanged because of small % delta
> - 2603 unchanged because of small delta
> - 0 unchanged because of Mann Whitney U test
> - 0 unchanged because of confidence interval
> - 318 improved or regressed because single sample change over 1%
> 
> Real regressions: probably 1 or 2, not that I will click 318 links to check
for sure… hence the motivation.
> 
> Observations: Most of the work is done by dropping small deltas. 
Confidence intervals and Mann Whitney U tests are the tests we want to be
triggering, however they only work with many samples. Even with reruns, most
tests end up being a single sample.  LNT bots that a triggered after another
build (unless using the multisample feature) just have one sample at each rev. 
Multisample is not a good option because most runs already take a long time.
> 
> Even with a small amount of predictable noise, if len(current samples) ==
1, will flag a lot of samples, especially if len(prev) > 1.  Reruns actually
make this worse by making it likely that we flag the next run after the run we
rerun.  For instance, a flat line with 5% random noise flags all the time.
> 
> Besides the Mann Whitney U test, we are not using prev_samples in any way
sane way.
> 
> Ideas:
> 
> -Try and get more samples in as many places as possible.  Maybe
—multisample=4 should be the default?  Make bots run more often (I have already
done this on green dragon).
> 
> - Use recent past run information to enhance single sample regression
detection.  I think we should add a lookback window, and model the recent past. 
I tired a technique suggested by Mikhail Zolotukhin of computing delta as the
smallest difference between current and all the previous samples.  It was far
more effective.  Alternately we could try a confidence interval generated from
previous, though that may not work on bimodal tests.
> 
> - Currently prev_samples is almost always just one other run, probably with
only one sample itself.  Lets give this more samples to work with. Start passing
more previous run data to all uses of the algorithm, in most places we
intentionally limit the computation to current=run and previous=run-1, lets do
something like previous=run-[1-10]. The risk in this approach is that regression
noise in the look back window could trigger a false negative (we miss detecting
a regression).  I think this is acceptable since we already miss lots of them
because the reports are not actionable.
> 
> - Given the choice between false positive and false negative, lets err
towards false negative.  We need to have manageable number of regressions
detected or else we can’t act on them.
> 
> Any objections to me implementing these ideas?
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150515/aa5b3356/attachment.html>

Kristof Beyls

2015-May-18 15:02 UTC

head link

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Thanks for raising this, Chris!

 

I also think that improving the signal-to-noise ratio in the performance
reports produced by LNT are essential to make the performance-tracking
bots useful and effective.

 

Our experience, using LNT internally, has been that if the number of false
positives are low enough (lower than about half a dozen per report or day),
they become useable, leaving only a little bit of manual investigation work
to detect if a particular change was significant or in the noise. Yes, ideally
the automated noise detection should be perfect; but even if it's not
perfect,
it will already be a massive win.

 

I have some further ideas and remarks below.

 

Thanks,

 

Kristof

 

 
> -----Original Message-----
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu]
> On Behalf Of Chris Matthews
> Sent: 15 May 2015 22:25
> To: LLVM Developers Mailing List
> Subject: [LLVMdev] Proposal: change LNT’s regression detection algorithm
> and how it is used to reduce false positives
> 
> tl;dr in low data situations we don’t look at past information, and that
> increases the false positive regression rate.  We should look at the
> possibly incorrect recent past runs to fix that.
> 
> Motivation: LNT’s current regression detection system has false positive
> rate that is too high to make it useful.  With test suites as large as
> the llvm “test-suite” a single report will show hundreds of regressions.
> The false positive rate is so high the reports are ignored because it is
> impossible for a human to triage them, large performance problems are
> lost in the noise, small important regressions never even have a chance.
> Later today I am going to commit a new unit test to LNT with 40 of my
> favorite regression patterns.  It has gems such as flat but noisy line,
> 5% regression in 5% noise, bimodal, and a slow increase, we fail to
> classify most of these correctly right now. They are not trick
> questions, all are obvious regressions or non-regressions, that are
> plainly visible. I want us to correctly classify them all!
 

That's a great idea!

Out of all of the ideas in this email, I think this is the most important
one to implement first.

 
> Some context: LNTs regression detection algorithm as I understand it:
> 
> detect(current run’s samples, last runs samples) —> improve, regress or
> unchanged.
> 
>     # when recovering from errors performance should not be counted
>     Current or last run failed -> unchanged
> 
>     delta = min(current samples) - min(prev samples)
 

I am not convinced that "min" is the best way to define the delta.

It makes the assumption that the "true" performance of code generated
by llvm
is the fastest it was ever seen running. I think this isn't the correct way
to model e.g. programs with bimodal behaviour, nor programs with a normal
distribution. I'm afraid I don't have a better solution, but I think the
Mann Whitney U test - which tries to determine if the sample points seem
to indicate different underlying distributions - is closer to what we
really ought to use to detect if a regression is "real". This way, it
models

that a fixed program, when run multiple times, has a distribution of

performance. I think that using "min" makes too many broken
assumptions on

what the distribution can look like.

 
> 
>     # too small to measure
>     delta <  (confidence*machine noise threshold (0.0005s by default)) -
> > unchanged
> 
>     # too small to care
>     delta % < 1% -> unchanged
> 
>     # too small to care
>     delta < 0.01s -> unchanged
> 
>     if len(current samples) >= 4 && len(prev samples) >= 4
>          Mann whitney U test -> possible unchanged
> 
>     #multisample, confidence interval check
>     if len(current samples) > 1
>            check delta within samples confidence interval -> if so,
> unchanged, else Improve, regress.
> 
>     # single sample,range check
>     if len(current samples) == 1
>         all % deltas above 1% improve or regress
> 
> 
> The too small to care rules are newer inventions.
> 
> Effectiveness data: to see how well these rules work I ran a 14 machine,
> 7 day report:
> 
> - 16773 run comparisons
> - 13852 marked unchanged because of small % delta
> - 2603 unchanged because of small delta
> - 0 unchanged because of Mann Whitney U test
> - 0 unchanged because of confidence interval
> - 318 improved or regressed because single sample change over 1%
> 
> Real regressions: probably 1 or 2, not that I will click 318 links to
> check for sure… hence the motivation.
> 
> Observations: Most of the work is done by dropping small deltas.
> Confidence intervals and Mann Whitney U tests are the tests we want to
> be triggering, however they only work with many samples. Even with
> reruns, most tests end up being a single sample.  LNT bots that a
> triggered after another build (unless using the multisample feature)
> just have one sample at each rev.  Multisample is not a good option
> because most runs already take a long time.
> 
> Even with a small amount of predictable noise, if len(current samples)
> == 1, will flag a lot of samples, especially if len(prev) > 1.  Reruns
> actually make this worse by making it likely that we flag the next run
> after the run we rerun.  For instance, a flat line with 5% random noise
> flags all the time.
> 
> Besides the Mann Whitney U test, we are not using prev_samples in any
> way sane way.
> 
> Ideas:
> 
> -Try and get more samples in as many places as possible.  Maybe —
> multisample=4 should be the default?  Make bots run more often (I have
> already done this on green dragon).
 

FWIW, the Cortex-A53 performance tracker I've set up recently uses
multisample=3. The Cortex-A53 is a slower/more energy-efficient core,
so it takes about 6 hours to do a LLVM rebuild + 3 runs of the LNT
benchmarks (see http://llvm.org/perf/db_default/v4/nts/machine/39).

 
> - Use recent past run information to enhance single sample regression
> detection.  I think we should add a lookback window, and model the
> recent past.  I tired a technique suggested by Mikhail Zolotukhin of
> computing delta as the smallest difference between current and all the
> previous samples.  It was far more effective.  Alternately we could try
> a confidence interval generated from previous, though that may not work
> on bimodal tests.
 

The noise levels per individual program are often dependent on the

micro-architecture of the core it runs on. Before setting up the Cortex-A53

performance tracking bot, I've done a bit of analysis to find out what the
noise

levels are per program across a Cortex-A53, a Cortex-A57 and a Core i7 CPU.
Below

is an example of a chart for just one program, indicating that the noise level
is
sometimes dependent on the micro-architecture of the core it runs on. Whereas a

Mann-Withney U - or similar - test would probably find - given enough data

points - what should be considered noise and what not; there may be a way to

run the test-suite in benchmark mode many times when a board gets set up, and
analyse

the results of that. The idea is that this way, the noisiness of the board as
setup

for fixed binaries could be measured, and that information could be used when
not

enough sample points are available.

(FWIW: for this program, the noisiness seems to come from noisiness in the
number
of branch mispredicts).

BTW – graphs like the one below make me think that the LNT webUI should be
showing
sample points be default instead of line graphs showing the minimum execution
time
per build number.

 



 
> - Currently prev_samples is almost always just one other run, probably
> with only one sample itself.  Lets give this more samples to work with.
> Start passing more previous run data to all uses of the algorithm, in
> most places we intentionally limit the computation to current=run and
> previous=run-1, lets do something like previous=run-[1-10]. The risk in
> this approach is that regression noise in the look back window could
> trigger a false negative (we miss detecting a regression).  I think this
> is acceptable since we already miss lots of them because the reports are
> not actionable.
> 
> - Given the choice between false positive and false negative, lets err
> towards false negative.  We need to have manageable number of
> regressions detected or else we can’t act on them.
 

This sounds like a good idea to me. Let's first make sure we have a working
system of (semi-?)automatically detecting at least a good portion of the
significant performance regression. After that we can fine tune to reduce
false negatives to catch a larger part of all significant performance
regressions.

 
> 
> Any objections to me implementing these ideas?
 

Absolutely not. Once implemented, we probably ought to have an idea about how
to test which combination of methods works best in practice. Could the
sample points you’re going to add to the LNT unit tests help in testing which
combination of methods work best?

 

I've got 2 further ideas, based on observations from the data coming from
the
Cortex-A53 performance tracker that I added about 10 days ago - see
http://llvm.org/perf/db_default/v4/nts/machine/39.

I'll be posting patches for review for these soon:

 

1. About 20 of the 300-ish programs that get run in benchmark-only mode run
for less than 10 milliseconds. These 20 programs are one of the main sources
of noisiness. We should just not run these programs in benchmark-only mode.

Or - alternatively we should make them run a bit longer, so that they are less
noisy.

 

2. The board I'm running the Cortex-A53 performance tracker on is a
big.LITTLE
system with 2 Cortex-A57s and 4 Cortex-A53s. To build the benchmark binaries,
I'm using all cores, to make the turn-around time of the bot as fast as
possible.
However, this leads to huge noise levels on the "compile_time" metric,
as sometimes
a binary gets compiled on a Cortex-A53 and sometimes on a Cortex-A57. For this
board specifically, it just shouldn't be reporting compile_time at all,
since the
numbers are meaningless from a performance-tracking use case.

 

 

Another thought: if we could reduce the overall run-time of the LNT run in
benchmark-only mode, we could run more "multi-samples" in the same
amount of
time. I did a quick analysis on whether it would be worthwhile to invest effort
in making some of the long-running programs in the test-suite run shorter in
benchmarking mode. On the Cortex-A53 board, it shows that the 27 longest-running
programs out of the 300-ish consume about half the run-time. If we could easily
make these 27 programs run an order-of-magnitude less long, we could almost
halve
the total execution time of the test-suite, and hence run twice the number of
samples in the same amount of time. The longest running programs I’ve found are,
sorted:

 

  0. 7.23% cumulative (7.23% - 417.15s this program)
nts.MultiSource/Benchmarks/PAQ8p/paq8p.exec

  1. 13.74% cumulative (6.51% - 375.84s this program)
nts.MultiSource/Benchmarks/SciMark2-C/scimark2.exec

  2. 18.83% cumulative (5.08% - 293.16s this program)
nts.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.exec

  3. 21.60% cumulative (2.77% - 160.02s this program)
nts.MultiSource/Benchmarks/mafft/pairlocalalign.exec

  4. 24.01% cumulative (2.41% - 138.98s this program)
nts.SingleSource/Benchmarks/CoyoteBench/almabench.exec

  5. 26.32% cumulative (2.32% - 133.59s this program)
nts.MultiSource/Applications/lua/lua.exec

  6. 28.26% cumulative (1.94% - 111.80s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk.exec

  7. 30.11% cumulative (1.85% - 106.56s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.exec

  8. 31.60% cumulative (1.49% - 86.00s this program)
nts.SingleSource/Benchmarks/CoyoteBench/huffbench.exec

  9. 32.75% cumulative (1.15% - 66.37s this program)
nts.MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.exec

10. 33.90% cumulative (1.15% - 66.13s this program)
nts.MultiSource/Applications/hexxagon/hexxagon.exec

11. 35.04% cumulative (1.14% - 65.98s this program)
nts.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syr2k/syr2k.exec

12. 36.14% cumulative (1.10% - 63.21s this program)
nts.MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.exec

13. 37.22% cumulative (1.08% - 62.35s this program)
nts.SingleSource/Benchmarks/SmallPT/smallpt.exec

14. 38.30% cumulative (1.08% - 62.30s this program)
nts.MultiSource/Benchmarks/nbench/nbench.exec

15. 39.37% cumulative (1.07% - 61.98s this program)
nts.MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.exec

16. 40.40% cumulative (1.03% - 59.50s this program)
nts.MultiSource/Applications/SPASS/SPASS.exec

17. 41.37% cumulative (0.97% - 55.74s this program)
nts.MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.exec

18. 42.33% cumulative (0.96% - 55.40s this program)
nts.SingleSource/Benchmarks/Misc/ReedSolomon.exec

19. 43.27% cumulative (0.94% - 54.34s this program)
nts.MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.exec

20. 44.21% cumulative (0.94% - 54.20s this program)
nts.MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.exec

21. 45.12% cumulative (0.91% - 52.46s this program)
nts.SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.exec

22. 46.01% cumulative (0.89% - 51.49s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.exec

23. 46.89% cumulative (0.88% - 50.66s this program)
nts.MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.exec

24. 47.73% cumulative (0.84% - 48.74s this program)
nts.MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.exec

25. 48.57% cumulative (0.84% - 48.43s this program)
nts.MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.exec

26. 49.40% cumulative (0.83% - 47.92s this program)
nts.SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.exec

27. 50.22% cumulative (0.81% - 46.92s this program)
nts.MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.exec

28. 51.03% cumulative (0.81% - 46.90s this program)
nts.MultiSource/Applications/minisat/minisat.exec

29. 51.81% cumulative (0.78% - 44.88s this program)
nts.MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.exec

…

 

 

For example, there seem to be a lot of TSVC benchmarks in the longest running
ones.
They all seem to take a command line parameter to define the number of
iterations the main
loop in the benchmark should be running. Just tuning these, so all these
benchmarks runs
O(1s) would make the overall test-suite already run significantly faster.

 

For the Polybench test cases: they print out lots of floating point numbers –
this
probably should be changed in the makefile so they don’t dump the matrices they
work
on anymore. I’m not sure how big the impact will be on overall run time for the
Polybench
benchmarks when doing this.

 

 

 

 

 

 

 

 
> _______________________________________________
> LLVM Developers mailing list
>  <mailto:LLVMdev at cs.uiuc.edu> LLVMdev at cs.uiuc.edu         
<http://llvm.cs.uiuc.edu> http://llvm.cs.uiuc.edu
>  <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/21963a0d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 46772 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/21963a0d/attachment.png>

Kristof Beyls

2015-May-18 16:39 UTC

head link

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Thanks for raising this, Chris!

I also think that improving the signal-to-noise ratio in the performance
reports produced by LNT are essential to make the performance-tracking
bots useful and effective.

Our experience, using LNT internally, has been that if the number of false
positives are low enough (lower than about half a dozen per report or day),
they become useable, leaving only a little bit of manual investigation work
to detect if a particular change was significant or in the noise. Yes, ideally
the automated noise detection should be perfect; but even if it's not
perfect,
it will already be a massive win.

I have some further ideas and remarks below.

Thanks,

Kristof
> -----Original Message-----
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu]
> On Behalf Of Chris Matthews
> Sent: 15 May 2015 22:25
> To: LLVM Developers Mailing List
> Subject: [LLVMdev] Proposal: change LNT’s regression detection algorithm
> and how it is used to reduce false positives
> 
> tl;dr in low data situations we don’t look at past information, and that
> increases the false positive regression rate.  We should look at the
> possibly incorrect recent past runs to fix that.
> 
> Motivation: LNT’s current regression detection system has false positive
> rate that is too high to make it useful.  With test suites as large as
> the llvm “test-suite” a single report will show hundreds of regressions.
> The false positive rate is so high the reports are ignored because it is
> impossible for a human to triage them, large performance problems are
> lost in the noise, small important regressions never even have a chance.
> Later today I am going to commit a new unit test to LNT with 40 of my
> favorite regression patterns.  It has gems such as flat but noisy line,
> 5% regression in 5% noise, bimodal, and a slow increase, we fail to
> classify most of these correctly right now. They are not trick
> questions, all are obvious regressions or non-regressions, that are
> plainly visible. I want us to correctly classify them all!
That's a great idea!
Out of all of the ideas in this email, I think this is the most important
one to implement first.
> Some context: LNTs regression detection algorithm as I understand it:
> 
> detect(current run’s samples, last runs samples) —> improve, regress or
> unchanged.
> 
>     # when recovering from errors performance should not be counted
>     Current or last run failed -> unchanged
> 
>     delta = min(current samples) - min(prev samples)
I am not convinced that "min" is the best way to define the delta.
It makes the assumption that the "true" performance of code generated
by llvm
is the fastest it was ever seen running. I think this isn't the correct way
to model e.g. programs with bimodal behaviour, nor programs with a normal
distribution. I'm afraid I don't have a better solution, but I think the
Mann Whitney U test - which tries to determine if the sample points seem
to indicate different underlying distributions - is closer to what we
really ought to use to detect if a regression is "real". This way, it
models
that a fixed program, when run multiple times, has a distribution of
performance. I think that using "min" makes too many broken
assumptions on
what the distribution can look like.
> Ideas:
> 
> -Try and get more samples in as many places as possible.  Maybe —
> multisample=4 should be the default?  Make bots run more often (I have
> already done this on green dragon).
FWIW, the Cortex-A53 performance tracker I've set up recently uses
multisample=3. The Cortex-A53 is a slower/more energy-efficient core,
so it takes about 6 hours to do a LLVM rebuild + 3 runs of the LNT
benchmarks (see http://llvm.org/perf/db_default/v4/nts/machine/39).
BTW, what is "green dragon"?
> - Use recent past run information to enhance single sample regression
> detection.  I think we should add a lookback window, and model the
> recent past.  I tired a technique suggested by Mikhail Zolotukhin of
> computing delta as the smallest difference between current and all the
> previous samples.  It was far more effective.  Alternately we could try
> a confidence interval generated from previous, though that may not work
> on bimodal tests.
The noise levels per individual program are often dependent on the
micro-architecture of the core it runs on. Before setting up the Cortex-A53
performance tracking bot, I've done a bit of analysis to find out what the
noise
levels are per program across a Cortex-A53, a Cortex-A57 and a Core i7 CPU. In
attachment is an example of a chart for just one program, indicating that the
noise level is
sometimes dependent on the micro-architecture of the core it runs on. Whereas a
Mann-Withney U - or similar - test would probably find - given enough data
points - what should be considered noise and what not; there may be a way to
run the test-suite in benchmark mode many times when a board gets set up, and
analyse
the results of that. The idea is that this way, the noisiness of the board as
setup
for fixed binaries could be measured, and that information could be used when
not
enough sample points are available.
(FWIW: for this program, the noisiness seems to come from noisiness in the
number
of branch mispredicts).
BTW – graphs like the one in attachment make me think that the LNT webUI should
be showing
sample points by default instead of line graphs showing the minimum execution
time
per build number.


> - Currently prev_samples is almost always just one other run, probably
> with only one sample itself.  Lets give this more samples to work with.
> Start passing more previous run data to all uses of the algorithm, in
> most places we intentionally limit the computation to current=run and
> previous=run-1, lets do something like previous=run-[1-10]. The risk in
> this approach is that regression noise in the look back window could
> trigger a false negative (we miss detecting a regression).  I think this
> is acceptable since we already miss lots of them because the reports are
> not actionable.
> 
> - Given the choice between false positive and false negative, lets err
> towards false negative.  We need to have manageable number of
> regressions detected or else we can’t act on them.
This sounds like a good idea to me. Let's first make sure we have a working
system of (semi-?)automatically detecting at least a good portion of the
significant performance regression. After that we can fine tune to reduce
false negatives to catch a larger part of all significant performance
regressions.

> 
> Any objections to me implementing these ideas?
Absolutely not. Once implemented, we probably ought to have an idea about how
to test which combination of methods works best in practice. Could the
sample points you’re going to add to the LNT unit tests help in testing which
combination of methods work best?

I've got 2 further ideas, based on observations from the data coming from
the
Cortex-A53 performance tracker that I added about 10 days ago - see
http://llvm.org/perf/db_default/v4/nts/machine/39.
I'll be posting patches for review for these soon:

1. About 20 of the 300-ish programs that get run in benchmark-only mode run
for less than 10 milliseconds. These 20 programs are one of the main sources
of noisiness. We should just not run these programs in benchmark-only mode.
Or - alternatively we should make them run a bit longer, so that they are less
noisy.

2. The board I'm running the Cortex-A53 performance tracker on is a
big.LITTLE
system with 2 Cortex-A57s and 4 Cortex-A53s. To build the benchmark binaries,
I'm using all cores, to make the turn-around time of the bot as fast as
possible.
However, this leads to huge noise levels on the "compile_time" metric,
as sometimes
a binary gets compiled on a Cortex-A53 and sometimes on a Cortex-A57. For this
board specifically, it just shouldn't be reporting compile_time at all,
since the
numbers are meaningless from a performance-tracking use case.


Another thought: if we could reduce the overall run-time of the LNT run in
benchmark-only mode, we could run more "multi-samples" in the same
amount of
time. I did a quick analysis on whether it would be worthwhile to invest effort
in making some of the long-running programs in the test-suite run shorter in
benchmarking mode. On the Cortex-A53 board, it shows that the 27 longest-running
programs out of the 300-ish consume about half the run-time. If we could easily
make these 27 programs run an order-of-magnitude less long, we could almost
halve
the total execution time of the test-suite, and hence run twice the number of
samples in the same amount of time. The longest running programs I’ve found are,
sorted:

  0. 7.23% cumulative (7.23% - 417.15s this program)
nts.MultiSource/Benchmarks/PAQ8p/paq8p.exec
  1. 13.74% cumulative (6.51% - 375.84s this program)
nts.MultiSource/Benchmarks/SciMark2-C/scimark2.exec
  2. 18.83% cumulative (5.08% - 293.16s this program)
nts.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.exec
  3. 21.60% cumulative (2.77% - 160.02s this program)
nts.MultiSource/Benchmarks/mafft/pairlocalalign.exec
  4. 24.01% cumulative (2.41% - 138.98s this program)
nts.SingleSource/Benchmarks/CoyoteBench/almabench.exec
  5. 26.32% cumulative (2.32% - 133.59s this program)
nts.MultiSource/Applications/lua/lua.exec
  6. 28.26% cumulative (1.94% - 111.80s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk.exec
  7. 30.11% cumulative (1.85% - 106.56s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.exec
  8. 31.60% cumulative (1.49% - 86.00s this program)
nts.SingleSource/Benchmarks/CoyoteBench/huffbench.exec
  9. 32.75% cumulative (1.15% - 66.37s this program)
nts.MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.exec
10. 33.90% cumulative (1.15% - 66.13s this program)
nts.MultiSource/Applications/hexxagon/hexxagon.exec
11. 35.04% cumulative (1.14% - 65.98s this program)
nts.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syr2k/syr2k.exec
12. 36.14% cumulative (1.10% - 63.21s this program)
nts.MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.exec
13. 37.22% cumulative (1.08% - 62.35s this program)
nts.SingleSource/Benchmarks/SmallPT/smallpt.exec
14. 38.30% cumulative (1.08% - 62.30s this program)
nts.MultiSource/Benchmarks/nbench/nbench.exec
15. 39.37% cumulative (1.07% - 61.98s this program)
nts.MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.exec
16. 40.40% cumulative (1.03% - 59.50s this program)
nts.MultiSource/Applications/SPASS/SPASS.exec
17. 41.37% cumulative (0.97% - 55.74s this program)
nts.MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.exec
18. 42.33% cumulative (0.96% - 55.40s this program)
nts.SingleSource/Benchmarks/Misc/ReedSolomon.exec
19. 43.27% cumulative (0.94% - 54.34s this program)
nts.MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.exec
20. 44.21% cumulative (0.94% - 54.20s this program)
nts.MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.exec
21. 45.12% cumulative (0.91% - 52.46s this program)
nts.SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.exec
22. 46.01% cumulative (0.89% - 51.49s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.exec
23. 46.89% cumulative (0.88% - 50.66s this program)
nts.MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.exec
24. 47.73% cumulative (0.84% - 48.74s this program)
nts.MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.exec
25. 48.57% cumulative (0.84% - 48.43s this program)
nts.MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.exec
26. 49.40% cumulative (0.83% - 47.92s this program)
nts.SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.exec
27. 50.22% cumulative (0.81% - 46.92s this program)
nts.MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.exec
28. 51.03% cumulative (0.81% - 46.90s this program)
nts.MultiSource/Applications/minisat/minisat.exec
29. 51.81% cumulative (0.78% - 44.88s this program)
nts.MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.exec
…


For example, there seem to be a lot of TSVC benchmarks in the longest running
ones.
They all seem to take a command line parameter to define the number of
iterations the main
loop in the benchmark should be running. Just tuning these, so all these
benchmarks runs
O(1s) would make the overall test-suite already run significantly faster.

For the Polybench test cases: they print out lots of floating point numbers –
this
probably should be changed in the makefile so they don’t dump the matrices they
work
on anymore. I’m not sure how big the impact will be on overall run time for the
Polybench
benchmarks when doing this.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: reg_detect_noise_.png
Type: image/png
Size: 15759 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/59e8dec2/attachment.png>

Mikhail Zolotukhin

2015-May-18 18:24 UTC

head link

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Hi Chris and others!

I totally support any work in this direction.

In the current state LNT’s regression detection system is too noisy, which makes
it almost impossible to use in some cases. If after each run a developer gets a
dozen of ‘regressions’, none of which happens to be real, he/she won’t care
about such reports after a while. We clearly need to filter out as much noise as
we can - and as it turns out even simplest techniques could help here. For
example, the technique I used (which you mentioned earlier) takes ~15 lines of
code to implement and filters out almost all noise in our internal data-sets.
It’d be really cool to have something more scientifically-proven though:)

One thing to add from me - I think we should try to do our best in assumption
that we don’t have enough samples. Of course, the more data we have - the
better, but in many cases we can’t (or we don’t want) to increase number os
samples, since it dramatically increases testing time. That’s not to discourage
anyone from increasing number of samples, or adding techniques relying on a
significant number of samples, but rather to try mining as many ‘samples’ as
possible from the data we have - e.g. I absolutely agree with your idea to pass
more than 1 previous run.

Thanks,
Michael

> On May 18, 2015, at 9:39 AM, Kristof Beyls <kristof.beyls at arm.com>
wrote:
> 
> Thanks for raising this, Chris!
> 
> I also think that improving the signal-to-noise ratio in the performance
> reports produced by LNT are essential to make the performance-tracking
> bots useful and effective.
> 
> Our experience, using LNT internally, has been that if the number of false
> positives are low enough (lower than about half a dozen per report or day),
> they become useable, leaving only a little bit of manual investigation work
> to detect if a particular change was significant or in the noise. Yes,
ideally
> the automated noise detection should be perfect; but even if it's not
perfect,
> it will already be a massive win.
> 
> I have some further ideas and remarks below.
> 
> Thanks,
> 
> Kristof
> 
>> -----Original Message-----
>> From: llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces at
cs.uiuc.edu> [mailto:llvmdev-bounces at cs.uiuc.edu
<mailto:llvmdev-bounces at cs.uiuc.edu>]
>> On Behalf Of Chris Matthews
>> Sent: 15 May 2015 22:25
>> To: LLVM Developers Mailing List
>> Subject: [LLVMdev] Proposal: change LNT’s regression detection
algorithm
>> and how it is used to reduce false positives
>> 
>> tl;dr in low data situations we don’t look at past information, and
that
>> increases the false positive regression rate.  We should look at the
>> possibly incorrect recent past runs to fix that.
>> 
>> Motivation: LNT’s current regression detection system has false
positive
>> rate that is too high to make it useful.  With test suites as large as
>> the llvm “test-suite” a single report will show hundreds of
regressions.
>> The false positive rate is so high the reports are ignored because it
is
>> impossible for a human to triage them, large performance problems are
>> lost in the noise, small important regressions never even have a
chance.
>> Later today I am going to commit a new unit test to LNT with 40 of my
>> favorite regression patterns.  It has gems such as flat but noisy line,
>> 5% regression in 5% noise, bimodal, and a slow increase, we fail to
>> classify most of these correctly right now. They are not trick
>> questions, all are obvious regressions or non-regressions, that are
>> plainly visible. I want us to correctly classify them all!
> 
> That's a great idea!
> Out of all of the ideas in this email, I think this is the most important
> one to implement first.
> 
>> Some context: LNTs regression detection algorithm as I understand it:
>> 
>> detect(current run’s samples, last runs samples) —> improve, regress
or
>> unchanged.
>> 
>>    # when recovering from errors performance should not be counted
>>    Current or last run failed -> unchanged
>> 
>>    delta = min(current samples) - min(prev samples)
> 
> I am not convinced that "min" is the best way to define the
delta.
> It makes the assumption that the "true" performance of code
generated by llvm
> is the fastest it was ever seen running. I think this isn't the correct
way
> to model e.g. programs with bimodal behaviour, nor programs with a normal
> distribution. I'm afraid I don't have a better solution, but I
think the
> Mann Whitney U test - which tries to determine if the sample points seem
> to indicate different underlying distributions - is closer to what we
> really ought to use to detect if a regression is "real". This
way, it models
> that a fixed program, when run multiple times, has a distribution of
> performance. I think that using "min" makes too many broken
assumptions on
> what the distribution can look like.
> 
>> Ideas:
>> 
>> -Try and get more samples in as many places as possible.  Maybe —
>> multisample=4 should be the default?  Make bots run more often (I have
>> already done this on green dragon).
> 
> FWIW, the Cortex-A53 performance tracker I've set up recently uses
> multisample=3. The Cortex-A53 is a slower/more energy-efficient core,
> so it takes about 6 hours to do a LLVM rebuild + 3 runs of the LNT
> benchmarks (see http://llvm.org/perf/db_default/v4/nts/machine/39
<http://llvm.org/perf/db_default/v4/nts/machine/39>).
> BTW, what is "green dragon"?
> 
>> - Use recent past run information to enhance single sample regression
>> detection.  I think we should add a lookback window, and model the
>> recent past.  I tired a technique suggested by Mikhail Zolotukhin of
>> computing delta as the smallest difference between current and all the
>> previous samples.  It was far more effective.  Alternately we could try
>> a confidence interval generated from previous, though that may not work
>> on bimodal tests.
> 
> The noise levels per individual program are often dependent on the
> micro-architecture of the core it runs on. Before setting up the Cortex-A53
> performance tracking bot, I've done a bit of analysis to find out what
the noise
> levels are per program across a Cortex-A53, a Cortex-A57 and a Core i7 CPU.
In
> attachment is an example of a chart for just one program, indicating that
the noise level is
> sometimes dependent on the micro-architecture of the core it runs on.
Whereas a
> Mann-Withney U - or similar - test would probably find - given enough data
> points - what should be considered noise and what not; there may be a way
to
> run the test-suite in benchmark mode many times when a board gets set up,
and analyse
> the results of that. The idea is that this way, the noisiness of the board
as setup
> for fixed binaries could be measured, and that information could be used
when not
> enough sample points are available.
> (FWIW: for this program, the noisiness seems to come from noisiness in the
number
> of branch mispredicts).
> BTW – graphs like the one in attachment make me think that the LNT webUI
should be showing
> sample points by default instead of line graphs showing the minimum
execution time
> per build number.
> 
> 
> 
>> - Currently prev_samples is almost always just one other run, probably
>> with only one sample itself.  Lets give this more samples to work with.
>> Start passing more previous run data to all uses of the algorithm, in
>> most places we intentionally limit the computation to current=run and
>> previous=run-1, lets do something like previous=run-[1-10]. The risk in
>> this approach is that regression noise in the look back window could
>> trigger a false negative (we miss detecting a regression).  I think
this
>> is acceptable since we already miss lots of them because the reports
are
>> not actionable.
>> 
>> - Given the choice between false positive and false negative, lets err
>> towards false negative.  We need to have manageable number of
>> regressions detected or else we can’t act on them.
> 
> This sounds like a good idea to me. Let's first make sure we have a
working
> system of (semi-?)automatically detecting at least a good portion of the
> significant performance regression. After that we can fine tune to reduce
> false negatives to catch a larger part of all significant performance
> regressions.
> 
> 
>> 
>> Any objections to me implementing these ideas?
> 
> Absolutely not. Once implemented, we probably ought to have an idea about
how
> to test which combination of methods works best in practice. Could the
> sample points you’re going to add to the LNT unit tests help in testing
which
> combination of methods work best?
> 
> I've got 2 further ideas, based on observations from the data coming
from the
> Cortex-A53 performance tracker that I added about 10 days ago - see
> http://llvm.org/perf/db_default/v4/nts/machine/39
<http://llvm.org/perf/db_default/v4/nts/machine/39>.
> I'll be posting patches for review for these soon:
> 
> 1. About 20 of the 300-ish programs that get run in benchmark-only mode run
> for less than 10 milliseconds. These 20 programs are one of the main
sources
> of noisiness. We should just not run these programs in benchmark-only mode.
> Or - alternatively we should make them run a bit longer, so that they are
less
> noisy.
> 
> 2. The board I'm running the Cortex-A53 performance tracker on is a
big.LITTLE
> system with 2 Cortex-A57s and 4 Cortex-A53s. To build the benchmark
binaries,
> I'm using all cores, to make the turn-around time of the bot as fast as
possible.
> However, this leads to huge noise levels on the "compile_time"
metric, as sometimes
> a binary gets compiled on a Cortex-A53 and sometimes on a Cortex-A57. For
this
> board specifically, it just shouldn't be reporting compile_time at all,
since the
> numbers are meaningless from a performance-tracking use case.
> 
> 
> Another thought: if we could reduce the overall run-time of the LNT run in
> benchmark-only mode, we could run more "multi-samples" in the
same amount of
> time. I did a quick analysis on whether it would be worthwhile to invest
effort
> in making some of the long-running programs in the test-suite run shorter
in
> benchmarking mode. On the Cortex-A53 board, it shows that the 27
longest-running
> programs out of the 300-ish consume about half the run-time. If we could
easily
> make these 27 programs run an order-of-magnitude less long, we could almost
halve
> the total execution time of the test-suite, and hence run twice the number
of
> samples in the same amount of time. The longest running programs I’ve found
are,
> sorted:
> 
>  0. 7.23% cumulative (7.23% - 417.15s this program)
nts.MultiSource/Benchmarks/PAQ8p/paq8p.exec
>  1. 13.74% cumulative (6.51% - 375.84s this program)
nts.MultiSource/Benchmarks/SciMark2-C/scimark2.exec
>  2. 18.83% cumulative (5.08% - 293.16s this program)
nts.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.exec
>  3. 21.60% cumulative (2.77% - 160.02s this program)
nts.MultiSource/Benchmarks/mafft/pairlocalalign.exec
>  4. 24.01% cumulative (2.41% - 138.98s this program)
nts.SingleSource/Benchmarks/CoyoteBench/almabench.exec
>  5. 26.32% cumulative (2.32% - 133.59s this program)
nts.MultiSource/Applications/lua/lua.exec
>  6. 28.26% cumulative (1.94% - 111.80s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk.exec
>  7. 30.11% cumulative (1.85% - 106.56s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.exec
>  8. 31.60% cumulative (1.49% - 86.00s this program)
nts.SingleSource/Benchmarks/CoyoteBench/huffbench.exec
>  9. 32.75% cumulative (1.15% - 66.37s this program)
nts.MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.exec
> 10. 33.90% cumulative (1.15% - 66.13s this program)
nts.MultiSource/Applications/hexxagon/hexxagon.exec
> 11. 35.04% cumulative (1.14% - 65.98s this program)
nts.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syr2k/syr2k.exec
> 12. 36.14% cumulative (1.10% - 63.21s this program)
nts.MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.exec
> 13. 37.22% cumulative (1.08% - 62.35s this program)
nts.SingleSource/Benchmarks/SmallPT/smallpt.exec
> 14. 38.30% cumulative (1.08% - 62.30s this program)
nts.MultiSource/Benchmarks/nbench/nbench.exec
> 15. 39.37% cumulative (1.07% - 61.98s this program)
nts.MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.exec
> 16. 40.40% cumulative (1.03% - 59.50s this program)
nts.MultiSource/Applications/SPASS/SPASS.exec
> 17. 41.37% cumulative (0.97% - 55.74s this program)
nts.MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.exec
> 18. 42.33% cumulative (0.96% - 55.40s this program)
nts.SingleSource/Benchmarks/Misc/ReedSolomon.exec
> 19. 43.27% cumulative (0.94% - 54.34s this program)
nts.MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.exec
> 20. 44.21% cumulative (0.94% - 54.20s this program)
nts.MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.exec
> 21. 45.12% cumulative (0.91% - 52.46s this program)
nts.SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.exec
> 22. 46.01% cumulative (0.89% - 51.49s this program)
nts.MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.exec
> 23. 46.89% cumulative (0.88% - 50.66s this program)
nts.MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.exec
> 24. 47.73% cumulative (0.84% - 48.74s this program)
nts.MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.exec
> 25. 48.57% cumulative (0.84% - 48.43s this program)
nts.MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.exec
> 26. 49.40% cumulative (0.83% - 47.92s this program)
nts.SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.exec
> 27. 50.22% cumulative (0.81% - 46.92s this program)
nts.MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.exec
> 28. 51.03% cumulative (0.81% - 46.90s this program)
nts.MultiSource/Applications/minisat/minisat.exec
> 29. 51.81% cumulative (0.78% - 44.88s this program)
nts.MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.exec
> …
> 
> 
> For example, there seem to be a lot of TSVC benchmarks in the longest
running ones.
> They all seem to take a command line parameter to define the number of
iterations the main
> loop in the benchmark should be running. Just tuning these, so all these
benchmarks runs
> O(1s) would make the overall test-suite already run significantly faster.
> 
> For the Polybench test cases: they print out lots of floating point numbers
– this
> probably should be changed in the makefile so they don’t dump the matrices
they work
> on anymore. I’m not sure how big the impact will be on overall run time for
the Polybench
> benchmarks when doing this.
> 
>
<reg_detect_noise_.png>_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/45ba1fca/attachment.html>

Chris Matthews

2015-May-20 00:04 UTC

head link

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

I agree. Fixed in r237748.
> On May 18, 2015, at 9:39 AM, Kristof Beyls <kristof.beyls at arm.com>
wrote:
> 
> BTW – graphs like the one in attachment make me think that the LNT webUI
should be showing
> sample points by default instead of line graphs showing the minimum
execution time
> per build number.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150519/771daa0e/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - May 2015 - [LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Maybe Matching Threads