thr3ads.net - llvm dev - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Chris Matthews

2013-Jun-27 18:14 UTC

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

There are a few things we have looked at with LNT runs, so I will share the
insights we have had so far. A lot of the problems we have are artificially
created by our test protocols instead of the compiler changes themselves.  I
have been doing a lot of large sample runs of single benchmarks to characterize
them better.  Some key points:

1) Some benchmarks are bi-modal or multi-modal, single means won’t describe
these well
2) Some runs are pretty noisy and sometimes have very large single sample spikes
3) Most benchmarks don’t regress most of the time
4) Compile time is pretty stable metric, execution time not always

and depending on what you are using LNT for:

5) A regression is not really something to worry about unless it lasts for a
while (some number of revisions or days or samples)
6) We also need to catch long slow regressions
7) Some of the “benchmarks” are really just correctness tests, and were not
designed with repeatable measurement in mind.

As it stands now, we really can’t detect small regressions, slow regressions,
and there are a lot of false positives.

There are two things I am working on right now to help make regression detection
more reliable: adaptive sampling and cluster based regression flagging.

First, we need more samples per revision. But we really don’t have time to do
—multisample=10 since that takes far too long.   The patch I am working on now
and will submit soon, implements client side adaptive sampling based on server
history.  Simply, it reruns benchmarks which are reported as regressed or
improved.  The idea here being, if its going to to be flagged as a regression or
improvement, get more data on those specific benchmarks to make sure that is the
case.  Adaptive sampling should reduce the false positive regression flagging
rate we see.  We are able to do this based on LNT’s provisional commit system.
After a run, we submit all the results, but don’t commit them. The server
reports the regressions, then we rerun the regressing benchmarks more times. 
This gives us more data in the places where we need it most.  This has made a
big difference on my local test machine.

As far as regression flagging goes, I have been working on a k-means
discovery/clustering based approach to first come up with a set of means in the
dataset, then characterize newer data based on that.  My hope is this can
characterize multi-modal results, be resilient to short spikes and detect long
term motion in the dataset.  I have this prototyped in LNT, but I am still
trying to work out the best criteria to flag regression with.

Probably obvious anyways but: since the LNT data is only as good as the setup it
is run on, the other thing that has helped us is coming up with a set of best
practices for running the benchmarks on a machine.  A machine which is “stable”
produces much better results, but achiving this is more complex than not playing
Starcraft while LNT is running.  You have to make sure power management is not
mucking with clock rates, and that none of the magic
backup/indexing/updating/networking/screensaver stuff on your machine is
running.  In practice, I have seen a process using 50% of the CPU on 1 core of 8
move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core
machine trigger hundreds of regressions in LNT.

Chris Matthews
chris.matthews@.com
(408) 783-6335

On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com> wrote:
> 
> On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at
linaro.org> wrote:
> 
>> On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es>
wrote:
>> We are looking for a good way/value to show the reliability of
individual results in the UI. Do you have some experience, what a good measure
of the reliability of test results is?
>> 
>> Hi Tobi,
>> 
>> I had a look at this a while ago, but never got around to actually work
on it. My idea was to never use point-changes as indication of
progress/regressions, unless there was a significant change (2/3 sigma). What we
should do is to compare the current moving-average with the past moving averages
(of K runs) with both last-avg and the (N-K)th moving-average (to make sure
previous values included in the current moving average are not toning it
down/up), and keep the biggest difference as the final result.
>> 
>> We should also compare the current mov-avg with M non-overlapping
mov-avgs before, and calculate if we're monotonically increasing, decreasing
or if there is a difference of 2/3 sigma between the current mov-avg (N) and the
(N-M)th mov-avg. That would give us an idea on the trends of each test.
> 
> Chris Matthews has recently been working on implementing something similar
to that.  Chris, can you share some details?
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/0be3afcf/attachment.html>

Renato Golin

2013-Jun-27 19:04 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Hi Chris,

Amazing that someone is finally looking at that with a proper background.
You're much better equipped than I am to deal with that, so I'll trust
you
on your judgements, as I haven't paid much attention to benchmarks, more
correctness. Some comments inline.


On 27 June 2013 19:14, Chris Matthews <chris.matthews at apple.com> wrote:
> 1) Some benchmarks are bi-modal or multi-modal, single means won’t
> describe these well
>
True. My idea was to have a moving-"measurement", where the basic one
being
average, but others applied as well. It's possible that k-means can give
you that, but I haven't understood what will be your vector space and
distance measures to guess.


2) Some runs are pretty noisy and sometimes have very large single
sample> spikes
> 3) Most benchmarks don’t regress most of the time
>
Most of ARM benchmarks regress all the time because both the signal and the
noise are in milliseconds, where machine and OS interference play a crucial
part. But they don't regress with time, and they keep their average AND
deviation for ever. So, if you can filter the noise on *all* benchmarks,
it'd be great for ARM testing.


5) A regression is not really something to worry about unless it lasts
for> a while (some number of revisions or days or samples)
> 6) We also need to catch long slow regressions
>
Yup. Moving peak and trend.


7) Some of the “benchmarks” are really just correctness tests, and were
not> designed with repeatable measurement in mind.
>
Yes. Would be great to move them to Application, and *not* time execution.
Benchmarks are specifically designed to test execution time, applications
aren't.

If we think an application is really important that we want to measure it,
we should actively change it to a benchmark, making sure it's actually
performing the core functionality on a repeatable way and with enough
confidence that noise isn't playing a part on the numbers. Just throwing it
and time execution will create a school of red herrings.


After a run, we submit all the results, but don’t commit them. The
server> reports the regressions, then we rerun the regressing benchmarks more
> times.  This gives us more data in the places where we need it most.  This
> has made a big difference on my local test machine.
>
This is a great idea, and I think it could improve things at a much lower
cost. It won't replace decent benchmarking strategies on the software
level, but it will reduce the noise, hopefully enough to allow other
analysis to be successful at an early stage.


As far as regression flagging goes, I have been working on a
k-means> discovery/clustering based approach to first come up with a set of means in
> the dataset, then characterize newer data based on that.  My hope is this
> can characterize multi-modal results, be resilient to short spikes and
> detect long term motion in the dataset.  I have this prototyped in LNT, but
> I am still trying to work out the best criteria to flag regression with.
>
I'd like to understand that better (mostly for personal education). But it
can be offline, if the rest of the list is not interested...


You have to make sure power management is not mucking with clock rates,
and> that none of the magic backup/indexing/updating/networking/screensaver
> stuff on your machine is running.  In practice, I have seen a process using
> 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and
> having 2 cores loaded on an 8 core machine trigger hundreds of regressions
> in LNT.
>
I have seen this too. I think LNT has two modes: test and benchmark (not
sure how to switch), but one tries to use all possible cores (unstable
benchmarks) and the other runs using a single core all the way. I think we
could assume that, for tests, we can use as much juice as we have
available, and for benchmarks, we could use less than the total number of
cores (the practical number can vary depending on the arch).

It's better to re-run some benchmarks 10 times, but use 8 CPUs than use
only one...

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/904b76b9/attachment.html>

Chris Matthews

2013-Jun-27 21:11 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Just forwarding this to the list, my original reply was bounced.

On Jun 27, 2013, at 11:14 AM, Chris Matthews <chris.matthews at apple.com>
wrote:
> There are a few things we have looked at with LNT runs, so I will share the
insights we have had so far. A lot of the problems we have are artificially
created by our test protocols instead of the compiler changes themselves.  I
have been doing a lot of large sample runs of single benchmarks to characterize
them better.  Some key points:
> 
> 1) Some benchmarks are bi-modal or multi-modal, single means won’t describe
these well
> 2) Some runs are pretty noisy and sometimes have very large single sample
spikes
> 3) Most benchmarks don’t regress most of the time
> 4) Compile time is pretty stable metric, execution time not always
> 
> and depending on what you are using LNT for:
> 
> 5) A regression is not really something to worry about unless it lasts for
a while (some number of revisions or days or samples)
> 6) We also need to catch long slow regressions
> 7) Some of the “benchmarks” are really just correctness tests, and were not
designed with repeatable measurement in mind.
> 
> As it stands now, we really can’t detect small regressions, slow
regressions, and there are a lot of false positives.
>  
> There are two things I am working on right now to help make regression
detection more reliable: adaptive sampling and cluster based regression
flagging.
> 
> First, we need more samples per revision. But we really don’t have time to
do —multisample=10 since that takes far too long.   The patch I am working on
now and will submit soon, implements client side adaptive sampling based on
server history.  Simply, it reruns benchmarks which are reported as regressed or
improved.  The idea here being, if its going to to be flagged as a regression or
improvement, get more data on those specific benchmarks to make sure that is the
case.  Adaptive sampling should reduce the false positive regression flagging
rate we see.  We are able to do this based on LNT’s provisional commit system.
After a run, we submit all the results, but don’t commit them. The server
reports the regressions, then we rerun the regressing benchmarks more times. 
This gives us more data in the places where we need it most.  This has made a
big difference on my local test machine.
> 
> As far as regression flagging goes, I have been working on a k-means
discovery/clustering based approach to first come up with a set of means in the
dataset, then characterize newer data based on that.  My hope is this can
characterize multi-modal results, be resilient to short spikes and detect long
term motion in the dataset.  I have this prototyped in LNT, but I am still
trying to work out the best criteria to flag regression with.
> 
> Probably obvious anyways but: since the LNT data is only as good as the
setup it is run on, the other thing that has helped us is coming up with a set
of best practices for running the benchmarks on a machine.  A machine which is
“stable” produces much better results, but achiving this is more complex than
not playing Starcraft while LNT is running.  You have to make sure power
management is not mucking with clock rates, and that none of the magic
backup/indexing/updating/networking/screensaver stuff on your machine is
running.  In practice, I have seen a process using 50% of the CPU on 1 core of 8
move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core
machine trigger hundreds of regressions in LNT.
> 
> 
> Chris Matthews
> chris.matthews@.com
> (408) 783-6335
> 
> On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com>
wrote:
> 
>> 
>> On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at
linaro.org> wrote:
>> 
>>> On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es>
wrote:
>>> We are looking for a good way/value to show the reliability of
individual results in the UI. Do you have some experience, what a good measure
of the reliability of test results is?
>>> 
>>> Hi Tobi,
>>> 
>>> I had a look at this a while ago, but never got around to actually
work on it. My idea was to never use point-changes as indication of
progress/regressions, unless there was a significant change (2/3 sigma). What we
should do is to compare the current moving-average with the past moving averages
(of K runs) with both last-avg and the (N-K)th moving-average (to make sure
previous values included in the current moving average are not toning it
down/up), and keep the biggest difference as the final result.
>>> 
>>> We should also compare the current mov-avg with M non-overlapping
mov-avgs before, and calculate if we're monotonically increasing, decreasing
or if there is a difference of 2/3 sigma between the current mov-avg (N) and the
(N-M)th mov-avg. That would give us an idea on the trends of each test.
>> 
>> Chris Matthews has recently been working on implementing something
similar to that.  Chris, can you share some details?
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/14cde134/attachment.html>

David Tweed

2013-Jun-28 09:28 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

First, we need more samples per revision. But we really don’t have time to do
—multisample=10 since that takes far too long.   The patch I am working on now
and will submit soon, implements client side adaptive sampling based on server
history.  Simply, it reruns benchmarks which are reported as regressed or
improved.  The idea here being, if its going to to be flagged as a regression or
improvement, get more data on those specific benchmarks to make sure that is the
case.  Adaptive sampling should reduce the false positive regression flagging
rate we see.  We are able to do this based on LNT’s provisional commit system.
After a run, we submit all the results, but don’t commit them. The server
reports the regressions, then we rerun the regressing benchmarks more times. 
This gives us more data in the places where we need it most.  This has made a
big difference on my local test machine.

 

| As far as regression flagging goes, I have been working on a k-means
discovery/clustering based approach to first come up with a set of means in the
dataset, then characterize newer data based on that.  My hope is this can
characterize multi-modal results,

| be resilient to short spikes and detect long term motion in the dataset.  I
have this prototyped in LNT, but I am still trying to work out the best criteria
to flag regression with.

 

Basic question: I'm imagining the volume of data being dealt with isn't
that large (as statistical datasets go) and you're discarding old values
anyway (since we care if we're regressing wrt now rather than LLVM 1.1), so
can't you just build a kernel density estimator of the "baseline"
runtime and then estimate the probabilities that samples from a given codebase
are going to happening "slower" than the baseline? I suppose the
drawback to not explicitly modelling the modes (with all its complications and
tunings) is that you can't attempt to determine when a value is bigger than
a lower cluster, even though it's smaller than the bigger cluster and
estimate if it's evidence of a slowdown within the small cluster regime.
Still that seems a bit complicated to do automatically.

 

(Inicidentally, responding to the earlier email below, I think you don't
really want to compare moving averages but use some statistical test to quantify
if the separation between the set of points within the "earlier
window" are statistically significantly higher than the "later
window"; all moving averages do is smear out useful information which can
be useful if you've just got far too many data points, but otherwise it
doesn't really help.

 

Cheers,

Dave

 

Probably obvious anyways but: since the LNT data is only as good as the setup it
is run on, the other thing that has helped us is coming up with a set of best
practices for running the benchmarks on a machine.  A machine which is “stable”
produces much better results, but achiving this is more complex than not playing
Starcraft while LNT is running.  You have to make sure power management is not
mucking with clock rates, and that none of the magic
backup/indexing/updating/networking/screensaver stuff on your machine is
running.  In practice, I have seen a process using 50% of the CPU on 1 core of 8
move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core
machine trigger hundreds of regressions in LNT.

 

 

Chris Matthews
chris.matthews@.com
(408) 783-6335

 

On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com> wrote:






On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at linaro.org>
wrote:





On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote:

We are looking for a good way/value to show the reliability of individual
results in the UI. Do you have some experience, what a good measure of the
reliability of test results is?

 

Hi Tobi,

 

I had a look at this a while ago, but never got around to actually work on it.
My idea was to never use point-changes as indication of progress/regressions,
unless there was a significant change (2/3 sigma). What we should do is to
compare the current moving-average with the past moving averages (of K runs)
with both last-avg and the (N-K)th moving-average (to make sure previous values
included in the current moving average are not toning it down/up), and keep the
biggest difference as the final result.

 

We should also compare the current mov-avg with M non-overlapping mov-avgs
before, and calculate if we're monotonically increasing, decreasing or if
there is a difference of 2/3 sigma between the current mov-avg (N) and the
(N-M)th mov-avg. That would give us an idea on the trends of each test.

 

Chris Matthews has recently been working on implementing something similar to
that.  Chris, can you share some details?

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
<http://llvm.cs.uiuc.edu/>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/e71616fa/attachment.html>

Renato Golin

2013-Jun-28 09:43 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 28 June 2013 10:28, David Tweed <david.tweed at arm.com> wrote:
> (Inicidentally, responding to the earlier email below, I think you
don't
> really want to compare moving averages but use some statistical test to
> quantify if the separation between the set of points within the
"earlier
> window" are statistically significantly higher than the "later
window"; all
> moving averages do is smear out useful information which can be useful if
> you've just got far too many data points, but otherwise it doesn't
really
> help.
>
When your data is explicitly grouped, I'd agree with you. But all I can see
from my results are hardware and OS flukes in the millisecond range, with
no distinct modal signal from them. Chris said he knows of some, I haven't
looked deep enough, so I trust his judgement. What I don't want is to be
treating noise groups as signal, that's all.

I think we probably need a few different approaches, depending on the
benchmark, with moving averages being the simplest, and why I suggested we
implement it first. Sometimes, smoothing the line is all you need... ;)

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/10ca5b66/attachment.html>

David Tweed

2013-Jun-28 13:06 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

| I think we probably need a few different approaches, depending on the
benchmark, with moving averages being the simplest, and why I suggested we
implement it first. Sometimes, smoothing the line is all you need... ;)

 

That's a viewpoint; another one is that statisticians might well have very
good reasons why they spend so long coming up with statistical tests in
order to create the most powerful tests so they can deal with marginal
quantities of data.

 

Cheers,

Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/7b6feeef/attachment.html>

Renato Golin

2013-Jun-28 13:28 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 28 June 2013 14:06, David Tweed <david.tweed at arm.com> wrote:
> That's a viewpoint; another one is that statisticians might well have
very
> good reasons why they spend so long coming up with statistical tests in
> order to create the most powerful tests so they can deal with marginal
> quantities of data.
>
87.35% of all statistics are made up, 55.12% of them could have been done a
lot simpler, a lot quicker and only 1.99% (AER) actually make your life
better.

I'm glad that Chris already has working solutions, and I'b be happy to
see
them go live before any professional statistician had a look at it. ;)

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/ae31f5fe/attachment.html>

Chris Matthews

2013-Jun-28 18:45 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

I should describe the cost of false negatives and false positives, since I think
it matters for how this problem is approached.  False negatives mean we miss a
real regression --- we don’t want that.  False positives mean somebody has to
spend some time looking at and reproducing the regression when there is not one
--- bad too.  Given this tradeoff I think we want to tend towards false
positives (over false negatives) strictly as a matter of compiler quality, but
if we can throw more data to reduce false positives that is good.

I have discussed the classification problem before with people off list.  The
problem that we face is that the space is pretty big for manual classification,
at worse: number of benchmarks * number of architectures * sets of flags *
metrics collected.  Perhaps some sensible defaults could overcome that, also to
classify well, you probably need a lot of samples as a baseline.

There certainly are lots of tests for small data. As far as I know though they
rely more heavily on assumptions that in our case would have to be proven.  That
said, I’d never object to a professional’s opinion on this problem!

Chris Matthews
chris.matthews@.com
(408) 783-6335

On Jun 28, 2013, at 6:28 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 28 June 2013 14:06, David Tweed <david.tweed at arm.com> wrote:
> That's a viewpoint; another one is that statisticians might well have
very good reasons why they spend so long coming up with statistical tests in
order to create the most powerful tests so they can deal with marginal
quantities of data.
> 
> 
> 87.35% of all statistics are made up, 55.12% of them could have been done a
lot simpler, a lot quicker and only 1.99% (AER) actually make your life better.
> 
> I'm glad that Chris already has working solutions, and I'b be happy
to see them go live before any professional statistician had a look at it. ;)
> 
> cheers,
> --renato
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/62b600f8/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Jun 2013 - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Reasonably Related Threads