thr3ads.net - llvm dev - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2013-Jun-27 16:27 UTC

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote:
> We are looking for a good way/value to show the reliability of individual
> results in the UI. Do you have some experience, what a good measure of the
> reliability of test results is?
>
Hi Tobi,

I had a look at this a while ago, but never got around to actually work on
it. My idea was to never use point-changes as indication of
progress/regressions, unless there was a significant change (2/3 sigma).
What we should do is to compare the current moving-average with the past
moving averages (of K runs) with both last-avg and the (N-K)th
moving-average (to make sure previous values included in the current moving
average are not toning it down/up), and keep the biggest difference as the
final result.

We should also compare the current mov-avg with M non-overlapping mov-avgs
before, and calculate if we're monotonically increasing, decreasing or if
there is a difference of 2/3 sigma between the current mov-avg (N) and the
(N-M)th mov-avg. That would give us an idea on the trends of each test.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/34b38916/attachment.html>

Bob Wilson

2013-Jun-27 16:41 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote:
> We are looking for a good way/value to show the reliability of individual
results in the UI. Do you have some experience, what a good measure of the
reliability of test results is?
> 
> Hi Tobi,
> 
> I had a look at this a while ago, but never got around to actually work on
it. My idea was to never use point-changes as indication of
progress/regressions, unless there was a significant change (2/3 sigma). What we
should do is to compare the current moving-average with the past moving averages
(of K runs) with both last-avg and the (N-K)th moving-average (to make sure
previous values included in the current moving average are not toning it
down/up), and keep the biggest difference as the final result.
> 
> We should also compare the current mov-avg with M non-overlapping mov-avgs
before, and calculate if we're monotonically increasing, decreasing or if
there is a difference of 2/3 sigma between the current mov-avg (N) and the
(N-M)th mov-avg. That would give us an idea on the trends of each test.
Chris Matthews has recently been working on implementing something similar to
that.  Chris, can you share some details?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/44c243dc/attachment.html>

Chris Matthews

2013-Jun-27 18:14 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

There are a few things we have looked at with LNT runs, so I will share the
insights we have had so far. A lot of the problems we have are artificially
created by our test protocols instead of the compiler changes themselves.  I
have been doing a lot of large sample runs of single benchmarks to characterize
them better.  Some key points:

1) Some benchmarks are bi-modal or multi-modal, single means won’t describe
these well
2) Some runs are pretty noisy and sometimes have very large single sample spikes
3) Most benchmarks don’t regress most of the time
4) Compile time is pretty stable metric, execution time not always

and depending on what you are using LNT for:

5) A regression is not really something to worry about unless it lasts for a
while (some number of revisions or days or samples)
6) We also need to catch long slow regressions
7) Some of the “benchmarks” are really just correctness tests, and were not
designed with repeatable measurement in mind.

As it stands now, we really can’t detect small regressions, slow regressions,
and there are a lot of false positives.

There are two things I am working on right now to help make regression detection
more reliable: adaptive sampling and cluster based regression flagging.

First, we need more samples per revision. But we really don’t have time to do
—multisample=10 since that takes far too long.   The patch I am working on now
and will submit soon, implements client side adaptive sampling based on server
history.  Simply, it reruns benchmarks which are reported as regressed or
improved.  The idea here being, if its going to to be flagged as a regression or
improvement, get more data on those specific benchmarks to make sure that is the
case.  Adaptive sampling should reduce the false positive regression flagging
rate we see.  We are able to do this based on LNT’s provisional commit system.
After a run, we submit all the results, but don’t commit them. The server
reports the regressions, then we rerun the regressing benchmarks more times. 
This gives us more data in the places where we need it most.  This has made a
big difference on my local test machine.

As far as regression flagging goes, I have been working on a k-means
discovery/clustering based approach to first come up with a set of means in the
dataset, then characterize newer data based on that.  My hope is this can
characterize multi-modal results, be resilient to short spikes and detect long
term motion in the dataset.  I have this prototyped in LNT, but I am still
trying to work out the best criteria to flag regression with.

Probably obvious anyways but: since the LNT data is only as good as the setup it
is run on, the other thing that has helped us is coming up with a set of best
practices for running the benchmarks on a machine.  A machine which is “stable”
produces much better results, but achiving this is more complex than not playing
Starcraft while LNT is running.  You have to make sure power management is not
mucking with clock rates, and that none of the magic
backup/indexing/updating/networking/screensaver stuff on your machine is
running.  In practice, I have seen a process using 50% of the CPU on 1 core of 8
move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core
machine trigger hundreds of regressions in LNT.

Chris Matthews
chris.matthews@.com
(408) 783-6335

On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com> wrote:
> 
> On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at
linaro.org> wrote:
> 
>> On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es>
wrote:
>> We are looking for a good way/value to show the reliability of
individual results in the UI. Do you have some experience, what a good measure
of the reliability of test results is?
>> 
>> Hi Tobi,
>> 
>> I had a look at this a while ago, but never got around to actually work
on it. My idea was to never use point-changes as indication of
progress/regressions, unless there was a significant change (2/3 sigma). What we
should do is to compare the current moving-average with the past moving averages
(of K runs) with both last-avg and the (N-K)th moving-average (to make sure
previous values included in the current moving average are not toning it
down/up), and keep the biggest difference as the final result.
>> 
>> We should also compare the current mov-avg with M non-overlapping
mov-avgs before, and calculate if we're monotonically increasing, decreasing
or if there is a difference of 2/3 sigma between the current mov-avg (N) and the
(N-M)th mov-avg. That would give us an idea on the trends of each test.
> 
> Chris Matthews has recently been working on implementing something similar
to that.  Chris, can you share some details?
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/0be3afcf/attachment.html>

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Jun 2013 - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Reasonably Related Threads