similar to: [LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Displaying 20 results from an estimated 10000 matches similar to: "[LLVMdev] [LNT] Question about results reliability in LNT infrustructure"

2013 Jul 01
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/23/2013 11:12 PM, Star Tan wrote: > Hi all, > > > When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not? > > > I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev *
2013 Jun 27
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/23/2013 11:12 PM, Star Tan wrote: > Hi all, > > > When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not? > > > I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev *
2013 Jun 28
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com> wrote: > Given this tradeoff I think we want to tend towards false positives (over > false negatives) strictly as a matter of compiler quality. > False hits are not binary, but (at least) two-dimensional. You can't say it's better to have any amount of false positives than any amount of false negatives
2013 Jun 28
2
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
I should describe the cost of false negatives and false positives, since I think it matters for how this problem is approached. False negatives mean we miss a real regression --- we don’t want that. False positives mean somebody has to spend some time looking at and reproducing the regression when there is not one --- bad too. Given this tradeoff I think we want to tend towards false positives
2013 Jun 27
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Just forwarding this to the list, my original reply was bounced. On Jun 27, 2013, at 11:14 AM, Chris Matthews <chris.matthews at apple.com> wrote: > There are a few things we have looked at with LNT runs, so I will share the insights we have had so far. A lot of the problems we have are artificially created by our test protocols instead of the compiler changes themselves. I have been
2013 Jun 27
2
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote: > We are looking for a good way/value to show the reliability of individual > results in the UI. Do you have some experience, what a good measure of the > reliability of test results is? > Hi Tobi, I had a look at this a while ago, but never got around to actually work on it. My idea was to never use
2013 Jun 27
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at linaro.org> wrote: > On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote: > We are looking for a good way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is? > > Hi Tobi, > > I had a look at
2013 Jun 27
7
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
There are a few things we have looked at with LNT runs, so I will share the insights we have had so far. A lot of the problems we have are artificially created by our test protocols instead of the compiler changes themselves. I have been doing a lot of large sample runs of single benchmarks to characterize them better. Some key points: 1) Some benchmarks are bi-modal or multi-modal, single
2013 Jun 30
3
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/28/2013 01:19 PM, Renato Golin wrote: > On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com> > wrote: > >> Given this tradeoff I think we want to tend towards false positives >> (over false negatives) strictly as a matter of compiler quality. >> > > False hits are not binary, but (at least) two-dimensional. You can't > say it's
2013 Jun 27
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Hi Chris, Amazing that someone is finally looking at that with a proper background. You're much better equipped than I am to deal with that, so I'll trust you on your judgements, as I haven't paid much attention to benchmarks, more correctness. Some comments inline. On 27 June 2013 19:14, Chris Matthews <chris.matthews at apple.com> wrote: > 1) Some benchmarks are bi-modal
2013 Jul 01
2
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 1 July 2013 02:02, Chris Matthews <chris.matthews at apple.com> wrote: > One thing that LNT is doing to help “smooth” the results for you is by > presenting the min of the data at a particular revision, which (hopefully) > is approximating the actual runtime without noise. > That's an interesting idea, as you said, if you run multiple times on every revision. On ARM,
2013 Jul 01
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
This is probably another area where a bit of dynamic behavior could help. When we find a regressions, kick off some runs to bisect back to where it manifests. This is what we would be doing manually anyway. We could just search back with the set of regressing benchmarks, meaning the whole suite does not have to be run (unless it is a global regression). There are situations where we see commit
2013 Jun 28
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 28 June 2013 10:28, David Tweed <david.tweed at arm.com> wrote: > (Inicidentally, responding to the earlier email below, I think you don't > really want to compare moving averages but use some statistical test to > quantify if the separation between the set of points within the "earlier > window" are statistically significantly higher than the "later
2013 Jul 01
1
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On Jun 30, 2013, at 6:02 PM, Chris Matthews <chris.matthews at apple.com> wrote: > This is probably another area where a bit of dynamic behavior could help. When we find a regressions, kick off some runs to bisect back to where it manifests. This is what we would be doing manually anyway. We could just search back with the set of regressing benchmarks, meaning the whole suite does not
2013 Jun 28
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 28 June 2013 14:06, David Tweed <david.tweed at arm.com> wrote: > That's a viewpoint; another one is that statisticians might well have very > good reasons why they spend so long coming up with statistical tests in > order to create the most powerful tests so they can deal with marginal > quantities of data. > 87.35% of all statistics are made up, 55.12% of them could
2013 Jun 30
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Hi Tobias, > I trust your knowledge about statistics, but am wondering why ministat (and > it's t-test) is promoted as a statistical sane tool for benchmarking > results. I do not know... Ask author of ministat? > Is the use of the t-test for benchmark results a bad idea in > general? No, in general. But one should be aware about the assumptions of the underlying theory.
2013 Jun 30
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
> Getting 10 samples at different commits will give you similar accuracy if > behaviour doesn't change, and you can rely on 10-point blocks before and > after each change to have the same result. Right. But this way you will have 10-commits delay. So, you will need 3-4 additional test runs to pinpoint the offending commit in the worst case. > This is why I proposed something like
2013 Jun 30
6
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 30 June 2013 10:14, Anton Korobeynikov <anton at korobeynikov.info> wrote: > 1. Increasing sample size to at least 5-10 > That's not feasible on slower systems. A single data point takes 1 hour on the fastest ARM board I can get (Chromebook). Getting 10 samples at different commits will give you similar accuracy if behaviour doesn't change, and you can rely on 10-point
2013 Jul 02
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 07/01/2013 09:41 AM, Renato Golin wrote: > On 1 July 2013 02:02, Chris Matthews <chris.matthews at apple.com> wrote: > >> One thing that LNT is doing to help “smooth” the results for you is by >> presenting the min of the data at a particular revision, which (hopefully) >> is approximating the actual runtime without noise. >> > > That's an
2013 Jun 30
0
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Hi Tobi, First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :) > The statistical test ministat is performing seems simple and pretty > standard. Is there any reason we could not do something similar? Or are we > doing it already and it just does not work as expected? The main problem with such sort of tests is that we cannot trust them, unless: 1. The data has the