thr3ads.net - llvm dev - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2013-Jun-30 02:10 UTC

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 06/28/2013 01:19 PM, Renato Golin wrote:> On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com>
> wrote:
>
>> Given this tradeoff I think we want to tend towards false positives
>> (over false negatives) strictly as a matter of compiler quality.
>>
>
> False hits are not binary, but (at least) two-dimensional. You can't
> say it's better to have any amount of false positives than any amount
> of false negatives (pretty much like the NSA spying on *everybody* to
> avoid *any* false negative). You can't also say that N
> false-positives is the same as N false-negatives, because a false-hit
> can be huge in itself, or not.
>
> What we have today is a huge amount of false positives and very few
> (or none) false negatives. But even the real positives that we could
> spot even with this amount of noise, we don't, because people don't
> normally look at regressions. If I had to skim through the
> regressions on every build, I'd do nothing else.
>
> Given the proportion, I'd rather have a few small false positives
> and reduce considerably the number of false positives with a hammer
> approach, and only later try to nail down the options and do some
> fine tuning, than doing the fine tuning now while still nobody cares
> about any result because they're not trust-worthy.
>
>
> That said, I’d never object to a professional’s opinion on this
> problem!
>>
>
> Absolutely! And David can help you a lot, there. But I wouldn't try
> to get it perfect before we get it acceptable.
Wow. Thanks a lot for the insights in what LNT is currently doing and
what people are planning for the future. It seems there is a lot of
interesting stuff on the way.

I agree with Renato that one of the major problems is currently not
missing regressions because we do not detect them, but missing them 
because nobody looks at the results due to the large amount of noise.

To make this more concrete I want to point you to the experiments that
Star Tan has run. He hosted his lnt results here [1]. One of the top 
changes in the reports is a 150% compile time increase for 
SingleSource/UnitTests/2003-07-10-SignConversions.c.

Looking at the data of the original run, we get:

~$ cat /tmp/data-before
0.0120
0.0080
0.0200

~$ cat /tmp/data-after
0.0200
0.0240
0.0200

It seems there is a lot of noise involved. Still, LNT is reporting this 
result without understanding that the results for this benchmark are 
unreliable.

In contrast, the ministat [2] tool is perfectly capable of understanding
that those results are insufficient to prove any statistical difference
at 90% confidence.

======================================================================$
./src/ministat -c 90 /tmp/data-before /tmp/data-after
x /tmp/data-before
+ /tmp/data-after
+-----------------------------------------------+
|                                   +           |
|  x          x                     *          +|
||____________M___A______________|_|M___A_____| |
+-----------------------------------------------+
     N           Min           Max        Median           Avg        Stddev
x   3         0.008          0.02         0.012   0.013333333  0.0061101009
+   3          0.02         0.024          0.02   0.021333333  0.0023094011
No difference proven at 90.0% confidence
======================================================================
Running ministat on the results reported for 
MultiSource/Benchmarks/7zip/7zip-benchmark we can prove a difference
even at 99.5% confidence:

======================================================================$
./src/ministat -c 99.5 /tmp/data2-before /tmp/data2-after
x /tmp/data2-before
+ /tmp/data2-after
+---------------------------------------------------------+
|    x                                               +    |
|x   x                                               +   +|
||__AM|                                              M_A_||
+---------------------------------------------------------+
     N           Min           Max        Median           Avg        Stddev
x   3        45.084        45.344        45.336     45.254667    0.14785579
+   3        48.152         48.36        48.152     48.221333    0.12008886
Difference at 99.5% confidence
	2.96667 +/- 0.788842
	6.55549% +/- 1.74312%
	(Student's t, pooled s = 0.13469)
======================================================================
The statistical test ministat is performing seems simple and pretty 
standard. Is there any reason we could not do something similar? Or are 
we doing it already and it just does not work as expected?

Filtering and sorting the results by confidence seems very interesting 
to me. In fact, I would like to first look at the performance changes 
reported with 99.5% confidence than at the ones that could not even be 
proven with 90% confidence.

Cheers,
Tobias

[1] http://188.40.87.11:8000/db_default/v4/nts/3
[2] https://github.com/codahale/ministat

-------------- next part --------------
0.0120
0.0080
0.0200
-------------- next part --------------
0.0200
0.0240
0.0200
-------------- next part --------------
45.0840
45.3440
45.3360
-------------- next part --------------
48.1520
48.3600
48.1520

Anton Korobeynikov

2013-Jun-30 09:14 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Hi Tobi,

First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :)
> The statistical test ministat is performing seems simple and pretty
> standard. Is there any reason we could not do something similar? Or are we
> doing it already and it just does not work as expected?The main problem with such sort of tests is that we cannot trust them, unless:
1. The data has the normal distribution
2. The sample size if large (say, > 50)

Here we have only 3 points and, no, I won't trust the ministat's
t-test and normal-approximation based confidence bounds. They are *too
short* (=the real confidence level is no 99.5%, but, actually 40-50%,
for example).

I'd ask for:

1. Increasing sample size to at least 5-10
2. Do the Wilcoxon/Mann-Whitney test

What do you think?

--
With best regards, Anton Korobeynikov
Faculty of Mathematics and Mechanics, Saint Petersburg State University

Tobias Grosser

2013-Jun-30 16:19 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 06/30/2013 02:14 AM, Anton Korobeynikov wrote:> Hi Tobi,
>
> First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :)
>
>> The statistical test ministat is performing seems simple and pretty
>> standard. Is there any reason we could not do something similar? Or are
we
>> doing it already and it just does not work as expected?
> The main problem with such sort of tests is that we cannot trust them,
unless:
> 1. The data has the normal distribution
> 2. The sample size if large (say, > 50)
>
> Here we have only 3 points and, no, I won't trust the ministat's
> t-test and normal-approximation based confidence bounds. They are *too
> short* (=the real confidence level is no 99.5%, but, actually 40-50%,
> for example).
Hi Anton,

I trust your knowledge about statistics, but am wondering why ministat 
(and it's t-test) is promoted as a statistical sane tool for 
benchmarking results. Is the use of the t-test for benchmark results a 
bad idea in general? Would ministat be a better tool if it implemented 
the Wilcoxon/Mann-Whitney test?
> I'd ask for:
>
> 1. Increasing sample size to at least 5-10
> 2. Do the Wilcoxon/Mann-Whitney test
Reading about the Wilcoxon/Mann-Whitney, it seems to be a more robust 
test that frees us from the normal-approximation assumption. As its 
implementation also does not look overly complicated, it may be a good 
choice.

Regarding the number of samples. I think the most important point is 
that we get some measurement of confidence by which we can sort our 
results and make it visible in the UI. For different use cases we can 
adapt the number of samples based on the required confidence and the 
amount of noise/lost regressions we can accept. This may also be a great 
use for the adaptive sampling that Chris suggested.

Is there anything stopping us from implementing such a test and exposing 
its results in the UI?

Cheers,
Tobi

Renato Golin

2013-Jun-30 18:30 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 30 June 2013 10:14, Anton Korobeynikov <anton at korobeynikov.info>
wrote:
> 1. Increasing sample size to at least 5-10
>
That's not feasible on slower systems. A single data point takes 1 hour on
the fastest ARM board I can get (Chromebook). Getting 10 samples at
different commits will give you similar accuracy if behaviour doesn't
change, and you can rely on 10-point blocks before and after each change to
have the same result.

What won't happen is one commit makes it truly faster and the very next
slow again (or slow/fast), so all we need to measure is for each commit, if
that was the one that made all next runs slower/faster, and that we can get
with several commits after the culprit, since the probability that another
(unrelated) commit will change the behaviour is small.

This is why I proposed something like moving averages. Not because it's the
best statistical model, but because it works around a concrete problem we
have. I don't care which model/tool you use, as long as it doesn't mean
I'll have to wait 10 hours for a result, or sift through hundreds of
commits every time I see a regression in performance. What that will do,
for sure, is make me ignore small regressions, since they won't be worth
the massive work to find the real culprit.

If I had a team of 10 people just to look at regressions all day long, I'd
ask them to make a proper statistical model and go do more interesting
things...

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130630/daa16eff/attachment.html>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Jun 2013 - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Maybe Matching Threads