thr3ads.net - llvm dev - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Chris Matthews

2013-Jun-28 18:45 UTC

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

I should describe the cost of false negatives and false positives, since I think
it matters for how this problem is approached.  False negatives mean we miss a
real regression --- we don’t want that.  False positives mean somebody has to
spend some time looking at and reproducing the regression when there is not one
--- bad too.  Given this tradeoff I think we want to tend towards false
positives (over false negatives) strictly as a matter of compiler quality, but
if we can throw more data to reduce false positives that is good.

I have discussed the classification problem before with people off list.  The
problem that we face is that the space is pretty big for manual classification,
at worse: number of benchmarks * number of architectures * sets of flags *
metrics collected.  Perhaps some sensible defaults could overcome that, also to
classify well, you probably need a lot of samples as a baseline.

There certainly are lots of tests for small data. As far as I know though they
rely more heavily on assumptions that in our case would have to be proven.  That
said, I’d never object to a professional’s opinion on this problem!

Chris Matthews
chris.matthews@.com
(408) 783-6335

On Jun 28, 2013, at 6:28 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 28 June 2013 14:06, David Tweed <david.tweed at arm.com> wrote:
> That's a viewpoint; another one is that statisticians might well have
very good reasons why they spend so long coming up with statistical tests in
order to create the most powerful tests so they can deal with marginal
quantities of data.
> 
> 
> 87.35% of all statistics are made up, 55.12% of them could have been done a
lot simpler, a lot quicker and only 1.99% (AER) actually make your life better.
> 
> I'm glad that Chris already has working solutions, and I'b be happy
to see them go live before any professional statistician had a look at it. ;)
> 
> cheers,
> --renato
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/62b600f8/attachment.html>

Renato Golin

2013-Jun-28 20:19 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com> wrote:
> Given this tradeoff I think we want to tend towards false positives (over
> false negatives) strictly as a matter of compiler quality.
>
False hits are not binary, but (at least) two-dimensional. You can't say
it's better to have any amount of false positives than any amount of false
negatives (pretty much like the NSA spying on *everybody* to avoid *any*
false negative). You can't also say that N false-positives is the same as N
false-negatives, because a false-hit can be huge in itself, or not.

What we have today is a huge amount of false positives and very few (or
none) false negatives. But even the real positives that we could spot even
with this amount of noise, we don't, because people don't normally look
at
regressions. If I had to skim through the regressions on every build, I'd
do nothing else.

Given the proportion, I'd rather have a few small false positives and
reduce considerably the number of false positives with a hammer approach,
and only later try to nail down the options and do some fine tuning, than
doing the fine tuning now while still nobody cares about any result because
they're not trust-worthy.

That said, I’d never object to a professional’s opinion on this
problem!>
Absolutely! And David can help you a lot, there. But I wouldn't try to get
it perfect before we get it acceptable.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/3b936baa/attachment.html>

Tobias Grosser

2013-Jun-30 02:10 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 06/28/2013 01:19 PM, Renato Golin wrote:> On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com>
> wrote:
>
>> Given this tradeoff I think we want to tend towards false positives
>> (over false negatives) strictly as a matter of compiler quality.
>>
>
> False hits are not binary, but (at least) two-dimensional. You can't
> say it's better to have any amount of false positives than any amount
> of false negatives (pretty much like the NSA spying on *everybody* to
> avoid *any* false negative). You can't also say that N
> false-positives is the same as N false-negatives, because a false-hit
> can be huge in itself, or not.
>
> What we have today is a huge amount of false positives and very few
> (or none) false negatives. But even the real positives that we could
> spot even with this amount of noise, we don't, because people don't
> normally look at regressions. If I had to skim through the
> regressions on every build, I'd do nothing else.
>
> Given the proportion, I'd rather have a few small false positives
> and reduce considerably the number of false positives with a hammer
> approach, and only later try to nail down the options and do some
> fine tuning, than doing the fine tuning now while still nobody cares
> about any result because they're not trust-worthy.
>
>
> That said, I’d never object to a professional’s opinion on this
> problem!
>>
>
> Absolutely! And David can help you a lot, there. But I wouldn't try
> to get it perfect before we get it acceptable.
Wow. Thanks a lot for the insights in what LNT is currently doing and
what people are planning for the future. It seems there is a lot of
interesting stuff on the way.

I agree with Renato that one of the major problems is currently not
missing regressions because we do not detect them, but missing them 
because nobody looks at the results due to the large amount of noise.

To make this more concrete I want to point you to the experiments that
Star Tan has run. He hosted his lnt results here [1]. One of the top 
changes in the reports is a 150% compile time increase for 
SingleSource/UnitTests/2003-07-10-SignConversions.c.

Looking at the data of the original run, we get:

~$ cat /tmp/data-before
0.0120
0.0080
0.0200

~$ cat /tmp/data-after
0.0200
0.0240
0.0200

It seems there is a lot of noise involved. Still, LNT is reporting this 
result without understanding that the results for this benchmark are 
unreliable.

In contrast, the ministat [2] tool is perfectly capable of understanding
that those results are insufficient to prove any statistical difference
at 90% confidence.

======================================================================$
./src/ministat -c 90 /tmp/data-before /tmp/data-after
x /tmp/data-before
+ /tmp/data-after
+-----------------------------------------------+
|                                   +           |
|  x          x                     *          +|
||____________M___A______________|_|M___A_____| |
+-----------------------------------------------+
     N           Min           Max        Median           Avg        Stddev
x   3         0.008          0.02         0.012   0.013333333  0.0061101009
+   3          0.02         0.024          0.02   0.021333333  0.0023094011
No difference proven at 90.0% confidence
======================================================================
Running ministat on the results reported for 
MultiSource/Benchmarks/7zip/7zip-benchmark we can prove a difference
even at 99.5% confidence:

======================================================================$
./src/ministat -c 99.5 /tmp/data2-before /tmp/data2-after
x /tmp/data2-before
+ /tmp/data2-after
+---------------------------------------------------------+
|    x                                               +    |
|x   x                                               +   +|
||__AM|                                              M_A_||
+---------------------------------------------------------+
     N           Min           Max        Median           Avg        Stddev
x   3        45.084        45.344        45.336     45.254667    0.14785579
+   3        48.152         48.36        48.152     48.221333    0.12008886
Difference at 99.5% confidence
	2.96667 +/- 0.788842
	6.55549% +/- 1.74312%
	(Student's t, pooled s = 0.13469)
======================================================================
The statistical test ministat is performing seems simple and pretty 
standard. Is there any reason we could not do something similar? Or are 
we doing it already and it just does not work as expected?

Filtering and sorting the results by confidence seems very interesting 
to me. In fact, I would like to first look at the performance changes 
reported with 99.5% confidence than at the ones that could not even be 
proven with 90% confidence.

Cheers,
Tobias

[1] http://188.40.87.11:8000/db_default/v4/nts/3
[2] https://github.com/codahale/ministat

-------------- next part --------------
0.0120
0.0080
0.0200
-------------- next part --------------
0.0200
0.0240
0.0200
-------------- next part --------------
45.0840
45.3440
45.3360
-------------- next part --------------
48.1520
48.3600
48.1520

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Jun 2013 - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Maybe Matching Threads