Chris Matthews
2013-Jun-28  18:45 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
I should describe the cost of false negatives and false positives, since I think it matters for how this problem is approached. False negatives mean we miss a real regression --- we don’t want that. False positives mean somebody has to spend some time looking at and reproducing the regression when there is not one --- bad too. Given this tradeoff I think we want to tend towards false positives (over false negatives) strictly as a matter of compiler quality, but if we can throw more data to reduce false positives that is good. I have discussed the classification problem before with people off list. The problem that we face is that the space is pretty big for manual classification, at worse: number of benchmarks * number of architectures * sets of flags * metrics collected. Perhaps some sensible defaults could overcome that, also to classify well, you probably need a lot of samples as a baseline. There certainly are lots of tests for small data. As far as I know though they rely more heavily on assumptions that in our case would have to be proven. That said, I’d never object to a professional’s opinion on this problem! Chris Matthews chris.matthews@.com (408) 783-6335 On Jun 28, 2013, at 6:28 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 28 June 2013 14:06, David Tweed <david.tweed at arm.com> wrote: > That's a viewpoint; another one is that statisticians might well have very good reasons why they spend so long coming up with statistical tests in order to create the most powerful tests so they can deal with marginal quantities of data. > > > 87.35% of all statistics are made up, 55.12% of them could have been done a lot simpler, a lot quicker and only 1.99% (AER) actually make your life better. > > I'm glad that Chris already has working solutions, and I'b be happy to see them go live before any professional statistician had a look at it. ;) > > cheers, > --renato > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/62b600f8/attachment.html>
Renato Golin
2013-Jun-28  20:19 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com> wrote:> Given this tradeoff I think we want to tend towards false positives (over > false negatives) strictly as a matter of compiler quality. >False hits are not binary, but (at least) two-dimensional. You can't say it's better to have any amount of false positives than any amount of false negatives (pretty much like the NSA spying on *everybody* to avoid *any* false negative). You can't also say that N false-positives is the same as N false-negatives, because a false-hit can be huge in itself, or not. What we have today is a huge amount of false positives and very few (or none) false negatives. But even the real positives that we could spot even with this amount of noise, we don't, because people don't normally look at regressions. If I had to skim through the regressions on every build, I'd do nothing else. Given the proportion, I'd rather have a few small false positives and reduce considerably the number of false positives with a hammer approach, and only later try to nail down the options and do some fine tuning, than doing the fine tuning now while still nobody cares about any result because they're not trust-worthy. That said, I’d never object to a professional’s opinion on this problem!>Absolutely! And David can help you a lot, there. But I wouldn't try to get it perfect before we get it acceptable. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/3b936baa/attachment.html>
Tobias Grosser
2013-Jun-30  02:10 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/28/2013 01:19 PM, Renato Golin wrote:> On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com> > wrote: > >> Given this tradeoff I think we want to tend towards false positives >> (over false negatives) strictly as a matter of compiler quality. >> > > False hits are not binary, but (at least) two-dimensional. You can't > say it's better to have any amount of false positives than any amount > of false negatives (pretty much like the NSA spying on *everybody* to > avoid *any* false negative). You can't also say that N > false-positives is the same as N false-negatives, because a false-hit > can be huge in itself, or not. > > What we have today is a huge amount of false positives and very few > (or none) false negatives. But even the real positives that we could > spot even with this amount of noise, we don't, because people don't > normally look at regressions. If I had to skim through the > regressions on every build, I'd do nothing else. > > Given the proportion, I'd rather have a few small false positives > and reduce considerably the number of false positives with a hammer > approach, and only later try to nail down the options and do some > fine tuning, than doing the fine tuning now while still nobody cares > about any result because they're not trust-worthy. > > > That said, I’d never object to a professional’s opinion on this > problem! >> > > Absolutely! And David can help you a lot, there. But I wouldn't try > to get it perfect before we get it acceptable.Wow. Thanks a lot for the insights in what LNT is currently doing and what people are planning for the future. It seems there is a lot of interesting stuff on the way. I agree with Renato that one of the major problems is currently not missing regressions because we do not detect them, but missing them because nobody looks at the results due to the large amount of noise. To make this more concrete I want to point you to the experiments that Star Tan has run. He hosted his lnt results here [1]. One of the top changes in the reports is a 150% compile time increase for SingleSource/UnitTests/2003-07-10-SignConversions.c. Looking at the data of the original run, we get: ~$ cat /tmp/data-before 0.0120 0.0080 0.0200 ~$ cat /tmp/data-after 0.0200 0.0240 0.0200 It seems there is a lot of noise involved. Still, LNT is reporting this result without understanding that the results for this benchmark are unreliable. In contrast, the ministat [2] tool is perfectly capable of understanding that those results are insufficient to prove any statistical difference at 90% confidence. ======================================================================$ ./src/ministat -c 90 /tmp/data-before /tmp/data-after x /tmp/data-before + /tmp/data-after +-----------------------------------------------+ | + | | x x * +| ||____________M___A______________|_|M___A_____| | +-----------------------------------------------+ N Min Max Median Avg Stddev x 3 0.008 0.02 0.012 0.013333333 0.0061101009 + 3 0.02 0.024 0.02 0.021333333 0.0023094011 No difference proven at 90.0% confidence ====================================================================== Running ministat on the results reported for MultiSource/Benchmarks/7zip/7zip-benchmark we can prove a difference even at 99.5% confidence: ======================================================================$ ./src/ministat -c 99.5 /tmp/data2-before /tmp/data2-after x /tmp/data2-before + /tmp/data2-after +---------------------------------------------------------+ | x + | |x x + +| ||__AM| M_A_|| +---------------------------------------------------------+ N Min Max Median Avg Stddev x 3 45.084 45.344 45.336 45.254667 0.14785579 + 3 48.152 48.36 48.152 48.221333 0.12008886 Difference at 99.5% confidence 2.96667 +/- 0.788842 6.55549% +/- 1.74312% (Student's t, pooled s = 0.13469) ====================================================================== The statistical test ministat is performing seems simple and pretty standard. Is there any reason we could not do something similar? Or are we doing it already and it just does not work as expected? Filtering and sorting the results by confidence seems very interesting to me. In fact, I would like to first look at the performance changes reported with 99.5% confidence than at the ones that could not even be proven with 90% confidence. Cheers, Tobias [1] http://188.40.87.11:8000/db_default/v4/nts/3 [2] https://github.com/codahale/ministat -------------- next part -------------- 0.0120 0.0080 0.0200 -------------- next part -------------- 0.0200 0.0240 0.0200 -------------- next part -------------- 45.0840 45.3440 45.3360 -------------- next part -------------- 48.1520 48.3600 48.1520
Maybe Matching Threads
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure