Tobias Grosser
2013-Jun-30 02:10 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/28/2013 01:19 PM, Renato Golin wrote:> On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com> > wrote: > >> Given this tradeoff I think we want to tend towards false positives >> (over false negatives) strictly as a matter of compiler quality. >> > > False hits are not binary, but (at least) two-dimensional. You can't > say it's better to have any amount of false positives than any amount > of false negatives (pretty much like the NSA spying on *everybody* to > avoid *any* false negative). You can't also say that N > false-positives is the same as N false-negatives, because a false-hit > can be huge in itself, or not. > > What we have today is a huge amount of false positives and very few > (or none) false negatives. But even the real positives that we could > spot even with this amount of noise, we don't, because people don't > normally look at regressions. If I had to skim through the > regressions on every build, I'd do nothing else. > > Given the proportion, I'd rather have a few small false positives > and reduce considerably the number of false positives with a hammer > approach, and only later try to nail down the options and do some > fine tuning, than doing the fine tuning now while still nobody cares > about any result because they're not trust-worthy. > > > That said, I’d never object to a professional’s opinion on this > problem! >> > > Absolutely! And David can help you a lot, there. But I wouldn't try > to get it perfect before we get it acceptable.Wow. Thanks a lot for the insights in what LNT is currently doing and what people are planning for the future. It seems there is a lot of interesting stuff on the way. I agree with Renato that one of the major problems is currently not missing regressions because we do not detect them, but missing them because nobody looks at the results due to the large amount of noise. To make this more concrete I want to point you to the experiments that Star Tan has run. He hosted his lnt results here [1]. One of the top changes in the reports is a 150% compile time increase for SingleSource/UnitTests/2003-07-10-SignConversions.c. Looking at the data of the original run, we get: ~$ cat /tmp/data-before 0.0120 0.0080 0.0200 ~$ cat /tmp/data-after 0.0200 0.0240 0.0200 It seems there is a lot of noise involved. Still, LNT is reporting this result without understanding that the results for this benchmark are unreliable. In contrast, the ministat [2] tool is perfectly capable of understanding that those results are insufficient to prove any statistical difference at 90% confidence. ======================================================================$ ./src/ministat -c 90 /tmp/data-before /tmp/data-after x /tmp/data-before + /tmp/data-after +-----------------------------------------------+ | + | | x x * +| ||____________M___A______________|_|M___A_____| | +-----------------------------------------------+ N Min Max Median Avg Stddev x 3 0.008 0.02 0.012 0.013333333 0.0061101009 + 3 0.02 0.024 0.02 0.021333333 0.0023094011 No difference proven at 90.0% confidence ====================================================================== Running ministat on the results reported for MultiSource/Benchmarks/7zip/7zip-benchmark we can prove a difference even at 99.5% confidence: ======================================================================$ ./src/ministat -c 99.5 /tmp/data2-before /tmp/data2-after x /tmp/data2-before + /tmp/data2-after +---------------------------------------------------------+ | x + | |x x + +| ||__AM| M_A_|| +---------------------------------------------------------+ N Min Max Median Avg Stddev x 3 45.084 45.344 45.336 45.254667 0.14785579 + 3 48.152 48.36 48.152 48.221333 0.12008886 Difference at 99.5% confidence 2.96667 +/- 0.788842 6.55549% +/- 1.74312% (Student's t, pooled s = 0.13469) ====================================================================== The statistical test ministat is performing seems simple and pretty standard. Is there any reason we could not do something similar? Or are we doing it already and it just does not work as expected? Filtering and sorting the results by confidence seems very interesting to me. In fact, I would like to first look at the performance changes reported with 99.5% confidence than at the ones that could not even be proven with 90% confidence. Cheers, Tobias [1] http://188.40.87.11:8000/db_default/v4/nts/3 [2] https://github.com/codahale/ministat -------------- next part -------------- 0.0120 0.0080 0.0200 -------------- next part -------------- 0.0200 0.0240 0.0200 -------------- next part -------------- 45.0840 45.3440 45.3360 -------------- next part -------------- 48.1520 48.3600 48.1520
Anton Korobeynikov
2013-Jun-30 09:14 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Hi Tobi, First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :)> The statistical test ministat is performing seems simple and pretty > standard. Is there any reason we could not do something similar? Or are we > doing it already and it just does not work as expected?The main problem with such sort of tests is that we cannot trust them, unless: 1. The data has the normal distribution 2. The sample size if large (say, > 50) Here we have only 3 points and, no, I won't trust the ministat's t-test and normal-approximation based confidence bounds. They are *too short* (=the real confidence level is no 99.5%, but, actually 40-50%, for example). I'd ask for: 1. Increasing sample size to at least 5-10 2. Do the Wilcoxon/Mann-Whitney test What do you think? -- With best regards, Anton Korobeynikov Faculty of Mathematics and Mechanics, Saint Petersburg State University
Tobias Grosser
2013-Jun-30 16:19 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/30/2013 02:14 AM, Anton Korobeynikov wrote:> Hi Tobi, > > First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :) > >> The statistical test ministat is performing seems simple and pretty >> standard. Is there any reason we could not do something similar? Or are we >> doing it already and it just does not work as expected?> The main problem with such sort of tests is that we cannot trust them, unless: > 1. The data has the normal distribution > 2. The sample size if large (say, > 50) > > Here we have only 3 points and, no, I won't trust the ministat's > t-test and normal-approximation based confidence bounds. They are *too > short* (=the real confidence level is no 99.5%, but, actually 40-50%, > for example).Hi Anton, I trust your knowledge about statistics, but am wondering why ministat (and it's t-test) is promoted as a statistical sane tool for benchmarking results. Is the use of the t-test for benchmark results a bad idea in general? Would ministat be a better tool if it implemented the Wilcoxon/Mann-Whitney test?> I'd ask for: > > 1. Increasing sample size to at least 5-10 > 2. Do the Wilcoxon/Mann-Whitney testReading about the Wilcoxon/Mann-Whitney, it seems to be a more robust test that frees us from the normal-approximation assumption. As its implementation also does not look overly complicated, it may be a good choice. Regarding the number of samples. I think the most important point is that we get some measurement of confidence by which we can sort our results and make it visible in the UI. For different use cases we can adapt the number of samples based on the required confidence and the amount of noise/lost regressions we can accept. This may also be a great use for the adaptive sampling that Chris suggested. Is there anything stopping us from implementing such a test and exposing its results in the UI? Cheers, Tobi
Renato Golin
2013-Jun-30 18:30 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 30 June 2013 10:14, Anton Korobeynikov <anton at korobeynikov.info> wrote:> 1. Increasing sample size to at least 5-10 >That's not feasible on slower systems. A single data point takes 1 hour on the fastest ARM board I can get (Chromebook). Getting 10 samples at different commits will give you similar accuracy if behaviour doesn't change, and you can rely on 10-point blocks before and after each change to have the same result. What won't happen is one commit makes it truly faster and the very next slow again (or slow/fast), so all we need to measure is for each commit, if that was the one that made all next runs slower/faster, and that we can get with several commits after the culprit, since the probability that another (unrelated) commit will change the behaviour is small. This is why I proposed something like moving averages. Not because it's the best statistical model, but because it works around a concrete problem we have. I don't care which model/tool you use, as long as it doesn't mean I'll have to wait 10 hours for a result, or sift through hundreds of commits every time I see a regression in performance. What that will do, for sure, is make me ignore small regressions, since they won't be worth the massive work to find the real culprit. If I had a team of 10 people just to look at regressions all day long, I'd ask them to make a proper statistical model and go do more interesting things... cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130630/daa16eff/attachment.html>
Possibly Parallel Threads
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure