Tobias Grosser
2013-Jun-30 16:19 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/30/2013 02:14 AM, Anton Korobeynikov wrote:> Hi Tobi, > > First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :) > >> The statistical test ministat is performing seems simple and pretty >> standard. Is there any reason we could not do something similar? Or are we >> doing it already and it just does not work as expected?> The main problem with such sort of tests is that we cannot trust them, unless: > 1. The data has the normal distribution > 2. The sample size if large (say, > 50) > > Here we have only 3 points and, no, I won't trust the ministat's > t-test and normal-approximation based confidence bounds. They are *too > short* (=the real confidence level is no 99.5%, but, actually 40-50%, > for example).Hi Anton, I trust your knowledge about statistics, but am wondering why ministat (and it's t-test) is promoted as a statistical sane tool for benchmarking results. Is the use of the t-test for benchmark results a bad idea in general? Would ministat be a better tool if it implemented the Wilcoxon/Mann-Whitney test?> I'd ask for: > > 1. Increasing sample size to at least 5-10 > 2. Do the Wilcoxon/Mann-Whitney testReading about the Wilcoxon/Mann-Whitney, it seems to be a more robust test that frees us from the normal-approximation assumption. As its implementation also does not look overly complicated, it may be a good choice. Regarding the number of samples. I think the most important point is that we get some measurement of confidence by which we can sort our results and make it visible in the UI. For different use cases we can adapt the number of samples based on the required confidence and the amount of noise/lost regressions we can accept. This may also be a great use for the adaptive sampling that Chris suggested. Is there anything stopping us from implementing such a test and exposing its results in the UI? Cheers, Tobi
Anton Korobeynikov
2013-Jun-30 19:05 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Hi Tobias,> I trust your knowledge about statistics, but am wondering why ministat (and > it's t-test) is promoted as a statistical sane tool for benchmarking > results.I do not know... Ask author of ministat?> Is the use of the t-test for benchmark results a bad idea in > general?No, in general. But one should be aware about the assumptions of the underlying theory. t-test is fine as soon as our data follows the normal distribution (and hence the test would be exact) or the sample size is large (then we have the asymptotic normality of the mean due to CLT).> Would ministat be a better tool if it implemented the > Wilcoxon/Mann-Whitney test?The precision would be much better for small sample sizes (say, in range 10-50). But in any case, never trust someone who will claim he can reliably estimate the variance from 3 data points.> Is there anything stopping us from implementing such a test and exposing its > results in the UI?I do not think so... -- With best regards, Anton Korobeynikov Faculty of Mathematics and Mechanics, Saint Petersburg State University
Chris Matthews
2013-Jul-01 01:04 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
I think we need to be using tests with the fewest assumptions possible. I don’t think there are many assumptions that would hold for all the benchmarks. Chris Matthews chris.matthews at apple.com phone: 36335 On Jun 30, 2013, at 12:05 PM, Anton Korobeynikov <anton at korobeynikov.info> wrote:> Hi Tobias, > >> I trust your knowledge about statistics, but am wondering why ministat (and >> it's t-test) is promoted as a statistical sane tool for benchmarking >> results. > I do not know... Ask author of ministat? > >> Is the use of the t-test for benchmark results a bad idea in >> general? > No, in general. But one should be aware about the assumptions of the > underlying theory. t-test is fine as soon as our data follows the > normal distribution (and hence the test would be exact) or the sample > size is large (then we have the asymptotic normality of the mean due > to CLT). > >> Would ministat be a better tool if it implemented the >> Wilcoxon/Mann-Whitney test? > The precision would be much better for small sample sizes (say, in range 10-50). > > But in any case, never trust someone who will claim he can reliably > estimate the variance from 3 data points. > >> Is there anything stopping us from implementing such a test and exposing its >> results in the UI? > I do not think so... > > -- > With best regards, Anton Korobeynikov > Faculty of Mathematics and Mechanics, Saint Petersburg State University > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130630/722d2962/attachment.html>
Renato Golin
2013-Jul-01 07:37 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 30 June 2013 20:05, Anton Korobeynikov <anton at korobeynikov.info> wrote:> But in any case, never trust someone who will claim he can reliably > estimate the variance from 3 data points. >One cannot stress this enough. ;) cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130701/c54abb2d/attachment.html>
Reasonably Related Threads
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure