Tobias Grosser
2014-Jan-07 18:06 UTC
[LLVMdev] New -O3 Performance tester - Use hardware to get reliable numbers
Hi, I would like to announce a new set of LNT -O3 performance testers. In a discussion titled "Question about results reliability in LNT infrustructure" Anton suggested that one way to get statistically reliable test results from the LNT infrastructure is to use a larger sample size (5-10) as well as a more robust statistical test (Wilcoxon/Mann-Whitney). Another requirement to make the performance results we get from our testers useful is to have a per-commit performance run. I would like to announce that I set up 4 identical machines* that publicly report LNT results for 'clang -O3' at: http://llvm.org/perf/db_default/v4/nts/machine/34 We currently catch in average groups of 3-5 commits. As most commits obviously do not impact performance this seems to be enough to track down performance regressions/changes easily. The results that have been reported so far seem to provide sufficient information to catch performance changes. Specifically, when setting the aggregation function to median, most runs are shown to not impact performance: e.g: http://llvm.org/perf/db_default/v4/nts/19939?num_comparison_runs=10&test_filter=&test_min_value_filter=&aggregation_fn=median&compare_to=19934&submit=Update We still have a couple of runs that report performance differences, but where looking at the performance graph of the changed test cases makes it very clear that those are false positives due to test case noise. Here comes the point of this mail. I am currently not sure when I find time to improve the LNT infrastructure to take advantage of the data provided. So in case someone else would like to have a look and e.g. add the Wilcoxon/Mann-Whitney test this would be highly appreciated. I also have a couple of more machines. Hence, if the LNT infrastructure is in place we can use them to increase the reliability of the results even more. Cheers, Tobias * Also have sufficiently close performance characteristics when running LNT tests for the same version
Sean Silva
2014-Jan-08 01:48 UTC
[LLVMdev] New -O3 Performance tester - Use hardware to get reliable numbers
On Tue, Jan 7, 2014 at 11:06 AM, Tobias Grosser <tobias at grosser.es> wrote:> Hi, > > I would like to announce a new set of LNT -O3 performance testers. > > In a discussion titled "Question about results reliability in LNT > infrustructure" Anton suggested that one way to get statistically reliable > test results from the LNT infrastructure is to use a larger sample size > (5-10) as well as a more robust statistical test (Wilcoxon/Mann-Whitney). > Another requirement to make the performance results we get from our testers > useful is to have a per-commit performance run. > > I would like to announce that I set up 4 identical machines* that publicly > report LNT results for 'clang -O3' at: > > http://llvm.org/perf/db_default/v4/nts/machine/34 > > We currently catch in average groups of 3-5 commits. As most commits > obviously do not impact performance this seems to be enough to track down > performance regressions/changes easily. >If possible, I think it would be a good idea to filter out commits that don't affect code generation. This would allow machine resources to be better used. Is there some way we can easily filter commits based on whether they affect code generation or not? Would it be reliable enough to check if the commit touches any of our integration tests? As a rough estimate: sean:~/pg/llvm/llvm % git log --oneline --since='1 month ago' | wc -l 706 sean:~/pg/llvm/llvm % git log --oneline --since='1 month ago' ./test | wc -l 317 So it seems like if this is reasonable we can effectively double our performance testing coverage by filtering like this. -- Sean Silva> > The results that have been reported so far seem to provide sufficient > information to catch performance changes. Specifically, when setting the > aggregation function to median, most runs are shown to not impact > performance: > > e.g: http://llvm.org/perf/db_default/v4/nts/19939?num_ > comparison_runs=10&test_filter=&test_min_value_filter> &aggregation_fn=median&compare_to=19934&submit=Update > > We still have a couple of runs that report performance differences, but > where looking at the performance graph of the changed test cases makes it > very clear that those are false positives due to test case noise. > > Here comes the point of this mail. I am currently not sure when I find > time to improve the LNT infrastructure to take advantage of the data > provided. So in case someone else would like to have a look and e.g. add > the Wilcoxon/Mann-Whitney test this would be highly appreciated. > > I also have a couple of more machines. Hence, if the LNT infrastructure is > in place we can use them to increase the reliability of the results even > more. > > Cheers, > Tobias > > * Also have sufficiently close performance characteristics when running > LNT tests for the same version > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140107/909e1cb6/attachment.html>
Tobias Grosser
2014-Jan-08 02:02 UTC
[LLVMdev] New -O3 Performance tester - Use hardware to get reliable numbers
On 01/08/2014 02:48 AM, Sean Silva wrote:> On Tue, Jan 7, 2014 at 11:06 AM, Tobias Grosser <tobias at grosser.es> wrote: > >> Hi, >> >> I would like to announce a new set of LNT -O3 performance testers. >> >> In a discussion titled "Question about results reliability in LNT >> infrustructure" Anton suggested that one way to get statistically reliable >> test results from the LNT infrastructure is to use a larger sample size >> (5-10) as well as a more robust statistical test (Wilcoxon/Mann-Whitney). >> Another requirement to make the performance results we get from our testers >> useful is to have a per-commit performance run. >> >> I would like to announce that I set up 4 identical machines* that publicly >> report LNT results for 'clang -O3' at: >> >> http://llvm.org/perf/db_default/v4/nts/machine/34 >> >> We currently catch in average groups of 3-5 commits. As most commits >> obviously do not impact performance this seems to be enough to track down >> performance regressions/changes easily. >> > > If possible, I think it would be a good idea to filter out commits that > don't affect code generation. This would allow machine resources to be > better used. > > Is there some way we can easily filter commits based on whether they affect > code generation or not? Would it be reliable enough to check if the commit > touches any of our integration tests? > > As a rough estimate: > > sean:~/pg/llvm/llvm % git log --oneline --since='1 month ago' | wc -l > 706 > sean:~/pg/llvm/llvm % git log --oneline --since='1 month ago' ./test | wc -l > 317 > > So it seems like if this is reasonable we can effectively double our > performance testing coverage by filtering like this.Hi Sean, this is a very interesting idea. Though I have no idea if checking for 'test/ this will be enough or not. If we keep the performance tester running for a while, we can probably validate this assumption by checking if runs that do not contain integration tests showed performance changes (and what kind of changes). As said before, I would be glad if I could get help with further improvements on the software side. Cheers, Tobias
Diego Novillo
2014-Jan-08 14:58 UTC
[LLVMdev] New -O3 Performance tester - Use hardware to get reliable numbers
On Tue, Jan 7, 2014 at 8:48 PM, Sean Silva <chisophugis at gmail.com> wrote:> sean:~/pg/llvm/llvm % git log --oneline --since='1 month ago' | wc -l > 706 > sean:~/pg/llvm/llvm % git log --oneline --since='1 month ago' ./test | wc -l > 317Wouldn't this also catch commits to code generation that added tests as well? Diego.
Apparently Analagous Threads
- [LLVMdev] Why is the default LNT aggregation function min instead of mean
- [LLVMdev] RFC:LNT Improvements
- [LLVMdev] Why is the default LNT aggregation function min instead of mean
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives