Dear all, Following the Benchmarking BOF from 2013 US dev meeting, I’d like to propose some improvements to the LNT performance tracking software. The most significant issue with current implementation is that the report is filled with extremely noisy values. Hence it is hard to notice performance improvements or regressions. After investigation of LNT and the LLVM test suite, I propose following methods. I've also attached prototype patches for each method. - Increase the execution time of the benchmark so it runs long enough to avoid noisy results Currently there are two options to run benchmarks, namely small and large problem size. I propose adding a third option: adaptive. In adaptive mode, benchmarks scale the problem size according to pre-measured system performance value so that the running time is kept at around 10 seconds, the sweet spot between time and accuracy. The downside is that correctness for some benchmarks cannot be measured. Solution is to measure correctness in a separate board with small problem size. LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode Test suite: [PATCH 1/2] Add support for adaptive problem size [PATCH 2/2] A subset of test suite programs modified for adaptive - Show and graph total compile time There is no obvious way to scale up the compile time of individual benchmarks, so total time is the best thing we can do to minimize error. LNT: [PATCH 1/3] Add Total to run view and graph plot - Only show performance changes with high confidence in summary report To investigate the correlation between program run time and its variance, I ran Dhrystone of different problem size multiple times. The result shows that some fluctuations are expected and shorter tests have much greater variance. By modelling the run time to be normally distributed, we can calculate the minimal difference for statistical significance. Using this knowledge, we can hide those results with low confidence level from summary report. They are still available and marked in colour in detailed report in case interested. LNT: [PATCH 3/3] Ignore tests with very short run time - Make sure board has low background noise Perform a system performance benchmark before each run and compare the value with the reference(obtained during machine set-up). If the percentage difference is too large, abort or defer the run. In prototype this feature is implemented using Bash script and not integrated into LNT. Will rewrite in Python. LNT: benchmark.sh In my prototype implementation, the summary report becomes much more useful. There are almost no noisy readings while small regressions are still detectable for long running benchmark programs. The implementation is backwards compatible with older databases. Screenshots from a sample run is attached. Thanks for reading! -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782 -------------- next part -------------- A non-text attachment was scrubbed... Name: patchset.tar.gz Type: application/gzip Size: 14775 bytes Desc: patchset.tar.gz URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/7737498f/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: old.png Type: image/png Size: 68926 bytes Desc: old.png URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/7737498f/attachment.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: new.png Type: image/png Size: 14728 bytes Desc: new.png URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/7737498f/attachment-0001.png>
Hi Yi Kong, thanks for working on this. I think there is a lot we can improve here. I copied Mingxing Tan who has worked on a couple of patches in this area before and Chris, who is maintaining LNT. On 30/04/2014 00:49, Yi Kong wrote:> Dear all, > > Following the Benchmarking BOF from 2013 US dev meeting, I’d like to propose some improvements to the LNT performance tracking software. > > The most significant issue with current implementation is that the report is filled with extremely noisy values. Hence it is hard to notice performance improvements or regressions.Right.> After investigation of LNT and the LLVM test suite, I propose following methods. I've also attached prototype patches for each method. > - Increase the execution time of the benchmark so it runs long enough to avoid noisy results > Currently there are two options to run benchmarks, namely small and large problem size. I propose adding a third option: adaptive. In adaptive mode, benchmarks scale the problem size according to pre-measured system performance value so that the running time is kept at around 10 seconds, the sweet spot between time and accuracy. The downside is that correctness for some benchmarks cannot be measured. Solution is to measure correctness in a separate board with small problem size. > LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode > Test suite: [PATCH 1/2] Add support for adaptive problem size > [PATCH 2/2] A subset of test suite programs modified for adaptiveI think it will be easier to review such patches one by one on the commit mailing lists. Especially as this one is a little larger. In general, I see such changes as a second step. First, we want to have a system in place that allows us to reliably detect if a benchmark is noisy or not, second we want to increase the number of benchmarks that are not noisy and where we can use the results.> - Show and graph total compile time > There is no obvious way to scale up the compile time of individual benchmarks, so total time is the best thing we can do to minimize error. > LNT: [PATCH 1/3] Add Total to run view and graph plotI did not see the effect of these changes in your images and also honestly do not fully understand what you are doing. What is the total compile time? Don't we already show the compile time in run view? How is the total time different to this compile time? Maybe you can answer this in a separate patch email.> - Only show performance changes with high confidence in summary report > To investigate the correlation between program run time and its variance, I ran Dhrystone of different problem size multiple times. The result shows that some fluctuations are expected and shorter tests have much greater variance. By modelling the run time to be normally distributed, we can calculate the minimal difference for statistical significance. Using this knowledge, we can hide those results with low confidence level from summary report. They are still available and marked in colour in detailed report in case interested. > LNT: [PATCH 3/3] Ignore tests with very short run timeI think this is the most important point which we should address first. In fact, I would prefer to go even further and actually compute the confidence and make the confidence we require an option. This allows us to understand both how stable/noisy a machine is and how well the other changes you propose work in practice. We had a longer discussion here on llvmdev names 'Questions about results reliability in LNT infrustructure'. Anton suggested to do the following: 1. Get 5-10 samples per run 2. Do the Wilcoxon/Mann-Whitney test I already set up -O3 buildbots that provide 10 runs, per commit and the noise for them is very low: http://llvm.org/perf/db_default/v4/nts/25151?num_comparison_runs=10&test_filter=&test_min_value_filter=&aggregation_fn=median&compare_to=25149&submit=Update If you are interested in performance data to test your changes, you can extract the results from the LLVM buildmaster at: http://lab.llvm.org:8011/builders/polly-perf-O3/builds/2942/steps/lnt.nightly-test/logs/report.json with 2942 being one of the latest successful builds. By going backwards or forwards you should get other builds if they have been successful. There should be a standard function for the wilcoxon/mann-whitney in scipy?, so in case you are interested adding this reliability numbers as a first step seems to be a simple and purely beneficial commit.> - Make sure board has low background noise > Perform a system performance benchmark before each run and compare the value with the reference(obtained during machine set-up). If the percentage difference is too large, abort or defer the run. In prototype this feature is implemented using Bash script and not integrated into LNT. Will rewrite in Python. > LNT: benchmark.shI am a little sceptical on this. Machines should generally not be noisy. However, if for some reason there is noise on the machine, the noise is as likely to appear during this pre-noise-test than during the actual benchmark runs, maybe during both, but maybe also only during the benchmark. So I am afraid we might often run in the situation where this test says OK but the later test is still suffering noise. I would probably prefer to make the previous point of reporting reliability work well and then we can see for each test/benchmark if there was noise involved or not. All the best, Tobias
On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es> wrote:> In general, I see such changes as a second step. First, we want to have a > system in place that allows us to reliably detect if a benchmark is noisy or > not, second we want to increase the number of benchmarks that are not noisy > and where we can use the results.I personally use the test-suite for correctness, not performance and would not like to have its run time increased by any means. As discussed in the BoF last year, if we could separate test run from benchmark run before we do any change, I'd appreciate. I want to have a separate benchmark bot on the subset that makes sense to work as benchmark, but I don't want the noise of the rest.> 1. Get 5-10 samples per run > 2. Do the Wilcoxon/Mann-Whitney test5-10 samples on an ARM board is not feasible. Currently it takes 1 hour to run the whole set. Making it run for 5-10 hours will reduce its value to zero.> I am a little sceptical on this. Machines should generally not be noisy.ARM machines work at a much lower power level than Intel ones. The scheduler is a lot more aggressive and the quality of the peripherals is *a lot* worse. Even if you set up the board for benchmarks (fix the scheduler, put everything up to 11), the quality of the external hardware (USB, SD, eMMC, etc) and their drivers do a lot of damage to any meaningful number you may extract if the moon is full and Jupiter is in Sagittarius. So...> However, if for some reason there is noise on the machine, the noise is as > likely to appear during this pre-noise-test than during the actual benchmark > runs, maybe during both, but maybe also only during the benchmark. So I am > afraid we might often run in the situation where this test says OK but the > later test is still suffering noise....this is not entirely true, on ARM. We may be getting server quality hardware for AArch64 any time now, but it's very unlikely that we'll *ever* get quality 32-bit test boards. cheers, --renato
On Apr 29, 2014, at 3:49 PM, Yi Kong <Yi.Kong at arm.com> wrote:> Dear all, > > Following the Benchmarking BOF from 2013 US dev meeting, I’d like to propose some improvements to the LNT performance tracking software. > > The most significant issue with current implementation is that the report is filled with extremely noisy values. Hence it is hard to notice performance improvements or regressions. > > After investigation of LNT and the LLVM test suite, I propose following methods. I've also attached prototype patches for each method. > - Increase the execution time of the benchmark so it runs long enough to avoid noisy results > Currently there are two options to run benchmarks, namely small and large problem size. I propose adding a third option: adaptive. In adaptive mode, benchmarks scale the problem size according to pre-measured system performance value so that the running time is kept at around 10 seconds, the sweet spot between time and accuracy. The downside is that correctness for some benchmarks cannot be measured. Solution is to measure correctness in a separate board with small problem size. > LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode > Test suite: [PATCH 1/2] Add support for adaptive problem size > [PATCH 2/2] A subset of test suite programs modified for adaptive > - Show and graph total compile time > There is no obvious way to scale up the compile time of individual benchmarks, so total time is the best thing we can do to minimize error. > LNT: [PATCH 1/3] Add Total to run view and graph plot > - Only show performance changes with high confidence in summary report > To investigate the correlation between program run time and its variance, I ran Dhrystone of different problem size multiple times. The result shows that some fluctuations are expected and shorter tests have much greater variance. By modelling the run time to be normally distributed, we can calculate the minimal difference for statistical significance. Using this knowledge, we can hide those results with low confidence level from summary report. They are still available and marked in colour in detailed report in case interested. > LNT: [PATCH 3/3] Ignore tests with very short run timeI think this is harder than it sounds. I just looked through some results from today, and I found a benchmark that found a real regression of 0.01s, in a benchmark running ~0.05s. That would have been filtered by your patch. Do you have some intuition that small runtime tests are where the noise is coming from? I feel like it is not a problem that is unique to short runs, but more to particular benchmarks.> - Make sure board has low background noise > Perform a system performance benchmark before each run and compare the value with the reference(obtained during machine set-up). If the percentage difference is too large, abort or defer the run. In prototype this feature is implemented using Bash script and not integrated into LNT. Will rewrite in Python. > LNT: benchmark.shI wrote a very similar python script for checking system baselines. I think this is a great idea. My script ran several non-compiler related tasks, which *should* be stable on any machine. By should I mean, they are long running and intentionally only test one aspect of the system. I did not gate results on these runs, but instead submitted these results to LNT, then allowed it to report on anomalies it detected. So far this process has detected some problems on our testing machines. If there is interest I can share that script.> In my prototype implementation, the summary report becomes much more useful. There are almost no noisy readings while small regressions are still detectable for long running benchmark programs. The implementation is backwards compatible with older databases. > > Screenshots from a sample run is attached. > > Thanks for reading! > > -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782<patchset.tar.gz><old.png><new.png>
Hi Chris, I think it would definitely be useful to share your script; so that we don't need to reinvent the wheel. Thanks! Kristof> > - Make sure board has low background noise > > Perform a system performance benchmark before each run and > compare the value with the reference(obtained during machine set-up). If > the percentage difference is too large, abort or defer the run. Inprototype> this feature is implemented using Bash script and not integrated into LNT. > Will rewrite in Python. > > LNT: benchmark.sh > > I wrote a very similar python script for checking system baselines. Ithink this> is a great idea. My script ran several non-compiler related tasks, which > *should* be stable on any machine. By should I mean, they are longrunning> and intentionally only test one aspect of the system. I did not gateresults on> these runs, but instead submitted these results to LNT, then allowed it to > report on anomalies it detected. So far this process has detected some > problems on our testing machines. If there is interest I can share thatscript.