Mikhail Zolotukhin
2015-May-18 18:24 UTC
[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives
Hi Chris and others! I totally support any work in this direction. In the current state LNT’s regression detection system is too noisy, which makes it almost impossible to use in some cases. If after each run a developer gets a dozen of ‘regressions’, none of which happens to be real, he/she won’t care about such reports after a while. We clearly need to filter out as much noise as we can - and as it turns out even simplest techniques could help here. For example, the technique I used (which you mentioned earlier) takes ~15 lines of code to implement and filters out almost all noise in our internal data-sets. It’d be really cool to have something more scientifically-proven though:) One thing to add from me - I think we should try to do our best in assumption that we don’t have enough samples. Of course, the more data we have - the better, but in many cases we can’t (or we don’t want) to increase number os samples, since it dramatically increases testing time. That’s not to discourage anyone from increasing number of samples, or adding techniques relying on a significant number of samples, but rather to try mining as many ‘samples’ as possible from the data we have - e.g. I absolutely agree with your idea to pass more than 1 previous run. Thanks, Michael> On May 18, 2015, at 9:39 AM, Kristof Beyls <kristof.beyls at arm.com> wrote: > > Thanks for raising this, Chris! > > I also think that improving the signal-to-noise ratio in the performance > reports produced by LNT are essential to make the performance-tracking > bots useful and effective. > > Our experience, using LNT internally, has been that if the number of false > positives are low enough (lower than about half a dozen per report or day), > they become useable, leaving only a little bit of manual investigation work > to detect if a particular change was significant or in the noise. Yes, ideally > the automated noise detection should be perfect; but even if it's not perfect, > it will already be a massive win. > > I have some further ideas and remarks below. > > Thanks, > > Kristof > >> -----Original Message----- >> From: llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces at cs.uiuc.edu> [mailto:llvmdev-bounces at cs.uiuc.edu <mailto:llvmdev-bounces at cs.uiuc.edu>] >> On Behalf Of Chris Matthews >> Sent: 15 May 2015 22:25 >> To: LLVM Developers Mailing List >> Subject: [LLVMdev] Proposal: change LNT’s regression detection algorithm >> and how it is used to reduce false positives >> >> tl;dr in low data situations we don’t look at past information, and that >> increases the false positive regression rate. We should look at the >> possibly incorrect recent past runs to fix that. >> >> Motivation: LNT’s current regression detection system has false positive >> rate that is too high to make it useful. With test suites as large as >> the llvm “test-suite” a single report will show hundreds of regressions. >> The false positive rate is so high the reports are ignored because it is >> impossible for a human to triage them, large performance problems are >> lost in the noise, small important regressions never even have a chance. >> Later today I am going to commit a new unit test to LNT with 40 of my >> favorite regression patterns. It has gems such as flat but noisy line, >> 5% regression in 5% noise, bimodal, and a slow increase, we fail to >> classify most of these correctly right now. They are not trick >> questions, all are obvious regressions or non-regressions, that are >> plainly visible. I want us to correctly classify them all! > > That's a great idea! > Out of all of the ideas in this email, I think this is the most important > one to implement first. > >> Some context: LNTs regression detection algorithm as I understand it: >> >> detect(current run’s samples, last runs samples) —> improve, regress or >> unchanged. >> >> # when recovering from errors performance should not be counted >> Current or last run failed -> unchanged >> >> delta = min(current samples) - min(prev samples) > > I am not convinced that "min" is the best way to define the delta. > It makes the assumption that the "true" performance of code generated by llvm > is the fastest it was ever seen running. I think this isn't the correct way > to model e.g. programs with bimodal behaviour, nor programs with a normal > distribution. I'm afraid I don't have a better solution, but I think the > Mann Whitney U test - which tries to determine if the sample points seem > to indicate different underlying distributions - is closer to what we > really ought to use to detect if a regression is "real". This way, it models > that a fixed program, when run multiple times, has a distribution of > performance. I think that using "min" makes too many broken assumptions on > what the distribution can look like. > >> Ideas: >> >> -Try and get more samples in as many places as possible. Maybe — >> multisample=4 should be the default? Make bots run more often (I have >> already done this on green dragon). > > FWIW, the Cortex-A53 performance tracker I've set up recently uses > multisample=3. The Cortex-A53 is a slower/more energy-efficient core, > so it takes about 6 hours to do a LLVM rebuild + 3 runs of the LNT > benchmarks (see http://llvm.org/perf/db_default/v4/nts/machine/39 <http://llvm.org/perf/db_default/v4/nts/machine/39>). > BTW, what is "green dragon"? > >> - Use recent past run information to enhance single sample regression >> detection. I think we should add a lookback window, and model the >> recent past. I tired a technique suggested by Mikhail Zolotukhin of >> computing delta as the smallest difference between current and all the >> previous samples. It was far more effective. Alternately we could try >> a confidence interval generated from previous, though that may not work >> on bimodal tests. > > The noise levels per individual program are often dependent on the > micro-architecture of the core it runs on. Before setting up the Cortex-A53 > performance tracking bot, I've done a bit of analysis to find out what the noise > levels are per program across a Cortex-A53, a Cortex-A57 and a Core i7 CPU. In > attachment is an example of a chart for just one program, indicating that the noise level is > sometimes dependent on the micro-architecture of the core it runs on. Whereas a > Mann-Withney U - or similar - test would probably find - given enough data > points - what should be considered noise and what not; there may be a way to > run the test-suite in benchmark mode many times when a board gets set up, and analyse > the results of that. The idea is that this way, the noisiness of the board as setup > for fixed binaries could be measured, and that information could be used when not > enough sample points are available. > (FWIW: for this program, the noisiness seems to come from noisiness in the number > of branch mispredicts). > BTW – graphs like the one in attachment make me think that the LNT webUI should be showing > sample points by default instead of line graphs showing the minimum execution time > per build number. > > > >> - Currently prev_samples is almost always just one other run, probably >> with only one sample itself. Lets give this more samples to work with. >> Start passing more previous run data to all uses of the algorithm, in >> most places we intentionally limit the computation to current=run and >> previous=run-1, lets do something like previous=run-[1-10]. The risk in >> this approach is that regression noise in the look back window could >> trigger a false negative (we miss detecting a regression). I think this >> is acceptable since we already miss lots of them because the reports are >> not actionable. >> >> - Given the choice between false positive and false negative, lets err >> towards false negative. We need to have manageable number of >> regressions detected or else we can’t act on them. > > This sounds like a good idea to me. Let's first make sure we have a working > system of (semi-?)automatically detecting at least a good portion of the > significant performance regression. After that we can fine tune to reduce > false negatives to catch a larger part of all significant performance > regressions. > > >> >> Any objections to me implementing these ideas? > > Absolutely not. Once implemented, we probably ought to have an idea about how > to test which combination of methods works best in practice. Could the > sample points you’re going to add to the LNT unit tests help in testing which > combination of methods work best? > > I've got 2 further ideas, based on observations from the data coming from the > Cortex-A53 performance tracker that I added about 10 days ago - see > http://llvm.org/perf/db_default/v4/nts/machine/39 <http://llvm.org/perf/db_default/v4/nts/machine/39>. > I'll be posting patches for review for these soon: > > 1. About 20 of the 300-ish programs that get run in benchmark-only mode run > for less than 10 milliseconds. These 20 programs are one of the main sources > of noisiness. We should just not run these programs in benchmark-only mode. > Or - alternatively we should make them run a bit longer, so that they are less > noisy. > > 2. The board I'm running the Cortex-A53 performance tracker on is a big.LITTLE > system with 2 Cortex-A57s and 4 Cortex-A53s. To build the benchmark binaries, > I'm using all cores, to make the turn-around time of the bot as fast as possible. > However, this leads to huge noise levels on the "compile_time" metric, as sometimes > a binary gets compiled on a Cortex-A53 and sometimes on a Cortex-A57. For this > board specifically, it just shouldn't be reporting compile_time at all, since the > numbers are meaningless from a performance-tracking use case. > > > Another thought: if we could reduce the overall run-time of the LNT run in > benchmark-only mode, we could run more "multi-samples" in the same amount of > time. I did a quick analysis on whether it would be worthwhile to invest effort > in making some of the long-running programs in the test-suite run shorter in > benchmarking mode. On the Cortex-A53 board, it shows that the 27 longest-running > programs out of the 300-ish consume about half the run-time. If we could easily > make these 27 programs run an order-of-magnitude less long, we could almost halve > the total execution time of the test-suite, and hence run twice the number of > samples in the same amount of time. The longest running programs I’ve found are, > sorted: > > 0. 7.23% cumulative (7.23% - 417.15s this program) nts.MultiSource/Benchmarks/PAQ8p/paq8p.exec > 1. 13.74% cumulative (6.51% - 375.84s this program) nts.MultiSource/Benchmarks/SciMark2-C/scimark2.exec > 2. 18.83% cumulative (5.08% - 293.16s this program) nts.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.exec > 3. 21.60% cumulative (2.77% - 160.02s this program) nts.MultiSource/Benchmarks/mafft/pairlocalalign.exec > 4. 24.01% cumulative (2.41% - 138.98s this program) nts.SingleSource/Benchmarks/CoyoteBench/almabench.exec > 5. 26.32% cumulative (2.32% - 133.59s this program) nts.MultiSource/Applications/lua/lua.exec > 6. 28.26% cumulative (1.94% - 111.80s this program) nts.MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk.exec > 7. 30.11% cumulative (1.85% - 106.56s this program) nts.MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.exec > 8. 31.60% cumulative (1.49% - 86.00s this program) nts.SingleSource/Benchmarks/CoyoteBench/huffbench.exec > 9. 32.75% cumulative (1.15% - 66.37s this program) nts.MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.exec > 10. 33.90% cumulative (1.15% - 66.13s this program) nts.MultiSource/Applications/hexxagon/hexxagon.exec > 11. 35.04% cumulative (1.14% - 65.98s this program) nts.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syr2k/syr2k.exec > 12. 36.14% cumulative (1.10% - 63.21s this program) nts.MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.exec > 13. 37.22% cumulative (1.08% - 62.35s this program) nts.SingleSource/Benchmarks/SmallPT/smallpt.exec > 14. 38.30% cumulative (1.08% - 62.30s this program) nts.MultiSource/Benchmarks/nbench/nbench.exec > 15. 39.37% cumulative (1.07% - 61.98s this program) nts.MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.exec > 16. 40.40% cumulative (1.03% - 59.50s this program) nts.MultiSource/Applications/SPASS/SPASS.exec > 17. 41.37% cumulative (0.97% - 55.74s this program) nts.MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.exec > 18. 42.33% cumulative (0.96% - 55.40s this program) nts.SingleSource/Benchmarks/Misc/ReedSolomon.exec > 19. 43.27% cumulative (0.94% - 54.34s this program) nts.MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.exec > 20. 44.21% cumulative (0.94% - 54.20s this program) nts.MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.exec > 21. 45.12% cumulative (0.91% - 52.46s this program) nts.SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.exec > 22. 46.01% cumulative (0.89% - 51.49s this program) nts.MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.exec > 23. 46.89% cumulative (0.88% - 50.66s this program) nts.MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.exec > 24. 47.73% cumulative (0.84% - 48.74s this program) nts.MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.exec > 25. 48.57% cumulative (0.84% - 48.43s this program) nts.MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.exec > 26. 49.40% cumulative (0.83% - 47.92s this program) nts.SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.exec > 27. 50.22% cumulative (0.81% - 46.92s this program) nts.MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.exec > 28. 51.03% cumulative (0.81% - 46.90s this program) nts.MultiSource/Applications/minisat/minisat.exec > 29. 51.81% cumulative (0.78% - 44.88s this program) nts.MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.exec > … > > > For example, there seem to be a lot of TSVC benchmarks in the longest running ones. > They all seem to take a command line parameter to define the number of iterations the main > loop in the benchmark should be running. Just tuning these, so all these benchmarks runs > O(1s) would make the overall test-suite already run significantly faster. > > For the Polybench test cases: they print out lots of floating point numbers – this > probably should be changed in the makefile so they don’t dump the matrices they work > on anymore. I’m not sure how big the impact will be on overall run time for the Polybench > benchmarks when doing this. > > <reg_detect_noise_.png>_______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/45ba1fca/attachment.html>
Sean Silva
2015-May-19 03:28 UTC
[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives
On Mon, May 18, 2015 at 11:24 AM, Mikhail Zolotukhin <mzolotukhin at apple.com> wrote:> Hi Chris and others! > > I totally support any work in this direction. > > In the current state LNT’s regression detection system is too noisy, which > makes it almost impossible to use in some cases. If after each run a > developer gets a dozen of ‘regressions’, none of which happens to be real, > he/she won’t care about such reports after a while. We clearly need to > filter out as much noise as we can - and as it turns out even simplest > techniques could help here. For example, the technique I used (which you > mentioned earlier) takes ~15 lines of code to implement and filters out > almost all noise in our internal data-sets. It’d be really cool to have > something more scientifically-proven though:) > > One thing to add from me - I think we should try to do our best in > assumption that we don’t have enough samples. Of course, the more data we > have - the better, but in many cases we can’t (or we don’t want) to > increase number os samples, since it dramatically increases testing time. >Why not just start out with only a few samples, then collect more for benchmarks that appear to have changed? -- Sean Silva> That’s not to discourage anyone from increasing number of samples, or > adding techniques relying on a significant number of samples, but rather to > try mining as many ‘samples’ as possible from the data we have - e.g. I > absolutely agree with your idea to pass more than 1 previous run. > > Thanks, > Michael > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/3ba578ef/attachment.html>
Chris Matthews
2015-May-19 04:02 UTC
[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives
The reruns flag already does that. It helps a bit, but only as long as the the benchmark is flagged as regressed.> On May 18, 2015, at 8:28 PM, Sean Silva <chisophugis at gmail.com> wrote: > > > > On Mon, May 18, 2015 at 11:24 AM, Mikhail Zolotukhin <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote: > Hi Chris and others! > > I totally support any work in this direction. > > In the current state LNT’s regression detection system is too noisy, which makes it almost impossible to use in some cases. If after each run a developer gets a dozen of ‘regressions’, none of which happens to be real, he/she won’t care about such reports after a while. We clearly need to filter out as much noise as we can - and as it turns out even simplest techniques could help here. For example, the technique I used (which you mentioned earlier) takes ~15 lines of code to implement and filters out almost all noise in our internal data-sets. It’d be really cool to have something more scientifically-proven though:) > > One thing to add from me - I think we should try to do our best in assumption that we don’t have enough samples. Of course, the more data we have - the better, but in many cases we can’t (or we don’t want) to increase number os samples, since it dramatically increases testing time. > > Why not just start out with only a few samples, then collect more for benchmarks that appear to have changed? > > -- Sean Silva > > That’s not to discourage anyone from increasing number of samples, or adding techniques relying on a significant number of samples, but rather to try mining as many ‘samples’ as possible from the data we have - e.g. I absolutely agree with your idea to pass more than 1 previous run. > > Thanks, > Michael > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150518/bbcc16bf/attachment.html>