Chris Matthews
2013-Jun-27 18:14 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
There are a few things we have looked at with LNT runs, so I will share the insights we have had so far. A lot of the problems we have are artificially created by our test protocols instead of the compiler changes themselves. I have been doing a lot of large sample runs of single benchmarks to characterize them better. Some key points: 1) Some benchmarks are bi-modal or multi-modal, single means won’t describe these well 2) Some runs are pretty noisy and sometimes have very large single sample spikes 3) Most benchmarks don’t regress most of the time 4) Compile time is pretty stable metric, execution time not always and depending on what you are using LNT for: 5) A regression is not really something to worry about unless it lasts for a while (some number of revisions or days or samples) 6) We also need to catch long slow regressions 7) Some of the “benchmarks” are really just correctness tests, and were not designed with repeatable measurement in mind. As it stands now, we really can’t detect small regressions, slow regressions, and there are a lot of false positives. There are two things I am working on right now to help make regression detection more reliable: adaptive sampling and cluster based regression flagging. First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long. The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history. Simply, it reruns benchmarks which are reported as regressed or improved. The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case. Adaptive sampling should reduce the false positive regression flagging rate we see. We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times. This gives us more data in the places where we need it most. This has made a big difference on my local test machine. As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that. My hope is this can characterize multi-modal results, be resilient to short spikes and detect long term motion in the dataset. I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with. Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine. A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running. You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running. In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT. Chris Matthews chris.matthews@.com (408) 783-6335 On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com> wrote:> > On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at linaro.org> wrote: > >> On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote: >> We are looking for a good way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is? >> >> Hi Tobi, >> >> I had a look at this a while ago, but never got around to actually work on it. My idea was to never use point-changes as indication of progress/regressions, unless there was a significant change (2/3 sigma). What we should do is to compare the current moving-average with the past moving averages (of K runs) with both last-avg and the (N-K)th moving-average (to make sure previous values included in the current moving average are not toning it down/up), and keep the biggest difference as the final result. >> >> We should also compare the current mov-avg with M non-overlapping mov-avgs before, and calculate if we're monotonically increasing, decreasing or if there is a difference of 2/3 sigma between the current mov-avg (N) and the (N-M)th mov-avg. That would give us an idea on the trends of each test. > > Chris Matthews has recently been working on implementing something similar to that. Chris, can you share some details? > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/0be3afcf/attachment.html>
Renato Golin
2013-Jun-27 19:04 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Hi Chris, Amazing that someone is finally looking at that with a proper background. You're much better equipped than I am to deal with that, so I'll trust you on your judgements, as I haven't paid much attention to benchmarks, more correctness. Some comments inline. On 27 June 2013 19:14, Chris Matthews <chris.matthews at apple.com> wrote:> 1) Some benchmarks are bi-modal or multi-modal, single means won’t > describe these well >True. My idea was to have a moving-"measurement", where the basic one being average, but others applied as well. It's possible that k-means can give you that, but I haven't understood what will be your vector space and distance measures to guess. 2) Some runs are pretty noisy and sometimes have very large single sample> spikes > 3) Most benchmarks don’t regress most of the time >Most of ARM benchmarks regress all the time because both the signal and the noise are in milliseconds, where machine and OS interference play a crucial part. But they don't regress with time, and they keep their average AND deviation for ever. So, if you can filter the noise on *all* benchmarks, it'd be great for ARM testing. 5) A regression is not really something to worry about unless it lasts for> a while (some number of revisions or days or samples) > 6) We also need to catch long slow regressions >Yup. Moving peak and trend. 7) Some of the “benchmarks” are really just correctness tests, and were not> designed with repeatable measurement in mind. >Yes. Would be great to move them to Application, and *not* time execution. Benchmarks are specifically designed to test execution time, applications aren't. If we think an application is really important that we want to measure it, we should actively change it to a benchmark, making sure it's actually performing the core functionality on a repeatable way and with enough confidence that noise isn't playing a part on the numbers. Just throwing it and time execution will create a school of red herrings. After a run, we submit all the results, but don’t commit them. The server> reports the regressions, then we rerun the regressing benchmarks more > times. This gives us more data in the places where we need it most. This > has made a big difference on my local test machine. >This is a great idea, and I think it could improve things at a much lower cost. It won't replace decent benchmarking strategies on the software level, but it will reduce the noise, hopefully enough to allow other analysis to be successful at an early stage. As far as regression flagging goes, I have been working on a k-means> discovery/clustering based approach to first come up with a set of means in > the dataset, then characterize newer data based on that. My hope is this > can characterize multi-modal results, be resilient to short spikes and > detect long term motion in the dataset. I have this prototyped in LNT, but > I am still trying to work out the best criteria to flag regression with. >I'd like to understand that better (mostly for personal education). But it can be offline, if the rest of the list is not interested... You have to make sure power management is not mucking with clock rates, and> that none of the magic backup/indexing/updating/networking/screensaver > stuff on your machine is running. In practice, I have seen a process using > 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and > having 2 cores loaded on an 8 core machine trigger hundreds of regressions > in LNT. >I have seen this too. I think LNT has two modes: test and benchmark (not sure how to switch), but one tries to use all possible cores (unstable benchmarks) and the other runs using a single core all the way. I think we could assume that, for tests, we can use as much juice as we have available, and for benchmarks, we could use less than the total number of cores (the practical number can vary depending on the arch). It's better to re-run some benchmarks 10 times, but use 8 CPUs than use only one... cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/904b76b9/attachment.html>
Chris Matthews
2013-Jun-27 21:11 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Just forwarding this to the list, my original reply was bounced. On Jun 27, 2013, at 11:14 AM, Chris Matthews <chris.matthews at apple.com> wrote:> There are a few things we have looked at with LNT runs, so I will share the insights we have had so far. A lot of the problems we have are artificially created by our test protocols instead of the compiler changes themselves. I have been doing a lot of large sample runs of single benchmarks to characterize them better. Some key points: > > 1) Some benchmarks are bi-modal or multi-modal, single means won’t describe these well > 2) Some runs are pretty noisy and sometimes have very large single sample spikes > 3) Most benchmarks don’t regress most of the time > 4) Compile time is pretty stable metric, execution time not always > > and depending on what you are using LNT for: > > 5) A regression is not really something to worry about unless it lasts for a while (some number of revisions or days or samples) > 6) We also need to catch long slow regressions > 7) Some of the “benchmarks” are really just correctness tests, and were not designed with repeatable measurement in mind. > > As it stands now, we really can’t detect small regressions, slow regressions, and there are a lot of false positives. > > There are two things I am working on right now to help make regression detection more reliable: adaptive sampling and cluster based regression flagging. > > First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long. The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history. Simply, it reruns benchmarks which are reported as regressed or improved. The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case. Adaptive sampling should reduce the false positive regression flagging rate we see. We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times. This gives us more data in the places where we need it most. This has made a big difference on my local test machine. > > As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that. My hope is this can characterize multi-modal results, be resilient to short spikes and detect long term motion in the dataset. I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with. > > Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine. A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running. You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running. In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT. > > > Chris Matthews > chris.matthews@.com > (408) 783-6335 > > On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com> wrote: > >> >> On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at linaro.org> wrote: >> >>> On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote: >>> We are looking for a good way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is? >>> >>> Hi Tobi, >>> >>> I had a look at this a while ago, but never got around to actually work on it. My idea was to never use point-changes as indication of progress/regressions, unless there was a significant change (2/3 sigma). What we should do is to compare the current moving-average with the past moving averages (of K runs) with both last-avg and the (N-K)th moving-average (to make sure previous values included in the current moving average are not toning it down/up), and keep the biggest difference as the final result. >>> >>> We should also compare the current mov-avg with M non-overlapping mov-avgs before, and calculate if we're monotonically increasing, decreasing or if there is a difference of 2/3 sigma between the current mov-avg (N) and the (N-M)th mov-avg. That would give us an idea on the trends of each test. >> >> Chris Matthews has recently been working on implementing something similar to that. Chris, can you share some details? >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/14cde134/attachment.html>
David Tweed
2013-Jun-28 09:28 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long. The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history. Simply, it reruns benchmarks which are reported as regressed or improved. The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case. Adaptive sampling should reduce the false positive regression flagging rate we see. We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times. This gives us more data in the places where we need it most. This has made a big difference on my local test machine. | As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that. My hope is this can characterize multi-modal results, | be resilient to short spikes and detect long term motion in the dataset. I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with. Basic question: I'm imagining the volume of data being dealt with isn't that large (as statistical datasets go) and you're discarding old values anyway (since we care if we're regressing wrt now rather than LLVM 1.1), so can't you just build a kernel density estimator of the "baseline" runtime and then estimate the probabilities that samples from a given codebase are going to happening "slower" than the baseline? I suppose the drawback to not explicitly modelling the modes (with all its complications and tunings) is that you can't attempt to determine when a value is bigger than a lower cluster, even though it's smaller than the bigger cluster and estimate if it's evidence of a slowdown within the small cluster regime. Still that seems a bit complicated to do automatically. (Inicidentally, responding to the earlier email below, I think you don't really want to compare moving averages but use some statistical test to quantify if the separation between the set of points within the "earlier window" are statistically significantly higher than the "later window"; all moving averages do is smear out useful information which can be useful if you've just got far too many data points, but otherwise it doesn't really help. Cheers, Dave Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine. A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running. You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running. In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT. Chris Matthews chris.matthews@.com (408) 783-6335 On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com> wrote: On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at linaro.org> wrote: On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote: We are looking for a good way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is? Hi Tobi, I had a look at this a while ago, but never got around to actually work on it. My idea was to never use point-changes as indication of progress/regressions, unless there was a significant change (2/3 sigma). What we should do is to compare the current moving-average with the past moving averages (of K runs) with both last-avg and the (N-K)th moving-average (to make sure previous values included in the current moving average are not toning it down/up), and keep the biggest difference as the final result. We should also compare the current mov-avg with M non-overlapping mov-avgs before, and calculate if we're monotonically increasing, decreasing or if there is a difference of 2/3 sigma between the current mov-avg (N) and the (N-M)th mov-avg. That would give us an idea on the trends of each test. Chris Matthews has recently been working on implementing something similar to that. Chris, can you share some details? _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/e71616fa/attachment.html>
Renato Golin
2013-Jun-28 09:43 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 28 June 2013 10:28, David Tweed <david.tweed at arm.com> wrote:> (Inicidentally, responding to the earlier email below, I think you don't > really want to compare moving averages but use some statistical test to > quantify if the separation between the set of points within the "earlier > window" are statistically significantly higher than the "later window"; all > moving averages do is smear out useful information which can be useful if > you've just got far too many data points, but otherwise it doesn't really > help. >When your data is explicitly grouped, I'd agree with you. But all I can see from my results are hardware and OS flukes in the millisecond range, with no distinct modal signal from them. Chris said he knows of some, I haven't looked deep enough, so I trust his judgement. What I don't want is to be treating noise groups as signal, that's all. I think we probably need a few different approaches, depending on the benchmark, with moving averages being the simplest, and why I suggested we implement it first. Sometimes, smoothing the line is all you need... ;) cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/10ca5b66/attachment.html>
David Tweed
2013-Jun-28 13:06 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
| I think we probably need a few different approaches, depending on the benchmark, with moving averages being the simplest, and why I suggested we implement it first. Sometimes, smoothing the line is all you need... ;) That's a viewpoint; another one is that statisticians might well have very good reasons why they spend so long coming up with statistical tests in order to create the most powerful tests so they can deal with marginal quantities of data. Cheers, Dave -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/7b6feeef/attachment.html>
Renato Golin
2013-Jun-28 13:28 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 28 June 2013 14:06, David Tweed <david.tweed at arm.com> wrote:> That's a viewpoint; another one is that statisticians might well have very > good reasons why they spend so long coming up with statistical tests in > order to create the most powerful tests so they can deal with marginal > quantities of data. >87.35% of all statistics are made up, 55.12% of them could have been done a lot simpler, a lot quicker and only 1.99% (AER) actually make your life better. I'm glad that Chris already has working solutions, and I'b be happy to see them go live before any professional statistician had a look at it. ;) cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/ae31f5fe/attachment.html>
Chris Matthews
2013-Jun-28 18:45 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
I should describe the cost of false negatives and false positives, since I think it matters for how this problem is approached. False negatives mean we miss a real regression --- we don’t want that. False positives mean somebody has to spend some time looking at and reproducing the regression when there is not one --- bad too. Given this tradeoff I think we want to tend towards false positives (over false negatives) strictly as a matter of compiler quality, but if we can throw more data to reduce false positives that is good. I have discussed the classification problem before with people off list. The problem that we face is that the space is pretty big for manual classification, at worse: number of benchmarks * number of architectures * sets of flags * metrics collected. Perhaps some sensible defaults could overcome that, also to classify well, you probably need a lot of samples as a baseline. There certainly are lots of tests for small data. As far as I know though they rely more heavily on assumptions that in our case would have to be proven. That said, I’d never object to a professional’s opinion on this problem! Chris Matthews chris.matthews@.com (408) 783-6335 On Jun 28, 2013, at 6:28 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 28 June 2013 14:06, David Tweed <david.tweed at arm.com> wrote: > That's a viewpoint; another one is that statisticians might well have very good reasons why they spend so long coming up with statistical tests in order to create the most powerful tests so they can deal with marginal quantities of data. > > > 87.35% of all statistics are made up, 55.12% of them could have been done a lot simpler, a lot quicker and only 1.99% (AER) actually make your life better. > > I'm glad that Chris already has working solutions, and I'b be happy to see them go live before any professional statistician had a look at it. ;) > > cheers, > --renato > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/62b600f8/attachment.html>
Possibly Parallel Threads
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure