Renato Golin
2013-Jun-30 18:30 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 30 June 2013 10:14, Anton Korobeynikov <anton at korobeynikov.info> wrote:> 1. Increasing sample size to at least 5-10 >That's not feasible on slower systems. A single data point takes 1 hour on the fastest ARM board I can get (Chromebook). Getting 10 samples at different commits will give you similar accuracy if behaviour doesn't change, and you can rely on 10-point blocks before and after each change to have the same result. What won't happen is one commit makes it truly faster and the very next slow again (or slow/fast), so all we need to measure is for each commit, if that was the one that made all next runs slower/faster, and that we can get with several commits after the culprit, since the probability that another (unrelated) commit will change the behaviour is small. This is why I proposed something like moving averages. Not because it's the best statistical model, but because it works around a concrete problem we have. I don't care which model/tool you use, as long as it doesn't mean I'll have to wait 10 hours for a result, or sift through hundreds of commits every time I see a regression in performance. What that will do, for sure, is make me ignore small regressions, since they won't be worth the massive work to find the real culprit. If I had a team of 10 people just to look at regressions all day long, I'd ask them to make a proper statistical model and go do more interesting things... cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20130630/daa16eff/attachment.html>
Anton Korobeynikov
2013-Jun-30 19:08 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
> Getting 10 samples at different commits will give you similar accuracy if > behaviour doesn't change, and you can rely on 10-point blocks before and > after each change to have the same result.Right. But this way you will have 10-commits delay. So, you will need 3-4 additional test runs to pinpoint the offending commit in the worst case.> This is why I proposed something like moving averages.Moving average will "smooth" the result. So, only really big changes will be caught by it. -- With best regards, Anton Korobeynikov Faculty of Mathematics and Mechanics, Saint Petersburg State University
Chris Matthews
2013-Jul-01 01:02 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
This is probably another area where a bit of dynamic behavior could help. When we find a regressions, kick off some runs to bisect back to where it manifests. This is what we would be doing manually anyway. We could just search back with the set of regressing benchmarks, meaning the whole suite does not have to be run (unless it is a global regression). There are situations where we see commit which make things slower then faster again, but so far those seem to be from experimental features being switched on then off. The problem with the moving averages is they really don’t behave well when the benchmark is naturally bimodal. One thing that LNT is doing to help “smooth” the results for you is by presenting the min of the data at a particular revision, which (hopefully) is approximating the actual runtime without noise. That works well with a lot of samples per revision, but not for across revisions, where we really need the smoothing. One way to explore this is to turn Ignoring small regressions is an interesting problem. Do it too many times, slowness creeps in. But you are correct, no one wants to fix a small regression. There is a bit of a value computation that we are all doing when we watch the results, which is not explicit in the software or documentation right now. Mine is along the lines of: small regression in important benchmarks with certain flags matters, and bigger regressions in less important benchmarks and flags matter too, etc. We also lack any way to coordinate or annotate regressions, that is a whole separate problem though. Another idea I have been toying with is building a "change of interest" model, where we can explicitly tag particular revisions as impacting performance, then test them preferentially. That could allow the effort to be focused to revisions where it might best have an effect. I don’t know if that would play out well in reality though. Chris Matthews chris.matthews@.com (408) 783-6335 On Jun 30, 2013, at 11:30 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 30 June 2013 10:14, Anton Korobeynikov <anton at korobeynikov.info> wrote: > 1. Increasing sample size to at least 5-10 > > That's not feasible on slower systems. A single data point takes 1 hour on the fastest ARM board I can get (Chromebook). Getting 10 samples at different commits will give you similar accuracy if behaviour doesn't change, and you can rely on 10-point blocks before and after each change to have the same result. > > What won't happen is one commit makes it truly faster and the very next slow again (or slow/fast), so all we need to measure is for each commit, if that was the one that made all next runs slower/faster, and that we can get with several commits after the culprit, since the probability that another (unrelated) commit will change the behaviour is small. > > This is why I proposed something like moving averages. Not because it's the best statistical model, but because it works around a concrete problem we have. I don't care which model/tool you use, as long as it doesn't mean I'll have to wait 10 hours for a result, or sift through hundreds of commits every time I see a regression in performance. What that will do, for sure, is make me ignore small regressions, since they won't be worth the massive work to find the real culprit. > > If I had a team of 10 people just to look at regressions all day long, I'd ask them to make a proper statistical model and go do more interesting things... > > cheers, > --renato > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu llvm.cs.uiuc.edu > lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20130630/7f34e330/attachment.html>
James Courtier-Dutton
2013-Jul-01 05:51 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On Jun 30, 2013 8:12 PM, "Anton Korobeynikov" <anton at korobeynikov.info> wrote:> > > Getting 10 samples at different commits will give you similar accuracyif> > behaviour doesn't change, and you can rely on 10-point blocks beforeand > after each change to have the same result.> Right. But this way you will have 10-commits delay. So, you will need > 3-4 additional test runs to pinpoint the offending commit in the worst > case. > > > This is why I proposed something like moving averages. > Moving average will "smooth" the result. So, only really big changes > will be caught by it. >Like any result in statistics, the result should be quoted together with a +/- figure derived from the statistical method used. Generally, low sample size means high +/-. Another option is to take a deterministic approach to measurement. The code should executive the same cpu instructions every time it is run, so some method to measure just these instructions should be attempted. Maybe processing qemu logs when llvm is run inside qemu might give a possible solution? James -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20130701/f9971a44/attachment.html>
Renato Golin
2013-Jul-01 07:47 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 30 June 2013 20:08, Anton Korobeynikov <anton at korobeynikov.info> wrote:> > Getting 10 samples at different commits will give you similar accuracy if > > behaviour doesn't change, and you can rely on 10-point blocks before and > > after each change to have the same result. > Right. But this way you will have 10-commits delay. So, you will need > 3-4 additional test runs to pinpoint the offending commit in the worst > case. >Well, 10 was an example, but yes, you'll always have N commit-groups delay. My assumption is that some (say 5) commit-groups delay is not a critical issue if it happens once in a while, as opposed to having to examine every hike on a range of several dozen commits.> This is why I proposed something like moving averages. > Moving average will "smooth" the result. So, only really big changes > will be caught by it. >Absolutely. Smoothing is bad, but it's better than what we have, and at least it would catch big regressions. Today, not even the big ones are being caught. You don't have to throw away the original data-points, you just run a moving average to pinpoint big changes, where the confidence that regression occurred is high. In parallel, you can still use the same data-points to do more refined analysis, and even cross-reference multiple analysis' data to give you even more confidence. Anton and David, I could not agree with you more on what's necessary to have a good analysis, I just wished we had something cruder but sooner while we develop the perfect statistical model. I believe Chris is doing that now. So, whatever is wrong with his analysis, let's just wait and see how it turns out, and how we can improve further. For now, anything will be an improvement. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20130701/e5ce354c/attachment.html>
Renato Golin
2013-Jul-01 16:41 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 1 July 2013 02:02, Chris Matthews <chris.matthews at apple.com> wrote:> One thing that LNT is doing to help “smooth” the results for you is by > presenting the min of the data at a particular revision, which (hopefully) > is approximating the actual runtime without noise. >That's an interesting idea, as you said, if you run multiple times on every revision. On ARM, every run takes *at least* 1h, other architectures might be a lot worse. It'd be very important on those architectures if you could extract point information from group data, and min doesn't fit in that model. You could take min from a group of runs, but again, that's no different than moving averages. Though, "moving mins" might make more sense than "moving averages" for the reasons you exposed. Also, on tests that take as long as noise to run (0.010s or less on A15), the minimum is not relevant, since runtime will flatten everything under 0.010 onto 0.010, making your test always report 0.010, even when there are regressions. I really cannot see how you can statistically enhance data in a scenario where the measuring rod is larger than the signal. We need to change the wannabe-benchmarks to behave like proper benchmarks, and move everything else into "Applications" for correctness and specifically NOT time them. Less is more. That works well with a lot of samples per revision, but not for across> revisions, where we really need the smoothing. One way to explore this is > to turn >I was really looking forward to that hear the end of that sentence... ;) We also lack any way to coordinate or annotate regressions, that is a whole> separate problem though. >Yup. I'm having visions of tag clouds, bugzilla integration, cross architectural regression detection, etc. But I'll ignore that for now, let's solve one big problem at a time. ;) cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20130701/1937dd25/attachment.html>
Jakob Stoklund Olesen
2013-Jul-01 18:13 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On Jun 30, 2013, at 6:02 PM, Chris Matthews <chris.matthews at apple.com> wrote:> This is probably another area where a bit of dynamic behavior could help. When we find a regressions, kick off some runs to bisect back to where it manifests. This is what we would be doing manually anyway. We could just search back with the set of regressing benchmarks, meaning the whole suite does not have to be run (unless it is a global regression). > > There are situations where we see commit which make things slower then faster again, but so far those seem to be from experimental features being switched on then off.This is an interesting paper: people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf "However, caches and branch predictors make performance dependent on machine-specific parameters and the exact layout of code, stack frames, and heap objects. A single binary constitutes just one sample from the space of program layouts, regardless of the number of runs. Since compiler optimizations and code changes also alter layout, it is currently impossible to distinguish the impact of an optimization from that of its layout effects." "We find that the performance impact of -O3 over -O2 optimizations is indistinguishable from random noise.” Thanks, /jakob
Apparently Analagous Threads
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure