All, I'm curious to know if anyone is interested in tracking performance (compile-time and/or execution-time) from a community perspective? This is a much loftier goal then just supporting build bots. If so, I'd be happy to propose a BOF at the upcoming Dev Meeting. Chad
That is a great idea! :)> On Aug 1, 2014, at 4:04 PM, Chad Rosier <mcrosier at codeaurora.org> wrote: > > All, > I'm curious to know if anyone is interested in tracking performance > (compile-time and/or execution-time) from a community perspective? This > is a much loftier goal then just supporting build bots. If so, I'd be > happy to propose a BOF at the upcoming Dev Meeting. > > Chad > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On 2 August 2014 00:04, Chad Rosier <mcrosier at codeaurora.org> wrote:> I'm curious to know if anyone is interested in tracking performance > (compile-time and/or execution-time) from a community perspective? This > is a much loftier goal then just supporting build bots. If so, I'd be > happy to propose a BOF at the upcoming Dev Meeting.Hi Chad, I'm not sure I'll be at the US dev meeting this year, but we had a performance BoF last year and I think we should have another, at least to check progress (that has been made) and to plan ahead. I'm sure Kristof, Tobias and others will be very glad to see it, too. If memory serves me well (it doesn't), these are the list of things we agreed on making, and their progress: 1. Performance-specific test-suite: a group of specific benchmarks that should be tracked with the LNT infrastructure. Hal proposed to look at this, but other people helped implement it. Last I heard there was some way of running it but I'm not sure how to do it. I'd love to have this as a buildbot, though, so we can track its progress. 2. Statistical analysis of the LNT data. A lot of work has been put into this and I believe it's a lot better. Anton, Yi and others have been discussing and submitting many patches to make the LNT reporting infrastructure more stable, less prone to noise and more useful all round. It's not perfect yet, but a lot better than last year's. Some other things happened since then that are also worth mentioning... 3. LNT website got really unstable (Internal Server Error every other day). This is the reason I stopped submitting results to it, since it would make my bot fail. And because I still don't have a performance test-suite bot, I don't really care much for the results. But with the noise reduction, it'd be really interesting to monitor the progress, even of the full test-suite, but right now, I can't afford to have random failures. This seriously needs looking into and would be good to have that as a topic in this BoF. 4. Big Endian results got in, and the infrastructure now is able to have both "golden standard" results. That's done and working (AFAIK). 5. Renovation of the test/benchmarks. The tests and benchmarks in the test-suite are getting really old. One good example is the ClamAV anti-virus, that is not just old, but the results are bogus and cooked, which doesn't really tell much signal from noise. Other benchmarks have such short run-time that it's almost pointless. One needs to go through the things we test/benchmark and make sure they're valid and meaningful. This is probably similar, but more extensive, than item 1. About non-test-suite benchmarking... I have been running some closed source benchmarks, but since we can't share any data on it, getting historical relative results is almost pointless. I don't think we should be worried as a community to run keep open scores about them. Also, since almost every one is running them behind closed doors, and fixing the bugs with reduced cases, I think that's the best deal we can get. I've also tried a few other benchmarks, like running ImageMagick libraries, or Phoronix, and I have to say, they're not really that great at spotting regressions. ImageMagick will take a lot of work to make it into a meaningful benchmark, and Phoronix is not really ready to be a compiler benchmark (it only compiles once, with the system compiler, so you have to heavily hack the scripts). If you're up to it, maybe you could hack those into a nice package, but it won't be easy. I know people have done it internally, like I did, but none of these scripts are ready to be left out in the open, since they're either very ugly (like mine) or contain private information... Hope that helps... cheers, --renato
On 2 August 2014 00:40, Renato Golin <renato.golin at linaro.org> wrote:> If memory serves me well (it doesn't), these are the list of things we > agreed on making, and their progress: > > 1. Performance-specific test-suite: a group of specific benchmarks > that should be tracked with the LNT infrastructure. Hal proposed to > look at this, but other people helped implement it. Last I heard there > was some way of running it but I'm not sure how to do it. I'd love to > have this as a buildbot, though, so we can track its progress.We have this in LNT now which can be activated using `--benchmarking-only`. It's about 50% faster than a full run and massively reduces the number of false positives. Chris has also posted a patch to rerun tests which the server said changed. Haven't tried it yet but it looks really promising.> 2. Statistical analysis of the LNT data. A lot of work has been put > into this and I believe it's a lot better. Anton, Yi and others have > been discussing and submitting many patches to make the LNT reporting > infrastructure more stable, less prone to noise and more useful all > round. It's not perfect yet, but a lot better than last year's.There's definitely lots of room for improvement. I'm going to propose some more once we've solved the LNT stability issues.> 3. LNT website got really unstable (Internal Server Error every other > day). This is the reason I stopped submitting results to it, since it > would make my bot fail. And because I still don't have a performance > test-suite bot, I don't really care much for the results. But with the > noise reduction, it'd be really interesting to monitor the progress, > even of the full test-suite, but right now, I can't afford to have > random failures. This seriously needs looking into and would be good > to have that as a topic in this BoF.We are now testing PostgreSQL as database backend on the public perf server, replacing the SQLite db. Hopefully this can improve the stability and system performance. Also being discussing is to move the LNT server to a PaaS service, as it has higher availability and saves a lot of maintenance work. However this will need community to provide or fund the hosting service. -Yi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140802/e9716261/attachment.html>
Hi Chad, I'm definitely interested and would have proposed such a BOF myself if you wouldn't have beaten me to it :) I think the BOF on the same topic last year was very productive in identifying the most needed changes to enable tracking performance from a community perspective. I think that by now most of the suggestions made at that BOF have been implemented, and as the rest of the thread shows, we'll hopefully soon have a few more performance tracking bots that produce useful (i.e. low-noise) data. I think it'll definitely be worthwhile to have a similar BOF this year. Thanks, Kristof> -----Original Message----- > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of Chad Rosier > Sent: 02 August 2014 00:04 > To: llvmdev at cs.uiuc.edu > Subject: [LLVMdev] Dev Meeting BOF: Performance Tracking > > All, > I'm curious to know if anyone is interested in tracking performance > (compile-time and/or execution-time) from a community perspective? This > is a much loftier goal then just supporting build bots. If so, I'd be > happy to propose a BOF at the upcoming Dev Meeting. > > Chad > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On 04/08/2014 11:28, Kristof Beyls wrote:> Hi Chad, > > I'm definitely interested and would have proposed such a BOF myself if > you wouldn't have beaten me to it :) > > I think the BOF on the same topic last year was very productive in > identifying the most needed changes to enable tracking performance > from a community perspective. I think that by now most of the > suggestions made at that BOF have been implemented, and as the rest > of the thread shows, we'll hopefully soon have a few more performance > tracking bots that produce useful (i.e. low-noise) data. > > I think it'll definitely be worthwhile to have a similar BOF this year.There is little for me to add, except that I would also be interested in such a BoF. Cheers, Tobias
Kristof,> Hi Chad, > > I'm definitely interested and would have proposed such a BOF myself if > you wouldn't have beaten me to it :)Given you have much more context than I, I would be very happy to work together on this BOF. :)> I think the BOF on the same topic last year was very productive in > identifying the most needed changes to enable tracking performance > from a community perspective. I think that by now most of the > suggestions made at that BOF have been implemented, and as the rest > of the thread shows, we'll hopefully soon have a few more performance > tracking bots that produce useful (i.e. low-noise) data.I'll grep through the dev/commits list to get up to speed. For everyone's reference here are the notes from last year: http://llvm.org/devmtg/2013-11/slides/BenchmarkBOFNotes.html Kristof, feel free to comment further on these, if you feel so inclined.> I think it'll definitely be worthwhile to have a similar BOF this year.I'll start working on some notes. Chad> > Thanks, > > Kristof > >> -----Original Message----- >> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] >> On Behalf Of Chad Rosier >> Sent: 02 August 2014 00:04 >> To: llvmdev at cs.uiuc.edu >> Subject: [LLVMdev] Dev Meeting BOF: Performance Tracking >> >> All, >> I'm curious to know if anyone is interested in tracking performance >> (compile-time and/or execution-time) from a community perspective? This >> is a much loftier goal then just supporting build bots. If so, I'd be >> happy to propose a BOF at the upcoming Dev Meeting. >> >> Chad >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Kristof, Unfortunately, our merge process is less than ideal. It has vastly improved over the past few months (years I hear), but we still have times where we bring in days/weeks worth of commits en mass. To that end, I've setup a nightly performance run against the community branch, but it's still an overwhelming amount of work to track/report/bisect regressions. As you guessed, this is what motivated my initial email.> On 5 August 2014 10:30, Kristof Beyls <Kristof.Beyls at arm.com> wrote: >> The biggest problem that we were trying to solve this year was to produce >> data without too much noise. I think with Renato hopefully setting up >> a chromebook (Cortex-A15) soon there will finally be an ARM architecture >> board producing useful data and pushing it into the central database. > > I haven't got around finishing that work (at least not reporting to > Perf anyway) because of the instability issues. > > I think getting Perf stable is priority 0 right now in the LLVM > benchmarking field.I agree 110%; we don't want the bots crying wolf. Otherwise, real issues will fall on deaf ears.>> I think this should be the main topic of the BoF this year: now that we >> can produce useful data; what do we do with the data to actually improve >> LLVM? > > With the benchmark LNT reporting meaningful results and warning users > of spikes, I think we have at least the base covered.I haven't used LNT in well over a year, but I recall Daniel Dunbar and I having many discussion on how LNT could be improved. (Forgive me if any of my suggestions have already been address. I'm playing catch up at the moment.)> Further improvements I can think of would be to: > > * Allow Perf/LNT to fix a set of "golden standards" based on past releases > * Mark the levels of those standards on every graph as coloured horizontal > lines > * Add warning systems when the current values deviate from any past > golden standardI agree. IIRC, there's functionality to set a baseline run to compare against. Unfortunately, I think this is too coarse. It would be great if the golden standard could be set on a per benchmark basis. Thus, upward trending benchmarks can have their standard updated while other benchmarks remain static.> * Allow Perf/LNT to report on differences between two distinct bots > * Create GCC buildbots with the same configurations/architectures and > compare them to LLVM's > * Mark golden standards for GCC releases, too, as a visual aid (no > warnings) > > * Implement trend detection (gradual decrease of performance) and > historical comparisons (against older releases) > * Implement warning systems to the admin (not users) for such trendsWould it be useful to detect upwards trends as well? Per my comment above, it would be great to update the golden standard so we're always moving in the right direction.> * Improve spike detection to wait one or two more builds to make sure > the spike was an actual regression, but then email the original blame > list, not the current builds' one.I recall Daniel and I discussing this issue. IIRC, we considered an eager approach where the current build would rerun the benchmark to verify the spikes. However, I like the lazy detection approach you're suggesting. This avoids long running builds when there are real regressions.> * Implement this feature on all warnings (previous runs, golden > standards, GCC comparisons) > > * Renovate the list of tests and benchmarks, extending their run times > dynamically instead of running them multiple times, getting the times > for the core functionality instead of whole-program timing, etc.Could we create a minimal test-suite that includes only benchmarks that are known to have little variance and run times greater than some decided upon threshold? With that in place we could begin the performance tracking (and hopefully adoption) sooner.> I agree with Kristof that, with the world of benchmarks being what it > is, focusing on test-suite buildbots will probably give the best > return on investment for the community. > > cheers, > --renatoKristof/All, I would be more than happy to contribute to this BOF in any way I can. Chad -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Hi Chad,> I recall Daniel and I discussing this issue. IIRC, we considered an eagerapproach where the current build would rerun the benchmark to verify the spikes. However, I like the lazy detection approach you're suggesting. This avoids long running builds when there are real regressions. I think the real issue behind this one is that it would change LNT from being a passive system to an active system. Currently the LNT tests can be run in any way one wishes, so long as a report is produced. Similarly, we can add other benchmarks to the report, which we currently do internally to avoid putting things like EEMBC into LNT's build system. With an "eager" approach as you mention, LNT would have to know how to ssh onto certain boxen, run the command and get the result back. Which would be a ton of work to do well! Cheers, James -----Original Message----- From: Chad Rosier [mailto:mcrosier at codeaurora.org] Sent: 05 August 2014 15:42 To: Renato Golin Cc: Kristof Beyls; mcrosier at codeaurora.org; James Molloy; Yi Kong; llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Dev Meeting BOF: Performance Tracking Kristof, Unfortunately, our merge process is less than ideal. It has vastly improved over the past few months (years I hear), but we still have times where we bring in days/weeks worth of commits en mass. To that end, I've setup a nightly performance run against the community branch, but it's still an overwhelming amount of work to track/report/bisect regressions. As you guessed, this is what motivated my initial email.> On 5 August 2014 10:30, Kristof Beyls <Kristof.Beyls at arm.com> wrote: >> The biggest problem that we were trying to solve this year was to >> produce data without too much noise. I think with Renato hopefully >> setting up a chromebook (Cortex-A15) soon there will finally be an >> ARM architecture board producing useful data and pushing it into thecentral database.> > I haven't got around finishing that work (at least not reporting to > Perf anyway) because of the instability issues. > > I think getting Perf stable is priority 0 right now in the LLVM > benchmarking field.I agree 110%; we don't want the bots crying wolf. Otherwise, real issues will fall on deaf ears.>> I think this should be the main topic of the BoF this year: now that >> we can produce useful data; what do we do with the data to actually >> improve LLVM? > > With the benchmark LNT reporting meaningful results and warning users > of spikes, I think we have at least the base covered.I haven't used LNT in well over a year, but I recall Daniel Dunbar and I having many discussion on how LNT could be improved. (Forgive me if any of my suggestions have already been address. I'm playing catch up at the moment.)> Further improvements I can think of would be to: > > * Allow Perf/LNT to fix a set of "golden standards" based on past > releases > * Mark the levels of those standards on every graph as coloured > horizontal lines > * Add warning systems when the current values deviate from any past > golden standardI agree. IIRC, there's functionality to set a baseline run to compare against. Unfortunately, I think this is too coarse. It would be great if the golden standard could be set on a per benchmark basis. Thus, upward trending benchmarks can have their standard updated while other benchmarks remain static.> * Allow Perf/LNT to report on differences between two distinct bots > * Create GCC buildbots with the same configurations/architectures and > compare them to LLVM's > * Mark golden standards for GCC releases, too, as a visual aid (no > warnings) > > * Implement trend detection (gradual decrease of performance) and > historical comparisons (against older releases) > * Implement warning systems to the admin (not users) for such trendsWould it be useful to detect upwards trends as well? Per my comment above, it would be great to update the golden standard so we're always moving in the right direction.> * Improve spike detection to wait one or two more builds to make sure > the spike was an actual regression, but then email the original blame > list, not the current builds' one.I recall Daniel and I discussing this issue. IIRC, we considered an eager approach where the current build would rerun the benchmark to verify the spikes. However, I like the lazy detection approach you're suggesting. This avoids long running builds when there are real regressions.> * Implement this feature on all warnings (previous runs, golden > standards, GCC comparisons) > > * Renovate the list of tests and benchmarks, extending their run times > dynamically instead of running them multiple times, getting the times > for the core functionality instead of whole-program timing, etc.Could we create a minimal test-suite that includes only benchmarks that are known to have little variance and run times greater than some decided upon threshold? With that in place we could begin the performance tracking (and hopefully adoption) sooner.> I agree with Kristof that, with the world of benchmarks being what it > is, focusing on test-suite buildbots will probably give the best > return on investment for the community. > > cheers, > --renatoKristof/All, I would be more than happy to contribute to this BOF in any way I can. Chad -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On 5 August 2014 15:41, Chad Rosier <mcrosier at codeaurora.org> wrote:> I agree. IIRC, there's functionality to set a baseline run to compare against. > Unfortunately, I think this is too coarse. It would be great if the golden > standard could be set on a per benchmark basis. Thus, upward trending > benchmarks can have their standard updated while other benchmarks remain > static.Having multiple "golden standards" showing as a coloured line would give the visual impression of mostly the highest score, no matter which release that was. Programatically, it'd also allow us to enquire about the "best golden standard" and always compare against it. I think the historical values are important to show a graph of the progress of releases, as well as the current revision, so you know how that fluctuated in the past few years as well as in the past few weeks.> Would it be useful to detect upwards trends as well? Per my comment above, > it would be great to update the golden standard so we're always moving in the > right direction.Upwards trends are nice to know, but the "current standard" can be the highest average of a set of N points since the last golden standard, and for that we don't explicitly need to be tracking upwards trends. If the last moving average is NOT the current standard, than we must have had detected a downwards slope since then.> Could we create a minimal test-suite that includes only benchmarks that > are known to have little variance and run times greater than some decided > upon threshold? With that in place we could begin the performance > tracking (and hopefully adoption) sooner.That's done. I haven't tested yet because of the failures in Perf. In the beginning, we could start with the set of gloden/current/previous standards for the benchmark-specific results, not the whole test-suite. As we progress towards more stability, we can implement that for all, but still allow configurations to only warn (user/admin) of the restricted set, to avoid extra noise on noisy targets (like ARM). cheers, --renato
My experience from leading BOFs at other conferences is more talk than action. So I suggest a different setup for this topic: how about having a working group meeting with participants who can commit time to work on this topic? The group meets for some time (TBD, during the conference of course), discusses and brainstorms the options, and - as a first immediate outcome - proposes a road forward in a 5-10 min report out talk. There might be other topics that could benefit from the working group format as well, so we could have a separate report out session at the conference. Cheers Gerolf On Aug 1, 2014, at 4:04 PM, Chad Rosier <mcrosier at codeaurora.org> wrote:> All, > I'm curious to know if anyone is interested in tracking performance > (compile-time and/or execution-time) from a community perspective? This > is a much loftier goal then just supporting build bots. If so, I'd be > happy to propose a BOF at the upcoming Dev Meeting. > > Chad > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On 20 August 2014 00:24, Gerolf Hoflehner <ghoflehner at apple.com> wrote:> My experience from leading BOFs at other conferences is more talk than action. So I suggest a different setup for this topic: how about having a working group meeting with participants who can commit time to work on this topic?Mine too, but in this case I have to say it wasn't at all what happened. It started with a 10 minute description of what we had and why it was bad, followed by a 40 minute discussion on what to do and how. There were about 80 people in the room, all actively involved in defining actions and actors. In the end we had clear goals, with clear owners and we have implemented every single one of them to date. I have to say, I've never seen that happen before! Furthermore, the "working group" was about the 80 people in the room anyway, and they all helped in one way or another. So, for any other discussion, I'd agree with you. For this one, I think we should stick to what's working. :) cheers, --renato