Back during the LLVM developer's meeting, I talked with some of you about a proposal to "validate" llvm. Now that 2.4 is almost out the door, it seems a good time to start that discussion. I've written up a detailed proposal and attached it to this message. The goal is to ease LLVM use by third parties. We've got consideral experience with LLVM and the community development model here and see its benefits as well as challenges. This proposal attempts to address what I feel is the main challenge: testing and stability of LLVM between releases. Please take a look and send feedback toi the list. I'd like to get the process moving early in the 2.4 cycle. Thanks for your input and support. -Dave -------------- next part -------------- LLVM Validation Proposal ------------------------ *Motivation* LLVM Top of Trunk (ToT) is fairly unstable. It is common for tests to break on a daily basis. This makes tracking the LLVM trunk difficult and often undesireable for those needing stability. Such needs come from a variety of situations: integrating LLVM with other components requires a stable base to isolate bugs; researchers want a stable platform on which to run experiments; developers want to know when they've broken something and don't want the testing noise random breakage introduces; some users keep private LLVM repositories where they do project-specific development and want to know when it is "safe" to merge from upstream so as to introduce as few new bugs as possible. Often those in the situations described above can't limit themselves to the latest stable release. LLVM gains important new features rapidly and users want to stay up-to-date, both to access those features and to stay current so as to make merges from upstream simpler. Six months between releases is a long time to wait and patches from release to release are extremely large and likely to conflict with local changes. *Solution* One way to meet the needs identified above is to regularly "validate" LLVM as passing all of its tests. A "validation run" is a testing run of LLVM at some revision. A "validation result" is the outcome of the testing run (pass/fail, etc.). A "validation" is a validation run that meets the stability requirements (e.g. passing all tests). Validations can be expressed as tags of trunk and users can check out the branch or svn switch over to it as desired. Validations should be kept around to maintain history. For example, users may not want to update to the latest-and-greatest validated LLVM but want a safe spot to advance further than they are at currently. Keeping all validation tags allows this. Since svn tags are cheap, this should not impose a repository burden. *Implementation* The biggest issue to deal with is testing. LLVM testing resources are already thin and validation requires that the testing process be somewhat more formalized. Some user interested in a particular target (say, x86 or PPC) should claim responsibility for validating LLVM on that target. This would involve running all of the LLVM target-independent tests as well as the tests for the specific target desired. The identified tester for each target will perform the validation by tagging the revision tested if validation is successful.. This setup allows testing to be distributed among those most interested in validations. It also relaxes the validation requirements some by allowing LLVM to be validated against one target even though tests of another target may fail. This also relieves the tester from having to provide working platforms on which to run all of the LLVM target-specific tests. Those doing validations are collectively referred to as "validators." All validation results should be announced on llvmdev and llvm-testresults. It is also useful to announce validation run failures on these lists to keep ther community informed of its progress. A list of failing tests would be most helpful in such messages. In addition to the independent target validators, there should be at least one user responsible for validating all of LLVM (a "comprehensive validation"), when every single LLVM test (including all target-specific tests) passes. This is a much higher benchmark and such a validation provides more confidence to end users about the stability of the validation. This is also a much higher testing burden so users undertaking such validations should have access to all of the necessary machine resources, including hardware for all supported target platforms. It is not clear if any LLVM user currently possesses this capacity. Alternatively, we could schedule per-target validation runs such that they occur on the same revision (e.g. every 200 commits). If all targets pass validation then we can consider LLVM comprehensively validated. This would eliminate the need for one "super-validator" with access to all the necessary computing resources. These regular validation runs will often fail. Such failures should be noted in messages to llvmdev and llvm-testresults. For the purposes of this proposal, the second scheme ("distributed comprehensive validation") should be preferred, as it seems more practical and less resource-intensive on any one individual or organization. The various targets as well as the comprehensive validation are collectively referred to as "validation targets." There should be one official validator for each validation target, though multiple users may contribute testing for a particular validation target. It is up to the validator to decide which testing runs are suitable for validation. The LLVM community must agree on a set of tests that constitutes a validation. This proposal suggests all publicly available LLVM tests (for each validation target) should be required to pass. This includes: * Tests in llvm/tests * Tests in llvm-test, excluding external tests such as SPEC If the LLVM community can identify validators with access to the external tests, those should be included as well. Note that this testing regimen requires validators to build and test llvm-gcc as well as the native LLVM tools. The tags themselves should live under the tags tree. One possible tagging scheme looks like this: trunk tags ... RELEASE_21 RELEASE_22 RELEASE_23 validated ALL development_24 r54582 X86 development_24 r53215 r54100 r54582 X86-64 (? maybe covered by x86) ... PowerPC ... Alpha ... Sparc ... Mips ... ARM ... CellSPU ... IA64 ... branches ... release_21 release_22 release_23 This scheme organizes validations by release cycle so users can more quickly find what they're looking for. Note that a validated revision could appear under more than one validation target. Validation tags are read-only. Thus they live under the tags directory. The RO nature of these tags could and probably should be enforced using svn hooks. This would require help from the LLVM repository maintainers. It is not a requirement for this proposal to move forward. Some testing infrastructure enhancements can make validation easier and/or more precise. For example, all X86 and X86-64 tests are lumped under X86. But some users are not interested in 32-bit X86. Separating the running of 32-bit and 64-bit X86 tests allows validation to be distributed to a greater extent. Again, such enhancements are not required for this proposal to move forward. Validations should occur at least weekly. As a release approaches, validations should occur more frequently, perhaps twice a week. This will ensure that the ToT remains stable. More frequent validations should begin one month before release. If the "validation run every N commits" approach to comprehensive validation is taken, validators must do validation runs according to those requirements as well. Such validation runs need not always result in a validation (i.e. the testing can fail) Validators are free to validate even more often than these requirements. *Next Steps* A validation coordinator should identify a validator for each validation target and keep the list current as responsibilities change. Users should begin signing up to be validators as soon as this or another proposal is accepted. It is not necessary to have a validator for each validation target before beginning validations. Tools should be developed to ease the validation process. Such tools should be set up to run regularly under cron or some similar scheduling system. The tools should send validation run results to llvmdev and llvm-testresults. *Outstanding Questions* We need to address the following detail questions about the process. Some of these will be answered through trial-and-error. 1. When doing a distributed comprehensive validation (scheme 2), how often should those tests occur? The proposal throws out "every 200 commits" as an example, but is that the right timeframe? 2. Who can be the validation coordinator? I will offer my services for this role at least starting out. 3. Who will be responsible for each validation? Cray commits to taking responsibility for validating x86. 4. Are there testers beyond the official validators that want to contribute resources? Who are they and what's the process of plugging them in? 5. Is validating weekly the right frequency? What about when a release approaches?
On Nov 10, 2008, at 12:59 PM, David Greene wrote:> Back during the LLVM developer's meeting, I talked with some of you > about a > proposal to "validate" llvm. Now that 2.4 is almost out the door, > it seems a > good time to start that discussion. > > I've written up a detailed proposal and attached it to this > message. The goal > is to ease LLVM use by third parties. We've got consideral > experience with > LLVM and the community development model here and see its benefits > as well as > challenges. This proposal attempts to address what I feel is the main > challenge: testing and stability of LLVM between releases. > > Please take a look and send feedback toi the list. I'd like to get > the > process moving early in the 2.4 cycle.Interesting proposal. I think this could be a great thing. When the system is up and running, I can scrounge up some darwin validator machines for the machine pool, -Chris
David Greene <dag at cray.com> writes:> Back during the LLVM developer's meeting, I talked with some of you > about a proposal to "validate" llvm. Now that 2.4 is almost out the > door, it seems a good time to start that discussion.I applaud your initiative. Discussing this issue is badly needed. From my point of view, LLVM still has an academical/exploratory character that makes it incompatible with a long term commitment from the POV of some industry users (those for whom LLVM would be a critical component) unless those users have enough resources for maintaining their own LLVM branch. IMO a validation process based on running test suites is not enough. As you know very well, tests can demonstrate failures, but can not demonstrate correctness. An approach based on having stable (bug-fix only) and development branches is more adequate. This way, each user can devote work to validate LLVM for its own purposes, apply fixes to the stable branch and then have some hope of achieving a point where LLVM is good enough, instead of an endless upgrading where you fix known bugs while knowing that new ones are being introduced. This conflicts with current practice of going forward at full throttle, when it is not rare that developers recommend using ToT just a few weeks after a release. Hopefully when clang matures new requirements on middle-term stability will be enforced. -- Oscar
On Monday 10 November 2008 15:49, Óscar Fuentes wrote:> IMO a validation process based on running test suites is not enough. AsNot enough for some, I agree. For others, it helps a lot. It would help us tremendously, for example, but then, we do maintain our own branch.> you know very well, tests can demonstrate failures, but can not > demonstrate correctness. An approach based on having stable (bug-fix > only) and development branches is more adequate. This way, each user can > devote work to validate LLVM for its own purposes, apply fixes to the > stable branch and then have some hope of achieving a point where LLVM is > good enough, instead of an endless upgrading where you fix known bugs > while knowing that new ones are being introduced.A stable and development branch would also help. You still need to validate the stable branch, however. So I think the proposal still applies regardless of how the repository is organized.> This conflicts with current practice of going forward at full throttle, > when it is not rare that developers recommend using ToT just a few weeks > after a release.Right. It would be a shift in development process.> Hopefully when clang matures new requirements on middle-term stability > will be enforced.It's hard to "enforce" anything in the open source world. That's something that third parties just have to come to understand. So we should try to introduce processes that can help achieve what we want without depending on anyone else to conform to our idea of how development should happen. -Dave
On Nov 10, 2008, at 12:59 PM, David Greene wrote:> I've written up a detailed proposal and attached it to this message.Yeah, mirrors in many ways something I've thought about for a while now. Roughly, cron (or while :; do) testers that figure out quality and create tags when that quality is met. Release branching can then just happen from the `prerelease' tag, and largely start from a known good quality. People can the figure out what naming scheme they want and which tests they want to run, and contribute by testing and creating tags. The existance of specific combinations of tags at the same versions can be used to create rollup tags. One can imagine a C on x86 tag, a C++ on x86_64 tag, a llvm-gcc on mips tag and so on. Though, in my version, I would have the cron job move forward their tag so that developers have a stable tag to select, if they just want say C on ppc. If someone wanted to play on sparc (to pick a less well maintained port that at the top, might not work, but did at some point in time, they could start with the C on sparc tag and reproduce that working. They wouldn't have to guess if it builds or not, they'd just know it should (given then definition of the tag). The prerelease tag could be comprised of the mips tag, the x86 tag, the llvm x86 tag, .... One could even include things like the freebsd world build and boot and self vaildate tag. Might take a few days to run, but, doing this to feed into a preleasse style tag might be nice. I don't know just how useful this would be.
Lately our random C program generator has seemed quite successful at catching regressions in llvm-gcc that the test suite misses. I'd suggest running some fixed number of random programs as part of the validation suite. On a fastish quad core I can test about 25,000 programs in 24 hours. Our hacked valgrind (which looks for volatile miscompilations) is a bottleneck, leaving this out would speed up the process considerably. We've never tested llvm-gcc for x64 using random testing, doing this would likely turn up a nice crop of bugs. I just started a random test run of llvm-gcc 2.0-2.4 that should provide some interesting quantitative results comparing these compilers in terms of crashes, volatile miscompilations, and regular miscompilations. However it may take a month or so to get statistical significance since 2.3 and 2.4 have quite low failure rates. John Regehr
On Monday 10 November 2008 19:11, Mike Stump wrote:> creating tags. The existance of specific combinations of tags at the > same versions can be used to create rollup tags. One can imagine a CI'm not entirely sure what you mean here. By "versions" do you mean svn revisions?> on x86 tag, a C++ on x86_64 tag, a llvm-gcc on mips tag and so on. > Though, in my version, I would have the cron job move forward their > tag so that developers have a stable tag to select, if they just want > say C on ppc.So you're saying have one tag that keeps the same name and indicates the highest validated revision? That makes sense to me.> The prerelease tag could be comprised of the mips tag, the x86 tag, > the llvm x86 tag, ....Yep.> One could even include things like the freebsd world build and boot > and self vaildate tag. Might take a few days to run, but, doing this > to feed into a preleasse style tag might be nice.llvm-gcc already does a bootstrap. Is that what you mean? -Dave
On Nov 10, 2008, at 12:59 PM, David Greene wrote:> Back during the LLVM developer's meeting, I talked with some of you > about a > proposal to "validate" llvm. Now that 2.4 is almost out the door, > it seems a > good time to start that discussion. > > I've written up a detailed proposal and attached it to this > message. The goal > is to ease LLVM use by third parties. We've got consideral > experience with > LLVM and the community development model here and see its benefits > as well as > challenges. This proposal attempts to address what I feel is the main > challenge: testing and stability of LLVM between releases. > > Please take a look and send feedback toi the list. I'd like to get > the > process moving early in the 2.4 cycle. >Hi Dave, Here are my opinions: I like the idea of regular validation tagging. However, I think that it should be as automated as possible. I'm worried that validation testing will be pushed off of people's plates indefinitely – even for people who care deeply about a particular platform. Here are the minimal set of tests that should go into a validation test. (All of these should be done in "Release" mode.) * Regression Tests - This catches obvious errors, but not major ones. * Full Bootstrap of LLVM-GCC - LLVM-GCC is a complex program. It's our second indication if something has gone horribly awry. * Nightly Testsuite - A very good suite of tests; much more extensive than a simple bootstrap. * LLVM-GCC Testsuite - Many thousands of great tests to test the many facets of the compiler. WAY too few people run these. As far as I know, Dale's the only one who's been slogging through the LLVM-GCC testsuite, finding and fixing errors. I think that there are still many failures that should be addressed. All four of the above should be run on at least a nightly basis (more frequently for some, like the regression tests). Each of these are automated, making that easy. If there are no regressions from the above four, we could tag that revision as being potentially "valid". -bw
Bill Wendling <isanbard at gmail.com> writes: [snip]> All four of the above should be run on at least a nightly basis (more > frequently for some, like the regression tests). Each of these are > automated, making that easy. If there are no regressions from the > above four, we could tag that revision as being potentially "valid".If a new test case is created (coming from a bug report or a code review, not from adding a new feature) and it fails for a previously "valid" revision, is the tag removed? -- Oscar
On Nov 11, 2008, at 10:33 PM, Bill Wendling wrote:> * Regression Tests - This catches obvious errors, but not major ones. > * Full Bootstrap of LLVM-GCC - LLVM-GCC is a complex program. It's our > second indication if something has gone horribly awry. > * Nightly Testsuite - A very good suite of tests; much more extensive > than a simple bootstrap. > * LLVM-GCC Testsuite - Many thousands of great tests to test the many > facets of the compiler. WAY too few people run these.I'd add one more item here, * Nightly Testsuite run using llvm-gcc built llvm! If, LLVM-GCC is a complex C program then LLVM is a complex C++ program whose correctness can be validated using such test suite runs. - Devang
On Wednesday 12 November 2008 00:33, Bill Wendling wrote:> > Please take a look and send feedback toi the list. I'd like to get > > the > > process moving early in the 2.4 cycle. > > Hi Dave, > > Here are my opinions: > > I like the idea of regular validation tagging. However, I think that > it should be as automated as possible. I'm worried that validation > testing will be pushed off of people's plates indefinitely – even for > people who care deeply about a particular platform.Yes, automation is key. I think it is very possible to do with this proposal.> Here are the minimal set of tests that should go into a validation > test. (All of these should be done in "Release" mode.) > > * Regression Tests - This catches obvious errors, but not major ones.For clarity, you mean "make check" on the LLVM tools, right?> * Full Bootstrap of LLVM-GCC - LLVM-GCC is a complex program. It's our > second indication if something has gone horribly awry.Yep.> * Nightly Testsuite - A very good suite of tests; much more extensive > than a simple bootstrap.How does this differ from "make check" or llvm-test?> * LLVM-GCC Testsuite - Many thousands of great tests to test the many > facets of the compiler. WAY too few people run these.By this do you mean llvm-test or the testsuite that ships with gcc? To my knowledge, LLVM has never passed the gcc testsuite ("make check" on llvm-gcc).> As far as I know, Dale's the only one who's been slogging through the > LLVM-GCC testsuite, finding and fixing errors. I think that there are > still many failures that should be addressed.Depending on how you're defining LLVM-GCC I may also be running those tests regularly.> All four of the above should be run on at least a nightly basis (more > frequently for some, like the regression tests). Each of these are > automated, making that easy. If there are no regressions from the > above four, we could tag that revision as being potentially "valid".Right. I would add one thing. We want to run these suites with Debug, Release, Release+Asserts and Debug+ExpensiveChecks builds. No one but me seems to run Debug+ExpensiveChecks tests because I see things break regularly. It's a valuable tool to find subtle C++ errors. -Dave