Back during the LLVM developer's meeting, I talked with some of you about a
proposal to "validate" llvm. Now that 2.4 is almost out the door, it
seems a
good time to start that discussion.
I've written up a detailed proposal and attached it to this message. The
goal
is to ease LLVM use by third parties. We've got consideral experience with
LLVM and the community development model here and see its benefits as well as
challenges. This proposal attempts to address what I feel is the main
challenge: testing and stability of LLVM between releases.
Please take a look and send feedback toi the list. I'd like to get the
process moving early in the 2.4 cycle.
Thanks for your input and support.
-Dave
-------------- next part --------------
LLVM Validation Proposal
------------------------
*Motivation*
LLVM Top of Trunk (ToT) is fairly unstable. It is common for tests to
break on a daily basis. This makes tracking the LLVM trunk difficult
and often undesireable for those needing stability. Such needs come
from a variety of situations: integrating LLVM with other components
requires a stable base to isolate bugs; researchers want a stable
platform on which to run experiments; developers want to know when
they've broken something and don't want the testing noise random
breakage introduces; some users keep private LLVM repositories where
they do project-specific development and want to know when it is
"safe"
to merge from upstream so as to introduce as few new bugs as possible.
Often those in the situations described above can't limit themselves to
the latest stable release. LLVM gains important new features rapidly
and users want to stay up-to-date, both to access those features and to
stay current so as to make merges from upstream simpler. Six months
between releases is a long time to wait and patches from release to
release are extremely large and likely to conflict with local changes.
*Solution*
One way to meet the needs identified above is to regularly "validate"
LLVM as passing all of its tests. A "validation run" is a testing run
of LLVM at some revision. A "validation result" is the outcome of the
testing run (pass/fail, etc.). A "validation" is a validation run
that
meets the stability requirements (e.g. passing all tests).
Validations can be expressed as tags of trunk and users can check out
the branch or svn switch over to it as desired.
Validations should be kept around to maintain history. For example,
users may not want to update to the latest-and-greatest validated LLVM
but want a safe spot to advance further than they are at currently.
Keeping all validation tags allows this. Since svn tags are cheap, this
should not impose a repository burden.
*Implementation*
The biggest issue to deal with is testing. LLVM testing resources are
already thin and validation requires that the testing process be
somewhat more formalized.
Some user interested in a particular target (say, x86 or PPC) should
claim responsibility for validating LLVM on that target. This would
involve running all of the LLVM target-independent tests as well as the
tests for the specific target desired. The identified tester for each
target will perform the validation by tagging the revision tested if
validation is successful..
This setup allows testing to be distributed among those most interested
in validations. It also relaxes the validation requirements some by
allowing LLVM to be validated against one target even though tests of
another target may fail. This also relieves the tester from having to
provide working platforms on which to run all of the LLVM
target-specific tests.
Those doing validations are collectively referred to as "validators."
All validation results should be announced on llvmdev and
llvm-testresults. It is also useful to announce validation run failures
on these lists to keep ther community informed of its progress. A list
of failing tests would be most helpful in such messages.
In addition to the independent target validators, there should be at
least one user responsible for validating all of LLVM (a "comprehensive
validation"), when every single LLVM test (including all target-specific
tests) passes. This is a much higher benchmark and such a validation
provides more confidence to end users about the stability of the
validation. This is also a much higher testing burden so users
undertaking such validations should have access to all of the necessary
machine resources, including hardware for all supported target
platforms. It is not clear if any LLVM user currently possesses this
capacity.
Alternatively, we could schedule per-target validation runs such that
they occur on the same revision (e.g. every 200 commits). If all
targets pass validation then we can consider LLVM comprehensively
validated. This would eliminate the need for one "super-validator"
with
access to all the necessary computing resources. These regular
validation runs will often fail. Such failures should be noted in
messages to llvmdev and llvm-testresults.
For the purposes of this proposal, the second scheme ("distributed
comprehensive validation") should be preferred, as it seems more
practical and less resource-intensive on any one individual or
organization.
The various targets as well as the comprehensive validation are
collectively referred to as "validation targets." There should be one
official validator for each validation target, though multiple users may
contribute testing for a particular validation target. It is up to the
validator to decide which testing runs are suitable for validation.
The LLVM community must agree on a set of tests that constitutes a
validation. This proposal suggests all publicly available LLVM tests
(for each validation target) should be required to pass. This includes:
* Tests in llvm/tests
* Tests in llvm-test, excluding external tests such as SPEC
If the LLVM community can identify validators with access to the
external tests, those should be included as well.
Note that this testing regimen requires validators to build and test
llvm-gcc as well as the native LLVM tools.
The tags themselves should live under the tags tree. One possible
tagging scheme looks like this:
trunk
tags
...
RELEASE_21
RELEASE_22
RELEASE_23
validated
ALL
development_24
r54582
X86
development_24
r53215
r54100
r54582
X86-64 (? maybe covered by x86)
...
PowerPC
...
Alpha
...
Sparc
...
Mips
...
ARM
...
CellSPU
...
IA64
...
branches
...
release_21
release_22
release_23
This scheme organizes validations by release cycle so users can more
quickly find what they're looking for. Note that a validated revision
could appear under more than one validation target.
Validation tags are read-only. Thus they live under the tags directory.
The RO nature of these tags could and probably should be enforced using
svn hooks. This would require help from the LLVM repository
maintainers. It is not a requirement for this proposal to move forward.
Some testing infrastructure enhancements can make validation easier
and/or more precise. For example, all X86 and X86-64 tests are lumped
under X86. But some users are not interested in 32-bit X86. Separating
the running of 32-bit and 64-bit X86 tests allows validation to be
distributed to a greater extent. Again, such enhancements are not
required for this proposal to move forward.
Validations should occur at least weekly. As a release approaches,
validations should occur more frequently, perhaps twice a week. This
will ensure that the ToT remains stable. More frequent validations
should begin one month before release. If the "validation run every N
commits" approach to comprehensive validation is taken, validators must
do validation runs according to those requirements as well. Such
validation runs need not always result in a validation (i.e. the testing
can fail) Validators are free to validate even more often than these
requirements.
*Next Steps*
A validation coordinator should identify a validator for each validation
target and keep the list current as responsibilities change. Users
should begin signing up to be validators as soon as this or another
proposal is accepted. It is not necessary to have a validator for each
validation target before beginning validations.
Tools should be developed to ease the validation process. Such tools
should be set up to run regularly under cron or some similar scheduling
system. The tools should send validation run results to llvmdev and
llvm-testresults.
*Outstanding Questions*
We need to address the following detail questions about the process.
Some of these will be answered through trial-and-error.
1. When doing a distributed comprehensive validation (scheme 2), how
often should those tests occur? The proposal throws out "every 200
commits" as an example, but is that the right timeframe?
2. Who can be the validation coordinator? I will offer my services for
this role at least starting out.
3. Who will be responsible for each validation? Cray commits to taking
responsibility for validating x86.
4. Are there testers beyond the official validators that want to
contribute resources? Who are they and what's the process of
plugging them in?
5. Is validating weekly the right frequency? What about when a release
approaches?
On Nov 10, 2008, at 12:59 PM, David Greene wrote:> Back during the LLVM developer's meeting, I talked with some of you > about a > proposal to "validate" llvm. Now that 2.4 is almost out the door, > it seems a > good time to start that discussion. > > I've written up a detailed proposal and attached it to this > message. The goal > is to ease LLVM use by third parties. We've got consideral > experience with > LLVM and the community development model here and see its benefits > as well as > challenges. This proposal attempts to address what I feel is the main > challenge: testing and stability of LLVM between releases. > > Please take a look and send feedback toi the list. I'd like to get > the > process moving early in the 2.4 cycle.Interesting proposal. I think this could be a great thing. When the system is up and running, I can scrounge up some darwin validator machines for the machine pool, -Chris
David Greene <dag at cray.com> writes:> Back during the LLVM developer's meeting, I talked with some of you > about a proposal to "validate" llvm. Now that 2.4 is almost out the > door, it seems a good time to start that discussion.I applaud your initiative. Discussing this issue is badly needed. From my point of view, LLVM still has an academical/exploratory character that makes it incompatible with a long term commitment from the POV of some industry users (those for whom LLVM would be a critical component) unless those users have enough resources for maintaining their own LLVM branch. IMO a validation process based on running test suites is not enough. As you know very well, tests can demonstrate failures, but can not demonstrate correctness. An approach based on having stable (bug-fix only) and development branches is more adequate. This way, each user can devote work to validate LLVM for its own purposes, apply fixes to the stable branch and then have some hope of achieving a point where LLVM is good enough, instead of an endless upgrading where you fix known bugs while knowing that new ones are being introduced. This conflicts with current practice of going forward at full throttle, when it is not rare that developers recommend using ToT just a few weeks after a release. Hopefully when clang matures new requirements on middle-term stability will be enforced. -- Oscar
On Monday 10 November 2008 15:49, Óscar Fuentes wrote:> IMO a validation process based on running test suites is not enough. AsNot enough for some, I agree. For others, it helps a lot. It would help us tremendously, for example, but then, we do maintain our own branch.> you know very well, tests can demonstrate failures, but can not > demonstrate correctness. An approach based on having stable (bug-fix > only) and development branches is more adequate. This way, each user can > devote work to validate LLVM for its own purposes, apply fixes to the > stable branch and then have some hope of achieving a point where LLVM is > good enough, instead of an endless upgrading where you fix known bugs > while knowing that new ones are being introduced.A stable and development branch would also help. You still need to validate the stable branch, however. So I think the proposal still applies regardless of how the repository is organized.> This conflicts with current practice of going forward at full throttle, > when it is not rare that developers recommend using ToT just a few weeks > after a release.Right. It would be a shift in development process.> Hopefully when clang matures new requirements on middle-term stability > will be enforced.It's hard to "enforce" anything in the open source world. That's something that third parties just have to come to understand. So we should try to introduce processes that can help achieve what we want without depending on anyone else to conform to our idea of how development should happen. -Dave
On Nov 10, 2008, at 12:59 PM, David Greene wrote:> I've written up a detailed proposal and attached it to this message.Yeah, mirrors in many ways something I've thought about for a while now. Roughly, cron (or while :; do) testers that figure out quality and create tags when that quality is met. Release branching can then just happen from the `prerelease' tag, and largely start from a known good quality. People can the figure out what naming scheme they want and which tests they want to run, and contribute by testing and creating tags. The existance of specific combinations of tags at the same versions can be used to create rollup tags. One can imagine a C on x86 tag, a C++ on x86_64 tag, a llvm-gcc on mips tag and so on. Though, in my version, I would have the cron job move forward their tag so that developers have a stable tag to select, if they just want say C on ppc. If someone wanted to play on sparc (to pick a less well maintained port that at the top, might not work, but did at some point in time, they could start with the C on sparc tag and reproduce that working. They wouldn't have to guess if it builds or not, they'd just know it should (given then definition of the tag). The prerelease tag could be comprised of the mips tag, the x86 tag, the llvm x86 tag, .... One could even include things like the freebsd world build and boot and self vaildate tag. Might take a few days to run, but, doing this to feed into a preleasse style tag might be nice. I don't know just how useful this would be.
Lately our random C program generator has seemed quite successful at catching regressions in llvm-gcc that the test suite misses. I'd suggest running some fixed number of random programs as part of the validation suite. On a fastish quad core I can test about 25,000 programs in 24 hours. Our hacked valgrind (which looks for volatile miscompilations) is a bottleneck, leaving this out would speed up the process considerably. We've never tested llvm-gcc for x64 using random testing, doing this would likely turn up a nice crop of bugs. I just started a random test run of llvm-gcc 2.0-2.4 that should provide some interesting quantitative results comparing these compilers in terms of crashes, volatile miscompilations, and regular miscompilations. However it may take a month or so to get statistical significance since 2.3 and 2.4 have quite low failure rates. John Regehr
On Monday 10 November 2008 19:11, Mike Stump wrote:> creating tags. The existance of specific combinations of tags at the > same versions can be used to create rollup tags. One can imagine a CI'm not entirely sure what you mean here. By "versions" do you mean svn revisions?> on x86 tag, a C++ on x86_64 tag, a llvm-gcc on mips tag and so on. > Though, in my version, I would have the cron job move forward their > tag so that developers have a stable tag to select, if they just want > say C on ppc.So you're saying have one tag that keeps the same name and indicates the highest validated revision? That makes sense to me.> The prerelease tag could be comprised of the mips tag, the x86 tag, > the llvm x86 tag, ....Yep.> One could even include things like the freebsd world build and boot > and self vaildate tag. Might take a few days to run, but, doing this > to feed into a preleasse style tag might be nice.llvm-gcc already does a bootstrap. Is that what you mean? -Dave
On Nov 10, 2008, at 12:59 PM, David Greene wrote:> Back during the LLVM developer's meeting, I talked with some of you > about a > proposal to "validate" llvm. Now that 2.4 is almost out the door, > it seems a > good time to start that discussion. > > I've written up a detailed proposal and attached it to this > message. The goal > is to ease LLVM use by third parties. We've got consideral > experience with > LLVM and the community development model here and see its benefits > as well as > challenges. This proposal attempts to address what I feel is the main > challenge: testing and stability of LLVM between releases. > > Please take a look and send feedback toi the list. I'd like to get > the > process moving early in the 2.4 cycle. >Hi Dave, Here are my opinions: I like the idea of regular validation tagging. However, I think that it should be as automated as possible. I'm worried that validation testing will be pushed off of people's plates indefinitely – even for people who care deeply about a particular platform. Here are the minimal set of tests that should go into a validation test. (All of these should be done in "Release" mode.) * Regression Tests - This catches obvious errors, but not major ones. * Full Bootstrap of LLVM-GCC - LLVM-GCC is a complex program. It's our second indication if something has gone horribly awry. * Nightly Testsuite - A very good suite of tests; much more extensive than a simple bootstrap. * LLVM-GCC Testsuite - Many thousands of great tests to test the many facets of the compiler. WAY too few people run these. As far as I know, Dale's the only one who's been slogging through the LLVM-GCC testsuite, finding and fixing errors. I think that there are still many failures that should be addressed. All four of the above should be run on at least a nightly basis (more frequently for some, like the regression tests). Each of these are automated, making that easy. If there are no regressions from the above four, we could tag that revision as being potentially "valid". -bw
Bill Wendling <isanbard at gmail.com> writes: [snip]> All four of the above should be run on at least a nightly basis (more > frequently for some, like the regression tests). Each of these are > automated, making that easy. If there are no regressions from the > above four, we could tag that revision as being potentially "valid".If a new test case is created (coming from a bug report or a code review, not from adding a new feature) and it fails for a previously "valid" revision, is the tag removed? -- Oscar
On Nov 11, 2008, at 10:33 PM, Bill Wendling wrote:> * Regression Tests - This catches obvious errors, but not major ones. > * Full Bootstrap of LLVM-GCC - LLVM-GCC is a complex program. It's our > second indication if something has gone horribly awry. > * Nightly Testsuite - A very good suite of tests; much more extensive > than a simple bootstrap. > * LLVM-GCC Testsuite - Many thousands of great tests to test the many > facets of the compiler. WAY too few people run these.I'd add one more item here, * Nightly Testsuite run using llvm-gcc built llvm! If, LLVM-GCC is a complex C program then LLVM is a complex C++ program whose correctness can be validated using such test suite runs. - Devang
On Wednesday 12 November 2008 00:33, Bill Wendling wrote:> > Please take a look and send feedback toi the list. I'd like to get > > the > > process moving early in the 2.4 cycle. > > Hi Dave, > > Here are my opinions: > > I like the idea of regular validation tagging. However, I think that > it should be as automated as possible. I'm worried that validation > testing will be pushed off of people's plates indefinitely – even for > people who care deeply about a particular platform.Yes, automation is key. I think it is very possible to do with this proposal.> Here are the minimal set of tests that should go into a validation > test. (All of these should be done in "Release" mode.) > > * Regression Tests - This catches obvious errors, but not major ones.For clarity, you mean "make check" on the LLVM tools, right?> * Full Bootstrap of LLVM-GCC - LLVM-GCC is a complex program. It's our > second indication if something has gone horribly awry.Yep.> * Nightly Testsuite - A very good suite of tests; much more extensive > than a simple bootstrap.How does this differ from "make check" or llvm-test?> * LLVM-GCC Testsuite - Many thousands of great tests to test the many > facets of the compiler. WAY too few people run these.By this do you mean llvm-test or the testsuite that ships with gcc? To my knowledge, LLVM has never passed the gcc testsuite ("make check" on llvm-gcc).> As far as I know, Dale's the only one who's been slogging through the > LLVM-GCC testsuite, finding and fixing errors. I think that there are > still many failures that should be addressed.Depending on how you're defining LLVM-GCC I may also be running those tests regularly.> All four of the above should be run on at least a nightly basis (more > frequently for some, like the regression tests). Each of these are > automated, making that easy. If there are no regressions from the > above four, we could tag that revision as being potentially "valid".Right. I would add one thing. We want to run these suites with Debug, Release, Release+Asserts and Debug+ExpensiveChecks builds. No one but me seems to run Debug+ExpensiveChecks tests because I see things break regularly. It's a valuable tool to find subtle C++ errors. -Dave