This note describes the first part of the Rotten Green Tests project. "Rotten Green Tests" is the title of a paper presented at the 2019 International Conference on Software Engineering (ICSE). Stripped to its essentials, the paper describes a method to identify defects or oversights in executable tests. The method has two steps: (a) Statically identify all "test assertions" in the test program. (b) Dynamically determine whether these assertions are actually executed. A test assertion that has been coded but is not executed is termed a "rotten green" test, because it allows the test to be green (i.e., pass) without actually enforcing the assertion. In many cases it is not immediately obvious, just by reading the code, that the test has a problem; the Rotten Green Test method helps identify these. The paper describes using this method on projects coded in Pharo (which appears to be a Smalltalk descendant) and so the specific tools are obviously not applicable to a C++ project such as LLVM. However, the concept can be easily transferred. I applied these ideas to the Clang and LLVM unittests, because these are all executable tests that use the googletest infrastructure. In particular, all "test assertions" are easily identified because they make use of macros defined by googletest; by modifying these macros, it should be feasible to keep track of all assertions, and report whether they have been executed. The mildly gory details can be saved for the code review and of course an LLVM Dev Meeting talk, but the basic idea is: Each test-assertion macro will statically allocate a struct identifying the source location of the macro, and have an executable statement recording whether that assertion has been executed. Then, when the test exits, we look for any of these records that haven't been executed, and report them. I've gotten this to work in three environments so far: 1) Linux, with gcc as the build compiler 2) Linux, with clang as the build compiler 3) Windows, with MSVC as the build compiler The results are not identical across the three environments. Besides the obvious case that some tests simply don't operate on both Linux and Windows, there are some subtleties that cause the infrastructure to work less well with gcc than with clang. The infrastructure depends on certain practices in coding the tests. First and foremost, it depends on tests being coded to use the googletest macros (EXPECT_* and ASSERT*) to express individual test assertions. This is generally true in the unittests, although not as universal as might be hoped; ClangRenameTests, for example, buries a handful of test assertions inside helper methods, which is a plausible coding tactic but makes the RGT infrastructure less useful (because many separate tests funnel through the same EXPECT/ASSERT macros, and so RGT can't discern whether any of those higher-level tests are rotten). Secondly, "source location" is constrained to filename and line number (__FILE__ and __LINE__), therefore we can have at most one assertion per source line. This is generally not a problem, although I did need to recode one test that used macros to generate assertions (replacing it with a template). In certain cases it also means gcc doesn't let us distinguish multiple assertions, mainly in nested macros, for an obscure reason. But those situations are not very common. There are a noticeable number of false positives, with two primary sources. One is that googletest has a way to mark a test as DISABLED; this test is still compiled, although never run, and all its assertions will therefore show up as rotten. The other is due to the common LLVM practice of making environmental decisions at runtime rather than compile time; for example, using something like 'if (isWindows())' rather than an #ifdef. I've recoded some of the easier cases to use #ifdef, in order to reduce the noise. Some of the noise appears to be irreduceable, which means if we don't want to have bots constantly red, we have to have the RGT reporting off by default. Well... actually... it is ON by default; however, I turn it off in lit. So, if you run `check-llvm` or use `llvm-lit` to run unittests, they won't report rotten green tests. However, if you run a program directly, it will report them (and cause the test program to exit with a failure status). This seemed like a reasonable balance that would make RGT useful while developing a test, without interfering with automation. The overall results are quite satisfying; there are many true positives, generally representing coding errors within the tests. A half-dozen of the unittests have been fixed, with more to come, and the RGT patch itself is at: https://reviews.llvm.org/D97566 Thanks, --paulr
David Blaikie via llvm-dev
2021-Feb-26 20:23 UTC
[llvm-dev] [cfe-dev] Rotten Green Tests project
Initial gut reaction would be this is perhaps a big enough patch/divergence from upstream gtest that it should go into upstream gtest first & maybe sync'ing up with a more recent gtest into LLVM? Though I realize that's a bit of a big undertaking (either/both of those steps). How does this compare to other local patches to gtest we have? On Fri, Feb 26, 2021 at 10:47 AM via cfe-dev <cfe-dev at lists.llvm.org> wrote:> This note describes the first part of the Rotten Green Tests project. > > "Rotten Green Tests" is the title of a paper presented at the 2019 > International Conference on Software Engineering (ICSE). Stripped to > its essentials, the paper describes a method to identify defects or > oversights in executable tests. The method has two steps: > > (a) Statically identify all "test assertions" in the test program. > (b) Dynamically determine whether these assertions are actually > executed. > > A test assertion that has been coded but is not executed is termed a > "rotten green" test, because it allows the test to be green (i.e., > pass) without actually enforcing the assertion. In many cases it is > not immediately obvious, just by reading the code, that the test has a > problem; the Rotten Green Test method helps identify these. > > The paper describes using this method on projects coded in Pharo > (which appears to be a Smalltalk descendant) and so the specific tools > are obviously not applicable to a C++ project such as LLVM. However, > the concept can be easily transferred. > > I applied these ideas to the Clang and LLVM unittests, because these > are all executable tests that use the googletest infrastructure. In > particular, all "test assertions" are easily identified because they > make use of macros defined by googletest; by modifying these macros, > it should be feasible to keep track of all assertions, and report > whether they have been executed. > > The mildly gory details can be saved for the code review and of course > an LLVM Dev Meeting talk, but the basic idea is: Each test-assertion > macro will statically allocate a struct identifying the source > location of the macro, and have an executable statement recording > whether that assertion has been executed. Then, when the test exits, > we look for any of these records that haven't been executed, and > report them. > > I've gotten this to work in three environments so far: > 1) Linux, with gcc as the build compiler > 2) Linux, with clang as the build compiler > 3) Windows, with MSVC as the build compiler > > The results are not identical across the three environments. Besides > the obvious case that some tests simply don't operate on both Linux > and Windows, there are some subtleties that cause the infrastructure > to work less well with gcc than with clang. > > The infrastructure depends on certain practices in coding the tests. > > First and foremost, it depends on tests being coded to use the > googletest macros (EXPECT_* and ASSERT*) to express individual test > assertions. This is generally true in the unittests, although not as > universal as might be hoped; ClangRenameTests, for example, buries a > handful of test assertions inside helper methods, which is a plausible > coding tactic but makes the RGT infrastructure less useful (because > many separate tests funnel through the same EXPECT/ASSERT macros, and > so RGT can't discern whether any of those higher-level tests are > rotten). > > Secondly, "source location" is constrained to filename and line number > (__FILE__ and __LINE__), therefore we can have at most one assertion > per source line. This is generally not a problem, although I did need > to recode one test that used macros to generate assertions (replacing > it with a template). In certain cases it also means gcc doesn't let > us distinguish multiple assertions, mainly in nested macros, for an > obscure reason. But those situations are not very common. > > There are a noticeable number of false positives, with two primary > sources. One is that googletest has a way to mark a test as DISABLED; > this test is still compiled, although never run, and all its > assertions will therefore show up as rotten. The other is due to the > common LLVM practice of making environmental decisions at runtime > rather than compile time; for example, using something like 'if > (isWindows())' rather than an #ifdef. I've recoded some of the easier > cases to use #ifdef, in order to reduce the noise. > > Some of the noise appears to be irreduceable, which means if we don't > want to have bots constantly red, we have to have the RGT reporting > off by default. > > Well... actually... it is ON by default; however, I turn it off in > lit. So, if you run `check-llvm` or use `llvm-lit` to run unittests, > they won't report rotten green tests. However, if you run a program > directly, it will report them (and cause the test program to exit with > a failure status). This seemed like a reasonable balance that would > make RGT useful while developing a test, without interfering with > automation. > > > The overall results are quite satisfying; there are many true > positives, generally representing coding errors within the tests. > A half-dozen of the unittests have been fixed, with more to come, and > the RGT patch itself is at: https://reviews.llvm.org/D97566 > > Thanks, > --paulr > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210226/09c7adc1/attachment.html>
Michael Kruse via llvm-dev
2021-Feb-26 21:01 UTC
[llvm-dev] [cfe-dev] Rotten Green Tests project
Could one also run llvm-cov/gcov and look for unexecuted lines? What is the advantage of this approach? Michael
James Henderson via llvm-dev
2021-Mar-01 08:32 UTC
[llvm-dev] [cfe-dev] Rotten Green Tests project
The overall concept seems interesting to me. Anything that helps reduce problems in tests that could obscure bugs etc is worth a good consideration, in my opinion. On Fri, 26 Feb 2021 at 18:47, via cfe-dev <cfe-dev at lists.llvm.org> wrote:> Well... actually... it is ON by default; however, I turn it off in > lit. So, if you run `check-llvm` or use `llvm-lit` to run unittests, > they won't report rotten green tests. However, if you run a program > directly, it will report them (and cause the test program to exit with > a failure status). This seemed like a reasonable balance that would > make RGT useful while developing a test, without interfering with > automation. > > When writing googletest unit tests, I almost always run the testexecutable directly. This is because it's by far the easiest way to run the test and debug in Visual Studio ("Set startup project" -> F5). I wouldn't be happy if this started showing up false test failures in some form or other, unless someone can point at an equally simple way of doing the same thing. James -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210301/2a57ec07/attachment.html>
Thanks, Michael!> Could one also run llvm-cov/gcov and look for unexecuted lines? What > is the advantage of this approach?One could do that; however, it is quite clear no-one *has* done that. The advantage of this approach is that it is automatic, and happens while you are writing/modifying the test, instead of perhaps years later, if ever. --paulr