thr3ads.net - llvm dev - [llvm-dev] Rotten Green Tests project [Feb 2021]

If this information is useful, please help other people find it:
Share via:

via llvm-dev

2021-Feb-26 18:47 UTC

[llvm-dev] Rotten Green Tests project

This note describes the first part of the Rotten Green Tests project.

"Rotten Green Tests" is the title of a paper presented at the 2019
International Conference on Software Engineering (ICSE).  Stripped to
its essentials, the paper describes a method to identify defects or
oversights in executable tests.  The method has two steps:

(a) Statically identify all "test assertions" in the test program.
(b) Dynamically determine whether these assertions are actually
executed.

A test assertion that has been coded but is not executed is termed a
"rotten green" test, because it allows the test to be green (i.e.,
pass) without actually enforcing the assertion.  In many cases it is
not immediately obvious, just by reading the code, that the test has a
problem; the Rotten Green Test method helps identify these.

The paper describes using this method on projects coded in Pharo
(which appears to be a Smalltalk descendant) and so the specific tools
are obviously not applicable to a C++ project such as LLVM.  However,
the concept can be easily transferred.

I applied these ideas to the Clang and LLVM unittests, because these
are all executable tests that use the googletest infrastructure.  In
particular, all "test assertions" are easily identified because they
make use of macros defined by googletest; by modifying these macros,
it should be feasible to keep track of all assertions, and report
whether they have been executed.

The mildly gory details can be saved for the code review and of course
an LLVM Dev Meeting talk, but the basic idea is: Each test-assertion
macro will statically allocate a struct identifying the source
location of the macro, and have an executable statement recording
whether that assertion has been executed.  Then, when the test exits,
we look for any of these records that haven't been executed, and
report them.

I've gotten this to work in three environments so far:
1) Linux, with gcc as the build compiler
2) Linux, with clang as the build compiler
3) Windows, with MSVC as the build compiler

The results are not identical across the three environments.  Besides
the obvious case that some tests simply don't operate on both Linux
and Windows, there are some subtleties that cause the infrastructure
to work less well with gcc than with clang.

The infrastructure depends on certain practices in coding the tests.

First and foremost, it depends on tests being coded to use the
googletest macros (EXPECT_* and ASSERT*) to express individual test
assertions.  This is generally true in the unittests, although not as
universal as might be hoped; ClangRenameTests, for example, buries a
handful of test assertions inside helper methods, which is a plausible
coding tactic but makes the RGT infrastructure less useful (because
many separate tests funnel through the same EXPECT/ASSERT macros, and
so RGT can't discern whether any of those higher-level tests are
rotten).

Secondly, "source location" is constrained to filename and line number
(__FILE__ and __LINE__), therefore we can have at most one assertion
per source line.  This is generally not a problem, although I did need
to recode one test that used macros to generate assertions (replacing
it with a template).  In certain cases it also means gcc doesn't let
us distinguish multiple assertions, mainly in nested macros, for an
obscure reason.  But those situations are not very common.

There are a noticeable number of false positives, with two primary
sources. One is that googletest has a way to mark a test as DISABLED;
this test is still compiled, although never run, and all its
assertions will therefore show up as rotten.  The other is due to the
common LLVM practice of making environmental decisions at runtime
rather than compile time; for example, using something like 'if
(isWindows())' rather than an #ifdef.  I've recoded some of the easier
cases to use #ifdef, in order to reduce the noise.

Some of the noise appears to be irreduceable, which means if we don't
want to have bots constantly red, we have to have the RGT reporting
off by default.

Well... actually... it is ON by default; however, I turn it off in
lit.  So, if you run `check-llvm` or use `llvm-lit` to run unittests,
they won't report rotten green tests.  However, if you run a program
directly, it will report them (and cause the test program to exit with
a failure status).  This seemed like a reasonable balance that would
make RGT useful while developing a test, without interfering with
automation.


The overall results are quite satisfying; there are many true
positives, generally representing coding errors within the tests.
A half-dozen of the unittests have been fixed, with more to come, and
the RGT patch itself is at: https://reviews.llvm.org/D97566

Thanks,
--paulr

David Blaikie via llvm-dev

2021-Feb-26 20:23 UTC

head link

[llvm-dev] [cfe-dev] Rotten Green Tests project

Initial gut reaction would be this is perhaps a big enough patch/divergence
from upstream gtest that it should go into upstream gtest first & maybe
sync'ing up with a more recent gtest into LLVM? Though I realize that's
a
bit of a big undertaking (either/both of those steps). How does this
compare to other local patches to gtest we have?

On Fri, Feb 26, 2021 at 10:47 AM via cfe-dev <cfe-dev at lists.llvm.org>
wrote:
> This note describes the first part of the Rotten Green Tests project.
>
> "Rotten Green Tests" is the title of a paper presented at the
2019
> International Conference on Software Engineering (ICSE).  Stripped to
> its essentials, the paper describes a method to identify defects or
> oversights in executable tests.  The method has two steps:
>
> (a) Statically identify all "test assertions" in the test
program.
> (b) Dynamically determine whether these assertions are actually
> executed.
>
> A test assertion that has been coded but is not executed is termed a
> "rotten green" test, because it allows the test to be green
(i.e.,
> pass) without actually enforcing the assertion.  In many cases it is
> not immediately obvious, just by reading the code, that the test has a
> problem; the Rotten Green Test method helps identify these.
>
> The paper describes using this method on projects coded in Pharo
> (which appears to be a Smalltalk descendant) and so the specific tools
> are obviously not applicable to a C++ project such as LLVM.  However,
> the concept can be easily transferred.
>
> I applied these ideas to the Clang and LLVM unittests, because these
> are all executable tests that use the googletest infrastructure.  In
> particular, all "test assertions" are easily identified because
they
> make use of macros defined by googletest; by modifying these macros,
> it should be feasible to keep track of all assertions, and report
> whether they have been executed.
>
> The mildly gory details can be saved for the code review and of course
> an LLVM Dev Meeting talk, but the basic idea is: Each test-assertion
> macro will statically allocate a struct identifying the source
> location of the macro, and have an executable statement recording
> whether that assertion has been executed.  Then, when the test exits,
> we look for any of these records that haven't been executed, and
> report them.
>
> I've gotten this to work in three environments so far:
> 1) Linux, with gcc as the build compiler
> 2) Linux, with clang as the build compiler
> 3) Windows, with MSVC as the build compiler
>
> The results are not identical across the three environments.  Besides
> the obvious case that some tests simply don't operate on both Linux
> and Windows, there are some subtleties that cause the infrastructure
> to work less well with gcc than with clang.
>
> The infrastructure depends on certain practices in coding the tests.
>
> First and foremost, it depends on tests being coded to use the
> googletest macros (EXPECT_* and ASSERT*) to express individual test
> assertions.  This is generally true in the unittests, although not as
> universal as might be hoped; ClangRenameTests, for example, buries a
> handful of test assertions inside helper methods, which is a plausible
> coding tactic but makes the RGT infrastructure less useful (because
> many separate tests funnel through the same EXPECT/ASSERT macros, and
> so RGT can't discern whether any of those higher-level tests are
> rotten).
>
> Secondly, "source location" is constrained to filename and line
number
> (__FILE__ and __LINE__), therefore we can have at most one assertion
> per source line.  This is generally not a problem, although I did need
> to recode one test that used macros to generate assertions (replacing
> it with a template).  In certain cases it also means gcc doesn't let
> us distinguish multiple assertions, mainly in nested macros, for an
> obscure reason.  But those situations are not very common.
>
> There are a noticeable number of false positives, with two primary
> sources. One is that googletest has a way to mark a test as DISABLED;
> this test is still compiled, although never run, and all its
> assertions will therefore show up as rotten.  The other is due to the
> common LLVM practice of making environmental decisions at runtime
> rather than compile time; for example, using something like 'if
> (isWindows())' rather than an #ifdef.  I've recoded some of the
easier
> cases to use #ifdef, in order to reduce the noise.
>
> Some of the noise appears to be irreduceable, which means if we don't
> want to have bots constantly red, we have to have the RGT reporting
> off by default.
>
> Well... actually... it is ON by default; however, I turn it off in
> lit.  So, if you run `check-llvm` or use `llvm-lit` to run unittests,
> they won't report rotten green tests.  However, if you run a program
> directly, it will report them (and cause the test program to exit with
> a failure status).  This seemed like a reasonable balance that would
> make RGT useful while developing a test, without interfering with
> automation.
>
>
> The overall results are quite satisfying; there are many true
> positives, generally representing coding errors within the tests.
> A half-dozen of the unittests have been fixed, with more to come, and
> the RGT patch itself is at: https://reviews.llvm.org/D97566
>
> Thanks,
> --paulr
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210226/09c7adc1/attachment.html>

Michael Kruse via llvm-dev

2021-Feb-26 21:01 UTC

head link

[llvm-dev] [cfe-dev] Rotten Green Tests project

Could one also run llvm-cov/gcov and look for unexecuted lines? What
is the advantage of this approach?

Michael

James Henderson via llvm-dev

2021-Mar-01 08:32 UTC

head link

[llvm-dev] [cfe-dev] Rotten Green Tests project

The overall concept seems interesting to me. Anything that helps reduce
problems in tests that could obscure bugs etc is worth a good
consideration, in my opinion.

On Fri, 26 Feb 2021 at 18:47, via cfe-dev <cfe-dev at lists.llvm.org>
wrote:
> Well... actually... it is ON by default; however, I turn it off in
> lit.  So, if you run `check-llvm` or use `llvm-lit` to run unittests,
> they won't report rotten green tests.  However, if you run a program
> directly, it will report them (and cause the test program to exit with
> a failure status).  This seemed like a reasonable balance that would
> make RGT useful while developing a test, without interfering with
> automation.
>
> When writing googletest unit tests, I almost always run the testexecutable directly. This is because it's by far the easiest way to run the
test and debug in Visual Studio ("Set startup project" -> F5). I
wouldn't
be happy if this started showing up false test failures in some form or
other, unless someone can point at an equally simple way of doing the same
thing.

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210301/2a57ec07/attachment.html>

via llvm-dev

2021-Mar-01 15:25 UTC

head link

[llvm-dev] [cfe-dev] Rotten Green Tests project

Thanks, Michael!
> Could one also run llvm-cov/gcov and look for unexecuted lines? What
> is the advantage of this approach?
One could do that; however, it is quite clear no-one *has* done that.
The advantage of this approach is that it is automatic, and happens
while you are writing/modifying the test, instead of perhaps years
later, if ever.

--paulr

llvm dev - Feb 2021 - Rotten Green Tests project

[llvm-dev] Rotten Green Tests project

[llvm-dev] [cfe-dev] Rotten Green Tests project

[llvm-dev] [cfe-dev] Rotten Green Tests project

[llvm-dev] [cfe-dev] Rotten Green Tests project

[llvm-dev] [cfe-dev] Rotten Green Tests project