thr3ads.net - llvm dev - [llvm-dev] [RFC] Updating googletest to non-release tagged version [Mar 2018]

If this information is useful, please help other people find it:
Share via:

via llvm-dev

2018-Mar-20 20:30 UTC

[llvm-dev] [RFC] Updating googletest to non-release tagged version

>> In my particular case from https://reviews.llvm.org/D44560, I currently
>> test the following 3 different cases across the full set of DWARF
>> versions and formats:
>>  - Parsing a valid line table
>>  - Emitting an error if the stated prologue length is greater than the
>>    actual length
>>  - Emitting an error if the stated prologue length is shorter than the
>>    actual length
>> The first is just testing the positive cases for each possible input. I
>> guess a single test for DWARF64 could be written for versions 2-4 and
>> another for version 5 (where there is more stuff between the two length
>> fields so this becomes interesting), similarly for DWARF32. To
summarise,
>> I think that these cases are interesting: V5 + 32, V5 + 64, V2 + 32/64,
>> V3 + 32/64, V4 + 32/64. The biggest issue I have with cutting down to
>> just this set is that it makes the tests less specific, e.g. at a
glance,
>> what is important about the test - the fact that it is v4, or DWARF64,
>> or both independently, or the combination?
>>
>> Related aside: I've realised since earlier that there is scope for
>> version 2 tests, distinct from version 3: v2 tests test the lower
>> boundary on valid versions, and v3 the upper boundary on versions
>> without maximum_operations_per_instruction.
>>
>> The latter two test cases are important because a) the length field has
>> a different size for DWARF32/64 and therefore the prologue length needs
>> to be measured from a different point between the different formats,
and
>> b) the contents of the prologue are different in each of version 3, 4,
>> and 5, and thus the amount read will be different. We could test each
												|>> individual version, independently of the format, but it is
theoretically
>> possible for an error to sneak in whereby the two different failure
>> points cancel each other out. The benefit is admittedly small, but
these
>> tests are fast, so I don't think it hurts to have them.
>
> Still not sure I follow all of this - though perhaps the test design
> discussion might be better in its own thread. But the broad subject/topic/
> name of the stuff I'm interested in applying here is equivalence
> partitioning: https://en.wikipedia.org/wiki/Equivalence_partitioning -
> helps explain the general idea of reducing "test all combinations/
> permutations of cases" to "test representative cases from each
class".
>
> - Dave 
Equivalence classes are a fine way to reduce duplication in manually
written tests.  They tend to depend on a white-box understanding of the
system-under-test to define the equivalences reasonably.  LLVM regression
tests tend to be written very much white-box, as it's the developer who
is writing the test. :-)  The problem there is that as the code evolves,
the original equivalence classes may change, but that doesn't necessarily
make any tests fail, it just ends up with reduced coverage.  Without good
coverage analysis we can easily find tests no longer verifying everything
we wanted them to.

(And in fact back at HP I had an interesting chat with a product owner
who was very much a fan of combinatoric testing; he said every single
time he added a new dimension, he found new bugs, and didn't think he
could have predicted which instances would have found the bugs.)

That said, in the case at hand I think looking at the testing at the
granularity of "parsing a line table" is too coarse.  There are
several
different bits to it and we can reasonably anticipate that they will
remain relatively independent bits:
- detecting the 32/64 format
- extracting the version-dependent sets of fields accurately
- detecting mismatches between the stated length and the internally
  consistent length (e.g., the header says length X, but the file
  table has N entries which runs it to length > X).

To the extent that new DWARF versions have different fields, we should
test that the new fields are extracted and reported (the success cases).
If the new fields are 32/64-sensitive, those should be tested in both 
formats to make sure that sensitivity is handled (more success cases).
The notion of running past the stated length (conversely, finishing the
parsing before reaching the stated length) intuitively isn't either
version-sensitive or format-sensitive, i.e. all those cases look like
they should be part of the same equivalence class, so you don't need
to replicate those across all versions and formats.  With the caveat
that if white-box examination shows that some case takes a very
different path, it's worth replicating the test.  In particular the
v5 directory/file tables are very different from prior versions, so
it's worth having separate bad-length tests for pre-v5 and v5.

--paulr

James Henderson via llvm-dev

2018-Mar-22 10:59 UTC

head link

[llvm-dev] [RFC] Updating googletest to non-release tagged version

Thanks for the comments all. I've thought about it and I agree that I can
probably reduce the number of test cases, so that it is not combinatorial,
in my use case.

I guess the broader question still is there as to whether a) people would
be opposed to updating to a non-official-release version of googletest, and
b) whether it would be worthwhile doing so. As I noted previously, there is
at least one other place in the tests that said it would like to use
Combine, so it still seems like there would be some benefit in it, although
I admit that I haven't looked to make sure that this is a reasonable
request there.

James

On 20 March 2018 at 20:30, <paul.robinson at sony.com> wrote:
> >> In my particular case from https://reviews.llvm.org/D44560, I
currently
> >> test the following 3 different cases across the full set of DWARF
> >> versions and formats:
> >>  - Parsing a valid line table
> >>  - Emitting an error if the stated prologue length is greater than
the
> >>    actual length
> >>  - Emitting an error if the stated prologue length is shorter than
the
> >>    actual length
> >> The first is just testing the positive cases for each possible
input. I
> >> guess a single test for DWARF64 could be written for versions 2-4
and
> >> another for version 5 (where there is more stuff between the two
length
> >> fields so this becomes interesting), similarly for DWARF32. To
> summarise,
> >> I think that these cases are interesting: V5 + 32, V5 + 64, V2 +
32/64,
> >> V3 + 32/64, V4 + 32/64. The biggest issue I have with cutting down
to
> >> just this set is that it makes the tests less specific, e.g. at a
> glance,
> >> what is important about the test - the fact that it is v4, or
DWARF64,
> >> or both independently, or the combination?
> >>
> >> Related aside: I've realised since earlier that there is scope
for
> >> version 2 tests, distinct from version 3: v2 tests test the lower
> >> boundary on valid versions, and v3 the upper boundary on versions
> >> without maximum_operations_per_instruction.
> >>
> >> The latter two test cases are important because a) the length
field has
> >> a different size for DWARF32/64 and therefore the prologue length
needs
> >> to be measured from a different point between the different
formats, and
> >> b) the contents of the prologue are different in each of version
3, 4,
> >> and 5, and thus the amount read will be different. We could test
each
>
>                       |
> >> individual version, independently of the format, but it is
theoretically
> >> possible for an error to sneak in whereby the two different
failure
> >> points cancel each other out. The benefit is admittedly small, but
these
> >> tests are fast, so I don't think it hurts to have them.
> >
> > Still not sure I follow all of this - though perhaps the test design
> > discussion might be better in its own thread. But the broad
> subject/topic/
> > name of the stuff I'm interested in applying here is equivalence
> > partitioning: https://en.wikipedia.org/wiki/Equivalence_partitioning -
> > helps explain the general idea of reducing "test all
combinations/
> > permutations of cases" to "test representative cases from
each class".
> >
> > - Dave
>
> Equivalence classes are a fine way to reduce duplication in manually
> written tests.  They tend to depend on a white-box understanding of the
> system-under-test to define the equivalences reasonably.  LLVM regression
> tests tend to be written very much white-box, as it's the developer who
> is writing the test. :-)  The problem there is that as the code evolves,
> the original equivalence classes may change, but that doesn't
necessarily
> make any tests fail, it just ends up with reduced coverage.  Without good
> coverage analysis we can easily find tests no longer verifying everything
> we wanted them to.
>
> (And in fact back at HP I had an interesting chat with a product owner
> who was very much a fan of combinatoric testing; he said every single
> time he added a new dimension, he found new bugs, and didn't think he
> could have predicted which instances would have found the bugs.)
>
> That said, in the case at hand I think looking at the testing at the
> granularity of "parsing a line table" is too coarse.  There are
several
> different bits to it and we can reasonably anticipate that they will
> remain relatively independent bits:
> - detecting the 32/64 format
> - extracting the version-dependent sets of fields accurately
> - detecting mismatches between the stated length and the internally
>   consistent length (e.g., the header says length X, but the file
>   table has N entries which runs it to length > X).
>
> To the extent that new DWARF versions have different fields, we should
> test that the new fields are extracted and reported (the success cases).
> If the new fields are 32/64-sensitive, those should be tested in both
> formats to make sure that sensitivity is handled (more success cases).
> The notion of running past the stated length (conversely, finishing the
> parsing before reaching the stated length) intuitively isn't either
> version-sensitive or format-sensitive, i.e. all those cases look like
> they should be part of the same equivalence class, so you don't need
> to replicate those across all versions and formats.  With the caveat
> that if white-box examination shows that some case takes a very
> different path, it's worth replicating the test.  In particular the
> v5 directory/file tables are very different from prior versions, so
> it's worth having separate bad-length tests for pre-v5 and v5.
>
> --paulr
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180322/d3513994/attachment.html>

David Blaikie via llvm-dev

2018-Mar-22 15:03 UTC

head link

[llvm-dev] [RFC] Updating googletest to non-release tagged version

Doesn't sound like there's opposition - can't say whether or not
it's wort
hit, but if you're willing to give it a go/send a patch, probably can't
hurt. Might be a bit of work to get all the corners right - I think there
are a few local patches to the gtest in llvm, unfortunately, that you'd
have to port over, etc.

On Thu, Mar 22, 2018 at 3:59 AM James Henderson <
jh7370.2008 at my.bristol.ac.uk> wrote:
> Thanks for the comments all. I've thought about it and I agree that I
can
> probably reduce the number of test cases, so that it is not combinatorial,
> in my use case.
>
> I guess the broader question still is there as to whether a) people would
> be opposed to updating to a non-official-release version of googletest, and
> b) whether it would be worthwhile doing so. As I noted previously, there is
> at least one other place in the tests that said it would like to use
> Combine, so it still seems like there would be some benefit in it, although
> I admit that I haven't looked to make sure that this is a reasonable
> request there.
>
> James
>
> On 20 March 2018 at 20:30, <paul.robinson at sony.com> wrote:
>
>> >> In my particular case from https://reviews.llvm.org/D44560, I
>> currently
>> >> test the following 3 different cases across the full set of
DWARF
>> >> versions and formats:
>> >>  - Parsing a valid line table
>> >>  - Emitting an error if the stated prologue length is greater
than the
>> >>    actual length
>> >>  - Emitting an error if the stated prologue length is shorter
than the
>> >>    actual length
>> >> The first is just testing the positive cases for each possible
input. I
>> >> guess a single test for DWARF64 could be written for versions
2-4 and
>> >> another for version 5 (where there is more stuff between the
two length
>> >> fields so this becomes interesting), similarly for DWARF32. To
>> summarise,
>> >> I think that these cases are interesting: V5 + 32, V5 + 64, V2
+ 32/64,
>> >> V3 + 32/64, V4 + 32/64. The biggest issue I have with cutting
down to
>> >> just this set is that it makes the tests less specific, e.g.
at a
>> glance,
>> >> what is important about the test - the fact that it is v4, or
DWARF64,
>> >> or both independently, or the combination?
>> >>
>> >> Related aside: I've realised since earlier that there is
scope for
>> >> version 2 tests, distinct from version 3: v2 tests test the
lower
>> >> boundary on valid versions, and v3 the upper boundary on
versions
>> >> without maximum_operations_per_instruction.
>> >>
>> >> The latter two test cases are important because a) the length
field has
>> >> a different size for DWARF32/64 and therefore the prologue
length needs
>> >> to be measured from a different point between the different
formats,
>> and
>> >> b) the contents of the prologue are different in each of
version 3, 4,
>> >> and 5, and thus the amount read will be different. We could
test each
>>
>>                       |
>> >> individual version, independently of the format, but it is
>> theoretically
>> >> possible for an error to sneak in whereby the two different
failure
>> >> points cancel each other out. The benefit is admittedly small,
but
>> these
>> >> tests are fast, so I don't think it hurts to have them.
>> >
>> > Still not sure I follow all of this - though perhaps the test
design
>> > discussion might be better in its own thread. But the broad
>> subject/topic/
>> > name of the stuff I'm interested in applying here is
equivalence
>> > partitioning:
https://en.wikipedia.org/wiki/Equivalence_partitioning -
>> > helps explain the general idea of reducing "test all
combinations/
>> > permutations of cases" to "test representative cases
from each class".
>> >
>> > - Dave
>>
>> Equivalence classes are a fine way to reduce duplication in manually
>> written tests.  They tend to depend on a white-box understanding of the
>> system-under-test to define the equivalences reasonably.  LLVM
regression
>> tests tend to be written very much white-box, as it's the developer
who
>> is writing the test. :-)  The problem there is that as the code
evolves,
>> the original equivalence classes may change, but that doesn't
necessarily
>> make any tests fail, it just ends up with reduced coverage.  Without
good
>> coverage analysis we can easily find tests no longer verifying
everything
>> we wanted them to.
>>
>> (And in fact back at HP I had an interesting chat with a product owner
>> who was very much a fan of combinatoric testing; he said every single
>> time he added a new dimension, he found new bugs, and didn't think
he
>> could have predicted which instances would have found the bugs.)
>>
>> That said, in the case at hand I think looking at the testing at the
>> granularity of "parsing a line table" is too coarse.  There
are several
>> different bits to it and we can reasonably anticipate that they will
>> remain relatively independent bits:
>> - detecting the 32/64 format
>> - extracting the version-dependent sets of fields accurately
>> - detecting mismatches between the stated length and the internally
>>   consistent length (e.g., the header says length X, but the file
>>   table has N entries which runs it to length > X).
>>
>> To the extent that new DWARF versions have different fields, we should
>> test that the new fields are extracted and reported (the success
cases).
>> If the new fields are 32/64-sensitive, those should be tested in both
>> formats to make sure that sensitivity is handled (more success cases).
>> The notion of running past the stated length (conversely, finishing the
>> parsing before reaching the stated length) intuitively isn't either
>> version-sensitive or format-sensitive, i.e. all those cases look like
>> they should be part of the same equivalence class, so you don't
need
>> to replicate those across all versions and formats.  With the caveat
>> that if white-box examination shows that some case takes a very
>> different path, it's worth replicating the test.  In particular the
>> v5 directory/file tables are very different from prior versions, so
>> it's worth having separate bad-length tests for pre-v5 and v5.
>>
>> --paulr
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180322/4e9e09d1/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Mar 2018 - [RFC] Updating googletest to non-release tagged version

[llvm-dev] [RFC] Updating googletest to non-release tagged version

[llvm-dev] [RFC] Updating googletest to non-release tagged version

[llvm-dev] [RFC] Updating googletest to non-release tagged version

Apparently Analagous Threads