via llvm-dev
2018-Mar-20 20:30 UTC
[llvm-dev] [RFC] Updating googletest to non-release tagged version
>> In my particular case from https://reviews.llvm.org/D44560, I currently >> test the following 3 different cases across the full set of DWARF >> versions and formats: >> - Parsing a valid line table >> - Emitting an error if the stated prologue length is greater than the >> actual length >> - Emitting an error if the stated prologue length is shorter than the >> actual length >> The first is just testing the positive cases for each possible input. I >> guess a single test for DWARF64 could be written for versions 2-4 and >> another for version 5 (where there is more stuff between the two length >> fields so this becomes interesting), similarly for DWARF32. To summarise, >> I think that these cases are interesting: V5 + 32, V5 + 64, V2 + 32/64, >> V3 + 32/64, V4 + 32/64. The biggest issue I have with cutting down to >> just this set is that it makes the tests less specific, e.g. at a glance, >> what is important about the test - the fact that it is v4, or DWARF64, >> or both independently, or the combination? >> >> Related aside: I've realised since earlier that there is scope for >> version 2 tests, distinct from version 3: v2 tests test the lower >> boundary on valid versions, and v3 the upper boundary on versions >> without maximum_operations_per_instruction. >> >> The latter two test cases are important because a) the length field has >> a different size for DWARF32/64 and therefore the prologue length needs >> to be measured from a different point between the different formats, and >> b) the contents of the prologue are different in each of version 3, 4, >> and 5, and thus the amount read will be different. We could test each|>> individual version, independently of the format, but it is theoretically >> possible for an error to sneak in whereby the two different failure >> points cancel each other out. The benefit is admittedly small, but these >> tests are fast, so I don't think it hurts to have them. > > Still not sure I follow all of this - though perhaps the test design > discussion might be better in its own thread. But the broad subject/topic/ > name of the stuff I'm interested in applying here is equivalence > partitioning: https://en.wikipedia.org/wiki/Equivalence_partitioning - > helps explain the general idea of reducing "test all combinations/ > permutations of cases" to "test representative cases from each class". > > - DaveEquivalence classes are a fine way to reduce duplication in manually written tests. They tend to depend on a white-box understanding of the system-under-test to define the equivalences reasonably. LLVM regression tests tend to be written very much white-box, as it's the developer who is writing the test. :-) The problem there is that as the code evolves, the original equivalence classes may change, but that doesn't necessarily make any tests fail, it just ends up with reduced coverage. Without good coverage analysis we can easily find tests no longer verifying everything we wanted them to. (And in fact back at HP I had an interesting chat with a product owner who was very much a fan of combinatoric testing; he said every single time he added a new dimension, he found new bugs, and didn't think he could have predicted which instances would have found the bugs.) That said, in the case at hand I think looking at the testing at the granularity of "parsing a line table" is too coarse. There are several different bits to it and we can reasonably anticipate that they will remain relatively independent bits: - detecting the 32/64 format - extracting the version-dependent sets of fields accurately - detecting mismatches between the stated length and the internally consistent length (e.g., the header says length X, but the file table has N entries which runs it to length > X). To the extent that new DWARF versions have different fields, we should test that the new fields are extracted and reported (the success cases). If the new fields are 32/64-sensitive, those should be tested in both formats to make sure that sensitivity is handled (more success cases). The notion of running past the stated length (conversely, finishing the parsing before reaching the stated length) intuitively isn't either version-sensitive or format-sensitive, i.e. all those cases look like they should be part of the same equivalence class, so you don't need to replicate those across all versions and formats. With the caveat that if white-box examination shows that some case takes a very different path, it's worth replicating the test. In particular the v5 directory/file tables are very different from prior versions, so it's worth having separate bad-length tests for pre-v5 and v5. --paulr
James Henderson via llvm-dev
2018-Mar-22 10:59 UTC
[llvm-dev] [RFC] Updating googletest to non-release tagged version
Thanks for the comments all. I've thought about it and I agree that I can probably reduce the number of test cases, so that it is not combinatorial, in my use case. I guess the broader question still is there as to whether a) people would be opposed to updating to a non-official-release version of googletest, and b) whether it would be worthwhile doing so. As I noted previously, there is at least one other place in the tests that said it would like to use Combine, so it still seems like there would be some benefit in it, although I admit that I haven't looked to make sure that this is a reasonable request there. James On 20 March 2018 at 20:30, <paul.robinson at sony.com> wrote:> >> In my particular case from https://reviews.llvm.org/D44560, I currently > >> test the following 3 different cases across the full set of DWARF > >> versions and formats: > >> - Parsing a valid line table > >> - Emitting an error if the stated prologue length is greater than the > >> actual length > >> - Emitting an error if the stated prologue length is shorter than the > >> actual length > >> The first is just testing the positive cases for each possible input. I > >> guess a single test for DWARF64 could be written for versions 2-4 and > >> another for version 5 (where there is more stuff between the two length > >> fields so this becomes interesting), similarly for DWARF32. To > summarise, > >> I think that these cases are interesting: V5 + 32, V5 + 64, V2 + 32/64, > >> V3 + 32/64, V4 + 32/64. The biggest issue I have with cutting down to > >> just this set is that it makes the tests less specific, e.g. at a > glance, > >> what is important about the test - the fact that it is v4, or DWARF64, > >> or both independently, or the combination? > >> > >> Related aside: I've realised since earlier that there is scope for > >> version 2 tests, distinct from version 3: v2 tests test the lower > >> boundary on valid versions, and v3 the upper boundary on versions > >> without maximum_operations_per_instruction. > >> > >> The latter two test cases are important because a) the length field has > >> a different size for DWARF32/64 and therefore the prologue length needs > >> to be measured from a different point between the different formats, and > >> b) the contents of the prologue are different in each of version 3, 4, > >> and 5, and thus the amount read will be different. We could test each > > | > >> individual version, independently of the format, but it is theoretically > >> possible for an error to sneak in whereby the two different failure > >> points cancel each other out. The benefit is admittedly small, but these > >> tests are fast, so I don't think it hurts to have them. > > > > Still not sure I follow all of this - though perhaps the test design > > discussion might be better in its own thread. But the broad > subject/topic/ > > name of the stuff I'm interested in applying here is equivalence > > partitioning: https://en.wikipedia.org/wiki/Equivalence_partitioning - > > helps explain the general idea of reducing "test all combinations/ > > permutations of cases" to "test representative cases from each class". > > > > - Dave > > Equivalence classes are a fine way to reduce duplication in manually > written tests. They tend to depend on a white-box understanding of the > system-under-test to define the equivalences reasonably. LLVM regression > tests tend to be written very much white-box, as it's the developer who > is writing the test. :-) The problem there is that as the code evolves, > the original equivalence classes may change, but that doesn't necessarily > make any tests fail, it just ends up with reduced coverage. Without good > coverage analysis we can easily find tests no longer verifying everything > we wanted them to. > > (And in fact back at HP I had an interesting chat with a product owner > who was very much a fan of combinatoric testing; he said every single > time he added a new dimension, he found new bugs, and didn't think he > could have predicted which instances would have found the bugs.) > > That said, in the case at hand I think looking at the testing at the > granularity of "parsing a line table" is too coarse. There are several > different bits to it and we can reasonably anticipate that they will > remain relatively independent bits: > - detecting the 32/64 format > - extracting the version-dependent sets of fields accurately > - detecting mismatches between the stated length and the internally > consistent length (e.g., the header says length X, but the file > table has N entries which runs it to length > X). > > To the extent that new DWARF versions have different fields, we should > test that the new fields are extracted and reported (the success cases). > If the new fields are 32/64-sensitive, those should be tested in both > formats to make sure that sensitivity is handled (more success cases). > The notion of running past the stated length (conversely, finishing the > parsing before reaching the stated length) intuitively isn't either > version-sensitive or format-sensitive, i.e. all those cases look like > they should be part of the same equivalence class, so you don't need > to replicate those across all versions and formats. With the caveat > that if white-box examination shows that some case takes a very > different path, it's worth replicating the test. In particular the > v5 directory/file tables are very different from prior versions, so > it's worth having separate bad-length tests for pre-v5 and v5. > > --paulr >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180322/d3513994/attachment.html>
David Blaikie via llvm-dev
2018-Mar-22 15:03 UTC
[llvm-dev] [RFC] Updating googletest to non-release tagged version
Doesn't sound like there's opposition - can't say whether or not it's wort hit, but if you're willing to give it a go/send a patch, probably can't hurt. Might be a bit of work to get all the corners right - I think there are a few local patches to the gtest in llvm, unfortunately, that you'd have to port over, etc. On Thu, Mar 22, 2018 at 3:59 AM James Henderson < jh7370.2008 at my.bristol.ac.uk> wrote:> Thanks for the comments all. I've thought about it and I agree that I can > probably reduce the number of test cases, so that it is not combinatorial, > in my use case. > > I guess the broader question still is there as to whether a) people would > be opposed to updating to a non-official-release version of googletest, and > b) whether it would be worthwhile doing so. As I noted previously, there is > at least one other place in the tests that said it would like to use > Combine, so it still seems like there would be some benefit in it, although > I admit that I haven't looked to make sure that this is a reasonable > request there. > > James > > On 20 March 2018 at 20:30, <paul.robinson at sony.com> wrote: > >> >> In my particular case from https://reviews.llvm.org/D44560, I >> currently >> >> test the following 3 different cases across the full set of DWARF >> >> versions and formats: >> >> - Parsing a valid line table >> >> - Emitting an error if the stated prologue length is greater than the >> >> actual length >> >> - Emitting an error if the stated prologue length is shorter than the >> >> actual length >> >> The first is just testing the positive cases for each possible input. I >> >> guess a single test for DWARF64 could be written for versions 2-4 and >> >> another for version 5 (where there is more stuff between the two length >> >> fields so this becomes interesting), similarly for DWARF32. To >> summarise, >> >> I think that these cases are interesting: V5 + 32, V5 + 64, V2 + 32/64, >> >> V3 + 32/64, V4 + 32/64. The biggest issue I have with cutting down to >> >> just this set is that it makes the tests less specific, e.g. at a >> glance, >> >> what is important about the test - the fact that it is v4, or DWARF64, >> >> or both independently, or the combination? >> >> >> >> Related aside: I've realised since earlier that there is scope for >> >> version 2 tests, distinct from version 3: v2 tests test the lower >> >> boundary on valid versions, and v3 the upper boundary on versions >> >> without maximum_operations_per_instruction. >> >> >> >> The latter two test cases are important because a) the length field has >> >> a different size for DWARF32/64 and therefore the prologue length needs >> >> to be measured from a different point between the different formats, >> and >> >> b) the contents of the prologue are different in each of version 3, 4, >> >> and 5, and thus the amount read will be different. We could test each >> >> | >> >> individual version, independently of the format, but it is >> theoretically >> >> possible for an error to sneak in whereby the two different failure >> >> points cancel each other out. The benefit is admittedly small, but >> these >> >> tests are fast, so I don't think it hurts to have them. >> > >> > Still not sure I follow all of this - though perhaps the test design >> > discussion might be better in its own thread. But the broad >> subject/topic/ >> > name of the stuff I'm interested in applying here is equivalence >> > partitioning: https://en.wikipedia.org/wiki/Equivalence_partitioning - >> > helps explain the general idea of reducing "test all combinations/ >> > permutations of cases" to "test representative cases from each class". >> > >> > - Dave >> >> Equivalence classes are a fine way to reduce duplication in manually >> written tests. They tend to depend on a white-box understanding of the >> system-under-test to define the equivalences reasonably. LLVM regression >> tests tend to be written very much white-box, as it's the developer who >> is writing the test. :-) The problem there is that as the code evolves, >> the original equivalence classes may change, but that doesn't necessarily >> make any tests fail, it just ends up with reduced coverage. Without good >> coverage analysis we can easily find tests no longer verifying everything >> we wanted them to. >> >> (And in fact back at HP I had an interesting chat with a product owner >> who was very much a fan of combinatoric testing; he said every single >> time he added a new dimension, he found new bugs, and didn't think he >> could have predicted which instances would have found the bugs.) >> >> That said, in the case at hand I think looking at the testing at the >> granularity of "parsing a line table" is too coarse. There are several >> different bits to it and we can reasonably anticipate that they will >> remain relatively independent bits: >> - detecting the 32/64 format >> - extracting the version-dependent sets of fields accurately >> - detecting mismatches between the stated length and the internally >> consistent length (e.g., the header says length X, but the file >> table has N entries which runs it to length > X). >> >> To the extent that new DWARF versions have different fields, we should >> test that the new fields are extracted and reported (the success cases). >> If the new fields are 32/64-sensitive, those should be tested in both >> formats to make sure that sensitivity is handled (more success cases). >> The notion of running past the stated length (conversely, finishing the >> parsing before reaching the stated length) intuitively isn't either >> version-sensitive or format-sensitive, i.e. all those cases look like >> they should be part of the same equivalence class, so you don't need >> to replicate those across all versions and formats. With the caveat >> that if white-box examination shows that some case takes a very >> different path, it's worth replicating the test. In particular the >> v5 directory/file tables are very different from prior versions, so >> it's worth having separate bad-length tests for pre-v5 and v5. >> >> --paulr >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180322/4e9e09d1/attachment.html>
Reasonably Related Threads
- [RFC] Updating googletest to non-release tagged version
- [RFC] Updating googletest to non-release tagged version
- [RFC] Updating googletest to non-release tagged version
- [RFC] Updating googletest to non-release tagged version
- [RFC] Updating googletest to non-release tagged version