Robinson, Paul via llvm-dev
2020-Jul-01 15:18 UTC
[llvm-dev] [RFC] Compiled regression tests.
The test as written is fragile because it requires a certain ordering. If the output order is not important, use CHECK-DAG rather than CHECK. This would be a failure to understand the testing tool. My experience, over a 40-year career, is that good software developers are generally not very good test-writers. These are different skills and good testing is frequently not taught. It’s easy to write fragile tests; you make your patch, you see what comes out, and you write the test to expect exactly that output, using the minimum possible features of the testing tool. This is poor technique all around. We even have scripts that automatically generate such tests, used primarily in codegen tests. I devoutly hope that the people who produce those tests responsibly eyeball all those cases. The proposal appears to be to migrate output-based tests (using ever-more-complicated FileCheck features) to executable tests, which makes it more like the software development people are used to instead of test-writing. But I don’t see that necessarily solving the problem; seems like it would be just as easy to write a program that doesn’t do the right thing as to write a FileCheck test that doesn’t do the right thing. Hal’s suggestion is more to the point: If the output we’re generating is not appropriate to the kinds of tests we want to perform, it can be worthwhile to generate different kinds of output. MIR is a case in point; for a long time it was hard to introspect into the interval between IR and final machine code, but now it’s a lot easier. --paulr From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Michael Kruse via llvm-dev Sent: Wednesday, July 1, 2020 1:40 AM To: Michael Kruse <llvmdev at meinersbur.de> Cc: llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] Compiled regression tests. To illustrate some difficulties with FileCheck, lets make a non-semantic change in LLVM: --- a/llvm/lib/Analysis/VectorUtils.cpp +++ b/llvm/lib/Analysis/VectorUtils.cpp @@ -642,8 +642,8 @@ MDNode *llvm::uniteAccessGroups(MDNode *AccGroups1, MDNode *AccGroups2) { return AccGroups1; SmallSetVector<Metadata *, 4> Union; - addToAccessGroupList(Union, AccGroups1); addToAccessGroupList(Union, AccGroups2); + addToAccessGroupList(Union, AccGroups1); if (Union.size() == 0) return nullptr; This changes the order of access groups when merging them. Same background: Access groups are used to mark that a side-effect (e.g. memory access) does not limit the parallelization of a loop, added by e.g. #pragma clang loop vectorize(assume_safety). Therefore a loop can be assumed to be parallelizable without further dependency analysis if all its side-effect instructions are marked this way (and has no loop-carried scalar dependencies). An access group represents the instructions marked for a specific loop. A single instruction can be parallel with regrads to multiple loop, i.e. belong to multiple access groups. The order of the access groups is insignificant, i.e. whether either `AccGroups1` or `AccGroups2` comes first, the instruction is marked parallel for both loops. Even the change should have no effect on correctness, `ninja check-llvm` fails: test/Transforms/Inline/parallel-loop-md-merge.ll string not found in input ; CHECK: ![[ACCESSES_INNER]] = !{!"llvm.loop.parallel_accesses", ![[ACCESS_GROUP_INNER]]} ^ <stdin>:45:1: note: scanning from here !7 = !{!"llvm.loop.parallel_accesses", !5} ^ <stdin>:45:1: note: with "ACCESSES_INNER" equal to "7" !7 = !{!"llvm.loop.parallel_accesses", !5} ^ <stdin>:45:1: note: with "ACCESS_GROUP_INNER" equal to "4" !7 = !{!"llvm.loop.parallel_accesses", !5} ^ The line FileCheck points to has nothing to do with the changed order of access groups. (what's the point of annotating the `!7 = ...` line repeatedly? Why not pointing to the lines where ACCESSES_INNER/ACCESS_GROUP_INNER have their values from?) The CHECK lines, annotated with their line numbers from parallel-loop-md-merge.ll, are: 68 ; CHECK: store double 4.200000e+01, {{.*}} !llvm.access.group ![[ACCESS_GROUP_LIST_3:[0-9]+]] 69 ; CHECK: br label %for.cond.i, !llvm.loop ![[LOOP_INNER:[0-9]+]] 70 ; CHECK: br label %for.cond, !llvm.loop ![[LOOP_OUTER:[0-9]+]] 71 72 ; CHECK: ![[ACCESS_GROUP_LIST_3]] = !{![[ACCESS_GROUP_INNER:[0-9]+]], ![[ACCESS_GROUP_OUTER:[0-9]+]]} 73 ; CHECK: ![[ACCESS_GROUP_INNER]] = distinct !{} 74 ; CHECK: ![[ACCESS_GROUP_OUTER]] = distinct !{} 75 ; CHECK: ![[LOOP_INNER]] = distinct !{![[LOOP_INNER]], ![[ACCESSES_INNER:[0-9]+]]} 76 ; CHECK: ![[ACCESSES_INNER]] = !{!"llvm.loop.parallel_accesses", ![[ACCESS_GROUP_INNER]]} 77 ; CHECK: ![[LOOP_OUTER]] = distinct !{![[LOOP_OUTER]], ![[ACCESSES_OUTER:[0-9]+]]} 78 ; CHECK: ![[ACCESSES_OUTER]] = !{!"llvm.loop.parallel_accesses", ![[ACCESS_GROUP_OUTER]]} And the output of `FileCheck --dump-input -v` is: 38: !0 = !{!1} 39: !1 = distinct !{!1, !2, !"callee: %A"} 40: !2 = distinct !{!2, !"callee"} 41: !3 = !{!4, !5} check:72 ^~~~~~~~~~~~~~ 42: !4 = distinct !{} check:73 ^~~~~~~~~~~~~~~~~ 43: !5 = distinct !{} check:74 ^~~~~~~~~~~~~~~~~ 44: !6 = distinct !{!6, !7} check:75 ^~~~~~~~~~~~~~~~~~~~~~~ 45: !7 = !{!"llvm.loop.parallel_accesses", !5} check:76 X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found 46: !8 = distinct !{!8, !9} check:76 ~~~~~~~~~~~~~~~~~~~~~~~ 47: !9 = !{!"llvm.loop.parallel_accesses", !4} check:76 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To fix this regression test, one needs to solve the following questions: * Is `!0 = !{!1}` supposed to match with anything? * Which of the CHECK lines are the ones that are important for the regression tests? * The `![[SOMETHING]] = distinct !{}` line appears twice. Which one is ACCESS_GROUP_INNER/ACCESS_GROUP_OUTER? * There are two lines with `llvm.loop.parallel_accesses` as well. The author of this regression test was kind enough to give the placeholders descriptive names, but do they match with what FileCheck matched? * LOOP_INNER and LOOP_OUTER are only distinguished by position and `.i`. Why does the inner loop come first? * Before the patch that modifies the order, the output of --dump-input=always was: 38: !0 = !{!1} 39: !1 = distinct !{!1, !2, !"callee: %A"} 40: !2 = distinct !{!2, !"callee"} 41: !3 = !{!4, !5} 42: !4 = distinct !{} 43: !5 = distinct !{} 44: !6 = distinct !{!6, !7} 45: !7 = !{!"llvm.loop.parallel_accesses", !4} 46: !8 = distinct !{!8, !9} 47: !9 = !{!"llvm.loop.parallel_accesses", !5} This seems to suggest that our change caused !4 and !5 in lines 45 and 47 to swap. This is not what the patch did, which changes to order of two MDNode arguments. The only MDNode that has two other MDNodes as operands is line (line !6 and !8 are representing loops as they reference themselves as first arguments; noalias scopes do this as well, but there is no noalias metadata in this example). An hasty fix is to change CHECK lines 76 and 77 to 76 ; CHECK: ![[ACCESSES_INNER]] = !{!"llvm.loop.parallel_accesses", ![[ACCESS_GROUP_OUTER]]} 78 ; CHECK: ![[ACCESSES_OUTER]] = !{!"llvm.loop.parallel_accesses", ![[ACCESS_GROUP_INNER]]} However, ACCESS_GROUP_OUTER belongs to ACCESSES_INNER and ACCESS_GROUP_INNER to ACCESSES_OUTER? This cannot be true. The reason is that MDNodes are emitted in topological order of a depth-first search. The patch changed the order of ACCESS_GROUP_INNER and ACCESS_GROUP_OUTER in line 41 but because in the mitted IR file, the first argument of !3 is emitted before the second argument, ACCESS_GROUP_OUTER is on line 42 and ACCESS_GROUP_INNER on line 43 instead. Even though this regression test uses proper regex-ing of metadata node numbers, it is still fragile as it fails on irrelevant changes. Once it fails, it requires time to fix properly because it is non-obvious which lines are intended to match which output lines. The hasty fix would have mixed up ACCESS_GROUP_OUTER/ACCESS_GROUP_INNER and I wouldn't expect someone who is working on the inliner to take the time to understand all the loop metadata. Set-semantics of metadata occurs often, such as noalias metadata, loop properties, tbaa metadata, etc. The difficulty gets amplified when there are multiple functions in the regression tests; metadata of all functions all appended to the end and MDNodes with same content uniqued, making it impossible to associate particular MDNodes with specific functions. Ideally the regression test would be robust and understandable, achievable with two asserts in a unittest: Loop &OuterLoop = *LI->begin(); ASSERT_TRUE(OuterLoop.isAnnotatedParallel()); Loop &InnerLoop = *OuterLoop.begin(); ASSERT_TRUE(InnerLoop.isAnnotatedParallel()); -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200701/a5c5b6e4/attachment-0001.html>
Michael Kruse via llvm-dev
2020-Jul-01 16:33 UTC
[llvm-dev] [RFC] Compiled regression tests.
Am Mi., 1. Juli 2020 um 10:18 Uhr schrieb Robinson, Paul < paul.robinson at sony.com>:> The test as written is fragile because it requires a certain ordering. If > the output order is not important, use CHECK-DAG rather than CHECK. This > would be a failure to understand the testing tool. >CHECK-DAG does not help here since what changes is within a list on the same line, and we have no CHECK-SAME-DAG or CHECK-DAG-SAME. Even if we had it, the actual line that changed is textually the same and FileCheck would need to backtrack deep into the following lines for alternative placeholder substitutions. It would look like CHECK-SAME-DAG: ![[ACCESS_GROUP_INNER:[0-9]+]] CHECK-SAME-DAG: , CHECK-SAME-DAG: ![[ACCESS_GROUP_OUTER:[0-9]+]] which allows the comma to appear anywhere and I don't find readable. My (naive?) conclusion is that textural checking is not the right tool here.> My experience, over a 40-year career, is that good software developers are > generally not very good test-writers. These are different skills and good > testing is frequently not taught. It’s easy to write fragile tests; you > make your patch, you see what comes out, and you write the test to expect > exactly that output, using the minimum possible features of the testing > tool. This is poor technique all around. We even have scripts that > automatically generate such tests, used primarily in codegen tests. I > devoutly hope that the people who produce those tests responsibly eyeball > all those cases. > > > > The proposal appears to be to migrate output-based tests (using > ever-more-complicated FileCheck features) to executable tests, which makes > it more like the software development people are used to instead of > test-writing. But I don’t see that necessarily solving the problem; seems > like it would be just as easy to write a program that doesn’t do the right > thing as to write a FileCheck test that doesn’t do the right thing. >IMHO having a tool that allows to better express what is intended to be tested is already worth a lot. For instance, we usually don't care about SSA value names or MDNode numbers, but we have to put extra work into regex-ing away those names in FileCheck tests and as a result, most tests we have do still expect the exact number for metadata nodes. This is a problem if we we want to emit new metadata nodes in that all those tests need to be updated. This problem goes away if the test method by default ignored value names/MDNode numbers and software development people had to put extra work if they actually want to verify this.> > Hal’s suggestion is more to the point: If the output we’re generating is > not appropriate to the kinds of tests we want to perform, it can be > worthwhile to generate different kinds of output. MIR is a case in point; > for a long time it was hard to introspect into the interval between IR and > final machine code, but now it’s a lot easier. >Can you elaborate on what makes it easier? Michael>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200701/f1ae18f1/attachment.html>
Robinson, Paul via llvm-dev
2020-Jul-01 17:24 UTC
[llvm-dev] [RFC] Compiled regression tests.
What I actually meant re. CHECK-DAG is to take this 72 ; CHECK: ![[ACCESS_GROUP_LIST_3]] = !{![[ACCESS_GROUP_INNER:[0-9]+]], ![[ACCESS_GROUP_OUTER:[0-9]+]]} 73 ; CHECK: ![[ACCESS_GROUP_INNER]] = distinct !{} 74 ; CHECK: ![[ACCESS_GROUP_OUTER]] = distinct !{} and turn it into this 72 ; CHECK: ![[ACCESS_GROUP_LIST_3]] = !{![[ACCESS_GROUP_INNER:[0-9]+]], ![[ACCESS_GROUP_OUTER:[0-9]+]]} 73 ; CHECK-DAG: ![[ACCESS_GROUP_INNER]] = distinct !{} 74 ; CHECK-DAG: ![[ACCESS_GROUP_OUTER]] = distinct !{} except that I think I was misled by the “non-semantic” remark about the change. Did you mean that the order of the INNER and OUTER elements (line 72) has been swapped? That sounds semantic, as far as the structure of the metadata is concerned! But okay, let’s call that a syntactic change, and a test relying on the order of the parameters will break. Which it did. And the correct fix is instead 72 ; CHECK: ![[ACCESS_GROUP_LIST_3]] = !{![[ACCESS_GROUP_OUTER:[0-9]+]], ![[ACCESS_GROUP_INNER:[0-9]+]]} is it not? To reflect the change in order? But let’s say I’m the one doing this presumably innocuous change, and have no clue what I’m doing, and don’t know much about how FileCheck works (which is pretty typical of the community, I admit). You’ve shown issues with trying to diagnose the FileCheck results. How would the compiled regression test fail? How would it be easier to identify and repair the issue? Re. how MIR makes testing easier: it is a serialization of the data that a machine-IR pass operates on, which makes it feasible to feed canned MIR through a single pass in llc and look at what exactly that pass did. Prior to MIR, we had to go from IR to real machine code and infer what was going on in a pass after multiple levels of transformation had occurred. It was very black-box, and the black box was way too big. Anyway, getting back to the original topic, it sounds like you’re proposing two rather independent things: (1) an easier API for writing optimizer unittests, (2) a way to push unittests into the main llvm/test tree. Can we separate these things? Thanks, --paulr From: Michael Kruse <llvmdev at meinersbur.de> Sent: Wednesday, July 1, 2020 12:33 PM To: Robinson, Paul <paul.robinson at sony.com> Cc: Michael Kruse <llvmdev at meinersbur.de>; llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] [RFC] Compiled regression tests. Am Mi., 1. Juli 2020 um 10:18 Uhr schrieb Robinson, Paul <paul.robinson at sony.com<mailto:paul.robinson at sony.com>>: The test as written is fragile because it requires a certain ordering. If the output order is not important, use CHECK-DAG rather than CHECK. This would be a failure to understand the testing tool. CHECK-DAG does not help here since what changes is within a list on the same line, and we have no CHECK-SAME-DAG or CHECK-DAG-SAME. Even if we had it, the actual line that changed is textually the same and FileCheck would need to backtrack deep into the following lines for alternative placeholder substitutions. It would look like CHECK-SAME-DAG: ![[ACCESS_GROUP_INNER:[0-9]+]] CHECK-SAME-DAG: , CHECK-SAME-DAG: ![[ACCESS_GROUP_OUTER:[0-9]+]] which allows the comma to appear anywhere and I don't find readable. My (naive?) conclusion is that textural checking is not the right tool here. My experience, over a 40-year career, is that good software developers are generally not very good test-writers. These are different skills and good testing is frequently not taught. It’s easy to write fragile tests; you make your patch, you see what comes out, and you write the test to expect exactly that output, using the minimum possible features of the testing tool. This is poor technique all around. We even have scripts that automatically generate such tests, used primarily in codegen tests. I devoutly hope that the people who produce those tests responsibly eyeball all those cases. The proposal appears to be to migrate output-based tests (using ever-more-complicated FileCheck features) to executable tests, which makes it more like the software development people are used to instead of test-writing. But I don’t see that necessarily solving the problem; seems like it would be just as easy to write a program that doesn’t do the right thing as to write a FileCheck test that doesn’t do the right thing. IMHO having a tool that allows to better express what is intended to be tested is already worth a lot. For instance, we usually don't care about SSA value names or MDNode numbers, but we have to put extra work into regex-ing away those names in FileCheck tests and as a result, most tests we have do still expect the exact number for metadata nodes. This is a problem if we we want to emit new metadata nodes in that all those tests need to be updated. This problem goes away if the test method by default ignored value names/MDNode numbers and software development people had to put extra work if they actually want to verify this. Hal’s suggestion is more to the point: If the output we’re generating is not appropriate to the kinds of tests we want to perform, it can be worthwhile to generate different kinds of output. MIR is a case in point; for a long time it was hard to introspect into the interval between IR and final machine code, but now it’s a lot easier. Can you elaborate on what makes it easier? Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200701/6e2c5014/attachment-0001.html>
David Greene via llvm-dev
2020-Jul-01 18:48 UTC
[llvm-dev] [RFC] Compiled regression tests.
"Robinson, Paul via llvm-dev" <llvm-dev at lists.llvm.org> writes:> We even have scripts that automatically generate such tests, used > primarily in codegen tests. I devoutly hope that the people who > produce those tests responsibly eyeball all those cases.I don't think people do because they are so noisy. Solving the problem was the primary motivation of my pile of test update script enhancements.> Hal’s suggestion is more to the point: If the output we’re generating > is not appropriate to the kinds of tests we want to perform, it can be > worthwhile to generate different kinds of output. MIR is a case in > point; for a long time it was hard to introspect into the interval > between IR and final machine code, but now it’s a lot easier.+1. I don't believe there's a silver bullet here. Different tools are appropriate for different kinds of tests. I happen to think Michael's proposal is another useful tool in the box, as is Hal's. -David
David Greene via llvm-dev
2020-Jul-01 18:58 UTC
[llvm-dev] [RFC] Compiled regression tests.
Michael Kruse via llvm-dev <llvm-dev at lists.llvm.org> writes:> Am Mi., 1. Juli 2020 um 10:18 Uhr schrieb Robinson, Paul < > paul.robinson at sony.com>: > >> The test as written is fragile because it requires a certain ordering. If >> the output order is not important, use CHECK-DAG rather than CHECK. This >> would be a failure to understand the testing tool. >> > > CHECK-DAG does not help here since what changes is within a list on the > same line, and we have no CHECK-SAME-DAG or CHECK-DAG-SAME. Even if we had > it, the actual line that changed is textually the same and FileCheck would > need to backtrack deep into the following lines for alternative placeholder > substitutions. It would look like > > CHECK-SAME-DAG: ![[ACCESS_GROUP_INNER:[0-9]+]] > CHECK-SAME-DAG: , > CHECK-SAME-DAG: ![[ACCESS_GROUP_OUTER:[0-9]+]]Would this not work? CHECK-SAME: ![[ACCESS_GROUP_INNER:[0-9]+]] CHECK-SAME: ![[ACCESS_GROUP_OUTER:[0-9]+]] I don't think CHECK-SAME is sensitive to order within the line. This works for me in my metadata tests but maybe I've just been lucky.> IMHO having a tool that allows to better express what is intended to be > tested is already worth a lot.Exactly.> For instance, we usually don't care about SSA value names or MDNode > numbers, but we have to put extra work into regex-ing away those names > in FileCheck tests and as a result, most tests we have do still expect > the exact number for metadata nodes. This is a problem if we we want > to emit new metadata nodes in that all those tests need to be updated.IME, many of our tests check way too much. Codegen tests are particularly egregious in this regard because many are (re-)generated and match whole functions of asm. One of the major problems with using FileCheck regexps in generated tests is that it's very difficult to take an update_*_test_checks.py output and whittle it down because the FileCheck regexps define and use variables which need to be massaged while altering the test. I have some patches that help alleviate this.> This problem goes away if the test method by default ignored value > names/MDNode numbers and software development people had to put extra > work if they actually want to verify this.If the test method is appropriate for the test, sure. But I suspect we'll have FileCheck tests for a good long while and also making that experience better is well worth the effort. -David
On 7/1/20 1:48 PM, David Greene via llvm-dev wrote:> "Robinson, Paul via llvm-dev" <llvm-dev at lists.llvm.org> writes: > >> We even have scripts that automatically generate such tests, used >> primarily in codegen tests. I devoutly hope that the people who >> produce those tests responsibly eyeball all those cases. > I don't think people do because they are so noisy. Solving the problem > was the primary motivation of my pile of test update script > enhancements. > >> Hal’s suggestion is more to the point: If the output we’re generating >> is not appropriate to the kinds of tests we want to perform, it can be >> worthwhile to generate different kinds of output. MIR is a case in >> point; for a long time it was hard to introspect into the interval >> between IR and final machine code, but now it’s a lot easier. > +1. I don't believe there's a silver bullet here. Different tools are > appropriate for different kinds of tests. I happen to think Michael's > proposal is another useful tool in the box, as is Hal's.To follow up on this, I also think that Michael's proposal is a potentially-useful tool in our testing toolbox. Moreover, his motivation for proposing this points out a real issue that we have with our current set of tests. Hindsight is 20/20, but the hardship associated with Michael's work updating the loop metadata tests probably would have been best addressed by having a good analysis output and checking that. It seems like a single bit of C++ code could, thus, be shared across many tests. It seems to me that the use case Michael's proposal addresses is where we have a large number of test cases, each requiring complex verification logic, and logic not easily phrased in terms of checking a generic textual analysis summary. I'm not yet sure this particular use case exists. His proposal also aims to do something else: make certain kinds of checks less fragile. As he points out, many useful concepts are easily expressed as CHECK-NOT, but because there's no binding between the text string in the check line and the code that might produce that text, it's easy for these to fall out of sync. His proposal, using C++ tests, provides better type checking in this regard. There may be other ways to address this concern (e.g., using FileCheck variables of some kind of positively match string name fragments before their use in CHECK-NOTs). -Hal> > -David > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory