Mehdi AMINI via llvm-dev
2019-Oct-09 19:38 UTC
[llvm-dev] [cfe-dev] RFC: End-to-end testing
On Wed, Oct 9, 2019 at 8:12 AM David Greene <dag at cray.com> wrote:> Mehdi AMINI via cfe-dev <cfe-dev at lists.llvm.org> writes: > > >> I have a bit of concern about this sort of thing - worrying it'll lead > to > >> people being less cautious about writing the more isolated tests. > >> > > > > I have the same concern. I really believe we need to be careful about > > testing at the right granularity to keep things both modular and the > > testing maintainable (for instance checking vectorized ASM from a C++ > > source through clang has always been considered a bad FileCheck > practice). > > (Not saying that there is no space for better integration testing in some > > areas). > > I absolutely disagree about vectorization tests. We have seen > vectorization loss in clang even though related LLVM lit tests pass, > because something else in the clang pipeline changed that caused the > vectorizer to not do its job.Of course, and as I mentioned I tried to add these tests (probably 4 or 5 years ago), but someone (I think Chandler?) was asking me at the time: does it affect a benchmark performance? If so why isn't it tracked there? And if not does it matter? The benchmark was presented as the actual way to check this invariant (because you're only vectoring to get performance, not for the sake of it). So I never pursued, even if I'm a bit puzzled that we don't have such tests.> We need both kinds of tests. There are > many asm tests of value beyond vectorization and they should include > component and well as end-to-end tests. > > > For instance I remember asking about implementing test based on checking > if > > some loops written in C source file were properly vectorized by the -O2 / > > -O3 pipeline and it was deemed like the kind of test that we don't want > to > > maintain: instead I was pointed at the test-suite to add better > benchmarks > > there for the end-to-end story. What is interesting is that the > test-suite > > is not gonna be part of the monorepo! > > And it shouldn't be. It's much too big. But there is a place for small > end-to-end tests that live alongside the code. > > >>> We could, for example, create > >>> a top-level "test" directory and put end-to-end tests there. Some of > >>> the things that could be tested include: > >>> > >>> - Pipeline execution (debug-pass=Executions) > >>> > >>> - Optimization warnings/messages > >>> - Specific asm code sequences out of clang (e.g. ensure certain loops > >>> are vectorized) > >>> - Pragma effects (e.g. ensure loop optimizations are honored) > >>> - Complete end-to-end PGO (generate a profile and re-compile) > >>> - GPU/accelerator offloading > >>> - Debuggability of clang-generated code > >>> > >>> Each of these things is tested to some degree within their own > >>> subprojects, but AFAIK there are currently no dedicated tests ensuring > >>> such things work through the entire clang pipeline flow and with other > >>> tools that make use of the results (debuggers, etc.). It is relatively > >>> easy to break the pipeline while the individual subproject tests > >>> continue to pass. > >>> > >> > > > > I'm not sure I really see much in your list that isn't purely about > testing > > clang itself here? > > Debugging and PGO involve other components, no?I don't think that you need anything else than LLVM core (which is a dependency of clang) itself? Things like PGO (unless you're using frontend instrumentation) don't even have anything to do with clang, so we may get into the situation David mentioned where we would rely on clang to test LLVM features, which I find non-desirable.> If we want to put clang > end-to-end tests in the clang subdirectory, that's fine with me. But we > need a place for tests that cut across components. > > I could also imagine llvm-mca end-to-end tests through clang. > > > Actually the first one seems more of a pure LLVM test. > > Definitely not. It would test the pipeline as constructed by clang, > which is very different from the default pipeline constructed by > opt/llc.I am not convinced it is "very" difference (they are using the PassManagerBuilder AFAIK), I am only aware of very subtle difference. But more fundamentally: *should* they be different? I would want `opt -O3` to be able to reproduce 1-1 the clang pipeline. Isn't it the role of LLVM PassManagerBuilder to expose what is the "-O3" pipeline? If we see the PassManagerBuilder as the abstraction for the pipeline, then I don't see what testing belongs to clang here, this seems like a layering violation (and maintaining the PassManagerBuilder in LLVM I wouldn't want to have to update the tests of all the subproject using it because they retest the same feature).> The old and new pass managers also construct different > pipelines. As we have seen with various mailing list messages, this is > surprising to users. Best to document and check it with testing. >Yes: both old and new pass managers are LLVM components, so hopefully that are documented and tested in LLVM :) -- Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191009/cf333ca6/attachment.html>
David Greene via llvm-dev
2019-Oct-09 21:31 UTC
[llvm-dev] [cfe-dev] RFC: End-to-end testing
Mehdi AMINI via llvm-dev <llvm-dev at lists.llvm.org> writes:>> I absolutely disagree about vectorization tests. We have seen >> vectorization loss in clang even though related LLVM lit tests pass, >> because something else in the clang pipeline changed that caused the >> vectorizer to not do its job. > > Of course, and as I mentioned I tried to add these tests (probably 4 or 5 > years ago), but someone (I think Chandler?) was asking me at the time: does > it affect a benchmark performance? If so why isn't it tracked there? And if > not does it matter? > The benchmark was presented as the actual way to check this invariant > (because you're only vectoring to get performance, not for the sake of it). > So I never pursued, even if I'm a bit puzzled that we don't have such tests.Thanks for explaining. Our experience is that relying solely on performance tests to uncover such issues is problematic for several reasons: - Performance varies from implementation to implementation. It is difficult to keep tests up-to-date for all possible targets and subtargets. - Partially as a result, but also for other reasons, performance tests tend to be complicated, either in code size or in the numerous code paths tested. This makes such tests hard to debug when there is a regression. - Performance tests don't focus on the why/how of vectorization. They just check, "did it run fast enough?" Maybe the test ran fast enough for some other reason but we still lost desired vectorization and could have run even faster. With a small asm test one can documented why vectorization is desired and how it comes about right in the test. Then when it breaks it's usually pretty obvious what the problem is. They don't replace performance tests, they complement each other, just as end-to-end and component tests complement each other.>> Debugging and PGO involve other components, no? > > I don't think that you need anything else than LLVM core (which is a > dependency of clang) itself?What about testing that what clang produces is debuggable with lldb? debuginfo tests do that now but AFAIK they are not end-to-end.> Things like PGO (unless you're using frontend instrumentation) don't even > have anything to do with clang, so we may get into the situation David > mentioned where we would rely on clang to test LLVM features, which I find > non-desirable.We would still expect component-level tests. This would be additional end-to-end testing, not replacing existing testing methodology. I agree the concern is valid but shouldn't code review discover such problems?>> > Actually the first one seems more of a pure LLVM test. >> >> Definitely not. It would test the pipeline as constructed by clang, >> which is very different from the default pipeline constructed by >> opt/llc. > > I am not convinced it is "very" difference (they are using the > PassManagerBuilder AFAIK), I am only aware of very subtle difference.I don't think clang exclusively uses PassManagerBuilder but it's been a while since I looked at that code.> But more fundamentally: *should* they be different? I would want `opt -O3` > to be able to reproduce 1-1 the clang pipeline.Which clang pipeline? -O3? -Ofast? opt currently can't do -Ofast. I don't *think* -Ofast affects the pipeline itself but I am not 100% certain.> Isn't it the role of LLVM PassManagerBuilder to expose what is the "-O3" > pipeline?Ideally, yes. In practice, it's not.> If we see the PassManagerBuilder as the abstraction for the pipeline, then > I don't see what testing belongs to clang here, this seems like a layering > violation (and maintaining the PassManagerBuilder in LLVM I wouldn't want > to have to update the tests of all the subproject using it because they > retest the same feature).If nothing else, end-to-end testing of the pipeline would uncover layering violations. :)>> The old and new pass managers also construct different >> pipelines. As we have seen with various mailing list messages, this is >> surprising to users. Best to document and check it with testing. >> > > Yes: both old and new pass managers are LLVM components, so hopefully that > are documented and tested in LLVM :)But we have nothing to guarantee that what clang does matches what opt does. Currently they do different things. -David
Mehdi AMINI via llvm-dev
2019-Oct-09 23:43 UTC
[llvm-dev] [cfe-dev] RFC: End-to-end testing
On Wed, Oct 9, 2019 at 2:31 PM David Greene <dag at cray.com> wrote:> Mehdi AMINI via llvm-dev <llvm-dev at lists.llvm.org> writes: > > >> I absolutely disagree about vectorization tests. We have seen > >> vectorization loss in clang even though related LLVM lit tests pass, > >> because something else in the clang pipeline changed that caused the > >> vectorizer to not do its job. > > > > Of course, and as I mentioned I tried to add these tests (probably 4 or 5 > > years ago), but someone (I think Chandler?) was asking me at the time: > does > > it affect a benchmark performance? If so why isn't it tracked there? And > if > > not does it matter? > > The benchmark was presented as the actual way to check this invariant > > (because you're only vectoring to get performance, not for the sake of > it). > > So I never pursued, even if I'm a bit puzzled that we don't have such > tests. > > Thanks for explaining. > > Our experience is that relying solely on performance tests to uncover > such issues is problematic for several reasons: > > - Performance varies from implementation to implementation. It is > difficult to keep tests up-to-date for all possible targets and > subtargets. > > - Partially as a result, but also for other reasons, performance tests > tend to be complicated, either in code size or in the numerous code > paths tested. This makes such tests hard to debug when there is a > regression. > > - Performance tests don't focus on the why/how of vectorization. They > just check, "did it run fast enough?" Maybe the test ran fast enough > for some other reason but we still lost desired vectorization and > could have run even faster. > > With a small asm test one can documented why vectorization is desired > and how it comes about right in the test. Then when it breaks it's > usually pretty obvious what the problem is. > > They don't replace performance tests, they complement each other, just > as end-to-end and component tests complement each other. > > >> Debugging and PGO involve other components, no? > > > > I don't think that you need anything else than LLVM core (which is a > > dependency of clang) itself? > > What about testing that what clang produces is debuggable with lldb? > debuginfo tests do that now but AFAIK they are not end-to-end. > > > Things like PGO (unless you're using frontend instrumentation) don't even > > have anything to do with clang, so we may get into the situation David > > mentioned where we would rely on clang to test LLVM features, which I > find > > non-desirable. > > We would still expect component-level tests. This would be additional > end-to-end testing, not replacing existing testing methodology. I agree > the concern is valid but shouldn't code review discover such problems? >Yes I agree, this concern is not a blocker for doing end-to-end testing, but more a "we will need to be careful about scoping the role of the end-to-end testing versus component level testing".> > >> > Actually the first one seems more of a pure LLVM test. > >> > >> Definitely not. It would test the pipeline as constructed by clang, > >> which is very different from the default pipeline constructed by > >> opt/llc. > > > > I am not convinced it is "very" difference (they are using the > > PassManagerBuilder AFAIK), I am only aware of very subtle difference. > > I don't think clang exclusively uses PassManagerBuilder but it's been a > while since I looked at that code. >Here is the code: https://github.com/llvm/llvm-project/blob/master/clang/lib/CodeGen/BackendUtil.cpp#L545 All the extension point where passes are hooked in are likely things where the pipeline would differ from LLVM.> > > But more fundamentally: *should* they be different? I would want `opt > -O3` > > to be able to reproduce 1-1 the clang pipeline. > > Which clang pipeline? -O3? -Ofast? opt currently can't do -Ofast. I > don't *think* -Ofast affects the pipeline itself but I am not 100% > certain. >If -Ofast affects the pipeline, I would expose it on the PassManagerBuilder I think.> > > Isn't it the role of LLVM PassManagerBuilder to expose what is the "-O3" > > pipeline? > > Ideally, yes. In practice, it's not. > > > If we see the PassManagerBuilder as the abstraction for the pipeline, > then > > I don't see what testing belongs to clang here, this seems like a > layering > > violation (and maintaining the PassManagerBuilder in LLVM I wouldn't want > > to have to update the tests of all the subproject using it because they > > retest the same feature). > > If nothing else, end-to-end testing of the pipeline would uncover > layering violations. :) > > >> The old and new pass managers also construct different > >> pipelines. As we have seen with various mailing list messages, this is > >> surprising to users. Best to document and check it with testing. > >> > > > > Yes: both old and new pass managers are LLVM components, so hopefully > that > > are documented and tested in LLVM :) > > But we have nothing to guarantee that what clang does matches what opt > does. Currently they do different things.My point is that this should be guaranteed by refactoring and using the right APIs, not duplicate the testing. But I can also misunderstand what it is that you would test on the clang side. For instance I wouldn't want to duplicate testing the O3 pass pipeline which is covered here: https://github.com/llvm/llvm-project/blob/master/llvm/test/Other/opt-O3-pipeline.ll But testing that a specific pass is added with respect to a particular clang option is fair, and actually this is *already* what we do I believe, like here: https://github.com/llvm/llvm-project/blob/master/clang/test/CodeGen/thinlto-debug-pm.c#L11-L14 I don't think these particular tests are the most controversial though, and it is really still fairly "focused" testing. I'm much more curious about larger end-to-end scope: for instance since you mention debug info and LLDB, what about a test that would verify that LLDB can print a particular variable content from a test that would come as a source program for instance. Such test are valuable in the absolute, it isn't clear to me that we could in practice block any commit that would break such test though: this is because a bug fix or an improvement in one of the pass may be perfectly correct in isolation but make the test fail by exposing a bug where we are already losing some debug info precision in a totally unrelated part of the codebase. I wonder how you see this managed in practice: would you gate any change on InstCombine (or other mid-level pass) on not regressing any of the debug-info quality test on any of the backend, and from any frontend (not only clang)? Or worse: a middle-end change that would end-up with a slightly different Dwarf construct on this particular test, which would trip LLDB but not GDB (basically expose a bug in LLDB). Should we require the contributor of inst-combine to debug LLDB and fix it first? Best, -- Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191009/4f6f2dc1/attachment.html>
Florian Hahn via llvm-dev
2019-Oct-10 09:34 UTC
[llvm-dev] [cfe-dev] RFC: End-to-end testing
Hi David, Thanks for kicking off a discussion on this topic!> On Oct 9, 2019, at 22:31, David Greene via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Mehdi AMINI via llvm-dev <llvm-dev at lists.llvm.org> writes: > >>> I absolutely disagree about vectorization tests. We have seen >>> vectorization loss in clang even though related LLVM lit tests pass, >>> because something else in the clang pipeline changed that caused the >>> vectorizer to not do its job. >> >> Of course, and as I mentioned I tried to add these tests (probably 4 or 5 >> years ago), but someone (I think Chandler?) was asking me at the time: does >> it affect a benchmark performance? If so why isn't it tracked there? And if >> not does it matter? >> The benchmark was presented as the actual way to check this invariant >> (because you're only vectoring to get performance, not for the sake of it). >> So I never pursued, even if I'm a bit puzzled that we don't have such tests. > > Thanks for explaining. > > Our experience is that relying solely on performance tests to uncover > such issues is problematic for several reasons: > > - Performance varies from implementation to implementation. It is > difficult to keep tests up-to-date for all possible targets and > subtargets.Could you expand a bit more what you mean here? Are you concerned about having to run the performance tests on different kinds of hardware? In what way do the existing benchmarks require keeping up-to-date? With tests checking ASM, wouldn’t we end up with lots of checks for various targets/subtargets that we need to keep up to date? Just considering AArch64 as an example, people might want to check the ASM for different architecture versions and different vector extensions and different vendors might want to make sure that the ASM on their specific cores does not regress.> > - Partially as a result, but also for other reasons, performance tests > tend to be complicated, either in code size or in the numerous code > paths tested. This makes such tests hard to debug when there is a > regression.I am not sure they have to. Have you considered adding the small test functions/loops as micro-benchmarks using the existing google benchmark infrastructure in test-suite? I think that might be able to address the points here relatively adequately. The separate micro benchmarks would be relatively small and we should be able to track down regressions in a similar fashion as if it would be a stand-alone file we compile and then analyze the ASM. Plus, we can easily run it and verify the performance on actual hardware.> > - Performance tests don't focus on the why/how of vectorization. They > just check, "did it run fast enough?" Maybe the test ran fast enough > for some other reason but we still lost desired vectorization and > could have run even faster. >If you would add a new micro-benchmark, you could check that it produces the desired result when adding it. The runtime-tracking should cover cases where we lost optimizations. I guess if the benchmarks are too big, additional optimizations in one part could hide lost optimizations somewhere else. But I would assume this to be relatively unlikely, as long as the benchmarks are isolated. Also, checking the assembly for vector code does also not guarantee that the vector code will be actually executed. So for example by just checking the assembly for certain vector instructions, we might miss that we regressed performance, because we messed up the runtime checks guarding the vector loop. Cheers, Florian