thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] RFC: End-to-end testing [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Mehdi AMINI via llvm-dev

2019-Oct-09 19:38 UTC

[llvm-dev] [cfe-dev] RFC: End-to-end testing

On Wed, Oct 9, 2019 at 8:12 AM David Greene <dag at cray.com> wrote:
> Mehdi AMINI via cfe-dev <cfe-dev at lists.llvm.org> writes:
>
> >> I have a bit of concern about this sort of thing - worrying
it'll lead
> to
> >> people being less cautious about writing the more isolated tests.
> >>
> >
> > I have the same concern. I really believe we need to be careful about
> > testing at the right granularity to keep things both modular and the
> > testing maintainable (for instance checking vectorized ASM from a C++
> > source through clang has always been considered a bad FileCheck
> practice).
> > (Not saying that there is no space for better integration testing in
some
> > areas).
>
> I absolutely disagree about vectorization tests.  We have seen
> vectorization loss in clang even though related LLVM lit tests pass,
> because something else in the clang pipeline changed that caused the
> vectorizer to not do its job.

Of course, and as I mentioned I tried to add these tests (probably 4 or 5
years ago), but someone (I think Chandler?) was asking me at the time: does
it affect a benchmark performance? If so why isn't it tracked there? And if
not does it matter?
The benchmark was presented as the actual way to check this invariant
(because you're only vectoring to get performance, not for the sake of it).
So I never pursued, even if I'm a bit puzzled that we don't have such
tests.



> We need both kinds of tests.  There are
> many asm tests of value beyond vectorization and they should include
> component and well as end-to-end tests.
>
> > For instance I remember asking about implementing test based on
checking
> if
> > some loops written in C source file were properly vectorized by the
-O2 /
> > -O3 pipeline and it was deemed like the kind of test that we don't
want
> to
> > maintain: instead I was pointed at the test-suite to add better
> benchmarks
> > there for the end-to-end story. What is interesting is that the
> test-suite
> > is not gonna be part of the monorepo!
>
> And it shouldn't be.  It's much too big.  But there is a place for
small
> end-to-end tests that live alongside the code.
>
> >>> We could, for example, create
> >>> a top-level "test" directory and put end-to-end
tests there.  Some of
> >>> the things that could be tested include:
> >>>
> >>> - Pipeline execution (debug-pass=Executions)
> >>>
> >>> - Optimization warnings/messages
> >>> - Specific asm code sequences out of clang (e.g. ensure
certain loops
> >>>   are vectorized)
> >>> - Pragma effects (e.g. ensure loop optimizations are honored)
> >>> - Complete end-to-end PGO (generate a profile and re-compile)
> >>> - GPU/accelerator offloading
> >>> - Debuggability of clang-generated code
> >>>
> >>> Each of these things is tested to some degree within their own
> >>> subprojects, but AFAIK there are currently no dedicated tests
ensuring
> >>> such things work through the entire clang pipeline flow and
with other
> >>> tools that make use of the results (debuggers, etc.).  It is
relatively
> >>> easy to break the pipeline while the individual subproject
tests
> >>> continue to pass.
> >>>
> >>
> >
> > I'm not sure I really see much in your list that isn't purely
about
> testing
> > clang itself here?
>
> Debugging and PGO involve other components, no?

I don't think that you need anything else than LLVM core (which is a
dependency of clang) itself?

Things like PGO (unless you're using frontend instrumentation) don't
even
have anything to do with clang, so we may get into the situation David
mentioned where we would rely on clang to test LLVM features, which I find
non-desirable.


>   If we want to put clang
> end-to-end tests in the clang subdirectory, that's fine with me.  But
we
> need a place for tests that cut across components.
>
> I could also imagine llvm-mca end-to-end tests through clang.
>
> > Actually the first one seems more of a pure LLVM test.
>
> Definitely not.  It would test the pipeline as constructed by clang,
> which is very different from the default pipeline constructed by
> opt/llc.

I am not convinced it is "very" difference (they are using the
PassManagerBuilder AFAIK), I am only aware of very subtle difference.
But more fundamentally: *should* they be different? I would want `opt -O3`
to be able to reproduce 1-1 the clang pipeline.
Isn't it the role of LLVM PassManagerBuilder to expose what is the
"-O3"
pipeline?
If we see the PassManagerBuilder as the abstraction for the pipeline, then
I don't see what testing belongs to clang here, this seems like a layering
violation (and maintaining the PassManagerBuilder in LLVM I wouldn't want
to have to update the tests of all the subproject using it because they
retest the same feature).


> The old and new pass managers also construct different
> pipelines.  As we have seen with various mailing list messages, this is
> surprising to users.  Best to document and check it with testing.
>
Yes: both old and new pass managers are LLVM components, so hopefully that
are documented and tested in LLVM :)

-- 
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191009/cf333ca6/attachment.html>

David Greene via llvm-dev

2019-Oct-09 21:31 UTC

head link

[llvm-dev] [cfe-dev] RFC: End-to-end testing

Mehdi AMINI via llvm-dev <llvm-dev at lists.llvm.org> writes:
>> I absolutely disagree about vectorization tests.  We have seen
>> vectorization loss in clang even though related LLVM lit tests pass,
>> because something else in the clang pipeline changed that caused the
>> vectorizer to not do its job.
>
> Of course, and as I mentioned I tried to add these tests (probably 4 or 5
> years ago), but someone (I think Chandler?) was asking me at the time: does
> it affect a benchmark performance? If so why isn't it tracked there?
And if
> not does it matter?
> The benchmark was presented as the actual way to check this invariant
> (because you're only vectoring to get performance, not for the sake of
it).
> So I never pursued, even if I'm a bit puzzled that we don't have
such tests.
Thanks for explaining.

Our experience is that relying solely on performance tests to uncover
such issues is problematic for several reasons:

- Performance varies from implementation to implementation.  It is
  difficult to keep tests up-to-date for all possible targets and
  subtargets.
  
- Partially as a result, but also for other reasons, performance tests
  tend to be complicated, either in code size or in the numerous code
  paths tested.  This makes such tests hard to debug when there is a
  regression.

- Performance tests don't focus on the why/how of vectorization.  They
  just check, "did it run fast enough?"  Maybe the test ran fast
enough
  for some other reason but we still lost desired vectorization and
  could have run even faster.

With a small asm test one can documented why vectorization is desired
and how it comes about right in the test.  Then when it breaks it's
usually pretty obvious what the problem is.

They don't replace performance tests, they complement each other, just
as end-to-end and component tests complement each other.
>> Debugging and PGO involve other components, no?
>
> I don't think that you need anything else than LLVM core (which is a
> dependency of clang) itself?
What about testing that what clang produces is debuggable with lldb?
debuginfo tests do that now but AFAIK they are not end-to-end.
> Things like PGO (unless you're using frontend instrumentation)
don't even
> have anything to do with clang, so we may get into the situation David
> mentioned where we would rely on clang to test LLVM features, which I find
> non-desirable.
We would still expect component-level tests.  This would be additional
end-to-end testing, not replacing existing testing methodology.  I agree
the concern is valid but shouldn't code review discover such problems?
>> > Actually the first one seems more of a pure LLVM test.
>>
>> Definitely not.  It would test the pipeline as constructed by clang,
>> which is very different from the default pipeline constructed by
>> opt/llc.
>
> I am not convinced it is "very" difference (they are using the
> PassManagerBuilder AFAIK), I am only aware of very subtle difference.
I don't think clang exclusively uses PassManagerBuilder but it's been a
while since I looked at that code.
> But more fundamentally: *should* they be different? I would want `opt -O3`
> to be able to reproduce 1-1 the clang pipeline.
Which clang pipeline?  -O3?  -Ofast?  opt currently can't do -Ofast.  I
don't *think* -Ofast affects the pipeline itself but I am not 100%
certain.
> Isn't it the role of LLVM PassManagerBuilder to expose what is the
"-O3"
> pipeline?
Ideally, yes.  In practice, it's not.
> If we see the PassManagerBuilder as the abstraction for the pipeline, then
> I don't see what testing belongs to clang here, this seems like a
layering
> violation (and maintaining the PassManagerBuilder in LLVM I wouldn't
want
> to have to update the tests of all the subproject using it because they
> retest the same feature).
If nothing else, end-to-end testing of the pipeline would uncover
layering violations.  :)
>> The old and new pass managers also construct different
>> pipelines.  As we have seen with various mailing list messages, this is
>> surprising to users.  Best to document and check it with testing.
>>
>
> Yes: both old and new pass managers are LLVM components, so hopefully that
> are documented and tested in LLVM :)
But we have nothing to guarantee that what clang does matches what opt
does.  Currently they do different things.

                         -David

Mehdi AMINI via llvm-dev

2019-Oct-09 23:43 UTC

head link

[llvm-dev] [cfe-dev] RFC: End-to-end testing

On Wed, Oct 9, 2019 at 2:31 PM David Greene <dag at cray.com> wrote:
> Mehdi AMINI via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
> >> I absolutely disagree about vectorization tests.  We have seen
> >> vectorization loss in clang even though related LLVM lit tests
pass,
> >> because something else in the clang pipeline changed that caused
the
> >> vectorizer to not do its job.
> >
> > Of course, and as I mentioned I tried to add these tests (probably 4
or 5
> > years ago), but someone (I think Chandler?) was asking me at the time:
> does
> > it affect a benchmark performance? If so why isn't it tracked
there? And
> if
> > not does it matter?
> > The benchmark was presented as the actual way to check this invariant
> > (because you're only vectoring to get performance, not for the
sake of
> it).
> > So I never pursued, even if I'm a bit puzzled that we don't
have such
> tests.
>
> Thanks for explaining.
>
> Our experience is that relying solely on performance tests to uncover
> such issues is problematic for several reasons:
>
> - Performance varies from implementation to implementation.  It is
>   difficult to keep tests up-to-date for all possible targets and
>   subtargets.
>
> - Partially as a result, but also for other reasons, performance tests
>   tend to be complicated, either in code size or in the numerous code
>   paths tested.  This makes such tests hard to debug when there is a
>   regression.
>
> - Performance tests don't focus on the why/how of vectorization.  They
>   just check, "did it run fast enough?"  Maybe the test ran fast
enough
>   for some other reason but we still lost desired vectorization and
>   could have run even faster.
>
> With a small asm test one can documented why vectorization is desired
> and how it comes about right in the test.  Then when it breaks it's
> usually pretty obvious what the problem is.
>
> They don't replace performance tests, they complement each other, just
> as end-to-end and component tests complement each other.
>
> >> Debugging and PGO involve other components, no?
> >
> > I don't think that you need anything else than LLVM core (which is
a
> > dependency of clang) itself?
>
> What about testing that what clang produces is debuggable with lldb?
> debuginfo tests do that now but AFAIK they are not end-to-end.
>
> > Things like PGO (unless you're using frontend instrumentation)
don't even
> > have anything to do with clang, so we may get into the situation David
> > mentioned where we would rely on clang to test LLVM features, which I
> find
> > non-desirable.
>
> We would still expect component-level tests.  This would be additional
> end-to-end testing, not replacing existing testing methodology.  I agree
> the concern is valid but shouldn't code review discover such problems?
>
Yes I agree, this concern is not a blocker for doing end-to-end testing,
but more a "we will need to be careful about scoping the role of the
end-to-end testing versus component level testing".


>
> >> > Actually the first one seems more of a pure LLVM test.
> >>
> >> Definitely not.  It would test the pipeline as constructed by
clang,
> >> which is very different from the default pipeline constructed by
> >> opt/llc.
> >
> > I am not convinced it is "very" difference (they are using
the
> > PassManagerBuilder AFAIK), I am only aware of very subtle difference.
>
> I don't think clang exclusively uses PassManagerBuilder but it's
been a
> while since I looked at that code.
>
Here is the code:
https://github.com/llvm/llvm-project/blob/master/clang/lib/CodeGen/BackendUtil.cpp#L545

All the extension point where passes are hooked in are likely things where
the pipeline would differ from LLVM.

>
> > But more fundamentally: *should* they be different? I would want `opt
> -O3`
> > to be able to reproduce 1-1 the clang pipeline.
>
> Which clang pipeline?  -O3?  -Ofast?  opt currently can't do -Ofast.  I
> don't *think* -Ofast affects the pipeline itself but I am not 100%
> certain.
>
If -Ofast affects the pipeline, I would expose it on the PassManagerBuilder
I think.

>
> > Isn't it the role of LLVM PassManagerBuilder to expose what is the
"-O3"
> > pipeline?
>
> Ideally, yes.  In practice, it's not.
>
> > If we see the PassManagerBuilder as the abstraction for the pipeline,
> then
> > I don't see what testing belongs to clang here, this seems like a
> layering
> > violation (and maintaining the PassManagerBuilder in LLVM I
wouldn't want
> > to have to update the tests of all the subproject using it because
they
> > retest the same feature).
>
> If nothing else, end-to-end testing of the pipeline would uncover
> layering violations.  :)
>
> >> The old and new pass managers also construct different
> >> pipelines.  As we have seen with various mailing list messages,
this is
> >> surprising to users.  Best to document and check it with testing.
> >>
> >
> > Yes: both old and new pass managers are LLVM components, so hopefully
> that
> > are documented and tested in LLVM :)
>
> But we have nothing to guarantee that what clang does matches what opt
> does.  Currently they do different things.

My point is that this should be guaranteed by refactoring and using the
right APIs, not duplicate the testing. But I can also misunderstand what it
is that you would test on the clang side. For instance I wouldn't want to
duplicate testing the O3 pass pipeline which is covered here:
https://github.com/llvm/llvm-project/blob/master/llvm/test/Other/opt-O3-pipeline.ll

But testing that a specific pass is added with respect to a particular
clang option is fair, and actually this is *already* what we do I believe,
like here:
https://github.com/llvm/llvm-project/blob/master/clang/test/CodeGen/thinlto-debug-pm.c#L11-L14

I don't think these particular tests are the most controversial though, and
it is really still fairly "focused" testing. I'm much more curious
about
larger end-to-end scope: for instance since you mention debug info and
LLDB, what about a test that would verify that LLDB can print a particular
variable content from a test that would come as a source program for
instance. Such test are valuable in the absolute, it isn't clear to me that
we could in practice block any commit that would break such test though:
this is because a bug fix or an improvement in one of the pass may be
perfectly correct in isolation but make the test fail by exposing a bug
where we are already losing some debug info precision in a totally
unrelated part of the codebase.
I wonder how you see this managed in practice: would you gate any change on
InstCombine (or other mid-level pass) on not regressing any of the
debug-info quality test on any of the backend, and from any frontend (not
only clang)? Or worse: a middle-end change that would end-up with a
slightly different Dwarf construct on this particular test, which would
trip LLDB but not GDB (basically expose a bug in LLDB). Should we require
the contributor of inst-combine to debug LLDB and fix it first?

Best,

-- 
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191009/4f6f2dc1/attachment.html>

Florian Hahn via llvm-dev

2019-Oct-10 09:34 UTC

head link

[llvm-dev] [cfe-dev] RFC: End-to-end testing

Hi David,

Thanks for kicking off a discussion on this topic!

> On Oct 9, 2019, at 22:31, David Greene via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Mehdi AMINI via llvm-dev <llvm-dev at lists.llvm.org> writes:
> 
>>> I absolutely disagree about vectorization tests.  We have seen
>>> vectorization loss in clang even though related LLVM lit tests
pass,
>>> because something else in the clang pipeline changed that caused
the
>>> vectorizer to not do its job.
>> 
>> Of course, and as I mentioned I tried to add these tests (probably 4 or
5
>> years ago), but someone (I think Chandler?) was asking me at the time:
does
>> it affect a benchmark performance? If so why isn't it tracked
there? And if
>> not does it matter?
>> The benchmark was presented as the actual way to check this invariant
>> (because you're only vectoring to get performance, not for the sake
of it).
>> So I never pursued, even if I'm a bit puzzled that we don't
have such tests.
> 
> Thanks for explaining.
> 
> Our experience is that relying solely on performance tests to uncover
> such issues is problematic for several reasons:
> 
> - Performance varies from implementation to implementation.  It is
>  difficult to keep tests up-to-date for all possible targets and
>  subtargets.
Could you expand a bit more what you mean here? Are you concerned about having
to run the performance tests on different kinds of hardware? In what way do the
existing benchmarks require keeping up-to-date?

With tests checking ASM, wouldn’t we end up with lots of checks for various
targets/subtargets that we need to keep up to date? Just considering AArch64 as
an example, people might want to check the ASM for different architecture
versions and different vector extensions and different vendors might want to
make sure that the ASM on their specific cores does not regress.
> 
> - Partially as a result, but also for other reasons, performance tests
>  tend to be complicated, either in code size or in the numerous code
>  paths tested.  This makes such tests hard to debug when there is a
>  regression.
I am not sure they have to. Have you considered adding the small test
functions/loops as micro-benchmarks using the existing google benchmark
infrastructure in test-suite?

I think that might be able to address the points here relatively adequately. The
separate micro benchmarks would be relatively small and we should be able to
track down regressions in a similar fashion as if it would be a stand-alone file
we compile and then analyze the ASM. Plus, we can easily run it and verify the
performance on actual hardware.
 > 
> - Performance tests don't focus on the why/how of vectorization.  They
>  just check, "did it run fast enough?"  Maybe the test ran fast
enough
>  for some other reason but we still lost desired vectorization and
>  could have run even faster.
> 
If you would add a new micro-benchmark, you could check that it produces the
desired result when adding it. The runtime-tracking should cover cases where we
lost optimizations. I guess if the benchmarks are too big, additional
optimizations in one part could hide lost optimizations somewhere else. But I
would assume this to be relatively unlikely, as long as the benchmarks are
isolated.

Also, checking the assembly for vector code does also not guarantee that the
vector code will be actually executed. So for example  by just checking the
assembly for certain vector instructions, we might miss that we regressed
performance, because we messed up the runtime checks guarding the vector loop.

Cheers,
Florian

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Oct 2019 - [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing

Apparently Analagous Threads