thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] RFC: End-to-end testing [Oct 2019]

If this information is useful, please help other people find it:
Share via:

David Greene via llvm-dev

2019-Oct-10 01:25 UTC

[llvm-dev] [cfe-dev] RFC: End-to-end testing

Philip Reames via cfe-dev <cfe-dev at lists.llvm.org> writes:
> A challenge we already have - as in, I've broken these tests and had to
> fix them - is that an end to end test which checks either IR or assembly 
> ends up being extraordinarily fragile.  Completely unrelated profitable 
> transforms create small differences which cause spurious test failures.  
> This is a very real issue today with the few end-to-end clang tests we 
> have, and I am extremely hesitant to expand those tests without giving 
> this workflow problem serious thought.  If we don't, this could bring 
> development on middle end transforms to a complete stop.  (Not kidding.)
Do you have a pointer to these tests?  We literally have tens of
thousands of end-to-end tests downstream and while some are fragile, the
vast majority are not.  A test that, for example, checks the entire
generated asm for a match is indeed very fragile.  A test that checks
whether a specific instruction/mnemonic was emitted is generally not, at
least in my experience.  End-to-end tests require some care in
construction.  I don't think update_llc_test_checks.py-type operation is
desirable.

Still, you raise a valid point and I think present some good options
below.
> A couple of approaches we could consider:
>
>  1. Simply restrict end to end tests to crash/assert cases.  (i.e. no
>     property of the generated code is checked, other than that it is
>     generated)  This isn't as restrictive as it sounds when combined
>     w/coverage guided fuzzer corpuses.
I would be pretty hesitant to do this but I'd like to hear more about
how you see this working with coverage/fuzzing.
>  2. Auto-update all diffs, but report them to a human user for
>     inspection.  This ends up meaning that tests never "fail" per
se,
>     but that individuals who have expressed interest in particular tests
>     get an automated notification and a chance to respond on list with a
>     reduced example.
That's certainly workable.
>  3. As a variant on the former, don't auto-update tests, but only
inform
>     the *contributor* of an end-to-end test of a failure. Responsibility
>     for determining failure vs false positive lies solely with them, and
>     normal channels are used to report a failure after it has been
>     confirmed/analyzed/explained.
I think I like this best of the three but it raises the question of what
happens when the contributor is no longer contributing.  Who's
responsible for the test?  Maybe it just sits there until someone else
claims it.
> I really think this is a problem we need to have thought through and 
> found a workable solution before end-to-end testing as proposed becomes 
> a practically workable option.
Noted.  I'm very happy to have this discussion and work the problem.

                     -David

Robinson, Paul via llvm-dev

2019-Oct-10 15:46 UTC

head link

[llvm-dev] [cfe-dev] RFC: End-to-end testing

David Greene, will you be at the LLVM Dev Meeting? If so, could you sign
up for a Round Table session on this topic?  Obviously lots to discuss
and concerns to be addressed.

In particular I think there are two broad categories of tests that would
have to be segregated just by the nature of their requirements:

(1) Executable tests. These obviously require an execution platform; for
feasibility reasons this means host==target and the guarantee of having
a linker (possibly but not necessarily LLD) and a runtime (possibly but
not necessarily including libcxx).  Note that the LLDB tests and the 
debuginfo-tests project already have this kind of dependency, and in the
case of debuginfo-tests, this is exactly why it's a separate project.

(2) Non-executable tests.  These are near-identical in character to the
existing clang/llvm test suites and I'd expect lit to drive them.  The 
only material difference from the majority(*) of existing clang tests is 
that they are free to depend on LLVM features/passes.  The only difference 
from the majority of existing LLVM tests is that they have [Obj]{C,C++} as 
their input source language.
(*) I've encountered clang tests that I feel depend on too much within LLVM,
and it's common for new contributors to provide a C/C++ test that needs to 
be converted to a .ll test.  Some of them go in anyway.

More comments/notes below.
> -----Original Message-----
> From: lldb-dev <lldb-dev-bounces at lists.llvm.org> On Behalf Of
David Greene
> via lldb-dev
> Sent: Wednesday, October 09, 2019 9:25 PM
> To: Philip Reames <listmail at philipreames.com>; llvm-dev at
lists.llvm.org;
> cfe-dev at lists.llvm.org; openmp-dev at lists.llvm.org; lldb-dev at
lists.llvm.org
> Subject: Re: [lldb-dev] [cfe-dev] [llvm-dev] RFC: End-to-end testing
> 
> Philip Reames via cfe-dev <cfe-dev at lists.llvm.org> writes:
> 
> > A challenge we already have - as in, I've broken these tests and
had to
> > fix them - is that an end to end test which checks either IR or
assembly
> > ends up being extraordinarily fragile.  Completely unrelated
profitable
> > transforms create small differences which cause spurious test
failures.
> > This is a very real issue today with the few end-to-end clang tests we
> > have, and I am extremely hesitant to expand those tests without giving
> > this workflow problem serious thought.  If we don't, this could
bring
> > development on middle end transforms to a complete stop.  (Not
kidding.)
> 
> Do you have a pointer to these tests?  We literally have tens of
> thousands of end-to-end tests downstream and while some are fragile, the
> vast majority are not.  A test that, for example, checks the entire
> generated asm for a match is indeed very fragile.  A test that checks
> whether a specific instruction/mnemonic was emitted is generally not, at
> least in my experience.  End-to-end tests require some care in
> construction.  I don't think update_llc_test_checks.py-type operation
is
> desirable.
Sony likewise has a rather large corpus of end-to-end tests.  I expect any
vendor would.  When they break, we fix them or report/fix the compiler bug.
It has not been an intolerable burden on us, and I daresay if it were at
all feasible to put these upstream, it would not be an intolerable burden
on the community.  (It's not feasible because host!=target and we'd need
to provide test kits to the community and our remote-execution tools. We'd
rather just run them internally.)

Philip, what I'm actually hearing from your statement is along the lines,
"Our end-to-end tests are really fragile, therefore any end-to-end test 
will be fragile, and that will be an intolerable burden."

That's an understandable reaction, but I think the community literally
would not tolerate too-fragile tests.  Tests that are too fragile will 
be made more robust or removed.  This has been community practice for a 
long time.  There's even an entire category of "noisy bots" that
certain
people take care of and don't bother the rest of the community.  The 
LLVM Project as a whole would not tolerate a test suite that "could 
bring development ... to a complete stop" and I hope we can ease your
concerns.

More comments/notes/opinions below.
> 
> Still, you raise a valid point and I think present some good options
> below.
> 
> > A couple of approaches we could consider:
> >
> >  1. Simply restrict end to end tests to crash/assert cases.  (i.e. no
> >     property of the generated code is checked, other than that it is
> >     generated)  This isn't as restrictive as it sounds when
combined
> >     w/coverage guided fuzzer corpuses.
> 
> I would be pretty hesitant to do this but I'd like to hear more about
> how you see this working with coverage/fuzzing.
I think this is way too restrictive.
> 
> >  2. Auto-update all diffs, but report them to a human user for
> >     inspection.  This ends up meaning that tests never
"fail" per se,
> >     but that individuals who have expressed interest in particular
tests
> >     get an automated notification and a chance to respond on list with
a
> >     reduced example.
> 
> That's certainly workable.
This is not different in principle from the "noisy bot" category, and
if
it's a significant concern, the e2e tests can start out in that category.
Experience will tell us whether they are inherently fragile.  I would not
want to auto-update tests.
> 
> >  3. As a variant on the former, don't auto-update tests, but only
inform
> >     the *contributor* of an end-to-end test of a failure.
Responsibility
> >     for determining failure vs false positive lies solely with them,
and
> >     normal channels are used to report a failure after it has been
> >     confirmed/analyzed/explained.
> 
> I think I like this best of the three but it raises the question of what
> happens when the contributor is no longer contributing.  Who's
> responsible for the test?  Maybe it just sits there until someone else
> claims it.
This is *exactly* the "noisy bot" tactic, and bots are supposed to
have
owners who are active.
> 
> > I really think this is a problem we need to have thought through and
> > found a workable solution before end-to-end testing as proposed
becomes
> > a practically workable option.
> 
> Noted.  I'm very happy to have this discussion and work the problem.
> 
>                      -David
> _______________________________________________
> lldb-dev mailing list
> lldb-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

David Greene via llvm-dev

2019-Oct-11 14:01 UTC

head link

[llvm-dev] [Openmp-dev] [cfe-dev] RFC: End-to-end testing

"Robinson, Paul via Openmp-dev" <openmp-dev at lists.llvm.org>
writes:
> David Greene, will you be at the LLVM Dev Meeting? If so, could you sign
> up for a Round Table session on this topic?  Obviously lots to discuss
> and concerns to be addressed.
That's a great idea.  I will be there.  I'm also hoping to help run a
routable on complex types so we'll need times that don't conflict.  What
times work well for folks?
> (1) Executable tests. These obviously require an execution platform; for
> feasibility reasons this means host==target and the guarantee of having
> a linker (possibly but not necessarily LLD) and a runtime (possibly but
> not necessarily including libcxx).  Note that the LLDB tests and the 
> debuginfo-tests project already have this kind of dependency, and in the
> case of debuginfo-tests, this is exactly why it's a separate project.
Ok.  I'd like to learn more about debuginfo-tests and how they're set
up.
> (2) Non-executable tests.  These are near-identical in character to the
> existing clang/llvm test suites and I'd expect lit to drive them.  The 
> only material difference from the majority(*) of existing clang tests is 
> that they are free to depend on LLVM features/passes.  The only difference 
> from the majority of existing LLVM tests is that they have [Obj]{C,C++} as 
> their input source language.
Right.  These are the kinds of tests I've been thinking about.

                      -David

Philip Reames via llvm-dev

2019-Oct-19 20:54 UTC

head link

[llvm-dev] [cfe-dev] RFC: End-to-end testing

On 10/9/19 6:25 PM, David Greene wrote:> Philip Reames via cfe-dev <cfe-dev at lists.llvm.org> writes:
>
>> A challenge we already have - as in, I've broken these tests and
had to
>> fix them - is that an end to end test which checks either IR or
assembly
>> ends up being extraordinarily fragile.  Completely unrelated profitable
>> transforms create small differences which cause spurious test failures.
>> This is a very real issue today with the few end-to-end clang tests we
>> have, and I am extremely hesitant to expand those tests without giving
>> this workflow problem serious thought.  If we don't, this could
bring
>> development on middle end transforms to a complete stop.  (Not
kidding.)
> Do you have a pointer to these tests?  We literally have tens of
> thousands of end-to-end tests downstream and while some are fragile, the
> vast majority are not.  A test that, for example, checks the entire
> generated asm for a match is indeed very fragile.  A test that checks
> whether a specific instruction/mnemonic was emitted is generally not, at
> least in my experience.  End-to-end tests require some care in
> construction.  I don't think update_llc_test_checks.py-type operation
is
> desirable.The couple I remember off hand were mostly vectorization tests, but it's 
been a while, so I might be misremembering.>
> Still, you raise a valid point and I think present some good options
> below.
>
>> A couple of approaches we could consider:
>>
>>   1. Simply restrict end to end tests to crash/assert cases.  (i.e. no
>>      property of the generated code is checked, other than that it is
>>      generated)  This isn't as restrictive as it sounds when
combined
>>      w/coverage guided fuzzer corpuses.
> I would be pretty hesitant to do this but I'd like to hear more about
> how you see this working with coverage/fuzzing.We've found end-to-end fuzzing from Java (which guarantees single 
threaded determinism and lack of UB) comparing two implementations to be 
extremely effective at catching regressions.  A big chunk of the 
regressions are assertion failures.  Our ability to detect miscompiles 
by comparing the output of two implementations (well, 2 or more for tie 
breaking purposes) has worked extremely well. However, once a problem is 
identified, we're stuck manually reducing and reacting, which is a very 
major time sink.  Key thing here in the context of this discussion is 
that there are no IR checks of any form, we just check the end-to-end 
correctness of the system and then reduce from there.>
>>   2. Auto-update all diffs, but report them to a human user for
>>      inspection.  This ends up meaning that tests never
"fail" per se,
>>      but that individuals who have expressed interest in particular
tests
>>      get an automated notification and a chance to respond on list with
a
>>      reduced example.
> That's certainly workable.
>
>>   3. As a variant on the former, don't auto-update tests, but only
inform
>>      the *contributor* of an end-to-end test of a failure.
Responsibility
>>      for determining failure vs false positive lies solely with them,
and
>>      normal channels are used to report a failure after it has been
>>      confirmed/analyzed/explained.
> I think I like this best of the three but it raises the question of what
> happens when the contributor is no longer contributing.  Who's
> responsible for the test?  Maybe it just sits there until someone else
> claims it.I'd argue it should be deleted if no one is willing to actively step 
up.  It is not in the community's interest to assume unending 
responsibility for any third party test suite given the high burden 
involved here.>
>> I really think this is a problem we need to have thought through and
>> found a workable solution before end-to-end testing as proposed becomes
>> a practically workable option.
> Noted.  I'm very happy to have this discussion and work the problem.
>
>                       -David

llvm dev - Oct 2019 - [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing

[llvm-dev] [Openmp-dev] [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing