thr3ads.net - llvm dev - [llvm-dev] RFC: Are auto-generated assertions a good practice? [May 2018]

If this information is useful, please help other people find it:
Share via:

David Blaikie via llvm-dev

2018-May-04 17:30 UTC

[llvm-dev] RFC: Are auto-generated assertions a good practice?

On Fri, May 4, 2018 at 10:16 AM Sanjay Patel <spatel at rotateright.com>
wrote:
> I understand the overfit argument (but in most cases it just shows that a
> unit test isn't minimized)...
>
Even minimized tests sometimes need a few other things to setup the
circumstance (many DWARF tests, for example - produce the full DWARF
output, but maybe you only care about one part of it (maybe you care about
the encoding of instruction ranges, but not the way names are rendered -
but there's no option to produce only ranges without names (wouldn't
really
make sense)).

> but I don't see how the complete auto-generated assertions could be
worse
> at detecting a miscompile than incomplete manually-generated assertions?
>
Generally overfitting wouldn't result in being bad at detecting failures,
but excess false positives - if things in the output unrelated to the issue
change (in intentional/benign ways) & cause the tests to fail often &
become a burden for the project.

Not suggesting that's the case with these particular instances - but
it's a
tradeoff/concern to keep in mind.

- Dave

> The whole point of auto-generating complete checks is to catch
> miscompiles/regressions sooner. Ie, before they get committed and result in
> bug reports. I don't know how to mine the bugzilla stats on that, but
it
> feels like we have made good progress on that front with better regression
> tests. Some history of the motivation and scripts here:
> https://bugs.llvm.org/show_bug.cgi?id=22897
>
>
>
> On Fri, May 4, 2018 at 10:31 AM, David Blaikie via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Yep - all about balance.
>>
>> The main risk are tests that overfit (golden files being the worst case
-
>> checking that the entire output matches /exactly/ - this is what
FileCheck
>> is intended to help avoid) and maintainability. In the case of the
>> autogenerated FileCheck lines I've seen so far - they seem like
they still
>> walk a fairly good line of checking exactly what's intended. Though
I
>> sometimes wonder if they're checking full combinatorial expansions
that may
>> not be required/desirable - always a tradeoff of just how black/white
box
>> tests are.
>>
>> On Fri, May 4, 2018 at 2:56 AM Alex Bradbury via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> On 4 May 2018 at 10:25, Alexandros Lamprineas via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>> > Hello llvm-dev,
>>> >
>>> >
>>> > On a recent code review I was asked to auto-generate assertion
checks
>>> for my
>>> > unit test. I wasn't aware that this was even possible. I
am referring
>>> to the
>>> > python `update` scripts under `utils` directory. My first
reaction was
>>> wow!
>>> > I found it very practical and useful. It saves you significant
amount
>>> of
>>> > time when writing a regression test. So I gave it a try. The
generated
>>> > checks were satisfying enough, almost exactly what I wanted.
Then I
>>> got a
>>> > bit sceptical about them. I am worried that auto-generating
tests
>>> based on
>>> > the compiler output can be quite dangerous. The tests will
always pass
>>> > regardless of whether the compiler emits right or wrong code,
>>> therefore you
>>> > have to be certain that they impose the desired compiler
behaviour. I
>>> guess
>>> > the question here is how often we should be using those
scripts.
>>>
>>> Like many test-related issues, it comes down to personal judgement.
It
>>> is of course easy to create test/CodeGen/*/* tests that pass
>>> regardless of whether the compiler breaks broken code regardless of
>>> whether the test CHECK lines are generated by
>>> update_llc_test.checks.py or not.
>>>
>>> I find it very helpful to have auto-generated CHECK lines that pick
up
>>> any codegen change, but this can of course be problematic for very
>>> large test cases that are likely to see churn due to scheduling or
>>> regallloc changes. Being able to regenerate the CHECK lines and
view
>>> the diff is also incredibly helpful when rebasing or moving a patch
>>> between different branches.
>>>
>>> My policy for test/Codegen/RISCV is to use
update_llc_test_checks.py
>>> wherever possible, except in cases where there are so many CHECK
lines
>>> on the output that they obscure the property being tested,
indicating
>>> that a more limited hand-crafted pattern would be superior.
>>>
>>> Best,
>>>
>>> Alex
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180504/e2f5d759/attachment.html>

Sanjay Patel via llvm-dev

2018-May-04 17:53 UTC

head link

[llvm-dev] RFC: Are auto-generated assertions a good practice?

On Fri, May 4, 2018 at 11:30 AM, David Blaikie <dblaikie at gmail.com>
wrote:
>
>
> On Fri, May 4, 2018 at 10:16 AM Sanjay Patel <spatel at
rotateright.com>
> wrote:
>
>> I understand the overfit argument (but in most cases it just shows that
a
>> unit test isn't minimized)...
>>
>
> Even minimized tests sometimes need a few other things to setup the
> circumstance (many DWARF tests, for example - produce the full DWARF
> output, but maybe you only care about one part of it (maybe you care about
> the encoding of instruction ranges, but not the way names are rendered -
> but there's no option to produce only ranges without names
(wouldn't really
> make sense)).
>
>
>> but I don't see how the complete auto-generated assertions could be
worse
>> at detecting a miscompile than incomplete manually-generated
assertions?
>>
>
> Generally overfitting wouldn't result in being bad at detecting
failures,
> but excess false positives - if things in the output unrelated to the issue
> change (in intentional/benign ways) & cause the tests to fail often
&
> become a burden for the project.
>
> Not suggesting that's the case with these particular instances - but
it's
> a tradeoff/concern to keep in mind.
>
Yes, that's definitely something we've acknowledged for x86 codegen
tests.
If someone wants to make a big isel, scheduler, or RA change, it's going to
result in a *lot* of regression test diffs. On the plus side, it forces the
author (I've definitely been there) to consider a lot of patterns that they
probably didn't think about it initially.

Note that for IR assertions, the scripts reg-ex-ify value names:

define i1 @trueval_is_true(i1 %C, i1 %X) {
; CHECK-LABEL: @trueval_is_true(
; CHECK-NEXT:    [[R:%.*]] = or i1 [[C:%.*]], [[X:%.*]]
; CHECK-NEXT:    ret i1 [[R]]
;
  %R = select i1 %C, i1 true, i1 %X
  ret i1 %R
}

So this actually insulates the tests from cosmetic changes in value naming.
That's something that's frequently overlooked in hand-written tests
because
it takes a lot of effort to type all that out. :)



> - Dave
>
>
>> The whole point of auto-generating complete checks is to catch
>> miscompiles/regressions sooner. Ie, before they get committed and
result in
>> bug reports. I don't know how to mine the bugzilla stats on that,
but it
>> feels like we have made good progress on that front with better
regression
>> tests. Some history of the motivation and scripts here:
>> https://bugs.llvm.org/show_bug.cgi?id=22897
>>
>>
>>
>> On Fri, May 4, 2018 at 10:31 AM, David Blaikie via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Yep - all about balance.
>>>
>>> The main risk are tests that overfit (golden files being the worst
case
>>> - checking that the entire output matches /exactly/ - this is what
>>> FileCheck is intended to help avoid) and maintainability. In the
case of
>>> the autogenerated FileCheck lines I've seen so far - they seem
like they
>>> still walk a fairly good line of checking exactly what's
intended. Though I
>>> sometimes wonder if they're checking full combinatorial
expansions that may
>>> not be required/desirable - always a tradeoff of just how
black/white box
>>> tests are.
>>>
>>> On Fri, May 4, 2018 at 2:56 AM Alex Bradbury via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On 4 May 2018 at 10:25, Alexandros Lamprineas via llvm-dev
>>>> <llvm-dev at lists.llvm.org> wrote:
>>>> > Hello llvm-dev,
>>>> >
>>>> >
>>>> > On a recent code review I was asked to auto-generate
assertion checks
>>>> for my
>>>> > unit test. I wasn't aware that this was even possible.
I am referring
>>>> to the
>>>> > python `update` scripts under `utils` directory. My first
reaction
>>>> was wow!
>>>> > I found it very practical and useful. It saves you
significant amount
>>>> of
>>>> > time when writing a regression test. So I gave it a try.
The generated
>>>> > checks were satisfying enough, almost exactly what I
wanted. Then I
>>>> got a
>>>> > bit sceptical about them. I am worried that
auto-generating tests
>>>> based on
>>>> > the compiler output can be quite dangerous. The tests will
always pass
>>>> > regardless of whether the compiler emits right or wrong
code,
>>>> therefore you
>>>> > have to be certain that they impose the desired compiler
behaviour. I
>>>> guess
>>>> > the question here is how often we should be using those
scripts.
>>>>
>>>> Like many test-related issues, it comes down to personal
judgement. It
>>>> is of course easy to create test/CodeGen/*/* tests that pass
>>>> regardless of whether the compiler breaks broken code
regardless of
>>>> whether the test CHECK lines are generated by
>>>> update_llc_test.checks.py or not.
>>>>
>>>> I find it very helpful to have auto-generated CHECK lines that
pick up
>>>> any codegen change, but this can of course be problematic for
very
>>>> large test cases that are likely to see churn due to scheduling
or
>>>> regallloc changes. Being able to regenerate the CHECK lines and
view
>>>> the diff is also incredibly helpful when rebasing or moving a
patch
>>>> between different branches.
>>>>
>>>> My policy for test/Codegen/RISCV is to use
update_llc_test_checks.py
>>>> wherever possible, except in cases where there are so many
CHECK lines
>>>> on the output that they obscure the property being tested,
indicating
>>>> that a more limited hand-crafted pattern would be superior.
>>>>
>>>> Best,
>>>>
>>>> Alex
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180504/357031a1/attachment-0001.html>

David Blaikie via llvm-dev

2018-May-04 17:57 UTC

head link

[llvm-dev] RFC: Are auto-generated assertions a good practice?

On Fri, May 4, 2018 at 10:53 AM Sanjay Patel <spatel at rotateright.com>
wrote:
> On Fri, May 4, 2018 at 11:30 AM, David Blaikie <dblaikie at
gmail.com> wrote:
>
>>
>>
>> On Fri, May 4, 2018 at 10:16 AM Sanjay Patel <spatel at
rotateright.com>
>> wrote:
>>
>>> I understand the overfit argument (but in most cases it just shows
that
>>> a unit test isn't minimized)...
>>>
>>
>> Even minimized tests sometimes need a few other things to setup the
>> circumstance (many DWARF tests, for example - produce the full DWARF
>> output, but maybe you only care about one part of it (maybe you care
about
>> the encoding of instruction ranges, but not the way names are rendered
-
>> but there's no option to produce only ranges without names
(wouldn't really
>> make sense)).
>>
>>
>>> but I don't see how the complete auto-generated assertions
could be
>>> worse at detecting a miscompile than incomplete manually-generated
>>> assertions?
>>>
>>
>> Generally overfitting wouldn't result in being bad at detecting
failures,
>> but excess false positives - if things in the output unrelated to the
issue
>> change (in intentional/benign ways) & cause the tests to fail often
&
>> become a burden for the project.
>>
>> Not suggesting that's the case with these particular instances -
but it's
>> a tradeoff/concern to keep in mind.
>>
>
> Yes, that's definitely something we've acknowledged for x86 codegen
tests.
> If someone wants to make a big isel, scheduler, or RA change, it's
going to
> result in a *lot* of regression test diffs. On the plus side, it forces the
> author (I've definitely been there) to consider a lot of patterns that
they
> probably didn't think about it initially.
>
*nod* Though if we had good ways of isolating tests further to avoid this
churn, I think we would do so - even at the loss of that consideration.

>  Note that for IR assertions, the scripts reg-ex-ify value names:
>
*nod* that's one example of the sort of good thing that avoids tests being
noisier than they need to be, etc.

>
> define i1 @trueval_is_true(i1 %C, i1 %X) {
> ; CHECK-LABEL: @trueval_is_true(
> ; CHECK-NEXT:    [[R:%.*]] = or i1 [[C:%.*]], [[X:%.*]]
> ; CHECK-NEXT:    ret i1 [[R]]
> ;
>   %R = select i1 %C, i1 true, i1 %X
>   ret i1 %R
> }
>
> So this actually insulates the tests from cosmetic changes in value
> naming. That's something that's frequently overlooked in
hand-written tests
> because it takes a lot of effort to type all that out. :)
>
That's not been my experience - though it is a bit laborious, most/all of
the hand written tests I see these days tend to do a good job of using
matches for value names rather than hardcoding them. But I don't write/work
with optimization tests so much - maybe they're often small enough that the
names are rarely perturbed & so hardcoding can be reasonable. In debug info
and clang tests it's pretty necessary because there's lots of moving
parts.

In any case I'm not suggesting the autogenerated checks are bad/broken/need
to be fixed - just that there are tradeoffs to keep in mind when working
with these things.

- Dave

>
>
>
>
>> - Dave
>>
>>
>>> The whole point of auto-generating complete checks is to catch
>>> miscompiles/regressions sooner. Ie, before they get committed and
result in
>>> bug reports. I don't know how to mine the bugzilla stats on
that, but it
>>> feels like we have made good progress on that front with better
regression
>>> tests. Some history of the motivation and scripts here:
>>> https://bugs.llvm.org/show_bug.cgi?id=22897
>>>
>>>
>>>
>>> On Fri, May 4, 2018 at 10:31 AM, David Blaikie via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Yep - all about balance.
>>>>
>>>> The main risk are tests that overfit (golden files being the
worst case
>>>> - checking that the entire output matches /exactly/ - this is
what
>>>> FileCheck is intended to help avoid) and maintainability. In
the case of
>>>> the autogenerated FileCheck lines I've seen so far - they
seem like they
>>>> still walk a fairly good line of checking exactly what's
intended. Though I
>>>> sometimes wonder if they're checking full combinatorial
expansions that may
>>>> not be required/desirable - always a tradeoff of just how
black/white box
>>>> tests are.
>>>>
>>>> On Fri, May 4, 2018 at 2:56 AM Alex Bradbury via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> On 4 May 2018 at 10:25, Alexandros Lamprineas via llvm-dev
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>> > Hello llvm-dev,
>>>>> >
>>>>> >
>>>>> > On a recent code review I was asked to auto-generate
assertion
>>>>> checks for my
>>>>> > unit test. I wasn't aware that this was even
possible. I am
>>>>> referring to the
>>>>> > python `update` scripts under `utils` directory. My
first reaction
>>>>> was wow!
>>>>> > I found it very practical and useful. It saves you
significant
>>>>> amount of
>>>>> > time when writing a regression test. So I gave it a
try. The
>>>>> generated
>>>>> > checks were satisfying enough, almost exactly what I
wanted. Then I
>>>>> got a
>>>>> > bit sceptical about them. I am worried that
auto-generating tests
>>>>> based on
>>>>> > the compiler output can be quite dangerous. The tests
will always
>>>>> pass
>>>>> > regardless of whether the compiler emits right or
wrong code,
>>>>> therefore you
>>>>> > have to be certain that they impose the desired
compiler behaviour.
>>>>> I guess
>>>>> > the question here is how often we should be using
those scripts.
>>>>>
>>>>> Like many test-related issues, it comes down to personal
judgement. It
>>>>> is of course easy to create test/CodeGen/*/* tests that
pass
>>>>> regardless of whether the compiler breaks broken code
regardless of
>>>>> whether the test CHECK lines are generated by
>>>>> update_llc_test.checks.py or not.
>>>>>
>>>>> I find it very helpful to have auto-generated CHECK lines
that pick up
>>>>> any codegen change, but this can of course be problematic
for very
>>>>> large test cases that are likely to see churn due to
scheduling or
>>>>> regallloc changes. Being able to regenerate the CHECK lines
and view
>>>>> the diff is also incredibly helpful when rebasing or moving
a patch
>>>>> between different branches.
>>>>>
>>>>> My policy for test/Codegen/RISCV is to use
update_llc_test_checks.py
>>>>> wherever possible, except in cases where there are so many
CHECK lines
>>>>> on the output that they obscure the property being tested,
indicating
>>>>> that a more limited hand-crafted pattern would be superior.
>>>>>
>>>>> Best,
>>>>>
>>>>> Alex
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180504/c48258c4/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - May 2018 - RFC: Are auto-generated assertions a good practice?

[llvm-dev] RFC: Are auto-generated assertions a good practice?

[llvm-dev] RFC: Are auto-generated assertions a good practice?

[llvm-dev] RFC: Are auto-generated assertions a good practice?

Possibly Parallel Threads