thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] RFC: End-to-end testing [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Florian Hahn via llvm-dev

2019-Oct-10 09:34 UTC

[llvm-dev] [cfe-dev] RFC: End-to-end testing

Hi David,

Thanks for kicking off a discussion on this topic!

> On Oct 9, 2019, at 22:31, David Greene via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Mehdi AMINI via llvm-dev <llvm-dev at lists.llvm.org> writes:
> 
>>> I absolutely disagree about vectorization tests.  We have seen
>>> vectorization loss in clang even though related LLVM lit tests
pass,
>>> because something else in the clang pipeline changed that caused
the
>>> vectorizer to not do its job.
>> 
>> Of course, and as I mentioned I tried to add these tests (probably 4 or
5
>> years ago), but someone (I think Chandler?) was asking me at the time:
does
>> it affect a benchmark performance? If so why isn't it tracked
there? And if
>> not does it matter?
>> The benchmark was presented as the actual way to check this invariant
>> (because you're only vectoring to get performance, not for the sake
of it).
>> So I never pursued, even if I'm a bit puzzled that we don't
have such tests.
> 
> Thanks for explaining.
> 
> Our experience is that relying solely on performance tests to uncover
> such issues is problematic for several reasons:
> 
> - Performance varies from implementation to implementation.  It is
>  difficult to keep tests up-to-date for all possible targets and
>  subtargets.
Could you expand a bit more what you mean here? Are you concerned about having
to run the performance tests on different kinds of hardware? In what way do the
existing benchmarks require keeping up-to-date?

With tests checking ASM, wouldn’t we end up with lots of checks for various
targets/subtargets that we need to keep up to date? Just considering AArch64 as
an example, people might want to check the ASM for different architecture
versions and different vector extensions and different vendors might want to
make sure that the ASM on their specific cores does not regress.
> 
> - Partially as a result, but also for other reasons, performance tests
>  tend to be complicated, either in code size or in the numerous code
>  paths tested.  This makes such tests hard to debug when there is a
>  regression.
I am not sure they have to. Have you considered adding the small test
functions/loops as micro-benchmarks using the existing google benchmark
infrastructure in test-suite?

I think that might be able to address the points here relatively adequately. The
separate micro benchmarks would be relatively small and we should be able to
track down regressions in a similar fashion as if it would be a stand-alone file
we compile and then analyze the ASM. Plus, we can easily run it and verify the
performance on actual hardware.
 > 
> - Performance tests don't focus on the why/how of vectorization.  They
>  just check, "did it run fast enough?"  Maybe the test ran fast
enough
>  for some other reason but we still lost desired vectorization and
>  could have run even faster.
> 
If you would add a new micro-benchmark, you could check that it produces the
desired result when adding it. The runtime-tracking should cover cases where we
lost optimizations. I guess if the benchmarks are too big, additional
optimizations in one part could hide lost optimizations somewhere else. But I
would assume this to be relatively unlikely, as long as the benchmarks are
isolated.

Also, checking the assembly for vector code does also not guarantee that the
vector code will be actually executed. So for example  by just checking the
assembly for certain vector instructions, we might miss that we regressed
performance, because we messed up the runtime checks guarding the vector loop.

Cheers,
Florian

David Greene via llvm-dev

2019-Oct-10 21:21 UTC

head link

[llvm-dev] [cfe-dev] RFC: End-to-end testing

Florian Hahn via llvm-dev <llvm-dev at lists.llvm.org> writes:
>> - Performance varies from implementation to implementation.  It is
>>  difficult to keep tests up-to-date for all possible targets and
>>  subtargets.
>
> Could you expand a bit more what you mean here? Are you concerned
> about having to run the performance tests on different kinds of
> hardware? In what way do the existing benchmarks require keeping
> up-to-date?
We have to support many different systems and those systems are always
changing (new processors, new BIOS, new OS, etc.).  Performance can vary
widely day to day from factors completely outside the compiler's
control.  As the performance changes you have to keep updating the tests
to expect the new performance numbers.  Relying on performance
measurements to ensure something like vectorization is happening just
isn't reliable in our experience.
> With tests checking ASM, wouldn’t we end up with lots of checks for
> various targets/subtargets that we need to keep up to date?
Yes, that's true.  But the only thing that changes the asm generated is
the compiler.
> Just considering AArch64 as an example, people might want to check the
> ASM for different architecture versions and different vector
> extensions and different vendors might want to make sure that the ASM
> on their specific cores does not regress.
Absolutely.  We do a lot of that sort of thing downstream.
>> - Partially as a result, but also for other reasons, performance tests
>>  tend to be complicated, either in code size or in the numerous code
>>  paths tested.  This makes such tests hard to debug when there is a
>>  regression.
>
> I am not sure they have to. Have you considered adding the small test
> functions/loops as micro-benchmarks using the existing google
> benchmark infrastructure in test-suite?
We have tried nightly performance runs using LNT/test-suite and have
found it to be very unreliable, especially the microbenchmarks.
> I think that might be able to address the points here relatively
> adequately. The separate micro benchmarks would be relatively small
> and we should be able to track down regressions in a similar fashion
> as if it would be a stand-alone file we compile and then analyze the
> ASM. Plus, we can easily run it and verify the performance on actual
> hardware.
A few of my colleagues really struggled to get consistent results out of
LNT.  They asked for help and discussed with a few upstream folks, but
in the end were not able to get something reliable working.  I've talked
to a couple of other people off-list and they've had similar
experiences.  It would be great if we have a reliable performance suite.
Please tell us how to get it working!  :)

But even then, I still maintain there is a place for the kind of
end-to-end testing I describe.  Performance testing would complement it.
Neither is a replacement for the other.
>> - Performance tests don't focus on the why/how of vectorization. 
They
>>  just check, "did it run fast enough?"  Maybe the test ran
fast enough
>>  for some other reason but we still lost desired vectorization and
>>  could have run even faster.
>> 
>
> If you would add a new micro-benchmark, you could check that it
> produces the desired result when adding it. The runtime-tracking
> should cover cases where we lost optimizations. I guess if the
> benchmarks are too big, additional optimizations in one part could
> hide lost optimizations somewhere else. But I would assume this to be
> relatively unlikely, as long as the benchmarks are isolated.
Even then I have seen small performance tests vary widely in performance
due to system issues (see above).  Again, there is a place for them but
they are not sufficient.
> Also, checking the assembly for vector code does also not guarantee
> that the vector code will be actually executed. So for example by just
> checking the assembly for certain vector instructions, we might miss
> that we regressed performance, because we messed up the runtime checks
> guarding the vector loop.
Oh absolutely.  Presumably such checks would be included in the test or
would be checked by a different test.  As always, tests have to be
constructed intelligently.  :)

                      -David

Sean Silva via llvm-dev

2019-Oct-11 15:48 UTC

head link

[llvm-dev] [cfe-dev] RFC: End-to-end testing

On Thu, Oct 10, 2019 at 2:21 PM David Greene via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
> Florian Hahn via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
> >> - Performance varies from implementation to implementation.  It is
> >>  difficult to keep tests up-to-date for all possible targets and
> >>  subtargets.
> >
> > Could you expand a bit more what you mean here? Are you concerned
> > about having to run the performance tests on different kinds of
> > hardware? In what way do the existing benchmarks require keeping
> > up-to-date?
>
> We have to support many different systems and those systems are always
> changing (new processors, new BIOS, new OS, etc.).  Performance can vary
> widely day to day from factors completely outside the compiler's
> control.  As the performance changes you have to keep updating the tests
> to expect the new performance numbers.  Relying on performance
> measurements to ensure something like vectorization is happening just
> isn't reliable in our experience.

Could you compare performance with vectorization turned on and off?

>
> > With tests checking ASM, wouldn’t we end up with lots of checks for
> > various targets/subtargets that we need to keep up to date?
>
> Yes, that's true.  But the only thing that changes the asm generated is
> the compiler.
>
> > Just considering AArch64 as an example, people might want to check the
> > ASM for different architecture versions and different vector
> > extensions and different vendors might want to make sure that the ASM
> > on their specific cores does not regress.
>
> Absolutely.  We do a lot of that sort of thing downstream.
>
> >> - Partially as a result, but also for other reasons, performance
tests
> >>  tend to be complicated, either in code size or in the numerous
code
> >>  paths tested.  This makes such tests hard to debug when there is
a
> >>  regression.
> >
> > I am not sure they have to. Have you considered adding the small test
> > functions/loops as micro-benchmarks using the existing google
> > benchmark infrastructure in test-suite?
>
> We have tried nightly performance runs using LNT/test-suite and have
> found it to be very unreliable, especially the microbenchmarks.
>
> > I think that might be able to address the points here relatively
> > adequately. The separate micro benchmarks would be relatively small
> > and we should be able to track down regressions in a similar fashion
> > as if it would be a stand-alone file we compile and then analyze the
> > ASM. Plus, we can easily run it and verify the performance on actual
> > hardware.
>
> A few of my colleagues really struggled to get consistent results out of
> LNT.  They asked for help and discussed with a few upstream folks, but
> in the end were not able to get something reliable working.  I've
talked
> to a couple of other people off-list and they've had similar
> experiences.  It would be great if we have a reliable performance suite.
> Please tell us how to get it working!  :)
>
> But even then, I still maintain there is a place for the kind of
> end-to-end testing I describe.  Performance testing would complement it.
> Neither is a replacement for the other.
>
> >> - Performance tests don't focus on the why/how of
vectorization.  They
> >>  just check, "did it run fast enough?"  Maybe the test
ran fast enough
> >>  for some other reason but we still lost desired vectorization and
> >>  could have run even faster.
> >>
> >
> > If you would add a new micro-benchmark, you could check that it
> > produces the desired result when adding it. The runtime-tracking
> > should cover cases where we lost optimizations. I guess if the
> > benchmarks are too big, additional optimizations in one part could
> > hide lost optimizations somewhere else. But I would assume this to be
> > relatively unlikely, as long as the benchmarks are isolated.
>
> Even then I have seen small performance tests vary widely in performance
> due to system issues (see above).  Again, there is a place for them but
> they are not sufficient.
>
> > Also, checking the assembly for vector code does also not guarantee
> > that the vector code will be actually executed. So for example by just
> > checking the assembly for certain vector instructions, we might miss
> > that we regressed performance, because we messed up the runtime checks
> > guarding the vector loop.
>
> Oh absolutely.  Presumably such checks would be included in the test or
> would be checked by a different test.  As always, tests have to be
> constructed intelligently.  :)
>
>                       -David
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191011/3b3fb8cb/attachment.html>

llvm dev - Oct 2019 - [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing

[llvm-dev] [cfe-dev] RFC: End-to-end testing