thr3ads.net - llvm dev - [llvm-dev] Floating point variance in the test suite [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Kaylor, Andrew via llvm-dev

2021-Jun-24 18:13 UTC

[llvm-dev] Floating point variance in the test suite

> If you truly want to benchmark LLVM, you should really be running specific
benchmarks in specific ways and looking very carefully at the results, not
relying on the test-suite.
This gets at my questions about which benchmarks are important and who considers
them to be important. I expect a lot of us have non-public testing going on for
the benchmarks that we consider to be critical. I see the test suite benchmarks
as more of a guard rail to catch changes that degrade performance early and in a
way that is convenient for other community members to address. So, to me, the
benchmarks don’t have to be perfect measures. On the other hand, if we just
disable things like fast-math and FMA, the benchmarks won’t tell us anything at
all about the impact of changes touching those optimizations.
> What we want is to make sure the program doesn't generate garbage, but
garbage means different things for different tests, and having an external tool
that knows what each of the tests think is garbage is not practical.
Yes, I agree. Your example in Bugzilla of NEON versus VFP instructions brings up
another issue. If I run a test with value-changing optimizations enabled, small
variations are acceptable, but if I run the same test with “precise” floating
point options, I shouldn’t see any differences from the expected results
(depending, of course, on library implementations).

So, I think we need a way for each test to indicate whether it can be run in
value-unsafe modes, to set different tolerances for different modes, and to be
built to run differently in different modes. For example, if I’m running the
Blur test in a value safe mode, there’s no need to perform an internal
comparison and a hashed output comparison can be used. If I’m running with
fp-contract=on or fast-math, I’d want an internal value check but those modes
might have different tolerances. Finally, I might want a way to run the test as
a benchmark with either fp-contract=on or fast-math without any check of the
results in order to get better performance data.

As for updating the tests, I’m going to bring up test ownership again because I
don’t know what constitutes acceptable variation for any given test. I could
take a guess at it, but if I get it wrong, my wrong guess becomes semi-enshrined
in the test suite and may not be noticed by people who would know better.

For the blur example, the FMA is happening on this line:

          sum_in_current_frame += (inputImage[i + k][j + l] *
                                   gaussianFilter[k + offset][l + offset]);

That’s an accumulated result inside four nested loops. It looks like in practice
the differently rounded results with FMA must be getting averaged out most of
the time, which makes sense assuming a relatively consistent magnitude of
values, but I’d have to study the algorithm to understand exactly what’s
happening and how to check the results reliably for a range of inputs. I think
that’s too much to expect from someone who is just making some optimization
change that triggers a failure in the test.

In the case that led me to start the discussion this week, Melanie was just
making the behavior of clang match its documentation. She didn’t even change any
optimizations. The failures that were exposed would always have happened if
certain compilation options were used. Naturally, she just wanted to not turn
any buildbots red. Then I started looking at the failing tests and ended up
opening this can of worms.

-Andy

From: Renato Golin <rengolin at gmail.com>
Sent: Thursday, June 24, 2021 1:06 PM
To: Kaylor, Andrew <andrew.kaylor at intel.com>
Cc: llvm-dev at lists.llvm.org; Michael Kruse <llvmdev at meinersbur.de>;
amykibm at gmail.com; Hubert Tong <hubert.reinterpretcast at gmail.com>
Subject: Re: [llvm-dev] Floating point variance in the test suite

Hi Andrew,

Sorry I didn't see this before. My reply to bugzilla didn't take into
account the contents, here, so are probable moot.

On Thu, 24 Jun 2021 at 17:22, Kaylor, Andrew <andrew.kaylor at
intel.com<mailto:andrew.kaylor at intel.com>> wrote:

I don't agree that the result doesn't matter for benchmarks. It seems
that the benchmarks are some of the best tests we have for exercising
optimizations like this and if the result is wrong by a wide enough margin that
could indicate a problem. But I understand Renato’s point that the performance
measurement is the primary purpose of the benchmarks, and some numeric
differences should be acceptable.
Yes, that's the point I was trying to make. You can't run a benchmark
without understanding what it does and what the results mean. Small variations
can be fine in one benchmark and totally unacceptable in others. However, what
we have in the test-suite are benchmark-turned-tests and tests-turned-benchmarks
in which the output is a lot less important if it's more important if
it's totally different (ex. error messages, NaNs). My comment was just to
the subset we have in the test-suite, not benchmarks in general.

If you truly want to benchmark LLVM, you should really be running specific
benchmarks in specific ways and looking very carefully at the results, not
relying on the test-suite.

In the previous discussion of this issue, Sebastian Pop proposed having the
program run twice -- once with "precise" FP results, and once with the
optimizations being tested. For the Blur test, the floating point results are
only intermediate and the final (printed) results are a matrix of 8-bit
integers. I’m not sure what would constitute an acceptable result for this
program. For any given value, an off-by-one result seems acceptable, but if
there are too many off-by-one values that would probably indicate a problem. In
the Polybench tests, Sebastian modified the tests to do a comparison within the
test itself. I don’t know if that’s practical for Blur or if it would be better
to have two runs and use a custom comparison tool.
Given the point above about the difference between benchmarks and
test-suite-benchmarks, I think having comparisons inside the program itself is
probably the best way forward. I should have mentioned that on my list, as I did
that, too, in the test-suite.

The main problem with that, for benchmarks, is that they can add substantial
runtime and change the profile of the test. But that can be easily fixed by
iterating a few more times on the kernel (from the ground state).

What we want is to make sure the program doesn't generate garbage, but
garbage means different things for different tests, and having an external tool
that knows what each of the tests think is garbage is not practical.

The way I see it, there are only three types of comparison:
 * Text comparison, for tests that must be identical on every platform.
 * Hash comparison, for those above where the output is too big.
 * FP-comparison, for those where the text and integers must be identical but
the FP numbers can vary a bit.

The weird behaviour of fpcmp looking at hashes and comparing the numbers in them
is a bug, IMO. As is comparing integers and allowing wiggle room.

Using fpcmp for comparing text is fine, because what it does with text and
integers should be exactly the same thing as diff, and if the text has FP
output, then it also can change depending on precision and it's mostly fine
if it does.

To me, the path forward is to fix the tests that break with one of the
alternatives above, and make sure fpcmp doesn't identify hex, octal, binary
or integers as floating-point, and treat them all as text.

For the Blur test, a quick comparison between the two matrices inside the
program (with appropriate wiggle room) would suffice.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210624/22a3f508/attachment.html>

Renato Golin via llvm-dev

2021-Jun-25 08:22 UTC

head link

[llvm-dev] Floating point variance in the test suite

On Thu, 24 Jun 2021 at 19:13, Kaylor, Andrew <andrew.kaylor at intel.com>
wrote:
> This gets at my questions about which benchmarks are important and who
> considers them to be important. I expect a lot of us have non-public
> testing going on for the benchmarks that we consider to be critical. I see
> the test suite benchmarks as more of a guard rail to catch changes that
> degrade performance early and in a way that is convenient for other
> community members to address. So, to me, the benchmarks don’t have to be
> perfect measures. On the other hand, if we just disable things like
> fast-math and FMA, the benchmarks won’t tell us anything at all about the
> impact of changes touching those optimizations.
>
Right, that's why we ended up with the complicated fp-contract situation.

Moreover, the benchmarks we have in the test-suite weren't super tuned like
other commercial ones. Worse still, they have to "compare similar" on
a
large number of platforms, with some potentially unstable output.

To make those benchmarks stable we'd need to:
 * Understand what the program is trying to generate
 * Only output relevant information (not like a dump of a huge intermediary
matrix)
 * Make sure the output is stable under varying conditions on different
architectures

Only then comparing outputs, even with fpcmp, will be meaningful. Right
now, we're on the stage "let's just hope the output is the
same", which
makes discussions like this one a recurrent theme.

>  So, I think we need a way for each test to indicate whether it can be run
> in value-unsafe modes, to set different tolerances for different modes, and
> to be built to run differently in different modes. For example, if I’m
> running the Blur test in a value safe mode, there’s no need to perform an
> internal comparison and a hashed output comparison can be used. If I’m
> running with fp-contract=on or fast-math, I’d want an internal value check
> but those modes might have different tolerances. Finally, I might want a
> way to run the test as a benchmark with either fp-contract=on or fast-math
> without any check of the results in order to get better performance data.
>
Yes, I think we may need different comparisons for different runs. For
example, on a test run, the FP delta must be really small, but on a
benchmark run, it can be larger or even ignore some numbers we know don't
change the overall result.

I believe this has to be in each benchmark's code, not in the comparison
tool, which has to be as dumb as possible.

As for updating the tests, I’m going to bring up test ownership
again> because I don’t know what constitutes acceptable variation for any given
> test. I could take a guess at it, but if I get it wrong, my wrong guess
> becomes semi-enshrined in the test suite and may not be noticed by people
> who would know better.
>
Unfortunately, there are no owners to the tests like that.

There are people who know more about certain tests than others, but I have
added tests and benchmarks to the test-suite without really knowing a lot
about them in the past, and I believe many other people have, too.

There's no way to know who knows more about a particular test than asking,
so I think the easiest way forward is to send an RFC to the list for each
benchmark we want to change with the proposal. If no one thinks that's a
bad idea, we go with it. If someone downstream raises issues, reverting the
commit to one single test/benchmark is easier than one that touches a lot
of different tests.

That’s an accumulated result inside four nested loops. It looks like
in> practice the differently rounded results with FMA must be getting averaged
> out most of the time, which makes sense assuming a relatively consistent
> magnitude of values, but I’d have to study the algorithm to understand
> exactly what’s happening and how to check the results reliably for a range
> of inputs. I think that’s too much to expect from someone who is just
> making some optimization change that triggers a failure in the test.
>
Absolutely agreed.

The work that Melanie, Sebastian and many others have done on improving the
test-suite is an important but thankless job.

Unfortunately, no one has time to spend on cleaning up the test-suite for
more than a month (stretching) so it gets some attention then fades.

Many years ago I spent a good few months on it because running on Arm would
yield the wildest differences in output, and my goal was to make Arm to be
a first-class citizen on LLVM, so it had to run on Arm buildbots without
noise.

As you'd expect, I only touched the tests that broke on Arm (at the time,
too many!), but cleaning the test-suite wasn't my primary goal. Since then,
many people have done the same, for new targets, new optimisations, etc.
But all with the test-suite as a secondary goal, which brings us here.

There were notable exceptions, for example when the Benchmark mode was
added, when LNT had its interface revamped with good statistics,
Sebastian's fp-contract change, now Melanie's work, etc. But they were
few
and far between.

We had a GSOC project to make the test-suite robust, but honestly, that's
not the sexiest GSOC project ever, so we're still waiting for some kind
soul to go through the painful task of understanding everything and
validating the output stability.

>  In the case that led me to start the discussion this week, Melanie was
> just making the behavior of clang match its documentation. She didn’t even
> change any optimizations. The failures that were exposed would always have
> happened if certain compilation options were used. Naturally, she just
> wanted to not turn any buildbots red. Then I started looking at the failing
> tests and ended up opening this can of worms.
>
I sincerely apologise. :D

I know we all have more important things to do (our jobs, for starter) than
to fix some spaghetti monster that should have been good enough from the
beginning. But truth is, testing and benchmarking is a really hard job.

I think this is not just an important part of the project to do, but I also
think less experienced developers should all go through an experience like
that some time in their early careers.

The work is painful, but it is also interesting. Understanding the
benchmarks teaches you a lot about their fields (ray tracing, physics
simulation, Fourier transforms, etc) as well as the numeric techniques
used.

You also learn about output stability, good software engineering, integer
and floating point arithmetic and all the pitfalls around them. It may not
make for the best CV item, but you do become a better programmer after
working on those hairy issues.

So, while we do what we can when we must, it really needs someone new, with
fresh eyes, looking at it as if everything is wrong, and come up with a
much better solution than what we have today.

cheers,
--renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210625/c9639f46/attachment.html>

llvm dev - Jun 2021 - Floating point variance in the test suite

[llvm-dev] Floating point variance in the test suite

[llvm-dev] Floating point variance in the test suite