thr3ads.net - llvm dev - [LLVMdev] RFC:LNT Improvements [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2014-Apr-30 09:05 UTC

[LLVMdev] RFC:LNT Improvements

On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es>
wrote:> In general, I see such changes as a second step. First, we want to have a
> system in place that allows us to reliably detect if a benchmark is noisy
or
> not, second we want to increase the number of benchmarks that are not noisy
> and where we can use the results.
I personally use the test-suite for correctness, not performance and
would not like to have its run time increased by any means.

As discussed in the BoF last year, if we could separate test run from
benchmark run before we do any change, I'd appreciate.

I want to have a separate benchmark bot on the subset that makes sense
to work as benchmark, but I don't want the noise of the rest.

> 1. Get 5-10 samples per run
> 2. Do the Wilcoxon/Mann-Whitney test
5-10 samples on an ARM board is not feasible. Currently it takes 1
hour to run the whole set. Making it run for 5-10 hours will reduce
its value to zero.

> I am a little sceptical on this. Machines should generally not be noisy.
ARM machines work at a much lower power level than Intel ones. The
scheduler is a lot more aggressive and the quality of the peripherals
is *a lot* worse.

Even if you set up the board for benchmarks (fix the scheduler, put
everything up to 11), the quality of the external hardware (USB, SD,
eMMC, etc) and their drivers do a lot of damage to any meaningful
number you may extract if the moon is full and Jupiter is in
Sagittarius.

So...
> However, if for some reason there is noise on the machine, the noise is as
> likely to appear during this pre-noise-test than during the actual
benchmark
> runs, maybe during both, but maybe also only during the benchmark. So I am
> afraid we might often run in the situation where this test says OK but the
> later test is still suffering noise.
...this is not entirely true, on ARM.

We may be getting server quality hardware for AArch64 any time now,
but it's very unlikely that we'll *ever* get quality 32-bit test
boards.

cheers,
--renato

Tobias Grosser

2014-Apr-30 09:21 UTC

head link

[LLVMdev] RFC:LNT Improvements

Hi Renato,

On 30/04/2014 11:05, Renato Golin wrote:> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es> wrote:
>> In general, I see such changes as a second step. First, we want to have
a
>> system in place that allows us to reliably detect if a benchmark is
noisy or
>> not, second we want to increase the number of benchmarks that are not
noisy
>> and where we can use the results.
>
> I personally use the test-suite for correctness, not performance and
> would not like to have its run time increased by any means.
I agree, we should not complicate the current use as a correctness 
test-suite. Though I don't think any of the changes proposed this.

I also believe we should as a first step not touch the test suite at 
all, but just improve how LNT reports the results it gets.
> As discussed in the BoF last year, if we could separate test run from
> benchmark run before we do any change, I'd appreciate.
To my understanding, the first patches should just improve LNT to report 
how reliable the results are it reports. So there is no way that this 
can effect the test suite runs, which means I do not see why we would 
want to delay such changes.

In fact, if we have a good idea which kernels are reliable and which 
ones are not, we can probably use this information to actually mark 
benchmarks that are known to be noisy.
> I want to have a separate benchmark bot on the subset that makes sense
> to work as benchmark, but I don't want the noise of the rest.
Right. There are two steps to get here:

1) Measure and show if a benchmark result is reliable

2) Avoid running the know to be noisy/unreliable benchmarks
>> 1. Get 5-10 samples per run
>> 2. Do the Wilcoxon/Mann-Whitney test
>
> 5-10 samples on an ARM board is not feasible. Currently it takes 1
> hour to run the whole set. Making it run for 5-10 hours will reduce
> its value to zero.
Reporting numbers that are not 100% reliable makes the results useless 
as well. As ARM boards are cheap, you could just put 5 boxes in place 
and we get the samples we need. Even if this is not yet feasible, I 
would rather run 5 samples of the benchmarks you really care, then 
running everything once and getting unreliable number.
>> I am a little sceptical on this. Machines should generally not be
noisy.
Let me rephrase. "Machines on which you would like to run benchmarks 
should have a consistent and low enough level of noise"
> ARM machines work at a much lower power level than Intel ones. The
> scheduler is a lot more aggressive and the quality of the peripherals
> is *a lot* worse.
>
> Even if you set up the board for benchmarks (fix the scheduler, put
> everything up to 11), the quality of the external hardware (USB, SD,
> eMMC, etc) and their drivers do a lot of damage to any meaningful
> number you may extract if the moon is full and Jupiter is in
> Sagittarius.
>
> So... >
 >>
>> However, if for some reason there is noise on the machine, the noise is
as
>> likely to appear during this pre-noise-test than during the actual
benchmark
>> runs, maybe during both, but maybe also only during the benchmark. So I
am
>> afraid we might often run in the situation where this test says OK but
the
>> later test is still suffering noise.
>
> ...this is not entirely true, on ARM.
So do you think the benchmark.sh script proposed by Yi Kong is useful 
for ARM?

Cheers,
Tobias

Renato Golin

2014-Apr-30 09:37 UTC

head link

[LLVMdev] RFC:LNT Improvements

On 30 April 2014 10:21, Tobias Grosser <tobias at grosser.es>
wrote:> To my understanding, the first patches should just improve LNT to report
how
> reliable the results are it reports. So there is no way that this can
effect
> the test suite runs, which means I do not see why we would want to delay
> such changes.
>
> In fact, if we have a good idea which kernels are reliable and which ones
> are not, we can probably use this information to actually mark benchmarks
> that are known to be noisy.
Right, yes, that'd be a good first step. I just wanted to make sure
that we don't just assume 10 runs is ok for everyone and consider it
done.

> Reporting numbers that are not 100% reliable makes the results useless as
> well. As ARM boards are cheap, you could just put 5 boxes in place and we
> get the samples we need. Even if this is not yet feasible, I would rather
> run 5 samples of the benchmarks you really care, then running everything
> once and getting unreliable number.
That'd be another source of noise. You can't consider 5 boards'
results to be the same as 5 results in 1 board.

They're cheap (as in quality) and different boards (of the same brand
and batch) have different manufacturing defects that are only exposed
when we crush them to death with compiler tests and benchmarks. Nobody
in the factory has ever tested for that, since they only expect you to
run light stuff like media players, web servers, routers.

> Let me rephrase. "Machines on which you would like to run benchmarks
should
> have a consistent and low enough level of noise"
No 32-bit ARM machine I have tested so far fits that bill.

> So do you think the benchmark.sh script proposed by Yi Kong is useful for
> ARM?
I'm also sceptical about that. I don't think that the noise on setup
will be any better or worse than noise during tests.

The only way to be sure is to run it every time and to understand the
curve, find a cut and warn on every noise level above the cut. Mind
you, this cut will be dynamic as the number of results grow, but once
we have a few dozen runs, it should stabilise.

But that is not a replacement for running the test multiple times or
for longer times. We need statistical significance.

cheers,
--renato

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Apr 2014 - [LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

Seemingly Similar Threads