thr3ads.net - llvm dev - [LLVMdev] RFC:LNT Improvements [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Yi Kong

2014-Apr-29 22:49 UTC

[LLVMdev] RFC:LNT Improvements

Dear all,

Following the Benchmarking BOF from 2013 US dev meeting, I’d like to propose
some improvements to the LNT performance tracking software.

The most significant issue with current implementation is that the report is
filled with extremely noisy values. Hence it is hard to notice performance
improvements or regressions.

After investigation of LNT and the LLVM test suite, I propose following methods.
I've also attached prototype patches for each method.
- Increase the execution time of the benchmark so it runs long enough to avoid
noisy results
Currently there are two options to run benchmarks, namely small and
large problem size. I propose adding a third option: adaptive. In adaptive mode,
benchmarks scale the problem size according to pre-measured system performance
value so that the running time is kept at around 10 seconds, the sweet spot
between time and accuracy. The downside is that correctness for some benchmarks
cannot be measured. Solution is to measure correctness in a separate board with
small problem size.
LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode
Test suite: [PATCH 1/2] Add support for adaptive problem size
[PATCH 2/2] A subset of test suite programs modified for
adaptive
- Show and graph total compile time
There is no obvious way to scale up the compile time of individual
benchmarks, so total time is the best thing we can do to minimize error.
LNT: [PATCH 1/3] Add Total to run view and graph plot
- Only show performance changes with high confidence in summary report
To investigate the correlation between program run time and its
variance, I ran Dhrystone of different problem size multiple times. The result
shows that some fluctuations are expected and shorter tests have much greater
variance. By modelling the run time to be normally distributed, we can calculate
the minimal difference for statistical significance. Using this knowledge, we
can hide those results with low confidence level from summary report. They are
still available and marked in colour in detailed report in case interested.
LNT: [PATCH 3/3] Ignore tests with very short run time
- Make sure board has low background noise
Perform a system performance benchmark before each run and compare the
value with the reference(obtained during machine set-up). If the percentage
difference is too large, abort or defer the run. In prototype this feature is
implemented using Bash script and not integrated into LNT. Will rewrite in
Python.
LNT: benchmark.sh

In my prototype implementation, the summary report becomes much more useful.
There are almost no noisy readings while small regressions are still detectable
for long running benchmark programs. The implementation is backwards compatible
with older databases.

Screenshots from a sample run is attached.

Thanks for reading!

-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered
in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No: 2548782
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patchset.tar.gz
Type: application/gzip
Size: 14775 bytes
Desc: patchset.tar.gz
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/7737498f/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: old.png
Type: image/png
Size: 68926 bytes
Desc: old.png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/7737498f/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: new.png
Type: image/png
Size: 14728 bytes
Desc: new.png
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/7737498f/attachment-0001.png>

Tobias Grosser

2014-Apr-30 06:50 UTC

head link

[LLVMdev] RFC:LNT Improvements

Hi Yi Kong,

thanks for working on this. I think there is a lot we can improve here. 
I copied Mingxing Tan who has worked on a couple of patches in this area 
before and Chris, who is maintaining LNT.

On 30/04/2014 00:49, Yi Kong wrote:> Dear all,
>
> Following the Benchmarking BOF from 2013 US dev meeting, I’d like to
propose some improvements to the LNT performance tracking software.
>
> The most significant issue with current implementation is that the report
is filled with extremely noisy values. Hence it is hard to notice performance
improvements or regressions.
Right.
> After investigation of LNT and the LLVM test suite, I propose following
methods. I've also attached prototype patches for each method.
> - Increase the execution time of the benchmark so it runs long enough to
avoid noisy results
>          Currently there are two options to run benchmarks, namely small
and large problem size. I propose adding a third option: adaptive. In adaptive
mode, benchmarks scale the problem size according to pre-measured system
performance value so that the running time is kept at around 10 seconds, the
sweet spot between time and accuracy. The downside is that correctness for some
benchmarks cannot be measured. Solution is to measure correctness in a separate
board with small problem size.
>          LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode
>          Test suite: [PATCH 1/2] Add support for adaptive problem size
>                          [PATCH 2/2] A subset of test suite programs
modified for adaptive
I think it will be easier to review such patches one by one on the 
commit mailing lists. Especially as this one is a little larger.

In general, I see such changes as a second step. First, we want to have 
a system in place that allows us to reliably detect if a benchmark is 
noisy or not, second we want to increase the number of benchmarks that 
are not noisy and where we can use the results.
> - Show and graph total compile time
>          There is no obvious way to scale up the compile time of individual
benchmarks, so total time is the best thing we can do to minimize error.
>          LNT: [PATCH 1/3] Add Total to run view and graph plot
I did not see the effect of these changes in your images and also 
honestly do not fully understand what you are doing. What is the total 
compile time? Don't we already show the compile time in run view? How is 
the total time different to this compile time?

Maybe you can answer this in a separate patch email.
> - Only show performance changes with high confidence in summary report
>          To investigate the correlation between program run time and its
variance, I ran Dhrystone of different problem size multiple times. The result
shows that some fluctuations are expected and shorter tests have much greater
variance. By modelling the run time to be normally distributed, we can calculate
the minimal difference for statistical significance. Using this knowledge, we
can hide those results with low confidence level from summary report. They are
still available and marked in colour in detailed report in case interested.
>          LNT: [PATCH 3/3] Ignore tests with very short run time
I think this is the most important point which we should address first.
In fact, I would prefer to go even further and actually compute the 
confidence and make the confidence we require an option. This allows
us to understand both how stable/noisy a machine is and how well the 
other changes you propose work in practice.

We had a longer discussion here on llvmdev names 'Questions about 
results reliability in LNT infrustructure'. Anton suggested to do the
following:

1. Get 5-10 samples per run
2. Do the Wilcoxon/Mann-Whitney test

I already set up -O3 buildbots that provide 10 runs, per commit and the
noise for them is very low:

http://llvm.org/perf/db_default/v4/nts/25151?num_comparison_runs=10&test_filter=&test_min_value_filter=&aggregation_fn=median&compare_to=25149&submit=Update

If you are interested in performance data to test your changes, you can 
extract the results from the LLVM buildmaster at:

http://lab.llvm.org:8011/builders/polly-perf-O3/builds/2942/steps/lnt.nightly-test/logs/report.json

with 2942 being one of the latest successful builds. By going backwards
or forwards you should get other builds if they have been successful.

There should be a standard function for the wilcoxon/mann-whitney in
scipy?, so in case you are interested adding this reliability numbers as
a first step seems to be a simple and purely beneficial commit.
> - Make sure board has low background noise
>          Perform a system performance benchmark before each run and compare
the value with the reference(obtained during machine set-up). If the percentage
difference is too large, abort or defer the run. In prototype this feature is
implemented using Bash script and not integrated into LNT. Will rewrite in
Python.
>          LNT: benchmark.sh
I am a little sceptical on this. Machines should generally not be noisy. 
However, if for some reason there is noise on the machine, the noise is 
as likely to appear during this pre-noise-test than during the actual 
benchmark runs, maybe during both, but maybe also only during the 
benchmark. So I am afraid we might often run in the situation where this 
test says OK but the later test is still suffering noise.

I would probably prefer to make the previous point of reporting 
reliability work well and then we can see for each test/benchmark if 
there was noise involved or not.

All the best,
Tobias

Renato Golin

2014-Apr-30 09:05 UTC

head link

[LLVMdev] RFC:LNT Improvements

On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es>
wrote:> In general, I see such changes as a second step. First, we want to have a
> system in place that allows us to reliably detect if a benchmark is noisy
or
> not, second we want to increase the number of benchmarks that are not noisy
> and where we can use the results.
I personally use the test-suite for correctness, not performance and
would not like to have its run time increased by any means.

As discussed in the BoF last year, if we could separate test run from
benchmark run before we do any change, I'd appreciate.

I want to have a separate benchmark bot on the subset that makes sense
to work as benchmark, but I don't want the noise of the rest.

> 1. Get 5-10 samples per run
> 2. Do the Wilcoxon/Mann-Whitney test
5-10 samples on an ARM board is not feasible. Currently it takes 1
hour to run the whole set. Making it run for 5-10 hours will reduce
its value to zero.

> I am a little sceptical on this. Machines should generally not be noisy.
ARM machines work at a much lower power level than Intel ones. The
scheduler is a lot more aggressive and the quality of the peripherals
is *a lot* worse.

Even if you set up the board for benchmarks (fix the scheduler, put
everything up to 11), the quality of the external hardware (USB, SD,
eMMC, etc) and their drivers do a lot of damage to any meaningful
number you may extract if the moon is full and Jupiter is in
Sagittarius.

So...
> However, if for some reason there is noise on the machine, the noise is as
> likely to appear during this pre-noise-test than during the actual
benchmark
> runs, maybe during both, but maybe also only during the benchmark. So I am
> afraid we might often run in the situation where this test says OK but the
> later test is still suffering noise.
...this is not entirely true, on ARM.

We may be getting server quality hardware for AArch64 any time now,
but it's very unlikely that we'll *ever* get quality 32-bit test
boards.

cheers,
--renato

Chris Matthews

2014-Apr-30 15:27 UTC

head link

[LLVMdev] RFC:LNT Improvements

On Apr 29, 2014, at 3:49 PM, Yi Kong <Yi.Kong at arm.com> wrote:
> Dear all,
> 
> Following the Benchmarking BOF from 2013 US dev meeting, I’d like to
propose some improvements to the LNT performance tracking software.
> 
> The most significant issue with current implementation is that the report
is filled with extremely noisy values. Hence it is hard to notice performance
improvements or regressions.
> 
> After investigation of LNT and the LLVM test suite, I propose following
methods. I've also attached prototype patches for each method.
> - Increase the execution time of the benchmark so it runs long enough to
avoid noisy results
>        Currently there are two options to run benchmarks, namely small and
large problem size. I propose adding a third option: adaptive. In adaptive mode,
benchmarks scale the problem size according to pre-measured system performance
value so that the running time is kept at around 10 seconds, the sweet spot
between time and accuracy. The downside is that correctness for some benchmarks
cannot be measured. Solution is to measure correctness in a separate board with
small problem size.
>        LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode
>        Test suite: [PATCH 1/2] Add support for adaptive problem size
>                        [PATCH 2/2] A subset of test suite programs modified
for adaptive
> - Show and graph total compile time
>        There is no obvious way to scale up the compile time of individual
benchmarks, so total time is the best thing we can do to minimize error.
>        LNT: [PATCH 1/3] Add Total to run view and graph plot
> - Only show performance changes with high confidence in summary report
>        To investigate the correlation between program run time and its
variance, I ran Dhrystone of different problem size multiple times. The result
shows that some fluctuations are expected and shorter tests have much greater
variance. By modelling the run time to be normally distributed, we can calculate
the minimal difference for statistical significance. Using this knowledge, we
can hide those results with low confidence level from summary report. They are
still available and marked in colour in detailed report in case interested.
>        LNT: [PATCH 3/3] Ignore tests with very short run time
I think this is harder than it sounds.  I just looked through some results from
today, and I found a benchmark that found a real regression of 0.01s, in a
benchmark running ~0.05s.   That would have been filtered by your patch. Do you
have some intuition that small runtime tests are where the noise is coming from?
I feel like it is not a problem that is unique to short runs, but more to
particular benchmarks.
> - Make sure board has low background noise
>        Perform a system performance benchmark before each run and compare
the value with the reference(obtained during machine set-up). If the percentage
difference is too large, abort or defer the run. In prototype this feature is
implemented using Bash script and not integrated into LNT. Will rewrite in
Python.
>        LNT: benchmark.sh
I wrote a very similar python script for checking system baselines.  I think
this is a great idea.  My script ran several non-compiler related tasks, which
*should* be stable on any machine.  By should I mean, they are long running and
intentionally only test one aspect of the system.  I did not gate results on
these runs, but instead submitted these results to LNT, then allowed it to
report on anomalies it detected.  So far this process has detected some problems
on our testing machines.  If there is interest I can share that script.

> In my prototype implementation, the summary report becomes much more
useful. There are almost no noisy readings while small regressions are still
detectable for long running benchmark programs. The implementation is backwards
compatible with older databases.
> 
> Screenshots from a sample run is attached.
> 
> Thanks for reading!
> 
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium.  Thank you.
> 
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No:  2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No: 
2548782<patchset.tar.gz><old.png><new.png>

Kristof Beyls

2014-Apr-30 17:30 UTC

head link

[LLVMdev] RFC:LNT Improvements

Hi Chris,

I think it would definitely be useful to share your script; so that we don't
need to reinvent
the wheel.

Thanks!

Kristof
> > - Make sure board has low background noise
> >        Perform a system performance benchmark before each run and
> compare the value with the reference(obtained during machine set-up). If
> the percentage difference is too large, abort or defer the run. In
prototype> this feature is implemented using Bash script and not integrated into LNT.
> Will rewrite in Python.
> >        LNT: benchmark.sh
> 
> I wrote a very similar python script for checking system baselines.  I
think this> is a great idea.  My script ran several non-compiler related tasks, which
> *should* be stable on any machine.  By should I mean, they are long
running> and intentionally only test one aspect of the system.  I did not gate
results on> these runs, but instead submitted these results to LNT, then allowed it to
> report on anomalies it detected.  So far this process has detected some
> problems on our testing machines.  If there is interest I can share thatscript.

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Apr 2014 - [LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

Maybe Matching Threads