thr3ads.net - llvm dev - [LLVMdev] Why is the default LNT aggregation function min instead of mean [Jan 2014]

If this information is useful, please help other people find it:
Share via:

David Blaikie

2014-Jan-17 02:09 UTC

[LLVMdev] Why is the default LNT aggregation function min instead of mean

On Thu, Jan 16, 2014 at 5:32 PM, Tobias Grosser <tobias at grosser.es>
wrote:
> On 01/17/2014 02:17 AM, David Blaikie wrote:
>
>> Right - you usually won't see a normal distribution in the noise of
test
>> results. You'll see results clustered around the lower bound with a
long
>> tail of slower and slower results. Depending on how many samples you do
it
>> might be appropriate to take the mean of the best 3, for example - but
the
>> general approach of taking the fastest N does have some basis in any
case.
>>
>> Not necessarily the right answer, the only right answer, etc.
>>
>
> Interesting. In fact I had the very same thoughts at the beginning.
>
> However, when looking at my test results the common pattern looks like
> this example:
>
> http://llvm.org/perf/db_default/v4/nts/graph?show_all_
> points=yes&moving_window_size=10&plot.0=34.95.3&submit=Update
>
> The run-time of a test case is very consistently one of several fixed
> values. The distribution of the different times is very consistent and
> seems to form, in fact, something like a normal distribution (more in the
> center, less at the border).
>
> The explanation I have here is that the machine is by itself in fact not
> very noisy. Instead, changes of the execution context (e.g. due to
> allocation of memory at a different location) influences the performance.
> If we, by luck, have a run where all 'choices' have been optimal we
get
> minimal performance. However, in case of several independent factors, it is
> more likely that we get a non-optimal configuration that yields a value in
> the middle. Consequently, the minimal seems to be a non-optimal choice
here.
>
> I understand that there may be some 'real' noise values, but as the
median
> does not seem to be affected very much by 'extremal' values, I have
the
> feeling it should be reasonable robust to such noise.
>
> Have you seen examples where the median value gives a wrong impression
> regarding performance?
>
I have - and I've also seen the kind of results you're seeing too. One
of
the issues here is the quantization of results due to very short tests and
not very granular timing. This is perhaps the only reason the results even
/have/ a median (with finer grained timing and longer tests I expect you'd
see fewer results with exactly the same time - yes, you might be in a
situation where the exact runtimes repeat due to very short tests being
wholely scheduled in one way or another - but in that case you'll get wide,
solid swings depending on that scheduling behavior which is also unhelpful)
in your results.

It's one of the reasons I gave up on trying to do timing on Linux - I
couldn't get a machine quiet enough to look real. Though in the long run I
still did tend to get results for many tests that were clustered around a
minima with outliers going upwards...

I'm perhaps rambling a bit here, and I'm by no means an authority on
this
subject (I tried and failed - gave up & worked on other things instead) but
I think so long as the data is that noisy and quantized like that, I'm not
sure how useful it'll be & not sure if it's the best data to be
trying to
figure out data processing on. Maybe I'm wrong, perhaps this is as good as
that data can get and we do need an answer to how to handle it.

- David

>
> Cheers,
> Tobias
>
>
>
>
>
>>
>> On Thu, Jan 16, 2014 at 5:05 PM, Chris Matthews <chris.matthews at
apple.com
>> >wrote:
>>
>>  I think the idea with min is that it would the the ideal fastest run.
>>>  The
>>> other runs were ‘slowed' by system noise or something else.
>>>
>>>
>>> On Jan 16, 2014, at 5:03 PM, Tobias Grosser <tobias at
grosser.es> wrote:
>>>
>>>  Hi,
>>>>
>>>> I am currently investigating how to ensure that LNT only shows
relevant
>>>>
>>> performance regressions for the -O3 performance tests I am running.
>>>
>>>>
>>>> One question that came up here is why the default aggregate
function for
>>>>
>>> LNT is 'min' instead of 'mean'. This looks a little
surprising from the
>>> statistical point, but also from looking at my test results picking
'min'
>>> seems to be an inferior choice.
>>>
>>>>
>>>> For all test runs I have looked at, picking mean largely
reduces the
>>>>
>>> run-over-run changes reported due to noise.
>>>
>>>>
>>>> See this run e.g:
>>>>
>>>> If we use the median, we just get just one change reported:
>>>>
>>>>
>>>>  http://llvm.org/perf/db_default/v4/nts/20661?num_
>>>
comparison_runs=10&test_filter=&test_min_value_filter>>>
&aggregation_fn=median&compare_to=20659&submit=Update
>>>
>>>>
>>>> If you use min, we get eight reports one claiming over 100%
performance
>>>> reduction for a case that is really just pure noise. I am
planning to
>>>>
>>> look into using better statistical methods. However, as a start,
could we
>>> switch the default to 'mean'?
>>>
>>>>
>>>> Cheers,
>>>> Tobias
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140116/1142a6a5/attachment.html>

Chandler Carruth

2014-Jan-17 02:21 UTC

head link

[LLVMdev] Why is the default LNT aggregation function min instead of mean

On Thu, Jan 16, 2014 at 6:09 PM, David Blaikie <dblaikie at gmail.com>
wrote:
> On Thu, Jan 16, 2014 at 5:32 PM, Tobias Grosser <tobias at
grosser.es> wrote:
>
>> On 01/17/2014 02:17 AM, David Blaikie wrote:
>>
>>> Right - you usually won't see a normal distribution in the
noise of test
>>> results. You'll see results clustered around the lower bound
with a long
>>> tail of slower and slower results. Depending on how many samples
you do
>>> it
>>> might be appropriate to take the mean of the best 3, for example -
but
>>> the
>>> general approach of taking the fastest N does have some basis in
any
>>> case.
>>>
>>> Not necessarily the right answer, the only right answer, etc.
>>>
>>
>> Interesting. In fact I had the very same thoughts at the beginning.
>>
>> However, when looking at my test results the common pattern looks like
>> this example:
>>
>> http://llvm.org/perf/db_default/v4/nts/graph?show_all_
>>
points=yes&moving_window_size=10&plot.0=34.95.3&submit=Update
>>
>> The run-time of a test case is very consistently one of several fixed
>> values. The distribution of the different times is very consistent and
>> seems to form, in fact, something like a normal distribution (more in
the
>> center, less at the border).
>>
>> The explanation I have here is that the machine is by itself in fact
not
>> very noisy. Instead, changes of the execution context (e.g. due to
>> allocation of memory at a different location) influences the
performance.
>> If we, by luck, have a run where all 'choices' have been
optimal we get
>> minimal performance. However, in case of several independent factors,
it is
>> more likely that we get a non-optimal configuration that yields a value
in
>> the middle. Consequently, the minimal seems to be a non-optimal choice
here.
>>
>> I understand that there may be some 'real' noise values, but as
the
>> median does not seem to be affected very much by 'extremal'
values, I have
>> the feeling it should be reasonable robust to such noise.
>>
>> Have you seen examples where the median value gives a wrong impression
>> regarding performance?
>>
>
> I have - and I've also seen the kind of results you're seeing too.
One of
> the issues here is the quantization of results due to very short tests and
> not very granular timing. This is perhaps the only reason the results even
> /have/ a median (with finer grained timing and longer tests I expect
you'd
> see fewer results with exactly the same time - yes, you might be in a
> situation where the exact runtimes repeat due to very short tests being
> wholely scheduled in one way or another - but in that case you'll get
wide,
> solid swings depending on that scheduling behavior which is also unhelpful)
> in your results.
>
> It's one of the reasons I gave up on trying to do timing on Linux - I
> couldn't get a machine quiet enough to look real. Though in the long
run I
> still did tend to get results for many tests that were clustered around a
> minima with outliers going upwards...
>
> I'm perhaps rambling a bit here, and I'm by no means an authority
on this
> subject (I tried and failed - gave up & worked on other things instead)
but
> I think so long as the data is that noisy and quantized like that, I'm
not
> sure how useful it'll be & not sure if it's the best data to be
trying to
> figure out data processing on. Maybe I'm wrong, perhaps this is as good
as
> that data can get and we do need an answer to how to handle it.
>
To jump into this thread mid way, I just wanted to point out that this kind
of step-function in timings is almost *always* a sign of an extremely
coarse timer. If we can't do better than .4ms (guessing from the graph) of
resolution in the timer, we're not going to be able to reasonable measure
changes in the CPU-execution times of these tests.

I would really like to see us move toward counting cycles of an
un-throttled processor, and then normalizing that into seconds. If we can't
get very accurate (tick granularity) timings, I don't think we can draw
reasonable conclusions without *very* long test runs.

I've long wanted the LNT test suite to run (on linux at least) under
'perf
stat' or some other external measurement tool that has fine grained and
accurate timing information available in addition to cycle counts and other
things that are even more resilient to context switchings, CPU migrations,
etc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140116/047a565d/attachment.html>

Tobias Grosser

2014-Jan-17 07:28 UTC

head link

[LLVMdev] Why is the default LNT aggregation function min instead of mean

On 01/17/2014 03:09 AM, David Blaikie wrote:> On Thu, Jan 16, 2014 at 5:32 PM, Tobias Grosser <tobias at
grosser.es> wrote:
>
>> On 01/17/2014 02:17 AM, David Blaikie wrote:
>>
>>> Right - you usually won't see a normal distribution in the
noise of test
>>> results. You'll see results clustered around the lower bound
with a long
>>> tail of slower and slower results. Depending on how many samples
you do it
>>> might be appropriate to take the mean of the best 3, for example -
but the
>>> general approach of taking the fastest N does have some basis in
any case.
>>>
>>> Not necessarily the right answer, the only right answer, etc.
>>>
>>
>> Interesting. In fact I had the very same thoughts at the beginning.
>>
>> However, when looking at my test results the common pattern looks like
>> this example:
>>
>> http://llvm.org/perf/db_default/v4/nts/graph?show_all_
>>
points=yes&moving_window_size=10&plot.0=34.95.3&submit=Update
>>
>> The run-time of a test case is very consistently one of several fixed
>> values. The distribution of the different times is very consistent and
>> seems to form, in fact, something like a normal distribution (more in
the
>> center, less at the border).
>>
>> The explanation I have here is that the machine is by itself in fact
not
>> very noisy. Instead, changes of the execution context (e.g. due to
>> allocation of memory at a different location) influences the
performance.
>> If we, by luck, have a run where all 'choices' have been
optimal we get
>> minimal performance. However, in case of several independent factors,
it is
>> more likely that we get a non-optimal configuration that yields a value
in
>> the middle. Consequently, the minimal seems to be a non-optimal choice
here.
>>
>> I understand that there may be some 'real' noise values, but as
the median
>> does not seem to be affected very much by 'extremal' values, I
have the
>> feeling it should be reasonable robust to such noise.
>>
>> Have you seen examples where the median value gives a wrong impression
>> regarding performance?
>>
>
> I have - and I've also seen the kind of results you're seeing too.
One of
> the issues here is the quantization of results due to very short tests and
> not very granular timing. This is perhaps the only reason the results even
> /have/ a median (with finer grained timing and longer tests I expect
you'd
> see fewer results with exactly the same time - yes, you might be in a
> situation where the exact runtimes repeat due to very short tests being
> wholely scheduled in one way or another - but in that case you'll get
wide,
> solid swings depending on that scheduling behavior which is also unhelpful)
> in your results.
>
> It's one of the reasons I gave up on trying to do timing on Linux - I
> couldn't get a machine quiet enough to look real. Though in the long
run I
> still did tend to get results for many tests that were clustered around a
> minima with outliers going upwards...
Interesting. How many tests did you run in general?
> I'm perhaps rambling a bit here, and I'm by no means an authority
on this
> subject (I tried and failed - gave up & worked on other things instead)
but
> I think so long as the data is that noisy and quantized like that, I'm
not
> sure how useful it'll be & not sure if it's the best data to be
trying to
> figure out data processing on. Maybe I'm wrong, perhaps this is as good
as
> that data can get and we do need an answer to how to handle it.
As a first step, my current goal is to ensure that we do not report 
performance changes for test cases that have slightly noisy run-time 
behavior. Similar to Chandler, I have doubts we can use such changes to 
measure performance changes reliably.

I am also no authority in general here. In fact, I just look at my 
performance testers and try to take conclusions from this very limited 
picture. For the latest run using the 'median' again reduces the
'noise'
nicely. I get 13 noise results with the 'min' and only a single one with
the median.

http://llvm.org/perf/db_default/v4/nts/20686

It seems at least for my tester, this really helps to get the noise 
down. Is there a way I can switch the default displaying and reporting 
for my tester? Or, as you have kind of seen mixed results in both 
configurations, might we even be able to switch the default for now?

Cheers,
Tobias

Chris Matthews

2014-Jan-17 07:58 UTC

head link

[LLVMdev] Why is the default LNT aggregation function min instead of mean

If you have a 0.004s granularity, and you want to identify small (1% changes)
you’ll probably need benchmarks running at least 0.8s.
 
On Jan 16, 2014, at 6:21 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> 
> On Thu, Jan 16, 2014 at 6:09 PM, David Blaikie <dblaikie at
gmail.com> wrote:
> On Thu, Jan 16, 2014 at 5:32 PM, Tobias Grosser <tobias at
grosser.es> wrote:
> On 01/17/2014 02:17 AM, David Blaikie wrote:
> Right - you usually won't see a normal distribution in the noise of
test
> results. You'll see results clustered around the lower bound with a
long
> tail of slower and slower results. Depending on how many samples you do it
> might be appropriate to take the mean of the best 3, for example - but the
> general approach of taking the fastest N does have some basis in any case.
> 
> Not necessarily the right answer, the only right answer, etc.
> 
> Interesting. In fact I had the very same thoughts at the beginning.
> 
> However, when looking at my test results the common pattern looks like this
example:
> 
>
http://llvm.org/perf/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.0=34.95.3&submit=Update
> 
> The run-time of a test case is very consistently one of several fixed
values. The distribution of the different times is very consistent and seems to
form, in fact, something like a normal distribution (more in the center, less at
the border).
> 
> The explanation I have here is that the machine is by itself in fact not
very noisy. Instead, changes of the execution context (e.g. due to allocation of
memory at a different location) influences the performance. If we, by luck, have
a run where all 'choices' have been optimal we get minimal performance.
However, in case of several independent factors, it is more likely that we get a
non-optimal configuration that yields a value in the middle. Consequently, the
minimal seems to be a non-optimal choice here.
> 
> I understand that there may be some 'real' noise values, but as the
median does not seem to be affected very much by 'extremal' values, I
have the feeling it should be reasonable robust to such noise.
> 
> Have you seen examples where the median value gives a wrong impression
> regarding performance?
> 
> I have - and I've also seen the kind of results you're seeing too.
One of the issues here is the quantization of results due to very short tests
and not very granular timing. This is perhaps the only reason the results even
/have/ a median (with finer grained timing and longer tests I expect you'd
see fewer results with exactly the same time - yes, you might be in a situation
where the exact runtimes repeat due to very short tests being wholely scheduled
in one way or another - but in that case you'll get wide, solid swings
depending on that scheduling behavior which is also unhelpful) in your results.
> 
> It's one of the reasons I gave up on trying to do timing on Linux - I
couldn't get a machine quiet enough to look real. Though in the long run I
still did tend to get results for many tests that were clustered around a minima
with outliers going upwards...
> 
> I'm perhaps rambling a bit here, and I'm by no means an authority
on this subject (I tried and failed - gave up & worked on other things
instead) but I think so long as the data is that noisy and quantized like that,
I'm not sure how useful it'll be & not sure if it's the best
data to be trying to figure out data processing on. Maybe I'm wrong, perhaps
this is as good as that data can get and we do need an answer to how to handle
it.
> 
> To jump into this thread mid way, I just wanted to point out that this kind
of step-function in timings is almost *always* a sign of an extremely coarse
timer. If we can't do better than .4ms (guessing from the graph) of
resolution in the timer, we're not going to be able to reasonable measure
changes in the CPU-execution times of these tests.
> 
> I would really like to see us move toward counting cycles of an
un-throttled processor, and then normalizing that into seconds. If we can't
get very accurate (tick granularity) timings, I don't think we can draw
reasonable conclusions without *very* long test runs.
> 
> I've long wanted the LNT test suite to run (on linux at least) under
'perf stat' or some other external measurement tool that has fine
grained and accurate timing information available in addition to cycle counts
and other things that are even more resilient to context switchings, CPU
migrations, etc.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140116/0a81a308/attachment.html>

Chris Matthews

2014-Jan-17 07:58 UTC

head link

[LLVMdev] Why is the default LNT aggregation function min instead of mean

Is it the case that you converge on the min faster than the mean?  

Right now there is no way to set a per-tester aggregation function.

I had spent a little time trying to detect regressions using k-means clustering.
It looked promising.  That was outside LNT though.


On Jan 16, 2014, at 11:28 PM, Tobias Grosser <tobias at grosser.es> wrote:
> On 01/17/2014 03:09 AM, David Blaikie wrote:
>> On Thu, Jan 16, 2014 at 5:32 PM, Tobias Grosser <tobias at
grosser.es> wrote:
>> 
>>> On 01/17/2014 02:17 AM, David Blaikie wrote:
>>> 
>>>> Right - you usually won't see a normal distribution in the
noise of test
>>>> results. You'll see results clustered around the lower
bound with a long
>>>> tail of slower and slower results. Depending on how many
samples you do it
>>>> might be appropriate to take the mean of the best 3, for
example - but the
>>>> general approach of taking the fastest N does have some basis
in any case.
>>>> 
>>>> Not necessarily the right answer, the only right answer, etc.
>>>> 
>>> 
>>> Interesting. In fact I had the very same thoughts at the beginning.
>>> 
>>> However, when looking at my test results the common pattern looks
like
>>> this example:
>>> 
>>> http://llvm.org/perf/db_default/v4/nts/graph?show_all_
>>>
points=yes&moving_window_size=10&plot.0=34.95.3&submit=Update
>>> 
>>> The run-time of a test case is very consistently one of several
fixed
>>> values. The distribution of the different times is very consistent
and
>>> seems to form, in fact, something like a normal distribution (more
in the
>>> center, less at the border).
>>> 
>>> The explanation I have here is that the machine is by itself in
fact not
>>> very noisy. Instead, changes of the execution context (e.g. due to
>>> allocation of memory at a different location) influences the
performance.
>>> If we, by luck, have a run where all 'choices' have been
optimal we get
>>> minimal performance. However, in case of several independent
factors, it is
>>> more likely that we get a non-optimal configuration that yields a
value in
>>> the middle. Consequently, the minimal seems to be a non-optimal
choice here.
>>> 
>>> I understand that there may be some 'real' noise values,
but as the median
>>> does not seem to be affected very much by 'extremal'
values, I have the
>>> feeling it should be reasonable robust to such noise.
>>> 
>>> Have you seen examples where the median value gives a wrong
impression
>>> regarding performance?
>>> 
>> 
>> I have - and I've also seen the kind of results you're seeing
too. One of
>> the issues here is the quantization of results due to very short tests
and
>> not very granular timing. This is perhaps the only reason the results
even
>> /have/ a median (with finer grained timing and longer tests I expect
you'd
>> see fewer results with exactly the same time - yes, you might be in a
>> situation where the exact runtimes repeat due to very short tests being
>> wholely scheduled in one way or another - but in that case you'll
get wide,
>> solid swings depending on that scheduling behavior which is also
unhelpful)
>> in your results.
>> 
>> It's one of the reasons I gave up on trying to do timing on Linux -
I
>> couldn't get a machine quiet enough to look real. Though in the
long run I
>> still did tend to get results for many tests that were clustered around
a
>> minima with outliers going upwards...
> 
> Interesting. How many tests did you run in general?
> 
>> I'm perhaps rambling a bit here, and I'm by no means an
authority on this
>> subject (I tried and failed - gave up & worked on other things
instead) but
>> I think so long as the data is that noisy and quantized like that,
I'm not
>> sure how useful it'll be & not sure if it's the best data
to be trying to
>> figure out data processing on. Maybe I'm wrong, perhaps this is as
good as
>> that data can get and we do need an answer to how to handle it.
> 
> As a first step, my current goal is to ensure that we do not report
performance changes for test cases that have slightly noisy run-time behavior.
Similar to Chandler, I have doubts we can use such changes to measure
performance changes reliably.
> 
> I am also no authority in general here. In fact, I just look at my
performance testers and try to take conclusions from this very limited picture.
For the latest run using the 'median' again reduces the 'noise'
> nicely. I get 13 noise results with the 'min' and only a single one
with the median.
> 
> http://llvm.org/perf/db_default/v4/nts/20686
> 
> It seems at least for my tester, this really helps to get the noise down.
Is there a way I can switch the default displaying and reporting for my tester?
Or, as you have kind of seen mixed results in both configurations, might we even
be able to switch the default for now?
> 
> Cheers,
> Tobias
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140116/312899b8/attachment.html>

llvm dev - Jan 2014 - [LLVMdev] Why is the default LNT aggregation function min instead of mean

[LLVMdev] Why is the default LNT aggregation function min instead of mean

[LLVMdev] Why is the default LNT aggregation function min instead of mean

[LLVMdev] Why is the default LNT aggregation function min instead of mean

[LLVMdev] Why is the default LNT aggregation function min instead of mean

[LLVMdev] Why is the default LNT aggregation function min instead of mean