thr3ads.net - llvm dev - [LLVMdev] Use perf tool for more accurate time measuring on Linux [May 2014]

If this information is useful, please help other people find it:
Share via:

Yi Kong

2014-May-20 20:00 UTC

[LLVMdev] Use perf tool for more accurate time measuring on Linux

On 20 May 2014 17:55, Tobias Grosser <tobias at grosser.es>
wrote:> On 20/05/2014 18:20, Yi Kong wrote:
>>
>> On 20 May 2014 16:40, Tobias Grosser <tobias at grosser.es>
wrote:
>>>
>>> On 20/05/2014 16:01, Yi Kong wrote:
>>>>
>>>>
>>>> I've set up a public LNT server to show the result of perf
stat. There
>>>> is a huge improvement compared with timeit tool.
>>>> http://parkas16.inria.fr:8000/
>>>
>>>
>>>
>>> Hi Yi Kong,
>>>
>>> thanks for testing these changes.
>>>
>>>
>>>> Patch is updated to pin the process to a single core, the
readings are
>>>> even more accurate. It's hard coded to run everything on
core 0, so
>>>> don't run parallel testing with it for now. The tool now
depends on
>>>> Linux perf and schedtool.
>>>
>>>
>>>
>>> I think this sounds like a very good direction.
>>>
>>> How did you evaluate the improvements exactly? The following run
shows
>>> e.g
>>> two execution time changes:
>>
>>
>> I sent a screenshot of original results in the previous mail. We used
>> to have lots of noise readings, both from small machine background
>> noise and large noise from the timing tool. Now noise from timing tool
>> is eliminated and only few machine background noise is left. This
>> makes manual investigation possible.
>
>
> I think we need to get this down to zero even at the cost of missing
> regressions. We have many commits and runs per day, having one or two noisy
> results per run means people will still not look at performance changes.
>
>
>>> http://parkas16.inria.fr:8000/db_default/v4/nts/9
>>>
>>> Are they expected? If I change e.g. the aggregation function to
median
>>> they disappear. Similarly the graph for one of them does not
suggest an
>>> actual performance change:
>>
>>
>> Yes, some false positives due to machine noise is expected. Median is
>> more tolerant to machine noise, therefore they disappear.
>
>
> Right.
>
> What I find interesting is that this change filters several results that
> seem to not be filtered out by our statistical test. Is this right?
Yes. MWU test is nonparametric, it examines the order rather than the
actual value of the samples. However eliminating with median uses
actual value(if medians of two samples are close enough, we treat them
as equal).
> In the optimal case, we should be able to set the confidence level we
> require high enough to filter out these results as well. Is this right?
Yes. The lowest confidence we can set is still quite high(90%). We can
certainly add a lower confidence option, but I can't find any MWU
table lower than that on the Internet.

Also, we should modify value analysis(based how close the
medians/minimums are) to vary according to the confidence level as
well. However this analysis is parametric, we needs to know how data
is actually distributed for every test. I don't think there is a
non-parametric test which does the same thing.
> Is there currently anything that blocks us from increasing the confidence
> level further reducing the noise level at the cost of some missed
> regressions?
>
>
>> As suggested by Chandler, we should also lock the CPU frequency to
>> further reduce machine noise.
>
>
> I set it to 2667.000 Mhz on parkas16. You can try if this improves
> something.
Sure. I'm shutting down the server to run the tests.
>
> Cheers,
> Tobias

Tobias Grosser

2014-May-20 21:21 UTC

head link

[LLVMdev] Use perf tool for more accurate time measuring on Linux

On 20/05/2014 22:00, Yi Kong wrote:> On 20 May 2014 17:55, Tobias Grosser <tobias at grosser.es> wrote:
>> On 20/05/2014 18:20, Yi Kong wrote:
>>>
>>> On 20 May 2014 16:40, Tobias Grosser <tobias at grosser.es>
wrote:
>>>>
>>>> On 20/05/2014 16:01, Yi Kong wrote:
>>>>>
>>>>>
>>>>> I've set up a public LNT server to show the result of
perf stat. There
>>>>> is a huge improvement compared with timeit tool.
>>>>> http://parkas16.inria.fr:8000/
>>>>
>>>>
>>>>
>>>> Hi Yi Kong,
>>>>
>>>> thanks for testing these changes.
>>>>
>>>>
>>>>> Patch is updated to pin the process to a single core, the
readings are
>>>>> even more accurate. It's hard coded to run everything
on core 0, so
>>>>> don't run parallel testing with it for now. The tool
now depends on
>>>>> Linux perf and schedtool.
>>>>
>>>>
>>>>
>>>> I think this sounds like a very good direction.
>>>>
>>>> How did you evaluate the improvements exactly? The following
run shows
>>>> e.g
>>>> two execution time changes:
>>>
>>>
>>> I sent a screenshot of original results in the previous mail. We
used
>>> to have lots of noise readings, both from small machine background
>>> noise and large noise from the timing tool. Now noise from timing
tool
>>> is eliminated and only few machine background noise is left. This
>>> makes manual investigation possible.
>>
>>
>> I think we need to get this down to zero even at the cost of missing
>> regressions. We have many commits and runs per day, having one or two
noisy
>> results per run means people will still not look at performance
changes.
>>
>>
>>>> http://parkas16.inria.fr:8000/db_default/v4/nts/9
>>>>
>>>> Are they expected? If I change e.g. the aggregation function to
median
>>>> they disappear. Similarly the graph for one of them does not
suggest an
>>>> actual performance change:
>>>
>>>
>>> Yes, some false positives due to machine noise is expected. Median
is
>>> more tolerant to machine noise, therefore they disappear.
>>
>>
>> Right.
>>
>> What I find interesting is that this change filters several results
that
>> seem to not be filtered out by our statistical test. Is this right?
>
> Yes. MWU test is nonparametric, it examines the order rather than the
> actual value of the samples. However eliminating with median uses
> actual value(if medians of two samples are close enough, we treat them
> as equal).
I see. So some of the useful eliminations come from the fact that we 
actually run a parametric test? So we _do_ in this case take some 
assumptions about the distribution of the values, right?
>> In the optimal case, we should be able to set the confidence level we
>> require high enough to filter out these results as well. Is this right?
>
> Yes. The lowest confidence we can set is still quite high(90%). We can
> certainly add a lower confidence option, but I can't find any MWU
> table lower than that on the Internet.
Why the lowest confidence? I would be interested in maximal confidence 
to reduce noise.

I found this table:

http://www.stat.purdue.edu/~bogdanm/wwwSTAT503_fall/Tables/Wilcoxon.pdf

I am not sure if those are the right values. Inside it says 
Wilcocan-Mann-Whitney U, but the filename suggests that the tables may 
be for the Wilcoxon signed-rank test.
> Also, we should modify value analysis(based how close the
> medians/minimums are) to vary according to the confidence level as
> well. However this analysis is parametric, we needs to know how data
> is actually distributed for every test. I don't think there is a
> non-parametric test which does the same thing.
What kind of problem could we get in case we assume normal distribution 
and the values are in fact not normal distributed?

Would we just fail to find a significant change? Or would we possibly 
let non-significant changes through?

Under the assumption that there is a non-zero percentage of test cases
where the performance results are normal distributed, it may be OK for a 
special low-noise-configuration to only get results from these test 
cases, but possibly ignore performance changes from the 
non-normal-distributed cases.

Cheers,
Tobias

Yi Kong

2014-May-20 21:55 UTC

head link

[LLVMdev] Use perf tool for more accurate time measuring on Linux

On 20 May 2014 22:21, Tobias Grosser <tobias at grosser.es>
wrote:> On 20/05/2014 22:00, Yi Kong wrote:
>>
>> On 20 May 2014 17:55, Tobias Grosser <tobias at grosser.es>
wrote:
>>>
>>> On 20/05/2014 18:20, Yi Kong wrote:
>>>>
>>>>
>>>> On 20 May 2014 16:40, Tobias Grosser <tobias at
grosser.es> wrote:
>>>>>
>>>>>
>>>>> On 20/05/2014 16:01, Yi Kong wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> I've set up a public LNT server to show the result
of perf stat. There
>>>>>> is a huge improvement compared with timeit tool.
>>>>>> http://parkas16.inria.fr:8000/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Hi Yi Kong,
>>>>>
>>>>> thanks for testing these changes.
>>>>>
>>>>>
>>>>>> Patch is updated to pin the process to a single core,
the readings are
>>>>>> even more accurate. It's hard coded to run
everything on core 0, so
>>>>>> don't run parallel testing with it for now. The
tool now depends on
>>>>>> Linux perf and schedtool.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I think this sounds like a very good direction.
>>>>>
>>>>> How did you evaluate the improvements exactly? The
following run shows
>>>>> e.g
>>>>> two execution time changes:
>>>>
>>>>
>>>>
>>>> I sent a screenshot of original results in the previous mail.
We used
>>>> to have lots of noise readings, both from small machine
background
>>>> noise and large noise from the timing tool. Now noise from
timing tool
>>>> is eliminated and only few machine background noise is left.
This
>>>> makes manual investigation possible.
>>>
>>>
>>>
>>> I think we need to get this down to zero even at the cost of
missing
>>> regressions. We have many commits and runs per day, having one or
two
>>> noisy
>>> results per run means people will still not look at performance
changes.
>>>
>>>
>>>>> http://parkas16.inria.fr:8000/db_default/v4/nts/9
>>>>>
>>>>> Are they expected? If I change e.g. the aggregation
function to median
>>>>> they disappear. Similarly the graph for one of them does
not suggest an
>>>>> actual performance change:
>>>>
>>>>
>>>>
>>>> Yes, some false positives due to machine noise is expected.
Median is
>>>> more tolerant to machine noise, therefore they disappear.
>>>
>>>
>>>
>>> Right.
>>>
>>> What I find interesting is that this change filters several results
that
>>> seem to not be filtered out by our statistical test. Is this right?
>>
>>
>> Yes. MWU test is nonparametric, it examines the order rather than the
>> actual value of the samples. However eliminating with median uses
>> actual value(if medians of two samples are close enough, we treat them
>> as equal).
>
>
> I see. So some of the useful eliminations come from the fact that we
> actually run a parametric test? So we _do_ in this case take some
> assumptions about the distribution of the values, right?
Yes. You can check get_value_status() in
lnt/server/reporting/analysis.py to see how we determine significance.
I don't think making such assumption is good idea, as some tests have
very different distributions to others.
>>> In the optimal case, we should be able to set the confidence level
we
>>> require high enough to filter out these results as well. Is this
right?
>>
>>
>> Yes. The lowest confidence we can set is still quite high(90%). We can
>> certainly add a lower confidence option, but I can't find any MWU
>> table lower than that on the Internet.
>
>
> Why the lowest confidence? I would be interested in maximal confidence to
> reduce noise.
Ah... I got it wrong way around. I agree with you.
> I found this table:
>
> http://www.stat.purdue.edu/~bogdanm/wwwSTAT503_fall/Tables/Wilcoxon.pdf
>
> I am not sure if those are the right values. Inside it says
> Wilcocan-Mann-Whitney U, but the filename suggests that the tables may be
> for the Wilcoxon signed-rank test.
That's indeed for the Wilcoxon signed-rank test.
>
>
>> Also, we should modify value analysis(based how close the
>> medians/minimums are) to vary according to the confidence level as
>> well. However this analysis is parametric, we needs to know how data
>> is actually distributed for every test. I don't think there is a
>> non-parametric test which does the same thing.
>
>
> What kind of problem could we get in case we assume normal distribution and
> the values are in fact not normal distributed?
If the distribution is in fact skewed, we will get lots of false negatives.
> Would we just fail to find a significant change? Or would we possibly let
> non-significant changes through?
>
> Under the assumption that there is a non-zero percentage of test cases
> where the performance results are normal distributed, it may be OK for a
> special low-noise-configuration to only get results from these test cases,
> but possibly ignore performance changes from the non-normal-distributed
> cases.
It's hard to test if execution time is normal distributed. The samples
are definately not normally distributed, because the measurements are
guaranteed upper-bound.
>
> Cheers,
> Tobias

Bruce Hoult

2014-May-21 00:44 UTC

head link

[LLVMdev] Use perf tool for more accurate time measuring on Linux

On Wed, May 21, 2014 at 9:21 AM, Tobias Grosser <tobias at grosser.es>
wrote:
> Also, we should modify value analysis(based how close the
>> medians/minimums are) to vary according to the confidence level as
>> well. However this analysis is parametric, we needs to know how data
>> is actually distributed for every test. I don't think there is a
>> non-parametric test which does the same thing.
>>
>
> What kind of problem could we get in case we assume normal distribution
> and the values are in fact not normal distributed?
>
I haven't looked at this particular data, but I've done a lot of work in
general trying to detect small changes in performance.

My feeling is that there is usually a "true" execution time PLUS the
sum of
some random amount of things that happened during the run. Nothing random
ever makes the code run faster than it should! (which by itself makes the
normal distribution completely inappropriate, as it always has a finite
chance of negative values)

Each individual random thing that might happen in a run probably actually
has a binomial or hypergeometric distribution, but p is so small and n so
large (and p*n constant) that you might as well call it a Poisson
distribution.

Note that while the sum of a number of arbitrary independent random
variables is normal (Central Limit Theorem), the sum of independent Poisson
variables is Poisson! And you only need one number to characterise a
Poisson distribution: the expected value (which also happens to be the same
as the variance).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/57dcd5b5/attachment.html>

llvm dev - May 2014 - [LLVMdev] Use perf tool for more accurate time measuring on Linux

[LLVMdev] Use perf tool for more accurate time measuring on Linux

[LLVMdev] Use perf tool for more accurate time measuring on Linux

[LLVMdev] Use perf tool for more accurate time measuring on Linux

[LLVMdev] Use perf tool for more accurate time measuring on Linux