thr3ads.net - llvm dev - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Star Tan

2013-Jun-24 06:12 UTC

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Hi all,


When we compare two testings, each of which is run with three samples, how would
LNT show whether the comparison is reliable or not?


I have seen that the function get_value_status in reporting/analysis.py uses a
very simple algorithm to infer data status. For example, if abs(self.delta)
<= (self.stddev * confidence_interval), then the data status is set as
UNCHANGED.  However, it is obviously not enough. For example, assuming both
self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta is
slightly larger than self.stddev, LNT will report to readers that the
performance improvement is huge without considering the huge stddev. I think one
way is to normalize the performance improvements by considering the stddev, but
I am not sure whether it has been implemented in LNT.


Could anyone give some suggestions that how can I find out whether the testing
results are reliable in LNT? Specifically, how can I get the normalized
performance improvement/regression by considering the stderr?


Best wishes,
Star Tan.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130624/b4257fc4/attachment.html>

Tobias Grosser

2013-Jun-27 16:05 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 06/23/2013 11:12 PM, Star Tan wrote:> Hi all,
>
>
> When we compare two testings, each of which is run with three samples, how
would LNT show whether the comparison is reliable or not?
>
>
> I have seen that the function get_value_status in reporting/analysis.py
uses a very simple algorithm to infer data status. For example, if
abs(self.delta) <= (self.stddev * confidence_interval), then the data status
is set as UNCHANGED.  However, it is obviously not enough. For example, assuming
both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta
is slightly larger than self.stddev, LNT will report to readers that the
performance improvement is huge without considering the huge stddev. I think one
way is to normalize the performance improvements by considering the stddev, but
I am not sure whether it has been implemented in LNT.
>
>
> Could anyone give some suggestions that how can I find out whether the
testing results are reliable in LNT? Specifically, how can I get the normalized
performance improvement/regression by considering the stderr?
Hi Daniel, Michael, Paul,

do you happen to have some insights on this? Basically, the stddev shown
when a run is compared to a previous run does not seem to be useful to
measure the reliability of the results shown. We are looking for a good
way/value to show the reliability of individual results in the UI. Do 
you have some experience, what a good measure of the reliability of test 
results is?

Thanks,
Tobias

Daniel Dunbar

2013-Jun-27 16:25 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

We don't really have a great answer yet. For now the best we do is try
and get our testing machines as quiet as possible and then mostly look
at the daily trend not individual reports.

 - Daniel


On Jun 27, 2013, at 9:05, Tobias Grosser <tobias at grosser.es> wrote:
> On 06/23/2013 11:12 PM, Star Tan wrote:
>> Hi all,
>>
>>
>> When we compare two testings, each of which is run with three samples,
how would LNT show whether the comparison is reliable or not?
>>
>>
>> I have seen that the function get_value_status in reporting/analysis.py
uses a very simple algorithm to infer data status. For example, if
abs(self.delta) <= (self.stddev * confidence_interval), then the data status
is set as UNCHANGED.  However, it is obviously not enough. For example, assuming
both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta
is slightly larger than self.stddev, LNT will report to readers that the
performance improvement is huge without considering the huge stddev. I think one
way is to normalize the performance improvements by considering the stddev, but
I am not sure whether it has been implemented in LNT.
>>
>>
>> Could anyone give some suggestions that how can I find out whether the
testing results are reliable in LNT? Specifically, how can I get the normalized
performance improvement/regression by considering the stderr?
>
> Hi Daniel, Michael, Paul,
>
> do you happen to have some insights on this? Basically, the stddev shown
> when a run is compared to a previous run does not seem to be useful to
> measure the reliability of the results shown. We are looking for a good
> way/value to show the reliability of individual results in the UI. Do you
have some experience, what a good measure of the reliability of test results is?
>
> Thanks,
> Tobias

Renato Golin

2013-Jun-27 16:27 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote:
> We are looking for a good way/value to show the reliability of individual
> results in the UI. Do you have some experience, what a good measure of the
> reliability of test results is?
>
Hi Tobi,

I had a look at this a while ago, but never got around to actually work on
it. My idea was to never use point-changes as indication of
progress/regressions, unless there was a significant change (2/3 sigma).
What we should do is to compare the current moving-average with the past
moving averages (of K runs) with both last-avg and the (N-K)th
moving-average (to make sure previous values included in the current moving
average are not toning it down/up), and keep the biggest difference as the
final result.

We should also compare the current mov-avg with M non-overlapping mov-avgs
before, and calculate if we're monotonically increasing, decreasing or if
there is a difference of 2/3 sigma between the current mov-avg (N) and the
(N-M)th mov-avg. That would give us an idea on the trends of each test.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/34b38916/attachment.html>

Tobias Grosser

2013-Jul-01 15:53 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 06/23/2013 11:12 PM, Star Tan wrote:> Hi all,
>
>
> When we compare two testings, each of which is run with three samples, how
would LNT show whether the comparison is reliable or not?
>
>
> I have seen that the function get_value_status in reporting/analysis.py
uses a very simple algorithm to infer data status. For example, if
abs(self.delta) <= (self.stddev * confidence_interval), then the data status
is set as UNCHANGED.  However, it is obviously not enough. For example, assuming
both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta
is slightly larger than self.stddev, LNT will report to readers that the
performance improvement is huge without considering the huge stddev. I think one
way is to normalize the performance improvements by considering the stddev, but
I am not sure whether it has been implemented in LNT.
>
>
> Could anyone give some suggestions that how can I find out whether the
testing results are reliable in LNT? Specifically, how can I get the normalized
performance improvement/regression by considering the stderr?
Hi Star Tan,

I just attached you some hacks I tried on the week-end. The attached 
patch prints the confidence intervals in LNT. If you like you can take 
them as an inspiration (not directly copy) to print those values in your 
lnt server. (The patches require scipy and numpy being installed in your 
python sandbox. This should be OK for our experiments, but we probably 
do not want to reimplement those functions before upstreaming).

Also, as Anton suggested. It may make sense to rerun your experiments 
with a larger number of samples. As the machine is currently not loaded 
and we do not track individual commits, 10 samples should probably be 
good enough.

Cheers,
Tobias
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-My-confidence-measurement-hacks.patch
Type: text/x-diff
Size: 8799 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130701/fab53a78/attachment.patch>

Star Tan

2013-Jul-02 02:09 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

At 2013-07-01 23:53:00,"Tobias Grosser" <tobias at grosser.es>
wrote:
>On 06/23/2013 11:12 PM, Star Tan wrote:
>> Hi all,
>>
>>
>> When we compare two testings, each of which is run with three samples,
how would LNT show whether the comparison is reliable or not?
>>
>>
>> I have seen that the function get_value_status in reporting/analysis.py
uses a very simple algorithm to infer data status. For example, if
abs(self.delta) <= (self.stddev * confidence_interval), then the data status
is set as UNCHANGED.  However, it is obviously not enough. For example, assuming
both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta
is slightly larger than self.stddev, LNT will report to readers that the
performance improvement is huge without considering the huge stddev. I think one
way is to normalize the performance improvements by considering the stddev, but
I am not sure whether it has been implemented in LNT.
>>
>>
>> Could anyone give some suggestions that how can I find out whether the
testing results are reliable in LNT? Specifically, how can I get the normalized
performance improvement/regression by considering the stderr?
>
>Hi Star Tan,
>
>I just attached you some hacks I tried on the week-end. The attached 
>patch prints the confidence intervals in LNT. If you like you can take 
>them as an inspiration (not directly copy) to print those values in your 
>lnt server. (The patches require scipy and numpy being installed in your 
>python sandbox. This should be OK for our experiments, but we probably 
>do not want to reimplement those functions before upstreaming).
Wonderful. I will integrate them into our lnt server.>
>Also, as Anton suggested. It may make sense to rerun your experiments 
>with a larger number of samples. As the machine is currently not loaded 
>and we do not track individual commits, 10 samples should probably be 
>good enough.
OK, I can rerun all tests with 10 samples tonight-:).>
>Cheers,
>Tobias
Bests,
Star Tan.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130702/8e2df375/attachment.html>

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jun 2013 - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Reasonably Related Threads