Star Tan
2013-Jun-24 06:12 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
Hi all, When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not? I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev * confidence_interval), then the data status is set as UNCHANGED. However, it is obviously not enough. For example, assuming both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta is slightly larger than self.stddev, LNT will report to readers that the performance improvement is huge without considering the huge stddev. I think one way is to normalize the performance improvements by considering the stddev, but I am not sure whether it has been implemented in LNT. Could anyone give some suggestions that how can I find out whether the testing results are reliable in LNT? Specifically, how can I get the normalized performance improvement/regression by considering the stderr? Best wishes, Star Tan. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130624/b4257fc4/attachment.html>
Tobias Grosser
2013-Jun-27 16:05 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/23/2013 11:12 PM, Star Tan wrote:> Hi all, > > > When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not? > > > I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev * confidence_interval), then the data status is set as UNCHANGED. However, it is obviously not enough. For example, assuming both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta is slightly larger than self.stddev, LNT will report to readers that the performance improvement is huge without considering the huge stddev. I think one way is to normalize the performance improvements by considering the stddev, but I am not sure whether it has been implemented in LNT. > > > Could anyone give some suggestions that how can I find out whether the testing results are reliable in LNT? Specifically, how can I get the normalized performance improvement/regression by considering the stderr?Hi Daniel, Michael, Paul, do you happen to have some insights on this? Basically, the stddev shown when a run is compared to a previous run does not seem to be useful to measure the reliability of the results shown. We are looking for a good way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is? Thanks, Tobias
Daniel Dunbar
2013-Jun-27 16:25 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
We don't really have a great answer yet. For now the best we do is try and get our testing machines as quiet as possible and then mostly look at the daily trend not individual reports. - Daniel On Jun 27, 2013, at 9:05, Tobias Grosser <tobias at grosser.es> wrote:> On 06/23/2013 11:12 PM, Star Tan wrote: >> Hi all, >> >> >> When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not? >> >> >> I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev * confidence_interval), then the data status is set as UNCHANGED. However, it is obviously not enough. For example, assuming both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta is slightly larger than self.stddev, LNT will report to readers that the performance improvement is huge without considering the huge stddev. I think one way is to normalize the performance improvements by considering the stddev, but I am not sure whether it has been implemented in LNT. >> >> >> Could anyone give some suggestions that how can I find out whether the testing results are reliable in LNT? Specifically, how can I get the normalized performance improvement/regression by considering the stderr? > > Hi Daniel, Michael, Paul, > > do you happen to have some insights on this? Basically, the stddev shown > when a run is compared to a previous run does not seem to be useful to > measure the reliability of the results shown. We are looking for a good > way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is? > > Thanks, > Tobias
Renato Golin
2013-Jun-27 16:27 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote:> We are looking for a good way/value to show the reliability of individual > results in the UI. Do you have some experience, what a good measure of the > reliability of test results is? >Hi Tobi, I had a look at this a while ago, but never got around to actually work on it. My idea was to never use point-changes as indication of progress/regressions, unless there was a significant change (2/3 sigma). What we should do is to compare the current moving-average with the past moving averages (of K runs) with both last-avg and the (N-K)th moving-average (to make sure previous values included in the current moving average are not toning it down/up), and keep the biggest difference as the final result. We should also compare the current mov-avg with M non-overlapping mov-avgs before, and calculate if we're monotonically increasing, decreasing or if there is a difference of 2/3 sigma between the current mov-avg (N) and the (N-M)th mov-avg. That would give us an idea on the trends of each test. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/34b38916/attachment.html>
Tobias Grosser
2013-Jul-01 15:53 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
On 06/23/2013 11:12 PM, Star Tan wrote:> Hi all, > > > When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not? > > > I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev * confidence_interval), then the data status is set as UNCHANGED. However, it is obviously not enough. For example, assuming both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta is slightly larger than self.stddev, LNT will report to readers that the performance improvement is huge without considering the huge stddev. I think one way is to normalize the performance improvements by considering the stddev, but I am not sure whether it has been implemented in LNT. > > > Could anyone give some suggestions that how can I find out whether the testing results are reliable in LNT? Specifically, how can I get the normalized performance improvement/regression by considering the stderr?Hi Star Tan, I just attached you some hacks I tried on the week-end. The attached patch prints the confidence intervals in LNT. If you like you can take them as an inspiration (not directly copy) to print those values in your lnt server. (The patches require scipy and numpy being installed in your python sandbox. This should be OK for our experiments, but we probably do not want to reimplement those functions before upstreaming). Also, as Anton suggested. It may make sense to rerun your experiments with a larger number of samples. As the machine is currently not loaded and we do not track individual commits, 10 samples should probably be good enough. Cheers, Tobias -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-My-confidence-measurement-hacks.patch Type: text/x-diff Size: 8799 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130701/fab53a78/attachment.patch>
Star Tan
2013-Jul-02 02:09 UTC
[LLVMdev] [LNT] Question about results reliability in LNT infrustructure
At 2013-07-01 23:53:00,"Tobias Grosser" <tobias at grosser.es> wrote:>On 06/23/2013 11:12 PM, Star Tan wrote: >> Hi all, >> >> >> When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not? >> >> >> I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev * confidence_interval), then the data status is set as UNCHANGED. However, it is obviously not enough. For example, assuming both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta is slightly larger than self.stddev, LNT will report to readers that the performance improvement is huge without considering the huge stddev. I think one way is to normalize the performance improvements by considering the stddev, but I am not sure whether it has been implemented in LNT. >> >> >> Could anyone give some suggestions that how can I find out whether the testing results are reliable in LNT? Specifically, how can I get the normalized performance improvement/regression by considering the stderr? > >Hi Star Tan, > >I just attached you some hacks I tried on the week-end. The attached >patch prints the confidence intervals in LNT. If you like you can take >them as an inspiration (not directly copy) to print those values in your >lnt server. (The patches require scipy and numpy being installed in your >python sandbox. This should be OK for our experiments, but we probably >do not want to reimplement those functions before upstreaming).Wonderful. I will integrate them into our lnt server.> >Also, as Anton suggested. It may make sense to rerun your experiments >with a larger number of samples. As the machine is currently not loaded >and we do not track individual commits, 10 samples should probably be >good enough.OK, I can rerun all tests with 10 samples tonight-:).> >Cheers, >TobiasBests, Star Tan. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130702/8e2df375/attachment.html>
Possibly Parallel Threads
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure
- [LLVMdev] [LNT] Question about results reliability in LNT infrustructure