thr3ads.net - llvm dev - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2013-Jun-30 18:30 UTC

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 30 June 2013 10:14, Anton Korobeynikov <anton at korobeynikov.info>
wrote:
> 1. Increasing sample size to at least 5-10
>
That's not feasible on slower systems. A single data point takes 1 hour on
the fastest ARM board I can get (Chromebook). Getting 10 samples at
different commits will give you similar accuracy if behaviour doesn't
change, and you can rely on 10-point blocks before and after each change to
have the same result.

What won't happen is one commit makes it truly faster and the very next
slow again (or slow/fast), so all we need to measure is for each commit, if
that was the one that made all next runs slower/faster, and that we can get
with several commits after the culprit, since the probability that another
(unrelated) commit will change the behaviour is small.

This is why I proposed something like moving averages. Not because it's the
best statistical model, but because it works around a concrete problem we
have. I don't care which model/tool you use, as long as it doesn't mean
I'll have to wait 10 hours for a result, or sift through hundreds of
commits every time I see a regression in performance. What that will do,
for sure, is make me ignore small regressions, since they won't be worth
the massive work to find the real culprit.

If I had a team of 10 people just to look at regressions all day long, I'd
ask them to make a proper statistical model and go do more interesting
things...

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130630/daa16eff/attachment.html>

Anton Korobeynikov

2013-Jun-30 19:08 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

> Getting 10 samples at different commits will give you similar accuracy if
> behaviour doesn't change, and you can rely on 10-point blocks before
and > after each change to have the same result.Right. But this way you will have 10-commits delay. So, you will need
3-4 additional test runs to pinpoint the offending commit in the worst
case.
> This is why I proposed something like moving averages.Moving average will "smooth" the result. So, only really big changes
will be caught by it.

--
With best regards, Anton Korobeynikov
Faculty of Mathematics and Mechanics, Saint Petersburg State University

Chris Matthews

2013-Jul-01 01:02 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

This is probably another area where a bit of dynamic behavior could help.  When
we find a regressions, kick off some runs to bisect back to where it manifests.
This is what we would be doing manually anyway.  We could just search back with
the set of regressing benchmarks, meaning the whole suite does not have to be
run (unless it is a global regression).

There are situations where we see commit which make things slower then faster
again, but so far those seem to be from experimental features being switched on
then off.

The problem with the moving averages is they really don’t behave well when the
benchmark is naturally bimodal.  One thing that LNT is doing to help “smooth”
the results for you is by presenting the min of the data at a particular
revision, which (hopefully) is approximating the actual runtime without noise. 
That works well with a lot of samples per revision, but not for across
revisions, where we really need the smoothing.   One way to explore this is to
turn

Ignoring small regressions is an interesting problem.  Do it too many times,
slowness creeps in.  But you are correct, no one wants to fix a small
regression.  There is a bit of a value computation that we are all doing when we
watch the results, which is not explicit in the software or documentation right
now.  Mine is along the lines of: small regression in important benchmarks with
certain flags matters, and bigger regressions in less important benchmarks and
flags matter too, etc.

We also lack any way to coordinate or annotate regressions, that is a whole
separate problem though.

Another idea I have been toying with is building a "change of
interest" model, where we can explicitly tag particular revisions as
impacting performance, then test them preferentially.  That could allow the
effort to be focused to revisions where it might best have an effect.  I don’t
know if that would play out well in reality though.

Chris Matthews
chris.matthews@.com
(408) 783-6335

On Jun 30, 2013, at 11:30 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 30 June 2013 10:14, Anton Korobeynikov <anton at
korobeynikov.info> wrote:
> 1. Increasing sample size to at least 5-10
> 
> That's not feasible on slower systems. A single data point takes 1 hour
on the fastest ARM board I can get (Chromebook). Getting 10 samples at different
commits will give you similar accuracy if behaviour doesn't change, and you
can rely on 10-point blocks before and after each change to have the same
result.
> 
> What won't happen is one commit makes it truly faster and the very next
slow again (or slow/fast), so all we need to measure is for each commit, if that
was the one that made all next runs slower/faster, and that we can get with
several commits after the culprit, since the probability that another
(unrelated) commit will change the behaviour is small.
> 
> This is why I proposed something like moving averages. Not because it's
the best statistical model, but because it works around a concrete problem we
have. I don't care which model/tool you use, as long as it doesn't mean
I'll have to wait 10 hours for a result, or sift through hundreds of commits
every time I see a regression in performance. What that will do, for sure, is
make me ignore small regressions, since they won't be worth the massive work
to find the real culprit.
> 
> If I had a team of 10 people just to look at regressions all day long,
I'd ask them to make a proper statistical model and go do more interesting
things...
> 
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130630/7f34e330/attachment.html>

James Courtier-Dutton

2013-Jul-01 05:51 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On Jun 30, 2013 8:12 PM, "Anton Korobeynikov" <anton at
korobeynikov.info>
wrote:>
> > Getting 10 samples at different commits will give you similar accuracy
if> > behaviour doesn't change, and you can rely on 10-point blocks
before
and > after each change to have the same result.> Right. But this way you will have 10-commits delay. So, you will need
> 3-4 additional test runs to pinpoint the offending commit in the worst
> case.
>
> > This is why I proposed something like moving averages.
> Moving average will "smooth" the result. So, only really big
changes
> will be caught by it.
>
Like any result in statistics, the result should be quoted together with a
+/- figure derived from the statistical method used. Generally, low sample
size means high +/-.

Another option is to take a deterministic approach to measurement. The code
should executive the same cpu  instructions every time it is run, so some
method to measure just these instructions should be attempted. Maybe
processing qemu logs when llvm is run inside qemu might give a possible
solution?

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130701/f9971a44/attachment.html>

Renato Golin

2013-Jul-01 07:47 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 30 June 2013 20:08, Anton Korobeynikov <anton at korobeynikov.info>
wrote:
> > Getting 10 samples at different commits will give you similar accuracy
if
> > behaviour doesn't change, and you can rely on 10-point blocks
before and
> > after each change to have the same result.
> Right. But this way you will have 10-commits delay. So, you will need
> 3-4 additional test runs to pinpoint the offending commit in the worst
> case.
>
Well, 10 was an example, but yes, you'll always have N commit-groups delay.
My assumption is that some (say 5) commit-groups delay is not a critical
issue if it happens once in a while, as opposed to having to examine every
hike on a range of several dozen commits.

> This is why I proposed something like moving averages.
> Moving average will "smooth" the result. So, only really big
changes
> will be caught by it.
>
Absolutely. Smoothing is bad, but it's better than what we have, and at
least it would catch big regressions. Today, not even the big ones are
being caught.

You don't have to throw away the original data-points, you just run a
moving average to pinpoint big changes, where the confidence that
regression occurred is high. In parallel, you can still use the same
data-points to do more refined analysis, and even cross-reference multiple
analysis' data to give you even more confidence.

Anton and David, I could not agree with you more on what's necessary to
have a good analysis, I just wished we had something cruder but sooner
while we develop the perfect statistical model. I believe Chris is doing
that now. So, whatever is wrong with his analysis, let's just wait and see
how it turns out, and how we can improve further. For now, anything will be
an improvement.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130701/e5ce354c/attachment.html>

Renato Golin

2013-Jul-01 16:41 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On 1 July 2013 02:02, Chris Matthews <chris.matthews at apple.com> wrote:
> One thing that LNT is doing to help “smooth” the results for you is by
> presenting the min of the data at a particular revision, which (hopefully)
> is approximating the actual runtime without noise.
>
That's an interesting idea, as you said, if you run multiple times on every
revision.

On ARM, every run takes *at least* 1h, other architectures might be a lot
worse. It'd be very important on those architectures if you could extract
point information from group data, and min doesn't fit in that model. You
could take min from a group of runs, but again, that's no different than
moving averages. Though, "moving mins" might make more sense than
"moving
averages" for the reasons you exposed.

Also, on tests that take as long as noise to run (0.010s or less on A15),
the minimum is not relevant, since runtime will flatten everything under
0.010 onto 0.010, making your test always report 0.010, even when there are
regressions.

I really cannot see how you can statistically enhance data in a scenario
where the measuring rod is larger than the signal. We need to change the
wannabe-benchmarks to behave like proper benchmarks, and move everything
else into "Applications" for correctness and specifically NOT time
them.
Less is more.

That works well with a lot of samples per revision, but not for
across> revisions, where we really need the smoothing.   One way to explore this is
> to turn
>
I was really looking forward to that hear the end of that sentence... ;)

We also lack any way to coordinate or annotate regressions, that is a
whole> separate problem though.
>
Yup. I'm having visions of tag clouds, bugzilla integration, cross
architectural regression detection, etc. But I'll ignore that for now,
let's solve one big problem at a time. ;)

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130701/1937dd25/attachment.html>

Jakob Stoklund Olesen

2013-Jul-01 18:13 UTC

head link

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

On Jun 30, 2013, at 6:02 PM, Chris Matthews <chris.matthews at apple.com>
wrote:
> This is probably another area where a bit of dynamic behavior could help. 
When we find a regressions, kick off some runs to bisect back to where it
manifests. This is what we would be doing manually anyway.  We could just search
back with the set of regressing benchmarks, meaning the whole suite does not
have to be run (unless it is a global regression).
> 
> There are situations where we see commit which make things slower then
faster again, but so far those seem to be from experimental features being
switched on then off.
This is an interesting paper:
http://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf

"However, caches and branch predictors make performance dependent on
machine-specific parameters and the exact layout of code, stack frames, and heap
objects. A single binary constitutes just one sample from the space of program
layouts, regardless of the number of runs. Since compiler optimizations and code
changes also alter layout, it is currently impossible to distinguish the impact
of an optimization from that of its layout effects."

"We find that the performance impact of -O3 over -O2 optimizations is
indistinguishable from random noise.”

Thanks,
/jakob

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Jul 2013 - [LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Maybe Matching Threads