thr3ads.net - llvm dev - [LLVMdev] RFC:LNT Improvements [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2014-Apr-30 14:34 UTC

[LLVMdev] RFC:LNT Improvements

On 30/04/2014 16:20, Yi Kong wrote:> Hi Tobias, Renato,
>
> Thanks for your attention to my RFC.
> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es> wrote:
>  >> - Show and graph total compile time
>  >>    There is no obvious way to scale up the compile time of
>  >> individual benchmarks, so total time is the best thing we can do
to
>  >> minimize error.
>  >>    LNT: [PATCH 1/3] Add Total to run view and graph plot
>  >
>  > I did not see the effect of these changes in your images and also
>  > honestly do not fully understand what you are doing. What is the
>  > total compile time? Don't we already show the compile time in run
>  > view? How is the total time different to this compile time?
>
> It is hard to spot minor improvements or regressions over a large number
> of tests from independent machine noise. So I added a "total
time"
> analysis to the run report and able to graph its trend, hoping that
> noise will cancel out and will help us to easily spot. (Screenshot
> attached)
I understand the picture, but I still don't get how to compute "total 
time". Is this a well known term?

When looking at the plots of our existing -O3 testers, I also look for 
some kind of less noisy line. The first thing coming to my mind would 
just be the median of the set of run samples. Are you doing something 
similar? Or are you computing a value across different runs?
> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es> wrote:
>  > I am a little sceptical on this. Machines should generally not be
>  > noisy. However, if for some reason there is noise on the machine, the
>  > noise is as likely to appear during this pre-noise-test than during
>  > the actual benchmark runs, maybe during both, but maybe also only
>  > during the benchmark. So I am afraid we might often run in the
>  > situation where this test says OK but the later test is still
>  > suffering noise.
>
> I agree that measuring before each run may not be very useful. The main
> purpose of it is for adaptive problem scaling.
I see. If it is OK with you, I would propose to first get your LNT 
improvements in, before we move to adaptive problem scaling.
> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es> wrote:
>  > In general, I see such changes as a second step. First, we want to
>  > have a system in place that allows us to reliably detect if a
>  > benchmark is noisy or not, second we want to increase the number of
>  > benchmarks that are not noisy and where we can use the results.
> Ok.
Obviously, as you already looked into this deeper, feel free to suggest 
different priorities if necessary.

Tobias

Yi Kong

2014-Apr-30 14:47 UTC

head link

[LLVMdev] RFC:LNT Improvements

On 30/04/14 15:34, Tobias Grosser wrote:> On 30/04/2014 16:20, Yi Kong wrote:
>> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es>
wrote:
>>   >> - Show and graph total compile time
>>   >>    There is no obvious way to scale up the compile time of
>>   >> individual benchmarks, so total time is the best thing we
can do to
>>   >> minimize error.
>>   >>    LNT: [PATCH 1/3] Add Total to run view and graph plot
>>   >
>>   > I did not see the effect of these changes in your images and
also
>>   > honestly do not fully understand what you are doing. What is the
>>   > total compile time? Don't we already show the compile time
in run
>>   > view? How is the total time different to this compile time?
>>
>> It is hard to spot minor improvements or regressions over a large
number
>> of tests from independent machine noise. So I added a "total
time"
>> analysis to the run report and able to graph its trend, hoping that
>> noise will cancel out and will help us to easily spot. (Screenshot
>> attached)
>
> I understand the picture, but I still don't get how to compute
"total
> time". Is this a well known term?
>
> When looking at the plots of our existing -O3 testers, I also look for
> some kind of less noisy line. The first thing coming to my mind would
> just be the median of the set of run samples. Are you doing something
> similar? Or are you computing a value across different runs?
That's the total time taken to compile/execute. Put it in another way,
sum of compile/execution time of all tests.

Cheers,
Yi Kong

-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered
in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No:  2548782

Chris Matthews

2014-Apr-30 14:56 UTC

head link

[LLVMdev] RFC:LNT Improvements

I have so many comments about this thread! I will start here.

I think having a total compile time metric is a great idea. The summary report
code already does this.  The one problem with this metric is that it does not
work well as the test suite evolves and we add and remove tests, so it should be
done on a subset of the tests, which is not going to change.  I would love to
see that feature reported in the nightly reports.

In the past we have toyed with a total execution time metric (sum of the
execution of all benchmarks), and it has not worked well.  There are some
benchmarks that run for so long that they alone can swing the metric, and all
the other little tests amount to nothing.  How the SPEC benchmarks do their
calculations in might be relevant.  They have a baseline run, and the metric is
the geometric mean of the ratio of current exec to base exec. That fixes the
different sized benchmarks problem.

On Apr 30, 2014, at 7:34 AM, Tobias Grosser <tobias at grosser.es> wrote:
> On 30/04/2014 16:20, Yi Kong wrote:
>> Hi Tobias, Renato,
>> 
>> Thanks for your attention to my RFC.
> 
>> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es>
wrote:
>> >> - Show and graph total compile time
>> >>    There is no obvious way to scale up the compile time of
>> >> individual benchmarks, so total time is the best thing we can
do to
>> >> minimize error.
>> >>    LNT: [PATCH 1/3] Add Total to run view and graph plot
>> >
>> > I did not see the effect of these changes in your images and also
>> > honestly do not fully understand what you are doing. What is the
>> > total compile time? Don't we already show the compile time in
run
>> > view? How is the total time different to this compile time?
>> 
>> It is hard to spot minor improvements or regressions over a large
number
>> of tests from independent machine noise. So I added a "total
time"
>> analysis to the run report and able to graph its trend, hoping that
>> noise will cancel out and will help us to easily spot. (Screenshot
>> attached)
> 
> I understand the picture, but I still don't get how to compute
"total time". Is this a well known term?
> 
> When looking at the plots of our existing -O3 testers, I also look for some
kind of less noisy line. The first thing coming to my mind would just be the
median of the set of run samples. Are you doing something similar? Or are you
computing a value across different runs?
> 
>> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es>
wrote:
>> > I am a little sceptical on this. Machines should generally not be
>> > noisy. However, if for some reason there is noise on the machine,
the
>> > noise is as likely to appear during this pre-noise-test than
during
>> > the actual benchmark runs, maybe during both, but maybe also only
>> > during the benchmark. So I am afraid we might often run in the
>> > situation where this test says OK but the later test is still
>> > suffering noise.
>> 
>> I agree that measuring before each run may not be very useful. The main
>> purpose of it is for adaptive problem scaling.
> 
> I see. If it is OK with you, I would propose to first get your LNT
improvements in, before we move to adaptive problem scaling.
> 
>> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es>
wrote:
>> > In general, I see such changes as a second step. First, we want to
>> > have a system in place that allows us to reliably detect if a
>> > benchmark is noisy or not, second we want to increase the number
of
>> > benchmarks that are not noisy and where we can use the results.
>> Ok.
> 
> Obviously, as you already looked into this deeper, feel free to suggest
different priorities if necessary.
> 
> Tobias

Tobias Grosser

2014-Apr-30 14:57 UTC

head link

[LLVMdev] RFC:LNT Improvements

On 30/04/2014 16:47, Yi Kong wrote:> On 30/04/14 15:34, Tobias Grosser wrote:
>> On 30/04/2014 16:20, Yi Kong wrote:
>>> On 30 April 2014 07:50, Tobias Grosser <tobias at grosser.es>
wrote:
>>>   >> - Show and graph total compile time
>>>   >>    There is no obvious way to scale up the compile time
of
>>>   >> individual benchmarks, so total time is the best thing
we can do to
>>>   >> minimize error.
>>>   >>    LNT: [PATCH 1/3] Add Total to run view and graph plot
>>>   >
>>>   > I did not see the effect of these changes in your images and
also
>>>   > honestly do not fully understand what you are doing. What is
the
>>>   > total compile time? Don't we already show the compile
time in run
>>>   > view? How is the total time different to this compile time?
>>>
>>> It is hard to spot minor improvements or regressions over a large
number
>>> of tests from independent machine noise. So I added a "total
time"
>>> analysis to the run report and able to graph its trend, hoping that
>>> noise will cancel out and will help us to easily spot. (Screenshot
>>> attached)
>>
>> I understand the picture, but I still don't get how to compute
"total
>> time". Is this a well known term?
>>
>> When looking at the plots of our existing -O3 testers, I also look for
>> some kind of less noisy line. The first thing coming to my mind would
>> just be the median of the set of run samples. Are you doing something
>> similar? Or are you computing a value across different runs?
>
> That's the total time taken to compile/execute. Put it in another way,
> sum of compile/execution time of all tests.
OK. I understand your intentions now.

I currently have little intuition if this works or not. It seems you 
also don't know if this works or not, do you?

My personal hope is that the reliability allows us to get rid of almost 
all noise such that most runs would just report no performance changes 
at all. If this is the case, the actual performance changes would stand 
out nicely and we could highlight them better in LNT.

If this does not work, some aggregated performance numbers as the ones 
you propose may be helpful. The total time is a reasonable first metric 
I suppose, but we may want to verify if statistics don't give us a 
better tool (Anton may be able to help).

Thanks again for your explanation,
Tobias

Mingxing Tan

2014-Apr-30 19:43 UTC

head link

[LLVMdev] RFC:LNT Improvements

I agree with Chris. Following the way of SPEC benchmarks is a good idea.

We already have so many benchmarks in LLVM testsuite. Why not select some
representative benchmarks (e.g. NPB, MiBench, nBench, Polybench, etc.) and
take a simple run as the baseline. In that case, we can get the score for
each benchmark and can easily tell the relative performance.

For the noisy problem, I agree we should not change the testsuite for
performance evaluation, since testsuite is originally used for correctness
check, but I believe we can hack LNT and provide different options for
correctness check and performance evaluation.  Previously, I (with Tobias)
mainly change the LNT framework by simply: 1. running each benchmark 10
times;  2. adding some reliability tests (e.g. t-test) to check the
reliability and dropping those benchmarks with low reliability; (3)
dropping those benchmarks if their total runtime is too small (less than
0.002s).  Definitely, these changes should only be applied for performance
evaluation.

BTW, I like the idea of "adaptive mode", but keep in mind it should be
enabled only for performance evaluation, not in default.

Best,
Star Tan


On Wed, Apr 30, 2014 at 10:56 AM, Chris Matthews
<chris.matthews at apple.com>wrote:
> I have so many comments about this thread! I will start here.
>
> I think having a total compile time metric is a great idea. The summary
> report code already does this.  The one problem with this metric is that it
> does not work well as the test suite evolves and we add and remove tests,
> so it should be done on a subset of the tests, which is not going to
> change.  I would love to see that feature reported in the nightly reports.
>
> In the past we have toyed with a total execution time metric (sum of the
> execution of all benchmarks), and it has not worked well.  There are some
> benchmarks that run for so long that they alone can swing the metric, and
> all the other little tests amount to nothing.  How the SPEC benchmarks do
> their calculations in might be relevant.  They have a baseline run, and the
> metric is the geometric mean of the ratio of current exec to base exec.
> That fixes the different sized benchmarks problem.
>
> On Apr 30, 2014, at 7:34 AM, Tobias Grosser <tobias at grosser.es>
wrote:
>
> > On 30/04/2014 16:20, Yi Kong wrote:
> >> Hi Tobias, Renato,
> >>
> >> Thanks for your attention to my RFC.
> >
> >> On 30 April 2014 07:50, Tobias Grosser <tobias at
grosser.es> wrote:
> >> >> - Show and graph total compile time
> >> >>    There is no obvious way to scale up the compile time
of
> >> >> individual benchmarks, so total time is the best thing we
can do to
> >> >> minimize error.
> >> >>    LNT: [PATCH 1/3] Add Total to run view and graph plot
> >> >
> >> > I did not see the effect of these changes in your images and
also
> >> > honestly do not fully understand what you are doing. What is
the
> >> > total compile time? Don't we already show the compile
time in run
> >> > view? How is the total time different to this compile time?
> >>
> >> It is hard to spot minor improvements or regressions over a large
number
> >> of tests from independent machine noise. So I added a "total
time"
> >> analysis to the run report and able to graph its trend, hoping
that
> >> noise will cancel out and will help us to easily spot. (Screenshot
> >> attached)
> >
> > I understand the picture, but I still don't get how to compute
"total
> time". Is this a well known term?
> >
> > When looking at the plots of our existing -O3 testers, I also look for
> some kind of less noisy line. The first thing coming to my mind would just
> be the median of the set of run samples. Are you doing something similar?
> Or are you computing a value across different runs?
> >
> >> On 30 April 2014 07:50, Tobias Grosser <tobias at
grosser.es> wrote:
> >> > I am a little sceptical on this. Machines should generally
not be
> >> > noisy. However, if for some reason there is noise on the
machine, the
> >> > noise is as likely to appear during this pre-noise-test than
during
> >> > the actual benchmark runs, maybe during both, but maybe also
only
> >> > during the benchmark. So I am afraid we might often run in
the
> >> > situation where this test says OK but the later test is still
> >> > suffering noise.
> >>
> >> I agree that measuring before each run may not be very useful. The
main
> >> purpose of it is for adaptive problem scaling.
> >
> > I see. If it is OK with you, I would propose to first get your LNT
> improvements in, before we move to adaptive problem scaling.
> >
> >> On 30 April 2014 07:50, Tobias Grosser <tobias at
grosser.es> wrote:
> >> > In general, I see such changes as a second step. First, we
want to
> >> > have a system in place that allows us to reliably detect if a
> >> > benchmark is noisy or not, second we want to increase the
number of
> >> > benchmarks that are not noisy and where we can use the
results.
> >> Ok.
> >
> > Obviously, as you already looked into this deeper, feel free to
suggest
> different priorities if necessary.
> >
> > Tobias
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140430/787bfdb1/attachment.html>

llvm dev - Apr 2014 - [LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements

[LLVMdev] RFC:LNT Improvements