thr3ads.net - llvm dev - [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3 [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Ghassan Shobaki

2013-Sep-19 16:25 UTC

[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

Hi Renato,

Please see my answers below.

Thanks
-Ghassan

________________________________
 From: Renato Golin <renato.golin at linaro.org>
To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> 
Cc: Andrew Trick <atrick at apple.com>; "llvmdev at cs.uiuc.edu"
<llvmdev at cs.uiuc.edu>
Sent: Thursday, September 19, 2013 5:30 PM
Subject: Re: [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

On 17 September 2013 19:04, Ghassan Shobaki <ghassan_shobaki at yahoo.com>
wrote:

We have done some experimental evaluation of the different schedulers in LLVM
3.3 (source, BURR, ILP, fast, MI). The evaluation was done
on x86-64 using SPEC CPU2006. We have measured both the amount of spill code as
well as the execution time as detailed below.>
Hi Ghassan,

This is an amazing piece of work, thanks for doing this. We need more benchmarks
like yours, and more often, too.

3. The source scheduler is the second best scheduler in terms of spill code and
execution time, and its performance is very close to that of BURR in both
metrics. This result is surprising for me, because, as far as I understand,
this scheduler is a conservative scheduler that tries to preserve the original
program order, isn't it? Does this result surprise you?

Well, SPEC is an old benchmark, when code was written to accommodate the
hardware requirements, so preserving the code order might not be that big of a
deal on SPEC, as it is on other types of code. So far, I haven't found SPEC
being too good to judge overall compilers' performance, but specific
micro-optimized features.

Besides, hardware and software are designed nowadays based on some version of
Dhrystone, EEMBC, SPEC or CoreMark, so it's not impossible to see 50%
increase in performance with little changes in either.

Ghassan: You have made me so curious to try other benchmarks in our future work.
Most academic publications on CPU performance though use SPEC. You can even find
some recent publications that are still using SPEC CPU2000! When I was at AMD in
2009, performance optimization and benchmarking was all about SPEC CPU2006. Have
things changed so much in the past 4 years? And the more important question is:
what specific features do these non-SPEC benchmarks have that are likely to
affect the scheduler's register pressure reduction behavior?

4. The ILP scheduler has the worst execution times on FP2006 and the second
worst spill counts, although it is the default on x86-64. Is this surprising?
BTW, Dragon Egg sets the scheduler to source. On Line 368 in Backend.cpp, we
find:>
>if (!flag_schedule_insns)
>    Args.push_back("--pre-RA-sched=source");
This looks like someone ran a similar test and did the sensible thing. How that
reflects with Clang, or how important it is to be the default, I don't know.
This is the same discussion as the optimization levels, and what passes should
be included in what. It also depends on which scheduler will evolve faster or
further in time, and what kind of code you're compiling...

This is not a perfectly accurate
metric, but, given the large sample size (> 10K functions), the total number
of spills across such a statistically significant sample is believed to give a
very strong indication about each scheduler's performance at reducing
register
pressure.

I agree this is a good enough metric, but I'd be cautious in stating that
there is a "very strong indication about each scheduler's
performance". SPEC is, after all, a special case in compiler/hardware
world, and anything you see here might not happen anywhere else. 

Real world, modern code, (such as LAMP stack, browsers, office suites, etc) are
written expecting the compiler to do magic, while old-school benchmarks
weren't, and they were optimized for decades by both compiler and hardware
engineers.

Ghassan: Can you please give more specific features in these modern benchmarks
that affect spill code reduction? Note that our study included over ten thousand
functions with spills. Such a large sample is expected to cover many different
kinds of behavior, and that's why I am calling it a "statistically
significant" sample. 

The %Diff Max (Min) is the maximum (minimum) percentage difference on a single
benchmark between each scheduler and the source scheduler. These numbers show
the differences on individual FP benchmarks can be quite significant.

I'm surprised that you didn't run "source" 5/9 times, too. Did
you get the exact performance numbers multiple times? Would be good to have a
more realistic geo-mean for source as well, so we could estimate how much the
other geo-means vary in comparison to source's.

Ghassan: Sorry if I did not include a clear enough description of the numbers
meanings. Let me explain that more precisely:
First of all, the "source" scheduler was indeed run for 9 iterations
(which took about 2 days), and that was our baseline. All the numbers in the
execution-time table are percentage differences relative to "source".
Of course, there were random variations in the numbers, but we did the standard
SPEC practice of taking the median. For most benchmarks, the random variation
was not significant. There was one particular benchmark though (libquantum), on
which we thought that the random variation is too large to make a meaningful
comparison, and therefore we decided to exclude that.

The "% Diff Max" and "% Diff Min" numbers reported in our
table are NOT random variations on an individual benchmark. Rather, the "%
Diff Max" for a given heuristic is the percentage difference
(in median scores) between this heuristic and source heuristic for the benchmark
on which this heuristic gave its the biggest *gain* relative to source.
Similarly, the "% Diff Min" for a given heuristic is the percentage
difference
(in median scores) between this heuristic and source heuristic for the benchmark
on which this heuristic gave its biggest *degradation* relative to source. So,
they are for two different benchmarks. The point in giving these numbers is to
show that, although the geometric-mean differences may look small, the
differences on individual benchmarks were quite significant. I can provide more
detailed numbers for all benchmarks if people are interested. I can post those
on our web site or any benchmarking page that LLVM may have.       

Most of the above performance differences have been correlated with significant
changes in spill counts in hot functions.

Which is a beautiful correlation between spill-rate and performance, showing
that your metrics are at least reasonably accurate, for all purposes.

We should
probably report this as a performance bug if ILP stays the default scheduler on
x86-64.

You should, regardless of what's the default choice.

cheers,

--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/c36541d0/attachment.html>

Renato Golin

2013-Sep-19 17:27 UTC

head link

[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

On 19 September 2013 17:25, Ghassan Shobaki <ghassan_shobaki at
yahoo.com>wrote:
> Ghassan: You have made me so curious to try other benchmarks in our future
> work. Most academic publications on CPU performance though use SPEC. You
> can even find some recent publications that are still using SPEC CPU2000!
> When I was at AMD in 2009, performance optimization and benchmarking was
> all about SPEC CPU2006. Have things changed so much in the past 4 years?
>
Unfortunately, no. Most manufacturers still use SPEC (and others) to
design, test and certify their products.

This is not a problem per se, as SPEC is very good and reasonably generic,
but any single benchmark can't cover the wide range of applications a CPU
is likely to undergo along its life. So, my grudge is that there isn't much
effort into understanding how to benchmark the different uses of a CPU, not
necessarily against SPEC. I think SPEC is a good match for your project.

And the more important question is: what specific features do
these> non-SPEC benchmarks have that are likely to affect the scheduler's
register
> pressure reduction behavior?
>
No idea. ;) Mind you that I don't know any decent benchmark that will give
you the "general user" case, but there are a number of specific
benchmarks
(browsers, codecs, databases, web servers all have benchmark features
enabled).

Also, for your project, you're only interested in a very specific behaviour
of a very specific part of the compiler (spills), so any benchmark will
give you a way to test it, but every one will have some form of bias.

What I recommend is not to spend much time running a plethora of
benchmarks, only to find out that they all tell you the same story, but try
to find a benchmark that is completely different from SPEC (say,
Browsermark or the MySQL benchmark suite) and see if the spill correlation
is similar.

If it is, ignore it. If not, just mention that this correlation may not be
seen with other benchmarks. ;)

Ghassan: Can you please give more specific features in these
modern> benchmarks that affect spill code reduction? Note that our study included
> over ten thousand functions with spills. Such a large sample is expected to
> cover many different kinds of behavior, and that's why I am calling it
a
> "statistically significant" sample.
>
I was being a bit pedantic in pointing out that 10K data points are only
statistically relevant if they're independent, which they might not be if
each individual test was created / crafted with the same intent in mind
(similar function size, number of functions, number of temporaries, etc).

Most programmers don't pay that much attention to good code and end up
writing horrible code, that stress specific parts of the compiler. If you
have access to PlumHall suite, I encourage you to compile the chapter 7.22
test as an example.

Also, related to register pressure, different bad codes will stress
different algorithms, so you also have to be careful in stating that one
algorithm is much better than others only based on one badly-written
program.

Ghassan: Sorry if I did not include a clear enough description of
the> numbers meanings. Let me explain that more precisely:
> First of all, the "source" scheduler was indeed run for 9
iterations
> (which took about 2 days), and that was our baseline. All the numbers in
> the execution-time table are percentage differences relative to
"source".
> Of course, there were random variations in the numbers, but we did the
> standard SPEC practice of taking the median. For most benchmarks, the
> random variation was not significant.
>
I see, my mistake.

There was one particular benchmark though (libquantum), on which we
thought> that the random variation is too large to make a meaningful comparison, and
> therefore we decided to exclude that.
>
Quite amusing, having the libquantum behaving erratically. ;)

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/83d9fdb4/attachment.html>

Ghassan Shobaki

2013-Sep-19 18:13 UTC

head link

[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

I should note here that although SPEC provided us with a sufficiently 
large sample for our spill-count experiment, I don't think that SPEC has
 enough hot functions with spills to make our execution-time results 
statistically significant. That's because SPEC has many benchmarks with 
peaky profiles, where one of two functions dominate the execution time. 
So, if one heuristic gets very lucky (or unlucky) on a few hot 
functions, it may get a deceivingly high (or low) score. 
That's why I think if someone runs the same kind of test on a different 
benchmark suite with comparable size, he may get different 
execution-time results, but most likely he will get the same spill count results
that we got (of course, I mean the relative results).

-Ghassan  



________________________________
 From: Renato Golin <renato.golin at linaro.org>
To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> 
Cc: Andrew Trick <atrick at apple.com>; "llvmdev at cs.uiuc.edu"
<llvmdev at cs.uiuc.edu>
Sent: Thursday, September 19, 2013 8:27 PM
Subject: Re: [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
 


On 19 September 2013 17:25, Ghassan Shobaki <ghassan_shobaki at yahoo.com>
wrote:

Ghassan: You have made me so curious to try other benchmarks in our future work.
Most academic publications on CPU performance though use SPEC. You can even find
some recent publications that are still using SPEC CPU2000! When I was at AMD in
2009, performance optimization and benchmarking was all about SPEC CPU2006. Have
things changed so much in the past 4 years?

Unfortunately, no. Most manufacturers still use SPEC (and others) to design,
test and certify their products.

This is not a problem per se, as SPEC is very good and reasonably generic, but
any single benchmark can't cover the wide range of applications a CPU is
likely to undergo along its life. So, my grudge is that there isn't much
effort into understanding how to benchmark the different uses of a CPU, not
necessarily against SPEC. I think SPEC is a good match for your project.


And the more important question is: what specific features do these non-SPEC
benchmarks have that are likely to affect the scheduler's register pressure
reduction behavior? >
No idea. ;) Mind you that I don't know any decent benchmark that will give
you the "general user" case, but there are a number of specific
benchmarks (browsers, codecs, databases, web servers all have benchmark features
enabled).

Also, for your project, you're only interested in a very specific behaviour
of a very specific part of the compiler (spills), so any benchmark will give you
a way to test it, but every one will have some form of bias.

What I recommend is not to spend much time running a plethora of benchmarks,
only to find out that they all tell you the same story, but try to find a
benchmark that is completely different from SPEC (say, Browsermark or the MySQL
benchmark suite) and see if the spill correlation is similar. 

If it is, ignore it. If not, just mention that this correlation may not be seen
with other benchmarks. ;)


Ghassan: Can you please give more specific features in these modern benchmarks
that affect spill code reduction? Note that our study included over ten thousand
functions with spills. Such a large sample is expected to cover many different
kinds of behavior, and that's why I am calling it a "statistically
significant" sample.  

I was being a bit pedantic in pointing out that 10K data points are only
statistically relevant if they're independent, which they might not be if
each individual test was created / crafted with the same intent in mind (similar
function size, number of functions, number of temporaries, etc).

Most programmers don't pay that much attention to good code and end up
writing horrible code, that stress specific parts of the compiler. If you have
access to PlumHall suite, I encourage you to compile the chapter 7.22 test as an
example.

Also, related to register pressure, different bad codes will stress different
algorithms, so you also have to be careful in stating that one algorithm is much
better than others only based on one badly-written program.


Ghassan: Sorry if I did not include a clear enough description of the numbers
meanings. Let me explain that more precisely:>
>First of all, the "source" scheduler was indeed run for 9
iterations (which took about 2 days), and that was our baseline. All the numbers
in the execution-time table are percentage differences relative to
"source". Of course, there were random variations in the numbers, but
we did the standard SPEC practice of taking the median. For most benchmarks, the
random variation was not significant.
I see, my mistake.


There was one particular benchmark though (libquantum), on which we thought that
the random variation is too large to make a meaningful comparison, and therefore
we decided to exclude that.>
Quite amusing, having the libquantum behaving erratically. ;)

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/d734f892/attachment.html>

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Sep 2013 - [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

Reasonably Related Threads