Ghassan Shobaki
2013-Sep-19 16:25 UTC
[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
Hi Renato, Please see my answers below. Thanks -Ghassan ________________________________ From: Renato Golin <renato.golin at linaro.org> To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> Cc: Andrew Trick <atrick at apple.com>; "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> Sent: Thursday, September 19, 2013 5:30 PM Subject: Re: [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3 On 17 September 2013 19:04, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote: We have done some experimental evaluation of the different schedulers in LLVM 3.3 (source, BURR, ILP, fast, MI). The evaluation was done on x86-64 using SPEC CPU2006. We have measured both the amount of spill code as well as the execution time as detailed below.>Hi Ghassan, This is an amazing piece of work, thanks for doing this. We need more benchmarks like yours, and more often, too. 3. The source scheduler is the second best scheduler in terms of spill code and execution time, and its performance is very close to that of BURR in both metrics. This result is surprising for me, because, as far as I understand, this scheduler is a conservative scheduler that tries to preserve the original program order, isn't it? Does this result surprise you? Well, SPEC is an old benchmark, when code was written to accommodate the hardware requirements, so preserving the code order might not be that big of a deal on SPEC, as it is on other types of code. So far, I haven't found SPEC being too good to judge overall compilers' performance, but specific micro-optimized features. Besides, hardware and software are designed nowadays based on some version of Dhrystone, EEMBC, SPEC or CoreMark, so it's not impossible to see 50% increase in performance with little changes in either. Ghassan: You have made me so curious to try other benchmarks in our future work. Most academic publications on CPU performance though use SPEC. You can even find some recent publications that are still using SPEC CPU2000! When I was at AMD in 2009, performance optimization and benchmarking was all about SPEC CPU2006. Have things changed so much in the past 4 years? And the more important question is: what specific features do these non-SPEC benchmarks have that are likely to affect the scheduler's register pressure reduction behavior? 4. The ILP scheduler has the worst execution times on FP2006 and the second worst spill counts, although it is the default on x86-64. Is this surprising? BTW, Dragon Egg sets the scheduler to source. On Line 368 in Backend.cpp, we find:> >if (!flag_schedule_insns) > Args.push_back("--pre-RA-sched=source");This looks like someone ran a similar test and did the sensible thing. How that reflects with Clang, or how important it is to be the default, I don't know. This is the same discussion as the optimization levels, and what passes should be included in what. It also depends on which scheduler will evolve faster or further in time, and what kind of code you're compiling... This is not a perfectly accurate metric, but, given the large sample size (> 10K functions), the total number of spills across such a statistically significant sample is believed to give a very strong indication about each scheduler's performance at reducing register pressure. I agree this is a good enough metric, but I'd be cautious in stating that there is a "very strong indication about each scheduler's performance". SPEC is, after all, a special case in compiler/hardware world, and anything you see here might not happen anywhere else. Real world, modern code, (such as LAMP stack, browsers, office suites, etc) are written expecting the compiler to do magic, while old-school benchmarks weren't, and they were optimized for decades by both compiler and hardware engineers. Ghassan: Can you please give more specific features in these modern benchmarks that affect spill code reduction? Note that our study included over ten thousand functions with spills. Such a large sample is expected to cover many different kinds of behavior, and that's why I am calling it a "statistically significant" sample. The %Diff Max (Min) is the maximum (minimum) percentage difference on a single benchmark between each scheduler and the source scheduler. These numbers show the differences on individual FP benchmarks can be quite significant. I'm surprised that you didn't run "source" 5/9 times, too. Did you get the exact performance numbers multiple times? Would be good to have a more realistic geo-mean for source as well, so we could estimate how much the other geo-means vary in comparison to source's. Ghassan: Sorry if I did not include a clear enough description of the numbers meanings. Let me explain that more precisely: First of all, the "source" scheduler was indeed run for 9 iterations (which took about 2 days), and that was our baseline. All the numbers in the execution-time table are percentage differences relative to "source". Of course, there were random variations in the numbers, but we did the standard SPEC practice of taking the median. For most benchmarks, the random variation was not significant. There was one particular benchmark though (libquantum), on which we thought that the random variation is too large to make a meaningful comparison, and therefore we decided to exclude that. The "% Diff Max" and "% Diff Min" numbers reported in our table are NOT random variations on an individual benchmark. Rather, the "% Diff Max" for a given heuristic is the percentage difference (in median scores) between this heuristic and source heuristic for the benchmark on which this heuristic gave its the biggest *gain* relative to source. Similarly, the "% Diff Min" for a given heuristic is the percentage difference (in median scores) between this heuristic and source heuristic for the benchmark on which this heuristic gave its biggest *degradation* relative to source. So, they are for two different benchmarks. The point in giving these numbers is to show that, although the geometric-mean differences may look small, the differences on individual benchmarks were quite significant. I can provide more detailed numbers for all benchmarks if people are interested. I can post those on our web site or any benchmarking page that LLVM may have. Most of the above performance differences have been correlated with significant changes in spill counts in hot functions. Which is a beautiful correlation between spill-rate and performance, showing that your metrics are at least reasonably accurate, for all purposes. We should probably report this as a performance bug if ILP stays the default scheduler on x86-64. You should, regardless of what's the default choice. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/c36541d0/attachment.html>
Renato Golin
2013-Sep-19 17:27 UTC
[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
On 19 September 2013 17:25, Ghassan Shobaki <ghassan_shobaki at yahoo.com>wrote:> Ghassan: You have made me so curious to try other benchmarks in our future > work. Most academic publications on CPU performance though use SPEC. You > can even find some recent publications that are still using SPEC CPU2000! > When I was at AMD in 2009, performance optimization and benchmarking was > all about SPEC CPU2006. Have things changed so much in the past 4 years? >Unfortunately, no. Most manufacturers still use SPEC (and others) to design, test and certify their products. This is not a problem per se, as SPEC is very good and reasonably generic, but any single benchmark can't cover the wide range of applications a CPU is likely to undergo along its life. So, my grudge is that there isn't much effort into understanding how to benchmark the different uses of a CPU, not necessarily against SPEC. I think SPEC is a good match for your project. And the more important question is: what specific features do these> non-SPEC benchmarks have that are likely to affect the scheduler's register > pressure reduction behavior? >No idea. ;) Mind you that I don't know any decent benchmark that will give you the "general user" case, but there are a number of specific benchmarks (browsers, codecs, databases, web servers all have benchmark features enabled). Also, for your project, you're only interested in a very specific behaviour of a very specific part of the compiler (spills), so any benchmark will give you a way to test it, but every one will have some form of bias. What I recommend is not to spend much time running a plethora of benchmarks, only to find out that they all tell you the same story, but try to find a benchmark that is completely different from SPEC (say, Browsermark or the MySQL benchmark suite) and see if the spill correlation is similar. If it is, ignore it. If not, just mention that this correlation may not be seen with other benchmarks. ;) Ghassan: Can you please give more specific features in these modern> benchmarks that affect spill code reduction? Note that our study included > over ten thousand functions with spills. Such a large sample is expected to > cover many different kinds of behavior, and that's why I am calling it a > "statistically significant" sample. >I was being a bit pedantic in pointing out that 10K data points are only statistically relevant if they're independent, which they might not be if each individual test was created / crafted with the same intent in mind (similar function size, number of functions, number of temporaries, etc). Most programmers don't pay that much attention to good code and end up writing horrible code, that stress specific parts of the compiler. If you have access to PlumHall suite, I encourage you to compile the chapter 7.22 test as an example. Also, related to register pressure, different bad codes will stress different algorithms, so you also have to be careful in stating that one algorithm is much better than others only based on one badly-written program. Ghassan: Sorry if I did not include a clear enough description of the> numbers meanings. Let me explain that more precisely: > First of all, the "source" scheduler was indeed run for 9 iterations > (which took about 2 days), and that was our baseline. All the numbers in > the execution-time table are percentage differences relative to "source". > Of course, there were random variations in the numbers, but we did the > standard SPEC practice of taking the median. For most benchmarks, the > random variation was not significant. >I see, my mistake. There was one particular benchmark though (libquantum), on which we thought> that the random variation is too large to make a meaningful comparison, and > therefore we decided to exclude that. >Quite amusing, having the libquantum behaving erratically. ;) cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/83d9fdb4/attachment.html>
Ghassan Shobaki
2013-Sep-19 18:13 UTC
[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
I should note here that although SPEC provided us with a sufficiently large sample for our spill-count experiment, I don't think that SPEC has enough hot functions with spills to make our execution-time results statistically significant. That's because SPEC has many benchmarks with peaky profiles, where one of two functions dominate the execution time. So, if one heuristic gets very lucky (or unlucky) on a few hot functions, it may get a deceivingly high (or low) score. That's why I think if someone runs the same kind of test on a different benchmark suite with comparable size, he may get different execution-time results, but most likely he will get the same spill count results that we got (of course, I mean the relative results). -Ghassan ________________________________ From: Renato Golin <renato.golin at linaro.org> To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> Cc: Andrew Trick <atrick at apple.com>; "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> Sent: Thursday, September 19, 2013 8:27 PM Subject: Re: [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3 On 19 September 2013 17:25, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote: Ghassan: You have made me so curious to try other benchmarks in our future work. Most academic publications on CPU performance though use SPEC. You can even find some recent publications that are still using SPEC CPU2000! When I was at AMD in 2009, performance optimization and benchmarking was all about SPEC CPU2006. Have things changed so much in the past 4 years? Unfortunately, no. Most manufacturers still use SPEC (and others) to design, test and certify their products. This is not a problem per se, as SPEC is very good and reasonably generic, but any single benchmark can't cover the wide range of applications a CPU is likely to undergo along its life. So, my grudge is that there isn't much effort into understanding how to benchmark the different uses of a CPU, not necessarily against SPEC. I think SPEC is a good match for your project. And the more important question is: what specific features do these non-SPEC benchmarks have that are likely to affect the scheduler's register pressure reduction behavior?>No idea. ;) Mind you that I don't know any decent benchmark that will give you the "general user" case, but there are a number of specific benchmarks (browsers, codecs, databases, web servers all have benchmark features enabled). Also, for your project, you're only interested in a very specific behaviour of a very specific part of the compiler (spills), so any benchmark will give you a way to test it, but every one will have some form of bias. What I recommend is not to spend much time running a plethora of benchmarks, only to find out that they all tell you the same story, but try to find a benchmark that is completely different from SPEC (say, Browsermark or the MySQL benchmark suite) and see if the spill correlation is similar. If it is, ignore it. If not, just mention that this correlation may not be seen with other benchmarks. ;) Ghassan: Can you please give more specific features in these modern benchmarks that affect spill code reduction? Note that our study included over ten thousand functions with spills. Such a large sample is expected to cover many different kinds of behavior, and that's why I am calling it a "statistically significant" sample. I was being a bit pedantic in pointing out that 10K data points are only statistically relevant if they're independent, which they might not be if each individual test was created / crafted with the same intent in mind (similar function size, number of functions, number of temporaries, etc). Most programmers don't pay that much attention to good code and end up writing horrible code, that stress specific parts of the compiler. If you have access to PlumHall suite, I encourage you to compile the chapter 7.22 test as an example. Also, related to register pressure, different bad codes will stress different algorithms, so you also have to be careful in stating that one algorithm is much better than others only based on one badly-written program. Ghassan: Sorry if I did not include a clear enough description of the numbers meanings. Let me explain that more precisely:> >First of all, the "source" scheduler was indeed run for 9 iterations (which took about 2 days), and that was our baseline. All the numbers in the execution-time table are percentage differences relative to "source". Of course, there were random variations in the numbers, but we did the standard SPEC practice of taking the median. For most benchmarks, the random variation was not significant.I see, my mistake. There was one particular benchmark though (libquantum), on which we thought that the random variation is too large to make a meaningful comparison, and therefore we decided to exclude that.>Quite amusing, having the libquantum behaving erratically. ;) cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/d734f892/attachment.html>
Reasonably Related Threads
- [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
- [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
- [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
- [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
- [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3