Gratian Lup
2011-Mar-30 20:16 UTC
[LLVMdev] GSoC: Profile-guided inlining and block positioning
1. Summary I will implement two optimizations in the LLVM compiler, both based on runtime profile information: an inlining mechanism that takes the call frequency into consideration, and a better basic block placement algorithm. 2. The project LLVM now includes an efficient framework [1] for instrumenting an application and collecting edge and path profile information. Profile-guided optimizations present an opportunity for improving the performance of the application in a significant way, but until now no real optimization was developed/adapted to take advantage of this profile information (there is a* BasicBlockPlacement* pass, but it can be considered naïve). The first part of my project consists in implementing a function inlining pass that takes advantage of function call frequency, if available. Function inlining is one of the most important optimizations nowadays, reducing call overhead and enhancing a series of classic optimizations that are usually inhibited by calls. Although it is desirable to inline as many of the calls as possible, a limit must be imposed, otherwise application performance might suffer as a result of excessive code size growth and instruction cache misses [2]. In taking these decisions the current heuristic can be improved by considering the number of times the function to be inlined was actually called: - A function that was not called at all, or was called below a certain threshold should not be inlined, or at least be penalized. - A function that is called frequently should always be inlined, or at least get a bonus, so it has a higher chance of being inlined. I will use the algorithm presented in [3] because, according to the benchmarks presented in the paper, the performance improvement can be up to 57% compared to inlining without any profile information, all this with a maximum of 10% code size growth. The second reason is that the algorithm takes into consideration the cost of performing inlining at a call site, functionality which is already provided by the *InlineCost *analysis.** The algorithm will be implemented as a new optimization pass, which will perform the following sequence of operations: 1. Determine a maximum cost for the whole inlining process. This can be*S * 1.1*, where *S *is the estimated size of the application (this can be an estimate of the total number of instructions, the way* CodeMetrics::analyzeBasicBlock* does it). Multiplying by* 1.1* means that a code growth of maximum 10% is allowed, which has been shown to be more than enough. If optimization for size is enabled we could allow a code growth of only 1%, which still provides most of the benefits. 2. Determine the call sites, and for each one a *benefit/cost* ratio, where *benefit* is the number of times the calee was actually called, and * cost*is the estimated cost of the inlining, as computed by the *InlineCost *analysis. The* call site/ratio* pairs are then inserted into a max-heap, because they are needed by the algorithm in decreasing-ratio order, and also because the ratio of a call site may change due to an inlining decision, which can be implemented efficiently in a heap (decrease-key). 3. Run the actual algorithm, which can be formulated as a knapsack-like problem: the call sites are chosen in a greedy fashion, based on the* benefit/cost* ratio, until the maximum cost is achieved (or no more call sites are available). After the calee is inlined, all call sites that may be affected by the inlining decision must be updated. The *benefit/cost ratio *formula* *most likely* *won’t be that simple, because it must also take into consideration attributes like *alwaysinline*, *noinline*, and *optsize*. Also, functions with an extremely high cost should not be inlined, even if they are called frequently. This will be fine-tuned in the later stages of the project. The second optimization I will implement is a better basic block placement algorithm compared to the current one. The algorithm will be *Algo2*from [4], with better results than *Algo1*, the one implemented now in the* BasicBlockPlacement *pass, according to the paper. Basic block placement using profile information is very helpful especially for control-intensive functions, and functions where many blocks are executed infrequently or not at all (debugging/error handling code, for example). Reordering the basic blocks helps reducing branch penalties, which is very important in today’s heavily pipelined CPUs. It also helps reducing instruction cache misses, because the instruction stream is denser and the unlikely code is moved to a “cold” place of the function. The performance improvements are between 2-15% for most applications. The basic idea of the algorithm is that it forms chains of basic blocks that should be placed as a group of straight-line code in a bottom-up fashion. After these chains are computed, the doubly-linked list of basic blocks is changed to reflect the order in the chains. 3. Benefits for LLVM - The optimizations I propose provide tangible performance benefits, unlike other profile guided optimizations that are worth performing only for a small number of applications, or may degrade performance if they introduced code growth that can’t be accounted for with more optimization opportunities (superblocks formation, for example). - Taking advantage of the many opportunities opened by profile-guided optimizations. - Inspiring other developers to implement other profile-guided optimizations. 4. Success criteria- Implement both optimizations described above.- Pass test suite with both optimizations enabled.- Performance with profile-guided inlining enabled should exceed the one obtained with the standard inlining pas. - Performance with the new block placement algorithm should exceed the one obtained with the current placement algorithm. 5. Road-map - Week 1 – 6: I will implement the inlining optimization first, because it should bring greater benefits. First the cost limit and call sites are identified, then the actual implementation of the algorithm can be done. - Week 7 – 9: Test the inlining optimization for correctness and performance improvements. I will adjust the heuristic in order to increase the performance gains and to take into consideration things like the * noinline* attribute and other heuristics. - Week 10-13: Implement the basic block positioning optimization and test for correctness and for any performance improvement. If all works well I will replace the current algorithm with the new one. § -Week 14: At the end I will do some final tests, improve the formatting and readability of the code, and write documents describing the implementation. 6. Biography I’m a 3rd year Computer Science student from Romania, at the “Lucian Blaga” University. I’ve been always interested in low level programming, especially optimization, compilers and operating systems. The last 8 months I spent developing a fully functional compiler for the *C99* language (full language support – except complex numbers, but could be easily added). In its current status it’s capable of parsing and generating the intermediate code for * SqlLite3* and other open-source applications and libraries. The backend shares some design decisions with LLVM, like the use of SSA as the intermediate format and the high modularity. As it is now, it consists of about 90K lines of C++ code. While I didn’t write an optimization pass for LLVM yet, I’ve read a lot of its code and most of the documentation, and I can say I’m quite familiar with the way it works. Of course I will stop working on my compiler while I work for LLVM, there’s more than a year until I have my dissertation. Last year I implemented a memory allocator optimized for multiprocessors. In most tests it is about 15% faster than the one included with Intel Thread Building Blocks, while using about half the memory compared to the ones that come with glibc/Visual C++. This was tested on a 4-core machine, and now I hope to get the chance to test it on the 48 core supercomputer at my university, and maybe publish a paper. In high school I implemented an application for the secure deletion of files (most notable is that it included a partial implementation of the NTFS file system). With this application I won the first price at a student competition at the university I’m attending now. *** 8. References* [1] Andreas Neustifte, *Efficient Profiling in the LLVM Compiler Infrastructure*, 2010 [2] Matthew Arnold, Stephen Fink, Vivek Sarkar, Peter F. Sweeney, *A Comparative Study of Static and Profile-Based Heuristics for Inlining*, 2000 [3] Pohua P. Chang, Scott A. Mahlke, William Y. Chen, Wen-Mei W. Hwu,*Profile-guided Automatic Inline Expansion for C Programs*, 1992 [4] Karl Pettis, Robert C. Hansen, *Profile Guided Code Positioning*, 1990 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110330/f08432bb/attachment.html>
Possibly Parallel Threads
- [LLVMdev] GSOC Project Proposal: Profile-guided optimizations
- [LLVMdev] Strange behaviour with profile data loading
- [LLVMdev] Path forward on profile guided inlining?
- Regarding basic block layout/code placement optimizations of profile guided optimization (PGO)
- [LLVMdev] Path forward on profile guided inlining?