Gratian Lup
2011-Mar-30  20:16 UTC
[LLVMdev] GSoC: Profile-guided inlining and block positioning
1.  Summary
I will implement two optimizations in the LLVM compiler, both based on
runtime profile information: an inlining mechanism that takes the call
frequency into consideration, and a better basic block placement algorithm.
2. The project
LLVM now includes an efficient framework [1] for instrumenting an
application and collecting edge and path profile information. Profile-guided
optimizations present an opportunity for improving the performance of the
application in a significant way, but until now no real optimization was
developed/adapted to take advantage of this profile information (there is a*
BasicBlockPlacement* pass, but it can be considered naïve).
The first part of my project consists in implementing a function inlining
pass that takes advantage of function call frequency, if available. Function
inlining is one of the most important optimizations nowadays, reducing call
overhead and enhancing a series of classic optimizations that are usually
inhibited by calls. Although it is desirable to inline as many of the calls
as possible, a limit must be imposed, otherwise application performance
might suffer as a result of excessive code size growth and instruction cache
misses [2]. In taking these decisions the current heuristic can be improved
by considering the number of times the function to be inlined was actually
called:
  -  A function that was not called at all, or was called below a certain
threshold should not be inlined, or at least be penalized.
  - A function that is called frequently should always be inlined, or at
least get a bonus, so it has a higher chance of being inlined.
I will use the algorithm presented in [3] because, according to the
benchmarks presented in the paper, the performance improvement can be up to
57% compared to inlining without any profile information, all this with a
maximum of 10% code size growth. The second reason is that the algorithm
takes into consideration the cost of performing inlining at a call site,
functionality which is already provided by the *InlineCost *analysis.**
The algorithm will be implemented as a new optimization pass, which will
perform the following sequence of operations:
1.  Determine a maximum cost for the whole inlining process. This can be*S *
1.1*, where *S *is the estimated size of the application (this can be an
estimate of the total number of instructions, the way*
CodeMetrics::analyzeBasicBlock* does it). Multiplying by* 1.1* means that a
code growth of maximum 10% is allowed, which has been shown to be more than
enough. If optimization for size is enabled we could allow a code growth of
only 1%, which still provides most of the benefits.
2.      Determine the call sites, and for each one a *benefit/cost* ratio,
where *benefit* is the number of times the calee was actually called, and *
cost*is the estimated cost of the inlining, as computed by the
*InlineCost *analysis.
The* call site/ratio* pairs are then inserted into a max-heap, because they
are needed by the algorithm in decreasing-ratio order, and also because the
ratio of a call site may change due to an inlining decision, which can be
implemented efficiently in a heap (decrease-key).
3.      Run the actual algorithm, which can be formulated as a knapsack-like
problem:  the call sites are chosen in a greedy fashion, based on the*
benefit/cost* ratio, until the maximum cost is achieved (or no more call
sites are available). After the calee is inlined, all call sites that may be
affected by the inlining decision must be updated.
The *benefit/cost ratio *formula* *most likely* *won’t be that simple,
because it must also take into consideration attributes like *alwaysinline*,
*noinline*, and *optsize*. Also, functions with an extremely high cost
should not be inlined, even if they are called frequently. This will be
fine-tuned in the later stages of the project.
The second optimization I will implement is a better basic block placement
algorithm compared to the current one. The algorithm will be *Algo2*from
[4], with better results than *Algo1*, the one implemented now in the*
BasicBlockPlacement *pass, according to the paper. Basic block placement
using profile information is very helpful especially for control-intensive
functions, and functions where many blocks are executed infrequently or not
at all (debugging/error handling code, for example). Reordering the basic
blocks helps reducing branch penalties, which is very important in today’s
heavily pipelined CPUs. It also helps reducing instruction cache misses,
because the instruction stream is denser and the unlikely code is moved to a
“cold” place of the function. The performance improvements are between 2-15%
for most applications.
The basic idea of the algorithm is that it forms chains of basic blocks that
should be placed as a group of straight-line code in a bottom-up fashion.
After these chains are computed, the doubly-linked list of basic blocks is
changed to reflect the order in the chains.
3.  Benefits for LLVM
- The optimizations I propose provide tangible performance benefits, unlike
other profile guided optimizations that are worth performing only for a
small number of applications, or may degrade performance if they introduced
code growth that can’t be accounted for with more optimization opportunities
(superblocks formation, for example).
- Taking advantage of the many opportunities opened by profile-guided
optimizations.
- Inspiring other developers to implement other profile-guided
optimizations.
4. Success criteria- Implement both optimizations described above.- Pass
test suite with both optimizations enabled.- Performance with profile-guided
inlining enabled should exceed the one obtained with the standard inlining
pas.
     - Performance with the new block placement algorithm should exceed the
one obtained with the current placement algorithm.
5. Road-map
     - Week 1 – 6: I will implement the inlining optimization first, because
it should bring greater benefits. First the cost limit and call sites are
identified, then the actual implementation of the algorithm can be done.
     - Week 7 – 9: Test the inlining optimization for correctness and
performance improvements. I will adjust the heuristic in order to increase
the performance gains and to take into consideration things like the *
noinline* attribute and other heuristics.
     - Week 10-13: Implement the basic block positioning optimization and
test for correctness and for any performance improvement. If all works well
I will replace the current algorithm with the new one.
§    -Week 14: At the end I will do some final tests, improve the formatting
and readability of the code, and write documents describing the
implementation.
6. Biography
I’m a 3rd year Computer Science student from Romania, at the “Lucian Blaga”
University. I’ve been always interested in low level programming, especially
optimization, compilers and operating systems. The last 8 months I spent
developing a fully functional compiler for the *C99* language (full language
support – except complex numbers, but could be easily added). In its current
status it’s capable of parsing and generating the intermediate code for *
SqlLite3* and other open-source applications and libraries. The backend
shares some design decisions with LLVM, like the use of SSA as the
intermediate format and the high modularity.  As it is now, it consists of
about 90K lines of C++ code. While I didn’t write an optimization pass for
LLVM yet, I’ve read a lot of its code and most of the documentation, and I
can say I’m quite familiar with the way it works. Of course I will stop
working on my compiler while I work for LLVM, there’s more than a year until
I have my dissertation.
Last year I implemented a memory allocator optimized for multiprocessors. In
most tests it is about 15% faster than the one included with Intel Thread
Building Blocks, while using about half the memory compared to the ones that
come with glibc/Visual C++. This was tested on a 4-core machine, and now I
hope to get the chance to test it on the 48 core supercomputer at my
university, and maybe publish a paper.
In high school I implemented an application for the secure deletion of files
(most notable is that it included a partial implementation of the NTFS file
system). With this application I won the first price at a student
competition at the university I’m attending now.
***
8. References*
[1] Andreas Neustifte, *Efficient Profiling in the LLVM Compiler
Infrastructure*, 2010
[2] Matthew Arnold, Stephen Fink, Vivek Sarkar, Peter F. Sweeney, *A
Comparative Study of Static and Profile-Based Heuristics for Inlining*, 2000
[3] Pohua P. Chang, Scott A. Mahlke, William Y. Chen, Wen-Mei W.
Hwu,*Profile-guided
Automatic Inline Expansion for C Programs*, 1992
[4] Karl Pettis, Robert C. Hansen, *Profile Guided Code Positioning*, 1990
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110330/f08432bb/attachment.html>
Possibly Parallel Threads
- [LLVMdev] GSOC Project Proposal: Profile-guided optimizations
- [LLVMdev] Strange behaviour with profile data loading
- [LLVMdev] Path forward on profile guided inlining?
- Regarding basic block layout/code placement optimizations of profile guided optimization (PGO)
- [LLVMdev] Path forward on profile guided inlining?
