Hi all, I'm interested in adding code size optimizations as a GSoC project. This is necessary when compiling for embedded devices, where LLVM should optimize for size even at the expense of speed. I'm still working on my proposal, but I'd like some advice on the technical parts and overall project plan. First, I would add a way to determine which parts of the code should be optimized for size. This is currently done with an OptimizeForSize attribute on functions, which is added by clang with -Os. I would add a more flexible method that determines whether a given BB should be optimized for size, depending on profiling information (if available) and -Os and other options. I could put this in ProfileInfoT, since it would interact closely with profiling information. The next step is to make optimization passes use this information. This means updating passes that currently check the OptimizeForSize attribute to use the new information instead. I would also make JumpThreading, InlineCost, and possibly other passes aware of size optimization. After working on existing passes, I would add a new MergeBlocks pass. This is an IPO pass that would combine equivalent basic blocks, extracting them into new functions. Research has shown that this can decrease code size by 5-6%. The new pass will be based on CodeExtractor and MergeFunctions; it will create a hash table of every basic block, based on the number of instructions of different types, and then perform a detailed comparison on blocks with the same hash. I would first write an inefficient preliminary version that only finds obvious cases. This can be used to get an impression of the technique's effectiveness. I would then work on making it more efficient and able to merge blocks in more difficult cases. Tentative schedule: 2-3 weeks: Add -Os to llc and opt. Use profiling information or -Os to determine which BBs should be optimized for size. 3-4 weeks: Make existing passes aware of size optimization. 2-3 weeks: Preliminary implementation of block merging. 2-3 weeks: Improve block merging algorithm speed. 1-4 weeks: Make block merging handle more difficult cases. Any thoughts? Thanks, Sean Bartell (wtachi in #llvm)
On Tue, Apr 5, 2011 at 8:29 PM, Sean Bartell <wingedtachikoma at gmail.com> wrote:> Hi all,Hi Sean, Last year, I've written master's thesis on MergeBlocks optimization on LLVM IR level (in St.-Petersburg State Polytechnical University), supervised by Anton Korobeynikov. I'll comment on MergeBlocks part of your proposal.> After working on existing passes, I would add a new MergeBlocks pass. This is > an IPO pass that would combine equivalent basic blocks, extracting them into > new functions. Research has shown that this can decrease code size by 5-6%. The > new pass will be based on CodeExtractor and MergeFunctions; it will create a > hash table of every basic block, based on the number of instructions of > different types, and then perform a detailed comparison on blocks with the same > hash. I would first write an inefficient preliminary version that only finds > obvious cases. This can be used to get an impression of the technique's > effectiveness. I would then work on making it more efficient and able to merge > blocks in more difficult cases.My implementation of MergeBasicBlocks pass (mergebb for short) managed to reduce code size by ~3%, on some samples from the LLVM test suite, on both x86 and ARM. Detecting mergeable basic blocks is pretty easy and straightforward: you can introduce an order on them, and detect all mergeable groups in N log N time (very fast in practice, event without any hashing). The hard part is deciding, if a particular group is worth extracting into a function. I ended up using simple heurisitc (like block size * A — n_inputs * B — n_outputs * C), and hand-picked magic A, B, and C coefficients for every platform. This may not be the best way to do it; I believe that one should use the code generator to actually measure the real difference that extracting every particular basic blocks group will make. This, however, may be really slow. To conclude, I'd say that, in my mind: — LLVM IR is not well suited for mergebb pass (at least if not using codegen to decide which blocks are worth merging). While it is very easy to detect equivalent blocks, and extract them into functions, you have almost no control on the code generation, which is crucial for this kind of optimization. Once again, if not using codegen for each basic blocks group, you have no good way to tell if a particular group is worth extracting at all. — Platform-specific low-level code generation tweaks and traditional -Os (-O3 without things that blow the size up) have an order of magnitude bigger impact on the size of generated code, and are «the way to go». Gregory
Possibly Parallel Threads
- [LLVMdev] Possible typo in LoopUnrollPass.cpp
- [LLVMdev] BlockFrequencyImpl asserts on multiple edges between same MBBs with different weights in LLVM 3.2
- [LLVMdev] BlockFrequencyImpl asserts on multiple edges between same MBBs with different weights in LLVM 3.2
- [LLVMdev] Disabling certain optimizations at -O1?
- [LLVMdev] BlockFrequencyImpl asserts on multiple edges between same MBBs with different weights in LLVM 3.2