thr3ads.net - llvm dev - [llvm-dev] RFC] Shrink-wrapping improvement [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Francis Visoiu Mistrih via llvm-dev

2017-Aug-02 23:35 UTC

[llvm-dev] RFC] Shrink-wrapping improvement

Hello,

During my internship, I worked on improving shrink-wrapping in LLVM.

As you might already know, we currently have a shrink-wrapping pass in-tree,
announced here: http://lists.llvm.org/pipermail/llvm-dev/2015-May/085220.html,
by Quentin Colombet. This pass is currently enabled by default at all
optimization levels (except O0, obviously) for targets that support it.

The current pass performs really good when the uses of callee-saved registers or
stack objects can efficiently share an unique saving point, and an unique
restoring point. Basically, what it does is "delay" the execution of
the prologue and the epilogue until it's really needed.
For that, it uses dominance and post-dominance properties of the CFG, which
performs great in terms of compile-time (we already have dominator trees) and
code-size (there is no "extra" code, just code motion).

Here are the main limitations of the current algorithm which motivate the
research of an improvement:

* Requiring an unique save / restore point can very easily result in the default
placement points, the entry block and the return blocks. It works great for
early returns but once the CFG gets more complicated we end up using the default
placement.

* Shrink-wrapping the whole state as a single unit is good because most of the
target-specific code in `emitPrologue` / `emitEpilogue` will still work. The
opportunity that we're missing here is shrink-wrapping every register
separately and exploiting the possibility to split up the prologue / epilogue by
having separate CSR operations, separate stack setup, etc.

My shrink-wrapping algorithm is inspired from Minimizing register usage penalty
at procedure calls - F. C. Chow (http://dl.acm.org/citation.cfm?id=53999), which
introduces the shrink-wrapping idea based on a dataflow analysis. This analysis
is expensive and if implemented as-is, will have a big compile-time impact.

About my implementation:

* We want to completely avoid placing save / restore instructions inside loops.
For that, I'm only placing save / restore code at SCC boundaries, which will
always end up outside of loops (even irreducible ones). So there is no need of
an expensive dataflow analysis, and we can end up with a simple linear traversal
of the CFG.

* This algorithm allows us to have multiple save / restore points for a set of
uses in a CFG.

* My implementation tracks all the different CSRs separately and choses as many
points as needed so that we *ONLY* save and restore when the register is used.

This algorithm has been generalized to be able to shrink-wrap any CFG with any
kind of input. For example, based on the points chosen for the save / restore of
the CSRs, I choose the best stack setup / destroy points where `emitPrologue` /
`emitEpilogue` is called. The same algorithm and interface is used, and can be
used for other things.

In order to use this interface, you have to describe what is a "use"
(ex: use of register, use of frame idx, etc). The only information it needs is:

* the number of separate elements we want to shrink-wrap, let's call it
`NumElt` (ex: number of callee saved registers)

* a mapping between a machine basic block number and a BitVector of size
`NumElt`. If there is a use of an element, the bit is set (ex: if the 3rd callee
saved register is used in the MBB#9, then `map[9].test(3) == true`)

The output of this is a mapping between a machine basic block number and a
BitVector of size `NumElt` (ex: if `saves[1].test(3) == true`, then the 3rd
callee saved register needs to be saved in MBB#1)

The patch implementing the algorithm and this interface is here: [CodeGen]
Provide an advanced shrink-wrapping interface (https://reviews.llvm.org/D36109).

There are a few limitations that need some more work:

* Exception handling support. In order to properly describe the state of the
function for every block, we now need more CFI directives. In order to get that
to the level of correctness, we need more logic when generating CFI
(https://reviews.llvm.org/D35844), and some unwinder fixes
(https://reviews.llvm.org/D34544). Also, compact unwinding can't work
anymore and needs to fallback on DWARF.

* Windows CFI support.

* Sanitizers, crash reporters, other things that assume there is a prologue and
don't understand DWARF.

There are some good results on AArch64 on the LLVM test suite + SPEC (2000 +
2006):

* On average, an estimation of 8% reduction of the executed save / restore
instructions based on the block frequency and the number of saved / restored
registers per block.

O3:
* Up to 8% execution time improvement
* 76 improved tests out of 322 tests.

* +0.6% compile-time impact.

* +2.5% code-size impact.

Os:
* Up to 6.5% execution time improvement
* 58 improved tests out of 322 tests.

* +0.3% compile-time impact.

* +2.8% code-size impact.
* We could think of an Os / Oz mode of the algorithm where we won't
duplicate the save / restore points.

There are also many regressions:

O3:
* 75 regressions out of 322 tests.
* 33 of them > 1%
Os:
* 98 regressions out of 322 tests.
* 22 of them > 1%

Here are some problems I noticed:

* Since we want to save / restore *ONLY* where a register is used, we can end up
splitting a pair of registers. Take the following example:

stp reg1, reg2
use reg1
use reg2
if (cond) {
// do stuff
} else {
// do stuff
}
ldp reg1, reg2
ret

This is good. But if reg2 is only used in the if.then block:

str reg1
use reg1
if (cond) {
str reg2
use reg2
ldr reg2
// do stuff
} else {
// do stuff
}
ldr reg1
ret

To fix this, instead of shrink-wrapping registers separately, we can think of
building pairs of registers and shrink-wrapping the pair as one single unit.

* Merging the sp adjustment with loads and stores. We sometimes can't do
that if the first registers to be saved / last to be restored are shrink-wrapped
away. Example:

stp reg1, reg2, [sp, #-32]! # sp -= 32; store reg1, store reg2;
stp reg3, reg4
use reg1, reg2, reg3, reg4
if (cond) {
// do stuff
} else {
// do stuff
}
ldp reg3, reg4
ldp reg1, reg2, [sp], #32 # load reg1, load reg2; sp += 32;

This is good. The LoadStoreOptimizer can build this easily even if we don't
generate this code in PEI. But if reg1, reg2 are used in the if.then block:

sub sp, sp, #32
stp reg3, reg4
use reg3, reg4
if (cond) {
stp reg1, reg2
use reg1, reg2
ldp reg1, reg2
// do stuff
} else {
// do stuff
}
ldp reg3, reg4
add sp, sp, #32

The LoadStoreOptimizer can't merge them because the stack slots are in the
wrong order. To fix this it would require to have more flexibility on the order
of the stack objects.

There is also more room for improvement:

* The register allocator problem described here:
https://bugs.llvm.org/show_bug.cgi?id=29154 and here:
https://reviews.llvm.org/D34608.

* Avoid having "artificial" uses from the target in the entry block
(or whatever we call the prologue here). At the time we shrink-wrap, we
don't know if we're going to use FP, if we're going to use the base
pointer, where it's going to be used, and where it should be initialized.

Also, all the target specific-code is very hacky. I guess I'm not the only
one noticing that the current FrameLowering process is quite messy and needs
some re-thinking. The main thing that bothered me was that the
PrologueEpilogueInserter is driving the whole FrameLowering process, and goes
back and forth with the TargetFrameLowering interface. Some hooks in that
interface are there to change the order things are done in the PEI pass. Some
targets have internal target-specific side effects in some of these hooks. One
solution here would be to have the whole FrameLowering process to be driven by
the target, by calling generic functions, which would allow this pass to
integrate much more nicely.

I managed to split this up into 4 (messy) commits. I will post these for review
once we have the general algorithm in, but if anyone is interested of trying
this out, here are the commits implementing this for AArch64, and a
less-advanced version of it for X86.

[PEI] Use the ShrinkWrapper interface for placing callee-saved registers
(https://github.com/thegameg/llvm/commit/57111b77a9536cad4a4b07ece2d29c784f5deeac)
[ShrinkWrapper] Add AArch64 support
(https://github.com/thegameg/llvm/commit/c74e92390126c1fc40e4bf0d4cdd53ab1b420f0a)
[ShrinkWrapper] Add X86 support
https://github.com/thegameg/llvm/commit/53c368fa1dd9655973b08e81392e5c0c1c9e5fe2
(https://github.com/thegameg/llvm/commit/53c368fa1dd9655973b08e81392e5c0c1c9e5fe2)
[ShrinkWrapper] Add AArch64 support for shrink-wrapping stack setup code
(https://github.com/thegameg/llvm/commit/ce787c040f22be9b359e7aa20c0e6525e93440ae)

Thanks,

-- 
Francis Visoiu Mistrih

Francis Visoiu Mistrih via llvm-dev

2017-Aug-03 00:26 UTC

head link

[llvm-dev] [RFC] Shrink-wrapping improvement

Also, I would like to add some more things:

* There is some more interest in this:
http://lists.llvm.org/pipermail/llvm-dev/2017-May/112623.html

* The current shrink-wrapping implementation is very compile-time efficient and
caught many opportunities so far. Since there have been sightings that there is
more performance potential, one of the main goals of my internship was to
explore the remaining performance headroom.

* I am looking for a way to have a mix between the current and the new
implementation, or some kind of fallback on the old one when the new one does
too much / not enough, or when things like code-size matter more.

* While it's the end of my internship, I plan to continue working on this
and keeping track of the patches until it becomes good enough.

Cheers,

-- 
Francis Visoiu Mistrih

llvm dev - Aug 2017 - RFC] Shrink-wrapping improvement

[llvm-dev] RFC] Shrink-wrapping improvement

[llvm-dev] [RFC] Shrink-wrapping improvement