As you may have noticed, I have been working on a fast register allocator in the
last week. The new allocator is intended to replace the local allocator for
debug builds.
Both allocators use a local allocation strategy. Each basic block is scanned
from top to bottom, and virtual registers are assigned to physical registers as
they appear. There are no live registers between blocks. Everything is spilled
at the end of each block. This strategy works well for unoptimized code where
live ranges are short.
The fast allocator uses a few tricks to run faster than the local allocator
while producing slightly better code.
When RALocal allocates a physical register, it first checks if any aliases are
in use to avoid clobbering a subregister. RAFast uses a working set strategy
where recently killed registers are kept around for quick allocation without the
alias scan.
RALocal uses a least-recently-used algorithm for allocating registers. RAFast
uses a whatever-I-can-get-my-hands-on-quickly algorithm.
RALocal folds spills and restores into other instructions. RAFast doesn't
bother because debug builds have very few spills.
RAFast uses more aggressive hinting by peeking at future instructions. This
helps reduce the number of register copies when setting up parameters for a
call:
%reg1028<def> = MOV32rm <fi#2>, 1, %reg0, 0, %reg0
%reg1029<def> = MOV32rm <fi#2>, 1, %reg0, 4, %reg0
ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
%reg1030<def> = LEA64r %RIP, 1, %reg0, <ga:@.str>
%reg1031<def> = MOV8r0 %EFLAGS<imp-def,dead>
%RDI<def> = MOV64rr %reg1030
%ESI<def> = MOV32rr %reg1028
%EDX<def> = MOV32rr %reg1029
%AL<def> = MOV8rr %reg1031
CALL64pcrel32 <ga:@printf>, %RDI, %ESI, %EDX, %AL, %RAX<imp-def>,
%RDX<imp-def>, %RSI<imp-def>, %RDI<imp-def>,
%RSP<imp-use>, ...
When finding a register for %reg1028, RAFast sees that it is later copied to
%ESI. It is allocated %ESI and the copy disappears:
%ESI<def> = MOV32rm <fi#2>, 1, %reg0, 0, %reg0
%EDX<def> = MOV32rm <fi#2>, 1, %reg0, 4, %reg0
ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
%RDI<def> = LEA64r %RIP, 1, %reg0, <ga:@.str>
%AL<def> = MOV8r0 %EFLAGS<imp-def,dead>
CALL64pcrel32 <ga:@printf>, %RDI<kill>, %ESI<kill>,
%EDX<kill>, %AL<kill>, %RAX<imp-def>, %RDX<imp-def>,
%RSI<imp-def>, %RDI<imp-def>, %RSP<imp-use>, ...
The aggressive coalescing also reduces register pressure, and there is a lot
less spilling.
In a debug build of SPECCPU2006/403.gcc it results in 8% less instructions:
Local Fast
------- -------
153623 258567 regalloc - Number of copies coalesced
24975 18381 regalloc - Number of loads added
19361 13864 regalloc - Number of stores added
1109880 1018144 asm-printer - Number of machine instrs printed
The fast allocator is about three times faster than the old local allocator.
It reduces the overall code generation time from 7.0s to 5.9s for 403.gcc:
RALocal: Total Execution Time: 6.9976 seconds (7.1717 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name
---
2.7376 ( 42.1%) 0.2554 ( 51.8%) 2.9931 ( 42.8%) 2.9930 ( 41.7%) X86
DAG->DAG Instruction Selection
1.4644 ( 22.5%) 0.0073 ( 1.5%) 1.4717 ( 21.0%) 1.4714 ( 20.5%) Local
Register Allocator
0.5407 ( 8.3%) 0.1298 ( 26.3%) 0.6705 ( 9.6%) 0.8463 ( 11.8%) X86
AT&T-Style Assembly Printer
RAFast: Total Execution Time: 5.9255 seconds (6.0734 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name
---
2.7223 ( 50.1%) 0.2574 ( 52.5%) 2.9796 ( 50.3%) 2.9796 ( 49.1%) X86
DAG->DAG Instruction Selection
0.5241 ( 9.6%) 0.1237 ( 25.2%) 0.6479 ( 10.9%) 0.8003 ( 13.2%) X86
AT&T-Style Assembly Printer
0.4526 ( 8.3%) 0.0063 ( 1.3%) 0.4589 ( 7.7%) 0.4584 ( 7.5%) Fast
Register Allocator
The fast allocator can be enabled for clang -O0 with a small patch:
diff --git a/lib/Frontend/CodeGenAction.cpp b/lib/Frontend/CodeGenAction.cpp
index 86005f2..cd2d2f0 100644
--- a/lib/Frontend/CodeGenAction.cpp
+++ b/lib/Frontend/CodeGenAction.cpp
@@ -322,7 +322,7 @@ bool BackendConsumer::AddEmitPasses() {
// Set register scheduler & allocation policy.
RegisterScheduler::setDefault(createDefaultScheduler);
- RegisterRegAlloc::setDefault(Fast ? createLocalRegisterAllocator :
+ RegisterRegAlloc::setDefault(Fast ? createFastRegisterAllocator :
createLinearScanRegisterAllocator);
// Create the code generator passes.
I have tested clang self hosting on x86-64 and x64 with this patch.
On the nightly test suite, -regalloc=fast performs as well as -regalloc=local.
See PR7154.
I expect we may see some problems with the register scavenger on ARM because
-regalloc=fast doesn't currently add <imp-def,dead> flags on call
clobbered registers. That increases register pressure quite a bit as seen by the
scavenger.
Share and Enjoy!
/jakob