thr3ads.net - llvm dev - [LLVMdev] Fast register allocation [May 2010]

If this information is useful, please help other people find it:
Share via:

Jakob Stoklund Olesen

2010-May-18 00:43 UTC

[LLVMdev] Fast register allocation

As you may have noticed, I have been working on a fast register allocator in the
last week. The new allocator is intended to replace the local allocator for
debug builds.

Both allocators use a local allocation strategy. Each basic block is scanned
from top to bottom, and virtual registers are assigned to physical registers as
they appear. There are no live registers between blocks. Everything is spilled
at the end of each block. This strategy works well for unoptimized code where
live ranges are short.

The fast allocator uses a few tricks to run faster than the local allocator
while producing slightly better code.

When RALocal allocates a physical register, it first checks if any aliases are
in use to avoid clobbering a subregister. RAFast uses a working set strategy
where recently killed registers are kept around for quick allocation without the
alias scan.

RALocal uses a least-recently-used algorithm for allocating registers. RAFast
uses a whatever-I-can-get-my-hands-on-quickly algorithm.

RALocal folds spills and restores into other instructions. RAFast doesn't
bother because debug builds have very few spills.

RAFast uses more aggressive hinting by peeking at future instructions. This
helps reduce the number of register copies when setting up parameters for a
call:

	%reg1028<def> = MOV32rm <fi#2>, 1, %reg0, 0, %reg0
	%reg1029<def> = MOV32rm <fi#2>, 1, %reg0, 4, %reg0
	ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
	%reg1030<def> = LEA64r %RIP, 1, %reg0, <ga:@.str>
	%reg1031<def> = MOV8r0 %EFLAGS<imp-def,dead>
	%RDI<def> = MOV64rr %reg1030
	%ESI<def> = MOV32rr %reg1028
	%EDX<def> = MOV32rr %reg1029
	%AL<def> = MOV8rr %reg1031
	CALL64pcrel32 <ga:@printf>, %RDI, %ESI, %EDX, %AL, %RAX<imp-def>,
%RDX<imp-def>, %RSI<imp-def>, %RDI<imp-def>,
%RSP<imp-use>, ...

When finding a register for %reg1028, RAFast sees that it is later copied to
%ESI. It is allocated %ESI and the copy disappears:

	%ESI<def> = MOV32rm <fi#2>, 1, %reg0, 0, %reg0
	%EDX<def> = MOV32rm <fi#2>, 1, %reg0, 4, %reg0
	ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
	%RDI<def> = LEA64r %RIP, 1, %reg0, <ga:@.str>
	%AL<def> = MOV8r0 %EFLAGS<imp-def,dead>
	CALL64pcrel32 <ga:@printf>, %RDI<kill>, %ESI<kill>,
%EDX<kill>, %AL<kill>, %RAX<imp-def>, %RDX<imp-def>,
%RSI<imp-def>, %RDI<imp-def>, %RSP<imp-use>, ...

The aggressive coalescing also reduces register pressure, and there is a lot
less spilling.
In a debug build of SPECCPU2006/403.gcc it results in 8% less instructions:

 Local    Fast
------- -------
 153623  258567 regalloc     - Number of copies coalesced
  24975   18381 regalloc     - Number of loads added
  19361   13864 regalloc     - Number of stores added
1109880 1018144 asm-printer  - Number of machine instrs printed

The fast allocator is about three times faster than the old local allocator.
It reduces the overall code generation time from 7.0s to 5.9s for 403.gcc:

RALocal: Total Execution Time: 6.9976 seconds (7.1717 wall clock)

---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name
---
2.7376 ( 42.1%)   0.2554 ( 51.8%)   2.9931 ( 42.8%)   2.9930 ( 41.7%)  X86
DAG->DAG Instruction Selection
1.4644 ( 22.5%)   0.0073 (  1.5%)   1.4717 ( 21.0%)   1.4714 ( 20.5%)  Local
Register Allocator
0.5407 (  8.3%)   0.1298 ( 26.3%)   0.6705 (  9.6%)   0.8463 ( 11.8%)  X86
AT&T-Style Assembly Printer

RAFast: Total Execution Time: 5.9255 seconds (6.0734 wall clock)

---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name
---
2.7223 ( 50.1%)   0.2574 ( 52.5%)   2.9796 ( 50.3%)   2.9796 ( 49.1%)  X86
DAG->DAG Instruction Selection
0.5241 (  9.6%)   0.1237 ( 25.2%)   0.6479 ( 10.9%)   0.8003 ( 13.2%)  X86
AT&T-Style Assembly Printer
0.4526 (  8.3%)   0.0063 (  1.3%)   0.4589 (  7.7%)   0.4584 (  7.5%)  Fast
Register Allocator


The fast allocator can be enabled for clang -O0 with a small patch:

diff --git a/lib/Frontend/CodeGenAction.cpp b/lib/Frontend/CodeGenAction.cpp
index 86005f2..cd2d2f0 100644
--- a/lib/Frontend/CodeGenAction.cpp
+++ b/lib/Frontend/CodeGenAction.cpp
@@ -322,7 +322,7 @@ bool BackendConsumer::AddEmitPasses() {
 
   // Set register scheduler & allocation policy.
   RegisterScheduler::setDefault(createDefaultScheduler);
-  RegisterRegAlloc::setDefault(Fast ? createLocalRegisterAllocator :
+  RegisterRegAlloc::setDefault(Fast ? createFastRegisterAllocator :
                                createLinearScanRegisterAllocator);
 
   // Create the code generator passes.

I have tested clang self hosting on x86-64 and x64 with this patch.
On the nightly test suite, -regalloc=fast performs as well as -regalloc=local.
See PR7154.

I expect we may see some problems with the register scavenger on ARM because
-regalloc=fast doesn't currently add <imp-def,dead> flags on call
clobbered registers. That increases register pressure quite a bit as seen by the
scavenger.

Share and Enjoy!
/jakob

Chris Lattner

2010-May-18 01:21 UTC

head link

[LLVMdev] Fast register allocation

On May 17, 2010, at 5:43 PM, Jakob Stoklund Olesen wrote:
> As you may have noticed, I have been working on a fast register allocator
in the last week. The new allocator is intended to replace the local allocator
for debug builds.
This is great work Jakob!  What is required and when do you think you'll be
killing off the local allocator?

-Chris

Jakob Stoklund Olesen

2010-May-18 14:49 UTC

head link

[LLVMdev] Fast register allocation

On May 17, 2010, at 6:21 PM, Chris Lattner wrote:
> 
> On May 17, 2010, at 5:43 PM, Jakob Stoklund Olesen wrote:
> 
>> As you may have noticed, I have been working on a fast register
allocator in the last week. The new allocator is intended to replace the local
allocator for debug builds.
> 
> This is great work Jakob!  What is required and when do you think
you'll be killing off the local allocator?
I think we should switch clang -O0 to the fast allocator right away. It is the
only way of smoking out the remaining bugs.

I am not sure if or when llvm-gcc wants to switch, but when there are no more
clients in the tree, we can kill off the local allocator.

/jakob

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - May 2010 - [LLVMdev] Fast register allocation

[LLVMdev] Fast register allocation

[LLVMdev] Fast register allocation

[LLVMdev] Fast register allocation

Possibly Parallel Threads