thr3ads.net - llvm dev - [llvm-dev] Register Allocation Graph Coloring algorithm and Others [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Michael Clark via llvm-dev

2017-Dec-19 00:07 UTC

[llvm-dev] Register Allocation Graph Coloring algorithm and Others

Hi Leslie,

I suggest adding these 3 papers to your reading list.

	Register allocation for programs in SSA-form
	Sebastian Hack, Daniel Grund, and Gerhard Goos
	http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf

	Simple and Efficient Construction of Static Single Assignment Form
	Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa , Christoph
Mallon , and Andreas Zwinkau
	https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf

	Optimal register allocation for SSA-form programs in polynomial time
	Sebastian Hack, Gerhard Goos
	http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf

A lot of the earlier literature regarding the register allocation problem
describes the general graph colouring problem as NP-complete, however previous
research in Register Allocation has been in the context of programs that were
not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation
is NP-complete and proposes an iterative algorithm.

If one reads Hack Goos, there is the discovery that programs that are in SSA
form have chordal graphs, and the colouring algorithm for chordal graphs can be
completed in polynomial time. After the cost of building the interference graph,
it is possible to perform register allocation in a single pass. The key is in
not modifying the graph.

If one has frequency for each basic block, then one can sort basic blocks by
frequency, allocating the highest frequency blocks first, and where possible
assigning the same physcial register to all virtual registers in each phi node
(to avoid copies). At program points where the live set is larger than k, the
set of physical registers, one spills the the register that has the largest
distance between its next use. At least that is how I am thinking about this
problem. I am also a novice with regards to register allocation.

I intend to develop a register allocator for use in this JIT engine:

	rv8: a high performance RISC-V to x86 binary translator
	Michael Clark, Bruce Hoult
	https://carrv.github.io/papers/clark-rv8-carrv2017.pdf

Our JIT already performs almost twice as fast a QEMU and we are using a static
register allocation, and QEMU i believe has a register allocator. We are mapping
a 31 register RISC architecture to a 16 register CISC architecture. I have
statistics on the current slow-down, and it appears to mainly be L1 accesses to
spilled registers. I believe after we have implemented a register allocator in
our JIT we will be able to run RISC-V binaries at near native performance on
x86-64. In fact given we are performing runtime profile guided optimisation, it
may even be possible for some benchmarks to run faster than register allocations
that are based on static estimates of loop frequencies.

	https://anarch128.org/~mclark/rv8-slides.pdf
	https://rv8.io/bench

We’ve still got a long way to go to get to ~1:1 performance for RISC-V on
x86-64, but I think it is possible, at least for some codes…

Regards,
Michael.
> On 15/12/2017, at 4:18 PM, Leslie Zhai <lesliezhai at llvm.org.cn>
wrote:
> 
> Hi GCC and LLVM developers,
> 
> I am learning Register Allocation algorithms and I am clear that:
> 
> * Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg
(hard)
> 
> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but it has
to spill code when PhysReg is unavailable
> 
> * Folding spill code into instructions, handling register coallescing,
splitting live ranges, doing rematerialization, doing shrink wrapping are harder
than RegAlloc
> 
> * LRA and IRA is default Passes in RA for GCC:
> 
> $ /opt/gcc-git/bin/gcc hello.c
> DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
> DEBUG: ../../gcc/ira-build.c, ira_build, line 3409
> 
> * Greedy is default Pass for LLVM
> 
> But I have some questions, please give me some hint, thanks a lot!
> 
> * IRA is regional register allocator performing graph coloring on a
top-down traversal of nested regions, is it Global? compares with Local LRA
> 
> * The papers by Briggs and Chaiten contradict[2] themselves when examine
the text of the paper vs. the pseudocode provided?
> 
> * Why  interference graph is expensive to build[3]?
> 
> And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for LLVM
firstly.
> 
> 
> [1] https://reviews.llvm.org/D39712
> 
> [2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html
> 
> [3] https://github.com/joaotavio/llvm-register-allocator
> 
> [4] https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol
> 
> -- 
> Regards,
> Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/
> 
> 
>

Leslie Zhai via llvm-dev

2017-Dec-19 02:10 UTC

head link

[llvm-dev] Register Allocation Graph Coloring algorithm and Others

Hi Michael,

Thanks for your sharing!

I will read the papers as you suggested, and I asked Quantum Computing 
professionals about solving the NP-complete problem 
https://github.com/epiqc/ScaffCC/issues/14 but GCC's IRA and LRA proved 
that there is a solution in SSA form for classical computer.

Sorting MachineBasicBlock by frequency firstly is a good idea, and I 
only experienced some PASS, such as LoopUnroll 
http://lists.llvm.org/pipermail/llvm-dev/2017-October/118419.html but I 
will practice your idea in my solutionByHEA 
https://github.com/xiangzhai/llvm/blob/avr/lib/CodeGen/RegAllocGraphColoring.cpp#L327

I experienced binary translation from X86 to Mips, QEMU is expensive 
https://www.leetcode.cn/2017/09/tcg-ir-to-llvm-ir.html I will bookmark 
your JIT engine to my blog :)


在 2017年12月19日 08:07, Michael Clark 写道:> Hi Leslie,
>
> I suggest adding these 3 papers to your reading list.
>
> 	Register allocation for programs in SSA-form
> 	Sebastian Hack, Daniel Grund, and Gerhard Goos
> 	http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf
>
> 	Simple and Efficient Construction of Static Single Assignment Form
> 	Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa ,
Christoph Mallon , and Andreas Zwinkau
> 	https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf
>
> 	Optimal register allocation for SSA-form programs in polynomial time
> 	Sebastian Hack, Gerhard Goos
> 	http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf
>
> A lot of the earlier literature regarding the register allocation problem
describes the general graph colouring problem as NP-complete, however previous
research in Register Allocation has been in the context of programs that were
not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation
is NP-complete and proposes an iterative algorithm.
>
> If one reads Hack Goos, there is the discovery that programs that are in
SSA form have chordal graphs, and the colouring algorithm for chordal graphs can
be completed in polynomial time. After the cost of building the interference
graph, it is possible to perform register allocation in a single pass. The key
is in not modifying the graph.
>
> If one has frequency for each basic block, then one can sort basic blocks
by frequency, allocating the highest frequency blocks first, and where possible
assigning the same physcial register to all virtual registers in each phi node
(to avoid copies). At program points where the live set is larger than k, the
set of physical registers, one spills the the register that has the largest
distance between its next use. At least that is how I am thinking about this
problem. I am also a novice with regards to register allocation.
>
> I intend to develop a register allocator for use in this JIT engine:
> 	
> 	rv8: a high performance RISC-V to x86 binary translator
> 	Michael Clark, Bruce Hoult
> 	https://carrv.github.io/papers/clark-rv8-carrv2017.pdf
>
> Our JIT already performs almost twice as fast a QEMU and we are using a
static register allocation, and QEMU i believe has a register allocator. We are
mapping a 31 register RISC architecture to a 16 register CISC architecture. I
have statistics on the current slow-down, and it appears to mainly be L1
accesses to spilled registers. I believe after we have implemented a register
allocator in our JIT we will be able to run RISC-V binaries at near native
performance on x86-64. In fact given we are performing runtime profile guided
optimisation, it may even be possible for some benchmarks to run faster than
register allocations that are based on static estimates of loop frequencies.
>
> 	https://anarch128.org/~mclark/rv8-slides.pdf
> 	https://rv8.io/bench
>
> We’ve still got a long way to go to get to ~1:1 performance for RISC-V on
x86-64, but I think it is possible, at least for some codes…
>
> Regards,
> Michael.
>
>> On 15/12/2017, at 4:18 PM, Leslie Zhai <lesliezhai at
llvm.org.cn> wrote:
>>
>> Hi GCC and LLVM developers,
>>
>> I am learning Register Allocation algorithms and I am clear that:
>>
>> * Unlimited VirtReg (pseudo) -> limited or fixed or alias[1] PhysReg
(hard)
>>
>> * Memory (20 - 100 cycles) is expensive than Register (1 cycle), but it
has to spill code when PhysReg is unavailable
>>
>> * Folding spill code into instructions, handling register coallescing,
splitting live ranges, doing rematerialization, doing shrink wrapping are harder
than RegAlloc
>>
>> * LRA and IRA is default Passes in RA for GCC:
>>
>> $ /opt/gcc-git/bin/gcc hello.c
>> DEBUG: ../../gcc/lra.c, lra_init_once, line 2441
>> DEBUG: ../../gcc/ira-build.c, ira_build, line 3409
>>
>> * Greedy is default Pass for LLVM
>>
>> But I have some questions, please give me some hint, thanks a lot!
>>
>> * IRA is regional register allocator performing graph coloring on a
top-down traversal of nested regions, is it Global? compares with Local LRA
>>
>> * The papers by Briggs and Chaiten contradict[2] themselves when
examine the text of the paper vs. the pseudocode provided?
>>
>> * Why  interference graph is expensive to build[3]?
>>
>> And I am practicing[4] to use HEA, developed by Dr. Rhydian Lewis, for
LLVM firstly.
>>
>>
>> [1] https://reviews.llvm.org/D39712
>>
>> [2] http://lists.llvm.org/pipermail/llvm-dev/2008-March/012940.html
>>
>> [3] https://github.com/joaotavio/llvm-register-allocator
>>
>> [4]
https://github.com/xiangzhai/llvm/tree/avr/include/llvm/CodeGen/GCol
>>
>> -- 
>> Regards,
>> Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/
>>
>>
>>
-- 
Regards,
Leslie Zhai - https://reviews.llvm.org/p/xiangzhai/

Vladimir Makarov via llvm-dev

2017-Dec-19 04:16 UTC

head link

[llvm-dev] Register Allocation Graph Coloring algorithm and Others

On 12/18/2017 07:07 PM, Michael Clark wrote:> Hi Leslie,
>
> I suggest adding these 3 papers to your reading list.
>
> 	Register allocation for programs in SSA-form
> 	Sebastian Hack, Daniel Grund, and Gerhard Goos
> 	http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf
>
> 	Simple and Efficient Construction of Static Single Assignment Form
> 	Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa ,
Christoph Mallon , and Andreas Zwinkau
> 	https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf
>
> 	Optimal register allocation for SSA-form programs in polynomial time
> 	Sebastian Hack, Gerhard Goos
> 	http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf
>
> A lot of the earlier literature regarding the register allocation problem
describes the general graph colouring problem as NP-complete, however previous
research in Register Allocation has been in the context of programs that were
not in SSA form. i.e. the Chaitin-Briggs paper states that register allocation
is NP-complete and proposes an iterative algorithm.
>
>I am sceptical about SSA-register allocators for high quality code 
generation.  But I should acknowledge although I tried a lot of RA 
algorithms I never tried SSA one.

One task of RA is to deal with moves but when you are destroying SSA you 
are creating moves or swaps.  Of course there are optimizations to 
decrease its number when you are going out of SSA.  But IMHO a good RA 
should see all moves created before or during own work.

As I already wrote coloring is just one out of many task of RA.  All 
these tasks in a good RA should be solved in cooperation. Optimistic 
coloring is good and fast enough (although building a conflict graph for 
it takes a lot of time).  I think better results can be obtained by 
improving other tasks than by improving optimistic coloring.

For example, nobody mentioned irregular file architectures (with 
subregisters and register subclasses) like x86/x86-64.  Even regular 
file architectures with caller/callee saved hard registers have 
irregularity because it creates register subclasses, e.g. class of saved 
general registers is a subclass of the overall general register class.  
Usually hard registers are present in IR and if some pseudos conflict 
with the hard registers, it also create a lot (intersected) subclasses.

Taking such irregularity, e.g. in coloring criteria based on Kempe's 
heuristic, can improve the final RA significantly (as I remember GCC 
generated code for SPECInt2000 even on ppc64 with mostly regular 
register file was improved by 1% by taking such subclasses into account 
during coloring). A classical article about this 
https://www.researchgate.net/publication/2440424_Graph-Coloring_Register_Allocation_for_Irregular_Architectures
could be a start point for such implementation.

Matthias Braun via llvm-dev

2017-Dec-19 20:24 UTC

head link

[llvm-dev] Register Allocation Graph Coloring algorithm and Others

> On Dec 18, 2017, at 8:16 PM, Vladimir Makarov via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> 
> On 12/18/2017 07:07 PM, Michael Clark wrote:
>> Hi Leslie,
>> 
>> I suggest adding these 3 papers to your reading list.
>> 
>> 	Register allocation for programs in SSA-form
>> 	Sebastian Hack, Daniel Grund, and Gerhard Goos
>> 	http://www.rw.cdl.uni-saarland.de/~grund/papers/cc06-ra_ssa.pdf
>> 
>> 	Simple and Efficient Construction of Static Single Assignment Form
>> 	Matthias Braun , Sebastian Buchwald , Sebastian Hack , Roland Leißa ,
Christoph Mallon , and Andreas Zwinkau
>> 	https://www.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf
>> 
>> 	Optimal register allocation for SSA-form programs in polynomial time
>> 	Sebastian Hack, Gerhard Goos
>> 
http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoos-ipl06.pdf
>> 
>> A lot of the earlier literature regarding the register allocation
problem describes the general graph colouring problem as NP-complete, however
previous research in Register Allocation has been in the context of programs
that were not in SSA form. i.e. the Chaitin-Briggs paper states that register
allocation is NP-complete and proposes an iterative algorithm.
>> 
>> 
> I am sceptical about SSA-register allocators for high quality code
generation.  But I should acknowledge although I tried a lot of RA algorithms I
never tried SSA one.
> 
> One task of RA is to deal with moves but when you are destroying SSA you
are creating moves or swaps.  Of course there are optimizations to decrease its
number when you are going out of SSA.  But IMHO a good RA should see all moves
created before or during own work.
> 
> As I already wrote coloring is just one out of many task of RA.  All these
tasks in a good RA should be solved in cooperation. Optimistic coloring is good
and fast enough (although building a conflict graph for it takes a lot of time).
I think better results can be obtained by improving other tasks than by
improving optimistic coloring.
> 
> For example, nobody mentioned irregular file architectures (with
subregisters and register subclasses) like x86/x86-64.  Even regular file
architectures with caller/callee saved hard registers have irregularity because
it creates register subclasses, e.g. class of saved general registers is a
subclass of the overall general register class.  Usually hard registers are
present in IR and if some pseudos conflict with the hard registers, it also
create a lot (intersected) subclasses.
I’m not sure this is the best place to start a discussion about register
allocation in general, but I’ll bait :)

Having worked with SSA regalloc in the past (and with LLVM now) I have to defend
it: You can (and should) apply post-realloc copy coalescing algorithms for SSA
allocators. The additional copies are a good thing, they are the reason for a
less constrained graph with reduced register pressure meaning you can get by
with less spills and reloads. Trading spills/reloads for copies is a good deal
on most modern architectures.

And I would take your statement that “a good RA should see all moves created
before or during own work” as an endorsement of SSA regallocs: PHIs are just
another form of COPY and very visible and SSA allocators cannot do pre-RA
coalescing (of the phi induced copies) so they will see all the copies.

SSA regalloc can handle classes/constraints just fine (you need to insert some
more copies, but that is also true for most traditional reallocs). You are right
in that SSA regalloc cannot directly deal with sub registers. I implemented
support for that via a pre-pass in the past that merges subvalues before
regalloc to get a regular regalloc problem again, which may be a bit tricky to
pull off when your target has additional register constraints at the same time
(which wasn’t the case for me). But for typical CPUs SSA regalloc is just fine:
Libfirm where a lot of the SSA regalloc work happened sports high quality
x86/x86_64/sparc code generators.

And I also agree that coloring is only part of the problem and while it is the
most researched talked about, I would also say that in practice spill code
placement, tweaks in your representation, how to deal with two address code and
other constraints, how good your dematerialization is etc. has a bigger impact
than your particular choice of coloring algorithm. And for regular architectures
I found SSA to make things easier :)

- Matthias
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171219/36e277a8/attachment.html>

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Dec 2017 - Register Allocation Graph Coloring algorithm and Others

[llvm-dev] Register Allocation Graph Coloring algorithm and Others

[llvm-dev] Register Allocation Graph Coloring algorithm and Others

[llvm-dev] Register Allocation Graph Coloring algorithm and Others

[llvm-dev] Register Allocation Graph Coloring algorithm and Others

Seemingly Similar Threads