thr3ads.net - llvm dev - [llvm-dev] Improving the split heuristics for the Greedy Register Allocator [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Nemanja Ivanovic via llvm-dev

2017-Jan-13 14:28 UTC

[llvm-dev] Improving the split heuristics for the Greedy Register Allocator

I have an issue that I've been wrestling with for quite some time and
I'm
hoping that someone with a deeper understanding of the register allocator
can help me with.

Namely, I am trying to teach RA to split a live range rather than
allocating a CSR. I've attempted a very large number of tweaks to the costs
(both existing and experimental ones that I've added). However, despite all
of that, I can't seem to get RA to split the following:

  1 BB#0: derived from LLVM BB %entry
  2     Live Ins: %X3
  3         %vreg15<def> = COPY %X3; G8RC:%vreg15
  4         %vreg4<def> = CMPLDI %vreg15, 0; CRRC:%vreg4 G8RC:%vreg15
  5         %vreg11:sub_32<def,read-undef> = LI 0; G8RC:%vreg11
  6         BCC 68, %vreg4, <BB#1>; CRRC:%vreg4
  7     Successors according to CFG: BB#4(0x30000000 / 0x80000000 = 37.50%)
BB#1(0x50000000 / 0x80000000 = 62.50%)
  8
  9 BB#4:
 10     Predecessors according to CFG: BB#0
 11         B <BB#3>
 12     Successors according to CFG: BB#3(?%)
 13
 14 BB#1: derived from LLVM BB %if.end
 15     Predecessors according to CFG: BB#0
 16         %vreg6<def> = ADDIStocHA %X2, <ga:@a>;
G8RC_and_G8RC_NOX0:%vreg6
 17         %vreg7<def> = LDtocL <ga:@a>, %vreg6,
%X2<imp-use>;
mem:LD8[GOT] G8RC_and_G8RC_NOX0:%vreg7,%vreg6
 18         %vreg8<def> = LWA 0, %vreg7;
mem:LD4[@a](tbaa=!3)(dereferenceable) G8RC:%vreg8 G8RC_and_G8RC_NOX0:%vreg7
 19         %vreg9<def> = CMPLD %vreg8, %vreg15; CRRC:%vreg9
G8RC:%vreg8,%vreg15
 20         BCC 68, %vreg9, <BB#3>; CRRC:%vreg9
 21         B <BB#2>
 22     Successors according to CFG: BB#2(0x30000000 / 0x80000000 = 37.50%)
BB#3(0x50000000 / 0x80000000 = 62.50%)
 23
 24 BB#2: derived from LLVM BB %if.then2
 25     Predecessors according to CFG: BB#1
 26         ADJCALLSTACKDOWN 96, %R1<imp-def,dead>, %R1<imp-use>
 27         %vreg16<def> = COPY %vreg15; G8RC:%vreg16,%vreg15
 28         BL8_NOP <ga:@callVoid>, <regmask **LONG LIST**>,
%X3<imp-def,dead>
 29         ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>, %R1<imp-use>
 30         ADJCALLSTACKDOWN 96, %R1<imp-def,dead>, %R1<imp-use> 31
%X3<def> = COPY %vreg16; G8RC:%vreg16
 32         BL8_NOP <ga:@callNonVoid>, <regmask **LONG LIST**>,
%X3<imp-use>, %X2<imp-use>, %R1<imp-def>, %X3<imp-def>
 33         ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>, %R1<imp-use>
 34         %vreg11<def> = COPY %X3; G8RC:%vreg11 35     Successors
according to CFG: BB#3(?%)
 36
 37 BB#3: derived from LLVM BB %return
 38     Predecessors according to CFG: BB#1 BB#2 BB#4
 39         %vreg12<def> = EXTSW_32_64 %vreg11:sub_32;
G8RC:%vreg12,%vreg11
 40         %X3<def> = COPY %vreg12; G8RC:%vreg12
 41         BLR8 %LR8<imp-use>, %RM<imp-use>, %X3<imp-use>

No matter what I do, vreg15 will get a Callee-Saved Register assigned to
it. However, this is suboptimal. So what I am trying to accomplish is to
split the live range of vreg15 into the paths without the call and the path
with the call (BL8_NOP is a call). Then the physical register X3 can be
used in the paths BB#0 -> BB#1 -> BB#3 and BB#0 -> BB#4 -> BB#3 and
it can
be copied to a Callee-Saved Register in BB#2.

Without such a split, vreg15 is assigned a CSR for the entire live range
and there is no way to avoid having to save/restore the CSR in the
prologue/epilogue. If one of the two paths that did not actually have the
call turn out to be the hottest path through the function, there is a lot
of wasted cycles in the save/restore because we weren't able to shrink-wrap
this function due to the choice RA made.

If anyone can offer some ideas on what I should do here, I would truly
appreciate it.

Nemanja
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170113/5325ff20/attachment.html>

Philip Reames via llvm-dev

2017-Jan-14 05:33 UTC

head link

[llvm-dev] Improving the split heuristics for the Greedy Register Allocator

This doesn't directly address your question, but the context might be 
useful.

At the moment, we have two distinct ways of handling callee saved 
register spill/restore.

The first one is the one used by most calling conventions/targets and 
uses the PrologEpilogIinserter logic to insert spills in the prolog and 
epilog if the live range needs to be spilled.  The reasoning is that 
some exception propagation mechanisms and debuggers expect to be able to 
find the value of the CSR and this is the convention that arose.  Within 
this mechanism, an optimization was added to reduce spill range to just 
include callees, but I believe it still uses the globally assigned spill 
slots.  The key point here is that the register allocator doesn't see 
vregs for these and thus never really gets a chance to split them 
intelligently.

The second one is called "split CSR".  If you search for that term in 
the code you'll find it's rarely used.  The basic idea here is that CSRs
are just vregs with assigned physical constraints on entry and exit.  
This version goes through the normal register allocation logic.  (But, 
good luck getting it to work with dwarf, etc...)  This implementation is 
also just generally less mature, so there might be more bugs to be 
found/fixed.  Here's one cleanup I started, but haven't gotten around to
completing yet which might help you understand the code 
(https://reviews.llvm.org/D27293).

Personally, I'm of the belief that the second mechanism should be our 
default, but today, it is not.

Philip


On 01/13/2017 06:28 AM, Nemanja Ivanovic via llvm-dev
wrote:> I have an issue that I've been wrestling with for quite some time and 
> I'm hoping that someone with a deeper understanding of the register 
> allocator can help me with.
>
> Namely, I am trying to teach RA to split a live range rather than 
> allocating a CSR. I've attempted a very large number of tweaks to the 
> costs (both existing and experimental ones that I've added). However, 
> despite all of that, I can't seem to get RA to split the following:
>
>   1 BB#0: derived from LLVM BB %entry
>   2     Live Ins: %X3
>   3         %vreg15<def> = COPY %X3; G8RC:%vreg15
>   4         %vreg4<def> = CMPLDI %vreg15, 0; CRRC:%vreg4 G8RC:%vreg15
>   5         %vreg11:sub_32<def,read-undef> = LI 0; G8RC:%vreg11
>   6         BCC 68, %vreg4, <BB#1>; CRRC:%vreg4
>   7     Successors according to CFG: BB#4(0x30000000 / 0x80000000 = 
> 37.50%) BB#1(0x50000000 / 0x80000000 = 62.50%)
>   8
>   9 BB#4:
>  10     Predecessors according to CFG: BB#0
>  11         B <BB#3>
>  12     Successors according to CFG: BB#3(?%)
>  13
>  14 BB#1: derived from LLVM BB %if.end
>  15     Predecessors according to CFG: BB#0
>  16         %vreg6<def> = ADDIStocHA %X2, <ga:@a>; 
> G8RC_and_G8RC_NOX0:%vreg6
>  17         %vreg7<def> = LDtocL <ga:@a>, %vreg6,
%X2<imp-use>;
> mem:LD8[GOT] G8RC_and_G8RC_NOX0:%vreg7,%vreg6
>  18         %vreg8<def> = LWA 0, %vreg7; 
> mem:LD4[@a](tbaa=!3)(dereferenceable) G8RC:%vreg8 
> G8RC_and_G8RC_NOX0:%vreg7
>  19         %vreg9<def> = CMPLD %vreg8, %vreg15; CRRC:%vreg9 
> G8RC:%vreg8,%vreg15
>  20         BCC 68, %vreg9, <BB#3>; CRRC:%vreg9
>  21         B <BB#2>
>  22     Successors according to CFG: BB#2(0x30000000 / 0x80000000 = 
> 37.50%) BB#3(0x50000000 / 0x80000000 = 62.50%)
>  23
>  24 BB#2: derived from LLVM BB %if.then2
>  25     Predecessors according to CFG: BB#1
>  26         ADJCALLSTACKDOWN 96, %R1<imp-def,dead>,
%R1<imp-use>
>  27         %vreg16<def> = COPY %vreg15; G8RC:%vreg16,%vreg15
>  28         BL8_NOP <ga:@callVoid>, <regmask **LONG LIST**>, 
> %X3<imp-def,dead>
>  29         ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>,
%R1<imp-use>
>  30         ADJCALLSTACKDOWN 96, %R1<imp-def,dead>,
%R1<imp-use>
> 31         %X3<def> = COPY %vreg16; G8RC:%vreg16
>  32         BL8_NOP <ga:@callNonVoid>, <regmask **LONG LIST**>,
> %X3<imp-use>, %X2<imp-use>, %R1<imp-def>,
%X3<imp-def>
>  33         ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>,
%R1<imp-use>
>  34         %vreg11<def> = COPY %X3; G8RC:%vreg11 35     Successors 
> according to CFG: BB#3(?%)
>  36
>  37 BB#3: derived from LLVM BB %return
>  38     Predecessors according to CFG: BB#1 BB#2 BB#4
>  39         %vreg12<def> = EXTSW_32_64 %vreg11:sub_32; 
> G8RC:%vreg12,%vreg11
>  40         %X3<def> = COPY %vreg12; G8RC:%vreg12
>  41         BLR8 %LR8<imp-use>, %RM<imp-use>,
%X3<imp-use>
>
> No matter what I do, vreg15 will get a Callee-Saved Register assigned 
> to it. However, this is suboptimal. So what I am trying to accomplish 
> is to split the live range of vreg15 into the paths without the call 
> and the path with the call (BL8_NOP is a call). Then the physical 
> register X3 can be used in the paths BB#0 -> BB#1 -> BB#3 and BB#0
->
> BB#4 -> BB#3 and it can be copied to a Callee-Saved Register in BB#2.
>
> Without such a split, vreg15 is assigned a CSR for the entire live 
> range and there is no way to avoid having to save/restore the CSR in 
> the prologue/epilogue. If one of the two paths that did not actually 
> have the call turn out to be the hottest path through the function, 
> there is a lot of wasted cycles in the save/restore because we weren't 
> able to shrink-wrap this function due to the choice RA made.
>
> If anyone can offer some ideas on what I should do here, I would truly 
> appreciate it.
>
> Nemanja
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170113/31f88383/attachment-0001.html>

Nemanja Ivanovic via llvm-dev

2017-Jan-16 23:10 UTC

head link

[llvm-dev] Improving the split heuristics for the Greedy Register Allocator

Sorry Philip,
I am really hoping you can provide a bit more guidance here and I apologize
for taking up so much of your time.

I've spent considerable time reading and re-reading your email as well as
looking at the code and I can't really understand how I should go about
doing this if I wanted to implement it using this mechanism.
Here's what I understand and please correct me if I'm wrong (I suspect I
am
spectacularly wrong here):
- The target is queried for whether a specific function supports this
mechanism
- If so, the initialization function is called
- Finally, the insertCopiesSplitCSR() function is called to insert copies *out
of* CSR's in the entry blocks and *into* CSR's in all the exit blocks,
leaving the CSR's available for reuse by the register allocator.

This way all the CSR's have their values preserved (in vregs) throughout
the lifetime of the function. The register allocator then figures out how
to preserve them. I suppose the choices are:
- Spill them in entry and reload in exits
- Elide the copy as the CSR won't be allocated in the function
- Defer the spill to immediately before the register is needed

Also, is the idea with this mechanism to eliminate CSR save/restore in the
function prologue/epilogue altogether?

Finally, I know very little about DWARF and debug data overall. It isn't
clear to me why this is difficult to get to work correctly with DWARF.

Thank you once again for your time.
Nemanja

On Sat, Jan 14, 2017 at 6:33 AM, Philip Reames <listmail at
philipreames.com>
wrote:
> This doesn't directly address your question, but the context might be
> useful.
>
> At the moment, we have two distinct ways of handling callee saved register
> spill/restore.
>
> The first one is the one used by most calling conventions/targets and uses
> the PrologEpilogIinserter logic to insert spills in the prolog and epilog
> if the live range needs to be spilled.  The reasoning is that some
> exception propagation mechanisms and debuggers expect to be able to find
> the value of the CSR and this is the convention that arose.  Within this
> mechanism, an optimization was added to reduce spill range to just include
> callees, but I believe it still uses the globally assigned spill slots.
> The key point here is that the register allocator doesn't see vregs for
> these and thus never really gets a chance to split them intelligently.
>
> The second one is called "split CSR".  If you search for that
term in the
> code you'll find it's rarely used.  The basic idea here is that
CSRs are
> just vregs with assigned physical constraints on entry and exit.  This
> version goes through the normal register allocation logic.  (But, good luck
> getting it to work with dwarf, etc...)  This implementation is also just
> generally less mature, so there might be more bugs to be found/fixed.
> Here's one cleanup I started, but haven't gotten around to
completing yet
> which might help you understand the code (https://reviews.llvm.org/D27293).
>
>
> Personally, I'm of the belief that the second mechanism should be our
> default, but today, it is not.
>
> Philip
>
> On 01/13/2017 06:28 AM, Nemanja Ivanovic via llvm-dev wrote:
>
> I have an issue that I've been wrestling with for quite some time and
I'm
> hoping that someone with a deeper understanding of the register allocator
> can help me with.
>
> Namely, I am trying to teach RA to split a live range rather than
> allocating a CSR. I've attempted a very large number of tweaks to the
costs
> (both existing and experimental ones that I've added). However, despite
all
> of that, I can't seem to get RA to split the following:
>
>   1 BB#0: derived from LLVM BB %entry
>   2     Live Ins: %X3
>   3         %vreg15<def> = COPY %X3; G8RC:%vreg15
>   4         %vreg4<def> = CMPLDI %vreg15, 0; CRRC:%vreg4 G8RC:%vreg15
>   5         %vreg11:sub_32<def,read-undef> = LI 0; G8RC:%vreg11
>   6         BCC 68, %vreg4, <BB#1>; CRRC:%vreg4
>   7     Successors according to CFG: BB#4(0x30000000 / 0x80000000 >
37.50%) BB#1(0x50000000 / 0x80000000 = 62.50%)
>   8
>   9 BB#4:
>  10     Predecessors according to CFG: BB#0
>  11         B <BB#3>
>  12     Successors according to CFG: BB#3(?%)
>  13
>  14 BB#1: derived from LLVM BB %if.end
>  15     Predecessors according to CFG: BB#0
>  16         %vreg6<def> = ADDIStocHA %X2, <ga:@a>;
> G8RC_and_G8RC_NOX0:%vreg6
>  17         %vreg7<def> = LDtocL <ga:@a>, %vreg6,
%X2<imp-use>;
> mem:LD8[GOT] G8RC_and_G8RC_NOX0:%vreg7,%vreg6
>  18         %vreg8<def> = LWA 0, %vreg7;
mem:LD4[@a](tbaa=!3)(dereferenceable)
> G8RC:%vreg8 G8RC_and_G8RC_NOX0:%vreg7
>  19         %vreg9<def> = CMPLD %vreg8, %vreg15; CRRC:%vreg9
> G8RC:%vreg8,%vreg15
>  20         BCC 68, %vreg9, <BB#3>; CRRC:%vreg9
>  21         B <BB#2>
>  22     Successors according to CFG: BB#2(0x30000000 / 0x80000000 >
37.50%) BB#3(0x50000000 / 0x80000000 = 62.50%)
>  23
>  24 BB#2: derived from LLVM BB %if.then2
>  25     Predecessors according to CFG: BB#1
>  26         ADJCALLSTACKDOWN 96, %R1<imp-def,dead>,
%R1<imp-use>
>  27         %vreg16<def> = COPY %vreg15; G8RC:%vreg16,%vreg15
>  28         BL8_NOP <ga:@callVoid>, <regmask **LONG LIST**>,
> %X3<imp-def,dead>
>  29         ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>,
%R1<imp-use>
>  30         ADJCALLSTACKDOWN 96, %R1<imp-def,dead>,
%R1<imp-use>
> 31         %X3<def> = COPY %vreg16; G8RC:%vreg16
>  32         BL8_NOP <ga:@callNonVoid>, <regmask **LONG LIST**>,
> %X3<imp-use>, %X2<imp-use>, %R1<imp-def>,
%X3<imp-def>
>  33         ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>,
%R1<imp-use>
>  34         %vreg11<def> = COPY %X3; G8RC:%vreg11 35     Successors
> according to CFG: BB#3(?%)
>  36
>  37 BB#3: derived from LLVM BB %return
>  38     Predecessors according to CFG: BB#1 BB#2 BB#4
>  39         %vreg12<def> = EXTSW_32_64 %vreg11:sub_32;
G8RC:%vreg12,%vreg11
>  40         %X3<def> = COPY %vreg12; G8RC:%vreg12
>  41         BLR8 %LR8<imp-use>, %RM<imp-use>,
%X3<imp-use>
>
> No matter what I do, vreg15 will get a Callee-Saved Register assigned to
> it. However, this is suboptimal. So what I am trying to accomplish is to
> split the live range of vreg15 into the paths without the call and the path
> with the call (BL8_NOP is a call). Then the physical register X3 can be
> used in the paths BB#0 -> BB#1 -> BB#3 and BB#0 -> BB#4 -> BB#3
and it
> can be copied to a Callee-Saved Register in BB#2.
>
> Without such a split, vreg15 is assigned a CSR for the entire live range
> and there is no way to avoid having to save/restore the CSR in the
> prologue/epilogue. If one of the two paths that did not actually have the
> call turn out to be the hottest path through the function, there is a lot
> of wasted cycles in the save/restore because we weren't able to
shrink-wrap
> this function due to the choice RA made.
>
> If anyone can offer some ideas on what I should do here, I would truly
> appreciate it.
>
> Nemanja
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170117/07407594/attachment-0001.html>

llvm dev - Jan 2017 - Improving the split heuristics for the Greedy Register Allocator

[llvm-dev] Improving the split heuristics for the Greedy Register Allocator

[llvm-dev] Improving the split heuristics for the Greedy Register Allocator

[llvm-dev] Improving the split heuristics for the Greedy Register Allocator