Wei Mi via llvm-dev
2017-Feb-09 04:15 UTC
[llvm-dev] Improving the split heuristics for the Greedy Register Allocator
On Wed, Feb 8, 2017 at 6:21 PM, Wei Mi <wmi at google.com> wrote:> I have an issue that I've been wrestling with for quite some time and I'm > hoping that someone with a deeper understanding of the register allocator > can help me with. > > Namely, I am trying to teach RA to split a live range rather than > allocating a CSR. I've attempted a very large number of tweaks to the costs > (both existing and experimental ones that I've added). However, despite all > of that, I can't seem to get RA to split the following: > > 1 BB#0: derived from LLVM BB %entry > 2 Live Ins: %X3 > 3 %vreg15<def> = COPY %X3; G8RC:%vreg15 > 4 %vreg4<def> = CMPLDI %vreg15, 0; CRRC:%vreg4 G8RC:%vreg15 > 5 %vreg11:sub_32<def,read-undef> = LI 0; G8RC:%vreg11 > 6 BCC 68, %vreg4, <BB#1>; CRRC:%vreg4 > 7 Successors according to CFG: BB#4(0x30000000 / 0x80000000 = 37.50%) > BB#1(0x50000000 / 0x80000000 = 62.50%) > 8 > 9 BB#4: > 10 Predecessors according to CFG: BB#0 > 11 B <BB#3> > 12 Successors according to CFG: BB#3(?%) > 13 > 14 BB#1: derived from LLVM BB %if.end > 15 Predecessors according to CFG: BB#0 > 16 %vreg6<def> = ADDIStocHA %X2, <ga:@a>; G8RC_and_G8RC_NOX0:%vreg6 > 17 %vreg7<def> = LDtocL <ga:@a>, %vreg6, %X2<imp-use>; > mem:LD8[GOT] G8RC_and_G8RC_NOX0:%vreg7,%vreg6 > 18 %vreg8<def> = LWA 0, %vreg7; > mem:LD4[@a](tbaa=!3)(dereferenceable) G8RC:%vreg8 G8RC_and_G8RC_NOX0:%vreg7 > 19 %vreg9<def> = CMPLD %vreg8, %vreg15; CRRC:%vreg9 > G8RC:%vreg8,%vreg15 > 20 BCC 68, %vreg9, <BB#3>; CRRC:%vreg9 > 21 B <BB#2> > 22 Successors according to CFG: BB#2(0x30000000 / 0x80000000 = 37.50%) > BB#3(0x50000000 / 0x80000000 = 62.50%) > 23 > 24 BB#2: derived from LLVM BB %if.then2 > 25 Predecessors according to CFG: BB#1 > 26 ADJCALLSTACKDOWN 96, %R1<imp-def,dead>, %R1<imp-use> > 27 %vreg16<def> = COPY %vreg15; G8RC:%vreg16,%vreg15 > 28 BL8_NOP <ga:@callVoid>, <regmask **LONG LIST**>, > %X3<imp-def,dead> > 29 ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>, %R1<imp-use> > 30 ADJCALLSTACKDOWN 96, %R1<imp-def,dead>, %R1<imp-use> 31 > %X3<def> = COPY %vreg16; G8RC:%vreg16 > 32 BL8_NOP <ga:@callNonVoid>, <regmask **LONG LIST**>, > %X3<imp-use>, %X2<imp-use>, %R1<imp-def>, %X3<imp-def> > 33 ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>, %R1<imp-use> > 34 %vreg11<def> = COPY %X3; G8RC:%vreg11 35 Successors > according to CFG: BB#3(?%) > 36 > 37 BB#3: derived from LLVM BB %return > 38 Predecessors according to CFG: BB#1 BB#2 BB#4 > 39 %vreg12<def> = EXTSW_32_64 %vreg11:sub_32; G8RC:%vreg12,%vreg11 > 40 %X3<def> = COPY %vreg12; G8RC:%vreg12 > 41 BLR8 %LR8<imp-use>, %RM<imp-use>, %X3<imp-use> > > No matter what I do, vreg15 will get a Callee-Saved Register assigned to > it. However, this is suboptimal. So what I am trying to accomplish is to > split the live range of vreg15 into the paths without the call and the path > with the call (BL8_NOP is a call). Then the physical register X3 can be > used in the paths BB#0 -> BB#1 -> BB#3 and BB#0 -> BB#4 -> BB#3 and it can > be copied to a Callee-Saved Register in BB#2. >Hi Nemanja, vreg15's live range is not across call. It is weird that it is always getting CSR. Maybe because of the hint of vreg16, i.e., the contribution of hint outweigh the cost of CSR first use? Is it possible to send out the testcase? I can try if there is anyway to allocate vreg15 to CSR. Thanks, Wei.> Without such a split, vreg15 is assigned a CSR for the entire live range > and there is no way to avoid having to save/restore the CSR in the > prologue/epilogue. If one of the two paths that did not actually have the > call turn out to be the hottest path through the function, there is a lot > of wasted cycles in the save/restore because we weren't able to shrink-wrap > this function due to the choice RA made. > > If anyone can offer some ideas on what I should do here, I would truly > appreciate it. > > Nemanja
Nemanja Ivanovic via llvm-dev
2017-Feb-09 06:17 UTC
[llvm-dev] Improving the split heuristics for the Greedy Register Allocator
Hmm... it would appear that the behaviour on that original test case changes with ToT. However, this test case will still allocate a CSR in the entry block even though it really does not need to. And adjusting the CSR first-time-use cost does not seem to have any effect on it. int *a; int callVoid(); int callNonVoid(int*); int test(int *b) { if (b == a) { callVoid(); return callNonVoid(b); } return 0; } On Thu, Feb 9, 2017 at 5:15 AM, Wei Mi <wmi at google.com> wrote:> On Wed, Feb 8, 2017 at 6:21 PM, Wei Mi <wmi at google.com> wrote: > > I have an issue that I've been wrestling with for quite some time and I'm > > hoping that someone with a deeper understanding of the register allocator > > can help me with. > > > > Namely, I am trying to teach RA to split a live range rather than > > allocating a CSR. I've attempted a very large number of tweaks to the > costs > > (both existing and experimental ones that I've added). However, despite > all > > of that, I can't seem to get RA to split the following: > > > > 1 BB#0: derived from LLVM BB %entry > > 2 Live Ins: %X3 > > 3 %vreg15<def> = COPY %X3; G8RC:%vreg15 > > 4 %vreg4<def> = CMPLDI %vreg15, 0; CRRC:%vreg4 G8RC:%vreg15 > > 5 %vreg11:sub_32<def,read-undef> = LI 0; G8RC:%vreg11 > > 6 BCC 68, %vreg4, <BB#1>; CRRC:%vreg4 > > 7 Successors according to CFG: BB#4(0x30000000 / 0x80000000 > 37.50%) > > BB#1(0x50000000 / 0x80000000 = 62.50%) > > 8 > > 9 BB#4: > > 10 Predecessors according to CFG: BB#0 > > 11 B <BB#3> > > 12 Successors according to CFG: BB#3(?%) > > 13 > > 14 BB#1: derived from LLVM BB %if.end > > 15 Predecessors according to CFG: BB#0 > > 16 %vreg6<def> = ADDIStocHA %X2, <ga:@a>; > G8RC_and_G8RC_NOX0:%vreg6 > > 17 %vreg7<def> = LDtocL <ga:@a>, %vreg6, %X2<imp-use>; > > mem:LD8[GOT] G8RC_and_G8RC_NOX0:%vreg7,%vreg6 > > 18 %vreg8<def> = LWA 0, %vreg7; > > mem:LD4[@a](tbaa=!3)(dereferenceable) G8RC:%vreg8 > G8RC_and_G8RC_NOX0:%vreg7 > > 19 %vreg9<def> = CMPLD %vreg8, %vreg15; CRRC:%vreg9 > > G8RC:%vreg8,%vreg15 > > 20 BCC 68, %vreg9, <BB#3>; CRRC:%vreg9 > > 21 B <BB#2> > > 22 Successors according to CFG: BB#2(0x30000000 / 0x80000000 > 37.50%) > > BB#3(0x50000000 / 0x80000000 = 62.50%) > > 23 > > 24 BB#2: derived from LLVM BB %if.then2 > > 25 Predecessors according to CFG: BB#1 > > 26 ADJCALLSTACKDOWN 96, %R1<imp-def,dead>, %R1<imp-use> > > 27 %vreg16<def> = COPY %vreg15; G8RC:%vreg16,%vreg15 > > 28 BL8_NOP <ga:@callVoid>, <regmask **LONG LIST**>, > > %X3<imp-def,dead> > > 29 ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>, %R1<imp-use> > > 30 ADJCALLSTACKDOWN 96, %R1<imp-def,dead>, %R1<imp-use> 31 > > %X3<def> = COPY %vreg16; G8RC:%vreg16 > > 32 BL8_NOP <ga:@callNonVoid>, <regmask **LONG LIST**>, > > %X3<imp-use>, %X2<imp-use>, %R1<imp-def>, %X3<imp-def> > > 33 ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>, %R1<imp-use> > > 34 %vreg11<def> = COPY %X3; G8RC:%vreg11 35 Successors > > according to CFG: BB#3(?%) > > 36 > > 37 BB#3: derived from LLVM BB %return > > 38 Predecessors according to CFG: BB#1 BB#2 BB#4 > > 39 %vreg12<def> = EXTSW_32_64 %vreg11:sub_32; > G8RC:%vreg12,%vreg11 > > 40 %X3<def> = COPY %vreg12; G8RC:%vreg12 > > 41 BLR8 %LR8<imp-use>, %RM<imp-use>, %X3<imp-use> > > > > No matter what I do, vreg15 will get a Callee-Saved Register assigned to > > it. However, this is suboptimal. So what I am trying to accomplish is to > > split the live range of vreg15 into the paths without the call and the > path > > with the call (BL8_NOP is a call). Then the physical register X3 can be > > used in the paths BB#0 -> BB#1 -> BB#3 and BB#0 -> BB#4 -> BB#3 and it > can > > be copied to a Callee-Saved Register in BB#2. > > > > Hi Nemanja, > > vreg15's live range is not across call. It is weird that it is always > getting CSR. Maybe because of the hint of vreg16, i.e., the > contribution of hint outweigh the cost of CSR first use? > > Is it possible to send out the testcase? I can try if there is anyway > to allocate vreg15 to CSR. > > Thanks, > Wei. > > > Without such a split, vreg15 is assigned a CSR for the entire live range > > and there is no way to avoid having to save/restore the CSR in the > > prologue/epilogue. If one of the two paths that did not actually have the > > call turn out to be the hottest path through the function, there is a lot > > of wasted cycles in the save/restore because we weren't able to > shrink-wrap > > this function due to the choice RA made. > > > > If anyone can offer some ideas on what I should do here, I would truly > > appreciate it. > > > > Nemanja >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170209/38662215/attachment.html>
Wei Mi via llvm-dev
2017-Feb-10 05:22 UTC
[llvm-dev] Improving the split heuristics for the Greedy Register Allocator
Hi Nemanja, Thanks for the testcase. I can reproduce your problem. splitting actually works as we expect but tryHintRecoloring coalesces the splitted vregs and revokes the effect of splitting. The simple way is to stop tryHintRecoloring for vregs colored with CSR registers, but I am still thinking if there is better way to solve it. Will keep you updated. Thanks, Wei. On Wed, Feb 8, 2017 at 10:17 PM, Nemanja Ivanovic <nemanja.i.ibm at gmail.com> wrote:> Hmm... it would appear that the behaviour on that original test case changes > with ToT. However, this test case will still allocate a CSR in the entry > block even though it really does not need to. And adjusting the CSR > first-time-use cost does not seem to have any effect on it. > > int *a; > int callVoid(); > int callNonVoid(int*); > int test(int *b) { > if (b == a) { > callVoid(); > return callNonVoid(b); > } > return 0; > } > > > On Thu, Feb 9, 2017 at 5:15 AM, Wei Mi <wmi at google.com> wrote: >> >> On Wed, Feb 8, 2017 at 6:21 PM, Wei Mi <wmi at google.com> wrote: >> > I have an issue that I've been wrestling with for quite some time and >> > I'm >> > hoping that someone with a deeper understanding of the register >> > allocator >> > can help me with. >> > >> > Namely, I am trying to teach RA to split a live range rather than >> > allocating a CSR. I've attempted a very large number of tweaks to the >> > costs >> > (both existing and experimental ones that I've added). However, despite >> > all >> > of that, I can't seem to get RA to split the following: >> > >> > 1 BB#0: derived from LLVM BB %entry >> > 2 Live Ins: %X3 >> > 3 %vreg15<def> = COPY %X3; G8RC:%vreg15 >> > 4 %vreg4<def> = CMPLDI %vreg15, 0; CRRC:%vreg4 G8RC:%vreg15 >> > 5 %vreg11:sub_32<def,read-undef> = LI 0; G8RC:%vreg11 >> > 6 BCC 68, %vreg4, <BB#1>; CRRC:%vreg4 >> > 7 Successors according to CFG: BB#4(0x30000000 / 0x80000000 >> > 37.50%) >> > BB#1(0x50000000 / 0x80000000 = 62.50%) >> > 8 >> > 9 BB#4: >> > 10 Predecessors according to CFG: BB#0 >> > 11 B <BB#3> >> > 12 Successors according to CFG: BB#3(?%) >> > 13 >> > 14 BB#1: derived from LLVM BB %if.end >> > 15 Predecessors according to CFG: BB#0 >> > 16 %vreg6<def> = ADDIStocHA %X2, <ga:@a>; >> > G8RC_and_G8RC_NOX0:%vreg6 >> > 17 %vreg7<def> = LDtocL <ga:@a>, %vreg6, %X2<imp-use>; >> > mem:LD8[GOT] G8RC_and_G8RC_NOX0:%vreg7,%vreg6 >> > 18 %vreg8<def> = LWA 0, %vreg7; >> > mem:LD4[@a](tbaa=!3)(dereferenceable) G8RC:%vreg8 >> > G8RC_and_G8RC_NOX0:%vreg7 >> > 19 %vreg9<def> = CMPLD %vreg8, %vreg15; CRRC:%vreg9 >> > G8RC:%vreg8,%vreg15 >> > 20 BCC 68, %vreg9, <BB#3>; CRRC:%vreg9 >> > 21 B <BB#2> >> > 22 Successors according to CFG: BB#2(0x30000000 / 0x80000000 >> > 37.50%) >> > BB#3(0x50000000 / 0x80000000 = 62.50%) >> > 23 >> > 24 BB#2: derived from LLVM BB %if.then2 >> > 25 Predecessors according to CFG: BB#1 >> > 26 ADJCALLSTACKDOWN 96, %R1<imp-def,dead>, %R1<imp-use> >> > 27 %vreg16<def> = COPY %vreg15; G8RC:%vreg16,%vreg15 >> > 28 BL8_NOP <ga:@callVoid>, <regmask **LONG LIST**>, >> > %X3<imp-def,dead> >> > 29 ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>, %R1<imp-use> >> > 30 ADJCALLSTACKDOWN 96, %R1<imp-def,dead>, %R1<imp-use> 31 >> > %X3<def> = COPY %vreg16; G8RC:%vreg16 >> > 32 BL8_NOP <ga:@callNonVoid>, <regmask **LONG LIST**>, >> > %X3<imp-use>, %X2<imp-use>, %R1<imp-def>, %X3<imp-def> >> > 33 ADJCALLSTACKUP 96, 0, %R1<imp-def,dead>, %R1<imp-use> >> > 34 %vreg11<def> = COPY %X3; G8RC:%vreg11 35 Successors >> > according to CFG: BB#3(?%) >> > 36 >> > 37 BB#3: derived from LLVM BB %return >> > 38 Predecessors according to CFG: BB#1 BB#2 BB#4 >> > 39 %vreg12<def> = EXTSW_32_64 %vreg11:sub_32; >> > G8RC:%vreg12,%vreg11 >> > 40 %X3<def> = COPY %vreg12; G8RC:%vreg12 >> > 41 BLR8 %LR8<imp-use>, %RM<imp-use>, %X3<imp-use> >> > >> > No matter what I do, vreg15 will get a Callee-Saved Register assigned to >> > it. However, this is suboptimal. So what I am trying to accomplish is to >> > split the live range of vreg15 into the paths without the call and the >> > path >> > with the call (BL8_NOP is a call). Then the physical register X3 can be >> > used in the paths BB#0 -> BB#1 -> BB#3 and BB#0 -> BB#4 -> BB#3 and it >> > can >> > be copied to a Callee-Saved Register in BB#2. >> > >> >> Hi Nemanja, >> >> vreg15's live range is not across call. It is weird that it is always >> getting CSR. Maybe because of the hint of vreg16, i.e., the >> contribution of hint outweigh the cost of CSR first use? >> >> Is it possible to send out the testcase? I can try if there is anyway >> to allocate vreg15 to CSR. >> >> Thanks, >> Wei. >> >> > Without such a split, vreg15 is assigned a CSR for the entire live range >> > and there is no way to avoid having to save/restore the CSR in the >> > prologue/epilogue. If one of the two paths that did not actually have >> > the >> > call turn out to be the hottest path through the function, there is a >> > lot >> > of wasted cycles in the save/restore because we weren't able to >> > shrink-wrap >> > this function due to the choice RA made. >> > >> > If anyone can offer some ideas on what I should do here, I would truly >> > appreciate it. >> > >> > Nemanja > >
Maybe Matching Threads
- Improving the split heuristics for the Greedy Register Allocator
- Machine Scheduler on Power PC: Latency Limit and Register Pressure
- Machine Scheduler on Power PC: Latency Limit and Register Pressure
- [LLVMdev] RegisterCoalescing pass crashes with ImplicitDef registers
- [LLVMdev] RegisterCoalescing Pass seems to ignore part of CFG.