Robert Lougher
2015-Jul-14 17:19 UTC
[LLVMdev] Poor register allocation (constants causing spilling)
Hi, While investigating a performance issue with an internal codebase I came across what looks to be poor register allocation. I have constructed a small(ish) reproducible which demonstrates the issue (see test.ll attached). I have spent some time going through the register allocator to understand what is happening. I have also experimented with some small changes to try and improve the allocation. Unfortunately, the full report is fairly long and detailed. However, in short, I found that not splitting rematerializable live-ranges lead to significantly better register allocation, and an overall performance improvement of 3%. *** The Problem Compile the attached testcase as follows: llc -mcpu=btver2 test.ll Examining the assembly in test.s we can see a constant is being loaded into %xmm8 (second instruction in foo). Tracing the constant we can see the following: foo: ... vmovaps .LCPI0_0(%rip), %xmm8 # xmm8 = [6.366197e-01,6.366197e-01,...] ... vmulps %xmm8, %xmm0, %xmm1 # first use of constant vmovaps %xmm8, %xmm9 # move constant into another register ... vmovaps %xmm0, -40(%rsp) # 16-byte Spill vmovaps %xmm9, %xmm0 # move constant into vacated register ... vmulps %xmm0, %xmm3, %xmm2 # second use of constant ... vmulps %xmm0, %xmm5, %xmm3 # last use of constant ... The constant is used 3 times within a live range of 86 instructions. Although the constant is clearly rematerializable, the allocator has gone to some length to keep it in a register, and it has spilled a value to the stack. It would have been cheaper to simply fold the constant load into the 3 uses. This is not the only example. Later on we can see this: vmovaps .LCPI0_1(%rip), %xmm6 # xmm6 = [2147483648,2147483648,...] vxorps %xmm6, %xmm2, %xmm3 ... vandps %xmm6, %xmm5, %xmm2 ... vmovaps %xmm1, -56(%rsp) # 16-byte Spill vmovaps %xmm6, %xmm1 ... vmovaps -56(%rsp), %xmm0 # 16-byte Reload ... vxorps %xmm1, %xmm3, %xmm4 ... Here, we have a spill and reload to keep the constant in a register for a single use! *** Live-Range Splitting Looking at the first constant, we can see that the copies are inserted by the greedy register allocator when it splits the live range. The main entry point is selectOrSplit. In brief: 1) The constant is assigned and evicted several times. Eventually selectOrSplit decides to split the live-range. This splits the interval in two, creating two new intervals (A and B). At the boundary of the intervals a copy is inserted that copies between the virtual registers. 2) The virtual register for the first interval (A) is assigned to XMM8. 3) The virtual register for the second interval (B) is assigned and evicted several times before selectOrSplit decides to split again (C and D). 4) The virtual register for interval C is assigned to XMM9. 5) The virtual register for interval D is assigned to XMM0. At the boundary between A and C we have a copy from XMM8 to XMM9, and at the boundary between C and D we have a copy from XMM9 to XMM0. The spill is the result of several evictions. To assign interval D to XMM0 another virtual register was evicted. This virtual register had previously caused an eviction, which was subsequently spilled. *** Spill Weights The main question is why does the register allocator decide to split the constant live range and evict other non-rematerializable registers? This comes down to the calculation of the spill weight which is used to determine whether one interval should evict another. If the live interval is rematerializable the spill weight is halved (multiplied by 0.5). From what I can see, this calculation is the only place in the greedy register allocator (as opposed to the spiller) where rematerializability is considered. So rematerializable intervals should be less important than non-rematerializable intervals, assuming similar number of uses, size of range, etc. However, there appears to be a flaw: * A rematerializable interval once split is no longer rematerializable * The isRematerializable check in CalcSpillWeights.cpp uses the target instruction info to check that the machine instruction for the live interval definition is trivially rematerializable. In the case of interval A above, the definition is a load from the constant-pool. This is trivially rematerializable and the spill weight is halved. However, in the remainder intervals B, C and D the definition is a copy. This is not trivially rematerializable, and the spill weight is unadjusted. This means these ranges are considered as expensive as non-rematerializable ranges. *** Experiment One In constrast to the spill weight calculation, the inline spiller can trace rematerializability back through the copies resulting from live range splitting (what it calls sibling values). The first experiment modified the greedy register allocator so that it would not split a rematerializable interval more than once (i.e. instead of splitting interval B into C and D it is spilled). The purpose of this experiment was to confirm that the inline spiller was able to follow the copy and rematerialize the load from the constant-pool. In this case, the load was folded into the two remaining uses (in fact, the use in interval A was also folded as it was the single remaining use). *** Experiment Two This experiment modified the greedy register allocator so that the spill weights for split rematerializable ranges were adjusted correctly (a hack rather than a proper fix). The hope was that with the "correct" spill weight the relevant priorities of the live ranges would result in better allocation (see end for full statistics). Unmodified: Spilled 3 values to the stack 124 instructions in foo Modified: Spilled 1 value to the stack 113 instructions in foo Unfortunately, the performance showed only minimal improvement. *** Experiment Three The code gives no explanation for the spill weight adjustment value of 0.5. This experiment was designed to determine what adjustment factor was needed to obtain no spills. This point was reached with an adjustment factor for rematerializable intervals of 0.3. Modified: No spills to the stack 111 instructions in foo Again, the performance showed only minimal improvement. *** Experiment Four In the previous experiments splitting was still occurring leading to register copies. In this final experiment the register allocator was modified so that rematerializable ranges are not split at all, but are spilled if they cannot be assigned (obviously, the spiller will rematerialize the values). Modified: No spills to the stack 106 instructions in foo This version contains considerably more folded loads than the previous experiments, and shows a 3% speed improvement. *** Conclusion I had hoped to get to this point by fixing the spill weight calculation rather than having to make changes to the register allocator logic as this seems a more intrusive change. However, results from SPEC CPU2006 shows the performance difference of most tests to be in the noise (<1%). Depending on architecture (btver2 or Haswell), libquantum and mcf shows minor improvement. The only consistent slow-down (both btver2 and Haswell) is lbm, which shows a decrease of 1-2%. Here, I can see that not spilling rematerializable ranges produces an extra spill in the hot loop (5 rather than 4). However, the greedy register allocator allocates half the number of spill slots (5 rather than 10) but when splitting the spiller appears to be able to remove spills as redundant. Obviously much more work needs to be done before a patch based on these experimental results can be submitted. For example, I have done no testing on non-X86 targets. However, at this point I would welcome opinions and suggestions! Thanks, Rob. *** Register Allocator Statistics Unmodified: 32 regalloc - Number of copies inserted for splitting 2 regalloc - Number of folded loads 3 regalloc - Number of folded stack accesses 19 regalloc - Number of identity moves eliminated after rewriting 18 regalloc - Number of instructions deleted by DCE 158 regalloc - Number of interferences evicted 280 regalloc - Number of new live ranges queued 311 regalloc - Number of registers assigned 173 regalloc - Number of registers unassigned 2 regalloc - Number of reloads inserted 8 regalloc - Number of rematerialized defs for spilling 3 regalloc - Number of rematerialized defs for splitting 7 regalloc - Number of single use loads folded after DCE 3 regalloc - Number of spill slots allocated 13 regalloc - Number of spilled live ranges 3 regalloc - Number of spills inserted 31 regalloc - Number of split local live ranges 31 regalloc - Number of splits finished 29 regalloc - Number of splits that were simple Experiment Two: 26 regalloc - Number of copies inserted for splitting 9 regalloc - Number of folded loads 10 regalloc - Number of folded stack accesses 12 regalloc - Number of identity moves eliminated after rewriting 21 regalloc - Number of instructions deleted by DCE 88 regalloc - Number of interferences evicted 181 regalloc - Number of new live ranges queued 224 regalloc - Number of registers assigned 102 regalloc - Number of registers unassigned 4 regalloc - Number of rematerialized defs for spilling 8 regalloc - Number of single use loads folded after DCE 1 regalloc - Number of spill slots allocated 10 regalloc - Number of spilled live ranges 4 regalloc - Number of spilled snippets 1 regalloc - Number of spills inserted 26 regalloc - Number of split local live ranges 26 regalloc - Number of splits finished 26 regalloc - Number of splits that were simple Experiment Three: 26 regalloc - Number of copies inserted for splitting 10 regalloc - Number of folded loads 10 regalloc - Number of folded stack accesses 13 regalloc - Number of identity moves eliminated after rewriting 20 regalloc - Number of instructions deleted by DCE 83 regalloc - Number of interferences evicted 174 regalloc - Number of new live ranges queued 220 regalloc - Number of registers assigned 98 regalloc - Number of registers unassigned 3 regalloc - Number of rematerialized defs for spilling 7 regalloc - Number of single use loads folded after DCE 9 regalloc - Number of spilled live ranges 4 regalloc - Number of spilled snippets 26 regalloc - Number of split local live ranges 26 regalloc - Number of splits finished 26 regalloc - Number of splits that were simple Experiment Four: 21 regalloc - Number of folded loads 21 regalloc - Number of folded stack accesses 2 regalloc - Number of identity moves eliminated after rewriting 7 regalloc - Number of instructions deleted by DCE 42 regalloc - Number of interferences evicted 49 regalloc - Number of new live ranges queued 148 regalloc - Number of registers assigned 42 regalloc - Number of registers unassigned 7 regalloc - Number of spilled live ranges -- Robert Lougher SN Systems - Sony Computer Entertainment Group -------------- next part -------------- A non-text attachment was scrubbed... Name: test.ll Type: application/octet-stream Size: 10693 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150714/9759562e/attachment.obj>
Quentin Colombet
2015-Jul-14 17:48 UTC
[LLVMdev] Poor register allocation (constants causing spilling)
Hi Rob, Thanks for the details explanation and investigation.> On Jul 14, 2015, at 10:19 AM, Robert Lougher <rob.lougher at gmail.com> wrote: > > Hi, > > While investigating a performance issue with an internal codebase I > came across what looks to be poor register allocation. I have > constructed a small(ish) reproducible which demonstrates the issue > (see test.ll attached). > > I have spent some time going through the register allocator to > understand what is happening. I have also experimented with some > small changes to try and improve the allocation. Unfortunately, the > full report is fairly long and detailed. However, in short, I found > that not splitting rematerializable live-ranges lead to significantly > better register allocation, and an overall performance improvement of > 3%. > > *** The Problem > > Compile the attached testcase as follows: > > llc -mcpu=btver2 test.ll > > Examining the assembly in test.s we can see a constant is being loaded > into %xmm8 (second instruction in foo). Tracing the constant we can > see the following: > > foo: > ... > vmovaps .LCPI0_0(%rip), %xmm8 # xmm8 = [6.366197e-01,6.366197e-01,...] > ... > vmulps %xmm8, %xmm0, %xmm1 # first use of constant > vmovaps %xmm8, %xmm9 # move constant into another register > ... > vmovaps %xmm0, -40(%rsp) # 16-byte Spill > vmovaps %xmm9, %xmm0 # move constant into vacated register > ... > vmulps %xmm0, %xmm3, %xmm2 # second use of constant > ... > vmulps %xmm0, %xmm5, %xmm3 # last use of constant > ... > > The constant is used 3 times within a live range of 86 instructions. > Although the constant is clearly rematerializable, the allocator has > gone to some length to keep it in a register, and it has spilled a > value to the stack. It would have been cheaper to simply fold the > constant load into the 3 uses. > > This is not the only example. Later on we can see this: > > vmovaps .LCPI0_1(%rip), %xmm6 # xmm6 = [2147483648,2147483648,...] > vxorps %xmm6, %xmm2, %xmm3 > ... > vandps %xmm6, %xmm5, %xmm2 > ... > vmovaps %xmm1, -56(%rsp) # 16-byte Spill > vmovaps %xmm6, %xmm1 > ... > vmovaps -56(%rsp), %xmm0 # 16-byte Reload > ... > vxorps %xmm1, %xmm3, %xmm4 > ... > > Here, we have a spill and reload to keep the constant in a register > for a single use! > > *** Live-Range Splitting > > Looking at the first constant, we can see that the copies are inserted > by the greedy register allocator when it splits the live range. The > main entry point is selectOrSplit. In brief: > > 1) The constant is assigned and evicted several times. Eventually > selectOrSplit decides to split the live-range. This splits the > interval in two, creating two new intervals (A and B). At the > boundary of the intervals a copy is inserted that copies between the > virtual registers. > > 2) The virtual register for the first interval (A) is assigned to XMM8. > > 3) The virtual register for the second interval (B) is assigned and > evicted several times before selectOrSplit decides to split again (C > and D). > > 4) The virtual register for interval C is assigned to XMM9. > > 5) The virtual register for interval D is assigned to XMM0. > > At the boundary between A and C we have a copy from XMM8 to XMM9, and > at the boundary between C and D we have a copy from XMM9 to XMM0. > > The spill is the result of several evictions. To assign interval D to > XMM0 another virtual register was evicted. This virtual register had > previously caused an eviction, which was subsequently spilled. > > *** Spill Weights > > The main question is why does the register allocator decide to split > the constant live range and evict other non-rematerializable > registers? This comes down to the calculation of the spill weight > which is used to determine whether one interval should evict another. > If the live interval is rematerializable the spill weight is halved > (multiplied by 0.5). From what I can see, this calculation is the > only place in the greedy register allocator (as opposed to the > spiller) where rematerializability is considered. > > So rematerializable intervals should be less important than > non-rematerializable intervals, assuming similar number of uses, size > of range, etc. However, there appears to be a flaw: > > * A rematerializable interval once split is no longer rematerializable * > > The isRematerializable check in CalcSpillWeights.cpp uses the target > instruction info to check that the machine instruction for the live > interval definition is trivially rematerializable. In the case of > interval A above, the definition is a load from the constant-pool. > This is trivially rematerializable and the spill weight is halved. > However, in the remainder intervals B, C and D the definition is a > copy. This is not trivially rematerializable, and the spill weight is > unadjusted. This means these ranges are considered as expensive as > non-rematerializable ranges.This is the first think to fix. As long as a live-range is rematerializable, we should reflect that properly in the spill weight. Could you clean up your patch to fix that?> > *** Experiment One > > In constrast to the spill weight calculation, the inline spiller can > trace rematerializability back through the copies resulting from live > range splitting (what it calls sibling values). The first experiment > modified the greedy register allocator so that it would not split a > rematerializable interval more than once (i.e. instead of splitting > interval B into C and D it is spilled). The purpose of this > experiment was to confirm that the inline spiller was able to follow > the copy and rematerialize the load from the constant-pool. In this > case, the load was folded into the two remaining uses (in fact, the > use in interval A was also folded as it was the single remaining use). > > *** Experiment Two > > This experiment modified the greedy register allocator so that the > spill weights for split rematerializable ranges were adjusted > correctly (a hack rather than a proper fix). The hope was that with > the "correct" spill weight the relevant priorities of the live ranges > would result in better allocation (see end for full statistics). > > Unmodified: > > Spilled 3 values to the stack > 124 instructions in foo > > Modified: > > Spilled 1 value to the stack > 113 instructions in foo > > Unfortunately, the performance showed only minimal improvement.Like I said, this should be the first thing we should fix.> > *** Experiment Three > > The code gives no explanation for the spill weight adjustment value of > 0.5. This experiment was designed to determine what adjustment factor > was needed to obtain no spills. This point was reached with an > adjustment factor for rematerializable intervals of 0.3.This is interesting because I think the weight of a rematerializable value compared to “regular” spill (i.e., what this constant is supposed to adjust for) can be target depend. The bottom line is that I’m not surprised 0.5 does not work well for all architecture. Maybe a hook for that value would be interesting.> > Modified: > > No spills to the stack > 111 instructions in foo > > Again, the performance showed only minimal improvement. > > *** Experiment Four > > In the previous experiments splitting was still occurring leading to > register copies. In this final experiment the register allocator was > modified so that rematerializable ranges are not split at all, but are > spilled if they cannot be assigned (obviously, the spiller will > rematerialize the values).I don’t think no splitting is the right answer generally speaking. Thanks, -Quentin> > Modified: > > No spills to the stack > 106 instructions in foo > > This version contains considerably more folded loads than the previous > experiments, and shows a 3% speed improvement. > > *** Conclusion > > I had hoped to get to this point by fixing the spill weight > calculation rather than having to make changes to the register > allocator logic as this seems a more intrusive change. However, > results from SPEC CPU2006 shows the performance difference of most > tests to be in the noise (<1%). Depending on architecture (btver2 or > Haswell), libquantum and mcf shows minor improvement. The only > consistent slow-down (both btver2 and Haswell) is lbm, which shows a > decrease of 1-2%. Here, I can see that not spilling rematerializable > ranges produces an extra spill in the hot loop (5 rather than 4). > However, the greedy register allocator allocates half the number of > spill slots (5 rather than 10) but when splitting the spiller appears > to be able to remove spills as redundant. > > Obviously much more work needs to be done before a patch based on > these experimental results can be submitted. For example, I have done > no testing on non-X86 targets. However, at this point I would welcome > opinions and suggestions! > > Thanks, > Rob. > > *** Register Allocator Statistics > > Unmodified: > > 32 regalloc - Number of copies inserted for splitting > 2 regalloc - Number of folded loads > 3 regalloc - Number of folded stack accesses > 19 regalloc - Number of identity moves eliminated after rewriting > 18 regalloc - Number of instructions deleted by DCE > 158 regalloc - Number of interferences evicted > 280 regalloc - Number of new live ranges queued > 311 regalloc - Number of registers assigned > 173 regalloc - Number of registers unassigned > 2 regalloc - Number of reloads inserted > 8 regalloc - Number of rematerialized defs for spilling > 3 regalloc - Number of rematerialized defs for splitting > 7 regalloc - Number of single use loads folded after DCE > 3 regalloc - Number of spill slots allocated > 13 regalloc - Number of spilled live ranges > 3 regalloc - Number of spills inserted > 31 regalloc - Number of split local live ranges > 31 regalloc - Number of splits finished > 29 regalloc - Number of splits that were simple > > Experiment Two: > > 26 regalloc - Number of copies inserted for splitting > 9 regalloc - Number of folded loads > 10 regalloc - Number of folded stack accesses > 12 regalloc - Number of identity moves eliminated after rewriting > 21 regalloc - Number of instructions deleted by DCE > 88 regalloc - Number of interferences evicted > 181 regalloc - Number of new live ranges queued > 224 regalloc - Number of registers assigned > 102 regalloc - Number of registers unassigned > 4 regalloc - Number of rematerialized defs for spilling > 8 regalloc - Number of single use loads folded after DCE > 1 regalloc - Number of spill slots allocated > 10 regalloc - Number of spilled live ranges > 4 regalloc - Number of spilled snippets > 1 regalloc - Number of spills inserted > 26 regalloc - Number of split local live ranges > 26 regalloc - Number of splits finished > 26 regalloc - Number of splits that were simple > > Experiment Three: > > 26 regalloc - Number of copies inserted for splitting > 10 regalloc - Number of folded loads > 10 regalloc - Number of folded stack accesses > 13 regalloc - Number of identity moves eliminated after rewriting > 20 regalloc - Number of instructions deleted by DCE > 83 regalloc - Number of interferences evicted > 174 regalloc - Number of new live ranges queued > 220 regalloc - Number of registers assigned > 98 regalloc - Number of registers unassigned > 3 regalloc - Number of rematerialized defs for spilling > 7 regalloc - Number of single use loads folded after DCE > 9 regalloc - Number of spilled live ranges > 4 regalloc - Number of spilled snippets > 26 regalloc - Number of split local live ranges > 26 regalloc - Number of splits finished > 26 regalloc - Number of splits that were simple > > Experiment Four: > > 21 regalloc - Number of folded loads > 21 regalloc - Number of folded stack accesses > 2 regalloc - Number of identity moves eliminated after rewriting > 7 regalloc - Number of instructions deleted by DCE > 42 regalloc - Number of interferences evicted > 49 regalloc - Number of new live ranges queued > 148 regalloc - Number of registers assigned > 42 regalloc - Number of registers unassigned > 7 regalloc - Number of spilled live ranges > > -- > Robert Lougher > SN Systems - Sony Computer Entertainment Group > <test.ll>
Philip Reames
2015-Jul-15 04:52 UTC
[LLVMdev] Poor register allocation (constants causing spilling)
Robert, Thanks for sharing your findings and analysis. For the moment, I'll leave the technical discussion for Quentin and those more familiar with the regalloc code, but I wanted to express interest in getting this fixed and say thanks. Has there been a bug filed for this yet? Philip On 07/14/2015 10:19 AM, Robert Lougher wrote:> Hi, > > While investigating a performance issue with an internal codebase I > came across what looks to be poor register allocation. I have > constructed a small(ish) reproducible which demonstrates the issue > (see test.ll attached). > > I have spent some time going through the register allocator to > understand what is happening. I have also experimented with some > small changes to try and improve the allocation. Unfortunately, the > full report is fairly long and detailed. However, in short, I found > that not splitting rematerializable live-ranges lead to significantly > better register allocation, and an overall performance improvement of > 3%. > > *** The Problem > > Compile the attached testcase as follows: > > llc -mcpu=btver2 test.ll > > Examining the assembly in test.s we can see a constant is being loaded > into %xmm8 (second instruction in foo). Tracing the constant we can > see the following: > > foo: > ... > vmovaps .LCPI0_0(%rip), %xmm8 # xmm8 = [6.366197e-01,6.366197e-01,...] > ... > vmulps %xmm8, %xmm0, %xmm1 # first use of constant > vmovaps %xmm8, %xmm9 # move constant into another register > ... > vmovaps %xmm0, -40(%rsp) # 16-byte Spill > vmovaps %xmm9, %xmm0 # move constant into vacated register > ... > vmulps %xmm0, %xmm3, %xmm2 # second use of constant > ... > vmulps %xmm0, %xmm5, %xmm3 # last use of constant > ... > > The constant is used 3 times within a live range of 86 instructions. > Although the constant is clearly rematerializable, the allocator has > gone to some length to keep it in a register, and it has spilled a > value to the stack. It would have been cheaper to simply fold the > constant load into the 3 uses. > > This is not the only example. Later on we can see this: > > vmovaps .LCPI0_1(%rip), %xmm6 # xmm6 = [2147483648,2147483648,...] > vxorps %xmm6, %xmm2, %xmm3 > ... > vandps %xmm6, %xmm5, %xmm2 > ... > vmovaps %xmm1, -56(%rsp) # 16-byte Spill > vmovaps %xmm6, %xmm1 > ... > vmovaps -56(%rsp), %xmm0 # 16-byte Reload > ... > vxorps %xmm1, %xmm3, %xmm4 > ... > > Here, we have a spill and reload to keep the constant in a register > for a single use! > > *** Live-Range Splitting > > Looking at the first constant, we can see that the copies are inserted > by the greedy register allocator when it splits the live range. The > main entry point is selectOrSplit. In brief: > > 1) The constant is assigned and evicted several times. Eventually > selectOrSplit decides to split the live-range. This splits the > interval in two, creating two new intervals (A and B). At the > boundary of the intervals a copy is inserted that copies between the > virtual registers. > > 2) The virtual register for the first interval (A) is assigned to XMM8. > > 3) The virtual register for the second interval (B) is assigned and > evicted several times before selectOrSplit decides to split again (C > and D). > > 4) The virtual register for interval C is assigned to XMM9. > > 5) The virtual register for interval D is assigned to XMM0. > > At the boundary between A and C we have a copy from XMM8 to XMM9, and > at the boundary between C and D we have a copy from XMM9 to XMM0. > > The spill is the result of several evictions. To assign interval D to > XMM0 another virtual register was evicted. This virtual register had > previously caused an eviction, which was subsequently spilled. > > *** Spill Weights > > The main question is why does the register allocator decide to split > the constant live range and evict other non-rematerializable > registers? This comes down to the calculation of the spill weight > which is used to determine whether one interval should evict another. > If the live interval is rematerializable the spill weight is halved > (multiplied by 0.5). From what I can see, this calculation is the > only place in the greedy register allocator (as opposed to the > spiller) where rematerializability is considered. > > So rematerializable intervals should be less important than > non-rematerializable intervals, assuming similar number of uses, size > of range, etc. However, there appears to be a flaw: > > * A rematerializable interval once split is no longer rematerializable * > > The isRematerializable check in CalcSpillWeights.cpp uses the target > instruction info to check that the machine instruction for the live > interval definition is trivially rematerializable. In the case of > interval A above, the definition is a load from the constant-pool. > This is trivially rematerializable and the spill weight is halved. > However, in the remainder intervals B, C and D the definition is a > copy. This is not trivially rematerializable, and the spill weight is > unadjusted. This means these ranges are considered as expensive as > non-rematerializable ranges. > > *** Experiment One > > In constrast to the spill weight calculation, the inline spiller can > trace rematerializability back through the copies resulting from live > range splitting (what it calls sibling values). The first experiment > modified the greedy register allocator so that it would not split a > rematerializable interval more than once (i.e. instead of splitting > interval B into C and D it is spilled). The purpose of this > experiment was to confirm that the inline spiller was able to follow > the copy and rematerialize the load from the constant-pool. In this > case, the load was folded into the two remaining uses (in fact, the > use in interval A was also folded as it was the single remaining use). > > *** Experiment Two > > This experiment modified the greedy register allocator so that the > spill weights for split rematerializable ranges were adjusted > correctly (a hack rather than a proper fix). The hope was that with > the "correct" spill weight the relevant priorities of the live ranges > would result in better allocation (see end for full statistics). > > Unmodified: > > Spilled 3 values to the stack > 124 instructions in foo > > Modified: > > Spilled 1 value to the stack > 113 instructions in foo > > Unfortunately, the performance showed only minimal improvement. > > *** Experiment Three > > The code gives no explanation for the spill weight adjustment value of > 0.5. This experiment was designed to determine what adjustment factor > was needed to obtain no spills. This point was reached with an > adjustment factor for rematerializable intervals of 0.3. > > Modified: > > No spills to the stack > 111 instructions in foo > > Again, the performance showed only minimal improvement. > > *** Experiment Four > > In the previous experiments splitting was still occurring leading to > register copies. In this final experiment the register allocator was > modified so that rematerializable ranges are not split at all, but are > spilled if they cannot be assigned (obviously, the spiller will > rematerialize the values). > > Modified: > > No spills to the stack > 106 instructions in foo > > This version contains considerably more folded loads than the previous > experiments, and shows a 3% speed improvement. > > *** Conclusion > > I had hoped to get to this point by fixing the spill weight > calculation rather than having to make changes to the register > allocator logic as this seems a more intrusive change. However, > results from SPEC CPU2006 shows the performance difference of most > tests to be in the noise (<1%). Depending on architecture (btver2 or > Haswell), libquantum and mcf shows minor improvement. The only > consistent slow-down (both btver2 and Haswell) is lbm, which shows a > decrease of 1-2%. Here, I can see that not spilling rematerializable > ranges produces an extra spill in the hot loop (5 rather than 4). > However, the greedy register allocator allocates half the number of > spill slots (5 rather than 10) but when splitting the spiller appears > to be able to remove spills as redundant. > > Obviously much more work needs to be done before a patch based on > these experimental results can be submitted. For example, I have done > no testing on non-X86 targets. However, at this point I would welcome > opinions and suggestions! > > Thanks, > Rob. > > *** Register Allocator Statistics > > Unmodified: > > 32 regalloc - Number of copies inserted for splitting > 2 regalloc - Number of folded loads > 3 regalloc - Number of folded stack accesses > 19 regalloc - Number of identity moves eliminated after rewriting > 18 regalloc - Number of instructions deleted by DCE > 158 regalloc - Number of interferences evicted > 280 regalloc - Number of new live ranges queued > 311 regalloc - Number of registers assigned > 173 regalloc - Number of registers unassigned > 2 regalloc - Number of reloads inserted > 8 regalloc - Number of rematerialized defs for spilling > 3 regalloc - Number of rematerialized defs for splitting > 7 regalloc - Number of single use loads folded after DCE > 3 regalloc - Number of spill slots allocated > 13 regalloc - Number of spilled live ranges > 3 regalloc - Number of spills inserted > 31 regalloc - Number of split local live ranges > 31 regalloc - Number of splits finished > 29 regalloc - Number of splits that were simple > > Experiment Two: > > 26 regalloc - Number of copies inserted for splitting > 9 regalloc - Number of folded loads > 10 regalloc - Number of folded stack accesses > 12 regalloc - Number of identity moves eliminated after rewriting > 21 regalloc - Number of instructions deleted by DCE > 88 regalloc - Number of interferences evicted > 181 regalloc - Number of new live ranges queued > 224 regalloc - Number of registers assigned > 102 regalloc - Number of registers unassigned > 4 regalloc - Number of rematerialized defs for spilling > 8 regalloc - Number of single use loads folded after DCE > 1 regalloc - Number of spill slots allocated > 10 regalloc - Number of spilled live ranges > 4 regalloc - Number of spilled snippets > 1 regalloc - Number of spills inserted > 26 regalloc - Number of split local live ranges > 26 regalloc - Number of splits finished > 26 regalloc - Number of splits that were simple > > Experiment Three: > > 26 regalloc - Number of copies inserted for splitting > 10 regalloc - Number of folded loads > 10 regalloc - Number of folded stack accesses > 13 regalloc - Number of identity moves eliminated after rewriting > 20 regalloc - Number of instructions deleted by DCE > 83 regalloc - Number of interferences evicted > 174 regalloc - Number of new live ranges queued > 220 regalloc - Number of registers assigned > 98 regalloc - Number of registers unassigned > 3 regalloc - Number of rematerialized defs for spilling > 7 regalloc - Number of single use loads folded after DCE > 9 regalloc - Number of spilled live ranges > 4 regalloc - Number of spilled snippets > 26 regalloc - Number of split local live ranges > 26 regalloc - Number of splits finished > 26 regalloc - Number of splits that were simple > > Experiment Four: > > 21 regalloc - Number of folded loads > 21 regalloc - Number of folded stack accesses > 2 regalloc - Number of identity moves eliminated after rewriting > 7 regalloc - Number of instructions deleted by DCE > 42 regalloc - Number of interferences evicted > 49 regalloc - Number of new live ranges queued > 148 regalloc - Number of registers assigned > 42 regalloc - Number of registers unassigned > 7 regalloc - Number of spilled live ranges > > -- > Robert Lougher > SN Systems - Sony Computer Entertainment Group > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150714/dcf3c410/attachment.html>
Robert Lougher
2015-Jul-15 22:31 UTC
[LLVMdev] Poor register allocation (constants causing spilling)
Hi Philip, On 15 July 2015 at 05:52, Philip Reames <listmail at philipreames.com> wrote:> Robert, > > Thanks for sharing your findings and analysis. For the moment, I'll leave > the technical discussion for Quentin and those more familiar with the > regalloc code, but I wanted to express interest in getting this fixed and > say thanks. > > Has there been a bug filed for this yet? >I have just created a bug for it: https://llvm.org/bugs/show_bug.cgi?id=24139 I wasn't sure how much of my analysis to put in it. In the end I put everything in (bar the conclusion) as it might be useful. I'm working on a patch for the spill weight (I'll reply to Quentin about it). Thanks, Rob.> Philip >
Robert Lougher
2015-Jul-15 23:22 UTC
[LLVMdev] Poor register allocation (constants causing spilling)
Hi Quentin, Sorry for the delay, I've been bogged down with other things today. On 14 July 2015 at 18:48, Quentin Colombet <qcolombet at apple.com> wrote:>> >> * A rematerializable interval once split is no longer rematerializable * >> >> The isRematerializable check in CalcSpillWeights.cpp uses the target >> instruction info to check that the machine instruction for the live >> interval definition is trivially rematerializable. In the case of >> interval A above, the definition is a load from the constant-pool. >> This is trivially rematerializable and the spill weight is halved. >> However, in the remainder intervals B, C and D the definition is a >> copy. This is not trivially rematerializable, and the spill weight is >> unadjusted. This means these ranges are considered as expensive as >> non-rematerializable ranges. > > This is the first think to fix. As long as a live-range is rematerializable, we should reflect that properly in the spill weight. > Could you clean up your patch to fix that?Unfortunately my change was an ugly hack that was good enough for my experiment but not the right approach for a proper fix (it added an is_remat bool to the interval class, and adjusted the weights and propagated the flag after local splitting - as they're in the same basic block they should remain rematerializable). I'm now looking at a proper fix, although it may take some time... Thanks, Rob.
Apparently Analagous Threads
- [LLVMdev] Calling conventions for YMM registers on AVX
- [LLVMdev] Codegen/Register allocation question.
- [LLVMdev] LLVM x86 Code Generator discards Instruction-level Parallelism
- [LLVMdev] Codegen/Register allocation question.
- [LLVMdev] llvm register reload/spilling around calls