On May 3, 2011, at 12:03 PM, David A. Greene wrote:>> >> The greedy allocator is trying to pick registers so inner loops are as >> small as possible, but that is not always the right thing to do. > > How does it balance that against spill cost?I added the CostPerUse field to the register descriptions. The allocator will try to minimize the spill weight assigned to registers with a CostPerUse. It does it by swapping physical register assignments, it won't do it if it requires extra spilling. This is actually the cause of the n-body regression. The benchmark has nested loops: %vreg1 = const pool load header1: ; large blocks with lots of floating point ops header2: ; small loop using %vreg1 jnz header2 ... jnz header1 The def of %vreg1 has been hoisted by LICM so it is live across a block with lots of floating point code. The allocator uses the low xmm registers for the large block, and %xmm8 is left for %vreg1 which has a low spill weight. This significantly improves code size, but the small loop suffers. A low xmm register could be used for %vreg1, but would need to be rematerialized. The allocator won't go that far just to use cheaper registers. In this case it might have helped to split the live range and rematerialize, but usually that won't be the case. /jakob
Jakob Stoklund Olesen <stoklund at 2pi.dk> writes:>>> The greedy allocator is trying to pick registers so inner loops are as >>> small as possible, but that is not always the right thing to do. >> >> How does it balance that against spill cost? > > I added the CostPerUse field to the register descriptions. The > allocator will try to minimize the spill weight assigned to registers > with a CostPerUse. It does it by swapping physical register > assignments, it won't do it if it requires extra spilling.CostPerUse models the encoding size of the register?> This is actually the cause of the n-body regression. The benchmark has nested loops: > > %vreg1 = const pool load > header1: > ; large blocks with lots of floating point ops > header2: > ; small loop using %vreg1 > jnz header2 > ... > jnz header1 >> The def of %vreg1 has been hoisted by LICM so it is live across a > block with lots of floating point code. The allocator uses the low xmm > registers for the large block, and %xmm8 is left for %vreg1 which has > a low spill weight. This significantly improves code size, but the > small loop suffers.Why does %xmm8 have a low spill weight? It's used in an inner loop.> In this case it might have helped to split the live range and > rematerialize, but usually that won't be the case.That was my initial reaction. Splitting should have at least rematerialized the value just before header2. That should significantly improve things. This is a classic motivational case for live range splitting. Another way to approach this is to add a register pressure heuristic to LICM so it doesn't spill so much stuff out over such a large loop body. -Dave
On May 3, 2011, at 3:23 PM, David A. Greene wrote:> Jakob Stoklund Olesen <stoklund at 2pi.dk> writes: > >>>> The greedy allocator is trying to pick registers so inner loops are as >>>> small as possible, but that is not always the right thing to do. >>> >>> How does it balance that against spill cost? >> >> I added the CostPerUse field to the register descriptions. The >> allocator will try to minimize the spill weight assigned to registers >> with a CostPerUse. It does it by swapping physical register >> assignments, it won't do it if it requires extra spilling. > > CostPerUse models the encoding size of the register?Yes, something like that.>> This is actually the cause of the n-body regression. The benchmark has nested loops: >> >> %vreg1 = const pool load >> header1: >> ; large blocks with lots of floating point ops >> header2: >> ; small loop using %vreg1 >> jnz header2 >> ... >> jnz header1 >> > >> The def of %vreg1 has been hoisted by LICM so it is live across a >> block with lots of floating point code. The allocator uses the low xmm >> registers for the large block, and %xmm8 is left for %vreg1 which has >> a low spill weight. This significantly improves code size, but the >> small loop suffers. > > Why does %xmm8 have a low spill weight? It's used in an inner loop.Because it is rematerializable and live across a big block where it isn't used.>> In this case it might have helped to split the live range and >> rematerialize, but usually that won't be the case. > > That was my initial reaction. Splitting should have at least > rematerialized the value just before header2. That should significantly > improve things. This is a classic motivational case for live range > splitting.Well, not really. Note there there are plenty of registers available and no spilling is neccessary. It's just that an REX prefix is required on some instructions when %xmm8 is used. Is it worth it to undo LICM just for that? In this case, probably. In general, no. /jakob