On May 3, 2011, at 3:23 PM, David A. Greene wrote:> Jakob Stoklund Olesen <stoklund at 2pi.dk> writes: > >>>> The greedy allocator is trying to pick registers so inner loops are as >>>> small as possible, but that is not always the right thing to do. >>> >>> How does it balance that against spill cost? >> >> I added the CostPerUse field to the register descriptions. The >> allocator will try to minimize the spill weight assigned to registers >> with a CostPerUse. It does it by swapping physical register >> assignments, it won't do it if it requires extra spilling. > > CostPerUse models the encoding size of the register?Yes, something like that.>> This is actually the cause of the n-body regression. The benchmark has nested loops: >> >> %vreg1 = const pool load >> header1: >> ; large blocks with lots of floating point ops >> header2: >> ; small loop using %vreg1 >> jnz header2 >> ... >> jnz header1 >> > >> The def of %vreg1 has been hoisted by LICM so it is live across a >> block with lots of floating point code. The allocator uses the low xmm >> registers for the large block, and %xmm8 is left for %vreg1 which has >> a low spill weight. This significantly improves code size, but the >> small loop suffers. > > Why does %xmm8 have a low spill weight? It's used in an inner loop.Because it is rematerializable and live across a big block where it isn't used.>> In this case it might have helped to split the live range and >> rematerialize, but usually that won't be the case. > > That was my initial reaction. Splitting should have at least > rematerialized the value just before header2. That should significantly > improve things. This is a classic motivational case for live range > splitting.Well, not really. Note there there are plenty of registers available and no spilling is neccessary. It's just that an REX prefix is required on some instructions when %xmm8 is used. Is it worth it to undo LICM just for that? In this case, probably. In general, no. /jakob
Jakob Stoklund Olesen <stoklund at 2pi.dk> writes:>> That was my initial reaction. Splitting should have at least >> rematerialized the value just before header2. That should significantly >> improve things. This is a classic motivational case for live range >> splitting. > > Well, not really. Note there there are plenty of registers available > and no spilling is neccessary.Oh, I misunderstood then. Thanks for clarifying.> It's just that an REX prefix is required on some instructions when > %xmm8 is used. Is it worth it to undo LICM just for that? In this > case, probably. In general, no.Ah, so you're saying the regression is due to the inner loop icache footprint increasing. Ok, that makes total sense to me. I agree this is a difficult thing to get right in a general sort of way. Perhaps the CostPerUse (or whatwever heuristics use it) can factor in the loop body size so that tight loops are favored for smaller encodings. -Dave
On May 3, 2011, at 4:08 PM, David A. Greene wrote:>> >> It's just that an REX prefix is required on some instructions when >> %xmm8 is used. Is it worth it to undo LICM just for that? In this >> case, probably. In general, no. > > Ah, so you're saying the regression is due to the inner loop icache > footprint increasing. Ok, that makes total sense to me. I agree this > is a difficult thing to get right in a general sort of way. Perhaps the > CostPerUse (or whatwever heuristics use it) can factor in the loop body > size so that tight loops are favored for smaller encodings.It is almost certainly that the inner loop doesn't fit in the processors predecode loop buffer. Modern intel X86 chips have a buffer that can hold a very small number of instructions and is bound by instruction count, code size, and sometimes # cache lines. If a loop fits in this it allows the processor to turn off the decoder completely for the loop, a significant power and performance win. I don't know how realistic it is to model the loop buffer in the register allocator, but this would a very interesting thing to try to optimize for in a later pass. If an inner loop "almost" fits, then it would probably be worth heroic effort to try to reduce the size of it to shave off a few bytes. -Chris