thr3ads.net - llvm dev - [LLVMdev] Greedy register allocation [May 2011]

If this information is useful, please help other people find it:
Share via:

Jakob Stoklund Olesen

2011-May-03 22:28 UTC

[LLVMdev] Greedy register allocation

On May 3, 2011, at 3:23 PM, David A. Greene wrote:
> Jakob Stoklund Olesen <stoklund at 2pi.dk> writes:
> 
>>>> The greedy allocator is trying to pick registers so inner loops
are as
>>>> small as possible, but that is not always the right thing to
do.
>>> 
>>> How does it balance that against spill cost?
>> 
>> I added the CostPerUse field to the register descriptions. The
>> allocator will try to minimize the spill weight assigned to registers
>> with a CostPerUse. It does it by swapping physical register
>> assignments, it won't do it if it requires extra spilling.
> 
> CostPerUse models the encoding size of the register?
Yes, something like that.
>> This is actually the cause of the n-body regression. The benchmark has
nested loops:
>> 
>> 	%vreg1 = const pool load
>> header1:
>> 	; large blocks with lots of floating point ops
>> header2:
>> 	; small loop using %vreg1
>> 	jnz header2
>> ...
>> 	jnz header1
>> 
> 
>> The def of %vreg1 has been hoisted by LICM so it is live across a
>> block with lots of floating point code. The allocator uses the low xmm
>> registers for the large block, and %xmm8 is left for %vreg1 which has
>> a low spill weight. This significantly improves code size, but the
>> small loop suffers.
> 
> Why does %xmm8 have a low spill weight?  It's used in an inner loop.
Because it is rematerializable and live across a big block where it isn't
used.
>> In this case it might have helped to split the live range and
>> rematerialize, but usually that won't be the case.
> 
> That was my initial reaction.  Splitting should have at least
> rematerialized the value just before header2.  That should significantly
> improve things.  This is a classic motivational case for live range
> splitting.
Well, not really. Note there there are plenty of registers available and no
spilling is neccessary.

It's just that an REX prefix is required on some instructions when %xmm8 is
used. Is it worth it to undo LICM just for that? In this case, probably. In
general, no.

/jakob

David A. Greene

2011-May-03 23:08 UTC

head link

[LLVMdev] Greedy register allocation

Jakob Stoklund Olesen <stoklund at 2pi.dk> writes:
>> That was my initial reaction.  Splitting should have at least
>> rematerialized the value just before header2.  That should
significantly
>> improve things.  This is a classic motivational case for live range
>> splitting.
>
> Well, not really. Note there there are plenty of registers available
> and no spilling is neccessary.
Oh, I misunderstood then.  Thanks for clarifying.
> It's just that an REX prefix is required on some instructions when
> %xmm8 is used. Is it worth it to undo LICM just for that? In this
> case, probably. In general, no.
Ah, so you're saying the regression is due to the inner loop icache
footprint increasing.  Ok, that makes total sense to me.  I agree this
is a difficult thing to get right in a general sort of way.  Perhaps the
CostPerUse (or whatwever heuristics use it) can factor in the loop body
size so that tight loops are favored for smaller encodings.

                               -Dave

Chris Lattner

2011-May-04 12:17 UTC

head link

[LLVMdev] Greedy register allocation

On May 3, 2011, at 4:08 PM, David A. Greene wrote:
>> 
>> It's just that an REX prefix is required on some instructions when
>> %xmm8 is used. Is it worth it to undo LICM just for that? In this
>> case, probably. In general, no.
> 
> Ah, so you're saying the regression is due to the inner loop icache
> footprint increasing.  Ok, that makes total sense to me.  I agree this
> is a difficult thing to get right in a general sort of way.  Perhaps the
> CostPerUse (or whatwever heuristics use it) can factor in the loop body
> size so that tight loops are favored for smaller encodings.
It is almost certainly that the inner loop doesn't fit in the processors
predecode loop buffer.  Modern intel X86 chips have a buffer that can hold a
very small number of instructions and is bound by instruction count, code size,
and sometimes # cache lines.  If a loop fits in this it allows the processor to
turn off the decoder completely for the loop, a significant power and
performance win.

I don't know how realistic it is to model the loop buffer in the register
allocator, but this would a very interesting thing to try to optimize for in a
later pass.  If an inner loop "almost" fits, then it would probably be
worth heroic effort to try to reduce the size of it to shave off a few bytes.

-Chris

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - May 2011 - [LLVMdev] Greedy register allocation

[LLVMdev] Greedy register allocation

[LLVMdev] Greedy register allocation

[LLVMdev] Greedy register allocation

Maybe Matching Threads