thr3ads.net - llvm dev - [LLVMdev] Greedy register allocation [May 2011]

If this information is useful, please help other people find it:
Share via:

Chris Lattner

2011-May-04 12:17 UTC

[LLVMdev] Greedy register allocation

On May 3, 2011, at 4:08 PM, David A. Greene wrote:
>> 
>> It's just that an REX prefix is required on some instructions when
>> %xmm8 is used. Is it worth it to undo LICM just for that? In this
>> case, probably. In general, no.
> 
> Ah, so you're saying the regression is due to the inner loop icache
> footprint increasing.  Ok, that makes total sense to me.  I agree this
> is a difficult thing to get right in a general sort of way.  Perhaps the
> CostPerUse (or whatwever heuristics use it) can factor in the loop body
> size so that tight loops are favored for smaller encodings.
It is almost certainly that the inner loop doesn't fit in the processors
predecode loop buffer.  Modern intel X86 chips have a buffer that can hold a
very small number of instructions and is bound by instruction count, code size,
and sometimes # cache lines.  If a loop fits in this it allows the processor to
turn off the decoder completely for the loop, a significant power and
performance win.

I don't know how realistic it is to model the loop buffer in the register
allocator, but this would a very interesting thing to try to optimize for in a
later pass.  If an inner loop "almost" fits, then it would probably be
worth heroic effort to try to reduce the size of it to shave off a few bytes.

-Chris

Evan Cheng

2011-May-04 15:23 UTC

head link

[LLVMdev] Greedy register allocation

On May 4, 2011, at 5:17 AM, Chris Lattner <clattner at apple.com> wrote:
> 
> On May 3, 2011, at 4:08 PM, David A. Greene wrote:
> 
>>> 
>>> It's just that an REX prefix is required on some instructions
when
>>> %xmm8 is used. Is it worth it to undo LICM just for that? In this
>>> case, probably. In general, no.
>> 
>> Ah, so you're saying the regression is due to the inner loop icache
>> footprint increasing.  Ok, that makes total sense to me.  I agree this
>> is a difficult thing to get right in a general sort of way.  Perhaps
the
>> CostPerUse (or whatwever heuristics use it) can factor in the loop body
>> size so that tight loops are favored for smaller encodings.
> 
> It is almost certainly that the inner loop doesn't fit in the
processors predecode loop buffer.  Modern intel X86 chips have a buffer that can
hold a very small number of instructions and is bound by instruction count, code
size, and sometimes # cache lines.  If a loop fits in this it allows the
processor to turn off the decoder completely for the loop, a significant power
and performance win.
> 
> I don't know how realistic it is to model the loop buffer in the
register allocator, but this would a very interesting thing to try to optimize
for in a later pass.  If an inner loop "almost" fits, then it would
probably be worth heroic effort to try to reduce the size of it to shave off a
few bytes.
Jakob and I have talked about this briefly. The idea is to insert a copy from
the larger register to the smaller register in the loop preheader and change the
references inside the loop. However, the reality is this is a lot harder than it
sounds.

1. When is this profitable? We can model size of loop buffer. But this is also
dependent on loop alignment. We may have to align small loops on 32 byte
boundary to get this right.

2. When should we do this? I was thinking of a late pass. But it's harder to
reason about dependencies when register allocation is done. Perhaps it should be
done before the virtual registers are rewritten?

3. How restrictive is the optimization? What if the register is updated inside
the loop? We have to be careful about instructions requiring specific register
inputs. What if the register is updated and live out of the loop? We need to
copy it back at loop exits?

We should start tackling the the restrictive forms of the problem. That should
fix a number of cases where linear scan got lucky. I am sure there are a lot
more interesting cases though.

Evan
> 
> -Chris
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Chris Lattner

2011-May-04 22:30 UTC

head link

[LLVMdev] Greedy register allocation

On May 4, 2011, at 8:23 AM, Evan Cheng wrote:
>> I don't know how realistic it is to model the loop buffer in the
register allocator, but this would a very interesting thing to try to optimize
for in a later pass.  If an inner loop "almost" fits, then it would
probably be worth heroic effort to try to reduce the size of it to shave off a
few bytes.
> 
> Jakob and I have talked about this briefly. The idea is to insert a copy
from the larger register to the smaller register in the loop preheader and
change the references inside the loop. However, the reality is this is a lot
harder than it sounds.
> 
> 1. When is this profitable? We can model size of loop buffer. But this is
also dependent on loop alignment. We may have to align small loops on 32 byte
boundary to get this right.
> 
> 2. When should we do this? I was thinking of a late pass. But it's
harder to reason about dependencies when register allocation is done. Perhaps it
should be done before the virtual registers are rewritten?
> 
> 3. How restrictive is the optimization? What if the register is updated
inside the loop? We have to be careful about instructions requiring specific
register inputs. What if the register is updated and live out of the loop? We
need to copy it back at loop exits?
> 
> We should start tackling the the restrictive forms of the problem. That
should fix a number of cases where linear scan got lucky. I am sure there are a
lot more interesting cases though.
Yep, doing this as a late pass, and having it only handle the "easy and
obvious" cases would make a lot of sense.  I bet that some non-trivial
number of the major speedups that Jakob is seeing are due to "greedy
getting lucky".  The bad thing about this is that relying on luck makes
performance analysis extremely difficult.  It is much better to actively control
these things where possible so that we both get good performance all the time,
and have more useful and comparable perf analysis.  I know you already know this
:)

-Chris

David A. Greene

2011-May-05 22:28 UTC

head link

[LLVMdev] Greedy register allocation

Chris Lattner <clattner at apple.com> writes:
> I don't know how realistic it is to model the loop buffer in the
> register allocator, but this would a very interesting thing to try to
> optimize for in a later pass.  If an inner loop "almost" fits,
then it
> would probably be worth heroic effort to try to reduce the size of it
> to shave off a few bytes.
I like this approach as well as it could cover a number of peephole-type
optimizations.

                              -Dave

David A. Greene

2011-May-05 22:29 UTC

head link

[LLVMdev] Greedy register allocation

Evan Cheng <evan.cheng at apple.com> writes:
> 1. When is this profitable? We can model size of loop buffer. But this
> is also dependent on loop alignment. We may have to align small loops
> on 32 byte boundary to get this right.
We almost always want to do this anyway on the newest processors.

                          -Dave

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - May 2011 - [LLVMdev] Greedy register allocation

[LLVMdev] Greedy register allocation

[LLVMdev] Greedy register allocation

[LLVMdev] Greedy register allocation

[LLVMdev] Greedy register allocation

[LLVMdev] Greedy register allocation

Reasonably Related Threads