thr3ads.net - llvm dev - [llvm-dev] Tweaking the Register Allocator's spill placement [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Johnson, Nicholas Paul via llvm-dev

2017-Jan-09 22:55 UTC

[llvm-dev] Tweaking the Register Allocator's spill placement

Hello,

My target features some very-high-latency instructions that access an on-chip
network (we'll call them FXLV).  In one important kernel (snippet below),
register allocation needs to spill values resulting from FXLV.  The spiller is
unaware of FXLV's latency, and thus naively inserts those spills immediately
after the FXLV, incurring huge and unnecessary data stalls.

    FXLV r10, 0(r3.0)
    SV r10, 0(r63.1) # spill to stack slot
    FXLV r10, 16(r3.0)
    SV r10, 16(r63.1) # spill to stack slot
    FXLV r10, 32(r3.0)
    SV r10, 32(r63.1) # spill to stack slot
    FXLV r10, 48(r3.0)
    SV r10, 48(r63.1) # spill to stack slot
    ...
Note also how the register allocator unfortunately re-uses register r10.  This
prevents Post-RA scheduling from helping.



A better sequence of spills would hide the latency thusly:
    FXLV r10, 0(r3.0)
    FXLV r11, 16(r3.0)
    FXLV r12, 32(r3.0)
    FXLV r13, 48(r3.0)
    ...
    SV r10, 0(r63.1) # spill to stack slot
    SV r11, 16(r63.1) # spill to stack slot
    SV r12, 32(r63.1) # spill to stack slot
    SV r13, 48(r63.1) # spill to stack slot
    ...  

Any suggestions of how I could achieve this better spill sequence?  Do I need to
write a target-specific implementation of InlineSpiller?

Thanks,
Nick Johnson
D. E. Shaw Research

Quentin Colombet via llvm-dev

2017-Jan-09 23:56 UTC

head link

[llvm-dev] Tweaking the Register Allocator's spill placement

Hi Nick,
> On Jan 9, 2017, at 2:55 PM, Johnson, Nicholas Paul via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> 
> Hello,
> 
> My target features some very-high-latency instructions that access an
on-chip network (we'll call them FXLV).  In one important kernel (snippet
below), register allocation needs to spill values resulting from FXLV.  The
spiller is unaware of FXLV's latency, and thus naively inserts those spills
immediately after the FXLV, incurring huge and unnecessary data stalls.
> 
>    FXLV r10, 0(r3.0)
>    SV r10, 0(r63.1) # spill to stack slot
>    FXLV r10, 16(r3.0)
>    SV r10, 16(r63.1) # spill to stack slot
>    FXLV r10, 32(r3.0)
>    SV r10, 32(r63.1) # spill to stack slot
>    FXLV r10, 48(r3.0)
>    SV r10, 48(r63.1) # spill to stack slot
>    ...
> Note also how the register allocator unfortunately re-uses register r10. 
This prevents Post-RA scheduling from helping.
> 
> 
> 
> A better sequence of spills would hide the latency thusly:
>    FXLV r10, 0(r3.0)
>    FXLV r11, 16(r3.0)
>    FXLV r12, 32(r3.0)
>    FXLV r13, 48(r3.0)
>    ...
>    SV r10, 0(r63.1) # spill to stack slot
>    SV r11, 16(r63.1) # spill to stack slot
>    SV r12, 32(r63.1) # spill to stack slot
>    SV r13, 48(r63.1) # spill to stack slot
>    ...  
> 
> Any suggestions of how I could achieve this better spill sequence?  Do I
need to write a target-specific implementation of InlineSpiller?
Have you tried to run the post RA scheduler?

The problem with spilling is that you want to keep the live-range in register as
short as possible because otherwise you may need to spill more.

Cheers,
-Quentin> 
> Thanks,
> Nick Johnson
> D. E. Shaw Research
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Matthias Braun via llvm-dev

2017-Jan-10 00:34 UTC

head link

[llvm-dev] Tweaking the Register Allocator's spill placement

> On Jan 9, 2017, at 2:55 PM, Johnson, Nicholas Paul via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> 
> Hello,
> 
> My target features some very-high-latency instructions that access an
on-chip network (we'll call them FXLV).  In one important kernel (snippet
below), register allocation needs to spill values resulting from FXLV.  The
spiller is unaware of FXLV's latency, and thus naively inserts those spills
immediately after the FXLV, incurring huge and unnecessary data stalls.
> 
>    FXLV r10, 0(r3.0)
>    SV r10, 0(r63.1) # spill to stack slot
>    FXLV r10, 16(r3.0)
>    SV r10, 16(r63.1) # spill to stack slot
>    FXLV r10, 32(r3.0)
>    SV r10, 32(r63.1) # spill to stack slot
>    FXLV r10, 48(r3.0)
>    SV r10, 48(r63.1) # spill to stack slot
>    ...
> Note also how the register allocator unfortunately re-uses register r10. 
This prevents Post-RA scheduling from helping.The old post-ra-scheduler-list was using the {Aggressive|Critical}AntiDepBreaker
classes to improve this sort of situation. They are not used for the
MachinePostScheduler (I can only guess but possibly out of frustration of bugs
and hard to understand code in the *DepBreakers). You could experiment with
using those.

- Matthias

Hal Finkel via llvm-dev

2017-Jan-10 00:38 UTC

head link

[llvm-dev] Tweaking the Register Allocator's spill placement

On 01/09/2017 06:34 PM, Matthias Braun via llvm-dev
wrote:>> On Jan 9, 2017, at 2:55 PM, Johnson, Nicholas Paul via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>
>> Hello,
>>
>> My target features some very-high-latency instructions that access an
on-chip network (we'll call them FXLV).  In one important kernel (snippet
below), register allocation needs to spill values resulting from FXLV.  The
spiller is unaware of FXLV's latency, and thus naively inserts those spills
immediately after the FXLV, incurring huge and unnecessary data stalls.
>>
>>     FXLV r10, 0(r3.0)
>>     SV r10, 0(r63.1) # spill to stack slot
>>     FXLV r10, 16(r3.0)
>>     SV r10, 16(r63.1) # spill to stack slot
>>     FXLV r10, 32(r3.0)
>>     SV r10, 32(r63.1) # spill to stack slot
>>     FXLV r10, 48(r3.0)
>>     SV r10, 48(r63.1) # spill to stack slot
>>     ...
>> Note also how the register allocator unfortunately re-uses register
r10.  This prevents Post-RA scheduling from helping.
> The old post-ra-scheduler-list was using the
{Aggressive|Critical}AntiDepBreaker classes to improve this sort of situation.
They are not used for the MachinePostScheduler (I can only guess but possibly
out of frustration of bugs and hard to understand code in the *DepBreakers). You
could experiment with using those.
FWIW, we should fix this at some point if we want to be able to retire 
the old post-RA scheduler. Anti-dep breaking still gives measurable 
performance gains on PowerPC at least. Last I discussed this with Andy, 
while there was some hope we could do without it, if not, we just needed 
someone to do the integration work.

  -Hal
>
> - Matthias
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Johnson, Nicholas Paul via llvm-dev

2017-Jan-10 15:43 UTC

head link

[llvm-dev] Tweaking the Register Allocator's spill placement

>> On Jan 9, 2017, at 2:55 PM, Johnson, Nicholas Paul via llvm-dev
<llvm-
>dev at lists.llvm.org> wrote:
>> Note also how the register allocator unfortunately re-uses register
r10.
>>This prevents Post-RA scheduling from helping.
Matthias wrote:>The old post-ra-scheduler-list was using the
>{Aggressive|Critical}AntiDepBreaker classes to improve this sort of
situation.
>They are not used for the MachinePostScheduler (I can only guess but
>possibly out of frustration of bugs and hard to understand code in the
>*DepBreakers). You could experiment with using those.
If I understand correctly, these dep breakers look for anti-dependences, then
try to break them after the fact by renaming a register (assuming there is an
available register).  Am investigating...

It seems that one could fix this before the fact, that is, discourage the
register allocator from repeatedly choosing the same register (again assuming
there are additional available registers). With luck, there would be fewer anti-
and output-dependences because the adjacent live ranges would not share a common
register.

I imagine people have tried this and there is a good reason not to...

Thanks,
Nick Johnson
D. E. Shaw Research

llvm dev - Jan 2017 - Tweaking the Register Allocator's spill placement

[llvm-dev] Tweaking the Register Allocator's spill placement

[llvm-dev] Tweaking the Register Allocator's spill placement

[llvm-dev] Tweaking the Register Allocator's spill placement

[llvm-dev] Tweaking the Register Allocator's spill placement

[llvm-dev] Tweaking the Register Allocator's spill placement