thr3ads.net - llvm dev - [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call [Apr 2020]

If this information is useful, please help other people find it:
Share via:

Momchil Velikov via llvm-dev

2020-Apr-14 14:17 UTC

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

Hi,
> +    MachineInstr *callMI = static_cast<MachineInstr *> (MIB);
Use MIB.getInstr() here.

~chill
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200414/eeb5b36c/attachment.html>

John Brawn via llvm-dev

2020-Apr-14 22:06 UTC

head link

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

> Could you please point out what am I doing wrong in the patch ?
It's because you're getting the function name by doing
  callee->getName().str().c_str()
The str() call generates a temporary copy of the name which ceases to exist
outside of this expression
causing the c_str() to return a pointer to no-longer-valid memory.

I'm not sure what the correct way of getting the name as a char* is here.
Doing
  callee->getName().data()
appears to work, though I don't know if we can rely on the StringReg that
getName() returns being
appropriately null-terminated.
> However for above case, IIUC, we would want all calls to be converted to bl
?It would be better yes, though I'm not sure how you'd go about making
that happen. It's probably not
worth worrying too much about though, as the new behaviour is still better than
the old.

John

________________________________
From: Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org>
Sent: 15 April 2020 01:44
To: John Brawn <John.Brawn at arm.com>
Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register
reload before each call

Hi,
I have attached WIP patch for adding foldMemoryOperand to Thumb1InstrInfo.
For the following case:

void f(int x, int y, int z)
{
  void bar(int, int, int);

  bar(x, y, z);
  bar(x, z, y);
  bar(y, x, z);
  bar(y, y, x);
}

it calls foldMemoryOperand twice, and thus converts two calls from blx to bl.
callMI->dump() shows the function name "bar" correctly, however in
generated assembly call to bar is garbled:
(compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m):

        add     r7, sp, #16
        mov     r6, r2
        mov     r5, r1
        mov     r4, r0
        bl      "<90>w\n        "
        mov     r1, r2
        mov     r2, r5
        bl      "<90>w\n        "
        mov     r0, r5
        mov     r1, r4
        mov     r2, r6
        ldr     r6, .LCPI0_0
        blx     r6
        mov     r0, r5
        mov     r1, r5
        mov     r2, r4
        blx     r6

regalloc dump (attached) shows:
Inline spilling tGPR:%9 [80r,152r:0)  0 at 80r
weight:3.209746e-03>From original %3        also spill snippet %8 [152r,232r:0)  0 at 152r weight:2.104167e-03
  tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit
killed $r0, implicit killed $r1, implicit killed $r2
        folded:   144r  tBL 14, $noreg, &"\E0\9C\06\A0\FC\7F",
implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed
$r1, implicit killed $r2 :: (load 4 from constant-pool)
        remat:  228r    %10:tgpr = tLDRpci %const.0, 14, $noreg ::
(load 4 from constant-pool)
                232e    %7:tgpr = COPY killed %10:tgpr

Could you please point out what am I doing wrong in the patch ?

Also, I guess, it only converted two calls to bl because further
spilling wasn't necessary.
However for above case, IIUC, we would want all calls to be converted to bl  ?
Since,
4 bl == 16 bytes
2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4
bytes (litpool) == 18 bytes

Thanks,
Prathamesh




On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni
<prathamesh.kulkarni at linaro.org> wrote:>
> Hi John,
> Thanks for the suggestions! I will start looking at adding
> foldMemoryOperand to ARMInstrInfo.
>
> Thanks,
> Prathamesh
>
> On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com>
wrote:
> >
> > If I'm understanding what's going on in this test correctly,
what's happening is:
> >  * ARMTargetLowering::LowerCall prefers indirect calls when a function
is called at least 3 times in minsize
> >  * In thumb 1 (without -fno-omit-frame-pointer) we have effectively
only 3 callee-saved registers (r4-r6)
> >  * The function has three arguments, so those three plus the register
we need to hold the function address is more than our callee-saved registers
> >  * Therefore something needs to be spilt
> >  * The function address can be rematerialized, so we spill that and
insert and LDR before each call
> >
> > If we didn't have this spilling happening (e.g. if the function
had one less argument) then the code size of using BL vs BLX
> >  * BL: 3*4-byte BL = 12 bytes
> >  * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes
> > (So maybe even not considering spilling, LowerCall should be adjusted
to do this for functions called 4 or more times)
> >
> > When we have to spill, if we compare spilling the functions address vs
spilling an argument:
> >  * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool = 16
bytes
> >  * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool +
1*2-byte STR + 2*2-byte LDR = 18 bytes
> > So just changing the spilling heuristic won't work.
> >
> > The two ways I see of fixing this:
> >  * In LowerCall only prefer an indirect call if the number of integer
register arguments is less than the number of callee-saved registers.
> >  * When the load of the function address is spilled, instead of just
rematerializing the load instead convert the BX back into BL.
> >
> > The first of these would be easier, but there will be situations where
we need to use less than three callee-saved registers (e.g. arguments are loaded
from a pointer) and there are situations where we will spill the function
address for reasons entirely unrelated to the function arguments (e.g. if we
have enough live local variables).
> >
> > For the second, looking at InlineSpiller.cpp it does have the concept
of rematerializing by folding a memory operand into another instruction, so I
think we could make use of that to do this. It looks like it would involve
adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a
LDR into a BX by turning it into a BL.
> >
> > John
> >
> > ________________________________
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org>
> > Sent: 07 April 2020 21:07
> > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces
register reload before each call
> >
> > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni
> > <prathamesh.kulkarni at linaro.org> wrote:
> > >
> > > Hi,
> > > Compiling attached test-case, which is reduced version of of
> > > uECC_shared_secret from tinycrypt library [1], with
> > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S
> > > results in reloading of register holding function's address
before
> > > every call to blx:
> > >
> > >         ldr       r3, .LCPI0_0
> > >         blx      r3
> > >         mov    r0, r6
> > >         mov    r1, r5
> > >         mov    r2, r4
> > >         ldr       r3, .LCPI0_0
> > >         blx       r3
> > >         ldr        r3, .LCPI0_0
> > >         mov     r0, r6
> > >         mov     r1, r5
> > >         mov     r2, r4
> > >         blx       r3
> > >
> > > .LCPI0_0:
> > >         .long   foo
> > >
> > > From dump of regalloc (attached), AFAIU, what seems to happen
during
> > > greedy allocator is, all virt regs %0 to %3 are live across first
two
> > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4
> > > respectively, and %3 which holds foo's address doesn't
have any
> > > register left.
> > > Since it's live-range has least weight, it does not evict any
existing interval,
> > > and gets split. Eventually we have the following allocation:
> > >
> > > [%0 -> $r6] tGPR
> > > [%1 -> $r5] tGPR
> > > [%2 -> $r4] tGPR
> > > [%6 -> $r3] tGPR
> > > [%11 -> $r3] tGPR
> > > [%16 -> $r3] tGPR
> > > [%17 -> $r3] tGPR
> > >
> > > where %6, %11, %16 and %17 all are derived from %3.
> > > And since r3 is a call-clobbered register, the compiler is forced
to
> > > reload foo's address
> > > each time before blx.
> > >
> > > To fix this, I thought of following approaches:
> > > (a) Disable the heuristic to prefer indirect call when there are
at
> > > least 3 calls to
> > > same function in basic block in ARMTargetLowering::LowerCall for
Thumb-1 ISA.
> > >
> > > (b) In ARMTargetLowering::LowerCall, put another constraint like
> > > number of arguments, as a proxy for register pressure for
Thumb-1, but
> > > that's bound to trip another cases.
> > >
> > > (c) Give higher priority to allocate vrit reg used for indirect
calls
> > > ? However, if that
> > > results in spilling of some other register, it would defeat the
> > > purpose of saving code-size. I suppose ideally we want to trigger
the
> > > heuristic of using indirect call only when we know beforehand
that it
> > > will not result in spilling. But I am not sure if it's
possible to
> > > estimate that during isel ?
> > >
> > > I would be grateful for suggestions on how to proceed further.
> > ping ?
> >
> > Thanks,
> > Prathamesh
> > >
> > > [1]
https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139
> > >
> > > Thanks,
> > > Prathamesh
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200414/386a9a63/attachment.html>

Prathamesh Kulkarni via llvm-dev

2020-Apr-15 00:44 UTC

head link

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

Hi,
I have attached WIP patch for adding foldMemoryOperand to Thumb1InstrInfo.
For the following case:

void f(int x, int y, int z)
{
  void bar(int, int, int);

  bar(x, y, z);
  bar(x, z, y);
  bar(y, x, z);
  bar(y, y, x);
}

it calls foldMemoryOperand twice, and thus converts two calls from blx to bl.
callMI->dump() shows the function name "bar" correctly, however in
generated assembly call to bar is garbled:
(compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m):

        add     r7, sp, #16
        mov     r6, r2
        mov     r5, r1
        mov     r4, r0
        bl      "<90>w\n        "
        mov     r1, r2
        mov     r2, r5
        bl      "<90>w\n        "
        mov     r0, r5
        mov     r1, r4
        mov     r2, r6
        ldr     r6, .LCPI0_0
        blx     r6
        mov     r0, r5
        mov     r1, r5
        mov     r2, r4
        blx     r6

regalloc dump (attached) shows:
Inline spilling tGPR:%9 [80r,152r:0)  0 at 80r
weight:3.209746e-03>From original %3        also spill snippet %8 [152r,232r:0)  0 at 152r weight:2.104167e-03
  tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit
killed $r0, implicit killed $r1, implicit killed $r2
        folded:   144r  tBL 14, $noreg, &"\E0\9C\06\A0\FC\7F",
implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed
$r1, implicit killed $r2 :: (load 4 from constant-pool)
        remat:  228r    %10:tgpr = tLDRpci %const.0, 14, $noreg ::
(load 4 from constant-pool)
                232e    %7:tgpr = COPY killed %10:tgpr

Could you please point out what am I doing wrong in the patch ?

Also, I guess, it only converted two calls to bl because further
spilling wasn't necessary.
However for above case, IIUC, we would want all calls to be converted to bl  ?
Since,
4 bl == 16 bytes
2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4
bytes (litpool) == 18 bytes

Thanks,
Prathamesh




On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni
<prathamesh.kulkarni at linaro.org> wrote:>
> Hi John,
> Thanks for the suggestions! I will start looking at adding
> foldMemoryOperand to ARMInstrInfo.
>
> Thanks,
> Prathamesh
>
> On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com>
wrote:
> >
> > If I'm understanding what's going on in this test correctly,
what's happening is:
> >  * ARMTargetLowering::LowerCall prefers indirect calls when a function
is called at least 3 times in minsize
> >  * In thumb 1 (without -fno-omit-frame-pointer) we have effectively
only 3 callee-saved registers (r4-r6)
> >  * The function has three arguments, so those three plus the register
we need to hold the function address is more than our callee-saved registers
> >  * Therefore something needs to be spilt
> >  * The function address can be rematerialized, so we spill that and
insert and LDR before each call
> >
> > If we didn't have this spilling happening (e.g. if the function
had one less argument) then the code size of using BL vs BLX
> >  * BL: 3*4-byte BL = 12 bytes
> >  * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes
> > (So maybe even not considering spilling, LowerCall should be adjusted
to do this for functions called 4 or more times)
> >
> > When we have to spill, if we compare spilling the functions address vs
spilling an argument:
> >  * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool = 16
bytes
> >  * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool +
1*2-byte STR + 2*2-byte LDR = 18 bytes
> > So just changing the spilling heuristic won't work.
> >
> > The two ways I see of fixing this:
> >  * In LowerCall only prefer an indirect call if the number of integer
register arguments is less than the number of callee-saved registers.
> >  * When the load of the function address is spilled, instead of just
rematerializing the load instead convert the BX back into BL.
> >
> > The first of these would be easier, but there will be situations where
we need to use less than three callee-saved registers (e.g. arguments are loaded
from a pointer) and there are situations where we will spill the function
address for reasons entirely unrelated to the function arguments (e.g. if we
have enough live local variables).
> >
> > For the second, looking at InlineSpiller.cpp it does have the concept
of rematerializing by folding a memory operand into another instruction, so I
think we could make use of that to do this. It looks like it would involve
adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a
LDR into a BX by turning it into a BL.
> >
> > John
> >
> > ________________________________
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org>
> > Sent: 07 April 2020 21:07
> > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces
register reload before each call
> >
> > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni
> > <prathamesh.kulkarni at linaro.org> wrote:
> > >
> > > Hi,
> > > Compiling attached test-case, which is reduced version of of
> > > uECC_shared_secret from tinycrypt library [1], with
> > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S
> > > results in reloading of register holding function's address
before
> > > every call to blx:
> > >
> > >         ldr       r3, .LCPI0_0
> > >         blx      r3
> > >         mov    r0, r6
> > >         mov    r1, r5
> > >         mov    r2, r4
> > >         ldr       r3, .LCPI0_0
> > >         blx       r3
> > >         ldr        r3, .LCPI0_0
> > >         mov     r0, r6
> > >         mov     r1, r5
> > >         mov     r2, r4
> > >         blx       r3
> > >
> > > .LCPI0_0:
> > >         .long   foo
> > >
> > > From dump of regalloc (attached), AFAIU, what seems to happen
during
> > > greedy allocator is, all virt regs %0 to %3 are live across first
two
> > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4
> > > respectively, and %3 which holds foo's address doesn't
have any
> > > register left.
> > > Since it's live-range has least weight, it does not evict any
existing interval,
> > > and gets split. Eventually we have the following allocation:
> > >
> > > [%0 -> $r6] tGPR
> > > [%1 -> $r5] tGPR
> > > [%2 -> $r4] tGPR
> > > [%6 -> $r3] tGPR
> > > [%11 -> $r3] tGPR
> > > [%16 -> $r3] tGPR
> > > [%17 -> $r3] tGPR
> > >
> > > where %6, %11, %16 and %17 all are derived from %3.
> > > And since r3 is a call-clobbered register, the compiler is forced
to
> > > reload foo's address
> > > each time before blx.
> > >
> > > To fix this, I thought of following approaches:
> > > (a) Disable the heuristic to prefer indirect call when there are
at
> > > least 3 calls to
> > > same function in basic block in ARMTargetLowering::LowerCall for
Thumb-1 ISA.
> > >
> > > (b) In ARMTargetLowering::LowerCall, put another constraint like
> > > number of arguments, as a proxy for register pressure for
Thumb-1, but
> > > that's bound to trip another cases.
> > >
> > > (c) Give higher priority to allocate vrit reg used for indirect
calls
> > > ? However, if that
> > > results in spilling of some other register, it would defeat the
> > > purpose of saving code-size. I suppose ideally we want to trigger
the
> > > heuristic of using indirect call only when we know beforehand
that it
> > > will not result in spilling. But I am not sure if it's
possible to
> > > estimate that during isel ?
> > >
> > > I would be grateful for suggestions on how to proceed further.
> > ping ?
> >
> > Thanks,
> > Prathamesh
> > >
> > > [1]
https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139
> > >
> > > Thanks,
> > > Prathamesh
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
Computing live-in reg-units in ABI blocks.
0B	%bb.0 R0#0 R1#0 R2#0
Created 3 new intervals.
********** INTERVALS **********
R0 [0B,48r:0)[96r,144r:4)[192r,240r:3)[288r,336r:2)[384r,432r:1)  0 at 0B-phi 1
at 384r 2 at 288r 3 at 192r 4 at 96r
R1 [0B,32r:0)[112r,144r:4)[208r,240r:3)[304r,336r:2)[400r,432r:1)  0 at 0B-phi 1
at 400r 2 at 304r 3 at 208r 4 at 112r
R2 [0B,16r:0)[128r,144r:4)[224r,240r:3)[320r,336r:2)[416r,432r:1)  0 at 0B-phi 1
at 416r 2 at 320r 3 at 224r 4 at 128r
%0 [48r,416r:0)  0 at 48r weight:0.000000e+00
%1 [32r,400r:0)  0 at 32r weight:0.000000e+00
%2 [16r,320r:0)  0 at 16r weight:0.000000e+00
%3 [80r,432r:0)  0 at 80r weight:0.000000e+00
RegMasks: 144r 240r 336r 432r
********** MACHINEINSTRS **********
# Machine code for function f: NoPHIs, TracksLiveness
Constant Pool:
  cp#0: @bar, align=4
Function Live Ins: $r0 in %0, $r1 in %1, $r2 in %2

0B	bb.0.entry:
	  liveins: $r0, $r1, $r2
16B	  %2:tgpr = COPY $r2
32B	  %1:tgpr = COPY $r1
48B	  %0:tgpr = COPY $r0
64B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
80B	  %3:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from constant-pool)
96B	  $r0 = COPY %0:tgpr
112B	  $r1 = COPY %1:tgpr
128B	  $r2 = COPY %2:tgpr
144B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
160B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
176B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
192B	  $r0 = COPY %0:tgpr
208B	  $r1 = COPY %2:tgpr
224B	  $r2 = COPY %1:tgpr
240B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
256B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
272B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
288B	  $r0 = COPY %1:tgpr
304B	  $r1 = COPY %0:tgpr
320B	  $r2 = COPY %2:tgpr
336B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
352B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
368B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
384B	  $r0 = COPY %1:tgpr
400B	  $r1 = COPY %1:tgpr
416B	  $r2 = COPY %0:tgpr
432B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
448B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
464B	  tBX_RET 14, $noreg

# End machine code for function f.

********** SIMPLE REGISTER COALESCING **********
********** Function: f
********** JOINING INTERVALS ***********
entry:
16B	%2:tgpr = COPY $r2
	Considering merging %2 with $r2
	Can only merge into reserved registers.
32B	%1:tgpr = COPY $r1
	Considering merging %1 with $r1
	Can only merge into reserved registers.
48B	%0:tgpr = COPY $r0
	Considering merging %0 with $r0
	Can only merge into reserved registers.
96B	$r0 = COPY %0:tgpr
	Considering merging %0 with $r0
	Can only merge into reserved registers.
112B	$r1 = COPY %1:tgpr
	Considering merging %1 with $r1
	Can only merge into reserved registers.
128B	$r2 = COPY %2:tgpr
	Considering merging %2 with $r2
	Can only merge into reserved registers.
192B	$r0 = COPY %0:tgpr
	Considering merging %0 with $r0
	Can only merge into reserved registers.
208B	$r1 = COPY %2:tgpr
	Considering merging %2 with $r1
	Can only merge into reserved registers.
224B	$r2 = COPY %1:tgpr
	Considering merging %1 with $r2
	Can only merge into reserved registers.
288B	$r0 = COPY %1:tgpr
	Considering merging %1 with $r0
	Can only merge into reserved registers.
304B	$r1 = COPY %0:tgpr
	Considering merging %0 with $r1
	Can only merge into reserved registers.
320B	$r2 = COPY %2:tgpr
	Considering merging %2 with $r2
	Can only merge into reserved registers.
384B	$r0 = COPY %1:tgpr
	Considering merging %1 with $r0
	Can only merge into reserved registers.
400B	$r1 = COPY %1:tgpr
	Considering merging %1 with $r1
	Can only merge into reserved registers.
416B	$r2 = COPY %0:tgpr
	Considering merging %0 with $r2
	Can only merge into reserved registers.
96B	$r0 = COPY %0:tgpr
	Considering merging %0 with $r0
	Can only merge into reserved registers.
112B	$r1 = COPY %1:tgpr
	Considering merging %1 with $r1
	Can only merge into reserved registers.
128B	$r2 = COPY %2:tgpr
	Considering merging %2 with $r2
	Can only merge into reserved registers.
192B	$r0 = COPY %0:tgpr
	Considering merging %0 with $r0
	Can only merge into reserved registers.
208B	$r1 = COPY %2:tgpr
	Considering merging %2 with $r1
	Can only merge into reserved registers.
224B	$r2 = COPY %1:tgpr
	Considering merging %1 with $r2
	Can only merge into reserved registers.
288B	$r0 = COPY %1:tgpr
	Considering merging %1 with $r0
	Can only merge into reserved registers.
304B	$r1 = COPY %0:tgpr
	Considering merging %0 with $r1
	Can only merge into reserved registers.
320B	$r2 = COPY %2:tgpr
	Considering merging %2 with $r2
	Can only merge into reserved registers.
384B	$r0 = COPY %1:tgpr
	Considering merging %1 with $r0
	Can only merge into reserved registers.
400B	$r1 = COPY %1:tgpr
	Considering merging %1 with $r1
	Can only merge into reserved registers.
416B	$r2 = COPY %0:tgpr
	Considering merging %0 with $r2
	Can only merge into reserved registers.
Trying to inflate 0 regs.
********** INTERVALS **********
R0 [0B,48r:0)[96r,144r:4)[192r,240r:3)[288r,336r:2)[384r,432r:1)  0 at 0B-phi 1
at 384r 2 at 288r 3 at 192r 4 at 96r
R1 [0B,32r:0)[112r,144r:4)[208r,240r:3)[304r,336r:2)[400r,432r:1)  0 at 0B-phi 1
at 400r 2 at 304r 3 at 208r 4 at 112r
R2 [0B,16r:0)[128r,144r:4)[224r,240r:3)[320r,336r:2)[416r,432r:1)  0 at 0B-phi 1
at 416r 2 at 320r 3 at 224r 4 at 128r
%0 [48r,416r:0)  0 at 48r weight:0.000000e+00
%1 [32r,400r:0)  0 at 32r weight:0.000000e+00
%2 [16r,320r:0)  0 at 16r weight:0.000000e+00
%3 [80r,432r:0)  0 at 80r weight:0.000000e+00
RegMasks: 144r 240r 336r 432r
********** MACHINEINSTRS **********
# Machine code for function f: NoPHIs, TracksLiveness
Constant Pool:
  cp#0: @bar, align=4
Function Live Ins: $r0 in %0, $r1 in %1, $r2 in %2

0B	bb.0.entry:
	  liveins: $r0, $r1, $r2
16B	  %2:tgpr = COPY $r2
32B	  %1:tgpr = COPY $r1
48B	  %0:tgpr = COPY $r0
64B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
80B	  %3:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from constant-pool)
96B	  $r0 = COPY %0:tgpr
112B	  $r1 = COPY %1:tgpr
128B	  $r2 = COPY %2:tgpr
144B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
160B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
176B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
192B	  $r0 = COPY %0:tgpr
208B	  $r1 = COPY %2:tgpr
224B	  $r2 = COPY %1:tgpr
240B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
256B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
272B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
288B	  $r0 = COPY %1:tgpr
304B	  $r1 = COPY %0:tgpr
320B	  $r2 = COPY %2:tgpr
336B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
352B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
368B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
384B	  $r0 = COPY %1:tgpr
400B	  $r1 = COPY %1:tgpr
416B	  $r2 = COPY %0:tgpr
432B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
448B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
464B	  tBX_RET 14, $noreg

# End machine code for function f.

********** GREEDY REGISTER ALLOCATION **********
********** Function: f
********** INTERVALS **********
R0 [0B,48r:0)[96r,144r:4)[192r,240r:3)[288r,336r:2)[384r,432r:1)  0 at 0B-phi 1
at 384r 2 at 288r 3 at 192r 4 at 96r
R1 [0B,32r:0)[112r,144r:4)[208r,240r:3)[304r,336r:2)[400r,432r:1)  0 at 0B-phi 1
at 400r 2 at 304r 3 at 208r 4 at 112r
R2 [0B,16r:0)[128r,144r:4)[224r,240r:3)[320r,336r:2)[416r,432r:1)  0 at 0B-phi 1
at 416r 2 at 320r 3 at 224r 4 at 128r
%0 [48r,416r:0)  0 at 48r weight:6.575521e-03
%1 [32r,400r:0)  0 at 32r weight:7.890625e-03
%2 [16r,320r:0)  0 at 16r weight:5.738636e-03
%3 [80r,432r:0)  0 at 80r weight:3.324468e-03
RegMasks: 144r 240r 336r 432r
********** MACHINEINSTRS **********
# Machine code for function f: NoPHIs, TracksLiveness
Constant Pool:
  cp#0: @bar, align=4
Function Live Ins: $r0 in %0, $r1 in %1, $r2 in %2

0B	bb.0.entry:
	  liveins: $r0, $r1, $r2
16B	  %2:tgpr = COPY $r2
32B	  %1:tgpr = COPY $r1
48B	  %0:tgpr = COPY $r0
64B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
80B	  %3:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from constant-pool)
96B	  $r0 = COPY %0:tgpr
112B	  $r1 = COPY %1:tgpr
128B	  $r2 = COPY %2:tgpr
144B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
160B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
176B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
192B	  $r0 = COPY %0:tgpr
208B	  $r1 = COPY %2:tgpr
224B	  $r2 = COPY %1:tgpr
240B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
256B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
272B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
288B	  $r0 = COPY %1:tgpr
304B	  $r1 = COPY %0:tgpr
320B	  $r2 = COPY %2:tgpr
336B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
352B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
368B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
384B	  $r0 = COPY %1:tgpr
400B	  $r1 = COPY %1:tgpr
416B	  $r2 = COPY %0:tgpr
432B	  tBLXr 14, $noreg, %3:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
448B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
464B	  tBX_RET 14, $noreg

# End machine code for function f.


selectOrSplit tGPR:%0 [48r,416r:0)  0 at 48r weight:6.575521e-03 w=6.575521e-03
AllocationOrder(tGPR) = [ $r0 $r1 $r2 $r3 $r4 $r5 $r6 ]
hints: $r0 $r1 $r2
missed hint $r0
assigning %0 to $r4: R4 [48r,416r:0)  0 at 48r

selectOrSplit tGPR:%1 [32r,400r:0)  0 at 32r weight:7.890625e-03 w=7.890625e-03
hints: $r1 $r0 $r2
missed hint $r1
assigning %1 to $r5: R5 [32r,400r:0)  0 at 32r

selectOrSplit tGPR:%2 [16r,320r:0)  0 at 16r weight:5.738636e-03 w=5.738636e-03
hints: $r2 $r1
missed hint $r2
assigning %2 to $r6: R6 [16r,320r:0)  0 at 16r

selectOrSplit tGPR:%3 [80r,432r:0)  0 at 80r weight:3.324468e-03 w=3.324468e-03
RS_Assign Cascade 0
wait for second round
queuing new interval: %3 [80r,432r:0)  0 at 80r weight:3.324468e-03

selectOrSplit tGPR:%3 [80r,432r:0)  0 at 80r weight:3.324468e-03 w=3.324468e-03
RS_Split Cascade 0
Analyze counted 5 instrs in 1 blocks, through 0 blocks.
tryLocalSplit:  80r 144r 240r 336r 432r
4 regmasks in block: 144r:80r-144r 144r:144r-240r 240r:240r-336r 336r:336r-432r
$r0 80r-144r i=INF extend
$r0 144r-240r i=INF extend
$r0 240r-336r i=INF extend
$r0 336r-432r i=INF end
$r1 80r-144r i=INF extend
$r1 144r-240r i=INF extend
$r1 240r-336r i=INF extend
$r1 336r-432r i=INF end
$r2 80r-144r i=INF extend
$r2 144r-240r i=INF extend
$r2 240r-336r i=INF extend
$r2 336r-432r i=INF end
$r3 80r-144r i=INF extend
$r3 144r-240r i=INF extend
$r3 240r-336r i=INF extend
$r3 336r-432r i=INF end
$r4 80r-144r i=6.575521e-03 w=6.250000e-03 extend
$r4 144r-240r i=6.575521e-03 w=7.575758e-03 (best) extend
$r4 144r-336r i=6.575521e-03 w=8.012821e-03 (best) extend
$r4 144r-432r i=6.575521e-03 w=7.102273e-03 end
$r5 80r-144r i=7.890625e-03 w=6.250000e-03 extend
$r5 144r-240r i=7.890625e-03 w=7.575758e-03 extend
$r5 240r-336r i=7.890625e-03 w=7.575758e-03 extend
$r5 336r-432r i=7.890625e-03 w=5.859375e-03 end
$r6 80r-144r i=5.738636e-03 w=6.250000e-03 extend
$r6 80r-240r i=5.738636e-03 w=6.944444e-03 extend
$r6 80r-336r i=5.738636e-03 w=7.440476e-03 (best) extend
$r6 80r-432r i=5.738636e-03 all
Best local split range: 80r-336r, 1.667770e-03, 4 instrs
    enterIntvBefore 80r: not live
    leaveIntvAfter 336r: valno 0
    useIntv [80B;344r): [80B;344r):1
  blit [80r,432r:0): [80r;344r)=1(%5):0 [344r;432r)=0(%4):0
  rewr %bb.0	80r:1	%5:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from
constant-pool)
  rewr %bb.0	144B:1	tBLXr 14, $noreg, %5:tgpr, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	240B:1	tBLXr 14, $noreg, %5:tgpr, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	336B:1	tBLXr 14, $noreg, %5:tgpr, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	432B:0	tBLXr 14, $noreg, %4:tgpr, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	344B:1	%4:tgpr = COPY %5:tgpr
Tagging non-progress ranges: %5
queuing new interval: %4 [344r,432r:0)  0 at 344r weight:2.069672e-03
queuing new interval: %5 [80r,344r:0)  0 at 80r weight:3.802711e-03

selectOrSplit tGPR:%5 [80r,344r:0)  0 at 80r weight:3.802711e-03 w=3.802711e-03
RS_Split2 Cascade 0
Analyze counted 5 instrs in 1 blocks, through 0 blocks.
tryLocalSplit:  80r 144r 240r 336r 344r
4 regmasks in block: 144r:80r-144r 144r:144r-240r 240r:240r-336r 336r:336r-344r
$r0 80r-144r i=INF extend
$r0 144r-240r i=INF extend
$r0 240r-336r i=INF extend
$r0 336r-344r i=INF end
$r1 80r-144r i=INF extend
$r1 144r-240r i=INF extend
$r1 240r-336r i=INF extend
$r1 336r-344r i=INF end
$r2 80r-144r i=INF extend
$r2 144r-240r i=INF extend
$r2 240r-336r i=INF extend
$r2 336r-344r i=INF end
$r3 80r-144r i=INF extend
$r3 144r-240r i=INF extend
$r3 240r-336r i=INF extend
$r3 336r-344r i=INF end
$r4 80r-144r i=6.575521e-03 w=6.250000e-03 extend
$r4 144r-240r i=6.575521e-03 w=7.575758e-03 (best) extend
$r4 144r-336r i=6.575521e-03 shrink
$r4 240r-336r i=6.575521e-03 w=7.575758e-03 (best) extend
$r4 240r-344r i=6.575521e-03 w=7.692308e-03 (best) end
$r5 80r-144r i=7.890625e-03 w=6.250000e-03 extend
$r5 144r-240r i=7.890625e-03 w=7.575758e-03 extend
$r5 240r-336r i=7.890625e-03 w=7.575758e-03 extend
$r5 336r-344r i=7.890625e-03 w=7.075472e-03 end
$r6 80r-144r i=5.738636e-03 w=6.250000e-03 extend
$r6 80r-240r i=5.738636e-03 w=6.944444e-03 (best) extend
$r6 80r-336r i=5.738636e-03 shrink
$r6 144r-336r i=5.738636e-03 shrink
$r6 240r-336r i=5.738636e-03 w=7.575758e-03 (best) extend
$r6 240r-344r i=5.738636e-03 w=7.692308e-03 (best) end
Best local split range: 240r-344r, 1.914560e-03, 3 instrs
    enterIntvBefore 240r: valno 0
    leaveIntvAfter 344r: not live
    useIntv [232r;352B): [232r;352B):1
  blit [80r,344r:0): [80r;232r)=0(%6):0 [232r;344r)=1(%7):0
  rewr %bb.0	80r:0	%6:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from
constant-pool)
  rewr %bb.0	144B:0	tBLXr 14, $noreg, %6:tgpr, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	240B:1	tBLXr 14, $noreg, %7:tgpr, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	336B:1	tBLXr 14, $noreg, %7:tgpr, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	344B:1	%4:tgpr = COPY %7:tgpr
  rewr %bb.0	232B:0	%7:tgpr = COPY %6:tgpr
queuing new interval: %6 [80r,232r:0)  0 at 80r weight:2.744565e-03
queuing new interval: %7 [232r,344r:0)  0 at 232r weight:3.945312e-03

selectOrSplit tGPR:%6 [80r,232r:0)  0 at 80r weight:2.744565e-03 w=2.744565e-03
RS_Assign Cascade 0
wait for second round
queuing new interval: %6 [80r,232r:0)  0 at 80r weight:2.744565e-03

selectOrSplit tGPR:%7 [232r,344r:0)  0 at 232r weight:3.945312e-03
w=3.945312e-03
RS_Assign Cascade 0
wait for second round
queuing new interval: %7 [232r,344r:0)  0 at 232r weight:3.945312e-03

selectOrSplit tGPR:%4 [344r,432r:0)  0 at 344r weight:2.069672e-03
w=2.069672e-03
assigning %4 to $r3: R3 [344r,432r:0)  0 at 344r

selectOrSplit tGPR:%6 [80r,232r:0)  0 at 80r weight:2.744565e-03 w=2.744565e-03
RS_Split Cascade 0
Analyze counted 3 instrs in 1 blocks, through 0 blocks.
tryLocalSplit:  80r 144r 232r
4 regmasks in block: 144r:80r-144r 144r:144r-232r
$r0 80r-144r i=INF extend
$r0 144r-232r i=INF end
$r1 80r-144r i=INF extend
$r1 144r-232r i=INF end
$r2 80r-144r i=INF extend
$r2 144r-232r i=INF end
$r3 80r-144r i=INF extend
$r3 144r-232r i=INF end
$r4 80r-144r i=6.575521e-03 w=6.250000e-03 extend
$r4 144r-232r i=6.575521e-03 w=5.952381e-03 end
$r5 80r-144r i=7.890625e-03 w=6.250000e-03 extend
$r5 144r-232r i=7.890625e-03 w=5.952381e-03 end
$r6 80r-144r i=5.738636e-03 w=6.250000e-03 (best) extend
$r6 80r-232r i=5.738636e-03 all
Best local split range: 80r-144r, 5.011263e-04, 2 instrs
    enterIntvBefore 80r: not live
    leaveIntvAfter 144r: valno 0
    useIntv [80B;152r): [80B;152r):1
  blit [80r,232r:0): [80r;152r)=1(%9):0 [152r;232r)=0(%8):0
  rewr %bb.0	80r:1	%9:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from
constant-pool)
  rewr %bb.0	144B:1	tBLXr 14, $noreg, %9:tgpr, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	232B:0	%7:tgpr = COPY %8:tgpr
  rewr %bb.0	152B:1	%8:tgpr = COPY %9:tgpr
Tagging non-progress ranges: %9
queuing new interval: %8 [152r,232r:0)  0 at 152r weight:2.104167e-03
queuing new interval: %9 [80r,152r:0)  0 at 80r weight:3.209746e-03

selectOrSplit tGPR:%9 [80r,152r:0)  0 at 80r weight:3.209746e-03 w=3.209746e-03
RS_Split2 Cascade 0
Analyze counted 3 instrs in 1 blocks, through 0 blocks.
tryLocalSplit:  80r 144r 152r
4 regmasks in block: 144r:80r-144r 144r:144r-152r
$r0 80r-144r i=INF extend
$r0 144r-152r i=INF end
$r1 80r-144r i=INF extend
$r1 144r-152r i=INF end
$r2 80r-144r i=INF extend
$r2 144r-152r i=INF end
$r3 80r-144r i=INF extend
$r3 144r-152r i=INF end
$r4 80r-144r i=6.575521e-03 extend
$r4 144r-152r i=6.575521e-03 end
$r5 80r-144r i=7.890625e-03 extend
$r5 144r-152r i=7.890625e-03 end
$r6 80r-144r i=5.738636e-03 extend
$r6 144r-152r i=5.738636e-03 end
Inline spilling tGPR:%9 [80r,152r:0)  0 at 80r weight:3.209746e-03
From original %3
	also spill snippet %8 [152r,232r:0)  0 at 152r weight:2.104167e-03
  tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit killed $r0,
implicit killed $r1, implicit killed $r2
	folded:   144r	tBL 14, $noreg, &"\E0\9C\06\A0\FC\7F",
implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed $r1,
implicit killed $r2 :: (load 4 from constant-pool)
	remat:  228r	%10:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from
constant-pool)
	        232e	%7:tgpr = COPY killed %10:tgpr

All defs dead: dead %9:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from
constant-pool)
All defs dead: dead %8:tgpr = COPY %9:tgpr
Remat created 2 dead defs.
Deleting dead def 152r	dead %8:tgpr = COPY %9:tgpr
Deleting dead def 80r	dead %9:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4
from constant-pool)
0 registers to spill after remat.
queuing new interval: %10 [228r,232r:0)  0 at 228r weight:INF

selectOrSplit tGPR:%10 [228r,232r:0)  0 at 228r weight:INF w=INF
assigning %10 to $r3: R3 [228r,232r:0)  0 at 228r
Dropping unused %8 EMPTY weight:2.104167e-03

selectOrSplit tGPR:%7 [232r,344r:0)  0 at 232r weight:3.945312e-03
w=3.945312e-03
hints: $r3
RS_Split Cascade 0
Analyze counted 4 instrs in 1 blocks, through 0 blocks.
tryLocalSplit:  232r 240r 336r 344r
4 regmasks in block: 240r:232r-240r 240r:240r-336r 336r:336r-344r
$r3 232r-240r i=INF extend
$r3 240r-336r i=INF extend
$r3 336r-344r i=INF end
$r0 232r-240r i=INF extend
$r0 240r-336r i=INF extend
$r0 336r-344r i=INF end
$r1 232r-240r i=INF extend
$r1 240r-336r i=INF extend
$r1 336r-344r i=INF end
$r2 232r-240r i=INF extend
$r2 240r-336r i=INF extend
$r2 336r-344r i=INF end
$r4 232r-240r i=6.575521e-03 w=7.075472e-03 (best) extend
$r4 232r-336r i=6.575521e-03 w=7.692308e-03 (best) extend
$r4 232r-344r i=6.575521e-03 all
$r5 232r-240r i=7.890625e-03 w=7.075472e-03 extend
$r5 240r-336r i=7.890625e-03 w=7.575758e-03 extend
$r5 336r-344r i=7.890625e-03 w=7.075472e-03 end
$r6 232r-240r i=5.738636e-03 w=7.075472e-03 (best) extend
$r6 232r-336r i=5.738636e-03 w=7.692308e-03 (best) extend
$r6 232r-344r i=5.738636e-03 all
Best local split range: 232r-336r, 1.914560e-03, 3 instrs
    enterIntvBefore 232r: not live
    leaveIntvAfter 336r: valno 0
    useIntv [232B;340r): [232B;340r):1
  blit [232r,344r:0): [232r;340r)=1(%13):0 [340r;344r)=0(%12):0
  rewr %bb.0	232r:1	%13:tgpr = COPY %10:tgpr
  rewr %bb.0	240B:1	tBLXr 14, $noreg, %13:tgpr, <regmask $lr $d8 $d9 $d10
$d11 $d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16
$s17 $s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	336B:1	tBLXr 14, $noreg, %13:tgpr, <regmask $lr $d8 $d9 $d10
$d11 $d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16
$s17 $s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	344B:0	%4:tgpr = COPY %12:tgpr
  rewr %bb.0	340B:1	%12:tgpr = COPY %13:tgpr
Tagging non-progress ranges: %13
queuing new interval: %12 [340r,344r:0)  0 at 340r weight:INF
queuing new interval: %13 [232r,340r:0)  0 at 232r weight:3.976378e-03

selectOrSplit tGPR:%13 [232r,340r:0)  0 at 232r weight:3.976378e-03
w=3.976378e-03
hints: $r3
RS_Split2 Cascade 0
Analyze counted 4 instrs in 1 blocks, through 0 blocks.
tryLocalSplit:  232r 240r 336r 340r
4 regmasks in block: 240r:232r-240r 240r:240r-336r 336r:336r-340r
$r3 232r-240r i=INF extend
$r3 240r-336r i=INF extend
$r3 336r-340r i=INF end
$r0 232r-240r i=INF extend
$r0 240r-336r i=INF extend
$r0 336r-340r i=INF end
$r1 232r-240r i=INF extend
$r1 240r-336r i=INF extend
$r1 336r-340r i=INF end
$r2 232r-240r i=INF extend
$r2 240r-336r i=INF extend
$r2 336r-340r i=INF end
$r4 232r-240r i=6.575521e-03 w=7.075472e-03 (best) extend
$r4 232r-336r i=6.575521e-03 shrink
$r4 240r-336r i=6.575521e-03 extend
$r4 336r-340r i=6.575521e-03 w=7.142857e-03 (best) end
$r5 232r-240r i=7.890625e-03 w=7.075472e-03 extend
$r5 240r-336r i=7.890625e-03 extend
$r5 336r-340r i=7.890625e-03 w=7.142857e-03 end
$r6 232r-240r i=5.738636e-03 w=7.075472e-03 (best) extend
$r6 232r-336r i=5.738636e-03 shrink
$r6 240r-336r i=5.738636e-03 extend
$r6 336r-340r i=0.000000e+00 w=7.142857e-03 (best) end
Best local split range: 336r-340r, 6.999861e-03, 2 instrs
    enterIntvBefore 336r: valno 0
    leaveIntvAfter 340r: not live
    useIntv [328r;344B): [328r;344B):1
  blit [232r,340r:0): [232r;328r)=0(%14):0 [328r;340r)=1(%15):0
  rewr %bb.0	232r:0	%14:tgpr = COPY %10:tgpr
  rewr %bb.0	240B:0	tBLXr 14, $noreg, %14:tgpr, <regmask $lr $d8 $d9 $d10
$d11 $d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16
$s17 $s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	336B:1	tBLXr 14, $noreg, %15:tgpr, <regmask $lr $d8 $d9 $d10
$d11 $d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16
$s17 $s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	340B:1	%12:tgpr = COPY %15:tgpr
  rewr %bb.0	328B:0	%15:tgpr = COPY %14:tgpr
queuing new interval: %14 [232r,328r:0)  0 at 232r weight:3.054435e-03
queuing new interval: %15 [328r,340r:0)  0 at 328r weight:3.677184e-03

selectOrSplit tGPR:%14 [232r,328r:0)  0 at 232r weight:3.054435e-03
w=3.054435e-03
hints: $r3
RS_Assign Cascade 0
wait for second round
queuing new interval: %14 [232r,328r:0)  0 at 232r weight:3.054435e-03

selectOrSplit tGPR:%12 [340r,344r:0)  0 at 340r weight:INF w=INF
hints: $r3
assigning %12 to $r3: R3 [340r,344r:0)  0 at 340r

selectOrSplit tGPR:%15 [328r,340r:0)  0 at 328r weight:3.677184e-03
w=3.677184e-03
hints: $r3
assigning %15 to $r6: R6 [328r,340r:0)  0 at 328r

selectOrSplit tGPR:%14 [232r,328r:0)  0 at 232r weight:3.054435e-03
w=3.054435e-03
hints: $r3 $r6
RS_Split Cascade 0
Analyze counted 3 instrs in 1 blocks, through 0 blocks.
tryLocalSplit:  232r 240r 328r
4 regmasks in block: 240r:232r-240r 240r:240r-328r
$r3 232r-240r i=INF extend
$r3 240r-328r i=INF end
$r6 232r-240r i=5.738636e-03 w=7.075472e-03 (best) extend
$r6 232r-328r i=5.738636e-03 all
$r0 232r-240r i=INF extend
$r0 240r-328r i=INF end
$r1 232r-240r i=INF extend
$r1 240r-328r i=INF end
$r2 232r-240r i=INF extend
$r2 240r-328r i=INF end
$r4 232r-240r i=6.575521e-03 w=7.075472e-03 extend
$r4 232r-328r i=6.575521e-03 all
$r5 232r-240r i=7.890625e-03 w=7.075472e-03 extend
$r5 240r-328r i=7.890625e-03 w=5.952381e-03 end
Best local split range: 232r-240r, 1.310072e-03, 2 instrs
    enterIntvBefore 232r: not live
    leaveIntvAfter 240r: valno 0
    useIntv [232B;248r): [232B;248r):1
  blit [232r,328r:0): [232r;248r)=1(%17):0 [248r;328r)=0(%16):0
  rewr %bb.0	232r:1	%17:tgpr = COPY %10:tgpr
  rewr %bb.0	240B:1	tBLXr 14, $noreg, %17:tgpr, <regmask $lr $d8 $d9 $d10
$d11 $d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16
$s17 $s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
  rewr %bb.0	328B:0	%15:tgpr = COPY %16:tgpr
  rewr %bb.0	248B:1	%16:tgpr = COPY %17:tgpr
Tagging non-progress ranges: %17
queuing new interval: %16 [248r,328r:0)  0 at 248r weight:2.104167e-03
queuing new interval: %17 [232r,248r:0)  0 at 232r weight:3.641827e-03

selectOrSplit tGPR:%17 [232r,248r:0)  0 at 232r weight:3.641827e-03
w=3.641827e-03
hints: $r3
RS_Split2 Cascade 0
Analyze counted 3 instrs in 1 blocks, through 0 blocks.
tryLocalSplit:  232r 240r 248r
4 regmasks in block: 240r:232r-240r 240r:240r-248r
$r3 232r-240r i=INF extend
$r3 240r-248r i=INF end
$r0 232r-240r i=INF extend
$r0 240r-248r i=INF end
$r1 232r-240r i=INF extend
$r1 240r-248r i=INF end
$r2 232r-240r i=INF extend
$r2 240r-248r i=INF end
$r4 232r-240r i=6.575521e-03 extend
$r4 240r-248r i=6.575521e-03 end
$r5 232r-240r i=7.890625e-03 extend
$r5 240r-248r i=7.890625e-03 end
$r6 232r-240r i=5.738636e-03 extend
$r6 240r-248r i=5.738636e-03 end
Inline spilling tGPR:%17 [232r,248r:0)  0 at 232r weight:3.641827e-03
From original %3
	also spill snippet %10 [228r,232r:0)  0 at 228r weight:INF
	also spill snippet %16 [248r,328r:0)  0 at 248r weight:2.104167e-03
  tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit killed $r0,
implicit killed $r1, implicit killed $r2
	folded:   240r	tBL 14, $noreg, &"\E0\9C\06\A0\FC\7F",
implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed $r1,
implicit killed $r2 :: (load 4 from constant-pool)
	remat:  324r	%18:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from
constant-pool)
	        328e	%15:tgpr = COPY killed %18:tgpr

All defs dead: dead %17:tgpr = COPY %10:tgpr
All defs dead: dead %10:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from
constant-pool)
All defs dead: dead %16:tgpr = COPY %17:tgpr
Remat created 3 dead defs.
Deleting dead def 248r	dead %16:tgpr = COPY %17:tgpr
Deleting dead def 228r	dead %10:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4
from constant-pool)
unassigning %10 from $r3: R3
Deleting dead def 232r	dead %17:tgpr = COPY %10:tgpr
Shrink: %10 EMPTY weight:INF
Shrunk: %10 EMPTY weight:INF
0 registers to spill after remat.
queuing new interval: %18 [324r,328r:0)  0 at 324r weight:INF

selectOrSplit tGPR:%18 [324r,328r:0)  0 at 324r weight:INF w=INF
hints: $r6
assigning %18 to $r6: R6 [324r,328r:0)  0 at 324r
Dropping unused %16 EMPTY weight:2.104167e-03
Dropping unused %10 EMPTY weight:INF
Trying to reconcile hints for: %0($r4)
%0($r4) is recolorable.
Trying to reconcile hints for: %1($r5)
%1($r5) is recolorable.
Trying to reconcile hints for: %2($r6)
%2($r6) is recolorable.
********** REWRITE VIRTUAL REGISTERS **********
********** Function: f
********** REGISTER MAP **********
[%0 -> $r4] tGPR
[%1 -> $r5] tGPR
[%2 -> $r6] tGPR
[%4 -> $r3] tGPR
[%12 -> $r3] tGPR
[%15 -> $r6] tGPR
[%18 -> $r6] tGPR

0B	bb.0.entry:
	  liveins: $r0, $r1, $r2
16B	  %2:tgpr = COPY $r2
32B	  %1:tgpr = COPY $r1
48B	  %0:tgpr = COPY $r0
64B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
96B	  $r0 = COPY %0:tgpr
112B	  $r1 = COPY %1:tgpr
128B	  $r2 = COPY %2:tgpr
144B	  tBL 14, $noreg, &"\94p\10\09", implicit-def $lr, implicit
$sp, implicit killed $r0, implicit killed $r1, implicit killed $r2 :: (load 4
from constant-pool)
160B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
176B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
192B	  $r0 = COPY %0:tgpr
208B	  $r1 = COPY %2:tgpr
224B	  $r2 = COPY %1:tgpr
240B	  tBL 14, $noreg, &"", implicit-def $lr, implicit $sp,
implicit killed $r0, implicit killed $r1, implicit killed $r2 :: (load 4 from
constant-pool)
256B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
272B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
288B	  $r0 = COPY %1:tgpr
304B	  $r1 = COPY %0:tgpr
320B	  $r2 = COPY killed %2:tgpr
324B	  %18:tgpr = tLDRpci %const.0, 14, $noreg :: (load 4 from constant-pool)
328B	  %15:tgpr = COPY killed %18:tgpr
336B	  tBLXr 14, $noreg, %15:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12 $d13
$d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18 $s19
$s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def dead
$lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def $sp
340B	  %12:tgpr = COPY killed %15:tgpr
344B	  %4:tgpr = COPY killed %12:tgpr
352B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
368B	  ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
384B	  $r0 = COPY %1:tgpr
400B	  $r1 = COPY killed %1:tgpr
416B	  $r2 = COPY killed %0:tgpr
432B	  tBLXr 14, $noreg, killed %4:tgpr, <regmask $lr $d8 $d9 $d10 $d11 $d12
$d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18
$s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def
dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def
$sp
448B	  ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
464B	  tBX_RET 14, $noreg> renamable $r6 = COPY $r2
> renamable $r5 = COPY $r1
> renamable $r4 = COPY $r0
> ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
> $r0 = COPY renamable $r4
> $r1 = COPY renamable $r5
> $r2 = COPY renamable $r6
> tBL 14, $noreg, &"\06", implicit-def $lr, implicit $sp,
implicit killed $r0, implicit killed $r1, implicit killed $r2 :: (load 4 from
constant-pool)
> ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
> ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
> $r0 = COPY renamable $r4
> $r1 = COPY renamable $r6
> $r2 = COPY renamable $r5
> tBL 14, $noreg, &"\06", implicit-def $lr, implicit $sp,
implicit killed $r0, implicit killed $r1, implicit killed $r2 :: (load 4 from
constant-pool)
> ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
> ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
> $r0 = COPY renamable $r5
> $r1 = COPY renamable $r4
> $r2 = COPY killed renamable $r6
> renamable $r6 = tLDRpci %const.0, 14, $noreg :: (load 4 from constant-pool)
> renamable $r6 = COPY killed renamable $r6Identity copy: renamable $r6 = COPY killed renamable $r6
  deleted.> tBLXr 14, $noreg, renamable $r6, <regmask $lr $d8 $d9 $d10 $d11 $d12
$d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17 $s18
$s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>, implicit-def
dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2, implicit-def
$sp
> renamable $r3 = COPY killed renamable $r6
> renamable $r3 = COPY killed renamable $r3Identity copy: renamable $r3 = COPY killed renamable $r3
  deleted.> ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
> ADJCALLSTACKDOWN 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
> $r0 = COPY renamable $r5
> $r1 = COPY killed renamable $r5
> $r2 = COPY killed renamable $r4
> tBLXr 14, $noreg, killed renamable $r3, <regmask $lr $d8 $d9 $d10 $d11
$d12 $d13 $d14 $d15 $q4 $q5 $q6 $q7 $r4 $r5 $r6 $r7 $r8 $r9 $r10 $r11 $s16 $s17
$s18 $s19 $s20 $s21 $s22 $s23 $s24 $s25 $s26 $s27 and 35 more...>,
implicit-def dead $lr, implicit $sp, implicit $r0, implicit $r1, implicit $r2,
implicit-def $sp
> ADJCALLSTACKUP 0, 0, 14, $noreg, implicit-def dead $sp, implicit $sp
> tBX_RET 14, $noreg-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm-611-2.diff
Type: text/x-patch
Size: 2620 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200415/aa64cfbd/attachment-0001.bin>

Prathamesh Kulkarni via llvm-dev

2020-Apr-15 23:37 UTC

head link

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

On Tue, 14 Apr 2020 at 19:48, Momchil Velikov <momchil.velikov at
gmail.com> wrote:>
> Hi,
>
> > +    MachineInstr *callMI = static_cast<MachineInstr *> (MIB);
>
> Use MIB.getInstr() here.Fixed, thanks for pointing it out!

Thanks,
Prathamesh>
> ~chill
>

Prathamesh Kulkarni via llvm-dev

2020-Apr-15 23:39 UTC

head link

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

On Wed, 15 Apr 2020 at 03:36, John Brawn <John.Brawn at arm.com>
wrote:>
> > Could you please point out what am I doing wrong in the patch ?
>
> It's because you're getting the function name by doing
>   callee->getName().str().c_str()
> The str() call generates a temporary copy of the name which ceases to exist
outside of this expression
> causing the c_str() to return a pointer to no-longer-valid memory.
Ah indeed, thanks for pointing it out!>
> I'm not sure what the correct way of getting the name as a char* is
here. Doing
>   callee->getName().data()
> appears to work, though I don't know if we can rely on the StringReg
that getName() returns being
> appropriately null-terminated.Using MachineFunction::createExternalSymbolName seems to
work.>
> > However for above case, IIUC, we would want all calls to be converted
to bl  ?
> It would be better yes, though I'm not sure how you'd go about
making that happen. It's probably not
> worth worrying too much about though, as the new behaviour is still better
than the old.OK, thanks for the clarification.
I will reg-test and submit an updated patch soon.

Thanks,
Prathamesh>
> John
>
> ________________________________
> From: Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org>
> Sent: 15 April 2020 01:44
> To: John Brawn <John.Brawn at arm.com>
> Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces
register reload before each call
>
> Hi,
> I have attached WIP patch for adding foldMemoryOperand to Thumb1InstrInfo.
> For the following case:
>
> void f(int x, int y, int z)
> {
>   void bar(int, int, int);
>
>   bar(x, y, z);
>   bar(x, z, y);
>   bar(y, x, z);
>   bar(y, y, x);
> }
>
> it calls foldMemoryOperand twice, and thus converts two calls from blx to
bl.
> callMI->dump() shows the function name "bar" correctly,
however in
> generated assembly call to bar is garbled:
> (compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m):
>
>         add     r7, sp, #16
>         mov     r6, r2
>         mov     r5, r1
>         mov     r4, r0
>         bl      "<90>w\n        "
>         mov     r1, r2
>         mov     r2, r5
>         bl      "<90>w\n        "
>         mov     r0, r5
>         mov     r1, r4
>         mov     r2, r6
>         ldr     r6, .LCPI0_0
>         blx     r6
>         mov     r0, r5
>         mov     r1, r5
>         mov     r2, r4
>         blx     r6
>
> regalloc dump (attached) shows:
> Inline spilling tGPR:%9 [80r,152r:0)  0 at 80r weight:3.209746e-03
> From original %3
>         also spill snippet %8 [152r,232r:0)  0 at 152r weight:2.104167e-03
>   tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit
> killed $r0, implicit killed $r1, implicit killed $r2
>         folded:   144r  tBL 14, $noreg,
&"\E0\9C\06\A0\FC\7F",
> implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed
> $r1, implicit killed $r2 :: (load 4 from constant-pool)
>         remat:  228r    %10:tgpr = tLDRpci %const.0, 14, $noreg ::
> (load 4 from constant-pool)
>                 232e    %7:tgpr = COPY killed %10:tgpr
>
> Could you please point out what am I doing wrong in the patch ?
>
> Also, I guess, it only converted two calls to bl because further
> spilling wasn't necessary.
> However for above case, IIUC, we would want all calls to be converted to bl
?
> Since,
> 4 bl == 16 bytes
> 2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4
> bytes (litpool) == 18 bytes
>
> Thanks,
> Prathamesh
>
>
>
>
> On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni
> <prathamesh.kulkarni at linaro.org> wrote:
> >
> > Hi John,
> > Thanks for the suggestions! I will start looking at adding
> > foldMemoryOperand to ARMInstrInfo.
> >
> > Thanks,
> > Prathamesh
> >
> > On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com>
wrote:
> > >
> > > If I'm understanding what's going on in this test
correctly, what's happening is:
> > >  * ARMTargetLowering::LowerCall prefers indirect calls when a
function is called at least 3 times in minsize
> > >  * In thumb 1 (without -fno-omit-frame-pointer) we have
effectively only 3 callee-saved registers (r4-r6)
> > >  * The function has three arguments, so those three plus the
register we need to hold the function address is more than our callee-saved
registers
> > >  * Therefore something needs to be spilt
> > >  * The function address can be rematerialized, so we spill that
and insert and LDR before each call
> > >
> > > If we didn't have this spilling happening (e.g. if the
function had one less argument) then the code size of using BL vs BLX
> > >  * BL: 3*4-byte BL = 12 bytes
> > >  * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes
> > > (So maybe even not considering spilling, LowerCall should be
adjusted to do this for functions called 4 or more times)
> > >
> > > When we have to spill, if we compare spilling the functions
address vs spilling an argument:
> > >  * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool
= 16 bytes
> > >  * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool
+ 1*2-byte STR + 2*2-byte LDR = 18 bytes
> > > So just changing the spilling heuristic won't work.
> > >
> > > The two ways I see of fixing this:
> > >  * In LowerCall only prefer an indirect call if the number of
integer register arguments is less than the number of callee-saved registers.
> > >  * When the load of the function address is spilled, instead of
just rematerializing the load instead convert the BX back into BL.
> > >
> > > The first of these would be easier, but there will be situations
where we need to use less than three callee-saved registers (e.g. arguments are
loaded from a pointer) and there are situations where we will spill the function
address for reasons entirely unrelated to the function arguments (e.g. if we
have enough live local variables).
> > >
> > > For the second, looking at InlineSpiller.cpp it does have the
concept of rematerializing by folding a memory operand into another instruction,
so I think we could make use of that to do this. It looks like it would involve
adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a
LDR into a BX by turning it into a BL.
> > >
> > > John
> > >
> > > ________________________________
> > > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on
behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org>
> > > Sent: 07 April 2020 21:07
> > > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb
forces register reload before each call
> > >
> > > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni
> > > <prathamesh.kulkarni at linaro.org> wrote:
> > > >
> > > > Hi,
> > > > Compiling attached test-case, which is reduced version of of
> > > > uECC_shared_secret from tinycrypt library [1], with
> > > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S
> > > > results in reloading of register holding function's
address before
> > > > every call to blx:
> > > >
> > > >         ldr       r3, .LCPI0_0
> > > >         blx      r3
> > > >         mov    r0, r6
> > > >         mov    r1, r5
> > > >         mov    r2, r4
> > > >         ldr       r3, .LCPI0_0
> > > >         blx       r3
> > > >         ldr        r3, .LCPI0_0
> > > >         mov     r0, r6
> > > >         mov     r1, r5
> > > >         mov     r2, r4
> > > >         blx       r3
> > > >
> > > > .LCPI0_0:
> > > >         .long   foo
> > > >
> > > > From dump of regalloc (attached), AFAIU, what seems to
happen during
> > > > greedy allocator is, all virt regs %0 to %3 are live across
first two
> > > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4
> > > > respectively, and %3 which holds foo's address
doesn't have any
> > > > register left.
> > > > Since it's live-range has least weight, it does not
evict any existing interval,
> > > > and gets split. Eventually we have the following allocation:
> > > >
> > > > [%0 -> $r6] tGPR
> > > > [%1 -> $r5] tGPR
> > > > [%2 -> $r4] tGPR
> > > > [%6 -> $r3] tGPR
> > > > [%11 -> $r3] tGPR
> > > > [%16 -> $r3] tGPR
> > > > [%17 -> $r3] tGPR
> > > >
> > > > where %6, %11, %16 and %17 all are derived from %3.
> > > > And since r3 is a call-clobbered register, the compiler is
forced to
> > > > reload foo's address
> > > > each time before blx.
> > > >
> > > > To fix this, I thought of following approaches:
> > > > (a) Disable the heuristic to prefer indirect call when there
are at
> > > > least 3 calls to
> > > > same function in basic block in ARMTargetLowering::LowerCall
for Thumb-1 ISA.
> > > >
> > > > (b) In ARMTargetLowering::LowerCall, put another constraint
like
> > > > number of arguments, as a proxy for register pressure for
Thumb-1, but
> > > > that's bound to trip another cases.
> > > >
> > > > (c) Give higher priority to allocate vrit reg used for
indirect calls
> > > > ? However, if that
> > > > results in spilling of some other register, it would defeat
the
> > > > purpose of saving code-size. I suppose ideally we want to
trigger the
> > > > heuristic of using indirect call only when we know
beforehand that it
> > > > will not result in spilling. But I am not sure if it's
possible to
> > > > estimate that during isel ?
> > > >
> > > > I would be grateful for suggestions on how to proceed
further.
> > > ping ?
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > [1]
https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Apr 2020 - [ARM] Register pressure with -mthumb forces register reload before each call

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call