thr3ads.net - llvm dev - [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call [Apr 2020]

If this information is useful, please help other people find it:
Share via:

Prathamesh Kulkarni via llvm-dev

2020-Apr-15 23:39 UTC

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

On Wed, 15 Apr 2020 at 03:36, John Brawn <John.Brawn at arm.com>
wrote:>
> > Could you please point out what am I doing wrong in the patch ?
>
> It's because you're getting the function name by doing
>   callee->getName().str().c_str()
> The str() call generates a temporary copy of the name which ceases to exist
outside of this expression
> causing the c_str() to return a pointer to no-longer-valid memory.
Ah indeed, thanks for pointing it out!>
> I'm not sure what the correct way of getting the name as a char* is
here. Doing
>   callee->getName().data()
> appears to work, though I don't know if we can rely on the StringReg
that getName() returns being
> appropriately null-terminated.Using MachineFunction::createExternalSymbolName seems to
work.>
> > However for above case, IIUC, we would want all calls to be converted
to bl  ?
> It would be better yes, though I'm not sure how you'd go about
making that happen. It's probably not
> worth worrying too much about though, as the new behaviour is still better
than the old.OK, thanks for the clarification.
I will reg-test and submit an updated patch soon.

Thanks,
Prathamesh>
> John
>
> ________________________________
> From: Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org>
> Sent: 15 April 2020 01:44
> To: John Brawn <John.Brawn at arm.com>
> Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces
register reload before each call
>
> Hi,
> I have attached WIP patch for adding foldMemoryOperand to Thumb1InstrInfo.
> For the following case:
>
> void f(int x, int y, int z)
> {
>   void bar(int, int, int);
>
>   bar(x, y, z);
>   bar(x, z, y);
>   bar(y, x, z);
>   bar(y, y, x);
> }
>
> it calls foldMemoryOperand twice, and thus converts two calls from blx to
bl.
> callMI->dump() shows the function name "bar" correctly,
however in
> generated assembly call to bar is garbled:
> (compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m):
>
>         add     r7, sp, #16
>         mov     r6, r2
>         mov     r5, r1
>         mov     r4, r0
>         bl      "<90>w\n        "
>         mov     r1, r2
>         mov     r2, r5
>         bl      "<90>w\n        "
>         mov     r0, r5
>         mov     r1, r4
>         mov     r2, r6
>         ldr     r6, .LCPI0_0
>         blx     r6
>         mov     r0, r5
>         mov     r1, r5
>         mov     r2, r4
>         blx     r6
>
> regalloc dump (attached) shows:
> Inline spilling tGPR:%9 [80r,152r:0)  0 at 80r weight:3.209746e-03
> From original %3
>         also spill snippet %8 [152r,232r:0)  0 at 152r weight:2.104167e-03
>   tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit
> killed $r0, implicit killed $r1, implicit killed $r2
>         folded:   144r  tBL 14, $noreg,
&"\E0\9C\06\A0\FC\7F",
> implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed
> $r1, implicit killed $r2 :: (load 4 from constant-pool)
>         remat:  228r    %10:tgpr = tLDRpci %const.0, 14, $noreg ::
> (load 4 from constant-pool)
>                 232e    %7:tgpr = COPY killed %10:tgpr
>
> Could you please point out what am I doing wrong in the patch ?
>
> Also, I guess, it only converted two calls to bl because further
> spilling wasn't necessary.
> However for above case, IIUC, we would want all calls to be converted to bl
?
> Since,
> 4 bl == 16 bytes
> 2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4
> bytes (litpool) == 18 bytes
>
> Thanks,
> Prathamesh
>
>
>
>
> On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni
> <prathamesh.kulkarni at linaro.org> wrote:
> >
> > Hi John,
> > Thanks for the suggestions! I will start looking at adding
> > foldMemoryOperand to ARMInstrInfo.
> >
> > Thanks,
> > Prathamesh
> >
> > On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com>
wrote:
> > >
> > > If I'm understanding what's going on in this test
correctly, what's happening is:
> > >  * ARMTargetLowering::LowerCall prefers indirect calls when a
function is called at least 3 times in minsize
> > >  * In thumb 1 (without -fno-omit-frame-pointer) we have
effectively only 3 callee-saved registers (r4-r6)
> > >  * The function has three arguments, so those three plus the
register we need to hold the function address is more than our callee-saved
registers
> > >  * Therefore something needs to be spilt
> > >  * The function address can be rematerialized, so we spill that
and insert and LDR before each call
> > >
> > > If we didn't have this spilling happening (e.g. if the
function had one less argument) then the code size of using BL vs BLX
> > >  * BL: 3*4-byte BL = 12 bytes
> > >  * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes
> > > (So maybe even not considering spilling, LowerCall should be
adjusted to do this for functions called 4 or more times)
> > >
> > > When we have to spill, if we compare spilling the functions
address vs spilling an argument:
> > >  * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool
= 16 bytes
> > >  * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool
+ 1*2-byte STR + 2*2-byte LDR = 18 bytes
> > > So just changing the spilling heuristic won't work.
> > >
> > > The two ways I see of fixing this:
> > >  * In LowerCall only prefer an indirect call if the number of
integer register arguments is less than the number of callee-saved registers.
> > >  * When the load of the function address is spilled, instead of
just rematerializing the load instead convert the BX back into BL.
> > >
> > > The first of these would be easier, but there will be situations
where we need to use less than three callee-saved registers (e.g. arguments are
loaded from a pointer) and there are situations where we will spill the function
address for reasons entirely unrelated to the function arguments (e.g. if we
have enough live local variables).
> > >
> > > For the second, looking at InlineSpiller.cpp it does have the
concept of rematerializing by folding a memory operand into another instruction,
so I think we could make use of that to do this. It looks like it would involve
adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a
LDR into a BX by turning it into a BL.
> > >
> > > John
> > >
> > > ________________________________
> > > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on
behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org>
> > > Sent: 07 April 2020 21:07
> > > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb
forces register reload before each call
> > >
> > > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni
> > > <prathamesh.kulkarni at linaro.org> wrote:
> > > >
> > > > Hi,
> > > > Compiling attached test-case, which is reduced version of of
> > > > uECC_shared_secret from tinycrypt library [1], with
> > > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S
> > > > results in reloading of register holding function's
address before
> > > > every call to blx:
> > > >
> > > >         ldr       r3, .LCPI0_0
> > > >         blx      r3
> > > >         mov    r0, r6
> > > >         mov    r1, r5
> > > >         mov    r2, r4
> > > >         ldr       r3, .LCPI0_0
> > > >         blx       r3
> > > >         ldr        r3, .LCPI0_0
> > > >         mov     r0, r6
> > > >         mov     r1, r5
> > > >         mov     r2, r4
> > > >         blx       r3
> > > >
> > > > .LCPI0_0:
> > > >         .long   foo
> > > >
> > > > From dump of regalloc (attached), AFAIU, what seems to
happen during
> > > > greedy allocator is, all virt regs %0 to %3 are live across
first two
> > > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4
> > > > respectively, and %3 which holds foo's address
doesn't have any
> > > > register left.
> > > > Since it's live-range has least weight, it does not
evict any existing interval,
> > > > and gets split. Eventually we have the following allocation:
> > > >
> > > > [%0 -> $r6] tGPR
> > > > [%1 -> $r5] tGPR
> > > > [%2 -> $r4] tGPR
> > > > [%6 -> $r3] tGPR
> > > > [%11 -> $r3] tGPR
> > > > [%16 -> $r3] tGPR
> > > > [%17 -> $r3] tGPR
> > > >
> > > > where %6, %11, %16 and %17 all are derived from %3.
> > > > And since r3 is a call-clobbered register, the compiler is
forced to
> > > > reload foo's address
> > > > each time before blx.
> > > >
> > > > To fix this, I thought of following approaches:
> > > > (a) Disable the heuristic to prefer indirect call when there
are at
> > > > least 3 calls to
> > > > same function in basic block in ARMTargetLowering::LowerCall
for Thumb-1 ISA.
> > > >
> > > > (b) In ARMTargetLowering::LowerCall, put another constraint
like
> > > > number of arguments, as a proxy for register pressure for
Thumb-1, but
> > > > that's bound to trip another cases.
> > > >
> > > > (c) Give higher priority to allocate vrit reg used for
indirect calls
> > > > ? However, if that
> > > > results in spilling of some other register, it would defeat
the
> > > > purpose of saving code-size. I suppose ideally we want to
trigger the
> > > > heuristic of using indirect call only when we know
beforehand that it
> > > > will not result in spilling. But I am not sure if it's
possible to
> > > > estimate that during isel ?
> > > >
> > > > I would be grateful for suggestions on how to proceed
further.
> > > ping ?
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > [1]
https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Prathamesh Kulkarni via llvm-dev

2020-Apr-18 00:42 UTC

head link

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

On Wed, 15 Apr 2020 at 16:39, Prathamesh Kulkarni
<prathamesh.kulkarni at linaro.org> wrote:>
> On Wed, 15 Apr 2020 at 03:36, John Brawn <John.Brawn at arm.com>
wrote:
> >
> > > Could you please point out what am I doing wrong in the patch ?
> >
> > It's because you're getting the function name by doing
> >   callee->getName().str().c_str()
> > The str() call generates a temporary copy of the name which ceases to
exist outside of this expression
> > causing the c_str() to return a pointer to no-longer-valid memory.
> Ah indeed, thanks for pointing it out!
> >
> > I'm not sure what the correct way of getting the name as a char*
is here. Doing
> >   callee->getName().data()
> > appears to work, though I don't know if we can rely on the
StringReg that getName() returns being
> > appropriately null-terminated.
> Using MachineFunction::createExternalSymbolName seems to work.
> >
> > > However for above case, IIUC, we would want all calls to be
converted to bl  ?
> > It would be better yes, though I'm not sure how you'd go about
making that happen. It's probably not
> > worth worrying too much about though, as the new behaviour is still
better than the old.
> OK, thanks for the clarification.
> I will reg-test and submit an updated patch soon.Hi, I have attached patch after reg-testing (make check-llvm shows no
unexpected failures).
Does it look OK ?

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > John
> >
> > ________________________________
> > From: Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org>
> > Sent: 15 April 2020 01:44
> > To: John Brawn <John.Brawn at arm.com>
> > Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces
register reload before each call
> >
> > Hi,
> > I have attached WIP patch for adding foldMemoryOperand to
Thumb1InstrInfo.
> > For the following case:
> >
> > void f(int x, int y, int z)
> > {
> >   void bar(int, int, int);
> >
> >   bar(x, y, z);
> >   bar(x, z, y);
> >   bar(y, x, z);
> >   bar(y, y, x);
> > }
> >
> > it calls foldMemoryOperand twice, and thus converts two calls from blx
to bl.
> > callMI->dump() shows the function name "bar" correctly,
however in
> > generated assembly call to bar is garbled:
> > (compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m):
> >
> >         add     r7, sp, #16
> >         mov     r6, r2
> >         mov     r5, r1
> >         mov     r4, r0
> >         bl      "<90>w\n        "
> >         mov     r1, r2
> >         mov     r2, r5
> >         bl      "<90>w\n        "
> >         mov     r0, r5
> >         mov     r1, r4
> >         mov     r2, r6
> >         ldr     r6, .LCPI0_0
> >         blx     r6
> >         mov     r0, r5
> >         mov     r1, r5
> >         mov     r2, r4
> >         blx     r6
> >
> > regalloc dump (attached) shows:
> > Inline spilling tGPR:%9 [80r,152r:0)  0 at 80r weight:3.209746e-03
> > From original %3
> >         also spill snippet %8 [152r,232r:0)  0 at 152r
weight:2.104167e-03
> >   tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit
> > killed $r0, implicit killed $r1, implicit killed $r2
> >         folded:   144r  tBL 14, $noreg,
&"\E0\9C\06\A0\FC\7F",
> > implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed
> > $r1, implicit killed $r2 :: (load 4 from constant-pool)
> >         remat:  228r    %10:tgpr = tLDRpci %const.0, 14, $noreg ::
> > (load 4 from constant-pool)
> >                 232e    %7:tgpr = COPY killed %10:tgpr
> >
> > Could you please point out what am I doing wrong in the patch ?
> >
> > Also, I guess, it only converted two calls to bl because further
> > spilling wasn't necessary.
> > However for above case, IIUC, we would want all calls to be converted
to bl  ?
> > Since,
> > 4 bl == 16 bytes
> > 2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4
> > bytes (litpool) == 18 bytes
> >
> > Thanks,
> > Prathamesh
> >
> >
> >
> >
> > On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni
> > <prathamesh.kulkarni at linaro.org> wrote:
> > >
> > > Hi John,
> > > Thanks for the suggestions! I will start looking at adding
> > > foldMemoryOperand to ARMInstrInfo.
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > > On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at
arm.com> wrote:
> > > >
> > > > If I'm understanding what's going on in this test
correctly, what's happening is:
> > > >  * ARMTargetLowering::LowerCall prefers indirect calls when
a function is called at least 3 times in minsize
> > > >  * In thumb 1 (without -fno-omit-frame-pointer) we have
effectively only 3 callee-saved registers (r4-r6)
> > > >  * The function has three arguments, so those three plus the
register we need to hold the function address is more than our callee-saved
registers
> > > >  * Therefore something needs to be spilt
> > > >  * The function address can be rematerialized, so we spill
that and insert and LDR before each call
> > > >
> > > > If we didn't have this spilling happening (e.g. if the
function had one less argument) then the code size of using BL vs BLX
> > > >  * BL: 3*4-byte BL = 12 bytes
> > > >  * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12
bytes
> > > > (So maybe even not considering spilling, LowerCall should be
adjusted to do this for functions called 4 or more times)
> > > >
> > > > When we have to spill, if we compare spilling the functions
address vs spilling an argument:
> > > >  * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte
litpool = 16 bytes
> > > >  * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte
litpool + 1*2-byte STR + 2*2-byte LDR = 18 bytes
> > > > So just changing the spilling heuristic won't work.
> > > >
> > > > The two ways I see of fixing this:
> > > >  * In LowerCall only prefer an indirect call if the number
of integer register arguments is less than the number of callee-saved registers.
> > > >  * When the load of the function address is spilled, instead
of just rematerializing the load instead convert the BX back into BL.
> > > >
> > > > The first of these would be easier, but there will be
situations where we need to use less than three callee-saved registers (e.g.
arguments are loaded from a pointer) and there are situations where we will
spill the function address for reasons entirely unrelated to the function
arguments (e.g. if we have enough live local variables).
> > > >
> > > > For the second, looking at InlineSpiller.cpp it does have
the concept of rematerializing by folding a memory operand into another
instruction, so I think we could make use of that to do this. It looks like it
would involve adding a foldMemoryOperand function to ARMInstrInfo and then have
this fold a LDR into a BX by turning it into a BL.
> > > >
> > > > John
> > > >
> > > > ________________________________
> > > > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on
behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org>
> > > > Sent: 07 April 2020 21:07
> > > > To: llvm-dev at lists.llvm.org <llvm-dev at
lists.llvm.org>
> > > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb
forces register reload before each call
> > > >
> > > > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni at linaro.org> wrote:
> > > > >
> > > > > Hi,
> > > > > Compiling attached test-case, which is reduced version
of of
> > > > > uECC_shared_secret from tinycrypt library [1], with
> > > > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S
> > > > > results in reloading of register holding function's
address before
> > > > > every call to blx:
> > > > >
> > > > >         ldr       r3, .LCPI0_0
> > > > >         blx      r3
> > > > >         mov    r0, r6
> > > > >         mov    r1, r5
> > > > >         mov    r2, r4
> > > > >         ldr       r3, .LCPI0_0
> > > > >         blx       r3
> > > > >         ldr        r3, .LCPI0_0
> > > > >         mov     r0, r6
> > > > >         mov     r1, r5
> > > > >         mov     r2, r4
> > > > >         blx       r3
> > > > >
> > > > > .LCPI0_0:
> > > > >         .long   foo
> > > > >
> > > > > From dump of regalloc (attached), AFAIU, what seems to
happen during
> > > > > greedy allocator is, all virt regs %0 to %3 are live
across first two
> > > > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5
and r4
> > > > > respectively, and %3 which holds foo's address
doesn't have any
> > > > > register left.
> > > > > Since it's live-range has least weight, it does not
evict any existing interval,
> > > > > and gets split. Eventually we have the following
allocation:
> > > > >
> > > > > [%0 -> $r6] tGPR
> > > > > [%1 -> $r5] tGPR
> > > > > [%2 -> $r4] tGPR
> > > > > [%6 -> $r3] tGPR
> > > > > [%11 -> $r3] tGPR
> > > > > [%16 -> $r3] tGPR
> > > > > [%17 -> $r3] tGPR
> > > > >
> > > > > where %6, %11, %16 and %17 all are derived from %3.
> > > > > And since r3 is a call-clobbered register, the compiler
is forced to
> > > > > reload foo's address
> > > > > each time before blx.
> > > > >
> > > > > To fix this, I thought of following approaches:
> > > > > (a) Disable the heuristic to prefer indirect call when
there are at
> > > > > least 3 calls to
> > > > > same function in basic block in
ARMTargetLowering::LowerCall for Thumb-1 ISA.
> > > > >
> > > > > (b) In ARMTargetLowering::LowerCall, put another
constraint like
> > > > > number of arguments, as a proxy for register pressure
for Thumb-1, but
> > > > > that's bound to trip another cases.
> > > > >
> > > > > (c) Give higher priority to allocate vrit reg used for
indirect calls
> > > > > ? However, if that
> > > > > results in spilling of some other register, it would
defeat the
> > > > > purpose of saving code-size. I suppose ideally we want
to trigger the
> > > > > heuristic of using indirect call only when we know
beforehand that it
> > > > > will not result in spilling. But I am not sure if
it's possible to
> > > > > estimate that during isel ?
> > > > >
> > > > > I would be grateful for suggestions on how to proceed
further.
> > > > ping ?
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > [1]
https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > llvm-dev at lists.llvm.org
> > > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm-611-3.diff
Type: text/x-patch
Size: 2542 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200418/de306dd8/attachment.bin>

Prathamesh Kulkarni via llvm-dev

2020-Apr-24 18:42 UTC

head link

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

On Fri, 17 Apr 2020 at 17:42, Prathamesh Kulkarni
<prathamesh.kulkarni at linaro.org> wrote:>
> On Wed, 15 Apr 2020 at 16:39, Prathamesh Kulkarni
> <prathamesh.kulkarni at linaro.org> wrote:
> >
> > On Wed, 15 Apr 2020 at 03:36, John Brawn <John.Brawn at arm.com>
wrote:
> > >
> > > > Could you please point out what am I doing wrong in the
patch ?
> > >
> > > It's because you're getting the function name by doing
> > >   callee->getName().str().c_str()
> > > The str() call generates a temporary copy of the name which
ceases to exist outside of this expression
> > > causing the c_str() to return a pointer to no-longer-valid
memory.
> > Ah indeed, thanks for pointing it out!
> > >
> > > I'm not sure what the correct way of getting the name as a
char* is here. Doing
> > >   callee->getName().data()
> > > appears to work, though I don't know if we can rely on the
StringReg that getName() returns being
> > > appropriately null-terminated.
> > Using MachineFunction::createExternalSymbolName seems to work.
> > >
> > > > However for above case, IIUC, we would want all calls to be
converted to bl  ?
> > > It would be better yes, though I'm not sure how you'd go
about making that happen. It's probably not
> > > worth worrying too much about though, as the new behaviour is
still better than the old.
> > OK, thanks for the clarification.
> > I will reg-test and submit an updated patch soon.
> Hi, I have attached patch after reg-testing (make check-llvm shows no
> unexpected failures).
> Does it look OK ?ping http://lists.llvm.org/pipermail/llvm-dev/2020-April/141048.html

Thanks,
Prathamesh>
> Thanks,
> Prathamesh
>
> >
> > Thanks,
> > Prathamesh
> > >
> > > John
> > >
> > > ________________________________
> > > From: Prathamesh Kulkarni <prathamesh.kulkarni at
linaro.org>
> > > Sent: 15 April 2020 01:44
> > > To: John Brawn <John.Brawn at arm.com>
> > > Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb
forces register reload before each call
> > >
> > > Hi,
> > > I have attached WIP patch for adding foldMemoryOperand to
Thumb1InstrInfo.
> > > For the following case:
> > >
> > > void f(int x, int y, int z)
> > > {
> > >   void bar(int, int, int);
> > >
> > >   bar(x, y, z);
> > >   bar(x, z, y);
> > >   bar(y, x, z);
> > >   bar(y, y, x);
> > > }
> > >
> > > it calls foldMemoryOperand twice, and thus converts two calls
from blx to bl.
> > > callMI->dump() shows the function name "bar"
correctly, however in
> > > generated assembly call to bar is garbled:
> > > (compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m):
> > >
> > >         add     r7, sp, #16
> > >         mov     r6, r2
> > >         mov     r5, r1
> > >         mov     r4, r0
> > >         bl      "<90>w\n        "
> > >         mov     r1, r2
> > >         mov     r2, r5
> > >         bl      "<90>w\n        "
> > >         mov     r0, r5
> > >         mov     r1, r4
> > >         mov     r2, r6
> > >         ldr     r6, .LCPI0_0
> > >         blx     r6
> > >         mov     r0, r5
> > >         mov     r1, r5
> > >         mov     r2, r4
> > >         blx     r6
> > >
> > > regalloc dump (attached) shows:
> > > Inline spilling tGPR:%9 [80r,152r:0)  0 at 80r
weight:3.209746e-03
> > > From original %3
> > >         also spill snippet %8 [152r,232r:0)  0 at 152r
weight:2.104167e-03
> > >   tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp,
implicit
> > > killed $r0, implicit killed $r1, implicit killed $r2
> > >         folded:   144r  tBL 14, $noreg,
&"\E0\9C\06\A0\FC\7F",
> > > implicit-def $lr, implicit $sp, implicit killed $r0, implicit
killed
> > > $r1, implicit killed $r2 :: (load 4 from constant-pool)
> > >         remat:  228r    %10:tgpr = tLDRpci %const.0, 14, $noreg
::
> > > (load 4 from constant-pool)
> > >                 232e    %7:tgpr = COPY killed %10:tgpr
> > >
> > > Could you please point out what am I doing wrong in the patch ?
> > >
> > > Also, I guess, it only converted two calls to bl because further
> > > spilling wasn't necessary.
> > > However for above case, IIUC, we would want all calls to be
converted to bl  ?
> > > Since,
> > > 4 bl == 16 bytes
> > > 2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4
> > > bytes (litpool) == 18 bytes
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > >
> > >
> > >
> > > On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni
> > > <prathamesh.kulkarni at linaro.org> wrote:
> > > >
> > > > Hi John,
> > > > Thanks for the suggestions! I will start looking at adding
> > > > foldMemoryOperand to ARMInstrInfo.
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > > > On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at
arm.com> wrote:
> > > > >
> > > > > If I'm understanding what's going on in this
test correctly, what's happening is:
> > > > >  * ARMTargetLowering::LowerCall prefers indirect calls
when a function is called at least 3 times in minsize
> > > > >  * In thumb 1 (without -fno-omit-frame-pointer) we have
effectively only 3 callee-saved registers (r4-r6)
> > > > >  * The function has three arguments, so those three
plus the register we need to hold the function address is more than our
callee-saved registers
> > > > >  * Therefore something needs to be spilt
> > > > >  * The function address can be rematerialized, so we
spill that and insert and LDR before each call
> > > > >
> > > > > If we didn't have this spilling happening (e.g. if
the function had one less argument) then the code size of using BL vs BLX
> > > > >  * BL: 3*4-byte BL = 12 bytes
> > > > >  * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12
bytes
> > > > > (So maybe even not considering spilling, LowerCall
should be adjusted to do this for functions called 4 or more times)
> > > > >
> > > > > When we have to spill, if we compare spilling the
functions address vs spilling an argument:
> > > > >  * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR +
4-byte litpool = 16 bytes
> > > > >  * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR +
4-byte litpool + 1*2-byte STR + 2*2-byte LDR = 18 bytes
> > > > > So just changing the spilling heuristic won't work.
> > > > >
> > > > > The two ways I see of fixing this:
> > > > >  * In LowerCall only prefer an indirect call if the
number of integer register arguments is less than the number of callee-saved
registers.
> > > > >  * When the load of the function address is spilled,
instead of just rematerializing the load instead convert the BX back into BL.
> > > > >
> > > > > The first of these would be easier, but there will be
situations where we need to use less than three callee-saved registers (e.g.
arguments are loaded from a pointer) and there are situations where we will
spill the function address for reasons entirely unrelated to the function
arguments (e.g. if we have enough live local variables).
> > > > >
> > > > > For the second, looking at InlineSpiller.cpp it does
have the concept of rematerializing by folding a memory operand into another
instruction, so I think we could make use of that to do this. It looks like it
would involve adding a foldMemoryOperand function to ARMInstrInfo and then have
this fold a LDR into a BX by turning it into a BL.
> > > > >
> > > > > John
> > > > >
> > > > > ________________________________
> > > > > From: llvm-dev <llvm-dev-bounces at
lists.llvm.org> on behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at
lists.llvm.org>
> > > > > Sent: 07 April 2020 21:07
> > > > > To: llvm-dev at lists.llvm.org <llvm-dev at
lists.llvm.org>
> > > > > Subject: Re: [llvm-dev] [ARM] Register pressure with
-mthumb forces register reload before each call
> > > > >
> > > > > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni at linaro.org> wrote:
> > > > > >
> > > > > > Hi,
> > > > > > Compiling attached test-case, which is reduced
version of of
> > > > > > uECC_shared_secret from tinycrypt library [1],
with
> > > > > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S
> > > > > > results in reloading of register holding
function's address before
> > > > > > every call to blx:
> > > > > >
> > > > > >         ldr       r3, .LCPI0_0
> > > > > >         blx      r3
> > > > > >         mov    r0, r6
> > > > > >         mov    r1, r5
> > > > > >         mov    r2, r4
> > > > > >         ldr       r3, .LCPI0_0
> > > > > >         blx       r3
> > > > > >         ldr        r3, .LCPI0_0
> > > > > >         mov     r0, r6
> > > > > >         mov     r1, r5
> > > > > >         mov     r2, r4
> > > > > >         blx       r3
> > > > > >
> > > > > > .LCPI0_0:
> > > > > >         .long   foo
> > > > > >
> > > > > > From dump of regalloc (attached), AFAIU, what
seems to happen during
> > > > > > greedy allocator is, all virt regs %0 to %3 are
live across first two
> > > > > > calls to foo. Thus %0, %1 and %2 get assigned r6,
r5 and r4
> > > > > > respectively, and %3 which holds foo's address
doesn't have any
> > > > > > register left.
> > > > > > Since it's live-range has least weight, it
does not evict any existing interval,
> > > > > > and gets split. Eventually we have the following
allocation:
> > > > > >
> > > > > > [%0 -> $r6] tGPR
> > > > > > [%1 -> $r5] tGPR
> > > > > > [%2 -> $r4] tGPR
> > > > > > [%6 -> $r3] tGPR
> > > > > > [%11 -> $r3] tGPR
> > > > > > [%16 -> $r3] tGPR
> > > > > > [%17 -> $r3] tGPR
> > > > > >
> > > > > > where %6, %11, %16 and %17 all are derived from
%3.
> > > > > > And since r3 is a call-clobbered register, the
compiler is forced to
> > > > > > reload foo's address
> > > > > > each time before blx.
> > > > > >
> > > > > > To fix this, I thought of following approaches:
> > > > > > (a) Disable the heuristic to prefer indirect call
when there are at
> > > > > > least 3 calls to
> > > > > > same function in basic block in
ARMTargetLowering::LowerCall for Thumb-1 ISA.
> > > > > >
> > > > > > (b) In ARMTargetLowering::LowerCall, put another
constraint like
> > > > > > number of arguments, as a proxy for register
pressure for Thumb-1, but
> > > > > > that's bound to trip another cases.
> > > > > >
> > > > > > (c) Give higher priority to allocate vrit reg used
for indirect calls
> > > > > > ? However, if that
> > > > > > results in spilling of some other register, it
would defeat the
> > > > > > purpose of saving code-size. I suppose ideally we
want to trigger the
> > > > > > heuristic of using indirect call only when we know
beforehand that it
> > > > > > will not result in spilling. But I am not sure if
it's possible to
> > > > > > estimate that during isel ?
> > > > > >
> > > > > > I would be grateful for suggestions on how to
proceed further.
> > > > > ping ?
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > [1]
https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > _______________________________________________
> > > > > LLVM Developers mailing list
> > > > > llvm-dev at lists.llvm.org
> > > > >
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Apr 2020 - [ARM] Register pressure with -mthumb forces register reload before each call

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call

[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call