Prathamesh Kulkarni via llvm-dev
2020-Apr-15 23:39 UTC
[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call
On Wed, 15 Apr 2020 at 03:36, John Brawn <John.Brawn at arm.com> wrote:> > > Could you please point out what am I doing wrong in the patch ? > > It's because you're getting the function name by doing > callee->getName().str().c_str() > The str() call generates a temporary copy of the name which ceases to exist outside of this expression > causing the c_str() to return a pointer to no-longer-valid memory.Ah indeed, thanks for pointing it out!> > I'm not sure what the correct way of getting the name as a char* is here. Doing > callee->getName().data() > appears to work, though I don't know if we can rely on the StringReg that getName() returns being > appropriately null-terminated.Using MachineFunction::createExternalSymbolName seems to work.> > > However for above case, IIUC, we would want all calls to be converted to bl ? > It would be better yes, though I'm not sure how you'd go about making that happen. It's probably not > worth worrying too much about though, as the new behaviour is still better than the old.OK, thanks for the clarification. I will reg-test and submit an updated patch soon. Thanks, Prathamesh> > John > > ________________________________ > From: Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org> > Sent: 15 April 2020 01:44 > To: John Brawn <John.Brawn at arm.com> > Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call > > Hi, > I have attached WIP patch for adding foldMemoryOperand to Thumb1InstrInfo. > For the following case: > > void f(int x, int y, int z) > { > void bar(int, int, int); > > bar(x, y, z); > bar(x, z, y); > bar(y, x, z); > bar(y, y, x); > } > > it calls foldMemoryOperand twice, and thus converts two calls from blx to bl. > callMI->dump() shows the function name "bar" correctly, however in > generated assembly call to bar is garbled: > (compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m): > > add r7, sp, #16 > mov r6, r2 > mov r5, r1 > mov r4, r0 > bl "<90>w\n " > mov r1, r2 > mov r2, r5 > bl "<90>w\n " > mov r0, r5 > mov r1, r4 > mov r2, r6 > ldr r6, .LCPI0_0 > blx r6 > mov r0, r5 > mov r1, r5 > mov r2, r4 > blx r6 > > regalloc dump (attached) shows: > Inline spilling tGPR:%9 [80r,152r:0) 0 at 80r weight:3.209746e-03 > From original %3 > also spill snippet %8 [152r,232r:0) 0 at 152r weight:2.104167e-03 > tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit > killed $r0, implicit killed $r1, implicit killed $r2 > folded: 144r tBL 14, $noreg, &"\E0\9C\06\A0\FC\7F", > implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed > $r1, implicit killed $r2 :: (load 4 from constant-pool) > remat: 228r %10:tgpr = tLDRpci %const.0, 14, $noreg :: > (load 4 from constant-pool) > 232e %7:tgpr = COPY killed %10:tgpr > > Could you please point out what am I doing wrong in the patch ? > > Also, I guess, it only converted two calls to bl because further > spilling wasn't necessary. > However for above case, IIUC, we would want all calls to be converted to bl ? > Since, > 4 bl == 16 bytes > 2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4 > bytes (litpool) == 18 bytes > > Thanks, > Prathamesh > > > > > On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni > <prathamesh.kulkarni at linaro.org> wrote: > > > > Hi John, > > Thanks for the suggestions! I will start looking at adding > > foldMemoryOperand to ARMInstrInfo. > > > > Thanks, > > Prathamesh > > > > On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com> wrote: > > > > > > If I'm understanding what's going on in this test correctly, what's happening is: > > > * ARMTargetLowering::LowerCall prefers indirect calls when a function is called at least 3 times in minsize > > > * In thumb 1 (without -fno-omit-frame-pointer) we have effectively only 3 callee-saved registers (r4-r6) > > > * The function has three arguments, so those three plus the register we need to hold the function address is more than our callee-saved registers > > > * Therefore something needs to be spilt > > > * The function address can be rematerialized, so we spill that and insert and LDR before each call > > > > > > If we didn't have this spilling happening (e.g. if the function had one less argument) then the code size of using BL vs BLX > > > * BL: 3*4-byte BL = 12 bytes > > > * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes > > > (So maybe even not considering spilling, LowerCall should be adjusted to do this for functions called 4 or more times) > > > > > > When we have to spill, if we compare spilling the functions address vs spilling an argument: > > > * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool = 16 bytes > > > * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool + 1*2-byte STR + 2*2-byte LDR = 18 bytes > > > So just changing the spilling heuristic won't work. > > > > > > The two ways I see of fixing this: > > > * In LowerCall only prefer an indirect call if the number of integer register arguments is less than the number of callee-saved registers. > > > * When the load of the function address is spilled, instead of just rematerializing the load instead convert the BX back into BL. > > > > > > The first of these would be easier, but there will be situations where we need to use less than three callee-saved registers (e.g. arguments are loaded from a pointer) and there are situations where we will spill the function address for reasons entirely unrelated to the function arguments (e.g. if we have enough live local variables). > > > > > > For the second, looking at InlineSpiller.cpp it does have the concept of rematerializing by folding a memory operand into another instruction, so I think we could make use of that to do this. It looks like it would involve adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a LDR into a BX by turning it into a BL. > > > > > > John > > > > > > ________________________________ > > > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org> > > > Sent: 07 April 2020 21:07 > > > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> > > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call > > > > > > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni > > > <prathamesh.kulkarni at linaro.org> wrote: > > > > > > > > Hi, > > > > Compiling attached test-case, which is reduced version of of > > > > uECC_shared_secret from tinycrypt library [1], with > > > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S > > > > results in reloading of register holding function's address before > > > > every call to blx: > > > > > > > > ldr r3, .LCPI0_0 > > > > blx r3 > > > > mov r0, r6 > > > > mov r1, r5 > > > > mov r2, r4 > > > > ldr r3, .LCPI0_0 > > > > blx r3 > > > > ldr r3, .LCPI0_0 > > > > mov r0, r6 > > > > mov r1, r5 > > > > mov r2, r4 > > > > blx r3 > > > > > > > > .LCPI0_0: > > > > .long foo > > > > > > > > From dump of regalloc (attached), AFAIU, what seems to happen during > > > > greedy allocator is, all virt regs %0 to %3 are live across first two > > > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4 > > > > respectively, and %3 which holds foo's address doesn't have any > > > > register left. > > > > Since it's live-range has least weight, it does not evict any existing interval, > > > > and gets split. Eventually we have the following allocation: > > > > > > > > [%0 -> $r6] tGPR > > > > [%1 -> $r5] tGPR > > > > [%2 -> $r4] tGPR > > > > [%6 -> $r3] tGPR > > > > [%11 -> $r3] tGPR > > > > [%16 -> $r3] tGPR > > > > [%17 -> $r3] tGPR > > > > > > > > where %6, %11, %16 and %17 all are derived from %3. > > > > And since r3 is a call-clobbered register, the compiler is forced to > > > > reload foo's address > > > > each time before blx. > > > > > > > > To fix this, I thought of following approaches: > > > > (a) Disable the heuristic to prefer indirect call when there are at > > > > least 3 calls to > > > > same function in basic block in ARMTargetLowering::LowerCall for Thumb-1 ISA. > > > > > > > > (b) In ARMTargetLowering::LowerCall, put another constraint like > > > > number of arguments, as a proxy for register pressure for Thumb-1, but > > > > that's bound to trip another cases. > > > > > > > > (c) Give higher priority to allocate vrit reg used for indirect calls > > > > ? However, if that > > > > results in spilling of some other register, it would defeat the > > > > purpose of saving code-size. I suppose ideally we want to trigger the > > > > heuristic of using indirect call only when we know beforehand that it > > > > will not result in spilling. But I am not sure if it's possible to > > > > estimate that during isel ? > > > > > > > > I would be grateful for suggestions on how to proceed further. > > > ping ? > > > > > > Thanks, > > > Prathamesh > > > > > > > > [1] https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139 > > > > > > > > Thanks, > > > > Prathamesh > > > _______________________________________________ > > > LLVM Developers mailing list > > > llvm-dev at lists.llvm.org > > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Prathamesh Kulkarni via llvm-dev
2020-Apr-18 00:42 UTC
[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call
On Wed, 15 Apr 2020 at 16:39, Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org> wrote:> > On Wed, 15 Apr 2020 at 03:36, John Brawn <John.Brawn at arm.com> wrote: > > > > > Could you please point out what am I doing wrong in the patch ? > > > > It's because you're getting the function name by doing > > callee->getName().str().c_str() > > The str() call generates a temporary copy of the name which ceases to exist outside of this expression > > causing the c_str() to return a pointer to no-longer-valid memory. > Ah indeed, thanks for pointing it out! > > > > I'm not sure what the correct way of getting the name as a char* is here. Doing > > callee->getName().data() > > appears to work, though I don't know if we can rely on the StringReg that getName() returns being > > appropriately null-terminated. > Using MachineFunction::createExternalSymbolName seems to work. > > > > > However for above case, IIUC, we would want all calls to be converted to bl ? > > It would be better yes, though I'm not sure how you'd go about making that happen. It's probably not > > worth worrying too much about though, as the new behaviour is still better than the old. > OK, thanks for the clarification. > I will reg-test and submit an updated patch soon.Hi, I have attached patch after reg-testing (make check-llvm shows no unexpected failures). Does it look OK ? Thanks, Prathamesh> > Thanks, > Prathamesh > > > > John > > > > ________________________________ > > From: Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org> > > Sent: 15 April 2020 01:44 > > To: John Brawn <John.Brawn at arm.com> > > Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call > > > > Hi, > > I have attached WIP patch for adding foldMemoryOperand to Thumb1InstrInfo. > > For the following case: > > > > void f(int x, int y, int z) > > { > > void bar(int, int, int); > > > > bar(x, y, z); > > bar(x, z, y); > > bar(y, x, z); > > bar(y, y, x); > > } > > > > it calls foldMemoryOperand twice, and thus converts two calls from blx to bl. > > callMI->dump() shows the function name "bar" correctly, however in > > generated assembly call to bar is garbled: > > (compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m): > > > > add r7, sp, #16 > > mov r6, r2 > > mov r5, r1 > > mov r4, r0 > > bl "<90>w\n " > > mov r1, r2 > > mov r2, r5 > > bl "<90>w\n " > > mov r0, r5 > > mov r1, r4 > > mov r2, r6 > > ldr r6, .LCPI0_0 > > blx r6 > > mov r0, r5 > > mov r1, r5 > > mov r2, r4 > > blx r6 > > > > regalloc dump (attached) shows: > > Inline spilling tGPR:%9 [80r,152r:0) 0 at 80r weight:3.209746e-03 > > From original %3 > > also spill snippet %8 [152r,232r:0) 0 at 152r weight:2.104167e-03 > > tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit > > killed $r0, implicit killed $r1, implicit killed $r2 > > folded: 144r tBL 14, $noreg, &"\E0\9C\06\A0\FC\7F", > > implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed > > $r1, implicit killed $r2 :: (load 4 from constant-pool) > > remat: 228r %10:tgpr = tLDRpci %const.0, 14, $noreg :: > > (load 4 from constant-pool) > > 232e %7:tgpr = COPY killed %10:tgpr > > > > Could you please point out what am I doing wrong in the patch ? > > > > Also, I guess, it only converted two calls to bl because further > > spilling wasn't necessary. > > However for above case, IIUC, we would want all calls to be converted to bl ? > > Since, > > 4 bl == 16 bytes > > 2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4 > > bytes (litpool) == 18 bytes > > > > Thanks, > > Prathamesh > > > > > > > > > > On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni > > <prathamesh.kulkarni at linaro.org> wrote: > > > > > > Hi John, > > > Thanks for the suggestions! I will start looking at adding > > > foldMemoryOperand to ARMInstrInfo. > > > > > > Thanks, > > > Prathamesh > > > > > > On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com> wrote: > > > > > > > > If I'm understanding what's going on in this test correctly, what's happening is: > > > > * ARMTargetLowering::LowerCall prefers indirect calls when a function is called at least 3 times in minsize > > > > * In thumb 1 (without -fno-omit-frame-pointer) we have effectively only 3 callee-saved registers (r4-r6) > > > > * The function has three arguments, so those three plus the register we need to hold the function address is more than our callee-saved registers > > > > * Therefore something needs to be spilt > > > > * The function address can be rematerialized, so we spill that and insert and LDR before each call > > > > > > > > If we didn't have this spilling happening (e.g. if the function had one less argument) then the code size of using BL vs BLX > > > > * BL: 3*4-byte BL = 12 bytes > > > > * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes > > > > (So maybe even not considering spilling, LowerCall should be adjusted to do this for functions called 4 or more times) > > > > > > > > When we have to spill, if we compare spilling the functions address vs spilling an argument: > > > > * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool = 16 bytes > > > > * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool + 1*2-byte STR + 2*2-byte LDR = 18 bytes > > > > So just changing the spilling heuristic won't work. > > > > > > > > The two ways I see of fixing this: > > > > * In LowerCall only prefer an indirect call if the number of integer register arguments is less than the number of callee-saved registers. > > > > * When the load of the function address is spilled, instead of just rematerializing the load instead convert the BX back into BL. > > > > > > > > The first of these would be easier, but there will be situations where we need to use less than three callee-saved registers (e.g. arguments are loaded from a pointer) and there are situations where we will spill the function address for reasons entirely unrelated to the function arguments (e.g. if we have enough live local variables). > > > > > > > > For the second, looking at InlineSpiller.cpp it does have the concept of rematerializing by folding a memory operand into another instruction, so I think we could make use of that to do this. It looks like it would involve adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a LDR into a BX by turning it into a BL. > > > > > > > > John > > > > > > > > ________________________________ > > > > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org> > > > > Sent: 07 April 2020 21:07 > > > > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> > > > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call > > > > > > > > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni > > > > <prathamesh.kulkarni at linaro.org> wrote: > > > > > > > > > > Hi, > > > > > Compiling attached test-case, which is reduced version of of > > > > > uECC_shared_secret from tinycrypt library [1], with > > > > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S > > > > > results in reloading of register holding function's address before > > > > > every call to blx: > > > > > > > > > > ldr r3, .LCPI0_0 > > > > > blx r3 > > > > > mov r0, r6 > > > > > mov r1, r5 > > > > > mov r2, r4 > > > > > ldr r3, .LCPI0_0 > > > > > blx r3 > > > > > ldr r3, .LCPI0_0 > > > > > mov r0, r6 > > > > > mov r1, r5 > > > > > mov r2, r4 > > > > > blx r3 > > > > > > > > > > .LCPI0_0: > > > > > .long foo > > > > > > > > > > From dump of regalloc (attached), AFAIU, what seems to happen during > > > > > greedy allocator is, all virt regs %0 to %3 are live across first two > > > > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4 > > > > > respectively, and %3 which holds foo's address doesn't have any > > > > > register left. > > > > > Since it's live-range has least weight, it does not evict any existing interval, > > > > > and gets split. Eventually we have the following allocation: > > > > > > > > > > [%0 -> $r6] tGPR > > > > > [%1 -> $r5] tGPR > > > > > [%2 -> $r4] tGPR > > > > > [%6 -> $r3] tGPR > > > > > [%11 -> $r3] tGPR > > > > > [%16 -> $r3] tGPR > > > > > [%17 -> $r3] tGPR > > > > > > > > > > where %6, %11, %16 and %17 all are derived from %3. > > > > > And since r3 is a call-clobbered register, the compiler is forced to > > > > > reload foo's address > > > > > each time before blx. > > > > > > > > > > To fix this, I thought of following approaches: > > > > > (a) Disable the heuristic to prefer indirect call when there are at > > > > > least 3 calls to > > > > > same function in basic block in ARMTargetLowering::LowerCall for Thumb-1 ISA. > > > > > > > > > > (b) In ARMTargetLowering::LowerCall, put another constraint like > > > > > number of arguments, as a proxy for register pressure for Thumb-1, but > > > > > that's bound to trip another cases. > > > > > > > > > > (c) Give higher priority to allocate vrit reg used for indirect calls > > > > > ? However, if that > > > > > results in spilling of some other register, it would defeat the > > > > > purpose of saving code-size. I suppose ideally we want to trigger the > > > > > heuristic of using indirect call only when we know beforehand that it > > > > > will not result in spilling. But I am not sure if it's possible to > > > > > estimate that during isel ? > > > > > > > > > > I would be grateful for suggestions on how to proceed further. > > > > ping ? > > > > > > > > Thanks, > > > > Prathamesh > > > > > > > > > > [1] https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139 > > > > > > > > > > Thanks, > > > > > Prathamesh > > > > _______________________________________________ > > > > LLVM Developers mailing list > > > > llvm-dev at lists.llvm.org > > > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- A non-text attachment was scrubbed... Name: llvm-611-3.diff Type: text/x-patch Size: 2542 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200418/de306dd8/attachment.bin>
Prathamesh Kulkarni via llvm-dev
2020-Apr-24 18:42 UTC
[llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call
On Fri, 17 Apr 2020 at 17:42, Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org> wrote:> > On Wed, 15 Apr 2020 at 16:39, Prathamesh Kulkarni > <prathamesh.kulkarni at linaro.org> wrote: > > > > On Wed, 15 Apr 2020 at 03:36, John Brawn <John.Brawn at arm.com> wrote: > > > > > > > Could you please point out what am I doing wrong in the patch ? > > > > > > It's because you're getting the function name by doing > > > callee->getName().str().c_str() > > > The str() call generates a temporary copy of the name which ceases to exist outside of this expression > > > causing the c_str() to return a pointer to no-longer-valid memory. > > Ah indeed, thanks for pointing it out! > > > > > > I'm not sure what the correct way of getting the name as a char* is here. Doing > > > callee->getName().data() > > > appears to work, though I don't know if we can rely on the StringReg that getName() returns being > > > appropriately null-terminated. > > Using MachineFunction::createExternalSymbolName seems to work. > > > > > > > However for above case, IIUC, we would want all calls to be converted to bl ? > > > It would be better yes, though I'm not sure how you'd go about making that happen. It's probably not > > > worth worrying too much about though, as the new behaviour is still better than the old. > > OK, thanks for the clarification. > > I will reg-test and submit an updated patch soon. > Hi, I have attached patch after reg-testing (make check-llvm shows no > unexpected failures). > Does it look OK ?ping http://lists.llvm.org/pipermail/llvm-dev/2020-April/141048.html Thanks, Prathamesh> > Thanks, > Prathamesh > > > > > Thanks, > > Prathamesh > > > > > > John > > > > > > ________________________________ > > > From: Prathamesh Kulkarni <prathamesh.kulkarni at linaro.org> > > > Sent: 15 April 2020 01:44 > > > To: John Brawn <John.Brawn at arm.com> > > > Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> > > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call > > > > > > Hi, > > > I have attached WIP patch for adding foldMemoryOperand to Thumb1InstrInfo. > > > For the following case: > > > > > > void f(int x, int y, int z) > > > { > > > void bar(int, int, int); > > > > > > bar(x, y, z); > > > bar(x, z, y); > > > bar(y, x, z); > > > bar(y, y, x); > > > } > > > > > > it calls foldMemoryOperand twice, and thus converts two calls from blx to bl. > > > callMI->dump() shows the function name "bar" correctly, however in > > > generated assembly call to bar is garbled: > > > (compiled with -Oz --target=arm-linux-gnueabi -marcha=armv6-m): > > > > > > add r7, sp, #16 > > > mov r6, r2 > > > mov r5, r1 > > > mov r4, r0 > > > bl "<90>w\n " > > > mov r1, r2 > > > mov r2, r5 > > > bl "<90>w\n " > > > mov r0, r5 > > > mov r1, r4 > > > mov r2, r6 > > > ldr r6, .LCPI0_0 > > > blx r6 > > > mov r0, r5 > > > mov r1, r5 > > > mov r2, r4 > > > blx r6 > > > > > > regalloc dump (attached) shows: > > > Inline spilling tGPR:%9 [80r,152r:0) 0 at 80r weight:3.209746e-03 > > > From original %3 > > > also spill snippet %8 [152r,232r:0) 0 at 152r weight:2.104167e-03 > > > tBL 14, $noreg, &bar, implicit-def $lr, implicit $sp, implicit > > > killed $r0, implicit killed $r1, implicit killed $r2 > > > folded: 144r tBL 14, $noreg, &"\E0\9C\06\A0\FC\7F", > > > implicit-def $lr, implicit $sp, implicit killed $r0, implicit killed > > > $r1, implicit killed $r2 :: (load 4 from constant-pool) > > > remat: 228r %10:tgpr = tLDRpci %const.0, 14, $noreg :: > > > (load 4 from constant-pool) > > > 232e %7:tgpr = COPY killed %10:tgpr > > > > > > Could you please point out what am I doing wrong in the patch ? > > > > > > Also, I guess, it only converted two calls to bl because further > > > spilling wasn't necessary. > > > However for above case, IIUC, we would want all calls to be converted to bl ? > > > Since, > > > 4 bl == 16 bytes > > > 2 bl + 2 blx + 1 lr == 2 * 4 (bl) + 2 * 2 (blx) + 1 * 2 (ldr) + 4 > > > bytes (litpool) == 18 bytes > > > > > > Thanks, > > > Prathamesh > > > > > > > > > > > > > > > On Fri, 10 Apr 2020 at 04:22, Prathamesh Kulkarni > > > <prathamesh.kulkarni at linaro.org> wrote: > > > > > > > > Hi John, > > > > Thanks for the suggestions! I will start looking at adding > > > > foldMemoryOperand to ARMInstrInfo. > > > > > > > > Thanks, > > > > Prathamesh > > > > > > > > On Tue, 7 Apr 2020 at 23:55, John Brawn <John.Brawn at arm.com> wrote: > > > > > > > > > > If I'm understanding what's going on in this test correctly, what's happening is: > > > > > * ARMTargetLowering::LowerCall prefers indirect calls when a function is called at least 3 times in minsize > > > > > * In thumb 1 (without -fno-omit-frame-pointer) we have effectively only 3 callee-saved registers (r4-r6) > > > > > * The function has three arguments, so those three plus the register we need to hold the function address is more than our callee-saved registers > > > > > * Therefore something needs to be spilt > > > > > * The function address can be rematerialized, so we spill that and insert and LDR before each call > > > > > > > > > > If we didn't have this spilling happening (e.g. if the function had one less argument) then the code size of using BL vs BLX > > > > > * BL: 3*4-byte BL = 12 bytes > > > > > * BX: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool = 12 bytes > > > > > (So maybe even not considering spilling, LowerCall should be adjusted to do this for functions called 4 or more times) > > > > > > > > > > When we have to spill, if we compare spilling the functions address vs spilling an argument: > > > > > * BX with spilt fn: 3*2-byte BX + 3*2-byte LDR + 4-byte litpool = 16 bytes > > > > > * BX with spilt arg: 3*2-byte BX + 1*2-byte LDR + 4-byte litpool + 1*2-byte STR + 2*2-byte LDR = 18 bytes > > > > > So just changing the spilling heuristic won't work. > > > > > > > > > > The two ways I see of fixing this: > > > > > * In LowerCall only prefer an indirect call if the number of integer register arguments is less than the number of callee-saved registers. > > > > > * When the load of the function address is spilled, instead of just rematerializing the load instead convert the BX back into BL. > > > > > > > > > > The first of these would be easier, but there will be situations where we need to use less than three callee-saved registers (e.g. arguments are loaded from a pointer) and there are situations where we will spill the function address for reasons entirely unrelated to the function arguments (e.g. if we have enough live local variables). > > > > > > > > > > For the second, looking at InlineSpiller.cpp it does have the concept of rematerializing by folding a memory operand into another instruction, so I think we could make use of that to do this. It looks like it would involve adding a foldMemoryOperand function to ARMInstrInfo and then have this fold a LDR into a BX by turning it into a BL. > > > > > > > > > > John > > > > > > > > > > ________________________________ > > > > > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Prathamesh Kulkarni via llvm-dev <llvm-dev at lists.llvm.org> > > > > > Sent: 07 April 2020 21:07 > > > > > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> > > > > > Subject: Re: [llvm-dev] [ARM] Register pressure with -mthumb forces register reload before each call > > > > > > > > > > On Tue, 31 Mar 2020 at 22:03, Prathamesh Kulkarni > > > > > <prathamesh.kulkarni at linaro.org> wrote: > > > > > > > > > > > > Hi, > > > > > > Compiling attached test-case, which is reduced version of of > > > > > > uECC_shared_secret from tinycrypt library [1], with > > > > > > --target=arm-linux-gnueabi -march=armv6-m -Oz -S > > > > > > results in reloading of register holding function's address before > > > > > > every call to blx: > > > > > > > > > > > > ldr r3, .LCPI0_0 > > > > > > blx r3 > > > > > > mov r0, r6 > > > > > > mov r1, r5 > > > > > > mov r2, r4 > > > > > > ldr r3, .LCPI0_0 > > > > > > blx r3 > > > > > > ldr r3, .LCPI0_0 > > > > > > mov r0, r6 > > > > > > mov r1, r5 > > > > > > mov r2, r4 > > > > > > blx r3 > > > > > > > > > > > > .LCPI0_0: > > > > > > .long foo > > > > > > > > > > > > From dump of regalloc (attached), AFAIU, what seems to happen during > > > > > > greedy allocator is, all virt regs %0 to %3 are live across first two > > > > > > calls to foo. Thus %0, %1 and %2 get assigned r6, r5 and r4 > > > > > > respectively, and %3 which holds foo's address doesn't have any > > > > > > register left. > > > > > > Since it's live-range has least weight, it does not evict any existing interval, > > > > > > and gets split. Eventually we have the following allocation: > > > > > > > > > > > > [%0 -> $r6] tGPR > > > > > > [%1 -> $r5] tGPR > > > > > > [%2 -> $r4] tGPR > > > > > > [%6 -> $r3] tGPR > > > > > > [%11 -> $r3] tGPR > > > > > > [%16 -> $r3] tGPR > > > > > > [%17 -> $r3] tGPR > > > > > > > > > > > > where %6, %11, %16 and %17 all are derived from %3. > > > > > > And since r3 is a call-clobbered register, the compiler is forced to > > > > > > reload foo's address > > > > > > each time before blx. > > > > > > > > > > > > To fix this, I thought of following approaches: > > > > > > (a) Disable the heuristic to prefer indirect call when there are at > > > > > > least 3 calls to > > > > > > same function in basic block in ARMTargetLowering::LowerCall for Thumb-1 ISA. > > > > > > > > > > > > (b) In ARMTargetLowering::LowerCall, put another constraint like > > > > > > number of arguments, as a proxy for register pressure for Thumb-1, but > > > > > > that's bound to trip another cases. > > > > > > > > > > > > (c) Give higher priority to allocate vrit reg used for indirect calls > > > > > > ? However, if that > > > > > > results in spilling of some other register, it would defeat the > > > > > > purpose of saving code-size. I suppose ideally we want to trigger the > > > > > > heuristic of using indirect call only when we know beforehand that it > > > > > > will not result in spilling. But I am not sure if it's possible to > > > > > > estimate that during isel ? > > > > > > > > > > > > I would be grateful for suggestions on how to proceed further. > > > > > ping ? > > > > > > > > > > Thanks, > > > > > Prathamesh > > > > > > > > > > > > [1] https://github.com/intel/tinycrypt/blob/master/lib/source/ecc_dh.c#L139 > > > > > > > > > > > > Thanks, > > > > > > Prathamesh > > > > > _______________________________________________ > > > > > LLVM Developers mailing list > > > > > llvm-dev at lists.llvm.org > > > > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev