thr3ads.net - llvm dev - [LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Bjoern Haase

2015-Jan-11 00:31 UTC

[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

Hello,

find enclosed a first patch for adding tail call optimizations for 
thumb1 targets.
I assume that this list is the right place for publishing patches for 
review?

Since this is my first proposal for LLVM, I'd very much appreciate your 
feedback.

What the patch is meant to do:

For Tail calls identified during DAG generation, the target address will 
be loaded into a register
by use of the constant pool. If R3 is used for argument passing, the 
target address is forced
to hard reg R12 in order to overcome limitations thumb1 register 
allocator with respect to
the upper registers.

I decided to fetch the target address to a register by a constant pool 
lookup because when
analyzing the code I found out, that the mechanisms are prepared also 
for situations, where
parameters are both passed in regs and on the stack. This would not be 
possible when using
a BL // pop {pc} sequence within the epilogue since this would change 
the stack offsets.

During epilog generation, spill register restoring will be done within 
the emit epilogue function.
If LR happens to be spilled on the stack by the prologue, it's restored 
by use of a scratch register
just before restoring the other registers.

I have so far tested the code by hand with a number of tests by 
analyzing generated assembly.
In the lit testsuite I get 4 failures which I attribute at a first 
analysis to the fact that the generated code for tail calls
results in different output that no longer matches the expectation strings.

Yours,

Björn
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sampleTests.tar.gz
Type: application/x-gzip
Size: 1119 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150111/57000648/attachment.bin>
-------------- next part --------------
Index: Thumb1FrameLowering.cpp
==================================================================---
Thumb1FrameLowering.cpp	(Revision 225589)
+++ Thumb1FrameLowering.cpp	(Arbeitskopie)
@@ -323,11 +323,18 @@
 }
 
 void Thumb1FrameLowering::emitEpilogue(MachineFunction &MF,
-                                   MachineBasicBlock &MBB) const {
+                                       MachineBasicBlock &MBB) const {
   MachineBasicBlock::iterator MBBI = MBB.getLastNonDebugInstr();
   assert((MBBI->getOpcode() == ARM::tBX_RET ||
-          MBBI->getOpcode() == ARM::tPOP_RET) &&
-         "Can only insert epilog into returning blocks");
+          MBBI->getOpcode() == ARM::tPOP_RET ||
+          MBBI->getOpcode() == ARM::TCRETURNri)
+          && "Can only insert epilog into returning blocks "
+             "and tail calls with address in regs.");
+
+  bool IsTailCallReturn = false;
+  if (MBBI->getOpcode() == ARM::TCRETURNri)
+	  IsTailCallReturn = true;
+
   DebugLoc dl = MBBI->getDebugLoc();
   MachineFrameInfo *MFI = MF.getFrameInfo();
   ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
@@ -351,8 +358,8 @@
     if (NumBytes - ArgRegsSaveSize != 0)
       emitSPUpdate(MBB, MBBI, TII, dl, *RegInfo, NumBytes - ArgRegsSaveSize);
   } else {
-    // Unwind MBBI to point to first LDR / VLDRD.
-    if (MBBI != MBB.begin()) {
+    // Unwind MBBI to point to first LDR / VLDRD. Not for tail call returns!
+    if ((MBBI != MBB.begin()) && (!IsTailCallReturn)) {
       do
         --MBBI;
       while (MBBI != MBB.begin() && isCSRestore(MBBI, CSRegs));
@@ -395,12 +402,119 @@
     }
   }
 
-  bool IsV4PopReturn = false;
-  for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo())
+  bool IsLRPartOfCalleeSavedInfo = false;
+
+  for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) {
     if (CSI.getReg() == ARM::LR)
-      IsV4PopReturn = true;
+    	IsLRPartOfCalleeSavedInfo = true;
+    }
+
+  bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo;
   IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps();
 
+  if (IsTailCallReturn) {
+    MBBI = MBB.getLastNonDebugInstr();
+
+    // First restore callee saved registers. Unlike for normal returns
+    // this is *not* done in restoreCalleeSavedRegisters.
+    const std::vector<CalleeSavedInfo>
&CSI(MFI->getCalleeSavedInfo());
+
+    bool IsR4IncludedInCSI = false;
+    bool IsLRIncludedInCSI = false;
+    for (unsigned i = CSI.size(); i != 0; --i) {
+      unsigned Reg = CSI[i-1].getReg();
+      if (Reg == ARM::R4)
+        IsR4IncludedInCSI = true;
+      if (Reg == ARM::LR)
+        IsLRIncludedInCSI = true;
+    }
+
+    MachineFunction &MF = *MBB.getParent();
+    const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
+
+    // We need to additionally push/pop R4 in case that LR reconstruction
+    // for tail calls requires R4 as scratch register.
+    bool IsR4ToBeAdditionallyAddedToPopIns = false;
+
+    if (IsLRIncludedInCSI) {
+      // We need to restore LR and need a scratch register for this purpose
+      int StackSlotForSavedLR = CSI.size() - 1;
+      assert (StackSlotForSavedLR >= 0 && "Wrong Stack slot for
LR.");
+
+      // Make sure that R4 may be used as scratch. Add an additional tPUSH (R4)
+      // if necessary.
+      if (!IsR4IncludedInCSI) {
+        IsR4ToBeAdditionallyAddedToPopIns = true;
+
+        AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tPUSH))
+            .addReg(ARM::R4,RegState::Kill));
+
+        StackSlotForSavedLR ++;
+      }
+
+      AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tLDRspi))
+        .addReg(ARM::R4, RegState::Define)
+        .addReg(ARM::SP)
+        .addImm(StackSlotForSavedLR));
+
+      AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tMOVr))
+        .addReg(ARM::LR, RegState::Define)
+        .addReg(ARM::R4, RegState::Kill));
+    }
+
+    MachineInstrBuilder MIB = BuildMI(MF, dl, TII.get(ARM::tPOP));
+    AddDefaultPred(MIB);
+
+    bool NumRegs = false;
+    for (unsigned i = CSI.size(); i != 0; --i) {
+      unsigned Reg = CSI[i-1].getReg();
+
+      if (Reg == ARM::LR)
+        continue;
+
+      MIB.addReg(Reg, getDefRegState(true));
+      NumRegs = true;
+    }
+
+    if (IsR4ToBeAdditionallyAddedToPopIns) {
+      MIB.addReg(ARM::R4, getDefRegState(true));
+      NumRegs = true;
+    }
+
+    // It's illegal to emit pop instruction without operands.
+    if (NumRegs)
+      MBB.insert(MBBI, &*MIB);
+    else
+      MF.DeleteMachineInstr(MIB);
+
+    if (IsLRIncludedInCSI) {
+      const Thumb1RegisterInfo *RegInfo +          static_cast<const
Thumb1RegisterInfo *>
+                      (MF.getSubtarget().getRegisterInfo());
+
+      // Re-adjust stack pointer for LR content still residing on the stack.
+      emitSPUpdate(MBB, MBBI, TII, dl, *RegInfo, 4);
+    }
+
+    MachineOperand &JumpTarget = MBBI->getOperand(0);
+
+    assert (MBBI->getOpcode() == ARM::TCRETURNri);
+    DebugLoc dl = MBBI->getDebugLoc();
+
+    BuildMI(MBB, MBBI, dl,
+            TII.get(ARM::tTAILJMPr))
+     .addReg(JumpTarget.getReg(), RegState::Kill);
+
+    MachineInstr *NewMI = std::prev(MBBI);
+    for (unsigned i = 1, e = MBBI->getNumOperands(); i != e; ++i)
+      NewMI->addOperand(MBBI->getOperand(i));
+
+    // Delete the pseudo instruction TCRETURN.
+    MBB.erase(MBBI);
+    MBBI = NewMI;
+    return;
+  }
+
   // Unlike T2 and ARM mode, the T1 pop instruction cannot restore
   // to LR, and we can't pop the value directly to the PC since
   // we need to update the SP after popping the value. So instead
@@ -501,15 +615,25 @@
                             MachineBasicBlock::iterator MI,
                             const std::vector<CalleeSavedInfo> &CSI,
                             const TargetRegisterInfo *TRI) const {
+
+  MachineBasicBlock::iterator MBBI = MBB.getLastNonDebugInstr();
+  bool IsTailCallReturn = false;
+  if(MBBI->getOpcode() == ARM::TCRETURNri)
+    IsTailCallReturn = true;
+
   if (CSI.empty())
     return false;
 
+  // We will handle callee saving in Epilogue generation and not here.
+  if (IsTailCallReturn)
+    return true;
+
   MachineFunction &MF = *MBB.getParent();
   ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
   const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
-
   bool isVarArg = AFI->getArgRegsSaveSize() > 0;
   DebugLoc DL = MI->getDebugLoc();
+
   MachineInstrBuilder MIB = BuildMI(MF, DL, TII.get(ARM::tPOP));
   AddDefaultPred(MIB);
 
@@ -517,12 +641,15 @@
   for (unsigned i = CSI.size(); i != 0; --i) {
     unsigned Reg = CSI[i-1].getReg();
     if (Reg == ARM::LR) {
+
       // Special epilogue for vararg functions. See emitEpilogue
       if (isVarArg)
         continue;
-      // ARMv4T requires BX, see emitEpilogue
-      if (STI.hasV4TOps() && !STI.hasV5TOps())
+
+      // ARMv4T and tail call returns require BX, see emitEpilogue
+      if ((STI.hasV4TOps() && !STI.hasV5TOps()))
         continue;
+
       Reg = ARM::PC;
       (*MIB).setDesc(TII.get(ARM::tPOP_RET));
       MIB.copyImplicitOps(&*MI);
Index: ARMSubtarget.cpp
==================================================================---
ARMSubtarget.cpp	(Revision 225589)
+++ ARMSubtarget.cpp	(Arbeitskopie)
@@ -262,7 +262,7 @@
     SupportsTailCall = !isTargetIOS() || !getTargetTriple().isOSVersionLT(5,
0);
   } else {
     IsR9Reserved = ReserveR9;
-    SupportsTailCall = !isThumb1Only();
+    SupportsTailCall = true;
   }
 
   if (Align == DefaultAlign) {
Index: ARMISelLowering.cpp
==================================================================---
ARMISelLowering.cpp	(Revision 225589)
+++ ARMISelLowering.cpp	(Arbeitskopie)
@@ -1671,6 +1671,26 @@
     InFlag = SDValue();
   }
 
+  // For thumb1 targets, if R3 is used for argument passing, we need
+  // to place the call target address in IP (i.e. R12).
+  bool IsR3UsedForArgumentPassing = false;
+  if (RegsToPass.size() >= 4) {
+    IsR3UsedForArgumentPassing = true;
+  }
+
+  bool IsCallAddressMoveToRegisterRequired = false;
+  bool CallAdressShallBeForcedToHardRegR12 = false;
+
+  if  (EnableARMLongCalls || (isTailCall &&
Subtarget->isThumb1Only() ))
+  {
+	  IsCallAddressMoveToRegisterRequired = true;
+
+	  if (isTailCall
+	      && IsR3UsedForArgumentPassing
+	      && Subtarget->isThumb1Only() )
+		  CallAdressShallBeForcedToHardRegR12 = true;
+  }
+
   // If the callee is a GlobalAddress/ExternalSymbol node (quite common, every
   // direct call is) turn it into a TargetGlobalAddress/TargetExternalSymbol
   // node so that legalize doesn't hack it.
@@ -1679,10 +1699,12 @@
   bool isLocalARMFunc = false;
   ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
 
-  if (EnableARMLongCalls) {
+  if (IsCallAddressMoveToRegisterRequired) {
     assert((Subtarget->isTargetWindows() ||
+            (isTailCall && Subtarget->isThumb1Only()) ||
             getTargetMachine().getRelocationModel() == Reloc::Static)
&&
            "long-calls with non-static relocation model!");
+
     // Handle a global address or an external symbol. If it's not one of
     // those, the target's already in a register, so we don't need to
do
     // anything extra.
@@ -1695,11 +1717,14 @@
 
       // Get the address of the callee into a register
       SDValue CPAddr = DAG.getTargetConstantPool(CPV, getPointerTy(), 4);
+
       CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
+
       Callee = DAG.getLoad(getPointerTy(), dl,
                            DAG.getEntryNode(), CPAddr,
                            MachinePointerInfo::getConstantPool(),
                            false, false, false, 0);
+
     } else if (ExternalSymbolSDNode
*S=dyn_cast<ExternalSymbolSDNode>(Callee)) {
       const char *Sym = S->getSymbol();
 
@@ -1785,6 +1810,12 @@
     }
   }
 
+  if (CallAdressShallBeForcedToHardRegR12) {
+	  Chain = DAG.getCopyToReg(Chain, dl, ARM::R12,
+                             Callee,Chain.getValue(1));
+    Callee = DAG.getRegister (ARM::R12,getPointerTy());
+  }
+
   // FIXME: handle tail calls differently.
   unsigned CallOpc;
   bool HasMinSizeAttr = MF.getFunction()->getAttributes().hasAttribute(
@@ -2000,26 +2031,6 @@
   if (isCalleeStructRet || isCallerStructRet)
     return false;
 
-  // FIXME: Completely disable sibcall for Thumb1 since Thumb1RegisterInfo::
-  // emitEpilogue is not ready for them. Thumb tail calls also use t2B, as
-  // the Thumb1 16-bit unconditional branch doesn't have sufficient
relocation
-  // support in the assembler and linker to be used. This would need to be
-  // fixed to fully support tail calls in Thumb1.
-  //
-  // Doing this is tricky, since the LDM/POP instruction on Thumb doesn't
take
-  // LR.  This means if we need to reload LR, it takes an extra instructions,
-  // which outweighs the value of the tail call; but here we don't know yet
-  // whether LR is going to be used.  Probably the right approach is to
-  // generate the tail call here and turn it back into CALL/RET in
-  // emitEpilogue if LR is used.
-
-  // Thumb1 PIC calls to external symbols use BX, so they can be tail calls,
-  // but we need to make sure there are enough registers; the only valid
-  // registers are the 4 used for parameters.  We don't currently do this
-  // case.
-  if (Subtarget->isThumb1Only())
-    return false;
-
   // Externally-defined functions with weak linkage should not be
   // tail-called on ARM when the OS does not support dynamic
   // pre-emption of symbols, as the AAELF spec requires normal calls
@@ -2365,7 +2376,7 @@
   if (!CI->isTailCall() || getTargetMachine().Options.DisableTailCalls)
     return false;
 
-  return !Subtarget->isThumb1Only();
+  return true;
 }
 
 // ConstantPool, JumpTable, GlobalAddress, and ExternalSymbol are lowered as

John Brawn

2015-Jan-12 14:50 UTC

head link

[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

Some comments on the general approach:
> For Tail calls identified during DAG generation, the target address
> will
> be loaded into a register
> by use of the constant pool. If R3 is used for argument passing, the
> target address is forced
> to hard reg R12 in order to overcome limitations thumb1 register
> allocator with respect to
> the upper registers.
For the case when r3 is not used for argument passing doing this is
definitely a good idea, but if we have to BX via r12 then we have
to first do an LDR to some other register as there's no instruction
to LDR directly to r12. With current trunk LLVM and your patch for

  int calledfn(int,int,int,int);
  int test() {
    return calledfn(1,2,3,4);
  }

I get

test:
        push    {r4, lr}
        movs    r0, #1
        movs    r1, #2
        movs    r2, #3
        movs    r3, #4
        ldr     r4, .LCPI0_0
        mov     r12, r4
        ldr     r4, [sp, #4]
        mov     lr, r4
        pop     {r4}
        add     sp, #4
        bx      r12

r4 has been used to load the target address, which means we have to save
and restore r4 which means we're no better off than not tailcalling. And
you can also see the next problem:
> During epilog generation, spill register restoring will be done within
> the emit epilogue function.
> If LR happens to be spilled on the stack by the prologue, it's restored
> by use of a scratch register
> just before restoring the other registers.
POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the
stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1
(ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles.

So I think tailcalling is only worthwhile if the function does not save
lr and r3 is free to hold the target address. Also needing consideration
is what happens if callee-saved registers other than lr need to be saved,
but I haven't looked into this.


A few comments on the patch itself (I've only given it a quick look over):

+  bool IsLRPartOfCalleeSavedInfo = false;
+
+  for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) {
     if (CSI.getReg() == ARM::LR)
-      IsV4PopReturn = true;
+    	IsLRPartOfCalleeSavedInfo = true;
+    }
+
+  bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo;
   IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps();
 
+  if (IsTailCallReturn) {
+    MBBI = MBB.getLastNonDebugInstr();
+
+    // First restore callee saved registers. Unlike for normal returns
+    // this is *not* done in restoreCalleeSavedRegisters.
+    const std::vector<CalleeSavedInfo>
&CSI(MFI->getCalleeSavedInfo());
+
+    bool IsR4IncludedInCSI = false;
+    bool IsLRIncludedInCSI = false;
+    for (unsigned i = CSI.size(); i != 0; --i) {
+      unsigned Reg = CSI[i-1].getReg();
+      if (Reg == ARM::R4)
+        IsR4IncludedInCSI = true;
+      if (Reg == ARM::LR)
+        IsLRIncludedInCSI = true;
+    }

You set IsLRPartOfCalleeSavedInfo then right afterwards duplicate the work
in setting IsLRIncludedInCSI.

+      // ARMv4T and tail call returns require BX, see emitEpilogue
+      if ((STI.hasV4TOps() && !STI.hasV5TOps()))

The comment says 'tail call returns require BX', but the code
doesn't do that.


John

Jonathan Roelofs

2015-Jan-12 15:18 UTC

head link

[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

Bjoern,

Thanks for the patch, this will be a really nice optimization to have :)

Would you mind using Phabricator for this? http://reviews.llvm.org/ It 
will make reviewing a bit easier on our end. It also helps if there's a 
lot of context in the diff, so `git diff -U999` is the way to go.

A few style nits:
     Please use 2-spaces instead of tabs for indentation.
     The new variable names seem a bit long compared to the rest of LLVM.

Please write some LIT test cases for this (including at least cases 
where R4 is and is not in CSI).

Correctness-wise, I'm a little concerned about forcing the call address 
into R12, as that might clash with the register scavenger (take that 
with a healthy dose of skepticism; I don't *know* that there's a problem
here, rather I'm just uncertain how the two will interact).


Cheers,

Jon

On 1/10/15 5:31 PM, Bjoern Haase wrote:> Hello,
>
> find enclosed a first patch for adding tail call optimizations for
> thumb1 targets.
> I assume that this list is the right place for publishing patches for
> review?
>
> Since this is my first proposal for LLVM, I'd very much appreciate your
> feedback.
>
> What the patch is meant to do:
>
> For Tail calls identified during DAG generation, the target address will
> be loaded into a register
> by use of the constant pool. If R3 is used for argument passing, the
> target address is forced
> to hard reg R12 in order to overcome limitations thumb1 register
> allocator with respect to
> the upper registers.
>
> I decided to fetch the target address to a register by a constant pool
> lookup because when
> analyzing the code I found out, that the mechanisms are prepared also
> for situations, where
> parameters are both passed in regs and on the stack. This would not be
> possible when using
> a BL // pop {pc} sequence within the epilogue since this would change
> the stack offsets.
>
> During epilog generation, spill register restoring will be done within
> the emit epilogue function.
> If LR happens to be spilled on the stack by the prologue, it's restored
> by use of a scratch register
> just before restoring the other registers.
>
> I have so far tested the code by hand with a number of tests by
> analyzing generated assembly.
> In the lit testsuite I get 4 failures which I attribute at a first
> analysis to the fact that the generated code for tail calls
> results in different output that no longer matches the expectation strings.
>
> Yours,
>
> Björn
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-- 
Jon Roelofs
jonathan at codesourcery.com
CodeSourcery / Mentor Embedded

Bjoern Haase

2015-Jan-12 22:43 UTC

head link

[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

Hello John,

thank you for your feedback.

Am 12.01.2015 um 15:50 schrieb John Brawn:> Some comments on the general approach:
>
>> For Tail calls identified during DAG generation, the target address
>> will
>> be loaded into a register
>> by use of the constant pool. If R3 is used for argument passing, the
>> target address is forced
>> to hard reg R12 in order to overcome limitations thumb1 register
>> allocator with respect to
>> the upper registers.
> For the case when r3 is not used for argument passing doing this is
> definitely a good idea, but if we have to BX via r12 then we have
> to first do an LDR to some other register as there's no instruction
> to LDR directly to r12. With current trunk LLVM and your patch for
>
>    int calledfn(int,int,int,int);
>    int test() {
>      return calledfn(1,2,3,4);
>    }
>
> I get
>
> test:
>          push    {r4, lr}
>          movs    r0, #1
>          movs    r1, #2
>          movs    r2, #3
>          movs    r3, #4
>          ldr     r4, .LCPI0_0
>          mov     r12, r4
>          ldr     r4, [sp, #4]
>          mov     lr, r4
>          pop     {r4}
>          add     sp, #4
>          bx      r12
>
> r4 has been used to load the target address, which means we have to save
> and restore r4 which means we're no better off than not tailcalling.Concerning speed: Yes. I agree.
For this specific extremely short function: Yes. I agree.

The benefit to expect (in my opinion) is *not* speed or code size. It's 
stack usage. On a typical cortex-m0 with say 64k Flash, 8k RAM and 
functions with stack frame of say 128 bytes it actually may easily make 
a difference if the stack frame is released or not when returning from a 
function and calling the next one. Actually, I am just working on a 
communication stack code that does do exactly this very frequently and 
unfortunately, some tail calls occuring in the biggest stack-suckers 
happen use 4 parameters :-().

And you can also see the next problem:>> During epilog generation, spill register restoring will be done within
>> the emit epilogue function.
>> If LR happens to be spilled on the stack by the prologue, it's
restored
>> by use of a scratch register
>> just before restoring the other registers.
> POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the
> stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1
> (ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles.
>
> So I think tailcalling is only worthwhile if the function does not save
> lr and r3 is free to hold the target address. Also needing consideration
> is what happens if callee-saved registers other than lr need to be saved,
> but I haven't looked into this.I am aware of that and I fully agree with you in that with respect to 
speed and code size code will not get improved or worse.

Why I still believe tail optimizations to be beneficial in many 
occasions is my personal experience on real-world projects. The last 15 
work on projects on "v6m-scale" microcontroller systems in different 
companies did teach me that mostly the true issue is not program memory 
and speed but RAM and stack usage. With respect to a typical RAM / flash 
ratio of 8k/64k, I'd like to say that 1 byte of RAM spared is roughly 
"worth" 8 bytes of flash memory. (I know and admit, that the question 
wether the tail calls would actually save RAM strongly depends on the 
question whether the biggest stack frame shows up due to tail calls or 
inner calls.)

If I'd be asked, I'd strongly advocate for a "optimize for
RAM" policy
for v6m even if making some limited compromises for code size and speed. 
(see also my other mail from yesterday). I don't think that there is a 
wrong or right "answer" to this question. It is risky to base the
design
choice on one single person's bias, so I'd suggest to ask others for 
their opinion. Maybe there is also some useful benchmarking code around.

If I knew how to do implement it, I'd be suggesting some heuristics at 
DAG generation for enabling tail call optimization if the expected stack 
frame size gets larger than, say 32 bytes or a specific compile switch.
> A few comments on the patch itself (I've only given it a quick look
over):
>
> +  bool IsLRPartOfCalleeSavedInfo = false;
> +
> +  for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) {
>       if (CSI.getReg() == ARM::LR)
> -      IsV4PopReturn = true;
> +    	IsLRPartOfCalleeSavedInfo = true;
> +    }
> +
> +  bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo;
>     IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps();
>   
> +  if (IsTailCallReturn) {
> +    MBBI = MBB.getLastNonDebugInstr();
> +
> +    // First restore callee saved registers. Unlike for normal returns
> +    // this is *not* done in restoreCalleeSavedRegisters.
> +    const std::vector<CalleeSavedInfo>
&CSI(MFI->getCalleeSavedInfo());
> +
> +    bool IsR4IncludedInCSI = false;
> +    bool IsLRIncludedInCSI = false;
> +    for (unsigned i = CSI.size(); i != 0; --i) {
> +      unsigned Reg = CSI[i-1].getReg();
> +      if (Reg == ARM::R4)
> +        IsR4IncludedInCSI = true;
> +      if (Reg == ARM::LR)
> +        IsLRIncludedInCSI = true;
> +    }
>
> You set IsLRPartOfCalleeSavedInfo then right afterwards duplicate the work
> in setting IsLRIncludedInCSI.Thank you for pointing out this You are right. That did stem from copy 
and paste. I previously had this part in the epilogue generation but in 
the register restoring function, where I did need the duplication.
>
> +      // ARMv4T and tail call returns require BX, see emitEpilogue
> +      if ((STI.hasV4TOps() && !STI.hasV5TOps()))
>
> The comment says 'tail call returns require BX', but the code
doesn't do that.Yes. In fact, I did use the tailjump instruction, which actually finally 
is implemented BX, but I agree, It should better be re-phrased with 
"require branch address in register".

As an additional question. When doing some execution testing in the 
simulator, I did stumble myself over an issue with predicates in the 
instructions.

+        AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tPUSH))
+            .addReg(ARM::R4,RegState::Kill));

It seems that the predicate information needs to be placed at a very 
specific position within the instruction operand list in order to 
prevent internal compiler errors. For push/pop, it seems that the 
predicate is required to be the first parameter, for others, there seems 
to be the requirement to place it directly after the last explicit 
operand. Is there some documentation on where to place the predicates?

Yours,

Björn.

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jan 2015 - [LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

Reasonably Related Threads