Bjoern Haase
2015-Jan-11 00:31 UTC
[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets
Hello, find enclosed a first patch for adding tail call optimizations for thumb1 targets. I assume that this list is the right place for publishing patches for review? Since this is my first proposal for LLVM, I'd very much appreciate your feedback. What the patch is meant to do: For Tail calls identified during DAG generation, the target address will be loaded into a register by use of the constant pool. If R3 is used for argument passing, the target address is forced to hard reg R12 in order to overcome limitations thumb1 register allocator with respect to the upper registers. I decided to fetch the target address to a register by a constant pool lookup because when analyzing the code I found out, that the mechanisms are prepared also for situations, where parameters are both passed in regs and on the stack. This would not be possible when using a BL // pop {pc} sequence within the epilogue since this would change the stack offsets. During epilog generation, spill register restoring will be done within the emit epilogue function. If LR happens to be spilled on the stack by the prologue, it's restored by use of a scratch register just before restoring the other registers. I have so far tested the code by hand with a number of tests by analyzing generated assembly. In the lit testsuite I get 4 failures which I attribute at a first analysis to the fact that the generated code for tail calls results in different output that no longer matches the expectation strings. Yours, Björn -------------- next part -------------- A non-text attachment was scrubbed... Name: sampleTests.tar.gz Type: application/x-gzip Size: 1119 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150111/57000648/attachment.bin> -------------- next part -------------- Index: Thumb1FrameLowering.cpp ==================================================================--- Thumb1FrameLowering.cpp (Revision 225589) +++ Thumb1FrameLowering.cpp (Arbeitskopie) @@ -323,11 +323,18 @@ } void Thumb1FrameLowering::emitEpilogue(MachineFunction &MF, - MachineBasicBlock &MBB) const { + MachineBasicBlock &MBB) const { MachineBasicBlock::iterator MBBI = MBB.getLastNonDebugInstr(); assert((MBBI->getOpcode() == ARM::tBX_RET || - MBBI->getOpcode() == ARM::tPOP_RET) && - "Can only insert epilog into returning blocks"); + MBBI->getOpcode() == ARM::tPOP_RET || + MBBI->getOpcode() == ARM::TCRETURNri) + && "Can only insert epilog into returning blocks " + "and tail calls with address in regs."); + + bool IsTailCallReturn = false; + if (MBBI->getOpcode() == ARM::TCRETURNri) + IsTailCallReturn = true; + DebugLoc dl = MBBI->getDebugLoc(); MachineFrameInfo *MFI = MF.getFrameInfo(); ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>(); @@ -351,8 +358,8 @@ if (NumBytes - ArgRegsSaveSize != 0) emitSPUpdate(MBB, MBBI, TII, dl, *RegInfo, NumBytes - ArgRegsSaveSize); } else { - // Unwind MBBI to point to first LDR / VLDRD. - if (MBBI != MBB.begin()) { + // Unwind MBBI to point to first LDR / VLDRD. Not for tail call returns! + if ((MBBI != MBB.begin()) && (!IsTailCallReturn)) { do --MBBI; while (MBBI != MBB.begin() && isCSRestore(MBBI, CSRegs)); @@ -395,12 +402,119 @@ } } - bool IsV4PopReturn = false; - for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) + bool IsLRPartOfCalleeSavedInfo = false; + + for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) { if (CSI.getReg() == ARM::LR) - IsV4PopReturn = true; + IsLRPartOfCalleeSavedInfo = true; + } + + bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo; IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps(); + if (IsTailCallReturn) { + MBBI = MBB.getLastNonDebugInstr(); + + // First restore callee saved registers. Unlike for normal returns + // this is *not* done in restoreCalleeSavedRegisters. + const std::vector<CalleeSavedInfo> &CSI(MFI->getCalleeSavedInfo()); + + bool IsR4IncludedInCSI = false; + bool IsLRIncludedInCSI = false; + for (unsigned i = CSI.size(); i != 0; --i) { + unsigned Reg = CSI[i-1].getReg(); + if (Reg == ARM::R4) + IsR4IncludedInCSI = true; + if (Reg == ARM::LR) + IsLRIncludedInCSI = true; + } + + MachineFunction &MF = *MBB.getParent(); + const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo(); + + // We need to additionally push/pop R4 in case that LR reconstruction + // for tail calls requires R4 as scratch register. + bool IsR4ToBeAdditionallyAddedToPopIns = false; + + if (IsLRIncludedInCSI) { + // We need to restore LR and need a scratch register for this purpose + int StackSlotForSavedLR = CSI.size() - 1; + assert (StackSlotForSavedLR >= 0 && "Wrong Stack slot for LR."); + + // Make sure that R4 may be used as scratch. Add an additional tPUSH (R4) + // if necessary. + if (!IsR4IncludedInCSI) { + IsR4ToBeAdditionallyAddedToPopIns = true; + + AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tPUSH)) + .addReg(ARM::R4,RegState::Kill)); + + StackSlotForSavedLR ++; + } + + AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tLDRspi)) + .addReg(ARM::R4, RegState::Define) + .addReg(ARM::SP) + .addImm(StackSlotForSavedLR)); + + AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tMOVr)) + .addReg(ARM::LR, RegState::Define) + .addReg(ARM::R4, RegState::Kill)); + } + + MachineInstrBuilder MIB = BuildMI(MF, dl, TII.get(ARM::tPOP)); + AddDefaultPred(MIB); + + bool NumRegs = false; + for (unsigned i = CSI.size(); i != 0; --i) { + unsigned Reg = CSI[i-1].getReg(); + + if (Reg == ARM::LR) + continue; + + MIB.addReg(Reg, getDefRegState(true)); + NumRegs = true; + } + + if (IsR4ToBeAdditionallyAddedToPopIns) { + MIB.addReg(ARM::R4, getDefRegState(true)); + NumRegs = true; + } + + // It's illegal to emit pop instruction without operands. + if (NumRegs) + MBB.insert(MBBI, &*MIB); + else + MF.DeleteMachineInstr(MIB); + + if (IsLRIncludedInCSI) { + const Thumb1RegisterInfo *RegInfo + static_cast<const Thumb1RegisterInfo *> + (MF.getSubtarget().getRegisterInfo()); + + // Re-adjust stack pointer for LR content still residing on the stack. + emitSPUpdate(MBB, MBBI, TII, dl, *RegInfo, 4); + } + + MachineOperand &JumpTarget = MBBI->getOperand(0); + + assert (MBBI->getOpcode() == ARM::TCRETURNri); + DebugLoc dl = MBBI->getDebugLoc(); + + BuildMI(MBB, MBBI, dl, + TII.get(ARM::tTAILJMPr)) + .addReg(JumpTarget.getReg(), RegState::Kill); + + MachineInstr *NewMI = std::prev(MBBI); + for (unsigned i = 1, e = MBBI->getNumOperands(); i != e; ++i) + NewMI->addOperand(MBBI->getOperand(i)); + + // Delete the pseudo instruction TCRETURN. + MBB.erase(MBBI); + MBBI = NewMI; + return; + } + // Unlike T2 and ARM mode, the T1 pop instruction cannot restore // to LR, and we can't pop the value directly to the PC since // we need to update the SP after popping the value. So instead @@ -501,15 +615,25 @@ MachineBasicBlock::iterator MI, const std::vector<CalleeSavedInfo> &CSI, const TargetRegisterInfo *TRI) const { + + MachineBasicBlock::iterator MBBI = MBB.getLastNonDebugInstr(); + bool IsTailCallReturn = false; + if(MBBI->getOpcode() == ARM::TCRETURNri) + IsTailCallReturn = true; + if (CSI.empty()) return false; + // We will handle callee saving in Epilogue generation and not here. + if (IsTailCallReturn) + return true; + MachineFunction &MF = *MBB.getParent(); ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>(); const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo(); - bool isVarArg = AFI->getArgRegsSaveSize() > 0; DebugLoc DL = MI->getDebugLoc(); + MachineInstrBuilder MIB = BuildMI(MF, DL, TII.get(ARM::tPOP)); AddDefaultPred(MIB); @@ -517,12 +641,15 @@ for (unsigned i = CSI.size(); i != 0; --i) { unsigned Reg = CSI[i-1].getReg(); if (Reg == ARM::LR) { + // Special epilogue for vararg functions. See emitEpilogue if (isVarArg) continue; - // ARMv4T requires BX, see emitEpilogue - if (STI.hasV4TOps() && !STI.hasV5TOps()) + + // ARMv4T and tail call returns require BX, see emitEpilogue + if ((STI.hasV4TOps() && !STI.hasV5TOps())) continue; + Reg = ARM::PC; (*MIB).setDesc(TII.get(ARM::tPOP_RET)); MIB.copyImplicitOps(&*MI); Index: ARMSubtarget.cpp ==================================================================--- ARMSubtarget.cpp (Revision 225589) +++ ARMSubtarget.cpp (Arbeitskopie) @@ -262,7 +262,7 @@ SupportsTailCall = !isTargetIOS() || !getTargetTriple().isOSVersionLT(5, 0); } else { IsR9Reserved = ReserveR9; - SupportsTailCall = !isThumb1Only(); + SupportsTailCall = true; } if (Align == DefaultAlign) { Index: ARMISelLowering.cpp ==================================================================--- ARMISelLowering.cpp (Revision 225589) +++ ARMISelLowering.cpp (Arbeitskopie) @@ -1671,6 +1671,26 @@ InFlag = SDValue(); } + // For thumb1 targets, if R3 is used for argument passing, we need + // to place the call target address in IP (i.e. R12). + bool IsR3UsedForArgumentPassing = false; + if (RegsToPass.size() >= 4) { + IsR3UsedForArgumentPassing = true; + } + + bool IsCallAddressMoveToRegisterRequired = false; + bool CallAdressShallBeForcedToHardRegR12 = false; + + if (EnableARMLongCalls || (isTailCall && Subtarget->isThumb1Only() )) + { + IsCallAddressMoveToRegisterRequired = true; + + if (isTailCall + && IsR3UsedForArgumentPassing + && Subtarget->isThumb1Only() ) + CallAdressShallBeForcedToHardRegR12 = true; + } + // If the callee is a GlobalAddress/ExternalSymbol node (quite common, every // direct call is) turn it into a TargetGlobalAddress/TargetExternalSymbol // node so that legalize doesn't hack it. @@ -1679,10 +1699,12 @@ bool isLocalARMFunc = false; ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>(); - if (EnableARMLongCalls) { + if (IsCallAddressMoveToRegisterRequired) { assert((Subtarget->isTargetWindows() || + (isTailCall && Subtarget->isThumb1Only()) || getTargetMachine().getRelocationModel() == Reloc::Static) && "long-calls with non-static relocation model!"); + // Handle a global address or an external symbol. If it's not one of // those, the target's already in a register, so we don't need to do // anything extra. @@ -1695,11 +1717,14 @@ // Get the address of the callee into a register SDValue CPAddr = DAG.getTargetConstantPool(CPV, getPointerTy(), 4); + CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr); + Callee = DAG.getLoad(getPointerTy(), dl, DAG.getEntryNode(), CPAddr, MachinePointerInfo::getConstantPool(), false, false, false, 0); + } else if (ExternalSymbolSDNode *S=dyn_cast<ExternalSymbolSDNode>(Callee)) { const char *Sym = S->getSymbol(); @@ -1785,6 +1810,12 @@ } } + if (CallAdressShallBeForcedToHardRegR12) { + Chain = DAG.getCopyToReg(Chain, dl, ARM::R12, + Callee,Chain.getValue(1)); + Callee = DAG.getRegister (ARM::R12,getPointerTy()); + } + // FIXME: handle tail calls differently. unsigned CallOpc; bool HasMinSizeAttr = MF.getFunction()->getAttributes().hasAttribute( @@ -2000,26 +2031,6 @@ if (isCalleeStructRet || isCallerStructRet) return false; - // FIXME: Completely disable sibcall for Thumb1 since Thumb1RegisterInfo:: - // emitEpilogue is not ready for them. Thumb tail calls also use t2B, as - // the Thumb1 16-bit unconditional branch doesn't have sufficient relocation - // support in the assembler and linker to be used. This would need to be - // fixed to fully support tail calls in Thumb1. - // - // Doing this is tricky, since the LDM/POP instruction on Thumb doesn't take - // LR. This means if we need to reload LR, it takes an extra instructions, - // which outweighs the value of the tail call; but here we don't know yet - // whether LR is going to be used. Probably the right approach is to - // generate the tail call here and turn it back into CALL/RET in - // emitEpilogue if LR is used. - - // Thumb1 PIC calls to external symbols use BX, so they can be tail calls, - // but we need to make sure there are enough registers; the only valid - // registers are the 4 used for parameters. We don't currently do this - // case. - if (Subtarget->isThumb1Only()) - return false; - // Externally-defined functions with weak linkage should not be // tail-called on ARM when the OS does not support dynamic // pre-emption of symbols, as the AAELF spec requires normal calls @@ -2365,7 +2376,7 @@ if (!CI->isTailCall() || getTargetMachine().Options.DisableTailCalls) return false; - return !Subtarget->isThumb1Only(); + return true; } // ConstantPool, JumpTable, GlobalAddress, and ExternalSymbol are lowered as
John Brawn
2015-Jan-12 14:50 UTC
[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets
Some comments on the general approach:> For Tail calls identified during DAG generation, the target address > will > be loaded into a register > by use of the constant pool. If R3 is used for argument passing, the > target address is forced > to hard reg R12 in order to overcome limitations thumb1 register > allocator with respect to > the upper registers.For the case when r3 is not used for argument passing doing this is definitely a good idea, but if we have to BX via r12 then we have to first do an LDR to some other register as there's no instruction to LDR directly to r12. With current trunk LLVM and your patch for int calledfn(int,int,int,int); int test() { return calledfn(1,2,3,4); } I get test: push {r4, lr} movs r0, #1 movs r1, #2 movs r2, #3 movs r3, #4 ldr r4, .LCPI0_0 mov r12, r4 ldr r4, [sp, #4] mov lr, r4 pop {r4} add sp, #4 bx r12 r4 has been used to load the target address, which means we have to save and restore r4 which means we're no better off than not tailcalling. And you can also see the next problem:> During epilog generation, spill register restoring will be done within > the emit epilogue function. > If LR happens to be spilled on the stack by the prologue, it's restored > by use of a scratch register > just before restoring the other registers.POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1 (ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles. So I think tailcalling is only worthwhile if the function does not save lr and r3 is free to hold the target address. Also needing consideration is what happens if callee-saved registers other than lr need to be saved, but I haven't looked into this. A few comments on the patch itself (I've only given it a quick look over): + bool IsLRPartOfCalleeSavedInfo = false; + + for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) { if (CSI.getReg() == ARM::LR) - IsV4PopReturn = true; + IsLRPartOfCalleeSavedInfo = true; + } + + bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo; IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps(); + if (IsTailCallReturn) { + MBBI = MBB.getLastNonDebugInstr(); + + // First restore callee saved registers. Unlike for normal returns + // this is *not* done in restoreCalleeSavedRegisters. + const std::vector<CalleeSavedInfo> &CSI(MFI->getCalleeSavedInfo()); + + bool IsR4IncludedInCSI = false; + bool IsLRIncludedInCSI = false; + for (unsigned i = CSI.size(); i != 0; --i) { + unsigned Reg = CSI[i-1].getReg(); + if (Reg == ARM::R4) + IsR4IncludedInCSI = true; + if (Reg == ARM::LR) + IsLRIncludedInCSI = true; + } You set IsLRPartOfCalleeSavedInfo then right afterwards duplicate the work in setting IsLRIncludedInCSI. + // ARMv4T and tail call returns require BX, see emitEpilogue + if ((STI.hasV4TOps() && !STI.hasV5TOps())) The comment says 'tail call returns require BX', but the code doesn't do that. John
Jonathan Roelofs
2015-Jan-12 15:18 UTC
[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets
Bjoern, Thanks for the patch, this will be a really nice optimization to have :) Would you mind using Phabricator for this? http://reviews.llvm.org/ It will make reviewing a bit easier on our end. It also helps if there's a lot of context in the diff, so `git diff -U999` is the way to go. A few style nits: Please use 2-spaces instead of tabs for indentation. The new variable names seem a bit long compared to the rest of LLVM. Please write some LIT test cases for this (including at least cases where R4 is and is not in CSI). Correctness-wise, I'm a little concerned about forcing the call address into R12, as that might clash with the register scavenger (take that with a healthy dose of skepticism; I don't *know* that there's a problem here, rather I'm just uncertain how the two will interact). Cheers, Jon On 1/10/15 5:31 PM, Bjoern Haase wrote:> Hello, > > find enclosed a first patch for adding tail call optimizations for > thumb1 targets. > I assume that this list is the right place for publishing patches for > review? > > Since this is my first proposal for LLVM, I'd very much appreciate your > feedback. > > What the patch is meant to do: > > For Tail calls identified during DAG generation, the target address will > be loaded into a register > by use of the constant pool. If R3 is used for argument passing, the > target address is forced > to hard reg R12 in order to overcome limitations thumb1 register > allocator with respect to > the upper registers. > > I decided to fetch the target address to a register by a constant pool > lookup because when > analyzing the code I found out, that the mechanisms are prepared also > for situations, where > parameters are both passed in regs and on the stack. This would not be > possible when using > a BL // pop {pc} sequence within the epilogue since this would change > the stack offsets. > > During epilog generation, spill register restoring will be done within > the emit epilogue function. > If LR happens to be spilled on the stack by the prologue, it's restored > by use of a scratch register > just before restoring the other registers. > > I have so far tested the code by hand with a number of tests by > analyzing generated assembly. > In the lit testsuite I get 4 failures which I attribute at a first > analysis to the fact that the generated code for tail calls > results in different output that no longer matches the expectation strings. > > Yours, > > Björn > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Jon Roelofs jonathan at codesourcery.com CodeSourcery / Mentor Embedded
Bjoern Haase
2015-Jan-12 22:43 UTC
[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets
Hello John, thank you for your feedback. Am 12.01.2015 um 15:50 schrieb John Brawn:> Some comments on the general approach: > >> For Tail calls identified during DAG generation, the target address >> will >> be loaded into a register >> by use of the constant pool. If R3 is used for argument passing, the >> target address is forced >> to hard reg R12 in order to overcome limitations thumb1 register >> allocator with respect to >> the upper registers. > For the case when r3 is not used for argument passing doing this is > definitely a good idea, but if we have to BX via r12 then we have > to first do an LDR to some other register as there's no instruction > to LDR directly to r12. With current trunk LLVM and your patch for > > int calledfn(int,int,int,int); > int test() { > return calledfn(1,2,3,4); > } > > I get > > test: > push {r4, lr} > movs r0, #1 > movs r1, #2 > movs r2, #3 > movs r3, #4 > ldr r4, .LCPI0_0 > mov r12, r4 > ldr r4, [sp, #4] > mov lr, r4 > pop {r4} > add sp, #4 > bx r12 > > r4 has been used to load the target address, which means we have to save > and restore r4 which means we're no better off than not tailcalling.Concerning speed: Yes. I agree. For this specific extremely short function: Yes. I agree. The benefit to expect (in my opinion) is *not* speed or code size. It's stack usage. On a typical cortex-m0 with say 64k Flash, 8k RAM and functions with stack frame of say 128 bytes it actually may easily make a difference if the stack frame is released or not when returning from a function and calling the next one. Actually, I am just working on a communication stack code that does do exactly this very frequently and unfortunately, some tail calls occuring in the biggest stack-suckers happen use 4 parameters :-(). And you can also see the next problem:>> During epilog generation, spill register restoring will be done within >> the emit epilogue function. >> If LR happens to be spilled on the stack by the prologue, it's restored >> by use of a scratch register >> just before restoring the other registers. > POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the > stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1 > (ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles. > > So I think tailcalling is only worthwhile if the function does not save > lr and r3 is free to hold the target address. Also needing consideration > is what happens if callee-saved registers other than lr need to be saved, > but I haven't looked into this.I am aware of that and I fully agree with you in that with respect to speed and code size code will not get improved or worse. Why I still believe tail optimizations to be beneficial in many occasions is my personal experience on real-world projects. The last 15 work on projects on "v6m-scale" microcontroller systems in different companies did teach me that mostly the true issue is not program memory and speed but RAM and stack usage. With respect to a typical RAM / flash ratio of 8k/64k, I'd like to say that 1 byte of RAM spared is roughly "worth" 8 bytes of flash memory. (I know and admit, that the question wether the tail calls would actually save RAM strongly depends on the question whether the biggest stack frame shows up due to tail calls or inner calls.) If I'd be asked, I'd strongly advocate for a "optimize for RAM" policy for v6m even if making some limited compromises for code size and speed. (see also my other mail from yesterday). I don't think that there is a wrong or right "answer" to this question. It is risky to base the design choice on one single person's bias, so I'd suggest to ask others for their opinion. Maybe there is also some useful benchmarking code around. If I knew how to do implement it, I'd be suggesting some heuristics at DAG generation for enabling tail call optimization if the expected stack frame size gets larger than, say 32 bytes or a specific compile switch.> A few comments on the patch itself (I've only given it a quick look over): > > + bool IsLRPartOfCalleeSavedInfo = false; > + > + for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) { > if (CSI.getReg() == ARM::LR) > - IsV4PopReturn = true; > + IsLRPartOfCalleeSavedInfo = true; > + } > + > + bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo; > IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps(); > > + if (IsTailCallReturn) { > + MBBI = MBB.getLastNonDebugInstr(); > + > + // First restore callee saved registers. Unlike for normal returns > + // this is *not* done in restoreCalleeSavedRegisters. > + const std::vector<CalleeSavedInfo> &CSI(MFI->getCalleeSavedInfo()); > + > + bool IsR4IncludedInCSI = false; > + bool IsLRIncludedInCSI = false; > + for (unsigned i = CSI.size(); i != 0; --i) { > + unsigned Reg = CSI[i-1].getReg(); > + if (Reg == ARM::R4) > + IsR4IncludedInCSI = true; > + if (Reg == ARM::LR) > + IsLRIncludedInCSI = true; > + } > > You set IsLRPartOfCalleeSavedInfo then right afterwards duplicate the work > in setting IsLRIncludedInCSI.Thank you for pointing out this You are right. That did stem from copy and paste. I previously had this part in the epilogue generation but in the register restoring function, where I did need the duplication.> > + // ARMv4T and tail call returns require BX, see emitEpilogue > + if ((STI.hasV4TOps() && !STI.hasV5TOps())) > > The comment says 'tail call returns require BX', but the code doesn't do that.Yes. In fact, I did use the tailjump instruction, which actually finally is implemented BX, but I agree, It should better be re-phrased with "require branch address in register". As an additional question. When doing some execution testing in the simulator, I did stumble myself over an issue with predicates in the instructions. + AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tPUSH)) + .addReg(ARM::R4,RegState::Kill)); It seems that the predicate information needs to be placed at a very specific position within the instruction operand list in order to prevent internal compiler errors. For push/pop, it seems that the predicate is required to be the first parameter, for others, there seems to be the requirement to place it directly after the last explicit operand. Is there some documentation on where to place the predicates? Yours, Björn.