Jeffrey Yasskin
2009-Jun-10 19:17 UTC
[LLVMdev] Why does the x86-64 JIT emit stubs for external calls?
In X86CodeGen.cpp, the following code appears in the handler used for CALL64pcrel32 instructions: // Assume undefined functions may be outside the Small codespace. bool NeedStub (Is64BitMode && (TM.getCodeModel() == CodeModel::Large || TM.getSubtarget<X86Subtarget>().isTargetDarwin())) || Opcode == X86::TAILJMPd; emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word, MO.getOffset(), 0, NeedStub); This causes every external call to be emitted as a call to a stub which then jumps to the real function. I understand, thanks to the helpful folks on #llvm, that calls across more than 31 bits of address space need to be emitted as a "mov $ADDRESS, r10; call *r10" pair instead of the simple "call rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call pair emitted inline? And why are Darwin and TAILJMPs special? Having this out of line seems to lose up to 2% performance on the Unladen Swallow benchmarks, so, while it's not urgent, it'd be nice to figure out how to avoid the stubs. What kind of patch would be welcome to fix this? Thanks, Jeffrey
Evan Cheng
2009-Jun-11 19:54 UTC
[LLVMdev] Why does the x86-64 JIT emit stubs for external calls?
On Jun 10, 2009, at 12:17 PM, Jeffrey Yasskin wrote:> In X86CodeGen.cpp, the following code appears in the handler used for > CALL64pcrel32 instructions: > > // Assume undefined functions may be outside the Small > codespace. > bool NeedStub > (Is64BitMode && > (TM.getCodeModel() == CodeModel::Large || > TM.getSubtarget<X86Subtarget>().isTargetDarwin())) || > Opcode == X86::TAILJMPd; > emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word, > MO.getOffset(), 0, NeedStub); > > This causes every external call to be emitted as a call to a stub > which then jumps to the real function. > I understand, thanks to the helpful folks on #llvm, that calls across > more than 31 bits of address space need to be emitted as a "mov > $ADDRESS, r10; call *r10" pair instead of the simple "call > rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call > pair emitted inline? And why are Darwin and TAILJMPs special?This is needed because of lazy compilation, before the callee is resolved, it is just a JIT stub. It's heap allocated so it may not be in the lower 4G even if the code size model is small. I know this is the case on Darwin x86_64, I am not sure about other targets. I forgot why this is needed for tail calls, sorry. In theory we can make the code generator inline mov+call, the reality is it doesn't know whether it's jitting or not. Also, we really want to keep the code generation the same (as much as possible) whether it's jitting or compiling. One possible solution for this is to add code size model specifically for JIT so code generator can generate more efficient code in that configuration. Evan> > > Having this out of line seems to lose up to 2% performance on the > Unladen Swallow benchmarks, so, while it's not urgent, it'd be nice to > figure out how to avoid the stubs. > > What kind of patch would be welcome to fix this? > > Thanks, > Jeffrey > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Jeffrey Yasskin
2009-Jun-11 23:24 UTC
[LLVMdev] [unladen-swallow] Re: Why does the x86-64 JIT emit stubs for external calls?
On Thu, Jun 11, 2009 at 12:54 PM, Evan Cheng<evan.cheng at apple.com> wrote:> > > > On Jun 10, 2009, at 12:17 PM, Jeffrey Yasskin wrote: > >> In X86CodeGen.cpp, the following code appears in the handler used for >> CALL64pcrel32 instructions: >> >> // Assume undefined functions may be outside the Small codespace. >> bool NeedStub >> (Is64BitMode && >> (TM.getCodeModel() == CodeModel::Large || >> TM.getSubtarget<X86Subtarget>().isTargetDarwin())) || >> Opcode == X86::TAILJMPd; >> emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word, >> MO.getOffset(), 0, NeedStub); >> >> This causes every external call to be emitted as a call to a stub >> which then jumps to the real function. >> I understand, thanks to the helpful folks on #llvm, that calls across >> more than 31 bits of address space need to be emitted as a "mov >> $ADDRESS, r10; call *r10" pair instead of the simple "call >> rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call >> pair emitted inline? And why are Darwin and TAILJMPs special? > > This is needed because of lazy compilation, before the callee is resolved, > it is just a JIT stub.Even with lazy compilation, the contents of the stub get emitted (by JITEmitter::getPointerToGlobal) as a direct call to the function, not the compilation callback, because the function is an external declaration. You can watch this happen with the following program: declare i32 @rand() define i32 @main() nounwind { entry: %call = tail call i32 @rand() ; <i32> [#uses=1] %add = add i32 %call, 2 ; <i32> [#uses=1] ret i32 %add } and the command line `lli -debug-only=jit -march=x86-64 test.bc`. With lazy compilation and a call to an internal function, the JITEmitter can emit a stub even if MachineRelocation::doesntNeedStub() (the field NeedStub gets passed into) returns true. Only returning false constrains the emitter.> It's heap allocated so it may not be in the lower 4G > even if the code size model is small. I know this is the case on Darwin > x86_64, I am not sure about other targets.Oh, other targets can certainly allocate code above 4G too. sys::AllocateRWX just uses mmap with no constraints on the returned address, and I've got a Linux desktop where that always produces an address over 4G.> I forgot why this is needed for > tail calls, sorry. > > In theory we can make the code generator inline mov+call, the reality is it > doesn't know whether it's jitting or not. Also, we really want to keep the > code generation the same (as much as possible) whether it's jitting or > compiling. One possible solution for this is to add code size model > specifically for JIT so code generator can generate more efficient code in > that configuration.For non-JIT, the code generator doesn't ever need a stub, right? The linker does it using the relocation information? It must be ignoring the NeedStub parameter. ... But wait, is this code generator used for anything besides the JIT? Compiling uses the AsmPrinter until direct object code generation lands, and presumably they're redesigning this whole subsystem. It sounds like I'd have to fully understand the whole structure of the code generator to fix this, and for <=2% performance, that's not really worth it. I'll probably wait for the direct object code people to get around to it. Thanks though.>> >> >> Having this out of line seems to lose up to 2% performance on the >> Unladen Swallow benchmarks, so, while it's not urgent, it'd be nice to >> figure out how to avoid the stubs. >> >> What kind of patch would be welcome to fix this? >> >> Thanks, >> Jeffrey >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >
Aaron Gray
2009-Jun-12 00:56 UTC
[LLVMdev] Why does the x86-64 JIT emit stubs for external calls?
> On Jun 10, 2009, at 12:17 PM, Jeffrey Yasskin wrote: > >> In X86CodeGen.cpp, the following code appears in the handler used for >> CALL64pcrel32 instructions: >> >> // Assume undefined functions may be outside the Small >> codespace. >> bool NeedStub >> (Is64BitMode && >> (TM.getCodeModel() == CodeModel::Large || >> TM.getSubtarget<X86Subtarget>().isTargetDarwin())) || >> Opcode == X86::TAILJMPd; >> emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word, >> MO.getOffset(), 0, NeedStub); >> >> This causes every external call to be emitted as a call to a stub >> which then jumps to the real function. >> I understand, thanks to the helpful folks on #llvm, that calls across >> more than 31 bits of address space need to be emitted as a "mov >> $ADDRESS, r10; call *r10" pair instead of the simple "call >> rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call >> pair emitted inline? And why are Darwin and TAILJMPs special? > > This is needed because of lazy compilation, before the callee is > resolved, it is just a JIT stub. It's heap allocated so it may not be > in the lower 4G even if the code size model is small. I know this is > the case on Darwin x86_64, I am not sure about other targets. I forgot > why this is needed for tail calls, sorry. > > In theory we can make the code generator inline mov+call, the reality > is it doesn't know whether it's jitting or not. Also, we really want > to keep the code generation the same (as much as possible) whether > it's jitting or compiling. One possible solution for this is to add > code size model specifically for JIT so code generator can generate > more efficient code in that configuration.Since the CodeEmitter's are now generically parameterized they can be specialized for JIT quite easily now. Aaron>> Having this out of line seems to lose up to 2% performance on the >> Unladen Swallow benchmarks, so, while it's not urgent, it'd be nice to >> figure out how to avoid the stubs. >> >> What kind of patch would be welcome to fix this? >> >> Thanks, >> Jeffrey >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Maybe Matching Threads
- [LLVMdev] Why does the x86-64 JIT emit stubs for external calls?
- [LLVMdev] [unladen-swallow] Re: Why does the x86-64 JIT emit stubs for external calls?
- [LLVMdev] Tailcall optimization in jit stopped working
- [LLVMdev] Being able to know the jitted code-size before emitting
- [LLVMdev] RFC: Tail call optimization X86