Jakob Stoklund Olesen
2011-Jan-04  07:30 UTC
[LLVMdev] Is PIC code defeating the branch predictor?
I noticed that we generate code like this for i386 PIC: calll L0$pb L0$pb: popl %eax movl %eax, -24(%ebp) ## 4-byte Spill I worry that this defeats the return address prediction for returns in the function because calls and returns no longer are matched. From Intel's Optimization Reference Manual: "The return address stack mechanism augments the static and dynamic predictors to optimize specifically for calls and returns. It holds 16 entries, which is large enough to cover the call depth of most programs. If there is a chain of more than 16 nested calls and more than 16 returns in rapid succession, performance may degrade. [...] To enable the use of the return stack mechanism, calls and returns must be matched in pairs. If this is done, the likelihood of exceeding the stack depth in a manner that will impact performance is very low. [...] Assembly/Compiler Coding Rule 4. (MH impact, MH generality) Near calls must be matched with near returns, and far calls must be matched with far returns. Pushing the return address on the stack and jumping to the routine to be called is not recommended since it creates a mismatch in calls and returns." Is this a known issue or a non-issue? An alternative approach would be: calll get_eip movl %eax, -24(%ebp) ## 4-byte Spill ... get_eip: movl (%esp), %eax ret More here: http://software.intel.com/en-us/blogs/2010/10/25/zero-length-calls-can-tank-atom-processor-performance/ -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1929 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110103/3f091e60/attachment.bin>
On Jan 3, 2011, at 11:30 PM, Jakob Stoklund Olesen wrote:> I noticed that we generate code like this for i386 PIC: > > calll L0$pb > L0$pb: > popl %eax > movl %eax, -24(%ebp) ## 4-byte Spill > > I worry that this defeats the return address prediction for returns in the function because calls and returns no longer are matched.Yes, this will defeat the processor's return address stack predictor. That said, I suspect it's not much of an issue on "desktop" processors: the reissue of the pop is an Atom-specific issue, so you only need to worry about the branch misprediction caused on the next return. Assuming these sequences aren't too frequent, the more elaborate tournament predictors in more powerful processors may be able to compensate for it. That said, the alternative sequence you propose seems like it would be an improvement on any processor with a multiple issue pipeline (unless ret does a lot more work than I think it does), though it doesn't fix the reissued pop problem on Atom. --Owen -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110104/f15faf28/attachment.html>
On 04 Jan 2011, at 08:30, Jakob Stoklund Olesen wrote:> I noticed that we generate code like this for i386 PIC: > > calll L0$pb > L0$pb: > popl %eax > movl %eax, -24(%ebp) ## 4-byte Spill > > I worry that this defeats the return address prediction for returns > in the function because calls and returns no longer are matched.According to benchmarks by Apple, it's nevertheless faster on modern x86 processors than the trampoline-based alternative (except maybe on Atom, as mentioned in another reply): http://lists.apple.com/archives/perfoptimization-dev/2007/Nov/msg00005.html At the time of that post, Apple's version of GCC still generated trampolines (hence the remark). They switched that to the above pattern afterwards. Jonas
On Jan 4, 2011, at 4:57 AM, Jonas Maebe wrote:> > On 04 Jan 2011, at 08:30, Jakob Stoklund Olesen wrote: > >> I noticed that we generate code like this for i386 PIC: >> >> calll L0$pb >> L0$pb: >> popl %eax >> movl %eax, -24(%ebp) ## 4-byte Spill >> >> I worry that this defeats the return address prediction for returns >> in the function because calls and returns no longer are matched. > > According to benchmarks by Apple, it's nevertheless faster on modern > x86 processors than the trampoline-based alternative (except maybe on > Atom, as mentioned in another reply): http://lists.apple.com/archives/perfoptimization-dev/2007/Nov/msg00005.html > > At the time of that post, Apple's version of GCC still generated > trampolines (hence the remark). They switched that to the above > pattern afterwards.Right. All modern X86 processors other than Atom that I'm aware of special case this sequence so it doesn't push an entry onto the return stack predictor. -Chris
Jakob Stoklund Olesen
2011-Jan-04  17:47 UTC
[LLVMdev] Is PIC code defeating the branch predictor?
On Jan 4, 2011, at 12:37 AM, Owen Anderson wrote:> > On Jan 3, 2011, at 11:30 PM, Jakob Stoklund Olesen wrote: > >> I noticed that we generate code like this for i386 PIC: >> >> calll L0$pb >> L0$pb: >> popl %eax >> movl %eax, -24(%ebp) ## 4-byte Spill >> >> I worry that this defeats the return address prediction for returns in the function because calls and returns no longer are matched. > > Yes, this will defeat the processor's return address stack predictor. That said, I suspect it's not much of an issue on "desktop" processors: the reissue of the pop is an Atom-specific issue, so you only need to worry about the branch misprediction caused on the next return. Assuming these sequences aren't too frequent, the more elaborate tournament predictors in more powerful processors may be able to compensate for it. > > That said, the alternative sequence you propose seems like it would be an improvement on any processor with a multiple issue pipeline (unless ret does a lot more work than I think it does), though it doesn't fix the reissued pop problem on Atom.Since PIC was around when the current Intel micro architecture was designed, one could speculate that it can recognize a zero-length call and knows to ignore it for branch prediction? I think the call+pop sequence is quite normal. Strangely, the optimization reference lists both code snippets in the Atom section, but doesn't recommend one over the other. I think the matched call+ret is best if we could stick some more instructions in there. Transform this: BB1: foo bar %eax = pic_base baz Into this: BB1: call BBx baz ... BBX: foo bar movl (%esp), %eax ret I don't know if it is worth it. The code appears in 32-bit PIC functions that access globals. /jakob
Possibly Parallel Threads
- [LLVMdev] Is PIC code defeating the branch predictor?
- [LLVMdev] Is PIC code defeating the branch predictor?
- [LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable
- [LLVMdev] How to tell whether a GlobalValue is user-defined
- [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences