thr3ads.net - llvm dev - [LLVMdev] Is PIC code defeating the branch predictor? [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Jakob Stoklund Olesen

2011-Jan-04 07:30 UTC

[LLVMdev] Is PIC code defeating the branch predictor?

I noticed that we generate code like this for i386 PIC:

	calll	L0$pb
L0$pb:
	popl	%eax
	movl	%eax, -24(%ebp)         ## 4-byte Spill

I worry that this defeats the return address prediction for returns in the
function because calls and returns no longer are matched.

From Intel's Optimization Reference Manual:

"The return address stack mechanism augments the static and dynamic
predictors to optimize specifically for calls and returns. It holds 16 entries,
which is large enough to cover the call depth of most programs. If there is a
chain of more than 16 nested calls and more than 16 returns in rapid succession,
performance may degrade.

[...] To enable the use of the return stack mechanism, calls and returns must be
matched in pairs. If this is done, the likelihood of exceeding the stack depth
in a manner that will impact performance is very low.

[...] Assembly/Compiler Coding Rule 4. (MH impact, MH generality) Near calls
must be matched with near returns, and far calls must be matched with far
returns. Pushing the return address on the stack and jumping to the routine to
be called is not recommended since it creates a mismatch in calls and
returns."

Is this a known issue or a non-issue?

An alternative approach would be:

	calll get_eip
	movl	%eax, -24(%ebp)         ## 4-byte Spill
...
get_eip:
	movl (%esp), %eax
	ret

More here:
http://software.intel.com/en-us/blogs/2010/10/25/zero-length-calls-can-tank-atom-processor-performance/


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1929 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110103/3f091e60/attachment.bin>

Owen Anderson

2011-Jan-04 08:37 UTC

head link

[LLVMdev] Is PIC code defeating the branch predictor?

On Jan 3, 2011, at 11:30 PM, Jakob Stoklund Olesen wrote:
> I noticed that we generate code like this for i386 PIC:
> 
> 	calll	L0$pb
> L0$pb:
> 	popl	%eax
> 	movl	%eax, -24(%ebp)         ## 4-byte Spill
> 
> I worry that this defeats the return address prediction for returns in the
function because calls and returns no longer are matched.
Yes, this will defeat the processor's return address stack predictor.  That
said, I suspect it's not much of an issue on "desktop" processors:
the reissue of the pop is an Atom-specific issue, so you only need to worry
about the branch misprediction caused on the next return.  Assuming these
sequences aren't too frequent, the more elaborate tournament predictors in
more powerful processors may be able to compensate for it.

That said, the alternative sequence you propose seems like it would be an
improvement on any processor with a multiple issue pipeline (unless ret does a
lot more work than I think it does), though it doesn't fix the reissued pop
problem on Atom.

--Owen
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110104/f15faf28/attachment.html>

Jonas Maebe

2011-Jan-04 12:57 UTC

head link

[LLVMdev] Is PIC code defeating the branch predictor?

On 04 Jan 2011, at 08:30, Jakob Stoklund Olesen wrote:
> I noticed that we generate code like this for i386 PIC:
>
> 	calll	L0$pb
> L0$pb:
> 	popl	%eax
> 	movl	%eax, -24(%ebp)         ## 4-byte Spill
>
> I worry that this defeats the return address prediction for returns  
> in the function because calls and returns no longer are matched.
According to benchmarks by Apple, it's nevertheless faster on modern  
x86 processors than the trampoline-based alternative (except maybe on  
Atom, as mentioned in another reply):
http://lists.apple.com/archives/perfoptimization-dev/2007/Nov/msg00005.html

At the time of that post, Apple's version of GCC still generated  
trampolines (hence the remark). They switched that to the above  
pattern afterwards.


Jonas

Chris Lattner

2011-Jan-04 17:22 UTC

head link

[LLVMdev] Is PIC code defeating the branch predictor?

On Jan 4, 2011, at 4:57 AM, Jonas Maebe wrote:
> 
> On 04 Jan 2011, at 08:30, Jakob Stoklund Olesen wrote:
> 
>> I noticed that we generate code like this for i386 PIC:
>> 
>> 	calll	L0$pb
>> L0$pb:
>> 	popl	%eax
>> 	movl	%eax, -24(%ebp)         ## 4-byte Spill
>> 
>> I worry that this defeats the return address prediction for returns  
>> in the function because calls and returns no longer are matched.
> 
> According to benchmarks by Apple, it's nevertheless faster on modern  
> x86 processors than the trampoline-based alternative (except maybe on  
> Atom, as mentioned in another reply):
http://lists.apple.com/archives/perfoptimization-dev/2007/Nov/msg00005.html
> 
> At the time of that post, Apple's version of GCC still generated  
> trampolines (hence the remark). They switched that to the above  
> pattern afterwards.
Right.  All modern X86 processors other than Atom that I'm aware of special
case this sequence so it doesn't push an entry onto the return stack
predictor.

-Chris

Jakob Stoklund Olesen

2011-Jan-04 17:47 UTC

head link

[LLVMdev] Is PIC code defeating the branch predictor?

On Jan 4, 2011, at 12:37 AM, Owen Anderson wrote:
> 
> On Jan 3, 2011, at 11:30 PM, Jakob Stoklund Olesen wrote:
> 
>> I noticed that we generate code like this for i386 PIC:
>> 
>> 	calll	L0$pb
>> L0$pb:
>> 	popl	%eax
>> 	movl	%eax, -24(%ebp)         ## 4-byte Spill
>> 
>> I worry that this defeats the return address prediction for returns in
the function because calls and returns no longer are matched.
> 
> Yes, this will defeat the processor's return address stack predictor. 
That said, I suspect it's not much of an issue on "desktop"
processors: the reissue of the pop is an Atom-specific issue, so you only need
to worry about the branch misprediction caused on the next return.  Assuming
these sequences aren't too frequent, the more elaborate tournament
predictors in more powerful processors may be able to compensate for it.
> 
> That said, the alternative sequence you propose seems like it would be an
improvement on any processor with a multiple issue pipeline (unless ret does a
lot more work than I think it does), though it doesn't fix the reissued pop
problem on Atom.
Since PIC was around when the current Intel micro architecture was designed, one
could speculate that it can recognize a zero-length call and knows to ignore it
for branch prediction? I think the call+pop sequence is quite normal.

Strangely, the optimization reference lists both code snippets in the Atom
section, but doesn't recommend one over the other.

I think the matched call+ret is best if we could stick some more instructions in
there. Transform this:

BB1:
	foo
	bar
	%eax = pic_base
	baz

Into this:

BB1:
	call BBx
	baz
...
BBX:
	foo
	bar
	movl (%esp), %eax
	ret

I don't know if it is worth it. The code appears in 32-bit PIC functions
that access globals.

/jakob

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Jan 2011 - [LLVMdev] Is PIC code defeating the branch predictor?

[LLVMdev] Is PIC code defeating the branch predictor?

[LLVMdev] Is PIC code defeating the branch predictor?

[LLVMdev] Is PIC code defeating the branch predictor?

[LLVMdev] Is PIC code defeating the branch predictor?

[LLVMdev] Is PIC code defeating the branch predictor?

Reasonably Related Threads