thr3ads.net - llvm dev - [llvm-dev] Why did Intel change his static branch prediction mechanism during these years? [Aug 2018]

If this information is useful, please help other people find it:
Share via:

2016 quekong via llvm-dev

2018-Aug-14 21:09 UTC

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

( I don't know if it's allowed to ask such question, if not, please
remind me. )

I know Intel implemented several static branch prediction mechanisms
these years:
  * 80486 age: Always-not-take
  * Pentium4 age: Backwards Taken/Forwards Not-Taken
  * PM, Core2: Didn't use static prediction,  randomly depending on
what happens to be in corresponding BTB entry , according to agner's
optimization guide ¹.
  * Newer CPUs like Ivy Bridge, Haswell have become increasingly
intangible, according to Matt G's experiment ².

And Intel seems don't want to talk about it any more, because the
latest material I found within Intel Document was written about ten
years ago.

I know static branch prediction is (far?) less important than dynamic,
but in quite a few situations, CPU will be completely lost and
programmers(with compiler) are usually the best guide. Of course these
situations are usually not performance bottleneck, because once a
branch is frequently executed, the dynamic predictor will capture it.

Since Intel no longer clearly statements the dynamic prediction
mechanism in its document, the builtin_expect() of GCC can do nothing
more than removing the unlikely branch from hot path or reversely for
likely branch.

I am not familiar with CPU design and I don't know what exactly
mechanism Intel use nowadays for its static predictor, I just feel the
best static mechanism for Intel should be to clearly document his CPU
"where I plan to go when dynamic predictor failed, forward or
backward", because usually the programmer is the best guide at that
time.


APPENDIX:
¹ Agner's optimization guide:
https://www.agner.org/optimize/microarchitecture.pdf   ,  section 3.5
.

² Matt G's experiment: https://xania.org/201602/bpu-part-two

Bruce Hoult via llvm-dev

2018-Aug-14 21:28 UTC

head link

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

I think almost everyone now is using some variation on "gshare", or
possibly a hybrid of gshare with another scheme (and a prediction of which
scheme is better).

Maybe the best thing a compiler can do now with static information such
as builtin_expect() is to keep the high probability path as straight-line
code and move the low probability path to out of line code.

On Tue, Aug 14, 2018 at 2:09 PM, 2016 quekong via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> ( I don't know if it's allowed to ask such question, if not, please
remind
> me. )
>
> I know Intel implemented several static branch prediction mechanisms
> these years:
>   * 80486 age: Always-not-take
>   * Pentium4 age: Backwards Taken/Forwards Not-Taken
>   * PM, Core2: Didn't use static prediction,  randomly depending on
> what happens to be in corresponding BTB entry , according to agner's
> optimization guide ¹.
>   * Newer CPUs like Ivy Bridge, Haswell have become increasingly
> intangible, according to Matt G's experiment ².
>
> And Intel seems don't want to talk about it any more, because the
> latest material I found within Intel Document was written about ten
> years ago.
>
> I know static branch prediction is (far?) less important than dynamic,
> but in quite a few situations, CPU will be completely lost and
> programmers(with compiler) are usually the best guide. Of course these
> situations are usually not performance bottleneck, because once a
> branch is frequently executed, the dynamic predictor will capture it.
>
> Since Intel no longer clearly statements the dynamic prediction
> mechanism in its document, the builtin_expect() of GCC can do nothing
> more than removing the unlikely branch from hot path or reversely for
> likely branch.
>
> I am not familiar with CPU design and I don't know what exactly
> mechanism Intel use nowadays for its static predictor, I just feel the
> best static mechanism for Intel should be to clearly document his CPU
> "where I plan to go when dynamic predictor failed, forward or
> backward", because usually the programmer is the best guide at that
> time.
>
>
> APPENDIX:
> ¹ Agner's optimization guide:
> https://www.agner.org/optimize/microarchitecture.pdf   ,  section 3.5
> .
>
> ² Matt G's experiment: https://xania.org/201602/bpu-part-two
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180814/63aab2ec/attachment.html>

Fabian Giesen via llvm-dev

2018-Aug-14 23:13 UTC

head link

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

The 486 "assume not taken" strategy (also common in other in-order 
cores) is not a branch prediction mechanism as such. It can be 
interpreted as one (in that it's essentially equivalent to predicting 
all branches not taken), but first and foremost it's simply what happens 
when the instruction fetch stage just keeps going through memory 
sequentially unless redirected.

In other words, it's what naturally happens in a pipelined processor in 
the _absence_ of any intentionally designed branch prediction mechanism.

As for static vs. dynamic prediction, to be making explicit static 
predictions, you actually need to know whether you have seen a given 
branch before, and many branch predictors don't! That is, they simply 
have no way to tell reliably whether a given branch history (or branch 
target) entry is actually for the branch it's trying to predict. The 
various branch prediction structures are mostly just bags of bits 
indexed by a hash code.

Now some do have a few tag bits that allow them to detect collisions, 
probabilistically with high likelihood anyway, but others don't. It's 
just a matter of whether the architects think it's worthwhile to spend 
the area on tag bits, or whether they have another thing they'd rather 
do that is a better trade-off for their target criteria.

Netburst (Pentium 4) is an anomaly, AFAIK because of the way the branch 
prediction interacts with the trace cache. Namely, I believe that branch 
history etc. in Netburst was kept at the trace level. For trace cache 
entries, unlike branch history/target entries, there is a well-defined 
moment when they're new, namely when they're initialized after a trace 
cache miss (trace cache miss detection _has_ to do full tagging and be 
precise for the CPU to work); that's an opportunity to set up the new 
trace with the static prediction heuristic, and also branch hints. [But 
I'm no expert on this, and I bet there are Intel folks on this list that 
could be more precise.]

Zooming out a bit, real designs don't generally have a single monolithic 
branch prediction mechanism; they have lots of them, in different places 
in the pipeline. Going roughly in pipeline order, some common mechanisms 
are:

1. Next I$ fetch line prediction. The address for the next instruction 
fetch needs to be determined before the current fetch even completes, 
for there to be no bubbles.

2. Early branch target prediction (indexed by instruction fetch 
address); instruction fetch is usually multiple cycles, and again you 
need to know the likely address to fetch next before you've even seen 
any instruction bytes or know where branches are!

3. A later, more precise target/direction predictor. The early 
predictors are "quick and dirty", and not very accurate. In parallel, 
you start working on a more accurate prediction; if the late predictor 
disagrees with the early predictor, you need to squash a few cycles 
worth of instructions, but this is still fairly early in the pipeline, 
so a lot cheaper than full mispredictions.

4. A return stack predictor for function returns, the most common type 
of indirect branch by far. Tracks function calls and returns and 
supplies what it thinks are likely targets for return statements.

5. A separate predictor for other indirect branch targets (e.g. through 
virtual or jump tables). BTBs can cover some of this but dedicated 
indirect predictors work better. Again, an indirect predictor can decide 
than an earlier-guessed branch target was bogus and initiate a partial 
flush.

5. There are sometimes also special loop predictors to catch counted 
loops that (almost) always execute the same number of iterations. 
Innermost loop nests are usually covered well by the regular global 
branch history, but loop predictors help with outer loop nests that are 
periodic on longer time scales than the global branch history captures.

6. Once instructions are decoded, branch target addresses (for direct 
branches) are verified. Branch target addresses in the BTBs and other 
structures are commonly stored with a reduced number of bits, and in 
certain (hopefully rare) cases the "decompressed" address will be 
incorrect. Again, if this is detected, a partial flush is initiated; 
we're now several cycles downstream from initial instruction fetch, but 
this is still far from a full branch misprediction penalty.

CPU vendors are usually close-mouthed about what exactly they're doing 
(it's some of their secret sauce), but some are more open than others. 
One not-too-old CPU design that has a relatively thorough description of 
the branch prediction architecture (though not the nitty-gritty details) 
is the AMD Jaguar (low-power x86):

   http://support.amd.com/TechDocs/52128_16h_Software_Opt_Guide.zip

Section 2.7.1 in the PDF has a (very) brief description of the various 
branch prediction mechanisms.

-Fabian

On 8/14/2018 2:09 PM, 2016 quekong via llvm-dev wrote:> ( I don't know if it's allowed to ask such question, if not, please
remind me. )
> 
> I know Intel implemented several static branch prediction mechanisms
> these years:
>    * 80486 age: Always-not-take
>    * Pentium4 age: Backwards Taken/Forwards Not-Taken
>    * PM, Core2: Didn't use static prediction,  randomly depending on
> what happens to be in corresponding BTB entry , according to agner's
> optimization guide ¹.
>    * Newer CPUs like Ivy Bridge, Haswell have become increasingly
> intangible, according to Matt G's experiment ².
> 
> And Intel seems don't want to talk about it any more, because the
> latest material I found within Intel Document was written about ten
> years ago.
> 
> I know static branch prediction is (far?) less important than dynamic,
> but in quite a few situations, CPU will be completely lost and
> programmers(with compiler) are usually the best guide. Of course these
> situations are usually not performance bottleneck, because once a
> branch is frequently executed, the dynamic predictor will capture it.
> 
> Since Intel no longer clearly statements the dynamic prediction
> mechanism in its document, the builtin_expect() of GCC can do nothing
> more than removing the unlikely branch from hot path or reversely for
> likely branch.
> 
> I am not familiar with CPU design and I don't know what exactly
> mechanism Intel use nowadays for its static predictor, I just feel the
> best static mechanism for Intel should be to clearly document his CPU
> "where I plan to go when dynamic predictor failed, forward or
> backward", because usually the programmer is the best guide at that
> time.
> 
> 
> APPENDIX:
> ¹ Agner's optimization guide:
> https://www.agner.org/optimize/microarchitecture.pdf   ,  section 3.5
> .
> 
> ² Matt G's experiment: https://xania.org/201602/bpu-part-two
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

UE US via llvm-dev

2018-Aug-15 07:20 UTC

head link

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

>From Matt's page, * "The target program consists of 1000 back to
backbranches"*.

Perhaps the processors got better at detecting when they may be running off
into some kind of data?  Wouldn't I want it to predict that kind of thing
as not taken?

Just a thought...

-G

On Tue, Aug 14, 2018 at 4:09 PM 2016 quekong via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> ( I don't know if it's allowed to ask such question, if not, please
remind
> me. )
>
> I know Intel implemented several static branch prediction mechanisms
> these years:
>   * 80486 age: Always-not-take
>   * Pentium4 age: Backwards Taken/Forwards Not-Taken
>   * PM, Core2: Didn't use static prediction,  randomly depending on
> what happens to be in corresponding BTB entry , according to agner's
> optimization guide ¹.
>   * Newer CPUs like Ivy Bridge, Haswell have become increasingly
> intangible, according to Matt G's experiment ².
>
> And Intel seems don't want to talk about it any more, because the
> latest material I found within Intel Document was written about ten
> years ago.
>
> I know static branch prediction is (far?) less important than dynamic,
> but in quite a few situations, CPU will be completely lost and
> programmers(with compiler) are usually the best guide. Of course these
> situations are usually not performance bottleneck, because once a
> branch is frequently executed, the dynamic predictor will capture it.
>
> Since Intel no longer clearly statements the dynamic prediction
> mechanism in its document, the builtin_expect() of GCC can do nothing
> more than removing the unlikely branch from hot path or reversely for
> likely branch.
>
> I am not familiar with CPU design and I don't know what exactly
> mechanism Intel use nowadays for its static predictor, I just feel the
> best static mechanism for Intel should be to clearly document his CPU
> "where I plan to go when dynamic predictor failed, forward or
> backward", because usually the programmer is the best guide at that
> time.
>
>
> APPENDIX:
> ¹ Agner's optimization guide:
> https://www.agner.org/optimize/microarchitecture.pdf   ,  section 3.5
> .
>
> ² Matt G's experiment: https://xania.org/201602/bpu-part-two
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180815/2bdfdda6/attachment.html>

2016 quekong via llvm-dev

2018-Aug-16 04:45 UTC

head link

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

Hi, Fabian Giesen. Sorry for reply so late.
>>>As for static vs. dynamic prediction, to be making explicit static
predictions, you actually need to know whether you have seen a given branch
before, and many branch predictors don't! That is, they simply have no way
to tell reliably whether a given branch history (or branch target) entry is
actually for the branch it's trying to predict. The various branch
prediction structures are mostly just bags of bits indexed by a hash code.
As you said, the branch predictor has no way to tell if met the branch
before, because BHB is indexed by hash value. But for branch target
predictor, BTB on early Intel CPUs is organized like common cache,
with set/way/tag identified by the full address of a branch.

So, I guess, is there a chance that the dynamic prediction phase quits
on BTB missing? Namely, if the branch target predictor(rather than
branch predictor) failed to find out a target, then it made the
conclusion that never see that branch before, no enough evidence for
dynamic prediction and pass control to static predictor. Otherwise,
how to explain early Intel CPUs(Pentium I, II, III) before Pentium4
support static prediction without trace cache ²?

I found some words from Intel document ¹ which may support my guess :
"Branches that do not have a history in the BTB (see Section 3.4.1,
“Branch Prediction Optimization”) are predicted using a static
prediction algorithm... "
Though I believe it's no longer applied for newer Intel CPUs, it
should be at least  an evidence from  Pentium age.

And I found another thing interesting.
Since Core architecture Intel stopped using static prediction
heuristic: (still quote from Intel document 3.4.1.3)
"The Intel Core microarchitecture does not use the static prediction
heuristic. However, to maintain consistency across Intel 64 and IA-32
processors, software should maintain the static prediction heuristic
as the default."

The interesting point is: Just on Core micro architecture, the
organization of BTB changed. Intel no longer use the full  address  of
a branch to identify BTB entry, but 0~21 bits of the address. And thus
the branch target predictor can't tell precisely whether seen a branch
before, just like branch predictor.

Thanks again.

APPENDIX:
¹ Intel Document, section 3.4.1.3
 
https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
² Agner's optimization guide, section 3.2 3.3
https://www.agner.org/optimize/microarchitecture.pdf

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Aug 2018 - Why did Intel change his static branch prediction mechanism during these years?

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

[llvm-dev] Why did Intel change his static branch prediction mechanism during these years?

Possibly Parallel Threads