thr3ads.net - llvm dev - [llvm-dev] Setting the value of MicroOpBufferSize correctly [Oct 2017]

If this information is useful, please help other people find it:
Share via:

Stefan Pintilie via llvm-dev

2017-Oct-25 17:50 UTC

[llvm-dev] Setting the value of MicroOpBufferSize correctly

Hi, 

I've been trying to determine how to properly set the MicroOpBufferSize 
parameter. Based on the documentation in the header file:
  // "> 1" means the processor is out-of-order. This is a machine 
independent
  // estimate of highly machine specific characteristics such as the 
register
  // renaming pool and reorder buffer.
Power 9 is out-of-order and so it makes sense to use a value greater than 
1. However we don't quite have a reorder buffer in the theoretical sense 
where we can pick any instruction in the buffer to dispatch next. As a 
result, I'm looking for a better understanding of how this parameter is 
used in the scheduler.

I've noticed that the only place where it matters if the value of 
MicroOpBufferSize is 2 or 200 is in 
GenericScheduler::checkAcyclicLatency(). In other places in the code we 
only care if we have values that fall in one of the three categories: 0, 
1, or > 1. 
Here is my understanding of what that function does (and I could be very 
wrong on this one) and please correct me if I am wrong. 

We can have two critical paths through a loop. One is cyclic (use-def 
cross loop iterations) and one is acyclic (everything can be computed in 
one iteration). If the acyclic path is greater than the cyclic path by 
more than the size of the instruction buffer then we have to make sure 
that we don't run out of instructions to dispatch in the instruction 
buffer. So, we set a flag Rem.IsAcyclicLatencyLimited and then 
tryCandidate checks this flag when making decisions. 
Is this correct? 
If that is what we are doing here then I think the proper value for this 
parameter is the full size of the instruction buffer on the P9 hardware. 
All we really care about is whether or not we will run out of instructions 
to dispatch while waiting for an instruction. We don't really care about 
HOW the out-of-order dispatch is done as long as we don't run out of 
instructions. 
Does my logic make sense?

I want to set the parameter but I want to make sure I understand what is 
going on. 

Thank you, 
Stefan 


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171025/4d93b4bc/attachment.html>

Andrew Trick via llvm-dev

2017-Oct-25 20:11 UTC

head link

[llvm-dev] Setting the value of MicroOpBufferSize correctly

> On Oct 25, 2017, at 10:50 AM, Stefan Pintilie <stefanp at ca.ibm.com>
wrote:
> 
> Hi, 
> 
> I've been trying to determine how to properly set the MicroOpBufferSize
parameter. Based on the documentation in the header file:
>  // "> 1" means the processor is out-of-order. This is a
machine independent
>   // estimate of highly machine specific characteristics such as the
register
>   // renaming pool and reorder buffer.
> Power 9 is out-of-order and so it makes sense to use a value greater than
1. However we don't quite have a reorder buffer in the theoretical sense
where we can pick any instruction in the buffer to dispatch next. As a result,
I'm looking for a better understanding of how this parameter is used in the
scheduler.
> 
> I've noticed that the only place where it matters if the value of
MicroOpBufferSize is 2 or 200 is in GenericScheduler::checkAcyclicLatency(). In
other places in the code we only care if we have values that fall in one of the
three categories: 0, 1, or > 1.
> Here is my understanding of what that function does (and I could be very
wrong on this one) and please correct me if I am wrong.
MicroOpBufferSize is a confusing machine parameter. Most backends only
need to tell generic machine scheduler whether to do strict VLIW-style
in-order (=0), heuristic in-order (=1), or out-of-order (>1). We
should have some kind of alias for those modes so that backends don't
need to explicitly mention the buffer size (patches please).

As far as picking the buffer size value for OOO hardware, it is meant
to estimate very roughly the number of instructions in-flight before
the cpu either runs out of physical registers or the number of entries
in some queue of instructions waiting for retirement (reoder buffer).

The generic machine scheduler doesn't try to simulate the reorder
buffer because it is pointless to do so within the scope of a single
block on any modern hardware. The only time I've seen a case where the
compiler could determine that OOO resources would be overutilized is
in floating point loop kernel characterized by long latency operations and
interesting cyclic dependencies.

Aside from the one heuristic where it's used, some backends may just
want to and express this level of detail about their microarchitecture
for documentation purpose and future use. I imagine a separate static
performance tool that *would* simulate the reservation stations and
reorder buffer. It would be nice if someone would write such a thing
based on LLVM.

Back to the scheduler... the generic machine scheduler has two purposes:

(1) A quick way to bootstrap a backend with at least some reasonable
(hopefully do no harm) scheduling.

(2) An example of how to piece together pieces of the scheduling
infrastructure into a custom scheduler, with in-tree testing of those
shared pieces.

You're far beyond the point of using the generic scheduler as a black
box. I suggest running benchmarks with and without the
checkAcyclicLatency heuristics to see which loops it helps/hurts and
possibly writing your own heuristic here.
> 
> We can have two critical paths through a loop. One is cyclic (use-def cross
loop iterations) and one is acyclic (everything can be computed in one
iteration). If the acyclic path is greater than the cyclic path by more than the
size of the instruction buffer then we have to make sure that we don't run
out of instructions to dispatch in the instruction buffer. So, we set a flag
Rem.IsAcyclicLatencyLimited and then tryCandidate checks this flag when making
decisions.
> Is this correct? 
> If that is what we are doing here then I think the proper value for this
parameter is the full size of the instruction buffer on the P9 hardware. All we
really care about is whether or not we will run out of instructions to dispatch
while waiting for an instruction. We don't really care about HOW the
out-of-order dispatch is done as long as we don't run out of instructions.
> Does my logic make sense?
> 
> I want to set the parameter but I want to make sure I understand what is
going on.
> 
> Thank you, 
> Stefan 

Yes, that's exactly the intention, but I need to clarify one
point. It's very unlikely that the code within a single iteration
would exceed the buffer size. Instead, we're looking for cases where
the OOO engine will be able to speculatively execute *many* loop
iterations before retiring some instructions from the first
iteration. These are the cases where I've seen the OOO engine running
out of resources. The heuristic estimates the number of iterations that
can be speculatively executed in the "shadow" of the loop's
acyclic
path. Then it basically multiplies the number of in-flight iterations
by the size of the loop.

/// CyclesPerIteration = max( CyclicPath, Loop-Resource-Height )
/// InFlightIterations = AcyclicPath / CyclesPerIteration
/// InFlightResources = InFlightIterations * LoopResources

Sorry, the implementation of the math is really hard to understand
because of all the scaling that happens so that heuristics can compute
everything using integers (no floating-point allowed in the
compiler). I realize the comments aren't quite adequate.

The comment above refers to "resources" but the only resource
that's
currently modeled here is the number of micro-ops being issued. You
could conceivably model the per-functional-unit buffer sizes here, but
I don't think those values are currently used at all.

If the number of in-flight micro-ops exceeds the buffer size, then
it's important to hide latency within the loop. When the buffer runs
out, it stalls the processor front-end, so you want to have already
issued as many independent long-latency instructions as possible when
that happens.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171025/3a417b4a/attachment.html>

llvm dev - Oct 2017 - Setting the value of MicroOpBufferSize correctly

[llvm-dev] Setting the value of MicroOpBufferSize correctly

[llvm-dev] Setting the value of MicroOpBufferSize correctly