Stefan Pintilie via llvm-dev
2017-Oct-25 17:50 UTC
[llvm-dev] Setting the value of MicroOpBufferSize correctly
Hi, I've been trying to determine how to properly set the MicroOpBufferSize parameter. Based on the documentation in the header file: // "> 1" means the processor is out-of-order. This is a machine independent // estimate of highly machine specific characteristics such as the register // renaming pool and reorder buffer. Power 9 is out-of-order and so it makes sense to use a value greater than 1. However we don't quite have a reorder buffer in the theoretical sense where we can pick any instruction in the buffer to dispatch next. As a result, I'm looking for a better understanding of how this parameter is used in the scheduler. I've noticed that the only place where it matters if the value of MicroOpBufferSize is 2 or 200 is in GenericScheduler::checkAcyclicLatency(). In other places in the code we only care if we have values that fall in one of the three categories: 0, 1, or > 1. Here is my understanding of what that function does (and I could be very wrong on this one) and please correct me if I am wrong. We can have two critical paths through a loop. One is cyclic (use-def cross loop iterations) and one is acyclic (everything can be computed in one iteration). If the acyclic path is greater than the cyclic path by more than the size of the instruction buffer then we have to make sure that we don't run out of instructions to dispatch in the instruction buffer. So, we set a flag Rem.IsAcyclicLatencyLimited and then tryCandidate checks this flag when making decisions. Is this correct? If that is what we are doing here then I think the proper value for this parameter is the full size of the instruction buffer on the P9 hardware. All we really care about is whether or not we will run out of instructions to dispatch while waiting for an instruction. We don't really care about HOW the out-of-order dispatch is done as long as we don't run out of instructions. Does my logic make sense? I want to set the parameter but I want to make sure I understand what is going on. Thank you, Stefan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171025/4d93b4bc/attachment.html>
Andrew Trick via llvm-dev
2017-Oct-25 20:11 UTC
[llvm-dev] Setting the value of MicroOpBufferSize correctly
> On Oct 25, 2017, at 10:50 AM, Stefan Pintilie <stefanp at ca.ibm.com> wrote: > > Hi, > > I've been trying to determine how to properly set the MicroOpBufferSize parameter. Based on the documentation in the header file: > // "> 1" means the processor is out-of-order. This is a machine independent > // estimate of highly machine specific characteristics such as the register > // renaming pool and reorder buffer. > Power 9 is out-of-order and so it makes sense to use a value greater than 1. However we don't quite have a reorder buffer in the theoretical sense where we can pick any instruction in the buffer to dispatch next. As a result, I'm looking for a better understanding of how this parameter is used in the scheduler. > > I've noticed that the only place where it matters if the value of MicroOpBufferSize is 2 or 200 is in GenericScheduler::checkAcyclicLatency(). In other places in the code we only care if we have values that fall in one of the three categories: 0, 1, or > 1. > Here is my understanding of what that function does (and I could be very wrong on this one) and please correct me if I am wrong.MicroOpBufferSize is a confusing machine parameter. Most backends only need to tell generic machine scheduler whether to do strict VLIW-style in-order (=0), heuristic in-order (=1), or out-of-order (>1). We should have some kind of alias for those modes so that backends don't need to explicitly mention the buffer size (patches please). As far as picking the buffer size value for OOO hardware, it is meant to estimate very roughly the number of instructions in-flight before the cpu either runs out of physical registers or the number of entries in some queue of instructions waiting for retirement (reoder buffer). The generic machine scheduler doesn't try to simulate the reorder buffer because it is pointless to do so within the scope of a single block on any modern hardware. The only time I've seen a case where the compiler could determine that OOO resources would be overutilized is in floating point loop kernel characterized by long latency operations and interesting cyclic dependencies. Aside from the one heuristic where it's used, some backends may just want to and express this level of detail about their microarchitecture for documentation purpose and future use. I imagine a separate static performance tool that *would* simulate the reservation stations and reorder buffer. It would be nice if someone would write such a thing based on LLVM. Back to the scheduler... the generic machine scheduler has two purposes: (1) A quick way to bootstrap a backend with at least some reasonable (hopefully do no harm) scheduling. (2) An example of how to piece together pieces of the scheduling infrastructure into a custom scheduler, with in-tree testing of those shared pieces. You're far beyond the point of using the generic scheduler as a black box. I suggest running benchmarks with and without the checkAcyclicLatency heuristics to see which loops it helps/hurts and possibly writing your own heuristic here.> > We can have two critical paths through a loop. One is cyclic (use-def cross loop iterations) and one is acyclic (everything can be computed in one iteration). If the acyclic path is greater than the cyclic path by more than the size of the instruction buffer then we have to make sure that we don't run out of instructions to dispatch in the instruction buffer. So, we set a flag Rem.IsAcyclicLatencyLimited and then tryCandidate checks this flag when making decisions. > Is this correct? > If that is what we are doing here then I think the proper value for this parameter is the full size of the instruction buffer on the P9 hardware. All we really care about is whether or not we will run out of instructions to dispatch while waiting for an instruction. We don't really care about HOW the out-of-order dispatch is done as long as we don't run out of instructions. > Does my logic make sense? > > I want to set the parameter but I want to make sure I understand what is going on. > > Thank you, > StefanYes, that's exactly the intention, but I need to clarify one point. It's very unlikely that the code within a single iteration would exceed the buffer size. Instead, we're looking for cases where the OOO engine will be able to speculatively execute *many* loop iterations before retiring some instructions from the first iteration. These are the cases where I've seen the OOO engine running out of resources. The heuristic estimates the number of iterations that can be speculatively executed in the "shadow" of the loop's acyclic path. Then it basically multiplies the number of in-flight iterations by the size of the loop. /// CyclesPerIteration = max( CyclicPath, Loop-Resource-Height ) /// InFlightIterations = AcyclicPath / CyclesPerIteration /// InFlightResources = InFlightIterations * LoopResources Sorry, the implementation of the math is really hard to understand because of all the scaling that happens so that heuristics can compute everything using integers (no floating-point allowed in the compiler). I realize the comments aren't quite adequate. The comment above refers to "resources" but the only resource that's currently modeled here is the number of micro-ops being issued. You could conceivably model the per-functional-unit buffer sizes here, but I don't think those values are currently used at all. If the number of in-flight micro-ops exceeds the buffer size, then it's important to hide latency within the loop. When the buffer runs out, it stalls the processor front-end, so you want to have already issued as many independent long-latency instructions as possible when that happens. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171025/3a417b4a/attachment.html>