From: Andrew Trick [mailto:atrick at apple.com] Sent: 24 January 2014 21:52 To: Daniel Sanders Cc: LLVM Developers Mailing List (llvmdev at cs.uiuc.edu) Subject: Re: New machine model questions On Jan 24, 2014, at 2:21 AM, Daniel Sanders <Daniel.Sanders at imgtec.com<mailto:Daniel.Sanders at imgtec.com>> wrote: Hi Andrew, I seem to be making good progress on the P5600 scheduler using the new machine model but I've got a few questions about it. Hi Daniel, These are really good questions. For future reference, I might provide better examples if you attach what you have so far for the model. How would you represent an instruction that splits into two micro-ops and is dispatched to two different reservation stations? For example, I have two reservation stations (AGQ and FPQ). An FPU load instruction is split into a load micro-op which is dispatched to AGQ and a writeback micro-op which is dispatched to FPQ. The AGQ micro-op is issued to a four-cycle latency pipeline called LDST. Three cycles after issue, the LDST pipeline wakes up the FPQ micro-op, which writes the result of the load back to the register file. This question illustrates the primary difference between the per-operand machine model and the itinerary. The itinerary directly models the stages of each pipeline independently. Some backend maintainers may still want to use itineraries if that level of precision is critical [1]. Another option is extending the new model. [2] I will assume that each queue is fully pipelined (4 ACQ ops can be in-flight). Forcing all this information into a single SchedWriteRes def would look like this: def P5600FLD : SchedWriteRes <[P5600UnitAGQ, P5600UnitFP]> { let Latency = 5; // 4 cycle load + 1 cycle FP writeback let NumMicroOps = 2; } This is bad (for an in-order processor) because it prevents FPLoad + FPx from being scheduled in the same cycle and fails to detect a conflict on FP ops 5 scheduled cycles ahead. A better way to express it would be: def P5600LD <[P5600UnitAGQ]> { let Latency = 4; } def P5600FP <[P5600UnitFP]>; def P5600FLD : WriteSequence<[P5600LD, P5600FP]>; Unfortunately, the implementation currently aggregates the processor resources, ignoring the fact that they are used on different cycles. This is totally fixable [2]. However, I don't know why you would care, since an out-of-order processor doing its job will make the stalls unpredictable either way. Thanks. I'll start with the WriteSequence method and see if testing shows that I need to go any further or not. The two reservation stations don't seem to be completely independent of each other for these split instructions. The wakeup signal used to wakeup the second micro-op seems to be a demand that the micro-op issues in that cycle rather than permission to issue when it's convenient. Is it possible to use other instructions already scheduled for the same cycle as part of the evaluation of a SchedPredicate in a SchedVariant? I've got a class of instructions (mostly simple addition) that can dispatch to two different reservation stations (ALQ and AGQ), both of which have a suitable pipeline with the same latency. The dispatch stage can dispatch two instructions per cycle. When it has one instruction from this class it dispatches it to ALQ (this isn't strictly true but I'll come back to that), and when it has two it dispatches one to ALQ and the other to AGQ. No. The machine model is used to form a scheduling DAG independent of the original schedule. If it's important to be this precise, then I suggest you plugin a new MachineSchedStrategy where you can model stalls for any special cases during scheduling. You need a super-resource: def P5600A : ProcResource<2>; def P5600AGQ : ProcResource<1> { let Super = P5600A; } def P5600ALQ : ProcResource<1> { let Super = P5600A; } I'll take a look at MachineSchedStrategy. I don't know how important that precision is likely to be at the moment but I've generally found that the more accurate the machine description is, the harder it is to find one of the bad cases. That experience comes from a particular in-order scheduler in a proprietary compiler so I don't know if I can expect similar things from LLVM or not. I'm expecting out-of-order to help reduce the amount of precision that's needed for a good result but I don't know how much of a reduction I can expect at the moment. I'm not sure I fully understand the super-resource suggestion. I've attached my WIP so you can take a look at the code in context but the relevant extracts are below. def P5600IssueALU : ProcResource<1>; def P5600IssueAL2 : ProcResource<1>; def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; } def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> { let BufferSize = 16; } def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>; def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>; def P5600WriteEitherALU : SchedWriteVariant< [SchedVar<SchedPredicate<[{1}]>, [P5600WriteALU]>, // FIXME: Predicate SchedVar<SchedPredicate<[{0}]>, [P5600WriteAL2]> // FIXME: Predicate ]>; I believe you are suggesting that I change this to: def P5600IssueEitherALU : ProcResource<2>; def P5600IssueALU : ProcResource<1> { let Super = P5600IssueEitherALU; } def P5600IssueAL2 : ProcResource<1> { let Super = P5600IssueEitherALU; } def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; } def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> { let BufferSize = 16; } def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>; def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>; def P5600WriteEitherALU : SchedWriteRes<[P5600IssueEitherALU]>; Instructions can then use P5600WriteEitherALU to pick between the two sub-resources at issue time. One curious consequence of this is that by allowing it to pick which pipeline the instruction is issued to, it effectively allows the instruction to pick which reservation station to be dispatched to at issue-time (which is backwards, normally dispatch determines the available subset of pipelines). That might not be a significant issue as far as the scheduler output is concerned but it seemed strange to me and it makes me doubt that I've fully understood it. One thing about the attached WIP. I'm using ItinRW and InstRW at the moment but I'm planning on migrating the ItinRW's to InstRW. The reason I'm not using the Sched<> class on each instruction is that I'm not confident that there is a common set of SchedReadWrite def's that would make sense on the full range of MIPS processor implementations. I'm going to have another think about this once I'm nearer a complete scheduler for P5600. Is it possible to use historical scheduling decisions as part of the evaluation of a SchedPredicate in a SchedVariant? I'm fairly certain the answer to this one is 'no' (because scheduling can be performed in both directions) but I'll ask anyway. In the previous question, I said that when the dispatch stage has one instruction that can be dispatched to either ALQ or AGQ it always picks ALQ. The truth of the matter is that historical decisions are used to guess which one is most likely to stall and the dispatch stage picks the other one. I haven't established exactly what information it's using yet though so I can't give a good example. SchedVariant is really just for opcodes that can use different resources/latency depending on the value of some immediate. The kind of micro-architectural special rules/heuristics that you are describing are exactly why we have a plugable MachineSchedStrategy. That makes sense. Is there an easy way to check I've covered every valid instruction? I'm thinking it would be helpful if I could get build warnings from tablegen about valid instructions with no scheduling information. This would also prevent someone adding an instruction later and forgetting to add it to the scheduler. YES! Very good question. When implementing a new model, it's important to run table-gen with subtarget-emitter. You should be able to touch your .td, then grab the command via make TOOL_VERBOSE=1 This is the line from ARM: llvm-tblgen -I /s/fix/lib/Target/ARM -I /s/fix/include -I /s/fix/include -I /s/fix/lib/Target -gen-subtarget -o ARMGenSubtargetInfo.inc /s/fix/lib/Target/ARM/ARM.td -debug-only=subtarget-emitter It will list all instructions and print "No machine model for <subtarget>" You will also get an assert in the scheduler, unless you add the following flag to your mode: let CompleteModel = 0; That's perfect, thanks. Thanks Daniel Sanders Leading Software Design Engineer, MIPS Processor IP Imagination Technologies Limited www.imgtec.com<http://www.imgtec.com/> [1] I added support for the itineraries into the new MI scheduler because I realized that some out-of-tree backend maintainers may still want that level of precision. I'm not sure yet whether you fall into that category. The new machine model was designed for out-of-order processors, but I also think it is sufficient for most in-order models. I would like to establish the new machine model as the preferred choice because it is simpler and more efficient, it will be easier for most backend developers to bring up a new subtarget, and we will then eventually have more consistency across targets. I also selfishly want more good in-tree examples of the new model so it will effectively be better documented and supported. I believe it is possible to handle special cases requiring the itinerary's precision without using an itinerary by either pluging custom logic into the MachineSchedStrategy, or extending the new machine model... [2] To model in-order pipeline resource we could - add a field to MCWriteProcResEntry + unsigned DelayCycles; - Modify the table gen code in SubtargetEmitter to record the delay. We already to this: // If this resource is already used in this sequence, add the current // entry's cycles so that the same resource appears to be used // serially, rather than multiple parallel uses. This is important for // in-order machine where the resource consumption is a hazard. But we could do also add a delay to the resource cycles when the the processor resource is unbuffered. - The code in SchedBoundary::bumpNode and SchedBoundary::checkHazard needs to be updated to increment the cycle accounting for DelayCycles. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140128/76c69b5b/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: MipsScheduleP5600.td Type: application/octet-stream Size: 12634 bytes Desc: MipsScheduleP5600.td URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140128/76c69b5b/attachment.obj>
On Jan 28, 2014, at 9:22 AM, Daniel Sanders <Daniel.Sanders at imgtec.com> wrote:> llvm-tblgen -I /s/fix/lib/Target/ARM -I /s/fix/include -I /s/fix/include -I /s/fix/lib/Target -gen-subtarget -o ARMGenSubtargetInfo.inc /s/fix/lib/Target/ARM/ARM.td -debug-only=subtarget-emitter > > It will list all instructions and print "No machine model for <subtarget>" > You will also get an assert in the scheduler, unless you add the following flag to your mode: > > let CompleteModel = 0; > > That's perfect, thanks.FYI Someone just pointed out that when the instruction has an itinerary class, and the machine model has no ItinRW, you get a warning that the machine model is missing for the instruction, even if the instruction has a SchedRW list. (I really didn’t want to support this situation, but may end up fixing the warning anyway for x86). I don’t think you’ll run into it since you’re not putting SchedRW lists on the instruction definitions themselves. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140128/00ecdb22/attachment.html>
On Jan 28, 2014, at 9:22 AM, Daniel Sanders <Daniel.Sanders at imgtec.com> wrote:> You need a super-resource: > > def P5600A : ProcResource<2>; > def P5600AGQ : ProcResource<1> { let Super = P5600A; } > def P5600ALQ : ProcResource<1> { let Super = P5600A; } > > I'll take a look at MachineSchedStrategy. I don't know how important that precision is likely to be at the moment but I've generally found that the more accurate the machine description is, the harder it is to find one of the bad cases. That experience comes from a particular in-order scheduler in a proprietary compiler so I don't know if I can expect similar things from LLVM or not. I'm expecting out-of-order to help reduce the amount of precision that's needed for a good result but I don't know how much of a reduction I can expect at the moment. > > I'm not sure I fully understand the super-resource suggestion. I've attached my WIP so you can take a look at the code in context but the relevant extracts are below. > def P5600IssueALU : ProcResource<1>; > def P5600IssueAL2 : ProcResource<1>; > def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; } > def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> { > let BufferSize = 16; > } > def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>; > def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>; > def P5600WriteEitherALU : SchedWriteVariant< > [SchedVar<SchedPredicate<[{1}]>, [P5600WriteALU]>, // FIXME: Predicate > SchedVar<SchedPredicate<[{0}]>, [P5600WriteAL2]> // FIXME: Predicate > ]>; > > I believe you are suggesting that I change this to: > def P5600IssueEitherALU : ProcResource<2>; > def P5600IssueALU : ProcResource<1> { let Super = P5600IssueEitherALU; } > def P5600IssueAL2 : ProcResource<1> { let Super = P5600IssueEitherALU; } > def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; } > def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> { > let BufferSize = 16; > } > def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>; > def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>; > def P5600WriteEitherALU : SchedWriteRes<[P5600IssueEitherALU]>; > > Instructions can then use P5600WriteEitherALU to pick between the two sub-resources at issue time. One curious consequence of this is that by allowing it to pick which pipeline the instruction is issued to, it effectively allows the instruction to pick which reservation station to be dispatched to at issue-time (which is backwards, normally dispatch determines the available subset of pipelines). That might not be a significant issue as far as the scheduler output is concerned but it seemed strange to me and it makes me doubt that I've fully understood it.The scheduler does not model which dispatch queue (or is it issue queue?) the instructions reside in. For an OOO core, I think this is almost totally unpredictable anyway. We assume (hope) that the hardware can balance the queues. With the itinerary-based hazard checker, I think the reservation station would force an instruction to use the first available resource which has the opposite problem in that it could unnecessarily prevent a later instruction from using that resource if the hardware can dynamically schedule. I did not realize you were using processor groups. For many (relatively simple) cores the functional units can be expressed as a hierarchy. An instruction either needs a specific unit, or it can be issued to some broader class. You can do that without any groups. I added ProcResGroup for SandyBridge because instructions can issue to some subset of ports, and these subsets are overlapping. I think it is possible to use both groups and super resources in the same model, but may cause to confusion. I was simply suggesting something like this, for example: def P5600UnitA : ProcResource<2> { let BufferSize=16; } def P5600UnitAGQ : ProcResource<1> { let Super = P5600A; } def P5600UnitALQ : ProcResource<1> { let Super = P5600A; } def P5600WriteA : SchedWriteRes<[P5600UnitA]>; def P5600WriteLd : SchedWriteRes<[P5600UnitACQ, P5600UnitLdSt]>; Where a load must issue on ACQ, and consumes one of the two ALU resources. A WriteA instruction simply uses one of the two ALU resources. We don’t model which one. The relationship between ALU2 and ACQ is not clear to me yet, so I’m not sure what’s intended in your example. Note that when an instruction uses a ProcResGroup it may use any of the named resources but we don’t know which one. It’s does not use all resources in the group. If you want an instruction to use multiple resources, then you just list them in the SchedWriteRes entry. You can also compose SchedWrites using SchedWriteSequence. (It would obviously be useful to work through a specific example here). FYI: BufferSize is a nice feature, but you can fairly safely omit it for an OOO code. The scheduler will by default assume an infinite dispatch queue and almost certainly generate the same schedule unless you have very large blocks! The scheduler does attempt to determine whether the OOO buffer will reach capacity across iterations of single block loops, but it only looks at the model’s MicroOpBufferSize for this computation, not the per-resource buffer size. -Andy
> -----Original Message----- > From: Andrew Trick [mailto:atrick at apple.com] > Sent: 28 January 2014 23:10 > To: Daniel Sanders > Cc: LLVM Developers Mailing List (llvmdev at cs.uiuc.edu) > Subject: Re: New machine model questions > > > On Jan 28, 2014, at 9:22 AM, Daniel Sanders <Daniel.Sanders at imgtec.com> > wrote: > > > <snip> > > The scheduler does not model which dispatch queue (or is it issue queue?) > the instructions reside in. For an OOO core, I think this is almost totally > unpredictable anyway. We assume (hope) that the hardware can balance the > queues.That would explain some of my confusion. I think we ought to double-check our terminology just to be sure we are talking about the same things. I'm using 'dispatch' to mean the last stage of the processor frontend (fetch/decode/dispatch possibly with other stages such as register renaming amongst them) which passes instructions on to split/unified reservation station(s). Dispatch is the last in-order stage before out-of-order execution begins. I'm then using 'issue' to mean an instruction being selected by a reservation station for execution in a pipeline and passed to it. I would say that dispatching to reservation stations is fairly predictable since it's still in-order at that point (issue on the other hand is unpredictable). In the case of a unified reservation station, dispatch just passes the instructions to the only reservation station. For split reservation stations, it generally selects a reservation station for an instruction based on the opcode (e.g. adds/subs/shifts to one reservation station, loads/stores to another, fpu ops to another) and passes the instruction to it. The P5600 is using split reservation stations and its dispatch is predictable in most cases. When the instructions are statically routed, the pipelines in which they can execute are all under the same reservation station. It's not necessary to model which reservation station they were dispatched to in these cases because there's no choice. A small number of instructions are dynamically routed according to the number processed in a given cycle and the previous outcome of the decision. Once routed to a reservation station it is not possible to issue to all the pipelines that could potentially execute the instruction (e.g. AGQ cannot issue to ALU, only to AL2) and the decision cannot be reversed. For these dynamically routed instructions, it seems that the P5600 is in conflict with the current machine model. I'll look into resolving this with a MachineSchedStrategy first. Suppose we have the following assembly (for the sake of this example, I'm going to ignore the use of history in making the dispatch decisions): insn1: addu $1, $2, $3 insn2: addu $4, $5, $6 insn3: addu $1, $1, $7 insn4: addu $4, $4, $8 Dispatch would receive this two instructions at a time over two cycles. In cycle t+0 it checks the opcodes of insn1 and insn2 and notes that both can be dispatched to either ALQ or AGQ. It can't send both to any one of these so it sends insn1 to ALQ and insn2 to AGQ. In cycle t+1, does the same thing and dispatches insn3 to ALQ and insn4 to AGQ. ALQ receives insn1 at t+0. During t+1 it finds that insn1's dependencies are resolved and it is ready to issue. It issues insn1 to the only suitable pipeline under its control, ALU. Similarly, it receives insn3 at t+1 and issues it to ALU in t+2. Meanwhile AGQ is doing the same thing with the instructions it receives. AGQ receives insn3 at t+0. During t+1 it finds that insn3's dependencies are resolved and it is ready to issue. It issues insn3 to the only suitable pipeline under its control, AL2. Similarly, it receives insn4 at t+1 and issues it to AL2 in t+2.> <snip> > > I did not realize you were using processor groups. For many (relatively > simple) cores the functional units can be expressed as a hierarchy. An > instruction either needs a specific unit, or it can be issued to some broader > class. You can do that without any groups. I added ProcResGroup for > SandyBridge because instructions can issue to some subset of ports, and > these subsets are overlapping. I think it is possible to use both groups and > super resources in the same model, but may cause to confusion. I was simply > suggesting something like this, for example: > > <snip>Ok, I've switched to this method of defining the hierarchy. I was following Haswell's example but I don't need overlapping subsets.> The relationship between ALU2 and ACQ is not clear to me yet, so I'm not > sure what's intended in your example.ALU2 is the issue port to one of the pipelines under the control of the AGQ reservation station. ALU2 is similar in principle to one of the HWPortX resources from the Haswell model, similarly AGQ corresponds to HWPortAny (except it's one of three reservation stations and not the only one).> FYI: BufferSize is a nice feature, but you can fairly safely omit it for an OOO > code. The scheduler will by default assume an infinite dispatch queue and > almost certainly generate the same schedule unless you have very large > blocks! The scheduler does attempt to determine whether the OOO buffer > will reach capacity across iterations of single block loops, but it only looks at > the model's MicroOpBufferSize for this computation, not the per-resource > buffer size. > > -AndyThat's a good point, block sizes tend to be small in most code. I'll have to look into the effect on heavily unrolled and vectorized code such as FFT/DCT where the blocks are likely to be large.