thr3ads.net - llvm dev - [LLVMdev] New machine model questions [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Daniel Sanders

2014-Jan-28 17:22 UTC

[LLVMdev] New machine model questions

From: Andrew Trick [mailto:atrick at apple.com]
Sent: 24 January 2014 21:52
To: Daniel Sanders
Cc: LLVM Developers Mailing List (llvmdev at cs.uiuc.edu)
Subject: Re: New machine model questions

On Jan 24, 2014, at 2:21 AM, Daniel Sanders <Daniel.Sanders at
imgtec.com<mailto:Daniel.Sanders at imgtec.com>> wrote:

Hi Andrew,

I seem to be making good progress on the P5600 scheduler using the new machine
model but I've got a few questions about it.

Hi Daniel,

These are really good questions. For future reference, I might provide better
examples if you attach what you have so far for the model.

How would you represent an instruction that splits into two micro-ops and is
dispatched to two different reservation stations?
For example, I have two reservation stations (AGQ and FPQ). An FPU load
instruction is split into a load micro-op which is dispatched to AGQ and a
writeback micro-op which is dispatched to FPQ.
The AGQ micro-op is issued to a four-cycle latency pipeline called LDST. Three
cycles after issue, the LDST pipeline wakes up the FPQ micro-op, which writes
the result of the load back to the register file.

This question illustrates the primary difference between the per-operand machine
model and the itinerary. The itinerary directly models the stages of each
pipeline independently. Some backend maintainers may still want to use
itineraries if that level of precision is critical [1]. Another option is
extending the new model. [2]

I will assume that each queue is fully pipelined (4 ACQ ops can be in-flight).

Forcing all this information into a single SchedWriteRes def would look like
this:

def P5600FLD : SchedWriteRes <[P5600UnitAGQ, P5600UnitFP]> {
  let Latency = 5; // 4 cycle load + 1 cycle FP writeback
  let NumMicroOps = 2;
}

This is bad (for an in-order processor) because it prevents FPLoad + FPx from
being scheduled in the same cycle and fails to detect a conflict on FP ops 5
scheduled cycles ahead.

A better way to express it would be:

def P5600LD <[P5600UnitAGQ]> { let Latency = 4; }
def P5600FP <[P5600UnitFP]>;

def P5600FLD : WriteSequence<[P5600LD, P5600FP]>;

Unfortunately, the implementation currently aggregates the processor resources,
ignoring the fact that they are used on different cycles. This is totally
fixable [2]. However, I don't know why you would care, since an out-of-order
processor doing its job will make the stalls unpredictable either way.

Thanks. I'll start with the WriteSequence method and see if testing shows
that I need to go any further or not.

The two reservation stations don't seem to be completely independent of each
other for these split instructions. The wakeup signal used to wakeup the second
micro-op seems to be a demand that the micro-op issues in that cycle rather than
permission to issue when it's convenient.

Is it possible to use other instructions already scheduled for the same cycle as
part of the evaluation of a SchedPredicate in a SchedVariant?
I've got a class of instructions (mostly simple addition) that can dispatch
to two different reservation stations (ALQ and AGQ), both of which have a
suitable pipeline with the same latency. The dispatch stage can dispatch two
instructions per cycle. When it has one instruction from this class it
dispatches it to ALQ (this isn't strictly true but I'll come back to
that), and when it has two it dispatches one to ALQ and the other to AGQ.

No. The machine model is used to form a scheduling DAG independent of the
original schedule. If it's important to be this precise, then I suggest you
plugin a new MachineSchedStrategy where you can model stalls for any special
cases during scheduling.

You need a super-resource:

def P5600A : ProcResource<2>;
def P5600AGQ : ProcResource<1> { let Super = P5600A; }
def P5600ALQ : ProcResource<1> { let Super = P5600A; }

I'll take a look at MachineSchedStrategy. I don't know how important
that precision is likely to be at the moment but I've generally found that
the more accurate the machine description is, the harder it is to find one of
the bad cases. That experience comes from a particular in-order scheduler in a
proprietary compiler so I don't know if I can expect similar things from
LLVM or not. I'm expecting out-of-order to help reduce the amount of
precision that's needed for a good result but I don't know how much of a
reduction I can expect at the moment.

I'm not sure I fully understand the super-resource suggestion. I've
attached my WIP so you can take a look at the code in context but the relevant
extracts are below.
def P5600IssueALU : ProcResource<1>;
def P5600IssueAL2 : ProcResource<1>;
def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; }
def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> {
  let BufferSize = 16;
}
def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>;
def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>;
def P5600WriteEitherALU : SchedWriteVariant<
  [SchedVar<SchedPredicate<[{1}]>, [P5600WriteALU]>, // FIXME:
Predicate
   SchedVar<SchedPredicate<[{0}]>, [P5600WriteAL2]>  // FIXME:
Predicate
  ]>;

I believe you are suggesting that I change this to:
def P5600IssueEitherALU : ProcResource<2>;
def P5600IssueALU : ProcResource<1> { let Super = P5600IssueEitherALU; }
def P5600IssueAL2 : ProcResource<1> { let Super = P5600IssueEitherALU; }
def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; }
def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> {
  let BufferSize = 16;
}
def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>;
def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>;
def P5600WriteEitherALU : SchedWriteRes<[P5600IssueEitherALU]>;

Instructions can then use P5600WriteEitherALU to pick between the two
sub-resources at issue time. One curious consequence of this is that by allowing
it to pick which pipeline the instruction is issued to, it effectively allows
the instruction to pick which reservation station to be dispatched to at
issue-time (which is backwards, normally dispatch determines the available
subset of pipelines). That might not be a significant issue as far as the
scheduler output is concerned but it seemed strange to me and it makes me doubt
that I've fully understood it.

One thing about the attached WIP. I'm using ItinRW and InstRW at the moment
but I'm planning on migrating the ItinRW's to InstRW. The reason I'm
not using the Sched<> class on each instruction is that I'm not
confident that there is a common set of SchedReadWrite def's that would make
sense on the full range of MIPS processor implementations. I'm going to have
another think about this once I'm nearer a complete scheduler for P5600.

Is it possible to use historical scheduling decisions as part of the evaluation
of a SchedPredicate in a SchedVariant?
I'm fairly certain the answer to this one is 'no' (because
scheduling can be performed in both directions) but I'll ask anyway. In the
previous question, I said that when the dispatch stage has one instruction that
can be dispatched to either ALQ or AGQ it always picks ALQ. The truth of the
matter is that historical decisions are used to guess which one is most likely
to stall and the dispatch stage picks the other one. I haven't established
exactly what information it's using yet though so I can't give a good
example.

SchedVariant is really just for opcodes that can use different resources/latency
depending on the value of some immediate.

The kind of micro-architectural special rules/heuristics that you are describing
are exactly why we have a plugable MachineSchedStrategy.

That makes sense.

Is there an easy way to check I've covered every valid instruction? I'm
thinking it would be helpful if I could get build warnings from tablegen about
valid instructions with no scheduling information. This would also prevent
someone adding an instruction later and forgetting to add it to the scheduler.

YES! Very good question.

When implementing a new model, it's important to run table-gen with
subtarget-emitter.

You should be able to touch your .td, then grab the command via make
TOOL_VERBOSE=1

This is the line from ARM:

llvm-tblgen -I /s/fix/lib/Target/ARM -I /s/fix/include -I  /s/fix/include -I
/s/fix/lib/Target -gen-subtarget -o  ARMGenSubtargetInfo.inc
/s/fix/lib/Target/ARM/ARM.td -debug-only=subtarget-emitter

It will list all instructions and print "No machine model for
<subtarget>"
You will also get an assert in the scheduler, unless you add the following flag
to your mode:

  let CompleteModel = 0;

That's perfect, thanks.

Thanks

Daniel Sanders
Leading Software Design Engineer, MIPS Processor IP
Imagination Technologies Limited
www.imgtec.com<http://www.imgtec.com/>

[1] I added support for the itineraries into the new MI scheduler because I
realized that some out-of-tree backend maintainers may still want that level of
precision. I'm not sure yet whether you fall into that category. The new
machine model was designed for out-of-order processors, but I also think it is
sufficient for most in-order models. I would like to establish the new machine
model as the preferred choice because it is simpler and more efficient, it will
be easier for most backend developers to bring up a new subtarget, and we will
then eventually have more consistency across targets. I also selfishly want more
good in-tree examples of the new model so it will effectively be better
documented and supported.

I believe it is possible to handle special cases requiring the itinerary's
precision without using an itinerary by either pluging custom logic into the
MachineSchedStrategy, or extending the new machine model...

[2] To model in-order pipeline resource we could

- add a field to MCWriteProcResEntry
  + unsigned DelayCycles;

- Modify the table gen code in SubtargetEmitter to record the delay.

  We already to this:
       // If this resource is already used in this sequence, add the current
       // entry's cycles so that the same resource appears to be used
       // serially, rather than multiple parallel uses. This is important for
       // in-order machine where the resource consumption is a hazard.

  But we could do also add a delay to the resource cycles when the the
  processor resource is unbuffered.

- The code in SchedBoundary::bumpNode and SchedBoundary::checkHazard
  needs to be updated to increment the cycle accounting for DelayCycles.

-Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140128/76c69b5b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MipsScheduleP5600.td
Type: application/octet-stream
Size: 12634 bytes
Desc: MipsScheduleP5600.td
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140128/76c69b5b/attachment.obj>

Andrew Trick

2014-Jan-28 17:56 UTC

head link

[LLVMdev] New machine model questions

On Jan 28, 2014, at 9:22 AM, Daniel Sanders <Daniel.Sanders at imgtec.com>
wrote:
> llvm-tblgen -I /s/fix/lib/Target/ARM -I /s/fix/include -I  /s/fix/include
-I /s/fix/lib/Target -gen-subtarget -o  ARMGenSubtargetInfo.inc
/s/fix/lib/Target/ARM/ARM.td -debug-only=subtarget-emitter
>  
> It will list all instructions and print "No machine model for
<subtarget>"
> You will also get an assert in the scheduler, unless you add the following
flag to your mode:
>  
>   let CompleteModel = 0;
>  
> That's perfect, thanks.

FYI

Someone just pointed out that when the instruction has an itinerary class, and
the machine model has no ItinRW, you get a warning that the machine model is
missing for the instruction, even if the instruction has a SchedRW list. (I
really didn’t want to support this situation, but may end up fixing the warning
anyway for x86).

I don’t think you’ll run into it since you’re not putting SchedRW lists on the
instruction definitions themselves.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140128/00ecdb22/attachment.html>

Andrew Trick

2014-Jan-28 23:10 UTC

head link

[LLVMdev] New machine model questions

On Jan 28, 2014, at 9:22 AM, Daniel Sanders <Daniel.Sanders at imgtec.com>
wrote:
> You need a super-resource:
>  
> def P5600A : ProcResource<2>;
> def P5600AGQ : ProcResource<1> { let Super = P5600A; }
> def P5600ALQ : ProcResource<1> { let Super = P5600A; }
>  
> I'll take a look at MachineSchedStrategy. I don't know how
important that precision is likely to be at the moment but I've generally
found that the more accurate the machine description is, the harder it is to
find one of the bad cases. That experience comes from a particular in-order
scheduler in a proprietary compiler so I don't know if I can expect similar
things from LLVM or not. I'm expecting out-of-order to help reduce the
amount of precision that's needed for a good result but I don't know how
much of a reduction I can expect at the moment.
>  
> I'm not sure I fully understand the super-resource suggestion. I've
attached my WIP so you can take a look at the code in context but the relevant
extracts are below.
> def P5600IssueALU : ProcResource<1>;
> def P5600IssueAL2 : ProcResource<1>;
> def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; }
> def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> {
>   let BufferSize = 16;
> }
> def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>;
> def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>;
> def P5600WriteEitherALU : SchedWriteVariant<
>   [SchedVar<SchedPredicate<[{1}]>, [P5600WriteALU]>, // FIXME:
Predicate
>    SchedVar<SchedPredicate<[{0}]>, [P5600WriteAL2]>  // FIXME:
Predicate
>   ]>;
>  
> I believe you are suggesting that I change this to:
> def P5600IssueEitherALU : ProcResource<2>;
> def P5600IssueALU : ProcResource<1> { let Super =
P5600IssueEitherALU; }
> def P5600IssueAL2 : ProcResource<1> { let Super =
P5600IssueEitherALU; }
> def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; }
> def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> {
>   let BufferSize = 16;
> }
> def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>;
> def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>;
> def P5600WriteEitherALU : SchedWriteRes<[P5600IssueEitherALU]>;
>  
> Instructions can then use P5600WriteEitherALU to pick between the two
sub-resources at issue time. One curious consequence of this is that by allowing
it to pick which pipeline the instruction is issued to, it effectively allows
the instruction to pick which reservation station to be dispatched to at
issue-time (which is backwards, normally dispatch determines the available
subset of pipelines). That might not be a significant issue as far as the
scheduler output is concerned but it seemed strange to me and it makes me doubt
that I've fully understood it.
The scheduler does not model which dispatch queue (or is it issue queue?) the
instructions reside in. For an OOO core, I think this is almost totally
unpredictable anyway. We assume (hope) that the hardware can balance the queues.

With the itinerary-based hazard checker, I think the reservation station would
force an instruction to use the first available resource which has the opposite
problem in that it could unnecessarily prevent a later instruction from using
that resource if the hardware can dynamically schedule.

I did not realize you were using processor groups. For many (relatively simple)
cores the functional units can be expressed as a hierarchy. An instruction
either needs a specific unit, or it can be issued to some broader class. You can
do that without any groups. I added ProcResGroup for SandyBridge because
instructions can issue to some subset of ports, and these subsets are
overlapping. I think it is possible to use both groups and super resources in
the same model, but may cause to confusion. I was simply suggesting something
like this, for example:

def P5600UnitA : ProcResource<2> { let BufferSize=16; }
def P5600UnitAGQ : ProcResource<1> { let Super = P5600A; }
def P5600UnitALQ : ProcResource<1> { let Super = P5600A; }

def P5600WriteA : SchedWriteRes<[P5600UnitA]>;
def P5600WriteLd : SchedWriteRes<[P5600UnitACQ, P5600UnitLdSt]>;

Where a load must issue on ACQ, and consumes one of the two ALU resources. A
WriteA instruction simply uses one of the two ALU resources. We don’t model
which one.

The relationship between ALU2 and ACQ is not clear to me yet, so I’m not sure
what’s intended in your example.

Note that when an instruction uses a ProcResGroup it may use any of the named
resources but we don’t know which one. It’s does not use all resources in the
group. If you want an instruction to use multiple resources, then you just list
them in the SchedWriteRes entry. You can also compose SchedWrites using
SchedWriteSequence. (It would obviously be useful to work through a specific
example here).

FYI: BufferSize is a nice feature, but you can fairly safely omit it for an OOO
code. The scheduler will by default assume an infinite dispatch queue and almost
certainly generate the same schedule unless you have very large blocks! The
scheduler does attempt to determine whether the OOO buffer will reach capacity
across iterations of single block loops, but it only looks at the model’s
MicroOpBufferSize for this computation, not the per-resource buffer size.

-Andy

Daniel Sanders

2014-Jan-30 13:17 UTC

head link

[LLVMdev] New machine model questions

> -----Original Message-----
> From: Andrew Trick [mailto:atrick at apple.com]
> Sent: 28 January 2014 23:10
> To: Daniel Sanders
> Cc: LLVM Developers Mailing List (llvmdev at cs.uiuc.edu)
> Subject: Re: New machine model questions
> 
> 
> On Jan 28, 2014, at 9:22 AM, Daniel Sanders <Daniel.Sanders at
imgtec.com>
> wrote:
> 
> > <snip>
> 
> The scheduler does not model which dispatch queue (or is it issue queue?)
> the instructions reside in. For an OOO core, I think this is almost totally
> unpredictable anyway. We assume (hope) that the hardware can balance the
> queues.
That would explain some of my confusion.

I think we ought to double-check our terminology just to be sure we are talking
about the same things.
I'm using 'dispatch' to mean the last stage of the processor
frontend (fetch/decode/dispatch possibly with other stages such as register
renaming amongst them) which passes instructions on to split/unified reservation
station(s). Dispatch is the last in-order stage before out-of-order execution
begins. I'm then using 'issue' to mean an instruction being selected
by a reservation station for execution in a pipeline and passed to it.

I would say that dispatching to reservation stations is fairly predictable since
it's still in-order at that point (issue on the other hand is
unpredictable). In the case of a unified reservation station, dispatch just
passes the instructions to the only reservation station. For split reservation
stations, it generally selects a reservation station for an instruction based on
the opcode (e.g. adds/subs/shifts to one reservation station, loads/stores to
another, fpu ops to another) and passes the instruction to it.

The P5600 is using split reservation stations and its dispatch is predictable in
most cases. When the instructions are statically routed, the pipelines in which
they can execute are all under the same reservation station. It's not
necessary to model which reservation station they were dispatched to in these
cases because there's no choice. A small number of instructions are
dynamically routed according to the number processed in a given cycle and the
previous outcome of the decision. Once routed to a reservation station it is not
possible to issue to all the pipelines that could potentially execute the
instruction (e.g. AGQ cannot issue to ALU, only to AL2) and the decision cannot
be reversed. For these dynamically routed instructions, it seems that the P5600
is in conflict with the current machine model. I'll look into resolving this
with a MachineSchedStrategy first.

Suppose we have the following assembly (for the sake of this example, I'm
going to ignore the use of history in making the dispatch decisions):
insn1: addu $1, $2, $3
insn2: addu $4, $5, $6
insn3: addu $1, $1, $7
insn4: addu $4, $4, $8

Dispatch would receive this two instructions at a time over two cycles. In cycle
t+0 it checks the opcodes of insn1 and insn2 and notes that both can be
dispatched to either ALQ or AGQ. It can't send both to any one of these so
it sends insn1 to ALQ and insn2 to AGQ. In cycle t+1, does the same thing and
dispatches insn3 to ALQ and insn4 to AGQ.

ALQ receives insn1 at t+0. During t+1 it finds that insn1's dependencies are
resolved and it is ready to issue. It issues insn1 to the only suitable pipeline
under its control, ALU. Similarly, it receives insn3 at t+1 and issues it to ALU
in t+2.

Meanwhile AGQ is doing the same thing with the instructions it receives. AGQ
receives insn3 at t+0. During t+1 it finds that insn3's dependencies are
resolved and it is ready to issue. It issues insn3 to the only suitable pipeline
under its control, AL2. Similarly, it receives insn4 at t+1 and issues it to AL2
in t+2.
> <snip>
> 
> I did not realize you were using processor groups. For many (relatively
> simple) cores the functional units can be expressed as a hierarchy. An
> instruction either needs a specific unit, or it can be issued to some
broader
> class. You can do that without any groups. I added ProcResGroup for
> SandyBridge because instructions can issue to some subset of ports, and
> these subsets are overlapping. I think it is possible to use both groups
and
> super resources in the same model, but may cause to confusion. I was simply
> suggesting something like this, for example:
> 
> <snip>
Ok, I've switched to this method of defining the hierarchy. I was following
Haswell's example but I don't need overlapping subsets.
> The relationship between ALU2 and ACQ is not clear to me yet, so I'm
not
> sure what's intended in your example.
ALU2 is the issue port to one of the pipelines under the control of the AGQ
reservation station. ALU2 is similar in principle to one of the HWPortX
resources from the Haswell model, similarly AGQ corresponds to HWPortAny (except
it's one of three reservation stations and not the only one).
 > FYI: BufferSize is a nice feature, but you can fairly safely omit it for an
OOO
> code. The scheduler will by default assume an infinite dispatch queue and
> almost certainly generate the same schedule unless you have very large
> blocks! The scheduler does attempt to determine whether the OOO buffer
> will reach capacity across iterations of single block loops, but it only
looks at
> the model's MicroOpBufferSize for this computation, not the
per-resource
> buffer size.
> 
> -Andy
That's a good point, block sizes tend to be small in most code. I'll
have to look into the effect on heavily unrolled and vectorized code such as
FFT/DCT where the blocks are likely to be large.

llvm dev - Jan 2014 - [LLVMdev] New machine model questions

[LLVMdev] New machine model questions

[LLVMdev] New machine model questions

[LLVMdev] New machine model questions

[LLVMdev] New machine model questions